Showing revision: 7

Introducing looqs - FTS for your files with previews for search results


2022-06-16

looqs - Why?


Everyone probably has experienced the following: You have lots of documents, such as PDF files lying around. You sometimes need to find a page/presentation slide in a certain file that contains the information you need. The problem is you don't know exactly in what file to look.

So you have several options: You can take educated guesses, you can open several files in your document viewer and search those. Or you can try "pdfgrep". I've observed another common pattern where people use tools to merge several PDF-files into one. At least then you only have a single document to search inside. Even then, depending on the viewer, search is still rather slow often enough. And you can't use this workaround for all your files anyway. What if you would like to search all your files?

In summary, the problem is these options are neither convenient nor particularly fast. Can we do better?

Screenshots




looqs - Preview functionality

looqs - Screenshot showing short form of filters and booleans

looqs to the rescue


looqs is a tool that looks for your files and it also takes a look inside them: Firstly, it creates a full-text index of your files. Secondly, it allows you to search them. Finally, it does not only tell you what file contains your keywords, it gives you a preview/snippet: It renders the pages or portions where your hits have been found. It also highlights your keywords. Then you can just click on a rendered page and your favourite document viewer or editor opens the document, at the relevant page immediately.

In more general terms, the goal of looqs is to help you find the information you have in files stored on your local drive fast and seamlessly. Say goodbye to pdfgrep and manual searches - find what you need without effort.

Challenges


Quality of results
A lot of time can be spent on improving the quality of results. For now, looqs uses BM25 as provided by sqlite to gauge the relevance of documents to your search term. As a guideline, the more words you enter, the more relevant the results are going to be of course. I find in combination with the preview functionality, I generally find what I need quite quickly. Sometimes, you may know approximately what documents the information your need is contained. In that case, search can be narrowed to certain paths to limit the results you have to review.

Complexity and Security
Naturally, it has to handle lots of files. Indexing is a challenge for some file formats. Rendering even more so.

looqs has to have the ability to open many file formats. Not a problem for plain text files, but the more complex the file format, the more complex the parser. Bugs that may ultimately lead to code execution and so potential system compromise are thus to be reckoned with.

looqs itself sandboxes the parsing and rendering using exile.h, and thus mitigates this problem by using a multi-process architecture. The sandboxed processes are isolated from the network and do not have permission to write on the filesystem. The set of system calls they can issue is limited.

Since looqs won't be able to parse and preview every single file format that there is, it will contain support external content processors. Instead of doing it the itself, small binaries/scripts can handle the task, taking complexity away from looqs itself. Currently I aim to generate the rendering for simple text files, .pdf and .epub rendering within looqs core logic.

No bloated daemons
looqs should not depend on daemons running. Bad experience with other "content trackers" has teached me this leads to trouble. Though you can limit their priority, somehow they always lead to high CPU usage, slowdowns, etc. I want to avoid heavy-weight database solutions and also see the benefit of some CLI interface.

CLI
A CLI interface is provided that can perform searches and display the paths where results have been found. This way it can also be used by tools such as adhocify or be employed by cronjobs. It can also speed up finding a directory to navigate to in a shell.

Status


looqs is at an early stage, the first releases are taking the necessary steps for looqs to mature. Indexing is done for several file formats, the preview functionality is there for PDF and plaintext files. Packages are available for Ubuntu 21.10 and 22.04.

Nevertheless, looqs has a long way to go still in terms of usability and functionality.

Roadmap


Many ideas are on the table. Better presentation of results, more metadata scanning, improved file format support, OCR for screenshots, ...

Resources


looqs repository: github.com. All development happens here.
Documents: See the README for instructions on how to download/build looqs. HACKING.md contains some technical aspects. Of course, there is also a User manual