jaytaph · CodeMusings

This weekend I've created a proof-of-concept of a system that crawls PDFs, extract the text with Tika, stored it into Elasticsearch and created a simple frontend to search the PDFs. In essence a very simple system that we might be using for an upcoming project at work.

There are some things I still need to add in order to function it correctly though, the major point being that the pdf's will contain a lot of scanned letters which need Tesseract for OCR. This should be relatively straightforward, but i will be spending some time this week in the evening setting that up as well.

There is also an option in Elasticsearch to import PDFs automatically these days, but I haven't experimented with that yet, so I might try and see if that works well enough.