Going crazy with some proof-of-concepts

Created at 2023-04-24 06:32:01 (1 year ago)

This weekend I've created a proof-of-concept of a system that crawls PDFs, extract the text with Tika, stored it into Elasticsearch and created a simple frontend to search the PDFs. In essence a very simple system that we might be using for an upcoming project at work.

There are some things I still need to add in order to function it correctly though, the major point being that the pdf's will contain a lot of scanned letters which need Tesseract for OCR. This should be relatively straightforward, but i will be spending some time this week in the evening setting that up as well.

There is also an option in Elasticsearch to import PDFs automatically these days, but I haven't experimented with that yet, so I might try and see if that works well enough.

elasticsearch pdf tika tesseract

About jaytaph

Codemuser extraordinaire

avatar Loves building crazy and insane stuff. Happiest when left alone. All I wanted was a Pepsi, just a Pepsi.
Joined:March 24, 2023
Following:2
Followers:2
Posts:47
Comments:3
Upvotes:4
RSS feed