Going crazy with some proof-of-concepts
This weekend I've created a proof-of-concept of a system that crawls PDFs, extract the text with Tika, stored it into Elasticsearch and created a simple frontend to search the PDFs. In essence a very simple system that we might be using for an upcoming project at work.
There are some things I still need to add in order to function it correctly though, the major point being that the pdf's will contain a lot of scanned letters which need Tesseract for OCR. This should be relatively straightforward, but i will be spending some time this week in the evening setting that up as well.
There is also an option in Elasticsearch to import PDFs automatically these days, but I haven't experimented with that yet, so I might try and see if that works well enough.
About jaytaph
Codemuser extraordinaire
Joined: | March 24, 2023 |
Following: | 2 |
Followers: | 2 |
Posts: | 51 |
Comments: | 3 |
Upvotes: | 4 |
Previous musings
- (1) December 2024
- (1) November 2024
- (1) October 2024
- (1) September 2024
- (1) July 2024
- (2) February 2024
- (3) January 2024
- (3) December 2023
- (4) November 2023
- (5) October 2023
- (10) September 2023
- (8) August 2023
- (1) June 2023
- (1) May 2023
- (4) April 2023
- (5) March 2023