I created a tool to search all released documents - open source
Got tired of manually searching PDFs so I built something.
**What it does:**
- Full-text search across all HOC releases
- OCR for scanned documents
- Date range filtering
- Name entity extraction
- Export results to CSV
GitHub: [link]
Built with Python, uses Tesseract for OCR and ElasticSearch for indexing. Currently indexed 4,287 documents.
Pull requests welcome. Looking for help with:
- Better date parsing (formats are inconsistent)
- Handwriting recognition
- UI improvements
Running a public instance here: [link] - be patient, it's on a cheap server.
4 Comments