I created a tool to search all released documents - open source

Got tired of manually searching PDFs so I built something. **What it does:** - Full-text search across all HOC releases - OCR for scanned documents - Date range filtering - Name entity extraction - Export results to CSV GitHub: [link] Built with Python, uses Tesseract for OCR and ElasticSearch for indexing. Currently indexed 4,287 documents. Pull requests welcome. Looking for help with: - Better date parsing (formats are inconsistent) - Handwriting recognition - UI improvements Running a public instance here: [link] - be patient, it's on a cheap server.

4 Comments