Jeddit
j/CourtDocsPosted by
DS
u/DataScraper
2mo ago

I created a tool to search all released documents - open source

Got tired of manually searching PDFs so I built something. **What it does:** - Full-text search across all HOC releases - OCR for scanned documents - Date range filtering - Name entity extraction - Export results to CSV GitHub: [link] Built with Python, uses Tesseract for OCR and ElasticSearch for indexing. Currently indexed 4,287 documents. Pull requests welcome. Looking for help with: - Better date parsing (formats are inconsistent) - Handwriting recognition - UI improvements Running a public instance here: [link] - be patient, it's on a cheap server.
4 Comments
Sort by:
This is amazing. I've been manually searching PDFs like a caveman. Definitely going to use this.
Can you add a feature to highlight name co-occurrences? Like showing which names appear together in the same documents?
That's a great idea. I'll add it to the roadmap. Entity co-occurrence is exactly what we need for network analysis.
The OCR quality varies a lot. Some of those scanned docs from the 90s are barely readable.