The original post: /r/datahoarder by /u/alexlazar98 on 2025-03-14 10:37:22.

https://preview.redd.it/zp9vlha0vmoe1.png?width=1200&format=png&auto=webp&s=25233afd4d8804e65b7d6dff7bab03f33fe6ef53

I want to start a personal project where I scan, OCR and index markdown for old books. This is a book with ALL of Romania’s roads back in 1974. It has tables and maps and all sorts of other interesting historical data points.

I already have some idea of data engineering. I’m a software engineer and I’ve made a project that helps with RAG, search and indexing of markdown files (even very big ones). My problem is the OCR part. Any tips?