Help me with OCR and indexing of old books with tables, data, etc

bOt@zerobytes.monsterM to It's A Digital Disease!@zerobytes.monster · 19 days ago

The original post: /r/datahoarder by /u/alexlazar98 on 2025-03-14 10:37:22.

https://preview.redd.it/zp9vlha0vmoe1.png?width=1200&format=png&auto=webp&s=25233afd4d8804e65b7d6dff7bab03f33fe6ef53

I want to start a personal project where I scan, OCR and index markdown for old books. This is a book with ALL of Romania’s roads back in 1974. It has tables and maps and all sorts of other interesting historical data points.

I already have some idea of data engineering. I’m a software engineer and I’ve made a project that helps with RAG, search and indexing of markdown files (even very big ones). My problem is the OCR part. Any tips?

You must log in or register to comment.

Chat

It's A Digital Disease!@zerobytes.monster

datahoarder@zerobytes.monster

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

Community locked: only moderators can create posts. You can still comment on posts.

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

2 users / day
2 users / week
7 users / month
25 users / 6 months
29 local subscribers
20 subscribers
7.63K Posts
8 Comments
Modlog

mods:
bOt@zerobytes.monster