Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I like to use textual anchors for things like, "line starts with" or "line ends with" or "file ends with" and combining that with levenshtein distance with some normalization stuff (combining adjacent strings in various patterns to account for OCR wonkiness). Turns into building lists of anchors that can be built off of. Of all the things I've tried, including things like image hashing and such, it's been the most effective generalized "tool".

But also, I hold the strong philosophy that it's important to actually read the documents that are being scanned. In that way, OCR tends to be more of a procedural step than anything.

Really, it ultimately depends on your goals.

 help



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: