02 January 2024

Using OCR for Transcribing Newspaper Articles


Image by PublicDomainPictures from Pixabay
Newspaper articles and obituaries can provide a lot of information. You're likely to find names of relatives and friends, possibly birth/death dates and places, residence, occupation/retirement or even cause of death. I personally like to transcribe newspaper clippings so I can easily work with the details...but transcribing can be a time consuming process depending on where you find the image. Enter optical character recognition (OCR). It's not perfect, but it can make the job a lot easier.

The majority of newspaper clippings are .jpg images (photos) so it's not as easy as copy/paste. If you're finding the image on a newspaper archive site, there's a reasonable chance an OCR transcription is already provided. Those are typically able to be copied and pasted (but you can't edit out errors directly on the webpages). Be sure to read both the source document and the transcription. Computers aren't infallible, and sometimes they have trouble with certain characters, old/faded images or when column borders exist. If you accidentally "transcribe" the column border, or catch some random characters from the next column, you'll have a lot of i's and l's and other gobbledygook where it doesn't belong. Articles spanning multiple columns may need to be transcribed in separate steps for each column. No matter how you cut it, OCR speeds up the process...just don't expect it to be perfect.  Once copied/pasted, you can feel free to make any necessary edits before saving in your tree or using to mine the details.

Snipping tool image
If you don't have a "built in" transcription with the image, there are tools to make the job easier. My new favorite is the build in Windows Snipping Tool. Windows users can search for the app on their computer, and I highly suggest pinning it to your taskbar for easy access. OCR was added in the last few months. Open the image you want to transcribe, and select the area with the text. Then, click the OCR button as shown in the image. All the text will be highlighted and you'll see an option to Copy All Text. This will put the text on your clipboard and you can paste it into the program of your choice (notepad, Word, Google Docs, etc.). For additional information on using Windows Snipping tool for OCR please watch the video linked at the end of this post. The biggest caveat is all the text needs to appear on your screen before snipping. Snipping Tool won't let you scroll to capture text. This can be problematic with longer articles/obits when the text becomes too small for the computer to "read" it. 

If you have a longer article, where the text is too small for Snipping Tool to work well, try using Google Docs! (Yup, it works! I use it all the time.) Simply upload the .jpg from your computer to Google Docs. Right click it, select Open With and choose Google Docs. You'll see the image at the top, and when you scroll down, the OCR transcription will be below. You can edit the transcription as needed right in Google Docs. More detail on using Google Docs for OCR can be found here.

If you happen to be lucky enough to have a PDF of the document you're working with, but can't just copy/paste the info (thank you Adobe and other PDF editors for locking down some documents), you can use the same Google Docs trick. You can also open PDFs in Word (though I've had mixed success) or you can use the PDFCandy.com OCR tool for free (though you are limited to how frequently you can use it at no charge. See my previous post here for the limitations.)

We all use transcriptions a little differently. There's no right or wrong way to go about it. If you like to do it old school and enjoy reading, then typing, then reading, then typing, then....well, you get the idea...that's fine! I personally prefer to let the computer do the heavy lifting, then I'll just do a little housekeeping to tidy up the text. Please keep copyright in mind if you're planning on posting/sharing any transcriptions, and always credit the newspaper, poster, etc. appropriately. If you have another suggestion for OCR transcription, connect with me on my socials. I'd love to hear how you handle transcribing from images and PDFs. 


Image by PublicDomainPictures from Pixabay
YouTube video: https://www.youtube.com/watch?v=k5tDJCLnShw

No comments:

Post a Comment

Your comments are appreciated! To reduce spam, all comments are moderated. Your comment will appear after review.