Tesseract OCR on MacOS Tutorial 1: Installation

Image of characters. — OCR or Optical Character Recognition is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text.

In textual research the ability to search through a library of documents is a valuable reference tool. Many public domain documents in a wide variety of languages are readily available on the internet. Many PDF documents do not have machine-encoded text included with the pages. They are simply flat images of text that cannot be edited, copied and pasted, and most importantly for research, thoroughly searched by electronic means. Thankfully, there is a freely available open source tool called Tesseract OCR. In this tutorial I will provide a rundown on how to install the Tesseract Open Source OCR Engine on macOS.

Disclaimer: I do not support the use of restricted documents in this way. We are dealing with public domain documents here. Newly written and non-public domain books and materials will most likely not be documents composed of flat images if procured with respect to their copyright holders.

A Word on macOS Package Management

First you need to choose a package management system. These systems allow you to download and install open source software by the command line in macOS. Open source software is installed by typing in a command at a terminal. The package manager downloads, builds, and installs the software. The two package managers available to macOS are Homebrew and MacPorts. Both are great projects with lots of community support behind them and both have Tesseract available for download. The biggest advantage to MacPorts is that another program available in its repository, called ScanTailor, is an amazing program to use in conjunction with Tesseract. Homebrew, unfortunately, does not have ScanTailor available and it is not advised to use both package managers on the same system. ScanTailor is a program that makes it easy to clean scanned documents up. Some of its many features include: trimming, splitting, and deskewing pages. For this reason, I highly recommend using MacPorts instead of Homebrew if you really want to be able to clean up your old scanned documents and make them searchable and easier to skim and reference.

Install MacPorts

MacPorts has a great guide but I’ll step us through the necessaries.

Download and install Xcode from the App Store.

Open a terminal: ⌘+Space, and type “terminal”. The top hit should be the terminal app. Press return with the terminal app highlighted.

Accept the End User License Agreement for Xcode from the terminal. To do that, go ahead and type the following command in the terminal, then press return.

sudo xcodebuild -license

Install Apple’s Command Line Developer Tools, again from the terminal using the following.

sudo xcode-select --install

Download and install the MacPorts package for your OS version. You just need to get the pkg file from the assets list on the page, scroll down the page a bit to find them. To check your OS version click:  (the apple logo) ➝ About this Mac.

Update MacPorts and install Tesseract

Back at the terminal type the following, pressing return after each line.

sudo port -v selfupdate
sudo port install tesseract

And finally, install a language pack, for example, English, Icelandic, Danish, Norwegian, Swedish, German, and Latin, using the commands in the last code box below.

sudo port install tesseract-eng
sudo port install tesseract-isl
sudo port install tesseract-dan
sudo port install tesseract-nor
sudo port install tesseract-swe
sudo port install tesseract-deu
sudo port install tesseract-lat

For a list of language packs available you can search for Tesseract on the MacPorts ➤ Available Ports page. Unfortunately, their search function only produces so many hits. For example, Swedish doesn’t appear. This page has a full list, paginated between two pages.

For easy access to all variants of a package on the command line, I recommend installing the bash-completion package and following the instructions provided after install. It should give you a few lines to put in .profile, a hidden file in your home directory. After quitting the terminal and starting up a new instance you can type tesseract in the terminal and press the tab key twice to get a list of every possible match.

These are the options I see with the `bash-completion` package installed.

Congratulations! If you made it this far you can now try out Tesseract OCR for yourself. For a quick rundown on basic usage see this page. The next step in the process of making a book sized pdf into a fully fledged OCR document is extracting the image files from the pdf. Stay tuned for the next tutorial in this series.

Tesseract OCR on MacOS Tutorial 1: Installation

A Word on macOS Package Management

Install MacPorts

Update MacPorts and install Tesseract

Share this:

Leave a Reply Cancel reply