More OCR in a quest for New York Times Spelling Bees
In my last post I started work to create a dataset that compiles information from the NYTimes Spelling Bee. The work essentially parses this excellent fan website to create a tidy dataset. I discoverd in my last post that parsing an image to get information on the daily game board was a little tricky. In this post I talk about how I fixed the OCR process and then look at some of the results.
My last post discovered that a naive use of OpenCV and
pytesseract was not going to correctly parse my images, which look like this:
In particular, the code was struggling to identify circular letters O and Q, it was failing to find X, and it was finding too many letters in some cases.
To rectify this issue I did three things:
I should have mentioned that in my first post I was using RStudio for all of my R and Python development. I wanted to see how far RStudio could take me in Python development. The result was pleasant, RStudio provides a Python environment pane for viewing objects as you execute code. This environment pane is really handy for interactive data science tasks, because it allows you to look at objects without requiring a special debug mode. However, in my case, the core part of my code was a loop:
for contour in contours: # select cropped section # classify cropped section as letter
To debug what was going wrong with certain characters I needed a fast and easy way to pause this loop and inspect the cropped image, the prediction, etc. This type of workflow is exactly what debuggers were built for, and in my case, that meant turning to VS Code because RStudio does not currently have a Python debugger.
Once I was setup with a debugger and test cases, it was relatively easy to see why classification was failing for certain characters. I realized right away that I had made a mistake in my original code. I was cropping and classifying the original image, even though I had generated contours off of a grayscale transformation of the image. Fixing that problem eliminated some of the cases where OCR found too many characters, and it allowed tesseract to correctly identify
X. In my original code I had used a simple rule to determine if a contour was a character or another part of the image. I solved another set of my test cases by making this rule more precise to eliminate contours that were detecting partial characters within other characters. Finally, I discovered that the OpenCV contour, which I was using to crop one character out of my larger image, was not working well for O and Q.
At this point I had an interesting thought. In my case, all of my images were essentially the same. The letters were positioned exactly in the same locations, and the 26 letters were always the same whenever the appeared in the image. This realization meant I didn’t actually need to be doing ML at all…I decided to take this shortcut by saving samples of O and Q. I then added a simple case statement to check my cropped image against those two samples, providing 100% accuracy for my classification of those two letters. I could have followed the same procedure for the other 24 letters, but at this point the ML approach was working for everything else and I decided to let it be.
The resulting code can be seen in the post’s github repo: https://github.com/slopp/nytbee. A summary of the core OCR algorithim is:
# use contours to dissect the parts of our image =cv2.findContours(gray, cv2.RETR_TREE, contours,_ cv2.CHAIN_APPROX_SIMPLE) = list() ocr_letters for cnt in contours : # for each countour rough out the area # we know the approx size of our letters, so ignore everything else = cv2.contourArea(cnt) area if area <300 and area>50: # if the contour looks like a letter and is the right size, creating a cropped # image that contains just the character = cv2.boundingRect(cnt) x, y, w, h = gray[y:y + h, x:x + w] cropped # if the cropped image matches an O or a Q, assign the label if np.array_equal(cropped, o_img): 'O') ocr_letters.append(elif np.array_equal(cropped, q_img): 'Q') ocr_letters.append(else: # otherwise, have pytesseract tell us what the image is ="--psm 10")) ocr_letters.append(pytesseract.image_to_string(cropped, config
With my data corrected, I was able to return to my original analysis. Are there certain letters, required letters, or genius scores that are predictive of the number of pangrams?
# A tibble: 6 x 8 # Groups: date  date total_letters_f… num_pangram max_score max_words min_genius <chr> <int> <dbl> <dbl> <dbl> <dbl> 1 Friday,… 7 1 92 37 64 2 Friday,… 7 1 92 37 64 3 Friday,… 7 1 92 37 64 4 Friday,… 7 1 92 37 64 5 Friday,… 7 1 92 37 64 6 Friday,… 7 1 92 37 64 # … with 2 more variables: letters <chr>, req_letter <chr>
Using this data we can learn one thing… games with many pangrams are outliers, and predicting outliers is hard.
data %>% ggplot() + geom_histogram(aes(num_pangram)) + theme_minimal() + labs( x = "Number of Pangrams", y = "Games with that Many Pangrams", title = "Games with more than 3 pangrams are extreme outliers" )
data %>% select(-letters) %>% unique() %>% ggplot(aes(x = reorder(req_letter, num_pangram), y =num_pangram)) + geom_boxplot() + coord_flip() + theme_minimal() + labs( x = "Required Letter", y = "Number of Pangrams", title = "No required letter is a greater predictor of the number of pangrams" )
data %>% unique() %>% ggplot(aes(x = reorder(letters, num_pangram), y =num_pangram)) + geom_boxplot() + coord_flip() + theme_minimal() + labs( x = "Letter", y = "Number of Pangrams", title = "Non-required letters aren't good pangram predictors either" )
We can refine this view a bit with a model:
library(tidymodels) lm_mod <- logistic_reg() data$num_pangram <- as.factor(data$num_pangram) lm_fit <- lm_mod %>% fit(num_pangram ~ min_genius + req_letter + letters, data = data) views <- tidy(lm_fit) views %>% arrange(desc(estimate))
# A tibble: 47 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 lettersQ 1.86 1.59 1.17 0.242 2 req_letterW 1.82 0.386 4.72 0.00000231 3 req_letterK 1.26 0.376 3.35 0.000795 4 req_letterD 0.915 0.276 3.31 0.000936 5 req_letterU 0.661 0.311 2.12 0.0336 6 req_letterY 0.661 0.312 2.12 0.0340 7 req_letterB 0.503 0.278 1.81 0.0702 8 req_letterH 0.387 0.339 1.14 0.253 9 req_letterP 0.280 0.252 1.11 0.265 10 req_letterF 0.189 0.455 0.416 0.678 # … with 37 more rows