Different feature engineering approaches using TidyTuesday’s Chopped data
I recently discovered a popular set of Python packages for automated feature generation created by FeatureLabs, a MIT spin off that is now a part of Alteryx.
The core package is called
featuretools, which generates automated features and is well suited to time series classification problems; e.g. “which customers will churn this month?” or “which wind turbine will fail next week?”.
I found their blogs a bit “markety” - they seem to be fond of coining terms that make relatively simple concepts like “write parameterized functions” sound fancy: “prediction engineering”. But perhaps this coining is best interpreted as ML salesmanship? Either way, I digress. The package has excellent reviews and appears to be the real deal (read: supported and scalable).
In researching the Python package I found an excellent reticulate-based implementation in R called
featuretoolsR. Finding the R package inspired me to give the framework a spin. It was also a good excuse to explore a fun TidyTuesday dataset from earlier this year, Chopped episode reviews!
A post on feature engineering that uses data from a show built around ingredients… let the analogies and puns begin!
As I played more with the data I ended up generating all kinds of features in a variety of different ways. This post explores the following:
A time series regression problem: can we predict the next episode’s rating? Spoiler: not really, the end models I created were not great.
Feature engineering: features built manually, features from
recipes, and features from
╔═════════════════════════╗ ║ featuretoolsR 0.4.4 ║ ╚═════════════════════════╝ ✓ Using Featuretools 0.21.0
The Chopped data consists of a time series (episodes aired on a certain date), episode ratings (the quantity to predict), and a variety of episode data well suited to generating features.
To begin the exploratory analysis, I took a look at the ratings across shows. (I removed any shows with no rating when I loaded the data).
ggplot(chopped_rating) + geom_histogram(aes(episode_rating)) + labs( title = "Predicting Chopped Episode Ratings", y = NULL, x = "Rating" )
As expected, Chopped had generally stellar ratings!
Next, I wanted to take a look at some of the features in the data to continue the exploratory analysis and see if any attributes may have the potential to predict episode rating.
One place to start is to see if episode rating varied by season. Who knew Chopped has 43 seasons?!
chopped_season <- chopped_rating %>% group_by(season) %>% summarize(episode_count = n()) chopped_rating %>% ggplot() + geom_boxplot(aes(x = as.factor(season), y =episode_rating)) + geom_smooth(aes(x = season, y = episode_rating)) + geom_text_repel(data = chopped_season, aes(season, 6, label=episode_count)) + labs( x = 'season' )
In this case it appears the ratings decreased gradually overtime, with the exception of the final seasons which had much fewer episodes with rating data. It is possible that incorporating
season into the model could help predict the episode rating.
Next I looked at ratings by judge. I have favorite judges afterall, maybe most people do?
by_judge <- chopped_rating %>% select(episode_rating, season, starts_with("judge")) %>% pivot_longer(starts_with("judge")) %>% group_by(value) %>% mutate(avg = mean(episode_rating), appearances = n()) by_judge %>% ggplot(aes(reorder(value, avg), episode_rating)) + geom_boxplot() + coord_flip()
This plot is a bit hard to read, but it does suggest a few things:
There are some judges who have only done one episode. This observation makes “judge” a hard variable to use for predictions for two reasons. For existing data, many guest judges represent singular values, no predictive insight there. For out of sample data, it is likely we would see new judges we know nothing about, also not helpful for predictions.
There may be differences between the recurring judges who have been on many episodes. Let’s take a deeper look:
influential_judges <- by_judge %>% select(value, avg, appearances) %>% unique() %>% filter(appearances > 10) %>% pull(value) by_judge %>% filter(value %in% influential_judges) %>% ggplot(aes(reorder(value, avg), episode_rating)) + geom_boxplot() + coord_flip()