Loppsided: Tasty Models

Sean Lopp

Intro

I recently discovered a popular set of Python packages for automated feature generation created by FeatureLabs, a MIT spin off that is now a part of Alteryx.

The core package is called featuretools, which generates automated features and is well suited to time series classification problems; e.g. “which customers will churn this month?” or “which wind turbine will fail next week?”.

I found their blogs a bit “markety” - they seem to be fond of coining terms that make relatively simple concepts like “write parameterized functions” sound fancy: “prediction engineering”. But perhaps this coining is best interpreted as ML salesmanship? Either way, I digress. The package has excellent reviews and appears to be the real deal (read: supported and scalable).

In researching the Python package I found an excellent reticulate-based implementation in R called featuretoolsR. Finding the R package inspired me to give the framework a spin. It was also a good excuse to explore a fun TidyTuesday dataset from earlier this year, Chopped episode reviews!

A post on feature engineering that uses data from a show built around ingredients… let the analogies and puns begin!

As I played more with the data I ended up generating all kinds of features in a variety of different ways. This post explores the following:

A time series regression problem: can we predict the next episode’s rating? Spoiler: not really, the end models I created were not great.
Feature engineering: features built manually, features from recipes, and features from featuretoolsR.


╔═════════════════════════╗
║   featuretoolsR 0.4.4   ║
╚═════════════════════════╝
✓ Using Featuretools 0.21.0

The Chopped data consists of a time series (episodes aired on a certain date), episode ratings (the quantity to predict), and a variety of episode data well suited to generating features.


paged_table(chopped_rating)

To begin the exploratory analysis, I took a look at the ratings across shows. (I removed any shows with no rating when I loaded the data).


ggplot(chopped_rating) + 
  geom_histogram(aes(episode_rating)) + 
  labs(
    title = "Predicting Chopped Episode Ratings",
    y = NULL,
    x = "Rating"
  )

As expected, Chopped had generally stellar ratings!

Features based on EDA

Next, I wanted to take a look at some of the features in the data to continue the exploratory analysis and see if any attributes may have the potential to predict episode rating.

EDA by Season

One place to start is to see if episode rating varied by season. Who knew Chopped has 43 seasons?!


chopped_season <- chopped_rating %>% 
  group_by(season) %>% 
  summarize(episode_count = n())
chopped_rating %>%
  ggplot() + 
    geom_boxplot(aes(x = as.factor(season), y =episode_rating)) + 
    geom_smooth(aes(x = season, y = episode_rating)) + 
    geom_text_repel(data = chopped_season, aes(season, 6, label=episode_count)) + 
  labs(
    x = 'season'
  )

In this case it appears the ratings decreased gradually overtime, with the exception of the final seasons which had much fewer episodes with rating data. It is possible that incorporating season into the model could help predict the episode rating.

EDA by Judge

Next I looked at ratings by judge. I have favorite judges afterall, maybe most people do?


by_judge <- chopped_rating %>% 
  select(episode_rating, season, starts_with("judge")) %>% 
  pivot_longer(starts_with("judge")) %>% 
  group_by(value) %>% 
  mutate(avg = mean(episode_rating),
         appearances = n()) 

by_judge %>% 
  ggplot(aes(reorder(value, avg), episode_rating)) + 
  geom_boxplot() + 
  coord_flip()

This plot is a bit hard to read, but it does suggest a few things:

There are some judges who have only done one episode. This observation makes “judge” a hard variable to use for predictions for two reasons. For existing data, many guest judges represent singular values, no predictive insight there. For out of sample data, it is likely we would see new judges we know nothing about, also not helpful for predictions.
There may be differences between the recurring judges who have been on many episodes. Let’s take a deeper look:


influential_judges <- by_judge %>% 
  select(value, avg, appearances) %>% 
  unique() %>% 
  filter(appearances > 10) %>% 
  pull(value)

by_judge %>% 
  filter(value %in% influential_judges) %>% 
  ggplot(aes(reorder(value, avg), episode_rating)) + 
  geom_boxplot() + 
  coord_flip()

This plot suggests that the “recurring” judges all have ranges that are pretty similar (all the boxplots are overlapping), which means the specifics of a recurring judge may not be that predictive. However something may be better than nothing, so I figured it would be worthwhile to add judges as a feature. How? One approach is create a set of dummy variables representing the recurring judges on each show:


encode_judge <- function(judges) {
  judges
  encode = tibble(
    judge = influential_judges,
    in_episode = 0
  )
  encode %>% 
    mutate(in_episode = judge %in% judges) %>% 
    pivot_wider(names_from = judge, values_from = in_episode)
}

chopped_rating[1:10,] %>%
  rowwise() %>% 
  mutate(encode_judge(c_across(starts_with("judge")))) %>% 
  paged_table()

One thing to note about this approach; we essentially had to run a “forward pass” across all of our data to compute the list of influential judges. The encoding then uses that fixed list. While this approach works, it is also risky because future data might drift - the judges might stop judging and new judges (not encoded here) could become part of the data! A similar problem occurs whenever you are training factors on data where new factors could potentially come into play.

This encoding function creates dummy variables across all the possible influential judge outcomes. Another approach would have been to create 3 variables: judge1, judge2, and judge3, and have each be a factor. In this case I went for the dummy encoding because I know judges are mutually exclusive (judge1 != judge2).

EDA Entity Extraction

There are a few variables in our data that are quite wordy. We may want to take, for instance, the episode notes and create a sentiment score (did something bad happen in highly rated episodes?). Or we could try to extract entities like title or location from the contestant info. Or we could play with the ingredients, for instance, to see if a particular ingredient was most often associated with high ratings. So many options!

I decided to start with _info fields, which contain free text information about the contestants:


chopped_rating %>% 
  select(ends_with('info')) %>% 
  paged_table()

Some of this information may be interesting for the model, but it is unlikely that the free form fields themselves will be useful. The info appears to commonly include location and title. These two bits can be extracted in a NLP process called entity extraction. Think of NLP entity extraction as a magic mutate call that can take free form text and pull out values of interest such locations.


library(spacyr)
# this launches the python session where the spacy nlp functions
# actually run. hooray reticulate!
spacy_initialize()

# lets take a look at just the first contestants
entities <- spacy_extract_entity(chopped_rating$contestant1_info)
paged_table(entities)

It looks like spacy is correctly identifying cities (GPE is short for geopolitial entity), but is getting confused by titles and persons. I’ll take what I can get, so lets create a function that will pull out cities. Unlike judges, I will create this encoding so that each contestant’s city is represented in a column contestant1_city, contestant2_city… Then later on we will use a recipe step to turn these values into factors. I use this encoding approach here because there can be multiple cities represented in the same show (in fact that is quite common).


encode_cities <- function(chefs_info){
  # attempt to extract entities from the info for all chefs in a given episode
  entities <- map_df(chefs_info, spacy_extract_entity, .id = 'contestant')
  if("ent_type" %in% colnames(entities)) {
    # pick the first city entity for each contestant
    identified <- entities %>% 
     filter(ent_type == "GPE") %>% 
     select(contestant, text) %>% 
     group_by(contestant) %>% 
     summarize(text = first(text))
  } else {
    identified <- tibble(contestant = NA, text = NA)
  }
    # we may not have found entities, so add defaults to be sure
    # we always have a column for each contestant 
    defaults <- tribble(
      ~contestant, ~text,
      "1", NA,
      "2", NA,
      "3", NA,
      "4", NA
    )
    results <- left_join(defaults, identified, by=c(contestant = "contestant")) %>% 
      mutate(location = coalesce(text.x, text.y)) %>% 
      select(contestant, location) %>% 
      pivot_wider(
        names_from = contestant,
        names_prefix = "city_",
        values_from = location
      )
    return(results)
}

chopped_rating[1:10,] %>%
  # pull out cities
  rowwise() %>% 
  mutate(encode_cities(c_across(ends_with("info")))) %>% 
  paged_table()

EDA Sentiment

Another field that might be of interest is the episode notes. While these notes are highly variable, the sentiment of the notes might be interesting. Are neutral shows boring? Are positive or negative notes revealing of a more interesting show?


library(tidytext)
sentiments <- get_sentiments("afinn")
series_sentiments <- chopped_rating %>% 
  unnest_tokens(word, episode_notes) %>% 
  left_join(sentiments) %>% 
  group_by(series_episode) %>% 
  summarize(note_sentiment = mean(value, na.rm = TRUE))

chopped_with_sentiments <- chopped_rating %>% 
  left_join(series_sentiments)

ggplot(chopped_with_sentiments) + 
  geom_jitter(aes(episode_rating, note_sentiment))

Add EDA Features

We can now finally add all these features we’ve built through our traditional EDA based approach:


chopped_plus_eda <- chopped_with_sentiments %>%
  # add judges
  rowwise() %>% 
  mutate(encode_judge(c_across(starts_with("judge")))) %>% 
  # entity extraction for cities
  mutate(encode_cities(c_across(ends_with("info"))))

Automated Features with featuretools

Our next step is do some automated feature generation using featuretools!

The package is designed to help automate feature creation through a process called “deep feature synthesis”. In my opinion this title is designed to make a pretty simple concept sound fancy. The core idea in featuretools is that you can take primitive operations to create new features, and then sometimes you can combine those operations to create even more new features.

For example, in our Chopped data set you could take primitives like “week_day” and “mode” to generate a number of new feature columns: week_day(episode), mode(season_rating), and then combinations like: mode(week_day(episode)).

There are two types of primitives, “transform” and “aggregation”. For tidyverse fans, an easy way to understand the difference is to think of the dplyr functions mutate and summarize. Transform primitives are functions that fit within a mutate; they return one value for each row. Aggregation primitives, on the other hand, require some type of grouping. They are the equivalent of a summarize function. So, in our example above, which type is week_day and mode?

Did you think about it? week_day is a transform primitive, it takes a date and for each date returns a weekday. mode is as an aggregation primitive, it returns the mode of an input column per group.

Now the part that makes “deep feature synthesis” fancy is that combinations of the two types of primitives can also be used to generate new features. For example, mode(week_day()).

Let’s apply the concept to our data:


# convert date to date time for correct time handling below
chopped_plus_eda$air_date <- mdy_hms(paste0(chopped_plus_eda$air_date, "20:00:00"))

# first we have to create an entity, which is the building block of featuretools
chopped_entity <- 
  as_entityset(
    chopped_plus_eda,
    index = "series_episode",
    time_index = "air_date",
    entity_id = "Chopped"
  )

# next we have to teach featuretools about the relationship between 
# seasons and episodes
chopped_seasons <- tibble(
  season = unique(chopped_plus_eda$season)
)

chopped_entity <- chopped_entity %>% 
  add_entity(
    df = chopped_seasons,
    entity_id = "seasons",
    index = "season"
  ) %>% 
  add_relationship(
    parent_set = "seasons",
    child_set = "Chopped",
    parent_idx = "season",
    child_idx = "season"
  )

# finally we have to be careful to handle time
# we don't want featuretools to leak data
# for instance, if we create a feature MAX(season.ingredients)
# that feature must actually be a "rolling" max over the course
# of the season, since we don't know how many ingredients a show in
# the future would have
cutoff_times <- chopped_plus_eda %>% 
  select(series_episode, air_date)

# now we perform the automatic feature generation
chopped_new_features_entity <- chopped_entity %>%
  dfs(
    target_entity = "Chopped", 
    trans_primitives = c("year",
                         "time_since_previous", 
                         "num_characters", 
                         "week", 
                         "weekday", 
                         "month", 
                         "num_words"),
    agg_primitives = c("max",
                       "mode", 
                       "num_unique", 
                       "mean", 
                       "sum"),
    max_depth = 3,
    include_cutoff_time = TRUE,
    cutoff_time = cutoff_times,
    # be sure to ignore the episode_rating since that is our target
    # we don't want it leaking into the generated features
    # also ignore some variables we don't want to use for feature generation
    ignore_variables = list(Chopped = list("episode_rating",
                                           "contestant1",
                                           "contestant2",
                                           "contestant3",
                                           "contestant4",
                                           "contestant1_info",
                                           "contestant2_info",
                                           "contestant3_info",
                                           "contestant4_info",
                                           "episode_notes"))
  )
chopped_new_features <- tidy_feature_matrix( chopped_new_features_entity)

# add back our index
chopped_all <- 
  chopped_plus_eda %>% 
  select(series_episode, 
         air_date, 
         episode_rating,
         season,
         season_episode) %>% 
  left_join(chopped_new_features, 
            by = c(season = "season", 
                   season_episode = "season_episode"))

paged_table(chopped_all)

We created many new features, and then combinations of those features! Will these be useful for modeling? That is like asking whether you can have too many ingredients in your pantry… maybe yes, maybe no…

Recipes

Now that we have our giant set of features we are ready to create a recipe and do some modeling. For fun, we can even create a few more features along the way…

One critical bit as we build our recipe is to keep in mind the time element; just like in our automated feature engineering we need to be sure we don’t leak data. We also have to be careful about time when we make our test/training split - we wouldn’t want to accidentally train on the future and test on the past!


library(timetk)
library(tidymodels)

# hold out some test data, going back
# far enough to have test data from full seasons
splits <- initial_time_split(chopped_all, prop = 0.3)
train_data <- training(splits)
test_data  <- testing(splits)

It was at this point I made an interesting discovery… the series_episode index did not always line up with the air_date.

A digression…

In other words, there were cases like this:


chopped_all %>% 
  filter(series_episode > 145, series_episode < 151) %>% 
  select(season, season_episode, series_episode, air_date, episode_name) %>% 
  paged_table()

Essentially there appears to be overlap between the beginning of one season and the end of another. This makes sense, because there are 43 seasons over the course of only 11 years. We can look at some of the details:


out_of_order = 0
out_of_order_idx = rep(FALSE, nrow(chopped_all))
for(i in 1:nrow(chopped_all)) {
  max_date <- max(chopped_all$air_date[1:(i-1)])
  if(chopped_all$air_date[i] < max_date) {
    out_of_order <- out_of_order + 1
    out_of_order_idx[i] <- TRUE
  }
}

chopped_all %>% 
  select(season, season_episode, air_date, series_episode) %>% 
  cbind(out_of_order_idx) %>% 
  paged_table()

Total out of order is 162. While surprising, I realized this out of order index didn’t make a huge difference, it just meant in order to avoid “leaking” data about the future, we would need to arrange by air_date when making test and training splits.

Note to future self: don’t take time and monotonically increasing indexes for granted!

Returning to the recipe and model fitting:


# arrange by time
# remove some variables that won't allow models to converge
# because they have a 1:1 mapping with episode rating
# fix the names created by feature tools or some models get upset
chopped_all <- chopped_all %>%
  arrange(air_date) %>% 
  select(-starts_with("judge"),
         -episode_name,
         -series_episode) %>% 
  set_tidy_names(syntactic = TRUE)
        
  

# hold out some test data, going back
# far enough to have test data from full seasons
splits <- initial_time_split(chopped_all, prop = 0.7)
train_data <- training(splits)
test_data  <- testing(splits)

# verify no overlap this time
max(train_data$air_date)


[1] "2016-05-24 20:00:00 UTC"


min(test_data$air_date)


[1] "2016-05-31 20:00:00 UTC"


# now we can create our recipe
# our goal for the recipe is to specify the formula for our predictive problem
# we also convert some of data to the appropriate type (factor, etc)
# and impute missing values
chopped_rec <- 
  recipe(episode_rating ~ ., data = train_data) %>%
  step_num2factor(season, 
                  ordered = FALSE, 
                  levels = as.character(1:43)) %>% 
  step_num2factor(season_episode, 
                  ordered = TRUE, 
                  levels = as.character(1:20))  %>%
  # this is a really important step
  # it helps our model handle cases where new cities in the test data
  # don't "surprise" the model, by assigning them automatically to a 
  # factor called "other"
  # it also assigns any cities that appear < 3 times to "other"
  step_other(all_nominal(), -season, -season_episode, threshold = 3) %>% 
  step_normalize(all_numeric(), -episode_rating) %>% 
  step_meanimpute(all_numeric(), -season_episode) %>% 
  step_zv(all_predictors())


# take a look
chopped_rec


Data Recipe

Inputs:

      role #variables
   outcome          1
 predictor         82

Operations:

Factor variables from season
Factor variables from season_episode
Collapsing factor levels for all_nominal(), ...
Centering and scaling for all_numeric(), -episode_rating
Mean Imputation for all_numeric(), -season_episode
Zero variance filter on all_predictors()


chopped_train_vals <- juice(prep(chopped_rec, train_data))
paged_table(chopped_train_vals)

Model Fitting

Now that we have our recipe and an absurd number of variables, we can do our model fitting. The goal of this post is not to compare too many model structures, but to instead get a feel for whether our added features are worthwhile!


# specify a random foerst model
mod <- rand_forest(mode = "regression") %>% 
  set_engine("ranger", importance = "impurity")

# prepare our recipe
chopped_rec_prepped <- prep(chopped_rec, train_data)

mod_fit <- 
  fit(mod, episode_rating ~ ., data = bake(chopped_rec_prepped, train_data))

train_res <- 
  predict(mod_fit, bake(chopped_rec_prepped, train_data)) %>% 
  bind_cols(train_data$episode_rating)

train_res %>% 
  metrics(.pred, `...2`)


# A tibble: 3 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.162
2 rsq     standard       0.927
3 mae     standard       0.120


test_res <- 
  predict(mod_fit, bake(chopped_rec_prepped, test_data)) %>% 
  bind_cols(test_data$episode_rating)

test_res %>% 
  metrics(.pred, `...2`)


# A tibble: 3 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      0.579 
2 rsq     standard      0.0291
3 mae     standard      0.446


ggplot(test_res, aes(.pred, `...2`)) + 
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) + 
  labs(
    x = "Predicted", 
    y = "Actual"
  )

The results are bleak. While our model performs really well on the training data, it performs poorly on the test data. This behavior could be for a few reasons. I am suspicious we may be over-fitting based on our over-abundance of inputs relative to our training data. There may also be structural changes between our training data (early seasons) and our test data (later seasons). I also didn’t tune the model or think much about model structure!

It is always risky to assume more ingredients make for a better meal!

Model performance aside, we can use feature importance for tree models to get a sense for which of the many features are making the biggest impact in the model.


library(vip)
vip::vip(mod_fit)

Looks like our auto-generated features came in handy, along with some of the EDA based work. As expected, the hard work we did extracting judges didn’t help the model much, whereas the features we got for free (season and air date) were important. The model did use our sentiment data, but didn’t do much with extracted cities. Interestingly, it appears both appetizers, entree, and dessert columns were useful thanks to our auto-generated features!

What if the model only used features available in the original data set?


mod_less <- rand_forest(mode = "regression") %>% 
  set_engine("ranger", importance = "impurity")

mod_less_fit <- 
  fit(mod_less, episode_rating ~ air_date + season + season_episode, data = bake(chopped_rec_prepped, train_data))

train_less_res <- 
  predict(mod_less_fit, bake(chopped_rec_prepped, train_data)) %>% 
  bind_cols(train_data$episode_rating)

train_less_res %>% 
  metrics(.pred, `...2`)


# A tibble: 3 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.240
2 rsq     standard       0.731
3 mae     standard       0.180


test_less_res <- 
  predict(mod_less_fit, bake(chopped_rec_prepped, test_data)) %>% 
  bind_cols(test_data$episode_rating)

test_less_res %>% 
  metrics(.pred, `...2`)


# A tibble: 3 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      0.620 
2 rsq     standard      0.0717
3 mae     standard      0.492

The less complicated model doesn’t perform as well as the original model, but it does perform in the same ball park; a rmse of 0.62 vs 0.58 for the significantly more complicated model. Is the complexity worth it? In this case, most likely not. Perhaps the lesson here is that model fitting really is like cooking on Chopped - keep things simple, focus on your main ingredients, and then execute.

Notes on Production Pipelines

For future out of sample predictions we’d have to generate a pipeline to create all our features since our current recipe doesn’t capture all the things we did to prep our data. As mentioned in the EDA section, we would need to be careful to watch for never before seen “influential” judges as well as new cities. The auto engineered features would also need to be computed for each new value - if anyone is interested in a recipe step for featuretools please tweet me @lopp_sean.

Other Things to Try

I didn’t spend much time on modelling here, it could be interesting to try more models and to try fitting with cross validation instead of just a single training / testing split. To do cross validation here we would need to watch for time, the timetk package has a new function to help with this: timetk::time_series_cv.

If you don’t think we did enough feature generation there is one other avenue to try, which is lagging or rolling the episode rating. It is possible show ratings carry “momentum” that could be predictive of the next show’s rating.

Tasty Models