Data Splitting and Overfitting

Application exercise
library(tidyverse)
library(tidymodels)
library(schrute)
library(lubridate)
library(kableExtra)
set.seed(1234)

Setup

To start, create a new repository within our class Github organization, then clone that repository into RStudio, create a new quarto file, give it a title and an author. Then Render, commit and push back to Github. s

Exercises

Use theoffice data from the schrute package to predict IMDB scores for episodes of The Office.

glimpse(theoffice)
Rows: 55,130
Columns: 12
$ index            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ season           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode_name     <chr> "Pilot", "Pilot", "Pilot", "Pilot", "Pilot", "Pilot",…
$ director         <chr> "Ken Kwapis", "Ken Kwapis", "Ken Kwapis", "Ken Kwapis…
$ writer           <chr> "Ricky Gervais;Stephen Merchant;Greg Daniels", "Ricky…
$ character        <chr> "Michael", "Jim", "Michael", "Jim", "Michael", "Micha…
$ text             <chr> "All right Jim. Your quarterlies look very good. How …
$ text_w_direction <chr> "All right Jim. Your quarterlies look very good. How …
$ imdb_rating      <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6…
$ total_votes      <int> 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706,…
$ air_date         <chr> "2005-03-24", "2005-03-24", "2005-03-24", "2005-03-24…

Fix air_date for later use.

theoffice <- theoffice %>%
  mutate(air_date = ymd(as.character(air_date)))

We will

  • engineer features based on episode scripts
  • train a model
  • make predictions
  • get performance metrics

Note: The episodes listed in theoffice don’t match the ones listed in the data we used in the cross validation lesson.

theoffice %>%
  distinct(season, episode)
# A tibble: 186 × 2
   season episode
    <int>   <int>
 1      1       1
 2      1       2
 3      1       3
 4      1       4
 5      1       5
 6      1       6
 7      2       1
 8      2       2
 9      2       3
10      2       4
# ℹ 176 more rows

Exercise 1 - Calculate the percentage of lines spoken by Jim, Pam, Michael, and Dwight for each episode of The Office.

Exercise 2 - Identify episodes that touch on Halloween, Valentine’s Day, and Christmas. Create a variable for each of these that indicate if it is or is not each of these days.

Exercise 3 Modify the following code to also create a new indicator variable called michael which takes the value 1 if Michael Scott (Steve Carrell) was there, and 0 if not. Note: Michael Scott (Steve Carrell) left the show at the end of Season 7. (make sure to remove eval: true)

office_df <- theoffice %>%
  select(season, episode, episode_name, imdb_rating, total_votes, air_date) %>%
  distinct(season, episode, .keep_all = TRUE) %>%
  left_join(halloween_episodes, by = "episode_name") %>% 
  left_join(valentine_episodes, by = "episode_name") %>% 
  left_join(christmas_episodes, by = "episode_name") %>% 
  replace_na(list(halloween = 0, valentine = 0, christmas = 0)) %>%
  mutate(michael = if_else(season > 7, 0, 1)) %>%
  ### add the new variable here
  left_join(office_lines, by = c("season", "episode", "episode_name"))

Exercise 4 - Split the data into training (75%) and testing (25%).

set.seed(1122)

Exercise 5 - Specify a linear regression model.

Exercise 6 - Create a recipe that updates the role of episode_name to not be a predictor, removes air_date as a predictor, uses season as a factor, and removes all zero variance predictors.

Exercise 7 - Build a workflow for fitting the model specified earlier and using the recipe you developed to preprocess the data.

Exercise 8 - Fit the model to training data and interpret a couple of the slope coefficients.

Exercise 9 - Use your model to make predictions for the testing data and calculate the R2 and the RMSE.