library(tidyverse)
library(tidymodels)
library(schrute)
library(lubridate)
library(kableExtra)
set.seed(1234)
Data Splitting and Overfitting
Application exercise
Setup
To start, create a new repository within our class Github organization, then clone that repository into RStudio, create a new quarto file, give it a title and an author. Then Render, commit and push back to Github. s
Exercises
Use theoffice
data from the schrute package to predict IMDB scores for episodes of The Office.
glimpse(theoffice)
Rows: 55,130
Columns: 12
$ index <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode_name <chr> "Pilot", "Pilot", "Pilot", "Pilot", "Pilot", "Pilot",…
$ director <chr> "Ken Kwapis", "Ken Kwapis", "Ken Kwapis", "Ken Kwapis…
$ writer <chr> "Ricky Gervais;Stephen Merchant;Greg Daniels", "Ricky…
$ character <chr> "Michael", "Jim", "Michael", "Jim", "Michael", "Micha…
$ text <chr> "All right Jim. Your quarterlies look very good. How …
$ text_w_direction <chr> "All right Jim. Your quarterlies look very good. How …
$ imdb_rating <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6…
$ total_votes <int> 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706,…
$ air_date <chr> "2005-03-24", "2005-03-24", "2005-03-24", "2005-03-24…
Fix air_date
for later use.
<- theoffice %>%
theoffice mutate(air_date = ymd(as.character(air_date)))
We will
- engineer features based on episode scripts
- train a model
- make predictions
- get performance metrics
Note: The episodes listed in theoffice
don’t match the ones listed in the data we used in the cross validation lesson.
%>%
theoffice distinct(season, episode)
# A tibble: 186 × 2
season episode
<int> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 2 1
8 2 2
9 2 3
10 2 4
# ℹ 176 more rows
Exercise 1 - Calculate the percentage of lines spoken by Jim, Pam, Michael, and Dwight for each episode of The Office.
Exercise 2 - Identify episodes that touch on Halloween, Valentine’s Day, and Christmas. Create a variable for each of these that indicate if it is or is not each of these days.
Exercise 3 Modify the following code to also create a new indicator variable called michael
which takes the value 1
if Michael Scott (Steve Carrell) was there, and 0
if not. Note: Michael Scott (Steve Carrell) left the show at the end of Season 7. (make sure to remove eval: true)
<- theoffice %>%
office_df select(season, episode, episode_name, imdb_rating, total_votes, air_date) %>%
distinct(season, episode, .keep_all = TRUE) %>%
left_join(halloween_episodes, by = "episode_name") %>%
left_join(valentine_episodes, by = "episode_name") %>%
left_join(christmas_episodes, by = "episode_name") %>%
replace_na(list(halloween = 0, valentine = 0, christmas = 0)) %>%
mutate(michael = if_else(season > 7, 0, 1)) %>%
### add the new variable here
left_join(office_lines, by = c("season", "episode", "episode_name"))
Exercise 4 - Split the data into training (75%) and testing (25%).
set.seed(1122)