library(tidyverse)
library(nycflights13)AE 04: NYC flights + data wrangling
Exercise 1
Your turn: Fill in the blanks:
The flights data frame has ___ rows. Each row represents a ___.
Exercise 2
Your turn: What are the names of the variables in flights.
# add code hereExercise 3 - select()
- Demo: Make a data frame that only contains the variables
dep_delayandarr_delay.
# add code here- Demo: Make a data frame that keeps every variable except
dep_delay.
# add code here- Demo: Make a data frame that includes all variables between
yearthroughdep_delay(inclusive). These are all variables that provide information about the departure of each flight.
# add code here- Demo: Use the
selecthelpercontains()to make a data frame that includes the variables associated with the arrival, i.e., contains the string"arr\_"in the name.
# add code hereExercise 4 - slice()
- Demo: Display the first five rows of the
flightsdata frame.
# add code here- Demo: Display the last two rows of the
flightsdata frame.
# add code hereExercise 5 - arrange()
- Demo: Let’s arrange the data by departure delay, so the flights with the shortest departure delays will be at the top of the data frame.
# add code here- Question: What does it mean for the
dep_delayto have a negative value?
Add your response here.
- Demo: Arrange the data by descending departure delay, so the flights with the longest departure delays will be at the top.
# add code here- Your turn: Create a data frame that only includes the plane tail number (
tailnum), carrier (carrier), and departure delay for the flight with the longest departure delay. What is the plane tail number (tailnum) for this flight?
# add code hereExercise 6 - filter()
- Demo: Filter for all rows where the destination airport is RDU.
# add code here- Demo: Filter for all rows where the destination airport is RDU and the arrival delay is less than 0.
# add code here- Your turn: Describe what the code is doing in words.
Add response here.
flights |>
filter(
dest %in% c("RDU", "GSO"),
arr_delay < 0 | dep_delay < 0
)# A tibble: 6,203 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 800 810 -10 949 955
2 2013 1 1 832 840 -8 1006 1030
3 2013 1 1 851 851 0 1032 1036
4 2013 1 1 917 920 -3 1052 1108
5 2013 1 1 1024 1030 -6 1204 1215
6 2013 1 1 1127 1129 -2 1303 1309
7 2013 1 1 1157 1205 -8 1342 1345
8 2013 1 1 1317 1325 -8 1454 1505
9 2013 1 1 1449 1450 -1 1651 1640
10 2013 1 1 1505 1510 -5 1654 1655
# ℹ 6,193 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Hint: Logical operators in R:
| operator | definition |
|---|---|
< |
is less than? |
<= |
is less than or equal to? |
> |
is greater than? |
>= |
is greater than or equal to? |
== |
is exactly equal to? |
!= |
is not equal to? |
x & y |
is x AND y? |
x \| y |
is x OR y? |
is.na(x) |
is x NA? |
!is.na(x) |
is x not NA? |
x %in% y |
is x in y? |
!(x %in% y) |
is x not in y? |
!x |
is not x? (only makes sense if x is TRUE or FALSE) |
Exercise 7 - count()
- Demo: Create a frequency table of the destination locations for flights from New York.
# add code here- Demo: In which month was there the fewest number of flights? How many flights were there in that month?
# add code here- Your turn: On which date (month + day) was there the largest number of flights? How many flights were there on that day?
# add code hereExercise 8 - mutate()
- Demo: Convert
air_time(minutes in the air) to hours and then create a new variable,mph, the miles per hour of the flight.
# add code here- Your turn: First, count the number of flights each month, and then calculate the proportion of flights in each month. What proportion of flights take place in July?
# add code here- Demo: Create a new variable,
rdu_bound, which indicates whether the flight is to RDU or not. Then, for each departure airport (origin), calculate what proportion of flights originating from that airport are to RDU.
# add code hereExercise 9 - summarize()
- Demo: Find mean arrival delay for all flights.
# add code hereExercise 10 - group_by()
- Demo: Find mean arrival delay for for each month.
# add code here- Your turn: What is the median departure delay for each airports around NYC (
origin)? Which airport has the shortest median departure delay?
# add code hereAdditional Practice
Try these on your own, either in class if you finish early, or after class.
- Create a new dataset that only contains flights that do not have a missing departure time. Include the columns
year,month,day,dep_time,dep_delay, anddep_delay_hours(the departure delay in hours). Hint: Note you may need to usemutate()to make one or more of these variables.
# add code here- For each airplane (uniquely identified by
tailnum), use agroup_by()paired withsummarize()to find the sample size, mean, and standard deviation of flight distances. Then include only the top 5 and bottom 5 airplanes in terms of mean distance traveled per flight in the final data frame.
# add code here