Visualizing various types of data

Tyler George

Cornell College
DSC 223 - Spring 2024 Block 7

Announcements

Questions from last time?

FAQs

  • Is there any code in the videos that is not in the readings? Yes and no. There is no substantial functionality introduced in the videos that is not also in the readings, however the examples in the videos are different than the ones in the reading.

  • What are all of the geoms we need to know? You don’t need to “memorize” or even “know” all o the geoms available in the ggplot2 package, but you can find a list of them on the ggplot2 cheat sheet or on the reference page.

  • Could you please clarify what situations it would be appropriate to use each geom function? Today’s topic! And think about it as “what plot should I make for which type of variable”.

From last time

library(tidyverse)
library(palmerpenguins)
library(ggthemes)

Violin plots

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g
    )
  ) +
  geom_violin()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_ydensity()`).

Multiple geoms

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g
    )
  ) +
  geom_violin() +
  geom_point()

Multiple geoms

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g
    )
  ) +
  geom_violin() +
  geom_jitter()

Multiple geoms + aesthetics

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g,
    color = species
    )
  ) +
  geom_violin() +
  geom_jitter()

Multiple geoms + aesthetics

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g,
    color = species
    )
  ) +
  geom_violin() +
  geom_jitter() +
  theme(
    legend.position = "none"
  )

Multiple geoms + aesthetics

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g,
    color = species
    )
  ) +
  geom_violin() +
  geom_jitter() +
  theme(
    legend.position = "none"
  ) +
  scale_color_colorblind()

From last time

ae-02-bechdel-dataviz

If you followed along with the application exercise…

Go to the project navigator in RStudio (top right corner of your RStudio window) and open the project called ae. If there are any uncommitted files, commit them so you can start with a clean slate.

If you didn’t clone the repo:

Go to the course GitHub org and find your ae-02 repo (repo name will be suffixed with your GitHub name).

Recap of this AE

  • Construct plots with ggplot().
  • Layers of ggplots are separated by +s.
  • The formula is (almost) always as follows:
ggplot(DATA, aes(x = X-VAR, y = Y-VAR, ...)) +
  geom_XXX()
  • Aesthetic attributes of a geometries (color, size, transparency, etc.) can be mapped to variables in the data or set by the user, e.g. color = binary vs. color = "pink".
  • Use facet_wrap() when faceting (creating small multiples) by one variable and facet_grid() when faceting by two variables.

Visualizing various types of data

Identifying variable types

Identify the type of each of the following variables.

  • Favorite food
  • Number of classes you’re taking this block
  • Zip code
  • Age

The way data is displayed matters

What do these three plots show?

Visualizing penguins

library(tidyverse)
library(palmerpenguins)
library(ggthemes)

penguins
# A tibble: 344 × 8
   species island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex    year
   <fct>   <fct>             <dbl>         <dbl>             <int>       <int> <fct> <int>
 1 Adelie  Torgers…           39.1          18.7               181        3750 male   2007
 2 Adelie  Torgers…           39.5          17.4               186        3800 fema…  2007
 3 Adelie  Torgers…           40.3          18                 195        3250 fema…  2007
 4 Adelie  Torgers…           NA            NA                  NA          NA <NA>   2007
 5 Adelie  Torgers…           36.7          19.3               193        3450 fema…  2007
 6 Adelie  Torgers…           39.3          20.6               190        3650 male   2007
 7 Adelie  Torgers…           38.9          17.8               181        3625 fema…  2007
 8 Adelie  Torgers…           39.2          19.6               195        4675 male   2007
 9 Adelie  Torgers…           34.1          18.1               193        3475 <NA>   2007
10 Adelie  Torgers…           42            20.2               190        4250 <NA>   2007
# ℹ 334 more rows

Univariate analysis

Univariate analysis

Analyzing a single variable:

  • Numerical: histogram, box plot, density plot, etc.

  • Categorical: bar plot, pie chart, etc.

Histogram - Step 1

ggplot(
  penguins
  )

Histogram - Step 2

ggplot(
  penguins,
  aes(x = body_mass_g)
  )

Histogram - Step 3

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range (`stat_bin()`).

Histogram - Step 4

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_histogram(
    binwidth = 250
  )

Histogram - Step 5

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_histogram(
    binwidth = 250
  ) +
  labs(
    title = "Weights of penguins",
    x = "Weight (grams)",
    y = "Count"
  )

Boxplot - Step 1

ggplot(
  penguins
  )

Boxplot - Step 2

ggplot(
  penguins,
  aes(x = body_mass_g)
  )

Boxplot - Step 3

ggplot(
  penguins,
  aes(y = body_mass_g)
  ) +
  geom_boxplot()

Boxplot - Step 4

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_boxplot()

Boxplot - Step 5

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_boxplot() +
  labs(
    x = "Weight (grams)",
    y = NULL
  )

Density plot - Step 1

ggplot(
  penguins
  )

Density plot - Step 2

ggplot(
  penguins,
  aes(x = body_mass_g)
  )

Density plot - Step 3

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density()

Density plot - Step 4

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density(
    fill = "darkslategray1"
  )

Density plot - Step 5

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density(
    fill = "darkslategray1",
    linewidth = 2
  )

Density plot - Step 6

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density(
    fill = "darkslategray1",
    linewidth = 2,
    color = "darkorchid3"
  )

Density plot - Step 7

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density(
    fill = "darkslategray1",
    linewidth = 2,
    color = "darkorchid3",
    alpha = 0.5
  )

Weights of penguins

TRUE / FALSE

  • The distribution of penguin weights in this sample is left skewed.
  • The distribution of penguin weights in this sample is unimodal.

Bivariate analysis

Bivariate analysis

Analyzing the relationship between two variables:

  • Numerical + numerical: scatterplot

  • Numerical + categorical: side-by-side box plots, violin plots, etc.

  • Categorical + categorical: stacked bar plots

  • Using an aesthetic (e.g., fill, color, shape, etc.) or facets to represent the second variable in any plot

Side-by-side box plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    y = species
    )
  ) +
  geom_boxplot()

Density plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    color = species
    )
  ) +
  geom_density()

Density plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    color = species,
    fill = species
    )
  ) +
  geom_density()

Density plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    color = species,
    fill = species
    )
  ) +
  geom_density(
    alpha = 0.5
  )

Density plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    color = species,
    fill = species
    )
  ) +
  geom_density(
    alpha = 0.5
  ) +
  theme(
    legend.position = "bottom"
  )