library(tidyverse) # for data wrangling and visualization
library(scales) # for pretty axis breaksAE 06: Joining country populations with continents
Suggested answers
These are suggested answers. This document should be used as reference only, it’s not designed to be an exhaustive key.
Goal
Our ultimate goal in this application exercise is to create a bar plot of total populations of continents, where the input data are:
- Countries and populations
- Countries and continents
Data
Countries and populations
These data come from The World Bank and reflect population counts as of 2022.
population <- read_csv("https://sta199-s24.github.io/data/world-pop-2022.csv")Let’s take a look at the data.
population# A tibble: 217 × 3
country year population
<chr> <dbl> <dbl>
1 Afghanistan 2022 41129.
2 Albania 2022 2778.
3 Algeria 2022 44903.
4 American Samoa 2022 44.3
5 Andorra 2022 79.8
6 Angola 2022 35589.
7 Antigua and Barbuda 2022 93.8
8 Argentina 2022 46235.
9 Armenia 2022 2780.
10 Aruba 2022 106.
# ℹ 207 more rows
Continents
These data come from Our World in Data.
continents <- read_csv("https://sta199-s24.github.io/data/continents.csv")Let’s take a look at the data.
continents# A tibble: 285 × 4
entity code year continent
<chr> <chr> <dbl> <chr>
1 Abkhazia OWID_ABK 2015 Asia
2 Afghanistan AFG 2015 Asia
3 Akrotiri and Dhekelia OWID_AKD 2015 Asia
4 Aland Islands ALA 2015 Europe
5 Albania ALB 2015 Europe
6 Algeria DZA 2015 Africa
7 American Samoa ASM 2015 Oceania
8 Andorra AND 2015 Europe
9 Angola AGO 2015 Africa
10 Anguilla AIA 2015 North America
# ℹ 275 more rows
Exercises
Think out loud:
- Which variable(s) will we use to join the
populationandcontinentsdata frames?
countryfrompopulationandentityfromcontinents- We want to create a new data frame that keeps all rows and columns from
populationand brings in the corresponding information fromcontinents. Which join function should we use?
left_join()withpopulationon the left.- Which variable(s) will we use to join the
Demo: Join the two data frames and name assign the joined data frame to a new data frame
population_continents.
population_continent <- population |>
left_join(continents, by = join_by(country == entity))- Demo: Take a look at the newly created
population_continentdata frame. There are some countries that were not incontinents. First, identify which countries these are (they will haveNAvalues forcontinent).
population_continent |>
filter(is.na(continent))# A tibble: 6 × 6
country year.x population code year.y continent
<chr> <dbl> <dbl> <chr> <dbl> <chr>
1 Congo, Dem. Rep. 2022 99010. <NA> NA <NA>
2 Congo, Rep. 2022 5970. <NA> NA <NA>
3 Hong Kong SAR, China 2022 7346. <NA> NA <NA>
4 Korea, Dem. People's Rep. 2022 26069. <NA> NA <NA>
5 Korea, Rep. 2022 51628. <NA> NA <NA>
6 Kyrgyz Republic 2022 6975. <NA> NA <NA>
- Demo: All of these countries are actually in the
continentsdata frame, but under different names. So, let’s clean that data first by updating the country names in thepopulationdata frame in a way they will match thecontinentsdata frame, and then joining them, using acase_when()statement inmutate(). At the end, check that all countries now have continent information.
population_continent <- population |>
mutate(country = case_when(
country == "Congo, Dem. Rep." ~ "Democratic Republic of Congo",
country == "Congo, Rep." ~ "Congo",
country == "Hong Kong SAR, China" ~ "Hong Kong",
country == "Korea, Dem. People's Rep." ~ "North Korea",
country == "Korea, Rep." ~ "South Korea",
country == "Kyrgyz Republic" ~ "Kyrgyzstan",
.default = country
)
) |>
left_join(continents, by = join_by(country == entity))
population_continent |>
filter(is.na(continent))# A tibble: 0 × 6
# ℹ 6 variables: country <chr>, year.x <dbl>, population <dbl>, code <chr>,
# year.y <dbl>, continent <chr>
- Think out loud: Which continent do you think has the highest population? Which do you think has the second highest? The lowest?
Add your response here.
- Demo: Create a new data frame called
population_summarythat contains a row for each continent and a column for the total population for that continent, in descending order of population. Note that the function for calculating totals in R issum().
population_summary <- population_continent |>
group_by(continent) |>
summarize(total_pop = sum(population)) |>
arrange(desc(total_pop))- Your turn: Make a bar plot with total population on the y-axis and continent on the x-axis, where the height of each bar represents the total population in that continent.
ggplot(population_summary, aes(x = continent, y = total_pop)) +
geom_col()
- Your turn: Recreate the following plot, which is commonly referred to as a lollipop plot. Hint: Start with the points, then try adding the
segments, then add axis labels andcaption, and finally, as a stretch goal, update the x scale (which will require a function we haven’t introduced in lectures or labs yet!).

ggplot(population_summary, aes()) +
geom_point(aes(x = total_pop, y = continent)) +
geom_segment(aes(y = continent, yend = continent, x = 0, xend = total_pop)) +
scale_x_continuous(labels = label_number(scale = 1/1000000, suffix = " bil")) +
theme_minimal() +
labs(
x = "Total population",
y = "Continent",
title = "World population",
subtitle = "As of 2022",
caption = "Data sources: The World Bank and Our World in Data"
)
- Think out loud: What additional improvements would you like to make to this plot.
Answers may vary. Ordering the continents in decreasing order of population.