Merging Data Frames

Sarah Cassie Burnett

September 25, 2025

What is a Join

Horizontal Join (Merge)

Often we have data from two different sources
Results in two data frames
How can we combine them into one?

Illustration

Source: R for HR

Types of Joins in `dplyr`

Mutating versus filtering joins
Four types of mutating joins
- inner_join()
- full_join()
- left_join()
- right_join()
For the most part we will use left_join()

`inner_join()`

Source: R for HR

`full_join()`

Source: R for HR

`left_join()`

Source: R for HR

`right_join()`

Source: R for HR

Exercise

Scenario 1: NOAA weather station data

Load Libraries and data

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)

# NOAA GHCN Monthly (local subset for speed)
# https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-monthly
temps <- read_csv("temps.csv", show_col_types = FALSE) # or temps_small.csv if temps.csv is too big
# show_col_types = FALSE doesn't do much, just supresses the message read out
# This is good for the including code in the quarto documents


# Station metadata
# https://www.philchodrow.prof/
stations <- read_csv("https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/noaa-ghcn/station-metadata.csv",
                     show_col_types = FALSE)

# FIPS 10-4 to ISO + Names
# https://www.mysociety.org/
countries <- read_csv("https://raw.githubusercontent.com/mysociety/gaze/master/data/fips-10-4-to-iso-country-codes.csv",
                      show_col_types = FALSE)

03:00

Check out the data

glimpse(temps)
dim(temps)          # rows, columns
colnames(temps)     # column names
head(temps)         # preview

colnames(temps)
colnames(stations)
colnames(countries)

temps |> arrange(Year)
temps |> arrange(desc(Year))

Try it out!

Write the wrangling code to get the temperatures at each person’s birth month and year.
Try adding the names of the month as labels instead of numbers.

After you finish, take a look at the column names and values to see if there are any variables you could potential use as a key.

03:00

Remove duplicates from country dataset

# Count duplicates by FIPS and filter to see them
countries |> add_count(`FIPS 10-4`) |> filter(n > 1)

# Deduplicate by FIPS code (keep first in the tibble)
countries <- countries |> distinct(`FIPS 10-4`, .keep_all = TRUE)

Join Plan

1. Keys

Do they match between tables?
- temps.ID and stations.ID
- substr(ID, 1, 2) and countries."FIPS 10-4"

2. Join Type

What rows do we want to keep?
- Keep all rows from temps
- Use left joins

Join: temps + stations

temps_st <- temps %>%
  left_join(stations, by = "ID")

Create FIPS from ID

substr("AFM00040911", 1, 2)
substr("AFM00040911", 1, 2) == "AF"

temps_st %>%
  mutate(`FIPS 10-4` = substr(ID, 1, 2))

Join: add countries

temps_cty <- temps_st %>%
  mutate(`FIPS 10-4` = substr(ID, 1, 2)) %>%
  left_join(countries, by = "FIPS 10-4")

Clean data from missing values and reorder

mytemps <- temps_cty %>%
  mutate(
    Temp = na_if(Temp, -9999)   # -9999 -> NA (hundredths of °C)
  ) %>%
  arrange(desc(Year))

Plot: Temperature Over Time

# sample 1% for interactive/plots
mytemps_small <- mytemps %>% sample_frac(0.01)

p <- ggplot(mytemps_small, aes(x = Year, y = Temp)) +
  geom_line(color = "chartreuse4") +
  labs(
    title = "Temperature over time",
    x = "Date",
    y = "Temperature (hundredths of °C)"
  ) +
  theme_minimal()

p

Save Plot

ggsave("temperature_over_time.png", plot = p, width = 8, height = 4, dpi = 300)

Temperature over time across the world

Scenario 2: Merge World Bank and V-Dem

We want to merge two data frames
One is from the World Bank
The other is from V-Dem
How do we do it?

Grab Some WB Data

# Load packages
library(wbstats) 
library(dplyr) 
library(janitor) 

# Store the list of indicators in an object
indicators <- c("flfp" = "SL.TLF.CACT.FE.ZS", "women_rep" = "SG.GEN.PARL.ZS") 

# Download the data  
wb_dta <- wb_data(indicators, mrv = 25) |> # most recent 25 years
  select(!iso2c) |> 
  rename(year = date) |> 
  mutate(
    flfp = round_to_fraction(flfp, denominator = 100), # round to nearest 100th
    women_rep = round_to_fraction(women_rep, denominator = 100) 
  )

# View the data
glimpse(wb_dta)

Grab Some V-Dem Data

library(vdemlite)

vdem_dta <- fetchdem(indicators = c("v2x_gender", "v2x_gencl", "e_regionpol_6C") |>
                         start_year = 2000, end_year = 2020) |> # 20 year span
                 rename(
                   women_polemp = v2x_gender,
                   women_civlib = v2x_gencl,
                   region = e_regionpol_6C
                 ) |>
            mutate(
              region = case_match(region, 
                                  1 ~ "Eastern Europe", 
                                  2 ~ "Latin America",  
                                  3 ~ "Middle East",   
                                  4 ~ "Africa", 
                                  5 ~ "The West", 
                                  6 ~ "Asia")
              )

glimpse(vdem_dta)

Key Questions

What is the unit of analysis?
What is/are the corresponding identifier variables?
Are the identifier variables in common?
Or do they have to be added/transformed to match?

Merging WB and V-Dem Data

These are both time-series, country-level data
Need to merge by country-year
Year is easy
But there are many different country codes
Can use countrycode package to assign country codes

Use `countrycode`

# Load countrycode
library(countrycode)

# Create new iso3c variable
vdem_data <- vdem_data |>    
  mutate(iso3c = countrycode(sourcevar = country_id, # what we are converting
        origin = "vdem",         # we are converting from vdem
        destination = "wb"))  |> # and converting to the WB iso3c code 
  relocate(iso3c, .after = country_id) # move iso3c 

# View the data
glimpse(dem_data)

Try it Yourself

Using your democracy data frame from the last lesson
Use mutate() and countrycode() to add iso3c country codes
Use relocate to move your iso3c code to the “front” of your data frame (optional)

Use `left_join()` to Merge

# Perform left join using common iso3c variable and year
dem_wb <- left_join(vdem_dta, wb_dta, by = c("iso3c", "year")) |> #join
  select(-c(country_text_id, country_id, country)) # drop extra country variables
  
# View the data
glimpse(dem_wb)

Try it Yourself

Take your V-Dem data frame and your World Bank data frame
Using left_join() to merge on country code and year
Along the way, use rename() and select() to insure you have just one country name
Try inner_join(), full_join(), and right_join() as time allows

Summarize the Data

Do a group, summarize, arrange sequence on your merged data frame
Group and summarize by country (mean or median)
Try using across() to summarize multiple columns at once

Create a Scatter Plot

Now you have one data point per country
Use ggplot2 to create a scatter plot

Merging Data Frames

What is a Join

Horizontal Join (Merge)

Illustration

Types of Joins in dplyr

inner_join()

full_join()

left_join()

right_join()

Exercise

Scenario 1: NOAA weather station data

Load Libraries and data

Check out the data

Try it out!

Remove duplicates from country dataset

Join Plan

1. Keys

2. Join Type

Join: temps + stations

Create FIPS from ID

Join: add countries

Clean data from missing values and reorder

Plot: Temperature Over Time

Save Plot

Scenario 2: Merge World Bank and V-Dem

Grab Some WB Data

Grab Some V-Dem Data

Key Questions

Merging WB and V-Dem Data

Use countrycode

Try it Yourself

Use left_join() to Merge

Try it Yourself

Summarize the Data

Create a Scatter Plot

Types of Joins in `dplyr`

`inner_join()`

`full_join()`

`left_join()`

`right_join()`

Use `countrycode`

Use `left_join()` to Merge