Merging Data Frames

Sarah Cassie Burnett

September 25, 2025

What is a Join

Horizontal Join (Merge)


  • Often we have data from two different sources
  • Results in two data frames
  • How can we combine them into one?

Illustration

Source: R for HR

Types of Joins in dplyr

  • Mutating versus filtering joins
  • Four types of mutating joins
    • inner_join()
    • full_join()
    • left_join()
    • right_join()
  • For the most part we will use left_join()

inner_join()


Source: R for HR

full_join()


Source: R for HR

left_join()


Source: R for HR

right_join()


Source: R for HR

Exercise

Scenario 1: NOAA weather station data

Load Libraries and data

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)

# NOAA GHCN Monthly (local subset for speed)
# https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-monthly
temps <- read_csv("temps.csv", show_col_types = FALSE) # or temps_small.csv if temps.csv is too big
# show_col_types = FALSE doesn't do much, just supresses the message read out
# This is good for the including code in the quarto documents


# Station metadata
# https://www.philchodrow.prof/
stations <- read_csv("https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/noaa-ghcn/station-metadata.csv",
                     show_col_types = FALSE)

# FIPS 10-4 to ISO + Names
# https://www.mysociety.org/
countries <- read_csv("https://raw.githubusercontent.com/mysociety/gaze/master/data/fips-10-4-to-iso-country-codes.csv",
                      show_col_types = FALSE)
03:00

Check out the data


glimpse(temps)
dim(temps)          # rows, columns
colnames(temps)     # column names
head(temps)         # preview

colnames(temps)
colnames(stations)
colnames(countries)

temps |> arrange(Year)
temps |> arrange(desc(Year))

Try it out!


  1. Write the wrangling code to get the temperatures at each person’s birth month and year.

  2. Try adding the names of the month as labels instead of numbers.

After you finish, take a look at the column names and values to see if there are any variables you could potential use as a key.

03:00

Remove duplicates from country dataset

# Count duplicates by FIPS and filter to see them
countries |> add_count(`FIPS 10-4`) |> filter(n > 1)

# Deduplicate by FIPS code (keep first in the tibble)
countries <- countries |> distinct(`FIPS 10-4`, .keep_all = TRUE)

Join Plan

1. Keys

  • Do they match between tables?
    • temps.ID and stations.ID
    • substr(ID, 1, 2) and countries."FIPS 10-4"

2. Join Type

  • What rows do we want to keep?
    • Keep all rows from temps
    • Use left joins

Join: temps + stations


temps_st <- temps %>%
  left_join(stations, by = "ID")

Create FIPS from ID


substr("AFM00040911", 1, 2)
substr("AFM00040911", 1, 2) == "AF"

temps_st %>%
  mutate(`FIPS 10-4` = substr(ID, 1, 2)) 

Join: add countries


temps_cty <- temps_st %>%
  mutate(`FIPS 10-4` = substr(ID, 1, 2)) %>%
  left_join(countries, by = "FIPS 10-4")

Clean data from missing values and reorder


mytemps <- temps_cty %>%
  mutate(
    Temp = na_if(Temp, -9999)   # -9999 -> NA (hundredths of °C)
  ) %>%
  arrange(desc(Year))

Plot: Temperature Over Time

# sample 1% for interactive/plots
mytemps_small <- mytemps %>% sample_frac(0.01)

p <- ggplot(mytemps_small, aes(x = Year, y = Temp)) +
  geom_line(color = "chartreuse4") +
  labs(
    title = "Temperature over time",
    x = "Date",
    y = "Temperature (hundredths of °C)"
  ) +
  theme_minimal()

p

Save Plot

ggsave("temperature_over_time.png", plot = p, width = 8, height = 4, dpi = 300)

Temperature over time across the world

Scenario 2: Merge World Bank and V-Dem


  • We want to merge two data frames
  • One is from the World Bank
  • The other is from V-Dem
  • How do we do it?

Grab Some WB Data


# Load packages
library(wbstats) 
library(dplyr) 
library(janitor) 

# Store the list of indicators in an object
indicators <- c("flfp" = "SL.TLF.CACT.FE.ZS", "women_rep" = "SG.GEN.PARL.ZS") 

# Download the data  
wb_dta <- wb_data(indicators, mrv = 25) |> # most recent 25 years
  select(!iso2c) |> 
  rename(year = date) |> 
  mutate(
    flfp = round_to_fraction(flfp, denominator = 100), # round to nearest 100th
    women_rep = round_to_fraction(women_rep, denominator = 100) 
  )

# View the data
glimpse(wb_dta) 

Grab Some V-Dem Data


library(vdemlite)

vdem_dta <- fetchdem(indicators = c("v2x_gender", "v2x_gencl", "e_regionpol_6C") |>
                         start_year = 2000, end_year = 2020) |> # 20 year span
                 rename(
                   women_polemp = v2x_gender,
                   women_civlib = v2x_gencl,
                   region = e_regionpol_6C
                 ) |>
            mutate(
              region = case_match(region, 
                                  1 ~ "Eastern Europe", 
                                  2 ~ "Latin America",  
                                  3 ~ "Middle East",   
                                  4 ~ "Africa", 
                                  5 ~ "The West", 
                                  6 ~ "Asia")
              )

glimpse(vdem_dta)

Key Questions


  • What is the unit of analysis?
  • What is/are the corresponding identifier variables?
  • Are the identifier variables in common?
  • Or do they have to be added/transformed to match?

Merging WB and V-Dem Data


  • These are both time-series, country-level data
  • Need to merge by country-year
  • Year is easy
  • But there are many different country codes
  • Can use countrycode package to assign country codes

Use countrycode


# Load countrycode
library(countrycode)

# Create new iso3c variable
vdem_data <- vdem_data |>    
  mutate(iso3c = countrycode(sourcevar = country_id, # what we are converting
        origin = "vdem",         # we are converting from vdem
        destination = "wb"))  |> # and converting to the WB iso3c code 
  relocate(iso3c, .after = country_id) # move iso3c 

# View the data
glimpse(dem_data)

Try it Yourself


  • Using your democracy data frame from the last lesson
  • Use mutate() and countrycode() to add iso3c country codes
  • Use relocate to move your iso3c code to the “front” of your data frame (optional)

Use left_join() to Merge


# Perform left join using common iso3c variable and year
dem_wb <- left_join(vdem_dta, wb_dta, by = c("iso3c", "year")) |> #join
  select(-c(country_text_id, country_id, country)) # drop extra country variables
  
# View the data
glimpse(dem_wb)  

Try it Yourself

  • Take your V-Dem data frame and your World Bank data frame
  • Using left_join() to merge on country code and year
  • Along the way, use rename() and select() to insure you have just one country name
  • Try inner_join(), full_join(), and right_join() as time allows

Summarize the Data


  • Do a group, summarize, arrange sequence on your merged data frame
  • Group and summarize by country (mean or median)
  • Try using across() to summarize multiple columns at once

Create a Scatter Plot


  • Now you have one data point per country
  • Use ggplot2 to create a scatter plot