Working With Data

Sarah Cassie Burnett

September 18, 2025

Announcements

  • Online quiz posted at the end of class, due tomorrow.
  • Coding Assignment 1 is due by 11:59 pm tonight.
    • Re-submit the assignment on Gradescope if it wasn’t uploaded correctly.
  • Minor final project assignment due tonight.

Scan and fill out for participation credit

https://forms.gle/97zZmDjALU8GkJEfA

How Do We Get Tidy/Clean Data?


  • Get lucky and find it (like on kaggle)
  • Wrangle it ourselves
  • Use a package where it has been wrangled for us
  • Download via an API

This Lesson

  • Practice with World Bank and V-Dem data
  • World Bank data through wbstats
    • There is another package called WDI
    • Both packages for accessing data through WB API
  • Varieties of Democracy (V-Dem) through vdemlite
    • There is also a package called vdemdata
    • vdemlite offers more functionality, works better in the cloud

filter(), select(), mutate()


Along the way we will practice some important dplyr verbs:


  • filter() is used to select observations based on their values
  • select() is used to select variables
  • mutate() is used to create new variables or modifying existing ones


As well as some helpful functions from the janitor package.

dplyr Basics

Before we discuss their individual differences, note what they have in common:

  • The first argument is always a data frame.
  • The subsequent arguments typically describe which columns to operate on (using variable names, without quotes).
  • The output is always a new data frame.

Data Frames

What is a Data Frame?


  • Special kind of tabular data used in data science
  • Each column can be a different data type
  • Data frames are the most common data structure in R

What is a Tibble?


  • Modern data frames in R
  • Offers better printing and subsetting behaviors
  • Displays only the first 10 rows and as many columns as fit on screen
  • Column names are preserved exactly, even if they contain spaces

Creating a Tibble


  • When you read data into R with readr you automatically get a tibble
  • You can create a tibble using tibble() from the tibble package:
  library(tibble)
  
  # Create a tibble
  my_tibble <- tibble(
    name = c("Alice", "Bob", "Charlie"),
    age = c(25, 30, 35),
    height = c(160, 170, 180),
    is_student = c(TRUE, FALSE, FALSE)
  )
  
my_tibble  
# A tibble: 3 × 4
  name      age height is_student
  <chr>   <dbl>  <dbl> <lgl>     
1 Alice      25    160 TRUE      
2 Bob        30    170 FALSE     
3 Charlie    35    180 FALSE     

Common Data Types

  • <chr> (Character): Stores text strings
    • Example: "hello", "R programming"
  • <dbl> (Double): Stores decimal (floating-point) numbers
    • Example: 3.14, -1.0
  • <int> (Integer): Stores whole numbers (integers)
    • Example: 1, -100, 42
  • <lgl> (Logical): Stores boolean values (TRUE, FALSE, NA)
    • Example: TRUE, FALSE, NA
  • <fct> (Factor): Stores categorical variables with fixed levels
    • Example: factor(c("low", "medium", "high"))
  • <date> (Date): Stores dates in the “YYYY-MM-DD” format
    • Example: as.Date("2024-09-05")

Other Data Types


  • <dttm> (Date-Time or POSIXct): Stores date-time objects (both date and time).
    • Example: as.POSIXct("2024-09-05 14:30:00")
  • <time> (Time): Specifically stores time-of-day values (rarely seen without a date)
    • Example: "14:30:00"
  • <list> (List): Stores lists, where each entry can be a complex object.
    • Example: list(c(1, 2, 3), c("a", "b", "c"))

Dates and Times with lubridate

  • lubridate is an R package that makes it easier to work with dates and times

  • Use convenient functions to store dates in different formats

library(lubridate)
  
# Store a date
my_date <- ymd("2024-09-05")
my_date2 <- mdy("09-05-2024")
my_date3 <- dmy("05-09-2024")
  
# Print in long form
format(my_date, "%B %d, %Y")
[1] "September 05, 2024"

APIs

APIs


  • API stands for “Application Programming Interface”
  • Way for two computers to talk to each other
  • In our case, we will use APIs to download social science data

APIs in R

  • APIs are accessed through packages in R
  • Sometimes there can be more than one package for an API
  • Much easier than reading in data from messy flat file!
  • We will use a few API packages in this course
    • World Bank data through wbstats (or WDI)
    • fredr for Federal Reserve Economic Data
    • tidycensus for US Census data
  • But there are many APIs out there (please explore!)

Searching for WB Indicators


library(wbstats) 
life_indicators <- wb_search("life expectancy")   # Search for life expectancy
print(life_indicators, n = 20)                    # Show first 20 results

# want: Individuals using the Internet (% of population)
internet_indicators <- wb_search("internet")      # Search for internet usage
print(internet_indicators, n = 20)                # Show first 20 results
#Don't see what you're looking for? Try:
View(internet_indicators)

wbstats Example

# Load packages
library(wbstats) # for downloading WB data
library(dplyr) # for selecting, renaming and mutating
library(janitor) # for rounding

# Store the list of indicators in an object
indicators <- c(
  life_exp       = "SP.DYN.LE00.IN",
  internet_users = "IT.NET.USER.ZS"
)

# Download the data  
wb_data_clean <- wb_data(indicators, mrv = 30) |>
  select(!iso2c) |>          # Drop the 2-letter country code (not needed)
  rename(year = date) |>     # Rename 'date' column to 'year'
  mutate(
    life_exp       = round_to_fraction(life_exp, denominator = 10),   # ~0.1 precision
    internet_users = round_to_fraction(internet_users, denominator = 100) # ~0.01 precision
  )

# View the structure of the dataset
glimpse(wb_data_clean)

Try it out!


  • Search for a WB indicator
  • Download the data
05:00

V-Dem Data

The V-Dem Dataset


  • V-Dem stands for Varieties of Democracy
  • It is a dataset that measures democracy around the world
  • Based on expert assessments of the quality of democracy in each country
  • Two packages we will explore: vdemlite and vdemdata

vdemlite


  • Covers a few hundred commonly used indicators and indices from 1970 onward
  • Covers everything in this document
  • As opposed to 4000+ indicators from the 18th century onward
  • Adds some functionality for working with the data
  • Easier to work with in the cloud and apps

vdemlite fuctions


  • fetchdem() to download the data
  • summarizedem() provides searchable table of indicators with summary stats
  • searchdem() to search for specific indicators or all indicators used to construct an index
  • See the vdemlite documentation for more details

fetchdem()


# Load packages
library(vdemlite) # to download V-Dem data

# Polyarchy and clean elections index for USA and Sweden for 2000-2020
dem_indicators <- fetchdem(indicators = c("v2x_polyarchy", "v2xel_frefair"),
                           countries = c("USA", "SWE"))

# View the data
glimpse(dem_indicators)

summarizedem()


# Summary statistics for the polyarchy index
summarizedem(indicator = "v2x_polyarchy")

searchdem()


searchdem()

Your Turn


  • Look at the vdemlite documentation
  • Try using searchdem() to find an indicator you are interested in using
  • Use summarizedem() to get summary statistics for that variable
  • Use fetchdem() to download the data for that variable for a country or countries of interest
  • Try using mutate() to add region codes to the data
05:00