Working With Data

Sarah Cassie Burnett

September 18, 2025

Announcements

Online quiz posted at the end of class, due tomorrow.
Coding Assignment 1 is due by 11:59 pm tonight.
- Re-submit the assignment on Gradescope if it wasn’t uploaded correctly.
Minor final project assignment due tonight.

Scan and fill out for participation credit

https://forms.gle/97zZmDjALU8GkJEfA

How Do We Get Tidy/Clean Data?

Get lucky and find it (like on kaggle)
Wrangle it ourselves
Use a package where it has been wrangled for us
Download via an API

This Lesson

Practice with World Bank and V-Dem data
World Bank data through wbstats
- There is another package called WDI
- Both packages for accessing data through WB API
Varieties of Democracy (V-Dem) through vdemlite
- There is also a package called vdemdata
- vdemlite offers more functionality, works better in the cloud

`filter()`, `select()`, `mutate()`

Along the way we will practice some important dplyr verbs:

filter() is used to select observations based on their values
select() is used to select variables
mutate() is used to create new variables or modifying existing ones

As well as some helpful functions from the janitor package.

dplyr Basics

Before we discuss their individual differences, note what they have in common:

The first argument is always a data frame.
The subsequent arguments typically describe which columns to operate on (using variable names, without quotes).
The output is always a new data frame.

Data Frames

What is a Data Frame?

Special kind of tabular data used in data science
Each column can be a different data type
Data frames are the most common data structure in R

What is a Tibble?

Modern data frames in R
Offers better printing and subsetting behaviors
Displays only the first 10 rows and as many columns as fit on screen
Column names are preserved exactly, even if they contain spaces

Creating a Tibble

When you read data into R with readr you automatically get a tibble
You can create a tibble using tibble() from the tibble package:

  library(tibble)
  
  # Create a tibble
  my_tibble <- tibble(
    name = c("Alice", "Bob", "Charlie"),
    age = c(25, 30, 35),
    height = c(160, 170, 180),
    is_student = c(TRUE, FALSE, FALSE)
  )
  
my_tibble

# A tibble: 3 × 4
  name      age height is_student
  <chr>   <dbl>  <dbl> <lgl>     
1 Alice      25    160 TRUE      
2 Bob        30    170 FALSE     
3 Charlie    35    180 FALSE

Common Data Types

<chr> (Character): Stores text strings
- Example: "hello", "R programming"
<dbl> (Double): Stores decimal (floating-point) numbers
- Example: 3.14, -1.0
<int> (Integer): Stores whole numbers (integers)
- Example: 1, -100, 42
<lgl> (Logical): Stores boolean values (TRUE, FALSE, NA)
- Example: TRUE, FALSE, NA
<fct> (Factor): Stores categorical variables with fixed levels
- Example: factor(c("low", "medium", "high"))
<date> (Date): Stores dates in the “YYYY-MM-DD” format
- Example: as.Date("2024-09-05")

Other Data Types

<dttm> (Date-Time or POSIXct): Stores date-time objects (both date and time).
- Example: as.POSIXct("2024-09-05 14:30:00")
<time> (Time): Specifically stores time-of-day values (rarely seen without a date)
- Example: "14:30:00"
<list> (List): Stores lists, where each entry can be a complex object.
- Example: list(c(1, 2, 3), c("a", "b", "c"))

Dates and Times with `lubridate`

lubridate is an R package that makes it easier to work with dates and times
Use convenient functions to store dates in different formats

library(lubridate)
  
# Store a date
my_date <- ymd("2024-09-05")
my_date2 <- mdy("09-05-2024")
my_date3 <- dmy("05-09-2024")
  
# Print in long form
format(my_date, "%B %d, %Y")

[1] "September 05, 2024"

APIs

API stands for “Application Programming Interface”
Way for two computers to talk to each other
In our case, we will use APIs to download social science data

APIs in R

APIs are accessed through packages in R
Sometimes there can be more than one package for an API
Much easier than reading in data from messy flat file!
We will use a few API packages in this course
- World Bank data through wbstats (or WDI)
- fredr for Federal Reserve Economic Data
- tidycensus for US Census data
But there are many APIs out there (please explore!)

Searching for WB Indicators

library(wbstats) 
life_indicators <- wb_search("life expectancy")   # Search for life expectancy
print(life_indicators, n = 20)                    # Show first 20 results

# want: Individuals using the Internet (% of population)
internet_indicators <- wb_search("internet")      # Search for internet usage
print(internet_indicators, n = 20)                # Show first 20 results
#Don't see what you're looking for? Try:
View(internet_indicators)

`wbstats` Example

# Load packages
library(wbstats) # for downloading WB data
library(dplyr) # for selecting, renaming and mutating
library(janitor) # for rounding

# Store the list of indicators in an object
indicators <- c(
  life_exp       = "SP.DYN.LE00.IN",
  internet_users = "IT.NET.USER.ZS"
)

# Download the data  
wb_data_clean <- wb_data(indicators, mrv = 30) |>
  select(!iso2c) |>          # Drop the 2-letter country code (not needed)
  rename(year = date) |>     # Rename 'date' column to 'year'
  mutate(
    life_exp       = round_to_fraction(life_exp, denominator = 10),   # ~0.1 precision
    internet_users = round_to_fraction(internet_users, denominator = 100) # ~0.01 precision
  )

# View the structure of the dataset
glimpse(wb_data_clean)

Try it out!

Search for a WB indicator
Download the data

05:00

V-Dem Data

The V-Dem Dataset

V-Dem stands for Varieties of Democracy
It is a dataset that measures democracy around the world
Based on expert assessments of the quality of democracy in each country
Two packages we will explore: vdemlite and vdemdata

`vdemlite`

Covers a few hundred commonly used indicators and indices from 1970 onward
Covers everything in this document
As opposed to 4000+ indicators from the 18th century onward
Adds some functionality for working with the data
Easier to work with in the cloud and apps

`vdemlite` fuctions

fetchdem() to download the data
summarizedem() provides searchable table of indicators with summary stats
searchdem() to search for specific indicators or all indicators used to construct an index
See the vdemlite documentation for more details

`fetchdem()`

# Load packages
library(vdemlite) # to download V-Dem data

# Polyarchy and clean elections index for USA and Sweden for 2000-2020
dem_indicators <- fetchdem(indicators = c("v2x_polyarchy", "v2xel_frefair"),
                           countries = c("USA", "SWE"))

# View the data
glimpse(dem_indicators)

`summarizedem()`

# Summary statistics for the polyarchy index
summarizedem(indicator = "v2x_polyarchy")

`searchdem()`

searchdem()

Your Turn

Look at the vdemlite documentation
Try using searchdem() to find an indicator you are interested in using
Use summarizedem() to get summary statistics for that variable
Use fetchdem() to download the data for that variable for a country or countries of interest
Try using mutate() to add region codes to the data

05:00

Working With Data

Announcements

How Do We Get Tidy/Clean Data?

This Lesson

filter(), select(), mutate()

dplyr Basics

Data Frames

What is a Data Frame?

What is a Tibble?

Creating a Tibble

Common Data Types

Other Data Types

Dates and Times with lubridate

APIs

APIs

APIs in R

Searching for WB Indicators

wbstats Example

Try it out!

V-Dem Data

The V-Dem Dataset

vdemlite

vdemlite fuctions

fetchdem()

summarizedem()

searchdem()

Your Turn

`filter()`, `select()`, `mutate()`

Dates and Times with `lubridate`

`wbstats` Example

`vdemlite`

`vdemlite` fuctions

`fetchdem()`

`summarizedem()`

`searchdem()`