What is tidy data?

Sarah Cassie Burnett

September 16, 2025

Preliminaries

Where Does Data Come From?


  • Your boss or a client sends you a file
  • Survey data collected by you or someone else
  • You can download it from a website
  • You can scrape it from a website
  • A data package (e.g. unvotes)
  • You can access it through an API

Getting Started with Data


  • Tabular data is data that is organized into rows and columns
    • a.k.a. rectangular data
  • A data frame is a special kind of tabular data used in data science
  • A variable is something you can measure
  • An observation is a single unit or case in your data set
  • The unit of analysis is the level at which you are measuring
    • In a cross-section: country, state, county, city, individual, etc.
    • In a time-series: year, month, day, etc.

Adjectives for Your Data

The Concept of “Tidy Data”

Tidy data is data that can use the packages of the tidyverse

There are four basic principles to a tidy API:

  • Reuse existing data structures.
  • Compose simple functions with the pipe.
  • Embrace functional programming.
  • Design for humans.

The Concept of “Tidy Data”


  • Each column represents a single variable
  • Each row represents a single observation
  • Each cell represents a single value

Tidy Data Example

What are Clean Data?

  • Column names are easy to work with and are not duplicated
  • Missing values have been dealt with
  • There are no repeated observations or columns
  • There are no blank observations or columns
  • The data are in the proper format, for example dates should be formatted as dates

Messy Data Example

Which of These is Likely Tidy/Clean?


  • Your boss or a client sends you a file
  • Survey data collected by you or someone else
  • You can download it from a website
  • You can scrape it from a website
  • A curated collection (e.g. unvotes)
  • You can access it through an API

How Do We Get Tidy/Clean Data?


  • Get lucky and find it
  • Wrangle it ourselves
  • Use a package where it has been wrangled for us
  • Download via an API

Reading Data

Read Data into R


  • Use read_csv() function from readr package
  • readr package is part of the tidyverse
  • Can do more with it than base R functions

R Code Review


  • <- is the assignment operator
    • Use it to assign values to objects
  • # is the comment operator
    • Use it to comment out code or add comments
    • Different function than in markdown text
  • To call a library, use library() and name of library
    • name of library does not have to be in quotes, e.g. library(readr)
    • only when you install it, e.g. install.packages("readr")

Read Data into R


# load libraries
library(readr)
library(dplry)

films <- read_csv("data/film_cleanish.csv") #notice file path

glimpse(films)

Viewing the Data in R


Use glimpse() to see the columns and data types:

# load libraries
library(readr)
library(dplyr)

films <- read_csv("data/film_cleanish.csv")

glimpse(films)
Rows: 1,659
Columns: 9
$ Year         <dbl> 1990, 1991, 1983, 1979, 1978, 1983, 1984, 1989, 1985, 199…
$ Length       <dbl> 111, 113, 104, 122, 94, 140, 101, 99, 104, 149, 188, 117,…
$ Title        <chr> "Tie Me Up! Tie Me Down!", "High Heels", "Dead Zone, The"…
$ Genre        <chr> "Comedy", "Comedy", "Horror", "Action", "Drama", "Action"…
$ `Lead Man`   <chr> "Banderas, Antonio", "Bosé, Miguel", "Walken, Christopher…
$ `Lead Woman` <chr> "Abril, Victoria", "Abril, Victoria", "Adams, Brooke", "A…
$ Director     <chr> "Almodóvar, Pedro", "Almodóvar, Pedro", "Cronenberg, Davi…
$ Popularity   <dbl> 68, 68, 79, 6, 14, 68, 14, 28, 6, 32, 81, 17, 46, 49, 6, …
$ Awards       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…

Or use View() or click on the name of the object in your Environment tab to see the data in a spreadsheet:

Write a New CSV File


Now try writing the same data to a file with a different name

write_csv(films, "data/film_cleanish_new.csv") 

Excel Files

Read in Excel File


library(readxl)

# not a real example
your_data <- read_excel("path/to/your/data.xlsx")

glimpse(your_data)

Google Sheets

Import Data from Google Sheets


  • Can use googlesheets4
  • Have a look at these Gapminder data
  • Use gs4_deauth() to authenticate
  • Then use read_sheet() to read in the data

Example Code


library(googlesheets4)

# Deauthorize to access public sheets without credentials
gs4_deauth()

# Read in the gapminder Africa data
gapminder_data <- read_sheet("1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY")

Or…


library(googlesheets4)

# Deauthorize to access public sheets without credentials
gs4_deauth()

# Read in the gapminder Africa data
gapminder_data <- read_sheet("1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY")

Or…


library(googlesheets4)

# Deauthorize to access public sheets without credentials
gs4_deauth()

# Read in the gapminder Africa data
gapminder_data <- googledrive::drive_get("gapminder") |>
  read_sheet()

Try It Out!


  • Use the code above to read in the data
  • Try reading in Gapminder data for a different country
05:00

Find Your Own Data


  • Visit kaggle.com
  • Find a dataset you like
  • Download it as a CSV
  • Create an R project and put it in the directory
  • Read it into R
  • Explore with glimpse() and View()
05:00