Data Frame in R

Sarah Cassie Burnett

September 4, 2025

Before we get started…


  • Check your version of the tidyverse library in the Console:
packageVersion("tidyverse")
[1] '2.0.0'

or install it during lecture:

install.packages("tidyverse")

Scan and fill out for participation credit

https://forms.gle/neFgQu8wcrbWXGQ9A

Annoucements

  • Quiz today.
  • Next in class quiz on Thursday.
  • Coding Assignment 1 to be posted today to be due next Thursday 11:59 pm.

Scan and fill out for participation credit

https://forms.gle/neFgQu8wcrbWXGQ9A

Some Common Base R Functions

Aggregate functions

(these take a whole vector and return a single summary value)

  • mean() mean of a set of numbers
  • median() median of a set of numbers
  • sd() standard deviation of a set of numbers
  • sum() sum of a set of numbers
  • length() length of a vector
  • max() / min() maximum and minimum values of a vector

Some Common Base R Functions

Elementwise functions

(these apply a transformation to each element of a vector)

  • round() round to a specified number of decimal places
  • sqrt() square root
  • log() natural logarithm
  • exp() exponential
  • abs() absolute value

Indexing in R


  • Use indexing to extract a subset of your container
  • You can access elements of a vector in several ways:
    • By position
    • By negative index
    • By boolean/logical expressions
    • By name

Packages

From Functions to Packages


  • A function is a set of instructions
    • read_csv() is a function
    • sample() is a function
  • A package is a collection of functions
    • readr is a package that contains the read_csv() function
    • ggplot2 is a package that contains the ggplot() function
  • Use install.packages() to install packages
  • Use library() to load packages
  • You can install packages from CRAN

A Data Science Workflow


The Tidyverse

  • The Tidyverse is a collection of data science packages
  • It is also considered a dialect of R
  • In this class, we will be using many Tidyverse packages
    • readr for reading data
    • tidyr for data tidying
    • dplyr for data manipulation
    • ggplot2 for data visualization
  • Click here for a full list

Working with Tidyverse packages


  • At first we will load the packages independently, e.g. library(ggplot2)
  • Later we will load them all at once with library(tidyverse)
  • Another way to call a package is with ::, e.g. ggplot2::ggplot()

Dataframe

What is this?

name <- c("Cars", "WALL-E", "The Lego Movie", "PAW Patrol: The Movie")
lead_person <- c("Lightning McQueen (Owen Wilson)",
                 "WALL-E (Ben Burtt)",
                 "Emmet Brickowski (Chris Pratt)",
                 "Ryder (Will Brisbin)")
length_minutes <- c(120, 97, 101, 86)
award <- c(TRUE, TRUE, TRUE, FALSE)

df <- data.frame(
  name,
  lead_person,
  length_minutes,
  award
)

Reading Data into R


library(readr)

films <- read_csv2("data/film.csv")
# or
films <- read_csv("data/film_cleanish.csv")

Let’s Look at the Data


One way to do this is with the base R head() function

head(films)
# A tibble: 6 × 9
   Year Length Title    Genre `Lead Man` `Lead Woman` Director Popularity Awards
  <dbl>  <dbl> <chr>    <chr> <chr>      <chr>        <chr>         <dbl> <lgl> 
1  1990    111 Tie Me … Come… Banderas,… Abril, Vict… Almodóv…         68 FALSE 
2  1991    113 High He… Come… Bosé, Mig… Abril, Vict… Almodóv…         68 FALSE 
3  1983    104 Dead Zo… Horr… Walken, C… Adams, Broo… Cronenb…         79 FALSE 
4  1979    122 Cuba     Acti… Connery, … Adams, Broo… Lester,…          6 FALSE 
5  1978     94 Days of… Drama Gere, Ric… Adams, Broo… Malick,…         14 FALSE 
6  1983    140 Octopus… Acti… Moore, Ro… Adams, Maud  Glen, J…         68 FALSE 

Use View()


Another way to look at the data is with View(). Or click on the name of the data frame in the Environment pane.

View(films)

Using glimpse() from dplyr


Another way to look at the data is with glimpse() from the dplyr package.

library(dplyr)

glimpse(films)
Rows: 1,659
Columns: 9
$ Year         <dbl> 1990, 1991, 1983, 1979, 1978, 1983, 1984, 1989, 1985, 199…
$ Length       <dbl> 111, 113, 104, 122, 94, 140, 101, 99, 104, 149, 188, 117,…
$ Title        <chr> "Tie Me Up! Tie Me Down!", "High Heels", "Dead Zone, The"…
$ Genre        <chr> "Comedy", "Comedy", "Horror", "Action", "Drama", "Action"…
$ `Lead Man`   <chr> "Banderas, Antonio", "Bosé, Miguel", "Walken, Christopher…
$ `Lead Woman` <chr> "Abril, Victoria", "Abril, Victoria", "Adams, Brooke", "A…
$ Director     <chr> "Almodóvar, Pedro", "Almodóvar, Pedro", "Cronenberg, Davi…
$ Popularity   <dbl> 68, 68, 79, 6, 14, 68, 14, 28, 6, 32, 81, 17, 46, 49, 6, …
$ Awards       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…

Try it out!


  • Read in the film_cleanish.csv file
  • Use the three methods we discussed to view the data
05:00

A Few More Basic dplyr Functions


Use select() to choose columns.

films_select <- films |>
     select(Title, Awards)

glimpse(films_select)
Rows: 1,659
Columns: 2
$ Title  <chr> "Tie Me Up! Tie Me Down!", "High Heels", "Dead Zone, The", "Cub…
$ Awards <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …

A Few More Basic dplyr Functions

Use filter() to choose rows.

films <- read_csv("data/film_cleanish.csv")

films_select <- films |>
  filter(Awards == TRUE)

# or
films_select <- films |>
  filter(Awards)

Note

Using the same name for the data frame results in overwriting the original data frame. If you want to keep the original data frame, use a different name.

A Few More Basic dplyr Functions

Use mutate() to create new columns.

films_new <- films |>
  mutate(Length_hours = Length / 60)

glimpse(films_new)
Rows: 1,659
Columns: 10
$ Year         <dbl> 1990, 1991, 1983, 1979, 1978, 1983, 1984, 1989, 1985, 199…
$ Length       <dbl> 111, 113, 104, 122, 94, 140, 101, 99, 104, 149, 188, 117,…
$ Title        <chr> "Tie Me Up! Tie Me Down!", "High Heels", "Dead Zone, The"…
$ Genre        <chr> "Comedy", "Comedy", "Horror", "Action", "Drama", "Action"…
$ `Lead Man`   <chr> "Banderas, Antonio", "Bosé, Miguel", "Walken, Christopher…
$ `Lead Woman` <chr> "Abril, Victoria", "Abril, Victoria", "Adams, Brooke", "A…
$ Director     <chr> "Almodóvar, Pedro", "Almodóvar, Pedro", "Cronenberg, Davi…
$ Popularity   <dbl> 68, 68, 79, 6, 14, 68, 14, 28, 6, 32, 81, 17, 46, 49, 6, …
$ Awards       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ Length_hours <dbl> 1.850000, 1.883333, 1.733333, 2.033333, 1.566667, 2.33333…

Try it out!


  • Use your new dplyr verbs to manipulate the data
  • Select columns, filter rows, and create new columns
05:00

Basic Data Viz with ggplot2


  • ggplot2 is a powerful data visualization package
  • It is based on the grammar of graphics
  • We will talk about this more in depth later

Basic Data Viz with ggplot2

  • For now, let’s make a simple column chart
Code
library(ggplot2)

ggplot(data = films, aes(x = Genre, y = Popularity)) +
  geom_col(fill = "dodgerblue") 

Try it out!


  • Use ggplot2 to make a simple column chart
  • Choose a different variable to plot
  • Change the color of the bars
05:00