Extracting Data

Sarah Cassie Burnett

October 2, 2025

Keyword Search in NSF Awards

Read CSV of NSF Awards around the country

  • Goal: Get a Data Science internship / starting job
  • Challenge: Data Science is interdisciplinary—data is part of every science and so is analyzing data.
  • Let’s search some NSF award-winning projects.

Download the csv files from the webpage

library(readr)
nsf_awards <- read_csv('data/Awards.csv')
colnames(nsf_awards)

stringr checks


library(stringr)

str_detect("REU Site: Big Data Analytics", "data|science|analysis")
# FALSE? (case sensitive)

str_detect(str_to_lower("REU Site: Big Data Analytics"), 
"data|science|analysis")
# TRUE

str_detect(str_to_lower("REU Site: Digital Forensic Research 
on Emerging Computing Environments"), "data|science|analysis")
# FALSE

Build the keyword flag


library(dplyr)
library(lubridate)

# Goal: Check the title and abstract for 
awards <- nsf_awards %>%
  select(Title, Abstract, PrincipalInvestigator, StartDate, EndDate) %>%
  mutate(
    keyword_match = str_detect(
      str_to_lower(paste(Title, Abstract)),
      "data analysis|data science|data|science|analysis"
    )
  )

Clean dates and sort


awards <- nsf_awards %>%
  select(Title, Abstract, PrincipalInvestigator, StartDate, EndDate) %>%
  mutate(
    keyword_match = str_detect(
      str_to_lower(paste(Title, Abstract)),
      "data analysis|data science|data|science|analysis"
    ),
    StartDate = mdy(StartDate),
    EndDate   = mdy(EndDate)
  ) 

awards |> arrange(EndDate)

Filter to awards matching keywords


filtered_awards <- awards |> filter(keyword_match)

Math REUs

https://docs.google.com/spreadsheets/d/1U-27BeHMSJCWumbNByal2tHyYo9wRVud9WoRE70E47Y/edit?gid=655344530#gid=655344530

library(googlesheets4)

# Deauthorize to access public sheets without credentials
gs4_deauth()

# Read in the math REU data
mathREU2026data <- read_sheet("1U-27BeHMSJCWumbNByal2tHyYo9wRVud9WoRE70E47Y")
mathREU2025data <- read_sheet("1U-27BeHMSJCWumbNByal2tHyYo9wRVud9WoRE70E47Y", sheet="2025")

filtered_mathREUs <- mathREU2025data |>
  mutate(
    keyword_match = str_detect(
      str_to_lower(`Topic(s) / Notes`),
      "data science")) |> 
  filter(keyword_match)

Try it out!

  • Look up your keywords based on your own interest.
  • Replace the pattern inside str_detect() with your domain terms.
# Sample extraction / contact
filtered_awards$Abstract[3]
filtered_awards$PrincipalInvestigator[3] 

# Google and email this person to see if they have 
# 2026 summer opportunities. 
02:00