Homework 3: Functional Programming and Bootstrapping

Published

November 3, 2025

Due: November 1118, 2025. I recommend attempting to complete it by November 11, 2025.

In this assignment you will practice with functions by creating a few of your own data science functions.

Instructions

1. Make NOAA Data Wrangling Functions

Create five functions for the NOAA data.

The function create_joined_temp_data() will take in the pathname to the file with the temperature data you wish to use for the analysis. You can use the url method to source the other datasets. I will also include fips-10-4-to-iso-country-codes.csv for countries and station-metadf.csv for station names so you can source them assuming they are in the same directory as your script. This function should performing the cleaning and joining tasks outlined in Coding Assignment 2 up to the specific filtering.

The function select_specific_temps() will have several parameters including the joined temperatures data frame, the country, the year to begin the data, the year to end the data (inclusive), the month number. It will select a subset of the data that meet the specifications.

The functions summarize_by_year(), coverage_by_year(), and coverage_by_country() should take in the joined temperature data and return the respective tibbles.

You can see the video posted on Blackboard for a close example of what I’m expecting for the first two functions.

There will be a few differences in how it performs from the video.

The first function takes one parameter: the pathname in quotes.
The functions should be a pure functions—no printing, file I/O, plotting, or changing globals. This might mean dropping some of the specs that don’t meet the definition of a pure function.

When beginning coding with functions, it might be tempting to get them to do many things at once. This isn’t usually the best way to approach functions. Functions should be designed with a purpose.

In functional programming, we decide what to compute. Programs are built from pure functions that take input and return output without side effects. Side effects can include:

Changing global variables from inside the function
Printing out stuff from inside the function
Writing to files
Plotting from inside the function

These behaviors are fine for quick exploration, but when we share code, they make it harder to use and test. Pure functions return an object; the user can then decide when to print, write to a file, or plot.

Avoid these behaviors in the functions that you write and intend to share. If you have any print statements for debugging or code chunks where you test the functions by calling them outside the function definition, remove them before you turn the assignment in.

Task 1-5 (10 points)

Create five functions for the NOAA data based on the code in Coding Assignment 2.

create_joined_temp_data(pathname)
- Input: a single character string pathname to the temperatures CSV.
- Other files: assume fips-10-4-to-iso-country-codes.csv and station-metadf.csv are in the same directory as your script (or read via URL).
- Output: the joined and cleaned tibble (as from Coding Assignment 1).
select_specific_temps(joined_temps, country, year_begin, year_end, month)
- Input: the joined temperatures tibble, the country, the year to begin the data, the year to end the data (inclusive), the month number.
- Output: a tibble with the specific data.
summarize_by_year(data)
- For each year, calculate the mean, minimum, and maximum temperature across all stations.
- Input: the joined temperatures data frame
- Output: tibble with columns by these exact names:
  - Year
  - mean_temp (mean of Temp)
  - min_temp (min of Temp)
  - max_temp (max of Temp)
coverage_by_year(data)
- For each year, calculate the number of measurements recorded across all stations. Arrange by decreasing year.
- Input: the joined temperatures data frame
- Output: tibble with columns by these exact names:
  - Year
  - n_stations
coverage_by_country(data)
- For each country, calculate the total number of measurements. Arrange by decreasing number of stations.
- Input columns required: Country.
- Output columns and exact** names:**
  - Country
  - n_stations

2. Create Data Exploration Functions

Create four functions for estimating standard error and confidence intervals of proportions.

Estimating confidence intervals of a proportion from a sample

Create a function called SE_p_hat that will estimate the standard error of a sample proportion. It will have parameter n for number of samples and p for the proportion.

Create a function called CI_p_hat that computes a two‑sided confidence interval for a sample proportion. Parameters: p (proportion), n (sample size), and level (confidence level as a decimal). It should return a tibble with columns lower_ci and upper_ci in that order.

You can compute the appropriate z‑value from the confidence level using:

# example: 95% CI
level <- 0.95
qnorm(1 - (1 - level)/2)

[1] 1.959964

When you test your function, try different level values (e.g., 0.90, 0.95, 0.99) and different n to see how the interval width changes.

Task 6-7 (5 points)

Write the function SE_p_hat() according to the requirements.
Write the function CI_p_hat() according to the requirements.
Your implementation should call SE_p_hat() internally.

Simulating confidence intervals of a proportion

To test this next part, use the infer package workflow to create a bootstrap distribution for a proportion based on categorical data.

Your pipeline should include:

specify(response = ...)
generate(reps = ..., type = "bootstrap")
calculate(stat = "prop") or "mean" for 0/1 data

Write a function to compute a percentile confidence interval from the bootstrap distribution, and compute its bootstrap standard error.

Bootstrap standard error (formula)

For a bootstrap distribution with \(R\) replicates (or trials) with proportions, \(\{\hat{p}_1, \hat{p}_2, \ldots, \hat{p}_R\}\), the bootstrap standard error is:

\[ \mathrm{SE}{\text{ boot}} = \sqrt{ \frac{1}{R - 1} \sum_{i = 1}^{R} \big(\hat{p}_i - \overline{p}\big)^2 }. \]

In general, you get the standard error by taking the standard deviation of the stat values in your bootstrap tibble.

You may compute the confidence interval from the bootstrap distribution using either technique shown in class.

Task 8-9 (5 points)

Write a function SE_p_boot() that takes in a bootstrap distribution and returns a single numeric equal to the bootstrap standard error of the statistic.
Write a function CI_p_boot() that takes in a bootstrap distribution and a confidence level (in decimal form) that returns a tibble with columns lower_ci and upper_ci containing the confidence interval computed from boot_dist.

Reminder of general rules:

All functions must be pure (no printing, plotting, file I/O, or mutation of globals).
Except for the standard error functions, all functions must return a tibble.
Keep function names and column names exactly as specified for grading.

3. Submission

In this assignment, rather than do a report, you are creating your own package with functions that can be used by another user. The assignment will be graded through Gradescope so it’s important you use the same variables names and columns names as in the assignment. There is a script in Gradescope that test your functions and compares the outputs with what outputs from the instructor’s functions when they are tested.

Submit a single file named hw3.R containing all function definitions (in the order listed above).
You can start with this template: hw3.R
Your code should run with library(dplyr) and any other packages you used listed at the top of the file.
Test locally: Before submitting, run your script in RStudio to make sure every function works as expected.
Submit definitions only: Remove any test calls (e.g., lines where you run your functions).
The autograder will automatically source("hw3.R") and call your functions for testing.

4. Reporting (Optional)

There’s no required written report for this homework since at this time you’ll be preparing a report and a poster for the final project.

If you’d like to demonstrate your work and provide a short tutorial for your code, you may do so by submitteding a rendered Quarto document as your third blog post.

There’s no extra credit for this, but it’s a good way to practice documenting your code and explaining your workflow. You might also find it useful later as personal notes or a reference for future programming projects.