# example: 95% CI
level <- 0.95
qnorm(1 - (1 - level)/2)[1] 1.959964
November 3, 2025
Due: November 1118, 2025. I recommend attempting to complete it by November 11, 2025.
In this assignment you will practice with functions by creating a few of your own data science functions.
Create five functions for the NOAA data.
The function create_joined_temp_data() will take in the pathname to the file with the temperature data you wish to use for the analysis. You can use the url method to source the other datasets. I will also include fips-10-4-to-iso-country-codes.csv for countries and station-metadf.csv for station names so you can source them assuming they are in the same directory as your script. This function should performing the cleaning and joining tasks outlined in Coding Assignment 2 up to the specific filtering.
The function select_specific_temps() will have several parameters including the joined temperatures data frame, the country, the year to begin the data, the year to end the data (inclusive), the month number. It will select a subset of the data that meet the specifications.
The functions summarize_by_year(), coverage_by_year(), and coverage_by_country() should take in the joined temperature data and return the respective tibbles.
You can see the video posted on Blackboard for a close example of what I’m expecting for the first two functions.
There will be a few differences in how it performs from the video.
When beginning coding with functions, it might be tempting to get them to do many things at once. This isn’t usually the best way to approach functions. Functions should be designed with a purpose.
In functional programming, we decide what to compute. Programs are built from pure functions that take input and return output without side effects. Side effects can include:
These behaviors are fine for quick exploration, but when we share code, they make it harder to use and test. Pure functions return an object; the user can then decide when to print, write to a file, or plot.
Avoid these behaviors in the functions that you write and intend to share. If you have any print statements for debugging or code chunks where you test the functions by calling them outside the function definition, remove them before you turn the assignment in.
Create five functions for the NOAA data based on the code in Coding Assignment 2.
create_joined_temp_data(pathname)
pathname to the temperatures CSV.fips-10-4-to-iso-country-codes.csv and station-metadf.csv are in the same directory as your script (or read via URL).select_specific_temps(joined_temps, country, year_begin, year_end, month)
summarize_by_year(data)
Yearmean_temp (mean of Temp)min_temp (min of Temp)max_temp (max of Temp)coverage_by_year(data)
Yearn_stationscoverage_by_country(data)
Country.Countryn_stationsCreate four functions for estimating standard error and confidence intervals of proportions.
Create a function called SE_p_hat that will estimate the standard error of a sample proportion. It will have parameter n for number of samples and p for the proportion.
Create a function called CI_p_hat that computes a two‑sided confidence interval for a sample proportion. Parameters: p (proportion), n (sample size), and level (confidence level as a decimal). It should return a tibble with columns lower_ci and upper_ci in that order.
You can compute the appropriate z‑value from the confidence level using:
When you test your function, try different level values (e.g., 0.90, 0.95, 0.99) and different n to see how the interval width changes.
SE_p_hat() according to the requirements.CI_p_hat() according to the requirements.SE_p_hat() internally.To test this next part, use the infer package workflow to create a bootstrap distribution for a proportion based on categorical data.
Your pipeline should include:
specify(response = ...)generate(reps = ..., type = "bootstrap")calculate(stat = "prop") or "mean" for 0/1 dataWrite a function to compute a percentile confidence interval from the bootstrap distribution, and compute its bootstrap standard error.
For a bootstrap distribution with \(R\) replicates (or trials) with proportions, \(\{\hat{p}_1, \hat{p}_2, \ldots, \hat{p}_R\}\), the bootstrap standard error is:
\[ \mathrm{SE}{\text{ boot}} = \sqrt{ \frac{1}{R - 1} \sum_{i = 1}^{R} \big(\hat{p}_i - \overline{p}\big)^2 }. \]
In general, you get the standard error by taking the standard deviation of the
statvalues in your bootstrap tibble.
You may compute the confidence interval from the bootstrap distribution using either technique shown in class.
SE_p_boot() that takes in a bootstrap distribution and returns a single numeric equal to the bootstrap standard error of the statistic.CI_p_boot() that takes in a bootstrap distribution and a confidence level (in decimal form) that returns a tibble with columns lower_ci and upper_ci containing the confidence interval computed from boot_dist.Reminder of general rules:
In this assignment, rather than do a report, you are creating your own package with functions that can be used by another user. The assignment will be graded through Gradescope so it’s important you use the same variables names and columns names as in the assignment. There is a script in Gradescope that test your functions and compares the outputs with what outputs from the instructor’s functions when they are tested.
hw3.R containing all function definitions (in the order listed above). library(dplyr) and any other packages you used listed at the top of the file.source("hw3.R") and call your functions for testing.There’s no required written report for this homework since at this time you’ll be preparing a report and a poster for the final project.
If you’d like to demonstrate your work and provide a short tutorial for your code, you may do so by submitteding a rendered Quarto document as your third blog post.
There’s no extra credit for this, but it’s a good way to practice documenting your code and explaining your workflow. You might also find it useful later as personal notes or a reference for future programming projects.