Final Project Tasks

Overview

For your group project, your team should already have agreed on a dataset to explore and identified the questions you want to answer. Now, each group member will choose one task to focus on that demonstrates a different aspect of your data-science skills.

Your group does not need to complete every task. For instance, you may find that little data wrangling is required. Some overlap is also fine. For example, multiple members might test different hypotheses or model different relationships using the same dataset.

By the end, your group should have a shared understanding of everyone’s contributions and benefit from discussing your approaches and findings with one another.

Tasks

1. Data Wrangling

Tackle a data-cleaning challenge using R. Work with ~~a dataset in an unusual format (for example, .dat, fixed-width text, or nested JSON)~~data that needs wrangling and convert it into a clean tibble ready for analysis.

Write a function that reads, reformats, and cleans the data. Include it in a small package with short documentation and a short tutorial explaining how to use it.

For example, the NOAA weather station data are .dat files that require custom parsing to become tidy tables. Your task would be similar—–turn messy files into tidy data.

2. Exploratory Data Analysis

Write an introduction and overview of your data. Your analysis should include descriptions of the different variables.

Histograms of continuous variables with commentary on the shape of the distribution (Is it unimodal? Skewed?).
If not unimodal, group continuous variables by a category and see if that results in fewer modes.
Adjust bins or binwidths as needed to reveal structure.
Summaries of continuous variables based on their distribution (use medians for skewed data).
Identify any outliers.
Visualize proportions of categorical variables or other meaningful summaries.
Include boxplots comparing continuous variables across categories. The goal is to report on all relevant variables, then choose the most meaningful visualizations for your final poster.

3. Hypothesis Testing: Testing a Null Statistic

Test a null statistic using your sample data. Choose a variable (such as a mean or proportion) and test whether your sample differs significantly from a known or expected population parameter. Compute the bootstrap distribution of the null world.

Null hypotheses are always stated in terms of population parameters, not sample statistics.
- Examples:
  - \(H_0: p = 0.5\) — the true population proportion is 50%.
  - \(H_0: \mu = 100\) — the true mean in the population is 100.
Your null hypothesis represents what you would expect in the population if there were no real effect or no difference.
The population parameter should be based on something realistic or documented, not arbitrary.
If your variable has a known population value (e.g., from a census or official dataset), cite it from a credible source (e.g., BLS, CDC/WHO, World Bank/OECD, Pew/Gallup, Statista).
If not, justify your expected value (e.g., “assuming equal proportions,” “no difference between groups,” or “using the previous year’s data as the status quo”).

Example (annual average reference)

According to the U.S. Bureau of Labor Statistics, the average unemployment rate in 2024 was 4.1% (CPS, annual average).
We test whether our sample of recent graduates shows a significantly lower unemployment rate.

\[ H_0: p = 0.041 \quad \text{vs.} \quad H_A: p < 0.041 \]

Propose an alternative hypothesis based on your dataset. Use bootstrapping to create the null world and compute the p-value. See if the alternative hypothesis is significant.

4. Hypothesis Testing: Permutation Test

Design and run a bootstrap randomization (permutation) or bootstrap test to assess whether there is a relationship between an explanatory variable and a response in your data.

Pick one treatment effect: difference in proportions (0/1 outcome, 2 groups), difference in means (numeric outcome, 2 groups), or correlation (two numeric variables).
State clear hypotheses in population parameters (e.g., \(H_0: p_1 - p_2 = 0\), \(H_0: \mu_1 - \mu_2 = 0\), or \(H_0: r = 0\)).
Compute the observed statistic from your sample.
Create a null distribution by permuting labels (or pairing) many times (e.g., 5,000) and recomputing the statistic.
Report the p-value and include a plot of the null distribution with the observed statistic marked.
Write a brief conclusion in context (avoid causal language unless your design supports it).
Add a bootstrap 95% CI for the same statistic and comment on how it aligns with your permutation test.

5. Modeling Relationships (Correlation and Regression)

Test a causal or predictive relationship between two quantitative variables using correlation and linear regression.

Identify two variables where you suspect one influences or predicts the other (e.g., education and income, GDP and democracy, gender and exam score).
Clearly state which variable is explanatory (predictor) and which is the response (outcome).
Look at the correlation. Report and interpret the correlation coefficient \(r\) in practical terms (direction, strength). Is this suitable for a linear association?
Find a line to fit the data with linear regression. Report and interpret the slope (units) and \(R^2\).
See if you need to preprocess any data to get a better \(r\) or \(R^2\).
Visualize the relationship with a scatterplot and a fitted line from a linear model.
Discuss whether the relationship appears causal or merely associative, and justify your reasoning.
Use bootstrapping to create a 95% confidence interval for the slope or correlation and comment on how it supports your interpretation.
If you are looking at time series data and the correlation coefficient is low, can you adjust the time window of data so that you see it improve?

Once you have completed your tasks, you will bring them together in the final poster presentation and report.