Data Proposal
DATS 1001 — Research Project First Draft
Due:
October 28November 4, 2025
Overview
This is your initial step toward your group’s final report and poster presentation. Your goal is to explore potential data sources, understand their structure, and begin preparing to combine and analyze them.
You don’t need to conduct your full analysis yet — this stage is focused on data exploration, documentation, and planning.
Data analysis and data science are only as powerful as the data we have to work with.
Good data provides the foundation for insight and knowledge.
All stages of this project are meant to be worked on in a group. Before you begin, I encourage you to read over this page on working with a group.
Choose and explore your data.
Document your data sources and structure.
Answer the exploration questions provided below.
Submit a draft report to Gradescope (one submission per group).
How to get started
Each group will prepare one shared Quarto document named data-proposal.qmd.
You’ll work together in Posit Cloud and submit a both a .qmd and .pdf version (data-proposal.pdf) of that document to Gradescope.
Set up Posit Cloud
- Every group member should create a free account at posit.cloud.
- One person (the project lead) should create a New Space by clicking the option in the left bar. Name the space as you’d like and add your team members as members.
- Create a new Posit Cloud project.
- Inside the shared project, create a new Quarto file named
data-proposal.qmdand have it render as a pdf. Add your all member names under the author at the top of the document.
Your quarto document might start out looking similar to this:
---
title: "Data Proposal"
author:
- Sarah Cassie Burnett
- Johnny Pickles
format: pdf
---Render to PDF in Posit Cloud
- Open your
data-proposal.qmdfile in the shared project.
- In the top toolbar, click the Render button (blue “Render” icon).
- Wait for rendering to finish — a new file named something like
data-proposal.pdfwill appear in the Files pane on the right.
- Click the gear with More to find Export… and Download it to save the PDF to your computer as
data-proposal.pdf.
- Upload the PDF to Gradescope — only one group member should submit.
Before submitting
- The PDF includes your group number and names.
- All plots, tables, and text appear correctly.
- The file name follows this format:
data-proposal.pdf.
- Submit one copy per group to Gradescope.
1. Find at least three datasets
- Ideally, each dataset should have ≥ 1,000 rows (observations).
Take note of datasets you find, the variables you’re interested in using, and the number of observations, even if it has fewer than 1000. - Include overall ≥ 5 variables/columns of different types (continuous, categorical, date/time, etc.).
- Try to find at least two datasets should share one variable that can link them (e.g.,
country,year,state,id), or you should be able to create a key from existing variables.
Here’s some links to datasets.
2. Document your datasets
Create a small data dictionary for each dataset:
| Variable | Type | Description | Units / Example |
|---|---|---|---|
| country | categorical | Country name | "United States" |
| year | integer | Year of observation | 2022 |
| gdp_per_capita | continuous | GDP per person | 65342.1 |
Also include:
- Dataset title and source URL
- Unit of observation (what does one row represent?)
- Time period covered
- Geographic scope (if applicable)
Use Google Sheets to collect this information together.
3. Assess data quality
Answer these questions as a group:
- What is the range of your data (min/max for each numeric variable)?
- How many observations do you have for each variable?
- Do you have any missing or useless information (
NA,NaN,"unknown", etc.)?
- For variables with missing values, how many valid observations remain?
- Are there any duplicate rows?
- Are there variables that could have inconsistent or messy measurements (e.g., units, capitalization, categories)?
Then outline a brief cleaning plan: > How will you handle missing values? How will you standardize categories or units?
4. Merge plan
If you plan to combine datasets:
- Which variable(s) will you use to link them?
- What join will you use (e.g.,
left_join,inner_join)?
- If a key variable doesn’t exist, how could you construct one?
(For example: combinecountry+year.)
If you do not have any datasets you would merge, discuss:
- What other data would be useful to add, and why would it help your analysis?
- Can you make a new variable from your existing data to stand in for something you don’t have?
- How can you tell if your current dataset already gives you enough information to answer your question?
5. Early visual exploration
Create at least one simple summary visualization.
For example:
- Histogram or boxplot for continuous variables
- Bar chart for categorical variables
- Scatter plot for potential relationships
Add short captions describing what the plot shows.
6. Research questions
Each group member should write five questions they’re curious about based on the data.
Together, choose 3–5 that feel most promising for your eventual project.
7. Group logistics
- Assign roles (data leads, cleaning, documentation, visualization).
- Decide where you’ll store data and work together (GitHub, Google Drive, Posit Cloud, etc.).
- Summarize your plan in a short Markdown or Quarto file (
data-proposal.qmd) and render it as a pdf (data-proposal.pdf). Submit the pdf to Gradescope.