Data Proposal

DATS 1001 — Research Project First Draft

Due: October 28November 4, 2025

Overview

This is your initial step toward your group’s final report and poster presentation. Your goal is to explore potential data sources, understand their structure, and begin preparing to combine and analyze them.

You don’t need to conduct your full analysis yet — this stage is focused on data exploration, documentation, and planning.

Data analysis and data science are only as powerful as the data we have to work with.
Good data provides the foundation for insight and knowledge.

All stages of this project are meant to be worked on in a group. Before you begin, I encourage you to read over this page on working with a group.

Task

Choose and explore your data.
Document your data sources and structure.
Answer the exploration questions provided below.
Submit a draft report to Gradescope (one submission per group).

How to get started

Each group will prepare one shared Quarto document named data-proposal.qmd.
You’ll work together in Posit Cloud and submit a both a .qmd and .pdf version (data-proposal.pdf) of that document to Gradescope.

Set up Posit Cloud

  1. Every group member should create a free account at posit.cloud.
  2. One person (the project lead) should create a New Space by clicking the option in the left bar. Name the space as you’d like and add your team members as members.
  3. Create a new Posit Cloud project.
  4. Inside the shared project, create a new Quarto file named data-proposal.qmd and have it render as a pdf. Add your all member names under the author at the top of the document.

Your quarto document might start out looking similar to this:

---
title: "Data Proposal"
author:
  - Sarah Cassie Burnett
  - Johnny Pickles
format: pdf
---

Render to PDF in Posit Cloud

  1. Open your data-proposal.qmd file in the shared project.
  2. In the top toolbar, click the Render button (blue “Render” icon).
  3. Wait for rendering to finish — a new file named something like
    data-proposal.pdf will appear in the Files pane on the right.
  4. Click the gear with More to find Export… and Download it to save the PDF to your computer as data-proposal.pdf.
  5. Upload the PDF to Gradescope — only one group member should submit.

Before submitting

  • The PDF includes your group number and names.
  • All plots, tables, and text appear correctly.
  • The file name follows this format: data-proposal.pdf.
  • Submit one copy per group to Gradescope.

1. Find at least three datasets

  • Ideally, each dataset should have ≥ 1,000 rows (observations).
    Take note of datasets you find, the variables you’re interested in using, and the number of observations, even if it has fewer than 1000.
  • Include overall ≥ 5 variables/columns of different types (continuous, categorical, date/time, etc.).
  • Try to find at least two datasets should share one variable that can link them (e.g., country, year, state, id), or you should be able to create a key from existing variables.
Tip

Here’s some links to datasets.

2. Document your datasets

Create a small data dictionary for each dataset:

Variable Type Description Units / Example
country categorical Country name "United States"
year integer Year of observation 2022
gdp_per_capita continuous GDP per person 65342.1

Also include:

  • Dataset title and source URL
  • Unit of observation (what does one row represent?)
  • Time period covered
  • Geographic scope (if applicable)
Tip

Use Google Sheets to collect this information together.

3. Assess data quality

Answer these questions as a group:

  1. What is the range of your data (min/max for each numeric variable)?
  2. How many observations do you have for each variable?
  3. Do you have any missing or useless information (NA, NaN, "unknown", etc.)?
  4. For variables with missing values, how many valid observations remain?
  5. Are there any duplicate rows?
  6. Are there variables that could have inconsistent or messy measurements (e.g., units, capitalization, categories)?

Then outline a brief cleaning plan: > How will you handle missing values? How will you standardize categories or units?

4. Merge plan

If you plan to combine datasets:

  • Which variable(s) will you use to link them?
  • What join will you use (e.g., left_join, inner_join)?
  • If a key variable doesn’t exist, how could you construct one?
    (For example: combine country + year.)

If you do not have any datasets you would merge, discuss:

  • What other data would be useful to add, and why would it help your analysis?
  • Can you make a new variable from your existing data to stand in for something you don’t have?
  • How can you tell if your current dataset already gives you enough information to answer your question?

5. Early visual exploration

Create at least one simple summary visualization.
For example:

  • Histogram or boxplot for continuous variables
  • Bar chart for categorical variables
  • Scatter plot for potential relationships

Add short captions describing what the plot shows.

6. Research questions

Each group member should write five questions they’re curious about based on the data.
Together, choose 3–5 that feel most promising for your eventual project.

7. Group logistics

  • Assign roles (data leads, cleaning, documentation, visualization).
  • Decide where you’ll store data and work together (GitHub, Google Drive, Posit Cloud, etc.).
  • Summarize your plan in a short Markdown or Quarto file (data-proposal.qmd) and render it as a pdf (data-proposal.pdf). Submit the pdf to Gradescope.