Homework 4: Linear regression lines

Published

November 14, 2025

Due: November 21, 2025.

In this assignment you will practice writing functions for simple linear regression.

You will write functions to compute:

All of your functions should work with any numeric variables, not just a specific dataset.

Instructions

1. Setup

You may assume the following packages are loaded at the top of your script:

library(dplyr)
library(ggplot2)

All of your functions should be pure functions:

  • No printing inside the functions (no calls to print(), cat(), show(), etc.).
  • No file I/O (no read_*, write_*, etc. inside functions).
  • No plotting inside functions except for the final plot_regression_line() function, which should return a ggplot object (not print it).
  • No changes to global variables.

Your script should define the following functions in this order:

standard_units <- function(x) { }

correlation <- function(data, x, y) { }

slope <- function(data, x, y) { }

intercept <- function(data, x, y) { }

y_hat <- function(data, x, y, x_new) { }

plot_regression_line <- function(data, x, y) { }
Task 0

Start from the hw4.R template posted on Blackboard. Also download the car data hybrid.csv for your own testing.

2. Standard units

The first function converts a numeric vector to standard units (also called z-scores).

In standard units, each value is measured in terms of how many standard deviations it is above or below the mean.

For a numeric vector \(x_1, x_2, \dots, x_n\), the value in standard units is

\[ \text{SU}(x_i) = \frac{x_i - \bar{x}}{s_x}, \]

where \(\bar{x}\) is the sample mean of \(x\) and \(s_x\) is the sample standard deviation of \(x\) defined as:

The sample mean of \(x\) is

\[\bar{x} \;=\; \frac{1}{n} \sum_{i=1}^{n} x_i.\]

The sample standard deviation of \(x\) is

\[s_x \;=\; \sqrt{ \frac{1}{\,n - 1\,} \sum_{i=1}^{n} (x_i - \bar{x})^2 }.\]

Task 1 (3 points)

Write a function standard_units(x) that:

  • Takes a numeric vector x.
  • Returns a numeric vector of the same length giving each element in standard units.
  • Uses the sample mean and sample standard deviation (mean() and sd()).
  • Does not modify any global variables or print anything.

You may assume x has length at least 2 and contains at least two non-missing values.

3. Correlation

The Pearson correlation coefficient \(r\) measures the strength and direction of a linear relationship between two quantitative variables.

\[ \begin{aligned} r &= \frac{\text{cov}(x, y)}{s_x s_y} \\ &= \frac{ \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) }{ \sqrt{ \sum_{i=1}^{n} (x_i - \bar{x})^2 } \sqrt{ \sum_{i=1}^{n} (y_i - \bar{y})^2 } } \end{aligned} \]

The correlation \(r\) between two numeric variables can also be defined using the standard units:

  • Convert \(x\) to standard units: \(\text{SU}(x_i)\)
  • Convert \(y\) to standard units: \(\text{SU}(y_i)\)
  • Then the correlation is:

The product of standard units:

\[ r = \frac{1}{n-1} \sum_{i=1}^n \text{SU}(x_i) \text{SU}(y_i). \]

We will compute correlation using this formula.

Task 2 (4 points)

Write a function correlation(data, x, y) that:

  • Takes a data frame or tibble data.
  • The arguments x and y are column names passed using tidy evaluation (for example correlation(df, height, weight)).
  • Extracts the x and y columns from data, converts each to standard units using your standard_units() function, and computes the correlation using the formula above: sum the product of standard units and divide by \(n - 1\).
  • Returns a single numeric value (not a tibble).

Hints:

  • Use pull(data, {{ x }}) and pull(data, {{ y }}) to extract the columns.
  • You may use sum(), nrow() or length() and vectorized multiplication.

Do not call cor(); the goal is to compute correlation manually using standard units. You can check your result with cor outside the function.

4. Slope and intercept of the regression line

For the regression of a response variable \(y\) on an explanatory variable \(x\), the least squares regression line is

\[ \hat{y} = b_0 + b_1 x, \]

where

  • \(b_1\) is the slope
  • \(b_0\) is the intercept

These can be computed from sample statistics:

\[ b_1 = r \cdot \frac{s_y}{s_x} \quad\text{and}\quad b_0 = \bar{y} - b_1 \bar{x}. \]

Here \(r\) is the correlation between \(x\) and \(y\).

Task 3 (3 points)

Write a function slope(data, x, y) that:

  • Takes a data frame or tibble data.

  • The arguments x and y are column names (tidy evaluation).

  • Computes the slope \(b_1\) using:

    • the correlation from your correlation() function, and
    • the sample standard deviations sd() of the x and y columns.
  • Returns a single numeric value equal to the slope \(b_1\).

You should not call lm(); compute the slope from \(r\), \(s_x\), and \(s_y\).

Task 4 (3 points)

Write a function intercept(data, x, y) that:

  • Takes the same arguments as slope().

  • Computes the slope \(b_1\) by calling your slope() function.

  • Computes the intercept

    \[ b_0 = \bar{y} - b_1 \bar{x}, \]

    where \(\bar{x}\) and \(\bar{y}\) are the sample means of the x and y columns.

  • Returns a single numeric value equal to the intercept \(b_0\).

Again, do not call lm() inside this function.

5. Computing predicted values

Once you have \(b_0\) and \(b_1\), you can compute the predicted value \(\hat{y}\) for any new \(x\) value using

\[ \hat{y} = b_0 + b_1 x_{\text{new}}. \]

Task 5 (4 points)

Write a function y_hat(data, x, y, x_new) that:

  • Takes:

    • data: a data frame or tibble containing the original x and y variables,
    • x, y: column names (tidy evaluation),
    • x_new: a numeric vector of new x-values at which you want predictions.
  • Computes the regression slope and intercept by calling your slope() and intercept() functions.

  • Uses those to compute predicted values \(\hat{y}\) for each element of x_new.

  • Returns a tibble with columns:

    • x_new: the input values,
    • y_hat: the predicted values.

6. Plotting the regression line

Finally, we will write a function that plots the data and the fitted regression line.

Task 6 (3 points)

Write a function plot_regression_line(data, x, y) that:

  • Takes a data frame or tibble data.

  • The arguments x and y are column names (tidy evaluation).

  • Computes the slope and intercept of the regression line by calling your slope() and intercept() functions.

  • Uses ggplot2 to construct a plot with:

    • scatter points of the original data, and
    • the fitted regression line.

To pass the column names of the tibble use:

ggplot(data, aes(x = {{ x }}, y = {{ y }})).

7. General rules

  • All functions must be pure, except that plot_regression_line() may construct a plot but must still return the plot object instead of printing it.
  • Do not use lm() or cor() in your implementations. The goal is to compute correlation and the regression line from definitions.
  • Use the exact function names and argument names given in this assignment for autograding.
  • Gradescope will have install these libraries: dplyr, readr, ggplot2.

8. Submission

This homework will be graded through Gradescope. It is important that you use the exact function names and argument names as specified.

  • Submit a single file named hw4.R containing all function definitions in the order listed above.

  • At the top of the file, include:

    library(dplyr)
    library(ggplot2)
  • Your script should contain only function definitions. Remove any test calls or exploratory code (for example, lines where you call the functions to print output).

  • The autograder will automatically run:

    source("hw4.R")

    and then call your functions on test data.

Before submitting:

  • Run source("hw4.R") in RStudio.
  • Manually test your functions on a small dataset (see bottom of the template and Blackboard for hybrid.csv).

9. Optional reporting

There is no required written report for this homework.

If you would like additional practice, you may create a fourth Quarto blog post where you:

  • Load a small dataset (built-in or your own),
  • Use your functions to compute the correlation and regression line, and
  • Plot the regression line on top of the data.

You may keep this document for your own notes or share it as an extra resource, but it will not be graded for this assignment.