Measuring Distributions

Sarah Cassie Burnett

October 16, 2025

Recap: Describing a Distribution


When summarizing a dataset, report all four key features:

  • Center — e.g., mean or median
  • Spread — e.g., IQR or standard deviation
  • Shape — e.g., skewness, modality (unimodal, bimodal)
  • Unusual points — e.g., outliers or extreme values

Recap: Describing Distributions


Last time, we focused on two main features of a distribution:

  • Center — the typical value (mean or median)
  • Spread — how far values vary from that center

Together, these give a complete summary of a dataset’s pattern.

Center: Mean vs. Median

  • Mean works well when data is symmetric and not affected by outliers.
  • Median is robust — it’s better when the data are skewed or have extreme values.
  • For bimodal or multimodal data,
    • your data could need to be separated into groups first, or
    • your data could be under sampled.

Undersampled vs. Well-sampled

Let’s look at penguin flipper length.
Here’s a histogram across three species:

Code
library(tidyverse)
library(palmerpenguins)
ggplot(penguins, aes(x = flipper_length_mm)) +
  geom_histogram(fill = "chartreuse4", color = "white", bins = 12, alpha = 0.8) +
  labs(
    title = "All penguins",
    x = "Flipper Length (mm)",
    y = "Count"
  ) +
  theme_minimal(base_size = 16)

Look at one group’s distribution

Let’s look at penguin flipper length.
Filter for only ‘Adelie’.

Code
library(tidyverse)
library(palmerpenguins)
set.seed(7)

Adelie_penguins <- penguins |> filter(species == "Adelie") 

ggplot(Adelie_penguins, aes(x = flipper_length_mm)) +
  geom_histogram(fill = "chartreuse4", color = "white", bins = 12, alpha = 0.8) +
  labs(
    title = "151 Adelie penguins",
    x = "Flipper Length (mm)",
    y = "Count") + theme_minimal(base_size = 16)

Undersampled vs. Well-sampled

Let’s look at penguin flipper length.
We’ll compare a tiny random sample to a large one.

Code
library(tidyverse)
library(palmerpenguins)
set.seed(7)


Adelie_penguins <- penguins |> 
  filter(species == "Adelie") |> 
  drop_na(flipper_length_mm)

# Undersampled (n = 10)
small_sample <- Adelie_penguins |> 
  sample_n(10)
Code
# Plot side by side
p_small <- ggplot(small_sample, aes(x = flipper_length_mm)) +
  geom_histogram(fill = "chartreuse4", color = "white", bins = 8, alpha = 0.8) +
  labs(
    title = "Undersampled (n = 10)",
    x = "Flipper Length (mm)",
    y = "Count"
  ) +
  theme_minimal(base_size = 16)

p_large <- ggplot(Adelie_penguins, aes(x = flipper_length_mm)) +
  geom_histogram(fill = "chartreuse4", color = "white", bins = 12, alpha = 0.8) +
  labs(
    title = "151 Adelie penguins",
    x = "Flipper Length (mm)",
    y = "Count"
  ) +
  theme_minimal(base_size = 16)

# Combine later with patchwork:
library(patchwork)
p_small + p_large

Why Measure Spread?

Two datasets can have the same center but very different variability.
We measure spread to understand how values differ from one another.

Common measures of spread so far:

  • Range: max − min
  • IQR (Interquartile Range): spread of the middle 50% of data

Range vs. IQR

Measure Description Robust to Outliers?
Range Distance from minimum to maximum NO
IQR Spread of the middle 50% of values YES

The IQR doesn’t change much even if the dataset has extreme values.

Example: IQR

We’ll use this data:

5, 9, 11, 14, 18, 21, 22

  1. Find the median (Q2)
  • There are 7 observations (odd number).
  • Median = 14

Example: IQR

Data: 5, 9, 11, 14, 18, 21, 22

  1. Split the data. In the inclusive method, we include the median in both halves.

    • Lower half is 5, 9, 11, 14
    • Upper half is 14, 18, 21, 22
  2. Find Q1 and Q3

    • Q1 = median of lower half = (9 + 11)/2 = 10
    • Q3 = median of upper half = (18 + 21)/2 = 19.5
  3. Compute the \(IQR = Q3 - Q1 = 19.5 - 10 = 9.5\)

Example: IQR

Why “inclusive” method?

We include the median in both halves so that each half represents half of the data up to and including the center.

This is the default method in R, Excel, and Google Sheets.

Summary

library(tidyverse)
data <- tibble(x = c(5, 9, 11, 14, 18, 21, 22))

# Summary of quartiles, IQR, range, etc.
data |>
  summarize(
    min_x  = min(x),
    q1_x   = quantile(x, 0.25),
    median = median(x),
    q3_x   = quantile(x, 0.75),
    max_x  = max(x),
    iqr_x  = IQR(x),
  )
# A tibble: 1 × 6
  min_x  q1_x median  q3_x max_x iqr_x
  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
1     5    10     14  19.5    22   9.5

Boxplot

ggplot(data, aes(y = x)) +
  geom_boxplot(fill = "chartreuse4", color = "black", width = 0.3) +
  labs(title = "Boxplot of Data: 5, 9, 11, 14, 18, 21, 22", y = "Value") +
  theme_minimal(base_size = 18)

Example: Fences and Outliers

Let’s continue with the same data:

5, 9, 11, 14, 18, 21, 22

  1. We already found that: Q1 = 10, Q3 = 19.5, IQR = 9.5

  2. Compute the fences using the 1.5×IQR rule:

\[\text{Lower Fence} = Q1 - 1.5 \times IQR = 10 - 1.5(9.5) = -4.25\] \[\text{Upper Fence} = Q3 + 1.5 \times IQR = 19.5 + 1.5(9.5) = 33.75\]

  1. Any data values below -4.25 or above 33.75 would be considered outliers.

Check in R

library(tidyverse)
data <- tibble(x = c(5, 9, 11, 14, 18, 21, 22))

data |>
  summarize(
    min_x  = min(x),
    q1_x   = quantile(x, 0.25),
    median = median(x),
    q3_x   = quantile(x, 0.75),
    max_x  = max(x),
    iqr_x  = IQR(x),
    lower_fence = q1_x - 1.5 * iqr_x,
    upper_fence = q3_x + 1.5 * iqr_x
  )
# A tibble: 1 × 8
  min_x  q1_x median  q3_x max_x iqr_x lower_fence upper_fence
  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>       <dbl>       <dbl>
1     5    10     14  19.5    22   9.5       -4.25        33.8

Visualizing Outliers

Code
library(ggplot2)

ggplot(data, aes(y = x)) +
  geom_boxplot(fill = "chartreuse4", color = "black", width = 0.3, outlier.color = "red", outlier.shape = 19) +
  labs(
    title = "Boxplot of Data with Fences and Potential Outliers",
    y = "Value"
  ) +
  theme_minimal(base_size = 18)

This dataset has no outliers, since all points fall between the fences (–4.25, 33.75).

Boxplots good for comparing distributions across categories

Code
library(palmerpenguins)
ggplot(penguins, aes(x = species, y = body_mass_g)) +
  geom_boxplot(fill = "chartreuse4", outlier.color = "red") +
  labs(
    x = "Species",
    y = "Body Mass (g)",
    title = "Distribution of Penguin Body Mass",
    caption = "Source: palmerpenguins"
  ) +
  theme_minimal(base_size = 20)

Try it out! Compute and Visualize by Hand

Dataset: 10, 2, 25, 8, 6, 12, 4

Pair up and compute the following by hand:

  1. Median
  2. Q1 and Q3
  3. Interquartile Range (IQR = Q3 − Q1)
  4. Lower and Upper Fences (1.5×IQR rule)
  5. Identify any potential outliers

Then sketch a rough boxplot of the data.

05:00

Check with R

data <- tibble(x = c(10, 2, 25, 8, 6, 12, 4))

data |> 
  summarize(
    min_x  = min(x),
    q1_x   = quantile(x, 0.25),
    median = median(x),
    q3_x   = quantile(x, 0.75),
    max_x  = max(x),
    iqr_x  = IQR(x),
    lower_fence = q1_x - 1.5 * iqr_x,
    upper_fence = q3_x + 1.5 * iqr_x
  )
# A tibble: 1 × 8
  min_x  q1_x median  q3_x max_x iqr_x lower_fence upper_fence
  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>       <dbl>       <dbl>
1     2     5      8    11    25     6          -4          20

Code
library(ggplot2)

ggplot(data, aes(y = x)) +
  geom_boxplot(fill = "chartreuse4", color = "black", width = 0.3, outlier.color = "red") +
  labs(
    title = "Boxplot of Data: 10, 2, 25, 8, 6, 12, 4",
    y = "Value"
  ) +
  theme_minimal(base_size = 18)

Variance and Standard Deviation


Variance and standard deviation measure how far values spread around the mean.

Variance

The variance (sample variance) is the average squared distance from the mean.

\[ \sigma^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} \]

Where:

  • \(x_i\) — an individual data point
  • \(\bar{x}\) — the sample mean (average of all data points)
  • \(n\) — the number of observations in the sample
  • \(\sigma^2\) — the sample variance (a measure of spread)
  • \((x_i - \bar{x})\) — the deviation of each observation from the mean

Variance increases when data points are farther from the mean — it reflects how “spread out” the dataset is overall.

Standard Deviation

  • The standard deviation (SD) is the square root of the variance.
  • It’s in the same units as the data.

\[ \sigma = \sqrt{\sigma^2} \]

Standard Deviation


  • A low standard deviation indicates that the values tend to be close to the mean
  • A high standard deviation indicates that the values are spread out over a wider range

Calculating Deviation from the Mean

  • First, calculate the mean (\(\bar{x}\)) of the dataset.
  • For each data point (\(x_i\)), calculate its deviation from the mean: \[x_i - \bar{x}\]

Square the Deviations

  • Square each deviation to eliminate negative values: \[(x_i - \bar{x})^2\]
  • Sum up all squared deviations: \[\sum_{i=1}^{n} (x_i - \bar{x})^2\]
  • This sum represents the total squared deviation from the mean.

Calculating the Variance

  • Divide the total squared deviation by \((n-1)\) (to account for the sample variance): \[\sigma^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2\]
  • Using \((n-1)\) ensures an unbiased estimate of the population variance when calculating from a sample.

Deriving the Standard Deviation


  • The standard deviation is the square root of the variance: \[\sigma = \sqrt{\sigma^2}\]
  • Taking the square root converts the variance back to the units of the original data.

Example: Variance and SD by Hand

Data: 2, 4, 6, 8

  1. Mean = (2 + 4 + 6 + 8) / 4 = 5
  2. Compute each deviation:
    \[ x_i - \bar{x} = \begin{cases} 2 - 5 = -3 \\ 4 - 5 = -1 \\ 6 - 5 = +1 \\ 8 - 5 = +3 \end{cases} \]

Example: Variance and SD by Hand

Data: 2, 4, 6, 8

  1. Deviations: −3, −1, +1, +3

  2. Square each deviation to remove negatives:

    \[ (x_i - \bar{x})^2 = \begin{cases} (-3)^2 = 9 \\ (-1)^2 = 1 \\ (+1)^2 = 1 \\ (+3)^2 = 9 \end{cases} \]

Example: Variance and SD by Hand

Data: 2, 4, 6, 8

  1. Squared deviations: 9, 1, 1, 9
  2. Variance: \(\sigma^2 = \frac{9 + 1 + 1 + 9}{4 - 1} = \frac{20}{3} \approx 6.7\)
  3. Standard deviation: \(\sigma = \sqrt{6.7} \approx 2.6\)

Standard Deviation Calculation in R


x <- c(2, 4, 6, 8)
spread <- x - mean(x)
spread
[1] -3 -1  1  3

Standard Deviation Calculation in R


x <- c(2, 4, 6, 8)
spread_squared <- spread^2
spread_squared
[1] 9 1 1 9

Standard Deviation Calculation in R


x <- c(2, 4, 6, 8)
sum_spread_squared <- sum(spread_squared)
sum_spread_squared
[1] 20

Standard Deviation Calculation in R


n <- length(x)
variance <- sum_spread_squared/(n-1)
variance
[1] 6.666667

Standard Deviation Calculation in R


standard_dev <- sqrt(variance)
standard_dev
[1] 2.581989
sd(x)
[1] 2.581989

Interpreting SD

  • Small SD means data are tightly clustered around the mean
  • Large SD means data are spread out
  • Units: always the same as the data itself (e.g., cm, dollars, points)

Visual Comparison

Code
library(patchwork)
x <- tibble(x = rnorm(1000, mean = 0, 2))
a <- ggplot(x, aes(x = x )) +
  geom_histogram(binwidth = .5, fill = "chartreuse4") + theme_bw(base_size = 14) +
  geom_vline(xintercept = mean(x$x), color = "red", linewidth = 1) +
  geom_vline(xintercept = mean(x$x) - sd(x$x), color = "blue", linetype = "dashed") +
  geom_vline(xintercept = mean(x$x) + sd(x$x), color = "blue", linetype = "dashed") +
   labs(
    x = "Value", 
    y = "Count", 
    title = "A Distribution with Mean = 0, SD = 2"
  ) + xlim(-20, 20)
x <- tibble(x = rnorm(1000,mean = 0, 10))
b <- ggplot(x, aes(x = x )) +
  geom_histogram(binwidth = .5, fill = "chartreuse4") + theme_bw(base_size = 14) +
  geom_vline(xintercept = mean(x$x), color = "red", linewidth = 1) +
  geom_vline(xintercept = mean(x$x) - sd(x$x), color = "blue", linetype = "dashed") +
  geom_vline(xintercept = mean(x$x) + sd(x$x), color = "blue", linetype = "dashed") +
   labs(
    x = "Value", 
    y = "Count", 
    title = "A Distribution with Mean = 0, SD = 10"
  ) + xlim(-20, 20)
a + b

Summary

Measure What it Describes Units Robust to Outliers?
Range Entire spread Same as data NO
IQR Middle 50% of data Same as data YES
Variance Average squared distance Squared units NO
Standard Deviation Typical distance from mean Same as data NO

Z-Scores

A z-score tells how many standard deviations a value \(x\) is from the mean.

\[ z = \frac{x - \bar{x}}{\sigma} \]

  • Positive is above the mean
  • Negative is below the mean
  • Values with \(|z| > 2\) are often considered unusual

Example: Calculating Z-Scores

Dataset: 2, 4, 6, 8
Mean = 5, SD ≈ 2.6

x z-score Interpretation
2 (2−5)/2.6 ≈ −1.15 below average
4 (4−5)/2.6 ≈ −0.38 slightly below
6 (6−5)/2.6 ≈ +0.38 slightly above
8 (8−5)/2.6 ≈ +1.15 above average

Compute in R

x <- c(2, 4, 6, 8)
z <- (x - mean(x)) / sd(x)

# Display both values and z-scores
tibble(Value = x, Z_Score = round(z, 2))
# A tibble: 4 × 2
  Value Z_Score
  <dbl>   <dbl>
1     2   -1.16
2     4   -0.39
3     6    0.39
4     8    1.16

Interpreting Z-Scores

  • A z-score of 0 means the value equals the mean.
  • A z-score of +1 means 1 SD above the mean.
  • A z-score of −1 means 1 SD below the mean.
  • Extreme z-scores (e.g., beyond ±2 or ±3) may indicate outliers or rare events.

Visual Example

Code
library(ggplot2)
set.seed(7)
data <- tibble(x = rnorm(500, mean = 0, sd = 4))

mean_x <- mean(data$x)
sd_x <- sd(data$x)

ggplot(data, aes(x)) +
  geom_histogram(bins = 30, fill = "chartreuse4", color = "white", alpha = 0.8) +
  geom_vline(xintercept = mean_x, color = "red", linewidth = 1) +
  geom_vline(xintercept = mean_x + c(-1, 1) * sd_x, color = "blue", linetype = "dashed") +
  geom_vline(xintercept = mean_x + c(-2, 2) * sd_x, color = "purple", linetype = "dotted") +
  labs(
    title = "Standard Normal Distribution with Z-Scores",
    x = "Value",
    y = "Count"
  ) +
  theme_minimal(base_size = 18) + annotate("text", x = mean_x -1.3, y = 40, label = "Mean (z = 0)", color = "red", vjust = -0.5) +
  annotate("text", x = mean_x + sd_x -.7 , y = 40, label = "z = +1", color = "blue", vjust = -0.5) +
  annotate("text", x = mean_x - sd_x -.7, y = 40, label = "z = -1", color = "blue", vjust = -0.5) +
  annotate("text", x = mean_x + 2 * sd_x -.7, y = 40, label = "z = +2", color = "purple", vjust = -0.5) +
  annotate("text", x = mean_x - 2 * sd_x - .7, y = 40, label = "z = -2", color = "purple", vjust = -0.5)

Visual Example: Z-score vs. IQR

Code
library(ggplot2)
library(dplyr)

set.seed(7)
data <- tibble(x = rnorm(500, mean = 0, sd = 4))

# Mean and standard deviation (for z-scores)
mean_x <- mean(data$x)
sd_x <- sd(data$x)

# Median and IQR (for fences)
Q1 <- quantile(data$x, 0.25)
Q3 <- quantile(data$x, 0.75)
IQR_val <- IQR(data$x)
lower_fence <- Q1 - 1.5 * IQR_val
upper_fence <- Q3 + 1.5 * IQR_val

ggplot(data, aes(x)) +
  geom_histogram(bins = 30, fill = "chartreuse4", color = "white", alpha = 0.8) +
  
  # Mean and z-score lines
  geom_vline(xintercept = mean_x, color = "red", linewidth = 1) +
  geom_vline(xintercept = mean_x + c(-1, 1) * sd_x, color = "blue", linetype = "dashed") +
  geom_vline(xintercept = mean_x + c(-2, 2) * sd_x, color = "purple", linetype = "dotted") +
  
  # IQR and fence lines
  geom_vline(xintercept = c(Q1, Q3), color = "orange", linetype = "solid", linewidth = 1) +
  geom_vline(xintercept = c(lower_fence, upper_fence), color = "brown", linetype = "dashed") +
  
  labs(
    title = "Comparing Z-Scores and IQR Fences",
    x = "Value",
    y = "Count"
  ) +
  theme_minimal(base_size = 18) +

  # Z-score annotations
  annotate("text", x = mean_x -1.3, y = 40, label = "Mean (z = 0)", color = "red", vjust = -0.5) +
  annotate("text", x = mean_x + sd_x - .7 , y = 40, label = "z = +1", color = "blue", vjust = -0.5) +
  annotate("text", x = mean_x - sd_x - .7, y = 40, label = "z = -1", color = "blue", vjust = -0.5) +
  annotate("text", x = mean_x + 2 * sd_x - .7, y = 40, label = "z = +2", color = "purple", vjust = -0.5) +
  annotate("text", x = mean_x - 2 * sd_x - .7, y = 40, label = "z = -2", color = "purple", vjust = -0.5) +

  # IQR annotations
  annotate("text", x = Q1, y = 45, label = "Q1", color = "orange", vjust = -0.5) +
  annotate("text", x = Q3, y = 45, label = "Q3", color = "orange", vjust = -0.5) +
  annotate("text", x = lower_fence, y = 30, label = "Lower Fence", color = "brown", angle = 90, vjust = -0.4) +
  annotate("text", x = upper_fence, y = 30, label = "Upper Fence", color = "brown", angle = 90, vjust = -0.4)

Summary of Outliers

Type Formula Purpose Common Use
Mild Fence Q1 - 1.5 × IQR
Q3 + 1.5 × IQR
Detects mild outliers Standard boxplots
Extreme Fence Q1 - 3 × IQR
Q3 + 3 × IQR
Detects extreme outliers Detailed exploration
2 SD Threshold Mean ± 2 × SD Captures ~95% of normal data Quality control, z-scores
3 SD Threshold Mean ± 3 × SD Flags rare deviations (~99.7%) Detecting anomalies