[1] 5
Sarah Cassie Burnett
October 14, 2025
The probability of rolling a 1, 2, 3, 4, 5, or 6 with a fair die is equal for each outcome.
Outcome:
Event:
Outcomes are individual results; Events are groups of outcomes we assign probabilities to.
Simulation is the process of using a computer to mimic a physical experiment.
Let’s simulate 10 die rolls… 1000 die rolls.
The Law of Large Numbers (LLN) is a fundamental principle in probability and statistics that describes how the results of random events become more predictable as the number of trials increases.
10 die rolls
1000 die rolls
1000 die rolls as a proportion
set.seed(7)
N <- 1000
dist <- tibble(x = sample(1:6, size = N, replace = TRUE))
hist <- ggplot(dist, aes(x)) +
geom_histogram(aes(y = after_stat(count / sum(count))), bins = 6, fill = "chartreuse4", color = "white") +
labs(
title = "Uniform Distribution (sampled from 1 to 6)",
x = "Value",
y = "Proportion"
) + theme_minimal(base_size = 20)
histIn practice, for discrete outcomes,
geom_bar()is preferred.
Many real-world variables follow non-uniform distributions.
In R, there are built in functions to generate these distributions:
runif() — Uniform distributionrnorm() — Normal (Gaussian) distributionrpois() — Poisson distributionrexp() — Exponential distributionAll outcomes are equally likely.
Also known as the bell-shaped or standard distribution.
It’s the most common distribution in statistics and math.
Used for counts of random, independent events.
(e.g., number of emails per hour or bus arrivals.)
Used for waiting times between random events.
Which describe the shape each variable:
palmerpenguins::penguins$body_mass_g (body mass of penguins).faithful$waiting (waiting time between Old Faithful eruptions).ggplot2::diamonds$price (price of diamonds).Challenge: How would you describe ggplot2::diamonds$cut (cut of diamonds)?
(See the next slide for the code for the plots.)
05:00
#1
library(palmerpenguins)
ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(bins = 20, fill = "chartreuse4", color = "white") +
labs(title = "Penguin Body Mass", x = "Body Mass (g)", y = "Count")
#2
ggplot(faithful, aes(x = waiting)) +
geom_histogram(bins = 30, fill = "chartreuse4", color = "white") +
labs(title = "Old Faithful Waiting Times", x = "Minutes", y = "Count")
#3
ggplot(diamonds, aes(x = price)) +
geom_histogram(bins = 50, fill = "chartreuse4", color = "white") +
labs(title = "Diamond Prices", x = "Price (USD)", y = "Count")
#4
ggplot(diamonds, aes(x = cut)) +
geom_bar(fill = "chartreuse4", color = "white") +
labs(title = "Diamond Cut Quality", x = "Cut", y = "Count")How do we find where our data is centered?
Example:
Data: [2, 5, 3, 25, 5]
The mean is the average of the values. Common measure of central tendency but sensitive to outliers.
Mean = (2 + 3 + 5 + 5 + 25) / 5 = 8
The median is the value that separates the higher half from the lower half of the data.
Ordered Data: [2, 3, 5, 5, 25], Median = 5
set.seed(7)
dist <- tibble(x = rnorm(2000, mean = 0, sd = 1))
stats <- dist |>
summarize(
mean_x = mean(x),
median_x = median(x)
)
ggplot(dist, aes(x)) +
geom_histogram(bins = 30, fill = "chartreuse4", color = "white") +
geom_vline(aes(xintercept = stats$mean_x), color = "red", linewidth = 1) +
geom_vline(aes(xintercept = stats$median_x), color = "blue", linewidth = 1, linetype = "dashed") +
labs(
title = "Normal Distribution (with Mean & Median)",
x = "Value",
y = "Count"
) +
annotate("text", x = stats$mean_x + 0.3, y = 200, label = "Mean", color = "red") +
annotate("text", x = stats$median_x - 0.3, y = 200, label = "Median", color = "blue") + theme_minimal(base_size = 16)library(ggplot2)
library(tibble)
library(dplyr)
set.seed(1)
dist <- tibble(x = c(rnorm(10000, mean = -2, sd = 1),
rnorm(10000, mean = 3, sd = 1)))
stats <- dist |>
summarize(
mean_x = mean(x),
median_x = median(x)
)
ggplot(dist, aes(x)) +
geom_histogram(bins = 20, fill = "chartreuse4", color = "white") +
geom_vline(aes(xintercept = stats$mean_x), color = "red", linewidth = 1) +
geom_vline(aes(xintercept = stats$median_x), color = "blue", linewidth = 1, linetype = "dashed") +
labs(title = "Bimodal Distribution (with Mean & Median)",
x = "Value", y = "Count") +
annotate("text", x = stats$mean_x + 0.4, y = 1500, label = "Mean", color = "red") +
annotate("text", x = stats$median_x - 0.4,y = 1500, label = "Median", color = "blue") + theme_minimal(base_size = 16)Data like this might be the combination of two groups. Taking the mean is not meaningful until the groups are separated.
dist <- tibble(x = rbeta(10000, 1, 8))
stats <- dist |>
summarize(
mean_x = mean(x),
median_x = median(x)
)
ggplot(dist, aes(x)) +
geom_histogram(bins = 20, fill = "chartreuse4", color = "white") +
geom_vline(aes(xintercept = stats$mean_x), color = "red", linewidth = 1) +
geom_vline(aes(xintercept = stats$median_x), color = "blue", linewidth = 1, linetype = "dashed") +
labs(title = "Right-Skewed Distribution", x = "Value", y = "Count") +
annotate("text", x = stats$mean_x + 0.1, y = 1800, label = "Mean →", color = "red") +
annotate("text", x = stats$median_x - 0.1, y = 1800, label = "Median", color = "blue") + theme_minimal(base_size = 16)dist <- tibble(x = rbeta(10000, 8, 1))
stats <- dist |>
summarize(
mean_x = mean(x),
median_x = median(x)
)
ggplot(dist, aes(x)) +
geom_histogram(bins = 20, fill = "chartreuse4", color = "white") +
geom_vline(aes(xintercept = stats$mean_x), color = "red", linewidth = 1) +
geom_vline(aes(xintercept = stats$median_x), color = "blue", linewidth = 1, linetype = "dashed") +
labs(title = "Left-Skewed Distribution", x = "Value", y = "Count") +
annotate("text", x = stats$mean_x - 0.1, y = 1800, label = "Mean ←", color = "red") +
annotate("text", x = stats$median_x + 0.1, y = 1800, label = "Median", color = "blue")+ theme_minimal(base_size = 16)max())min())Good to know but not any insight about the distribution or if there are outliers.
How do we find where most of the data lie?
Example:
Data: [12, 5, 2, 4, 7, 15, 8, 10]
Quantiles split the ordered data into equal parts.
Ordered Data: [2, 4, 5, 7, 8, 10, 12, 15]
The Interquartile Range (IQR) measures the spread of the middle 50% of data.
\[\text{IQR} = Q3 - Q1 = 11 - 4.5 = 6.5\]
The IQR tells us where the central half of the data lie —
it’s a robust measure of spread as it ignores extreme values like the min and max.
Example:
Data: [2, 4, 5, 7, 8, 10, 12, 15, 50]
Example:
Data: [2, 4, 5, 7, 8, 10, 12, 15, 50]
Example:
Data: [2, 4, 5, 7, 8, 10, 12, 15, 50]
Whiskers extend extend up to \(1.5 \times IQR\) beyond the box. Anything farther out is flagged as an outlier and plotted as a red point.
library(ggplot2)
library(tibble)
data <- tibble(x = c(2, 4, 5, 7, 8, 10, 12, 15, 50))
ggplot(data, aes(x = "", y = x)) +
geom_boxplot(fill = "chartreuse4", color = "black", outlier.color = "red", outlier.shape = 19, width = 0.3) +
labs(
title = "Boxplot with Outlier",
x = "",
y = "Value"
) +
theme_minimal(base_size = 20)penguins |>
drop_na(body_mass_g) |>
summarize(
mean_mass = mean(body_mass_g),
median_mass = median(body_mass_g),
q25_mass = quantile(body_mass_g, 0.25),
q75_mass = quantile(body_mass_g, 0.75),
sd_mass = sd(body_mass_g),
iqr_mass = IQR(body_mass_g),
min_mass = min(body_mass_g),
max_mass = max(body_mass_g),
range_mass = max(body_mass_g) - min(body_mass_g),
n_obs = n()
)# A tibble: 1 × 10
mean_mass median_mass q25_mass q75_mass sd_mass iqr_mass min_mass max_mass
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 4202. 4050 3550 4750 802. 1200 2700 6300
# ℹ 2 more variables: range_mass <int>, n_obs <int>
Next lecture