
Sarah Cassie Burnett
October 23, 2025
In data analysis, we are usually interested in saying something about a Target Population.
In many instances, we have a sample to talk about
We cannot talk to all college students
We cannot monitor all french fry orders
On December 19, 2014, the front page of Spanish national newspaper El País read “Catalan public opinion swings toward ‘no’ for independence, says survey”.1

The probability of the tiny difference between the ‘No’ and ‘Yes’ being just due to random chance is very high.1

\[CI = \bar{x} \pm Z \left( \frac{\sigma}{\sqrt{n}} \right)\]
We rarely have the true \(\sigma\)…
This part here represents the standard error:
\[\left( \frac{\sigma}{\sqrt{n}} \right)\]
| Confidence Level | Z-Value (±) |
|---|---|
| 80% | 1.28 |
| 90% | 1.645 |
| 95% | 1.96 |
| 99% | 2.576 |
\[CI = \bar{x} \pm Z \left( \frac{\sigma}{\sqrt{n}} \right)\]
A confidence interval gives us a range of plausible values for a population parameter.
A 95% confidence interval means that if we repeated our sampling many times,
Suppose we survey 100 students about their average daily screen time (in hours).
The 95% confidence interval is:
\[ 6.2 \pm 1.96 \times 0.15 = (5.91, 6.49) \]
We are 95% confident that the true average daily screen time among all students is between 5.9 and 6.5 hours.
\[CI = \bar{x} \pm Z \left( \frac{\sigma}{\sqrt{n}} \right)\]
This is therefore a parametric method of calculating the CI. It depends on assumptions about the normality of the distribution — that we know the parameters \(\sigma\) and \(\bar{x}\).
What Proportion of Russians believe their country interfered in the 2016 presidential elections in the US?
openintro packageFor this example, we will use data from the Open Intro package. Install that package before running this code chunk.
Let’s use mutate() to recode the qualitative variable as a numeric one…
Now let’s calculate the mean and standard deviation of the try_influence variable…
And finally let’s draw a bar plot…
tidymodelsInstall tidymodels before running this code chunk…
#install.packages("tidymodels")
library(tidymodels)
set.seed(66)
boot_dist <- russiaData |>
specify(response = try_influence) |> # specify the variable of interest
generate(reps = 10000, type = "bootstrap") |> # generate 10000 bootstrap samples
calculate(stat = "mean") # calculate the mean of each bootstrap sample
glimpse(boot_dist)Rows: 10,000
Columns: 2
$ replicate <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ stat <dbl> 0.1146245, 0.1442688, 0.1343874, 0.1877470, 0.1521739, 0.138…
Calculate the mean of the bootstrap distribution (of the means of the individual draws)…
Calculate the confidence interval. A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution.
Create upper and lower bounds for visualization.
Visualize with a histogram
ggplot(data = boot_dist, mapping = aes(x = stat)) +
geom_histogram(binwidth =.01, fill = "steelblue4") +
geom_vline(xintercept = c(lower_bound, upper_bound), color = "darkgrey", size = 1, linetype = "dashed") +
labs(title = "Bootstrap distribution of means",
subtitle = "and 95% confidence interval",
x = "Estimate",
y = "Frequency") +
theme_bw()infer packageinfer packageinfer packageThe 95% confidence interval was calculated as (lower_bound, upper_bound). Which of the following is the correct interpretation of this interval?
(a) 95% of the time the percentage of Russian who believe that Russia interfered in the 2016 US elections is between lower_bound and upper_bound.
(b) 95% of all Russians believe that the chance Russia interfered in the 2016 US elections is between lower_bound and upper_bound.
(c) We are 95% confident that the proportion of Russians who believe that Russia interfered in the 2016 US election is between lower_bound and upper_bound.
(d) We are 95% confident that the proportion of Russians who supported interfering in the 2016 US elections is between lower_bound and upper_bound.
The 95% confidence interval was calculated as (lower_bound, upper_bound). Which of the following is the correct interpretation of this interval?
(c) We are 95% confident that the proportion of Russians who believe that Russia interfered in the 2016 US election is between lower_bound and upper_bound.
Given that the mean of 0-1 encoded data is equal to the proportion, use the following formula to estimate the confidence interval based on the sample of 506 Russians,
03:00
We estimate the standard error using:
\[ SE = \sqrt{ \frac{ \hat{p}(1 - \hat{p}) }{n} } \]
Tips:
nrows(your_data)your_data$col_name