Histograms are used to visualize the distribution of a single variable
x-axis represents value of variable of interest
y-axis represents the frequency of that value
Purpose of Histograms
They are generally used for continuous variables (e.g., income, age, etc.)
A continuous variable is one that can take on any value within a range (e.g., 0.5, 1.2, 3.7, etc.)
A discrete variable is one that can only take on certain values (e.g., 1, 2, 3, etc.)
Typically, the height of the bar represents the number of observations which fall in that bin
Example
Histogram Code
# load dplyrlibrary(dplyr)# load datafilms <-read_csv("data/film_cleanish.csv")# filter for the films with awardsfilms_w_awards <- films |>filter(Awards)# create histogramggplot(films_w_awards, aes(x = Length)) +geom_histogram(fill ="chartreuse4") +labs(x ="Length of the Movie",y ="Number of Movies",title ="Movie Lengths of Award Winning Classic Films",caption ="Source: Telecom ParisTech" ) +theme_minimal()
Histogram Code
Note that you only need to specify the x-axis variable in the aes() function. ggplot2 will automatically visualize the y-axis for a histogram.
ggplot(films_w_awards, aes(x = Length)) +geom_histogram(bins =50, fill ="chartreuse4") +labs(x ="Length of the Movie",y ="Number of Movies",title ="Movie Lengths of Award Winning Classic Films",caption ="Source: Telecom ParisTech" ) +theme_minimal()
Change Number of Bins
Change number of bins (bars) using bins or binwidth arguments (default number of bins = 30):
ggplot(films_w_awards, aes(x = Length)) +geom_histogram(bins =50, fill ="chartreuse4") +labs(x ="Length of the Movie",y ="Number of Movies",title ="Movie Lengths of Award Winning Classic Films",caption ="Source: Telecom ParisTech" ) +theme_minimal()
At 50 bins…
At 100 bins…probably too many!
Using binwidth instead of bins…
ggplot(films_w_awards, aes(x = Length)) +geom_histogram(binwidth =20, fill ="chartreuse4") +labs(x ="Length of the Movie",y ="Number of Movies",title ="Movie Lengths of Award Winning Classic Films",caption ="Source: Telecom ParisTech" ) +theme_minimal()
Setting binwidth to 2…
Change from Count to Density
ggplot(films_w_awards, aes(after_stat(density), x = Length)) +geom_histogram(fill ="chartreuse4") +labs(x ="Length of the Movie",y ="Number of Movies",title ="Movie Lengths of Award Winning Classic Films",caption ="Source: Telecom ParisTech" ) +theme_minimal()
For densities, the total area sums to 1. The height of a bar represents the probability of observations in that bin (rather than the number of observations).
Which gives us…
Try it out!
Pick a variable that you want to explore the distribution of