Categorical vs. Continuous Data

Sarah Cassie Burnett

October 7, 2025

Today

What do we mean by categorical vs continuous?
Why types matter for summaries and visuals

What are some ways we can classify data?

anecdotal vs. representative
census vs. sample
observational vs. experimental
categorical vs. numerical
discrete vs. continuous
cross-sectional vs. time series
longitudinal vs. panel
unstructured vs. structured

What are some ways we can classify data?

anecdotal vs. representative
census vs. sample
observational vs. experimental
categorical vs. numerical
discrete vs. continuous
cross-sectional vs. time series
longitudinal vs. panel
unstructured vs. structured

Variable Types

Categorical
- Binary - two categories
- Nominal - multiple unordered categories
- Ordinal - multiple ordered categories
Numerical
- Continuous - a range of numbers (measurement data)
- Discrete - take on separate, distinct values (countable data)

Create a (random) daily-life dataset

library(tidyverse) # let's load in everything
set.seed(7) # we are going to create something random
            # the 'seed' is so my random looks like your random

N <- 60 #number of observations

lifelog <- tibble(
  id = 1:N,                                    # identifier (categorical by meaning)
  age = sample(18:35, N, replace = TRUE),      # numeric (discrete integer)
  height_cm = round(rnorm(N, mean = 170, sd = 10), 1),  # continuous
  commute_mode = sample(c("Walk","Bike","Transit","Car"), N, replace = TRUE), # categorical (nominal)
  coffee_cups = sample(0:5, N, replace = TRUE),      # counts (discrete numeric)
  coffee_today = coffee_cups > 0,               # logical/binary (categorical by meaning)
  study_hours = round(runif(N, 0, 6), 1),      # continuous
  mood = factor(sample(c("Low","Medium","High"), N, replace = TRUE),
                levels = c("Low","Medium","High"), ordered = TRUE), # categorical (ordinal)
  zip_code = sample(c("20001","20002","20037","20052"), N, replace = TRUE) # numeric-looking categorical
)

The Two Big Families

Family	Meaning	R classes you’ll see	Typical summaries	Typical visuals
Categorical	labels/groups (nominal or ordered)	`factor`, `ordered`, `character`, `logical`	counts, proportions	bar charts, stacked bars
Continuous	measurements in a range	`numeric`, `double`, `integer`	mean/median, sd/IQR, quantiles	histogram, line plot, scatterplot, boxplot

Categorical Data

Classify

What counts as categorical?

Nominal: commute_mode, zip_code
Ordinal: mood (Low < Medium < High)
Binary: coffee_today (TRUE/FALSE)

Summarize one categorical variable

# counts & proportions
lifelog |>
  count(commute_mode) |>
  mutate(prop = n / sum(n)) |>
  arrange(desc(n))

# A tibble: 4 × 3
  commute_mode     n  prop
  <chr>        <int> <dbl>
1 Car             17 0.283
2 Walk            17 0.283
3 Transit         15 0.25 
4 Bike            11 0.183

Visualize

Single categorical variable: use bar chart

ggplot(lifelog, aes(x = commute_mode)) +
  geom_bar(fill="chartreuse4") +
  labs(x = "Commute mode", y = "Count")

Visualize

Two categorical variables: dodged bars

ggplot(lifelog, aes(x = commute_mode, fill = mood)) +
  geom_bar(position = "dodge") +
  labs(x = "Commute mode", y = "Count", fill = "Mood (ordinal)")

Visualize

Two categorical variables: stacked bars

ggplot(lifelog, aes(x = commute_mode, fill = mood)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(x = "Commute mode", y = "Proportion", fill = "Mood")

Continuous Data

Classify

What counts as continuous?

Measurements: height_cm, study_hours
Counts (discrete numeric): coffee_cups (treated similarly but integer-valued)

Summarize

lifelog |>
  summarize(
    mean_height   = mean(height_cm),
    sd_height     = sd(height_cm),
    median_height = median(height_cm),
    q25_height    = quantile(height_cm, 0.25),
    q75_height    = quantile(height_cm, 0.75),
    mean_study    = mean(study_hours),
    sd_study      = sd(study_hours),
    median_study  = median(study_hours),
    q25_study     = quantile(study_hours, 0.25),
    q75_study     = quantile(study_hours, 0.75)
  )

# A tibble: 1 × 10
  mean_height sd_height median_height q25_height q75_height mean_study sd_study
        <dbl>     <dbl>         <dbl>      <dbl>      <dbl>      <dbl>    <dbl>
1        172.      8.19           174       167.       178.       2.93     1.81
# ℹ 3 more variables: median_study <dbl>, q25_study <dbl>, q75_study <dbl>

Visualize

One continuous variable: histogram

# histogram
ggplot(lifelog, aes(x = height_cm)) +
  geom_histogram(binwidth = 5, fill="chartreuse4") +
  labs(x = "Height (cm)", y = "Count")

Visualize

Two continuous variables: scatterplots

ggplot(lifelog, aes(x = height_cm, y = study_hours)) +
  geom_point(color = "darkorange", alpha = 0.7) +
  labs(x = "Height (cm)", y = "Study hours (per day)")

If one variable was time, a line chart would be recommended here.

Visualize

What if you want to visualize continuous distributions across different categories?

Visualize

Histogram of continuous variable by category (not easy to see)

ggplot(lifelog, aes(x = study_hours, fill = commute_mode)) +
  geom_histogram(binwidth = 0.5, alpha = 0.6, position = "identity", color = "white") +
  labs(
    x = "Study hours (per day)",
    y = "Count",
    fill = "Commute mode",
    title = "Distribution of study hours by commute mode"
  )

Visualize

Boxplot of continuous by categorical

ggplot(lifelog, aes(x = commute_mode, y = study_hours)) +
  geom_boxplot() +
  labs(x = "Commute mode", y = "Study hours (per day)")

Try it out!

Best visualization for each:
- height_cm → ?
- study_hours → ?
- coffee_cups (discrete count) → ?

Classify: continuous or categorical (nominal, ordinal, binary)
- coffee_today → ?
- mood → ?
- id, zip_code → ?

Summarize:
- Categorical → ?
- Continuous → ?

05:00

Try it out!

Best visualization for each:
- height_cm → histogram, boxplot - study_hours → histogram, boxplot - coffee_cups (discrete count) → histogram, boxplot

Classify: continuous or categorical (nominal, ordinal, binary)
- coffee_today → binary categorical
- mood → ordinal categorical
- id, zip_code → categorical identifiers

Summarize:
- Categorical → counts, proportions
- Continuous → mean/median, SD/IQR, quantiles

Other practices (continuous)

Don’t take means of numerics used as labels like IDs or ZIP codes
Check for outliers

# quick outlier peek for height
lifelog |>
  summarize(q05 = quantile(height_cm, .05), q95 = quantile(height_cm, .95))

# A tibble: 1 × 2
    q05   q95
  <dbl> <dbl>
1  160.  184.

Other practices (categorical)

Unseen levels: set factor levels explicitly for consistent ordering

Factor vs. vector of characters

tbl <- tibble(month = c("March", "January", "February"))

# Default: character columns arrange alphabetically
tbl |> arrange(month)

# A tibble: 3 × 1
  month   
  <chr>   
1 February
2 January 
3 March

tbl <- tbl |>
  mutate(month = factor(month, levels = c("January", "February", "March")))

# Now arrange() follows factor levels
tbl |> arrange(month)

# A tibble: 3 × 1
  month   
  <fct>   
1 January 
2 February
3 March