Classification

November 18, 2025

Big Picture Recap

So far, we have focused on prediction/estimation using linear regression.
We used features (inputs, \(X\)) to predict an outcome (response, \(Y\)).
Today we:
- Connect regression to classification.
- Set up a machine learning framework for training and testing models.

Linear Regression (Recap)

Goal: predict a numeric outcome \(Y\) from one or more predictors \(X\).
Examples:
- Predict exam score from hours studied.
- Predict body mass from flipper length.
Setup:
- Numeric \(X\) → directly into the model.
- Categorical \(X\) → convert to numeric first, then use regression.

If \(X\) is Categorical

We encode categories as numeric variables.

Binary category:
- Convert to 0/1 and proceed with regression.
- Example: Smoker
  - Smoker = 0 for “No”
  - Smoker = 1 for “Yes”

If \(X\) is Categorical

Multi-level categorical predictors:
- Use one-hot encoding (indicator / dummy variables).
- Example: Color with levels: Red, Blue, Green
  - Create:
    - Color_Blue = 1 if Blue, 0 otherwise
    - Color_Green = 1 if Green, 0 otherwise
  - Baseline: Red (when both indicators are 0).
- One-hot encoding turns categories into vectors (e.g., Red = (0,0), Blue = (1,0), Green = (0,1))

Multiple Linear Regression

Extends simple linear regression to multiple predictors.
Goal: predict a numeric outcome \(Y\) using several features:

\[\widehat{Y} = b_0 + b_1 X_1 + b_2 X_2 + \dots + b_p X_p\]

model <- lm(exam_score ~ hours_studied + hours_slept, data = my_data)
summary(model)

Many factors can contribute to an outcome.

Guessing the Value of a Variable

Based on incomplete information
One way of making predictions / estimate:
- To predict an outcome for an individual,
- find others who are like that individual
- and whose outcomes you know.
- Use those outcomes as the basis of your prediction.
Two Types of Prediction
- Regression = Numeric; Classification = Categorical

Classification

02:00

Spam or not spam? Why do you think so?

Machine Learning Algorithm

A mathematical model
calculated based on sample data
- “training data”
that makes predictions or decisions without being explicitly programmed to perform the task

Classification Examples: Text

Output: (Spam, Not Spam)

Classification Examples: Image

Output: (Car, Road, Tree, Sky, Traffic Sign)

Classification Examples: Image

Output: (Car, Road, Tree, Sky, Traffic Sign)

Classification Examples: Videos

Output: (In, Out)

CKD Data

library(readr)
library(dplyr)
ckd <- read_csv("data/ckd.csv") |>
  rename(Glucose = `Blood Glucose Random`) |>
  mutate(
    Class = factor(
      Class,
      levels = c(0, 1),
      labels = c("No CKD", "CKD")
    )
  )

ckd |> count(Class)

# A tibble: 2 × 2
  Class      n
  <fct>  <int>
1 No CKD   115
2 CKD       43

Hemoglobin and Glucose

library(ggplot2)

ggplot(ckd, aes(x = Hemoglobin, y = Glucose, color = Class)) +
  geom_point() +
  scale_color_manual(
    values = c("No CKD" = "gold", "CKD" = "darkblue")
  ) +
  theme_minimal(base_size = 16)

White Blood Cell Count and Glucose

ggplot(ckd, aes(x = `White Blood Cell Count`, y = Glucose, color = Class)) +
  geom_point() +
  scale_color_manual(
    values = c("No CKD" = "gold", "CKD" = "darkblue")
  ) +
  theme_minimal(base_size = 16)

Manual Classfier

max_gluc_0 <- ckd |>
  filter(Class == "No CKD") |>
  summarize(max_glucose = max(Glucose)) |>
  pull(max_glucose)

min_hemo_0 <- ckd |>
  filter(Class == "No CKD") |>
  summarize(min_hemo = min(Hemoglobin)) |>
  pull(min_hemo)

max_gluc_0

[1] 140

min_hemo_0

[1] 13

Manual Classfier

classify_manually <- function(hemoglobin, glucose) {
  if (hemoglobin < min_hemo_0 || glucose > max_gluc_0) {
    return(1)
  } else {
    return(0)
  }
}

classify_manually(15, 100)

[1] 0

classify_manually(10, 300)

[1] 1

Classifiers

Regression vs Classification (Summary)

Regression:
- Output: numeric \(Y\) (e.g., income, temperature, score).
- Model: usually a line or curve.
- Fit by minimizing squared errors.
Classification:
- Output: category (e.g., spam / not spam).
- Model: decision boundary between classes.
- Fit by minimizing classification errors (or related loss).