This study postulated a quadratic relation between the weight lifted and the shot put distance.
Add the polynomial term to the model
linear_fit <-lm(`Shot Put Distance`~`Weight Lifted`, data = shotput)quad_fit <-lm(`Shot Put Distance`~`Weight Lifted`+I(`Weight Lifted`^2), data = shotput)# augment() returns the original data + fitted values and residualsaug_lin <-augment(linear_fit)aug_quad <-augment(quad_fit)
Example: Shotput
Compare linear vs. nonlinear fit visually
Code
p1 <-ggplot(shotput, aes(x =`Weight Lifted`, y =`Shot Put Distance`)) +geom_point(alpha =0.6) +geom_line(data = aug_lin, aes(y = .fitted), color ="#8E2C90", linewidth =1) +labs(title ="Linear Fit",x ="Weight Lifted (kg)",y ="Shot Put Distance (m)" ) +theme_bw(base_size =18)p2 <-ggplot(shotput, aes(x =`Weight Lifted`, y =`Shot Put Distance`)) +geom_point(alpha =0.6) +geom_line(data = aug_quad, aes(y = .fitted), color ="steelblue4", linewidth =1) +labs(title ="Quadratic Fit",x ="Weight Lifted (kg)",y ="Shot Put Distance (m)" ) +theme_bw(base_size =18)library(patchwork)p1 + p2
Least squares still minimizes SSR, but if the model form is wrong (line for a curved relation), the minimized SSR can still be large. Choose features / terms that reflect the pattern.
Try it out!
Replace Weight Lifted with a log transform with log() or standardize it with scale, refit both models, and compare SSR.
Try cubic by adding I(strength^3). Does adj.\(R^2\) improve meaningfully?
Try it out!
log_fit <-lm(`Shot Put Distance`~log(`Weight Lifted`), data = shotput)std_fit <-lm(scale(`Shot Put Distance`) ~scale(`Weight Lifted`), data = shotput)c(SSR_log =sum(residuals(log_fit)^2),SSR_standardized =sum(residuals(std_fit)^2))
SSR_log SSR_standardized
27.600432 5.469267
Try it out!
cubic_fit <-lm(`Shot Put Distance`~`Weight Lifted`+I(`Weight Lifted`^2) +I(`Weight Lifted`^3), data = shotput)summary(cubic_fit)$adj.r.squared
[1] 0.871463
summary(quad_fit)$adj.r.squared
[1] 0.8581642
summary(linear_fit)$adj.r.squared
[1] 0.7896436
Beyond Fit: Training vs. Testing Data
A model that fits the training data extremely well may not perform well on new data.
Training data: used to estimate the model parameters (fit).
Testing data: used to evaluate how well the model generalizes.
Overfitting vs. Underfitting
Model Validation trends
Training error always decreases with model complexity.
Test error eventually increases — a sign of overfitting.
Always check generalization using test data, cross-validation, or resampling.
The goal isn’t the lowest SSR on training data — it’s the lowest error on unseen data.
When least square methods doesn’t work
When the model is nonlinear in the parameters,
there’s no algebraic way to solve for the \(b\)’s.