Introduction
This tutorial demonstrates how to perform and interpret a simple
multiple linear regression in R.
We will predict a basketball player’s Points per Game
(Points) from several predictors such as Minutes
Played, Field Goals Made,
Assists, and Rebounds.
All code is commented for beginners. We use a simulated dataset — replace with your real dataset if available (CSV with same column names).
Packages
Install packages if you haven’t already (run once)
install.packages(c(“tidyverse”,“broom”,“car”,“MASS”,“caret”,“ggpubr”))
library(tidyverse)
library(broom) # tidy model output
library(car) # VIF & diagnostic plots
library(MASS) # stepAIC for model selection
library(caret) # cross-validation
library(ggpubr) # nicer ggplots
1. Create example dataset (or load your own)
set.seed(2025)
# Simulated player-match level data (n = 120)
n <- 120
player_data <- tibble(
PlayerID = sample(1:30, n, replace = TRUE),
Minutes = round(rnorm(n, mean = 30, sd = 5),1),
FieldGoals = rpois(n, lambda = 6), # count of field goals made
Assists = rpois(n, lambda = 4),
Rebounds = rpois(n, lambda = 6),
Turnovers = rpois(n, lambda = 2),
Age = sample(18:36, n, replace = TRUE)
)
# Create Points as a function of predictors + noise
player_data <- player_data %>%
mutate(
Points = round( 0.8*Minutes + 3.5*FieldGoals + 1.8*Assists + 1.1*Rebounds - 2.0*Turnovers + rnorm(n,0,6), 1)
)
conflicts()
## [1] "%>%" "%>%" "%>%" "group_by"
## [5] "intersect" "mutate" "n" "recode"
## [9] "select" "setdiff" "union" "%>%"
## [13] "lift" "some" "%>%" "all_of"
## [17] "any_of" "as_tibble" "contains" "ends_with"
## [21] "everything" "last_col" "matches" "num_range"
## [25] "one_of" "starts_with" "tibble" "tribble"
## [29] "%>%" "add_row" "as_data_frame" "as_tibble"
## [33] "data_frame" "glimpse" "lst" "tibble"
## [37] "tribble" "type_sum" "as_label" "enexpr"
## [41] "enexprs" "enquo" "enquos" "ensym"
## [45] "ensyms" "expr" "quo" "quo_name"
## [49] "quos" "sym" "syms" "vars"
## [53] "filter" "lag" "npk" "Arith"
## [57] "Compare" "show" "%||%" "as.difftime"
## [61] "body<-" "date" "intersect" "kronecker"
## [65] "plot" "Position" "setdiff" "setequal"
## [69] "union"
# Quick glimpse and group summary
library(tidyverse)
library(skimr)
player_data %>%
dplyr::select(Minutes, FieldGoals, Assists, Rebounds, Turnovers, Age, Points) %>%
skim()
| Name | Piped data |
| Number of rows | 120 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Minutes | 0 | 1 | 29.33 | 5.09 | 17.7 | 25.50 | 29.30 | 33.35 | 41.3 | ▂▇▇▆▂ |
| FieldGoals | 0 | 1 | 5.94 | 2.14 | 0.0 | 5.00 | 6.00 | 7.00 | 12.0 | ▁▃▇▂▁ |
| Assists | 0 | 1 | 4.05 | 1.97 | 0.0 | 3.00 | 4.00 | 6.00 | 10.0 | ▆▇▆▂▁ |
| Rebounds | 0 | 1 | 6.30 | 2.68 | 1.0 | 4.00 | 6.00 | 8.00 | 14.0 | ▂▇▅▃▁ |
| Turnovers | 0 | 1 | 2.08 | 1.36 | 0.0 | 1.00 | 2.00 | 3.00 | 7.0 | ▇▇▅▂▁ |
| Age | 0 | 1 | 26.71 | 5.24 | 18.0 | 22.00 | 27.00 | 31.00 | 36.0 | ▆▃▆▇▃ |
| Points | 0 | 1 | 55.45 | 11.57 | 25.1 | 47.03 | 55.55 | 63.62 | 81.0 | ▁▅▇▆▂ |
2. Exploratory data analysis (EDA)
Summary table by variable
# Load required libraries
library(tidyverse)
library(kableExtra)
# Summarise and format data
player_data %>%
summarise(across(
c(Minutes, FieldGoals, Assists, Rebounds, Turnovers, Age, Points),
list(N = ~n(), Mean = ~mean(.), SD = ~sd(.))
)) %>%
pivot_longer(everything()) %>%
kable() %>%
kable_styling()
| name | value |
|---|---|
| Minutes_N | 120.000000 |
| Minutes_Mean | 29.327500 |
| Minutes_SD | 5.091119 |
| FieldGoals_N | 120.000000 |
| FieldGoals_Mean | 5.941667 |
| FieldGoals_SD | 2.135321 |
| Assists_N | 120.000000 |
| Assists_Mean | 4.050000 |
| Assists_SD | 1.965458 |
| Rebounds_N | 120.000000 |
| Rebounds_Mean | 6.300000 |
| Rebounds_SD | 2.683908 |
| Turnovers_N | 120.000000 |
| Turnovers_Mean | 2.075000 |
| Turnovers_SD | 1.360688 |
| Age_N | 120.000000 |
| Age_Mean | 26.708333 |
| Age_SD | 5.239469 |
| Points_N | 120.000000 |
| Points_Mean | 55.448333 |
| Points_SD | 11.566301 |
#Scatterplots of Points vs a few predictors
p1 <- ggplot(player_data, aes(x = Minutes, y = Points)) + geom_point() + geom_smooth(method="lm", se=FALSE) + labs(title="Points vs Minutes")
p2 <- ggplot(player_data, aes(x = FieldGoals, y = Points)) + geom_jitter(width = 0.2, height = 0) + geom_smooth(method="lm", se=FALSE) + labs(title="Points vs FieldGoals")
p3 <- ggplot(player_data, aes(x = Assists, y = Points)) + geom_jitter(width = 0.2, height = 0) + geom_smooth(method="lm", se=FALSE) + labs(title="Points vs Assists")
ggarrange(p1, p2, p3, ncol = 1, nrow = 3)
Interpretation: Look for linear relationships and obvious outliers. These plots suggest positive relationships between Points and Minutes/FieldGoals/Assists.
3. Fit the initial multiple linear regression model
# Fit model: Points predicted by Minutes, FieldGoals, Assists, Rebounds, Turnovers, Age
mod_full <- lm(Points ~ Minutes + FieldGoals + Assists + Rebounds + Turnovers + Age, data = player_data)
summary(mod_full) %>% broom::tidy() %>% kable() %>% kable_styling()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.1688318 | 5.5149965 | -0.0306132 | 0.9756320 |
| Minutes | 0.9478291 | 0.1237780 | 7.6574923 | 0.0000000 |
| FieldGoals | 3.4687991 | 0.2874936 | 12.0656586 | 0.0000000 |
| Assists | 1.4007610 | 0.3113271 | 4.4993226 | 0.0000166 |
| Rebounds | 1.2249986 | 0.2286981 | 5.3564002 | 0.0000005 |
| Turnovers | -2.1670162 | 0.4659160 | -4.6510873 | 0.0000090 |
| Age | -0.0630797 | 0.1164661 | -0.5416146 | 0.5891503 |
The multiple linear regression model examined the factors influencing player scoring performance (Points) using Minutes, FieldGoals, Assists, Rebounds, Turnovers, and Age as predictors. The results showed that FieldGoals (β = 3.47, p < 0.001), Minutes (β = 0.95, p < 0.001), Assists (β = 1.40, p < 0.001), and Rebounds (β = 1.22, p < 0.001) were all significant positive predictors of points scored, indicating that increased offensive involvement strongly contributes to higher performance. Turnovers (β = -2.17, p < 0.001) had a significant negative effect, suggesting that more mistakes reduce scoring outcomes. Age (β = -0.06, p = 0.589) was not significant, implying that age does not meaningfully influence scoring in this dataset. Overall, the model highlights that on-court activity and efficiency measures are the main drivers of a player’s point performance.
4. Check assumptions
4.1 Linearity & homoscedasticity — residuals vs fitted
plot(mod_full, which = 1) # Residuals vs Fitted
4.2 Normality of residuals — QQ plot & Shapiro-Wilk
plot(mod_full, which = 2) # Normal QQ
shapiro.test(residuals(mod_full)) # formal test (note: sensitive to large n)
##
## Shapiro-Wilk normality test
##
## data: residuals(mod_full)
## W = 0.98222, p-value = 0.1145
4.3 Multicollinearity — Variance Inflation Factor (VIF)
vif_values <- car::vif(mod_full)
vif_values %>% as_tibble(rownames = "predictor") %>% rename(VIF = value) %>% kable() %>% kable_styling()
| predictor | VIF |
|---|---|
| Minutes | 1.093360 |
| FieldGoals | 1.037606 |
| Assists | 1.030887 |
| Rebounds | 1.037314 |
| Turnovers | 1.106578 |
| Age | 1.025235 |
Notes:
VIF > 5 (or > 10) indicates problematic multicollinearity.
If homoscedasticity violated → consider robust SEs or transform the outcome.
5. Influence & outliers
plot(mod_full, which = 4) # Cook's distance
# List any observations with high influence
cooks <- cooks.distance(mod_full)
which(cooks > (4/length(cooks))) # indices of potentially influential points
## 4 44 47 48 72 73 98 117
## 4 44 47 48 72 73 98 117
If a few points are highly influential, inspect them and decide whether they are data errors or true extremes
Cook’s Distance was used to identify influential observations in the model. The plot and calculated values showed that observations 4, 44, 47, 48, 72, 73, 98, and 117 have relatively high influence compared to others. These data points may have a stronger impact on the regression coefficients and could indicate potential outliers or unusual player performances. However, their influence should be checked further before deciding whether to remove them.
6. Model simplification / selection (stepwise AIC)
mod_step <- stepAIC(mod_full, direction = "both", trace = FALSE)
summary(mod_step)
##
## Call:
## lm(formula = Points ~ Minutes + FieldGoals + Assists + Rebounds +
## Turnovers, data = player_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6350 -4.1476 -0.3008 3.2340 19.2602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.7305 4.6866 -0.369 0.713
## Minutes 0.9498 0.1233 7.700 5.39e-12 ***
## FieldGoals 3.4587 0.2860 12.093 < 2e-16 ***
## Assists 1.3836 0.3087 4.481 1.77e-05 ***
## Rebounds 1.2253 0.2280 5.375 4.12e-07 ***
## Turnovers -2.1922 0.4622 -4.743 6.14e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.554 on 114 degrees of freedom
## Multiple R-squared: 0.6924, Adjusted R-squared: 0.6789
## F-statistic: 51.33 on 5 and 114 DF, p-value: < 2.2e-16
broom::tidy(mod_step) %>% kable() %>% kable_styling()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -1.7304907 | 4.6866340 | -0.3692396 | 0.7126341 |
| Minutes | 0.9497511 | 0.1233430 | 7.7000778 | 0.0000000 |
| FieldGoals | 3.4587464 | 0.2860032 | 12.0933827 | 0.0000000 |
| Assists | 1.3835595 | 0.3087415 | 4.4812876 | 0.0000177 |
| Rebounds | 1.2253310 | 0.2279874 | 5.3745568 | 0.0000004 |
| Turnovers | -2.1921544 | 0.4621594 | -4.7432869 | 0.0000061 |
To simplify the regression model, a stepwise selection based on AIC was performed. The final model kept the variables Minutes, FieldGoals, Assists, Rebounds, and Turnovers, while Age was removed because it did not significantly improve the model.
The simplified model explained about 68% of the variation in player points (Adjusted R² = 0.679), showing that these performance measures are strong predictors of scoring. The AIC value for the stepwise model was slightly lower than that of the full model, indicating a better fit with fewer predictors.
All remaining variables were highly significant (p < 0.001), and the residual standard error (6.55) suggests that the model predicts points with good accuracy. Overall, the stepwise model is more efficient and interpretable without losing much explanatory power.
7. Model performance metrics (R², adj-R², RMSE)
# Full model metrics
glance_full <- broom::glance(mod_full)
glance_step <- broom::glance(mod_step)
tibble(
Model = c("Full", "Stepwise"),
R2 = c(glance_full$r.squared, glance_step$r.squared),
Adj_R2 = c(glance_full$adj.r.squared, glance_step$adj.r.squared),
AIC = c(glance_full$AIC, glance_step$AIC)
) %>% kable() %>% kable_styling()
| Model | R2 | Adj_R2 | AIC |
|---|---|---|---|
| Full | 0.6932112 | 0.6769215 | 801.2926 |
| Stepwise | 0.6924148 | 0.6789242 | 799.6037 |
# RMSE (residual standard error)
sigma(mod_full) # RMSE for full
## [1] 6.574285
sigma(mod_step) # RMSE for stepwise
## [1] 6.553877
The performance of the full and stepwise regression models was compared using R², Adjusted R², AIC, and RMSE. The full model achieved an Adjusted R² of 0.679 with an RMSE of 6.57, while the stepwise model showed a very similar Adjusted R² (0.679) and a slightly lower RMSE (6.55). The AIC value for the stepwise model was also marginally lower, indicating a slightly better trade-off between model fit and complexity. Overall, both models performed almost identically, but the stepwise model is preferred because it is simpler, more efficient, and retains only the most important predictors without losing predictive accuracy.
8. Simple k-fold cross-validation (estimate RMSE on unseen data)
set.seed(2025)
train_control <- trainControl(method = "cv", number = 5)
# caret formula using mod_step predictors
formula_step <- formula(mod_step)
cv_model <- train(formula_step, data = player_data,
method = "lm",
trControl = train_control,
na.action = na.omit)
cv_model$results # includes RMSE and Rsquared
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 6.556177 0.6845598 5.181276 1.154802 0.08904792 0.9220103
A simple k-fold cross-validation was conducted to evaluate the predictive performance of the regression model on unseen data. The results showed an average RMSE of 6.56, indicating that the model’s point predictions typically deviate by about 6.5 points from the actual values. The R² value of 0.68 suggests that the model explains around 68% of the variability in player points, which aligns well with the in-sample performance. The Mean Absolute Error (MAE) was 5.18, showing relatively small average errors. Overall, the cross-validation results confirm that the model performs consistently and generalizes well to new data.
9. Final model interpretation
summary(mod_step)
##
## Call:
## lm(formula = Points ~ Minutes + FieldGoals + Assists + Rebounds +
## Turnovers, data = player_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6350 -4.1476 -0.3008 3.2340 19.2602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.7305 4.6866 -0.369 0.713
## Minutes 0.9498 0.1233 7.700 5.39e-12 ***
## FieldGoals 3.4587 0.2860 12.093 < 2e-16 ***
## Assists 1.3836 0.3087 4.481 1.77e-05 ***
## Rebounds 1.2253 0.2280 5.375 4.12e-07 ***
## Turnovers -2.1922 0.4622 -4.743 6.14e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.554 on 114 degrees of freedom
## Multiple R-squared: 0.6924, Adjusted R-squared: 0.6789
## F-statistic: 51.33 on 5 and 114 DF, p-value: < 2.2e-16
confint(mod_step)
## 2.5 % 97.5 %
## (Intercept) -11.0146765 7.553695
## Minutes 0.7054094 1.194093
## FieldGoals 2.8921762 4.025317
## Assists 0.7719450 1.995174
## Rebounds 0.7736898 1.676972
## Turnovers -3.1076886 -1.276620
broom::tidy(mod_step, conf.int = TRUE) %>% kable() %>% kable_styling()
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -1.7304907 | 4.6866340 | -0.3692396 | 0.7126341 | -11.0146765 | 7.553695 |
| Minutes | 0.9497511 | 0.1233430 | 7.7000778 | 0.0000000 | 0.7054094 | 1.194093 |
| FieldGoals | 3.4587464 | 0.2860032 | 12.0933827 | 0.0000000 | 2.8921762 | 4.025317 |
| Assists | 1.3835595 | 0.3087415 | 4.4812876 | 0.0000177 | 0.7719450 | 1.995174 |
| Rebounds | 1.2253310 | 0.2279874 | 5.3745568 | 0.0000004 | 0.7736898 | 1.676972 |
| Turnovers | -2.1921544 | 0.4621594 | -4.7432869 | 0.0000061 | -3.1076886 | -1.276620 |
The final multiple linear regression model identified Minutes, FieldGoals, Assists, Rebounds, and Turnovers as significant predictors of player points. The coefficients indicate that players who play more minutes, make more field goals, provide more assists, and grab more rebounds tend to score higher points. In contrast, an increase in turnovers is associated with a significant decrease in total points.
All predictors were highly significant (p < 0.001), confirming their strong influence on scoring performance. The model explains about 69% of the total variation in player points (Adjusted R² = 0.679), showing a good overall fit. The confidence intervals further support the reliability of the estimates for example, the effect of FieldGoals (3.46, 95% CI [2.89, 4.03]) shows a consistently strong positive relationship with points, while Turnovers (-2.19, 95% CI [-3.11, -1.28]) confirms a clear negative effect.
Overall, the final model suggests that player performance outcomes are mainly driven by game involvement and efficiency, while errors such as turnovers reduce scoring potential.
10.Conclusion and Practical Implications
This analysis aimed to identify the key factors that influence player scoring performance. The final regression model showed that minutes played, field goals, assists, and rebounds all have strong positive effects on total points scored, while turnovers negatively affect performance. The model demonstrated good predictive accuracy, explaining around 69% of the variation in points, and performed consistently during cross-validation.
From a practical perspective, these findings suggest that coaches and performance analysts should focus on improving players’ shooting efficiency, playmaking ability, and rebounding efforts, while minimizing turnovers during games. By targeting these specific areas, teams can enhance overall scoring performance and player contribution. This model provides a useful analytical tool for understanding how different aspects of gameplay impact success on the court.