Multiple Linear Regression

Introduction

This tutorial demonstrates how to perform and interpret a simple multiple linear regression in R.
We will predict a basketball player’s Points per Game (Points) from several predictors such as Minutes Played, Field Goals Made, Assists, and Rebounds.

All code is commented for beginners. We use a simulated dataset — replace with your real dataset if available (CSV with same column names).

Packages

Install packages if you haven’t already (run once)

install.packages(c(“tidyverse”,“broom”,“car”,“MASS”,“caret”,“ggpubr”))

library(tidyverse)
library(broom)      # tidy model output
library(car)        # VIF & diagnostic plots
library(MASS)       # stepAIC for model selection
library(caret)      # cross-validation
library(ggpubr)     # nicer ggplots

1. Create example dataset (or load your own)

set.seed(2025)

# Simulated player-match level data (n = 120)

n <- 120
player_data <- tibble(
PlayerID = sample(1:30, n, replace = TRUE),
Minutes = round(rnorm(n, mean = 30, sd = 5),1),
FieldGoals = rpois(n, lambda = 6),       # count of field goals made
Assists = rpois(n, lambda = 4),
Rebounds = rpois(n, lambda = 6),
Turnovers = rpois(n, lambda = 2),
Age = sample(18:36, n, replace = TRUE)
)

# Create Points as a function of predictors + noise

player_data <- player_data %>%
mutate(
Points = round( 0.8*Minutes + 3.5*FieldGoals + 1.8*Assists + 1.1*Rebounds - 2.0*Turnovers + rnorm(n,0,6), 1)
)

conflicts()

##  [1] "%>%"           "%>%"           "%>%"           "group_by"     
##  [5] "intersect"     "mutate"        "n"             "recode"       
##  [9] "select"        "setdiff"       "union"         "%>%"          
## [13] "lift"          "some"          "%>%"           "all_of"       
## [17] "any_of"        "as_tibble"     "contains"      "ends_with"    
## [21] "everything"    "last_col"      "matches"       "num_range"    
## [25] "one_of"        "starts_with"   "tibble"        "tribble"      
## [29] "%>%"           "add_row"       "as_data_frame" "as_tibble"    
## [33] "data_frame"    "glimpse"       "lst"           "tibble"       
## [37] "tribble"       "type_sum"      "as_label"      "enexpr"       
## [41] "enexprs"       "enquo"         "enquos"        "ensym"        
## [45] "ensyms"        "expr"          "quo"           "quo_name"     
## [49] "quos"          "sym"           "syms"          "vars"         
## [53] "filter"        "lag"           "npk"           "Arith"        
## [57] "Compare"       "show"          "%||%"          "as.difftime"  
## [61] "body<-"        "date"          "intersect"     "kronecker"    
## [65] "plot"          "Position"      "setdiff"       "setequal"     
## [69] "union"

# Quick glimpse and group summary

library(tidyverse)
library(skimr)

player_data %>%
  dplyr::select(Minutes, FieldGoals, Assists, Rebounds, Turnovers, Age, Points) %>%
  skim()

Data summary
Name	Piped data
Number of rows	120
Number of columns	7
_______________________
Column type frequency:
numeric	7
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Minutes	1	29.33	5.09	17.7	25.50	29.30	33.35	41.3	▂▇▇▆▂
FieldGoals	1	5.94	2.14	0.0	5.00	6.00	7.00	12.0	▁▃▇▂▁
Assists	1	4.05	1.97	0.0	3.00	4.00	6.00	10.0	▆▇▆▂▁
Rebounds	1	6.30	2.68	1.0	4.00	6.00	8.00	14.0	▂▇▅▃▁
Turnovers	1	2.08	1.36	0.0	1.00	2.00	3.00	7.0	▇▇▅▂▁
Age	1	26.71	5.24	18.0	22.00	27.00	31.00	36.0	▆▃▆▇▃
Points	1	55.45	11.57	25.1	47.03	55.55	63.62	81.0	▁▅▇▆▂

2. Exploratory data analysis (EDA)

Summary table by variable

# Load required libraries
library(tidyverse)
library(kableExtra)

# Summarise and format data
player_data %>%
  summarise(across(
    c(Minutes, FieldGoals, Assists, Rebounds, Turnovers, Age, Points),
    list(N = ~n(), Mean = ~mean(.), SD = ~sd(.))
  )) %>%
  pivot_longer(everything()) %>%
  kable() %>%
  kable_styling()

name	value
Minutes_N	120.000000
Minutes_Mean	29.327500
Minutes_SD	5.091119
FieldGoals_N	120.000000
FieldGoals_Mean	5.941667
FieldGoals_SD	2.135321
Assists_N	120.000000
Assists_Mean	4.050000
Assists_SD	1.965458
Rebounds_N	120.000000
Rebounds_Mean	6.300000
Rebounds_SD	2.683908
Turnovers_N	120.000000
Turnovers_Mean	2.075000
Turnovers_SD	1.360688
Age_N	120.000000
Age_Mean	26.708333
Age_SD	5.239469
Points_N	120.000000
Points_Mean	55.448333
Points_SD	11.566301

#Scatterplots of Points vs a few predictors



p1 <- ggplot(player_data, aes(x = Minutes, y = Points)) + geom_point() + geom_smooth(method="lm", se=FALSE) + labs(title="Points vs Minutes")
p2 <- ggplot(player_data, aes(x = FieldGoals, y = Points)) + geom_jitter(width = 0.2, height = 0) + geom_smooth(method="lm", se=FALSE) + labs(title="Points vs FieldGoals")
p3 <- ggplot(player_data, aes(x = Assists, y = Points)) + geom_jitter(width = 0.2, height = 0) + geom_smooth(method="lm", se=FALSE) + labs(title="Points vs Assists")




ggarrange(p1, p2, p3, ncol = 1, nrow = 3)

Interpretation: Look for linear relationships and obvious outliers. These plots suggest positive relationships between Points and Minutes/FieldGoals/Assists.

3. Fit the initial multiple linear regression model

# Fit model: Points predicted by Minutes, FieldGoals, Assists, Rebounds, Turnovers, Age

mod_full <- lm(Points ~ Minutes + FieldGoals + Assists + Rebounds + Turnovers + Age, data = player_data)
summary(mod_full) %>% broom::tidy() %>% kable() %>% kable_styling()

term	estimate	std.error	statistic	p.value
(Intercept)	-0.1688318	5.5149965	-0.0306132	0.9756320
Minutes	0.9478291	0.1237780	7.6574923	0.0000000
FieldGoals	3.4687991	0.2874936	12.0656586	0.0000000
Assists	1.4007610	0.3113271	4.4993226	0.0000166
Rebounds	1.2249986	0.2286981	5.3564002	0.0000005
Turnovers	-2.1670162	0.4659160	-4.6510873	0.0000090
Age	-0.0630797	0.1164661	-0.5416146	0.5891503

The multiple linear regression model examined the factors influencing player scoring performance (Points) using Minutes, FieldGoals, Assists, Rebounds, Turnovers, and Age as predictors. The results showed that FieldGoals (β = 3.47, p < 0.001), Minutes (β = 0.95, p < 0.001), Assists (β = 1.40, p < 0.001), and Rebounds (β = 1.22, p < 0.001) were all significant positive predictors of points scored, indicating that increased offensive involvement strongly contributes to higher performance. Turnovers (β = -2.17, p < 0.001) had a significant negative effect, suggesting that more mistakes reduce scoring outcomes. Age (β = -0.06, p = 0.589) was not significant, implying that age does not meaningfully influence scoring in this dataset. Overall, the model highlights that on-court activity and efficiency measures are the main drivers of a player’s point performance.

4. Check assumptions

4.1 Linearity & homoscedasticity — residuals vs fitted

plot(mod_full, which = 1)  # Residuals vs Fitted

4.2 Normality of residuals — QQ plot & Shapiro-Wilk

plot(mod_full, which = 2)  # Normal QQ

shapiro.test(residuals(mod_full))  # formal test (note: sensitive to large n)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(mod_full)
## W = 0.98222, p-value = 0.1145

4.3 Multicollinearity — Variance Inflation Factor (VIF)

vif_values <- car::vif(mod_full)
vif_values %>% as_tibble(rownames = "predictor") %>% rename(VIF = value) %>% kable() %>% kable_styling()

predictor	VIF
Minutes	1.093360
FieldGoals	1.037606
Assists	1.030887
Rebounds	1.037314
Turnovers	1.106578
Age	1.025235

Notes:

VIF > 5 (or > 10) indicates problematic multicollinearity.
If homoscedasticity violated → consider robust SEs or transform the outcome.

5. Influence & outliers

plot(mod_full, which = 4) # Cook's distance

# List any observations with high influence

cooks <- cooks.distance(mod_full)
which(cooks > (4/length(cooks)))  # indices of potentially influential points

##   4  44  47  48  72  73  98 117 
##   4  44  47  48  72  73  98 117

If a few points are highly influential, inspect them and decide whether they are data errors or true extremes

Cook’s Distance was used to identify influential observations in the model. The plot and calculated values showed that observations 4, 44, 47, 48, 72, 73, 98, and 117 have relatively high influence compared to others. These data points may have a stronger impact on the regression coefficients and could indicate potential outliers or unusual player performances. However, their influence should be checked further before deciding whether to remove them.

6. Model simplification / selection (stepwise AIC)

mod_step <- stepAIC(mod_full, direction = "both", trace = FALSE)
summary(mod_step)

## 
## Call:
## lm(formula = Points ~ Minutes + FieldGoals + Assists + Rebounds + 
##     Turnovers, data = player_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6350  -4.1476  -0.3008   3.2340  19.2602 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.7305     4.6866  -0.369    0.713    
## Minutes       0.9498     0.1233   7.700 5.39e-12 ***
## FieldGoals    3.4587     0.2860  12.093  < 2e-16 ***
## Assists       1.3836     0.3087   4.481 1.77e-05 ***
## Rebounds      1.2253     0.2280   5.375 4.12e-07 ***
## Turnovers    -2.1922     0.4622  -4.743 6.14e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.554 on 114 degrees of freedom
## Multiple R-squared:  0.6924, Adjusted R-squared:  0.6789 
## F-statistic: 51.33 on 5 and 114 DF,  p-value: < 2.2e-16

broom::tidy(mod_step) %>% kable() %>% kable_styling()

term	estimate	std.error	statistic	p.value
(Intercept)	-1.7304907	4.6866340	-0.3692396	0.7126341
Minutes	0.9497511	0.1233430	7.7000778	0.0000000
FieldGoals	3.4587464	0.2860032	12.0933827	0.0000000
Assists	1.3835595	0.3087415	4.4812876	0.0000177
Rebounds	1.2253310	0.2279874	5.3745568	0.0000004
Turnovers	-2.1921544	0.4621594	-4.7432869	0.0000061

To simplify the regression model, a stepwise selection based on AIC was performed. The final model kept the variables Minutes, FieldGoals, Assists, Rebounds, and Turnovers, while Age was removed because it did not significantly improve the model.

The simplified model explained about 68% of the variation in player points (Adjusted R² = 0.679), showing that these performance measures are strong predictors of scoring. The AIC value for the stepwise model was slightly lower than that of the full model, indicating a better fit with fewer predictors.

All remaining variables were highly significant (p < 0.001), and the residual standard error (6.55) suggests that the model predicts points with good accuracy. Overall, the stepwise model is more efficient and interpretable without losing much explanatory power.

7. Model performance metrics (R², adj-R², RMSE)

# Full model metrics

glance_full <- broom::glance(mod_full)
glance_step <- broom::glance(mod_step)

tibble(
Model = c("Full", "Stepwise"),
R2 = c(glance_full$r.squared, glance_step$r.squared),
Adj_R2 = c(glance_full$adj.r.squared, glance_step$adj.r.squared),
AIC = c(glance_full$AIC, glance_step$AIC)
) %>% kable() %>% kable_styling()

Model	R2	Adj_R2	AIC
Full	0.6932112	0.6769215	801.2926
Stepwise	0.6924148	0.6789242	799.6037

# RMSE (residual standard error)

sigma(mod_full)  # RMSE for full

## [1] 6.574285

sigma(mod_step)  # RMSE for stepwise

## [1] 6.553877

The performance of the full and stepwise regression models was compared using R², Adjusted R², AIC, and RMSE. The full model achieved an Adjusted R² of 0.679 with an RMSE of 6.57, while the stepwise model showed a very similar Adjusted R² (0.679) and a slightly lower RMSE (6.55). The AIC value for the stepwise model was also marginally lower, indicating a slightly better trade-off between model fit and complexity. Overall, both models performed almost identically, but the stepwise model is preferred because it is simpler, more efficient, and retains only the most important predictors without losing predictive accuracy.

8. Simple k-fold cross-validation (estimate RMSE on unseen data)

set.seed(2025)
train_control <- trainControl(method = "cv", number = 5)

# caret formula using mod_step predictors

formula_step <- formula(mod_step)

cv_model <- train(formula_step, data = player_data,
method = "lm",
trControl = train_control,
na.action = na.omit)

cv_model$results  # includes RMSE and Rsquared

##   intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD     MAESD
## 1      TRUE 6.556177 0.6845598 5.181276 1.154802 0.08904792 0.9220103

A simple k-fold cross-validation was conducted to evaluate the predictive performance of the regression model on unseen data. The results showed an average RMSE of 6.56, indicating that the model’s point predictions typically deviate by about 6.5 points from the actual values. The R² value of 0.68 suggests that the model explains around 68% of the variability in player points, which aligns well with the in-sample performance. The Mean Absolute Error (MAE) was 5.18, showing relatively small average errors. Overall, the cross-validation results confirm that the model performs consistently and generalizes well to new data.

9. Final model interpretation

summary(mod_step)

## 
## Call:
## lm(formula = Points ~ Minutes + FieldGoals + Assists + Rebounds + 
##     Turnovers, data = player_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6350  -4.1476  -0.3008   3.2340  19.2602 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.7305     4.6866  -0.369    0.713    
## Minutes       0.9498     0.1233   7.700 5.39e-12 ***
## FieldGoals    3.4587     0.2860  12.093  < 2e-16 ***
## Assists       1.3836     0.3087   4.481 1.77e-05 ***
## Rebounds      1.2253     0.2280   5.375 4.12e-07 ***
## Turnovers    -2.1922     0.4622  -4.743 6.14e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.554 on 114 degrees of freedom
## Multiple R-squared:  0.6924, Adjusted R-squared:  0.6789 
## F-statistic: 51.33 on 5 and 114 DF,  p-value: < 2.2e-16

confint(mod_step)

##                   2.5 %    97.5 %
## (Intercept) -11.0146765  7.553695
## Minutes       0.7054094  1.194093
## FieldGoals    2.8921762  4.025317
## Assists       0.7719450  1.995174
## Rebounds      0.7736898  1.676972
## Turnovers    -3.1076886 -1.276620

broom::tidy(mod_step, conf.int = TRUE) %>% kable() %>% kable_styling()

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-1.7304907	4.6866340	-0.3692396	0.7126341	-11.0146765	7.553695
Minutes	0.9497511	0.1233430	7.7000778	0.0000000	0.7054094	1.194093
FieldGoals	3.4587464	0.2860032	12.0933827	0.0000000	2.8921762	4.025317
Assists	1.3835595	0.3087415	4.4812876	0.0000177	0.7719450	1.995174
Rebounds	1.2253310	0.2279874	5.3745568	0.0000004	0.7736898	1.676972
Turnovers	-2.1921544	0.4621594	-4.7432869	0.0000061	-3.1076886	-1.276620

The final multiple linear regression model identified Minutes, FieldGoals, Assists, Rebounds, and Turnovers as significant predictors of player points. The coefficients indicate that players who play more minutes, make more field goals, provide more assists, and grab more rebounds tend to score higher points. In contrast, an increase in turnovers is associated with a significant decrease in total points.

All predictors were highly significant (p < 0.001), confirming their strong influence on scoring performance. The model explains about 69% of the total variation in player points (Adjusted R² = 0.679), showing a good overall fit. The confidence intervals further support the reliability of the estimates for example, the effect of FieldGoals (3.46, 95% CI [2.89, 4.03]) shows a consistently strong positive relationship with points, while Turnovers (-2.19, 95% CI [-3.11, -1.28]) confirms a clear negative effect.

Overall, the final model suggests that player performance outcomes are mainly driven by game involvement and efficiency, while errors such as turnovers reduce scoring potential.

10.Conclusion and Practical Implications

This analysis aimed to identify the key factors that influence player scoring performance. The final regression model showed that minutes played, field goals, assists, and rebounds all have strong positive effects on total points scored, while turnovers negatively affect performance. The model demonstrated good predictive accuracy, explaining around 69% of the variation in points, and performed consistently during cross-validation.

From a practical perspective, these findings suggest that coaches and performance analysts should focus on improving players’ shooting efficiency, playmaking ability, and rebounding efforts, while minimizing turnovers during games. By targeting these specific areas, teams can enhance overall scoring performance and player contribution. This model provides a useful analytical tool for understanding how different aspects of gameplay impact success on the court.

Multiple Linear Regression

THEDINI JAYAWARDANA

2025-10-30