DATA605 Wk 12 Discussion 12

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

library(tidyverse)
library(kableExtra)
library(corrplot)
library(broom)

This dataset will be used to predict average scores made by each player based on their height, weight, successful goals and free throws.

Preview of Dataset

Before adding terms

basketball <- read.csv('basketball.csv', sep = ',')
colnames(basketball)<- c("height", "weight", "success_goals", "free_throws", "avg_scores")

dim(basketball)

## [1] 54  5

kable(basketball) %>%
  kable_styling() %>%
  scroll_box(width = "700px", height = "400px")

height	weight	success_goals	free_throws	avg_scores
6.8	225	0.442	0.672	9.2
6.3	180	0.435	0.797	11.7
6.4	190	0.456	0.761	15.8
6.2	180	0.416	0.651	8.6
6.9	205	0.449	0.900	23.2
6.4	225	0.431	0.780	27.4
6.3	185	0.487	0.771	9.3
6.8	235	0.469	0.750	16.0
6.9	235	0.435	0.818	4.7
6.7	210	0.480	0.825	12.5
6.9	245	0.516	0.632	20.1
6.9	245	0.493	0.757	9.1
6.3	185	0.374	0.709	8.1
6.1	185	0.424	0.782	8.6
6.2	180	0.441	0.775	20.3
6.8	220	0.503	0.880	25.0
6.5	194	0.503	0.833	19.2
7.6	225	0.425	0.571	3.3
6.3	210	0.371	0.816	11.2
7.1	240	0.504	0.714	10.5
6.8	225	0.400	0.765	10.1
7.3	263	0.482	0.655	7.2
6.4	210	0.475	0.244	13.6
6.8	235	0.428	0.728	9.0
7.2	230	0.559	0.721	24.6
6.4	190	0.441	0.757	12.6
6.6	220	0.492	0.747	5.6
6.8	210	0.402	0.739	8.7
6.1	180	0.415	0.713	7.7
6.5	235	0.492	0.742	24.1
6.4	185	0.484	0.861	11.7
6.0	175	0.387	0.721	7.7
6.0	192	0.436	0.785	9.6
7.3	263	0.482	0.655	7.2
6.1	180	0.340	0.821	12.3
6.7	240	0.516	0.728	8.9
6.4	210	0.475	0.846	13.6
5.8	160	0.412	0.813	11.2
6.9	230	0.411	0.595	2.8
7.0	245	0.407	0.573	3.2
7.3	228	0.445	0.726	9.4
5.9	155	0.291	0.707	11.9
6.2	200	0.449	0.804	15.4
6.8	235	0.546	0.784	7.4
7.0	235	0.480	0.744	18.9
5.9	105	0.359	0.839	7.9
6.1	180	0.528	0.790	12.2
5.7	185	0.352	0.701	11.0
7.1	245	0.414	0.778	2.8
5.8	180	0.425	0.872	11.8
7.4	240	0.599	0.713	17.1
6.8	225	0.482	0.701	11.6
6.8	215	0.457	0.734	5.8
7.0	230	0.435	0.764	8.3

After adding terms

# adding an extra column with dichotomous term
basketball$success <- ifelse(basketball$success_goals >=0.5, 1, 0)
basketball <- basketball[, c(1,2,3,6,4,5)] #re-arrange columns

#Quadratic
basketball$success_goals <- basketball$success_goals^2

#dichotomous vs. quantitative
basketball$dvsq <- round(basketball$height * basketball$success, 2)

dim(basketball)

## [1] 54  7

kable(basketball) %>%
  kable_styling() %>%
  scroll_box(width = "700px", height = "400px")

height	weight	success_goals	success	free_throws	avg_scores	dvsq
6.8	225	0.195364	0	0.672	9.2	0.0
6.3	180	0.189225	0	0.797	11.7	0.0
6.4	190	0.207936	0	0.761	15.8	0.0
6.2	180	0.173056	0	0.651	8.6	0.0
6.9	205	0.201601	0	0.900	23.2	0.0
6.4	225	0.185761	0	0.780	27.4	0.0
6.3	185	0.237169	0	0.771	9.3	0.0
6.8	235	0.219961	0	0.750	16.0	0.0
6.9	235	0.189225	0	0.818	4.7	0.0
6.7	210	0.230400	0	0.825	12.5	0.0
6.9	245	0.266256	1	0.632	20.1	6.9
6.9	245	0.243049	0	0.757	9.1	0.0
6.3	185	0.139876	0	0.709	8.1	0.0
6.1	185	0.179776	0	0.782	8.6	0.0
6.2	180	0.194481	0	0.775	20.3	0.0
6.8	220	0.253009	1	0.880	25.0	6.8
6.5	194	0.253009	1	0.833	19.2	6.5
7.6	225	0.180625	0	0.571	3.3	0.0
6.3	210	0.137641	0	0.816	11.2	0.0
7.1	240	0.254016	1	0.714	10.5	7.1
6.8	225	0.160000	0	0.765	10.1	0.0
7.3	263	0.232324	0	0.655	7.2	0.0
6.4	210	0.225625	0	0.244	13.6	0.0
6.8	235	0.183184	0	0.728	9.0	0.0
7.2	230	0.312481	1	0.721	24.6	7.2
6.4	190	0.194481	0	0.757	12.6	0.0
6.6	220	0.242064	0	0.747	5.6	0.0
6.8	210	0.161604	0	0.739	8.7	0.0
6.1	180	0.172225	0	0.713	7.7	0.0
6.5	235	0.242064	0	0.742	24.1	0.0
6.4	185	0.234256	0	0.861	11.7	0.0
6.0	175	0.149769	0	0.721	7.7	0.0
6.0	192	0.190096	0	0.785	9.6	0.0
7.3	263	0.232324	0	0.655	7.2	0.0
6.1	180	0.115600	0	0.821	12.3	0.0
6.7	240	0.266256	1	0.728	8.9	6.7
6.4	210	0.225625	0	0.846	13.6	0.0
5.8	160	0.169744	0	0.813	11.2	0.0
6.9	230	0.168921	0	0.595	2.8	0.0
7.0	245	0.165649	0	0.573	3.2	0.0
7.3	228	0.198025	0	0.726	9.4	0.0
5.9	155	0.084681	0	0.707	11.9	0.0
6.2	200	0.201601	0	0.804	15.4	0.0
6.8	235	0.298116	1	0.784	7.4	6.8
7.0	235	0.230400	0	0.744	18.9	0.0
5.9	105	0.128881	0	0.839	7.9	0.0
6.1	180	0.278784	1	0.790	12.2	6.1
5.7	185	0.123904	0	0.701	11.0	0.0
7.1	245	0.171396	0	0.778	2.8	0.0
5.8	180	0.180625	0	0.872	11.8	0.0
7.4	240	0.358801	1	0.713	17.1	7.4
6.8	225	0.232324	0	0.701	11.6	0.0
6.8	215	0.208849	0	0.734	5.8	0.0
7.0	230	0.189225	0	0.764	8.3	0.0

After including the said terms in the dataframe I took a closer look at dvsq and noticed something.

df <- subset(basketball, basketball$success==1)
kable(df) %>%
  kable_styling() %>%
  scroll_box(width = "700px", height = "400px")

	height	weight	success_goals	success	free_throws	avg_scores	dvsq
11	6.9	245	0.266256	1	0.632	20.1	6.9
16	6.8	220	0.253009	1	0.880	25.0	6.8
17	6.5	194	0.253009	1	0.833	19.2	6.5
20	7.1	240	0.254016	1	0.714	10.5	7.1
25	7.2	230	0.312481	1	0.721	24.6	7.2
36	6.7	240	0.266256	1	0.728	8.9	6.7
44	6.8	235	0.298116	1	0.784	7.4	6.8
47	6.1	180	0.278784	1	0.790	12.2	6.1
51	7.4	240	0.358801	1	0.713	17.1	7.4

Players who are taller than 6ft with matched weight have higher success in scoring more goals. Let’s see what the model will tell us.

Visualization

par(mfrow=c(1,2))
b_cor <- cor(basketball)
corrplot(b_cor, method = "circle")
corrplot(b_cor, method = "number")

pairs(basketball, pch=21)

By looking at the above plots we can see relationships between the variables. For instance, with height and weight, we can observe that there is somewhat a strong relationship between them while with success_goals and avg_scores there seems to be a very weak relationship.

Identifying Potential Predictors and Evaluating the Model

basketball_lm <- lm(avg_scores ~ height + weight + success_goals + success + free_throws + dvsq, data = basketball)
summary(basketball_lm)

## 
## Call:
## lm(formula = avg_scores ~ height + weight + success_goals + success + 
##     free_throws + dvsq, data = basketball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8629 -3.3563 -0.8627  1.8945 15.0850 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    21.37380   15.52754   1.377   0.1752  
## height         -4.84320    3.03984  -1.593   0.1178  
## weight          0.02133    0.04615   0.462   0.6462  
## success_goals  42.16688   23.25897   1.813   0.0762 .
## success       -46.68407   35.62309  -1.311   0.1964  
## free_throws    11.93155    7.81288   1.527   0.1334  
## dvsq            7.14273    5.24398   1.362   0.1797  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.364 on 47 degrees of freedom
## Multiple R-squared:  0.2668, Adjusted R-squared:  0.1731 
## F-statistic:  2.85 on 6 and 47 DF,  p-value: 0.01891

In creating this model, we want to keep the predictors with a p-value that does not exeed the threshold of 0.05. We can see that weight has a p-value of 0.6462 which encourages us to remove that variable from the model. Weight is not that significant after all.

Removing weight…

basketball_lm <- update(basketball_lm, .~. - weight, data = basketball)
summary(basketball_lm)

## 
## Call:
## lm(formula = avg_scores ~ height + success_goals + success + 
##     free_throws + dvsq, data = basketball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6431 -3.6178 -0.9641  2.0449 15.6501 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     18.830     14.399   1.308   0.1972  
## height          -3.786      1.985  -1.908   0.0624 .
## success_goals   44.743     22.395   1.998   0.0514 .
## success        -45.257     35.197  -1.286   0.2047  
## free_throws     11.331      7.640   1.483   0.1446  
## dvsq             6.911      5.177   1.335   0.1882  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 48 degrees of freedom
## Multiple R-squared:  0.2634, Adjusted R-squared:  0.1867 
## F-statistic: 3.433 on 5 and 48 DF,  p-value: 0.009874

Removing success with 0.2047

basketball_lm <- update(basketball_lm, .~. - success, data = basketball)
summary(basketball_lm)

## 
## Call:
## lm(formula = avg_scores ~ height + success_goals + free_throws + 
##     dvsq, data = basketball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9480 -3.3230 -0.6853  2.4111 15.7896 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    14.6685    14.1241   1.039   0.3041  
## height         -3.1059     1.9255  -1.613   0.1132  
## success_goals  45.1652    22.5416   2.004   0.0507 .
## free_throws    10.8074     7.6803   1.407   0.1657  
## dvsq            0.2739     0.4020   0.681   0.4990  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.356 on 49 degrees of freedom
## Multiple R-squared:  0.2381, Adjusted R-squared:  0.1759 
## F-statistic: 3.827 on 4 and 49 DF,  p-value: 0.008745

Removing dvsq with p_value of 0.4990

basketball_lm <- update(basketball_lm, .~. - dvsq, data = basketball)
summary(basketball_lm)

## 
## Call:
## lm(formula = avg_scores ~ height + success_goals + free_throws, 
##     data = basketball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3412 -3.4562 -0.9371  2.7112 15.6373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     13.545     13.952   0.971  0.33630   
## height          -3.242      1.905  -1.702  0.09501 . 
## success_goals   55.456     16.640   3.333  0.00162 **
## free_throws     11.106      7.627   1.456  0.15159   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.327 on 50 degrees of freedom
## Multiple R-squared:  0.2308, Adjusted R-squared:  0.1847 
## F-statistic: 5.002 on 3 and 50 DF,  p-value: 0.004126

Removing free_throws with p_value of 0.15463

basketball_lm <- update(basketball_lm, .~. - free_throws, data = basketball)
summary(basketball_lm)

## 
## Call:
## lm(formula = avg_scores ~ height + success_goals, data = basketball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9911 -3.5787 -0.9251  2.2237 15.9712 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     26.384     10.932   2.414 0.019428 *  
## height          -4.036      1.845  -2.187 0.033333 *  
## success_goals   58.534     16.686   3.508 0.000952 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.385 on 51 degrees of freedom
## Multiple R-squared:  0.1982, Adjusted R-squared:  0.1668 
## F-statistic: 6.304 on 2 and 51 DF,  p-value: 0.003577

All of the predictors have a p_value below the threshold which brings us to our final model:

\[avg\_scores = 26.384 - (4.036 \times height) + (58.534 \times success_goals)\]

Since this is a Multiple Linear Regression model, we’ll look at the \(Adjusted \ R^2\) since it removes any noise or bias.

Notice with each update, the \(Adjusted \ R^2\) value decreased as well as the p_value but the F-statisticincreased. This indicates model doesn’t explain much of variation of the data but it is still significant due to the p_value being 0.003577. It is better than not having a model at all.

Residual Analysis

basketballlm_df <- augment(basketball_lm)
ggplot(basketballlm_df, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept=0, color = 'brown', size = 1) + ggtitle('Residual vs Fitted')

The residuals are randomly plotted above and below the residual line which shows that the model could be a good fit.

ggplot(basketballlm_df, aes(x=.std.resid)) + geom_histogram(aes(y=..density..), bins = 10, colour="black")+
 geom_density(alpha=.2, fill="green") + ggtitle('Histogram of Residuals')

Based on the histogram, most of the residuals are right skewed.

qplot(sample =.std.resid, data = basketballlm_df) + geom_abline()+ ggtitle('Normal Q-Q Plot')

We have to be a little cautious here. There’s a pattern here which confirms what we saw on the histogram with the residuals being right skewed, however we would not reject the model.

Conclusion

It does make sense for when predicting average scores, it may be based on height and how often the play scores a goal. Based on the p value our model is significant, however based on the \(R^2\), there may be other variables that may contribute to average scores a player gets.