Multiple Regression Assignment

# install.packages("ggpubr")
# install.packages("QF")

library("ggpubr")

## Loading required package: ggplot2

library(datasets)
library(ggplot2)

# 1. Find the simple correlation between all of the predictors with each other and the outcome.  
left_right <- read.csv("/Users/yahavmanor/Desktop/ENP164/Left_Right_Recode (1).csv")

time_to_sit.full = lm(left_right$Time.Sample.1..s. ~ left_right$Left.or.Right + left_right$Row + left_right$Chair.Style)
summary(time_to_sit.full)

## 
## Call:
## lm(formula = left_right$Time.Sample.1..s. ~ left_right$Left.or.Right + 
##     left_right$Row + left_right$Chair.Style)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7718 -1.8392 -0.1494  2.1052  3.8880 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    12.8898     2.7674   4.658 6.56e-05 ***
## left_right$Left.or.RightLeft   -4.1257     2.5486  -1.619   0.1163    
## left_right$Left.or.RightRight   1.0804     2.5694   0.421   0.6772    
## left_right$Row                 -0.8784     0.3990  -2.202   0.0358 *  
## left_right$Chair.Style          0.2628     0.5507   0.477   0.6369    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.48 on 29 degrees of freedom
## Multiple R-squared:  0.5656, Adjusted R-squared:  0.5056 
## F-statistic: 9.438 on 4 and 29 DF,  p-value: 5.172e-05

x <- left_right$L_R_Code
y <- left_right$Row
z <- left_right$Chair.Style
j <- left_right$Time.Sample.1..s.

# Choosing to conduct Pearson tests since these are parametrics tests

cor.test(x, j, method = c("pearson"))

## 
##  Pearson's product-moment correlation
## 
## data:  x and j
## t = 5.5199, df = 32, p-value = 4.376e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4716026 0.8385205
## sample estimates:
##       cor 
## 0.6983893

# The p value for the Pearson correlation tests is below the criterion value of 0.05, indicating that the relationship is statistically significant.  The relationship between the variables is meaningful. This means that there is a meaningful relationship between whether the students were sitting on the left or right side of the room and time to sit

cor.test(y, j, method = c("pearson"))

## 
##  Pearson's product-moment correlation
## 
## data:  y and j
## t = -0.86564, df = 32, p-value = 0.3931
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4656127  0.1969774
## sample estimates:
##        cor 
## -0.1512642

# The p value for the Pearson correlation tests is above the criterion value of 0.05, indicating that the relationship is not statistically significant.  The relationship between the variables is not meaningful. This means that there is not a meaningful relationship between whether the row that the students are sitting in and time to sit

cor.test(z, j, method = c("pearson"))

## 
##  Pearson's product-moment correlation
## 
## data:  z and j
## t = 0.21895, df = 32, p-value = 0.8281
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3034589  0.3719763
## sample estimates:
##        cor 
## 0.03867563

# The p value for the Pearson correlation tests is above the criterion value of 0.05, indicating that the relationship is not statistically significant.  The relationship between the variables is not meaningful. This means that there is not a meaningful relationship between whether the chair style that the students are sitting in and time to sit

# 2. Try different combinations of the three variables above to see which ones yield the best result.  Does the Row variable add meaningfully to the prediction of time to sit? Do the initial correlations tell you about the predictive value of each of the variables?

time_to_sit.norow = lm(left_right$Time.Sample.1..s. ~ left_right$L_R_Code + left_right$Chair.Style)
summary(time_to_sit.norow)

## 
## Call:
## lm(formula = left_right$Time.Sample.1..s. ~ left_right$L_R_Code + 
##     left_right$Chair.Style)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6023 -1.5693 -0.3517  1.5831  5.1403 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              7.0637     1.2048   5.863 1.81e-06 ***
## left_right$L_R_Code      4.9261     0.9010   5.467 5.62e-06 ***
## left_right$Chair.Style  -0.2535     0.5190  -0.488    0.629    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.595 on 31 degrees of freedom
## Multiple R-squared:  0.4917, Adjusted R-squared:  0.4589 
## F-statistic: 14.99 on 2 and 31 DF,  p-value: 2.789e-05

anova(time_to_sit.full, time_to_sit.norow)

## Analysis of Variance Table
## 
## Model 1: left_right$Time.Sample.1..s. ~ left_right$Left.or.Right + left_right$Row + 
##     left_right$Chair.Style
## Model 2: left_right$Time.Sample.1..s. ~ left_right$L_R_Code + left_right$Chair.Style
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     29 178.40                           
## 2     31 208.75 -2    -30.35 2.4668 0.1025

# Adjusted r-squared decreased from 0.5 (in time to sit full) to 0.45, so it shows that the model is worse when Row is not accounted for. This is evidence that Row does have a statistically significant impact (adds meaningfully) on time to sit.

time_to_sit.nocode = lm(left_right$Time.Sample.1..s. ~ left_right$Row + left_right$Chair.Style)
summary(time_to_sit.nocode)

## 
## Call:
## lm(formula = left_right$Time.Sample.1..s. ~ left_right$Row + 
##     left_right$Chair.Style)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5054 -3.1118  0.5358  2.6937  6.5179 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              9.7243     1.9729   4.929 2.63e-05 ***
## left_right$Row          -0.6095     0.5701  -1.069    0.293    
## left_right$Chair.Style   0.5329     0.7903   0.674    0.505    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.572 on 31 degrees of freedom
## Multiple R-squared:  0.03701,    Adjusted R-squared:  -0.02512 
## F-statistic: 0.5956 on 2 and 31 DF,  p-value: 0.5574

anova(time_to_sit.full, time_to_sit.nocode)

## Analysis of Variance Table
## 
## Model 1: left_right$Time.Sample.1..s. ~ left_right$Left.or.Right + left_right$Row + 
##     left_right$Chair.Style
## Model 2: left_right$Time.Sample.1..s. ~ left_right$Row + left_right$Chair.Style
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     29 178.40                                  
## 2     31 395.46 -2   -217.06 17.642 9.713e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Adjusted r-squared decreased from 0.5 (in time to sit full) to -0.02, so it shows that the model is worse when L_R_Code is not accounted for. This is evidence that L_R_Code does have a statistically significant impact on time to sit. This is a bigger decrease than in the previous ANOVA test that compared L_R_Code with Chair Style, which shows that L_R_Code has the biggest impact on time to sit (is the most impactful predictor).

# The initial correlations showed a statistically significant correlation between time to sit and L_R_Code, but did not show meaningful correlations between time to sit and chair style as well as time to sit and row. This is further proven through the ANOVA tests conducted above, as the adjusted R squared value when excluding the L_R_Code variable changed in a way that demonstrated that L_R_Code does have a statistically significant impact on time to sit. Overall, through looking at correlation tests as well as ANOVA tests, it is evident that L_R_Code is the most meaningful predictor variable when predicting time to sit (at least, more than chair style or row are)

# 3. Find another data-set, and demonstrate your understanding of the use of correlation, linear regression and multiple regression.

grades <- read.csv("/Users/yahavmanor/Desktop/ENP164/Correlation_data.csv")

# Demonstrating understanding of correlation

x <- grades$Time.slept
y <- grades$Hours.studied
z <- grades$Current.GPA
j <- grades$Exam.grade

# Choosing to do a Pearson test since this is a parametric test

cor.test(x, j, method = c("pearson"))

## 
##  Pearson's product-moment correlation
## 
## data:  x and j
## t = 2.6971, df = 13, p-value = 0.01829
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1251291 0.8503386
## sample estimates:
##       cor 
## 0.5989969

# The p value for the Pearson correlation tests is below the criterion value of 0.05, indicating that the relationship is statistically significant.  The relationship between the variables is meaningful. This means that there is a meaningful relationship between time slept and exam grade

cor.test(y, j, method = c("pearson"))

## 
##  Pearson's product-moment correlation
## 
## data:  y and j
## t = 4.1742, df = 13, p-value = 0.001091
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3993068 0.9145101
## sample estimates:
##       cor 
## 0.7567719

# The p value for the Pearson correlation tests is below the criterion value of 0.05, indicating that the relationship is statistically significant.  The relationship between the variables is meaningful. This means that there is a meaningful relationship between hours studied and exam grade

cor.test(z, j, method = c("pearson"))

## 
##  Pearson's product-moment correlation
## 
## data:  z and j
## t = 3.4054, df = 13, p-value = 0.004694
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2689874 0.8869317
## sample estimates:
##      cor 
## 0.686637

# The p value for the Pearson correlation tests is below the criterion value of 0.05, indicating that the relationship is statistically significant.  The relationship between the variables is meaningful. This means that there is a meaningful relationship between current GPA and exam grade

# Demonstrating understanding of linear regression

# Plotting linear regression model between exam grade and time slept
lmGrade_Slept = lm(grades$Exam.grade ~ grades$Time.slept)

ggplot(grades, aes(x = Exam.grade, y = Time.slept)) +
         geom_point(size = 4, shape = 21) +
         geom_smooth(method=lm, se=FALSE) +
  stat_regline_equation(label.x = 3, label.y = 32) +
  stat_regline_equation(label.x = 3, label.y = 40, aes(label = after_stat(rr.label)))

## `geom_smooth()` using formula = 'y ~ x'

#The code above provides a plot of the braking distance as a function of speed.  The best fitting straight line is plotted along with the equation for the line.
          
plot(grades$Exam.grade, grades$Time.slept, xlab="grade", ylab="time slept")
abline(lm(Exam.grade~Time.slept,data=grades),col='red')

#-----------------------------------------------------------------------------------------------------

# Plotting linear regression model between exam grade and hours studied
lmGrade_Slept = lm(grades$Exam.grade ~ grades$Hours.studied)

ggplot(grades, aes(x = Exam.grade, y = Hours.studied)) +
         geom_point(size = 4, shape = 21) +
         geom_smooth(method=lm, se=FALSE) +
  stat_regline_equation(label.x = 3, label.y = 32) +
  stat_regline_equation(label.x = 3, label.y = 40, aes(label = after_stat(rr.label)))

## `geom_smooth()` using formula = 'y ~ x'

#The code above provides a plot of the braking distance as a function of speed.  The best fitting straight line is plotted along with the equation for the line.
          
plot(grades$Exam.grade, grades$Hours.studied, xlab="grade", ylab="hours studied")
abline(lm(Exam.grade~Hours.studied,data=grades),col='blue')

#-----------------------------------------------------------------------------------------------------

# Plotting linear regression model between exam grade and current GPA
lmGrade_Slept = lm(grades$Exam.grade ~ grades$Current.GPA)

ggplot(grades, aes(x = Exam.grade, y = Current.GPA)) +
         geom_point(size = 4, shape = 21) +
         geom_smooth(method=lm, se=FALSE) +
  stat_regline_equation(label.x = 3, label.y = 32) +
  stat_regline_equation(label.x = 3, label.y = 40, aes(label = after_stat(rr.label)))

## `geom_smooth()` using formula = 'y ~ x'

#The code above provides a plot of the braking distance as a function of speed.  The best fitting straight line is plotted along with the equation for the line.
          
plot(grades$Exam.grade, grades$Current.GPA, xlab="grade", ylab="current GPA")
abline(lm(Exam.grade~Current.GPA,data=grades),col='green')

# Demonstrating understanding of multiple regression

grades.full = lm(grades$Exam.grade ~ grades$Time.slept + grades$Hours.studied + grades$Current.GPA)

summary(grades.full)    #prints the outcome from this model.

## 
## Call:
## lm(formula = grades$Exam.grade ~ grades$Time.slept + grades$Hours.studied + 
##     grades$Current.GPA)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5162  -2.4083   0.7531   4.0327  11.6100 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)   
## (Intercept)           47.6701    12.9147   3.691  0.00356 **
## grades$Time.slept      2.0808     2.0298   1.025  0.32733   
## grades$Hours.studied   1.1443     0.6216   1.841  0.09277 . 
## grades$Current.GPA     2.9819     4.8250   0.618  0.54914   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.221 on 11 degrees of freedom
## Multiple R-squared:  0.6414, Adjusted R-squared:  0.5437 
## F-statistic:  6.56 on 3 and 11 DF,  p-value: 0.008353

cat("r-squared or multiple r-squared is:")

## r-squared or multiple r-squared is:

summary(grades.full)$r.squared

## [1] 0.6414499

cat("adjusted-squared is:")

## adjusted-squared is:

summary(grades.full)$adj.r.squared

## [1] 0.5436635

# Adjusted r squared in 0.543

#-----------------------------------------------------------------------------------------------------

#The model below contains only two of the original predictors.
grades.nosleep = lm(grades$Exam.grade ~ grades$Hours.studied + grades$Current.GPA)

summary(grades.nosleep)

## 
## Call:
## lm(formula = grades$Exam.grade ~ grades$Hours.studied + grades$Current.GPA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.189  -2.195   1.885   4.346   9.728 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           54.1868    11.2655   4.810 0.000426 ***
## grades$Hours.studied   1.2507     0.6142   2.036 0.064413 .  
## grades$Current.GPA     4.6666     4.5460   1.027 0.324892    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.238 on 12 degrees of freedom
## Multiple R-squared:  0.6072, Adjusted R-squared:  0.5417 
## F-statistic: 9.275 on 2 and 12 DF,  p-value: 0.003673

# Adjusted r squared in 0.541

#Putting the names of the two model variables into the anova function will provide a comparison.
anova(grades.full, grades.nosleep)

## Analysis of Variance Table
## 
## Model 1: grades$Exam.grade ~ grades$Time.slept + grades$Hours.studied + 
##     grades$Current.GPA
## Model 2: grades$Exam.grade ~ grades$Hours.studied + grades$Current.GPA
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     11 743.39                           
## 2     12 814.41 -1   -71.018 1.0509 0.3273

# The outcome of this ANOVA comparing the grades.nosleep model with grades.2 model does not provide a statistically significant outcome (adjusted R squared slightly decreased).  This means, that adding the third variable Time.slept, did not improve the fit of the model.  Therefore, we would conclude that hours slept is not useful at predicting exam grades.')

#-----------------------------------------------------------------------------------------------------

#The model below contains only two of the original predictors.
grades.nostudy = lm(grades$Exam.grade ~ grades$Time.slept + grades$Current.GPA)

summary(grades.nostudy)

## 
## Call:
## lm(formula = grades$Exam.grade ~ grades$Time.slept + grades$Current.GPA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.310  -4.577   0.875   5.778  11.284 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)          37.707     12.840   2.937   0.0124 *
## grades$Time.slept     2.705      2.191   1.234   0.2407  
## grades$Current.GPA    8.595      4.095   2.099   0.0576 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.002 on 12 degrees of freedom
## Multiple R-squared:  0.531,  Adjusted R-squared:  0.4528 
## F-statistic: 6.793 on 2 and 12 DF,  p-value: 0.01064

# Adjusted r squared in 0.45

#Putting the names of the two model variables into the anova function will provide a comparison.
anova(grades.full, grades.nostudy)

## Analysis of Variance Table
## 
## Model 1: grades$Exam.grade ~ grades$Time.slept + grades$Hours.studied + 
##     grades$Current.GPA
## Model 2: grades$Exam.grade ~ grades$Time.slept + grades$Current.GPA
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     11 743.39                              
## 2     12 972.38 -1   -228.99 3.3884 0.09277 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# The outcome of this ANOVA comparing the grades.nostudy model with grades.2 model does not provide a statistically significant outcome (adjusted R squared decreased).  This means, that adding the third variable Hours.studied, did not improve the fit of the model.  Therefore, we would conclude that hours studied is not useful at predicting exam grades.')

#-----------------------------------------------------------------------------------------------------

#The model below contains only two of the original predictors.
grades.nogpa = lm(grades$Exam.grade ~ grades$Time.slept + grades$Hours.studied)

summary(grades.nogpa)

## 
## Call:
## lm(formula = grades$Exam.grade ~ grades$Time.slept + grades$Hours.studied)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.753  -4.148   1.973   4.247  10.634 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           52.1720    10.3859   5.023 0.000297 ***
## grades$Time.slept      2.5081     1.8587   1.349 0.202114    
## grades$Hours.studied   1.3871     0.4692   2.956 0.012003 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.006 on 12 degrees of freedom
## Multiple R-squared:  0.629,  Adjusted R-squared:  0.5672 
## F-statistic: 10.17 on 2 and 12 DF,  p-value: 0.002608

# Adjusted r squared in 0.567

#Putting the names of the two model variables into the anova function will provide a comparison.
anova(grades.full, grades.nogpa)

## Analysis of Variance Table
## 
## Model 1: grades$Exam.grade ~ grades$Time.slept + grades$Hours.studied + 
##     grades$Current.GPA
## Model 2: grades$Exam.grade ~ grades$Time.slept + grades$Hours.studied
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     11 743.39                           
## 2     12 769.21 -1   -25.813 0.3819 0.5491

# The outcome of this ANOVA comparing the grades.nostudy model with grades.nogpa model does provide a statistically significant outcome (adjusted R squared increased)  This means, that adding the third variable Current.GPA, improved the fit of the model.  Therefore, we would conclude that current GPA is useful at predicting exam grades.')

#-----------------------------------------------------------------------------------------------------

# Overall, it is clear that the predictor variable that has the greatest impact on exam grades is the student's current GPA. This is particularly interesting since the correlation test illustrated that there is a statistically significant correlation between each predictor variable and exam grades (outcome). However, the ANOVA tests showed that current GPA has the greatest impact. This difference in results may reveal that some predictors lose significance due to multicollinearity (strong correlations between predictor variables) or because their apparent effect was actually due to a confounding variable that was not accounted for in the dataset (like grades on past exams in that class). Essentially, correlation tests might suggest significance individually, but a full model shows that only one predictor truly contributes meaningfully to explaining the variance in the outcome.

Multiple Regression Assignment

Yahav Manor

2025-04-01