Regression in R

In this lab, we’ll go over how to use R to do regression analysis. Let’s start by bringing in the openintro package and the datasets we’ll be using.

require(openintro)
require(ggplot2)
# NOTE: CHANGE THE NAME OF "satGPA" to "satgpa" in the next line to correct error
data(satgpa)
data(ncbirths)
# Load the data
data(satgpa)
# Check the structure of the data
str(satgpa)
## tibble [1,000 × 6] (S3: tbl_df/tbl/data.frame)
##  $ sex    : int [1:1000] 1 2 2 1 1 2 1 1 2 1 ...
##  $ sat_v  : int [1:1000] 65 58 56 42 55 55 57 53 67 41 ...
##  $ sat_m  : int [1:1000] 62 64 60 53 52 56 65 62 77 44 ...
##  $ sat_sum: int [1:1000] 127 122 116 95 107 111 122 115 144 85 ...
##  $ hs_gpa : num [1:1000] 3.4 4 3.75 3.75 4 4 2.8 3.8 4 2.6 ...
##  $ fy_gpa : num [1:1000] 3.18 3.33 3.25 2.42 2.63 2.91 2.83 2.51 3.82 2.54 ...
# Check the first few rows of the data
head(satgpa)
# Factor sex
satgpa$sex <- as.factor(satgpa$sex)
summary(satgpa)
##  sex         sat_v           sat_m         sat_sum          hs_gpa     
##  1:516   Min.   :24.00   Min.   :29.0   Min.   : 53.0   Min.   :1.800  
##  2:484   1st Qu.:43.00   1st Qu.:49.0   1st Qu.: 93.0   1st Qu.:2.800  
##          Median :49.00   Median :55.0   Median :103.0   Median :3.200  
##          Mean   :48.93   Mean   :54.4   Mean   :103.3   Mean   :3.198  
##          3rd Qu.:54.00   3rd Qu.:60.0   3rd Qu.:113.0   3rd Qu.:3.700  
##          Max.   :76.00   Max.   :77.0   Max.   :144.0   Max.   :4.500  
##      fy_gpa     
##  Min.   :0.000  
##  1st Qu.:1.980  
##  Median :2.465  
##  Mean   :2.468  
##  3rd Qu.:3.020  
##  Max.   :4.000
table(satgpa$sex)
## 
##   1   2 
## 516 484
# Create a scatterplot of sat_sum and fy_gpa by sex

ggplot(satgpa, aes(x = sat_sum, y = fy_gpa, color = sex)) +
  geom_point(size = 2, alpha = .5) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  # increase text size of title
  ggtitle("Relationship between SAT scores and Final Year School GPA") +
  # increase text size of x and y axis labels
  labs(x = "SAT Sum Score", y = "Final Year GPA")+
  scale_colour_brewer(palette = "Dark2")+
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Find the outliers and annotate them 
# Are these high leverage points?

ggplot(satgpa, aes(x = sat_sum, y = fy_gpa, color = sex)) +
  # highlight outliers above and below 2.5 SD
  geom_point(data = subset(satgpa, fy_gpa > mean(fy_gpa) + 2.5*sd(fy_gpa) | fy_gpa < mean(fy_gpa) - 2.5*sd(fy_gpa)), aes(x = sat_sum, y = fy_gpa), color = "maroon") +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  # increase text size of title
  ggtitle("Relationship between SAT scores and Final Year School GPA") +
  # increase text size of x and y axis labels
  labs(x = "SAT Sum Score", y = "Final Year GPA")+
  #scale_colour_brewer(palette = "Dark2")+
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Find the outliers in Final Year GPA who had fy_gpa but low sat_sum

satgpa_outliers <- subset(satgpa, fy_gpa > mean(fy_gpa) + 2.5*sd(fy_gpa) | fy_gpa < mean(fy_gpa) - 2.5*sd(fy_gpa))

satgpa_outliers
# Find outliers above and below 2.5 SD and annotate them

satgpa_outliers <- subset(satgpa, fy_gpa > mean(fy_gpa) + 2.5*sd(fy_gpa) | fy_gpa < mean(fy_gpa) - 2.5*sd(fy_gpa))
satgpa_outliers
# Find the correlation between sat_um and fy_gpa
cor(satgpa$sat_sum, satgpa$fy_gpa)
## [1] 0.460281
# Compute the correlation between sat_sum and fy_gpa and construct a 95% Confidence Interval

cor.test(satgpa$sat_sum, satgpa$fy_gpa, conf.level = 0.95)
## 
##  Pearson's product-moment correlation
## 
## data:  satgpa$sat_sum and satgpa$fy_gpa
## t = 16.379, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4099866 0.5077848
## sample estimates:
##      cor 
## 0.460281
# Interpret the correlation coefficient

# The correlation coefficient is 0.46, which means that there is a moderate positive correlation between SAT scores and high school GPA. This means that as SAT scores increase, high school GPA also tends to increase.
# Fit a linear regression model

satgpa_mod <- lm(fy_gpa ~ sat_sum, data = satgpa)
summary(satgpa_mod)
## 
## Call:
## lm(formula = fy_gpa ~ sat_sum, data = satgpa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1976 -0.4495  0.0315  0.4557  1.6115 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.001927   0.151991   0.013     0.99    
## sat_sum     0.023866   0.001457  16.379   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.658 on 998 degrees of freedom
## Multiple R-squared:  0.2119, Adjusted R-squared:  0.2111 
## F-statistic: 268.3 on 1 and 998 DF,  p-value: < 2.2e-16
# Interpret the output above

# The slope is 0.0012, which means that for each additional point in SAT score, the high school GPA increases by 0.0012 points.

# The intercept is 0.001, which means that if a student had a SAT score of 0, their high school GPA would be predicted to be 0.001.

# The R-squared value is 0.21, which means that 21% of the variation in high school GPA can be explained by SAT scores.

# The p-value is less than 0.001, which means that the relationship between SAT scores and high school GPA is statistically significant.
# Use ANOVA to predict sat_sum using fy_gpa

satgpa_mod2 <- aov(sat_sum ~ fy_gpa, data = satgpa)
summary(satgpa_mod2)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## fy_gpa        1  43203   43203   268.3 <2e-16 ***
## Residuals   998 160722     161                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Interpret the output above: ANOVA
# The ANOVA table shows that the model is significant (p < 0.001), which means that there is a significant relationship between SAT scores and high school GPA.
# The F-statistic is 268.3, which means that the model is a good fit for the data.
plot(satgpa_mod)

library(performance)

# Check the conditions for the regression
check_model(satgpa_mod)

satagpamod2 <- lm(fy_gpa ~ sat_sum * hs_gpa, data = satgpa)
summary(satagpamod2)
## 
## Call:
## lm(formula = fy_gpa ~ sat_sum * hs_gpa, data = satgpa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.10116 -0.33940  0.01683  0.41035  1.56987 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     0.875881   0.792732   1.105   0.2695  
## sat_sum        -0.002729   0.007763  -0.352   0.7253  
## hs_gpa          0.032836   0.246334   0.133   0.8940  
## sat_sum:hs_gpa  0.005300   0.002359   2.247   0.0249 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5929 on 996 degrees of freedom
## Multiple R-squared:  0.3613, Adjusted R-squared:  0.3594 
## F-statistic: 187.8 on 3 and 996 DF,  p-value: < 2.2e-16

Handout: The interaction is sig, even though none of the main effects were significant. This sometimes happens! Notice that the interaction is significant, even though none of the main effects are significant. This does sometimes happen, and that’s because the “main effect” doesn’t have the same interpretation anymore in the presence of an interaction. The coefficient for each just represent the effect when the other is at 0. So, for example, -0.002729 is the effect of SATSum when HSGPA is 0. The interaction being significant in this case is actually an indication that the effect of SATSum is significant for other values of HSGPA

Linear Regression with ncbirths

Now, let’s use the ncbirths dataset to run some linear regressions.

Questions

For the following questions, use the ncbirths dataset. Make sure you include all relevant code used to reach the answer.