In this lab, we’ll go over how to use R to do regression analysis.
Let’s start by bringing in the openintro package and the
datasets we’ll be using.
require(openintro)
require(ggplot2)
# NOTE: CHANGE THE NAME OF "satGPA" to "satgpa" in the next line to correct error
data(satgpa)
data(ncbirths)
# Load the data
data(satgpa)
# Check the structure of the data
str(satgpa)
## tibble [1,000 × 6] (S3: tbl_df/tbl/data.frame)
## $ sex : int [1:1000] 1 2 2 1 1 2 1 1 2 1 ...
## $ sat_v : int [1:1000] 65 58 56 42 55 55 57 53 67 41 ...
## $ sat_m : int [1:1000] 62 64 60 53 52 56 65 62 77 44 ...
## $ sat_sum: int [1:1000] 127 122 116 95 107 111 122 115 144 85 ...
## $ hs_gpa : num [1:1000] 3.4 4 3.75 3.75 4 4 2.8 3.8 4 2.6 ...
## $ fy_gpa : num [1:1000] 3.18 3.33 3.25 2.42 2.63 2.91 2.83 2.51 3.82 2.54 ...
# Check the first few rows of the data
head(satgpa)
# Factor sex
satgpa$sex <- as.factor(satgpa$sex)
summary(satgpa)
## sex sat_v sat_m sat_sum hs_gpa
## 1:516 Min. :24.00 Min. :29.0 Min. : 53.0 Min. :1.800
## 2:484 1st Qu.:43.00 1st Qu.:49.0 1st Qu.: 93.0 1st Qu.:2.800
## Median :49.00 Median :55.0 Median :103.0 Median :3.200
## Mean :48.93 Mean :54.4 Mean :103.3 Mean :3.198
## 3rd Qu.:54.00 3rd Qu.:60.0 3rd Qu.:113.0 3rd Qu.:3.700
## Max. :76.00 Max. :77.0 Max. :144.0 Max. :4.500
## fy_gpa
## Min. :0.000
## 1st Qu.:1.980
## Median :2.465
## Mean :2.468
## 3rd Qu.:3.020
## Max. :4.000
table(satgpa$sex)
##
## 1 2
## 516 484
# Create a scatterplot of sat_sum and fy_gpa by sex
ggplot(satgpa, aes(x = sat_sum, y = fy_gpa, color = sex)) +
geom_point(size = 2, alpha = .5) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
# increase text size of title
ggtitle("Relationship between SAT scores and Final Year School GPA") +
# increase text size of x and y axis labels
labs(x = "SAT Sum Score", y = "Final Year GPA")+
scale_colour_brewer(palette = "Dark2")+
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Find the outliers and annotate them
# Are these high leverage points?
ggplot(satgpa, aes(x = sat_sum, y = fy_gpa, color = sex)) +
# highlight outliers above and below 2.5 SD
geom_point(data = subset(satgpa, fy_gpa > mean(fy_gpa) + 2.5*sd(fy_gpa) | fy_gpa < mean(fy_gpa) - 2.5*sd(fy_gpa)), aes(x = sat_sum, y = fy_gpa), color = "maroon") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
# increase text size of title
ggtitle("Relationship between SAT scores and Final Year School GPA") +
# increase text size of x and y axis labels
labs(x = "SAT Sum Score", y = "Final Year GPA")+
#scale_colour_brewer(palette = "Dark2")+
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Find the outliers in Final Year GPA who had fy_gpa but low sat_sum
satgpa_outliers <- subset(satgpa, fy_gpa > mean(fy_gpa) + 2.5*sd(fy_gpa) | fy_gpa < mean(fy_gpa) - 2.5*sd(fy_gpa))
satgpa_outliers
# Find outliers above and below 2.5 SD and annotate them
satgpa_outliers <- subset(satgpa, fy_gpa > mean(fy_gpa) + 2.5*sd(fy_gpa) | fy_gpa < mean(fy_gpa) - 2.5*sd(fy_gpa))
satgpa_outliers
# Find the correlation between sat_um and fy_gpa
cor(satgpa$sat_sum, satgpa$fy_gpa)
## [1] 0.460281
# Compute the correlation between sat_sum and fy_gpa and construct a 95% Confidence Interval
cor.test(satgpa$sat_sum, satgpa$fy_gpa, conf.level = 0.95)
##
## Pearson's product-moment correlation
##
## data: satgpa$sat_sum and satgpa$fy_gpa
## t = 16.379, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4099866 0.5077848
## sample estimates:
## cor
## 0.460281
# Interpret the correlation coefficient
# The correlation coefficient is 0.46, which means that there is a moderate positive correlation between SAT scores and high school GPA. This means that as SAT scores increase, high school GPA also tends to increase.
# Fit a linear regression model
satgpa_mod <- lm(fy_gpa ~ sat_sum, data = satgpa)
summary(satgpa_mod)
##
## Call:
## lm(formula = fy_gpa ~ sat_sum, data = satgpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1976 -0.4495 0.0315 0.4557 1.6115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001927 0.151991 0.013 0.99
## sat_sum 0.023866 0.001457 16.379 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.658 on 998 degrees of freedom
## Multiple R-squared: 0.2119, Adjusted R-squared: 0.2111
## F-statistic: 268.3 on 1 and 998 DF, p-value: < 2.2e-16
# Interpret the output above
# The slope is 0.0012, which means that for each additional point in SAT score, the high school GPA increases by 0.0012 points.
# The intercept is 0.001, which means that if a student had a SAT score of 0, their high school GPA would be predicted to be 0.001.
# The R-squared value is 0.21, which means that 21% of the variation in high school GPA can be explained by SAT scores.
# The p-value is less than 0.001, which means that the relationship between SAT scores and high school GPA is statistically significant.
# Use ANOVA to predict sat_sum using fy_gpa
satgpa_mod2 <- aov(sat_sum ~ fy_gpa, data = satgpa)
summary(satgpa_mod2)
## Df Sum Sq Mean Sq F value Pr(>F)
## fy_gpa 1 43203 43203 268.3 <2e-16 ***
## Residuals 998 160722 161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Interpret the output above: ANOVA
# The ANOVA table shows that the model is significant (p < 0.001), which means that there is a significant relationship between SAT scores and high school GPA.
# The F-statistic is 268.3, which means that the model is a good fit for the data.
plot(satgpa_mod)
library(performance)
# Check the conditions for the regression
check_model(satgpa_mod)
satagpamod2 <- lm(fy_gpa ~ sat_sum * hs_gpa, data = satgpa)
summary(satagpamod2)
##
## Call:
## lm(formula = fy_gpa ~ sat_sum * hs_gpa, data = satgpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.10116 -0.33940 0.01683 0.41035 1.56987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.875881 0.792732 1.105 0.2695
## sat_sum -0.002729 0.007763 -0.352 0.7253
## hs_gpa 0.032836 0.246334 0.133 0.8940
## sat_sum:hs_gpa 0.005300 0.002359 2.247 0.0249 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5929 on 996 degrees of freedom
## Multiple R-squared: 0.3613, Adjusted R-squared: 0.3594
## F-statistic: 187.8 on 3 and 996 DF, p-value: < 2.2e-16
Handout: The interaction is sig, even though none of the main effects were significant. This sometimes happens! Notice that the interaction is significant, even though none of the main effects are significant. This does sometimes happen, and that’s because the “main effect” doesn’t have the same interpretation anymore in the presence of an interaction. The coefficient for each just represent the effect when the other is at 0. So, for example, -0.002729 is the effect of SATSum when HSGPA is 0. The interaction being significant in this case is actually an indication that the effect of SATSum is significant for other values of HSGPA
ncbirthsNow, let’s use the ncbirths dataset to run some linear
regressions.
For the following questions, use the ncbirths dataset.
Make sure you include all relevant code used to reach the answer.