library(tidyverse)
library(mosaic)
library(ggplot2)
library(ggformula)
Higher_edu <- read_csv("https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/salary_tuition.csv")
The research question to be answered with linear regression is the influence of percentage of student body in STEM on early career income. We want to explore the relationship between percentage of student body in STEM and early career pay.
summary_early <- favstats(~early_career_pay, data=Higher_edu)
summary_stem <- favstats(~stem_percent, data=Higher_edu)
as.data.frame((rbind(summary_early,summary_stem)), row.names=c("Early Career Pay", "Stem Percentage"))
## min Q1 median Q3 max mean sd n
## Early Career Pay 32500 45900 49700 55000 88800 51150.47490 7974.05800 737
## Stem Percentage 0 8 14 22 97 17.45862 15.12111 737
## missing
## Early Career Pay 0
## Stem Percentage 0
cor(Higher_edu$stem_percent, Higher_edu$early_career_pay)
## [1] 0.6580194
gf_point(early_career_pay~stem_percent, data = Higher_edu, size = 4, alpha = .5) %>%
gf_smooth()%>%
gf_smooth(method = 'lm', linetype = 2, color = 'red')%>%
gf_labs(x = "Stem Percent",
y = "Early Career Pay")
## `geom_smooth()` using method = 'loess'
Stem percent and early career pay have a correlation value of 0.66, this shows a fairly strong, positive and linear relationship.The scatter plot with the smoother lines also supports this linear relationmship between the two variables.
stem_earlypay_model <- lm(early_career_pay~stem_percent, data = Higher_edu)
coef(stem_earlypay_model)
## (Intercept) stem_percent
## 45092.2678 347.0039
The intercept is $45092.27 and the slope is $347. The estimated early career pay values will be:
\[ \hat{Early\_career\_pay} = 45092.27 + 347 (stem\_ percent) \] For the intercept, this means that when a college has no student in STEM, the average early career pay for student from such institution is $45,092.27 on average. The slope estimate on the other hand means that for every percentage increase in number of student in STEM, early career pay increases by $347 on average.
The intercept is very much interpretable as it provides the average early career pay for students from institutions with zero STEM student body. This result is practical because there are colleges with zero percent STEM student body in our dataset. In fact, we have twenty of such colleges in our dataset.
summary(stem_earlypay_model)
##
## Call:
## lm(formula = early_career_pay ~ stem_percent, data = Higher_edu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21396 -3515 -715 2803 35214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45092.27 338.20 133.33 <2e-16 ***
## stem_percent 347.00 14.65 23.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6009 on 735 degrees of freedom
## Multiple R-squared: 0.433, Adjusted R-squared: 0.4322
## F-statistic: 561.3 on 1 and 735 DF, p-value: < 2.2e-16
The residual standard error and multiple R-squared statistics provide information about how well our model is performing. The standard residual error, \(\hat\sigma\) in the model above means our observed data far from the predicted outcome by $6009 while the multiple R-squared, \(R^2\) states that 43% of variation in outcome is explained by our predictor variable. In problem context, this means that On average, the observed early career pay values are about $6009 away from the predicted early career pay. 43% of variation in early career pay can be explained by percentage of student body in STEM.
Higher_edu$residuals <- resid(stem_earlypay_model)
gf_density(~ residuals, data = Higher_edu) %>%
gf_labs(x = "Residuals")
ggplot(Higher_edu, aes(sample = residuals)) +
stat_qq(size = 5) +
stat_qq_line(size = 2)
The density plot and QQ of residuals both suggest that our residual distribution is approximately symmetric. Although there is a small deviation at both tails of the QQ plot, this is usually not a problem in meeting regression assumption of normality of residual.
library(broom)
residuals_diag <- augment(stem_earlypay_model)
gf_point(.resid ~ .fitted, data = residuals_diag, size = 5, alpha = .15) %>%
gf_smooth(method = 'loess', size = 2) %>%
gf_labs(x = 'Fitted Values',
y = 'Residuals')
gf_point(.std.resid ~ .fitted, data = residuals_diag, size = 5, alpha = .15) %>%
gf_smooth(method = 'loess', size = 2) %>%
gf_labs(x = 'Fitted Values',
y = 'Standardized Residuals')
gf_point(.std.resid ~ .fitted, data = residuals_diag, size = 4, alpha = .15) %>%
gf_hline(yintercept = ~ 2, color = 'red', size = 1) %>%
gf_hline(yintercept = ~ -2, color = 'red', size = 1) %>%
gf_smooth(method = 'loess', size = 2) %>%
gf_labs(x = 'Fitted Values',
y = 'Standardized Residuals')
Based on the residual plots above. We can conclude that the homogeneity assumption is met. The residual seems not to have a trend, which supports the claim that the homogeneity assumption is not violated.
The null and alternative hypotheses are given below for \(\beta_{0}\) and \(\beta_{1}\)
\(\mathbf{\beta_{0}}\)
\[ H_{0}: \beta_{0} = 0.\ The\ population\ yintercept\ equals\ 0. \]
\[ H_{1}: \beta_{0} \ne 0.\ The\ population\ yintercept\ is\ not\ equal\ to\ 0. \] \(\mathbf{\beta_{1}}\)
\[ H_{1}: \beta_{1} = 0.\ The\ population\ slope\ equals\ 0. \]
\[ H_{1}: \beta_{1} \ne 0.\ The\ population\ slope\ is\ not\ equal\ to\ 0. \]
From
summary(stem_earlypay_model)
##
## Call:
## lm(formula = early_career_pay ~ stem_percent, data = Higher_edu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21396 -3515 -715 2803 35214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45092.27 338.20 133.33 <2e-16 ***
## stem_percent 347.00 14.65 23.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6009 on 735 degrees of freedom
## Multiple R-squared: 0.433, Adjusted R-squared: 0.4322
## F-statistic: 561.3 on 1 and 735 DF, p-value: < 2.2e-16
We have p-values that shows that the probability of our {0} and {1} having one or more value as extreme as zero is very low. It therefore provides evidence against the null hypotheses that \(\beta_{0}=0\) and \(\beta_{1}=0\). The p-values are approximately zero for both estimates and we can on this basis, conclude that evidence against the null hypotheses is strong.
The result from question 8 showing that the estimates of {0} and {1} are significant implies that there is there is a relationship between the predictor variable (Percent of student body in STEM) and the outcome variable (early career pay). Practically, the result makes sense because {0} cannot be zero as college graduate earn an actual early career pay regardless of the number of student in STEM in their college. Also, colleges without STEM degree exist and graduates from such colleges have paid jobs. The {1} being significant also makes sense due to the fact that STEM degrees are known to offer opportunities for high paying job, it therefore explains why a percentage increase in number of students in STEM in a college increases their early career pay by $342.