Loading data into R

library(tidyverse)
library(mosaic)
library(ggplot2)
library(ggformula)
Higher_edu <- read_csv("https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/salary_tuition.csv")

Question 1

The research question to be answered with linear regression is the influence of percentage of student body in STEM on early career income. We want to explore the relationship between percentage of student body in STEM and early career pay.

Question 2

summary_early <- favstats(~early_career_pay, data=Higher_edu)
summary_stem <- favstats(~stem_percent, data=Higher_edu)
as.data.frame((rbind(summary_early,summary_stem)), row.names=c("Early Career Pay", "Stem Percentage"))
##                    min    Q1 median    Q3   max        mean         sd   n
## Early Career Pay 32500 45900  49700 55000 88800 51150.47490 7974.05800 737
## Stem Percentage      0     8     14    22    97    17.45862   15.12111 737
##                  missing
## Early Career Pay       0
## Stem Percentage        0
cor(Higher_edu$stem_percent, Higher_edu$early_career_pay)
## [1] 0.6580194
gf_point(early_career_pay~stem_percent, data = Higher_edu, size = 4, alpha = .5) %>% 
  gf_smooth()%>%
   gf_smooth(method = 'lm', linetype = 2, color = 'red')%>%
  gf_labs(x = "Stem Percent",
          y = "Early Career Pay")
## `geom_smooth()` using method = 'loess'

Stem percent and early career pay have a correlation value of 0.66, this shows a fairly strong, positive and linear relationship.The scatter plot with the smoother lines also supports this linear relationmship between the two variables.

Question 3

stem_earlypay_model <- lm(early_career_pay~stem_percent, data = Higher_edu)
coef(stem_earlypay_model)
##  (Intercept) stem_percent 
##   45092.2678     347.0039

The intercept is $45092.27 and the slope is $347. The estimated early career pay values will be:

\[ \hat{Early\_career\_pay} = 45092.27 + 347 (stem\_ percent) \] For the intercept, this means that when a college has no student in STEM, the average early career pay for student from such institution is $45,092.27 on average. The slope estimate on the other hand means that for every percentage increase in number of student in STEM, early career pay increases by $347 on average.

Question 4

The intercept is very much interpretable as it provides the average early career pay for students from institutions with zero STEM student body. This result is practical because there are colleges with zero percent STEM student body in our dataset. In fact, we have twenty of such colleges in our dataset.

Question 5

summary(stem_earlypay_model)
## 
## Call:
## lm(formula = early_career_pay ~ stem_percent, data = Higher_edu)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -21396  -3515   -715   2803  35214 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45092.27     338.20  133.33   <2e-16 ***
## stem_percent   347.00      14.65   23.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6009 on 735 degrees of freedom
## Multiple R-squared:  0.433,  Adjusted R-squared:  0.4322 
## F-statistic: 561.3 on 1 and 735 DF,  p-value: < 2.2e-16

The residual standard error and multiple R-squared statistics provide information about how well our model is performing. The standard residual error, \(\hat\sigma\) in the model above means our observed data far from the predicted outcome by $6009 while the multiple R-squared, \(R^2\) states that 43% of variation in outcome is explained by our predictor variable. In problem context, this means that On average, the observed early career pay values are about $6009 away from the predicted early career pay. 43% of variation in early career pay can be explained by percentage of student body in STEM.

Question 6

Checking assumption of normality of residual

Higher_edu$residuals <- resid(stem_earlypay_model)
gf_density(~ residuals, data = Higher_edu) %>%
  gf_labs(x = "Residuals")

ggplot(Higher_edu, aes(sample = residuals)) + 
  stat_qq(size = 5) + 
  stat_qq_line(size = 2)

The density plot and QQ of residuals both suggest that our residual distribution is approximately symmetric. Although there is a small deviation at both tails of the QQ plot, this is usually not a problem in meeting regression assumption of normality of residual.

Checking assumption of normality of residual

library(broom)
residuals_diag <- augment(stem_earlypay_model)
gf_point(.resid ~ .fitted, data = residuals_diag, size = 5, alpha = .15) %>%
  gf_smooth(method = 'loess', size = 2) %>%
  gf_labs(x = 'Fitted Values',
          y = 'Residuals')

gf_point(.std.resid ~ .fitted, data = residuals_diag, size = 5, alpha = .15) %>%
  gf_smooth(method = 'loess', size = 2) %>%
  gf_labs(x = 'Fitted Values',
          y = 'Standardized Residuals')

gf_point(.std.resid ~ .fitted, data = residuals_diag, size = 4, alpha = .15) %>%
  gf_hline(yintercept = ~ 2, color = 'red', size = 1) %>%
  gf_hline(yintercept = ~ -2, color = 'red', size = 1) %>%
  gf_smooth(method = 'loess', size = 2) %>%
  gf_labs(x = 'Fitted Values',
          y = 'Standardized Residuals')

Based on the residual plots above. We can conclude that the homogeneity assumption is met. The residual seems not to have a trend, which supports the claim that the homogeneity assumption is not violated.

Question 7

The null and alternative hypotheses are given below for \(\beta_{0}\) and \(\beta_{1}\)

\(\mathbf{\beta_{0}}\)

\[ H_{0}: \beta_{0} = 0.\ The\ population\ yintercept\ equals\ 0. \]

Vs

\[ H_{1}: \beta_{0} \ne 0.\ The\ population\ yintercept\ is\ not\ equal\ to\ 0. \] \(\mathbf{\beta_{1}}\)

\[ H_{1}: \beta_{1} = 0.\ The\ population\ slope\ equals\ 0. \]

Vs

\[ H_{1}: \beta_{1} \ne 0.\ The\ population\ slope\ is\ not\ equal\ to\ 0. \]

Question 8

From

summary(stem_earlypay_model)
## 
## Call:
## lm(formula = early_career_pay ~ stem_percent, data = Higher_edu)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -21396  -3515   -715   2803  35214 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45092.27     338.20  133.33   <2e-16 ***
## stem_percent   347.00      14.65   23.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6009 on 735 degrees of freedom
## Multiple R-squared:  0.433,  Adjusted R-squared:  0.4322 
## F-statistic: 561.3 on 1 and 735 DF,  p-value: < 2.2e-16

We have p-values that shows that the probability of our {0} and {1} having one or more value as extreme as zero is very low. It therefore provides evidence against the null hypotheses that \(\beta_{0}=0\) and \(\beta_{1}=0\). The p-values are approximately zero for both estimates and we can on this basis, conclude that evidence against the null hypotheses is strong.

Question 9

The result from question 8 showing that the estimates of {0} and {1} are significant implies that there is there is a relationship between the predictor variable (Percent of student body in STEM) and the outcome variable (early career pay). Practically, the result makes sense because {0} cannot be zero as college graduate earn an actual early career pay regardless of the number of student in STEM in their college. Also, colleges without STEM degree exist and graduates from such colleges have paid jobs. The {1} being significant also makes sense due to the fact that STEM degrees are known to offer opportunities for high paying job, it therefore explains why a percentage increase in number of students in STEM in a college increases their early career pay by $342.