Comment #1 In other words, you are trying to explain variation in tuition based on early career pay, right? I would recommend trying to articulate the research questions a bit more directly, that can help readers.
New Guiding Question:
\[ How\ is\ the\ tuition\ for\ in-state\ residents\ in\ USD\ related\ to\ estimated\ early\ career\ pay\ in\ USD? \]
Comment #2 Also, for your scatterplots, I believe you have the x and y axes flipped in your description. For the formula notation, the y-axis is to the left of the ~.
salary_tuition <- readr::read_csv("https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/salary_tuition.csv")
library(tidyverse)
library(ggformula)
library(ggplot2)
library(lubridate)
library(mosaic)
library(bayesrules)
library(e1071)
library(car)
library(AICcmodavg)
theme_set(theme_bw(base_size = 18))
head(salary_tuition)
## # A tibble: 6 x 13
## name state_name early_career_pay mid_career_pay make_world_bett~ stem_percent
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Aubu~ Alabama 54400 104500 51 31
## 2 Univ~ Alabama 57500 103900 59 45
## 3 Tusk~ Alabama 54500 93500 61 30
## 4 Samf~ Alabama 48400 90500 52 3
## 5 Spri~ Alabama 46600 89100 53 12
## 6 Birm~ Alabama 49100 88300 48 27
## # ... with 7 more variables: type <chr>, degree_length <chr>,
## # room_and_board <dbl>, in_state_tuition <dbl>, in_state_total <dbl>,
## # out_of_state_tuition <dbl>, out_of_state_total <dbl>
gf_point(early_career_pay ~ in_state_tuition, data = salary_tuition, size = 4, alpha = .3) %>%
gf_labs(x = "Tuition for in-state residents in USD",
y = "Estimated early career pay in USD")
Comment #3 Your linear regression run in R, does not match with your model formula. See the comment above regarding the scatterplot, but the outcome should be to the left of the ~.
\[ early\_career\_pay = \beta_{0} + \beta_{1}\ in\_state\_tuition + \epsilon \]
Question #1: Based on your first assignment, are there attributes that could have been missing from the linear regression fitted in assignment 1? That is, is the assumption of all attributes being included in the model appropriate? Why or why not? If not, what other attributes may be of interest?
Based on the first assignment, the linear model had low value of \(R^2 = 0.218\) this would be alarming as the predictor attribute could explain only low level of the variability in the outcome attribute. As a result, it would be beneficial to add more predictors that could give better predictions for the outcome attribute. Including all of the 13 attributes won’t be a good idea, as it requires a total of \(2^{13} = 8192\) models! that contain subsets of 13 attributes, this is not practical. Therefore, we cannot consider all the 13 attributes and we have to choose a smaller set of models to consider. Consequently, we will choose one of the classical approaches for this task: Forward, Backward, or Mixed selection method. The following attributes may be interesting to added to the model:
Question #2: Add one or more of the attributes identified in question 1 to create a multiple regression model. That is, add one or more of the attributes from question 1 while keeping the attribute from assignment 1 into the regression model. Summarize the interpretation of the regression coefficient estimates for this multiple regression model.
The linear model that was done in assignment #1 was as follows:
\[ \hat{early\_career\_pay} = 45,250 + 0.0227\ in\_state\_tuition \]
Before adding more attributes the following two points have to be considered:
There is a linear relationship between each of the added predictors and the early_career_pay attribute. To test this, create scatterplots and verify a linear shape.
There aren’t high levels of correlation between any of the chosen predictors. If these attributes are correlated both to each other and the early_career_pay, this qualifies as multicollinearity and makes it more difficult to unpack the relationships.
Thus: The correlations between the attributes could be shown in pairs as following:
library(GGally)
ggpairs(salary_tuition[c('early_career_pay', 'in_state_tuition', 'stem_percent')])
This would show moderate, positive relationships between each of the two predictors in_state_tuition, stem_percent and the outcome early_career_pay. Moreover, there is a weak correlation between the two predictors themselves.
Now we can fit the model
epay_mult_reg <- lm(early_career_pay ~ in_state_tuition + stem_percent, data = salary_tuition)
summary(epay_mult_reg)
##
## Call:
## lm(formula = early_career_pay ~ in_state_tuition + stem_percent,
## data = salary_tuition)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18168 -3406 -57 2908 33095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.146e+04 4.075e+02 101.75 <2e-16 ***
## in_state_tuition 1.660e-01 1.243e-02 13.35 <2e-16 ***
## stem_percent 3.082e+02 1.346e+01 22.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5393 on 734 degrees of freedom
## Multiple R-squared: 0.5438, Adjusted R-squared: 0.5426
## F-statistic: 437.5 on 2 and 734 DF, p-value: < 2.2e-16
According to the above summary, the multiple regression model would be:
\[ \hat{early\_career\_pay} = \beta_0 + \beta_1*\ in\_state\_tuition\ +\ \beta_2*\ stem\_percent \\ \hat{early\_career\_pay} = 41,460 + 0.166\ in\_state\_tuition\ +\ 308.2\ stem\_percent \]
Additionally;
\(\beta_0:\) could be interpreted as; on average the early career pay is \(\$41,460\) when in_state_tuition and stem_percent are equal to zero.
\(\beta_1:\) could be interpreted as; the outcome attribute early_career_pay would change on average by \(\$166\) for each \(\$1,000\) increase in the in_state_tuition holding the stem_percent value constant.
\(\beta_2:\) could be interpreted as; the outcome attribute early_career_pay would change on average by \(\$308.2\) for each \(1\%\) increase in the stem_percent holding the in_state_tuition value constant.
Question #3: Estimate model fit indices to compare the model from assignment 1 to the multiple regression fitted in question 2. Does the model improve model fit based on that from assignment 1? Why or why not?
In this context, the simple regression model and the multiple regression model are considered to be nested models. Therefore, there are a variety of statistics used to provide statistical evidence for competing models. Variance decomposition can be used to determine if the added predictor stem-percent helped to explain significant variation over and above the simpler model.
In this situation, another F statistic can be derived where:
epay_smpl_reg <- lm(early_career_pay ~ in_state_tuition, data = salary_tuition)
summary(epay_smpl_reg)
##
## Call:
## lm(formula = early_career_pay ~ in_state_tuition, data = salary_tuition)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16908 -4857 -975 3628 30618
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.525e+04 4.872e+02 92.88 <2e-16 ***
## in_state_tuition 2.274e-01 1.588e-02 14.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7056 on 735 degrees of freedom
## Multiple R-squared: 0.2181, Adjusted R-squared: 0.217
## F-statistic: 205 on 1 and 735 DF, p-value: < 2.2e-16
epay_mult_reg <- lm(early_career_pay ~ in_state_tuition + stem_percent, data = salary_tuition)
summary(epay_mult_reg)
##
## Call:
## lm(formula = early_career_pay ~ in_state_tuition + stem_percent,
## data = salary_tuition)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18168 -3406 -57 2908 33095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.146e+04 4.075e+02 101.75 <2e-16 ***
## in_state_tuition 1.660e-01 1.243e-02 13.35 <2e-16 ***
## stem_percent 3.082e+02 1.346e+01 22.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5393 on 734 degrees of freedom
## Multiple R-squared: 0.5438, Adjusted R-squared: 0.5426
## F-statistic: 437.5 on 2 and 734 DF, p-value: < 2.2e-16
First: Compared to the value of \(R^2=0.2181\) in the simple regression model which indicates that the in_state_tuition explains \(21.81\%\) of the variability in the early_career_pay attribute. However, in the multiple regression model the value of \(R^2=0.5438\) indicates that the attributes in_state_tuition and stem_percent explain \(54.38\%\) of the variability in the early_career_pay attribute, which shows a high improvement in the model. As known the higher value of \(R^2\) the better model is.
Second: Compared to the standard residual error of \(7,056\) in the simple regression model, the value of standard residual error in the multiple regression model is \(5,393\). Consequently, it proves that the multiple regression model is performing better than the simple regression model. As known the smaller value of \(SS_{res}\) the better model is.
anova(epay_smpl_reg, epay_mult_reg)
## Analysis of Variance Table
##
## Model 1: early_career_pay ~ in_state_tuition
## Model 2: early_career_pay ~ in_state_tuition + stem_percent
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 735 3.6592e+10
## 2 734 2.1349e+10 1 1.5243e+10 524.07 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Third: The variance decomposition can be used to determine if the added predictors helped to explain significant variation over and the simpler model. As shown in the analysis of variance test (ANOVA) the P-value for F-statistics shows that as significant value, so we could say that we should prefer the multiple regression model compared to simple regression model.
Question #4: Explore and evaluate the model assumptions regarding the residual/error term. Summarize how well the model is meeting the assumptions, citing specific statistics or visualizations to justify your conclusions. In addition, do the model assumptions seem better met compared to those from the regression in assignment 1?
First: we check the the normality of residuals through the following steps:
head(resid(epay_mult_reg))
## 1 2 3 4 5 6
## 1511.6170 389.8062 111.6432 760.1084 -5110.8531 -3613.4611
salary_tuition$residuals <- resid(epay_mult_reg)
head(salary_tuition)
## # A tibble: 6 x 14
## name state_name early_career_pay mid_career_pay make_world_bett~ stem_percent
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Aubu~ Alabama 54400 104500 51 31
## 2 Univ~ Alabama 57500 103900 59 45
## 3 Tusk~ Alabama 54500 93500 61 30
## 4 Samf~ Alabama 48400 90500 52 3
## 5 Spri~ Alabama 46600 89100 53 12
## 6 Birm~ Alabama 49100 88300 48 27
## # ... with 8 more variables: type <chr>, degree_length <chr>,
## # room_and_board <dbl>, in_state_tuition <dbl>, in_state_total <dbl>,
## # out_of_state_tuition <dbl>, out_of_state_total <dbl>, residuals <dbl>
gf_density( ~ residuals, data = salary_tuition) %>%
gf_labs(x = "Residuals")
ggplot(salary_tuition, aes(sample = residuals)) +
stat_qq(size = 5) +
stat_qq_line(size = 2)
Based on the two upper figures, we can claim that the normality assumption is met. As in density plot, it could be noticed that the curve is slightly positive/right skewed, not ideally as normal distribution, and for the QQ plot the plotted y values do not coincide with the fitted line as the plots on the two extremely terminal sides are alarming. However, due to the sample size (n=2,500) and CLT theorem we can rely on the estimates from the model. For this reasons, the results will not be impacted by violating this assumption.
Second: Assessing the residuals’ homogeneity of variance:
library(broom)
resid_diagnostics <- augment(epay_mult_reg)
head(resid_diagnostics)
## # A tibble: 6 x 9
## early_career_pay in_state_tuition stem_percent .fitted .resid .hat .sigma
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 54400 11276 31 52888. 1512. 0.00414 5397.
## 2 57500 10714 45 57110. 390. 0.00836 5397.
## 3 54500 22170 30 54388. 112. 0.00253 5397.
## 4 48400 31650 3 47640. 760. 0.00304 5397.
## 5 46600 39464 12 51711. -5111. 0.00270 5393.
## 6 49100 17650 27 52713. -3613. 0.00249 5395.
## # ... with 2 more variables: .cooksd <dbl>, .std.resid <dbl>
library(broom)
norm_residuals <- augment(epay_mult_reg)
gf_point(.resid ~ .fitted, data = resid_diagnostics, size = 5, alpha = .15) %>%
gf_smooth(method = 'loess', size = 2) %>%
gf_labs(x = 'Fitted Values',
y = 'Residuals')
gf_point(.std.resid ~ .fitted, data = norm_residuals, size = 4, alpha = .15) %>%
gf_hline(yintercept = ~ 2, color = 'red', size = 1) %>%
gf_hline(yintercept = ~ -2, color = 'red', size = 1) %>%
gf_smooth(method = 'loess', size = 2) %>%
gf_labs(x = 'Fitted Values',
y = 'Standardized Residuals')
Based on the figure above, approximately \(95\%\) of the residuals/standardized residuals fall within \(\pm2\) standard deviations from the mean residuals/standard residuals (ZERO). Plus, the residuals are nearly equally distributed around the loess line, as its shape is neither fanned in nor out. Moreover, the fitted line by residuals/standardized residuals seems to be flattened, which means that the residuals’ homogeneity of variance is met in this model. Finally, the multiple regression model seems to be better than the single regression model in meeting this assumption.
Third: Multicollinearity is an assumption of regression that explores whether the predictor attributes (X) are correlated. Linear regression assumes that the predictor attributes are uncorrelated, but in practice this assumption is never met. Multicollinearity can impact the the linear regression model. for example, in case of high positive correlation between two predictors, the two predictors could be combined to form a new predictor with a single metric.
cor(in_state_tuition ~ stem_percent, data = salary_tuition)
## [1] 0.2157148
As shown above there is a weak positive correlation between the two predictors in_state_tuition and stem_percent.
vif(epay_mult_reg)
## in_state_tuition stem_percent
## 1.048804 1.048804
The smallest possible value of VIF is one (absence of multicollinearity). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity (James et al. 2014). When faced to multicollinearity, the concerned variables should be removed, since the presence of multicollinearity implies that the information that this variable provides about the response is redundant in the presence of the other variables (James et al. 2014,P. Bruce and Bruce (2017)). Thus we keep both attributes in_state_tuition and stem_percent.
In general, the the multiple regression model met the assumptions better than the single linear regression model.
Question #5: Summarize the statistical evidence surrounding the null or alternative hypotheses that are being explored for the coefficients entered into the model. Note, I did not explicitly ask you for null/alternative hypotheses, but you may want to write those down for your reference. In a few sentences, describe if the evidence provides support for or against the null hypothesis. Provide specific statistical evidence to support your justification.
Briefly, we want to make inferences that there is a test statistic which aims to test if the model is explaining variation over and above a simple mean. More specifically, this omnibus hypothesis tests the following:
\[ H_{0}: All\ \beta = 0 \\ H_{A}: at\ least\ one\ \beta \neq 0 \]
This hypothesis can be formally tested with an F-statistic which is distributed as an F distribution with \(p\) predictors attributes and \(n - p - 1\) degrees of freedom.
summary(epay_mult_reg)
##
## Call:
## lm(formula = early_career_pay ~ in_state_tuition + stem_percent,
## data = salary_tuition)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18168 -3406 -57 2908 33095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.146e+04 4.075e+02 101.75 <2e-16 ***
## in_state_tuition 1.660e-01 1.243e-02 13.35 <2e-16 ***
## stem_percent 3.082e+02 1.346e+01 22.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5393 on 734 degrees of freedom
## Multiple R-squared: 0.5438, Adjusted R-squared: 0.5426
## F-statistic: 437.5 on 2 and 734 DF, p-value: < 2.2e-16
When testing the null hypothesis that there is no linear association between early_career_pay, in_state_tuition, and stem_percent, we reject the null hypothesis (\(F_{734}\) = 437.5, p-value < 2.2e-16). Additionally, it could be noticed that in_state_tuition, and stem_percent explain \(54.4\%\) of the variability in early_career_pay attribute.
Question #6: Create confidence intervals for the coefficients in the multiple regression model. Justify your confidence level and interpret the confidence intervals in the context of the data. What do these confidence intervals suggest about the magnitude of effect?
Confidence interval can be computed. Confidence intervals take the following general form:
\[ \hat{\beta} \pm C * SE \] By choosing a confidence interval \(95\%\), we assign cut score for the confidence intervals \(C=1.96\)
First: the confidence interval for the \(\beta_0\) as the following
4.146e+04 + c(-1, 1) * 1.96 * 4.075e+02
## [1] 40661.3 42258.7
Second: the confidence interval for the \(\beta_1\) as the following
1.660e-01 + c(-1, 1) * 1.96 * 1.243e-02
## [1] 0.1416372 0.1903628
Third: the confidence interval for the \(\beta_2\) as the following
3.082e+02 + c(-1, 1) * 1.96 * 1.346e+01
## [1] 281.8184 334.5816
In conclusion, a precise of confidence interval can give a very good estimate of the population parameter. i.e. there is a \(95\%\) chance of the fitted model over re-sampling, the value of \(\beta_0\) ranges between \(\$40,661\) and \(\$42,258\). Similarly, there is a \(95\%\) chance of the fitted model over re-sampling, the value of \(\beta_1\) ranges between \(\$142\) and \(\$190\) and there is a \(95\%\) chance of the fitted model over re-sampling, the value of \(\beta_2\) ranges between \(\$282\) and \(\$335\). Added to this all of these intervals do not contain zero, this could be another proof on rejecting the null hypothesis.
Question #7: Based on the justification in #5 and #6, what practical implications does this result have? That is, are the statistical results that you have been describing/summarizing throughout this assignment practically relevant? Be as specific as you can in your discussion about why you believe the results are practically useful or not.
Based of the implications found in 5th and 6th questions; it could be said that this model in overall doing a good job in predicting the value of early career pay, as the two predictors can explain around 54% of the variability in the early career pay. In this assignment I relied my choice for the predictors on their type, as I chose continuous variables only. Practically, we could have a better performance of this model if we added some categorical variables.