First: the following lines show some comments on asignment #1 and the changes made on it before starting the second assignment.

Comment #1 In other words, you are trying to explain variation in tuition based on early career pay, right? I would recommend trying to articulate the research questions a bit more directly, that can help readers.

New Guiding Question:

\[ How\ is\ the\ tuition\ for\ in-state\ residents\ in\ USD\ related\ to\ estimated\ early\ career\ pay\ in\ USD? \]

Comment #2 Also, for your scatterplots, I believe you have the x and y axes flipped in your description. For the formula notation, the y-axis is to the left of the ~.

salary_tuition <- readr::read_csv("https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/salary_tuition.csv") 
library(tidyverse)
library(ggformula)
library(ggplot2)
library(lubridate)
library(mosaic)
library(bayesrules)
library(e1071)
library(car)
library(AICcmodavg)
theme_set(theme_bw(base_size = 18))
head(salary_tuition)
## # A tibble: 6 x 13
##   name  state_name early_career_pay mid_career_pay make_world_bett~ stem_percent
##   <chr> <chr>                 <dbl>          <dbl>            <dbl>        <dbl>
## 1 Aubu~ Alabama               54400         104500               51           31
## 2 Univ~ Alabama               57500         103900               59           45
## 3 Tusk~ Alabama               54500          93500               61           30
## 4 Samf~ Alabama               48400          90500               52            3
## 5 Spri~ Alabama               46600          89100               53           12
## 6 Birm~ Alabama               49100          88300               48           27
## # ... with 7 more variables: type <chr>, degree_length <chr>,
## #   room_and_board <dbl>, in_state_tuition <dbl>, in_state_total <dbl>,
## #   out_of_state_tuition <dbl>, out_of_state_total <dbl>
gf_point(early_career_pay ~ in_state_tuition, data = salary_tuition, size = 4, alpha = .3) %>%
  gf_labs(x = "Tuition for in-state residents in USD",
          y = "Estimated early career pay in USD")

Comment #3 Your linear regression run in R, does not match with your model formula. See the comment above regarding the scatterplot, but the outcome should be to the left of the ~.

\[ early\_career\_pay = \beta_{0} + \beta_{1}\ in\_state\_tuition + \epsilon \]


Second: Assignment #2 Solution

Question #1: Based on your first assignment, are there attributes that could have been missing from the linear regression fitted in assignment 1? That is, is the assumption of all attributes being included in the model appropriate? Why or why not? If not, what other attributes may be of interest?

Based on the first assignment, the linear model had low value of \(R^2 = 0.218\) this would be alarming as the predictor attribute could explain only low level of the variability in the outcome attribute. As a result, it would be beneficial to add more predictors that could give better predictions for the outcome attribute. Including all of the 13 attributes won’t be a good idea, as it requires a total of \(2^{13} = 8192\) models! that contain subsets of 13 attributes, this is not practical. Therefore, we cannot consider all the 13 attributes and we have to choose a smaller set of models to consider. Consequently, we will choose one of the classical approaches for this task: Forward, Backward, or Mixed selection method. The following attributes may be interesting to added to the model:


Question #2: Add one or more of the attributes identified in question 1 to create a multiple regression model. That is, add one or more of the attributes from question 1 while keeping the attribute from assignment 1 into the regression model. Summarize the interpretation of the regression coefficient estimates for this multiple regression model.

The linear model that was done in assignment #1 was as follows:

\[ \hat{early\_career\_pay} = 45,250 + 0.0227\ in\_state\_tuition \]

Before adding more attributes the following two points have to be considered:

Thus: The correlations between the attributes could be shown in pairs as following:

library(GGally)

ggpairs(salary_tuition[c('early_career_pay', 'in_state_tuition', 'stem_percent')])

This would show moderate, positive relationships between each of the two predictors in_state_tuition, stem_percent and the outcome early_career_pay. Moreover, there is a weak correlation between the two predictors themselves.

Now we can fit the model

epay_mult_reg <- lm(early_career_pay ~ in_state_tuition + stem_percent, data = salary_tuition)

summary(epay_mult_reg)
## 
## Call:
## lm(formula = early_career_pay ~ in_state_tuition + stem_percent, 
##     data = salary_tuition)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18168  -3406    -57   2908  33095 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.146e+04  4.075e+02  101.75   <2e-16 ***
## in_state_tuition 1.660e-01  1.243e-02   13.35   <2e-16 ***
## stem_percent     3.082e+02  1.346e+01   22.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5393 on 734 degrees of freedom
## Multiple R-squared:  0.5438, Adjusted R-squared:  0.5426 
## F-statistic: 437.5 on 2 and 734 DF,  p-value: < 2.2e-16

According to the above summary, the multiple regression model would be:

\[ \hat{early\_career\_pay} = \beta_0 + \beta_1*\ in\_state\_tuition\ +\ \beta_2*\ stem\_percent \\ \hat{early\_career\_pay} = 41,460 + 0.166\ in\_state\_tuition\ +\ 308.2\ stem\_percent \]

Additionally;


Question #3: Estimate model fit indices to compare the model from assignment 1 to the multiple regression fitted in question 2. Does the model improve model fit based on that from assignment 1? Why or why not?

In this context, the simple regression model and the multiple regression model are considered to be nested models. Therefore, there are a variety of statistics used to provide statistical evidence for competing models. Variance decomposition can be used to determine if the added predictor stem-percent helped to explain significant variation over and above the simpler model.

In this situation, another F statistic can be derived where:

epay_smpl_reg <- lm(early_career_pay ~ in_state_tuition, data = salary_tuition)
summary(epay_smpl_reg)
## 
## Call:
## lm(formula = early_career_pay ~ in_state_tuition, data = salary_tuition)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -16908  -4857   -975   3628  30618 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.525e+04  4.872e+02   92.88   <2e-16 ***
## in_state_tuition 2.274e-01  1.588e-02   14.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7056 on 735 degrees of freedom
## Multiple R-squared:  0.2181, Adjusted R-squared:  0.217 
## F-statistic:   205 on 1 and 735 DF,  p-value: < 2.2e-16
epay_mult_reg <- lm(early_career_pay ~ in_state_tuition + stem_percent, data = salary_tuition)

summary(epay_mult_reg)
## 
## Call:
## lm(formula = early_career_pay ~ in_state_tuition + stem_percent, 
##     data = salary_tuition)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18168  -3406    -57   2908  33095 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.146e+04  4.075e+02  101.75   <2e-16 ***
## in_state_tuition 1.660e-01  1.243e-02   13.35   <2e-16 ***
## stem_percent     3.082e+02  1.346e+01   22.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5393 on 734 degrees of freedom
## Multiple R-squared:  0.5438, Adjusted R-squared:  0.5426 
## F-statistic: 437.5 on 2 and 734 DF,  p-value: < 2.2e-16

First: Compared to the value of \(R^2=0.2181\) in the simple regression model which indicates that the in_state_tuition explains \(21.81\%\) of the variability in the early_career_pay attribute. However, in the multiple regression model the value of \(R^2=0.5438\) indicates that the attributes in_state_tuition and stem_percent explain \(54.38\%\) of the variability in the early_career_pay attribute, which shows a high improvement in the model. As known the higher value of \(R^2\) the better model is.

Second: Compared to the standard residual error of \(7,056\) in the simple regression model, the value of standard residual error in the multiple regression model is \(5,393\). Consequently, it proves that the multiple regression model is performing better than the simple regression model. As known the smaller value of \(SS_{res}\) the better model is.

anova(epay_smpl_reg, epay_mult_reg)
## Analysis of Variance Table
## 
## Model 1: early_career_pay ~ in_state_tuition
## Model 2: early_career_pay ~ in_state_tuition + stem_percent
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1    735 3.6592e+10                                   
## 2    734 2.1349e+10  1 1.5243e+10 524.07 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Third: The variance decomposition can be used to determine if the added predictors helped to explain significant variation over and the simpler model. As shown in the analysis of variance test (ANOVA) the P-value for F-statistics shows that as significant value, so we could say that we should prefer the multiple regression model compared to simple regression model.


Question #4: Explore and evaluate the model assumptions regarding the residual/error term. Summarize how well the model is meeting the assumptions, citing specific statistics or visualizations to justify your conclusions. In addition, do the model assumptions seem better met compared to those from the regression in assignment 1?

First: we check the the normality of residuals through the following steps:

head(resid(epay_mult_reg))
##          1          2          3          4          5          6 
##  1511.6170   389.8062   111.6432   760.1084 -5110.8531 -3613.4611
salary_tuition$residuals <- resid(epay_mult_reg)
head(salary_tuition)
## # A tibble: 6 x 14
##   name  state_name early_career_pay mid_career_pay make_world_bett~ stem_percent
##   <chr> <chr>                 <dbl>          <dbl>            <dbl>        <dbl>
## 1 Aubu~ Alabama               54400         104500               51           31
## 2 Univ~ Alabama               57500         103900               59           45
## 3 Tusk~ Alabama               54500          93500               61           30
## 4 Samf~ Alabama               48400          90500               52            3
## 5 Spri~ Alabama               46600          89100               53           12
## 6 Birm~ Alabama               49100          88300               48           27
## # ... with 8 more variables: type <chr>, degree_length <chr>,
## #   room_and_board <dbl>, in_state_tuition <dbl>, in_state_total <dbl>,
## #   out_of_state_tuition <dbl>, out_of_state_total <dbl>, residuals <dbl>
gf_density( ~ residuals, data = salary_tuition) %>%
  gf_labs(x = "Residuals")

ggplot(salary_tuition, aes(sample = residuals)) + 
  stat_qq(size = 5) + 
  stat_qq_line(size = 2)

Based on the two upper figures, we can claim that the normality assumption is met. As in density plot, it could be noticed that the curve is slightly positive/right skewed, not ideally as normal distribution, and for the QQ plot the plotted y values do not coincide with the fitted line as the plots on the two extremely terminal sides are alarming. However, due to the sample size (n=2,500) and CLT theorem we can rely on the estimates from the model. For this reasons, the results will not be impacted by violating this assumption.

Second: Assessing the residuals’ homogeneity of variance:

library(broom)

resid_diagnostics <- augment(epay_mult_reg)
head(resid_diagnostics)
## # A tibble: 6 x 9
##   early_career_pay in_state_tuition stem_percent .fitted .resid    .hat .sigma
##              <dbl>            <dbl>        <dbl>   <dbl>  <dbl>   <dbl>  <dbl>
## 1            54400            11276           31  52888.  1512. 0.00414  5397.
## 2            57500            10714           45  57110.   390. 0.00836  5397.
## 3            54500            22170           30  54388.   112. 0.00253  5397.
## 4            48400            31650            3  47640.   760. 0.00304  5397.
## 5            46600            39464           12  51711. -5111. 0.00270  5393.
## 6            49100            17650           27  52713. -3613. 0.00249  5395.
## # ... with 2 more variables: .cooksd <dbl>, .std.resid <dbl>
library(broom)
norm_residuals <- augment(epay_mult_reg)
gf_point(.resid ~ .fitted, data = resid_diagnostics, size = 5, alpha = .15) %>%
  gf_smooth(method = 'loess', size = 2) %>%
  gf_labs(x = 'Fitted Values',
          y = 'Residuals')

gf_point(.std.resid ~ .fitted, data = norm_residuals, size = 4, alpha = .15) %>%
  gf_hline(yintercept = ~ 2, color = 'red', size = 1) %>%
  gf_hline(yintercept = ~ -2, color = 'red', size = 1) %>%
  gf_smooth(method = 'loess', size = 2) %>%
  gf_labs(x = 'Fitted Values',
          y = 'Standardized Residuals')

Based on the figure above, approximately \(95\%\) of the residuals/standardized residuals fall within \(\pm2\) standard deviations from the mean residuals/standard residuals (ZERO). Plus, the residuals are nearly equally distributed around the loess line, as its shape is neither fanned in nor out. Moreover, the fitted line by residuals/standardized residuals seems to be flattened, which means that the residuals’ homogeneity of variance is met in this model. Finally, the multiple regression model seems to be better than the single regression model in meeting this assumption.

Third: Multicollinearity is an assumption of regression that explores whether the predictor attributes (X) are correlated. Linear regression assumes that the predictor attributes are uncorrelated, but in practice this assumption is never met. Multicollinearity can impact the the linear regression model. for example, in case of high positive correlation between two predictors, the two predictors could be combined to form a new predictor with a single metric.

cor(in_state_tuition ~ stem_percent, data = salary_tuition)
## [1] 0.2157148

As shown above there is a weak positive correlation between the two predictors in_state_tuition and stem_percent.

vif(epay_mult_reg)
## in_state_tuition     stem_percent 
##         1.048804         1.048804

The smallest possible value of VIF is one (absence of multicollinearity). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity (James et al. 2014). When faced to multicollinearity, the concerned variables should be removed, since the presence of multicollinearity implies that the information that this variable provides about the response is redundant in the presence of the other variables (James et al. 2014,P. Bruce and Bruce (2017)). Thus we keep both attributes in_state_tuition and stem_percent.

In general, the the multiple regression model met the assumptions better than the single linear regression model.


Question #5: Summarize the statistical evidence surrounding the null or alternative hypotheses that are being explored for the coefficients entered into the model. Note, I did not explicitly ask you for null/alternative hypotheses, but you may want to write those down for your reference. In a few sentences, describe if the evidence provides support for or against the null hypothesis. Provide specific statistical evidence to support your justification.

Briefly, we want to make inferences that there is a test statistic which aims to test if the model is explaining variation over and above a simple mean. More specifically, this omnibus hypothesis tests the following:

\[ H_{0}: All\ \beta = 0 \\ H_{A}: at\ least\ one\ \beta \neq 0 \]

This hypothesis can be formally tested with an F-statistic which is distributed as an F distribution with \(p\) predictors attributes and \(n - p - 1\) degrees of freedom.

summary(epay_mult_reg)
## 
## Call:
## lm(formula = early_career_pay ~ in_state_tuition + stem_percent, 
##     data = salary_tuition)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18168  -3406    -57   2908  33095 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.146e+04  4.075e+02  101.75   <2e-16 ***
## in_state_tuition 1.660e-01  1.243e-02   13.35   <2e-16 ***
## stem_percent     3.082e+02  1.346e+01   22.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5393 on 734 degrees of freedom
## Multiple R-squared:  0.5438, Adjusted R-squared:  0.5426 
## F-statistic: 437.5 on 2 and 734 DF,  p-value: < 2.2e-16

When testing the null hypothesis that there is no linear association between early_career_pay, in_state_tuition, and stem_percent, we reject the null hypothesis (\(F_{734}\) = 437.5, p-value < 2.2e-16). Additionally, it could be noticed that in_state_tuition, and stem_percent explain \(54.4\%\) of the variability in early_career_pay attribute.

Question #6: Create confidence intervals for the coefficients in the multiple regression model. Justify your confidence level and interpret the confidence intervals in the context of the data. What do these confidence intervals suggest about the magnitude of effect?

Confidence interval can be computed. Confidence intervals take the following general form:

\[ \hat{\beta} \pm C * SE \] By choosing a confidence interval \(95\%\), we assign cut score for the confidence intervals \(C=1.96\)

First: the confidence interval for the \(\beta_0\) as the following

4.146e+04  + c(-1, 1) * 1.96 * 4.075e+02
## [1] 40661.3 42258.7

Second: the confidence interval for the \(\beta_1\) as the following

1.660e-01 + c(-1, 1) * 1.96 * 1.243e-02
## [1] 0.1416372 0.1903628

Third: the confidence interval for the \(\beta_2\) as the following

3.082e+02  + c(-1, 1) * 1.96 * 1.346e+01
## [1] 281.8184 334.5816

In conclusion, a precise of confidence interval can give a very good estimate of the population parameter. i.e. there is a \(95\%\) chance of the fitted model over re-sampling, the value of \(\beta_0\) ranges between \(\$40,661\) and \(\$42,258\). Similarly, there is a \(95\%\) chance of the fitted model over re-sampling, the value of \(\beta_1\) ranges between \(\$142\) and \(\$190\) and there is a \(95\%\) chance of the fitted model over re-sampling, the value of \(\beta_2\) ranges between \(\$282\) and \(\$335\). Added to this all of these intervals do not contain zero, this could be another proof on rejecting the null hypothesis.

Question #7: Based on the justification in #5 and #6, what practical implications does this result have? That is, are the statistical results that you have been describing/summarizing throughout this assignment practically relevant? Be as specific as you can in your discussion about why you believe the results are practically useful or not.

Based of the implications found in 5th and 6th questions; it could be said that this model in overall doing a good job in predicting the value of early career pay, as the two predictors can explain around 54% of the variability in the early career pay. In this assignment I relied my choice for the predictors on their type, as I chose continuous variables only. Practically, we could have a better performance of this model if we added some categorical variables.