Ortega-Cruz -Project

Edication Attainment Vs Urban Education

## Loading required package: stringr

## Loading required package: AER

## Loading required package: car

## Loading required package: carData

## Loading required package: lmtest

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Loading required package: survival

## Loading required package: data.table

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: ggplot2

Part 1 - Introduction

This project will be focusing on multiple linear regression. For multiple linear regression, we will have at least three variables in our models, one response and two predictors. The response variable will be quantitative, but the predictor variables can be quantitative or categorical.

First, I’ll investigate what the data set looks like and explore how we can visualize uncomplicated multiple linear regression equations. Next, we will demonstrate how R can assist us with model building, though we will be manually building different models throughout this R guide. Then, we’ll dig deeper into the process of creating a multiple linear regression model and determine its interpretation. We will build a model, using both quantitative and categorical variables, discussing how categorical variables are used in a multiple linear regression situation. Additionally, we will interpret the coefficients for our variables and determine whether each of our predictor variables has a significant relationship with our response variable. We will also discuss the use and interpretation of interaction terms, F-tests, partial F-tests and make predictions using our regression line. Finally, we will demonstrate why and how to include polynomial terms or transformations in a model and interpret the output.

Part 2 - Data

The data set we will use for this projet is CollegeDistance from the R package AER. This is a cross-sectional data set from a survey conducted by the Department of Education in 1980, with a follow-up in 1986, containing 14 variables relating to the characteristics of the students surveyed for this data set, their families and the area in which they live. Though the data set contains 14 different variables, we will only use score, the achievement test score obtained during the student’s senior year of high school, urban, whether the student’s high school is located in an urban area, and education, the number of years of education attained 6 years after high school graduation, in any of the models we create. Of these chosen variables, we can see that score, and education are quantitative and urban is categorical.

library (AER) #Dataset included within this library
data(CollegeDistance)
head(CollegeDistance)

##   gender ethnicity score fcollege mcollege home urban unemp wage distance
## 1   male     other 39.15      yes       no  yes   yes   6.2 8.09      0.2
## 2 female     other 48.87       no       no  yes   yes   6.2 8.09      0.2
## 3   male     other 48.74       no       no  yes   yes   6.2 8.09      0.2
## 4   male      afam 40.40       no       no  yes   yes   6.2 8.09      0.2
## 5 female     other 40.48       no       no   no   yes   5.6 8.09      0.4
## 6   male     other 54.71       no       no  yes   yes   5.6 8.09      0.4
##   tuition education income region
## 1 0.88915        12   high  other
## 2 0.88915        12    low  other
## 3 0.88915        12    low  other
## 4 0.88915        12    low  other
## 5 0.88915        13    low  other
## 6 0.88915        12    low  other

names(CollegeDistance)

##  [1] "gender"    "ethnicity" "score"     "fcollege"  "mcollege" 
##  [6] "home"      "urban"     "unemp"     "wage"      "distance" 
## [11] "tuition"   "education" "income"    "region"

Part 3 - Exploratory data analysis

Cases

Cross-section data from the High School and Beyond survey conducted by the Department of Education in 1980, with a follow-up in 1986.A data frame containing 4,739 observations on 14 variables.

nrow(CollegeDistance)

## [1] 4739

summary(CollegeDistance)

##     gender        ethnicity        score       fcollege   mcollege  
##  male  :2139   other   :3050   Min.   :28.95   no :3753   no :4088  
##  female:2600   afam    : 786   1st Qu.:43.92   yes: 986   yes: 651  
##                hispanic: 903   Median :51.19                        
##                                Mean   :50.89                        
##                                3rd Qu.:57.77                        
##                                Max.   :72.81                        
##   home      urban          unemp             wage           distance     
##  no : 852   no :3635   Min.   : 1.400   Min.   : 6.590   Min.   : 0.000  
##  yes:3887   yes:1104   1st Qu.: 5.900   1st Qu.: 8.850   1st Qu.: 0.400  
##                        Median : 7.100   Median : 9.680   Median : 1.000  
##                        Mean   : 7.597   Mean   : 9.501   Mean   : 1.803  
##                        3rd Qu.: 8.900   3rd Qu.:10.150   3rd Qu.: 2.500  
##                        Max.   :24.900   Max.   :12.960   Max.   :20.000  
##     tuition         education      income       region    
##  Min.   :0.2575   Min.   :12.00   low :3374   other:3796  
##  1st Qu.:0.4850   1st Qu.:12.00   high:1365   west : 943  
##  Median :0.8245   Median :13.00                           
##  Mean   :0.8146   Mean   :13.81                           
##  3rd Qu.:1.1270   3rd Qu.:16.00                           
##  Max.   :1.4042   Max.   :18.00

hist(CollegeDistance$score)

What we can see is the variable that we want to predict canbe considered to have normal distribution

Checking Multicollinearity

One of the very first things we want to do before making any model is check for multicollinearity between our quantitative predictor variables. Multicollinearity is a problem because it will inflate the standard error of a model as well as make the parameter estimates inconsistent.

So, to look at multicollinearity, let’s first look at the pairwise scatterplots for each of our quantitative predictor variables.

Pairwise Scatterplots

We want to look at pairwise scatterplots of our predictor variables so that we can determine visually if any pairs of predictors appear highly correlated. Specify the columns to use within your data frame like this:

plot(CollegeDistance[c(3,10, 11)])

Looking at our plot, it does not appear that any of our quantitative predictor variables are highly correlated, or have a strong linear relationship with one another.

Correlation Matrices

Just to be certain there is no strong correlation, let’s calculate the correlation matrix of the three variables using cor().

cor(CollegeDistance[c(3,10, 11)])

##                score    distance    tuition
## score     1.00000000 -0.06797927  0.1298585
## distance -0.06797927  1.00000000 -0.1009806
## tuition   0.12985848 -0.10098058  1.0000000

We see that the pairwise correlations between our quantitative predictor variables are very low. We only consider multicollinearity to be a problem when the absolute value of the correlation is greater than .9, so we would say that there is no multicollinearity issue in this model.

Variance Inflation Factor

There is a third way to detect multicollinearity in a model. We consider the variance inflation factor (VIF) of a variable by first making a model using that variable as the response and all other predictor variables in the original model as the predictors. For example, to calculate the VIF of achievement test score we need to create this linear model:

score <- lm(score ~ distance + tuition, data = CollegeDistance)

Then, we need to plug the R-squared value of that model into our VIF equation, $VIF=\frac { 11 }{ 1-{ r }^{ 2 } }$

1/(1-summary(score)$r.squared)

## [1] 1.020309

So the VIF of achievement test score for this model is around 1.02. This is great news since we only consider multicollinearity to be an issue when VIF>10. We could then continue checking VIF by following the same process for our distance and tuition variables.

Visualizing the Data

Now that we know there is no evidence of multicollinearity between our predictor variables, we can move on to visualizing some of the data. In many multiple linear regression situations, the variables we are using cannot be simultaneously represented two-dimensionally so exploring the data visually is far more difficult. Nevertheless, using an example, I will demonstrate that we can represent a fairly uncomplicated multiple linear regression situation two-dimensionally. For this, we will need a quantitative response variable and two predictor variables, one quantitative and one categorical.

Plotting

We can plot the student’s achievement test score, the quantitative response, versus the number of years of education they have attained, the quantitative predictor, varying the shape of their plotted point by whether or not their high school is located in an urban area, the categorical predictor, because this satisfies our variable requirements.

plot(CollegeDistance$score, CollegeDistance$education, xlab = "Years of Education", ylab = "Achievement Test Score", pch = as.numeric(CollegeDistance$urban))

The circles indicate that the student represented went to a high school in an urban area. The triangles indicate that the student represented went to a high school in a non-urban area. From this plot we can’t really see whether there is any difference between the two groups because there are a lot of overlapping data points. We’ll try to fix this by adding noise to, or slightly increasing the variation of, the scatterplot so that we can see any patterns in the data more clearly.

Adding Noise to the Scatterplot

We can add more variation to the scatterplot by using the command jitter on our predictor and response variables, where factor is the argument specifying how much noise to add to the plot, and then using these altered variables as our new predictor and response like this:

newscore <- jitter(CollegeDistance$score, factor = 2)
newed <- jitter(CollegeDistance$education, factor = 2)
plot(newscore, newed, xlab = "Years of Education", ylab = "Achievement Test Score", pch = as.numeric(CollegeDistance$urban))

We can now more distinctly see two types of data points, circles and triangles. However, there doesn’t appear to be a distinct separation of the data points. This might lead us to believe that there is not a significant difference between the educational attainment of students at urban and non-urban schools, but we can confirm whether or not this is true using a test later on.

Part 4 - Inference

Multiple Linear Regression

Criteria for Comparing all Kinds of Models

There are many types of criteria that we can use to determine whether or not one model of a set of data is better than another model of that same set of data. First, when comparing any two models we can use the Akaike Information Criteria (AIC) or Mallow’s Cp. When looking at AIC, we want to weigh the number of parameters in a model and the AIC value together. For example, if we have a model with an AIC value of X and 4 parameters, in order to prefer a model with more than 4 parameters, we would want the AIC value for that model to be at least 10 units less than X. Generally, we can think of lower AIC values corresponding to better models. One important note on these comparison tools is that they have no units so they can only compare models made from the same data set, not models from varied data sets.

For Comparing Nested Models

If we are comparing nested models, i.e. one model’s predictors are a subset of the other model’s predictors, we have additional comparison options. For comparing nested models, we can use the p-values from t-tests or partial F-tests, only choosing the more complex model if the p-value of the t-test or F-test is below .05. We can also use adjusted R-squared as a comparison tool; however, I would discourage anyone from using adjusted R-squared as their comparison criteria because the other methods of comparison are more rigorous.

Creating the model

Stepwise Selection

Now, on to the topic of actually building the model. First, we can discuss the idea of building a model in a stepwise fashion. There are two ways of creating models stepwise, either forward selection or backwards elimination. In forward selection, you start with the simplest desired model and add predictors to the model until your chosen criteria indicate that adding more predictors to the model would actually worsen the model’s abilities. In backwards elimination, you start with the most complex model acceptable and remove predictors from the model until your chosen criteria indicate that removing more predictors from the model would worsen the model’s abilities.

By their nature, these models will always be nested so any of the comparison methods previously mentioned can be used.

As an example, I’ll be illustrating the backwards elimination method to create a model that predicts educational attainment 6 years after graduation using all or a subset of the variables discussed earlier and AIC as my model comparison criteria. I will specify the simplest model as a model using only the intercept to model educational attainment 6 years after graduation.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

starting.model <- lm(education ~ score + urban + distance + tuition, data = CollegeDistance)
simple.model <- lm(education ~ 1, data = CollegeDistance)
stepAIC(starting.model, scope = list(upper = starting.model, lower = simple.model), direction = "backward")

## Start:  AIC=4339.19
## education ~ score + urban + distance + tuition
## 
##            Df Sum of Sq   RSS    AIC
## - urban     1       0.5 11815 4337.4
## <none>                  11815 4339.2
## - tuition   1      10.8 11826 4341.5
## - distance  1      53.3 11868 4358.5
## - score     1    3178.2 14993 5466.2
## 
## Step:  AIC=4337.39
## education ~ score + distance + tuition
## 
##            Df Sum of Sq   RSS    AIC
## <none>                  11815 4337.4
## - tuition   1        11 11826 4339.8
## - distance  1        62 11877 4360.2
## - score     1      3205 15020 5472.8

## 
## Call:
## lm(formula = education ~ score + distance + tuition, data = CollegeDistance)
## 
## Coefficients:
## (Intercept)        score     distance      tuition  
##     9.15680      0.09547     -0.05013     -0.14369

We can see from the output that our final model actually contains all of the chosen variables, except for whether or not a student’s high school is located in an urban area, as the optimal way of predicting that student’s educational attainment 6 years after graduation. One thing to note about this process is that, although the two models’ AIC differ by less than 10, the chosen model is the model with fewer predictor variables because of the necessary balance between accuracy and complexity that AIC uses.

We could easily use a forward selection method by specifying starting.model starting.model as the simplest model we would deem acceptable and changing direction to “forward”. However, using a forward selection method might produce a different model than the backwards selection method because each process takes a different set of paths to reach the optimal model.

Creating a New Model

We’ve looked at our data set, gone through a short example of how one might visualize a relatively simple multiple linear regression situation and discussed some sophisticated techniques of model building as well as how to compare models. Now, we can create a multiple linear regression model manually. For this model, I will use achievement test score, average tuition for a four-year college in the student’s state, distance from a four year college and whether or not the student’s high school was located in an urban area to predict the number of years of education the student had attained six years after graduation.

schoolmod <- lm(education ~ score + tuition + distance + urban, data = CollegeDistance)

Checking Model Assumptions

At this point, we need to check the assumptions of our model to be sure that the model provides accurate results.

plot(schoolmod)

The first plot is the residual plot, a comparison of the residuals of our model against the fitted values produced by our model, and is the most important plot because it can tell us about trends in our residuals, evidence of variability and possible outliers. The plot for this model indicates that our model is systematically underpredicting the lower values of educational attainment and systematically overpredicting the higher values of educational attainment. Although the residuals do not seem to be evenly spread around 0 for all fitted values, the range of the residuals at each fitted value appears to be roughly the same, so we can conclude there is no evidence of heteroskedasticity. Finally, this plot indicates that there are likely no outliers because there are no points on the plot well-separated from the rest.

The next plot is the QQ-plot. Though most of the points seem to fall on the line which indicates that our residuals come from a normal distribution, there are some points that stray from the line in the lower and upper quantiles of the plot. It is possible that these points do not come from a normal distribution, but most of our points seem to come from a normal distribution so there is not a lot to worry about here.

The third plot created is the scale-location plot. This plot is similar to the residual plot, but uses the square root of the standardized residuals instead of the residuals themselves. This makes trends in residuals more evident and, from this plot, we can see that there is likely a U-shaped trend in our residuals.

Outliers

So, although our data did not appear to have any outliers, it’s still important to know what to do if we do have outliers. Unfortunately, this is not an easy question to answer.

So, what can we do if we find that we have an outlier?

We see the leverage plot. This plot graphs the standardized residuals against their leverage. It also includes the Cook’s distance boundaries. Any point outside of those boundaries would be an outlier in the x direction. Since we cannot even see the boundaries on our plot, we can conclude that we have no outliers.

Indicator Variables

summary(schoolmod)

## 
## Call:
## lm(formula = education ~ score + tuition + distance + urban, 
##     data = CollegeDistance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.826 -1.181 -0.246  1.219  5.367 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.141015   0.148905  61.388  < 2e-16 ***
## score        0.095596   0.002679  35.686  < 2e-16 ***
## tuition     -0.142627   0.068517  -2.082   0.0374 *  
## distance    -0.048723   0.010539  -4.623 3.88e-06 ***
## urbanyes     0.025619   0.057090   0.449   0.6536    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.58 on 4734 degrees of freedom
## Multiple R-squared:  0.221,  Adjusted R-squared:  0.2203 
## F-statistic: 335.7 on 4 and 4734 DF,  p-value: < 2.2e-16

We should first take note of how our categorical variable urban is written in our summary of the model. In the model, this variable is treated as an indicator variable meaning that it can take one of two values, either 0 or 1. In the summary, this variable is labeled as urbanyes. This means that the variable is given the value of 1 if a student’s school is in an urban area and a 0 if a student’s school is not in an urban area. Since this variable has only 2 categories, yes and no, it only needs one indicator function. However, if a categorical variable has more than 2 categories, it will need more indicator functions. To be exact, the variable would need the number of categories minus one indicator functions.

The Regression Equation

From our summary, we can also get the multiple linear regression equations:

$\widehat{EducationalAttainment} = 9.141 + .0956 \cdot AchievementScore - .1426 \cdot Tuition - .0487 \cdot Distance + .0256 \cdot I(urban = yes)$

Since we have an indicator variable, urban, in our equation, this equation can actually be broken down into two separate equations, one for if the student’s school is located in an urban area and another for if the student’s school is in a non-urban area. For students going to school in urban areas:

$\widehat{EducationalAttainment} = (9.141 + .0256) + .0956 \cdot AchievementScore - .1426 \cdot Tuition - .0487 \cdot Distance$

For students going to school in non-urban areas:

$\widehat{EducationalAttainment} = 9.141 + .0956 \cdot AchievmentScore - .1426 \cdot Tuition - .0487 \cdot CollegeDistance$

Notice that the only change in the two equations is the intercept.

Interpreting Regression Coefficients

Returning again to our summary, we can interpret the coefficients for each of our variables in a way similar to how they were interpreted for simple linear regression. For example, for every increase in achievement test score by 1 point, the predicted years of education of a student increases by .0956 years when that student’s state’s average tuition for a 4-year college, distance from a 4-year college and location of school (urban or non-urban) all stay the same. You might notice that the only change in the interpretation that we made from the interpretation for simple linear regression is that you must state that all of the other predictors in the model are being held constant.

We can also interpret the coefficient of our indicator function in a similar way. For example, the predicted years of education of a student increases by .0256 when their high school is located in an urban location as opposed to a non-urban location when their achievement test score, average state tuition for a four-year college and distance from a four-year college all remain the same.

Finally, thinking about the intercept in context, we can see that the intercept wouldn’t really make sense to interpret because even a student that scored a zero on their achievement test, had an average state tuition for a 4-year college of zero dollars, lived zero miles from a 4-year college and went to a non-urban high school would still have at least 12 years of education since they had to have graduated in order to be part of this data set and our intercept is less than 12. We cannot use this model to determine anything about students who did not graduate.

Confidence Intervals for Regression Coefficients

We recognize that these regression coefficient values given by our summary are simply point estimates so we’ll need to create confidence intervals for them as well. This will generate 95% confidence intervals for all of the coefficients in our model.

confint(schoolmod)

##                   2.5 %       97.5 %
## (Intercept)  8.84909261  9.432937106
## score        0.09034461  0.100848185
## tuition     -0.27695296 -0.008301033
## distance    -0.06938312 -0.028062221
## urbanyes    -0.08630305  0.137541361

As an example, we can interpret the confidence intervals for achievement test score and the indicator variable for whether a student’s high school is located in an urban area. We interpret the confidence interval for achievement test score, a quantitative variable, like this: we are 95% confident that an increase by 1 point in achievement test score will increase the number of years of education a student achieves 6 years after high school graduation by between .09 and .1 years on average, all other variables held constant. We interpret the confidence interval for our indicator variable, a categorical variable, like this: we are 95% confident that a student going to high school in an urban area will achieve between -.086 and .138 years of education more than a student going to a high school in a non-urban area on average, all other variables held constant. Again, we cannot interpret our intercept because it does not make sense within the limits of our data set.

T-Tests

From our summary, we can also determine the significance of each of our predictor variables through the results of a t-test.A t-test tests the null hypothesis that $\beta = 0$ against the alternative hypothesis that $\beta \neq 0$,where $\beta$ is the regression coefficient. The only difference for multiple linear regression is that we have multiple $\beta$s, so a different t-test is being run for each regression coefficient. In our summary table, we see the t-statistic for each of our predictor variables as well as the intercept and a corresponding p-value for each of the t-statistics. The results of the t-tests tell us that all of the variables in our model are significant predictors of a student’s number of years of education except whether their high school is in an urban or non-urban area since that is the only p-value greater than .05. This means we can remove our urban variable from the model because it has no significant relationship with number of years of education that a student achieves 6 years after high school graduation. This is a similar conclusion to the one that was drawn from our automated model building procedure earlier.

Part 5 - Conclusions

Making Predictions

Next, we can make some predictions using our model. We can use our regression equation to predict the years of education a student would have attained if they had a test score of 52, the average college tuition for their state was $900, they lived five miles from a four-year college and they attended an urban high school.

u <- (9.141+.0256) + .0956*52 - .1426*.900 - .0487*.5
u

## [1] 13.98511

From our equation, we can see that this student would have likely pursued just shy of 2 years of education after graduating high school. Now, let’s compare that value with a similar student who only differed by where they went to high school. This student did not go to high school in an urban area.

nu <- 9.141 + .0956*52 - .1426*.900 - .0487*.5
nu

## [1] 13.95951

From our equation, it seems that this student would have also pursued just shy of 2 years of education after graduating high school. However, take note of the difference between the two predicted years of education.

abs(u-nu)

## [1] 0.0256

The difference between the two educational attainment levels is the value of the regression coefficient estimated for the urban indicator variable. Since this coefficient is small, the difference between the two students is also small.We know that providing a point estimate is usually not adequate when providing a prediction based on a regression model because it gives an answer that is very specific and likely not exactly correct. Therefore, we might want to make a prediction interval. We can do this using the command predict(model, new.dataframe, interval) predict(model, new.dataframe, interval), similar to how we made prediction intervals for simple linear regression models. Our interval would be specified as “predict”

The only difference for the multiple linear regression models is that the new data frame must contain specified values for all of the variables in your regression model. So, our new data frame, corresponding to the example student that attended a high school in an urban area, would look like this:

newdata <- data.frame(score = 52, tuition = .9, distance = .5, urban = 'yes')
newdata

##   score tuition distance urban
## 1    52     0.9      0.5   yes

Now, we can use this data frame to create a 95% confidence interval for the student with a high school in an urban area that we made a point prediction for previously.

predict(schoolmod, newdata, interval = "predict")

We can see that the point estimate given by this prediction interval is slightly different than the value we calculated, however, this can be attributed to a rounding error. We see that our 95% prediction interval is [10.886, 17.083], thus, I am 95% confident that an individual student with an achievement score of 52, an average state tuition for four-year colleges of $900, living five miles from a four-year college and going to a high school in an urban area will have achieved a total of between 10.886 and 17.083 years of education 6 years after graduating from high school. In this case, we have a slightly odd interval because we know that any student graduating from high school will attain at least 12 years of education 6 years after graduating from high school.

Confidence Intervals

In addition to prediction intervals, we can also create confidence intervals for the mean years of education students with specific characteristics might attain 6 years after high school graduation. We can use a similar predict(model, new.dataframe, interval) command to do this, changing the interval argument from “predict” to “confidence”. We will use the same data frame that we created previously for the prediction interval.

predict(schoolmod, newdata, interval = "confidence")

##        fit      lwr      upr
## 1 13.98492 13.89019 14.07965

The last word

We get the same point prediction as we did for the prediction interval, which is to be expected. Also as we would expect, we see a smaller interval than we did for the prediction interval. From this interval we can say we are 95% confident that the mean number of years of education a student would attain 6 years after high school graduation, given that they had an achievement test score of 52, an average state tuition for four-year colleges of $900, live five miles from the nearest four-year college and went to a high school in an urban area, is between 13.89 and 14.08 years.

If we take the original research question:

“Is Education Attainment (years) related to if the school is in an urban area or not?”

The answer is, no in the case of suing the URBAN/NOURBAN variable. The rest of the variables will give us some relevance but URBAN was discarded with manual validation process and AIC validation

References

Cross-section data from the High School and Beyond survey conducted by the Department of Education in 1980, with a follow-up in 1986. The survey included students from approximately 1,100 high schools.

Source: Online complements to Stock and Watson (2007).

References Rouse, C.E. (1995). Democratization or Diversion? The Effect of Community Colleges on Educational Attainment. Journal of Business & Economic Statistics, 12, 217-224. Stock, J.H. and Watson, M.W. (2007). Introduction to Econometrics, 2nd ed. Boston: Addison Wesley