Using the regress command in Stata and the NELS data provided in class, estimate the effects of educational expectations (bys45), family income (byfaminc), and sex (sex) on 8th grade GPA (bygrads). Interpret both the unstandardized coefficients as well as the standardized coefficients (when it is appropriate to do so). Finally, create a figure showing how the expected value of GPA varies by educational expectations, holding family income and sex constant.
A one unit increase in educational expectation is significantly associated with about 0.236 increase in 8th grade GPA, holding all other variables constant.
A one unit increase in family income is significantly associated with about 0.035 increase in 8th grade GPA, holding all other variables constant.
Females have about 0.101 higher 8th grade GPA than males, holding all other variables constant. The results is significant at p < .05 level.
A one standard deviation increase in educational expectation is significantly associated with about 0.404 standard deviation increase in 8th grade GPA, holding all other variables constant.The results is significant at p < .05 level.
A one standard deviation increase in family income is significantly associated with about 0.124 standard deviation increase in 8th grade GPA, holding all other variables constant.The results is significant at p < .05 level.
Females have about 0.05 standard deviation higher GPA than males, holding all other variables constant. The results is significant at p < .05 level.
A few points I want to discuss before making the margins plot.
First, the following two plots show the difference between treating
bys45
as continious (Graph1) or ordinal (Graph2). Becuase
model2
is less restrictive compared to model1
,
the model prediction of model2
is not a straight line.
Second, Nan and I also discussed how we should “average” the sex variable, which is nominal. Nan suggests we can sum the column of sex, and divided by the number of rows, so we have a number between 1 and 2. I think since sex is strictly nominal, it is meaningless to apply any mathematical feature (mean, sd, etc.) on it. Therefore, in my margins plot, there is a separate model prediction for males and females.
Here is the plot taking means of all predictors:
## Don't know how to automatically pick scale for object of type
## <haven_labelled/vctrs_vctr/double>. Defaulting to continuous.
And here are the plots if we take the means of family income and educational expectation, but show the relationship for both genders. Figure 1 shows the model prediction if we treat educational expectation as continuous, and figure shows presents the model prediction if we treat educational expectation as a factor.
## Don't know how to automatically pick scale for object of type
## <haven_labelled/vctrs_vctr/double>. Defaulting to continuous.
First, estimate a model showing the effects of educational attainment (f4hhdg) on yearly income (f4hi99). Be sure to exclude the approximately 6% of respondents who reported no income in the past year. Is the effect of educational attainment on income non-linear? Should we be concerned with outliers on the income measure? Next, construct dummy variables for partnership status (f4gmrs) and add these variables to the regression equation. Does this set of dummy variables contribute significantly to the variance of the outcome variable? Interpret the coefficients in the model. What is the expected yearly income of a respondent who is divorced with a Bachelor’s degree?
Check the variables (Code Chunks and output are hidden )
Now, let’s fit a simplistic model to see the association between the two:
##
## Table2: OLS Model Coefficients
## ===============================================
## Dependent variable:
## ---------------------------
## f4hi99
## -----------------------------------------------
## f4hhdg 2,002.185***
## (145.105)
##
## Constant 22,081.010***
## (436.708)
##
## -----------------------------------------------
## Observations 8,341
## R2 0.022
## Adjusted R2 0.022
## Residual Std. Error 19,121.430 (df = 8339)
## F Statistic 190.389*** (df = 1; 8339)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We need to plot the relationship of the two variables to see if they are linear.
The flat violin plot layered with the boxplot seems to suggest that the relationship is not linear. However, we are not sure.
One way to statistically verify our intuition is to run a local regression so that we have a smooth curve that tells us the relationship between the two variable. We run the regression, and then plot the curve over our violin and box plot. The local regression tells us that we have a curve-linear relationship between the two variables
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
First, we fit a squared term in the model
# Fit a square term
fit_square <- lm(f4hi99 ~ f4hhdg + I(as.numeric(f4hhdg^2)),
data = fit1$model, subset = f4hi99 != 0)
# Model Summary
summary(fit_square)
##
## Call:
## lm(formula = f4hi99 ~ f4hhdg + I(as.numeric(f4hhdg^2)), data = fit1$model,
## subset = f4hi99 != 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34573 -10047 -2007 6364 469993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23459.7 880.5 26.645 <2e-16 ***
## f4hhdg 539.2 824.2 0.654 0.5130
## I(as.numeric(f4hhdg^2)) 274.4 152.2 1.803 0.0714 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19120 on 8338 degrees of freedom
## Multiple R-squared: 0.0227, Adjusted R-squared: 0.02247
## F-statistic: 96.85 on 2 and 8338 DF, p-value: < 2.2e-16
# comparing models
anova(fit, fit_square)
## Analysis of Variance Table
##
## Model 1: f4hi99 ~ f4hhdg
## Model 2: f4hi99 ~ f4hhdg + I(as.numeric(f4hhdg^2))
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8339 3.0490e+12
## 2 8338 3.0478e+12 1 1188618865 3.2518 0.07138 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA test of the two models shows that the polynomial model with
a quadratic term fits the model slightly better than the linear model.
However, the F statistics is not significant at .05 level, which
indicates that we should be safe if we choose to stick with the linear
model.
In the class, we also compared the difference between treating educational attainment as ordinal vs treating it as continuous. We use BIC statistics to check out:
fit_c <- lm(f4hi99 ~ as.factor(f4hhdg), data = fit1$model, subset = f4hi99 != 0)
# Check out BIC statistics
BIC(fit) # for treating education as continuous
## [1] 188156.4
BIC(fit_c) # for treating education as ordinal
## [1] 188147.1
We found that the BIC statistics for the model that treats education as ordinal (BIC = 188147) is smaller than the model that treats it as continuous (188156). Therefore, we should treat educational attainment as a ordinal variable in the following analysis.
Notice in the above plot, we exclude all income values above 200,000. To know whether we need to worry about outliers, this time, let’s plot the same graph, but include all the income values over 200,000.
The violin plot, box plot, but especially the local regression estimate show that including these outliers reduce the level of curvelinear association we saw previously. Those data points with annual income over 200,000 do have strong influence (over the non-parametric model).
## Don't know how to automatically pick scale for object of type
## <haven_labelled/vctrs_vctr/double>. Defaulting to continuous.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
However, the local regression model is non-linear. How potential outliers affect the linear model might be a different story. First, let’s plot a leverage Vs Residual plot using the OLS model we fitted previously:
x <- plot(fit_c, which=5) # Leverage vs Residual plot
We can see in the linear model, there are three data points that are highly influential: identified in the Residuals vs fitted values plot by their row numbers 699, 2462, 12141. Let’s create a new data frame excluding them and fit the new data frame with the same OLS model again:
##
## Table2: OLS Model Coefficients
## ===============================================
## Dependent variable:
## ---------------------------
## f4hi99
## -----------------------------------------------
## as.factor(f4hhdg)2 -977.868
## (741.756)
##
## as.factor(f4hhdg)3 934.939
## (755.043)
##
## as.factor(f4hhdg)4 6,481.834***
## (476.443)
##
## as.factor(f4hhdg)5 4,357.509***
## (1,066.233)
##
## as.factor(f4hhdg)6 6,647.254***
## (2,553.537)
##
## Constant 24,624.170***
## (338.915)
##
## -----------------------------------------------
## Observations 8,341
## R2 0.027
## Adjusted R2 0.027
## Residual Std. Error 18,939.870 (df = 8335)
## F Statistic 46.809*** (df = 5; 8335)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
From the table above, we can see that coefficients do not change much compared to our original model. This is because we have such a large sample size, which dilutes the influece of these three outliers.
However, why the potential outliers seem to have more influence over the local regression? This is because the local regression (the smooth curve we drew for our two violin plots) combines multiple regression with the k-nearest neighbors algorithm, making the model a much better fit (likely an overfit) of the sample data. The better a model fits the data, the more it assimilates the influence of every data point. This is why the outliers in our sample data affects the local regression curve more than the linear model.
Now, let’s take a look at the marriage variable, and break it down into factor variables.
We chose the level single
as the reference level, so we
generate five indicator variables, each represents a type of marriage
status, with 1 indicating yes, and 0 indicating no.
Do notice that R and STATA can automatically do this if we recode
f4gmrs
as a factor variable. However, to demonstrate what
happens under the hood, we do this manually this time:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 1.00 1.00 1.58 2.00 6.00 42
## [1] "marital status in 2000"
##
## Labels:
## value label
## -7 {not reached-partial/abbrev interview}
## -2 {refused}
## -1 {don^t know}
## 1 single, never married
## 2 married
## 3 divorced
## 4 separated
## 5 widowed
## 6 in marriage-like relationship
##
## 1 2 3 4 5 6 <NA>
## 6455 4796 563 171 5 112 42
Finally, let’s put all indicators variables in our restricted model and have the new unrestricted model:
# New Model
fit1 <- lm(f4hi99 ~ as.factor(f4hhdg) + married + divorced +
separated + widowed + marriage_like,
data = df, subset = f4hi99 != 0)
stargazer(fit1, type='text', title= "Table3: OLS Model Coefficients of Full Model", covariate.labels = c("certificate/license","associate^s degree", "bachelor^s degree", "master^s degree/equivalent" , "ph.d or a professional degree", "Married", "Divorced", "separated", "Widowed", "Marriagelike", "Female"))
##
## Table3: OLS Model Coefficients of Full Model
## =========================================================
## Dependent variable:
## ---------------------------
## f4hi99
## ---------------------------------------------------------
## certificate/license -1,053.449
## (748.773)
##
## associatedegree 909.253
## (760.637)
##
## bachelordegree 6,508.562***
## (485.366)
##
## masterdegree/equivalent 4,836.675***
## (1,074.691)
##
## ph.d or a professional degree 9,635.544***
## (2,528.651)
##
## Married 1,190.712***
## (444.212)
##
## Divorced -961.496
## (1,123.807)
##
## separated -3,142.071
## (2,223.104)
##
## Widowed -19,156.350
## (19,075.090)
##
## Marriagelike 4,333.772*
## (2,250.139)
##
## Female 24,209.800***
## (398.190)
##
## ---------------------------------------------------------
## Observations 8,341
## R2 0.029
## Adjusted R2 0.028
## Residual Std. Error 19,061.780 (df = 8330)
## F Statistic 25.285*** (df = 10; 8330)
## =========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
From the modelling results, we can see that the adjusted R-squared value increases from 0.022 in the restricted model to 0.028 in the full model. Not a sharp increase. We follow this by a F-test of the two nested models, as shown below. The F-statistics is about 3.15, significant at p < .05 level. This means the set of dummy variables we added into the model does contribute significantly to the variance of the outcome variable.
anova(fit_c, fit1, test='F')
## Analysis of Variance Table
##
## Model 1: f4hi99 ~ as.factor(f4hhdg)
## Model 2: f4hi99 ~ as.factor(f4hhdg) + married + divorced + separated +
## widowed + marriage_like
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8335 3.0324e+12
## 2 8330 3.0267e+12 5 5720329598 3.1486 0.007655 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The dummy variable married
is significant at p
<.05 level, which indicates that compared to single people, married
people on average have 1190.71 more yearly income, holding education
constant.
Similarly, we can see that compared to single people, those who are in a marriage-like relationship earn 4333.78 more annually, holding education constant. The result is significant at p <.05 level.
People who have a bachelor’s degree earn about 6508.56 more than those who have no degree annually, holding marriage status constant. The result is significant at p <.05 level.
People who have a master’s degree earn about 4836.68 more than those who have no degree annually, holding marriage status constant. The result is significant at p <.05 level.
People who have a Phd or professional degree earn about 9635.54 more than those who have no degree annually, holding marriage status constant. The result is significant at p <.05 level.
To know the model prediction of the yearly income of a respondent who is divorced with a bachelor’s degree, we simply create a new matrix given the individual’s information, and feed it into the model.
The expected yearly income is 31098.88, based on the model.
# Create a new dataset to make predictions on
newdata <- data.frame(f4hhdg = c(4), married= c(0), divorced = c(1),
separated = c(0),widowed = c(0), marriage_like = c(0))
# Make predictions on the new dataset using the model
predictions <- predict(fit_c, newdata=newdata)
# Print the predictions for divorced with a Bachelor's degree
print(predictions)
## 1
## 31098.88
First, estimate a model showing the effects of number of children (f4gnch), partnership status (f4gmrs), and sex (sex) on employment status (f4aempl). What is the range of predicted probabilities in your model? What is the predicted probability of employment for the (a) “average” respondent (i.e., mean on all values); and (b) a married father with 2 children. Interpret the coefficients in terms of changes in the odds. For instance, how much do the odds of being employed decrease with each additional child?
Again, we look into variables at hands (code not shown):
We notice that our dependent variable is a dichotomous variable with 0 and 1 (in R, they are coded as factor variable with two factor levels). In this case, we opt for the logistic model, a model from the GLMs.
# Fitting
bifit <- glm(f4aempl ~ f4gnch + factor(f4gmrs) + factor(sex),
data = df, family = binomial())
# Table results
stargazer(bifit, type='text', title= "Table4: ML Logistic Model Coefficients",
covariate.labels = c("Num of children", "Married", "Divorced", "separated", "Widowed", "Marriagelike", "Female"))
##
## Table4: ML Logistic Model Coefficients
## =============================================
## Dependent variable:
## ---------------------------
## f4aempl
## ---------------------------------------------
## Num of children -0.440***
## (0.027)
##
## Married 0.064
## (0.063)
##
## Divorced 0.325**
## (0.138)
##
## separated 0.131
## (0.220)
##
## Widowed 0.337
## (1.162)
##
## Marriagelike -0.167
## (0.271)
##
## Female -0.696***
## (0.060)
##
## Constant 2.570***
## (0.056)
##
## ---------------------------------------------
## Observations 11,331
## Log Likelihood -4,255.418
## Akaike Inf. Crit. 8,526.836
## =============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Now, let’s do model predictions.
Range of all predicted probabilities:
predictions <- predict(bifit, type='response')
range(predictions)
## [1] 0.2302173 0.9476340
We can definitely take the theoretical means of all variables, to see the model prediction for an average respondent: The result is about 0.88. (code hidden)
## 1
## 0.8767825
Again, in terms of how to take the average of marital status
f4gm4s
and sex
, Nan and I had a discussion. I
think the average of these two nominal variables have no mathemtical
meaning, hence cannot be used to generate the model prediction. Instead,
I only took the average of the number of children f4gnch
,
and then show the model prediction for the average number of children
with different permutations of marital status and sex.
For a respondent who have an average number of children: (code hidden)
## single males single females married males
## 0.9081272 0.8313221 0.9133532
## married females divorced males divorced females
## 0.8401480 0.9319140 0.8721959
## separated males separated females widowed males
## 0.9184760 0.8488829 0.9326375
## widowed females marrige-like males marraige-like females
## 0.8734679 0.8931648 0.8065163
For a married father with 2 children (code hidden), the model prediction is about 0.85.
## married father 2 kids
## 0.852474
First, let’s see the coefficients in terms of change in odds (factor)
An additional children is associated with a decrease in the odds of being employed by a factor of about 0.64, holding sex and marriage status constant. The association is significant at p < .05 level.
Being divorced is associated with an increase in the odds of being employed by a factor of about 1.39, holding sex and marriage status constant. The association is significant at p < .05 level.
The odds of female being employed is about a factor of 0.50 lower than the odds pf males being employed, holding sex and marriage status constant. The association is significant at p < .05 level.
## Change in odds P < .05
## intercept 13.0687306 TRUE
## num of children 0.6439119 TRUE
## married 1.0664155 FALSE
## divorce 1.3847074 TRUE
## separate 1.1397848 FALSE
## widowed 1.4006676 FALSE
## marriage-like 0.8457802 FALSE
## females 0.4985990 TRUE
and then in terms of the percentage change in odds:
An additional children is associated with a decrease in the odds of being employed by about 36%, holding sex and marriage status constant. The association is significant at p < .05 level.
Being divorced is associated with an increase in the odds of being employed by about 39 percent, holding sex and marriage status constant. The association is significant at p < .05 level.
The odds of female being employed is about 50% lower than the odds pf males being employed, holding sex and marriage status constant. The association is significant at p < .05 level.
## Coefficients in terms of changes in percentage change in odds
a2 <- as.data.frame((exp(bifit$coefficients) - 1) * 100)
a2 <- cbind(a2,a111[,2])
colnames(a2) <- c("percentage Change in odds", "P < .05")
rownames(a2) <- c('intercept', "num of children", 'married',
'divorce', 'separate', 'widowed',
'marriage-like', 'females' )
a2
## percentage Change in odds P < .05
## intercept 1206.873057 TRUE
## num of children -35.608812 TRUE
## married 6.641545 FALSE
## divorce 38.470745 TRUE
## separate 13.978478 FALSE
## widowed 40.066759 FALSE
## marriage-like -15.421981 FALSE
## females -50.140095 TRUE