Homework 1 SOC575

Question 1

Using the regress command in Stata and the NELS data provided in class, estimate the effects of educational expectations (bys45), family income (byfaminc), and sex (sex) on 8th grade GPA (bygrads). Interpret both the unstandardized coefficients as well as the standardized coefficients (when it is appropriate to do so). Finally, create a figure showing how the expected value of GPA varies by educational expectations, holding family income and sex constant.

Tabulate these variables first (Code and Output are hidden)

By looking into these variables, we found out that we are regressing a continuous variable over a factor variable and two count variables. OLS linear model seems to be a good fit for the task at hand.

Nan and I discussed whether educational expectation bys45 should be treated as continuous (model1) or factor variables (model2) in the data. We check both ways of modelling. It won’t change the coefficients of family income and sex.

However, we do realize that the two ways of modelling impose two different model assumptions. While model1 assumes continuity, model 2 is less restrictive in this regard. The margins plots will demonstrate this difference more clearly. Personally, I prefer model1, because bys45 has a natural order and level gradation. Treating it as a continuous variable makes the model parsimonious (fewer variables, more degree of freedom) and more interpretable (don’t need to interpret the 5 dummies with the reference level).

The following section shows both models, but I only do the interpretation for model1, which treats bys45 as a continuous variable.

Model1: Treating educational expectation as continuous

# Regress gpa over edu expec, family income, and sex

## Treatng bys45 as continuous
model1 <- lm(bygrads ~ bys45 + byfaminc + factor(sex), data = df)

## Summary for model1
stargazer(model1, type='text', title= "Table1: OLS Model Coefficients", 
          covariate.labels = c("edu expectation", "family income", "female"))

## 
## Table1: OLS Model Coefficients
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               bygrads          
## -----------------------------------------------
## edu expectation              0.236***          
##                               (0.005)          
##                                                
## family income                0.035***          
##                               (0.003)          
##                                                
## female                       0.101***          
##                               (0.013)          
##                                                
## Constant                     1.496***          
##                               (0.031)          
##                                                
## -----------------------------------------------
## Observations                  10,200           
## R2                             0.217           
## Adjusted R2                    0.217           
## Residual Std. Error     0.647 (df = 10196)     
## F Statistic         940.779*** (df = 3; 10196) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Fully Standardized Coefficients for education and family income:

## For mode1
lm.beta(model1)

## 
## Call:
## lm(formula = bygrads ~ bys45 + byfaminc + factor(sex), data = df)
## 
## Standardized Coefficients::
##  (Intercept)        bys45     byfaminc factor(sex)2 
##           NA    0.4044418    0.1243482    0.0690323

Partial standardied coefficients for gender

model1ps <- lm(scale(bygrads) ~ bys45 + byfaminc + factor(sex), data = df)
summary(model1ps)$coefficients[3,1]

## [1] 0.04818472

Model2: Treating educational expectation as factor

## Treating bys45 as ordinal
model2 <- lm(bygrads ~ factor(bys45) + byfaminc + factor(sex), data = df)

## Summary for model2
stargazer(model2, type='text', title= "Table1: OLS Model Coefficients", 
          covariate.labels = c("will finish h.s", "voc,trd,bus aftr h.s", "will attend college", "will finish college", "higher sch aftr coll", "family income", "female"))

## 
## Table1: OLS Model Coefficients
## ================================================
##                          Dependent variable:    
##                      ---------------------------
##                                bygrads          
## ------------------------------------------------
## will finish h.s               0.244***          
##                                (0.064)          
##                                                 
## voc,trd,bus aftr h.s          0.522***          
##                                (0.064)          
##                                                 
## will attend college           0.569***          
##                                (0.063)          
##                                                 
## will finish college           0.937***          
##                                (0.061)          
##                                                 
## higher sch aftr coll          1.178***          
##                                (0.062)          
##                                                 
## family income                 0.034***          
##                                (0.003)          
##                                                 
## female                        0.101***          
##                                (0.013)          
##                                                 
## Constant                      1.756***          
##                                (0.064)          
##                                                 
## ------------------------------------------------
## Observations                   10,200           
## R2                              0.221           
## Adjusted R2                     0.221           
## Residual Std. Error      0.645 (df = 10192)     
## F Statistic          413.927*** (df = 7; 10192) 
## ================================================
## Note:                *p<0.1; **p<0.05; ***p<0.01

Beta Coefficients:

## For model2
lm.beta(model2)

## 
## Call:
## lm(formula = bygrads ~ factor(bys45) + byfaminc + factor(sex), 
##     data = df)
## 
## Standardized Coefficients::
##    (Intercept) factor(bys45)2 factor(bys45)3 factor(bys45)4 factor(bys45)5 
##             NA     0.09511896     0.20068507     0.25999917     0.63526342 
## factor(bys45)6       byfaminc   factor(sex)2 
##     0.69844678     0.12112095     0.06911794

Interpretation of the model coefficients

A one unit increase in educational expectation is significantly associated with about 0.236 increase in 8th grade GPA, holding all other variables constant.
A one unit increase in family income is significantly associated with about 0.035 increase in 8th grade GPA, holding all other variables constant.
Females have about 0.101 higher 8th grade GPA than males, holding all other variables constant. The results is significant at p < .05 level.

Interpretation of the model beta coefficients

A one standard deviation increase in educational expectation is significantly associated with about 0.404 standard deviation increase in 8th grade GPA, holding all other variables constant.The results is significant at p < .05 level.
A one standard deviation increase in family income is significantly associated with about 0.124 standard deviation increase in 8th grade GPA, holding all other variables constant.The results is significant at p < .05 level.
Females have about 0.05 standard deviation higher GPA than males, holding all other variables constant. The results is significant at p < .05 level.

Margins Plot

A few points I want to discuss before making the margins plot.

First, the following two plots show the difference between treating bys45 as continious (Graph1) or ordinal (Graph2). Becuase model2 is less restrictive compared to model1, the model prediction of model2 is not a straight line.

Second, Nan and I also discussed how we should “average” the sex variable, which is nominal. Nan suggests we can sum the column of sex, and divided by the number of rows, so we have a number between 1 and 2. I think since sex is strictly nominal, it is meaningless to apply any mathematical feature (mean, sd, etc.) on it. Therefore, in my margins plot, there is a separate model prediction for males and females.

Here is the plot taking means of all predictors:

## Don't know how to automatically pick scale for object of type
## <haven_labelled/vctrs_vctr/double>. Defaulting to continuous.

And here are the plots if we take the means of family income and educational expectation, but show the relationship for both genders. Figure 1 shows the model prediction if we treat educational expectation as continuous, and figure shows presents the model prediction if we treat educational expectation as a factor.

## Don't know how to automatically pick scale for object of type
## <haven_labelled/vctrs_vctr/double>. Defaulting to continuous.

Question 2

First, estimate a model showing the effects of educational attainment (f4hhdg) on yearly income (f4hi99). Be sure to exclude the approximately 6% of respondents who reported no income in the past year. Is the effect of educational attainment on income non-linear? Should we be concerned with outliers on the income measure? Next, construct dummy variables for partnership status (f4gmrs) and add these variables to the regression equation. Does this set of dummy variables contribute significantly to the variance of the outcome variable? Interpret the coefficients in the model. What is the expected yearly income of a respondent who is divorced with a Bachelor’s degree?

Check the variables (Code Chunks and output are hidden )

Now, let’s fit a simplistic model to see the association between the two:

## 
## Table2: OLS Model Coefficients
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               f4hi99           
## -----------------------------------------------
## f4hhdg                     2,002.185***        
##                              (145.105)         
##                                                
## Constant                   22,081.010***       
##                              (436.708)         
##                                                
## -----------------------------------------------
## Observations                   8,341           
## R2                             0.022           
## Adjusted R2                    0.022           
## Residual Std. Error   19,121.430 (df = 8339)   
## F Statistic          190.389*** (df = 1; 8339) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Check Linearity: Method 1 Plotting

We need to plot the relationship of the two variables to see if they are linear.

The flat violin plot layered with the boxplot seems to suggest that the relationship is not linear. However, we are not sure.

One way to statistically verify our intuition is to run a local regression so that we have a smooth curve that tells us the relationship between the two variable. We run the regression, and then plot the curve over our violin and box plot. The local regression tells us that we have a curve-linear relationship between the two variables

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Checking Linearity: Method 2 Fitting a Polynomial Model

First, we fit a squared term in the model

# Fit a square term
fit_square <- lm(f4hi99 ~ f4hhdg + I(as.numeric(f4hhdg^2)), 
                 data = fit1$model, subset = f4hi99 != 0)
# Model Summary
summary(fit_square)

## 
## Call:
## lm(formula = f4hi99 ~ f4hhdg + I(as.numeric(f4hhdg^2)), data = fit1$model, 
##     subset = f4hi99 != 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34573 -10047  -2007   6364 469993 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              23459.7      880.5  26.645   <2e-16 ***
## f4hhdg                     539.2      824.2   0.654   0.5130    
## I(as.numeric(f4hhdg^2))    274.4      152.2   1.803   0.0714 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19120 on 8338 degrees of freedom
## Multiple R-squared:  0.0227, Adjusted R-squared:  0.02247 
## F-statistic: 96.85 on 2 and 8338 DF,  p-value: < 2.2e-16

# comparing models
anova(fit, fit_square)

## Analysis of Variance Table
## 
## Model 1: f4hi99 ~ f4hhdg
## Model 2: f4hi99 ~ f4hhdg + I(as.numeric(f4hhdg^2))
##   Res.Df        RSS Df  Sum of Sq      F  Pr(>F)  
## 1   8339 3.0490e+12                               
## 2   8338 3.0478e+12  1 1188618865 3.2518 0.07138 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test of the two models shows that the polynomial model with a quadratic term fits the model slightly better than the linear model. However, the F statistics is not significant at .05 level, which indicates that we should be safe if we choose to stick with the linear model.

In the class, we also compared the difference between treating educational attainment as ordinal vs treating it as continuous. We use BIC statistics to check out:

fit_c <- lm(f4hi99 ~ as.factor(f4hhdg), data = fit1$model, subset = f4hi99 != 0) 

# Check out BIC statistics
BIC(fit) # for treating education as continuous

## [1] 188156.4

BIC(fit_c) # for treating education as ordinal

## [1] 188147.1

We found that the BIC statistics for the model that treats education as ordinal (BIC = 188147) is smaller than the model that treats it as continuous (188156). Therefore, we should treat educational attainment as a ordinal variable in the following analysis.

Outliers

Notice in the above plot, we exclude all income values above 200,000. To know whether we need to worry about outliers, this time, let’s plot the same graph, but include all the income values over 200,000.

The violin plot, box plot, but especially the local regression estimate show that including these outliers reduce the level of curvelinear association we saw previously. Those data points with annual income over 200,000 do have strong influence (over the non-parametric model).

## Don't know how to automatically pick scale for object of type
## <haven_labelled/vctrs_vctr/double>. Defaulting to continuous.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

However, the local regression model is non-linear. How potential outliers affect the linear model might be a different story. First, let’s plot a leverage Vs Residual plot using the OLS model we fitted previously:

x <- plot(fit_c, which=5) # Leverage vs Residual plot

We can see in the linear model, there are three data points that are highly influential: identified in the Residuals vs fitted values plot by their row numbers 699, 2462, 12141. Let’s create a new data frame excluding them and fit the new data frame with the same OLS model again:

## 
## Table2: OLS Model Coefficients
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               f4hi99           
## -----------------------------------------------
## as.factor(f4hhdg)2           -977.868          
##                              (741.756)         
##                                                
## as.factor(f4hhdg)3            934.939          
##                              (755.043)         
##                                                
## as.factor(f4hhdg)4         6,481.834***        
##                              (476.443)         
##                                                
## as.factor(f4hhdg)5         4,357.509***        
##                             (1,066.233)        
##                                                
## as.factor(f4hhdg)6         6,647.254***        
##                             (2,553.537)        
##                                                
## Constant                   24,624.170***       
##                              (338.915)         
##                                                
## -----------------------------------------------
## Observations                   8,341           
## R2                             0.027           
## Adjusted R2                    0.027           
## Residual Std. Error   18,939.870 (df = 8335)   
## F Statistic          46.809*** (df = 5; 8335)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

From the table above, we can see that coefficients do not change much compared to our original model. This is because we have such a large sample size, which dilutes the influece of these three outliers.

However, why the potential outliers seem to have more influence over the local regression? This is because the local regression (the smooth curve we drew for our two violin plots) combines multiple regression with the k-nearest neighbors algorithm, making the model a much better fit (likely an overfit) of the sample data. The better a model fits the data, the more it assimilates the influence of every data point. This is why the outliers in our sample data affects the local regression curve more than the linear model.

Now, let’s take a look at the marriage variable, and break it down into factor variables.

We chose the level single as the reference level, so we generate five indicator variables, each represents a type of marriage status, with 1 indicating yes, and 0 indicating no.

Do notice that R and STATA can automatically do this if we recode f4gmrs as a factor variable. However, to demonstrate what happens under the hood, we do this manually this time:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    1.00    1.00    1.58    2.00    6.00      42

## [1] "marital status in 2000"

## 
## Labels:
##  value                                  label
##     -7 {not reached-partial/abbrev interview}
##     -2                              {refused}
##     -1                           {don^t know}
##      1                  single, never married
##      2                                married
##      3                               divorced
##      4                              separated
##      5                                widowed
##      6          in marriage-like relationship

## 
##    1    2    3    4    5    6 <NA> 
## 6455 4796  563  171    5  112   42

Finally, let’s put all indicators variables in our restricted model and have the new unrestricted model:

# New Model 
fit1 <- lm(f4hi99 ~ as.factor(f4hhdg) + married + divorced + 
            separated + widowed + marriage_like, 
           data = df, subset = f4hi99 != 0)

stargazer(fit1, type='text', title= "Table3: OLS Model Coefficients of Full Model",  covariate.labels = c("certificate/license","associate^s degree", "bachelor^s degree", "master^s degree/equivalent" , "ph.d or a professional degree", "Married", "Divorced", "separated", "Widowed", "Marriagelike", "Female"))

## 
## Table3: OLS Model Coefficients of Full Model
## =========================================================
##                                   Dependent variable:    
##                               ---------------------------
##                                         f4hi99           
## ---------------------------------------------------------
## certificate/license                   -1,053.449         
##                                        (748.773)         
##                                                          
## associatedegree                         909.253          
##                                        (760.637)         
##                                                          
## bachelordegree                       6,508.562***        
##                                        (485.366)         
##                                                          
## masterdegree/equivalent              4,836.675***        
##                                       (1,074.691)        
##                                                          
## ph.d or a professional degree        9,635.544***        
##                                       (2,528.651)        
##                                                          
## Married                              1,190.712***        
##                                        (444.212)         
##                                                          
## Divorced                               -961.496          
##                                       (1,123.807)        
##                                                          
## separated                             -3,142.071         
##                                       (2,223.104)        
##                                                          
## Widowed                               -19,156.350        
##                                      (19,075.090)        
##                                                          
## Marriagelike                          4,333.772*         
##                                       (2,250.139)        
##                                                          
## Female                               24,209.800***       
##                                        (398.190)         
##                                                          
## ---------------------------------------------------------
## Observations                             8,341           
## R2                                       0.029           
## Adjusted R2                              0.028           
## Residual Std. Error             19,061.780 (df = 8330)   
## F Statistic                    25.285*** (df = 10; 8330) 
## =========================================================
## Note:                         *p<0.1; **p<0.05; ***p<0.01

From the modelling results, we can see that the adjusted R-squared value increases from 0.022 in the restricted model to 0.028 in the full model. Not a sharp increase. We follow this by a F-test of the two nested models, as shown below. The F-statistics is about 3.15, significant at p < .05 level. This means the set of dummy variables we added into the model does contribute significantly to the variance of the outcome variable.

anova(fit_c, fit1, test='F')

## Analysis of Variance Table
## 
## Model 1: f4hi99 ~ as.factor(f4hhdg)
## Model 2: f4hi99 ~ as.factor(f4hhdg) + married + divorced + separated + 
##     widowed + marriage_like
##   Res.Df        RSS Df  Sum of Sq      F   Pr(>F)   
## 1   8335 3.0324e+12                                 
## 2   8330 3.0267e+12  5 5720329598 3.1486 0.007655 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpreting the coefficients

The dummy variable married is significant at p <.05 level, which indicates that compared to single people, married people on average have 1190.71 more yearly income, holding education constant.
Similarly, we can see that compared to single people, those who are in a marriage-like relationship earn 4333.78 more annually, holding education constant. The result is significant at p <.05 level.
People who have a bachelor’s degree earn about 6508.56 more than those who have no degree annually, holding marriage status constant. The result is significant at p <.05 level.
People who have a master’s degree earn about 4836.68 more than those who have no degree annually, holding marriage status constant. The result is significant at p <.05 level.
People who have a Phd or professional degree earn about 9635.54 more than those who have no degree annually, holding marriage status constant. The result is significant at p <.05 level.

To know the model prediction of the yearly income of a respondent who is divorced with a bachelor’s degree, we simply create a new matrix given the individual’s information, and feed it into the model.

The expected yearly income is 31098.88, based on the model.

# Create a new dataset to make predictions on
newdata <- data.frame(f4hhdg = c(4), married= c(0), divorced = c(1),
                      separated = c(0),widowed = c(0), marriage_like = c(0))

# Make predictions on the new dataset using the model
predictions <- predict(fit_c, newdata=newdata)

# Print the predictions for divorced with a Bachelor's degree
print(predictions)

##        1 
## 31098.88

Question 3

First, estimate a model showing the effects of number of children (f4gnch), partnership status (f4gmrs), and sex (sex) on employment status (f4aempl). What is the range of predicted probabilities in your model? What is the predicted probability of employment for the (a) “average” respondent (i.e., mean on all values); and (b) a married father with 2 children. Interpret the coefficients in terms of changes in the odds. For instance, how much do the odds of being employed decrease with each additional child?

Again, we look into variables at hands (code not shown):

We notice that our dependent variable is a dichotomous variable with 0 and 1 (in R, they are coded as factor variable with two factor levels). In this case, we opt for the logistic model, a model from the GLMs.

# Fitting
bifit <- glm(f4aempl ~ f4gnch + factor(f4gmrs) + factor(sex),
             data = df, family = binomial())

# Table results

stargazer(bifit, type='text', title= "Table4: ML Logistic Model Coefficients", 
          covariate.labels = c("Num of children", "Married", "Divorced", "separated", "Widowed", "Marriagelike", "Female"))

## 
## Table4: ML Logistic Model Coefficients
## =============================================
##                       Dependent variable:    
##                   ---------------------------
##                             f4aempl          
## ---------------------------------------------
## Num of children            -0.440***         
##                             (0.027)          
##                                              
## Married                      0.064           
##                             (0.063)          
##                                              
## Divorced                    0.325**          
##                             (0.138)          
##                                              
## separated                    0.131           
##                             (0.220)          
##                                              
## Widowed                      0.337           
##                             (1.162)          
##                                              
## Marriagelike                -0.167           
##                             (0.271)          
##                                              
## Female                     -0.696***         
##                             (0.060)          
##                                              
## Constant                   2.570***          
##                             (0.056)          
##                                              
## ---------------------------------------------
## Observations                11,331           
## Log Likelihood            -4,255.418         
## Akaike Inf. Crit.          8,526.836         
## =============================================
## Note:             *p<0.1; **p<0.05; ***p<0.01

Now, let’s do model predictions.

Range of all predicted probabilities:

predictions <- predict(bifit, type='response')

range(predictions)

## [1] 0.2302173 0.9476340

We can definitely take the theoretical means of all variables, to see the model prediction for an average respondent: The result is about 0.88. (code hidden)

##         1 
## 0.8767825

Again, in terms of how to take the average of marital status f4gm4s and sex, Nan and I had a discussion. I think the average of these two nominal variables have no mathemtical meaning, hence cannot be used to generate the model prediction. Instead, I only took the average of the number of children f4gnch, and then show the model prediction for the average number of children with different permutations of marital status and sex.

For a respondent who have an average number of children: (code hidden)

##          single males        single females         married males 
##             0.9081272             0.8313221             0.9133532 
##       married females        divorced males      divorced females 
##             0.8401480             0.9319140             0.8721959 
##       separated males     separated females         widowed males 
##             0.9184760             0.8488829             0.9326375 
##       widowed females    marrige-like males marraige-like females 
##             0.8734679             0.8931648             0.8065163

For a married father with 2 children (code hidden), the model prediction is about 0.85.

## married father 2 kids 
##              0.852474

Interpreting the Coefficients

First, let’s see the coefficients in terms of change in odds (factor)

An additional children is associated with a decrease in the odds of being employed by a factor of about 0.64, holding sex and marriage status constant. The association is significant at p < .05 level.
Being divorced is associated with an increase in the odds of being employed by a factor of about 1.39, holding sex and marriage status constant. The association is significant at p < .05 level.
The odds of female being employed is about a factor of 0.50 lower than the odds pf males being employed, holding sex and marriage status constant. The association is significant at p < .05 level.

##                 Change in odds P < .05
## intercept           13.0687306    TRUE
## num of children      0.6439119    TRUE
## married              1.0664155   FALSE
## divorce              1.3847074    TRUE
## separate             1.1397848   FALSE
## widowed              1.4006676   FALSE
## marriage-like        0.8457802   FALSE
## females              0.4985990    TRUE

and then in terms of the percentage change in odds:

An additional children is associated with a decrease in the odds of being employed by about 36%, holding sex and marriage status constant. The association is significant at p < .05 level.
Being divorced is associated with an increase in the odds of being employed by about 39 percent, holding sex and marriage status constant. The association is significant at p < .05 level.
The odds of female being employed is about 50% lower than the odds pf males being employed, holding sex and marriage status constant. The association is significant at p < .05 level.

## Coefficients in terms of changes in percentage change in odds

a2 <- as.data.frame((exp(bifit$coefficients) - 1) * 100)

a2 <- cbind(a2,a111[,2])

colnames(a2) <- c("percentage Change in odds", "P < .05")

rownames(a2) <- c('intercept', "num of children", 'married',
                        'divorce', 'separate', 'widowed', 
                  'marriage-like', 'females' )
a2

##                 percentage Change in odds P < .05
## intercept                     1206.873057    TRUE
## num of children                -35.608812    TRUE
## married                          6.641545   FALSE
## divorce                         38.470745    TRUE
## separate                        13.978478   FALSE
## widowed                         40.066759   FALSE
## marriage-like                  -15.421981   FALSE
## females                        -50.140095    TRUE