Research Question for Homework #1 – ELS:2002 Dataset

I am interested in the effects of immigration upon student expectations of success. Much research has been done to suggest that immigrant adolescents – with the exception of undocumented immigrants – generally expect themselves to accomplish high levels of student success after high school. It is probably impossible to estimate the proportion of undocumented students in the ELS dataset, which then suggests that the actual educational expectations among migrant students in this data set are difficult to predict.

My hypothesis is that the proportion of undocumented students in the ELS dataset is likely very low, and consequently migrant students will have higher mean educational expectations than non-migrant students.

To test this, we will need to construct a number of binary variables, and understand some serious caveats which go along with these new variables.

I propose to create the following binary variables:

-Educational Expectations of the Student (Dependent) -Immigration Proxy (Independent) -Parent’s Educational Level (Independent) -Parent’s Expectations of the Student (Inependent)

Because parental expectations and experiences likely have a strong effect upon the student’s expectations, I expect that both the parent’s educational level (college graduate or above) and parent’s expectations (typically quite high) of the student should have a positive effect upon the student’s educational self-expectation independent of the student’s immigration status.

Dependent Variable: Educational Expectations

First, I created a variable called “college expectations”, which will be “1” if the student expects to graduate college and “0” if they do not. Missing data is “NA”.

els<-els %>% mutate(college_expectations=case_when(.$bystexp %in% c(5:7) ~ 1,
                                                   .$bystexp %in% c(1:4,-1) ~ 0,
                                                   .$bystexp %in% c(-8,-4) ~ NA_real_))

Independent Variable # 1 - Immigration Proxy

Then I created a new “immigration proxy” variable. This variable equals “1” if the student was born in either Puerto Rico or outside the US AND does not speak English at home. If student is born in the US OR speaks English at home, it will equal “0”. Missing values or skipped questions are “NA”.

Obviously this is an imperfect proxy since students from Puerto Rico are citizens of the United States. We are assuming, for at least the sake of this homework, that someone born outside the continental United States and who speaks a non-English language at home is likely an immigrant.

els<-els %>% mutate(immigrant_proxy=case_when(.$bygnstat== 1 & .$byhomlng %in% c(2:6) ~ 1,
                                                   .$bygnstat %in% c(2:3) | .$byhomlng==1 ~ 0,
                                                   .$bygnstat %in% c(-8,-4,-9) | .$byhomlng %in% c(-4,-8,-9) ~ NA_real_))
els<-els %>% mutate(immigrant_proxy2=case_when(.$immigrant_proxy==1 ~ "immigrant",
                                               .$immigrant_proxy==0 ~ "non-immigrant"))

Independent Variable # 2 - Parent’s Education

Here I created a new variable to measure parent’s highest level of education. If a student’s parent is a college graduate or higher, the value is “1”. If neither parent has completed college, the value is “0”. Missing values or skipped questions are “NA”.

els<-els %>% mutate(parents_college=case_when(.$bypared %in% c(6:8) ~ 1,
                                              .$bypared %in% c(1:5) ~ 0,
                                              .$bypared %in% c(-4,-8,-9) ~ NA_real_))

Independent Variable #3 - Parent’s Expectations

I also created a new variable to measure the parent’s expectations of their child’s future level of education. If a student’s parent expects them to graduate college or higher, the value is “1”. If the parent expects their child to not graduate college or high school, the value is “0”. Missing values or skipped questions are “NA”.

els<-els %>% mutate(parents_college_exp=case_when(.$byparasp %in% c(5:7) ~ 1,
                                              .$byparasp %in% c(1:4) ~ 0,
                                              .$byparasp %in% c(-4) ~ NA_real_))
els2<-els %>% select(stu_id,bystuwt,strat_id,immigrant_proxy,immigrant_proxy2,college_expectations,parents_college,parents_college_exp)
els3<-els2[complete.cases(els2),]

Descriptive Statistics

In the following tables, I’ve provided some brief and unweighted descriptive statistics for college expectations, parent’s college level and parent’s college expectations by “immigration proxy”. Non-immigrant parents seem slightly more likely to have graduated from college, while immigrant parents seem slightly more likely to expect their children to attend college. Beyond this, the simple analysis appears inconclusive.

els3 %>% group_by(immigrant_proxy2) %>% summarise(college_expectations=mean(college_expectations,na.rm = T),parents_college=mean(parents_college,na.rm = T),parents_college_exp=mean(parents_college_exp,na.rm=T),n())

college expectations

“0” if the student believes they will not graduate from college, “1” if they believe that they will.

table(els3$college_expectations,els3$immigrant_proxy2)
##    
##     immigrant non-immigrant
##   0       276          3535
##   1       749         10170
prop.table(table(els3$college_expectations,els3$immigrant_proxy2),margin=2)
##    
##     immigrant non-immigrant
##   0 0.2692683     0.2579351
##   1 0.7307317     0.7420649

parents education level

“0” if the parent has graduated from college, “1” if they have graduated from college.

table(els3$parents_college,els3$immigrant_proxy2)
##    
##     immigrant non-immigrant
##   0       643          7997
##   1       382          5708
prop.table(table(els3$parents_college,els3$immigrant_proxy2),margin=2)
##    
##     immigrant non-immigrant
##   0 0.6273171     0.5835097
##   1 0.3726829     0.4164903

parents college expectations for student

“0” if the parent believes their child will not graduate from college, “1” if they believe that they will.

table(els3$parents_college_exp,els3$immigrant_proxy2)
##    
##     immigrant non-immigrant
##   0        89          1708
##   1       936         11997
prop.table(table(els3$parents_college_exp,els3$immigrant_proxy2),margin = 2)
##    
##      immigrant non-immigrant
##   0 0.08682927    0.12462605
##   1 0.91317073    0.87537395

Create Survey Design

Because we are dissatisfied with this simplistic analysis, I have created a survey design which will hopefully shed more light upon these results:

des<-svydesign(ids=~1,strata =~strat_id,weights=~bystuwt,data=els3[is.na(els3$bystuwt)==F,])

Survey Design Weighted Analysis

Now, we can re-run some of the descriptive statistics using statistical weights provided by the ELS:2002 dataset. In this example, we will re-run the prop.table for college expectations vs. immigration proxy:

prop.table(wtd.table(els3$college_expectations,els3$immigrant_proxy2,weights=els3$bystuwt),margin=2)
##   immigrant non-immigrant
## 0 0.2747658     0.2804260
## 1 0.7252342     0.7195740

As you can see, the figures are visibly different – both categories have gained and lost a few points simply by using weights.

Standard Errors with and without survey design:

Without survey design:

n<-length(is.na(els3$immigrant_proxy2)==F)
p<-prop.table(table(els3$college_expectations,els3$immigrant_proxy2),margin=2)
se<-sqrt(p*(1-p))/n
data.frame(proportion=p,se=se)

With survey design:

sv.table<-svyby(formula=~college_expectations,by=~immigrant_proxy2,design=des,FUN=svymean,na.rm=T)
sv.table

As you can see, With the survey design, standard errors are quite a bit larger, and likely more accurate.

Regression Analysis: No Survey design vs. Survey Design

Now we will perform a regression analysis on these variables, using no weights or designs, using weights only and finally using the full survey design.

First, with no weights or design:

fit1<-lm(college_expectations~immigrant_proxy+parents_college+parents_college_exp,data=els3)
summary(fit1)
## 
## Call:
## lm(formula = college_expectations ~ immigrant_proxy + parents_college + 
##     parents_college_exp, data = els3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8678 -0.3101  0.1322  0.2547  0.7123 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.310145   0.009656  32.119   <2e-16 ***
## immigrant_proxy     -0.022414   0.013148  -1.705   0.0883 .  
## parents_college      0.122489   0.006922  17.696   <2e-16 ***
## parents_college_exp  0.435133   0.010417  41.772   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4057 on 14726 degrees of freedom
## Multiple R-squared:  0.1421, Adjusted R-squared:  0.1419 
## F-statistic: 812.9 on 3 and 14726 DF,  p-value: < 2.2e-16

It appears that the immigrant proxy has a weakly negative effect upon a student’s college expectations – but only at p=0.08 – the very edge of significance.

Now, we run the same analysis using weights provided by the ELS:2002 dataset:

fit2<-lm(college_expectations~immigrant_proxy+parents_college+parents_college_exp,data=els3,weights=bystuwt)
summary(fit2)
## 
## Call:
## lm(formula = college_expectations ~ immigrant_proxy + parents_college + 
##     parents_college_exp, data = els3, weights = bystuwt)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.991  -3.129   2.090   3.690  19.639 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.301486   0.009386  32.120   <2e-16 ***
## immigrant_proxy     -0.001719   0.014959  -0.115    0.909    
## parents_college      0.121359   0.007193  16.873   <2e-16 ***
## parents_college_exp  0.431280   0.010168  42.414   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.243 on 14726 degrees of freedom
## Multiple R-squared:  0.1412, Adjusted R-squared:  0.141 
## F-statistic: 806.9 on 3 and 14726 DF,  p-value: < 2.2e-16

Now we see that the significance of the immigration proxy has disappeared when weights are added.

To further improve the model, we run a regression analysis using the full survey design:

fit3<-svyglm(college_expectations~immigrant_proxy+parents_college+parents_college_exp,des,family=gaussian)
summary(fit3)
## 
## Call:
## svyglm(formula = college_expectations ~ immigrant_proxy + parents_college + 
##     parents_college_exp, des, family = gaussian)
## 
## Survey design:
## svydesign(ids = ~1, strata = ~strat_id, weights = ~bystuwt, data = els3[is.na(els3$bystuwt) == 
##     F, ])
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.301486   0.012568  23.989   <2e-16 ***
## immigrant_proxy     -0.001719   0.018233  -0.094    0.925    
## parents_college      0.121359   0.008330  14.568   <2e-16 ***
## parents_college_exp  0.431280   0.013656  31.581   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1731935)
## 
## Number of Fisher Scoring iterations: 2

The immigration proxy remains insignificant while the standard errors have increased, and the t-values have shrunk considerably.

These three regression results can be seen arrayed side-by-side in the table below:

stargazer(fit1,fit2,fit3,style="demography",type="html",
column.labels=c("No Design","Weights","Survey Design"),
title="Regression Models for College Expectations by Immigration Status, Parent's Level of Education and Parent's College Expectations - ELS:2002",keep.stat="n",model.names=F,align=T,ci=T)
Regression Models for College Expectations by Immigration Status, Parent’s Level of Education and Parent’s College Expectations - ELS:2002
college_expectations
No Design Weights Survey Design
Model 1 Model 2 Model 3
immigrant_proxy -0.022 -0.002 -0.002
(-0.048, 0.003) (-0.031, 0.028) (-0.037, 0.034)
parents_college 0.122*** 0.121*** 0.121***
(0.109, 0.136) (0.107, 0.135) (0.105, 0.138)
parents_college_exp 0.435*** 0.431*** 0.431***
(0.415, 0.456) (0.411, 0.451) (0.405, 0.458)
Constant 0.310*** 0.301*** 0.301***
(0.291, 0.329) (0.283, 0.320) (0.277, 0.326)
N 14,730 14,730 14,730
p < .05; p < .01; p < .001

Conclusion: Weights and Survey design make a difference!