A. Define binary outcome variable:

The dichotomous dependent variable will be called move and it is defined as whether or not the individual has moved in the past year or not. a value of 1 indicates they have moved and a value of 0 indicates they have not.

The data used for this homework assignment comes from University of Michigan’s 2009 Panel Study of Income Dynamics.

B. State research question about what factors you believe will affect your outcome variable.

Can a person’s health and education determine how likely they are to have moved in the past year?

C. Define at least two predictor variables, based on your research question.

The two factors used to determine the outcome in this homework assignment are: health, which will have the outcomes of good, fair and poor:

table(hw1$health)
## 
## fair good poor 
##  624  974  278

and educ, (education) which will have the outcomes of lths (less than highschool), hs (highschool) and college (at least some college):

table(hw1$educ)
## 
## college      hs    lths 
##     600     589     691

D. Perform a descriptive analysis of the outcome variable by each of the variables defined in part B (Also calculate descriptive techniques for non-weighted, weighted and full survey design. Calculate percentages and display with tables. discuss the differences):

Number within each group:

table(hw1$health, hw1$educ)
##       
##        college  hs lths
##   fair     176 210  238
##   good     357 282  335
##   poor      64  96  118

Proportion within each group:

round(prop.table(table(hw1$health, hw1$educ), margin = 2), digits = 3)
##       
##        college    hs  lths
##   fair   0.295 0.357 0.344
##   good   0.598 0.480 0.485
##   poor   0.107 0.163 0.171

and finally a chi square test to test if the health status is affected by an individual’s education:

chisq.test(table(hw1$health, hw1$educ))
## 
##  Pearson's Chi-squared test
## 
## data:  table(hw1$health, hw1$educ)
## X-squared = 24.455, df = 4, p-value = 6.473e-05

For the last part, I’m not a hundred percent surehow to answer the last a,b,c and d, so I’m going to calculate descriptive statistics for unweighted, weighted and full survey design and make comparisons amongst the three.

Looking at the unweighted sample:

round(prop.table(table(hw1$health,hw1$educ), margin = 2), digits =4)
##       
##        college     hs   lths
##   fair  0.2948 0.3571 0.3444
##   good  0.5980 0.4796 0.4848
##   poor  0.1072 0.1633 0.1708

and the weighted sample:

round(prop.table(wtd.table(hw1$health, hw1$educ, weights = hw1$wt), margin = 2), digits = 4)  ### simple weighted
##      college     hs   lths
## fair  0.2968 0.3449 0.3349
## good  0.5987 0.4877 0.5088
## poor  0.1045 0.1675 0.1563

and we can see that while the numbers are close, they are different, so the 2009, individual cross sectional weights have had an effect on the data.

Here we will examine the standard errors within the weighted sample

n <- table(is.na(hw1$health)==F)
n
## 
## FALSE  TRUE 
##     4  1876
p <- prop.table(wtd.table(hw1$health, hw1$educ, weights = hw1$wt), margin = 2)
se <- (p*(1-p))/n[2]

stargazer(data.frame(proportion = p, se = sqrt(se)), summary = F, type = "text", digits = 4)
## 
## =========================================================================
##   proportion.Var1 proportion.Var2 proportion.Freq se.Var1 se.Var2 se.Freq
## -------------------------------------------------------------------------
## 1      fair           college         0.2968       fair   college 0.0105 
## 2      good           college         0.5987       good   college 0.0113 
## 3      poor           college         0.1045       poor   college 0.0071 
## 4      fair             hs            0.3449       fair     hs    0.0110 
## 5      good             hs            0.4877       good     hs    0.0115 
## 6      poor             hs            0.1675       poor     hs    0.0086 
## 7      fair            lths           0.3349       fair    lths   0.0109 
## 8      good            lths           0.5088       good    lths   0.0115 
## 9      poor            lths           0.1563       poor    lths   0.0084 
## -------------------------------------------------------------------------

And here we will examine the difference of the full survey design and the weights: weights:

cat <- wtd.table(hw1$health, hw1$educ, weights = hw1$wt)
print(cat)
##      college      hs    lths
## fair 1932391 1924493 2294313
## good 3897847 2721148 3485907
## poor  680154  934473 1070680

and the survey design:

dog <- svytable(~health + educ, design = des)
dog
##       educ
## health college      hs    lths
##   fair 1932391 1924493 2294313
##   good 3897847 2721148 3485907
##   poor  680154  934473 1070680
stargazer(data.frame(prop.table(svytable(~health + educ, design = des), margin = 2)), summary = F, type = "text", digits = 4)
## 
## =======================
##   health  educ    Freq 
## -----------------------
## 1  fair  college 0.2968
## 2  good  college 0.5987
## 3  poor  college 0.1045
## 4  fair    hs    0.3449
## 5  good    hs    0.4877
## 6  poor    hs    0.1675
## 7  fair   lths   0.3349
## 8  good   lths   0.5088
## 9  poor   lths   0.1563
## -----------------------

From the results of this analysis we can see that the weighted sample has slightly different proportions than does the unweighted sample and exactly the same as the full survey design. the standard error of heh weighted sample is slightly different than that of the full survey design and surprisingly, the weighted sample has slightly smaller standard errors.

In conclusion, the results of this analysis has shown that, by applying the weights from the survey, the proportion of the independent variables change (in this case, slightly), and only by doing the full survey design, do we obtain the true standard error, which must be used for any