Part II. R Application (52 points)
The dataset for this part of the exam is from Panel Study of Income Dynamics and its 2014 Child Development Supplement Study. Collected and managed by the Survey Research Center of the Institute for Social Research at the University of Michigan, PSID is an ongoing longitudinal survey with a nationally representative sample started with roughly 5,000 families and over 18,000 individuals in 1968. Individuals in these families and their descendants have been followed annually and biennially since 1997. In 2013, PSID collected data on 24,952 individuals (of which 17,785 are PSID “sample persons”) within 9,063 families. To provide a database of children and their families with which to study the dynamic process of early human capital formation, the PSID in 1997 initiated a data collection effort for a cohort of 3,563 children under the age of 13 from 2,394 families in the Child Development Supplement (CDS). Because all the original CDS cohort children had reached adulthood, the CDS was relaunched with an entirely new sample covering all children in PSID households aged 0-17 years in 2014. The 2014 CDS no longer follows a single cohort of children; instead, it obtains information on a wide range of child measures and family dynamics of all children in PSID families in 2014. Most information in the CDS is collected from the participant’s primary caregiver, and in the majority of the cases (over 90%) these primary caregiver respondents are children’s biological mother.
For this analysis, I selected children when their primary caregiver is biological mother. Below is a list of variables and the variable labels from the dataset called psid_cds.dta.
The research question you are going to analyze is how family socioeconomic status predicts child’s body mass index. Body mass index is a strong indicator for child development. Higher body mass index generally indicates a higher risk of obesity or overweight.
To prepare your analysis, you always need to clean your data. Although I have done most of the data cleaning for you, you will need to a couple of additional data preparations.
#Import the dataset
Examdata <- read_data("psid_cds.dta")
examdata <-Examdata %>%
mutate(tenure=ifelse(tenure==5,0,
ifelse(tenure==1,1,NA))) %>%
mutate(race = car::Recode(crace3, recodes ="1='White';2='Black';3='Other'")) %>%
mutate(sex = car::Recode(csex, recodes ="1='Male'; 2='Female'")) %>%
mutate(rlbw = car::Recode(lbw, recodes ="0='No'; 1='Yes'")) %>%
drop_na(cbmi, adjfinc, educ, adjwlth2, sex, race, rlbw, cage)
head(examdata$tenure)
## [1] 0 0 1 1 1 1
# Checking if crace3 is a factor variable
is.factor(Examdata$crace3)
## [1] FALSE
examdata%>%
filter(race%in%c("White","Black","Other")) %>%
group_by(race)%>%
summarise(means=mean(cbmi, na.rm=T), sds=sd(cbmi, na.rm=T), n=n())
## `summarise()` ungrouping output (override with `.groups` argument)
# H0: μ1 = μ2 = μ3
# H1: Means are not all equal
mm <- aov(cbmi~race,examdata)
anova(mm)
# From the analysis above, at a 95% level of significance, we will fail to reject the null hypothesis and conclude that there is no significant difference in the means of the Body Max Index(BMI) for children from different racial backgrounds.This is so because our p-value(0.09158) is greater than 0.05
examdata%>%
group_by(sex) %>%
summarise(means=mean(cbmi, na.rm=T), sds=sd(cbmi, na.rm=T), n=n())
## `summarise()` ungrouping output (override with `.groups` argument)
# H0: true difference in means is equal to 0
# H1: true difference in means is not equal to 0
#t.test(cbmi ~ sex, data = examdata)
t.test(cbmi ~ sex, data=examdata)
##
## Welch Two Sample t-test
##
## data: cbmi by sex
## t = -1.3384, df = 3600, p-value = 0.1809
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.9529798 0.1797414
## sample estimates:
## mean in group Female mean in group Male
## 16.12328 16.50990
# From the analysis above, at a 95% level of significance, we will fail to reject the null hypothesis and conclude that there is no significant statistical difference between means of the Body Max Index(BMI) for male and female.This is so because our p-value(0.1809) is greater than 0.05
#1.Adjusted family income
#2.Family wealth (1000s) with equity
#3.Mother’s educational attainment
# References
# 1. Cardel, M., Willig, A. L., Dulin-Keita, A., Casazza, K., Beasley, T. M., & Fernández, J. R. (2012). Parental feeding practices and socioeconomic status are associated with child adiposity in a multi-ethnic sample of children.Appetite, 58(1), 347-353.
# 2. Ahn, M. K., Juon, H. S., & Gittelsohn, J. (2008). Peer Reviewed: Association of Race/Ethnicity, Socioeconomic Status, Acculturation, and Environmental Factors with Risk of Overweight Among Adolescents in California, 2003.Preventing chronic disease, 5(3)..
#Summary Statistics for Family wealth (1000s) with equity
summarise(examdata, Mean_value=mean(adjwlth2, na.rm = T), Median_value=median(adjwlth2, na.rm = T),
sd_value=sd(adjwlth2, na.rm = T))
#Summary Statistics for Adjusted family income
summarise(examdata, Mean_value=mean(adjfinc, na.rm = T), Median_value=median(adjfinc, na.rm = T),
sd_value=sd(adjfinc, na.rm = T))
#Summary Statistics for Mother’s educational attainment
summarise(examdata, Mean_value=mean(educ, na.rm = T), Median_value=median(educ, na.rm = T),
sd_value=sd(educ, na.rm = T))
# BMI(Y) = β0 + β1x(Family Income) + β2x(Mothers year of education) + β3 x(Family wealth with equity) + ε
fit <- lm( cbmi ~adjfinc + educ + adjwlth2, data=examdata)
summary(fit)
##
## Call:
## lm(formula = cbmi ~ adjfinc + educ + adjwlth2, data = examdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.113 -1.831 0.595 4.300 38.863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.712e+01 6.646e-01 25.753 < 2e-16 ***
## adjfinc -4.685e-06 1.515e-06 -3.093 0.00199 **
## educ -4.054e-02 4.826e-02 -0.840 0.40091
## adjwlth2 2.720e-04 4.668e-04 0.583 0.56012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.662 on 3601 degrees of freedom
## Multiple R-squared: 0.003906, Adjusted R-squared: 0.003076
## F-statistic: 4.707 on 3 and 3601 DF, p-value: 0.002779
Anova(fit)
# For Family income - For every additional increase in family income,the expected Child's body max index is going to decrease by 4.685e-06 and this decrease is statistically significant at the population level, holding all else constant
# For Mother's education - For every additional year in child's mother education year completed,the expected Child's body max index is going to decrease by 4.054e-02. However, this decrease is not statistically significant at the population level, holding all else constant
#For Family Wealth with equity - For every additional one thousand dollar added to family wealth with equity,the expected child's body max index is going to increase by 2.720e-04. However, this Increase is not statistically significant at the population level, holding all else constant
#The R2 indicates that only 0.39% of the variations observed in Child's Body Max Index can be explained by the predicting variables; Family income, Mother's education and Family Wealth with equity in our model. This is very low, predictions from the regression equation are not reliable.
#The p-value (0.002779)for the F-test suggests that the model containing all predictors is more useful in predicting body max index than not taking into account these predictors.
#Summary Statistics for child’s age
summarise(examdata, Min_value=min(cage, na.rm = T), Max_value=max(cage, na.rm = T),
Mean_value=mean(cage, na.rm = T), sd_value=sd(cage, na.rm = T), Median_value=median(cage, na.rm = T))
# Descriptive Statistics for categorical variable Child's sex using frequency table
examdata$sex <- as.factor(examdata$sex)
sjmisc::frq(examdata$sex)
##
## x <categorical>
## # total N=3605 valid N=3605 mean=1.50 sd=0.50
##
## Value | N | Raw % | Valid % | Cum. %
## ----------------------------------------
## Female | 1817 | 50.40 | 50.40 | 50.40
## Male | 1788 | 49.60 | 49.60 | 100.00
## <NA> | 0 | 0.00 | <NA> | <NA>
#Descriptive Statistics for categorical variable Child's Race using frequency table
examdata$race<- as.factor(examdata$race)
sjmisc::frq(examdata$race)
##
## x <categorical>
## # total N=3605 valid N=3605 mean=2.19 sd=0.96
##
## Value | N | Raw % | Valid % | Cum. %
## ---------------------------------------
## Black | 1379 | 38.25 | 38.25 | 38.25
## Other | 175 | 4.85 | 4.85 | 43.11
## White | 2051 | 56.89 | 56.89 | 100.00
## <NA> | 0 | 0.00 | <NA> | <NA>
#Descriptive Statistics for categorical variable Child's LBW(whether child was born low birth weight) using frequency table
examdata$rlbw<- as.factor(examdata$rlbw)
sjmisc::frq(examdata$rlbw)
##
## x <categorical>
## # total N=3605 valid N=3605 mean=1.09 sd=0.29
##
## Value | N | Raw % | Valid % | Cum. %
## ---------------------------------------
## No | 3278 | 90.93 | 90.93 | 90.93
## Yes | 327 | 9.07 | 9.07 | 100.00
## <NA> | 0 | 0.00 | <NA> | <NA>
fit2 <- lm( cbmi ~adjfinc + educ + adjwlth2 + sex + race + rlbw + cage, data=examdata)
summary(fit2)
##
## Call:
## lm(formula = cbmi ~ adjfinc + educ + adjwlth2 + sex + race +
## rlbw + cage, data = examdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.536 -1.718 1.319 4.009 41.144
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.147e+01 7.035e-01 16.305 < 2e-16 ***
## adjfinc -5.314e-06 1.430e-06 -3.716 0.000206 ***
## educ -9.546e-03 4.628e-02 -0.206 0.836597
## adjwlth2 -2.703e-04 4.408e-04 -0.613 0.539828
## sexMale 1.211e-01 2.717e-01 0.446 0.655747
## raceOther -2.196e+00 6.579e-01 -3.338 0.000851 ***
## raceWhite -2.021e-01 2.945e-01 -0.686 0.492556
## rlbwYes -1.319e-01 4.754e-01 -0.277 0.781532
## cage 6.527e-01 3.018e-02 21.627 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.148 on 3596 degrees of freedom
## Multiple R-squared: 0.1199, Adjusted R-squared: 0.1179
## F-statistic: 61.24 on 8 and 3596 DF, p-value: < 2.2e-16
#For Sex - Compared to female children, the male children has a Body max Index that is 0.1211 higher holding all else constant. The observed difference is not statistically significant at the population level. (p-value >0.05)
#For Child's age - For every additional additional year in child's age ,the expected Child's body max index is going to increase by 0.6527 and this increase is statistically significant at the population level, holding all else constant (P-value < 0.001)
#For Race other - Compared to children with black racial background, children of Other racial background has a Body max Index that is 2.196 less holding all else constant. The observed difference is statistically significant at the population level. (p-value < 0.001)
#For Race White - Compared to children with black racial background, children of White racial background has a Body max Index that is 0.2021 less holding all else constant. The observed difference is not statistically significant at the population level. (p-value > 0.05)
#Comparing the first and second model.
anova(fit, fit2)
# The second model is the preferred model. The adjusted R square in the second model(0.1179) is greater than the adjusted R square in the first model(0.003076). Hence, the preferred model was able to explain better the variations observed in Child's body mass index with the predictor variables in the model.Also, when the two models were compared using anova table, the p-value was less than 0.05, suggesting that the addition of the sex,race,low birth weight and child age have significantly improve the model, and that its unlikely that the improvement in the fit when we added the variables was due to random fluctuation in data. Thus, it is important especially to consider child's age and race when modeling how family socioeconomic status predicts child’s body mass index.
#The R2 indicates that only 11.9% of the variations observed in Child's Body Max Index can be explained by the predicting variables; Family income, Mother's education, Family Wealth with equity,sex,race,low birth weight and child age in our model. On the other hand, since,adjusted R square is calculated only for those variables whose addition in the model were significant. We can conclude that the adjusted R2 indicates that only 11.8% of the variations observed in Child's Body Max Index can be explained by the predictor variables; Family income,race,and child age in our model.
attach(examdata)
#Examining the multicollinearity of the continuous variables
round(cor(cbind(cbmi,adjfinc,educ, adjwlth2, cage)),2)
## cbmi adjfinc educ adjwlth2 cage
## cbmi 1.00 -0.06 -0.03 -0.03 0.33
## adjfinc -0.06 1.00 0.24 0.58 0.06
## educ -0.03 0.24 1.00 0.21 -0.02
## adjwlth2 -0.03 0.58 0.21 1.00 0.08
## cage 0.33 0.06 -0.02 0.08 1.00
#Examining the multicollinearity of all variables
vif(fit2)
## GVIF Df GVIF^(1/(2*Df))
## adjfinc 1.550801 1 1.245312
## educ 1.109969 1 1.053551
## adjwlth2 1.530454 1 1.237115
## sex 1.002244 1 1.001121
## race 1.096729 2 1.023352
## rlbw 1.012281 1 1.006122
## cage 1.023479 1 1.011672
# Examining the assumption of linearity.
pairs(~cbmi+adjfinc+educ+cage+adjwlth2,main='BodyMax Index scatterplots')
#it appears that the assumption linearity is violated in the relationship between body mas index and all our variable expect for the age variable..
# Hence we can take the log of each of the independent variables first to see if it can fix our problem, if not we can take the log of both the independent and dependent variables together.
#Unfortunately, since the assumption of linearity is violated we can't access if other assumption has been violated until we fix the linearity problem. if the problem is fixed we can now text for constant variance and normality of the residuals.
"body mass index. (5 points)
#Multiple linear regression was carried out to investigate how family socioeconomic status predicts child’s body mass index. There was a significant relationship between child's age and child's body max Index (p < 0.001), Family income and child's Body max index(p < 0.001) and other race and child's body max index when compared to black race (p < 0.001). For Child's age - For every additional additional year in child's age ,the expected Child's body max index is going to increase by 0.6527 and this increase is statistically significant at the population level, holding all else constant (P-value < 0.001). For Race other - Compared to children with black racial background,children of Other racial background has a Body max Index that is 2.196 less holding all else constant. The observed difference is statistically significant at the population level. (p-value < 0.001) nonsmokers.For Family income - For every additional increase in family income,the expected Child's body max index is going to decrease by -4.685e-06 and this decrease is statistically significant at the population level, holding all else constant.
#The R2 indicates that only 11.9% of the variations observed in Child's Body Max Index can be explained by the predicting variables; Family income, Mother's education, Family Wealth with equity,sex,race,low birth weight and child age in our model. It is worthy to note that family wealth and mother's education was not significant in our model. This is different from what is obtainable in the literature.