Part II. R Application (52 points)

The dataset for this part of the exam is from Panel Study of Income Dynamics and its 2014 Child Development Supplement Study. Collected and managed by the Survey Research Center of the Institute for Social Research at the University of Michigan, PSID is an ongoing longitudinal survey with a nationally representative sample started with roughly 5,000 families and over 18,000 individuals in 1968. Individuals in these families and their descendants have been followed annually and biennially since 1997. In 2013, PSID collected data on 24,952 individuals (of which 17,785 are PSID “sample persons”) within 9,063 families. To provide a database of children and their families with which to study the dynamic process of early human capital formation, the PSID in 1997 initiated a data collection effort for a cohort of 3,563 children under the age of 13 from 2,394 families in the Child Development Supplement (CDS). Because all the original CDS cohort children had reached adulthood, the CDS was relaunched with an entirely new sample covering all children in PSID households aged 0-17 years in 2014. The 2014 CDS no longer follows a single cohort of children; instead, it obtains information on a wide range of child measures and family dynamics of all children in PSID families in 2014. Most information in the CDS is collected from the participant’s primary caregiver, and in the majority of the cases (over 90%) these primary caregiver respondents are children’s biological mother.

For this analysis, I selected children when their primary caregiver is biological mother. Below is a list of variables and the variable labels from the dataset called psid_cds.dta.

The research question you are going to analyze is how family socioeconomic status predicts child’s body mass index. Body mass index is a strong indicator for child development. Higher body mass index generally indicates a higher risk of obesity or overweight.

To prepare your analysis, you always need to clean your data. Although I have done most of the data cleaning for you, you will need to a couple of additional data preparations.

Question 1. Recode tenure so that 1=owning a house, 0=renting a house, and missing is set to missing (NA). (2 point)

#Import the dataset
Examdata <- read_data("psid_cds.dta")

examdata <-Examdata %>% 
 mutate(tenure=ifelse(tenure==5,0,
         ifelse(tenure==1,1,NA))) %>%  
  mutate(race = car::Recode(crace3, recodes ="1='White';2='Black';3='Other'")) %>% 
  mutate(sex  = car::Recode(csex, recodes ="1='Male'; 2='Female'")) %>% 
  mutate(rlbw = car::Recode(lbw, recodes ="0='No'; 1='Yes'")) %>% 
drop_na(cbmi, adjfinc, educ, adjwlth2, sex, race, rlbw, cage)
head(examdata$tenure) 
## [1] 0 0 1 1 1 1

Question 2. Check whether crace3 (child’s race) variable is a factor variable. (1 point)

# Checking if crace3 is a factor variable 
is.factor(Examdata$crace3)
## [1] FALSE

Question 3. Run a test to see if there are significant differences in body mass index among kids of different racial backgrounds. (3 points)

examdata%>%
  filter(race%in%c("White","Black","Other")) %>% 
 group_by(race)%>%
  summarise(means=mean(cbmi, na.rm=T), sds=sd(cbmi, na.rm=T), n=n())
## `summarise()` ungrouping output (override with `.groups` argument)
# H0: μ1 = μ2 = μ3 
# H1: Means are not all equal

mm <- aov(cbmi~race,examdata)
anova(mm)
# From the analysis above, at a 95% level of significance, we will fail to reject the null hypothesis and conclude that there is no significant difference in the means of the Body Max Index(BMI) for children from different racial backgrounds.This is so because our p-value(0.09158) is greater than 0.05

Question 4. Run a test to see if there is a significant difference in body mass index between girls and boys.(3 point). *Note: Make sure you state the null and alternative hypotheses for each of these tests, and interpret the outputs thoroughly.

examdata%>%
 group_by(sex) %>% 
summarise(means=mean(cbmi, na.rm=T), sds=sd(cbmi, na.rm=T), n=n())
## `summarise()` ungrouping output (override with `.groups` argument)
# H0: true difference in means is equal to 0 
# H1: true difference in means is not equal to 0
#t.test(cbmi ~ sex, data = examdata)
t.test(cbmi ~ sex, data=examdata)
## 
##  Welch Two Sample t-test
## 
## data:  cbmi by sex
## t = -1.3384, df = 3600, p-value = 0.1809
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9529798  0.1797414
## sample estimates:
## mean in group Female   mean in group Male 
##             16.12328             16.50990
# From the analysis above, at a 95% level of significance, we will fail to reject the null hypothesis and conclude that there is no significant statistical difference between means of the Body Max Index(BMI) for male and female.This is so because our p-value(0.1809) is greater than 0.05

Question 5. Review relevant literature and identify three family socioeconomic variables from the variable list that are relevant to child’s body mass index. Be sure to cite at least TWO references to support your claim. (3 points)

#1.Adjusted family income
#2.Family wealth (1000s) with equity
#3.Mother’s educational attainment 

#                                 References

# 1.  Cardel, M., Willig, A. L., Dulin-Keita, A., Casazza, K., Beasley, T. M., & Fernández, J. R. (2012). Parental        feeding practices and socioeconomic status are associated with child adiposity in a multi-ethnic sample of          children.Appetite, 58(1), 347-353.

# 2.  Ahn, M. K., Juon, H. S., & Gittelsohn, J. (2008). Peer Reviewed: Association of Race/Ethnicity, Socioeconomic       Status, Acculturation, and Environmental Factors with Risk of Overweight Among Adolescents in California,           2003.Preventing chronic disease, 5(3)..

Question 6. Examine the means, medians, and standard deviation for variables you identified in 5). (3 points)

#Summary Statistics for Family wealth (1000s) with equity
summarise(examdata, Mean_value=mean(adjwlth2, na.rm = T), Median_value=median(adjwlth2, na.rm = T), 
         sd_value=sd(adjwlth2, na.rm = T))
#Summary Statistics for Adjusted family income
summarise(examdata, Mean_value=mean(adjfinc, na.rm = T), Median_value=median(adjfinc, na.rm = T), 
         sd_value=sd(adjfinc, na.rm = T))
#Summary Statistics for Mother’s educational attainment
summarise(examdata, Mean_value=mean(educ, na.rm = T), Median_value=median(educ, na.rm = T), 
         sd_value=sd(educ, na.rm = T))

Question 7. Estimate a regression model with all independent variables you identified in 5), and interpret the output of regression analysis. (6 points)

# BMI(Y) = β0 + β1x(Family Income) + β2x(Mothers year of education) + β3 x(Family wealth with equity) + ε
fit <- lm( cbmi ~adjfinc + educ + adjwlth2, data=examdata)
summary(fit)
## 
## Call:
## lm(formula = cbmi ~ adjfinc + educ + adjwlth2, data = examdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.113  -1.831   0.595   4.300  38.863 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.712e+01  6.646e-01  25.753  < 2e-16 ***
## adjfinc     -4.685e-06  1.515e-06  -3.093  0.00199 ** 
## educ        -4.054e-02  4.826e-02  -0.840  0.40091    
## adjwlth2     2.720e-04  4.668e-04   0.583  0.56012    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.662 on 3601 degrees of freedom
## Multiple R-squared:  0.003906,   Adjusted R-squared:  0.003076 
## F-statistic: 4.707 on 3 and 3601 DF,  p-value: 0.002779
Anova(fit)
# For Family income - For every additional increase in family income,the expected Child's body max index is going to decrease by 4.685e-06 and this decrease is statistically significant at the population level, holding all else constant 

# For Mother's education - For every additional year in child's mother education year completed,the expected Child's body max index is going to decrease by 4.054e-02. However, this decrease is not statistically significant at the population level, holding all else constant 

#For Family Wealth with equity - For every additional one thousand dollar added to family wealth with equity,the expected child's body max index is going to increase by 2.720e-04. However, this Increase is not statistically significant at the population level, holding all else constant 

#The R2 indicates that only  0.39% of the variations observed  in Child's Body Max Index can be explained by the  predicting variables; Family income, Mother's education and Family Wealth with equity in our model. This is very low,  predictions from the regression equation are not reliable. 
#The p-value (0.002779)for the F-test suggests that the model containing all predictors is more useful in predicting body max index than not taking into account these predictors.

In addition to the three family socioeconomic background variables you identified from 5), previous research suggests that several demographic variables are also important predictors of body mass index, including child’s age, sex, race, and low birth weight status.

Question 8. Provide appropriate descriptive statistics for child’s age, sex, race, and low birth weight status. (3 points)

#Summary Statistics for child’s age
summarise(examdata, Min_value=min(cage, na.rm = T), Max_value=max(cage, na.rm = T), 
          Mean_value=mean(cage, na.rm = T), sd_value=sd(cage, na.rm = T), Median_value=median(cage, na.rm = T))
# Descriptive Statistics for  categorical variable Child's sex using frequency table
examdata$sex <- as.factor(examdata$sex)
sjmisc::frq(examdata$sex)
## 
## x <categorical>
## # total N=3605  valid N=3605  mean=1.50  sd=0.50
## 
## Value  |    N | Raw % | Valid % | Cum. %
## ----------------------------------------
## Female | 1817 | 50.40 |   50.40 |  50.40
## Male   | 1788 | 49.60 |   49.60 | 100.00
## <NA>   |    0 |  0.00 |    <NA> |   <NA>
#Descriptive Statistics for  categorical variable Child's Race using frequency table
examdata$race<- as.factor(examdata$race)
sjmisc::frq(examdata$race)
## 
## x <categorical>
## # total N=3605  valid N=3605  mean=2.19  sd=0.96
## 
## Value |    N | Raw % | Valid % | Cum. %
## ---------------------------------------
## Black | 1379 | 38.25 |   38.25 |  38.25
## Other |  175 |  4.85 |    4.85 |  43.11
## White | 2051 | 56.89 |   56.89 | 100.00
## <NA>  |    0 |  0.00 |    <NA> |   <NA>
#Descriptive Statistics for  categorical variable Child's LBW(whether child was born low birth weight) using frequency table
examdata$rlbw<- as.factor(examdata$rlbw)
sjmisc::frq(examdata$rlbw)
## 
## x <categorical>
## # total N=3605  valid N=3605  mean=1.09  sd=0.29
## 
## Value |    N | Raw % | Valid % | Cum. %
## ---------------------------------------
## No    | 3278 | 90.93 |   90.93 |  90.93
## Yes   |  327 |  9.07 |    9.07 | 100.00
## <NA>  |    0 |  0.00 |    <NA> |   <NA>

Question 9. Estimate another regression model with all the independent variables you identified in 5) and these child’s demographic variables. (3 points)

fit2 <- lm( cbmi ~adjfinc + educ + adjwlth2 + sex + race + rlbw + cage, data=examdata)
summary(fit2)
## 
## Call:
## lm(formula = cbmi ~ adjfinc + educ + adjwlth2 + sex + race + 
##     rlbw + cage, data = examdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.536  -1.718   1.319   4.009  41.144 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.147e+01  7.035e-01  16.305  < 2e-16 ***
## adjfinc     -5.314e-06  1.430e-06  -3.716 0.000206 ***
## educ        -9.546e-03  4.628e-02  -0.206 0.836597    
## adjwlth2    -2.703e-04  4.408e-04  -0.613 0.539828    
## sexMale      1.211e-01  2.717e-01   0.446 0.655747    
## raceOther   -2.196e+00  6.579e-01  -3.338 0.000851 ***
## raceWhite   -2.021e-01  2.945e-01  -0.686 0.492556    
## rlbwYes     -1.319e-01  4.754e-01  -0.277 0.781532    
## cage         6.527e-01  3.018e-02  21.627  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.148 on 3596 degrees of freedom
## Multiple R-squared:  0.1199, Adjusted R-squared:  0.1179 
## F-statistic: 61.24 on 8 and 3596 DF,  p-value: < 2.2e-16

Question 10. Interpret the coefficients of sex, child’s age and race variables from the model from 9). (8 points)

#For Sex - Compared to female children, the male children has a Body max Index that is 0.1211 higher holding all else constant. The observed difference is not statistically significant at the population level. (p-value >0.05)  

#For Child's age - For every additional additional year in child's age ,the expected Child's body max index is going to increase by 0.6527 and this increase is statistically significant at the population level, holding all else constant (P-value < 0.001)

#For Race other - Compared to  children with black racial background,  children of Other racial background has a Body max Index that is 2.196 less  holding all else constant. The observed difference is  statistically significant at the population level. (p-value < 0.001)

#For Race White - Compared to children with black racial background,  children of  White racial background has a Body max Index that is 0.2021 less  holding all else constant. The observed difference is not statistically significant at the population level. (p-value > 0.05)

Question 11. Compare the regression model from 7) to the more advanced model from 9), which model is preferred? Why? (Make sure you run proper test to support your claim.) (3 points)

#Comparing the first and second model.
anova(fit, fit2)
# The second model is the preferred model. The adjusted R square in the second model(0.1179) is greater than the adjusted R square in the first model(0.003076). Hence, the preferred model was able to explain better the variations observed in Child's body mass index with the predictor variables in the model.Also, when the two models were compared using anova table, the p-value was less than 0.05, suggesting that the addition of the sex,race,low birth weight and child age have significantly improve the model, and that its unlikely that the improvement in the fit when we added the variables was due to random fluctuation in data. Thus, it is important especially to consider child's age and race when modeling how family socioeconomic status predicts child’s body mass index. 

Question 12. Interpret the R-square and adjusted R-square from the preferred model. (4 points)

#The R2 indicates that only  11.9% of the variations observed  in Child's Body Max Index can be explained by the predicting variables; Family income, Mother's education, Family Wealth with equity,sex,race,low birth weight and child age in our model. On the other hand, since,adjusted R square is calculated  only for those variables whose addition in the model were significant. We can conclude that the adjusted R2 indicates  that only 11.8% of the variations observed  in Child's Body Max Index can be explained by the predictor variables; Family income,race,and child age in our model.

Question 13. Evaluate whether the preferred model violates any linear regression assumptions. If violation(s) exists in the preferred model, propose reasonable solution(s). (5 points)

attach(examdata)
#Examining the multicollinearity of the continuous variables 
round(cor(cbind(cbmi,adjfinc,educ, adjwlth2, cage)),2)
##           cbmi adjfinc  educ adjwlth2  cage
## cbmi      1.00   -0.06 -0.03    -0.03  0.33
## adjfinc  -0.06    1.00  0.24     0.58  0.06
## educ     -0.03    0.24  1.00     0.21 -0.02
## adjwlth2 -0.03    0.58  0.21     1.00  0.08
## cage      0.33    0.06 -0.02     0.08  1.00
#Examining the  multicollinearity of all variables
vif(fit2)
##              GVIF Df GVIF^(1/(2*Df))
## adjfinc  1.550801  1        1.245312
## educ     1.109969  1        1.053551
## adjwlth2 1.530454  1        1.237115
## sex      1.002244  1        1.001121
## race     1.096729  2        1.023352
## rlbw     1.012281  1        1.006122
## cage     1.023479  1        1.011672
# Examining the assumption of linearity.
pairs(~cbmi+adjfinc+educ+cage+adjwlth2,main='BodyMax Index scatterplots')

#it appears that the assumption linearity is violated in the relationship between body mas index and all our variable expect for the age variable..
# Hence we can take the log of each of the independent variables first to see if it can fix our problem, if not we can take the log of both the independent and dependent variables together. 
#Unfortunately, since the assumption of linearity is violated we can't access if other assumption has been violated until we fix the linearity problem. if the problem is fixed we can now text for constant variance and normality of the residuals.

Question 14. Based on the analysis you’ve done so far, write a short paragraph to summarize your findings regarding the relationship between family socioeconomic background and child’s

                  "body mass index. (5 points)
#Multiple linear regression was carried out to investigate how family socioeconomic status predicts child’s body mass index. There was a significant relationship between child's age and child's body max Index (p < 0.001), Family income and child's Body max index(p < 0.001) and other race and  child's body max index when compared to black race (p < 0.001). For Child's age - For every additional additional year in child's age ,the expected Child's body max index is going to increase by 0.6527 and this increase is statistically significant at the population level, holding all else constant (P-value < 0.001). For Race other - Compared to  children with black racial background,children of Other racial background has a Body max Index that is 2.196 less  holding all else constant. The observed difference is  statistically significant at the population level. (p-value < 0.001) nonsmokers.For Family income - For every additional increase in family income,the expected Child's body max index is going to decrease by -4.685e-06 and this decrease is statistically significant at the population level, holding all else constant. 
#The R2 indicates that only  11.9% of the variations observed  in Child's Body Max Index can be explained by the predicting variables; Family income, Mother's education, Family Wealth with equity,sex,race,low birth weight and child age in our model. It is worthy to note that family wealth and mother's education was not significant in our model. This is different from what is obtainable in the literature.