Load Libraries

library(haven)
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(broom)
library(nortest)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(ipumsr)
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble  3.0.3     v stringr 1.4.0
## v tidyr   1.1.1     v forcats 0.5.0
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x psych::%+%()    masks ggplot2::%+%()
## x psych::alpha()  masks ggplot2::alpha()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x MASS::select()  masks dplyr::select()
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(knitr)
Psid<- read_dta("C:/Users/chris/Downloads/psid_cds.dta")
View(Psid)

create new dataset without missing data

newPsid <- na.omit(Psid)

Question 1). Recode tenure so that 1=owning a house, 0=renting a house, and missing is set to missing (NA)

Psid<- read_dta("C:/Users/chris/Downloads/psid_cds.dta")
Psidrecode <- Psid %>% 
 mutate(tenure=ifelse(tenure==5,0,
         ifelse(tenure==1,1,NA))) 

Question 2). Check whether crace3 (child’s race) variable is a factor variable.

is.factor(Psid$crace3)
## [1] FALSE

Question 3). Run a test to see if there are significant differences in body mass index among kids of different racial backgrounds.

mbic <- aov(cbmi ~ crace3, data= newPsid)
anova(mbic)
## Analysis of Variance Table
## 
## Response: cbmi
##             Df Sum Sq Mean Sq F value Pr(>F)
## crace3       1    159  158.55  2.1148  0.146
## Residuals 3571 267719   74.97

Null: There are no differences in the body mass index of kids from different racial backgrounds

Research: There are differences in body mass index of kids from different racial backgrounds

Since the p-value (0.146) is less than the t-value (2.1148) at a 95% level of significance, I reject the null hypothesis and accept the alternative hypothesis that there are significant difference in the Body Max Index(BMI) of children from different racial backgrounds.

Question 4). Run a test to see if there is a significant difference in body mass index between girls and boys.

t.test(cbmi~csex, data = newPsid)
## 
##  Welch Two Sample t-test
## 
## data:  cbmi by csex
## t = 1.3577, df = 3568.9, p-value = 0.1746
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1745978  0.9609947
## sample estimates:
## mean in group 1 mean in group 2 
##        16.55198        16.15878

The null: There are no significant difference in body mass index between girls and boys

The research: There are significant different in body mass index between girls and boys

Since the p value (0.1746) is greater than the alpha value (0.05), we can reject the null hull hypothesis and conclude that there is no significant statistical difference between means of the Body Max Index(BMI) for male and female. Boys have a mean of 16.44939 while girls have a mean of 16.29597. This values are not signifcant since the differences between the values is very small.

Question 5). Review relevant literature and identify three family socioeconomic variables from the variable list that are relevant to child’s body mass index. Be sure to cite at least TWO references to support your claim.

1. Mother’s educational attainment

2. Family Income

3. Maternal Employment (emp2)

        References
        

1. Berge, J. M, (2009). A Review of Familial Correlates of Child and Adolescent Obesity: What has the 21st Century Taught us so Far? International Journal of Adolescent Medical Health. 2009 Oct–Dec; 21(4): 457–483.

2. Ziol-Guest,K.M., Rachel E Dunifon, R.E., and Kalil, A. (2013). Parental employment and children’s body weight: Mothers, others, and mechanisms.Soc Sci Med ;95:52-9.

Question 6). Examine the means, medians, and standard deviation for variables you identified in 5).

summary(newPsid$adjfinc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   20675   42879   58285   74641 4824656
sd(newPsid$adjfinc)
## [1] 118676.6
summary(newPsid$educ)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   12.00   13.00   13.74   16.00   20.00
sd(newPsid$educ)
## [1] 3.076806
summary(newPsid$emp2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   0.663   1.000   1.000
sd(newPsid$emp2)
## [1] 0.4727413

Question 7). Estimate a regression model with all independent variables you identified in 5), and interpret the output of regression analysis.

fit <- lm( cbmi ~adjfinc + educ + emp2, data=newPsid)
summary(fit)
## 
## Call:
## lm(formula = cbmi ~ adjfinc + educ + emp2, data = newPsid)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.189  -1.818   0.571   4.235  38.899 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.719e+01  6.730e-01  25.545  < 2e-16 ***
## adjfinc     -4.292e-06  1.256e-06  -3.417 0.000639 ***
## educ        -5.227e-02  4.912e-02  -1.064 0.287407    
## emp2         1.943e-01  3.116e-01   0.624 0.532891    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.645 on 3569 degrees of freedom
## Multiple R-squared:  0.00426,    Adjusted R-squared:  0.003423 
## F-statistic: 5.089 on 3 and 3569 DF,  p-value: 0.001625
anova(fit)
## Analysis of Variance Table
## 
## Response: cbmi
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## adjfinc      1   1041 1041.43 13.9346 0.0001922 ***
## educ         1     71   70.56  0.9442 0.3312718    
## emp2         1     29   29.07  0.3890 0.5328906    
## Residuals 3569 266737   74.74                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For Family income - For every unit increase in family income,the Child’s body max index is expected to decrease by -4.292e-06, holding all else constant. The association between family income and child’s body mass is statistically significant at 0.0001 level.

For Mother’s education - For every additional year increase in child’s mother education completed,the Child’s body max index is expected to decrease by -5.227e-02, holding all else constant. The association between amother’s education and child’s body mass is statistically significant at 0.0001 level.

#For Maternal Employment - Compared to women who are not currently employed, women who are currently employed are expected to have children with a bmi 1.94 more, holding all other variables constant. The difference is not statistically significant based on a p-value of 0.532891 which is higher than 0.05.

In addition to the three family socioeconomic background variables you identified from 5), previous research suggests that several demographic variables are also important predictors of body mass index, including child’s age, sex, race, and low birth weight status.

Question 8). Provide appropriate descriptive statistics for child’s age, sex, race, and low birth weight status.

summarise(newPsid, Min_value=min(cage, na.rm = T), Max_value=max(cage, na.rm = T), 
          Mean_value=mean(cage, na.rm = T), sd_value=sd(cage, na.rm = T), Median_value=median(cage, na.rm = T))
## # A tibble: 1 x 5
##   Min_value Max_value Mean_value sd_value Median_value
##       <dbl>     <dbl>      <dbl>    <dbl>        <dbl>
## 1         1        17       8.41     4.54            8
summary(newPsid$csex)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.505   2.000   2.000
sd(newPsid$csex)
## [1] 0.5000432
summary(newPsid$crace3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.476   2.000   3.000
sd(newPsid$crace3)
## [1] 0.5870847
summary(newPsid$lbw)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08928 0.00000 1.00000
sd(newPsid$lbw)
## [1] 0.2851884

Question 9). Estimate another regression model with all the independent variables you identified in 5) and these child’s demographic variables.

fit2 <- lm(cbmi ~ factor(emp2) + factor(csex) + factor(crace3) + factor(lbw) + adjfinc + educ + cage, data = newPsid)
summary(fit2)
## 
## Call:
## lm(formula = cbmi ~ factor(emp2) + factor(csex) + factor(crace3) + 
##     factor(lbw) + adjfinc + educ + cage, data = newPsid)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.836  -1.670   1.267   4.034  40.880 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.163e+01  7.433e-01  15.644  < 2e-16 ***
## factor(emp2)1   -4.339e-01  2.953e-01  -1.469   0.1418    
## factor(csex)2   -1.290e-01  2.724e-01  -0.474   0.6358    
## factor(crace3)2  1.804e-01  2.953e-01   0.611   0.5412    
## factor(crace3)3 -1.699e+00  6.574e-01  -2.584   0.0098 ** 
## factor(lbw)1    -2.871e-02  4.799e-01  -0.060   0.9523    
## adjfinc         -5.814e-06  1.195e-06  -4.867 1.18e-06 ***
## educ            -5.702e-03  4.705e-02  -0.121   0.9035    
## cage             6.552e-01  3.041e-02  21.550  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.132 on 3564 degrees of freedom
## Multiple R-squared:  0.1202, Adjusted R-squared:  0.1182 
## F-statistic: 60.86 on 8 and 3564 DF,  p-value: < 2.2e-16

Question 10). Interpret the coefficients of sex, child’s age and race variables from the model from 9).

For Sex - Compared to male children, the female children have a Body max Index that is 1.29 less, holding all else cinstant. And this difference is not statistically significant given a p-value (0.6358) which is greater than the ha value (0.05) of 0.358

child’s age - For every additional year increase in the child’s age, the bmi is expected to increases by 6.55, holding all other variables constant, and this association is statistically significant at the 0.001 level based on the p-value of <2e-16.

child’s race - White is the reference group. (child’s race, 1=white, 2=black, 3=other). Compared to babies born to white mothers, babies born to other mothers have a bmi of 1.80 more, holding all else constant and the difference is not statistically significant given that the p-value of 0.5412 is greater than 0.05.

Compared to babies born to white mothers, other race children have a bmi of 1.69 less than babies borrn to white mothers and this difference is statistically significant at 0.01 level based on the p-value of 0.0098.

Question 11). Compare the regression model from 7) to the more advanced model from 9), which model is preferred? Why? (Make sure you run proper test to support your claim.)

anova(fit, fit2)
## Analysis of Variance Table
## 
## Model 1: cbmi ~ adjfinc + educ + emp2
## Model 2: cbmi ~ factor(emp2) + factor(csex) + factor(crace3) + factor(lbw) + 
##     adjfinc + educ + cage
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   3569 266737                                  
## 2   3564 235681  5     31055 93.924 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When you compare the two models you will release that the addition of sex,race,low birth weight and child age to the second more advanced model have significantly improve the model suggesting that it is the better model to adopt or use when modelling how family socioeconomic status predicts child’s body mass index. Also based on the results of the second model, that is, the p-value of < 2.2e-16, which is significant at 0.001, i will say the second model is more preferred to the first one.

Question 12). Interpret the R-square and adjusted R-square from the preferred model.

The R-squared: 0.1202, Adjusted R-squared: 0.1182 from preferred model

The r squared of 0.1202 means that 12% of the total variation in a child’s body mass index can be explained by family income, housing tenure, mother employment, child’s age,child’s sex and child’s race and low birth weight. Adjusted r squared of 0.1182 on the other hand means that the adjusted family income, mother’s employment,housing tenure, child’s sex child’s race and child’s age, contribute in explaining 11.82% of the variation in a child’s body mass index

Question 13). Evaluate whether the preferred model violates any linear regression assumptions. If violation(s) exists in the preferred model, propose reasonable solution(s).

plot(fit2, which=1) 
plot(fit2)

bptest(fit2)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit2
## BP = 113.7, df = 8, p-value < 2.2e-16
ad.test(resid(fit2))
## 
##  Anderson-Darling normality test
## 
## data:  resid(fit2)
## A = 146.32, p-value < 2.2e-16

The plot suggests that there is non linear relationship between cost and length of stay. It also suggests that there are unusual data points in the data set such as outliers. This suggest that the linearity assumption model, the normality and constant varinace have all been violated. To fixed the problem there is need to perform a bp test or ad test to test normality and constant variance. To fix the problem identified above, we need to perform log transformation on both sides of the equation. This model will transform both the dependent and independent variables. This then will fix both problems well.

Question 14). Based on the analysis you’ve done so far, write a short paragraph to summarize your findings regarding the relationship between family socioeconomic background and child’s body mass index.

We performed multiple linear regression to investigate how family’s socioeconomic status predict a child’s body mass index. We found a significant relationship between a child’s age and Body max Index (p < 0.001), Family income and Body max index(p < 0.001) and other race and birth weight in comparison with the white race. we also observed that there is no significant statistical difference between means of the Body Max Index(BMI) for male and female. While we also noted that family wealth and mother education was not significant in our model, which is contrary to most liteature.