library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(ggplot2)
library(haven)
library(broom)
library(tibble)
library(tidyr)
library(stringr)
library(forcats)

The research question you are going to analyze is how family socioeconomic status predicts child’s body mass index. Body mass index is a strong indicator for child development. Higher body mass index generally indicates a higher risk of obesity or overweight.

To prepare your analysis, you always need to clean your data. Although I have done most of the data cleaning for you, you will need to a couple of additional data preparations.

Data:

psid <- read_dta("psid_cds.dta")
View(psid)

1). Recode tenure so that 1=owning a house, 0=renting a house, and missing is set to missing (NA). (2 point)

psid %>% 
  mutate(ten_new = ifelse(tenure == 1, "own home", 
                          ifelse(tenure == 0, "rent home", NA)))
## # A tibble: 4,243 x 18
##     cage adjwlth1 adjwlth2   age tenure marpi  educ   smk   emp adjfinc  csex
##    <dbl>    <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
##  1    14    0          0      28      5     0     9     1     1  33951.     2
##  2     5    0          0      28      5     0     9     1     1  33951.     2
##  3     2   35        350      28      1     1    20     1     6 251015.     1
##  4     7   35        350      28      1     1    20     1     6 251015.     2
##  5     5    0.400     10.8    28      1     0    12     0     1  29265.     2
##  6     8   -2         34.4    40      1     1    16     0     1 124534.     2
##  7    13   -2         34.4    40      1     1    16     0     1 124534.     2
##  8    10   -2         34.4    40      1     1    16     0     1 124534.     1
##  9     2  304.       304.     31      5     1    20     0     1 106489.     2
## 10     1  304.       304.     31      5     1    20     0     1 106489.     2
## # ... with 4,233 more rows, and 7 more variables: lbw <dbl>, cbmi <dbl>,
## #   tkids <dbl>, lifesat <dbl+lbl>, crace3 <dbl+lbl>, emp2 <dbl>, ten_new <chr>

2). Check whether crace3 (child’s race) variable is a factor variable. (1 point)

is.factor(psid$crace3)
## [1] FALSE
# crace3 is not a factor variable

Now, do some initial investigation: 3). Run a test to see if there are significant differences in body mass index among kids of different racial backgrounds. (3 points)

psid_r <- psid %>% 
transmute(race = ifelse(crace3 == 1, "white",
                        ifelse(crace3 == 2, "black",
                               ifelse(crace3 == 3, "other", NA))),
          bmi = cbmi)

one.way <- aov(bmi ~ race, data = psid_r)

summary(one.way)
##               Df Sum Sq Mean Sq F value Pr(>F)  
## race           2    602  300.84   3.969  0.019 *
## Residuals   3935 298230   75.79                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 305 observations deleted due to missingness
#Null hypothesis: There is no difference in bmi between the racial groups.

#Research Hypothesis: There is a difference in bmi between the racial groups.

#Given the high F-value and small p-value, it seems likely that there is a significant difference in child's bmi by race.  

4). Run a test to see if there is a significant difference in body mass index between girls and boys. (3 points)

psid_s <- psid %>% 
  transmute(sex = ifelse(csex == 1,"male",
                         ifelse(csex == 2, "female", NA)),
            bmi = cbmi)

t.test(psid_s$bmi~psid_s$sex)
## 
##  Welch Two Sample t-test
## 
## data:  psid_s$bmi by psid_s$sex
## t = -0.56235, df = 4125.5, p-value = 0.5739
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6882736  0.3814424
## sample estimates:
## mean in group female   mean in group male 
##             16.29597             16.44939
#Null hypothesis: The difference in bmi between males and females is zero.

#Research hypothesis: The difference in bmi between males and females is not zero.

#The t-statistic is smaller than the critical value, which indicates that we fail to reject the null hypothesis, meaning that the difference in bmi by sex is zero.

Note: Make sure you state the null and alternative hypotheses for each of these tests, and interpret the outputs thoroughly.

5). Review relevant literature and identify three family socioeconomic variables from the variable list that are relevant to child’s body mass index. Be sure to cite at least TWO references to support your claim. (3 points)

# Three family socioeconomic variables relevant to child's bmi are adjusted family income, mother's education, and mother's employment.

#Ettinger, A. K., Riley, A. W., & Price, C. E. (2018). Increasing maternal employment influences child overweight/obesity among ethnically diverse families. Journal of Family Issues, 39(10), 2836-2861. 
#Jin, Y., & Jones-Smith, J. C. (2015). Peer Reviewed: Associations between Family Income and Children’s Physical Fitness and Obesity in California, 2010–2012. Preventing Chronic Disease, 12. 
#Muthuri, S. K., Onywera, V. O., Tremblay, M. S., Broyles, S. T., Chaput, J.-P., Fogelholm, M., . . . Lambert, E. V. (2016). Relationships between parental education and overweight with childhood overweight and physical activity in 9–11 year old children: Results from a 12-country study. PLoS One, 11(8), e0147746.

6). Examine the means, medians, and standard deviation for variables you identified in 5). (3 points)

mean(psid$adjfinc)
## [1] 55378.59
median(psid$adjfinc)
## [1] 40088.82
sd(psid$adjfinc)
## [1] 110723.7
mean(psid$educ, na.rm=T)
## [1] 13.59814
median(psid$educ, na.rm=T)
## [1] 13
sd(psid$educ, na.rm=T)
## [1] 3.047127
mean(psid$emp2, na.rm=T)
## [1] 0.6520198
median(psid$emp2, na.rm=T)
## [1] 1
sd(psid$emp2, na.rm=T)
## [1] 0.476386

7). Estimate a regression model with all independent variables you identified in 5), and interpret the output of regression analysis. (6 points)

psid2 <- psid %>% 
  transmute(sex = ifelse(csex == 1, "male",
                         ifelse(csex ==2, "female", NA)),
            age = cage,
            race = ifelse(crace3 == 1, "white",
                          ifelse(crace3 == 2, "black",
                                 ifelse(crace3 == 3, "other", NA))),
            lbw = ifelse(lbw == 1, "yes",
                         ifelse(lbw == 0, "no", NA)),
            bmi = cbmi,
            faminc = adjfinc,
            educ = educ,
            emp = ifelse(emp2 == 1, "yes",
                         ifelse(emp2 == 0, "no", NA))
            
            )
psid2 <- psid2 %>% 
  mutate(race_blk = ifelse(psid2$race == 'black', 1,0),
         race_othr = ifelse(psid2$race == 'other', 1,0),
         sex_m = ifelse(psid2$sex == 'male', 1,0),
         lbwyes = ifelse(psid2$lbw == 'yes', 1,0),
         empyes = ifelse(psid2$emp == 'yes', 1,0 ))


  

fit <- lm(bmi ~ faminc + educ + empyes, data = psid2)

summary(fit)
## 
## Call:
## lm(formula = bmi ~ faminc + educ + empyes, data = psid2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.565  -1.828   0.593   4.390  38.725 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.757e+01  6.334e-01  27.738  < 2e-16 ***
## faminc      -4.266e-06  1.259e-06  -3.389 0.000708 ***
## educ        -7.075e-02  4.668e-02  -1.516 0.129688    
## empyes       4.833e-02  2.924e-01   0.165 0.868728    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.721 on 4073 degrees of freedom
##   (166 observations deleted due to missingness)
## Multiple R-squared:  0.004217,   Adjusted R-squared:  0.003484 
## F-statistic:  5.75 on 3 and 4073 DF,  p-value: 0.0006383
# Results of the linear model show very small coefficient values for each of the indicators. Adjusted family income was the only variable to achieve statistical significance. Each one unit increase of adjusted family income was associated with a 4.2e-6 decrease in child's bmi. The intercept tells the value of the child's bmi when the predictor variables (x) are equal to zero.

In addition to the three family socioeconomic background variables you identified from 5), previous research suggests that several demographic variables are also important predictors of body mass index, including child’s age, sex, race, and low birth weight status.

8). Provide appropriate descriptive statistics for child’s age, sex, race, and low birth weight status. (3 points)

library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
summary(psid$cage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   5.000   8.000   8.572  12.000  17.000       7
sd(psid$cage, na.rm =T)
## [1] 4.539147
sex_c <- table(psid_s$sex)/length(psid_s$sex)
sex_prc <- sex_c*100
kable(sex_prc)
Var1 Freq
female 50.69526
male 49.30474
race_c <- table(psid_r$race)/length(psid_r$race)
race_prc <- race_c*100
kable(race_prc)
Var1 Freq
black 39.028989
other 4.619373
white 51.614424
lbw_c <- table(psid$lbw)/length(psid$lbw)
lbw_prc <- lbw_c*100
kable(lbw_prc)
Var1 Freq
0 80.48551
1 11.45416

9). Estimate another regression model with all the independent variables you identified in 5) and these child’s demographic variables. (3 points)

fit1 <- lm(bmi ~ sex_m + race_blk + race_othr + lbwyes + age + faminc + empyes +educ, data = psid2)

na.omit(psid2)
## # A tibble: 3,599 x 13
##    sex     age race  lbw     bmi faminc  educ emp   race_blk race_othr sex_m
##    <chr> <dbl> <chr> <chr> <dbl>  <dbl> <dbl> <chr>    <dbl>     <dbl> <dbl>
##  1 fema~    14 white no      0   3.40e4     9 yes          0         0     0
##  2 fema~     5 white no      0   3.40e4     9 yes          0         0     0
##  3 male      2 white no     16.1 2.51e5    20 no           0         0     1
##  4 fema~     7 white no     17.6 2.51e5    20 no           0         0     0
##  5 fema~     5 white no     19.1 2.93e4    12 yes          0         0     0
##  6 fema~     8 white no     14.5 1.25e5    16 yes          0         0     0
##  7 fema~    13 white yes    19.7 1.25e5    16 yes          0         0     0
##  8 male     10 white no     18.5 1.25e5    16 yes          0         0     1
##  9 fema~     2 white no     14.9 1.06e5    20 yes          0         0     0
## 10 fema~     1 white no      0   1.06e5    20 yes          0         0     0
## # ... with 3,589 more rows, and 2 more variables: lbwyes <dbl>, empyes <dbl>
summary(fit1)
## 
## Call:
## lm(formula = bmi ~ sex_m + race_blk + race_othr + lbwyes + age + 
##     faminc + empyes + educ, data = psid2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.815  -1.677   1.290   4.050  40.862 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.151e+01  7.359e-01  15.638  < 2e-16 ***
## sex_m        1.192e-01  2.718e-01   0.438  0.66112    
## race_blk     1.877e-01  2.946e-01   0.637  0.52403    
## race_othr   -2.005e+00  6.499e-01  -3.085  0.00205 ** 
## lbwyes      -1.711e-01  4.759e-01  -0.360  0.71920    
## age          6.520e-01  3.032e-02  21.504  < 2e-16 ***
## faminc      -5.750e-06  1.195e-06  -4.811 1.57e-06 ***
## empyes      -4.430e-01  2.942e-01  -1.506  0.13220    
## educ        -3.600e-03  4.671e-02  -0.077  0.93857    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.143 on 3590 degrees of freedom
##   (644 observations deleted due to missingness)
## Multiple R-squared:  0.1191, Adjusted R-squared:  0.1171 
## F-statistic: 60.67 on 8 and 3590 DF,  p-value: < 2.2e-16

10). Interpret the coefficients of sex, child’s age and race variables from the model from 9). (8 points)

#Compared to females, male children are associated with a 1.3e-01 higher bmi, all other factors being constant. This relationship is not statistically significant.

#Each single unit increase in child's age is associated with a .65 unit increase in child's bmi. This relationship is statistically significant (p<.001). 

#Compared to whites, black children are associated with a .2 higher bmi, all other factors being constant. This relationship is not statistically significant.

#Compared to whites, children of other race are associated with bmi scores that are two points lower, all other factors being constant. This relationship is statistically significant (p<.01).

11). Compare the regression model from 7) to the more advanced model from 9), which model is preferred? Why? (Make sure you run proper test to support your claim.) (3 points)

data(psid2)  # I had to use these next few chunks to fit both models to same dataset.
## Warning in data(psid2): data set 'psid2' not found
psid2[1,2] <- NA
nobs( fit1 <- lm(bmi~sex_m + race_blk + race_othr + lbwyes + age + faminc + empyes +educ, data = psid2) ) 
## [1] 3598
nobs( update(fit, .~.-sex_m - race_blk - race_othr - lbwyes - age) )  
## [1] 4077
nobs( fit <- update(fit1, .~.-sex_m - race_blk - race_othr - lbwyes - age, data=fit1$model) ) 
## [1] 3598
update_nested <- function(object, formula., ..., evaluate = TRUE){
    update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate)
}
nobs( fit2 <- update_nested(fit1, .~.-sex_m - race_blk - race_othr - lbwyes - age) )
## [1] 3598
all.equal(fit, fit2)  
## [1] "Component \"call\": target, current do not match when deparsed"
identical(fit[-10], fit2[-10])
## [1] TRUE
summary(fit1)
## 
## Call:
## lm(formula = bmi ~ sex_m + race_blk + race_othr + lbwyes + age + 
##     faminc + empyes + educ, data = psid2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.817  -1.683   1.289   4.042  40.846 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.155e+01  7.356e-01  15.707  < 2e-16 ***
## sex_m        1.075e-01  2.716e-01   0.396  0.69238    
## race_blk     1.738e-01  2.945e-01   0.590  0.55510    
## race_othr   -2.024e+00  6.494e-01  -3.116  0.00185 ** 
## lbwyes      -1.758e-01  4.755e-01  -0.370  0.71166    
## age          6.535e-01  3.030e-02  21.565  < 2e-16 ***
## faminc      -5.756e-06  1.195e-06  -4.819  1.5e-06 ***
## empyes      -4.334e-01  2.940e-01  -1.474  0.14056    
## educ        -7.072e-03  4.670e-02  -0.151  0.87965    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.137 on 3589 degrees of freedom
##   (645 observations deleted due to missingness)
## Multiple R-squared:  0.1197, Adjusted R-squared:  0.1177 
## F-statistic:    61 on 8 and 3589 DF,  p-value: < 2.2e-16
anova(fit,fit1)
## Analysis of Variance Table
## 
## Model 1: bmi ~ faminc + empyes + educ
## Model 2: bmi ~ sex_m + race_blk + race_othr + lbwyes + age + faminc + 
##     empyes + educ
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   3594 268830                                  
## 2   3589 237623  5     31207 94.269 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#The more complex model is preffered because the low p-value in the anova comparison shows that the improved fit is likely not due to random chance, and that the demographic variables are important considerations when modeling childhood bmi.

12). Interpret the R-square and adjusted R-square from the preferred model. (4 points)

summary(fit1)
## 
## Call:
## lm(formula = bmi ~ sex_m + race_blk + race_othr + lbwyes + age + 
##     faminc + empyes + educ, data = psid2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.817  -1.683   1.289   4.042  40.846 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.155e+01  7.356e-01  15.707  < 2e-16 ***
## sex_m        1.075e-01  2.716e-01   0.396  0.69238    
## race_blk     1.738e-01  2.945e-01   0.590  0.55510    
## race_othr   -2.024e+00  6.494e-01  -3.116  0.00185 ** 
## lbwyes      -1.758e-01  4.755e-01  -0.370  0.71166    
## age          6.535e-01  3.030e-02  21.565  < 2e-16 ***
## faminc      -5.756e-06  1.195e-06  -4.819  1.5e-06 ***
## empyes      -4.334e-01  2.940e-01  -1.474  0.14056    
## educ        -7.072e-03  4.670e-02  -0.151  0.87965    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.137 on 3589 degrees of freedom
##   (645 observations deleted due to missingness)
## Multiple R-squared:  0.1197, Adjusted R-squared:  0.1177 
## F-statistic:    61 on 8 and 3589 DF,  p-value: < 2.2e-16
#About 12% of the variance seen in observations are accounted for by the independent variables used in the model, according to the R squared value.

#The adjusted R squared is slightly lower at 11.7%. The only difference between the two is that adjusted R squared takes into account the number of x variables, in effect keeping a check on the quantity of variables used in models.

13). Evaluate whether the preferred model violates any linear regression assumptions. If violation(s) exists in the preferred model, propose reasonable solution(s). (5 points)

sresid<-studres(fit1)#Histogram of residuals
hist(sresid, freq=FALSE,
     main="Histogram of Residuals")
xfit<-xfit<-seq(min(sresid),max(sresid),length=40) 
yfit<-dnorm(xfit) 
lines(xfit, yfit)

plot(fit1, which=2)

library(nortest)
## Warning: package 'nortest' was built under R version 4.0.3
ad.test(resid(fit1)) #Not normal distribution
## 
##  Anderson-Darling normality test
## 
## data:  resid(fit1)
## A = 147.2, p-value < 2.2e-16
plot(fit1, which=1)

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(fit1) #Not equal resid variance
## 
##  studentized Breusch-Pagan test
## 
## data:  fit1
## BP = 115.78, df = 8, p-value < 2.2e-16
# Since the assumption of normality and constant variance are violated, a dependent variable transformation would be a reasonable solution

14). Based on the analysis you’ve done so far, write a short paragraph to summarize your findings regarding the relationship between family socioeconomic background and child’s body mass index. (5 points)

# After reviewing literature on socioeconomic variables related to child's bmi, I expected to find significant relationships in variables like adjusted family income, mother's education attainment, and mothers employment status. After controlling for these variables and other important demographic variables, adjusted family income was the only predictor out of the initial 3 that had a significant relationship. However, the nested linear regression model violates two assumptions and will need to be adjusted which could influence the final output.