library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2)
library(haven)
library(broom)
library(tibble)
library(tidyr)
library(stringr)
library(forcats)
The research question you are going to analyze is how family socioeconomic status predicts child’s body mass index. Body mass index is a strong indicator for child development. Higher body mass index generally indicates a higher risk of obesity or overweight.
To prepare your analysis, you always need to clean your data. Although I have done most of the data cleaning for you, you will need to a couple of additional data preparations.
Data:
psid <- read_dta("psid_cds.dta")
View(psid)
1). Recode tenure so that 1=owning a house, 0=renting a house, and missing is set to missing (NA). (2 point)
psid %>%
mutate(ten_new = ifelse(tenure == 1, "own home",
ifelse(tenure == 0, "rent home", NA)))
## # A tibble: 4,243 x 18
## cage adjwlth1 adjwlth2 age tenure marpi educ smk emp adjfinc csex
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 14 0 0 28 5 0 9 1 1 33951. 2
## 2 5 0 0 28 5 0 9 1 1 33951. 2
## 3 2 35 350 28 1 1 20 1 6 251015. 1
## 4 7 35 350 28 1 1 20 1 6 251015. 2
## 5 5 0.400 10.8 28 1 0 12 0 1 29265. 2
## 6 8 -2 34.4 40 1 1 16 0 1 124534. 2
## 7 13 -2 34.4 40 1 1 16 0 1 124534. 2
## 8 10 -2 34.4 40 1 1 16 0 1 124534. 1
## 9 2 304. 304. 31 5 1 20 0 1 106489. 2
## 10 1 304. 304. 31 5 1 20 0 1 106489. 2
## # ... with 4,233 more rows, and 7 more variables: lbw <dbl>, cbmi <dbl>,
## # tkids <dbl>, lifesat <dbl+lbl>, crace3 <dbl+lbl>, emp2 <dbl>, ten_new <chr>
2). Check whether crace3 (child’s race) variable is a factor variable. (1 point)
is.factor(psid$crace3)
## [1] FALSE
# crace3 is not a factor variable
Now, do some initial investigation: 3). Run a test to see if there are significant differences in body mass index among kids of different racial backgrounds. (3 points)
psid_r <- psid %>%
transmute(race = ifelse(crace3 == 1, "white",
ifelse(crace3 == 2, "black",
ifelse(crace3 == 3, "other", NA))),
bmi = cbmi)
one.way <- aov(bmi ~ race, data = psid_r)
summary(one.way)
## Df Sum Sq Mean Sq F value Pr(>F)
## race 2 602 300.84 3.969 0.019 *
## Residuals 3935 298230 75.79
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 305 observations deleted due to missingness
#Null hypothesis: There is no difference in bmi between the racial groups.
#Research Hypothesis: There is a difference in bmi between the racial groups.
#Given the high F-value and small p-value, it seems likely that there is a significant difference in child's bmi by race.
4). Run a test to see if there is a significant difference in body mass index between girls and boys. (3 points)
psid_s <- psid %>%
transmute(sex = ifelse(csex == 1,"male",
ifelse(csex == 2, "female", NA)),
bmi = cbmi)
t.test(psid_s$bmi~psid_s$sex)
##
## Welch Two Sample t-test
##
## data: psid_s$bmi by psid_s$sex
## t = -0.56235, df = 4125.5, p-value = 0.5739
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6882736 0.3814424
## sample estimates:
## mean in group female mean in group male
## 16.29597 16.44939
#Null hypothesis: The difference in bmi between males and females is zero.
#Research hypothesis: The difference in bmi between males and females is not zero.
#The t-statistic is smaller than the critical value, which indicates that we fail to reject the null hypothesis, meaning that the difference in bmi by sex is zero.
Note: Make sure you state the null and alternative hypotheses for each of these tests, and interpret the outputs thoroughly.
5). Review relevant literature and identify three family socioeconomic variables from the variable list that are relevant to child’s body mass index. Be sure to cite at least TWO references to support your claim. (3 points)
# Three family socioeconomic variables relevant to child's bmi are adjusted family income, mother's education, and mother's employment.
#Ettinger, A. K., Riley, A. W., & Price, C. E. (2018). Increasing maternal employment influences child overweight/obesity among ethnically diverse families. Journal of Family Issues, 39(10), 2836-2861.
#Jin, Y., & Jones-Smith, J. C. (2015). Peer Reviewed: Associations between Family Income and Children’s Physical Fitness and Obesity in California, 2010–2012. Preventing Chronic Disease, 12.
#Muthuri, S. K., Onywera, V. O., Tremblay, M. S., Broyles, S. T., Chaput, J.-P., Fogelholm, M., . . . Lambert, E. V. (2016). Relationships between parental education and overweight with childhood overweight and physical activity in 9–11 year old children: Results from a 12-country study. PLoS One, 11(8), e0147746.
6). Examine the means, medians, and standard deviation for variables you identified in 5). (3 points)
mean(psid$adjfinc)
## [1] 55378.59
median(psid$adjfinc)
## [1] 40088.82
sd(psid$adjfinc)
## [1] 110723.7
mean(psid$educ, na.rm=T)
## [1] 13.59814
median(psid$educ, na.rm=T)
## [1] 13
sd(psid$educ, na.rm=T)
## [1] 3.047127
mean(psid$emp2, na.rm=T)
## [1] 0.6520198
median(psid$emp2, na.rm=T)
## [1] 1
sd(psid$emp2, na.rm=T)
## [1] 0.476386
7). Estimate a regression model with all independent variables you identified in 5), and interpret the output of regression analysis. (6 points)
psid2 <- psid %>%
transmute(sex = ifelse(csex == 1, "male",
ifelse(csex ==2, "female", NA)),
age = cage,
race = ifelse(crace3 == 1, "white",
ifelse(crace3 == 2, "black",
ifelse(crace3 == 3, "other", NA))),
lbw = ifelse(lbw == 1, "yes",
ifelse(lbw == 0, "no", NA)),
bmi = cbmi,
faminc = adjfinc,
educ = educ,
emp = ifelse(emp2 == 1, "yes",
ifelse(emp2 == 0, "no", NA))
)
psid2 <- psid2 %>%
mutate(race_blk = ifelse(psid2$race == 'black', 1,0),
race_othr = ifelse(psid2$race == 'other', 1,0),
sex_m = ifelse(psid2$sex == 'male', 1,0),
lbwyes = ifelse(psid2$lbw == 'yes', 1,0),
empyes = ifelse(psid2$emp == 'yes', 1,0 ))
fit <- lm(bmi ~ faminc + educ + empyes, data = psid2)
summary(fit)
##
## Call:
## lm(formula = bmi ~ faminc + educ + empyes, data = psid2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.565 -1.828 0.593 4.390 38.725
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.757e+01 6.334e-01 27.738 < 2e-16 ***
## faminc -4.266e-06 1.259e-06 -3.389 0.000708 ***
## educ -7.075e-02 4.668e-02 -1.516 0.129688
## empyes 4.833e-02 2.924e-01 0.165 0.868728
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.721 on 4073 degrees of freedom
## (166 observations deleted due to missingness)
## Multiple R-squared: 0.004217, Adjusted R-squared: 0.003484
## F-statistic: 5.75 on 3 and 4073 DF, p-value: 0.0006383
# Results of the linear model show very small coefficient values for each of the indicators. Adjusted family income was the only variable to achieve statistical significance. Each one unit increase of adjusted family income was associated with a 4.2e-6 decrease in child's bmi. The intercept tells the value of the child's bmi when the predictor variables (x) are equal to zero.
In addition to the three family socioeconomic background variables you identified from 5), previous research suggests that several demographic variables are also important predictors of body mass index, including child’s age, sex, race, and low birth weight status.
8). Provide appropriate descriptive statistics for child’s age, sex, race, and low birth weight status. (3 points)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
summary(psid$cage)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 5.000 8.000 8.572 12.000 17.000 7
sd(psid$cage, na.rm =T)
## [1] 4.539147
sex_c <- table(psid_s$sex)/length(psid_s$sex)
sex_prc <- sex_c*100
kable(sex_prc)
| Var1 | Freq |
|---|---|
| female | 50.69526 |
| male | 49.30474 |
race_c <- table(psid_r$race)/length(psid_r$race)
race_prc <- race_c*100
kable(race_prc)
| Var1 | Freq |
|---|---|
| black | 39.028989 |
| other | 4.619373 |
| white | 51.614424 |
lbw_c <- table(psid$lbw)/length(psid$lbw)
lbw_prc <- lbw_c*100
kable(lbw_prc)
| Var1 | Freq |
|---|---|
| 0 | 80.48551 |
| 1 | 11.45416 |
9). Estimate another regression model with all the independent variables you identified in 5) and these child’s demographic variables. (3 points)
fit1 <- lm(bmi ~ sex_m + race_blk + race_othr + lbwyes + age + faminc + empyes +educ, data = psid2)
na.omit(psid2)
## # A tibble: 3,599 x 13
## sex age race lbw bmi faminc educ emp race_blk race_othr sex_m
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 fema~ 14 white no 0 3.40e4 9 yes 0 0 0
## 2 fema~ 5 white no 0 3.40e4 9 yes 0 0 0
## 3 male 2 white no 16.1 2.51e5 20 no 0 0 1
## 4 fema~ 7 white no 17.6 2.51e5 20 no 0 0 0
## 5 fema~ 5 white no 19.1 2.93e4 12 yes 0 0 0
## 6 fema~ 8 white no 14.5 1.25e5 16 yes 0 0 0
## 7 fema~ 13 white yes 19.7 1.25e5 16 yes 0 0 0
## 8 male 10 white no 18.5 1.25e5 16 yes 0 0 1
## 9 fema~ 2 white no 14.9 1.06e5 20 yes 0 0 0
## 10 fema~ 1 white no 0 1.06e5 20 yes 0 0 0
## # ... with 3,589 more rows, and 2 more variables: lbwyes <dbl>, empyes <dbl>
summary(fit1)
##
## Call:
## lm(formula = bmi ~ sex_m + race_blk + race_othr + lbwyes + age +
## faminc + empyes + educ, data = psid2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.815 -1.677 1.290 4.050 40.862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.151e+01 7.359e-01 15.638 < 2e-16 ***
## sex_m 1.192e-01 2.718e-01 0.438 0.66112
## race_blk 1.877e-01 2.946e-01 0.637 0.52403
## race_othr -2.005e+00 6.499e-01 -3.085 0.00205 **
## lbwyes -1.711e-01 4.759e-01 -0.360 0.71920
## age 6.520e-01 3.032e-02 21.504 < 2e-16 ***
## faminc -5.750e-06 1.195e-06 -4.811 1.57e-06 ***
## empyes -4.430e-01 2.942e-01 -1.506 0.13220
## educ -3.600e-03 4.671e-02 -0.077 0.93857
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.143 on 3590 degrees of freedom
## (644 observations deleted due to missingness)
## Multiple R-squared: 0.1191, Adjusted R-squared: 0.1171
## F-statistic: 60.67 on 8 and 3590 DF, p-value: < 2.2e-16
10). Interpret the coefficients of sex, child’s age and race variables from the model from 9). (8 points)
#Compared to females, male children are associated with a 1.3e-01 higher bmi, all other factors being constant. This relationship is not statistically significant.
#Each single unit increase in child's age is associated with a .65 unit increase in child's bmi. This relationship is statistically significant (p<.001).
#Compared to whites, black children are associated with a .2 higher bmi, all other factors being constant. This relationship is not statistically significant.
#Compared to whites, children of other race are associated with bmi scores that are two points lower, all other factors being constant. This relationship is statistically significant (p<.01).
11). Compare the regression model from 7) to the more advanced model from 9), which model is preferred? Why? (Make sure you run proper test to support your claim.) (3 points)
data(psid2) # I had to use these next few chunks to fit both models to same dataset.
## Warning in data(psid2): data set 'psid2' not found
psid2[1,2] <- NA
nobs( fit1 <- lm(bmi~sex_m + race_blk + race_othr + lbwyes + age + faminc + empyes +educ, data = psid2) )
## [1] 3598
nobs( update(fit, .~.-sex_m - race_blk - race_othr - lbwyes - age) )
## [1] 4077
nobs( fit <- update(fit1, .~.-sex_m - race_blk - race_othr - lbwyes - age, data=fit1$model) )
## [1] 3598
update_nested <- function(object, formula., ..., evaluate = TRUE){
update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate)
}
nobs( fit2 <- update_nested(fit1, .~.-sex_m - race_blk - race_othr - lbwyes - age) )
## [1] 3598
all.equal(fit, fit2)
## [1] "Component \"call\": target, current do not match when deparsed"
identical(fit[-10], fit2[-10])
## [1] TRUE
summary(fit1)
##
## Call:
## lm(formula = bmi ~ sex_m + race_blk + race_othr + lbwyes + age +
## faminc + empyes + educ, data = psid2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.817 -1.683 1.289 4.042 40.846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.155e+01 7.356e-01 15.707 < 2e-16 ***
## sex_m 1.075e-01 2.716e-01 0.396 0.69238
## race_blk 1.738e-01 2.945e-01 0.590 0.55510
## race_othr -2.024e+00 6.494e-01 -3.116 0.00185 **
## lbwyes -1.758e-01 4.755e-01 -0.370 0.71166
## age 6.535e-01 3.030e-02 21.565 < 2e-16 ***
## faminc -5.756e-06 1.195e-06 -4.819 1.5e-06 ***
## empyes -4.334e-01 2.940e-01 -1.474 0.14056
## educ -7.072e-03 4.670e-02 -0.151 0.87965
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.137 on 3589 degrees of freedom
## (645 observations deleted due to missingness)
## Multiple R-squared: 0.1197, Adjusted R-squared: 0.1177
## F-statistic: 61 on 8 and 3589 DF, p-value: < 2.2e-16
anova(fit,fit1)
## Analysis of Variance Table
##
## Model 1: bmi ~ faminc + empyes + educ
## Model 2: bmi ~ sex_m + race_blk + race_othr + lbwyes + age + faminc +
## empyes + educ
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3594 268830
## 2 3589 237623 5 31207 94.269 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#The more complex model is preffered because the low p-value in the anova comparison shows that the improved fit is likely not due to random chance, and that the demographic variables are important considerations when modeling childhood bmi.
12). Interpret the R-square and adjusted R-square from the preferred model. (4 points)
summary(fit1)
##
## Call:
## lm(formula = bmi ~ sex_m + race_blk + race_othr + lbwyes + age +
## faminc + empyes + educ, data = psid2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.817 -1.683 1.289 4.042 40.846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.155e+01 7.356e-01 15.707 < 2e-16 ***
## sex_m 1.075e-01 2.716e-01 0.396 0.69238
## race_blk 1.738e-01 2.945e-01 0.590 0.55510
## race_othr -2.024e+00 6.494e-01 -3.116 0.00185 **
## lbwyes -1.758e-01 4.755e-01 -0.370 0.71166
## age 6.535e-01 3.030e-02 21.565 < 2e-16 ***
## faminc -5.756e-06 1.195e-06 -4.819 1.5e-06 ***
## empyes -4.334e-01 2.940e-01 -1.474 0.14056
## educ -7.072e-03 4.670e-02 -0.151 0.87965
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.137 on 3589 degrees of freedom
## (645 observations deleted due to missingness)
## Multiple R-squared: 0.1197, Adjusted R-squared: 0.1177
## F-statistic: 61 on 8 and 3589 DF, p-value: < 2.2e-16
#About 12% of the variance seen in observations are accounted for by the independent variables used in the model, according to the R squared value.
#The adjusted R squared is slightly lower at 11.7%. The only difference between the two is that adjusted R squared takes into account the number of x variables, in effect keeping a check on the quantity of variables used in models.
13). Evaluate whether the preferred model violates any linear regression assumptions. If violation(s) exists in the preferred model, propose reasonable solution(s). (5 points)
sresid<-studres(fit1)#Histogram of residuals
hist(sresid, freq=FALSE,
main="Histogram of Residuals")
xfit<-xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
plot(fit1, which=2)
library(nortest)
## Warning: package 'nortest' was built under R version 4.0.3
ad.test(resid(fit1)) #Not normal distribution
##
## Anderson-Darling normality test
##
## data: resid(fit1)
## A = 147.2, p-value < 2.2e-16
plot(fit1, which=1)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(fit1) #Not equal resid variance
##
## studentized Breusch-Pagan test
##
## data: fit1
## BP = 115.78, df = 8, p-value < 2.2e-16
# Since the assumption of normality and constant variance are violated, a dependent variable transformation would be a reasonable solution
14). Based on the analysis you’ve done so far, write a short paragraph to summarize your findings regarding the relationship between family socioeconomic background and child’s body mass index. (5 points)
# After reviewing literature on socioeconomic variables related to child's bmi, I expected to find significant relationships in variables like adjusted family income, mother's education attainment, and mothers employment status. After controlling for these variables and other important demographic variables, adjusted family income was the only predictor out of the initial 3 that had a significant relationship. However, the nested linear regression model violates two assumptions and will need to be adjusted which could influence the final output.