1). Recode tenure so that 1=owning a house, 0=renting a house, and missing is set to missing (NA). (2 point)
#use "case_when", to help change several "if and else if" statements. pipe then mutate.
tenure2 <-pc %>%
mutate(
tenure = case_when
(tenure == 1 ~ "owning",
tenure == 0 ~ "renting",
TRUE ~ NA_character_))
View(tenure2)
## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1
2). Check whether crace3 (child’s race) variable is a factor variable. (1 point)
is.factor(pc$crace3)
## [1] FALSE
#crace3 (Child's race) variable is not a factor variable.
3). Run a test to see if there are significant differences in body mass index among kids of different racial backgrounds. (3 points)
pc1 <- na.omit(pc)
View(pc1)
## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1
#race=1white, 2black, 3other
pc1 %>%
group_by(crace3) %>%
summarise(diffin_bmi=mean(cbmi))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## crace3 diffin_bmi
## <dbl+lbl> <dbl>
## 1 1 [White] 16.1
## 2 2 [Black] 16.8
## 3 3 [other] 16.0
bmi_race<-lm(cbmi~factor(crace3), data=pc1)
summary(bmi_race)
##
## Call:
## lm(formula = cbmi ~ factor(crace3), data = pc1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.755 -1.812 0.588 4.288 39.188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.1117 0.1915 84.125 <2e-16 ***
## factor(crace3)2 0.6430 0.3029 2.122 0.0339 *
## factor(crace3)3 -0.0640 0.6910 -0.093 0.9262
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.657 on 3570 degrees of freedom
## Multiple R-squared: 0.001322, Adjusted R-squared: 0.0007629
## F-statistic: 2.364 on 2 and 3570 DF, p-value: 0.09423
#Ho: μ = 0
#Ha: μ ≠ 0
#null hypothesis: there is no difference in child's body mass across race (white, black, other)
#alternative hypothesis: There is a difference in at least one race.
#The analysis demonstrates at least one difference in child's body mass index with a p-value of 0.0339. This is statistically significant and therefore we reject the null hypothesis and are in favor of the alternative hypothesis.
4). Run a test to see if there is a significant difference in body mass index between girls and boys. (3 points)
gender_bmi<-lm(cbmi~factor(csex), data=pc1)
summary(gender_bmi)
##
## Call:
## lm(formula = cbmi ~ factor(csex), data = pc1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.552 -1.859 0.541 4.248 39.141
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.5520 0.2059 80.377 <2e-16 ***
## factor(csex)2 -0.3932 0.2897 -1.357 0.175
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.659 on 3571 degrees of freedom
## Multiple R-squared: 0.0005155, Adjusted R-squared: 0.0002356
## F-statistic: 1.842 on 1 and 3571 DF, p-value: 0.1748
#gender 1=male, 2=female
pc1 %>%
group_by(csex) %>%
summarise(diffin_bmi_gender=mean(cbmi))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## csex diffin_bmi_gender
## <dbl> <dbl>
## 1 1 16.6
## 2 2 16.2
anova(gender_bmi)
## Analysis of Variance Table
##
## Response: cbmi
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(csex) 1 138 138.086 1.8417 0.1748
## Residuals 3571 267740 74.976
coef(gender_bmi)
## (Intercept) factor(csex)2
## 16.5519796 -0.3931985
#Ho: μ = 0
#Ha: μ ≠ 0
#null hypothesis: there is no difference in child's body mass across gender.
#alternative hypothesis: There is a difference in at least one gender.
#The analysis demonstrates there is no statistical difference in child's body mass index based on gender.The p value is greater than .05 at 0.175. We fail to reject the null hypothesis.
5). Review relevant literature and identify three family socioeconomic variables from the variable list that are relevant to child’s body mass index. Be sure to cite at least TWO references to support your claim. (3 points)
#First family socioeconomic variable: siblings (tkids). A recent study by Griauzde M.D. et al., found that body mass index trajectory was lower for kids who had a new sibling join the family prior to kindergarten compared to those who did not.
##Citation:Griauzde M.D., D. H., Lumeng MD, J., Shah MD, P., & Kaciroti PhD, N. (2019). Lower Body Mass Index Z-Score Trajectory During Early Childhood After the Birth of a Younger Sibling. Academic Pediatrics, 19(1), 51–57. https://doi.org/10.1016/j.acap.2018.06.003
#Second family socioeconomic variable: child's gender (csex). Komiya et al, demonstrated that there was no significant difference between male and females for body mass index. Before puberty, male and females have similar body mass index.
##Citation: Komiya, S., Eto, C., Otoki, K., Teramoto, K., Shimizu, F., & Shimamoto, H. (2000). Gender differences in body fat of low-and high-body-mass children: relationship with body mass index. Springer Journals, 82(1), 16–23. https://doi.org/10.1007/s004210050646
#Third family socioeconomic variable: whether mother is currently employed (emp2). The study by Taylor et al., showed that a mother's employment status does not impact a child's body mass index.
##Citation: Taylor, A., Winefield, H., Kettler, L., & Gill, T. (2012). A Population Study of 5 to 15 Year Olds: Full Time Maternal Employment not Associated with High BMI. The Importance of Screen-Based Activity, Reading for Pleasure and Sleep Duration in Children’s BMI. Maternal and Child Health Journal, 16(3), 587–599. https://doi.org/10.1007/s10995-011-0792-y
6). Examine the means, medians, and standard deviation for variables you identified in 5). (3 points)
describe(pc1$tkids)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 3573 2.46 1.23 2 2.32 1.48 1 9 8 1.25 2.74 0.02
describe(pc1$csex)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 3573 1.51 0.5 2 1.51 0 1 2 1 -0.02 -2 0.01
describe(pc1$emp2)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 3573 0.66 0.47 1 0.7 0 0 1 1 -0.69 -1.52 0.01
7). Estimate a regression model with all independent variables you identified in 5), and interpret the output of regression analysis. (6 points)
tce <- lm(cbmi ~ tkids + factor(csex) + factor(emp2), data=pc1)
summary(tce)
##
## Call:
## lm(formula = cbmi ~ tkids + factor(csex) + factor(emp2), data = pc1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.739 -1.835 0.517 4.305 39.081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.863245 0.438893 38.422 <2e-16 ***
## tkids -0.123908 0.119996 -1.033 0.302
## factor(csex)2 -0.396305 0.289795 -1.368 0.172
## factor(emp2)1 -0.008126 0.310999 -0.026 0.979
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.66 on 3569 degrees of freedom
## Multiple R-squared: 0.0008204, Adjusted R-squared: -1.953e-05
## F-statistic: 0.9767 on 3 and 3569 DF, p-value: 0.4026
coef(tce)
## (Intercept) tkids factor(csex)2 factor(emp2)1
## 16.863244554 -0.123907858 -0.396305277 -0.008125936
anova(tce)
## Analysis of Variance Table
##
## Response: cbmi
## Df Sum Sq Mean Sq F value Pr(>F)
## tkids 1 79 79.398 1.0587 0.3036
## factor(csex) 1 140 140.305 1.8709 0.1715
## factor(emp2) 1 0 0.051 0.0007 0.9792
## Residuals 3569 267658 74.995
#tkids: For each increase in number of siblings, the expected value for body mass index is going to decrease by .12, holding all else constant. This increase is not statistically significant with a p value at 0.302.
#csex;male=1,female=2: When compared to males, females see a decrease of body mass index of 0.40, holding all else constant. Due to the p value of 0.172, it is not statistically significant.
#emp2; unemployed=0, employed=1: When compared to unemployed mothers, employed mothers' children will have a decrease of 0.008 of body mass index, holding all else constant. The decrease is not statistically significant due to the p-value of 0.979
In addition to the three family socioeconomic background variables you identified from 5), previous research suggests that several demographic variables are also important predictors of body mass index, including child’s age, sex, race, and low birth weight status.
8). Provide appropriate descriptive statistics for child’s age, sex, race, and low birth weight status. (3 points)
describe(pc1$cage)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 3573 8.41 4.54 8 8.29 5.93 1 17 16 0.2 -1.06 0.08
describe(pc1$csex)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 3573 1.51 0.5 2 1.51 0 1 2 1 -0.02 -2 0.01
describe(pc1$crace3)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 3573 1.48 0.59 1 1.48 0 1 3 2 0.8 -0.35 0.01
describe(pc1$lbw)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 3573 0.09 0.29 0 0 0 0 1 1 2.88 6.29 0
9). Estimate another regression model with all the independent variables you identified in 5) and these child’s demographic variables. (3 points)
tce2 <- lm(cbmi~tkids+factor(csex)+factor(emp2)+cage+factor(crace3)+lbw, data=pc1)
summary(tce2)
##
## Call:
## lm(formula = cbmi ~ tkids + factor(csex) + factor(emp2) + cage +
## factor(crace3) + lbw, data = pc1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.263 -1.640 1.283 4.034 40.766
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.022825 0.483617 24.860 < 2e-16 ***
## tkids -0.300690 0.113305 -2.654 0.00799 **
## factor(csex)2 -0.144971 0.273057 -0.531 0.59551
## factor(emp2)1 -0.662014 0.295852 -2.238 0.02531 *
## cage 0.651534 0.030467 21.385 < 2e-16 ***
## factor(crace3)2 0.461444 0.288207 1.601 0.10945
## factor(crace3)3 -1.548008 0.654404 -2.366 0.01806 *
## lbw 0.003453 0.480765 0.007 0.99427
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.151 on 3565 degrees of freedom
## Multiple R-squared: 0.1158, Adjusted R-squared: 0.114
## F-statistic: 66.68 on 7 and 3565 DF, p-value: < 2.2e-16
coef(tce2)
## (Intercept) tkids factor(csex)2 factor(emp2)1 cage
## 12.022825321 -0.300690259 -0.144971331 -0.662013900 0.651533717
## factor(crace3)2 factor(crace3)3 lbw
## 0.461444229 -1.548007620 0.003452651
anova(tce2)
## Analysis of Variance Table
##
## Response: cbmi
## Df Sum Sq Mean Sq F value Pr(>F)
## tkids 1 79 79.4 1.1950 0.274397
## factor(csex) 1 140 140.3 2.1117 0.146264
## factor(emp2) 1 0 0.1 0.0008 0.977856
## cage 1 30140 30140.0 453.6329 < 2.2e-16 ***
## factor(crace3) 2 654 327.0 4.9210 0.007341 **
## lbw 1 0 0.0 0.0001 0.994270
## Residuals 3565 236864 66.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
10). Interpret the coefficients of sex, child’s age and race variables from the model from 9). (8 points)
#csexfactor: When compared to males, females will have an decrease of .15 of body mass index, holding all else constant. The decrease is not statistically significant due to the p-value of 0.59551.
#cage: For each unit increase in a child's age, the body mass index will increase by 0.66, holding all else constant. The decrease is statistically significant due to the p-value of <2e-16.
#crace3: When compared to white children, black children will have an increase of body mass index of 0.46, holding all else constant. The increase is not statistically significant due to the p-value of 0.10945.
#When compared to white children, other children will have a decrease of 1.55 of body mass index, holding all else constant. This decrease is statistically significant with a p-value of 0.01806.
11). Compare the regression model from 7) to the more advanced model from 9), which model is preferred? Why? (Make sure you run proper test to support your claim.) (3 points)
#The second model (tce2), in number 9, is preferred over the model in number 7. The reason the model in number 9 is preferred is due to its p value< 2.2e-16.A low p value is preferred as it determines statistical significance. Additionally,for model number 9 the adjusted r square is:0.114. A higher adjusted r square value is preferred to best fit a model.
anova(tce,tce2)
## Analysis of Variance Table
##
## Model 1: cbmi ~ tkids + factor(csex) + factor(emp2)
## Model 2: cbmi ~ tkids + factor(csex) + factor(emp2) + cage + factor(crace3) +
## lbw
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3569 267658
## 2 3565 236864 4 30794 115.87 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(tce)
##
## Call:
## lm(formula = cbmi ~ tkids + factor(csex) + factor(emp2), data = pc1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.739 -1.835 0.517 4.305 39.081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.863245 0.438893 38.422 <2e-16 ***
## tkids -0.123908 0.119996 -1.033 0.302
## factor(csex)2 -0.396305 0.289795 -1.368 0.172
## factor(emp2)1 -0.008126 0.310999 -0.026 0.979
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.66 on 3569 degrees of freedom
## Multiple R-squared: 0.0008204, Adjusted R-squared: -1.953e-05
## F-statistic: 0.9767 on 3 and 3569 DF, p-value: 0.4026
summary(tce2)
##
## Call:
## lm(formula = cbmi ~ tkids + factor(csex) + factor(emp2) + cage +
## factor(crace3) + lbw, data = pc1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.263 -1.640 1.283 4.034 40.766
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.022825 0.483617 24.860 < 2e-16 ***
## tkids -0.300690 0.113305 -2.654 0.00799 **
## factor(csex)2 -0.144971 0.273057 -0.531 0.59551
## factor(emp2)1 -0.662014 0.295852 -2.238 0.02531 *
## cage 0.651534 0.030467 21.385 < 2e-16 ***
## factor(crace3)2 0.461444 0.288207 1.601 0.10945
## factor(crace3)3 -1.548008 0.654404 -2.366 0.01806 *
## lbw 0.003453 0.480765 0.007 0.99427
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.151 on 3565 degrees of freedom
## Multiple R-squared: 0.1158, Adjusted R-squared: 0.114
## F-statistic: 66.68 on 7 and 3565 DF, p-value: < 2.2e-16
12). Interpret the R-square and adjusted R-square from the preferred model. (4 points)
#Preferred model, number 9 (tce2).
# Adjusted R-Square:0.114, When compared to the first adjusted r square (-1.953e-05 ) in tce model, the adjusted r square for the preferred model tce2 is higher at 0.114. The higher the adjusted r-square value, the better model fit.
# R-square: 0.1158, 11 percent of the total variance in child's body mass index is explained by the variables in model (tce2).
13). Evaluate whether the preferred model violates any linear regression assumptions. If violation(s) exists in the preferred model, propose reasonable solution(s). (5 points)
#test for normality
plot(tce2, which=2)
ad.test(resid(tce2))
##
## Anderson-Darling normality test
##
## data: resid(tce2)
## A = 149.2, p-value < 2.2e-16
# After observing the QQ plot and looking at the p-value of Anderson-Darling test, it is statistically significant at < 2.2e-16, reject null. Null states, there is normal distribution. In favor of alternative hypothesis which states there is no normal distribution.
#Test if residuals are distributed constantly
bptest(tce2)
##
## studentized Breusch-Pagan test
##
## data: tce2
## BP = 114.79, df = 7, p-value < 2.2e-16
plot(tce2, which = 1)
#Null hypothesis states there is constant variance. Alternative hypothesis states there is no constant variance. Based on the p-value of the Breauch-Pagan test, its is statistically significant at < 2.2e-16. There is strong evidence to reject the null hypothesis. Model violates constant variance.
#examine histogram of residuals
sresid<-studres(tce2)
hist(sresid, freq=FALSE,
main="Histogram of Residuals")
xtce2<-xtce2<-seq(min(sresid),max(sresid),length=40)
ytce2<-dnorm(xtce2)
lines(xtce2, ytce2)
#bell curve and histogram residuals not consistent.
#Fixing problem
#To fix the problems stated above, transformation would be used. The transformation I would use is boxcox.
#Value of lambda will be use to determine the transformation to use for 'Y' variable.
#look at estimated value (middle line) and determine the closest value of lambda.
14). Based on the analysis you’ve done so far, write a short paragraph to summarize your findings regarding the relationship between family socioeconomic background and child’s body mass index. (5 points)
#Based on model tce2, there are some variables such as age, race, and siblings (tkids), that were statistically significant in relation to child's body mass index. #The variable of tkids (siblings) was consistent to Griauzde M.D. et al. study that found that body mass index trajectory was lower for kids who had a new sibling join the family prior to kindergarten compared to those who did not. I did not research literature for age and race as predictors for child's body mass index but running the model tce2 suggest, they have a statistically significant relationship with cbmi. The data is restated as follows, cage: For each unit increase in a child's age, the body mass index will increase by 0.66, holding all else constant. The decrease is statistically significant due to the p-value of <2e-16. crace3: When compared to white children, black children will have an increase of body mass index of 0.46, holding all else constant. The increase is not statistically significant due to the p-value of 0.10945. When compared to white children, other children will have a decrease of 1.55 of body mass index, holding all else constant. This decrease is statistically significant with a p-value of 0.01806.
#Other variables such as child's gender, and mother's employment status did not have a statistically significant relationship regarding child's body mass index. This was consistent with the literature I reviewed for this assignment. Komiya et al, demonstrated that there was no significant difference between male and females for body mass index. Before puberty, male and females have similar body mass index. Lastly, the study by Taylor et al., showed that a mother's employment status does not impact a child's body mass index.
#The data is restated as follows,
#csex;male=1,female=2: When compared to males, females see a decrease of body mass index of 0.40, holding all else constant. Due to the p value of 0.172, it is not statistically significant.
#emp2; unemployed=0, employed=1: When compared to unemployed mothers, employed mothers' children will have a decrease of 0.008 of body mass index, holding all else constant. The decrease is not statistically significant due to the p-value of 0.979
Publish Rpub Link
#bet