##load data
library(haven)
psid <- read_dta("psid_cds.dta")
View(psid)

##The research question you are going to analyze is how family socioeconomic status predicts child’s body mass index. Body mass index is a strong indicator for child development. Higher body mass index generally indicates a higher risk of obesity or overweight.

##To prepare your analysis, you always need to clean your data. Although I have done most of the data cleaning for you, you will need to a couple of additional data preparations.

##1). Recode tenure so that 1=owning a house, 0=renting a house, and missing is set to missing (NA). (2 point)

tenure<- recode_factor(psid$tenure, "1" = c("owning a house"),
                    "5" = c("renting a house"),
                    "8" = c("NA"))
summary(tenure)
##  owning a house renting a house              NA 
##            2046            1977             220
view(tenure)

##2). Check whether crace3 (child’s race) variable is a factor variable. (1 point)

##YES it is a factor
attach(psid)
## The following object is masked _by_ .GlobalEnv:
## 
##     tenure
str(crace3)
##  dbl+lbl [1:4243] 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
##  @ label       : chr "RECODE of g3race"
##  @ format.stata: chr "%9.0g"
##  @ labels      : Named num [1:3] 1 2 3
##   ..- attr(*, "names")= chr [1:3] "White" "Black" "other"

###Now, do some initial investigation: ##3). Run a test to see if there are significant differences in body mass index among kids of different racial backgrounds. (3 points)

m1<-aov(formula=cbmi~crace3, data=psid)

anova(m1)
## Analysis of Variance Table
## 
## Response: cbmi
##             Df Sum Sq Mean Sq F value Pr(>F)
## crace3       1     74  73.666  0.9705 0.3246
## Residuals 3936 298758  75.904
##Null: There are significant differences in body mass index among kids of different racial backgrounds
##Research: There are no significant differences in body mass index among kids of different racial backgrounds

##There is significant evident to reject the null because the p value is higher than .05. Therefore, there us no significant difference between body mass index and kids of different racial backgrounds

##4). Run a test to see if there is a significant difference in body mass index between girls and boys. (3 points)

t.test(cbmi~csex, data = psid)
## 
##  Welch Two Sample t-test
## 
## data:  cbmi by csex
## t = 0.56235, df = 4125.5, p-value = 0.5739
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3814424  0.6882736
## sample estimates:
## mean in group 1 mean in group 2 
##        16.44939        16.29597
##The null: There is a significant difference in body mass  index between girls and boys
##The research: There is no significant different in body mass index between girls and boys
## Since the p value is higher than a .05, we  can reject the null that there is a significant difference in body mass  index between girls and boys
##While looking at the means, the means are close to each other although they are different. Boys have a mean of 16.44939 while girls have a mean of  16.29597. This shows that boys have a higher body mass than girls.

###Note: Make sure you state the null and alternative hypotheses for each of these tests, and interpret the outputs thoroughly.

##5). Review relevant literature and identify three family socioeconomic variables from the variable list that are relevant to child’s body mass index. Be sure to ###cite at least TWO references to support your claim. (3 points)

##The first one is adjusted family income
##Housing: determines quality of life
##whether mother is currently employed


##Reference : https://www.all4kids.org/news/blog/poverty-and-its-effects-on-children/
##Reference: https://www.healthaffairs.org/doi/full/10.1377/hlthaff.2009.0721

##6). Examine the means, medians, and standard deviation for variables you identified in 5). (3 points)

##Adjusted family income
summary(psid$adjfinc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   19520   40089   55379   70502 4824656
sd(psid$adjfinc)
## [1] 110723.7
##housing
summary(psid$tenure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   5.000   3.227   5.000   8.000
sd(psid$tenure)
## [1] 2.24469
##Mother currently employed
summary(psid$emp2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   1.000   0.652   1.000   1.000      10
sd(psid$emp2)
## [1] NA

##7). Estimate a regression model with all independent variables you identified in 5), and interpret the output of regression analysis. (6 points)

bodymass<-lm(cbmi~adjfinc+ tenure + emp2, data=psid)
summary(bodymass)
## 
## Call:
## lm(formula = cbmi ~ adjfinc + tenure + emp2, data = psid)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.667  -1.917   0.571   4.411  38.874 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.667e+01  3.387e-01  49.198  < 2e-16 ***
## adjfinc     -4.767e-06  1.253e-06  -3.805 0.000144 ***
## tenure       2.342e-04  6.273e-02   0.004 0.997021    
## emp2        -2.300e-03  2.901e-01  -0.008 0.993675    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.741 on 4117 degrees of freedom
##   (122 observations deleted due to missingness)
## Multiple R-squared:  0.003685,   Adjusted R-squared:  0.002959 
## F-statistic: 5.076 on 3 and 4117 DF,  p-value: 0.001653
##Interpret output:
##Adjusted family income:
##For every unit increase in adjusted family income,the child's body mass index is going to decrease by 4.8% Holding all else constant. The association between adjusted family income and child's body mass is statistically significant at a .001 level.

##Housing:
##For every unit increase in housing, the child's body mass index is going to increase by 2.3%. Holding all else constant. The association between housing and a child's body mass index is not statistically significant at a .001 level

##Mother currently employed:
##For every unit increase in whether the mother is employed, the child's body mass index is expected to decrease by 2.3%. Holding all else constant. The association between mother's employment and a child's body mass index is not statistically significant

#In addition to the three family socioeconomic background variables you identified from 5), previous research suggests that several demographic variables are also important predictors of body mass index, including child’s age, sex, race, and low birth weight status.

##8). Provide appropriate descriptive statistics for child’s age, sex, race, and low birth weight status. (3 points)

##age
summary(psid$cage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   5.000   8.000   8.572  12.000  17.000       7
sd(psid$cage)
## [1] NA
##sex
summary(psid$csex)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.507   2.000   2.000
sd(psid$csex)
## [1] 0.5000106
##race
summary(psid$crace3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   1.000   1.507   2.000   3.000     201
sd(psid$crace3)
## [1] NA
##low birth weight status
summary(psid$lbw)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.1246  0.0000  1.0000     342
sd(psid$lbw)
## [1] NA

##9). Estimate another regression model with all the independent variables you identified in 5) and these child’s demographic variables. (3 points)

bodymass2 <-lm(cbmi~adjfinc+tenure + emp2+cage+csex+crace3+lbw, data=psid)
summary(bodymass2)
## 
## Call:
## lm(formula = cbmi ~ adjfinc + tenure + emp2 + cage + csex + crace3 + 
##     lbw, data = psid)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.835  -1.743   1.325   4.079  40.980 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.190e+01  6.558e-01  18.149  < 2e-16 ***
## adjfinc     -5.725e-06  1.182e-06  -4.844 1.32e-06 ***
## tenure       1.241e-01  6.431e-02   1.929   0.0538 .  
## emp2        -3.824e-01  2.908e-01  -1.315   0.1886    
## cage         6.506e-01  3.023e-02  21.520  < 2e-16 ***
## csex        -1.157e-01  2.706e-01  -0.428   0.6690    
## crace3      -4.387e-01  2.408e-01  -1.822   0.0685 .  
## lbw         -1.225e-01  4.740e-01  -0.258   0.7960    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.143 on 3623 degrees of freedom
##   (612 observations deleted due to missingness)
## Multiple R-squared:  0.1174, Adjusted R-squared:  0.1157 
## F-statistic: 68.86 on 7 and 3623 DF,  p-value: < 2.2e-16

##10). Interpret the coefficients of sex, child’s age and race variables from the model from 9). (8 points)

##Interpret coefficient
##Sex:
##For every unit change in a child's sex, the expected child's body mass going to decrease by 1% holding all else constant. The association between child's sex and child's body mass index is not statistically significant at a .001 level.

##Child's age:
##For every unit increase in a child's age, the expected child's body mass is going to increase by 6.5% holding all else constant. The association between child's age and child's body mass is statistically significantly at a .001 level

##race:
##For every unit change/increase in child's race, the expected body mass index is going to decrease by 4% holding all else constant. The association between child's race and child's body mass index is statistically significant at a .001 level

##11). Compare the regression model from 7) to the more advanced model from 9), which model is preferred? Why? (Make sure you run proper test to support your claim.) (3 points)

##The second model is preferred because it is significant at a .001 level

##I used anova(bodymass,bodymass2) to get the results but i could not knit my file with the code


##anova(bodymass, bodymass2)

##12). Interpret the R-square and adjusted R-square from the preferred model. (4 points)

##Interpret R-Square
##12% of the total variation in a child's body mass index  can be explained by adjusted family income, housing tenure, mother employment, child's age,child's sex and child's race

##Interpret adjusted r:
##The adjusted r predicts that adjusted family income, housing tenure, mother employment, child's age,child's sex and child's race contribute in explaining 12% of the variation in a child's body mass index

##13). Evaluate whether the preferred model violates any linear regression assumptions. If violation(s) exists in the preferred model, propose reasonable solution(s). (5 points)

plot(bodymass2, which=1) 
plot(bodymass2)

bptest(bodymass2)
## 
##  studentized Breusch-Pagan test
## 
## data:  bodymass2
## BP = 116.64, df = 7, p-value < 2.2e-16
ad.test(resid(bodymass2))
## 
##  Anderson-Darling normality test
## 
## data:  resid(bodymass2)
## A = 149.54, p-value < 2.2e-16
##The violation of normality exists in the model. This is seen by the curved pattern in the Q-Q plot
##Similarly, constant variance is violated in the model 
## In order for this to be fixed either a bp test or ad test need to be carried out to test normality and constant variance.
##After seeing if constant variance and or normality are violated, we need to transform either x or y and run the log transformation to see if it helped fix the issue.

##14). Based on the analysis you’ve done so far, write a short paragraph to summarize your findings regarding the relationship between family socioeconomic background and child’s body mass index. (5 points)

##While looking at how family income is status predicts chid's body mass index, we see that there are other variables that influence body mass. Although variables such as housing tenure, mother employment, child's age,child's sex and child's race were looked at to see if they have an impact on body mass. The variable with the most significance is adjusted family income and child's age. Therefore, when a child's age increase, their body mass increased. Similarly, when the adjusted family income increase, a child's body mass decreased. In conclusion, age and family income play a key role in obesity.