library(haven)
psid_cds <- read_dta("C:\\Users\\codar\\OneDrive\\Documents\\Stats 1\\Data\\psid_cds.dta")
View(psid_cds)

#cleanedout the NAs
nona <- na.omit (psid_cds)
  1. Recode tenure so that 1=owning a house, 0=renting a house, and missing is set to missing (NA). (2 point)
psidrecode <-psid_cds %>%
  mutate(
    tenure = case_when 
              (tenure == 0 ~ "rent",
                         tenure == 1 ~ "own",
                         TRUE ~ NA_character_))

# View(psidrecode)
# names(psid_cds)
  1. Check whether crace3 (child’s race) variable is a factor variable. (1 point)
is.factor(psid_cds$crace3)
## [1] FALSE
  1. Run a test to see if there are significant differences in body mass index among kids of different racial backgrounds. (3 points)
#I have run the model after having removed the NA's. 
mbirace <- lm(cbmi ~ crace3, data= nona)
summary(mbirace)
## 
## Call:
## lm(formula = cbmi ~ crace3, data = nona)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.900  -1.883   0.617   4.317  39.117 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.8237     0.3919  40.374   <2e-16 ***
## crace3        0.3589     0.2468   1.454    0.146    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.659 on 3571 degrees of freedom
## Multiple R-squared:  0.0005919,  Adjusted R-squared:  0.000312 
## F-statistic: 2.115 on 1 and 3571 DF,  p-value: 0.146
  1. Run a test to see if there is a significant difference in body mass index between girls and boys. (3 points)
#I have run the model after having removed the NA's. 
mbisex <- lm(cbmi ~ csex, data= nona)
summary(mbisex)
## 
## Call:
## lm(formula = cbmi ~ csex, data = nona)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.552  -1.859   0.541   4.248  39.141 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.9452     0.4595  36.875   <2e-16 ***
## csex         -0.3932     0.2897  -1.357    0.175    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.659 on 3571 degrees of freedom
## Multiple R-squared:  0.0005155,  Adjusted R-squared:  0.0002356 
## F-statistic: 1.842 on 1 and 3571 DF,  p-value: 0.1748
  1. Review relevant literature and identify three family socioeconomic variables from the variable list that are relevant to child’s body mass index. Be sure to cite at least TWO references to support your claim. (3 points)
#Family Income (adjfinc)
#According to a study by Jo Young titled, "What Money Can Buy: Family Income and Childhood Obesity" published in Econ Human Biol, family income and childhood obesity are generally negatively correlated, but for children in very low-income families, they are positively correlated. The study is available throught the National Library of Medicine here: https://pubmed.ncbi.nlm.nih.gov/25434511/ 

#Maternal Education level (educ)
#In a study by Yuejing Feng, Lulu Ding, Xue Tang, Yi Wang, and Chengchao Zhou1, published in the International journal of environmental research and public health, the authors examined maternal education and school age children weight status. Their study found a significant association between maternal education and children’s weight status. Mother's with higher level of education are found less likely to be underweight (technical college: Odds Ratio (OR) = 0.223, 95% Confidence Interval (CI) = 0.052–0.956. The study is available here:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6678504/

#Maternal Employment (emp2)
#According to a study by Taryn W. Morrissey, Rachel E. Dunifon, and Ariel Kalil titled, "Maternal Employment, Work Schedules, and Children’s Body Mass Index", the authors found that an increase in the total time a mother is employed is associated with an increase in her child’s BMI; additionally, they found that the association between maternal employment and children’s weight is much stronger at 6th grade relative to younger ages. This study may be found here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3070422/?report=classic
  1. Examine the means, medians, and standard deviation for variables you identified in 5. (3 points)
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:questionr':
## 
##     describe
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## The following object is masked from 'package:car':
## 
##     logit
describe(psid_cds$adjfinc)
##    vars    n     mean       sd   median  trimmed      mad min     max   range
## X1    1 4243 55378.59 110723.7 40088.82 44814.45 35028.41   0 4824656 4824656
##     skew kurtosis      se
## X1 26.94    982.2 1699.82
describe(psid_cds$educ)
##    vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 4198 13.6 3.05     13   13.37 1.48   0  20    20 0.33      1.1 0.05
describe(psid_cds$emp2)
##    vars    n mean   sd median trimmed mad min max range  skew kurtosis   se
## X1    1 4233 0.65 0.48      1    0.69   0   0   1     1 -0.64    -1.59 0.01
  1. Estimate a regression model with all independent variables you identified in 5, and interpret the output of regression analysis. (6 points)
#As mentioned in the document, the data had some NA's, so I have run a model using complete cases by removing the NA's. The interpretations below will be based on this dataframe. 
mod1.simple <- lm(cbmi ~ factor(emp2) + educ + adjfinc, data = nona)
summary(mod1.simple)
## 
## Call:
## lm(formula = cbmi ~ factor(emp2) + educ + adjfinc, data = nona)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.189  -1.818   0.571   4.235  38.899 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.719e+01  6.730e-01  25.545  < 2e-16 ***
## factor(emp2)1  1.943e-01  3.116e-01   0.624 0.532891    
## educ          -5.227e-02  4.912e-02  -1.064 0.287407    
## adjfinc       -4.292e-06  1.256e-06  -3.417 0.000639 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.645 on 3569 degrees of freedom
## Multiple R-squared:  0.00426,    Adjusted R-squared:  0.003423 
## F-statistic: 5.089 on 3 and 3569 DF,  p-value: 0.001625
#Interpretation: 

#For adjfinc: Holding all other parameters constant, for every increase in adjfinc, the expected child bmi is going to decrease by 4.29 and this decrease is statistically significant at the .001 level based on the p value of 0.000708

#For educ: Holding all other variables constant, for every additional year of the mother's education, the expected child bmi is anticipated to decrease by 5.2 and this decrease is not  stastically significant based on a p-value greater than .05 of 0.129688. 

#For emp2: Holding all other variables constant, compared to women who are not currently employed, women who are currently employed are expected to have children with a bmi 1.94 higher and this increase is not  statistically significant based on a p-value of 0.532891 which is higher than .05

In addition to the three family socioeconomic background variables you identified from 5), previous research suggests that several demographic variables are also important predictors of body mass index, including child’s age, sex, race, and low birth weight status.

  1. Provide appropriate descriptive statistics for child’s age, sex, race, and low birth weight status. (3 points)
# names(psid_cds)
library(psych)
describe(psid_cds$cage)
##    vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 4236 8.57 4.54      8    8.48 5.93   1  17    16 0.15    -1.08 0.07
describe(psid_cds$csex)
##    vars    n mean  sd median trimmed mad min max range  skew kurtosis   se
## X1    1 4243 1.51 0.5      2    1.51   0   1   2     1 -0.03       -2 0.01
describe(psid_cds$crace3)
##    vars    n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 4042 1.51 0.59      1    1.51   0   1   3     2 0.69     -0.5 0.01
describe(psid_cds$lbw)
##    vars    n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 3901 0.12 0.33      0    0.03   0   0   1     1 2.27     3.17 0.01
  1. Estimate another regression model with all the independent variables you identified in 5 and these child’s demographic variables. (3 points)
#Running more advanced model using the dataframe excluding NA's
mod2.advanced <- lm(cbmi ~ factor(emp2) + factor(csex) + factor(crace3) + factor(lbw) + adjfinc + educ + cage, data = nona)
summary(mod2.advanced)
## 
## Call:
## lm(formula = cbmi ~ factor(emp2) + factor(csex) + factor(crace3) + 
##     factor(lbw) + adjfinc + educ + cage, data = nona)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.836  -1.670   1.267   4.034  40.880 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.163e+01  7.433e-01  15.644  < 2e-16 ***
## factor(emp2)1   -4.339e-01  2.953e-01  -1.469   0.1418    
## factor(csex)2   -1.290e-01  2.724e-01  -0.474   0.6358    
## factor(crace3)2  1.804e-01  2.953e-01   0.611   0.5412    
## factor(crace3)3 -1.699e+00  6.574e-01  -2.584   0.0098 ** 
## factor(lbw)1    -2.871e-02  4.799e-01  -0.060   0.9523    
## adjfinc         -5.814e-06  1.195e-06  -4.867 1.18e-06 ***
## educ            -5.702e-03  4.705e-02  -0.121   0.9035    
## cage             6.552e-01  3.041e-02  21.550  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.132 on 3564 degrees of freedom
## Multiple R-squared:  0.1202, Adjusted R-squared:  0.1182 
## F-statistic: 60.86 on 8 and 3564 DF,  p-value: < 2.2e-16
# mrbmi2 <- lm(cbmi ~ factor(emp2) + factor(csex) + factor(crace3) + factor(lbw) + adjfinc + educ + cage, data = psid_cds)
# summary(mrbmi2)
  1. Interpret the coefficients of sex, child’s age and race variables from the model from 9. (8 points)
#child's sex (1=male, 2=female)
#Compared to male children, female children have a bmi that is 1.29 less than and this difference is not statistically significant given a p-value greater than.05 of 0.358

#child's age
#Holding all other variables constant, for every additional year in the child's age, the bmi increases by 6.55 and this association is statistically significant at the .001 level based on the p0value of <2e-16. 

#child's race (child’s race, 1=white, 2=black, 3=other...my base category is 1 aka white so it's omitted in the output)
#Compared to white children, black children have a bmi that is 1.80 higher holding all else constant and the difference is not statistically significant given that the p-value of 0.5412 which is greater than .05.

#Compared to white children, other race children have a bmi that is 1.69 less than white children and this difference is statistically significant at the 0.01 level based on the p-value of 0.0098. 
  1. Compare the regression model from 7 to the more advanced model from 9, which model is preferred? Why? (Make sure you run proper test to support your claim) (3 points)
#reference points
#mod1.simple <- lm(cbmi ~ factor(emp2) + educ + adjfinc, data = nona)
#mod2.advanced <- lm(cbmi ~ factor(emp2) + factor(csex) + factor(crace3) + factor(lbw) + adjfinc + educ + cage, data = nona)
#summary(mod1.simple)

#To compare my two models I will use the anova function/test
anova(mod1.simple, mod2.advanced)
## Analysis of Variance Table
## 
## Model 1: cbmi ~ factor(emp2) + educ + adjfinc
## Model 2: cbmi ~ factor(emp2) + factor(csex) + factor(crace3) + factor(lbw) + 
##     adjfinc + educ + cage
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   3569 266737                                  
## 2   3564 235681  5     31055 93.924 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Based on the p-value of < 2.2e-16, which is significant at the 0.001, we can conclude that we do want to include the demographic variables and the better model is our more advanced model. 

12). Interpret the R-square and adjusted R-square from the preferred model. (4 points)

mod2.advanced <- lm(cbmi ~ factor(emp2) + factor(csex) + factor(crace3) + factor(lbw) + adjfinc + educ + cage, data = nona)
summary(mod2.advanced)
## 
## Call:
## lm(formula = cbmi ~ factor(emp2) + factor(csex) + factor(crace3) + 
##     factor(lbw) + adjfinc + educ + cage, data = nona)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.836  -1.670   1.267   4.034  40.880 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.163e+01  7.433e-01  15.644  < 2e-16 ***
## factor(emp2)1   -4.339e-01  2.953e-01  -1.469   0.1418    
## factor(csex)2   -1.290e-01  2.724e-01  -0.474   0.6358    
## factor(crace3)2  1.804e-01  2.953e-01   0.611   0.5412    
## factor(crace3)3 -1.699e+00  6.574e-01  -2.584   0.0098 ** 
## factor(lbw)1    -2.871e-02  4.799e-01  -0.060   0.9523    
## adjfinc         -5.814e-06  1.195e-06  -4.867 1.18e-06 ***
## educ            -5.702e-03  4.705e-02  -0.121   0.9035    
## cage             6.552e-01  3.041e-02  21.550  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.132 on 3564 degrees of freedom
## Multiple R-squared:  0.1202, Adjusted R-squared:  0.1182 
## F-statistic: 60.86 on 8 and 3564 DF,  p-value: < 2.2e-16
#R-squared interpretation: 12% of the total variance in child bmi is explained by the variables included in our advanced model. 

#Adjusted R-squared interpretation: Another way to compare models is to look at the Adjusted R-squared. The adjusted R-squared for the advanced model is 0.1182 which is higher than the simple model's adjusted R-squared of 0.003423. Therefore, the model that fits the data the best would be our second more advanced model since the value of adjusted r increased after we added more variables to our original model.  

13). Evaluate whether the preferred model violates any linear regression assumptions. If violation(s) exists in the preferred model, propose reasonable solution(s). (5 points)

#Constant Variance Assumption: We can run a visual and formal test for constant variance. 
plot(mod2.advanced, which=1)

library(lmtest)
bptest(mod2.advanced)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod2.advanced
## BP = 113.7, df = 8, p-value < 2.2e-16
#Our null hypothesis is that the variance of error term is constant across values of x (homoscedasticity). The alternative hypothesis is that there is not constant variance. Because our p-value is less than .05, we can reject the null hypothesis. To address constant variance violation we should transform the dependent variable using log. 

#Normality: To test for normality, we can run a QQ plot. We can also plot the residuals in a histogram. 

#First, we begin with a histogram
hist(resid(mod2.advanced))

#The histogram does not appear to show an obvious bell shape, instead displaying multiple peaks.

#We can then use a QQ plot to continue to assess for normality.
plot(mod2.advanced, which=2)

#We find that the residuals waver quite a bit from the diagonal line. We see significant wavering at the bottom tail and wavering at the top. Curved patterns suggest non-normality. This provides visual evidence that the residuals are not normally distributed.  

#We can also conduct a formal test for normality using the Anderson-Darling test (and also the Shapiro test). 
library(nortest)
## Warning: package 'nortest' was built under R version 4.0.3
ad.test(mod2.advanced$residuals)  
## 
##  Anderson-Darling normality test
## 
## data:  mod2.advanced$residuals
## A = 146.32, p-value < 2.2e-16
#Because the results are less than .05, we would reject the null hypothesis of normality of our residuals.

#To correct for this, we could use a transformation. To determine which transformation to use, we could use the box-cox analysis. The boxcox will help determine which transformation we will use based on lambda.

#Lastly, we can also check for multicolinearity. #What we are looking for in our results are for values that are not higher than 10 because values of VIFs higher than 10 indicate problematic near multicollinearity. 

library(car)
vif(mod2.advanced)   
##                    GVIF Df GVIF^(1/(2*Df))
## factor(emp2)   1.052610  1        1.025968
## factor(csex)   1.002331  1        1.001165
## factor(crace3) 1.094738  2        1.022887
## factor(lbw)    1.011952  1        1.005958
## adjfinc        1.085526  1        1.041886
## educ           1.131938  1        1.063926
## cage           1.030186  1        1.014981

14). Based on the analysis you’ve done so far, write a short paragraph to summarize your findings regarding the relationship between family socioeconomic background and child’s body mass index. (5 points)

#Based on the second multiple regression model, there were several socioeconomic factors that were statistically significant in relation to child's body mass index. Of the seven factors included in the second model family income and child's age were significant, holding all other variables constant, based on p-values 0.000708 and < 2e-16, respectively. When it comes to a child's race, we can conclude that there is not a statistically significant difference between white and black children when it comes to bmi. However, for children of other race compared to white children, have a bmi that is 1.69 less than white children and this difference is statistically significant at the 0.01 level based on the p-value of 0.0098. #While some of the academic literature I read discussed the close link between socioeconomic factors and child bmi, my model does not have a strong Adjusted R-Square, overall. In our dataset, we have other variables, such as whether the mother is a smoker or not, which I believe would also add strength to the overall model, given previous research that shows that paternal/maternal smoking strongly associated with a child's bmi. Geographic location may also be an interesting predictor to examine. Overall, my model found statistical significance among the few aforementioned variables.