Task 1 - load NHANES data

if (!require(NHANES)) install.packages('NHANES')
## Loading required package: NHANES
library(NHANES)
data(NHANES)

Task 2 Copy of NHANES data

d <- NHANES

Task 3 recode the PhysActive Variable into a dummy variable

if (!require(fastDummies)) install.packages('fastDummies')
## Loading required package: fastDummies
## Thank you for using fastDummies!
## To acknowledge our work, please cite the package:
## Kaplan, J. & Schlegel, B. (2023). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. Version 1.7.1. URL: https://github.com/jacobkap/fastDummies, https://jacobkap.github.io/fastDummies/.
library(fastDummies)

d <- dummy_cols(d,select_columns = c("PhysActive"))

Task 4 -2 continuous variables and selection of 2 categorical variables that may be associated with outcome of interest PhysActive

?NHANES

2 continuous Variables- BMI,poverty 2 categorical variables - AgeDecade,Smoke100

Task 5 Question of Interest

What is the relationship between Physically Active and BMI,Age (decade), poverty, and smoking history in Americans during the years 2009–2012?

Task 6 removing unnecessary variables from d

d <- d[c("PhysActive","BMI","Poverty","AgeDecade","Smoke100","PhysActive_Yes","PhysActive_No")]
dim(d)
## [1] 10000     7

Task 7 Remove observations with missing data

d <- na.omit(d)

Task 8 Visualizations and Table

Scatterplot of dependent variable PhyActive_Yes and independent variable BMI

plot(d$BMI, d$PhysActive_Yes)

### Scatterplot of independent variable Poverty and dependent variable PhysActive_Yes

plot(d$Poverty, d$PhysActive_Yes)

### Two way table for Physical Active Status and AgeDecade

with(d, table(AgeDecade, PhysActive_Yes))
##          PhysActive_Yes
## AgeDecade   0   1
##     0-9     0   0
##     10-19   0   0
##     20-29 420 825
##     30-39 523 723
##     40-49 593 691
##     50-59 584 625
##     60-69 425 421
##     70+   332 190

Two way table for Physical Active Status and Smoking History (Smooke100)

with(d, table(Smoke100, PhysActive_Yes))
##         PhysActive_Yes
## Smoke100    0    1
##      No  1457 2095
##      Yes 1420 1380
logit1 <- glm(PhysActive_Yes ~ BMI+Poverty+AgeDecade+Smoke100, data = d, family = "binomial")
summary(logit1)
## 
## Call:
## glm(formula = PhysActive_Yes ~ BMI + Poverty + AgeDecade + Smoke100, 
##     family = "binomial", data = d)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.282240   0.140525   9.125  < 2e-16 ***
## BMI             -0.040636   0.004109  -9.890  < 2e-16 ***
## Poverty          0.287023   0.016794  17.090  < 2e-16 ***
## AgeDecade 30-39 -0.441473   0.087187  -5.063 4.12e-07 ***
## AgeDecade 40-49 -0.711900   0.086950  -8.187 2.67e-16 ***
## AgeDecade 50-59 -0.862360   0.089461  -9.640  < 2e-16 ***
## AgeDecade 60-69 -0.925904   0.097796  -9.468  < 2e-16 ***
## AgeDecade 70+   -1.379051   0.114297 -12.066  < 2e-16 ***
## Smoke100Yes     -0.275341   0.053906  -5.108 3.26e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8749.4  on 6351  degrees of freedom
## Residual deviance: 8102.2  on 6343  degrees of freedom
## AIC: 8120.2
## 
## Number of Fisher Scoring iterations: 4
exp(cbind(OR = coef(logit1), confint(logit1)))
## Waiting for profiling to be done...
##                        OR     2.5 %    97.5 %
## (Intercept)     3.6047069 2.7395816 4.7528065
## BMI             0.9601785 0.9524374 0.9679047
## Poverty         1.3324553 1.2894456 1.3771997
## AgeDecade 30-39 0.6430886 0.5419081 0.7627425
## AgeDecade 40-49 0.4907110 0.4136270 0.5816419
## AgeDecade 50-59 0.4221648 0.3540740 0.5028227
## AgeDecade 60-69 0.3961733 0.3268991 0.4796579
## AgeDecade 70+   0.2518175 0.2010101 0.3146729
## Smoke100Yes     0.7593132 0.6831689 0.8439255
if (!require(jtools)) install.packages('jtools')
## Loading required package: jtools
library(jtools)

summ(logit1, exp = TRUE, confint = TRUE)
## MODEL INFO:
## Observations: 6352
## Dependent Variable: PhysActive_Yes
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## χ²(8) = 647.18, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.13
## Pseudo-R² (McFadden) = 0.07
## AIC = 8120.18, BIC = 8180.99 
## 
## Standard errors: MLE
## ----------------------------------------------------------------
##                         exp(Est.)   2.5%   97.5%   z val.      p
## --------------------- ----------- ------ ------- -------- ------
## (Intercept)                  3.60   2.74    4.75     9.12   0.00
## BMI                          0.96   0.95    0.97    -9.89   0.00
## Poverty                      1.33   1.29    1.38    17.09   0.00
## AgeDecade 30-39              0.64   0.54    0.76    -5.06   0.00
## AgeDecade 40-49              0.49   0.41    0.58    -8.19   0.00
## AgeDecade 50-59              0.42   0.35    0.50    -9.64   0.00
## AgeDecade 60-69              0.40   0.33    0.48    -9.47   0.00
## AgeDecade 70+                0.25   0.20    0.32   -12.07   0.00
## Smoke100Yes                  0.76   0.68    0.84    -5.11   0.00
## ----------------------------------------------------------------

Task 10 Interpretation in Sample

BMI

For every one-unit increase in BMI, the odds of being Physically Active are predicted to change on average by 0.96 times, controlling for all other independent variables. ### Poverty For every one-unit increase in Poverty, the odds of being Physically Active are predicted to change on average by 1.33 times, controlling for all other independent variables. ### AgeDecade For every one-unit increase in AgeDecade 30-39 compared to AgeDecade 20-29, the odds of diabetes are predicted to change on average by 0.64 times, controlling for all other independent variables. For every one-unit increase in AgeDecade 40-49 compared to AgeDecade 20-29, the odds of being Physically Active are predicted to change on average by 0.49 times, controlling for all other independent variables.

For every one-unit increase in AgeDecade 50-59 compared to AgeDecade 20-29, the odds of diabetes are predicted to change on average by 0.42 times, controlling for all other independent variables. For every one-unit increase in AgeDecade 60-69 compared to AgeDecade 20-29, the odds of diabetes are predicted to change on average by 0.40 times, controlling for all other independent variables. For every one-unit increase in AgeDecade 70+ compared to AgeDecade 20-29, the odds of diabetes are predicted to change on average by 0.25 times, controlling for all other independent variables. ### Smoke100 For every one-unit increase in individuals with a smoking history of atleast 100 cigarettes in their life as compared to no smoking history , the odds of being Physically Active are predicted to change on average by 0.76 times, controlling for all other independent variables.

Task 11 Interpretation in Population

BMI

We are 95% confident that in our population of interest, for every one-unit increase in BMI, the odds of being Physically Active are predicted to change by at least 0.95 times and at most 0.97 times, controlling for all other independent variables. ### Poverty We are 95% confident that in our population of interest, for every one-unit increase in Poverty, the odds of being Physically Active are predicted to change by at least 1.29 times and at most 1.38 times, controlling for all other independent variables. ### AgeDecade We are 95% confident that in our population of interest,for every one-unit increase in AgeDecade 30-39 compared to AgeDecade 20-29, the odds of being physically active are predicted to change by atleast 0.54 times and at most 0.76 times, controlling for all other independent variables.

We are 95% confident that in our population of interest,for every one-unit increase in AgeDecade 40-49 compared to AgeDecade 20-29, the odds of being physically active are predicted to change by atleast 0.41 times and at most 0.58 times, controlling for all other independent variables.

We are 95% confident that in our population of interest,for every one-unit increase in AgeDecade 50-59 compared to AgeDecade 20-29, the odds of being physically active are predicted to change by atleast 0.35 times and at most 0.50 times, controlling for all other independent variables.

We are 95% confident that in our population of interest,for every one-unit increase in AgeDecade 60-69 compared to AgeDecade 20-29, the odds of being physically active are predicted to change by atleast 0.33 times and at most 0.48 times, controlling for all other independent variables.

We are 95% confident that in our population of interest,for every one-unit increase in AgeDecade 70+ compared to AgeDecade 20-29, the odds of being physically active are predicted to change by atleast 0.20 times and at most 0.32 times, controlling for all other independent variables.

Smoke100

We are 95% confident that in our population of interest, for every one-unit increase in individuals with a smoking history of atleast 100 cigarettes in their life as compared to no smoking history , the odds of being Physically Active are predicted to change by at least 0.68 times and at most 0.84 times, controlling for all other independent variables.

Task 12 Interpret p Value

BMI

The p value is low which means that the null hypothesis that there is no association between BMI and PhyActive can be rejected and the alternate hypothesis is true ### Poverty The p value is low which means that the null hypothesis that there is no association between Poverty and PhyActive can be rejected and the alternate hypothesis is true ### AgeDecade The p value is low which means that the null hypothesis that there is no association between AgeDecade and PhyActive can be rejected and alternate hypothesis is true ### Smoke100 The p value is low which means that the null hypothesis that there is no association between Smoke100 and PhyActive can be rejected and the alternate hypothesis is true

Task 13 Calculate and InterpretPseudo R^2

if (!require(DescTools)) install.packages('DescTools')
## Loading required package: DescTools
## 
## Attaching package: 'DescTools'
## The following object is masked from 'package:jtools':
## 
##     %nin%
library(DescTools)

PseudoR2(logit1)
##   McFadden 
## 0.07396876

The Pseudo R^2 is 0.074 for our regression model logit1 which is a low value and means that our model doesnt fit the data well, the residual errors are big and the regression model will not reliably predict whether someone is 1 or 0 on the dependent variable which is PhysActive.

Task 14 Run and interpret the results of the Hosmer-Lemeshow test on logit1

if (!require(ResourceSelection)) install.packages('ResourceSelection')
## Loading required package: ResourceSelection
## ResourceSelection 0.3-6   2023-06-27
library(ResourceSelection)

hoslem.test(logit1$y,predict(logit1, type = "response"))
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  logit1$y, predict(logit1, type = "response")
## X-squared = 34.11, df = 8, p-value = 3.881e-05

The p-value for the Hosmer-Lemeshow test on our model logit1 is extremely low, supporting the alternate hypothesis that this logistic regression model does not fit our data well

Questions

the independent variables AgeDecade and Smoke history had dummy variables that i didnot put code in - so R did it - is that correct? so we only make dummy variables for the dependent variable to code the regression model? Can you review the p value and the Pseudo R^2 interpretation ?