Install Tidyverse & Import Dataset

library(tidyverse) survey <- read_csv(“C:/Users/HP/OneDrive/Desktop/Queens College/Data 306-Data Analysis and Modeling (2022)/NHANES-2017-2018-Demo.csv”)

Converting the variable names to upper cases

names(survey) <- str_to_upper(names(survey))

Number of rows and columns in dataset

dim(survey)

Question #1

The Dependent Variable = INDFMPIR

Types Independant Variables: RIDAGEYR = Continuous, RIAGENDR = Categorical, RIDRETH3 = Categorical, DMDFMSIZ = Discrete Numerical, DMDEDUC2 = Categorical

Selecting Variables

data <- survey %>% select(INDFMPIR, RIDAGEYR, RIAGENDR, RIDRETH3, DMDFMSIZ, DMDEDUC2)

Independent Variables

predictors <- names(data)[!names(data) %in% ‘INDFMPIR’]

Question #2

Renaming Variable Names

names(data) <- c(‘Y’, ‘age’, ‘gender’, ‘race’, ‘familysize’, ‘education’)

Replacing 7 and 9 in Education to NA to indicate missing values

7 = Refused to answer, 9 = Don’t know

data\(education[data\)education %in% c(7,9)] <- NA

Conversion of Categorical Variables to Categorical Data Type

categorical_vars <- c(‘gender’, ‘race’, ‘education’) data[, categorical_vars] <- lapply(data[categorical_vars], as.factor)

Fitting a Linear Regression Model

Model1 <- lm(Y ~ ., data=data, )

Summary of Model

summary(Model1)

Multiple Linear Regression

Y = 1.216 + 0.007∗age - 0.172∗gender2 - 0.149∗race2 - 0.037∗race3 - 0.268∗race4 + 0.224∗race6 - 0.319∗race7 - 0.019∗familysize + 0.254∗education2 + 0.686∗education3 + 1.207∗education4 + 2.298∗education5

Part 1

The only variables that have statistical significance to the dependent variable are: age, race6 and all education types/datas. The directions of the variables all seem to slightly go down; ranging from 0.1-5% (5% and up being statistically significant). The magnitude seems to get much steeper once education comes into play. Looking at education2 (up to the 12th grade) and education3 (high school diploma), the family income poverty ratio jumped 0.26 and 0.70, respectively. College (education4 & education5) increased this further; by 1.20 and 2.3, respectively.

Part 2

Based on the results of the model, the execution of the model to accurately explains the variability in the dependent variable is subpar. Based on the R-squared value, we see that the independent variables are only able to interpret for approximately 24.7% of the dependent variable. On the other hand, based on the adjusted R-squared we see that this decreases to aproximately 24.5%.

Question #3

Linear Regression with Interaction Terms

Model2 <- lm(Y ~ . + age*familysize, data=data)

summary(Model2)

Multiple Linear Regression

Y = 1.717 - 0.003∗age - 0.165∗gender2 - 0.153∗race2 - 0.024∗race3 - 0.271∗race4 + 0.210∗race6 - 0.321∗race7 - 0.110∗familysize + 0.306∗education2 + 0.738∗education3 + 1.243∗education4 + 2.327∗education5 + 0.004∗age∗familysize

Part 1

Once family size and age were incorporated into the mix, the coefficients (almost all of them), had a modest increase while the remaining had a modest decrease.

Part 2

Upon comparison, I would say that model2 is better. Both its r-squared & adjusted r-squared had better percentages.

Part 3: ANOVA between Models 1 and 2

anova(Model2, Model1)

ANOVA F-test: test to compare Model1 and Model2. Upon analysis, we see that there’s a statistical significance (pvalue = 1.075e−7) at 0.1%-5% significance level. Model 2 is exceptional in performance compared to Model 1.

Part 5

prediction

set.seed(0)

size = 20 # sample size age.mean <- mean(data\(age) age.lwr <- mean(data\)age) - sd(data\(age) *1 age.upr <- mean(data\)age) + sd(data$age) *1

new_data <- data.frame( age = sample(c(age.mean, age.lwr, age.upr), replace = T, size=size), gender = factor(rep(2,size)), race = factor(rep(4, size)), familysize = sample(unique(data$familysize), replace=T, size = size), education = factor(rep(3, size)) )

predictions1 <- predict(object = Model2, newdata = new_data, interval = ‘confidence’, level = .95 ) %>% as_tibble()

Part 6

predictions1 %>% ggplot() + geom_line(aes(x=1:size, y=fit), lty=1, color=‘steelblue’) + geom_ribbon(aes(x=1:size, ymin=lwr, ymax=upr), alpha=0.4) + theme_light() + theme(plot.title = element_text(face=‘bold’), panel.grid = element_blank(), legend.position = ‘top’, legend.title = element_blank()) + labs(x=‘Number of Samples’, y=‘Predictions’, title = ‘Model2 Predictions with 95% Confidence Intervals’)

Question # 4

Part 1

Model 3

Model3 <- lm(Y ~ . + gender*race, data=data)

summary(Model3)

Multiple Linear Regression

Y = 1.228 + 0.007∗age - 0.210∗gender2 - 0.185∗race2 - 0.088∗race3 - 0.180∗race4 + 0.01∗race6 - 0.335∗race7 - 0.010∗familysize + 0.257∗education2 + 0.689∗education3 + 1.216∗education4 + 2.307∗education5 + 0.068∗gender2∗race2 + 0.099∗gender2∗race3 - 0.170∗gender2∗race4 + 0.227∗gender2∗race6 + 0.028∗gender2∗race7

Part 1 (continued)

There is a minimal change in the coefficients from model1 to model3.

Part 2

After comparing the adjusted R-squares for model 3 and model 1, it would seem that model3 was better. Even though the percentage is rather meek, at an 0.1 percent increase.

Part 3

Compare Model 3 and Model1

anova(Model3, Model1)

Part 3 (continued)

ANOVA F-test: After analysis, there really isn’t any significance between the two. They are almost the same.

Part 4

prediction

set.seed(0)

Rows with Education =3 and Familysize = 4 (with Scaled Value)

new_data <- data.frame( age = rep(age.mean, size), gender = factor(sample(unique(data\(gender), replace=T, size=size)), race = factor(sample(unique(data\)race), replace=T, size=size)), familysize = rep(4, size), education = factor(rep(3, size)) )

predictions2 <- predict(object = Model3, newdata = new_data, interval = ‘confidence’, level = .95) %>% as_tibble()

Part 5

predictions2 %>% ggplot() + geom_col(aes(x=1:size, fit), fill=‘steelblue’, width=0.7) + geom_errorbar(aes(x=1:size, ymin=lwr, ymax=upr), alpha=0.6,width=0.5) + theme_light() + theme(panel.grid = element_blank(), plot.title = element_text(face=‘bold’)) + labs(title=‘Model3 Predictions with 95% Confidence Intervals’, y=‘Predictions’, x=‘Number of samples’)

## Install Tidyverse & Import Dataset ##

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.3     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
survey <- read_csv("C:/Users/HP/OneDrive/Desktop/Queens College/Data 306-Data Analysis and Modeling (2022)/NHANES-2017-2018-Demo.csv")
## Rows: 9254 Columns: 46
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (46): seqn, sddsrvyr, ridstatr, riagendr, ridageyr, ridagemn, ridreth1, ...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Converting the variable names to upper cases #
names(survey) <- str_to_upper(names(survey))

# Number of rows and columns in dataset #
dim(survey)
## [1] 9254   46
## Question #1 ##
# The Dependent Variable = INDFMPIR #
# Types Independant Variables: RIDAGEYR = Continuous, RIAGENDR = Categorical, RIDRETH3 = Categorical, DMDFMSIZ = Discrete Numerical, DMDEDUC2 = Categorical #

# Selecting Variables #
data <- survey %>%
  select(INDFMPIR, RIDAGEYR, RIAGENDR, 
         RIDRETH3, DMDFMSIZ, DMDEDUC2)
         
# Independent Variables #
predictors  <- names(data)[!names(data) %in% 'INDFMPIR']

## Question #2 ##

# Renaming Variable Names
names(data) <- c('Y', 'age', 'gender', 'race', 'familysize', 'education')

# Replacing 7 and 9 in Education to NA to indicate missing values #

# 7 = Refused to answer, 9 = Don't know #
data$education[data$education %in% c(7,9)] <- NA

# Conversion of Categorical Variables to Categorical Data Type #
categorical_vars <- c('gender', 'race', 'education')
data[, categorical_vars] <- lapply(data[categorical_vars], as.factor)

# Fitting a Linear Regression Model #
Model1 <- lm(Y ~ ., data=data, )

# Summary of Model #
summary(Model1)
## 
## Call:
## lm(formula = Y ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1284 -1.0242 -0.1691  1.0534  3.7226 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.216223   0.130171   9.343  < 2e-16 ***
## age          0.007429   0.001238   5.999 2.13e-09 ***
## gender2     -0.172061   0.040447  -4.254 2.14e-05 ***
## race2       -0.149272   0.089537  -1.667 0.095549 .  
## race3       -0.037206   0.070686  -0.526 0.598664    
## race4       -0.267549   0.074498  -3.591 0.000332 ***
## race6        0.224105   0.082830   2.706 0.006842 ** 
## race7       -0.318729   0.108030  -2.950 0.003189 ** 
## familysize  -0.010763   0.013661  -0.788 0.430817    
## education2   0.253833   0.097326   2.608 0.009134 ** 
## education3   0.686155   0.088631   7.742 1.19e-14 ***
## education4   1.207475   0.087220  13.844  < 2e-16 ***
## education5   2.298127   0.090588  25.369  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.393 on 4761 degrees of freedom
##   (4480 observations deleted due to missingness)
## Multiple R-squared:  0.2466, Adjusted R-squared:  0.2447 
## F-statistic: 129.9 on 12 and 4761 DF,  p-value: < 2.2e-16
# Multiple Linear Regression #
## Y = 1.216 + 0.007∗age - 0.172∗gender2 - 0.149∗race2 - 0.037∗race3 - 0.268∗race4 + 0.224∗race6 - 0.319∗race7 - 0.019∗familysize + 0.254∗education2 + 0.686∗education3 + 1.207∗education4 + 2.298∗education5 ##

# Part 1 #

# The only variables that have statistical significance to the dependent variable are: age, race6 and all education types/datas. The directions of the variables all seem to slightly go down; ranging from 0.1-5% (5% and up being statistically significant). The magnitude seems to get much steeper once education comes into play. Looking at education2 (up to the 12th grade) and education3 (high school diploma), the family income poverty ratio jumped 0.26 and 0.70, respectively. College (education4 & education5) increased this further; by 1.20 and 2.3, respectively. #

# Part 2 #

# Based on the results of the model, the execution of the model to accurately explains the variability in the dependent variable is subpar. Based on the R-squared value, we see that the independent variables are only able to interpret for approximately 24.7% of the dependent variable. On the other hand, based on the adjusted R-squared we see that this decreases to aproximately 24.5%. #

## Question #3 ##
# Linear Regression with Interaction Terms #
Model2 <- lm(Y ~ . + age*familysize, data=data)

summary(Model2)
## 
## Call:
## lm(formula = Y ~ . + age * familysize, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1949 -1.0183 -0.1789  1.0493  3.7949 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.716886   0.160308  10.710  < 2e-16 ***
## age            -0.003421   0.002384  -1.435 0.151282    
## gender2        -0.164542   0.040356  -4.077 4.63e-05 ***
## race2          -0.153364   0.089285  -1.718 0.085916 .  
## race3          -0.024339   0.070525  -0.345 0.730026    
## race4          -0.270746   0.074288  -3.645 0.000271 ***
## race6           0.210405   0.082633   2.546 0.010920 *  
## race7          -0.321319   0.107722  -2.983 0.002870 ** 
## familysize     -0.204506   0.038871  -5.261 1.49e-07 ***
## education2      0.305941   0.097541   3.137 0.001720 ** 
## education3      0.738046   0.088914   8.301  < 2e-16 ***
## education4      1.243412   0.087233  14.254  < 2e-16 ***
## education5      2.327033   0.090492  25.715  < 2e-16 ***
## age:familysize  0.004135   0.000777   5.322 1.08e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.389 on 4760 degrees of freedom
##   (4480 observations deleted due to missingness)
## Multiple R-squared:  0.2511, Adjusted R-squared:  0.249 
## F-statistic: 122.8 on 13 and 4760 DF,  p-value: < 2.2e-16
# Multiple Linear Regression #
# Y = 1.717 - 0.003∗age - 0.165∗gender2 - 0.153∗race2 - 0.024∗race3 - 0.271∗race4 + 0.210∗race6 - 0.321∗race7 - 0.110∗familysize + 0.306∗education2 + 0.738∗education3 + 1.243∗education4 + 2.327∗education5 + 0.004∗age∗familysize #

# Part 1 #
# Once family size and age were incorporated into the mix, the coefficients (almost all of them), had a modest increase while the remaining had a modest decrease. #

# Part 2 #
# Upon comparison, I would say that model2 is better. Both its r-squared & adjusted r-squared had better percentages. #

# Part 3: ANOVA between Models 1 and 2 #
anova(Model2, Model1)
## Analysis of Variance Table
## 
## Model 1: Y ~ age + gender + race + familysize + education + age * familysize
## Model 2: Y ~ age + gender + race + familysize + education
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4760 9177.3                                  
## 2   4761 9231.9 -1   -54.602 28.321 1.075e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ANOVA F-test: test to compare Model1 and Model2. Upon analysis, we see that there’s a statistical significance (pvalue = 1.075e−7) at 0.1%-5% significance level. Model 2 is exceptional in performance compared to Model 1. #

# Part 5 #
# prediction
set.seed(0)

size = 20 # sample size
age.mean <- mean(data$age)
age.lwr <- mean(data$age) - sd(data$age) *1
age.upr <- mean(data$age) + sd(data$age) *1


new_data <- data.frame(
  age = sample(c(age.mean, age.lwr, age.upr), replace = T, size=size),
  gender = factor(rep(2,size)),
  race = factor(rep(4, size)),
  familysize = sample(unique(data$familysize), replace=T, size = size),
  education = factor(rep(3, size))
  )

predictions1 <- predict(object = Model2, 
                       newdata = new_data, 
                       interval = 'confidence', 
                       level = .95
                       ) %>% as_tibble()

# Part 6 #
predictions1 %>% 
  ggplot() + 
  geom_line(aes(x=1:size, y=fit), lty=1, color='steelblue') +
  geom_ribbon(aes(x=1:size, ymin=lwr, ymax=upr), alpha=0.4) +
  theme_light() + 
  theme(plot.title = element_text(face='bold'),
        panel.grid = element_blank(), 
        legend.position = 'top', 
        legend.title  = element_blank()) + 
  labs(x='Number of Samples', y='Predictions', 
       title = 'Model2 Predictions with 95% Confidence Intervals')

## Question # 4 ##

# Part 1 #

# Model 3 #
Model3 <- lm(Y ~ . + gender*race, data=data)

summary(Model3)
## 
## Call:
## lm(formula = Y ~ . + gender * race, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2164 -1.0220 -0.1766  1.0571  3.8260 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.227797   0.139393   8.808  < 2e-16 ***
## age            0.007430   0.001238   6.002 2.10e-09 ***
## gender2       -0.209942   0.114653  -1.831  0.06715 .  
## race2         -0.185423   0.129718  -1.429  0.15294    
## race3         -0.088081   0.098893  -0.891  0.37315    
## race4         -0.179685   0.105991  -1.695  0.09009 .  
## race6          0.099187   0.118732   0.835  0.40354    
## race7         -0.335456   0.148248  -2.263  0.02369 *  
## familysize    -0.009647   0.013662  -0.706  0.48013    
## education2     0.257259   0.097440   2.640  0.00831 ** 
## education3     0.688911   0.088878   7.751 1.11e-14 ***
## education4     1.215547   0.087472  13.896  < 2e-16 ***
## education5     2.307029   0.090892  25.382  < 2e-16 ***
## gender2:race2  0.068280   0.178509   0.382  0.70211    
## gender2:race3  0.098647   0.132417   0.745  0.45632    
## gender2:race4 -0.169518   0.142606  -1.189  0.23461    
## gender2:race6  0.226825   0.157287   1.442  0.14934    
## gender2:race7  0.028425   0.212395   0.134  0.89354    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.392 on 4756 degrees of freedom
##   (4480 observations deleted due to missingness)
## Multiple R-squared:  0.2482, Adjusted R-squared:  0.2455 
## F-statistic: 92.37 on 17 and 4756 DF,  p-value: < 2.2e-16
# Multiple Linear Regression #
# Y = 1.228 + 0.007∗age - 0.210∗gender2 - 0.185∗race2 - 0.088∗race3 - 0.180∗race4 + 0.01∗race6 - 0.335∗race7 - 0.010∗familysize + 0.257∗education2 + 0.689∗education3 + 1.216∗education4 + 2.307∗education5 + 0.068∗gender2∗race2 + 0.099∗gender2∗race3 - 0.170∗gender2∗race4 + 0.227∗gender2∗race6 + 0.028∗gender2∗race7 #

# Part 1 (continued) #
## There is a minimal change in the coefficients from model1 to model3. #

# Part 2 #
# After comparing the adjusted R-squares for model 3 and model 1, it would seem that model3 was better. Even though the percentage is rather meek, at an 0.1 percent increase. #

# Part 3 #
# Compare Model 3 and Model1 #
anova(Model3, Model1)
## Analysis of Variance Table
## 
## Model 1: Y ~ age + gender + race + familysize + education + gender * race
## Model 2: Y ~ age + gender + race + familysize + education
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1   4756 9212.4                              
## 2   4761 9231.9 -5   -19.519 2.0154 0.07329 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Part 3 (continued) #
## ANOVA F-test: After analysis, there really isn't any significance between the two. They are almost the same. #

# Part 4 #
# prediction #
set.seed(0)

# Rows with Education =3 and Familysize = 4 (with Scaled Value) #
new_data <- data.frame(
  age = rep(age.mean, size), 
  gender = factor(sample(unique(data$gender), replace=T, size=size)),
  race = factor(sample(unique(data$race), replace=T, size=size)),
  familysize = rep(4, size),
  education = factor(rep(3, size))
)
  

predictions2 <- predict(object = Model3, 
                       newdata = new_data, 
                       interval = 'confidence', 
                       level = .95) %>% as_tibble()

# Part 5 #
predictions2 %>% 
  ggplot() + 
  geom_col(aes(x=1:size, fit), fill='steelblue', width=0.7) + 
  geom_errorbar(aes(x=1:size, ymin=lwr, ymax=upr), alpha=0.6,width=0.5) + 
  theme_light() + 
  theme(panel.grid = element_blank(), 
        plot.title = element_text(face='bold')) + 
  labs(title='Model3 Predictions with 95% Confidence Intervals', 
       y='Predictions', x='Number of samples')