library("tidyverse")
library("dplyr")
library("htmltools")
library("forcats")
library("stargazer")
library("sjPlot")
library("car")

1 Describe your substantive interest and the general questions(s) you would like to answer (eg, “Does more education cause people to become more liberal?”). Be sure to frame it in a such a way that you are proposing a hypothesis (or multiple hypotheses) that might be either confirmed or disproven by the results of your analysis. 2 Describe the data set you have found, including its source, its contents, and why it was collected originally. 3 What is your dependent variable? Why are you interested in explaining it? What do you hypothesize are the major factors that influence or cause it? 4 What are your independent variables, and why have you chosen these? Prior to running your regression, what effects do you expect them to have on the dependent variable? Which of these variables do you think a˙ect other of the independent variables, and how might that a˙ect your final results? 5 Explain and show in detail how you rename and recode the variables you are examining, and what units each are measured in. 6 Before running a multiple regression, run a few bivariate regressions of Y on some of your X variables. What do you infer? Which of these do you think might change with the addition of multiple variables? 7 Run your full multiple regression using lm() and present your results using the output from the stargazer R package. Interpret the coefficients. What do they tell you substantively? Which variables seem to have the biggest substantive impact? Which ones could you actually change with some intervention, and how big a di˙erence do you think that could make? 8 How have any of the coefficients changed from the bivariate regressions? What can you infer from that?How do you think your various independent variables interact and a˙ect each other? Try to find an example where a variable appears signficant in the bivariate regression, but not in the full regression. Is this an example of a spurious or a chained causal pathway? 9 How does what you see match, or not, your hypotheses from (4)? Why did/didn’t it match what you expected? 10 What do the R2 and adjusted R2 tell you about your model? 11 How would you use one of the variable selection methods to choose a model with fewer variables? Select one of the methods (either one of the stepwise or criterion-based methods) and show which variables it would lead you to keep. Do you agree with its results? 12 What are your overall conclusions? What are the weaknesses of your results, and how could you improve them with better or different data? 13 Calculations (using R): a. Derive the coeÿcients from your regression using the (X0X)−1X0Y formula. (If you run into problems using solve(), try using ginv() instead, which does the same thing but is a bit more robust.) b. For one of the coeÿcients, confirm its p value as shown in the regression output using the coeÿcient, its standard error, and pt() in R. c. Calculate the R2 and adjusted R2 using R, and confirm that your results match the regression output. d. Calulate the F statistic using R and confirm it against the regression output. 14 Add at least one quadratic term into your model and interpret the results. Is it significant? What is the effect of a 1-unit increase in that variable at its mean value? 15 Add at least one interaction term to you model and interpret the results. Is it significant? What is the effect of a 1-unit increase in one of those interacted variables holding the other at its mean value? 16 Test either the model in 14 or the model in 15 using the F test for nested models. That is, estimate the full model with the variable and quadratic term, or the variable and interaction, and then estimate the reduced model without either, and run the F test to establish whether those variables significantly improve your model.

Read Data Set 1

US <- read_csv("C:\\Users\\Stephen\\Documents\\Northeastern\\PPUA 5301 - Introduction to Computational Statistics\\Homework\\Homework 8_9\\Nutrition__Physical_Activity__and_Obesity_-_Behavioral_Risk_Factor_Surveillance_System.csv")
# names(US)
(Qs <- unique(US$Question))
## [1] "Percent of adults aged 18 years and older who have obesity"                                                                                                                                                                                           
## [2] "Percent of adults aged 18 years and older who have an overweight classification"                                                                                                                                                                      
## [3] "Percent of adults who report consuming fruit less than one time daily"                                                                                                                                                                                
## [4] "Percent of adults who report consuming vegetables less than one time daily"                                                                                                                                                                           
## [5] "Percent of adults who engage in muscle-strengthening activities on 2 or more days a week"                                                                                                                                                             
## [6] "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)"                                                 
## [7] "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week"
## [8] "Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)"                                                
## [9] "Percent of adults who engage in no leisure-time physical activity"

Xform Dataset 1

US_Hlth <- US %>% filter(YearStart == 2015 & Question %in% c("Percent of adults who engage in no leisure-time physical activity", 
    "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)", 
    "Percent of adults aged 18 years and older who have an overweight classification", 
    "Percent of adults who report consuming vegetables less than one time daily") & 
    is.na(Gender) == F & !LocationDesc %in% c("Guam", "Puerto Rico", "National")) %>% 
    select(YearStart, LocationDesc, Question, Data_Value, Gender) %>% group_by(LocationDesc, 
    Question) %>% arrange(LocationDesc, Question) %>% summarize(Perc = mean(Data_Value))
colnames(US_Hlth) <- c("ST", "Hlth_M", "Perc")
US_Hlth$Hlth_M <- as.factor(US_Hlth$Hlth_M)
US_Hlth$Hlth_M <- US_Hlth$Hlth_M %>% fct_recode(`0` = "Percent of adults who engage in no leisure-time physical activity", 
    `150` = "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)", 
    Veg = "Percent of adults who report consuming vegetables less than one time daily", 
    Wgt = "Percent of adults aged 18 years and older who have an overweight classification")
US_Hlth <- US_Hlth %>% ungroup() %>% filter(Hlth_M %in% c("0", "150", "Veg", "Wgt")) %>% 
    spread("Hlth_M", "Perc")

Xform Dataset 1 - Revised

US_Hlth <- US %>% filter(YearStart == 2015 & is.na(Gender) == F & !LocationDesc %in% 
    c("Guam", "Puerto Rico", "National")) %>% select(YearStart, LocationDesc, Question, 
    Data_Value, Gender) %>% group_by(LocationDesc, Question) %>% arrange(LocationDesc, 
    Question) %>% summarize(Perc = mean(Data_Value))
colnames(US_Hlth) <- c("ST", "Hlth_M", "Perc")
US_Hlth$Hlth_M <- as.factor(US_Hlth$Hlth_M)
US_Hlth$Hlth_M <- US_Hlth$Hlth_M %>% fct_recode(wgtObs = "Percent of adults aged 18 years and older who have obesity", 
    wgtOvr = "Percent of adults aged 18 years and older who have an overweight classification", 
    conFru = "Percent of adults who report consuming fruit less than one time daily", 
    conVeg = "Percent of adults who report consuming vegetables less than one time daily", 
    actWL = "Percent of adults who engage in muscle-strengthening activities on 2 or more days a week", 
    act150 = "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)", 
    act150WL = "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week", 
    act300 = "Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)", 
    act0 = "Percent of adults who engage in no leisure-time physical activity")
US_Hlth <- US_Hlth %>% ungroup() %>% spread("Hlth_M", "Perc")

Read Dataset 2

Party <- tibble::tribble(~State, ~Republican.lean.Rep., ~No.lean, ~Democrat.lean.Dem., 
    ~Sample.Size, "Alabama", "52%", "13%", "35%", 511, "Alaska", "39%", "29%", "32%", 
    310, "Arizona", "40%", "21%", "39%", 653, "Arkansas", "46%", "16%", "38%", 311, 
    "California", "30%", "21%", "49%", 3697, "Colorado", "41%", "17%", "42%", 504, 
    "Connecticut", "32%", "18%", "50%", 377, "Delaware", "29%", "17%", "55%", 301, 
    "District of Columbia", "11%", "15%", "73%", 303, "Florida", "37%", "19%", "44%", 
    2020, "Georgia", "41%", "18%", "41%", 968, "Hawaii", "28%", "20%", "51%", 312, 
    "Idaho", "49%", "19%", "32%", 320, "Illinois", "33%", "19%", "48%", 1326, "Indiana", 
    "42%", "20%", "37%", 654, "Iowa", "41%", "19%", "40%", 330, "Kansas", "46%", 
    "23%", "31%", 307, "Kentucky", "44%", "13%", "43%", 439, "Louisiana", "41%", 
    "16%", "43%", 465, "Maine", "36%", "17%", "47%", 303, "Maryland", "31%", "14%", 
    "55%", 644, "Massachusetts", "27%", "17%", "56%", 704, "Michigan", "34%", "19%", 
    "47%", 982, "Minnesota", "39%", "15%", "46%", 563, "Mississippi", "44%", "14%", 
    "42%", 309, "Missouri", "41%", "18%", "42%", 642, "Montana", "49%", "21%", "30%", 
    312, "Nebraska", "47%", "17%", "36%", 312, "Nevada", "37%", "18%", "46%", 314, 
    "New Hampshire", "35%", "20%", "44%", 303, "New Jersey", "30%", "19%", "51%", 
    886, "New Mexico", "37%", "15%", "48%", 312, "New York", "28%", "19%", "53%", 
    1966, "North Carolina", "41%", "17%", "43%", 1022, "North Dakota", "50%", "18%", 
    "33%", 338, "Ohio", "42%", "18%", "40%", 1132, "Oklahoma", "45%", "15%", "40%", 
    391, "Oregon", "32%", "21%", "47%", 419, "Pennsylvania", "39%", "15%", "46%", 
    1366, "Rhode Island", "30%", "22%", "48%", 305, "South Carolina", "43%", "18%", 
    "39%", 495, "South Dakota", "53%", "10%", "37%", 305, "Tennessee", "48%", "15%", 
    "36%", 661, "Texas", "39%", "21%", "40%", 2535, "Utah", "54%", "16%", "30%", 
    315, "Vermont", "29%", "14%", "57%", 306, "Virginia", "43%", "18%", "39%", 882, 
    "Washington", "33%", "23%", "44%", 714, "West Virginia", "43%", "16%", "41%", 
    309, "Wisconsin", "42%", "16%", "42%", 600, "Wyoming", "57%", "18%", "25%", 316)

Xform Dataset 2

Party <- lapply(Party, gsub, pattern = "\\%", replacement = "")
Party <- as.data.frame(Party, stringsAsFactors = F, )
reclass <- function(df, vec) {
    df[] <- Map(function(x, f) {
        # switch below shows the accepted values in the vector you can modify it and/or
        # add more
        f <- switch(f, as.is = "force", factor = "as.factor", num = "as.numeric", 
            char = "as.character")
        # takes the name of the function and fetches the function
        f <- get(f)
        # apply the function
        f(x)
    }, df, vec)
    df
}
Party <- reclass(Party, c("char", "num", "num", "num", "num"))
names(Party)
## [1] "State"                "Republican.lean.Rep." "No.lean"             
## [4] "Democrat.lean.Dem."   "Sample.Size"
colnames(Party) <- c("ST", "R", "N", "D", "ns")

Join Tables for Analysis

Pty_Hlth <- left_join(Party, US_Hlth, by = "ST")
rm(US_Hlth, Party)

Prelim Analysis - Revised

Pty_Hlth <- Pty_Hlth %>% arrange(R)
ggact <- ggplot(data = Pty_Hlth, mapping = aes(x = reorder(ST, R), y = R)) + geom_point() + 
    geom_point(mapping = aes(y = actWL), color = "red", size = 2) + geom_point(mapping = aes(y = act150), 
    color = "orange", size = 2) + geom_point(mapping = aes(y = act150WL), color = "#C42B2D", 
    size = 2) + geom_point(mapping = aes(y = act300), color = "blue", size = 2) + 
    geom_point(mapping = aes(y = act0), color = "black", size = 2) + theme(axis.text.x = element_text(angle = 90, 
    hjust = 1, lineheight = 1))
ggwgt <- ggplot(data = Pty_Hlth, mapping = aes(x = reorder(ST, R), y = R)) + geom_point() + 
    geom_point(mapping = aes(y = wgtObs), color = "red", size = 2) + geom_point(mapping = aes(y = wgtOvr), 
    color = "#A32D2D", size = 2) + theme(legend.position = "none", axis.text.x = element_text(angle = 90, 
    hjust = 1, lineheight = 1))
ggcon <- ggplot(data = Pty_Hlth, mapping = aes(x = reorder(ST, R), y = R)) + geom_point() + 
    geom_point(mapping = aes(y = conFru), color = "red", size = 2) + geom_point(mapping = aes(y = conVeg), 
    color = "#B88037", size = 2) + theme(legend.position = "none", axis.text.x = element_text(angle = 90, 
    hjust = 1, lineheight = 1))
# The Wgt variable is obviously not correlated, so this will be left out of
# further analyses
ggact

ggwgt

ggcon

1

Describe your substantive interest and the general questions(s) you would like to answer (eg, “Does more education cause people to become more liberal?”). Be sure to frame it in a such a way that you are proposing a hypothesis (or multiple hypotheses) that might be either confirmed or disproven by the results of your analysis.

Selecting the data for comparison was quite difficult given the restraints of having the data not include temporal or categorical data. All of my interests included data that had a temporal index or categorical data involved. Resolving to include health related measures, I decided to continue with the inquiry from class about partisanship, and combine these two data sets to allow for correlation between these topics. The data within fits the restraints and might yield interesting correlations.
The hypothesis is a question of whether there is a correlation between various health behaviors and party affiliation within states. After Revision
\(H_1\) Is the “Percent of adults aged 18 years and older who have obesity” correlated with the % identifying as Republican? If so, what is the sign of \(\beta_1\)?
\(H_2\) Is the “Percent of adults who report consuming fruit less than one time daily” correlated with the % identifying as Republican? If so, what is the sign of \(\beta_2\)?
\(H_3\) Is the “Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)” a predictor variable for the % identifying as Republican? If so what is the sign of \(\beta_3\)?

2

Describe the data set you have found, including its source, its contents, and why it was collected originally.

The data health behaviors and risk factors is a subset of the Nutrition, Physical Activity, and Obesity - Behavioral Risk Factor Surveillance System data set1 from the CDC. The data is gathered to track behavioral risk factors across different demographic strata by state.
The data about party affilation by state came from the Party affiliation by State (2014)2 data set provided by the Pew Research Center. This particular data was originally collected as part of the Religious Landscape Study3
1https://catalog.data.gov/dataset/nutrition-physical-activity-and-obesity-behavioral-risk-factor-surveillance-system
2http://www.pewforum.org/religious-landscape-study/compare/party-affiliation/by/state/
3http://www.pewforum.org/religious-landscape-study/

3

What is your dependent variable? Why are you interested in explaining it? What do you hypothesize are the major factors that influence or cause it?

The dependent variable is the percentage of the state pop. identified as Republican, or Republican leaning. The variable is complex with numerous influencing factors. During the in-class explorations we have looked at potential influencing factors such as race, income, and education level on party affiliation via data from the ANES survey.
Intuitively, party affiliation could also be influenced or predicted in part by religion, affiliation of parents, place of residence, affiliation of one’s peers in one’s social network. It could also be related to psychological traits, or, as we will explore, health behaviors.
In this study, we’ll be looking at the percentage of each state’s pop. exhibiting particular health behaviors, these behaviors will be described in further detail below. An state with a high rate of obesity wgtObs might also lack a system of education about health related information and/or a cultural bias that occludes access to quality health related information. The obesity rate could also be related to lack of access to health care services and medical professionals etc. With regard to whether or not the “Percent of adults who report consuming fruit less than one time daily” influences the percent of the state pop. identifying as Republican it is likely that the two are not related. Though the variable might not have much explanatory value, it was the only variable from the dietary behaviors questions that did graphically did not appear to exhibit colinearity with variables from the other classes.

4

What are your independent variables, and why have you chosen these? Prior to running your regression, what effects do you expect them to have on the dependent variable? Which of these variables do you think affect other of the independent variables, and how might that affect your final results?

Original
This brief analysis will look at whether there is correlation between the following independent variables:
  • \(x_1\): % of pop. living sedentary lifestyle
  • \(x_2\): % of pop. living lifestyle meeting minimum physical activity recommendations
  • \(x_3\): % of pop. that consumes vegetables less than 1 time daily
and the dependent variable \(y\): the identification as Republican.
There might be a positive correlation between the percent of a state’s pop. that exhibits health behaviors identified as risk factors \(x_1,x_3\) and the percent of a state’s pop. that identifies as Republican or Republican leaning. There is likely an inverse colinear relationship between \(x_1\) and \(x_2\) because they are mutually exclusive categories along a scale of 5 categories (not all of which are mutually exclusive.) See variable Qs, for the original categories. There could also be a correlation between states with a larger percentage of the population meeting minimum physical activity recommendations, and those whom are consuming vegetables more than 1 time a day, due to a theoretical common variable of health knowledge.
After Revision This brief analysis will look at whether there is correlation between the following independent variables:
  • \(x_1\): % of adults aged 18 years and older who have obesity
  • \(x_2\): % of adults who report consuming fruit less than one time daily
  • \(x_3\): % of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)

and the dependent variable \(y\): the percent of the state pop. identifying as Republican. The % of adults aged 18 years and older who have obesity (\(x_1\)) might correlate positively with the % of the state pop. identified as Republican. The % of adults who report consuming fruit less than one time daily will likely have no correlation with % of the state pop. identified as Republican. There might be a negative correlation between % of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination). However, the latter two variables are likely more related to the latitude, and longitude of a state (as a predictor of outdoor weather patterns conducive to outdoor activity), as well as the popularity and diversity of various athletic recreational activities within the state.

5

Explain and show in detail how you rename and recode the variables you are examining, and what units each are measured in.

The variables have been recoded in the chunks entitled “Xform Dataset 1 - Revised” & “Xform Dataset 2.” The renaming and recoding procedures are explained below:
Dataset 1
The data is filtered for the year 2015, the only year with data for each question that is closest to the year in which the data was gathered in dataset 2, the relevant questions, the 50 states plus the District of Columbia (as in dataset 2), and the gender stratification. The “Questions” column from the original dataset is renamed Hlth_M (Health Measures). The percentages of the pop. reporting a yes to the question are filtered by the gender stratification, such that only 2 percentages need be averaged to get a single % for the entire population. The answers to the questions are then recoded using the forcats package, see the chunk entitled Xform Dataset 1 - Revised for details on variable names. The factors are then spread into columns.
Dataset 2
The dataset variables were transformed as follows:
  1. R=“Republican/Republican Leaning”
  2. D=“Democrat/Democrat Leaning”
  3. N=“No affiliation”

The variables were reclassed to characters or numbers where necessary. The tables are joined by State. All variables are measured in % of state population.

6

Before running a multiple regression, run a few bivariate regressions of Y on some of your X variables. What do you infer? Which of these do you think might change with the addition of multiple variables?

These linear regressions without results contained in this blockquote were used to identify that the variables in question were colinear. The assignment was then revised. The revised answer to this question begins where the blockquote ends.
Upon further reading, I had an incorrect method for attempting to spot multicolinearity, and the revision may not have even been necessary. 2-Var Regressions

r0 <- summary(lm(R ~ `0`, Pty_Hlth))
r150 <- summary(lm(R ~ `150`, Pty_Hlth))
rv <- summary(lm(R ~ Veg, Pty_Hlth))
`150v0` <- summary(lm(`0` ~ `150`, Pty_Hlth))
150v0

Before running a multiple regression, run a few bivariate regressions of Y on some of your X variables. What do you infer? Which of these do you think might change with the addition of multiple variables?
While there is statistical significance at the 95% confidence level in the correlations between R and 150, and Veg, it can also be noted the \(R^2\) values indicate that little of the variation is explained by the model. As variables are added in a multiple variable regression, the \(R^2\) value will inevitably increase, though it is important to note the low \(R^2\) in these 2-Variable models.
The comparison of 0 and 150 show that the variables are indeed colinear, and will be confounding variables if combined. It may be necessary to add an additional health variable that is not colinear with any of the other variables.

Add Statistic on Obesity for Multiple Linear Regression

USob <- US %>% filter(YearStart == 2015 & Question %in% c("Percent of adults aged 18 years and older who have obesity") & 
    is.na(Gender) == F & !LocationDesc %in% c("Guam", "Puerto Rico", "National")) %>% 
    select(YearStart, LocationDesc, Question, Data_Value, Gender) %>% group_by(LocationDesc, 
    Question) %>% arrange(LocationDesc, Question) %>% summarize(Perc = mean(Data_Value))
USob <- USob[, -2]
names(USob) <- c("ST", "Ob")
Pty_Hlth <- left_join(Pty_Hlth, USob, by = "ST")
ROb <- summary(lm(R ~ Ob, Pty_Hlth))
ROb
rm(USob)
`0Ob` <- summary(lm(`0` ~ Ob, Pty_Hlth))
0Ob

It appears that obesity is also colinear with a sedentary lifestyle, logically, this linkage makes sense. Let’s have a look at the fruit variable. Add Fruit Variable

USF <- US %>% filter(YearStart == 2015 & Question %in% c("Percent of adults who report consuming fruit less than one time daily") & 
    is.na(Gender) == F & !LocationDesc %in% c("Guam", "Puerto Rico", "National")) %>% 
    select(YearStart, LocationDesc, Question, Data_Value, Gender) %>% group_by(LocationDesc, 
    Question) %>% arrange(LocationDesc, Question) %>% summarize(Perc = mean(Data_Value))
USF <- USF[, -2]
names(USF) <- c("ST", "Fruit")
Pty_Hlth <- left_join(Pty_Hlth, USF, by = "ST")
(VF <- summary(lm(Veg ~ Fruit, Pty_Hlth)))
# Colinear
(F150 <- summary(lm(Fruit ~ `150`, Pty_Hlth)))
# Colinear
(FOb <- summary(lm(Fruit ~ Ob, Pty_Hlth)))
# Colinear```

Healthy lifestyle traits show high degrees of correlation, while risk factors are also highly correlated. Add Weight Lifting Variable

USWl <- US %>% filter(YearStart == 2015 & Question %in% c("Percent of adults who engage in muscle-strengthening activities on 2 or more days a week") & 
    is.na(Gender) == F & !LocationDesc %in% c("Guam", "Puerto Rico", "National")) %>% 
    select(YearStart, LocationDesc, Question, Data_Value, Gender) %>% group_by(LocationDesc, 
    Question) %>% arrange(LocationDesc, Question) %>% summarize(Perc = mean(Data_Value))
USWl <- USWl[, -2]
names(USWl) <- c("ST", "Wl")
Pty_Hlth <- left_join(Pty_Hlth, USWl, by = "ST")
(VFWl <- summary(lm(Veg ~ Wl, Pty_Hlth)))
# Colinear
(Wl150 <- summary(lm(Wl ~ `150`, Pty_Hlth)))
# Colinear
(WlOb <- summary(lm(Wl ~ Ob, Pty_Hlth)))
# Colinear
It appears that it will be necessary to start back at the beginning, extract all of the health related questions, spread them to individual variables and graph them. By graphing all of the variables it may allow them to be differentiated to avoid colinearity.
After Revision
After revision, the three variables that appear not to be colinear are:
  1. “% of adults aged 18 years and older who have obesity”, wgtObs
  2. “% of adults who report consuming fruit less than one time daily”, conFru
  3. “% of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)”, act300

2-Variable Regressions - Revised

Obs150 <- lm(wgtObs ~ act150, Pty_Hlth)
ObsFru <- lm(wgtObs ~ conFru, Pty_Hlth)
Obs300 <- lm(wgtObs ~ act300, Pty_Hlth)
Fru300 <- lm(conFru ~ act300, Pty_Hlth)
RObs <- lm(R ~ wgtObs, Pty_Hlth)
RFru <- lm(R ~ conFru, Pty_Hlth)
R300 <- lm(R ~ act300, Pty_Hlth)

sjt.lm(Obs150)
    wgtObs
    B CI p
(Intercept)   57.96 49.78 – 66.13 <.001
act150   -0.56 -0.71 – -0.40 <.001
Observations   51
R2 / adj. R2   .506 / .496
sjt.lm(ObsFru)
    wgtObs
    B CI p
(Intercept)   2.67 -3.10 – 8.44 .357
conFru   0.65 0.51 – 0.79 <.001
Observations   51
R2 / adj. R2   .639 / .632
sjt.lm(Obs300)
    wgtObs
    B CI p
(Intercept)   47.99 41.21 – 54.77 <.001
act300   -0.58 -0.79 – -0.37 <.001
Observations   51
R2 / adj. R2   .390 / .377
sjt.lm(Fru300)
    conFru
    B CI p
(Intercept)   62.43 53.74 – 71.11 <.001
act300   -0.67 -0.94 – -0.40 <.001
Observations   51
R2 / adj. R2   .339 / .326
sjt.lm(RFru)
    R
    B CI p
(Intercept)   1.64 -16.27 – 19.55 .855
conFru   0.92 0.48 – 1.35 <.001
Observations   51
R2 / adj. R2   .269 / .254
sjt.lm(RObs)
    R
    B CI p
(Intercept)   7.32 -8.67 – 23.31 .362
wgtObs   1.09 0.55 – 1.63 <.001
Observations   51
R2 / adj. R2   .250 / .235
sjt.lm(R300)
    R
    B CI p
(Intercept)   53.97 35.52 – 72.41 <.001
act300   -0.46 -1.03 – 0.11 .112
Observations   51
R2 / adj. R2   .051 / .031

The \(R^2\) values appear to be low enough where the degree of multicollinearity will not be so substantial as to render the predictive value of the model inaccurate. With the addition of multiple variables, the signficance of the individual health behavioral variables will likely decrease. The \(R^2\) will also decrease as the amount of variance will be cumulative as more variables are added to the model.

7

Run your full multiple regression using lm() and present your results using the output from the stargazer R package.
  1. Interpret the coefficients. What do they tell you substantively?
  2. Which variables seem to have the biggest substantive impact?
  3. Which ones could you actually change with some intervention, and how big a difference do you think that could make?

7

RObs300Fru <- lm(R ~ wgtObs + conFru + act300, Pty_Hlth)
vif(RObs300Fru)
##   wgtObs   conFru   act300 
## 3.100245 2.863585 1.691883
sjt.lm(RObs300Fru)
    R
    B CI p
(Intercept)   -21.00 -60.24 – 18.23 .287
wgtObs   0.70 -0.23 – 1.64 .137
conFru   0.66 -0.07 – 1.39 .075
act300   0.39 -0.25 – 1.04 .226
Observations   51
R2 / adj. R2   .311 / .267

7 - Stargazer

stargazer(RObs300Fru, no.space = TRUE, dep.var.labels = c("% identified as Republican"), 
    omit.stat = c("ll"), header = FALSE)

a

The variables from the revision: wgtObs,conFru, and act300 are modeled, and the variables from the originally intended linear model: wgtObs,conVeg, and act150 are modeled for comparison purposes.

Coefficients
\[\begin{aligned} Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \end{aligned}\] Where:

  • \(x_1\): % of adults aged 18 years and older who have obesity
  • \(x_2\): % of adults who report consuming fruit less than one time daily
  • \(x_3\): % of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)

The \(\beta\) coefficient for a given variable indicates that for a unit increment in that variable, the resulting value \(\hat{y}\) will be incremented by the value of \(\beta\).

b

Variables with Substantive Impact
The conFru variable in linear model RObs300Fru comes close to being significant at the 95% confidence level, though the adjusted \(R^2\) value for the model is .267. The wgtObs variable in the R150VegObs model is significant at the 99% confidence level, but the adjusted \(R^2\) value for the model is lower at .235. Given these outcomes, the model is not going to provide an accurate predictive model for the dependent variable.

c

Treatment Scenario to Control Independent Variables
Each of the independent variables are health behaviors, and/or risk factors. An intervention in influence the variable might be implementing mandatory nutrition & health education in grades K-12. If upon graduation, graduates were offered access to insurance coverage conditional with premiums ranging from deeply discounted to full price based on a spectrum of compliance with exercise and diet guidelines determined from statistical averages across the state, then you might be able to substantially drive down the rates of risk factors, and the percentages of people with risk-bearing health behaviors. If it was noted that a steep decline in the aforementioned variables did occur to the degree that it was statistically significant, it could be investigated whether there was also a significant drop in the % of people identifying as Republican. By controlling the health related variables more directly it will show whether party affiliation is effected.
Survey Intervention
If there was a survey conducted with a statistically significant sample of a given population (such as within a state) in which individuals were asked about party affiliation in addition to data on health behaviors, as in the CDC study, it would be a straightforward process to run statistical analyses on the health behavior/risk factor question answers stratified by party affiliation. Maybe the CDC would be up for adding a question to their survey?

8

  1. How have any of the coefficients changed from the bivariate regressions?
  2. What can you infer from that?
  3. How do you think your various independent variables interact and affect each other?
  4. Try to find an example where a variable appears significant in the bivariate regression, but not in the full regression. Is this an example of a spurious or a chained causal pathway?

a

8a

RFru$coefficients
## (Intercept)      conFru 
##   1.6424165   0.9182656
RObs$coefficients
## (Intercept)      wgtObs 
##    7.321541    1.090106
R300$coefficients
## (Intercept)      act300 
##  53.9657706  -0.4583004
RObs300Fru$coefficients
## (Intercept)      wgtObs      conFru      act300 
## -21.0009851   0.7021530   0.6611062   0.3921873

This coefficients have dropped in value from the bivariate to the multivariate. Also, we can see that including all three of these variables has mitigated the negative correlaton with act300 and actually made it a positive correlation. This does not seem to be a spurious or chained causal pathway, but rather the influence of two positively correlated values and one negatively correlated value included in the same linear model.

b

It can be inferred, that were variables with all positive correlations chosen for the multivariate regression, a more pronounced positive effect on the coefficients will be observed. To test this, we can add act0 - “Percent of adults who engage in no leisure-time physical activity” to the model. It is expected that due to the likely multicolinearity between this variable and wgtObs, we will observe larger coefficients, increased significance, but also have a high VIF. To test: 8b

RObsFru0 <- lm(R ~ wgtObs + conFru + act0, Pty_Hlth)
RObs300Fru$coefficients
## (Intercept)      wgtObs      conFru      act300 
## -21.0009851   0.7021530   0.6611062   0.3921873
summary(RObs300Fru)$adj.r.squared
## [1] 0.2671496
RObsFru0$coefficients
## (Intercept)      wgtObs      conFru        act0 
##  -1.2495710   0.7951514   0.8687442  -0.7133401
summary(RObsFru0)$adj.r.squared
## [1] 0.2992062
vif(RObsFru0)
##   wgtObs   conFru     act0 
## 3.085080 3.275400 2.446539
R0 <- lm(R ~ act0, Pty_Hlth)
R0$coefficients
## (Intercept)        act0 
##  25.4810285   0.5340354

As inferred, the coefficients did increase in value, as did the VIF scores. However, unexpectedly, the act0 coefficient is negative, even though it is positive in the bivariate regression with the dependent variable, and with its other IVs. The y-intercept shifted dramatically between the two models, which could indicate that when individuals in act300 are included, the % of individuals identifying as Republican starts at a much lower percent, and goes up by small amounts with each unit shift in the IVs, whereas when act0 is included, there is a larger % of those identifying as Republican, which goes down by 0.71 with each unit shift in X. Admittedly, these results are confusing, and this is just a conjecture.

c

8c

vif(RObs300Fru)
##   wgtObs   conFru   act300 
## 3.100245 2.863585 1.691883
VIF

According to this resource4, the VIF is a multiplier by which the variance could be inflated due to multicollinearity. From our VIF values, we can see that the variance could be significantly inflated in the models due to multicollinearity. Looking at the values for VIF between the models, it appears that the variables in R150VegObs (the originally proposed model) have more distributed multicollinearity between them than in the original model, though the multicollinearity between wgtObs and R in the revised model is more pronounced.

By evaluating the variables, it’s apparent that wgtObs and act300 are going to be mutually exclusive, and likely going to dampen the coefficients and significance of one another when included in the same model. We can confirm this by adding all variables that are likely to all be correlated:

8c - cont

R0VegObs <- lm(R ~ act0 + conVeg + wgtObs, Pty_Hlth)
vif(R0VegObs)
##     act0   conVeg   wgtObs 
## 2.564286 2.461472 2.433079
sjt.lm(R0VegObs)
    R
    B CI p
(Intercept)   6.99 -9.20 – 23.17 .390
act0   -0.59 -1.38 – 0.20 .139
conVeg   0.56 -0.36 – 1.48 .230
wgtObs   1.19 0.35 – 2.03 .006
Observations   51
R2 / adj. R2   .290 / .245

The p-value for the coefficient of wgtObs is now statistically significant. The coefficients are higher in value, and the \(R^2\) is similar to other models. What still remains confusing is the negative coefficient for act0, which when regressed with wgtObs and conVeg, and R individually, are all positive.

Veg0 <- lm(act0 ~ conVeg, Pty_Hlth)
Obs0 <- lm(act0 ~ wgtObs, Pty_Hlth)
Veg0$coefficients
## (Intercept)      conVeg 
##   6.4113886   0.8582117
Obs0$coefficients
## (Intercept)      wgtObs 
##    2.753649    0.784917
R0$coefficients
## (Intercept)        act0 
##  25.4810285   0.5340354
4https://statisticalhorizons.com/multicollinearity

d

The bivariate models indicate numerous statistically significant correlations: 8d

summary(RFru)
summary(RObs)
summary(R300)

None of which are reflected in the multivariate regression, likely due to the reasons mentioned in the previous answer regarding inverse relationships and resulting in suppression of the coefficient values. There may be confounding variables between conFru and wgtObs. If all risk factors and healthy behaviors were grouped together respectively, variables delineating each might be overall health knowledge, access to health information etc. as discussed prior. There is definite mutual exclusivity between wgtObs and act300, which isn’t spurious or chained but could be causiing suppression when the two are included in the same model. It could be conceived that act0 and wgtObs might have chained causation as being obese would necessarily imply a sedentary lifestyle, but sedentary lifestyle is a larger set of people of which those whom are obese are a subset. Previous results indicate that sedentary lifestyle and obesity are not multicolinear, likely for this reason.

9

How does what you see match, or not, your hypotheses from (4)? Why did/didn’t it match what you expected?
The hypothesis from (4) is not supported by the multivariate regression analyses explored above. The hypothesis was a stretch to begin with, and more a result of trying to fit the two datasets into a hypothesis test. The data didn’t fit the hypothesis likely because the logic that lead to the hypothesis was coarse, and the datasets are not ideal for actually exploring these hypotheses. A more direct way of exploring the hypothesis would be the survey intervention proposed in 7c.

10

What do the \(R^2\) and adjusted \(R^2\) tell you about your model?

10

sumRFO3 <- summary(RObs300Fru)
sumRFO3$r.squared
## [1] 0.3111206
sumRFO3$adj.r.squared
## [1] 0.2671496

The \(R^2\) value indicates that ~.31% of the variance in R is explained by the IV’s included in the model. The Adjusted \(R^2\) is more accurate for multivariate analysis as it accounts for the number of variables included in the model.

11

How would you use one of the variable selection methods to choose a model with fewer variables? Select one of the methods (either one of the stepwise or criterion-based methods) and show which variables it would lead you to keep. Do you agree with its results?

From previous discussion, the inverse effect of act300 is likely being suppressed by two variables with positive correlation. If we eliminate this variable from the regression and create the model: 11

RObsFru <- lm(R ~ wgtObs + conFru, Pty_Hlth)
sjt.lm(RObsFru)
    R
    B CI p
(Intercept)   0.26 -17.75 – 18.27 .977
wgtObs   0.52 -0.37 – 1.41 .247
conFru   0.58 -0.14 – 1.30 .111
Observations   51
R2 / adj. R2   .289 / .259

We have a slightly smaller \(R^2\) due to the removal of the variable, but no statistically significant correlations. This implies that our variables might not have strong enough correlations for a 3-variable regression to show fit even without suppression occuring. If we substitute act300 with act0: 11 - cont2

sjt.lm(RObsFru0)
    R
    B CI p
(Intercept)   -1.25 -18.85 – 16.35 .887
wgtObs   0.80 -0.12 – 1.71 .086
conFru   0.87 0.11 – 1.63 .027
act0   -0.71 -1.46 – 0.03 .060
Observations   51
R2 / adj. R2   .341 / .299

we find a higher \(R^2\) and coefficients, as well as greater significance since these could all be considered risk factors. It does appear that wgtObs achieves significance in this 3-Var model, but the adjusted R^2 also declined slightly.
If we look at all the variables systematically, it appears that the three variables with the highest correlation are wgtObs, conVeg, and conFru. If we combine these variables for the best fitting 4 var model: 11 - cont3

RObs <- lm(R ~ wgtObs, Pty_Hlth)
ROvr <- lm(R ~ wgtOvr, Pty_Hlth)
RFru <- lm(R ~ conFru, Pty_Hlth)
RVeg <- lm(R ~ conVeg, Pty_Hlth)
R300 <- lm(R ~ act300, Pty_Hlth)
R150 <- lm(R ~ act150, Pty_Hlth)
RWL <- lm(R ~ actWL, Pty_Hlth)
R150WL <- lm(R ~ act150WL, Pty_Hlth)
R0 <- lm(R ~ act0, Pty_Hlth)

sum_RObs <- summary(RObs)
sum_RObs$adj.r.squared
## [1] 0.2348471
sum_ROvr <- summary(ROvr)
sum_ROvr$adj.r.squared
## [1] -0.01514473
sum_RFru <- summary(RFru)
sum_RFru$adj.r.squared
## [1] 0.2538753
sum_RVeg <- summary(RVeg)
sum_RVeg$adj.r.squared
## [1] 0.1483685
sum_R300 <- summary(R300)
sum_R300$adj.r.squared
## [1] 0.03149323
sum_R150 <- summary(R150)
sum_R150$adj.r.squared
## [1] 0.06556153
sum_RWL <- summary(RWL)
sum_RWL$adj.r.squared
## [1] 0.09364551
sum_R150WL <- summary(R150WL)
sum_R150WL$adj.r.squared
## [1] 0.07239154
sum_R0 <- summary(R0)
sum_R0$adj.r.squared
## [1] 0.05255669
RObsFruVeg <- lm(R ~ wgtObs + conFru + conVeg, Pty_Hlth)
sjt.lm(RObsFruVeg)
    R
    B CI p
(Intercept)   0.17 -18.09 – 18.42 .985
wgtObs   0.53 -0.39 – 1.46 .253
conFru   0.61 -0.21 – 1.43 .144
conVeg   -0.07 -1.00 – 0.87 .889
Observations   51
R2 / adj. R2   .289 / .244

Surprisingly, the adjusted \(R^2\) is actually lower than the model including act0, though the adjusted \(R^2\) was higher across wgtObs, conFru, and conVeg than wgtObs, conFru, and act0.

12

What are your overall conclusions? What are the weaknesses of your results, and how could you improve them with better or different data?

Overall, the data does not allow for robust explanation of the dependent variable. Data with observations on individuals (as one would find in a survey) with continuous variables for given measures such as weight, number of minutes in various activities per week, dietary choices, and party affiliation might yield more significant results, and would make the modeling process much easier.

13

Calculations (using R): a. Derive the coefficients from your regression using the (X0X)−1X0Y formula. (If you run into problems using solve(), try using ginv() instead, which does the same thing but is a bit more robust.) b. For one of the coefficients, confirm its p value as shown in the regression output using the coefficient, its standard error, and pt() in R. c. Calculate the R2 and adjusted R2 using R, and confirm that your results match the regression output. d. Calulate the F statistic using R and confirm it against the regression output.

a

\((X'X)^{-1}X'Y = \beta\) 13a

mRObsFru300 <- as.matrix(cbind(1, Pty_Hlth$wgtObs, Pty_Hlth$conFru, Pty_Hlth$act300))
head(mRObsFru300)
##      [,1]  [,2]  [,3]  [,4]
## [1,]    1 22.00 37.25 34.90
## [2,]    1 24.25 34.20 31.45
## [3,]    1 22.60 41.25 37.05
## [4,]    1 25.05 37.60 28.55
## [5,]    1 29.65 38.40 30.45
## [6,]    1 25.15 34.00 39.10
solve(t(mRObsFru300) %*% mRObsFru300) %*% t(mRObsFru300) %*% Pty_Hlth$R
##             [,1]
## [1,] -21.0009851
## [2,]   0.7021530
## [3,]   0.6611062
## [4,]   0.3921873
RObs300Fru$coefficients
## (Intercept)      wgtObs      conFru      act300 
## -21.0009851   0.7021530   0.6611062   0.3921873

b

13b

# conFru
pt(sumRFO3$coefficients[2, 1]/sumRFO3$coefficients[2, 2], 4, low = F)
## [1] 0.1026241
sumRFO3$coefficients
##                Estimate Std. Error   t value   Pr(>|t|)
## (Intercept) -21.0009851 19.5032351 -1.076795 0.28706806
## wgtObs        0.7021530  0.4646104  1.511273 0.13741392
## conFru        0.6611062  0.3628551  1.821957 0.07482829
## act300        0.3921873  0.3200041  1.225570 0.22646667

The \(p_value\) for conFru seems to be calculated in the linear regression output via a different algorithm than pt() as indicated by the difference in the two values.

c

\(R^{2} = \frac{TSS - SSE}{TSS}\) 13c

ypred <- predict(RObs300Fru)
y <- Pty_Hlth$R
tss <- sum((y - mean(y))^2)
sse <- sum((y - ypred)^2)
r2 <- (tss - sse)/tss
r2
## [1] 0.3111206
sumRFO3$r.squared
## [1] 0.3111206

\(\textrm{adjusted } R^2 = \frac{TSS/df_t - SSE/df_e}{TSS/df_t}\) 13c - cont

n <- length(y)
k <- ncol(mRObsFru300) - 1
dft <- n - 1
dfe <- n - k - 1
(tss/dft - sse/dfe)/(tss/dft)
## [1] 0.2671496
sumRFO3$adj.r.squared
## [1] 0.2671496

d

\(F = \frac{R^2 / k}{(1-R^2)/(n-k-1)}\) 13d

f <- (r2/k)/((1 - r2)/(n - k - 1))
f
## [1] 7.075583
sumRFO3$fstatistic
##     value     numdf     dendf 
##  7.075583  3.000000 47.000000

14

Add at least one quadratic term into your model and interpret the results. Is it significant? What is the effect of a 1-unit increase in that variable at its mean value?

15

Add at least one interaction term to you model and interpret the results. Is it significant? What is the effect of a 1-unit increase in one of those interacted variables holding the other at its mean value?

16

Test either the model in 14 or the model in 15 using the F test for nested models. That is, estimate the full model with the variable and quadratic term, or the variable and interaction, and then estimate the reduced model without either, and run the F test to establish whether those variables significantly improve your model.