library("tidyverse")
library("dplyr")
library("htmltools")
library("forcats")
library("stargazer")
library("sjPlot")
library("car")
1 Describe your substantive interest and the general questions(s) you would like to answer (eg, “Does more education cause people to become more liberal?”). Be sure to frame it in a such a way that you are proposing a hypothesis (or multiple hypotheses) that might be either confirmed or disproven by the results of your analysis. 2 Describe the data set you have found, including its source, its contents, and why it was collected originally. 3 What is your dependent variable? Why are you interested in explaining it? What do you hypothesize are the major factors that influence or cause it? 4 What are your independent variables, and why have you chosen these? Prior to running your regression, what effects do you expect them to have on the dependent variable? Which of these variables do you think a˙ect other of the independent variables, and how might that a˙ect your final results? 5 Explain and show in detail how you rename and recode the variables you are examining, and what units each are measured in. 6 Before running a multiple regression, run a few bivariate regressions of Y on some of your X variables. What do you infer? Which of these do you think might change with the addition of multiple variables? 7 Run your full multiple regression using lm() and present your results using the output from the stargazer R package. Interpret the coefficients. What do they tell you substantively? Which variables seem to have the biggest substantive impact? Which ones could you actually change with some intervention, and how big a di˙erence do you think that could make? 8 How have any of the coefficients changed from the bivariate regressions? What can you infer from that?How do you think your various independent variables interact and a˙ect each other? Try to find an example where a variable appears signficant in the bivariate regression, but not in the full regression. Is this an example of a spurious or a chained causal pathway? 9 How does what you see match, or not, your hypotheses from (4)? Why did/didn’t it match what you expected? 10 What do the R2 and adjusted R2 tell you about your model? 11 How would you use one of the variable selection methods to choose a model with fewer variables? Select one of the methods (either one of the stepwise or criterion-based methods) and show which variables it would lead you to keep. Do you agree with its results? 12 What are your overall conclusions? What are the weaknesses of your results, and how could you improve them with better or different data? 13 Calculations (using R): a. Derive the coeÿcients from your regression using the (X0X)−1X0Y formula. (If you run into problems using solve(), try using ginv() instead, which does the same thing but is a bit more robust.) b. For one of the coeÿcients, confirm its p value as shown in the regression output using the coeÿcient, its standard error, and pt() in R. c. Calculate the R2 and adjusted R2 using R, and confirm that your results match the regression output. d. Calulate the F statistic using R and confirm it against the regression output. 14 Add at least one quadratic term into your model and interpret the results. Is it significant? What is the effect of a 1-unit increase in that variable at its mean value? 15 Add at least one interaction term to you model and interpret the results. Is it significant? What is the effect of a 1-unit increase in one of those interacted variables holding the other at its mean value? 16 Test either the model in 14 or the model in 15 using the F test for nested models. That is, estimate the full model with the variable and quadratic term, or the variable and interaction, and then estimate the reduced model without either, and run the F test to establish whether those variables significantly improve your model.
Read Data Set 1
US <- read_csv("C:\\Users\\Stephen\\Documents\\Northeastern\\PPUA 5301 - Introduction to Computational Statistics\\Homework\\Homework 8_9\\Nutrition__Physical_Activity__and_Obesity_-_Behavioral_Risk_Factor_Surveillance_System.csv")
# names(US)
(Qs <- unique(US$Question))
## [1] "Percent of adults aged 18 years and older who have obesity"
## [2] "Percent of adults aged 18 years and older who have an overweight classification"
## [3] "Percent of adults who report consuming fruit less than one time daily"
## [4] "Percent of adults who report consuming vegetables less than one time daily"
## [5] "Percent of adults who engage in muscle-strengthening activities on 2 or more days a week"
## [6] "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)"
## [7] "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week"
## [8] "Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)"
## [9] "Percent of adults who engage in no leisure-time physical activity"
Xform Dataset 1
US_Hlth <- US %>% filter(YearStart == 2015 & Question %in% c("Percent of adults who engage in no leisure-time physical activity",
"Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)",
"Percent of adults aged 18 years and older who have an overweight classification",
"Percent of adults who report consuming vegetables less than one time daily") &
is.na(Gender) == F & !LocationDesc %in% c("Guam", "Puerto Rico", "National")) %>%
select(YearStart, LocationDesc, Question, Data_Value, Gender) %>% group_by(LocationDesc,
Question) %>% arrange(LocationDesc, Question) %>% summarize(Perc = mean(Data_Value))
colnames(US_Hlth) <- c("ST", "Hlth_M", "Perc")
US_Hlth$Hlth_M <- as.factor(US_Hlth$Hlth_M)
US_Hlth$Hlth_M <- US_Hlth$Hlth_M %>% fct_recode(`0` = "Percent of adults who engage in no leisure-time physical activity",
`150` = "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)",
Veg = "Percent of adults who report consuming vegetables less than one time daily",
Wgt = "Percent of adults aged 18 years and older who have an overweight classification")
US_Hlth <- US_Hlth %>% ungroup() %>% filter(Hlth_M %in% c("0", "150", "Veg", "Wgt")) %>%
spread("Hlth_M", "Perc")
Xform Dataset 1 - Revised
US_Hlth <- US %>% filter(YearStart == 2015 & is.na(Gender) == F & !LocationDesc %in%
c("Guam", "Puerto Rico", "National")) %>% select(YearStart, LocationDesc, Question,
Data_Value, Gender) %>% group_by(LocationDesc, Question) %>% arrange(LocationDesc,
Question) %>% summarize(Perc = mean(Data_Value))
colnames(US_Hlth) <- c("ST", "Hlth_M", "Perc")
US_Hlth$Hlth_M <- as.factor(US_Hlth$Hlth_M)
US_Hlth$Hlth_M <- US_Hlth$Hlth_M %>% fct_recode(wgtObs = "Percent of adults aged 18 years and older who have obesity",
wgtOvr = "Percent of adults aged 18 years and older who have an overweight classification",
conFru = "Percent of adults who report consuming fruit less than one time daily",
conVeg = "Percent of adults who report consuming vegetables less than one time daily",
actWL = "Percent of adults who engage in muscle-strengthening activities on 2 or more days a week",
act150 = "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)",
act150WL = "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week",
act300 = "Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)",
act0 = "Percent of adults who engage in no leisure-time physical activity")
US_Hlth <- US_Hlth %>% ungroup() %>% spread("Hlth_M", "Perc")
Read Dataset 2
Party <- tibble::tribble(~State, ~Republican.lean.Rep., ~No.lean, ~Democrat.lean.Dem.,
~Sample.Size, "Alabama", "52%", "13%", "35%", 511, "Alaska", "39%", "29%", "32%",
310, "Arizona", "40%", "21%", "39%", 653, "Arkansas", "46%", "16%", "38%", 311,
"California", "30%", "21%", "49%", 3697, "Colorado", "41%", "17%", "42%", 504,
"Connecticut", "32%", "18%", "50%", 377, "Delaware", "29%", "17%", "55%", 301,
"District of Columbia", "11%", "15%", "73%", 303, "Florida", "37%", "19%", "44%",
2020, "Georgia", "41%", "18%", "41%", 968, "Hawaii", "28%", "20%", "51%", 312,
"Idaho", "49%", "19%", "32%", 320, "Illinois", "33%", "19%", "48%", 1326, "Indiana",
"42%", "20%", "37%", 654, "Iowa", "41%", "19%", "40%", 330, "Kansas", "46%",
"23%", "31%", 307, "Kentucky", "44%", "13%", "43%", 439, "Louisiana", "41%",
"16%", "43%", 465, "Maine", "36%", "17%", "47%", 303, "Maryland", "31%", "14%",
"55%", 644, "Massachusetts", "27%", "17%", "56%", 704, "Michigan", "34%", "19%",
"47%", 982, "Minnesota", "39%", "15%", "46%", 563, "Mississippi", "44%", "14%",
"42%", 309, "Missouri", "41%", "18%", "42%", 642, "Montana", "49%", "21%", "30%",
312, "Nebraska", "47%", "17%", "36%", 312, "Nevada", "37%", "18%", "46%", 314,
"New Hampshire", "35%", "20%", "44%", 303, "New Jersey", "30%", "19%", "51%",
886, "New Mexico", "37%", "15%", "48%", 312, "New York", "28%", "19%", "53%",
1966, "North Carolina", "41%", "17%", "43%", 1022, "North Dakota", "50%", "18%",
"33%", 338, "Ohio", "42%", "18%", "40%", 1132, "Oklahoma", "45%", "15%", "40%",
391, "Oregon", "32%", "21%", "47%", 419, "Pennsylvania", "39%", "15%", "46%",
1366, "Rhode Island", "30%", "22%", "48%", 305, "South Carolina", "43%", "18%",
"39%", 495, "South Dakota", "53%", "10%", "37%", 305, "Tennessee", "48%", "15%",
"36%", 661, "Texas", "39%", "21%", "40%", 2535, "Utah", "54%", "16%", "30%",
315, "Vermont", "29%", "14%", "57%", 306, "Virginia", "43%", "18%", "39%", 882,
"Washington", "33%", "23%", "44%", 714, "West Virginia", "43%", "16%", "41%",
309, "Wisconsin", "42%", "16%", "42%", 600, "Wyoming", "57%", "18%", "25%", 316)
Xform Dataset 2
Party <- lapply(Party, gsub, pattern = "\\%", replacement = "")
Party <- as.data.frame(Party, stringsAsFactors = F, )
reclass <- function(df, vec) {
df[] <- Map(function(x, f) {
# switch below shows the accepted values in the vector you can modify it and/or
# add more
f <- switch(f, as.is = "force", factor = "as.factor", num = "as.numeric",
char = "as.character")
# takes the name of the function and fetches the function
f <- get(f)
# apply the function
f(x)
}, df, vec)
df
}
Party <- reclass(Party, c("char", "num", "num", "num", "num"))
names(Party)
## [1] "State" "Republican.lean.Rep." "No.lean"
## [4] "Democrat.lean.Dem." "Sample.Size"
colnames(Party) <- c("ST", "R", "N", "D", "ns")
Join Tables for Analysis
Pty_Hlth <- left_join(Party, US_Hlth, by = "ST")
rm(US_Hlth, Party)
Prelim Analysis - Revised
Pty_Hlth <- Pty_Hlth %>% arrange(R)
ggact <- ggplot(data = Pty_Hlth, mapping = aes(x = reorder(ST, R), y = R)) + geom_point() +
geom_point(mapping = aes(y = actWL), color = "red", size = 2) + geom_point(mapping = aes(y = act150),
color = "orange", size = 2) + geom_point(mapping = aes(y = act150WL), color = "#C42B2D",
size = 2) + geom_point(mapping = aes(y = act300), color = "blue", size = 2) +
geom_point(mapping = aes(y = act0), color = "black", size = 2) + theme(axis.text.x = element_text(angle = 90,
hjust = 1, lineheight = 1))
ggwgt <- ggplot(data = Pty_Hlth, mapping = aes(x = reorder(ST, R), y = R)) + geom_point() +
geom_point(mapping = aes(y = wgtObs), color = "red", size = 2) + geom_point(mapping = aes(y = wgtOvr),
color = "#A32D2D", size = 2) + theme(legend.position = "none", axis.text.x = element_text(angle = 90,
hjust = 1, lineheight = 1))
ggcon <- ggplot(data = Pty_Hlth, mapping = aes(x = reorder(ST, R), y = R)) + geom_point() +
geom_point(mapping = aes(y = conFru), color = "red", size = 2) + geom_point(mapping = aes(y = conVeg),
color = "#B88037", size = 2) + theme(legend.position = "none", axis.text.x = element_text(angle = 90,
hjust = 1, lineheight = 1))
# The Wgt variable is obviously not correlated, so this will be left out of
# further analyses
ggact
ggwgt
ggcon
Describe your substantive interest and the general questions(s) you would like to answer (eg, “Does more education cause people to become more liberal?”). Be sure to frame it in a such a way that you are proposing a hypothesis (or multiple hypotheses) that might be either confirmed or disproven by the results of your analysis.
Selecting the data for comparison was quite difficult given the restraints of having the data not include temporal or categorical data. All of my interests included data that had a temporal index or categorical data involved. Resolving to include health related measures, I decided to continue with the inquiry from class about partisanship, and combine these two data sets to allow for correlation between these topics. The data within fits the restraints and might yield interesting correlations.
The hypothesis is a question of whether there is a correlation between various health behaviors and party affiliation within states. After Revision
\(H_1\) Is the “Percent of adults aged 18 years and older who have obesity” correlated with the % identifying as Republican? If so, what is the sign of \(\beta_1\)?
\(H_2\) Is the “Percent of adults who report consuming fruit less than one time daily” correlated with the % identifying as Republican? If so, what is the sign of \(\beta_2\)?
\(H_3\) Is the “Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)” a predictor variable for the % identifying as Republican? If so what is the sign of \(\beta_3\)?
Describe the data set you have found, including its source, its contents, and why it was collected originally.
1https://catalog.data.gov/dataset/nutrition-physical-activity-and-obesity-behavioral-risk-factor-surveillance-system
2http://www.pewforum.org/religious-landscape-study/compare/party-affiliation/by/state/
3http://www.pewforum.org/religious-landscape-study/
What is your dependent variable? Why are you interested in explaining it? What do you hypothesize are the major factors that influence or cause it?
The dependent variable is the percentage of the state pop. identified as Republican, or Republican leaning. The variable is complex with numerous influencing factors. During the in-class explorations we have looked at potential influencing factors such as race, income, and education level on party affiliation via data from the ANES survey.
Intuitively, party affiliation could also be influenced or predicted in part by religion, affiliation of parents, place of residence, affiliation of one’s peers in one’s social network. It could also be related to psychological traits, or, as we will explore, health behaviors.
In this study, we’ll be looking at the percentage of each state’s pop. exhibiting particular health behaviors, these behaviors will be described in further detail below. An state with a high rate of obesity wgtObs might also lack a system of education about health related information and/or a cultural bias that occludes access to quality health related information. The obesity rate could also be related to lack of access to health care services and medical professionals etc. With regard to whether or not the “Percent of adults who report consuming fruit less than one time daily” influences the percent of the state pop. identifying as Republican it is likely that the two are not related. Though the variable might not have much explanatory value, it was the only variable from the dietary behaviors questions that did graphically did not appear to exhibit colinearity with variables from the other classes.
What are your independent variables, and why have you chosen these? Prior to running your regression, what effects do you expect them to have on the dependent variable? Which of these variables do you think affect other of the independent variables, and how might that affect your final results?
OriginalAfter Revision This brief analysis will look at whether there is correlation between the following independent variables:
This brief analysis will look at whether there is correlation between the following independent variables:and the dependent variable \(y\): the identification as Republican.
- \(x_1\): % of pop. living sedentary lifestyle
- \(x_2\): % of pop. living lifestyle meeting minimum physical activity recommendations
- \(x_3\): % of pop. that consumes vegetables less than 1 time daily
There might be a positive correlation between the percent of a state’s pop. that exhibits health behaviors identified as risk factors \(x_1,x_3\) and the percent of a state’s pop. that identifies as Republican or Republican leaning. There is likely an inverse colinear relationship between \(x_1\) and \(x_2\) because they are mutually exclusive categories along a scale of 5 categories (not all of which are mutually exclusive.) See variable Qs, for the original categories. There could also be a correlation between states with a larger percentage of the population meeting minimum physical activity recommendations, and those whom are consuming vegetables more than 1 time a day, due to a theoretical common variable of health knowledge.
and the dependent variable \(y\): the percent of the state pop. identifying as Republican. The % of adults aged 18 years and older who have obesity (\(x_1\)) might correlate positively with the % of the state pop. identified as Republican. The % of adults who report consuming fruit less than one time daily will likely have no correlation with % of the state pop. identified as Republican. There might be a negative correlation between % of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination). However, the latter two variables are likely more related to the latitude, and longitude of a state (as a predictor of outdoor weather patterns conducive to outdoor activity), as well as the popularity and diversity of various athletic recreational activities within the state.
Explain and show in detail how you rename and recode the variables you are examining, and what units each are measured in.
The variables were reclassed to characters or numbers where necessary. The tables are joined by State. All variables are measured in % of state population.
Before running a multiple regression, run a few bivariate regressions of Y on some of your X variables. What do you infer? Which of these do you think might change with the addition of multiple variables?
After RevisionThese linear regressions without results contained in this blockquote were used to identify that the variables in question were colinear. The assignment was then revised. The revised answer to this question begins where the blockquote ends.
Upon further reading, I had an incorrect method for attempting to spot multicolinearity, and the revision may not have even been necessary. 2-Var Regressionsr0 <- summary(lm(R ~ `0`, Pty_Hlth)) r150 <- summary(lm(R ~ `150`, Pty_Hlth)) rv <- summary(lm(R ~ Veg, Pty_Hlth)) `150v0` <- summary(lm(`0` ~ `150`, Pty_Hlth)) 150v0
Before running a multiple regression, run a few bivariate regressions of Y on some of your X variables. What do you infer? Which of these do you think might change with the addition of multiple variables?
While there is statistical significance at the 95% confidence level in the correlations between R and 150, and Veg, it can also be noted the \(R^2\) values indicate that little of the variation is explained by the model. As variables are added in a multiple variable regression, the \(R^2\) value will inevitably increase, though it is important to note the low \(R^2\) in these 2-Variable models.
The comparison of0
and150
show that the variables are indeed colinear, and will be confounding variables if combined. It may be necessary to add an additional health variable that is not colinear with any of the other variables.Add Statistic on Obesity for Multiple Linear Regression
USob <- US %>% filter(YearStart == 2015 & Question %in% c("Percent of adults aged 18 years and older who have obesity") & is.na(Gender) == F & !LocationDesc %in% c("Guam", "Puerto Rico", "National")) %>% select(YearStart, LocationDesc, Question, Data_Value, Gender) %>% group_by(LocationDesc, Question) %>% arrange(LocationDesc, Question) %>% summarize(Perc = mean(Data_Value)) USob <- USob[, -2] names(USob) <- c("ST", "Ob") Pty_Hlth <- left_join(Pty_Hlth, USob, by = "ST") ROb <- summary(lm(R ~ Ob, Pty_Hlth)) ROb rm(USob) `0Ob` <- summary(lm(`0` ~ Ob, Pty_Hlth)) 0Ob
It appears that obesity is also colinear with a sedentary lifestyle, logically, this linkage makes sense. Let’s have a look at the fruit variable. Add Fruit Variable
USF <- US %>% filter(YearStart == 2015 & Question %in% c("Percent of adults who report consuming fruit less than one time daily") & is.na(Gender) == F & !LocationDesc %in% c("Guam", "Puerto Rico", "National")) %>% select(YearStart, LocationDesc, Question, Data_Value, Gender) %>% group_by(LocationDesc, Question) %>% arrange(LocationDesc, Question) %>% summarize(Perc = mean(Data_Value)) USF <- USF[, -2] names(USF) <- c("ST", "Fruit") Pty_Hlth <- left_join(Pty_Hlth, USF, by = "ST") (VF <- summary(lm(Veg ~ Fruit, Pty_Hlth))) # Colinear (F150 <- summary(lm(Fruit ~ `150`, Pty_Hlth))) # Colinear (FOb <- summary(lm(Fruit ~ Ob, Pty_Hlth))) # Colinear```
Healthy lifestyle traits show high degrees of correlation, while risk factors are also highly correlated. Add Weight Lifting Variable
It appears that it will be necessary to start back at the beginning, extract all of the health related questions, spread them to individual variables and graph them. By graphing all of the variables it may allow them to be differentiated to avoid colinearity.USWl <- US %>% filter(YearStart == 2015 & Question %in% c("Percent of adults who engage in muscle-strengthening activities on 2 or more days a week") & is.na(Gender) == F & !LocationDesc %in% c("Guam", "Puerto Rico", "National")) %>% select(YearStart, LocationDesc, Question, Data_Value, Gender) %>% group_by(LocationDesc, Question) %>% arrange(LocationDesc, Question) %>% summarize(Perc = mean(Data_Value)) USWl <- USWl[, -2] names(USWl) <- c("ST", "Wl") Pty_Hlth <- left_join(Pty_Hlth, USWl, by = "ST") (VFWl <- summary(lm(Veg ~ Wl, Pty_Hlth))) # Colinear (Wl150 <- summary(lm(Wl ~ `150`, Pty_Hlth))) # Colinear (WlOb <- summary(lm(Wl ~ Ob, Pty_Hlth))) # Colinear
2-Variable Regressions - Revised
Obs150 <- lm(wgtObs ~ act150, Pty_Hlth)
ObsFru <- lm(wgtObs ~ conFru, Pty_Hlth)
Obs300 <- lm(wgtObs ~ act300, Pty_Hlth)
Fru300 <- lm(conFru ~ act300, Pty_Hlth)
RObs <- lm(R ~ wgtObs, Pty_Hlth)
RFru <- lm(R ~ conFru, Pty_Hlth)
R300 <- lm(R ~ act300, Pty_Hlth)
sjt.lm(Obs150)
wgtObs | ||||
B | CI | p | ||
(Intercept) | 57.96 | 49.78 – 66.13 | <.001 | |
act150 | -0.56 | -0.71 – -0.40 | <.001 | |
Observations | 51 | |||
R2 / adj. R2 | .506 / .496 |
sjt.lm(ObsFru)
wgtObs | ||||
B | CI | p | ||
(Intercept) | 2.67 | -3.10 – 8.44 | .357 | |
conFru | 0.65 | 0.51 – 0.79 | <.001 | |
Observations | 51 | |||
R2 / adj. R2 | .639 / .632 |
sjt.lm(Obs300)
wgtObs | ||||
B | CI | p | ||
(Intercept) | 47.99 | 41.21 – 54.77 | <.001 | |
act300 | -0.58 | -0.79 – -0.37 | <.001 | |
Observations | 51 | |||
R2 / adj. R2 | .390 / .377 |
sjt.lm(Fru300)
conFru | ||||
B | CI | p | ||
(Intercept) | 62.43 | 53.74 – 71.11 | <.001 | |
act300 | -0.67 | -0.94 – -0.40 | <.001 | |
Observations | 51 | |||
R2 / adj. R2 | .339 / .326 |
sjt.lm(RFru)
R | ||||
B | CI | p | ||
(Intercept) | 1.64 | -16.27 – 19.55 | .855 | |
conFru | 0.92 | 0.48 – 1.35 | <.001 | |
Observations | 51 | |||
R2 / adj. R2 | .269 / .254 |
sjt.lm(RObs)
R | ||||
B | CI | p | ||
(Intercept) | 7.32 | -8.67 – 23.31 | .362 | |
wgtObs | 1.09 | 0.55 – 1.63 | <.001 | |
Observations | 51 | |||
R2 / adj. R2 | .250 / .235 |
sjt.lm(R300)
R | ||||
B | CI | p | ||
(Intercept) | 53.97 | 35.52 – 72.41 | <.001 | |
act300 | -0.46 | -1.03 – 0.11 | .112 | |
Observations | 51 | |||
R2 / adj. R2 | .051 / .031 |
The \(R^2\) values appear to be low enough where the degree of multicollinearity will not be so substantial as to render the predictive value of the model inaccurate. With the addition of multiple variables, the signficance of the individual health behavioral variables will likely decrease. The \(R^2\) will also decrease as the amount of variance will be cumulative as more variables are added to the model.
7
RObs300Fru <- lm(R ~ wgtObs + conFru + act300, Pty_Hlth)
vif(RObs300Fru)
## wgtObs conFru act300
## 3.100245 2.863585 1.691883
sjt.lm(RObs300Fru)
R | ||||
B | CI | p | ||
(Intercept) | -21.00 | -60.24 – 18.23 | .287 | |
wgtObs | 0.70 | -0.23 – 1.64 | .137 | |
conFru | 0.66 | -0.07 – 1.39 | .075 | |
act300 | 0.39 | -0.25 – 1.04 | .226 | |
Observations | 51 | |||
R2 / adj. R2 | .311 / .267 |
7 - Stargazer
stargazer(RObs300Fru, no.space = TRUE, dep.var.labels = c("% identified as Republican"),
omit.stat = c("ll"), header = FALSE)
The variables from the revision: wgtObs,conFru, and act300 are modeled, and the variables from the originally intended linear model: wgtObs,conVeg, and act150 are modeled for comparison purposes.
Coefficients
\[\begin{aligned} Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \end{aligned}\] Where:
The \(\beta\) coefficient for a given variable indicates that for a unit increment in that variable, the resulting value \(\hat{y}\) will be incremented by the value of \(\beta\).
Variables with Substantive Impact
The conFru variable in linear model RObs300Fru comes close to being significant at the 95% confidence level, though the adjusted \(R^2\) value for the model is .267. The wgtObs variable in the R150VegObs model is significant at the 99% confidence level, but the adjusted \(R^2\) value for the model is lower at .235. Given these outcomes, the model is not going to provide an accurate predictive model for the dependent variable.
8a
RFru$coefficients
## (Intercept) conFru
## 1.6424165 0.9182656
RObs$coefficients
## (Intercept) wgtObs
## 7.321541 1.090106
R300$coefficients
## (Intercept) act300
## 53.9657706 -0.4583004
RObs300Fru$coefficients
## (Intercept) wgtObs conFru act300
## -21.0009851 0.7021530 0.6611062 0.3921873
This coefficients have dropped in value from the bivariate to the multivariate. Also, we can see that including all three of these variables has mitigated the negative correlaton with act300 and actually made it a positive correlation. This does not seem to be a spurious or chained causal pathway, but rather the influence of two positively correlated values and one negatively correlated value included in the same linear model.
It can be inferred, that were variables with all positive correlations chosen for the multivariate regression, a more pronounced positive effect on the coefficients will be observed. To test this, we can add act0 - “Percent of adults who engage in no leisure-time physical activity” to the model. It is expected that due to the likely multicolinearity between this variable and wgtObs, we will observe larger coefficients, increased significance, but also have a high VIF. To test: 8b
RObsFru0 <- lm(R ~ wgtObs + conFru + act0, Pty_Hlth)
RObs300Fru$coefficients
## (Intercept) wgtObs conFru act300
## -21.0009851 0.7021530 0.6611062 0.3921873
summary(RObs300Fru)$adj.r.squared
## [1] 0.2671496
RObsFru0$coefficients
## (Intercept) wgtObs conFru act0
## -1.2495710 0.7951514 0.8687442 -0.7133401
summary(RObsFru0)$adj.r.squared
## [1] 0.2992062
vif(RObsFru0)
## wgtObs conFru act0
## 3.085080 3.275400 2.446539
R0 <- lm(R ~ act0, Pty_Hlth)
R0$coefficients
## (Intercept) act0
## 25.4810285 0.5340354
As inferred, the coefficients did increase in value, as did the VIF scores. However, unexpectedly, the act0 coefficient is negative, even though it is positive in the bivariate regression with the dependent variable, and with its other IVs. The y-intercept shifted dramatically between the two models, which could indicate that when individuals in act300 are included, the % of individuals identifying as Republican starts at a much lower percent, and goes up by small amounts with each unit shift in the IVs, whereas when act0 is included, there is a larger % of those identifying as Republican, which goes down by 0.71 with each unit shift in X. Admittedly, these results are confusing, and this is just a conjecture.
8c
vif(RObs300Fru)
## wgtObs conFru act300
## 3.100245 2.863585 1.691883
VIFAccording to this resource4, the VIF is a multiplier by which the variance could be inflated due to multicollinearity. From our VIF values, we can see that the variance could be significantly inflated in the models due to multicollinearity. Looking at the values for VIF between the models, it appears that the variables in R150VegObs (the originally proposed model) have more distributed multicollinearity between them than in the original model, though the multicollinearity between wgtObs and R in the revised model is more pronounced.
By evaluating the variables, it’s apparent that wgtObs and act300 are going to be mutually exclusive, and likely going to dampen the coefficients and significance of one another when included in the same model. We can confirm this by adding all variables that are likely to all be correlated:
8c - cont
R0VegObs <- lm(R ~ act0 + conVeg + wgtObs, Pty_Hlth)
vif(R0VegObs)
## act0 conVeg wgtObs
## 2.564286 2.461472 2.433079
sjt.lm(R0VegObs)
R | ||||
B | CI | p | ||
(Intercept) | 6.99 | -9.20 – 23.17 | .390 | |
act0 | -0.59 | -1.38 – 0.20 | .139 | |
conVeg | 0.56 | -0.36 – 1.48 | .230 | |
wgtObs | 1.19 | 0.35 – 2.03 | .006 | |
Observations | 51 | |||
R2 / adj. R2 | .290 / .245 |
The p-value for the coefficient of wgtObs is now statistically significant. The coefficients are higher in value, and the \(R^2\) is similar to other models. What still remains confusing is the negative coefficient for act0, which when regressed with wgtObs and conVeg, and R individually, are all positive.
Veg0 <- lm(act0 ~ conVeg, Pty_Hlth)
Obs0 <- lm(act0 ~ wgtObs, Pty_Hlth)
Veg0$coefficients
## (Intercept) conVeg
## 6.4113886 0.8582117
Obs0$coefficients
## (Intercept) wgtObs
## 2.753649 0.784917
R0$coefficients
## (Intercept) act0
## 25.4810285 0.5340354
4https://statisticalhorizons.com/multicollinearity
The bivariate models indicate numerous statistically significant correlations: 8d
summary(RFru)
summary(RObs)
summary(R300)
None of which are reflected in the multivariate regression, likely due to the reasons mentioned in the previous answer regarding inverse relationships and resulting in suppression of the coefficient values. There may be confounding variables between conFru and wgtObs. If all risk factors and healthy behaviors were grouped together respectively, variables delineating each might be overall health knowledge, access to health information etc. as discussed prior. There is definite mutual exclusivity between wgtObs and act300, which isn’t spurious or chained but could be causiing suppression when the two are included in the same model. It could be conceived that act0 and wgtObs might have chained causation as being obese would necessarily imply a sedentary lifestyle, but sedentary lifestyle is a larger set of people of which those whom are obese are a subset. Previous results indicate that sedentary lifestyle and obesity are not multicolinear, likely for this reason.
10
sumRFO3 <- summary(RObs300Fru)
sumRFO3$r.squared
## [1] 0.3111206
sumRFO3$adj.r.squared
## [1] 0.2671496
The \(R^2\) value indicates that ~.31% of the variance in R is explained by the IV’s included in the model. The Adjusted \(R^2\) is more accurate for multivariate analysis as it accounts for the number of variables included in the model.
How would you use one of the variable selection methods to choose a model with fewer variables? Select one of the methods (either one of the stepwise or criterion-based methods) and show which variables it would lead you to keep. Do you agree with its results?
From previous discussion, the inverse effect of act300 is likely being suppressed by two variables with positive correlation. If we eliminate this variable from the regression and create the model: 11
RObsFru <- lm(R ~ wgtObs + conFru, Pty_Hlth)
sjt.lm(RObsFru)
R | ||||
B | CI | p | ||
(Intercept) | 0.26 | -17.75 – 18.27 | .977 | |
wgtObs | 0.52 | -0.37 – 1.41 | .247 | |
conFru | 0.58 | -0.14 – 1.30 | .111 | |
Observations | 51 | |||
R2 / adj. R2 | .289 / .259 |
We have a slightly smaller \(R^2\) due to the removal of the variable, but no statistically significant correlations. This implies that our variables might not have strong enough correlations for a 3-variable regression to show fit even without suppression occuring. If we substitute act300 with act0: 11 - cont2
sjt.lm(RObsFru0)
R | ||||
B | CI | p | ||
(Intercept) | -1.25 | -18.85 – 16.35 | .887 | |
wgtObs | 0.80 | -0.12 – 1.71 | .086 | |
conFru | 0.87 | 0.11 – 1.63 | .027 | |
act0 | -0.71 | -1.46 – 0.03 | .060 | |
Observations | 51 | |||
R2 / adj. R2 | .341 / .299 |
we find a higher \(R^2\) and coefficients, as well as greater significance since these could all be considered risk factors. It does appear that wgtObs achieves significance in this 3-Var model, but the adjusted R^2 also declined slightly.
If we look at all the variables systematically, it appears that the three variables with the highest correlation are wgtObs, conVeg, and conFru. If we combine these variables for the best fitting 4 var model: 11 - cont3
RObs <- lm(R ~ wgtObs, Pty_Hlth)
ROvr <- lm(R ~ wgtOvr, Pty_Hlth)
RFru <- lm(R ~ conFru, Pty_Hlth)
RVeg <- lm(R ~ conVeg, Pty_Hlth)
R300 <- lm(R ~ act300, Pty_Hlth)
R150 <- lm(R ~ act150, Pty_Hlth)
RWL <- lm(R ~ actWL, Pty_Hlth)
R150WL <- lm(R ~ act150WL, Pty_Hlth)
R0 <- lm(R ~ act0, Pty_Hlth)
sum_RObs <- summary(RObs)
sum_RObs$adj.r.squared
## [1] 0.2348471
sum_ROvr <- summary(ROvr)
sum_ROvr$adj.r.squared
## [1] -0.01514473
sum_RFru <- summary(RFru)
sum_RFru$adj.r.squared
## [1] 0.2538753
sum_RVeg <- summary(RVeg)
sum_RVeg$adj.r.squared
## [1] 0.1483685
sum_R300 <- summary(R300)
sum_R300$adj.r.squared
## [1] 0.03149323
sum_R150 <- summary(R150)
sum_R150$adj.r.squared
## [1] 0.06556153
sum_RWL <- summary(RWL)
sum_RWL$adj.r.squared
## [1] 0.09364551
sum_R150WL <- summary(R150WL)
sum_R150WL$adj.r.squared
## [1] 0.07239154
sum_R0 <- summary(R0)
sum_R0$adj.r.squared
## [1] 0.05255669
RObsFruVeg <- lm(R ~ wgtObs + conFru + conVeg, Pty_Hlth)
sjt.lm(RObsFruVeg)
R | ||||
B | CI | p | ||
(Intercept) | 0.17 | -18.09 – 18.42 | .985 | |
wgtObs | 0.53 | -0.39 – 1.46 | .253 | |
conFru | 0.61 | -0.21 – 1.43 | .144 | |
conVeg | -0.07 | -1.00 – 0.87 | .889 | |
Observations | 51 | |||
R2 / adj. R2 | .289 / .244 |
Surprisingly, the adjusted \(R^2\) is actually lower than the model including act0, though the adjusted \(R^2\) was higher across wgtObs, conFru, and conVeg than wgtObs, conFru, and act0.
What are your overall conclusions? What are the weaknesses of your results, and how could you improve them with better or different data?
Overall, the data does not allow for robust explanation of the dependent variable. Data with observations on individuals (as one would find in a survey) with continuous variables for given measures such as weight, number of minutes in various activities per week, dietary choices, and party affiliation might yield more significant results, and would make the modeling process much easier.
Calculations (using R): a. Derive the coefficients from your regression using the (X0X)−1X0Y formula. (If you run into problems using solve(), try using ginv() instead, which does the same thing but is a bit more robust.) b. For one of the coefficients, confirm its p value as shown in the regression output using the coefficient, its standard error, and pt() in R. c. Calculate the R2 and adjusted R2 using R, and confirm that your results match the regression output. d. Calulate the F statistic using R and confirm it against the regression output.
\((X'X)^{-1}X'Y = \beta\) 13a
mRObsFru300 <- as.matrix(cbind(1, Pty_Hlth$wgtObs, Pty_Hlth$conFru, Pty_Hlth$act300))
head(mRObsFru300)
## [,1] [,2] [,3] [,4]
## [1,] 1 22.00 37.25 34.90
## [2,] 1 24.25 34.20 31.45
## [3,] 1 22.60 41.25 37.05
## [4,] 1 25.05 37.60 28.55
## [5,] 1 29.65 38.40 30.45
## [6,] 1 25.15 34.00 39.10
solve(t(mRObsFru300) %*% mRObsFru300) %*% t(mRObsFru300) %*% Pty_Hlth$R
## [,1]
## [1,] -21.0009851
## [2,] 0.7021530
## [3,] 0.6611062
## [4,] 0.3921873
RObs300Fru$coefficients
## (Intercept) wgtObs conFru act300
## -21.0009851 0.7021530 0.6611062 0.3921873
13b
# conFru
pt(sumRFO3$coefficients[2, 1]/sumRFO3$coefficients[2, 2], 4, low = F)
## [1] 0.1026241
sumRFO3$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.0009851 19.5032351 -1.076795 0.28706806
## wgtObs 0.7021530 0.4646104 1.511273 0.13741392
## conFru 0.6611062 0.3628551 1.821957 0.07482829
## act300 0.3921873 0.3200041 1.225570 0.22646667
The \(p_value\) for conFru seems to be calculated in the linear regression output via a different algorithm than pt() as indicated by the difference in the two values.
\(R^{2} = \frac{TSS - SSE}{TSS}\) 13c
ypred <- predict(RObs300Fru)
y <- Pty_Hlth$R
tss <- sum((y - mean(y))^2)
sse <- sum((y - ypred)^2)
r2 <- (tss - sse)/tss
r2
## [1] 0.3111206
sumRFO3$r.squared
## [1] 0.3111206
\(\textrm{adjusted } R^2 = \frac{TSS/df_t - SSE/df_e}{TSS/df_t}\) 13c - cont
n <- length(y)
k <- ncol(mRObsFru300) - 1
dft <- n - 1
dfe <- n - k - 1
(tss/dft - sse/dfe)/(tss/dft)
## [1] 0.2671496
sumRFO3$adj.r.squared
## [1] 0.2671496
\(F = \frac{R^2 / k}{(1-R^2)/(n-k-1)}\) 13d
f <- (r2/k)/((1 - r2)/(n - k - 1))
f
## [1] 7.075583
sumRFO3$fstatistic
## value numdf dendf
## 7.075583 3.000000 47.000000