Load some potentially useful packages:

library(ggplot2)

library(dplyr)


Question 1

Answer the following brief questions on logs, odds, and probabilities:

(a) What is log(10) in base 10? To find this using R, uncomment and run the command below:

 log( 10 , base=10 )
## [1] 1


(b) What is log(10) in base e? To find this using R, uncomment and run the command below:

 log( 10 )
## [1] 2.302585


(c) If log(x) = 2 (in base e), what is x?


(d) If the president of the U.S. has a 30% approval, what are their approval odds?


(e) If the odds of a president being impeached are 2 (Note: This is often reported as 2-to-1), what is the chance of impeachment?



Question 2

Here are the variables available in the WCGS data:


Read in the WCGS data:

WCGS = read.csv("https://raw.githubusercontent.com/cwolock/BIOST311/main/Datasets/wcgs.csv")


(a) What is the appropriate explanatory variable and response variable to explore the main scientific question that this study was interested in?

the appropriate explanatory variable is tabp, and the appropriate response variable is chd


(b) Create a plot summarizing the relationship between the explanatory and response variable you chose in (a). To help with this, consider the code skeleton below, and fill in the appropriate pieces:

 #We can convert numeric binary variables (e.g. 0-1 variables) into "No-Yes" variables, as follows (uncomment the 2 lines below):

 WCGS = WCGS %>% mutate( TypeA = ifelse(tabp==0,"No","Yes") ,
                       CHD = ifelse(chd==0,"No","Yes") )


# Make a barplot of the "No-Yes" explanatory variable, but where we fill the bars based on the "No-Yes" response variable (uncomment and complete the lines below):

 ggplot( data=WCGS , aes(x = TypeA,
                         fill = CHD) ) +
                         geom_bar()


(c) Construct a two-by-two table of the newly created TypeA and CHD variables from (b). To help with this, uncomment and complete the line of code below:

WCGS %>% select( TypeA , CHD ) %>% table( )
##      CHD
## TypeA   No  Yes
##   No  1486   79
##   Yes 1411  178


(d) What is the observed probability that someone develops CHD if they exhibit Type A behavior? What are the odds of CHD if someone has Type A behavior?

P=0.112, odds=0.1262


(e) What is the observed probability that someone develops CHD if they do not exhibit Type A behavior? What are the odds of CHD if someone does not have Type A behavior?

P=0.0505, odds=0.0532


(f) What is the observed odds ratio of CHD, comparing those with Type A behavior to those without?

Odds Ratio=2.3722, the odds of someone developing CHD with type A behavior is a little under 2.5 times more likely than the odds of someone developing CHD without type A behavior



Question 3

(a) Fit a simple logistic regression model with chd as your response variable and tabp as your predictor. Interpret the coefficient on tabp using an odds ratio.

glm(chd~tabp,family=binomial,data=WCGS)
## 
## Call:  glm(formula = chd ~ tabp, family = binomial, data = WCGS)
## 
## Coefficients:
## (Intercept)         tabp  
##     -2.9344       0.8641  
## 
## Degrees of Freedom: 3153 Total (i.e. Null);  3152 Residual
## Null Deviance:       1781 
## Residual Deviance: 1740  AIC: 1744

the coefficient for tabp is the number that is subtracted from the intercept to get the power of e which will give you the odds ratio


(b) What do you notice about your answer to (a), and your answers to Question 2(f)?

The coefficient for tabp is the power of e which represents the odds ratio


(c) Suppose as a secondary research question, we are interested in the association between high blood pressure and cholesterol. When arteries build up with plaque from cholesterol, the heart needs to work harder to pump blood through them. Therefore, we have reason to believe that higher levels of cholesterol are associated with high blood pressure.

For the purposes of this analysis, we will be defining someone with high blood pressure (a binary indicator) as someone who has a diastolic blood pressure greater than 90, AND a systolic blood pressure greater than 140. This is a relatively common definition of high blood pressure (also known as “hypertension”).

Create a binary indicator for high blood pressure in the WCGS dataset. Call this new variable highbp. To help, here is some code that you can uncomment and complete:

WCGS = WCGS %>% mutate(highbp = case_when((sbp>=140 & dbp>=90)~TRUE,(sbp<=140&dbp<=90)~FALSE,(sbp>=140&dbp<=90)~FALSE,(sbp<=140&dbp>=90)~FALSE))


(d) What proportion of people in this data set have high blood pressure?

df<-WCGS
table(WCGS$highbp)
## 
## FALSE  TRUE 
##  2701   453
print(453/3154)
## [1] 0.1436271


(e) Create an exploratory plot that describes the relationship between high blood pressure and cholesterol. How would you describe the relationship between these two variables?

ggplot(data=WCGS,aes(x=chol,y=highbp))+geom_boxplot()
## Warning: Removed 12 rows containing non-finite values (`stat_boxplot()`).

people with higher cholesterol are more likely to have high blood pressure.


(f) Fit a simple logistic regression model with high blood pressure as the outcome and cholesterol as the predictor. Interpret the coefficient on chol using an odds ratio.

glm(highbp~chol,family=binomial,data=WCGS)
## 
## Call:  glm(formula = highbp ~ chol, family = binomial, data = WCGS)
## 
## Coefficients:
## (Intercept)         chol  
##    -3.11001      0.00575  
## 
## Degrees of Freedom: 3141 Total (i.e. Null);  3140 Residual
##   (12 observations deleted due to missingness)
## Null Deviance:       2585 
## Residual Deviance: 2559  AIC: 2563

The chol coefficient represents the number subtracted from the intercept that will get you the power of e that equals the odds ratio, in this case, the odds ratio represents the strength of the relationship between high blood pressure and cholesterol, and since that ratio is equal to about 1, it can be inferred that the relationship between high blood pressure and cholesterol is weak


(g) Returning to the researcher’s initial question regarding the association between Type A behavior and CHD, suppose we have reason to believe that the relationship between Type A behavior and CHD differs depending on smoking status. We will consider 3 scientific questions:

1. For individuals of the same smoking status, what is the association between Type A behavior and CHD?

2. For individuals who do not smoke, what is the association between Type A behavior and CHD? (Note: The intention here is to allow a customized effect of Type A behavior for individuals who do/do not smoke)

3. Among individuals who smoke, for those who smoke the same number of cigarettes per day, what is the association between Type A behavior and CHD?

Create a smoking status variable according to the following definition: a non-smoker is someone who smokes 0 cigarettes per day, and a smoker is someone who smokes at least 1 cigarette per day.

WCGS=WCGS%>%mutate(smoking_status=case_when(ncigs>=1~"smoker",ncigs==0~"non-smoker"))


Part 1: Fit the regression model to answer scientific question 1., and interpret the relevant coefficient (that answers your question) in context.

glm(chd~tabp:smoking_status,family=binomial,data=WCGS)
## 
## Call:  glm(formula = chd ~ tabp:smoking_status, family = binomial, data = WCGS)
## 
## Coefficients:
##                   (Intercept)  tabp:smoking_statusnon-smoker  
##                        -2.934                          0.612  
##     tabp:smoking_statussmoker  
##                         1.070  
## 
## Degrees of Freedom: 3153 Total (i.e. Null);  3151 Residual
## Null Deviance:       1781 
## Residual Deviance: 1732  AIC: 1738

for people of the same smoking status, people with type A behavior are far less likely to develop CHD than those without it.

Part 2: Fit the regression model to answer scientific question 2., and interpret the relevant coefficient (that answers your question) in context.

glm(chd~tabp:smoking_status,family=binomial,data=WCGS)
## 
## Call:  glm(formula = chd ~ tabp:smoking_status, family = binomial, data = WCGS)
## 
## Coefficients:
##                   (Intercept)  tabp:smoking_statusnon-smoker  
##                        -2.934                          0.612  
##     tabp:smoking_statussmoker  
##                         1.070  
## 
## Degrees of Freedom: 3153 Total (i.e. Null);  3151 Residual
## Null Deviance:       1781 
## Residual Deviance: 1732  AIC: 1738

the non smoker coefficient refers to the number that you subtract from the intercept to get the power of e that will get you the odds ratio for non smokers, this odds ratio shows that among non-smokers, those with type A behavior have odds that are about ten times worse than those without type A behavior

Part 3: Fit the regression model to answer scientific question 3., and interpret the relevant coefficient (that answers your question) in context.

WCGS=WCGS%>%
  group_by(ncigs)%>%
  mutate(ncigs_cat=cut(ncigs,42))

I don’t know



Question 4

In 2002, John McCain and John Kerry proposed new auto standards that would have forced manufacturers to increase fuel economy. Carl Levin proposed an amendment to the proposal that would delay any required increase in fuel economy. Hence a YES vote on the Levin amendment was good news for manufacturers. It shouldn’t surprise you that senators also received contributions from auto manufacturers in 2002. How did auto contributions vary by political affiliation and how, if at all, were senators’ votes tied to their political affiliations and the contributions they received from auto manufacturers?

Senate.csv has data on how each of the 100 senators voted (Vote = 1 for a YES vote and Vote = 0 for a NO vote). The other important variables in the data are:


Read in the data:

Senate = read.csv( 'https://raw.githubusercontent.com/vittorioaddona/data/main/Senate.csv' )


Use this data to answer the following questions:

(a) Make a table of how the senators voted and their party affiliation.

Name  |Party|Vote|

:——–:|:—:|:–:| Murkowski R 1 Stevens R 1 Sessions R 1 Shelby R 1 Hutchinson R 1 Lincoln D 1 McCain R 0 Kyl R 1 Boxer D 0 Feinstein D 0 Allard R 1 Campbell R 1 Dodd D 0 Lieberman D 0 Carper D. 1 Biden. D. 0 Graham. D. 0 Nelson. D. 0 Cleland. D. 1 Miller. D. 1 Akaka. D. 0 Inouye. D. 0 Harkin. D. 0 Grassley. R. 1 Craig. R. 1 Crapo. R. 1 Durbin. D. 0 Fitzgerald. R. 1 Bayh. D. 1 Lugar. R. 1 Brownback. R. 1 Roberts. R. 1 Bunning. R. 1 McConnell. R. 1 Breaux. D. 1 Landrieu. D. 1 Kennedy. D. 0 Kerry. D. 0 Sarbanes. D. 0 Mikulski. D. 1 Collins. R. 0 Snowe. R. 0 Levin D. 1 Stabenow. D. 1 Dayton. D. 0 Wellstone. D. 0 Bond. R. 1 Carnahan. D. 1 Cochran. R. 1 Lott. R. 1 Baucus. D. 1 Burns. R. 1 Edwards. D. 0 Helms. R. 1 Conrad. D. 1 Dorgan. D. 1 Hagel. R. 1 Nelson. D. 1 Gregg R. 0 Smith. R. 1 Corzine. D. 0 Torricelli. D. 0 Bingaman. D. 0 Domenici. R. 1 Reid. D. 0 Ensign. R. 1 Clinton. D. 0 Schumer. D 0 DeWine. R. 1 Voinovich. R. 1 Inhofe. R. 1 Nickles. R. 1 Smith. R. 0 Wyden. D. 0 Santorum. R. 1 Specter. R. 1 Chafee. R. 0 Reed. D. 0 Hollings. D. 0 Thurmond. R. 1 Daschle. D. 0 Johnson. D. 1 Frist. R. 1 Thompson. R. 1 Gramm. R. 1 Hutchison. R. 1 Bennett. R. 1 Hatch. R. 1 Allen. R. 1 Warner. R. 1 Jeffords. D. 0 Leahy. D. 0 Cantwell. D. 0 Murray. D. 0 Feingold. D. 1 Kohl. D. 1 Rockefeller. D. 0 Byrd. D. 1 Enzi. R. 1 Thomas. R. 1


(b) What proportion of Democrats voted for the Levin amendment? What proportion of Republicans voted for the Levin amendment?

Democrats=0.387 Republicans=0.877


(c) What were the odds of a YES vote from a Democrat? What were the odds of a YES vote from a Republican?

Dem=0.6333 Rep=7.13008


(d) What was the ratio of the odds of a YES vote between Republicans and Democrats (put the Republican odds in the numerator)? This number is called an odds ratio.

11.294


(e) Fit a logistic regression model for Vote using party affiliation.

glm(Vote~Party,family=binomial,Senate)
## 
## Call:  glm(formula = Vote ~ Party, family = binomial, data = Senate)
## 
## Coefficients:
## (Intercept)       PartyR  
##     -0.5213       2.4907  
## 
## Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
## Null Deviance:       132.8 
## Residual Deviance: 103.8     AIC: 107.8


(f) How could you obtain the odds ratio you found in (d) from the model output in (e)?

add the party r coefficient to the intercept coefficient and set that value as the power of e


(g) Now, fit a logistic regression model for Vote that uses both Party and AutoCont. Provide an interpretation of the coefficient on AutoCont.

glm(Vote~Party+AutoCont,family=binomial,Senate)
## 
## Call:  glm(formula = Vote ~ Party + AutoCont, family = binomial, data = Senate)
## 
## Coefficients:
## (Intercept)       PartyR     AutoCont  
##  -1.4001534    1.8810341    0.0001079  
## 
## Degrees of Freedom: 99 Total (i.e. Null);  97 Residual
## Null Deviance:       132.8 
## Residual Deviance: 89.3  AIC: 95.3

the AutoCont Coefficient shows how the probability of a yes vote changes based on contributions irrespective of party affiliation

(h) Did the odds ratio associated with Party increase or decrease when you added AutoCont to the model? What does this tell us about the contributions made to senators from each party (i.e. which party must have received higher contributions)? Briefly explain your reasoning.

it decreased, meaning that republicans had higher contributions because if you adjust the probability for a yes vote with contributions, democrats and republicans who received similar levels of contributions had a similar chance of voting yes.

(i) Using the model from (h), find the predicted probability of a YES vote for a hypothetical Republican senator who received $20,000 in contributions.

0.7252



fun = function(x,m,b) {  exp(m*x+b)/(1+exp(m*x+b)) }

ggplot( data=Senate, aes(x=AutoCont,y=jitter(Vote,amount=0.03),color=Party) ) +
  geom_point() + 
  stat_function(aes(color="R"),fun=fun,args=list(m=0.0001079,b=0.4808807)) +
  stat_function(aes(color="D"),fun=fun,args=list(m=0.0001079,b=-1.4001534)) +
  coord_cartesian(xlim = c(0,55000)) +
  ylab("Vote")