STAT 155 Topic 7 In-class Activity

Load some potentially useful packages:

library(ggplot2)

library(dplyr)

Question 1

Answer the following brief questions on logs, odds, and probabilities:

(a) What is log(10) in base 10? To find this using R, uncomment and run the command below:

 log( 10 , base=10 )

## [1] 1

(b) What is log(10) in base e? To find this using R, uncomment and run the command below:

 log( 10 )

## [1] 2.302585

(c) If log(x) = 2 (in base e), what is x?

(d) If the president of the U.S. has a 30% approval, what are their approval odds?

(e) If the odds of a president being impeached are 2 (Note: This is often reported as 2-to-1), what is the chance of impeachment?

Question 2

The Western Collaborative Group Study (WCGS) was designed in order to investigate a possible link between Type A behavior and coronary heart disease (CHD), and to develop a framework to select patients for intervention in order to decrease risk of CHD. The study contained 3154 cis men between the ages of 39 and 59 in California who had no history of CHD. They were enrolled in the study in 1960 and 1961, underwent a medical examination and covered their medical history, and they were re-examined annually for interim cardiovascular history.

Here are the variables available in the WCGS data:

id: Patient identification number
age: Age at time of enrollment in the study (years)
height: Height at time of enrollment in the study (inches)
weight: Weight at time of enrollment in the study (lbs)
sbp: Systolic blood pressure
dbp: Diastolic blood pressure
chol: Cholesterol (mg/dL)
ncigs: Number of cigarettes smoked per day
tabp: Presence (1) or absence (0) of Type A behavior
chd: Presence (1) or absence (0) of CHD over followup
typchd: Type of CHD event (0 = none, 1 = MI or SD, 2 = silent MI, 3 = angina)
time: Followup time (days)
arcus: Presence (1) or absence (0) of arcus senilis
bmi: Constructed variable; BMI = weight * 703 / height^2

Read in the WCGS data:

WCGS = read.csv("https://raw.githubusercontent.com/cwolock/BIOST311/main/Datasets/wcgs.csv")

(a) What is the appropriate explanatory variable and response variable to explore the main scientific question that this study was interested in?

the appropriate explanatory variable is tabp, and the appropriate response variable is chd

(b) Create a plot summarizing the relationship between the explanatory and response variable you chose in (a). To help with this, consider the code skeleton below, and fill in the appropriate pieces:

 #We can convert numeric binary variables (e.g. 0-1 variables) into "No-Yes" variables, as follows (uncomment the 2 lines below):

 WCGS = WCGS %>% mutate( TypeA = ifelse(tabp==0,"No","Yes") ,
                       CHD = ifelse(chd==0,"No","Yes") )


# Make a barplot of the "No-Yes" explanatory variable, but where we fill the bars based on the "No-Yes" response variable (uncomment and complete the lines below):

 ggplot( data=WCGS , aes(x = TypeA,
                         fill = CHD) ) +
                         geom_bar()

(c) Construct a two-by-two table of the newly created TypeA and CHD variables from (b). To help with this, uncomment and complete the line of code below:

WCGS %>% select( TypeA , CHD ) %>% table( )

##      CHD
## TypeA   No  Yes
##   No  1486   79
##   Yes 1411  178

(d) What is the observed probability that someone develops CHD if they exhibit Type A behavior? What are the odds of CHD if someone has Type A behavior?

P=0.112, odds=0.1262

(e) What is the observed probability that someone develops CHD if they do not exhibit Type A behavior? What are the odds of CHD if someone does not have Type A behavior?

P=0.0505, odds=0.0532

(f) What is the observed odds ratio of CHD, comparing those with Type A behavior to those without?

Odds Ratio=2.3722, the odds of someone developing CHD with type A behavior is a little under 2.5 times more likely than the odds of someone developing CHD without type A behavior

Question 3

(a) Fit a simple logistic regression model with `chd` as your response variable and `tabp` as your predictor. Interpret the coefficient on `tabp` using an odds ratio.

glm(chd~tabp,family=binomial,data=WCGS)

## 
## Call:  glm(formula = chd ~ tabp, family = binomial, data = WCGS)
## 
## Coefficients:
## (Intercept)         tabp  
##     -2.9344       0.8641  
## 
## Degrees of Freedom: 3153 Total (i.e. Null);  3152 Residual
## Null Deviance:       1781 
## Residual Deviance: 1740  AIC: 1744

the coefficient for tabp is the number that is subtracted from the intercept to get the power of e which will give you the odds ratio

(b) What do you notice about your answer to (a), and your answers to Question 2(f)?

The coefficient for tabp is the power of e which represents the odds ratio

(c) Suppose as a secondary research question, we are interested in the association between high blood pressure and cholesterol. When arteries build up with plaque from cholesterol, the heart needs to work harder to pump blood through them. Therefore, we have reason to believe that higher levels of cholesterol are associated with high blood pressure.

For the purposes of this analysis, we will be defining someone with high blood pressure (a binary indicator) as someone who has a diastolic blood pressure greater than 90, AND a systolic blood pressure greater than 140. This is a relatively common definition of high blood pressure (also known as “hypertension”).

Create a binary indicator for high blood pressure in the `WCGS` dataset. Call this new variable `highbp`. To help, here is some code that you can uncomment and complete:

WCGS = WCGS %>% mutate(highbp = case_when((sbp>=140 & dbp>=90)~TRUE,(sbp<=140&dbp<=90)~FALSE,(sbp>=140&dbp<=90)~FALSE,(sbp<=140&dbp>=90)~FALSE))

(d) What proportion of people in this data set have high blood pressure?

df<-WCGS
table(WCGS$highbp)

## 
## FALSE  TRUE 
##  2701   453

print(453/3154)

## [1] 0.1436271

(e) Create an exploratory plot that describes the relationship between high blood pressure and cholesterol. How would you describe the relationship between these two variables?

ggplot(data=WCGS,aes(x=chol,y=highbp))+geom_boxplot()

## Warning: Removed 12 rows containing non-finite values (`stat_boxplot()`).

people with higher cholesterol are more likely to have high blood pressure.

(f) Fit a simple logistic regression model with high blood pressure as the outcome and cholesterol as the predictor. Interpret the coefficient on `chol` using an odds ratio.

glm(highbp~chol,family=binomial,data=WCGS)

## 
## Call:  glm(formula = highbp ~ chol, family = binomial, data = WCGS)
## 
## Coefficients:
## (Intercept)         chol  
##    -3.11001      0.00575  
## 
## Degrees of Freedom: 3141 Total (i.e. Null);  3140 Residual
##   (12 observations deleted due to missingness)
## Null Deviance:       2585 
## Residual Deviance: 2559  AIC: 2563

The chol coefficient represents the number subtracted from the intercept that will get you the power of e that equals the odds ratio, in this case, the odds ratio represents the strength of the relationship between high blood pressure and cholesterol, and since that ratio is equal to about 1, it can be inferred that the relationship between high blood pressure and cholesterol is weak

(g) Returning to the researcher’s initial question regarding the association between Type A behavior and CHD, suppose we have reason to believe that the relationship between Type A behavior and CHD differs depending on smoking status. We will consider 3 scientific questions:

1. For individuals of the same smoking status, what is the association between Type A behavior and CHD?

2. For individuals who do not smoke, what is the association between Type A behavior and CHD? (Note: The intention here is to allow a customized effect of Type A behavior for individuals who do/do not smoke)

3. Among individuals who smoke, for those who smoke the same number of cigarettes per day, what is the association between Type A behavior and CHD?

Create a smoking status variable according to the following definition: a non-smoker is someone who smokes 0 cigarettes per day, and a smoker is someone who smokes at least 1 cigarette per day.

WCGS=WCGS%>%mutate(smoking_status=case_when(ncigs>=1~"smoker",ncigs==0~"non-smoker"))

Part 1: Fit the regression model to answer scientific question 1., and interpret the relevant coefficient (that answers your question) in context.

glm(chd~tabp:smoking_status,family=binomial,data=WCGS)

## 
## Call:  glm(formula = chd ~ tabp:smoking_status, family = binomial, data = WCGS)
## 
## Coefficients:
##                   (Intercept)  tabp:smoking_statusnon-smoker  
##                        -2.934                          0.612  
##     tabp:smoking_statussmoker  
##                         1.070  
## 
## Degrees of Freedom: 3153 Total (i.e. Null);  3151 Residual
## Null Deviance:       1781 
## Residual Deviance: 1732  AIC: 1738

for people of the same smoking status, people with type A behavior are far less likely to develop CHD than those without it.

Part 2: Fit the regression model to answer scientific question 2., and interpret the relevant coefficient (that answers your question) in context.

glm(chd~tabp:smoking_status,family=binomial,data=WCGS)

## 
## Call:  glm(formula = chd ~ tabp:smoking_status, family = binomial, data = WCGS)
## 
## Coefficients:
##                   (Intercept)  tabp:smoking_statusnon-smoker  
##                        -2.934                          0.612  
##     tabp:smoking_statussmoker  
##                         1.070  
## 
## Degrees of Freedom: 3153 Total (i.e. Null);  3151 Residual
## Null Deviance:       1781 
## Residual Deviance: 1732  AIC: 1738

the non smoker coefficient refers to the number that you subtract from the intercept to get the power of e that will get you the odds ratio for non smokers, this odds ratio shows that among non-smokers, those with type A behavior have odds that are about ten times worse than those without type A behavior

Part 3: Fit the regression model to answer scientific question 3., and interpret the relevant coefficient (that answers your question) in context.

WCGS=WCGS%>%
  group_by(ncigs)%>%
  mutate(ncigs_cat=cut(ncigs,42))

I don’t know

Question 4

In 2002, John McCain and John Kerry proposed new auto standards that would have forced manufacturers to increase fuel economy. Carl Levin proposed an amendment to the proposal that would delay any required increase in fuel economy. Hence a YES vote on the Levin amendment was good news for manufacturers. It shouldn’t surprise you that senators also received contributions from auto manufacturers in 2002. How did auto contributions vary by political affiliation and how, if at all, were senators’ votes tied to their political affiliations and the contributions they received from auto manufacturers?

Senate.csv has data on how each of the 100 senators voted (Vote = 1 for a YES vote and Vote = 0 for a NO vote). The other important variables in the data are:

Party: R = republican and D = democrat
AutoCont: Amount of money received in contributions from auto manufacturers

Read in the data:

Senate = read.csv( 'https://raw.githubusercontent.com/vittorioaddona/data/main/Senate.csv' )

Use this data to answer the following questions:

(a) Make a table of how the senators voted and their party affiliation.

Name  |Party|Vote|

:——–:|:—:|:–:| Murkowski R 1 Stevens R 1 Sessions R 1 Shelby R 1 Hutchinson R 1 Lincoln D 1 McCain R 0 Kyl R 1 Boxer D 0 Feinstein D 0 Allard R 1 Campbell R 1 Dodd D 0 Lieberman D 0 Carper D. 1 Biden. D. 0 Graham. D. 0 Nelson. D. 0 Cleland. D. 1 Miller. D. 1 Akaka. D. 0 Inouye. D. 0 Harkin. D. 0 Grassley. R. 1 Craig. R. 1 Crapo. R. 1 Durbin. D. 0 Fitzgerald. R. 1 Bayh. D. 1 Lugar. R. 1 Brownback. R. 1 Roberts. R. 1 Bunning. R. 1 McConnell. R. 1 Breaux. D. 1 Landrieu. D. 1 Kennedy. D. 0 Kerry. D. 0 Sarbanes. D. 0 Mikulski. D. 1 Collins. R. 0 Snowe. R. 0 Levin D. 1 Stabenow. D. 1 Dayton. D. 0 Wellstone. D. 0 Bond. R. 1 Carnahan. D. 1 Cochran. R. 1 Lott. R. 1 Baucus. D. 1 Burns. R. 1 Edwards. D. 0 Helms. R. 1 Conrad. D. 1 Dorgan. D. 1 Hagel. R. 1 Nelson. D. 1 Gregg R. 0 Smith. R. 1 Corzine. D. 0 Torricelli. D. 0 Bingaman. D. 0 Domenici. R. 1 Reid. D. 0 Ensign. R. 1 Clinton. D. 0 Schumer. D 0 DeWine. R. 1 Voinovich. R. 1 Inhofe. R. 1 Nickles. R. 1 Smith. R. 0 Wyden. D. 0 Santorum. R. 1 Specter. R. 1 Chafee. R. 0 Reed. D. 0 Hollings. D. 0 Thurmond. R. 1 Daschle. D. 0 Johnson. D. 1 Frist. R. 1 Thompson. R. 1 Gramm. R. 1 Hutchison. R. 1 Bennett. R. 1 Hatch. R. 1 Allen. R. 1 Warner. R. 1 Jeffords. D. 0 Leahy. D. 0 Cantwell. D. 0 Murray. D. 0 Feingold. D. 1 Kohl. D. 1 Rockefeller. D. 0 Byrd. D. 1 Enzi. R. 1 Thomas. R. 1

(b) What proportion of Democrats voted for the Levin amendment? What proportion of Republicans voted for the Levin amendment?

Democrats=0.387 Republicans=0.877

(c) What were the odds of a YES vote from a Democrat? What were the odds of a YES vote from a Republican?

Dem=0.6333 Rep=7.13008

(d) What was the ratio of the odds of a YES vote between Republicans and Democrats (put the Republican odds in the numerator)? This number is called an odds ratio.

11.294

(e) Fit a logistic regression model for `Vote` using party affiliation.

glm(Vote~Party,family=binomial,Senate)

## 
## Call:  glm(formula = Vote ~ Party, family = binomial, data = Senate)
## 
## Coefficients:
## (Intercept)       PartyR  
##     -0.5213       2.4907  
## 
## Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
## Null Deviance:       132.8 
## Residual Deviance: 103.8     AIC: 107.8

(f) How could you obtain the odds ratio you found in (d) from the model output in (e)?

add the pary r coefficient to the intercept coefficient and set that value as the power of e

(g) Now, fit a logistic regression model for `Vote` that uses both `Party` and `AutoCont`. Provide an interpretation of the coefficient on `AutoCont`.

glm(Vote~Party+AutoCont,family=binomial,Senate)

## 
## Call:  glm(formula = Vote ~ Party + AutoCont, family = binomial, data = Senate)
## 
## Coefficients:
## (Intercept)       PartyR     AutoCont  
##  -1.4001534    1.8810341    0.0001079  
## 
## Degrees of Freedom: 99 Total (i.e. Null);  97 Residual
## Null Deviance:       132.8 
## Residual Deviance: 89.3  AIC: 95.3

(h) Did the odds ratio associated with `Party` increase or decrease when you added `AutoCont` to the model? What does this tell us about the contributions made to senators from each party (i.e. which party must have received higher contributions)? Briefly explain your reasoning.

decrease, which means that the democrats received more contributions because their odds went up more

(i) Using the model from (h), find the predicted probability of a YES vote for a hypothetical Republican senator who received $20,000 in contributions.

0.7252

This isn’t a question, but here is a visualization of the model in (g):

fun = function(x,m,b) {  exp(m*x+b)/(1+exp(m*x+b)) }

ggplot( data=Senate, aes(x=AutoCont,y=jitter(Vote,amount=0.03),color=Party) ) +
  geom_point() + 
  stat_function(aes(color="R"),fun=fun,args=list(m=0.0001079,b=0.4808807)) +
  stat_function(aes(color="D"),fun=fun,args=list(m=0.0001079,b=-1.4001534)) +
  coord_cartesian(xlim = c(0,55000)) +
  ylab("Vote")