BIO205 week 11 - multiple regression

Introduction

Multiple regression allows us to model how certain factors help predict an outcome. Usually, we use multiple regression when we have three or more measurement variables, of which one we call the response variable. The goal is to determine whether or not using some, or all, of these variables can help us confidently predict a response (dependent variable). For this, we will turn to some examples of the SARS-CoV2.

Part 1. Proteins of SARS-CoV2 entry

SARS-CoV2 is a virus, and as we know can cause no symptoms, symptoms from fevers and difficulty breathing, and even extreme sickness. How does this virus enter our body? It primarily attack the lungs (hence the masks), and the virus takes advantange of proteins in our lung airways to infect hosts.

The study we will examine comes from a study that wants to address whether two proteins – ACE2 and TMPRSS2 – correlate to infection and can predict infection. In the study, ten cultures of lung tissue were exposed to the SARS-CoV2 virus. The undergrad in the lab examined the infection % of each plate, then lysed the cells, extracted protein, and measured protein levels of the two proteins – ACE2 and TMPRSS2. These two proteins are thought to be involved with SARS-CoV2 infection.

ACE2: Angiotensin-Converting Enzyme 2, is a protein found in the lungs that spike proteins on SARS-CoV2 binds, and uses to enter a lung cell

TMPRSS2: Transmembrane Serine Protease 2, is a protein found in lungs that cleaves ACE2 and aids in the delivery of viral proteins into the cell

Part 1a. First, let’s, as always, look at the dataset.

sars <- read.csv("DataACETM.csv")
# View(sars)
str(sars)

## 'data.frame':    10 obs. of  5 variables:
##  $ plate      : int  1 2 3 4 5 6 7 8 9 10
##  $ infection  : int  10 50 60 85 100 14 52 45 68 82
##  $ ACE2.ug.mg.: int  2 6 7 8 12 3 5 4 7 6
##  $ TMPRSS2    : int  11 15 26 29 32 15 20 16 23 29
##  $ camostat   : int  50 25 25 10 10 50 25 25 10 10

Looking at the dataset, we can see that there are 10 plates in this dataset, and each plate has an infection rate, a protein amount for ACE2, and a protein amount for TMPRSS2.

Part 1b. So, we can now look at the data with using ggscatter plots.

library(ggpubr)

## Loading required package: ggplot2

ace <- ggscatter(sars, y = "infection", x = "ACE2.ug.mg.",
                 add = "reg.line", 
                 conf.int = TRUE, 
                 add.params = list(color = "blue"))
tmpr <- ggscatter(sars, y = "infection", x = "TMPRSS2",
                  add = "reg.line", 
                  conf.int = TRUE, 
                  add.params = list(color = "gold"))
ggarrange(ace, tmpr, #rremove("x.text") 
          labels = c("A", "B"),
          ncol = 2, nrow = 1)

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

Cool. You made a nice figure! Note that i combined the scatter plots into on image, using the ggarange() function.

However, let’s say we also wanted to look at ACE2 vs TMPRSS2. We can make another scatter plot, that is not so hard. But, what if we have some datasets that have more than 2 predictor variables. There is a way to look at this quickly.

sars2 <- sars[,-c(1, 5)] # first, let's remove the column that is just the plate number
plot(sars2)

Woah, that is a lot of graphs! As you can see, each variable is plotted against each other. Notice that I took out the column with plate number information (hidden variable), which you should do for hidden variables. Notice that some of the boxes are duplications, with the two variables just swapped in the x and y axis.

Part 1c. Let’s look at some correlations

There are two ways to do this. A simple way.

cor(sars2)

##             infection ACE2.ug.mg.   TMPRSS2
## infection   1.0000000   0.9074260 0.9310635
## ACE2.ug.mg. 0.9074260   1.0000000 0.8567141
## TMPRSS2     0.9310635   0.8567141 1.0000000

Or a prettier way

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggcorr(sars2, label = TRUE, label_round = 3)

Great. So it looks like both ACE2 and TMPRSS2 correlate with infection. However, notice also that ACE2 correlates with a strong positive relationship with TMPRSS2 as well. If this is the case, they might have redundant roles in an infection model. We call this phenomenon multicollinearity.

Part 1d. Looking at it all with one function

There is another way to quickly look at many relationships, using the ggpairs() function.

library(GGally)
ggpairs(sars2)

#remember, we took out the cell plate number by taking out column 1 with the line below
# sars2 <- sars[,-c(1,5)]

Neat! Look at all those graphs in one function, and the correlation values. The bottom left graphs give you the scatter, the top left gives you the correlation values, and the diagonal gives you the distribution of your data (to test for normality).

Here is another one!

ggpairs(sars2, lower = list(continuous = "smooth")) #add some regression lines

Part 1e. Multiple linear regression

This is great. But we want to know what the contribution of both these proteins might be on infection rate. We want to come up with an equation that helps us determine the contribution of one, or both, of the paramters to infection. Recall that our equation for linear regression was

y = a + bx

For multiple regression, it is

y = a + b1x + b2x + bNx

infection <- lm(infection ~ ACE2.ug.mg. + TMPRSS2, data = sars)
summary(infection)

## 
## Call:
## lm(formula = infection ~ ACE2.ug.mg. + TMPRSS2, data = sars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.478  -5.978   3.026   7.112   9.939 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -19.1838    10.5872  -1.812   0.1129  
## ACE2.ug.mg.   4.2565     2.2403   1.900   0.0992 .
## TMPRSS2       2.3261     0.8746   2.660   0.0325 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.805 on 7 degrees of freedom
## Multiple R-squared:  0.9122, Adjusted R-squared:  0.8871 
## F-statistic: 36.35 on 2 and 7 DF,  p-value: 0.0002008

Part 1e continued - the interpretation

What the program gives you in this table is the model for all the variables (intercept, ACE2, and TMPRSS2). The first is the intercept, and this tells you the value of the response variable if both predictors are zero. ACE2 and TMPRSS2 are the slopes of those variables. The null hypothesis would state that the slope is not different than zero.

Here, we have two predictors, so the equation we use is

y = a + b1x1 + b2x2

infection = -19.1838 + 4.2565[ACE2] + 2.3261[TMPRSS2].

Notice that the multiple R-squared is 0.9122. Higher R2, means that a greater percentage of variation in the response (here, infection) is explained by the predictors in the model (ACE2 and TMPRSS2 protein levels). In other words, high R2 means the model fits your data well.

Because adding predictors increases R2, the adjusted R2 is used when you have many predictors. It gives you a more honest estimate of your R2.

The p value is 0.0002. This derives from the F statistic, similar to the ANOVA linear modeling. Here, they help you address whether the predictor/independent variables reliably predict the response/dependent variable. Given the p value, the result suggests that we should reject our null that there the two proteins - ACE2 and TMPRSS2 - DO NOT predict infection. Thus, we can use the equation to help predict infection, if we know our predictor variables.

In this hypothetical example, slope of the TMPRSS2 is signficant, but that for ACE2 is not. Thus, using both ACE2 AND TMPRSS2 is not that different than TMPRSS2 alone. We likely should just use TMPRSS2 in our model.

This can be seen if you run your analyses as if they were simple linear regression models.

ace2 <- lm(infection ~ ACE2.ug.mg., data = sars2)
summary(ace2)

## 
## Call:
## lm(formula = infection ~ ACE2.ug.mg., data = sars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.517  -8.517  -1.961   6.532  25.400 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.4333    10.0735   0.043 0.966742    
## ACE2.ug.mg.   9.3611     1.5326   6.108 0.000287 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 8 degrees of freedom
## Multiple R-squared:  0.8234, Adjusted R-squared:  0.8013 
## F-statistic: 37.31 on 1 and 8 DF,  p-value: 0.000287

tmprss2 <- lm(infection ~ TMPRSS2, data = sars2)
summary(tmprss2)

## 
## Call:
## lm(formula = infection ~ TMPRSS2, data = sars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.851  -5.726   1.026   5.713  18.149 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24.3954    11.7761  -2.072    0.072 .  
## TMPRSS2       3.7498     0.5195   7.218 9.09e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.29 on 8 degrees of freedom
## Multiple R-squared:  0.8669, Adjusted R-squared:  0.8502 
## F-statistic:  52.1 on 1 and 8 DF,  p-value: 9.086e-05

Both individually yield significant models; however, the multiple regression model suggests, again, that using TMPRSS2 alone is most predictive.

Part 2. Practice

Using the full dataset - DataACETM.csv, or what i called sars when i read the file. Use it to run through the same visualization for graphs (ggpairs) and linear modeling to get an equation (lm).

What is the infection if you have the mean ACE2 and TMPRSS2 protein level and the mean comastat drug amount?

image of the week