Multiple regression allows us to model how certain factors help predict an outcome. Usually, we use multiple regression when we have three or more measurement variables, of which one we call the response variable. The goal is to determine whether or not using some, or all, of these variables can help us confidently predict a response (dependent variable). For this, we will turn to some examples of the SARS-CoV2.
SARS-CoV2 is a virus, and as we know can cause no symptoms, symptoms from fevers and difficulty breathing, and even extreme sickness. How does this virus enter our body? It primarily attack the lungs (hence the masks), and the virus takes advantange of proteins in our lung airways to infect hosts.
The study we will examine comes from a study that wants to address whether two proteins – ACE2 and TMPRSS2 – correlate to infection and can predict infection. In the study, ten cultures of lung tissue were exposed to the SARS-CoV2 virus. The undergrad in the lab examined the infection % of each plate, then lysed the cells, extracted protein, and measured protein levels of the two proteins – ACE2 and TMPRSS2. These two proteins are thought to be involved with SARS-CoV2 infection.
ACE2: Angiotensin-Converting Enzyme 2, is a protein found in the lungs that spike proteins on SARS-CoV2 binds, and uses to enter a lung cell
TMPRSS2: Transmembrane Serine Protease 2, is a protein found in lungs that cleaves ACE2 and aids in the delivery of viral proteins into the cell
sars <- read.csv("DataACETM.csv")
# View(sars)
str(sars)
## 'data.frame': 10 obs. of 5 variables:
## $ plate : int 1 2 3 4 5 6 7 8 9 10
## $ infection : int 10 50 60 85 100 14 52 45 68 82
## $ ACE2.ug.mg.: int 2 6 7 8 12 3 5 4 7 6
## $ TMPRSS2 : int 11 15 26 29 32 15 20 16 23 29
## $ camostat : int 50 25 25 10 10 50 25 25 10 10
Looking at the dataset, we can see that there are 10 plates in this dataset, and each plate has an infection rate, a protein amount for ACE2, and a protein amount for TMPRSS2.
library(ggpubr)
## Loading required package: ggplot2
ace <- ggscatter(sars, y = "infection", x = "ACE2.ug.mg.",
add = "reg.line",
conf.int = TRUE,
add.params = list(color = "blue"))
tmpr <- ggscatter(sars, y = "infection", x = "TMPRSS2",
add = "reg.line",
conf.int = TRUE,
add.params = list(color = "gold"))
ggarrange(ace, tmpr, #rremove("x.text")
labels = c("A", "B"),
ncol = 2, nrow = 1)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Cool. You made a nice figure! Note that i combined the scatter plots into on image, using the ggarange() function.
However, let’s say we also wanted to look at ACE2 vs TMPRSS2. We can make another scatter plot, that is not so hard. But, what if we have some datasets that have more than 2 predictor variables. There is a way to look at this quickly.
sars2 <- sars[,-c(1, 5)] # first, let's remove the column that is just the plate number
plot(sars2)
Woah, that is a lot of graphs! As you can see, each variable is plotted against each other. Notice that I took out the column with plate number information (hidden variable), which you should do for hidden variables. Notice that some of the boxes are duplications, with the two variables just swapped in the x and y axis.
There are two ways to do this. A simple way.
cor(sars2)
## infection ACE2.ug.mg. TMPRSS2
## infection 1.0000000 0.9074260 0.9310635
## ACE2.ug.mg. 0.9074260 1.0000000 0.8567141
## TMPRSS2 0.9310635 0.8567141 1.0000000
Or a prettier way
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(sars2, label = TRUE, label_round = 3)
Great. So it looks like both ACE2 and TMPRSS2 correlate with infection. However, notice also that ACE2 correlates with a strong positive relationship with TMPRSS2 as well. If this is the case, they might have redundant roles in an infection model. We call this phenomenon multicollinearity.
There is another way to quickly look at many relationships, using the ggpairs() function.
library(GGally)
ggpairs(sars2)
#remember, we took out the cell plate number by taking out column 1 with the line below
# sars2 <- sars[,-c(1,5)]
Neat! Look at all those graphs in one function, and the correlation values. The bottom left graphs give you the scatter, the top left gives you the correlation values, and the diagonal gives you the distribution of your data (to test for normality).
Here is another one!
ggpairs(sars2, lower = list(continuous = "smooth")) #add some regression lines
This is great. But we want to know what the contribution of both these proteins might be on infection rate. We want to come up with an equation that helps us determine the contribution of one, or both, of the paramters to infection. Recall that our equation for linear regression was
y = a + bx
For multiple regression, it is
y = a + b1x + b2x + bNx
infection <- lm(infection ~ ACE2.ug.mg. + TMPRSS2, data = sars)
summary(infection)
##
## Call:
## lm(formula = infection ~ ACE2.ug.mg. + TMPRSS2, data = sars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.478 -5.978 3.026 7.112 9.939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.1838 10.5872 -1.812 0.1129
## ACE2.ug.mg. 4.2565 2.2403 1.900 0.0992 .
## TMPRSS2 2.3261 0.8746 2.660 0.0325 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.805 on 7 degrees of freedom
## Multiple R-squared: 0.9122, Adjusted R-squared: 0.8871
## F-statistic: 36.35 on 2 and 7 DF, p-value: 0.0002008
What the program gives you in this table is the model for all the variables (intercept, ACE2, and TMPRSS2). The first is the intercept, and this tells you the value of the response variable if both predictors are zero. ACE2 and TMPRSS2 are the slopes of those variables. The null hypothesis would state that the slope is not different than zero.
Here, we have two predictors, so the equation we use is
y = a + b1x1 + b2x2
or
infection = -19.1838 + 4.2565[ACE2] + 2.3261[TMPRSS2].
Notice that the multiple R-squared is 0.9122. Higher R2, means that a greater percentage of variation in the response (here, infection) is explained by the predictors in the model (ACE2 and TMPRSS2 protein levels). In other words, high R2 means the model fits your data well.
Because adding predictors increases R2, the adjusted R2 is used when you have many predictors. It gives you a more honest estimate of your R2.
The p value is 0.0002. This derives from the F statistic, similar to the ANOVA linear modeling. Here, they help you address whether the predictor/independent variables reliably predict the response/dependent variable. Given the p value, the result suggests that we should reject our null that there the two proteins - ACE2 and TMPRSS2 - DO NOT predict infection. Thus, we can use the equation to help predict infection, if we know our predictor variables.
In this hypothetical example, slope of the TMPRSS2 is signficant, but that for ACE2 is not. Thus, using both ACE2 AND TMPRSS2 is not that different than TMPRSS2 alone. We likely should just use TMPRSS2 in our model.
This can be seen if you run your analyses as if they were simple linear regression models.
ace2 <- lm(infection ~ ACE2.ug.mg., data = sars2)
summary(ace2)
##
## Call:
## lm(formula = infection ~ ACE2.ug.mg., data = sars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.517 -8.517 -1.961 6.532 25.400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4333 10.0735 0.043 0.966742
## ACE2.ug.mg. 9.3611 1.5326 6.108 0.000287 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13 on 8 degrees of freedom
## Multiple R-squared: 0.8234, Adjusted R-squared: 0.8013
## F-statistic: 37.31 on 1 and 8 DF, p-value: 0.000287
tmprss2 <- lm(infection ~ TMPRSS2, data = sars2)
summary(tmprss2)
##
## Call:
## lm(formula = infection ~ TMPRSS2, data = sars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.851 -5.726 1.026 5.713 18.149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -24.3954 11.7761 -2.072 0.072 .
## TMPRSS2 3.7498 0.5195 7.218 9.09e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.29 on 8 degrees of freedom
## Multiple R-squared: 0.8669, Adjusted R-squared: 0.8502
## F-statistic: 52.1 on 1 and 8 DF, p-value: 9.086e-05
Both individually yield significant models; however, the multiple regression model suggests, again, that using TMPRSS2 alone is most predictive.
Using the full dataset - DataACETM.csv, or what i called sars when i read the file. Use it to run through the same visualization for graphs (ggpairs) and linear modeling to get an equation (lm).
What is the infection if you have the mean ACE2 and TMPRSS2 protein level and the mean comastat drug amount?
image of the week