The most common usage when both your disease predictor and outcome are continuous are correlations. Typically these are fit linearly, and if not a best fit line can be established to understand what numerical transformation is needed in order to better conduct a linear regression.
We will need several things, namely a workable data set and some packages. Let’s start with a dataset that’s built into R that a lot of my intro stats class uses: Cars.
data <- mtcars
head(data,10)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Let’s also utilize some data visualization packages to make sure that we spruce up our review of the analysis later on. Two common ones are ggplot2 and ggpubr.
Typically you can request these packages from CRAN, but just to be a little fancy we’ll pull directly from their respective github pages.
The latter code was unnessesary because apparently ggplot2 is a part of the ggpubr git, but better safe than sorry! I also ended up just using CRAN because it never gives me an error message if I load a package.
When performing a parametric test such as correlation, you’d want to establish normality in the data set. While a perfectly normal set of data is unrealistic, it at least gives you an idea of how robust your correlation may be beyond just the the R or R - squared value. You effectively have something to reference to justify performing a transformation or using other analysis.
Let’s just start with a scatter plot for now. Let’s compare a car’s horsepower with their miles per gallon.
library("ggpubr")
## Warning: package 'ggpubr' was built under R version 3.5.3
## Loading required package: ggplot2
## Loading required package: magrittr
## Warning: package 'magrittr' was built under R version 3.5.2
ggscatter(data, x= "mpg", y = "hp",
color = "red", fill = "lightgray",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Miles per Gallon", ylab = "Horsepower")
Look at that, red beauty. No surprise here, lower horsepower nets a higher mpg rating.
Now let’s check for normality.
shapiro.test(data$mpg)
##
## Shapiro-Wilk normality test
##
## data: data$mpg
## W = 0.94756, p-value = 0.1229
shapiro.test(data$hp)
##
## Shapiro-Wilk normality test
##
## data: data$hp
## W = 0.93342, p-value = 0.04881
Looks like horsepower’s p-value is <0.05 meaning it is not normal. Not surprising considering out scatter plot would have a trend line that looks like an inverse log function. But this hardly removes the validity from our analysis given how strong the correlation is.
How do the Q-Q plots look?
ggqqplot(data$mpg, ylab = "mpg")
ggqqplot(data$hp, ylab = "horsepower")
At least the QQ plot for horsepower fits within a normal distribution characterized by the shade of grey.
This means we can move forward with a person correlation test, should these prove to the be grossly non parametic, a Spearman or Kendall rank-based test would be used.
#Pearson Correlation test
pearson <- cor.test(data$hp, data$mpg,
method = "pearson")
kendall <- cor.test(data$hp, data$mpg,
method = "kendall")
## Warning in cor.test.default(data$hp, data$mpg, method = "kendall"): Cannot
## compute exact p-value with ties
spearman <- cor.test(data$hp, data$mpg,
method = "spearman")
## Warning in cor.test.default(data$hp, data$mpg, method = "spearman"): Cannot
## compute exact p-value with ties
pearson
##
## Pearson's product-moment correlation
##
## data: data$hp and data$mpg
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8852686 -0.5860994
## sample estimates:
## cor
## -0.7761684
kendall
##
## Kendall's rank correlation tau
##
## data: data$hp and data$mpg
## z = -5.871, p-value = 4.332e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.7428125
spearman
##
## Spearman's rank correlation rho
##
## data: data$hp and data$mpg
## S = 10337, p-value = 5.086e-12
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.8946646
Ahhh, the quintessential linear regression. The workhorse of any basic statistical analysis, and pattern recognition at it’s finest.
Let’s load a data set and some packages.
install.packages("tidyverse", repos="http://cran.rstudio.com/", dependencies=TRUE)
## Installing package into 'C:/Users/Alexander Chang/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Alexander Chang\AppData\Local\Temp\RtmpOu8dJ9\downloaded_packages
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v tibble 2.1.1 v purrr 0.3.2
## v tidyr 0.8.3 v dplyr 0.8.0.1
## v readr 1.3.1 v stringr 1.4.0
## v tibble 2.1.1 v forcats 0.3.0
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'stringr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.2
## -- Conflicts --------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x tidyr::extract() masks magrittr::extract()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
library(ggpubr)
theme_set(theme_pubr())
devtools::install_github("kassambara/datarium")
## Skipping install of 'datarium' from a github remote, the SHA1 (5bfbbdd1) has not changed since last install.
## Use `force = TRUE` to force installation
#loading the data
data ("marketing", package = "datarium")
head (marketing, 5)
## youtube facebook newspaper sales
## 1 276.12 45.36 83.04 26.52
## 2 53.40 47.16 54.12 12.48
## 3 20.64 55.08 83.16 11.16
## 4 181.80 49.56 70.20 22.20
## 5 216.96 12.96 70.08 15.48
#Visualize the scatter plot
ggplot(marketing, aes(x = facebook, y = sales)) +
geom_point() +
stat_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Look linear to me. Let’s just go through the regression
reg <- lm(sales ~ facebook, data = marketing)
summary(reg)
##
## Call:
## lm(formula = sales ~ facebook, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.8766 -2.5589 0.9248 3.3330 9.8173
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.17397 0.67548 16.542 <2e-16 ***
## facebook 0.20250 0.02041 9.921 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.13 on 198 degrees of freedom
## Multiple R-squared: 0.332, Adjusted R-squared: 0.3287
## F-statistic: 98.42 on 1 and 198 DF, p-value: < 2.2e-16
confint(reg)
## 2.5 % 97.5 %
## (Intercept) 9.8419062 12.5060253
## facebook 0.1622443 0.2427472
Let’s interpret our residual standard error, we can express this as a percentage.
sigma(reg)*100/mean(marketing$sales)
## [1] 30.48632
about 30% This isn’t really that great considering that our R-squared is only 0.32.87, meaning that roughly 32% of the change in sales was due to facebook. As a predictor of sales this isn’t good, but in reality this would be great as it would mean a significant proportion of your sales is affected by facebook. Oh by the way, this is an advertising budget.
Now let’s take the same dataset and see what a firm’s advertising budget should be spent on according to this data set.
multiple <- lm(sales ~ youtube + facebook + newspaper, data = marketing)
summary (multiple)
##
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5932 -1.0690 0.2902 1.4272 3.3951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.526667 0.374290 9.422 <2e-16 ***
## youtube 0.045765 0.001395 32.809 <2e-16 ***
## facebook 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.023 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
According to our summary output, not facebook and youtube seem to have a positive influence on the predicted sales, with facebook giving it the best boost. Newspapers seem to not only not produce any positive impact, but negatively impact sales numbers.
We can’t really talk about multiple linear regression without including one of the key steps: Model selection.
When it comes to predicting behavior and extrapolating future patterns based on previous observations, refining a regression model is one of the more straight forward methods of doing so with low computing power.
library (MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library (tidyverse)
fullmodel <- lm(Fertility ~ ., data = swiss)
step.model <- stepAIC(fullmodel, direction = "both", trace = FALSE)
head(swiss, 6)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
summary(step.model)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
This stepwise regression looks at the AIC or akaike information criteria, essentially it is an estimator for the relative quality of models. The lower the value, the better. Using this method we can see that all the variables in the data set were included in the model as they all had a low AIC cut off determined.
Let’s use the olsrr package to better visualize the selection process.
install.packages("olsrr", repos="http://cran.rstudio.com/", dependencies=TRUE)
## Installing package into 'C:/Users/Alexander Chang/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)
## package 'olsrr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Alexander Chang\AppData\Local\Temp\RtmpOu8dJ9\downloaded_packages
library(olsrr)
## Warning: package 'olsrr' was built under R version 3.5.3
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
##
## cement
## The following object is masked from 'package:datasets':
##
## rivers
visual <- ols_step_both_p(fullmodel)
## Stepwise Selection Method
## ---------------------------
##
## Candidate Terms:
##
## 1. Agriculture
## 2. Examination
## 3. Education
## 4. Catholic
## 5. Infant.Mortality
##
## We are selecting variables based on p value...
##
## Variables Entered/Removed:
##
## - Education added
## - Catholic added
## - Infant.Mortality added
## - Agriculture added
##
## No more variables to be added/removed.
##
##
## Final Model Output
## ------------------
##
## Model Summary
## --------------------------------------------------------------
## R 0.836 RMSE 7.168
## R-Squared 0.699 Coef. Var 10.219
## Adj. R-Squared 0.671 MSE 51.383
## Pred R-Squared 0.620 MAE 5.434
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 5019.885 4 1254.971 24.424 0.0000
## Residual 2158.069 42 51.383
## Total 7177.955 46
## --------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------
## (Intercept) 62.101 9.605 6.466 0.000 42.718 81.485
## Education -0.980 0.148 -0.755 -6.617 0.000 -1.279 -0.681
## Catholic 0.125 0.029 0.416 4.315 0.000 0.066 0.183
## Infant.Mortality 1.078 0.382 0.251 2.824 0.007 0.308 1.849
## Agriculture -0.155 0.068 -0.281 -2.267 0.029 -0.292 -0.017
## ---------------------------------------------------------------------------------------------
plot(visual)
As we can see the model is strongest when all four variables are present, meaning that this combination of preditcors in our model produces the most accurate output given out data set. Of course this doesn’t test for one of the common pitfall of multiple regression: Collinearity. This occurs when two variables have a linear relationship with one another, meaning that estimates could have high error. The Variance Inflation Factor (VIF) is a good estimate of that, you typically want it lower than 10 to pass the eye test, and between 1 and 4 to feel confident in a varaible. Fortunately this package also includes that test.
ols_vif_tol(fullmodel)
## # A tibble: 5 x 3
## Variables Tolerance VIF
## <chr> <dbl> <dbl>
## 1 Agriculture 0.438 2.28
## 2 Examination 0.272 3.68
## 3 Education 0.360 2.77
## 4 Catholic 0.516 1.94
## 5 Infant.Mortality 0.903 1.11
We’re in luck! Our dataset suggestions little to no collinearity!