R Stats Project

Continous Predictor & Outcome

The most common usage when both your disease predictor and outcome are continuous are correlations. Typically these are fit linearly, and if not a best fit line can be established to understand what numerical transformation is needed in order to better conduct a linear regression.

We will need several things, namely a workable data set and some packages. Let’s start with a dataset that’s built into R that a lot of my intro stats class uses: Cars.

data <- mtcars 
head(data,10)

##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Let’s also utilize some data visualization packages to make sure that we spruce up our review of the analysis later on. Two common ones are ggplot2 and ggpubr.

Typically you can request these packages from CRAN, but just to be a little fancy we’ll pull directly from their respective github pages.

The latter code was unnessesary because apparently ggplot2 is a part of the ggpubr git, but better safe than sorry! I also ended up just using CRAN because it never gives me an error message if I load a package.

Performing Diagnostics for correlations tests

When performing a parametric test such as correlation, you’d want to establish normality in the data set. While a perfectly normal set of data is unrealistic, it at least gives you an idea of how robust your correlation may be beyond just the the R or R - squared value. You effectively have something to reference to justify performing a transformation or using other analysis.

Let’s just start with a scatter plot for now. Let’s compare a car’s horsepower with their miles per gallon.

library("ggpubr")

## Warning: package 'ggpubr' was built under R version 3.5.3

## Loading required package: ggplot2

## Loading required package: magrittr

## Warning: package 'magrittr' was built under R version 3.5.2

ggscatter(data, x= "mpg", y = "hp", 
          color = "red", fill = "lightgray",
          add = "reg.line", conf.int = TRUE,
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Miles per Gallon", ylab = "Horsepower")

Look at that, red beauty. No surprise here, lower horsepower nets a higher mpg rating.

Now let’s check for normality.

shapiro.test(data$mpg)

## 
##  Shapiro-Wilk normality test
## 
## data:  data$mpg
## W = 0.94756, p-value = 0.1229

shapiro.test(data$hp)

## 
##  Shapiro-Wilk normality test
## 
## data:  data$hp
## W = 0.93342, p-value = 0.04881

Looks like horsepower’s p-value is <0.05 meaning it is not normal. Not surprising considering out scatter plot would have a trend line that looks like an inverse log function. But this hardly removes the validity from our analysis given how strong the correlation is.

How do the Q-Q plots look?

ggqqplot(data$mpg, ylab = "mpg")

ggqqplot(data$hp, ylab = "horsepower")

At least the QQ plot for horsepower fits within a normal distribution characterized by the shade of grey.

This means we can move forward with a person correlation test, should these prove to the be grossly non parametic, a Spearman or Kendall rank-based test would be used.

Correlation Tests

#Pearson Correlation test 
pearson <- cor.test(data$hp, data$mpg, 
                     method = "pearson")
kendall <- cor.test(data$hp, data$mpg, 
                     method = "kendall")

## Warning in cor.test.default(data$hp, data$mpg, method = "kendall"): Cannot
## compute exact p-value with ties

spearman <- cor.test(data$hp, data$mpg, 
                     method = "spearman")

## Warning in cor.test.default(data$hp, data$mpg, method = "spearman"): Cannot
## compute exact p-value with ties

pearson

## 
##  Pearson's product-moment correlation
## 
## data:  data$hp and data$mpg
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8852686 -0.5860994
## sample estimates:
##        cor 
## -0.7761684

kendall

## 
##  Kendall's rank correlation tau
## 
## data:  data$hp and data$mpg
## z = -5.871, p-value = 4.332e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.7428125

spearman

## 
##  Spearman's rank correlation rho
## 
## data:  data$hp and data$mpg
## S = 10337, p-value = 5.086e-12
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.8946646

Linear Regression

Simple Linear Regression

Ahhh, the quintessential linear regression. The workhorse of any basic statistical analysis, and pattern recognition at it’s finest.

Let’s load a data set and some packages.

install.packages("tidyverse", repos="http://cran.rstudio.com/", dependencies=TRUE)

## Installing package into 'C:/Users/Alexander Chang/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)

## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Alexander Chang\AppData\Local\Temp\RtmpOu8dJ9\downloaded_packages

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.5.3

## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --

## v tibble  2.1.1       v purrr   0.3.2  
## v tidyr   0.8.3       v dplyr   0.8.0.1
## v readr   1.3.1       v stringr 1.4.0  
## v tibble  2.1.1       v forcats 0.3.0

## Warning: package 'tibble' was built under R version 3.5.3

## Warning: package 'tidyr' was built under R version 3.5.3

## Warning: package 'readr' was built under R version 3.5.2

## Warning: package 'purrr' was built under R version 3.5.3

## Warning: package 'dplyr' was built under R version 3.5.3

## Warning: package 'stringr' was built under R version 3.5.3

## Warning: package 'forcats' was built under R version 3.5.2

## -- Conflicts --------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x tidyr::extract()   masks magrittr::extract()
## x dplyr::filter()    masks stats::filter()
## x dplyr::lag()       masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()

library(ggpubr)
theme_set(theme_pubr())
devtools::install_github("kassambara/datarium")

## Skipping install of 'datarium' from a github remote, the SHA1 (5bfbbdd1) has not changed since last install.
##   Use `force = TRUE` to force installation

#loading the data 
data ("marketing", package = "datarium")
head (marketing, 5)

##   youtube facebook newspaper sales
## 1  276.12    45.36     83.04 26.52
## 2   53.40    47.16     54.12 12.48
## 3   20.64    55.08     83.16 11.16
## 4  181.80    49.56     70.20 22.20
## 5  216.96    12.96     70.08 15.48

#Visualize the scatter plot
ggplot(marketing, aes(x = facebook, y = sales)) + 
  geom_point() + 
  stat_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Look linear to me. Let’s just go through the regression

reg <- lm(sales ~ facebook, data = marketing)
summary(reg)

## 
## Call:
## lm(formula = sales ~ facebook, data = marketing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.8766  -2.5589   0.9248   3.3330   9.8173 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.17397    0.67548  16.542   <2e-16 ***
## facebook     0.20250    0.02041   9.921   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.13 on 198 degrees of freedom
## Multiple R-squared:  0.332,  Adjusted R-squared:  0.3287 
## F-statistic: 98.42 on 1 and 198 DF,  p-value: < 2.2e-16

confint(reg)

##                 2.5 %     97.5 %
## (Intercept) 9.8419062 12.5060253
## facebook    0.1622443  0.2427472

Let’s interpret our residual standard error, we can express this as a percentage.

sigma(reg)*100/mean(marketing$sales)

## [1] 30.48632

about 30% This isn’t really that great considering that our R-squared is only 0.32.87, meaning that roughly 32% of the change in sales was due to facebook. As a predictor of sales this isn’t good, but in reality this would be great as it would mean a significant proportion of your sales is affected by facebook. Oh by the way, this is an advertising budget.

Multiple Linear Regression

Now let’s take the same dataset and see what a firm’s advertising budget should be spent on according to this data set.

multiple <- lm(sales ~ youtube + facebook + newspaper, data = marketing)
summary (multiple)

## 
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper, data = marketing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5932  -1.0690   0.2902   1.4272   3.3951 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.526667   0.374290   9.422   <2e-16 ***
## youtube      0.045765   0.001395  32.809   <2e-16 ***
## facebook     0.188530   0.008611  21.893   <2e-16 ***
## newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.023 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

According to our summary output, not facebook and youtube seem to have a positive influence on the predicted sales, with facebook giving it the best boost. Newspapers seem to not only not produce any positive impact, but negatively impact sales numbers.

Model Selection

We can’t really talk about multiple linear regression without including one of the key steps: Model selection.

When it comes to predicting behavior and extrapolating future patterns based on previous observations, refining a regression model is one of the more straight forward methods of doing so with low computing power.

library (MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library (tidyverse)
fullmodel <- lm(Fertility ~ ., data = swiss)
step.model <- stepAIC(fullmodel, direction = "both", trace = FALSE)
head(swiss, 6)

##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

summary(step.model)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

This stepwise regression looks at the AIC or akaike information criteria, essentially it is an estimator for the relative quality of models. The lower the value, the better. Using this method we can see that all the variables in the data set were included in the model as they all had a low AIC cut off determined.

Let’s use the olsrr package to better visualize the selection process.

install.packages("olsrr", repos="http://cran.rstudio.com/", dependencies=TRUE)

## Installing package into 'C:/Users/Alexander Chang/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)

## package 'olsrr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Alexander Chang\AppData\Local\Temp\RtmpOu8dJ9\downloaded_packages

library(olsrr)

## Warning: package 'olsrr' was built under R version 3.5.3

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:MASS':
## 
##     cement

## The following object is masked from 'package:datasets':
## 
##     rivers

visual <- ols_step_both_p(fullmodel)

## Stepwise Selection Method   
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. Agriculture 
## 2. Examination 
## 3. Education 
## 4. Catholic 
## 5. Infant.Mortality 
## 
## We are selecting variables based on p value...
## 
## Variables Entered/Removed: 
## 
## - Education added 
## - Catholic added 
## - Infant.Mortality added 
## - Agriculture added 
## 
## No more variables to be added/removed.
## 
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.836       RMSE                7.168 
## R-Squared               0.699       Coef. Var          10.219 
## Adj. R-Squared          0.671       MSE                51.383 
## Pred R-Squared          0.620       MAE                 5.434 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression    5019.885         4       1254.971    24.424    0.0000 
## Residual      2158.069        42         51.383                     
## Total         7177.955        46                                    
## --------------------------------------------------------------------
## 
##                                      Parameter Estimates                                      
## ---------------------------------------------------------------------------------------------
##            model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ---------------------------------------------------------------------------------------------
##      (Intercept)    62.101         9.605                  6.466    0.000    42.718    81.485 
##        Education    -0.980         0.148       -0.755    -6.617    0.000    -1.279    -0.681 
##         Catholic     0.125         0.029        0.416     4.315    0.000     0.066     0.183 
## Infant.Mortality     1.078         0.382        0.251     2.824    0.007     0.308     1.849 
##      Agriculture    -0.155         0.068       -0.281    -2.267    0.029    -0.292    -0.017 
## ---------------------------------------------------------------------------------------------

plot(visual)

As we can see the model is strongest when all four variables are present, meaning that this combination of preditcors in our model produces the most accurate output given out data set. Of course this doesn’t test for one of the common pitfall of multiple regression: Collinearity. This occurs when two variables have a linear relationship with one another, meaning that estimates could have high error. The Variance Inflation Factor (VIF) is a good estimate of that, you typically want it lower than 10 to pass the eye test, and between 1 and 4 to feel confident in a varaible. Fortunately this package also includes that test.

ols_vif_tol(fullmodel)

## # A tibble: 5 x 3
##   Variables        Tolerance   VIF
##   <chr>                <dbl> <dbl>
## 1 Agriculture          0.438  2.28
## 2 Examination          0.272  3.68
## 3 Education            0.360  2.77
## 4 Catholic             0.516  1.94
## 5 Infant.Mortality     0.903  1.11

We’re in luck! Our dataset suggestions little to no collinearity!