This notebook summarizes datacamp’s Correlation and Regression course in R.

library(openintro)
library(dplyr)
library(ggplot2)
library(gapminder)
library(tidyr)
library(mosaicData)

## Warning: package 'mosaicData' was built under R version 3.4.4

library(readr)
library(broom)

Bivariate relationships

Understanding relationship between 2 variables is very important. In a statistical model, an output is called response or dependent variable. It is usally thought to be explained by independent or explainatory variable. Scatterplot is a 2D plot in which each variable is numeric (y=response, x=independent). You can label aes using xlab / ylab or scale_x_continuous/scale_y_continuous layers.

ggplot(data=possum, aes(y=totalL,x=tailL)) + geom_point() + ggtitle("Total Length of a possum's body as a function of its tail length") + xlab("Tail Length") + ylab("Total Length")

Boxplots are a case of scatterplots where x is discretized. You can discretize a continous variable by using cut(x,breaks=n)

ggplot(data=possum, aes(y=totalL,x=cut(tailL,breaks=5))) + geom_point() + ggtitle("Total Length of a possum's body as a function of its tail length category") + xlab("Tail Length") + ylab("Total Length")

ggplot(data=possum, aes(y=totalL,x=cut(tailL,breaks=5))) + geom_boxplot() + ggtitle("Total Length of a possum's body as a function of its tail length") + xlab("Tail Length") + ylab("Total Length")

# Baby weight vs gestation period in NC in 2004
ggplot(data=ncbirths,aes(x=weeks,y=weight)) + geom_point() +  xlab("Gestation period in weeks") + ylab("Birth weight") + ggtitle("Relationship between gestation period and birth weight in NC - 2004")

## Warning: Removed 2 rows containing missing values (geom_point).

# Boxplot of weight vs. weeks with weeks discretized 
ggplot(data = ncbirths, aes(x = cut(weeks, breaks = 5), y = weight)) + 
geom_boxplot() +  xlab("Gestation period in weeks category") + ylab("Birth weight") + ggtitle("Relationship between gestation period week category and birth weight in NC - 2004")

To characterize bivariate relationships, one can look for 4 things:

Strength : How much scatter - packed, or loosely organized
Direction : positive or negative (tend to move same direction or opposite)
Form : Linear, non-linear, quadratic . Basically shape of points
Outliers: How many dont fit the pattern - could be erroneous measurements, exceptions etc.

# Different kinds of patterns in a scatterplot

# Mammals scatterplot
ggplot(data=mammals,aes(x=BodyWt,y=BrainWt)) + geom_point() + xlab("Body Weight") + ylab("Brain weight") + ggtitle("Brain weight of mammals as a function of their body weight")

# Baseball player scatterplot
ggplot(data=mlbBat10,aes(x=OBP,y=SLG)) + geom_point()  + xlab("Onbase percentage") + ylab("Slugging percentage") + ggtitle("Slugging percentage of players as a function of their on base percentage")

# Body dimensions scatterplot
ggplot(data=bdims,aes(x=hgt,y=wgt,color=factor(sex))) + geom_point() +  xlab("Height") + ylab("Weight") + ggtitle("Weight of a person as a function of their height for females and males")

# Smoking scatterplot
ggplot(data=smoking,aes(x=age,y=amtWeekdays)) + geom_point() +  xlab("age") + ylab("Amount smoked on weekdays") + ggtitle("Amount smoked on weekdays as a function of their age")

## Warning: Removed 1270 rows containing missing values (geom_point).

Sometimes, relationships are lost due to a bad scale of variables. Log transformation is a common transformation done on variables. The coord_trans() function transforms the coordinates of the plot. Alternatively, the scale_x_log10() and scale_y_log10() functions perform a base-10 log transformation of each axis.

# Scatterplot with coord_trans()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() + 
  coord_trans(x = "log10", y = "log10")

# Scatterplot with scale_x_log10() and scale_y_log10()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() +
  scale_x_log10() + scale_y_log10()

Often points overlap, making patterns hard to see. You can add alpha = 0.x to the geom_point() to add transparency. You can also add position=“jitter” (noise) to the plot, which relieves the constraint of stuff being integers.

Correlation

Correlation can help quantify the relationship between 2 variables. It exists between -1 and 1. Absolute value will tell strength, and sign with tell direction. Perfect correlation is at absolute value 1. It is important to know that the correlation co-efficient only captures strength of the ‘linear’ relatioship between variables.

# Non linear relation
run10 %>% filter(divPlace<=10) %>%
  ggplot(aes(x=age,y=pace,color=gender)) + geom_point()

# Correlation only measures linear relationship, and even though above are related, corr is very less
run10_filtered <- run10 %>% filter(divPlace<=10)
cor(run10_filtered$age,run10_filtered$pace)

## [1] 0.6763423

Most commonly computer correlation type is Pearson product-moment correlation. It is mathematically defined as cov(x,y)/sqroot(SXX*SYY) where SXX and SYY are sums of squares. The use=“pairwise.complete.obs” allows to ignore NAs

# Compute correlation
ncbirths %>%
  summarize(N = n(), r = cor(mage, weight))

##      N          r
## 1 1000 0.05506589

# Compute correlation for all non-missing pairs
ncbirths %>%
  summarize(N = n(), r = cor(weight, weeks, use = "pairwise.complete.obs"))

##      N         r
## 1 1000 0.6701013

Anscombe created a synthetic dataset to show correlation, and it is used to date.

# Recreating anscombe's dataset like datacamp. It should be at x,y, set level
set1 <- as.data.frame(cbind(y=anscombe$y1,x=anscombe$x1,set=rep(1,nrow(anscombe))))
set2 <- as.data.frame(cbind(y=anscombe$y2,x=anscombe$x2,set=rep(2,nrow(anscombe))))
set3 <- as.data.frame(cbind(y=anscombe$y3,x=anscombe$x3,set=rep(3,nrow(anscombe))))
set4 <- as.data.frame(cbind(y=anscombe$y4,x=anscombe$x4,set=rep(4,nrow(anscombe))))
anscombe_re <- rbind(set1,set2,set3,set4)

ggplot(data=anscombe_re,aes(x=x,y=y)) + geom_point() + facet_wrap(~set)

All above points in each facet, have same mean , sd, correlation. In plot 3 and 4 , presence of a single outlier lowers and increases (resp.) the correlation.

anscombe_re %>%
  group_by(set) %>%
  summarize(N = n(), mean(x), sd(x), mean(y), sd(y), cor(x,y))

## # A tibble: 4 x 7
##     set     N `mean(x)` `sd(x)` `mean(y)` `sd(y)` `cor(x, y)`
##   <dbl> <int>     <dbl>   <dbl>     <dbl>   <dbl>       <dbl>
## 1  1.00    11      9.00    3.32      7.50    2.03       0.816
## 2  2.00    11      9.00    3.32      7.50    2.03       0.816
## 3  3.00    11      9.00    3.32      7.50    2.03       0.816
## 4  4.00    11      9.00    3.32      7.50    2.03       0.817

# Correlation for all baseball players
mlbBat10 %>%
  summarize(N = n(), r = cor(OBP, SLG))

##      N         r
## 1 1199 0.8145628

# Correlation for all players with at least 200 ABs
mlbBat10 %>%
  filter(AB>=200) %>%
  summarize(N = n(), r = cor(OBP,SLG))

##     N         r
## 1 329 0.6855364

# Correlation of body dimensions
bdims %>%
  group_by(sex) %>%
  summarize(N = n(), r = cor(hgt,wgt))

## # A tibble: 2 x 3
##     sex     N     r
##   <int> <int> <dbl>
## 1     0   260 0.431
## 2     1   247 0.535

# Correlation among mammals, with and without log
mammals %>%
  summarize(N = n(), 
            r = cor(BodyWt, BrainWt), 
            r_log = cor(log(BodyWt), log(BrainWt)))

##    N         r     r_log
## 1 62 0.9341638 0.9595748

Autocorrelation or serial correlation measures if a variable’s current measurement is correlaed to its past measurements. Spurious correlations are when two variables tend to move together, but they maybe completely unrelated, and it is often due to confounding variables, or sometimes just random chance.

# Getting noise.csv from datacamp's website, since couldnt find that dataset
noise <- read.csv("noise.csv")

# Create faceted scatterplot
ggplot(data=noise,aes(x=x,y=y)) + geom_point() + facet_wrap(~z)

# Compute correlations for each dataset
noise_summary <- noise %>%
  group_by(z) %>%
  summarize(N = n(), spurious_cor = cor(x, y))
print(noise_summary)

## # A tibble: 20 x 3
##        z     N spurious_cor
##    <int> <int>        <dbl>
##  1     1    50      0.122  
##  2     2    50      0.0998 
##  3     3    50     -0.0263 
##  4     4    50      0.0650 
##  5     5    50     -0.165  
##  6     6    50     -0.0949 
##  7     7    50      0.240  
##  8     8    50      0.0110 
##  9     9    50      0.00825
## 10    10    50     -0.0613 
## 11    11    50      0.125  
## 12    12    50      0.0667 
## 13    13    50      0.0392 
## 14    14    50      0.188  
## 15    15    50      0.309  
## 16    16    50      0.218  
## 17    17    50     -0.103  
## 18    18    50      0.139  
## 19    19    50      0.165  
## 20    20    50     -0.0926

# Isolate sets with correlations above 0.2 in absolute strength
noise_summary %>%
  filter(abs(spurious_cor) > 0.2)

## # A tibble: 3 x 3
##       z     N spurious_cor
##   <int> <int>        <dbl>
## 1     7    50        0.240
## 2    15    50        0.309
## 3    16    50        0.218

Visualizing linear models

It is hard to decide which line is better, without a numeric indication or metric which can measure and some criteria which make it better.

ggplot(data=possum,aes(x=tailL,y=totalL))+ geom_point() + geom_abline(intercept = 0,slope=2.5)

ggplot(data=possum,aes(x=tailL,y=totalL))+ geom_point() + geom_abline(intercept = 0,slope=1.7)

ggplot(data=possum,aes(x=tailL,y=totalL))+ geom_point()  + geom_abline(intercept=40,slope = 1.3)

The regression line aims to minimize sum of squares between the points and the line. It is the line of best fit, or the least squares line. geom_smooth(metho=“lm”) adds line of best fit to the data

ggplot(data=possum,aes(x=tailL,y=totalL))+ geom_point() + geom_smooth(method="lm")

The blue line is the line of best fit. The grey section is standard error associated with line. se=FALSE will remove that grey part

ggplot(data=possum,aes(x=tailL,y=totalL))+ geom_point() + geom_smooth(method="lm",se=FALSE)

# Scatterplot with regression line
ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

Usually we assume response_var = f(explanatory) + noise. f is a mathematical function, which explains y in terms of x. noise is random noise. In linear regression, we assume f is a linear expression. So it is modelled in terms of y = mx + c. In regression, the notations used are:

              Y = Bo + B1X + e .

Distribution of noise is normal, with mean 0 and a fixed sd. B above indicates Beta. e is epsilon. Y is actual observed values of response. y hat is predicted by model Difference between actual vs expected is called residuals. residuals are realization of noise term. Which epsilon is an unknown true quantity, which residual is a known estimate of that quantity. It is the minimization of the residuals, which help determine line of best fit. Least squares is a easy , deterministic and unique solution, and residuals are guaranteed to sum to zero. Regression slope and correlaton co-efficient are closely related.

# Computing all statistics needed to calculate slope of regression line

N <- nrow(bdims)
r <- cor(bdims$hgt, bdims$wgt)
mean_hgt <- mean(bdims$hgt)
sd_hgt <- sd(bdims$hgt)
mean_wgt <- mean(bdims$wgt)
sd_wgt <- sd(bdims$wgt)

bdims_summary <- as.data.frame(cbind(N,r,mean_hgt,sd_hgt,mean_wgt,sd_wgt))

# It is known that mean x , and mean y always lie on line of best fit. So using y=mx+c form, we can compute c

# Now calculating slope and intercept and adding to the dataframe
bdims_summary %>% mutate(slope= r*(sd_wgt/sd_hgt),
                         intercept = mean_wgt - ((r*(sd_wgt/sd_hgt))*mean_hgt))

##     N         r mean_hgt   sd_hgt mean_wgt   sd_wgt    slope intercept
## 1 507 0.7173011 171.1438 9.407205 69.14753 13.34576 1.017617 -105.0113

Apart from least squares, other types are ridge, non parametric, weighted, generalized, etc.

# Making Galton men and women datasets just like datacamp

Galton_men <- Galton %>% filter(sex=="M")
Galton_women <- Galton %>% filter(sex=="F")

# Create a scatterplot of the height of men as a function of their father's height. Add the simple linear regression line and a diagonal line (with slope equal to 1 and intercept equal to 0) to the plot.
ggplot(data = Galton_men, aes(x = father, y = height)) +
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) + 
  geom_smooth(method = "lm", se = FALSE)

# Create a scatterplot of the height of women as a function of their mother's height. Add the simple linear regression line and a diagonal line to the plot.

ggplot(data = Galton_women, aes(x = mother, y = height)) +
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) + 
  geom_smooth(method = "lm", se = FALSE)

Interpreting Regression Coefficients

The intercept tells us that if the X variable were 0, what will be value of Y. Slope tells us that for every unit increase in X, the Y increases by slope amount

# What inflluences prices of ucla books? Can it be said that advanced courses (with higher course number) will have more price?

# Since course number is alphanumeric, we can extract number from it 

textbooks %>% mutate(course_number = parse_number(course)) %>% 
  ggplot(aes(x=course_number, y = uclaNew)) + geom_point()

# Appears to have no correlation or a slightly negative (but very weak) correlation at best

ggplot(data=textbooks,aes(x = amazNew,y = uclaNew)) + geom_point() + geom_smooth(method= "lm",se=FALSE)

# Almost perfect, and strong correlation

lm(uclaNew~amazNew,data=textbooks)

## 
## Call:
## lm(formula = uclaNew ~ amazNew, data = textbooks)
## 
## Coefficients:
## (Intercept)      amazNew  
##       0.929        1.199

This tell us that for each additional dollar that amazon charges for a book, UCLA charges about a dollar 20 cents more (making it 20 % higher). If a book was free on amazon, it would cost about 92 cents in UCLA bookstore (interpreting intercept)

# Linear model for weight as a function of height
lm(wgt ~ hgt, data = bdims)

## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Coefficients:
## (Intercept)          hgt  
##    -105.011        1.018

# Linear model for SLG as a function of OBP
lm(SLG~OBP,data=mlbBat10)

## 
## Call:
## lm(formula = SLG ~ OBP, data = mlbBat10)
## 
## Coefficients:
## (Intercept)          OBP  
##    0.009407     1.110323

# Log-linear model for body weight as a function of brain weight
lm(log(BodyWt) ~ log(BrainWt),data=mammals)

## 
## Call:
## lm(formula = log(BodyWt) ~ log(BrainWt), data = mammals)
## 
## Coefficients:
##  (Intercept)  log(BrainWt)  
##       -2.509         1.225

Output of lm is an object. When you call lm, you see the call and the fitted coefficients. Using coef() you can return just the coefficients. fitted.values() returns y-hat values. In most cases, length of this vector will be equal to the rows of dataframe input for lm. However, in some cases, if X contained missing values, which will be discarded while running model, the y-hat will be shorter. Actual - Expected is residual. To make life easy, there is a package in tidyverse called broom. If you run augment(), it’ll show you a very clean output of fitted, residuals, sd etc.

mod <- lm(uclaNew~amazNew,data=textbooks)
class(mod)

## [1] "lm"

# Coefficients
coef(mod)

## (Intercept)     amazNew 
##   0.9289651   1.1990014

# Summary
summary(mod)

## 
## Call:
## lm(formula = uclaNew ~ amazNew, data = textbooks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.785  -4.574   0.577   4.012  39.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.92897    1.93538    0.48    0.633    
## amazNew      1.19900    0.02519   47.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.47 on 71 degrees of freedom
## Multiple R-squared:  0.9696, Adjusted R-squared:  0.9692 
## F-statistic:  2266 on 1 and 71 DF,  p-value: < 2.2e-16

# Predicted y values 
fitted.values(mod)

##         1         2         3         4         5         6         7 
##  34.44105  38.26587  39.29701  14.74146  17.96678  13.12281  24.98093 
##         8         9        10        11        12        13        14 
##  20.90433 128.32287  16.82772  36.83906 106.54900  23.05054  20.67652 
##        15        16        17        18        19        20        21 
## 117.68772  57.89352  90.77014 160.12038 146.60764 130.42112  14.92131 
##        22        23        24        25        26        27        28 
##  23.63805  15.60474  27.24705  38.26587  35.64006  20.29284  46.19127 
##        29        30        31        32        33        34        35 
##  39.03323  40.46004  37.94214 102.84409  42.83406 118.37115  98.26390 
##        36        37        38        39        40        41        42 
##  12.31948  13.15878 162.42247 173.28542 211.95321 181.53455 175.26377 
##        43        44        45        46        47        48        49 
## 209.02765 157.99815 189.98751 165.39599  30.84405 191.90591  28.58993 
##        50        51        52        53        54        55        56 
##  26.15595  52.10235  48.13365 103.08389 112.59197  81.74166 160.14436 
##        57        58        59        60        61        62        63 
##  30.07669  30.84405 103.38364  13.01490  79.73933 101.95682  11.24038 
##        64        65        66        67        68        69        70 
##  70.97463  97.29271  77.77297  45.33998  25.16078  48.09768  32.54663 
##        71        72        73 
##  29.93281  23.37427  22.77477

# Broom of tidyverse
augment(mod)

##    uclaNew amazNew   .fitted  .se.fit       .resid       .hat    .sigma
## 1    27.67   27.95  34.44105 1.460257  -6.77105468 0.01944321 10.515199
## 2    40.59   31.14  38.26587 1.418184   2.32413081 0.01833896 10.543185
## 3    31.68   32.00  39.29701 1.407411  -7.61701041 0.01806141 10.506820
## 4    16.00   11.52  14.74146 1.720649   1.25853860 0.02699567 10.545810
## 5    18.95   14.21  17.96678 1.673789   0.98322479 0.02554530 10.546241
## 6    14.95   10.17  13.12281 1.744683   1.82719051 0.02775510 10.544587
## 7    24.70   20.06  24.98093 1.577134  -0.28093350 0.02268020 10.546858
## 8    19.50   16.66  20.90433 1.632387  -1.40432868 0.02429719 10.545543
## 9   123.84  106.25 128.32287 1.700434  -4.48286560 0.02636508 10.532925
## 10   17.00   13.26  16.82772 1.690176   0.17227613 0.02604795 10.546892
## 11   31.63   29.95  36.83906 1.433496  -5.20905751 0.01873712 10.528168
## 12  116.00   88.09 106.54900 1.422121   9.45100013 0.01844092 10.485102
## 13   27.67   18.45  23.05054 1.602965   4.61945878 0.02342922 10.532103
## 14   24.70   16.47  20.67652 1.635552   4.02348159 0.02439150 10.535669
## 15  126.67   97.38 117.68772 1.553935   8.98227697 0.02201788 10.490892
## 16   53.90   47.51  57.89352 1.262122  -3.99352239 0.01452488 10.535947
## 17   89.73   74.93  90.77014 1.286151  -1.04014123 0.01508321 10.546168
## 18  171.00  132.77 160.12038 2.216402  10.87961683 0.04479266 10.462654
## 19  152.00  121.50 146.60764 1.986090   5.39236280 0.03596727 10.526465
## 20  124.80  108.00 130.42112 1.731280  -5.62111808 0.02733028 10.524889
## 21   16.00   11.67  14.92131 1.717999   1.07868839 0.02691259 10.546103
## 22   25.95   18.94  23.63805 1.595039   2.31194809 0.02319808 10.543206
## 23   18.00   12.24  15.60474 1.707969   2.39525758 0.02659926 10.542920
## 24   21.73   21.95  27.24705 1.547617  -5.51704618 0.02183919 10.525817
## 25   40.59   31.14  38.26587 1.418184   2.32413081 0.01833896 10.543185
## 26   28.95   28.95  35.64006 1.446719  -6.69005609 0.01908438 10.515966
## 27   19.95   16.15  20.29284 1.640900  -0.34283796 0.02455128 10.546831
## 28   49.45   37.75  46.19127 1.342166   3.25873144 0.01642563 10.539598
## 29   41.09   31.78  39.03323 1.410143   2.05676990 0.01813160 10.543994
## 30   50.95   32.97  40.46004 1.395565  10.48995821 0.01775864 10.470767
## 31   44.50   30.87  37.94214 1.421617   6.55786119 0.01842786 10.517198
## 32   82.45   85.00 102.84409 1.384276 -20.39408550 0.01747249 10.256218
## 33   34.60   34.95  42.83406 1.372416  -8.23406459 0.01717439 10.500089
## 34   88.42   97.95 118.37115 1.562802 -29.95115384 0.02226987  9.906067
## 35   84.00   81.18  98.26390 1.342263 -14.26390008 0.01642800 10.405876
## 36   11.25    9.50  12.31948 1.756734  -1.06947854 0.02813984 10.546115
## 37   15.00   10.20  13.15878 1.744146   1.84122047 0.02773800 10.544551
## 38  180.03  134.69 162.42247 2.256857  17.60753411 0.04644271 10.324375
## 39  174.00  143.75 173.28542 2.451619   0.71458128 0.05480443 10.546547
## 40  188.58  176.00 211.95321 3.181203 -23.37321441 0.09227679 10.131120
## 41  146.75  150.63 181.53455 2.603153 -34.78454847 0.06178867  9.633991
## 42  183.75  145.40 175.26377 2.487702   8.48622894 0.05642951 10.495096
## 43  214.50  173.56 209.02765 3.124576   5.47234904 0.08902084 10.524626
## 44  197.00  131.00 157.99815 2.179394  39.00184934 0.04330931  9.408665
## 45  194.00  157.68 189.98751 2.761088   4.01249154 0.06951368 10.535188
## 46  176.25  137.17 165.39599 2.309560  10.85401060 0.04863716 10.462711
## 47   24.70   24.95  30.84405 1.502672  -6.14405043 0.02058913 10.520777
## 48  188.00  159.28 191.90591 2.797260  -3.90591073 0.07134694 10.535781
## 49   29.70   23.07  28.58993 1.530555   1.11007224 0.02136032 10.546060
## 50   26.24   21.04  26.15595 1.561717   0.08404511 0.02223896 10.546908
## 51   55.13   42.68  52.10235 1.296535   3.02765446 0.01532775 10.540606
## 52   43.56   39.37  48.13365 1.326062  -4.57365085 0.01603384 10.532505
## 53  129.60   85.20 103.08389 1.386624  26.51611422 0.01753183 10.050561
## 54  123.84   93.13 112.59197 1.490523  11.24803299 0.02025756 10.459091
## 55   85.12   67.40  81.74166 1.241909   3.37833944 0.01406335 10.539070
## 56  152.48  132.79 160.14436 2.216822  -7.66436320 0.04480963 10.505180
## 57   29.70   24.31  30.07669 1.512055  -0.37668952 0.02084707 10.546814
## 58   24.70   24.95  30.84405 1.502672  -6.14405043 0.02058913 10.520777
## 59   89.71   85.45 103.38364 1.389580 -13.67363613 0.01760665 10.417222
## 60   10.50   10.08  13.01490 1.746297  -2.51489936 0.02780648 10.542506
## 61   92.88   65.73  79.73933 1.235832  13.14067180 0.01392606 10.427641
## 62  100.88   84.26 101.95682 1.375712  -1.07682445 0.01725697 10.546113
## 63   11.95    8.60  11.24038 1.773045   0.70962274 0.02866482 10.546561
## 64   72.47   58.42  70.97463 1.225979   1.49537216 0.01370489 10.545377
## 65   92.10   80.37  97.29271 1.334076  -5.19270894 0.01622822 10.528333
## 66   78.35   64.09  77.77297 1.231234   0.57703413 0.01382264 10.546684
## 67   42.00   37.04  45.33998 1.349552  -3.33997755 0.01660691 10.539227
## 68   18.75   20.21  25.16078 1.574759  -6.41078371 0.02261195 10.518396
## 69   48.46   39.34  48.09768 1.326351   0.36231919 0.01604082 10.546822
## 70   39.55   26.37  32.54663 1.482266   7.00336756 0.02003375 10.512962
## 71   29.65   24.19  29.93281 1.513827  -0.28280935 0.02089596 10.546857
## 72   23.76   18.72  23.37427 1.598591   0.38572840 0.02330152 10.546809
## 73   27.70   18.22  22.77477 1.606705   4.92522911 0.02353867 10.530074
##         .cooksd   .std.resid
## 1  4.226829e-03 -0.652942233
## 2  4.686540e-04  0.223993081
## 3  4.954865e-03 -0.734001929
## 4  2.059099e-04  0.121832728
## 5  1.185695e-04  0.095110137
## 6  4.469293e-04  0.176950099
## 7  8.543980e-06 -0.027135635
## 8  2.294765e-04 -0.135757804
## 9  2.548167e-03 -0.433822889
## 10 3.715594e-06  0.016669061
## 11 2.407299e-03 -0.502135911
## 12 7.794440e-03  0.910907676
## 13 2.390079e-03  0.446369024
## 14 1.891346e-03  0.388972632
## 15 8.467715e-03  0.867312889
## 16 1.087459e-03 -0.384138738
## 17 7.669359e-05 -0.100080014
## 18 2.649222e-02  1.062966531
## 19 5.130524e-03  0.524430605
## 20 4.161390e-03 -0.544245474
## 21 1.507729e-04  0.104417885
## 22 5.924816e-04  0.223372468
## 23 7.342945e-04  0.231825498
## 24 3.167444e-03 -0.532667629
## 25 4.686540e-04  0.223993081
## 26 4.047193e-03 -0.645013417
## 27 1.382683e-05 -0.033146791
## 28 8.220236e-04  0.313761594
## 29 3.627276e-04  0.198204669
## 30 9.234223e-03  1.010693480
## 31 3.750028e-03  0.632056531
## 32 3.432050e-02 -1.964656784
## 33 5.495873e-03 -0.793105302
## 34 9.527663e-02 -2.892403464
## 35 1.575174e-02 -1.373377806
## 36 1.553595e-04 -0.103591709
## 37 4.535234e-04  0.178307230
## 38 7.219393e-02  1.721789118
## 39 1.428089e-04  0.070185211
## 40 2.789344e-01 -2.342591363
## 41 3.872219e-01 -3.429184479
## 42 2.080974e-02  0.834223429
## 43 1.464543e-02  0.547487723
## 44 3.281620e-01  3.807625635
## 45 5.893295e-03  0.397204306
## 46 2.886263e-02  1.062605267
## 47 3.694004e-03 -0.592825841
## 48 5.754302e-03 -0.387035126
## 49 1.252980e-04  0.107150609
## 50 7.491239e-07  0.008116166
## 51 6.606739e-04  0.291350168
## 52 1.579366e-03 -0.440278767
## 53 5.822235e-02  2.554497324
## 54 1.217296e-02  1.085114173
## 55 7.527957e-04  0.324887942
## 56 1.315295e-02 -0.748834637
## 57 1.406665e-05 -0.036350726
## 58 3.694004e-03 -0.592825841
## 59 1.555083e-02 -1.317334688
## 60 8.483234e-04 -0.243556130
## 61 1.127520e-02  1.263623353
## 62 9.446134e-05 -0.103724115
## 63 6.975023e-05  0.068753970
## 64 1.436283e-04  0.143780724
## 65 2.061341e-03 -0.499921271
## 66 2.157564e-05  0.055485410
## 67 8.733758e-04 -0.321613882
## 68 4.435124e-03 -0.619202108
## 69 9.915957e-06  0.034878485
## 70 4.664804e-03  0.675547942
## 71 7.948256e-06 -0.027291924
## 72 1.656937e-05  0.037269724
## 73 2.730262e-03  0.475941668

mod <- lm(wgt ~ hgt, data = bdims)
coef(mod)

## (Intercept)         hgt 
## -105.011254    1.017617

summary(mod)

## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

# Mean of weights equal to mean of fitted values?
mean(bdims$wgt) == mean(fitted.values(mod))

## [1] TRUE

# Mean of the residuals
mean(mod$residuals)

## [1] -1.060132e-16

bdims_mod_tidy <- augment(mod)
glimpse(bdims_mod_tidy)

## Observations: 507
## Variables: 9
## $ wgt        <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62....
## $ hgt        <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 18...
## $ .fitted    <dbl> 72.05406, 73.37697, 91.89759, 84.77427, 85.48661, 7...
## $ .se.fit    <dbl> 0.4320546, 0.4520060, 1.0667332, 0.7919264, 0.81834...
## $ .resid     <dbl> -6.4540648, -1.5769666, -11.1975919, -12.1742745, -...
## $ .hat       <dbl> 0.002154570, 0.002358152, 0.013133942, 0.007238576,...
## $ .sigma     <dbl> 9.312824, 9.317005, 9.303732, 9.301360, 9.312471, 9...
## $ .cooksd    <dbl> 5.201807e-04, 3.400330e-05, 9.758463e-03, 6.282074e...
## $ .std.resid <dbl> -0.69413418, -0.16961994, -1.21098084, -1.31269063,...

mod <- lm(uclaNew~amazNew,data=textbooks) 

# Arrange in descending order of residuals to find specific textbooks which are over or underpriced at ucla
augment(mod) %>% arrange(desc(.resid)) %>% head()

##   uclaNew amazNew   .fitted  .se.fit   .resid       .hat    .sigma
## 1  197.00  131.00 157.99815 2.179394 39.00185 0.04330931  9.408665
## 2  129.60   85.20 103.08389 1.386624 26.51611 0.01753183 10.050561
## 3  180.03  134.69 162.42247 2.256857 17.60753 0.04644271 10.324375
## 4   92.88   65.73  79.73933 1.235832 13.14067 0.01392606 10.427641
## 5  123.84   93.13 112.59197 1.490523 11.24803 0.02025756 10.459091
## 6  171.00  132.77 160.12038 2.216402 10.87962 0.04479266 10.462654
##      .cooksd .std.resid
## 1 0.32816195   3.807626
## 2 0.05822235   2.554497
## 3 0.07219393   1.721789
## 4 0.01127520   1.263623
## 5 0.01217296   1.085114
## 6 0.02649222   1.062967

# This will do insample predictions
predict(mod)

##         1         2         3         4         5         6         7 
##  34.44105  38.26587  39.29701  14.74146  17.96678  13.12281  24.98093 
##         8         9        10        11        12        13        14 
##  20.90433 128.32287  16.82772  36.83906 106.54900  23.05054  20.67652 
##        15        16        17        18        19        20        21 
## 117.68772  57.89352  90.77014 160.12038 146.60764 130.42112  14.92131 
##        22        23        24        25        26        27        28 
##  23.63805  15.60474  27.24705  38.26587  35.64006  20.29284  46.19127 
##        29        30        31        32        33        34        35 
##  39.03323  40.46004  37.94214 102.84409  42.83406 118.37115  98.26390 
##        36        37        38        39        40        41        42 
##  12.31948  13.15878 162.42247 173.28542 211.95321 181.53455 175.26377 
##        43        44        45        46        47        48        49 
## 209.02765 157.99815 189.98751 165.39599  30.84405 191.90591  28.58993 
##        50        51        52        53        54        55        56 
##  26.15595  52.10235  48.13365 103.08389 112.59197  81.74166 160.14436 
##        57        58        59        60        61        62        63 
##  30.07669  30.84405 103.38364  13.01490  79.73933 101.95682  11.24038 
##        64        65        66        67        68        69        70 
##  70.97463  97.29271  77.77297  45.33998  25.16078  48.09768  32.54663 
##        71        72        73 
##  29.93281  23.37427  22.77477

# To predict out of sample
new_data <- data.frame(amazNew=8.49)
predict(mod,newdata=new_data)

##        1 
## 11.10849

# Visualize predictions
isrs <- augment(mod,newdata=new_data)
ggplot(textbooks,aes(x=amazNew,y=uclaNew)) + geom_point() + geom_smooth(method="lm") +
  geom_point(data=isrs,aes(y=.fitted),size=3,color="red")

# Out of sample predictions
ben <- data.frame(wgt=74.8,hgt=182.8)
mod <- lm(wgt ~ hgt, data = bdims)
predict(mod,newdata=ben)

##        1 
## 81.00909

# Adding regression line manually
# Add the line to the scatterplot
coefs <- data.frame(Intercept=coef(mod)[1],hgt=coef(mod)[2])
ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
  geom_point() + 
  geom_abline(data = coefs, 
              aes(intercept = Intercept, slope = hgt),  
              color = "dodgerblue")

Model fit

The best line is always which minimizes residuals. However, we cant process residuals directly. We need to square the individual residuals, so as the sum not equal to zero. Also, squareing penalizes large deviations more. SSE capturs how much our model missed by. Since units of SSE make it hard to interpret, more widely used is RMSE- Root mean Square error, which is standard deviation of residuals. sqroot of (SSE/n-2). n-2 is degress of freedom. It is also called Residual standard error.

mod_possum <- lm(totalL~tailL,data=possum)
mod_possum %>% augment() %>% summarise(SSE=sum(.resid^2),SSE_also = (n()-1)*var(.resid),
                                       RMSE = sqrt(sum(.resid^2)/n()-2 ))

##        SSE SSE_also     RMSE
## 1 1301.488 1301.488 3.242578

mod <- lm(formula = wgt ~ hgt, data = bdims)
summary(mod)

## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

# Compute the mean of the residuals
mean(residuals(mod))

## [1] -1.060132e-16

# Compute RMSE
sqrt(sum(residuals(mod)^2) / df.residual(mod))

## [1] 9.30804

If you are comparing two or more models, and trying to judge which is better, RMSE sometimes is confusing especially if units of prediction are different – e.g. 10cm vs 3$. In this case, we compare performance to a null model (which is nothing but average). The ratio of SSE to SSE of null model, is quantification of variability explained by our model. The SSE of null model is called SST – sum of squares total. Co-efficient of determination, commonly called R-squared. it is proportion of variability in response which is explained by the model. For a simple linear regression, value of R-squared is nothing but square of correlation

# Making a NULL model
mod_null <- lm(totalL~1,data=possum)
mod_null %>% augment() %>% summarize

## data frame with 0 columns and 0 rows

# Making lm
mod_possum <- lm(totalL~tailL,data=possum)
mod_possum %>% augment() %>% summarise(SSE=sum(.resid^2))

##        SSE
## 1 1301.488

mod <- lm(wgt~hgt,data=bdims)
summary(mod)

## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

bdims_tidy <- augment(mod) 

# Compute R-squared
bdims_tidy %>%
  summarize(var_y = var(wgt), var_e = var(.resid)) %>%
  mutate(R_squared = 1-(var_e/var_y))

##      var_y    var_e R_squared
## 1 178.1094 86.46839 0.5145208

# Making a NULL model
null_mod <- lm(wgt~1,data=bdims)
mod_null <- augment(null_mod)
glimpse(mod_null)

## Observations: 507
## Variables: 8
## $ wgt        <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62....
## $ .fitted    <dbl> 69.14753, 69.14753, 69.14753, 69.14753, 69.14753, 6...
## $ .se.fit    <dbl> 0.5927061, 0.5927061, 0.5927061, 0.5927061, 0.59270...
## $ .resid     <dbl> -3.5475345, 2.6524655, 11.5524655, 3.4524655, 9.652...
## $ .hat       <dbl> 0.001972387, 0.001972387, 0.001972387, 0.001972387,...
## $ .sigma     <dbl> 13.35803, 13.35845, 13.34906, 13.35808, 13.35205, 1...
## $ .cooksd    <dbl> 1.399179e-04, 7.822033e-05, 1.483780e-03, 1.325192e...
## $ .std.resid <dbl> -0.26607983, 0.19894594, 0.86648293, 0.25894926, 0....

# Making lm
hgt_mod <- lm(wgt~hgt,data=bdims)
mod_hgt <- augment(hgt_mod)
glimpse(mod_hgt)

## Observations: 507
## Variables: 9
## $ wgt        <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62....
## $ hgt        <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 18...
## $ .fitted    <dbl> 72.05406, 73.37697, 91.89759, 84.77427, 85.48661, 7...
## $ .se.fit    <dbl> 0.4320546, 0.4520060, 1.0667332, 0.7919264, 0.81834...
## $ .resid     <dbl> -6.4540648, -1.5769666, -11.1975919, -12.1742745, -...
## $ .hat       <dbl> 0.002154570, 0.002358152, 0.013133942, 0.007238576,...
## $ .sigma     <dbl> 9.312824, 9.317005, 9.303732, 9.301360, 9.312471, 9...
## $ .cooksd    <dbl> 5.201807e-04, 3.400330e-05, 9.758463e-03, 6.282074e...
## $ .std.resid <dbl> -0.69413418, -0.16961994, -1.21098084, -1.31269063,...

# Compute SSE for null model
mod_null %>%
  summarize(SSE = var(.resid))

##        SSE
## 1 178.1094

# Compute SSE for regression model
mod_hgt %>%
  summarize(SSE = var(.resid))

##        SSE
## 1 86.46839

Leverage is a value which is entirely dependent on value of explainatory variable, and its mean. It is denoted by h(i). In in the augmented model summary, .hat variable houses leverage. Values with extremely high leverage scores, may or may not have an impact on slope. if it does have an impact, they are called influential observations. It is combination of high leverage and high residuals affect influence. It is captured by cooks’ distance available as .cooksd.

regulars <- mlbBat10 %>% filter(AB>400)
mod <- lm(HR~SB,data=regulars)
mod %>% augment() %>% arrange(desc(.hat)) %>% select(HR,SB,.fitted,.resid,.hat) %>% head()

##   HR SB  .fitted    .resid       .hat
## 1  1 68 2.382748 -1.382748 0.13081869
## 2  2 52 6.461361 -4.461361 0.07033516
## 3  5 50 6.971187 -1.971187 0.06416940
## 4 19 47 7.735927 11.264073 0.05550188
## 5  5 47 7.735927 -2.735927 0.05550188
## 6  1 42 9.010493 -8.010493 0.04260564

mod <- lm(SLG~OBP,data=mlbBat10)
# Rank points of high leverage
mod %>%
  augment() %>%
  arrange(desc(.hat)) %>%
  head()

##   SLG OBP .fitted    .se.fit     .resid       .hat    .sigma     .cooksd
## 1   1   1 1.11973 0.01858286 -0.1197301 0.01839873 0.1370121 0.007292179
## 2   1   1 1.11973 0.01858286 -0.1197301 0.01839873 0.1370121 0.007292179
## 3   1   1 1.11973 0.01858286 -0.1197301 0.01839873 0.1370121 0.007292179
## 4   1   1 1.11973 0.01858286 -0.1197301 0.01839873 0.1370121 0.007292179
## 5   4   1 1.11973 0.01858286  2.8802699 0.01839873 0.1082501 4.220041822
## 6   1   1 1.11973 0.01858286 -0.1197301 0.01839873 0.1370121 0.007292179
##   .std.resid
## 1 -0.8820989
## 2 -0.8820989
## 3 -0.8820989
## 4 -0.8820989
## 5 21.2200793
## 6 -0.8820989

# Rank influential points
mod %>% augment() %>% 
arrange(desc(.cooksd)) %>% head()

##     SLG   OBP   .fitted    .se.fit     .resid        .hat    .sigma
## 1 4.000 1.000 1.1197301 0.01858286  2.8802699 0.018398730 0.1082501
## 2 0.000 1.000 1.1197301 0.01858286 -1.1197301 0.018398730 0.1331035
## 3 0.000 1.000 1.1197301 0.01858286 -1.1197301 0.018398730 0.1331035
## 4 0.000 1.000 1.1197301 0.01858286 -1.1197301 0.018398730 0.1331035
## 5 0.000 1.000 1.1197301 0.01858286 -1.1197301 0.018398730 0.1331035
## 6 1.667 0.667 0.7499925 0.01126412  0.9170075 0.006760166 0.1344494
##     .cooksd .std.resid
## 1 4.2200418  21.220079
## 2 0.6377891  -8.249492
## 3 0.6377891  -8.249492
## 4 0.6377891  -8.249492
## 5 0.6377891  -8.249492
## 6 0.1535068   6.716256

Removing outliers should be justified. Changes to scope of study also must be considered. There are other powerful statistical techniques which can help reduce impact of outliers.

# Create nontrivial_players
nontrivial_players <- mlbBat10 %>% filter(AB>=10,OBP<0.500)


# Fit model to new data
mod_clear <- lm(SLG~OBP,data=nontrivial_players)

# View model summary
summary(mod_clear)

## 
## Call:
## lm(formula = SLG ~ OBP, data = nontrivial_players)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31383 -0.04165 -0.00261  0.03992  0.35819 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.043326   0.009823  -4.411 1.18e-05 ***
## OBP          1.345816   0.033012  40.768  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07011 on 734 degrees of freedom
## Multiple R-squared:  0.6937, Adjusted R-squared:  0.6932 
## F-statistic:  1662 on 1 and 734 DF,  p-value: < 2.2e-16

# Visualize new model
ggplot(data=nontrivial_players,aes(x=OBP,y=SLG)) +
geom_point() + geom_smooth(method="lm")

# Rank high leverage points
mod %>% augment() %>%
arrange(desc(.hat)) %>% head()

##   SLG OBP .fitted    .se.fit     .resid       .hat    .sigma     .cooksd
## 1   1   1 1.11973 0.01858286 -0.1197301 0.01839873 0.1370121 0.007292179
## 2   1   1 1.11973 0.01858286 -0.1197301 0.01839873 0.1370121 0.007292179
## 3   1   1 1.11973 0.01858286 -0.1197301 0.01839873 0.1370121 0.007292179
## 4   1   1 1.11973 0.01858286 -0.1197301 0.01839873 0.1370121 0.007292179
## 5   4   1 1.11973 0.01858286  2.8802699 0.01839873 0.1082501 4.220041822
## 6   1   1 1.11973 0.01858286 -0.1197301 0.01839873 0.1370121 0.007292179
##   .std.resid
## 1 -0.8820989
## 2 -0.8820989
## 3 -0.8820989
## 4 -0.8820989
## 5 21.2200793
## 6 -0.8820989