Regression Play

This is an R Markdown Notebook in which I am going to play around with different datasets and run through regressions and practise various concepts in the process.

Starting with this tutorial. The following analysis will aim to find whether there is a relationship, described by a simple linear regression model, between the weight and snout vent length.

The dataset in here comes from the book Mathematical Statistics with Applications by Mendenhall, Wackerly and Scheaffer (Fourth Edition – Duxbury 1990).

This data was created for a study in central Florida where 15 alligators were captured and two measurements were made on each of the alligators. The weight (in pounds) was recorded with the snout vent length (in inches – this is the distance between the back of the head to the end of the nose).

The variables - weight and snout length have been analyzed in the log form by the author and look like this :

alligator = data.frame(
  lnLength = c(3.87, 3.61, 4.33, 3.43, 3.81, 3.83, 3.46, 3.76, 3.50, 3.58, 4.19, 3.78, 
               3.71, 3.73, 3.78),
  lnWeight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50, 3.58, 3.64, 5.90, 4.43, 
               4.38, 4.42, 4.25)
)

Visually, the data looks like this:

require(lattice)

## Loading required package: lattice

xyplot(lnWeight ~ lnLength, data = alligator,
  xlab = "Snout vent length (inches) on log scale",
  ylab = "Weight (pounds) on log scale",
  main = "Alligators in Central Florida",
  sub = "Scatter Plot"
)

The graph shows a linear relationship and presents a compelling case of linear regression at least to begin with.

So lets fit a simple linear regression model to the data first:

fit_alligators<-lm(lnWeight ~ lnLength, data = alligator)

summary(fit_alligators)

## 
## Call:
## lm(formula = lnWeight ~ lnLength, data = alligator)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.24348 -0.03186  0.03740  0.07727  0.12669 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.4761     0.5007  -16.93 3.08e-10 ***
## lnLength      3.4311     0.1330   25.80 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1229 on 13 degrees of freedom
## Multiple R-squared:  0.9808, Adjusted R-squared:  0.9794 
## F-statistic: 665.8 on 1 and 13 DF,  p-value: 1.495e-12

–> lm() - fits the linear model to the data.
–> Result of lm() contains a lot information about the linear model, e.g. the parameters or residuals. Now lets break it down piece by piece for understanding this output:

The estimates for the

model intercept is -8.4761 and
slope of the relationship with snout vent length is 3.4311
and information about standard errors of these estimates is also provided in the Coefficients table.
We see that the test of significance of the model coefficients i.e. p-values is also summarised in that table so we can see that there is strong evidence that the coefficient is significantly different from zero – as the snout vent length increases so does the weight.

Now lets do some investigations using residual diagnostics to determine whether the various assumptions that underpin linear regression are reasonable for our data or if there is evidence to suggest that additional variables are required in the model or some other alterations to identify a better description of the variables that determine how weight changes with changes in snout length.

A plot of the residuals against fitted values is used to determine whether there are any systematic patterns, such as over estimation for most of the large values or increasing spread as the model fitted values increase. To create this plot we could use the following code:
resid() - extracts the model residuals from the fitted model object.
fitted() - extracts the fitted on the original data using the generated model object.

xyplot(resid(fit_alligators) ~ fitted(fit_alligators),
  xlab = "Fitted Values",
  ylab = "Residuals",
  main = "Residual Diagnostic Plot",
  panel = function(x, y, ...)
  {
    panel.grid(h = -1, v = -1)
    panel.abline(h = 0)
    panel.xyplot(x, y, ...)
  }
)

We create our own custom panel function using the buliding blocks provided by the lattice package. We start by creating a set of grid lines as the base layer and the h=-1 and v=-1 tell lattice to align these with the labels on the axes. We then create a solid horizontal line to help distinguish between positive and negative residuals. Finally we get the points plotted on the top layer.

The residual diagnostic plot shown above is probably ok but there are more cases of positive residuals and when we consider a normal probability plot we see that there are some deficiencies with the model.

qqmath( ~ resid(fit_alligators),
  xlab = "Theoretical Quantiles",
  ylab = "Residuals"
)

Now if this plot had showed something approaching a straight line it would support the model assumption about the distribution of the residuals. This and the other plots thus suggest that further tweaking to the model is required to improve the model or a decision would need to be made about whether to report the model as is with some caveats about its usage.

Another good example I want to cull out is this

This article explains regression using cars dataset available in the base R. This dataset has 50 obs/rows and 2 variables/columns.

A quick look at the data:

head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Graphical Analysis

Its a good practise to analyze and understand the variables before doing any actual modelling. The graphical analysis and correlation study below will help with this.

Scatter Plot – check for linearity
Scatter plots can help visualize any linear relationships between the dependent (response) variable and independent (predictor) variables. Ideally, if you are having multiple predictor variables, a scatter plot is drawn for each one of them against the response, along with the line of best fit as shown below.

attach(cars)
# scatterplot
scatter.smooth(x=speed, y=dist, main="Dist ~ Speed")

The scatter plot along with the smoothing line above suggests a linearly increasing relationship between the ‘dist’ and ‘speed’ variables. This is a good thing, because, one of the underlying assumptions in linear regression is that the relationship between the response and predictor variables is linear and additive.

Box Plot – check for outliers
Generally, any datapoint that lies outside the 1.5 x interquartile-range (IQR) is considered an outlier, where, IQR is calculated as the distance between the 25th percentile and 75th percentile values for that variable.

attach(cars)

## The following objects are masked from cars (pos = 3):
## 
##     dist, speed

# divide graph area in 2 columns
par(mfrow=c(1, 2))  

# box plot for 'speed'
boxplot(speed, main="Speed", sub=paste("Outlier rows: ", boxplot.stats(speed)$out))  

# box plot for 'distance'
boxplot(dist, main="Distance", sub=paste("Outlier rows: ", boxplot.stats(dist)$out))

Density plot – Check for normality of response variable

library(e1071)

## Warning: package 'e1071' was built under R version 3.4.3

# divide graph area in 2 columns
par(mfrow=c(1, 2))  

# density plot for 'speed'
plot(density(speed), main="Density Plot: Speed", ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(speed), 2)))  
polygon(density(speed), col="red")

# density plot for 'dist'
plot(density(dist), main="Density Plot: Distance", ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(dist), 2)))  
polygon(density(dist), col="lightblue")

Correlation
Correlation is a statistical measure that suggests the level of linear dependence between two variables, that occur in pair – just like what we have here in speed and dist. Correlation can take values between -1 to +1.

A value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the response variable (Y) is unexplained by the predictor (X), in which case, we should probably look for better explanatory variables.

# calculate correlation between speed and distance 
cor(cars$speed, cars$dist)

## [1] 0.8068949

Build Linear Model

# build linear regression model on full data
linearMod <- lm(dist ~ speed, data=cars)  
print(linearMod)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

The output of lm() in this case is: The ‘Coefficients’ part having two components:
Intercept: -17.579,
speed: 3.932

These are also called the beta coefficients. In other words,

\(dist = Intercept + (β∗speed)\)
\(dist = −17.579 + 3.932 ∗ speed\)

As expected, this implies that distance covered increases as speed increases.

Linear Regression Diagnostics

Voila, the linear model is ready and looks like we have a formula that we can use to predict the dist value if a corresponding speed is known. But wait!! Before using a regression model, we have to ensure that it is statistically significant. Lets begin by analyzing the summary statistics for linearMod to ensure it is.

The summary statistics above tell us a number of things:

The p value: Checking for statistical significance

model p-Value (bottom last line) and
p-Value of individual predictor variables (extreme right column under ‘Coefficients’).

The p-Values are very important because, We can consider a linear model to be statistically significant only when both these p-Values are less that the pre-determined statistical significance level, which is ideally 0.05. This is visually interpreted by the significance stars at the end of the row.

The p value: Testing Hypotheses

When there is a p-value, there is a hull and alternative hypothesis associated with it. In Linear Regression:

The Null Hypothesis is that the coefficients associated with the variables is equal to zero (i.e there is no relationship in the predictor and response variable).
The alternate hypothesis is that the coefficients are not equal to zero (i.e. there exists a relationship between the independent variable in question and the dependent variable).

t-value: supporting hypotheses results from p-value

We can interpret the t-value something like this. A larger t-value indicates that it is less likely that the coefficient is not equal to zero purely by chance. So, higher the t-value, the better.

Pr(>|t|) or p-value is the probability that you get a t-value as high or higher than the observed value when the Null Hypothesis (the β coefficient is equal to zero or that there is no relationship) is true. So if the Pr(>|t|) is low, the coefficients are significant (significantly different from zero). If the Pr(>|t|) is high, the coefficients are not significant.

What this means to us? when p Value is less than significance level (< 0.05), we can safely reject the null hypothesis that the co-efficient β of the predictor is zero.

In our case, linearMod, both these p-Values are well below the 0.05 threshold, so we can conclude our model is indeed statistically significant. It is absolutely important for the model to be statistically significant before we can go ahead and use it to predict (or estimate) the dependent variable, otherwise, the confidence in predicted values from that model reduces and may be construed as an event of chance.

How to calculate the t Statistic and p-Values?
When the model co-efficients and standard error are known, the formula for calculating t Statistic and p-Value is as follows:

# capture model summary as an object
modelSummary <- summary(linearMod)  
# model coefficients
modelCoeffs <- modelSummary$coefficients 
# get beta estimate for speed
beta.estimate <- modelCoeffs["speed", "Estimate"]
# get std.error for speed
std.error <- modelCoeffs["speed", "Std. Error"]
# calc t statistic
t_value <- beta.estimate/std.error  
# calc p Value
p_value <- 2*pt(-abs(t_value), df=nrow(cars)-ncol(cars))  
# fstatistic
f_statistic <- linearMod$fstatistic[1] 
# parameters for model p-value calc
f <- summary(linearMod)$fstatistic 

model_p <- pf(f[1], f[2], f[3], lower=FALSE)

However, an R user doesnt need to compute any of these as the summary() on the lm() object returns all of these pieces.

summary(linearMod)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

R-Squared and Adj R-Squared
The actual information in a data is the total variation it contains. What R-Squared tells us is the proportion of variation in the dependent (response) variable that has been explained by this model.

\(R2=1−SSE/SST\)

where, SSE is the sum of squared errors given by

\(SSE=∑(yi−yi^)2\) and

\(SST=∑in(yi−yi¯)2\) is the sum of squared total.
Here, yi^ is the fitted value for observation i and y¯ is the mean of Y.

We don’t necessarily discard a model based on a low R-Squared value. Its a better practice to look at the AIC and prediction accuracy on validation sample when deciding on the efficacy of a model.

As you add more X variables to your model, the R-Squared value of the new bigger model will always be greater than that of the smaller subset. This is because, since all the variables in the original model is also present, their contribution to explain the dependent variable will be present in the super-set as well, therefore, whatever new variable we add can only add (if not significantly) to the variation that was already explained. It is here, the adjusted R-Squared value comes to help. Adj R-Squared penalizes total value for the number of terms (read predictors) in your model. Therefore when comparing nested models, it is a good practice to look at adj-R-squared value over R-squared.

\(R2adj=1−MSE/MST\)

where, MSE is the mean squared error given by

\(MSE=SSE/(n−q)\) and

\(MST=SST/(n−1)\) is the mean squared total,
where n is the number of observations and q is the number of coefficients in the model.

Therefore, by moving around the numerators and denominators, the relationship between R2 and Radj2 becomes:

\(R2adj=1−((1−R2)*(n−1)/(n−q)\)

Standard Error and F-Statistic
Both standard errors and F-statistic are measures of goodness of fit.

\(Std.Error=√MSE=√(SSE/(n−q))\)

\(F−statistic=MSR/MSE\)

where, n is the number of observations, q is the number of coefficients and MSR is the mean square regression, calculated as,

\(MSR = ∑(yi−y¯^)/(q−1) = SST−SSE/(q−1)\)

AIC and BIC
The Akaike’s information criterion - AIC (Akaike, 1974) and the Bayesian information criterion - BIC (Schwarz, 1978) are measures of the goodness of fit of an estimated statistical model and can also be used for model selection. Both criteria depend on the maximized value of the likelihood function L for the estimated model.

The AIC is defined as:

\(AIC = (−2) × ln(L) + (2×k)\)

where, k is the number of model parameters and the BIC is defined as:

\(BIC = (−2) × ln(L) + k × ln(n)\)

where, n is the sample size.

For model comparison, the model with the lowest AIC and BIC score is preferred.

AIC(linearMod)

## [1] 419.1569

BIC(linearMod)

## [1] 424.8929

How to know if the model is best fit for your data?

The most common metrics to look at while selecting the model are:

STATISTIC	CRITERION
R-Squared	Higher the better (> 0.70)
Adj R-Squared	Higher the better
F-Statistic	Higher the better
Std. Error	Closer to zero the better
t-statistic	Should be greater 1.96 for p-value to be less than 0.05
AIC	Lower the better
BIC	Lower the better
Mallows cp	Should be close to the number of predictors in model
MAPE (Mean absolute percentage error)	Lower the better
MSE (Mean squared error)	Lower the better
Min_Max Accuracy	Higher the better

Predicting Linear Models

So far we learnt the art of building a linear regression model using the whole dataset. In the real world however, the preferred practice is to split your dataset into a 80:20 sample (training:test), and then, build the model on the 80% sample and then use the model thus built to predict the dependent variable on test data.

Doing it this way, we will have the model predicted values for the 20% data (test) as well as the actuals (from the origi nal dataset). By calculating accuracy measures (like min_max accuracy) and error rates (MAPE or MSE), we can find out the prediction accuracy of the model. Now, lets see how to actually do this..

Step 1: Create the training (development) and test (validation) data samples from original data

# Create Training and Test data -
set.seed(100)  # setting seed to reproduce results of random sampling
trainingRowIndex <- sample(1:nrow(cars), 0.8*nrow(cars))  # row indices for training data
trainingData <- cars[trainingRowIndex, ]  # model training data
testData  <- cars[-trainingRowIndex, ]   # test data

Step 2: Develop the model on the training data and use it to predict the distance on test data

# Build the model on training data -
lmMod <- lm(dist ~ speed, data=trainingData)  # build the model
distPred <- predict(lmMod, testData)  # predict distance

Step 3: Review diagnostic measures.

summary (lmMod)  # model summary

## 
## Call:
## lm(formula = dist ~ speed, data = trainingData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.350 -10.771  -2.137   9.255  42.231 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -22.657      7.999  -2.833  0.00735 ** 
## speed          4.316      0.487   8.863 8.73e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.84 on 38 degrees of freedom
## Multiple R-squared:  0.674,  Adjusted R-squared:  0.6654 
## F-statistic: 78.56 on 1 and 38 DF,  p-value: 8.734e-11

AIC (lmMod)  # Calculate akaike information criterion

## [1] 338.4489

From the model summary, the model p value and predictor’s p value are less than the significance level, so we know we have a statistically significant model. Also, the R-Sq and Adj R-Sq are comparative to the original model built on full data.

Step 4: Calculate prediction accuracy and error rates
A simple correlation between the actuals and predicted values can be used as a form of accuracy measure. A higher correlation accuracy implies that the actuals and predicted values have similar directional movement, i.e. when the actuals values increase the predicteds also increase and vice-versa.

actuals_preds <- data.frame(cbind(actuals=testData$dist, predicteds=distPred))  
# make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds)
head(actuals_preds)

##    actuals predicteds
## 1        2  -5.392776
## 4       22   7.555787
## 8       26  20.504349
## 20      26  37.769100
## 26      54  42.085287
## 31      50  50.717663

Now lets calculate the Min Max accuracy and MAPE:

\(MinMaxAccuracy=mean(min(actuals,predicteds)/max(actuals,predicteds))\)

\(MeanAbsolutePercentageError(MAPE)=mean(abs(predicteds−actuals)/actuals)\)

min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))  
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)

k- Fold Cross validation

Suppose, the model predicts satisfactorily on the 20% split (test data), is that enough to believe that your model will perform equally well all the time? It is important to rigorously test the model’s performance as much as possible. One way is to ensure that the model equation you have will perform well, when it is ‘built’ on a different subset of training data and predicted on the remaining data.

How to do this is? Split your data into ‘k’ mutually exclusive random sample portions. Keeping each portion as test data, we build the model on the remaining (k-1 portion) data and calculate the mean squared error of the predictions. This is done for each of the ‘k’ random sample portions. Then finally, the average of these mean squared errors (for ‘k’ portions) is computed. We can use this metric to compare different linear models graphically. By doing this, we need to check two things:

If the model’s prediction accuracy isn’t varying too much for any one particular sample, and
If the lines of best fit don’t vary too much with respect the the slope and level. In other words, they should be parallel and as close to each other as possible.

library(DAAG)

## Warning: package 'DAAG' was built under R version 3.4.3

# Check if the dashed lines are parallel and small and big symbols are not over dispersed for any one specific color.
#par(new=TRUE); par(bg="aliceblue"); par(mar=c(4,4,4,4))  # set margins

# performs the CV
cvResults<-suppressWarnings(CVlm(data=cars, form.lm=dist ~ speed, m=5, dots=FALSE, seed=29, legend.pos="topleft",  printit=FALSE, main="Small symbols are predicted values while bigger ones are actuals."))

# mean squared error
attr(cvResults, 'ms')

## [1] 251.2783

In the below plot, Are the dashed lines parallel? Are the small and big symbols are not over dispersed for one particular color?

Some other usefule articles I want to keep in here: