Linear Regression Walkthrough

Prologue

One of the pecularities of life is that often times things change very fast and by fast i mean it can be 1 minute, a second or even a day type of Fast. Fast here is relative and can be applied to different scenarios. Now, that’s true but that’s not the whole truth.

Overtime, you tend to understand that whatever we consider fast - fast change, fast process, fast acquisition, fast build-up, fast whatever excluding fast food - is/was actually an accumaltion of slow processes. They build up, build up (say for years) and at some point they strike or they change very rapidly.

A typical scenario is Eastern Asia countries (Japan, China, Taiwan, Korea) from the gapminder report. These countries prioritized their growth in the 70’s and years later, the rewards are being consolidated in that region (economies growing fast). This is the same for individuals too, i remember when i started learning python - and in the first few weeks i felt like i wasn’t learning anything then at some point bang bang bang it started coming together. It’s about trusting the process. There’s an aspect of trusting the process that pertains with Machine Learning too (especially Supervised Algorithm). This is where linear regression comes in.

While working on the gapminder report, one of the areas i struggled with was making a decision on which fitted model to choose from. Look at the plots below …

… the points are life expectancy (in years) for Nigeria between 1952 and 2007 and the lines are fitted lines of our models. The lm.fit is a linear model, while the other two models are non-linear models (quadratic and cubic models). These plots pass the eye test and it’s a no brainer that model_3 fits better than the other models.

However just because model_3 fits better doesn’t mean it will give a better prediction of future figures. Having additional knowledge about your data can act as a shortcut here - because i know there is a linear relationship between life expectancy and years - it was easy for me to deduce that model_3 will not perform very well with new data and this was confirmed in the same report (no need to go into details we already covered this in gapminder part 2 report).

After I finalized the gapminder report, i decided to do a deep-dive on linear regression. The linear regression knowledge i had before now was limited (and it’s still is by the way) but bang bang bang it’s starting to make much sense now. Now let’s dig into the linear regression walkthrough.

Introduction

Imagine you are a data analyst for an ecommerce company or an online marketplace(you have sellers that sell on your platform), you are most likely going to be dealing with metrics such as products categories, sellers’ pageviews, items sold, ratings, valid items and NMV (i will refer to these as features or variables) . So one day, your boss asks you a few questions:

Question 1: “what is the most important factor or metric for NMV - i mean what metrics have the biggest influence on seller’s NMV” ?
- You can guess by saying “pageviews” or items sold or seller’s ratings - fair enough.
Question 2: “How much influence does or do they have on NMV”
- This you won’t have an answer for.
Question 3: “Black Friday is almost here, based on our sales performance in the last 3 months, can we forecast sales for Black Friday” ?
- By now, you already know you are in trouble.
Question 4: “Our new sellers have a high cancellation rate, how can we ensure our next set of new sellers don’t have this same problem” ?
- At this point, you are just waiting for him to end with all the questions .

These questions are not hard to figure out per se. I remember doing forecast with just excel few years ago or even doing a forecast with PowerBI. Back then, a basic way of doing this calculation was to sum total sales or items sold in the last 3 months, divide by the number of days (or period) and use that as average sales per day and building on that you come up with a forecast for future figures - But looking back now, i can say they didn’t help much.

One analysis i definitely didn’t do was figuring out the impact of each metric on NMV or items sold. I mean how do you want to do that ? I can find my way through it if i needed to but it will be the same as doing forecast on excel meaning it won’t help much. This is what makes Linear Regression outstanding - you can do way much more with Linear regression and i mean much more than you can imagine. With linear regression expertise, all you have to tell your boss is “when do you need this data ?” .

A basic explanation of linear regression is to imagine building a model to predict or understand the relationship between a metric and another metric (simple linear regression) or relationship with multiple metrics (multiple linear regression).

Predict or understand the relationship between a seller’s NMV based and his/her pageviews (simple linear regression); or
Predict or understand a seller’s NMV and the seller’s pageviews, rating, products categories (multiple linear regression).

Building a linear model or interpreting your results is not hard to figure out, that’s the easy part. The challenging part is building a good and accurate model that you can use to predict future values. But how do you know if your model is good or accurate ? That’s our major focus in this walkthrough.

This walkthrough is divided into four sections. And we’d be using the US State facts and figure data as the complementary data-set for our walkthrough. (state.x77 dataset on R).

We start the first section by creating a scatterplotmatrix and fitting our model then we talk about linear model statistical assumption. Here the focus is on verifying if our linear model has passed the “Linear regression” test. We can divide this section further into five parts (listed below).

Statistical Assumption:
- Base Plot;
- Normality;
- Linearity;
- Homoscedasticity;
- Outliers, Leverage and Influential Observations;
- Global validation of linear model assumption; and
- Multicollinearity;

In the second part our focusing is on corrective measures. If our model fails some linearity assumptions, how can we correct them. Here we focus on:

Corrective Measures.

In the third part, our objective is to find ways of selecting the best models. Here we talk about:

Variable selection; and
All subset Regression.

Then in our fourth and final part we talk about cross-validation and relative importance.

Just a reminder of things we’d not be doing in this walkthroug. In this walkthrough we’d not be diving deep into data exploration or feature engineering, neither are we going to go into training and testing data set. Also, we’d not be talking about logistic regression (that’s a topic for another day) or supervised methods.

Now we start

Statistical Assumption

Let’s start by plotting a scatterplotMatrix and also building our model.

#converting dataset to a dataframe
states <- as.data.frame(state.x77[,c("Murder","Population",
                                     "Income","Illiteracy",
                                     "Frost")])

#Explore the relationship among the variables two at a time
par(mfrow = c(2,2))
scatterplotMatrix(states,
                  main = "Scatter Plot Matrix", col = "red")

Just a word on our scatterplot matrix, Here we can see that the variable “Murder” has a linear relationship with population and illiteracy(Look at the red fitted line). The linear relationship between Murder and illiteracy is way more pronounced than Murder and population meaning the relationship between murder and illiteracy is more significant than the relationship between murder and population. It’s also not hard to figure out that Murder and Frost or Murder and Income have a negative relationship (negative correlation). Infact Frost has a negative relationship with other variables except income.

Now let’s fit a model with Murder as our response variable and the other variables as our predictors (this is a case of multiple linear regression).

#Fit a linear model, with murder as the dependent response variable and other features as predictors
fit <- lm(Murder ~ ., data = states)
summary(fit)

## 
## Call:
## lm(formula = Murder ~ ., data = states)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7960 -1.6495 -0.0811  1.4815  7.6210 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
## Population  2.237e-04  9.052e-05   2.471   0.0173 *  
## Income      6.442e-05  6.837e-04   0.094   0.9253    
## Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
## Frost       5.813e-04  1.005e-02   0.058   0.9541    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.535 on 45 degrees of freedom
## Multiple R-squared:  0.567,  Adjusted R-squared:  0.5285 
## F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08

As a reminder, p-value less than 1% means the relationship is significant. Once again, our model is pointing out the significance of the relationship between Murder rate and illiteracy rate (p-value is less than 1%). Inshort, states with high murder rate also have high illiteracy rate. There’s also a relationship between murder rate and population (p-value more than 1% but less than 5%) but the relationship is not very significant. The p-value for the relationship between Murder and Frost is 95% (that’s another way of saying no relationship exists between those two) same for Murder and income (92% p-value - no relationship).

Another way to read this is, a 1% change in illiteracy rate will lead to a 4.143% increase in Murder rate. A 1% change in Frost Frost will lead to a 0.0005813 increase in Murder rate (which is negligible).

Like i stated earlier, interpreting our model is not so hard to figure out with the summary function. Which brings us to theme of this section - statistical assumptions.

Word of caution here: Things start to get complicated from here. I’m going to mention some words(like error terms, residuals) that you might not understand, my advice would be - you should do a quick google search just to have a basic understanding of those word. Trust me, they are not that complicated to figure out. I will try to tip in basic definitions for some terms but won’t be doing it for most to avoid a bloated report

When you fit a linear model to your data, some problems might occur. The idea of testing for statistical assumption is to ensure your model doesn’t have these problems. To evaluate the statistical assumption of our model, we can use R base plot function. Testing for statistical assumptions involves testing for:

Normality: Our residuals values should be distributed with a mean of 0. The points on the graph should be on the straigh 45-degree line - This is evaluated with Normal Q-Q plot ;
Linearity: If our response is linearly related to the independent predictors then there should be no relationship between our residuals and fitted values - This is evaluated with Residuals vs Fitted plot ;
Independence: This is not evaluated by any of the base plot. It can also be referred to as correlation of error terms. (will come back to this)
Homoscedasticity: Is the variance of our error terms constant or it changes with the response value. If it is not constant then that’s heteroscedasticity and that’s a problem - This is evaluated with scale location plot ;
Outliers: An outlier is an observation that is either over-estimated or underestimated by our model (inshort not predicted well). You can use the Residuals v. Leverage plot to pick out observations with high residuals.
High leverage: When there is a big disparity or gap between observation A and other observations, we say Observation A is high leverage observation. If you are predicting gdpPercap based on Population, then a country like China would be considered an high leverage observation because of the big disparity between china’s populationa and other countries. (Understood ?, if not quick google search);
Influential Observations: These are observations that have a big impact on the model.

We can use the base R plot function to check for these assumptions (to have a good view of this, you need to divide your plot canvas into four regions or else you might have to press the enter key to view the plots one after the other).

#Step 1: change screen layout to 2 by 2
par(mfrow = c(2,2))

#Step 2: Plot 
plot(fit)

Verdict: from our plots, our model seems to satisfy/pass the linear model assumptions test.

How can you know this ?

Residuals v. fitted Plot (Linearity Plot) - you want the points to be evenly distributed with no pattern(Also if the fitted line is a curve, then that’s a problem). You don’t want a relationship between the residuals and fitted. From our plot, we can deduce that the points are evenly distributed, hence response is linearly related to the predictors;
Normal Q-Q Plot: You want the points on that plot to be on that (linear) straight line. Having few points deviating from that line is a red flag. You can see that most of the points are on the line, hence Normality test passed.
Scale - Location Plot (Homoscedasticity Plot): you wan the points to be evenly distributed.
Residuals vs Leverage Plot (Outliers, Leverage and Influence Plot): is a different case entirely. From this plot, you want to keep an eye on Alaska and Nevada - what this plot is telling me is that both states (are most likely influential in our model - you should notice this too from the Normal Q-Q Plot) have significant influence on our model. One last thing - from our 4th plot (bottom right), Nevada to me seems to be an outlier (high standardized residual) and Alaska (high leverage) seems to be an observation with high leverage.

The thing about these plots is they are not so hard to figure out but we can certainly make it way much easier to read. And this is what we are going to do.

They way to go about this is we focus on each assumption with an advanced plot or function to have a better idea of what we are talking about. Let’s start with Normality plot.

Normality

An advanced way of evaluating the Normality assumption is to use the qqPlot. What i like most about the qqPlot is you also get an output of observations that are out of place or that failed the normality assumption. In our case, Nevada and Rhode Island both fail the normality assumption test. But that’s two observations from 50 observations (4% - meaning 96% passed the test). We have way more observations that pass the test for us to conclude that our model passed the normality test.

#we can use qqPlot to test for Normality and it's even more accurate than the Normal Q-Q plot covered previous.
qqPlot(fit, labels = row.names(states), main = "Q-Q Plot")

##       Nevada Rhode Island 
##           28           39

Just for curiosity sake, let’s have a quick look into these two observations.

#Nevada
Actual <- states["Nevada","Murder"] #Actual Murder Rate

Predicted <- fitted (fit)[("Nevada")] #Predicted Murder Rate

Residuals <- residuals(fit)["Nevada"] #Actual - Predicted

Nevada <- cbind(Actual, Predicted, Residuals) #Nevada's actual murder rate from our data set is 11.5% but our model predicted 3.8%.

#Rhode Island

Actual <- states["Rhode Island","Murder"] #Actual Murder Rate

Predicted <- fitted(fit) ["Rhode Island"] #Predicted Murder Rate

Residuals <- residuals(fit) ["Rhode Island"] #Actual - Predicted

Rhode_Island <- cbind(Actual, Predicted, Residuals) #Our model over-estimated the model rate in Rhode Island. The actual murder rate is 2.4% but our model predicted 7.2%.

rbind(Nevada, Rhode_Island)

##              Actual Predicted Residuals
## Nevada         11.5  3.878958  7.621042
## Rhode Island    2.4  7.195966 -4.795966

Our model under-estimated Nevada’s murder rate (predicted 3.87% whereas Actual is 11.5%) and over-estimated Rhode Island murder rate (Predicted 7.2% and actual is 2.4%). Another way to say this is our model did a poor job predicting the murder rate in those states.

Linearity

To test for Linearity we can use the Component + Residual plots from the car library.

crPlots(fit)

Here we are looking for non-linearity between our dependent response variables and independent predictors. We check for this using the component + residual plots.

Conclusion: the plot confirms that we’ve met the linearity assumption. While the author wasn’t clear of how he came to that conclusion - the way i understand these plot is, Murder has a linear relationship with Population and illiteracy from the plot above (we already know this from the summary of our model that murder has a significant relationship with both variables but no relationship with Income and Frost). If there was no linear relationship between Murder and all predictors then we can conclude that we failed the linearity test.

Independence or Correlation of Error Terms

Unlike other assumptions where we can conclude from visualizing a plot, with correlation of error terms we don’t use a plot. Remember for this, we want a situation where our error terms are not correlated or are independent. The way to go about this is to use the durbinWatsonTest.

durbinWatsonTest(fit)

##  lag Autocorrelation D-W Statistic p-value
##    1      -0.2006929      2.317691   0.314
##  Alternative hypothesis: rho != 0

Again, with p-value, the conclusion is either the relationship is significant (less than 1%) or not(greater than 5%). In this case the p-value is 26.4%, meaning the relationship is not significant, hence correlation of error terms is not a problem, so test passed.

Homoscedasticity

Here, we are testing for non-constant variance of error terms. We want a situation where our error term is constant and doesn’t change with the response value. Let me digress here and mention few words on Error terms.

The accuracy of our predictions depends on reducible errors and irreducible errors. Reducible errors are errors that can be improved whereas irreducible errors cannot be reduced. Irreducible errors are characterisitics of our response variable that cannot be estimated or predicted by our predictors. There’s some part of Murder rate that can’t be estimated with illiteracy or Income or population - inshort - our prediction can’t be 100% accurate.

Hence, you want a situation where there’s no variance in our irreducible error. In layman’s terms we want to test the hypothesis of constant error variance against the alternative that the error variance changes with the level of the fitted values. We use the ncvTest function for this.

ncvTest(fit)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1.746514, Df = 1, p = 0.18632

spreadLevelPlot(fit)

## 
## Suggested power transformation:  1.209626

The p-value is 18% - not significant, meaning we’ve passed the constant variance assumption.

We can also confirm this with the spreadlevel function. If the line in our plot was non-horizontal then we have failed the constant variance assumption or we have an Heteroscedasticity issue (in this particular scenario, the line is horizontal). In situation where we fail this test, what we can do is to transform our response variable - we transform it to the power suggested by spreadLevelPlot. The suggested power transformation is 1.209 but since we passed the constant variance test we don’t need to do any transformation.

Global validation of linear model assumption

This is my go to function anytime i’m confused or can’t reach on conclusion on linear assumptions.

The global stat line tells you if your model passes the statistical assumptions. If the p-value is significant, then that’s a big red flag (your assumptions will not be acceptable).

To perform global validation of linear model assumptions, we use the gvlma function from the gvlma package. I really like this function.

gvmodel <- gvlma(fit)
summary(gvmodel)

## 
## Call:
## lm(formula = Murder ~ ., data = states)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7960 -1.6495 -0.0811  1.4815  7.6210 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
## Population  2.237e-04  9.052e-05   2.471   0.0173 *  
## Income      6.442e-05  6.837e-04   0.094   0.9253    
## Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
## Frost       5.813e-04  1.005e-02   0.058   0.9541    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.535 on 45 degrees of freedom
## Multiple R-squared:  0.567,  Adjusted R-squared:  0.5285 
## F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = fit) 
## 
##                     Value p-value                Decision
## Global Stat        2.7728  0.5965 Assumptions acceptable.
## Skewness           1.5374  0.2150 Assumptions acceptable.
## Kurtosis           0.6376  0.4246 Assumptions acceptable.
## Link Function      0.1154  0.7341 Assumptions acceptable.
## Heteroscedasticity 0.4824  0.4873 Assumptions acceptable.

It’s clear now that our data meet all the statistical assumptions that go with linear regression model.

Multicollinearity

Just a basic explantion of multicollinearity. If two variables are really correlated - say correlation between Predictor A and Predictor B is around 0.9 - If we use these two variables as predictors for say a response variable “C” . the linear model wll be something like this

lm(C ~ A + B, data = ..)

Because A and B are very correlated it can be difficult to figure out which of the two (A or B) has a significant relationship with C. Hence, collinearity reduces the accuracy of our estimates and it causes the standard error to grow. And in a situation where by maybe B has a significant relationship with C, having A in our model causes the influence of B to be minimized or less obvious and even make it insignificant. This process is referred to as collinearity , and multicollinearity is when we have multiple predictors that are highly correlated with eachother.

A simple way to avoid collinearity and multicollinearity is:

To check the collinearity of predictor variables - this is not a perfect scenario because we can still miss collinearity using this scenario, but it’s one way of doing this. My advice is to follow the second scenario;
Another option is to use the VIF (variance inflation factor) - this is the perfect scenatio;
- For VIF, the smallest possible value is 1, which indicated the absense of collinearity. If VIF exceeds 5 or 10, then that’s is problematic and indicates the presense of collinearity;
- This is almost the same as saying if the square root of VIF is greater than 2, then that’s a problem.

Let’s test for the collinearity of our model.

vif(fit)

## Population     Income Illiteracy      Frost 
##   1.245282   1.345822   2.165848   2.082547

#or

vif(fit) > 5

## Population     Income Illiteracy      Frost 
##      FALSE      FALSE      FALSE      FALSE

We don’t have a collinearity problem.

Outliers

Outliers are observations that are not predicted well by our model. We saw this with Nevada and Rhode Island from our qqPlots. Observations that have more than 2 or less than -2 standardized residuals are considered outliers.

We can use the outlierTest() to confirm the outliers in our data set

outlierTest(fit)

##        rstudent unadjusted p-value Bonferroni p
## Nevada 3.542929         0.00095088     0.047544

Here Nevada is outed as an outlier, Bonferroni p-value is significant (4%).

Note that Rhode Island is not considered an outlier because it’s residuals is not way off as Nevada (look at our Q-Q plot again, Rhode Island is on ‘-2’ line of residuals unlike Nevada which is on more than 3 residual line). Rhode Island standardized residual is actually -2.000016 (it’s actually less than -2 but you can understand why it’s not considered an outlier - the difference is negligible).

One way forward is to delete observations that show up as outliers to improve the model but be careful with this.

High-leverage points

Observations that have high leverage have an unusual value as a predictor. It is always important to identify high leverage observations because they tend to have more impact on our regression line. Infact removing high leverage observations has a much more substantial impact on the least square line than removing the outliers.

High leverage observations can be identified through the hat statistics. The average hat value can be computed by dividing p (number of parameters in the model) by n, the sample size - inshort p/n .

Any observation that is 2 or 3 times greater than the average hat value should be examined.

Let’s examine this by creating a plot and also using a function to segment our plot into high and low leverage zones.

hat.plot <- function(fit){
  p = length(coefficients(fit)) # counts the number of parameters
  n = length(fitted(fit)) #counts the number of observations
  plot(hatvalues(fit), main = "Index Plot of Hat Values") #hat values plot
  abline(h = c(2,3)*p/n, col = "red", lty = 2)
  sorted <- sort(hatvalues(fit), decreasing = T)[1:5]
  text(hatvalues(fit),names(hatvalues(fit)),pos = 4)#segmenting plot into high leverage and low leverage zone
  #identify(1:n, hatvalues(fit), names(hatvalues(fit)))
  #ifelse(hatvalues(fit) > 0.4, text(hatvalues(fit),names(hatvalues(fit))),"j")
}

hat.plot(fit)

From the plot, we can see that we have five observations with high leverage. Alaska have the highest leverage, followed by california. Then Hawaii, New York and Washington are not far behind.

If you can recollect, we defined the term ‘High leverage’ as observations with values that are of higher or lower proportion compared to other observations in the data set. And we used China as an example, In a dataset of countries, China will definitely be a high leverage country because of it’s population. Nigeria is a high leverage country in Africa because of our huge population - it’s way above the mean.

It goes both ways, if China is a high leverage country then very small countries (very small) can also be considered high leverage countries since their population is way below the mean.

We can confirm this, by collating the mean of our variables and comparing it to our high leverage observations.

Average <- apply(states, 2, mean)
Average <- as.data.frame(Average)
Average <- t(Average)

Alaska <- states["Alaska",] #Low population
California <- states["California",] #High Population
Hawaii <- states["Hawaii",] #No Frost
NYC <- states["New York",] #High population
Washington <- states["Washington",] #very low illiteracy rate

rbind(Average, Alaska, California, Hawaii, NYC, Washington)

##            Murder Population Income Illiteracy  Frost
## Average     7.378    4246.42 4435.8       1.17 104.46
## Alaska     11.300     365.00 6315.0       1.50 152.00
## California 10.300   21198.00 5114.0       1.10  20.00
## Hawaii      6.200     868.00 4963.0       1.90   0.00
## New York   10.900   18076.00 4903.0       1.40  82.00
## Washington  4.300    3559.00 4864.0       0.60  32.00

From the ouput above, you can understand why Alaska is considered high leverage (population is too small and Income is huge compared to the average of all other states). Californa is also high leverage(huge population and income - more than average) same for New york. Hawaii have a very low population and Frost is 0, way below the mean.

All these states are atypical compared with other observations in our data set.

Influential Observations

Influential observations have a disproportionate impact on the values of the model parameters. Their impact is such that, if you remove them from your model, your model will change significantly. You can use the Cook’s distance or added variable plots to identify influential observations.

Cook’s D values greater than 4/(n-k-1) where k is the number of predictors and n is the sample size indicate influential observations. Let’s create a Cook’s D plot:

cutoff <- 4/(nrow(states) - length(fit$coefficients) -2)
plot(fit, which = 4, cook.levels = cutoff)
abline(h=cutoff, lty = 2, col = "red")

Any state below the red line is not an influential state and vice-versa. The graph identifies Alaska, Hawaii and Nevada as influential observations. Deleting these states will have a notable impact on the values of the intercept and slopes in the regression model. One last reservation on Cook’s D plot is that they can help identify influential observations but they don’t provide information about how these observations affect the model.

This is where added-variable plots can help. The basic idea is:

for each predictor say Xi, you plot the residuals from predicting the response variable on the other predictors say k-1 (excluding the Xi predictor) versus the residuals from regressing Xi on the k-1 predictors. so basically, you regress murder (as response) on all predictos (except illiteracy) and then regress the excluded illiteracy on the same predictors used to regress Murder.

avPlots(fit, ask = F, col = "red")

The blue line is the regression coefficient for that variable. Using the bottom left plot as illustration - The basic idea is if we remove Alaska or washington or Nevada or Rhode Island, the regression line will change significantly. Those states are the biggest driver for the regression line.

To end this section, let’s talk about the Influence plot. You can combine the information from outlier, leverage and influence plots into one highly informative plot using the influencePlot() function.

influencePlot(fit, main = "Influence Plot", sub = "Circle size is proportional to Cook's distance")

##                 StudRes        Hat       CookD
## Alaska        1.7536917 0.43247319 0.448050997
## California   -0.2761492 0.34087628 0.008052956
## Nevada        3.5429286 0.09508977 0.209915743
## Rhode Island -2.0001631 0.04562377 0.035858963

Conclusion on this: states that have hat-values > 0.2 are high leverage states and states with high studentized residuals (greater than 2 and less than -2) are outliers.

Nevada, Rhode Island are outliers (even though our outlierTest didn’t include Rhode Island as an outlier, there’s no doubt in my mind that it’s actually an outlier);
Alaska and the points in the middle quadrant have high leverage;
Nevada, Alaska and Hawaii are influential states.

You should always remove influential observations from your plot and refit your model. Because this can be too much work, my advice for you is to plot your influential plot as soon you fit your model and adjust it immediately rather than test all linear assumptions and then finding out some observations are messing up your model.

Corrective measures:

Sometimes and often times your model will fail the linearity assumptions test. And that’s not a problem, what’s important is knowing what to do next.

There are four approaches of dealing with violations of regression assumptions:

Deleting observations;
Tranforming variables;
Adding or deleting variables
Using another regression approach.

I won’t cover or go into details for each of this approach. Just want to let you know of ways of going around or solving violations of linear assumptions.

However when the model violates the normality assumption test, you typically attempt a tranformation of the response variable. You can use the powerTransform() function to generate a maximum-likelihood estimation of the power $. Let’s apply this to the states data

summary(powerTransform(states$Murder))

## bcPower Transformation to Normality 
##               Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## states$Murder    0.6055           1       0.0884       1.1227
## 
## Likelihood ratio test that transformation parameter is equal to 0
##  (log transformation)
##                            LRT df     pval
## LR test, lambda = (0) 5.665991  1 0.017297
## 
## Likelihood ratio test that no transformation is needed
##                            LRT df    pval
## LR test, lambda = (1) 2.122763  1 0.14512

This suggests you can normalize the variable Murder by replacing it with Murder ^0.6 . Because 0.6 is close to 0.5, you can try a square-root transformation to improve the model’s fit to normality. However we don’t need to go through this since we’ve already confirmed that our model obeys the normality assumption (and this is further obvious from our summary above - our p-value when lambda is 1 is equal to 0.145). So there’s no strong evidence that a transformation is needed.

When the assumption of linearity is violated, a tranformation of the predictor variables can help. We can do this with the boxTidwell() function to generate the maximum-likelihood estimates of predictor powers that can improve the linearity. An example of applying the Box-Tidwell transformations to a model that predicts state murder rates from their population and illiteracy rates follows:

boxTidwell(Murder ~ Population + Illiteracy, data = states)

##            MLE of lambda Score Statistic (z) Pr(>|z|)
## Population       0.86939             -0.3228   0.7468
## Illiteracy       1.35812              0.6194   0.5357
## 
## iterations =  19

The results suggest trying the transformation Population^.87 and Illiteracy ^1.36 to achieve greater linearity. But the score tests for Population (p = .75) and Illiteracy(p = .54) suggest that neither variable needs to be transformed. (As obvious from our previous plot).

Selecting the best model

Let’s take a step back and understand how we built our model. We are predicting the Murder rate for each state based on the state’s income(gdpPercap), illiteracy rate, population and frost. Murder rate is the dependent variable, and the other variables are the predictors. I have also mentioned it on several occasion about how murder rate and income have no relationship (also frost). So if a relationship doesn’t exist between Murder rate and Income, then why not exclude it and not use it in predicting Murder rate (there’s a possibility that our model might be better off without using predictors that have no relationship with the response).

If we can take out one variable, we can also take out two variables or even more. The question now becomes how do we know the best combination of predictors to use to build a model ? What we can is to build different models using different combination of predictors and then compare these models.

Comparing Models

We can compare models using the anova function or using the Akaike information criterion (AIC). Let’s build two models using all predictors (we call this fit1) and another one with just the significant predictors (we call this fit2).

fit1 <- lm(Murder ~ Population + Illiteracy + Income + Frost, data = states)

fit2 <- lm(Murder ~ Population + Illiteracy, data = states)

Choosing the best model between the two linear models using ANOVA

anova(fit2, fit1)

## Analysis of Variance Table
## 
## Model 1: Murder ~ Population + Illiteracy
## Model 2: Murder ~ Population + Illiteracy + Income + Frost
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     47 289.25                           
## 2     45 289.17  2  0.078505 0.0061 0.9939

Here the p-value is non-significant. Another way of saying this is excluding Income and Frost actually makes our linear model better (fit2 better). So we are better off dropping them. I don’t like the anova way of comparing models, it can be quite confusing to read.

Another method we can use to compare models is to use the Akaike information criterion (AIC). I prefer this to the anova function. Models with lower AIC are actually better.

AIC(fit1, fit2)

##      df      AIC
## fit1  6 241.6429
## fit2  4 237.6565

Here we can see that the AIC value for fit 2 is 237.6565 and lower than the AIC value for fit1. Hence, the model with illiteracy and population as predictors is better than a model with illiteracy, population, income and frost as predictors.

Anova and AIC are good for models that have few parameters (say less than 10). Once you have a model with 5 or more parameters then you need a better model for variable selection.

Variable Selection

Here too, we focus on two approaches to select the best predictor variables from a larger pool of variables. We use the step-wise methods and the all subset regression.

Let’s start with step-wise selection. I won’t go into details with this (and this topic is well covered in ISLR than in this book - so having an understanding prior to reading this book helped me to figure it easily).

With step-wise, you have 3 options. We have:

Forward stepwise: here you start from scratch, then add predictor variables one after the other and stop when the addition of new variables will not improve the model;
backward stepwise: for this, you include all the variables then you remove the least useful variable - one at a time and test - until removing further variables will not improve the model;
stepwise stepwise: (you read that correctly) but this is usually called stepwise instead of the double call. This approach is called Hybrid selection (on ISLR). It involves a combination of the two prrevious approaches (Forward and backward stepwise). Meaning we add variables one at a time and also evaluate all variables to see which ones are not relevant so we can remove or delete them. You can add a variable in say step 7 and delete that same variable in step 10.

Let’s try the stepwise approach.

fit <- lm(Murder ~ Population + Illiteracy + Income + Frost,
          data = states)

stepAIC(fit, direction = "backward") #backward stepwise

## Start:  AIC=97.75
## Murder ~ Population + Illiteracy + Income + Frost
## 
##              Df Sum of Sq    RSS     AIC
## - Frost       1     0.021 289.19  95.753
## - Income      1     0.057 289.22  95.759
## <none>                    289.17  97.749
## - Population  1    39.238 328.41 102.111
## - Illiteracy  1   144.264 433.43 115.986
## 
## Step:  AIC=95.75
## Murder ~ Population + Illiteracy + Income
## 
##              Df Sum of Sq    RSS     AIC
## - Income      1     0.057 289.25  93.763
## <none>                    289.19  95.753
## - Population  1    43.658 332.85 100.783
## - Illiteracy  1   236.196 525.38 123.605
## 
## Step:  AIC=93.76
## Murder ~ Population + Illiteracy
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    289.25  93.763
## - Population  1    48.517 337.76  99.516
## - Illiteracy  1   299.646 588.89 127.311

## 
## Call:
## lm(formula = Murder ~ Population + Illiteracy, data = states)
## 
## Coefficients:
## (Intercept)   Population   Illiteracy  
##   1.6515497    0.0002242    4.0807366

From the summary above, we can see the impact of removing the insignificant variables from our model. The “none” row in each model represents the AIC value of that model.

As can be seen, a model containing population and illiteracy (AIC 93.763) is better than a model containing Population, illiteracy and Income (AIC = 95.753) and this is in turn better than a model with all three predictors and Frost (AIC = 97.749).

The limitation of this process is that it only chooses a good model - doesn’t choose the best model because not every model is evalueated. To overcome this limitation, let’s bring in all subsets regression.

All subset Regression

For all subset regression, every possibel model is inspected. This is the same as “Best Subset selection” in ISLR. Unlike the stepwise selection approach, we fit a separate least squares regression for each possible combination of our predictors (it leaves no stones unturned). Then from these collection of models, we pick the best model. One limitation of all subset regression approach is computational limitation. Because our model loops through all predictors, this can be a problem if we have 10 or more predictors - that’s 1,000 models to be considered (you can see what i’m driving at here).

library(leaps)

leaps <- regsubsets(Murder ~ Population + Illiteracy + Income + Frost, data = states)

summary(leaps)

## Subset selection object
## Call: regsubsets.formula(Murder ~ Population + Illiteracy + Income + 
##     Frost, data = states)
## 4 Variables  (and intercept)
##            Forced in Forced out
## Population     FALSE      FALSE
## Illiteracy     FALSE      FALSE
## Income         FALSE      FALSE
## Frost          FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
##          Population Illiteracy Income Frost
## 1  ( 1 ) " "        "*"        " "    " "  
## 2  ( 1 ) "*"        "*"        " "    " "  
## 3  ( 1 ) "*"        "*"        "*"    " "  
## 4  ( 1 ) "*"        "*"        "*"    "*"

An asterick (*) indicates that a given variable is included in the corresponding model.

For instance, the best one model variable is with “Illiteracy as predictor”, the best two-variable model contains Illiteracy and Population.

Best three-variable model contains Illiteracy, Population and Income.

R² is the amount of variance accounted for in the response variable by the predictors variables

Caution here, because R² increases when with the addition of predictors and RSS decreases with additional predictors, both are not suitable for selecting the best model. Instead we use adjusted R², C_p and BIC to choose the best models.

As always, the best way to understand this is by showing an example.

leaps.summary <- summary(leaps)

leaps.summary$rsq

## [1] 0.4941741 0.5668327 0.5669181 0.5669502

As seen above, we can see that R² increases from 0.4941 (one predictor model) to 0.56695 (all predictors model). It doesn’t mean all predictors model is the best, it just means R² typically increases as you add more predictors to the model.

facet(2,2)
plot(leaps.summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l", main = "RSS")
points(which.min(leaps.summary$rss), leaps.summary$rss[which.min(leaps.summary$rss)], col = "red", cex=2, pch = 20)

#adjusted r2
plot(leaps.summary$adjr2, xlab = "Number of Variables", ylab = "adjr", type = "l", main = "adjusted R2 Plot")

points(which.max(leaps.summary$adjr2), leaps.summary$adjr2[which.max(leaps.summary$adjr2)], col = "red", cex=2, pch = 20)

#Cp plot
plot(leaps.summary$cp, xlab = "Number of variables", ylab = "Cp", type = "l", main = "Cp Plot")

points(which.min(leaps.summary$cp), leaps.summary$cp[which.min(leaps.summary$cp)], col = "red", cex = 2, pch = 20)


#BIC plot
plot(leaps.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l", main = "BIC Plot")
points(which.min(leaps.summary$bic), leaps.summary$bic[which.min(leaps.summary$bic)], col = "red", cex = 2, pch = 20)

Caution here, because R² increases when with the addition of predictors and RSS decreases with additional predictors, both are not suitable for selecting the best model

All metrics point to 2.0 as the lowest BIC & AIC value and also the highest for adjusted R². Meaning the best model is a two-variables model. I have ignored the first plot(RSS plot) because it will always decrease with additional predictors, so it’s not a suitable technique to choose the best model.

Now we know how many predictors will fit best with our models, but we don’t know the exact predictors our model should use (we just know it’s two). We can find out on which two predictors by plotting the regsubset built-in function.

facet(1,2)

plot(leaps, scale = "adjr2", main = "Best model with adjr2")

plot(leaps, scale = "bic", main = "Best model with BIC")

A way to read this is - a model with all variables has an adjsuted R² of 0.53, and the best model has an adjusted r² of 0.55 and for that model we have three shaded regions (Intercept and our two-variable model - Population and Illiteracy). The BIC also confirms this, the lowest value is -30 and three regions are shaded there too.

To round this part up , we can also plot the subsets

subsets(leaps, statistic = "cp",
        main = "Cp Plot for All Subsets Regression",
        legend = c(3.5,6.5))

abline(1, 1, lty = 2, col = "red")

For this we are using the Mallows Cp statistic to choose the best model. Better models will fall close to a line with intercept 1 and slope 1.

The plot suggests that you consider a two-predictor model with Population and Illiteracy or a three-predictor model of Popultion, illiteracy and Income. We can reject the others above the fitted line.

Cross Validation

In Ordianry least square regression, the model parameters are selected to minimize the residuals and maximize the amount of variance accounted for in the response variable (R-squared). Because the equation has been optimized for the given set of data, it won’t perform well with a new set of data. We already know that our equation won’t perform as well with a new sample observations , but how much will you lose ? Cross Validation is a useful method for evaluation the generalizability of a regression equation.

“When description is your primary goal, the selection and interpretation of a regression model signals the end of your labor. But when your goal is prediction, you can ask,”How well will this equation perform in the real world ? " - R in Action.

There are 3 approaches with cross validation. We have:

Validation approach: you divide your data set into two equal parts, build a model with the first part(also called training data) and then test how your model will perform with unseen data by using the other part of your data (test data or validation data). Then you use the Mean square error (MSE) to check for the quality of fit; This approach is not optimized, it can be variable, meaning you can get different results when you repeat the process. It can also overestimate the the test MSE;
Leave one out Approach (LOOCV): Unlike the validation approach where you divide your data sets into two equal parts, with LOOCV, you still divide your data set into two part but this time only a single observation is used as the validation set. Inshort if you have 100 observations, your data will be divided into training part (99 observations) and validation part(1 observation). So you build your model with the 99 observations, then test your model’s fitness by using the validation data (the single observation). You repeat this process over and over again, this time with another observation then the average of the MSE is taken as the test MSE. This approach will always give the same result (test MSE). It can be time consuming though
k-Fold Cross Validation: k-fold is a different scenario. There’s some aspect of the other two approaches in it. Let’s say we have 100 observations, with k-fold, the 100 observations is divided into 10 groups (k-groups). Now we have k-10 groups, from these k-10 groups, we build our model using 9 of the 10 groups (inshort k-9 groups are used and we put k-1 aside). Then we test our model fitness on the k-1 group set aside to get our test MSE. We repeat this process for all the k-10 groups such that we pick one group set it aside then build a model with the other groups and test it on the group we set aside. The average of the test MSE is taken as our overall test MSE. This has best computational advantage over LOOCV and i prefer it.

Now let’s go over LOOCV and k-fold by showing an example.

#LOOCV

glm.fit <- glm(Murder ~ Population + Illiteracy, data = states)
cv.err <- cv.glm(states, glm.fit)
cv.err$delta

## [1] 6.377163 6.371059

With LOOCV, our test MSE is 6.37. Remember, the lower the test MSE the better. Another way to interpret the LOOCV test MSE is, if we fit our model with new data of illiteracy rate and population our prediction will be off by an average of 2.52 residuals.

#k-fold cv test error

glm.fit <- glm(Murder ~ Population + Illiteracy, data = states)

cv.cv <- cv.glm(states, glm.fit, K = 10)

cv.cv$delta

## [1] 6.237320 6.213449

With k-fold cross validation , our test MSE is 6.23 (lower than LOOCV) but very close.

The same way we applied k-fold to calculate test MSE, we can also use k-fold validation to validate R-square. We already know our R-square from the summary table but this R-square will definitely not be the same for new data. Inshort the amount of variance in murder rate that our model can account for with a new data set, will most likely to be lower and the best way to know or find out what the true R-square value might be for new data is to use k-fold cross validation.

I will be pasting a function (shrinkage) that cross-validates a model’s R-square statistic (Note: I didn’t write the function).

shrinkage <- function(fit, k=10){
  require(bootstrap)
  
  theta.fit <- function(x, y){lsfit(x,y)}
  theta.predict <- function(fit, x){cbind(1,x) %*% fit $coef}
  
  x <- fit$model[,2:ncol(fit$model)]
  y <- fit$model[,1]
  
  results <- crossval(x, y, theta.fit, theta.predict, ngroup = k)
  r2 <- cor(y, fit$fitted.values)^2
  r2cv <- cor(y, results$cv.fit)^2
  cat("Original R-square =", r2, "\n")
  cat(k, "Fold Cross-Validated R-square =", r2cv, "\n")
  cat("Change = ", r2-r2cv, "\n")
}



fit1 <- lm(Murder ~ Population + Illiteracy + Income + Frost, data = states)

fit2 <- lm(Murder ~ Population + Illiteracy, data = states)



shrinkage(fit1)

## Loading required package: bootstrap

## 
## Attaching package: 'bootstrap'

## The following object is masked from 'package:modelr':
## 
##     bootstrap

## Original R-square = 0.5669502 
## 10 Fold Cross-Validated R-square = 0.4731644 
## Change =  0.09378581

shrinkage(fit2)

## Original R-square = 0.5668327 
## 10 Fold Cross-Validated R-square = 0.5336829 
## Change =  0.03314983

The shrinkage function compares the model’s R-squared and the k-fold suggested R-squared for both models. The R-square of our first model is 0.5669 but our k-fold cross validation suggests the true R-square for new data is 0.432 (that’s 0.134 less than our model’s R-square) . Compared to our model 2 (fit 2), the difference in R-square is 4% (another confirmation that the model with just population and illiteracy as predictors is way better than a model with all our data variables as predictors).

A lot of times you are going to come across multiple functions that you didn’t write or copied from the internet, my advice would be try to understand what each line in the function is implementing. That way even if you can’t write the function off-hand you can manipulate it or even get to learn new ways of working with functions. Trust the process.

Relative Importance

Finally, you might want to find out which predictors have the biggest influence on your model’s outcome. You can do this by comparing the standardized regression coefficients of your model. To do this, you have to scale your data, fit your model using the scaled data and then comparing your predictors coefficients.

scale_states <- scale(states)
scale_states <- as.data.frame(scale_states)
scales_fit <- lm(Murder ~ Population + Income + Illiteracy + Frost, data = scale_states)
coef(scales_fit)

##   (Intercept)    Population        Income    Illiteracy         Frost 
## -2.054026e-16  2.705095e-01  1.072372e-02  6.840496e-01  8.185407e-03

Look out for the biggest value - that’s the most important predictor for your model. Here illiteracy is the biggest value (0.684), meaning an increase in one-standard deviation in illiteracy will lead to 0.68 standard deviation increase in Murder rate. Population equates to 0.27 increase in SD of Murder rate.

Another approach is to use a function (also copied and pasted from R in Action book) to rank our predictors based on their impact on our R^{2^.}

relweights <- function(fit, ...){
  R <- cor(fit$model)
  nvar <- ncol(R)
  rxx <- R[2:nvar, 2:nvar]
  rxy <- R[2:nvar, 1]
  svd <- eigen(rxx)
  evec <- svd$vectors
  ev <- svd$values
  delta <- diag(sqrt(ev))
  lambda <- evec %*% delta %*% t(evec)
  lambdasq <- lambda ^ 2
  beta <- solve(lambda) %*% rxy
  rsquare <- colSums(beta ^ 2)
  rawwgt <- lambdasq %*% beta ^ 2
  import <- (rawwgt / rsquare) * 100
  import <- as.data.frame(import)
  row.names(import) <- names(fit$model[2:nvar])
  names(import) <- "Weights"
  import <- import[order(import),1, drop = F]
  dotchart(import$Weights, labels = row.names(import),
           xlab = "% of R-Square", pch = 19,
           main = "Relative importance of Predictor Variables",
           sub = paste("Total R-Square=", round(rsquare, digits = 3)),
           ...)
  return(import)
}

relweights(fit1)

##              Weights
## Income      5.488962
## Population 14.723401
## Frost      20.787442
## Illiteracy 59.000195

You can see from the plot above that illiteracy accounts for almost 60% of the R-square. Frost accounts for 20% of our model’s R-square (actually surprised that Frost had more influence than Population). While Population and Income accont for the remaining part of our R-square.

Conclusion

We have covered a lot. Let’s go over what we discussed here again:

We started with testing for statistical assumptions with the base R plots then we went through the same process but this time using alternatives - advanced plots like Q-Q Plots and Influence plots. Then we moved to how to rectify our model if it fails statistical assumption either by deleting the variables, transforming it or deleting observations. We talked about choosing the best models with Anova, step-wise and ended with the relative importance of variables.

To cap this all, i will be working on two or three examples, just to drive it home.

References: R in Action https://www.amazon.com/R-Action-Robert-Kabacoff/dp/1935182390. Most of the concepts covered in this walkthrough are from the book R in Action by Kabacoff. Some other concepts like cross-validation, regsubsets etc are concepts i learnt in the past. I have tried to explain the concepts based on my own understanding and also my previous knowledge with Linear Regression.