Introduction :

A model is a simplification or an approximation of reality, that is, it is a mathematical abstraction that captures the essence of reality but does not perfectly describe it.In general, a model is anything that provides an abstraction of the data. It could be a probability or even a data visualization. According to Wikipedia, a statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data. A statistical model represents often in considerable idealized form, the data generating process. That means a statistical model is a mathematical model based on probabilities that attempts to describe the process that creates this. In general, a statistical model describes the relationship between two or more variables. A statistical model could be used for description; that is describes the basic characteristics of the data in question; inference. It can also be used for comparison, like hypothesis testing. It can also be used for prediction like to make a prediction about new unknown values.

There are two types of statistical model; parametric models and non-parametric models. A Parametric model is a concept using statistics to describe a model in which all its information is represented within its parameters. In short, the only information needed to predict future or unknown values from the current value is the parameters. For example, the Gaussian distribution, completely described by the mean and standard deviation, that is the \(\mu\) and the \(\sigma\),or the Poisson distribution described by the parameter \(\lambda\). Non-parametric methods or models refers to the type of statistics that does not require that the population being analyzed made certain assumptions or parameters. Since non-parametric models are a bit more complex than parametric model, we would focus more on parametric models.

Linear Regression :

The linear regression simple linear regression model is used to predict a quantitativeoutcome Y, on the basis of one single predictor variable X. The goal is to build a mathematical model or formula that defines Y as a function of the X variable. The model defines a linear relationship between an explanatory variableor independent variable or the X- variable, or the predictor variable dependingon the context; this means the same thing, and the outcome variable or theresponse variable or the dependent variable or the y variable. Once we can build a statistically significant model, it is possible to use it for predicting future outcomes on the basis of new X value.

Dataset :

We will use one of the openly available data sets called mpg, which is a fuel economic data from 1999 to 2008 for 38 popular models of this- of cars. This dataset contains 234 rows and 11 variables. I have made the data available by the given link ‘DATASET’.

Required Packages :

#install.packages(c("tidyverse", "ggpubr", "broom", "ggfortify"))
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggpubr)
library(broom)
library(ggfortify)

Importing the Data set :

## Import the mpg.csv dataset
#data <- read.csv ("mpg.csv", header = T, sep = ",")
require(ggplot2)
str(mpg)

## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

Exploring ‘mpg’ Dataset using different plots :

This notebook explores the mpg dataset available in the ggplot2 package. The goals of this notebook are:

General Understanding of mpg Dataset
Learning features.

Manufacturers :

table(mpg$manufacturer)

## 
##       audi  chevrolet      dodge       ford      honda    hyundai       jeep 
##         18         19         37         25          9         14          8 
## land rover    lincoln    mercury     nissan    pontiac     subaru     toyota 
##          4          3          4         13          5         14         34 
## volkswagen 
##         27

qplot(manufacturer,data = mpg , geom = "bar" , fill = manufacturer)

Displacement vs Highway Efficiency:

qplot(displ,hwy,data = mpg,geom = "point",color = class)

# looking at this data seperately for each class
ggplot(data = mpg)+geom_point(mapping = aes(x=displ,y=hwy,color = class ))+
  facet_wrap(~class,nrow = 2)

# Creating Facets Vertically on drive
ggplot(data = mpg)+geom_point(mapping = aes(x=displ,y=hwy,color = drv))+
  facet_grid(drv~.)

Estimating a smooth curve for the relationship between displacement and highway mileage :

## Estimating a smooth curve for the relationship between displacement and highway mileage :
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

##Separate curve for each type of drive:
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv, color=drv))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Overlaying a smooth curve on top of scatter plot:

 ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +geom_point(mapping=aes(color=class)) +geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Filtering data for a specific geom:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = dplyr::filter(mpg, class == "subcompact"), se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The se=FALSE setting removes the confidence interval around the estimated curve.

Grouping data by drive and then drawing scatter plot with estimated curve for each group:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

str(mpg)

## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

This is the over all structure of my data sets. Now I want to check is there any missing value or not , because it might be a hectic for my analysis work . So we’re doing the following.

## Count missing values in the variables
   sum(is.na(data))

## Warning in is.na(data): is.na() applied to non-(list or vector) of type
## 'closure'

## [1] 0

I want to be able to check for each of the variables to see if there’s any missing value in each of the variable & to do that, we’ll use this function sapply.

sapply (mpg, function(x)sum(is.na(x)))

## manufacturer        model        displ         year          cyl        trans 
##            0            0            0            0            0            0 
##          drv          cty          hwy           fl        class 
##            0            0            0            0            0

Note: What sapply function basically does is, it takes a list or a vector or a data frame as input and gives output in a vector or matrix. It is useful for operations on list object and returns a list object of same length of the original set.

Now we will carry out visualize the variables. It is important to build the statistical model. We want to plot a scatter plot for the variables with cty, which is city miles per gallon on the x-axis, and hwy highway miles per gallon on the y-axis, because that’s the y-variable. So, I’m going to use the ggplot function.

ggplot(data = mpg,aes(x=cty,y=hwy)) +geom_point() +stat_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Now we can see that this is like there’s a linear relationship and you will see the points, the scatter plot So, this our graph suggests a linear, increasing relationship between hwy and cty. This is a good thing because one important assumption of the linear regression is that the relationship between the outcome variable and the predictor variable must be linear and additive.

The Level of Assosiation Between Two Variable:

The correlation coefficient measures the level of association between two variables X and Y. Its’ value ranges between minus one, that’s a perfect negative correlation (when X increases, Y decreases) or plus one, which is a perfect positive correlation (when X increases Y will also increase). A value close to zero suggests a weak relationship between the variables. A low correlation, say between -0.2 to 0.2 probably suggests that much of the variation of the outcome variable Y is not explained by the predictor variable X. It is also possible to compute the correlation coefficient between these two variables using the function cor().

## Find the correlation between the variables
cor(mpg$cty,mpg$hwy)

## [1] 0.9559159

In such a case, we will probably look for a better predictor variable. In our own example here, the correlation coefficient is large enough, so we can continue by building a linear model of y, as a function of x. I want to correlate, that is the cty and hwy. you’ll see that the correlation coefficient is very high; that’s about 0.9559. That’s about 95%. That’s a strong positive correlation.

Note: A brief check to see if the linear regression model will be a good fit for data. We used the scatter plot to which shows a linear trend between the two variables. For example, if we noticed a non-linear trend, who might have to use some other models like the polynomial models, or do some data transformation, like the square root transformation on the data.

Model Building :

\(Y = \beta_0+\beta_1X+e\)

\(Y = Response\ variable\)

\(X = Regressor \ variable\)

\(\beta_0 = Slope\) \(\beta_1 = Intercept\)

\(e = Random\ error \ Component\)

Therefore the model will be , \(hwy = \beta_0+ \beta_1cty +e\)

So, the simple linear regression model we will build in this case. We’ll try to find the best line that predicts the hwy, that’s the highway miles gallon on the basis of the cty. So, our linear model equation will look like above we did.hwy which is making my \(hwy = \beta_0+ \beta_1cty +e\)

I put up this pictorial view so it will be clearer. So, mathematically the beta coefficients; \(\beta_0\) and \(\beta_1\) are determined so that the RSS which is the residual sum of squares which is basically the square of the residual errors that is the difference between and the observed value and the predicted value We want it to be as small or as minimal as possible & this method of determining the beta coefficients is technically called the Least Squares regression or Ordinary Least Squares regression (OLS). But in R, we can use the lm function to determine the beta coefficients of this linear model. the lm function takes the y-variable first. That is hwy. I’ll put a tilde there and cty. So, this is saying that I want this my hwy to be a function of cty.

model<-lm(hwy~cty,data = mpg)
model

## 
## Call:
## lm(formula = hwy ~ cty, data = mpg)
## 
## Coefficients:
## (Intercept)          cty  
##       0.892        1.337

Now let’s try to recall the model \(hwy = \beta_0+\beta_1cty +e\) and you will see the intercept, which is the \(\beta_0\) and the cty which is my \(\beta_1\) The estimated regression line equation can be written as hwy = 0.892+1.337cty. We have from this our result- 0.892 is the \(\beta_0\) and 1.1.337 \(\beta_1\) And so, the intercept 0.892. It can be interpreted as the predicted, hwy units for zero city miles per gallon. What it means is that if I come over here, and put zero here All of this becomes zero. And what I have left is 0.892. So basically, the intercept is the predicted highway miles per gallon unit for zero city mpg. The regression beta coefficient for the cty, which is \(\beta_1\) this 1.337 is a slope. And the slope term in our model is saying that for every one mpg increase- for every one mile per gallon increase in the cty, the required highway miles per gallon increases by 1.337. That is if you come over here, okay And put one here instead of cty and you calculate 0.892 plus 1.337 times one. It would give you the equivalent value of about two point- 2.229 mpg, right. That’s what the \(\beta_0\) and \(\beta_1\)represent.

ggplot(mpg,aes(x=cty,y=hwy)) +geom_point() +stat_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

We already have the fitted model of hwy as a function of cty. But before using this formula or model to predict future hwy, you want to make sure that the model is statistically significant, that is, there is a statistically significant relationship between the predictor and the outcome variables. The model and also that the model we built fits very well for the data at hand.

Assessment & Interpretation:

We will assess and interpret the result of a simple linear regression model to know if the model fits the data well. We want to assess the summary of the fitted model; we will use the summary function - summary(model).

summary(model)

## 
## Call:
## lm(formula = hwy ~ cty, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3408 -1.2790  0.0214  1.0338  4.0461 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.89204    0.46895   1.902   0.0584 .  
## cty          1.33746    0.02697  49.585   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.752 on 232 degrees of freedom
## Multiple R-squared:  0.9138, Adjusted R-squared:  0.9134 
## F-statistic:  2459 on 1 and 232 DF,  p-value: < 2.2e-16

The summary output shows six components, including the call, which shows the function call used to compute the regression model. We have the residuals, here, which provides a quick view of the distribution of the residuals, which by definition, have the mean of zero, therefore, the median should not be far from zero and the minimum and maximum should be roughly equal in absolute value. Then, we have the coefficients, which shows the regression beta coefficients that we’ve seen and their statistical significance. The predictor variables that are statistically associated to the outcome variables are marked by stars. We will come into that shortly. We have the Residual Standard Error (RSE), We have the R-Squared and the F-statistic. These are metrics that are used to check how well the model fits our data. Now, let’s take each of these components one by one. Now, let’s go to residuals. The residuals- the next item in the model to talks about the residuals. Residuals are essentially the difference between the actual observed response value hwy in our case and the response value that the model predicted. We would take this further by considering plotting the residuals to see whether this is normally distributed and so on. The next section in the model output talks about the coefficient of the model. Theoretically, in simple linear regression, the coefficients are two unknown constants that represent the intercept and the slope terms in the linear regression model. For the coefficients, significance that we can see from this estimate here. We have the standard error (SE), which defines the accuracy of the beta coefficients. That is, for a given beta coefficient, the SE or standard error reflects how the coefficient varies under repeated sampling. It can be used to compute the confidence interval and the T-statistic or the T-value. The T-value here and the associated p-value defines the statistical significance of the beta coefficients. So we will go on to this coefficient standard error.

The coefficient standard error measures the average amount that the coefficient estimates vary from the actual average value of our response variable- of our y-variable. The standard error measures the variability or the accuracy of the beta coefficient. We will ideally want a lower number relative to its coefficient. The standard errors can also be used to compute confidence interval and to start statistically test the hypothesis of the existence of the relationship between hwy and cty.

We can go on to calculate the confidence interval.

confint(model)

##                   2.5 %   97.5 %
## (Intercept) -0.03189534 1.815978
## cty          1.28431197 1.390599

That is, there is an approximately 95% chance that the interval- this 1.28431197 and 1.390599 will contain the true value of \(\beta_1\) here. And similarly, for \(\beta_0\) this says that this is the 95% confidence interval for \(\beta_0\) .

T-statistic and The p-value :

For a given predictor, the T-statistic and its’ associated p-value tests whether or not there is a statistical significance relationship between the given predictor and the outcome variable. That is whether or not the beta coefficient of the predictor in this case, this cty is significantly different from zero.

Our statistical hypothesis here i.e.,

\(H_0 : \beta_0 = 0\ ( No linear\ relationship \ between\ X \ and \ Y\ )\)

\(H_1 : \beta_1 \neq 0 \ ( Linear \ relationship \ between \ X\ and \ Y)\)

This t-value here measures-- this one here measure how many standard deviations our coefficient estimate is far from zero.We want it to be as far as possible from zero as it will, to indicate that we will reject the null hypothesis that- that is saying that there is no relationship between the predictor and the outcome. That is, we want to be able to declare that as a relationship between cty and hwy. In our example, the t-statistic value is relatively far away from zero. You can see 49.585; that’s relatively far away from zero.

We have large relative to the standard error, which could indicate a relationship exists. In general, t-values are also used to compute the p-values; that is the next column there. The higher the t-statistic and the lower the p-value, the more significant the predictor. You can see this p greater than t here. That is the, uh, acronym for the p-value. It is found in the model output, and it relates to the probability of observing any value equal or larger than t. A small p-value indicates that it is unlikely that we observe a relationship between the cty and the response hwy due to chance. What we’re saying is that typically a p-value of 5% or less is a good cut-off mark, a good cut-off point for the model. So in our model example, the p-value is very close to zero. It is very, very low, and you will see these significant codes there on this line here. And this stars here- by this side here; associated with each of these estimates. And the symbol visually specifies the level of significance. The codes show the definition of these symbols. One star means that the p-value is between 0.01 and 0.05. The more the stars beside the variable’s p-value, the most significant the variable. So this means that three-star beside the cty means that it is highly significant. They mean the p-value is highly significant. It’s so clear because the t-value about 49.585 and the p-value is very small and this means that this is statistically significant coefficient, that is cty and indicates that there is a relationship between the predictor X and the outcome Y variable. Consequently, a small p-value for the intercept and the slope indicates that we will reject the null hypothesis. So in our example, what would be the conclusion. That both the p-value for the intercept and the predictor are highly significant. So we can reject the null hypothesis and accept that-- the alternative hypothesis, which means that there is a significant association or relationship between a predictor and the outcome variable.

This allows us to conclude that there is a relationship between cty and hwy. Although, we saw from the visualization that there is a linear relationship between cty and hwy. But now we are very sure that this relationship was not just due to chance or randomness of data because we have now tested our hypothesis, and we’ve made a decision that this relationship actually exists.

Once you have identified that at least one predictor variable is significantly associated with the outcome, you should continue the diagnostics by checking how well the model fits the data. This process is also referred to as goodness-of-fit. The overall quality of the linear regression fit is can be accessed by using the following three quantities displayed in the model summary. You can either use the Residual Standard Error or the R squared or the F-statistic. The Residual Standard Error (RSE), also known as the model Sigma, is the residual variation representing the average variation of the observation points around the fitted regression line. This is the standard deviation of the residual errors. RSE provides an absolute measure of patterns in data that can’t be explained by the model When comparing two models, the model with the smaller RSE is a good indication that the model fits the data best. Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be used which should be small as possible. In our example, the RSE is 1.752, meaning that the observed hwy values deviate from the true regression line by approximately 1.8 units on average. Whether or not an RSE of 1.8 units is an acceptable prediction error is subjective and depends on the problem context. However, we can calculate the percentage error. Let’s calculate the percentage error to know how much error we have here.

sigma(model)*100/mean(mpg$hwy)

## [1] 7.475581

The percentage of the prediction error is 7.47 which is quite low. That’s acceptable. The next metric we want to see is R- squared and adjusted R-squared. The R-squared statistic provides a measure of how well the model is fitting the actual data. It takes the form of a proportion of variance. R-squared is a measure of the linear relationship between our predictor variable cty and our response or target variable hwy. It always lies between zero and one, that is, a number near zero represents a regression that does not explain variance in the response variable well and a number close to one does explain the observed variance in the response variable. For a simple linear regression R-squared is a square of the Pearson correlation coefficient. A high value of R-squared is a good indication. In our example the R squared we got is 0.9138, Although, we got a relatively strong R-squared, nevertheless, it is hard to define what level of R-squared is appropriate to claim that the model fits well. So, essentially it will vary with the application and the domain studied. That is why we took time to explain the result of the other components in the model summary, so together you can make a more holistic decision of your model. Another side note is that this adjusted R-squared here adjusts for the degree of freedom, In multiple regression setting the R-squared, will always increase as more variables are included in the model. That is why this adjusted R-squared is the preferred measure as it adjusts for the number of variables considered. So, in a multiple linear regression setting or model, you should mainly consider the adjusted R-squared, which is a penalized R-squared for a higher number of predictors. So, an adjusted R-squared that is close to one indicates that a large proportion of the variability in the outcome has been explained by the regression model. A number near zero indicates that the regression model did not explain much of the variability in the outcome variable. The last one we want to see is the F- statistic. The F-statistic here- the last one here, gives the overall significance of the model. It accesses whether at least one predictor variable has a non-zero coefficient. In simple linear regression, this test is not really interesting, since it is just a duplicate of the information we got from the t-value here. So, basically- the t-test is identical to the square of the t-test. So if you do, um, 49.585 squared, you have something close to 2458.672 which is very close to this our F-statistic we got here. So basically, square of the t-test, t-value is like the F-statistic. so this is true for any model where the degree of freedom is one. So, the F-statistic becomes more important once we start using multiple predictors as in multiple linear regression. A large F-statistic will correspond to a statistically significant p-value of less than 0.05. In our example, the F-statistic equals 2459 which means it produce a value-- you can see the p-value produced there, which is less than \(2.2\times10^{-16}\) which is the p-value & it’s is very small, very far away from zero And this means it is highly significant.

Model Prediction :

Here we will show that how to check for metrics that we can get from the fitted model and make predictions for new values. The fitted or predicted values are those values of the y or y values that you would expect to get giving x values according to the built regression model, or maybe visually, the best fitting straight line. So, to find the fitted values of a simple regression model.

## Find the fitted values of the simple regression model
 fitted <-predict.lm(model)
 head(fitted,3)

##        1        2        3 
## 24.96624 28.97861 27.64115

The first three fitted values. The first three fitted values and you see that these are the first three fitted values 24.966 28 point this and 27 point this. But we can also do something in R. We can easily augment our data to add the fitted values and the residuals by using the augment function from the broom package. We have already, uh, imported that package or library to the workspace. So, we will call this our new output. We call it this our new output, say, model_diag_metrics.

model_diag_metrics <- augment(model)
head(model_diag_metrics)

## # A tibble: 6 × 8
##     hwy   cty .fitted .resid    .hat .sigma     .cooksd .std.resid
##   <int> <int>   <dbl>  <dbl>   <dbl>  <dbl>       <dbl>      <dbl>
## 1    29    18    25.0 4.03   0.00458   1.74 0.0123          2.31  
## 2    29    21    29.0 0.0214 0.00834   1.76 0.000000632     0.0123
## 3    31    20    27.6 3.36   0.00661   1.74 0.0123          1.92  
## 4    30    21    29.0 1.02   0.00834   1.75 0.00144         0.585 
## 5    26    16    22.3 3.71   0.00445   1.74 0.0101          2.12  
## 6    26    18    25.0 1.03   0.00458   1.75 0.000805        0.591

This is because the result of the augment will return several metrics that will be useful for regression diagnostics.

Now, you will see this-- From the results we have the cty, which is the predictor variable. We have hwy is the observed. This first column is the observed response variable. Then we have dot fitted. dot fitted is the fitted hwy. If you just reduce this a bit. Now, if you compare this one now; the first value for this place is for 24.96624. Now, if you look at this fitted value here, the first value here is approximately 25 which is the same thing. This one is 28.97861, the next one here is 29.0. This is 27.6, the third value here for this place to the next value is 27.6. So, basically, this augments it’s more encompassing. Say it returns the- it also returns the residual error in the-- in the fourth column there and some other metrics there. The next thing we would like to do is now we have the fitted value. We know that our model can fit the data properly and make some predictions.

Now, we want to try to create this plot.

ggplot(model_diag_metrics, aes(cty, hwy)) +
  geom_point() +
  stat_smooth(method = lm, se = FALSE) +
  geom_segment(aes(xend = cty, yend = .fitted), color = "red", size = 0.3)

## `geom_smooth()` using formula 'y ~ x'

Basically, what this plot does is that it plots the residual error, in red color. And what it does is that, it shows the difference between the observed value and the fitted regression line.

This red segment will represent the residual error between an observed hwy value and the corresponding predicted value. So, let me just plot this and so you will see what I’m saying well. So, let me-- let’s zoom and you will see now that this is a fitted regression line; the blue line is a fitted regression line. Each of-- each of the points here are the data points, the observation, the 234 rows that we have. And this red line here is telling us the residual error, which is like the difference between each observation, each sample observation and the regression line. And that’s how we actually calculate the residual error, the residual sum of square, the residual squared error and the rest of it. And this plot really tells- gives a lot of information. You can see how the difference between this one- this particular observation and the regression line

So, predict then I will have my object. I can either write object equals to the model or I can just write only object there. I want to give this my object new data & the new data will take a data frame because my data is a data frame & that data frame will take the new value for cty. We want to give this for new values of cty, we want to predict the what the hwy will be. So, we want to give this new value of cty ,so when cty is 21 when cty is 27 mpg, when it is 14 mpg. We want to predict new data for the highway miles per gallon, given some unknown values for the city miles per gallon.

Predict New Values Using the Model:

predict(
   object = model,
   newdata = data.frame(cty=c(21,27,14))
 )

##        1        2        3 
## 28.97861 37.00334 19.61642

Now we have this when the city mpg is 21, the, hwy-highway mpg is 28.97861 & that’s how it works is, if you come back to this place and put in this particular model this place here, if you put 21, and calculate \(0.892+1.337\times21\) And that’s how you make new prediction for new values of the predictor.

Diagnostic Plot :

Linear regression makes several assumptions about the data, such as linearity of the data. That is, the relationship between the predictor X and the outcome Y is assumed to be linear. The second assumption will be the normality of the residuals. The residual errors are assumed to be normally distributed. We want to check that assumption. The third one will be the homogeneity of residual variance. The residuals are assumed to have a common or constant variance and this term is called homoscedasticity. The fourth assumption is the independence of residual error terms. You would normally want to check whether or not these assumptions hold true. Possible problems include non-linearity, that is non-linearity of the outcome predictor relationship. Another one could be the heteroscedasticity, that is non-constant variance of the error terms, or maybe the presence of influential values in the data that can be outliers. That is, extreme values in the outcome variable, or maybe high leverage points; extreme values in the predictor X variable.

All of these assumptions and potential problems can be checked by producing some diagnostic plots that visualizes the residual errors. So, we want to plot the model. But first, we want to put this on a 2X2 matrix like plot the model in a two by two fashion. That is what par() gives to us. par(mfrow = c(2,2)). This 2X2 means I want two rows and two columns.

par(mfrow = c(2, 2))   
plot(model)

You will see that this returns four plots. So, this plots two by two. The Residuals vs the Fitted. The normal Q-Q plot, the scale location, the residual leverage. I would briefly explain what each of these plots means, but let’s use the auto plot for a version of above one.

par(mfrow = c(2, 2))   
autoplot(model)

Residuals in four different ways the first one is the Residuals vs Fitted. This is used to check the linear relationship assumption. A horizontal line without a distinct pattern is an indication for a linear relationship, and that’s what is good. For the normal Q-Q plot here. It’s used to examine whether the residuals are normally distributed. It is good if the residuals follow the residual points Follow the straight line-that dash straight line. Then, for the scale location or the spread location. This is used to check for the homogeneity of the variance of the residuals. That is to know if it is homoscedastic or it has a constant variance. Horizontal lines with equally spread point is a good indication of homoscedasticity.

For residuals vs leverage. This is used to identify influential cases that is the extreme values, that might influence the regression results when included or excluded from the analysis. So, I would like to just plot each of these one after the other to see and explain what they mean, especially for the first three assumptions, which is the linearity assumption, normal assumption, and the homoscedastic assumption.

Let’s return the first plot using the linearity assumption we can check the Residual

versus the Fitted plot; that is the first plot. So, I’ll just do plot(model, 1)

plot(model, 1)

So you can see now from what we have. Ideally, the residual plot will show no fitted pattern, that is, the red line should be approximately horizontal to this zero here. The presence of a pattern may indicate a problem with some aspects of the linear model. Now, for example, we can see there is no pattern in the residual plot. We have even seen that it was linear, We can assume that there’s a linear relationship between the predictor and the outcome variable. But someone might argue that okay, this is not 100% on this horizontal line here. So, I want to show you what you can do in case the assumption of linearity is not satisfied.

model_0 <- lm(hwy ~ sqrt(cty), data = mpg)
plot(model_0, 1)

What you can do is you can create another model or a residual plot. Whenever you have a residual plot that has a nonlinear relationship with the data. A simple approach could be to use a non-linear transformation of the predictor such as log(x) or- that is a log of the predictor or the square root of the predictor, or even the square of the predictor in the regression model. And that’s what I’ve done here. I tried to find the square of the regression-- of the predictor cty and let’s see now. Now let’s check that model with the first plot now. Let’s see what difference it makes with the other one. Now you’ll notice that this particular red line is somewhat on this line and there is no specific pattern. However, what we have before too, we can actually assume that this other one we have here is also linear, but in case you have a situation where you have a non-linear assumption, so you can always do a simple transformation of your model. Now let’s move on. In the case where we want to return the same second diagnostic plot for the model, uh, the Q-Q plot of residuals can be used to verify or visually check that the normal assumption is satisfied. The normal probability plot of residuals should approximately follow a straight line.

plot(model, 2)

Now let’s see. You will notice that the residuals followed this straight line. They approximately lie on this straight line, so we can assume that in our example, all the points fall approximately on the reference line. So, we can assume that the residuals are normal. Let’s now check for the homogeneity of the variance. And to do that, we’ll use the scale location plot, which is like the spread location plot and that’s the third plot in the diagnostic plots.

plot(model, 3)

This plot shows if residuals are spread equally along the ranges of the predictor. It is good if you see the horizontal line with equally spread points & the horizontal line with equally spread points. In our case, in our example, this is not totally the case. It can be seen that the variability or the variances of the residual points increase with the value of the predicted of the fitted outcome variable. So, as the fitted value increases; you will see that the residual points also tend to increase, suggesting a non-constant variance in the residual errors, and this is called heteroscedasticity. A possible solution to reduce the heteroscedasticity problem is to use a log or square root transformation on the outcome variable y.

Multiple Linear Regression Model:

We want to build a multiple regression model with hwy as y-variable, y-axis. y-variable and cty and cyl,as the x or predictor variables.

multiple_reg_model <- lm(hwy~cty+cyl,data = mpg)
multiple_reg_model

## 
## Call:
## lm(formula = hwy ~ cty + cyl, data = mpg)
## 
## Coefficients:
## (Intercept)          cty          cyl  
##    -0.07702      1.36425      0.08784

summary(multiple_reg_model)

## 
## Call:
## lm(formula = hwy ~ cty + cyl, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4735 -1.1952  0.0398  0.9934  4.1691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.07702    1.40888  -0.055    0.956    
## cty          1.36425    0.04559  29.924   <2e-16 ***
## cyl          0.08784    0.12040   0.730    0.466    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.754 on 231 degrees of freedom
## Multiple R-squared:  0.914,  Adjusted R-squared:  0.9132 
## F-statistic:  1227 on 2 and 231 DF,  p-value: < 2.2e-16

You will see now that here, that the cyl variable that we added now is not significant, because if you look at this part here, the p-value is greater than 0.05. In fact, the standard error is 0.1204 and the t statistic; the t-value is very low. It should be as large-as far as possible from zero. So this is not significant. It’s not a significant variable that we will use. This means we will not include this in our model. So, you can begin to see how to explore. So, you can desire to look through your data and look for another variable. Remove this. Remove this cyl, and add another variable. And just make sure you are trying to check the assumptions. Look at the significant variables and all that. And if you notice that look at the adjusted R-squared, you’ll see that it reduced, right? a bit from the multiple R-squared here, because there is now a new variable, so it penalizes for each new predictor variable that you add, just like we have explained before. You can also do the autoplot.

autoplot(multiple_reg_model)

Autoplot here to plot the diagnostics. Everything follows, just that we now have one more or more variable added to it. The explanation I’ve done still suffice. So, you can begin to see the Residual versus Fitted. This still follows a normal distribution It follows-It still lies on the straight line. You can begin to see the scale location; that’s the homogeneity of the variance and make the correct interpretation.

Reference:

Applied Regression Analysis Book by Harry Smith
An Introduction to Statistical Learning: With Applications in R
Andrew F. Siegel, Michael R. Wagner, in Practical Business Statistics (Eighth Edition), 2022
https://data.library.virginia.edu/diagnostic-plots
http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials
Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly Media.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated.

Creating Linear Statistical Models

Abhirup Moitra

2022-08-14