Simple linear regression (ordinary least square, OLS)

Load the required libraries using p_load() funtion of packman package. First install the packman package if needed using install.packages(‘packman’) in the console becuase you do not need to install this each time.

pacman::p_load(readxl, jtools, car, tidyverse)

Import dataset using File > Import Dataset

cheese <- read_excel("DataSets.xlsx", sheet = "correlation_regression", range = "C14:F44")
str(cheese)
## tibble [30 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Taste : num [1:30] 18.1 56.7 18 0.7 39 11.6 15.9 26.5 0.7 54.9 ...
##  $ Acetic: num [1:30] 4.9 5.86 5.25 5.33 5.37 ...
##  $ H2S   : num [1:30] 3.85 10.2 6.17 3.91 5.44 ...
##  $ Lactic: num [1:30] 1.29 2.01 1.63 1.25 1.57 1.46 1.16 1.72 1.06 1.52 ...

This dataset contain 4 variables with 30 observations. Taste of the cheese is the dependent variable. We want to check the effect of Acetic acid concentration (ppm) on the taste (perception score). In simple linear regression, there is only one independent variable. In multiple linear regression, there are more than one independent variables. The dependent variable in linear regression must be continuous or discrete and the residuals are normally distributed with equal variance (homoskedastic). We need to estimate the coefficients of the regression equation.

Regression equation: \(y_i = \alpha + \beta X_i + \epsilon_i\)

Fit the model using lm() function. We see that the model is significant indicated by F(1,28) = 12.11 with p = 0.00. This means that the model significantly explains the variance in taste of cheese. Furthermore, \(R^2=0.30\) indicates that 30% of the variance in taste can be explained by the acetic acid content of the cheese. The adjusted \(R^2 = 0.28\) indicates that even if we consider the degree of freedom loss incurred by including less important variable(s) in the model, it will explain 28% fo the variance in taste. Estimate (regression coefficient) of acetic acid is 15.65 (taste score) which is significant at 0.00% level of probability (highly significant). This means that if we increase the concentration of acetic acid by one (01) ppm, the taste score will be increased by 15.65. The intercept (\(\alpha\)) indicates the average taste score in the absence of acetic acid (independent variable).

mod = lm(Taste ~ Acetic, cheese)
summ(mod, confint = TRUE)
Observations 30
Dependent variable Taste
Type OLS linear regression
F(1,28) 12.11
0.30
Adj. R² 0.28
Est. 2.5% 97.5% t val. p
(Intercept) -61.50 -112.39 -10.60 -2.48 0.02
Acetic 15.65 6.44 24.86 3.48 0.00
Standard errors: OLS

We know the actual taste score. We can predict the taste score (\(\hat y\)) by the model: \(\hat y = -61.50 + 15.65 \cdot Acetic\)

We can predict using the predict(). We also know the actual taste scores. We can calculate the residuals (error, \(\epsilon\)) [for each of the 30 observations] by subtracting the predicted values from the actual values.

actual = cheese$Taste
cheese$predicted = predict(mod) # add to the dataset
cheese$residuals = actual - cheese$predicted # add to the dataset
cheese$residuals
##           1           2           3           4           5           6 
##   2.9558486  26.5809354  -2.6052222 -21.1726913  16.5326935 -21.4608449 
##           7           8           9          10          11          12 
##   2.4927507 -13.0546683  -7.8564414  20.1491963   6.5804286  17.8331049 
##          13          14          15          16          17          18 
##   2.7108059  -6.2032583   6.6019680   2.8005741 -29.6419979   4.0822028 
##          19          20          21          22          23          24 
##   0.4730167   1.6717813  15.3060543   5.4669033  -5.8669261  -1.7467174 
##          25          26          27          28          29          30 
##  19.2831210  -5.6673065 -16.7871038 -15.8897330 -11.7085167   8.1400422

Let us visualize the residuals.The blue dots are the actual observations. The red line is the regression equation (predicted or estimated y values) line based on the estimated coefficients. The arrows represent the deviations (residuals) of the observations from the predicted y values (red line). The residuals of the model can also be directly accessed using mod$residuals command.

ggplot(cheese, aes(x = Acetic, y = Taste)) +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE, linewidth = 2, color = 'red') + # predicted values (regression line)
  geom_segment(aes(xend = Acetic, yend = predicted), color = 'grey50', linewidth = 1, arrow = arrow(type='open', ends = 'first', length = unit(0.1, 'inches'))) +
  geom_point(color = 'blue', shape = 19, size = 5, alpha = 0.5) + theme_test()

To be regression valid, the regressions should be normally distributed with homoskedasticity (devoid of heteroskedasticity). The model contains several values.

names(mod)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

These values can be accessed by mod$…. commands. Such as, mod$fitted.values are the predicted values.

mod$fitted.values
##         1         2         3         4         5         6         7         8 
## 15.144151 30.119065 20.605222 21.872691 22.467306 33.060845 13.407249 39.554668 
##         9        10        11        12        13        14        15        16 
##  8.556441 34.750804 28.319571 39.366895  9.589194 21.403258 30.698032 38.099426 
##        17        18        19        20        21        22        23        24 
## 35.141998  9.917797 20.526983 19.228219 23.593946 20.433097 11.466926 27.646717 
##        25        26        27        28        29        30 
## 28.616879 22.467306 23.187104 29.289733 33.608517 23.859958

Let us check the residuals to check their normality and heteroskedasticity.

boxplot(mod$residuals, col = 'red')

The boxplot indicates the residuals seem to be normally distributed.

qqPlot(mod)

## [1]  2 17

The qqplot also indicates the residuals seem to be normally distributed.

plot(mod, which = 1)

The red line should align with the dotted horizontal line to indicate the linear relationship between the independent and dependent variables. This is one of the important assumptions of the linear regression. This plot should also show the equal variance across the fitted values to meet the assumptions of homoskedasticity.

However, the spread of the residuals around the dotted line is not homogenous across the fitted values. It shows a funnel-shaped spread, it means the variances are heteroskedastic. It is a problem of linear regression. We can estimate the heteroskedasticity robust standard error, which gives more stable p-values. The robust standard error only gives more stable p-values, the estimates are similarly unbiased.

Visualization of the funnel-shaped spread of the residuals.

ggplot(cheese, aes(y = residuals, x = predicted)) +
  geom_point(shape = 1, size = 2) + theme_test() +
  geom_hline(yintercept = 0, linetype = 'dotted', linewidth = 1) +
  geom_hline(yintercept = 15, linetype = 'dashed', linewidth = 1, color = 'blue') +
  geom_hline(yintercept = -15, linetype = 'dashed', linewidth = 1, color = 'blue') +
  geom_abline(intercept = -7, slope = 1.1) +
  geom_abline(intercept = 1, slope = -1.0)

To get the robust standard error, in case of funnel shaped (heteroskedastic) spread of the residuals, we can use summ(mod, robust = TRUE, confint = TRUE) command.

summ(mod, robust = TRUE, confint = TRUE)
Observations 30
Dependent variable Taste
Type OLS linear regression
F(1,28) 12.11
0.30
Adj. R² 0.28
Est. 2.5% 97.5% t val. p
(Intercept) -61.50 -109.15 -13.84 -2.64 0.01
Acetic 15.65 6.38 24.91 3.46 0.00
Standard errors: Robust, type = HC3

Multiple linear regression

In multiple linear regression, there more than one independent variables. The assumptions are:

  1. The dependent (criterion, response) variable is continuous (or discrete) and normally distributed.

  2. The independent (predictor) variables can be continuous (covariate), discrete, or categorical (factor)

  3. The independent variables are orthogonal meaning that they are uncorrelated to each other. In reality, they may have correlations. If the correlations are significant between two variables, multicollinearity problem arises. This means that either of the concerned variable can explain the variance. So, we need to drop one variable with higher VIF (variance inflation factor). The higher VIF (>4, 6 or at most 10) indicates multicollinearity. The formula is related to the coefficient of determination \(R^2\), which is \(VIF = \frac{1}{1-R^2}\)

  4. The independent variables are linearly correlated with the dependent variables. This is checked by the residuals vs fitted plot. The residual plot should show a straight line along the zero line.

  5. The variance of the residuals should be homoskedastic. The residual plot should show the spread of the residuals fairly equal in number of spread at both sides of the zero line.

  6. The residuals should be normally distributed that is checked by the residual qqplot. The residuals should be placed along the diagonal line.

  7. The residuals should be random and uncorrelated with the independent variable. If the qqplot of the residuals shows any systematic deviation from the diagonal line, this indicates the variances still present in the residuals and the independent variables have not captured all the variance of the dependent variable.

Let us fit a multiple regression model with this cheese dataset.

model = lm(Taste ~ Acetic + H2S + Lactic, cheese)
summ(model, confint = T, vif = T)
Observations 30
Dependent variable Taste
Type OLS linear regression
F(3,26) 16.22
0.65
Adj. R² 0.61
Est. 2.5% 97.5% t val. p VIF
(Intercept) -28.88 -69.44 11.69 -1.46 0.16 NA
Acetic 0.33 -8.84 9.49 0.07 0.94 1.83
H2S 3.91 1.35 6.48 3.13 0.00 1.99
Lactic 19.67 1.93 37.41 2.28 0.03 1.94
Standard errors: OLS

Our \(H_0 : \beta_{Acetic} = \beta_{H2S} = \beta_{lactic} = 0\) If the associated p-value is > 0.05, we cannot reject the \(H_0\). In our example, the \(\beta\) of H2S and Lactic are not 0 as the p < 0.05. The VIFs are less than 2. So, there is no multicollinearity issues and we do not need to calculate the robust standard errors.

This result indicates that one unit increase in H2s increases the taste score by 3.91 holding other variables constant. Similarly, one unit increase in Lactic acid increases the taste score by 19.67 holding other variables constant. Still we do not know which variable has the higher effects on the taste score. For this we need to calculate the standardized regression coefficient (unit free) to compare between the variables. Besides, we are removing the non-significant variables from the model to preserve more degrees of freedom. We see the \(R^2 = 0.65\) and \(Adjusted~R^2 = 0.61\). This reduction (penalty) in the coefficient of determination is due to the inclusion of non-significant variables in the model.

model_reduced = lm(Taste ~ H2S + Lactic, cheese)
summ(model_reduced, confint = T, vif = T, scale = T)
Observations 30
Dependent variable Taste
Type OLS linear regression
F(2,27) 25.26
0.65
Adj. R² 0.63
Est. 2.5% 97.5% t val. p VIF
(Intercept) 24.53 20.81 28.26 13.52 0.00 NA
H2S 8.39 3.44 13.35 3.47 0.00 1.71
Lactic 6.04 1.08 10.99 2.50 0.02 1.71
Standard errors: OLS; Continuous predictors are mean-centered and scaled by 1 s.d. The outcome variable remains in its original units.

Now we see that the effect of H2S on the taste score is higher than that of lactic acid. The interpretation of the standardized coefficient is somewhat different from non-standardized coefficient. Here we have to conclude that one standard deviation change in H2S will change 8.39 standard deviation in taste score holding other variables constant. We must acknowledge holding other variables constant, because this rate change may not be realized if other variables also change.

However, for writing the equation and easier interpretation, we use non-standardized regression coefficients.

summ(model_reduced, confint = T, vif = T, scale = F)
Observations 30
Dependent variable Taste
Type OLS linear regression
F(2,27) 25.26
0.65
Adj. R² 0.63
Est. 2.5% 97.5% t val. p VIF
(Intercept) -27.59 -46.02 -9.16 -3.07 0.00 NA
H2S 3.95 1.62 6.28 3.47 0.00 1.71
Lactic 19.89 3.56 36.22 2.50 0.02 1.71
Standard errors: OLS

This model is signicant indicated by the F value. It explains 65% of the variance in the taste score by H2s and Lactic acid. We can also estimate the individual contribution of the variables through anova(model_reduced) outputs (sum of squares) similar to the ANOVA. The model prediction equation is: \(\hat y = -27.59 + 3.95 \cdot H2S + 19.89 \cdot Acetic\). Using this equation, we can predict the unknown taste score by plugging in the values of the Independent variables.

Residual analysis

plot(model_reduced, which = 1)

plot(model_reduced, which = 2)

Distribution of dependent variable (seems normal)

boxplot(cheese$Taste)

qqPlot(cheese$Taste)

## [1] 12  2

Model quality: our model is a good model because:

  1. The dependent variable is continuous and normally distributed checked by boxplot and qqplot.
  2. The independent variables are continuous.
  3. The independent variables are orthogonal meaning that they are uncorrelated to each other. The VIFs are less than 2.
  4. The independent variables are linearly correlated with the dependent variables. This has been checked by the residuals vs fitted plot. The residual plot shows nearly a straight line along the zero line.
  5. The variance of the residuals is homoskedastic. The residual plot shows the spread of the residuals fairly equal in number of spread at both sides of the zero line.
  6. The residuals are normally distributed that has been checked by the residual qqplot. The residuals are located along the diagonal line.
  7. The residuals should be random and uncorrelated with the independent variable. The qqplot of the residuals does not show any systematic deviation from the diagonal line.