Introduction

Regression analysis


The linear regression model

  • In regression analysis we wish to express the relationship between the dependent variable and the explanatory variable in a functional form \[ Y = f(X) + \epsilon \]

  • Suppose we observe pairs (X,Y) and the scatterplot shows a fairly linear pattern, then we can let \[ f(X) = \beta_0 + \beta_1 X \]

  • Thus, the (simple) linear regression model is given by \[ Y = \beta_0 + \beta_1X + \epsilon \] where:

    • Y is the response variable
    • X is the regressor (predictor)
    • \(\beta_0\) - y-intercept
    • \(\beta_1\) - slope of the regression line; regression coefficient
    • \(\epsilon\) is the random error term
  • For example, X may represent TV advertising and Y may represent sales.


Interpretation of the regression parameters of the SLR model

  • \(\beta_0\) is the y-intercept of the regression line. It is the estimated mean response when X=0

  • \(\beta_0\) is meaningful only if

    • X=0 is within the range of X values in the data, and
    • X=0 is a logical or meaningful value
  • \(\beta_1\) is the slope of the regression line. It is the estimated change in the mean response for every unit change in X


Estimation of regression parameters

  • The regression parameters \(\beta_0\) and \(\beta_1\) are unknown quantities that must be estimated from the data

  • The most common and popular method of estimating regression parameters is the Method of Least Squares

  • The resulting estimators are called Ordinary Least Squares (OLS) estimators

  • The OLS estimators of \(\beta_0\) and \(\beta_1\) are \(\hat{\beta_0}\) and \(\hat{\beta_1}\), respectively

  • \(\hat{\beta_0} = b_0\) and \(\hat{\beta_1} = b_1\)


Assumptions of the linear regression model

Standard assumptions

  • Y is a continuous random variable

  • \(\epsilon_i \sim N(0, \sigma^2), \forall \; i \Longrightarrow\) normality and homogenous variance assumptions

  • For two different observations, \(i\) and \(j\), the error terms \(\epsilon_i\) and \(\epsilon_j\) are independent \(\Longrightarrow\) independence assumption

Assumptions on the predictor variables

  • The predictor variables are assumed fixed or selected in advance.

  • The values of the predictors are assumed to be measured without error.

  • The predictor variables are assumed to be linearly independent of each other (No multicollinearity) \(\Longrightarrow\) multicollinearity occurs if at least 2 of the regressors are strongly and significantly correlated

Assumption about the observations

  • All observations are equally important in determining the results and in influencing conclusions (absence of outliers and influential observations)


Overall measures of fit of the model

  • Coefficient of determination (\(R^2\)):

    • the percentage of variation in Y that can be explained or attributed to its linear relation with X

    • the closer to 1 or 100% the better the fit

  • F test of overall effect of X: \(F = \frac{MSR}{MSE}\)

    • A significant F value indicates that Y is significantly related with X; X has a significant effect on Y
  • Root Mean Square Error (RMSE):

    • standard deviation of the residuals (=difference between the observed Y and the predicted Y from the model)

    • the closer to zero the better the fit


Inference on the parameter of a SLR model

  • \(H_0: \beta_1 = 0\) (X has no effect on Y)

  • \(H_1: \beta_1 \neq 0\) (X has an effect on Y)

  • Test statistic: \(t=\frac{\hat{\beta_1}}{se(\hat{\beta_1})}\)


An example of SLR analysis

#Loading required packages
library(ISLR)
library(readxl)
library(tidyverse)
library(ggpubr)
library(performance)
library(car)
library(caret)
library(olsrr)
library(jtools)
library(moments)
library(lmtest)
library(see)
library(caTools)
library(gtsummary)
library(modelsummary)
library(DescTools)
library(ggeffects)
library(rsample)
library(regclass)
library(sjPlot)

#Importing the data into R
sales.data <- read_excel("Data/Advertising.xlsx")
head(sales.data) #Displays the 1st 6 rows of the data frame
## # A tibble: 6 × 5
##    ...1    TV radio newspaper sales
##   <dbl> <dbl> <dbl>     <dbl> <dbl>
## 1     1 230.   37.8      69.2  22.1
## 2     2  44.5  39.3      45.1  10.4
## 3     3  17.2  45.9      69.3   9.3
## 4     4 152.   41.3      58.5  18.5
## 5     5 181.   10.8      58.4  12.9
## 6     6   8.7  48.9      75     7.2
#Visualization
ggscatter(x = "TV",
          y = "sales",
          data = sales.data,
          color = "darkgreen", 
          shape = 19, size = 1.75, 
          add = "reg.line",
          conf.int = TRUE,
          add.params = list(color = "blue", fill = "lightgray")) +
  labs(x = "Spending on TV ads",
       y = "Sales")

#Code for SLR model
slr <- lm(sales ~ TV,
           data = sales.data)

#Displays the output of the analysis
summary(slr)
## 
## Call:
## lm(formula = sales ~ TV, data = sales.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16


Checking the assumptions

Before the results of regression analysis are interpreted, we need to make sure that the assumptions of the model are satisfied. Conclusions drawn from a regression analysis is valid and reliable only if the assumptions are satisfied. Regression diagnostics refers to detection of violations of model assumptions and is also referred to as residual analysis since many of the model assumptions will be checked using the residuals. The residuals are the difference between the observed values of Y and the values predicted from the model. The residuals are denoted by \(\hat{\epsilon}\) and are computed as \[ \hat{\epsilon} = Y - \hat{Y} \]


Checking the linearity assumption

library(GGally)
library(lmtest)

#Pairwise scatter plots of DV with IV
sales.data |> 
  select(TV, sales) |> 
  ggpairs()

#Rainbow test for linearity
raintest(slr)
## 
##  Rainbow test
## 
## data:  slr
## Rain = 1.1365, df1 = 100, df2 = 98, p-value = 0.2631

INTERPRETATION: The non-significant result of the rainbow test indicates that a linear model fits the data well.


Checking normality of residuals assumption

#Shapiro-Wilk test for normality
shapiro.test(slr$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  slr$residuals
## W = 0.99053, p-value = 0.2133

INTERPRETATION: Based on the This was confirmed by the Shapiro-Wilk test at the 5% level of significance.


Checking homogeneity of variance (homoscedasticity) assumption

#Plots
ols_plot_resid_fit(slr)

#Breusch Pagan Test for Heteroskedasticity
ols_test_breusch_pagan(slr)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : sales 
##  Variables: fitted values of sales 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    42.73013 
##  Prob > Chi2   =    6.283693e-11

INTERPRETATION: The Residual vs Fitted Values plot shows smaller variance for smaller fitted values and larger variance for larger fitted values. There is an observable funnel-type or V pattern in the plot indicating non-constant or unequal variance (heteroscedasticity). This is confirmed by the Breusch-Pagan test at the 5% level of significance.


Checking independence assumption

According to this assumption, adjacent residuals must be uncorrelated or independent. One way to achieve this is making sure that the data do not exhibit any kind of clustering or hierarchical relationship. The Durbin-Watson test is oftentimes used to check this assumption. A significant test result indicates violation of the independence assumption and the residuals are known as autocorrelated.

durbinWatsonTest(slr)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.02342385      1.934689    0.65
##  Alternative hypothesis: rho != 0
check_autocorrelation(slr)
## OK: Residuals appear to be independent and not autocorrelated (p = 0.632).

INTERPRETATION: The Durbin-Watson test result is not significant therefore the residuals are independent.

Checking outliers and influential observations

check_outliers(slr)
## OK: No outliers detected.
## - Based on the following method and threshold: cook (0.7).
## - For variable: (Whole model)

Modifying the model

Let us try to transform the sales data using logarithmic formula and refit the model.

slr1 <- lm(log(sales) ~ TV,
           data = sales.data)

check_heteroscedasticity(slr1)
## Warning: Heteroscedasticity (non-constant error variance) detected (p = 0.033).
shapiro.test(slr1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  slr1$residuals
## W = 0.91963, p-value = 5.474e-09

NOTES: The problem of heteroscedasticity is not solved and even made the residuals non-normal.


Weighted regression to correct heteroscedasticity

The idea of weighted regression is to assign higher weights to observations with lower variance and lower weights to observations with higher variance. There are several weighting procedures and the choice usually depends on the observed heteroscedasticity.

sales.data <- sales.data  %>% 
  mutate(wt = 1 / lm(abs(slr$residuals) ~ slr$fitted.values)$fitted.values^2)

head(sales.data)
## # A tibble: 6 × 6
##    ...1    TV radio newspaper sales     wt
##   <dbl> <dbl> <dbl>     <dbl> <dbl>  <dbl>
## 1     1 230.   37.8      69.2  22.1 0.0825
## 2     2  44.5  39.3      45.1  10.4 0.511 
## 3     3  17.2  45.9      69.3   9.3 0.838 
## 4     4 152.   41.3      58.5  18.5 0.148 
## 5     5 181.   10.8      58.4  12.9 0.117 
## 6     6   8.7  48.9      75     7.2 1.01
slr2 <- lm(sales ~ TV,
           weights = wt,
           data = sales.data)

summary(slr2)
## 
## Call:
## lm(formula = sales ~ TV, data = sales.data, weights = wt)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2241 -0.8965 -0.1057  1.0011  2.5396 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.300885   0.240966   26.15   <2e-16 ***
## TV          0.053361   0.002308   23.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.249 on 198 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7284 
## F-statistic: 534.7 on 1 and 198 DF,  p-value: < 2.2e-16
check_heteroscedasticity(slr2)
## OK: Error variance appears to be homoscedastic (p = 0.162).
msummary(list("OLS" = slr, "Weighted OLS" = slr2))
OLS Weighted OLS
(Intercept) 7.033 6.301
(0.458) (0.241)
TV 0.048 0.053
(0.003) (0.002)
Num.Obs. 200 200
R2 0.612 0.730
R2 Adj. 0.610 0.728
AIC 1044.1 1001.1
BIC 1054.0 1011.0
Log.Lik. -519.046 -497.566
F 312.145 534.672
RMSE 3.24 3.28


Interpretation of the model fit and the regression estimates

A. Model fit

  1. \(R^2 = 73\% \implies\) 73% of the variation in sales can be explained by its linear relationship with TV advertising expenses.

  2. The model significantly fits the data, F(1, 198)= 534.7 p< 2.2e-16.

  3. The \(RMSE = 1.249\) is small and close to zero.

Based on the above overall measures of fit, the model adequately fits the data.

B. Inference on the regression coefficients

Note that the sales variable is in thousands of units, and the TV variable is in thousands of dollars. Thus, an increase of $1,000 in the TV advertising budget is associated with an increase in sales by around 53 units. This effect is significant, t(1) = 23.12, p < 0.001.

C. Predictions

library(ggeffects)
ggeffect(slr2)
## $TV
## # Predicted values of sales
## 
##  TV | Predicted |       95% CI
## ------------------------------
##   0 |      6.30 |  5.83,  6.78
##  40 |      8.44 |  8.07,  8.80
##  80 |     10.57 | 10.23, 10.90
## 120 |     12.70 | 12.31, 13.10
## 140 |     13.77 | 13.32, 14.22
## 180 |     15.91 | 15.32, 16.49
## 220 |     18.04 | 17.30, 18.78
## 300 |     22.31 | 21.23, 23.39
## 
## attr(,"class")
## [1] "ggalleffects" "list"        
## attr(,"model.name")
## [1] "slr2"



WORKSHOP 1

Instructions

  1. Using the Wage data set, fit a linear regression model with wage as response variable and either one of year and age as predictor.

  2. Perform all diagnostic checking of the model assumptions. Discuss your findings.

  3. Assuming all assumptions are meet,

    1. Discuss the fit of the model.

    2. Interpret the estimate of the regression parameter \(\hat{\beta}_1\).



Multiple linear regression model

Oftentimes, a regression model with a single predictor variable is not enough to provide an adequate description of the response variable. This occurs when several key variables affect the response variable in important and predictive ways.

The multiple linear regression model is \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon \]

where:

NOTES:

\(\beta_0\) is the regression constant or intercept and is the value of Y when all X’s are equal to zero; must be interpreted with caution

\(\beta_1\) is the estimated mean response for every 1 unit change in \(X_1\), holding all other X’s constant

\(\beta_2\) is the estimated mean response for every 1 unit change in \(X_2\), holding all other X’s constant

\(\vdots\)

and so on


An example of MLR analysis

#Import another data set
triticum <- read_excel("Data/triticum.xlsx")

head(triticum)
## # A tibble: 6 × 7
##   DSeed Weight Length Diameter Moisture Hardness Maturity
##   <dbl>  <dbl>  <dbl>    <dbl>    <dbl>    <dbl> <chr>   
## 1     1   30.2   3.27     2.09     10.3   -16.6  Young   
## 2     2   35.5   3.65     2.34     10.6    -8.27 Mature  
## 3     3   29.2   3.36     2.15     10.3   -21.4  Mature  
## 4     4   16.8   2.77     1.79     11.0     4.13 Mature  
## 5     5   23.4   2.78     1.8      10.0    -2.05 Young   
## 6     6   31.8   3.37     2.15     10.3   -41.8  Mature
#Examining correlations and density plots
triticum %>% 
  select(-c(DSeed)) %>% 
  ggpairs()

#Regression model fitting
mlr1 <- lm(Weight ~ Length + Diameter + Moisture + Hardness,
           data = triticum)

summary(mlr1)
## 
## Call:
## lm(formula = Weight ~ Length + Diameter + Moisture + Hardness, 
##     data = triticum)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4465 -1.8739  0.0729  2.0695  7.3834 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.96977   10.15445  -1.277 0.203115    
## Length      -40.15453   14.55184  -2.759 0.006373 ** 
## Diameter     90.94293   23.17550   3.924 0.000123 ***
## Moisture     -1.80476    0.93910  -1.922 0.056167 .  
## Hardness     -0.05248    0.01621  -3.238 0.001428 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.754 on 185 degrees of freedom
## Multiple R-squared:  0.8076, Adjusted R-squared:  0.8035 
## F-statistic: 194.2 on 4 and 185 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mlr1)


Checking model assumptions

#Rainbow test for linearity
raintest(mlr1)
## 
##  Rainbow test
## 
## data:  mlr1
## Rain = 0.90798, df1 = 95, df2 = 90, p-value = 0.679
#Shapiro-Wilk test for normality
shapiro.test(mlr1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  mlr1$residuals
## W = 0.99241, p-value = 0.4269
check_outliers(mlr1)
## OK: No outliers detected.
## - Based on the following method and threshold: cook (0.9).
## - For variable: (Whole model)
#Breusch Pagan Test for Heteroskedasticity
ols_test_breusch_pagan(mlr1)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                
##  ----------------------------------
##  Response : Weight 
##  Variables: fitted values of Weight 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    0.02383978 
##  Prob > Chi2   =    0.8772933
check_heteroscedasticity(mlr1)
## OK: Error variance appears to be homoscedastic (p = 0.877).
#Testing autocorrelation
durbinWatsonTest(mlr1)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.07037491      1.855916    0.27
##  Alternative hypothesis: rho != 0
check_autocorrelation(mlr1)
## OK: Residuals appear to be independent and not autocorrelated (p = 0.364).
#Checking multicollinearity
library(regclass)
VIF(mlr1)
##     Length   Diameter   Moisture   Hardness 
## 538.005939 537.888050   1.018578   1.029427

REMARKS:

  1. Multicollinearity occurs when there is strong correlations between explanatory variables. If multicollinearity is present the effect of one explanatory variable maybe masked with the effect of the other explanatory variables. In other words, one explanatory variable carries the same information as the other explanatory variables. The Variance inflation factor (VIF) measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model. In practice, \(VIF < 10\) is acceptable.

  2. In the model above, the VIF of Length and Diameter are above 10. Thus, multicollinearity is present.

  3. One solution is to retain only one of the strongly correlated regressors. Another approach is to compute a (linear) combination of highly correlated regressions, say via principal component analysis (PCA) or other logical mathematical forms. For example, we can alculate the length-to-diameter ratio (LDR) and use it as a regressor in lieu of Length and Diameter.

mlr2 <- lm(Weight ~ Length + Moisture + Hardness, 
           data=triticum)

summary(mlr2)
## 
## Call:
## lm(formula = Weight ~ Length + Moisture + Hardness, data = triticum)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1559 -2.0919 -0.0105  2.0710  8.4907 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.26912   10.53056  -1.070  0.28595    
## Length       16.89431    0.65676  25.724  < 2e-16 ***
## Moisture     -1.55934    0.97260  -1.603  0.11057    
## Hardness     -0.05213    0.01682  -3.098  0.00225 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.859 on 186 degrees of freedom
## Multiple R-squared:  0.7916, Adjusted R-squared:  0.7883 
## F-statistic: 235.5 on 3 and 186 DF,  p-value: < 2.2e-16
shapiro.test(rstandard(mlr2))
## 
##  Shapiro-Wilk normality test
## 
## data:  rstandard(mlr2)
## W = 0.99202, p-value = 0.3833
check_outliers(mlr2)
## OK: No outliers detected.
## - Based on the following method and threshold: cook (0.842).
## - For variable: (Whole model)
ols_test_breusch_pagan(mlr2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                
##  ----------------------------------
##  Response : Weight 
##  Variables: fitted values of Weight 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    0.5627966 
##  Prob > Chi2   =    0.4531356
durbinWatsonTest(mlr2)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.0676415      1.852136   0.308
##  Alternative hypothesis: rho != 0
VIF(mlr2)
##   Length Moisture Hardness 
## 1.017140 1.014060 1.029396
mlr3 <- lm(Weight ~ Diameter + Moisture + Hardness, 
           data=triticum)

summary(mlr3)
## 
## Call:
## lm(formula = Weight ~ Diameter + Moisture + Hardness, data = triticum)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8891 -2.0515  0.0676  2.0180  8.1264 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.06672   10.32806  -1.168  0.24416    
## Diameter     27.05265    1.02545  26.381  < 2e-16 ***
## Moisture     -1.62809    0.95342  -1.708  0.08937 .  
## Hardness     -0.05198    0.01649  -3.152  0.00189 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.803 on 186 degrees of freedom
## Multiple R-squared:  0.7997, Adjusted R-squared:  0.7965 
## F-statistic: 247.6 on 3 and 186 DF,  p-value: < 2.2e-16
shapiro.test(rstandard(mlr3))
## 
##  Shapiro-Wilk normality test
## 
## data:  rstandard(mlr3)
## W = 0.99171, p-value = 0.3497
check_outliers(mlr3)
## OK: No outliers detected.
## - Based on the following method and threshold: cook (0.842).
## - For variable: (Whole model)
ols_test_breusch_pagan(mlr3)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                
##  ----------------------------------
##  Response : Weight 
##  Variables: fitted values of Weight 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    0.2892729 
##  Prob > Chi2   =    0.5906869
durbinWatsonTest(mlr3)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.07294521       1.84538   0.294
##  Alternative hypothesis: rho != 0
VIF(mlr3)
## Diameter Moisture Hardness 
## 1.016917 1.013843 1.029302
triticum <- triticum %>% 
  mutate(LDR = Length/Diameter)

mlr4 <- lm(Weight~LDR + Moisture + Hardness, data=triticum)

summary(mlr4)
## 
## Call:
## lm(formula = Weight ~ LDR + Moisture + Hardness, data = triticum)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.659  -3.930  -0.976   2.874  20.594 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -336.44119   81.50086  -4.128 5.52e-05 ***
## LDR          246.50433   49.84305   4.946 1.69e-06 ***
## Moisture      -1.92436    1.95413  -0.985  0.32602    
## Hardness      -0.09543    0.03357  -2.842  0.00498 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.737 on 186 degrees of freedom
## Multiple R-squared:  0.1607, Adjusted R-squared:  0.1472 
## F-statistic: 11.87 on 3 and 186 DF,  p-value: 3.777e-07
shapiro.test(rstandard(mlr4))
## 
##  Shapiro-Wilk normality test
## 
## data:  rstandard(mlr4)
## W = 0.94625, p-value = 1.47e-06
check_outliers(mlr4)
## OK: No outliers detected.
## - Based on the following method and threshold: cook (0.842).
## - For variable: (Whole model)
ols_test_breusch_pagan(mlr4)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                
##  ----------------------------------
##  Response : Weight 
##  Variables: fitted values of Weight 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    2.039188 
##  Prob > Chi2   =    0.1532915
durbinWatsonTest(mlr4)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.02161564      2.042817   0.784
##  Alternative hypothesis: rho != 0
VIF(mlr4)
##      LDR Moisture Hardness 
## 1.007990 1.016293 1.017896


Model selection

  • Select the most parsimonious model from several competing models

  • Principle of parsimony (Occam’s Razor): the correct explanation is the simplest explanation

  • There are many statistics for model selection, e. g., Akaike Information Criterion (AIC), Bayesian Information Criterion

msummary(list("Model 1" = mlr1, 
              "Model 2" = mlr2, 
              "Model 3" = mlr3, 
              "Model 4" = mlr4))
Model 1 Model 2 Model 3 Model 4
(Intercept) -12.970 -11.269 -12.067 -336.441
(10.154) (10.531) (10.328) (81.501)
Length -40.155 16.894
(14.552) (0.657)
Diameter 90.943 27.053
(23.175) (1.025)
Moisture -1.805 -1.559 -1.628 -1.924
(0.939) (0.973) (0.953) (1.954)
Hardness -0.052 -0.052 -0.052 -0.095
(0.016) (0.017) (0.016) (0.034)
LDR 246.504
(49.843)
Num.Obs. 190 190 190 190
R2 0.808 0.792 0.800 0.161
R2 Adj. 0.803 0.788 0.796 0.147
AIC 931.1 944.3 936.8 1209.0
BIC 950.6 960.5 953.0 1225.2
Log.Lik. -459.545 -467.141 -463.377 -599.500
F 194.187 235.549 247.574 11.871
RMSE 2.72 2.83 2.77 5.68
mae <- c(MAE(mlr1),MAE(mlr2), MAE(mlr3), MAE(mlr4))
mape <- c(MAPE(mlr1),MAPE(mlr2), MAPE(mlr3), MAPE(mlr4))
data.frame(mae, mape)
##        mae       mape
## 1 2.223117 0.08289007
## 2 2.308773 0.08545903
## 3 2.261153 0.08379099
## 4 4.347989 0.15541644

INTERPRETAION

  1. Based on the the assumption checks, measures of fit (F, R2 Adj., RMSE), AIC, and BIC, it is clear that Model 3 (Diameter, Moisture, Hardness as predictors) generally outperforms the other 3 models.

  2. A 1-cm increase in the diameter of a seed corresponds to an increase of 27.05 g in seed weight, assuming moisture and hardness are fixed. This effect is significant, t(1) = 26.381, p<0.001.

  3. The sign of the estimates for moisture and hardness are negative indicating negative effect on weight.



Training and Testing a Model

It is a common practice to split a data set into training and testing sets. The reasons for doing this are:

  1. Prevent Overfitting: By splitting the data, you ensure that the model isn’t just memorizing the training data but is able to generalize well to unseen data.

  2. Unbiased Evaluation: Using a separate test set provides an unbiased evaluation of the model’s performance.

  3. Hyperparameter Tuning: Validation sets help in tuning the model’s hyperparameters to improve performance.

The training set is used to build the model while the test set is used to provide an unbiased evaluation of a final model fit on the training dataset.

#Data splitting
set.seed(123)

split <- initial_split(data = triticum,
                       prop = 0.8)

train.data <- training(split)
test.data <- testing(split)

dim(train.data)
## [1] 152   8
dim(test.data)
## [1] 38  8
#Fitting the model to the training set
mlr5 <- lm(Weight ~ Diameter + Moisture + Hardness + Maturity,
           data = train.data)

summary(mlr5)
## 
## Call:
## lm(formula = Weight ~ Diameter + Moisture + Hardness + Maturity, 
##     data = train.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8105 -2.0228  0.0859  2.0659  7.9314 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -13.86163   11.42318  -1.213 0.226898    
## Diameter       27.33861    1.12318  24.340  < 2e-16 ***
## Moisture       -1.56445    1.05866  -1.478 0.141609    
## Hardness       -0.06999    0.01896  -3.691 0.000314 ***
## MaturityYoung   0.31097    0.46589   0.667 0.505521    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.824 on 147 degrees of freedom
## Multiple R-squared:  0.815,  Adjusted R-squared:   0.81 
## F-statistic: 161.9 on 4 and 147 DF,  p-value: < 2.2e-16
#Evaluating the model accuracy in the test set
test.data <- test.data %>% 
  mutate(Pred.wt = predict(mlr5, test.data))
head(test.data)
## # A tibble: 6 × 9
##   DSeed Weight Length Diameter Moisture Hardness Maturity   LDR Pred.wt
##   <dbl>  <dbl>  <dbl>    <dbl>    <dbl>    <dbl> <chr>    <dbl>   <dbl>
## 1     2   35.5   3.65     2.34     10.6    -8.27 Mature    1.56    34.1
## 2    10   28.0   3.2      2.05     10.2   -16.3  Mature    1.56    27.3
## 3    15   29.8   3.53     2.26     10.4   -25.8  Mature    1.56    33.5
## 4    18   20.9   3.08     1.97     11.0    -6.41 Mature    1.56    23.3
## 5    28   26.8   3.09     1.98     10.3     6.31 Young     1.56    24.1
## 6    33   34.5   3.57     2.29     10.6    -9.14 Young     1.56    33.1
# Overall accuracy assessment
postResample(pred = test.data$Pred.wt, 
             obs = test.data$Weight)
##      RMSE  Rsquared       MAE 
## 2.8091612 0.7196884 2.3814392
#Check model assumptions visually
check_model(mlr5)

#Visualize prediction

ggeffect(mlr5) 
## $Diameter
## # Predicted values of Weight
## 
## Diameter | Predicted |       95% CI
## -----------------------------------
##     1.60 |     14.51 | 13.28, 15.74
##     1.70 |     17.25 | 16.22, 18.27
##     1.90 |     22.72 | 22.06, 23.37
##     2.00 |     25.45 | 24.93, 25.97
##     2.10 |     28.18 | 27.73, 28.64
##     2.30 |     33.65 | 33.04, 34.26
##     2.40 |     36.38 | 35.61, 37.16
##     2.70 |     44.59 | 43.21, 45.96
## 
## $Moisture
## # Predicted values of Weight
## 
## Moisture | Predicted |       95% CI
## -----------------------------------
##    10.00 |     29.44 | 28.23, 30.66
##    10.10 |     29.29 | 28.27, 30.31
##    10.30 |     28.97 | 28.30, 29.65
##    10.40 |     28.82 | 28.28, 29.35
##    10.50 |     28.66 | 28.20, 29.12
##    10.70 |     28.35 | 27.78, 28.92
##    10.80 |     28.19 | 27.48, 28.90
##    11.10 |     27.72 | 26.46, 28.98
## 
## $Hardness
## # Predicted values of Weight
## 
## Hardness | Predicted |       95% CI
## -----------------------------------
##      -50 |     31.15 | 29.71, 32.59
##      -40 |     30.45 | 29.36, 31.54
##      -30 |     29.75 | 28.99, 30.52
##      -20 |     29.05 | 28.54, 29.57
##      -10 |     28.35 | 27.88, 28.83
##        0 |     27.65 | 26.97, 28.33
##       10 |     26.95 | 25.96, 27.95
##       30 |     25.55 | 23.86, 27.25
## 
## $Maturity
## # Predicted values of Weight
## 
## Maturity | Predicted |       95% CI
## -----------------------------------
## Mature   |     28.44 | 27.76, 29.11
## Young    |     28.75 | 28.13, 29.37
## 
## 
## attr(,"class")
## [1] "ggalleffects" "list"        
## attr(,"model.name")
## [1] "mlr5"
ggeffect(mlr5) %>% 
  plot() %>% 
  plot_grid()

Reporting the results

tbl_regression(mlr5,
               add_pairwise_contrasts = TRUE,
               pvalue_fun = label_style_pvalue(digits = 2)) %>% 
  bold_p()
Characteristic Beta 95% CI p-value
Diameter 27 25, 30 <0.001
Moisture -1.6 -3.7, 0.53 0.14
Hardness -0.07 -0.11, -0.03 <0.001
Maturity


    Young - Mature 0.31 -0.61, 1.2 0.51
Abbreviation: CI = Confidence Interval



WORKSHOP 2

Instructions

  1. Filter the Wage data set to include only individuals with wage < 200 and age < 60 using the following code chunck.
Wage.f <- Wage %>% 
  filter(wage < 200 & age < 60)
  1. Split the Wage.f data frame into training and testing sets.

  2. Using the training set, fit a linear regression model with wage as response variable and year, age, job status, and education as predictors.

  3. Perform all diagnostic checking of the model assumptions. Discuss your findings.

  4. Evaluate the accuracy of the predictions obtained from the model.

  5. Report and interpret the parameter estimates and their significance.