Regression analysis
It is a technique of studying the dependence of one variable (called dependent variable), on one or more independent variables (called explanatory variables)
The goals of regression analysis are:
Estimating the relationship between the dependent variable and the explanatory variable(s)
Evaluating the effect of each of the explanatory variables on the dependent variable, controlling the effects of all other explanatory variables
Predicting the value of the dependent variable for a given value of the explanatory variables
In regression analysis we wish to express the relationship between the dependent variable and the explanatory variable in a functional form \[ Y = f(X) + \epsilon \]
Suppose we observe pairs (X,Y) and the scatterplot shows a fairly linear pattern, then we can let \[ f(X) = \beta_0 + \beta_1 X \]
Thus, the (simple) linear regression model is given by \[ Y = \beta_0 + \beta_1X + \epsilon \] where:
For example, X may represent TV advertising and Y may represent sales.
\(\beta_0\) is the y-intercept of the regression line. It is the estimated mean response when X=0
\(\beta_0\) is meaningful only if
\(\beta_1\) is the slope of the regression line. It is the estimated change in the mean response for every unit change in X
The regression parameters \(\beta_0\) and \(\beta_1\) are unknown quantities that must be estimated from the data
The most common and popular method of estimating regression parameters is the Method of Least Squares
The resulting estimators are called Ordinary Least Squares (OLS) estimators
The OLS estimators of \(\beta_0\) and \(\beta_1\) are \(\hat{\beta_0}\) and \(\hat{\beta_1}\), respectively
\(\hat{\beta_0} = b_0\) and \(\hat{\beta_1} = b_1\)
Standard assumptions
Y is a continuous random variable
\(\epsilon_i \sim N(0, \sigma^2), \forall \; i \Longrightarrow\) normality and homogenous variance assumptions
For two different observations, \(i\) and \(j\), the error terms \(\epsilon_i\) and \(\epsilon_j\) are independent \(\Longrightarrow\) independence assumption
Assumptions on the predictor variables
The predictor variables are assumed fixed or selected in advance.
The values of the predictors are assumed to be measured without error.
The predictor variables are assumed to be linearly independent of each other (No multicollinearity) \(\Longrightarrow\) multicollinearity occurs if at least 2 of the regressors are strongly and significantly correlated
Assumption about the observations
Coefficient of determination (\(R^2\)):
the percentage of variation in Y that can be explained or attributed to its linear relation with X
the closer to 1 or 100% the better the fit
F test of overall effect of X: \(F = \frac{MSR}{MSE}\)
Root Mean Square Error (RMSE):
standard deviation of the residuals (=difference between the observed Y and the predicted Y from the model)
the closer to zero the better the fit
\(H_0: \beta_1 = 0\) (X has no effect on Y)
\(H_1: \beta_1 \neq 0\) (X has an effect on Y)
Test statistic: \(t=\frac{\hat{\beta_1}}{se(\hat{\beta_1})}\)
#Loading required packages
library(ISLR)
library(readxl)
library(tidyverse)
library(ggpubr)
library(performance)
library(car)
library(caret)
library(olsrr)
library(jtools)
library(moments)
library(lmtest)
library(see)
library(caTools)
library(gtsummary)
library(modelsummary)
library(DescTools)
library(ggeffects)
library(rsample)
library(regclass)
library(sjPlot)
#Importing the data into R
sales.data <- read_excel("Data/Advertising.xlsx")
head(sales.data) #Displays the 1st 6 rows of the data frame
## # A tibble: 6 × 5
## ...1 TV radio newspaper sales
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 230. 37.8 69.2 22.1
## 2 2 44.5 39.3 45.1 10.4
## 3 3 17.2 45.9 69.3 9.3
## 4 4 152. 41.3 58.5 18.5
## 5 5 181. 10.8 58.4 12.9
## 6 6 8.7 48.9 75 7.2
#Visualization
ggscatter(x = "TV",
y = "sales",
data = sales.data,
color = "darkgreen",
shape = 19, size = 1.75,
add = "reg.line",
conf.int = TRUE,
add.params = list(color = "blue", fill = "lightgray")) +
labs(x = "Spending on TV ads",
y = "Sales")
#Code for SLR model
slr <- lm(sales ~ TV,
data = sales.data)
#Displays the output of the analysis
summary(slr)
##
## Call:
## lm(formula = sales ~ TV, data = sales.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
Before the results of regression analysis are interpreted, we need to make sure that the assumptions of the model are satisfied. Conclusions drawn from a regression analysis is valid and reliable only if the assumptions are satisfied. Regression diagnostics refers to detection of violations of model assumptions and is also referred to as residual analysis since many of the model assumptions will be checked using the residuals. The residuals are the difference between the observed values of Y and the values predicted from the model. The residuals are denoted by \(\hat{\epsilon}\) and are computed as \[ \hat{\epsilon} = Y - \hat{Y} \]
library(GGally)
library(lmtest)
#Pairwise scatter plots of DV with IV
sales.data |>
select(TV, sales) |>
ggpairs()
#Rainbow test for linearity
raintest(slr)
##
## Rainbow test
##
## data: slr
## Rain = 1.1365, df1 = 100, df2 = 98, p-value = 0.2631
INTERPRETATION: The non-significant result of the rainbow test indicates that a linear model fits the data well.
#Shapiro-Wilk test for normality
shapiro.test(slr$residuals)
##
## Shapiro-Wilk normality test
##
## data: slr$residuals
## W = 0.99053, p-value = 0.2133
INTERPRETATION: Based on the This was confirmed by the Shapiro-Wilk test at the 5% level of significance.
#Plots
ols_plot_resid_fit(slr)
#Breusch Pagan Test for Heteroskedasticity
ols_test_breusch_pagan(slr)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : sales
## Variables: fitted values of sales
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 42.73013
## Prob > Chi2 = 6.283693e-11
INTERPRETATION: The Residual vs Fitted Values plot shows smaller variance for smaller fitted values and larger variance for larger fitted values. There is an observable funnel-type or V pattern in the plot indicating non-constant or unequal variance (heteroscedasticity). This is confirmed by the Breusch-Pagan test at the 5% level of significance.
According to this assumption, adjacent residuals must be uncorrelated or independent. One way to achieve this is making sure that the data do not exhibit any kind of clustering or hierarchical relationship. The Durbin-Watson test is oftentimes used to check this assumption. A significant test result indicates violation of the independence assumption and the residuals are known as autocorrelated.
durbinWatsonTest(slr)
## lag Autocorrelation D-W Statistic p-value
## 1 0.02342385 1.934689 0.65
## Alternative hypothesis: rho != 0
check_autocorrelation(slr)
## OK: Residuals appear to be independent and not autocorrelated (p = 0.632).
INTERPRETATION: The Durbin-Watson test result is not significant therefore the residuals are independent.
check_outliers(slr)
## OK: No outliers detected.
## - Based on the following method and threshold: cook (0.7).
## - For variable: (Whole model)
Let us try to transform the sales data using logarithmic formula and refit the model.
slr1 <- lm(log(sales) ~ TV,
data = sales.data)
check_heteroscedasticity(slr1)
## Warning: Heteroscedasticity (non-constant error variance) detected (p = 0.033).
shapiro.test(slr1$residuals)
##
## Shapiro-Wilk normality test
##
## data: slr1$residuals
## W = 0.91963, p-value = 5.474e-09
NOTES: The problem of heteroscedasticity is not solved and even made the residuals non-normal.
The idea of weighted regression is to assign higher weights to observations with lower variance and lower weights to observations with higher variance. There are several weighting procedures and the choice usually depends on the observed heteroscedasticity.
sales.data <- sales.data %>%
mutate(wt = 1 / lm(abs(slr$residuals) ~ slr$fitted.values)$fitted.values^2)
head(sales.data)
## # A tibble: 6 × 6
## ...1 TV radio newspaper sales wt
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 230. 37.8 69.2 22.1 0.0825
## 2 2 44.5 39.3 45.1 10.4 0.511
## 3 3 17.2 45.9 69.3 9.3 0.838
## 4 4 152. 41.3 58.5 18.5 0.148
## 5 5 181. 10.8 58.4 12.9 0.117
## 6 6 8.7 48.9 75 7.2 1.01
slr2 <- lm(sales ~ TV,
weights = wt,
data = sales.data)
summary(slr2)
##
## Call:
## lm(formula = sales ~ TV, data = sales.data, weights = wt)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -5.2241 -0.8965 -0.1057 1.0011 2.5396
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.300885 0.240966 26.15 <2e-16 ***
## TV 0.053361 0.002308 23.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.249 on 198 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7284
## F-statistic: 534.7 on 1 and 198 DF, p-value: < 2.2e-16
check_heteroscedasticity(slr2)
## OK: Error variance appears to be homoscedastic (p = 0.162).
msummary(list("OLS" = slr, "Weighted OLS" = slr2))
OLS | Weighted OLS | |
---|---|---|
(Intercept) | 7.033 | 6.301 |
(0.458) | (0.241) | |
TV | 0.048 | 0.053 |
(0.003) | (0.002) | |
Num.Obs. | 200 | 200 |
R2 | 0.612 | 0.730 |
R2 Adj. | 0.610 | 0.728 |
AIC | 1044.1 | 1001.1 |
BIC | 1054.0 | 1011.0 |
Log.Lik. | -519.046 | -497.566 |
F | 312.145 | 534.672 |
RMSE | 3.24 | 3.28 |
A. Model fit
\(R^2 = 73\% \implies\) 73% of the variation in sales can be explained by its linear relationship with TV advertising expenses.
The model significantly fits the data, F(1, 198)= 534.7 p< 2.2e-16.
The \(RMSE = 1.249\) is small and close to zero.
Based on the above overall measures of fit, the model adequately fits the data.
B. Inference on the regression coefficients
Note that the sales variable is in thousands of units, and the TV variable is in thousands of dollars. Thus, an increase of $1,000 in the TV advertising budget is associated with an increase in sales by around 53 units. This effect is significant, t(1) = 23.12, p < 0.001.
C. Predictions
library(ggeffects)
ggeffect(slr2)
## $TV
## # Predicted values of sales
##
## TV | Predicted | 95% CI
## ------------------------------
## 0 | 6.30 | 5.83, 6.78
## 40 | 8.44 | 8.07, 8.80
## 80 | 10.57 | 10.23, 10.90
## 120 | 12.70 | 12.31, 13.10
## 140 | 13.77 | 13.32, 14.22
## 180 | 15.91 | 15.32, 16.49
## 220 | 18.04 | 17.30, 18.78
## 300 | 22.31 | 21.23, 23.39
##
## attr(,"class")
## [1] "ggalleffects" "list"
## attr(,"model.name")
## [1] "slr2"
Instructions
Using the Wage data set, fit a linear regression model with wage as response variable and either one of year and age as predictor.
Perform all diagnostic checking of the model assumptions. Discuss your findings.
Assuming all assumptions are meet,
Discuss the fit of the model.
Interpret the estimate of the regression parameter \(\hat{\beta}_1\).
Oftentimes, a regression model with a single predictor variable is not enough to provide an adequate description of the response variable. This occurs when several key variables affect the response variable in important and predictive ways.
The multiple linear regression model is \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon \]
where:
\(\beta_0, \beta_1, \cdots, \beta_p\) are the regression coefficients
\(\epsilon\) is the random error, \(\epsilon \sim N(0, \sigma^2)\)
NOTES:
\(\beta_0\) is the regression constant or intercept and is the value of Y when all X’s are equal to zero; must be interpreted with caution
\(\beta_1\) is the estimated mean response for every 1 unit change in \(X_1\), holding all other X’s constant
\(\beta_2\) is the estimated mean response for every 1 unit change in \(X_2\), holding all other X’s constant
\(\vdots\)
and so on
#Import another data set
triticum <- read_excel("Data/triticum.xlsx")
head(triticum)
## # A tibble: 6 × 7
## DSeed Weight Length Diameter Moisture Hardness Maturity
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 30.2 3.27 2.09 10.3 -16.6 Young
## 2 2 35.5 3.65 2.34 10.6 -8.27 Mature
## 3 3 29.2 3.36 2.15 10.3 -21.4 Mature
## 4 4 16.8 2.77 1.79 11.0 4.13 Mature
## 5 5 23.4 2.78 1.8 10.0 -2.05 Young
## 6 6 31.8 3.37 2.15 10.3 -41.8 Mature
#Examining correlations and density plots
triticum %>%
select(-c(DSeed)) %>%
ggpairs()
#Regression model fitting
mlr1 <- lm(Weight ~ Length + Diameter + Moisture + Hardness,
data = triticum)
summary(mlr1)
##
## Call:
## lm(formula = Weight ~ Length + Diameter + Moisture + Hardness,
## data = triticum)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4465 -1.8739 0.0729 2.0695 7.3834
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.96977 10.15445 -1.277 0.203115
## Length -40.15453 14.55184 -2.759 0.006373 **
## Diameter 90.94293 23.17550 3.924 0.000123 ***
## Moisture -1.80476 0.93910 -1.922 0.056167 .
## Hardness -0.05248 0.01621 -3.238 0.001428 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.754 on 185 degrees of freedom
## Multiple R-squared: 0.8076, Adjusted R-squared: 0.8035
## F-statistic: 194.2 on 4 and 185 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mlr1)
#Rainbow test for linearity
raintest(mlr1)
##
## Rainbow test
##
## data: mlr1
## Rain = 0.90798, df1 = 95, df2 = 90, p-value = 0.679
#Shapiro-Wilk test for normality
shapiro.test(mlr1$residuals)
##
## Shapiro-Wilk normality test
##
## data: mlr1$residuals
## W = 0.99241, p-value = 0.4269
check_outliers(mlr1)
## OK: No outliers detected.
## - Based on the following method and threshold: cook (0.9).
## - For variable: (Whole model)
#Breusch Pagan Test for Heteroskedasticity
ols_test_breusch_pagan(mlr1)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ----------------------------------
## Response : Weight
## Variables: fitted values of Weight
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 0.02383978
## Prob > Chi2 = 0.8772933
check_heteroscedasticity(mlr1)
## OK: Error variance appears to be homoscedastic (p = 0.877).
#Testing autocorrelation
durbinWatsonTest(mlr1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.07037491 1.855916 0.27
## Alternative hypothesis: rho != 0
check_autocorrelation(mlr1)
## OK: Residuals appear to be independent and not autocorrelated (p = 0.364).
#Checking multicollinearity
library(regclass)
VIF(mlr1)
## Length Diameter Moisture Hardness
## 538.005939 537.888050 1.018578 1.029427
REMARKS:
Multicollinearity occurs when there is strong correlations between explanatory variables. If multicollinearity is present the effect of one explanatory variable maybe masked with the effect of the other explanatory variables. In other words, one explanatory variable carries the same information as the other explanatory variables. The Variance inflation factor (VIF) measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model. In practice, \(VIF < 10\) is acceptable.
In the model above, the VIF of Length and Diameter are above 10. Thus, multicollinearity is present.
One solution is to retain only one of the strongly correlated regressors. Another approach is to compute a (linear) combination of highly correlated regressions, say via principal component analysis (PCA) or other logical mathematical forms. For example, we can alculate the length-to-diameter ratio (LDR) and use it as a regressor in lieu of Length and Diameter.
mlr2 <- lm(Weight ~ Length + Moisture + Hardness,
data=triticum)
summary(mlr2)
##
## Call:
## lm(formula = Weight ~ Length + Moisture + Hardness, data = triticum)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1559 -2.0919 -0.0105 2.0710 8.4907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.26912 10.53056 -1.070 0.28595
## Length 16.89431 0.65676 25.724 < 2e-16 ***
## Moisture -1.55934 0.97260 -1.603 0.11057
## Hardness -0.05213 0.01682 -3.098 0.00225 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.859 on 186 degrees of freedom
## Multiple R-squared: 0.7916, Adjusted R-squared: 0.7883
## F-statistic: 235.5 on 3 and 186 DF, p-value: < 2.2e-16
shapiro.test(rstandard(mlr2))
##
## Shapiro-Wilk normality test
##
## data: rstandard(mlr2)
## W = 0.99202, p-value = 0.3833
check_outliers(mlr2)
## OK: No outliers detected.
## - Based on the following method and threshold: cook (0.842).
## - For variable: (Whole model)
ols_test_breusch_pagan(mlr2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ----------------------------------
## Response : Weight
## Variables: fitted values of Weight
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 0.5627966
## Prob > Chi2 = 0.4531356
durbinWatsonTest(mlr2)
## lag Autocorrelation D-W Statistic p-value
## 1 0.0676415 1.852136 0.308
## Alternative hypothesis: rho != 0
VIF(mlr2)
## Length Moisture Hardness
## 1.017140 1.014060 1.029396
mlr3 <- lm(Weight ~ Diameter + Moisture + Hardness,
data=triticum)
summary(mlr3)
##
## Call:
## lm(formula = Weight ~ Diameter + Moisture + Hardness, data = triticum)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8891 -2.0515 0.0676 2.0180 8.1264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.06672 10.32806 -1.168 0.24416
## Diameter 27.05265 1.02545 26.381 < 2e-16 ***
## Moisture -1.62809 0.95342 -1.708 0.08937 .
## Hardness -0.05198 0.01649 -3.152 0.00189 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.803 on 186 degrees of freedom
## Multiple R-squared: 0.7997, Adjusted R-squared: 0.7965
## F-statistic: 247.6 on 3 and 186 DF, p-value: < 2.2e-16
shapiro.test(rstandard(mlr3))
##
## Shapiro-Wilk normality test
##
## data: rstandard(mlr3)
## W = 0.99171, p-value = 0.3497
check_outliers(mlr3)
## OK: No outliers detected.
## - Based on the following method and threshold: cook (0.842).
## - For variable: (Whole model)
ols_test_breusch_pagan(mlr3)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ----------------------------------
## Response : Weight
## Variables: fitted values of Weight
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 0.2892729
## Prob > Chi2 = 0.5906869
durbinWatsonTest(mlr3)
## lag Autocorrelation D-W Statistic p-value
## 1 0.07294521 1.84538 0.294
## Alternative hypothesis: rho != 0
VIF(mlr3)
## Diameter Moisture Hardness
## 1.016917 1.013843 1.029302
triticum <- triticum %>%
mutate(LDR = Length/Diameter)
mlr4 <- lm(Weight~LDR + Moisture + Hardness, data=triticum)
summary(mlr4)
##
## Call:
## lm(formula = Weight ~ LDR + Moisture + Hardness, data = triticum)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.659 -3.930 -0.976 2.874 20.594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -336.44119 81.50086 -4.128 5.52e-05 ***
## LDR 246.50433 49.84305 4.946 1.69e-06 ***
## Moisture -1.92436 1.95413 -0.985 0.32602
## Hardness -0.09543 0.03357 -2.842 0.00498 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.737 on 186 degrees of freedom
## Multiple R-squared: 0.1607, Adjusted R-squared: 0.1472
## F-statistic: 11.87 on 3 and 186 DF, p-value: 3.777e-07
shapiro.test(rstandard(mlr4))
##
## Shapiro-Wilk normality test
##
## data: rstandard(mlr4)
## W = 0.94625, p-value = 1.47e-06
check_outliers(mlr4)
## OK: No outliers detected.
## - Based on the following method and threshold: cook (0.842).
## - For variable: (Whole model)
ols_test_breusch_pagan(mlr4)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ----------------------------------
## Response : Weight
## Variables: fitted values of Weight
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 2.039188
## Prob > Chi2 = 0.1532915
durbinWatsonTest(mlr4)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.02161564 2.042817 0.784
## Alternative hypothesis: rho != 0
VIF(mlr4)
## LDR Moisture Hardness
## 1.007990 1.016293 1.017896
Select the most parsimonious model from several competing models
Principle of parsimony (Occam’s Razor): the correct explanation is the simplest explanation
There are many statistics for model selection, e. g., Akaike Information Criterion (AIC), Bayesian Information Criterion
msummary(list("Model 1" = mlr1,
"Model 2" = mlr2,
"Model 3" = mlr3,
"Model 4" = mlr4))
Model 1 | Model 2 | Model 3 | Model 4 | |
---|---|---|---|---|
(Intercept) | -12.970 | -11.269 | -12.067 | -336.441 |
(10.154) | (10.531) | (10.328) | (81.501) | |
Length | -40.155 | 16.894 | ||
(14.552) | (0.657) | |||
Diameter | 90.943 | 27.053 | ||
(23.175) | (1.025) | |||
Moisture | -1.805 | -1.559 | -1.628 | -1.924 |
(0.939) | (0.973) | (0.953) | (1.954) | |
Hardness | -0.052 | -0.052 | -0.052 | -0.095 |
(0.016) | (0.017) | (0.016) | (0.034) | |
LDR | 246.504 | |||
(49.843) | ||||
Num.Obs. | 190 | 190 | 190 | 190 |
R2 | 0.808 | 0.792 | 0.800 | 0.161 |
R2 Adj. | 0.803 | 0.788 | 0.796 | 0.147 |
AIC | 931.1 | 944.3 | 936.8 | 1209.0 |
BIC | 950.6 | 960.5 | 953.0 | 1225.2 |
Log.Lik. | -459.545 | -467.141 | -463.377 | -599.500 |
F | 194.187 | 235.549 | 247.574 | 11.871 |
RMSE | 2.72 | 2.83 | 2.77 | 5.68 |
mae <- c(MAE(mlr1),MAE(mlr2), MAE(mlr3), MAE(mlr4))
mape <- c(MAPE(mlr1),MAPE(mlr2), MAPE(mlr3), MAPE(mlr4))
data.frame(mae, mape)
## mae mape
## 1 2.223117 0.08289007
## 2 2.308773 0.08545903
## 3 2.261153 0.08379099
## 4 4.347989 0.15541644
INTERPRETAION
Based on the the assumption checks, measures of fit (F, R2 Adj., RMSE), AIC, and BIC, it is clear that Model 3 (Diameter, Moisture, Hardness as predictors) generally outperforms the other 3 models.
A 1-cm increase in the diameter of a seed corresponds to an increase of 27.05 g in seed weight, assuming moisture and hardness are fixed. This effect is significant, t(1) = 26.381, p<0.001.
The sign of the estimates for moisture and hardness are negative indicating negative effect on weight.
It is a common practice to split a data set into training and testing sets. The reasons for doing this are:
Prevent Overfitting: By splitting the data, you ensure that the model isn’t just memorizing the training data but is able to generalize well to unseen data.
Unbiased Evaluation: Using a separate test set provides an unbiased evaluation of the model’s performance.
Hyperparameter Tuning: Validation sets help in tuning the model’s hyperparameters to improve performance.
The training set is used to build the model while the test set is used to provide an unbiased evaluation of a final model fit on the training dataset.
#Data splitting
set.seed(123)
split <- initial_split(data = triticum,
prop = 0.8)
train.data <- training(split)
test.data <- testing(split)
dim(train.data)
## [1] 152 8
dim(test.data)
## [1] 38 8
#Fitting the model to the training set
mlr5 <- lm(Weight ~ Diameter + Moisture + Hardness + Maturity,
data = train.data)
summary(mlr5)
##
## Call:
## lm(formula = Weight ~ Diameter + Moisture + Hardness + Maturity,
## data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8105 -2.0228 0.0859 2.0659 7.9314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.86163 11.42318 -1.213 0.226898
## Diameter 27.33861 1.12318 24.340 < 2e-16 ***
## Moisture -1.56445 1.05866 -1.478 0.141609
## Hardness -0.06999 0.01896 -3.691 0.000314 ***
## MaturityYoung 0.31097 0.46589 0.667 0.505521
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.824 on 147 degrees of freedom
## Multiple R-squared: 0.815, Adjusted R-squared: 0.81
## F-statistic: 161.9 on 4 and 147 DF, p-value: < 2.2e-16
#Evaluating the model accuracy in the test set
test.data <- test.data %>%
mutate(Pred.wt = predict(mlr5, test.data))
head(test.data)
## # A tibble: 6 × 9
## DSeed Weight Length Diameter Moisture Hardness Maturity LDR Pred.wt
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 2 35.5 3.65 2.34 10.6 -8.27 Mature 1.56 34.1
## 2 10 28.0 3.2 2.05 10.2 -16.3 Mature 1.56 27.3
## 3 15 29.8 3.53 2.26 10.4 -25.8 Mature 1.56 33.5
## 4 18 20.9 3.08 1.97 11.0 -6.41 Mature 1.56 23.3
## 5 28 26.8 3.09 1.98 10.3 6.31 Young 1.56 24.1
## 6 33 34.5 3.57 2.29 10.6 -9.14 Young 1.56 33.1
# Overall accuracy assessment
postResample(pred = test.data$Pred.wt,
obs = test.data$Weight)
## RMSE Rsquared MAE
## 2.8091612 0.7196884 2.3814392
#Check model assumptions visually
check_model(mlr5)
#Visualize prediction
ggeffect(mlr5)
## $Diameter
## # Predicted values of Weight
##
## Diameter | Predicted | 95% CI
## -----------------------------------
## 1.60 | 14.51 | 13.28, 15.74
## 1.70 | 17.25 | 16.22, 18.27
## 1.90 | 22.72 | 22.06, 23.37
## 2.00 | 25.45 | 24.93, 25.97
## 2.10 | 28.18 | 27.73, 28.64
## 2.30 | 33.65 | 33.04, 34.26
## 2.40 | 36.38 | 35.61, 37.16
## 2.70 | 44.59 | 43.21, 45.96
##
## $Moisture
## # Predicted values of Weight
##
## Moisture | Predicted | 95% CI
## -----------------------------------
## 10.00 | 29.44 | 28.23, 30.66
## 10.10 | 29.29 | 28.27, 30.31
## 10.30 | 28.97 | 28.30, 29.65
## 10.40 | 28.82 | 28.28, 29.35
## 10.50 | 28.66 | 28.20, 29.12
## 10.70 | 28.35 | 27.78, 28.92
## 10.80 | 28.19 | 27.48, 28.90
## 11.10 | 27.72 | 26.46, 28.98
##
## $Hardness
## # Predicted values of Weight
##
## Hardness | Predicted | 95% CI
## -----------------------------------
## -50 | 31.15 | 29.71, 32.59
## -40 | 30.45 | 29.36, 31.54
## -30 | 29.75 | 28.99, 30.52
## -20 | 29.05 | 28.54, 29.57
## -10 | 28.35 | 27.88, 28.83
## 0 | 27.65 | 26.97, 28.33
## 10 | 26.95 | 25.96, 27.95
## 30 | 25.55 | 23.86, 27.25
##
## $Maturity
## # Predicted values of Weight
##
## Maturity | Predicted | 95% CI
## -----------------------------------
## Mature | 28.44 | 27.76, 29.11
## Young | 28.75 | 28.13, 29.37
##
##
## attr(,"class")
## [1] "ggalleffects" "list"
## attr(,"model.name")
## [1] "mlr5"
ggeffect(mlr5) %>%
plot() %>%
plot_grid()
tbl_regression(mlr5,
add_pairwise_contrasts = TRUE,
pvalue_fun = label_style_pvalue(digits = 2)) %>%
bold_p()
Characteristic | Beta | 95% CI | p-value |
---|---|---|---|
Diameter | 27 | 25, 30 | <0.001 |
Moisture | -1.6 | -3.7, 0.53 | 0.14 |
Hardness | -0.07 | -0.11, -0.03 | <0.001 |
Maturity | |||
Young - Mature | 0.31 | -0.61, 1.2 | 0.51 |
Abbreviation: CI = Confidence Interval |
Instructions
Wage.f <- Wage %>%
filter(wage < 200 & age < 60)
Split the Wage.f data frame into training and testing sets.
Using the training set, fit a linear regression model with wage as response variable and year, age, job status, and education as predictors.
Perform all diagnostic checking of the model assumptions. Discuss your findings.
Evaluate the accuracy of the predictions obtained from the model.
Report and interpret the parameter estimates and their significance.