As we have learnt previously, attribute variables are vectors of class factor in R. This vector is encoded numerically, with information about the levels of the variable saved in the levels attribute.
## [1] "factor"
## [1] 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 1 2
## [38] 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2
## [112] 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## [186] 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2
## [223] 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1
## [334] 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2
## [371] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## attr(,"levels")
## [1] "Female" "Male"
Note, that the unclass()
function removes the attributes
of the sex
variable above and prints it using the default
print method, allowing for easier examination of the internal structure
of the object.
However, when using an attribute variable in a linear regression model it would make no sense to treat it as a measured explanatory variable because of its numeric levels. In the context of linear modelling we need to code them to represent the levels of the attribute.
Two-level attribute variables are very easy to code. We simply create
an indicator or dummy variable that
takes on two possible dummy numerical values. Consider the
sex
indicator variable. We can code this using a dummy
variable \(d\): \[\begin{equation}
d=\left\{
\begin{array}{@{}ll@{}}
0, & \text{if female} \\
1, & \text{if male}
\end{array}\right.
\end{equation}\]
💡 This is the default coding used in R.
Zero value is assigned to the level which is first alphabetically,
unless it is changed by using the releveld()
function for
example, or by specifying the levels of the factor variable
specifically. </span)
For a simple regression model of salary
versus
sex
:
\[salary = b_0 + b_1sex + e,\]
this results in the model
\[\begin{equation} salary_i = b_0 + b_1sex_i + e_i=\left\{ \begin{array}{@{}ll@{}} b_0 + b_1 \times 1 + e_i = b_0 + b_1 + e_i, & \text{if the person is male} \\ b_0 + b_1 \times 0 + e_i = b_0 + e_i, & \text{if the person is female} \end{array}\right. \end{equation}\]
where \(b_0\) can be interpreted as the average \(\text{salary}\) for females, and \(b_0 + b_1\) as the average \(\text{salary}\) for males. The value of \(b_1\) represents the average difference in \(\text{salary}\) between females and males.
We can conclude that dealing with an attribute variable with two levels in a linear model is straightforward. In this case, a dummy variable indicates whether an observation has a particular characteristic: yes/no. We can observe it as a “switch” in a model, as this dummy variable can only assume the values 0 and 1, where 0 indicates the absence of the effect, and 1 indicates the presence. The values 0/1 can be seen as off/on.
The way in which R codes dummy variables is controlled by the contrast option:
## $contrasts
## unordered ordered
## "contr.treatment" "contr.poly"
The output points out the conversion of the factor into an appropriate set of contrasts. In particular, the first one: for unordered factors, and the second one: the ordered factors. The former is applicable in our context.
To explicitly identify the coding of the factor, i.e. dummy variable
used by R, we can use the contrasts()
function.
## Male
## Female 0
## Male 1
## B
## A 0
## B 1
## AssocProf Prof
## AsstProf 0 0
## AssocProf 1 0
## Prof 0 1
Note that applied “contr.treatment” conversion takes only the value 0 or 1 and that for an attribute variable with k levels it will create k-1 dummy variables. One can argue that the printout of the function could have been more informative by putting indexed letter d as a header for each of the printed columns. For example:
sex
, where k=2attribute | \(d\) |
---|---|
Female | 0 |
Male | 1 |
rank
, where k=3attribute | \(d_1\) | \(d_2\) |
---|---|---|
AsstProf | 0 | 0 |
AssocProf | 1 | 0 |
Prof | 0 | 1 |
There are many different ways of coding attribute variables besides
the dummy variable approach explained here. All of these different
approaches lead to equivalent model fits. What differs are the
coefficients, i.e. model parameters as they require different
interpretations, arranged to measure particular contrasts. This 0/1
coding implemented in R’s default
contr.treatment
contrast offers
straightforward interpretation of the associated parameter in the model,
which often is not the case when implementing other contrasts.
In the case of measured predictors, we are comfortable with the interpretation of the linear model coefficient as a slope, which tells us what a unit increase in the response variable is, i.e. outcome per unit increase in the explanatory variable. This is not necessarily the right interpretation for attribute explanatory variables.
If we obtain the mean salary for each sex group we will find that for
female professors the average salary is \(\$101,002\) and for male professors the
average is \(\$115,090\). That is, a
difference of \(\$14,088\). If we now
look at the parameters of the regression model for salary
vs sex
where females are coded as zero and males as one, we
get exactly the same information, implying that the coefficient is the
estimated difference in average between the two groups.
##
## Call:
## lm(formula = salary ~ sex)
##
## Coefficients:
## (Intercept) sexMale
## 101002 14088
For more on this topic check the following link: Categorical variables and interaction terms in linear regression
We are interested in the extent to which variation in the response
variable salary
is associated with variation in the
explanatory variables available in the Salaries
data set,
that is we want to fit a multiple linear regression model to the given
data. The model we wish to construct should contain enough to explain
relations in the data and at the same time be simple enough to
understand, explain to others, and use.
For convenience we will adopt the following notation:
salary
yrs.since.phd
yrs.service
discipline
sex
rank
Generally, in multiple regression we have a continuous response variable and two or more continuous explanatory variables, however in this dataset we have three attribute variables that we wish to include as the explanatory variables in the model.
Next, we need to specify the model that embodies our mechanistic understanding of the factors involved and the way that they are related to the response variable. It would make sense to expect that all of the available \(x\) variables may impact the behaviour of \(y\), thus the model we wish to build should reflect our viewpoint, i.e. \(y = f(x_1, x_2, x_3, x_4, x_5)\):
\[y = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4x_4 + b_5x_5 + e\] Our viewpoint states a belief that all explanatory variables have a positive impact on the response. For example, more years in service will cause a higher salary.
Our objective now is to determine the values of the parameters in the model that lead to the best fit of the model to the data. That is, we are not only trying to estimate the parameters of the model, but we are also seeking the minimal adequate model to describe the data.
The best model is the model that produces the least unexplained variation following the principle of parsimony rather than complexity. That is the model should have as few parameters as possible, subject to the constraint that the parameters in the model should all be statistically significant.
For regression modelling in R we use the lm()
function,
that fits a linear model assuming normal errors and constant variance.
We specify the model by a formula that uses arithmetic operators:
+
, -
, *
, /
and
^
which enable different functionalities from their
ordinary ones. But, before we dive into statistical modelling of the
given data, we need to take a first step and conduct the most
fundamental task of data analysis procedure: Get to Know Our
Data.
Examining multivariate data is intrinsically more challenging than
examining univariate and bivariate data. To get the most in-depth vision
into multivariate data behaviour we construct scatter plot matrices that
enable the display of pairwise relationships. In R, the scatter plot
matrices are composed by the pairs()
function, which comes
as a part of the default graphics
package. Since we wish to
include attribute variables in our analysis we are going to use the
GGally::ggpairs()
function that produces a pairwise
comparison of multivariate data for both data types: measured and
attribute.
This is an information rich visualisation that includes pairwise
relationships of all the variables we want to consider for our model. By
focusing on the last column of the plots, we can notice influence from
all explanatory variables onto the response, except maybe for
discipline
and sex
. We also notice unbalanced
representation of the groups for the variables rank
and
sex
, but for the purpose of our practice in fitting a
multi-factor model this isn’t too problematic. We need to be especially
concerned with the extent of correlations between the explanatory
variables, and what is of particular interest to us is the high
multicollinearity between rank
,
yrs.since.phd
and yrs.service
, which happens
when the variables are highly linearly related. As a
consequence, we will need to keep an eye on the significance of using
all of these variables in the model.
There are no fixed rules when fitting linear models, but there are adopted standards that have proven to work well in practice. We start off by fitting a maximal model then we carry on simplifying it by removing non-significant explanatory variables. This needs to be done with caution, making sure that the simplifications make good scientific sense, and do not lead to significant reductions in explanatory power. Although this should be the adopted strategy for fitting a model, it is not a guarantee to finding all the important structures in a complex data frame.
We can summarise our model building procedure algorithm as follows:
##
## Call:
## lm(formula = salary ~ ., data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65248 -13211 -1775 10384 99592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65955.2 4588.6 14.374 < 2e-16 ***
## rankAssocProf 12907.6 4145.3 3.114 0.00198 **
## rankProf 45066.0 4237.5 10.635 < 2e-16 ***
## disciplineB 14417.6 2342.9 6.154 1.88e-09 ***
## yrs.since.phd 535.1 241.0 2.220 0.02698 *
## yrs.service -489.5 211.9 -2.310 0.02143 *
## sexMale 4783.5 3858.7 1.240 0.21584
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22540 on 390 degrees of freedom
## Multiple R-squared: 0.4547, Adjusted R-squared: 0.4463
## F-statistic: 54.2 on 6 and 390 DF, p-value: < 2.2e-16
The \(R^2 = 45.47\%\) and the \(\bar{R}^2 = 44.63\%\) are well above the
value of zero allowing us to accept this as a valid model without having
to formally test it to assess its statistical significance. It manages
to explain almost half of the variability in the response variable
salary
.
We identify the sex
explanatory variable as clearly not
significant, which is in line with the conclusion we could draw from the
boxplot in the pairwise comparison plot for salary
vs. sex
. We will remove it to begin the process of model
simplification.
##
## Call:
## lm(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service,
## data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65244 -13498 -1455 9638 99682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69869.0 3332.1 20.968 < 2e-16 ***
## rankAssocProf 12831.5 4147.7 3.094 0.00212 **
## rankProf 45287.7 4236.7 10.689 < 2e-16 ***
## disciplineB 14505.2 2343.4 6.190 1.52e-09 ***
## yrs.since.phd 534.6 241.2 2.217 0.02720 *
## yrs.service -476.7 211.8 -2.250 0.02497 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22550 on 391 degrees of freedom
## Multiple R-squared: 0.4525, Adjusted R-squared: 0.4455
## F-statistic: 64.64 on 5 and 391 DF, p-value: < 2.2e-16
We note a slight reduction in \(\bar{R^2}\) from \(44.63\%\) to \(44.55\%\) which we can regard as an
insignificant decrease. The next step is to check the coefficients and
assess for the effect of the remaining variables. We identify
yrs.since.phd
and yrs.service
as the least
influential in explaining the variability of salary
. To
illustrate how to formally assess their effect, we will conduct the
t-test for the yrs.since.phd
variable:
=========================
## [1] 1.64876
As \(t_{calc} = 2.217 > t_{crit} = 1.64876 => H_1\) we will keep the remaining variable and stop with the model simplification and focus on its interpretation.
We can take a closer look at the coefficients of our fitted model:
## (Intercept) rankAssocProf rankProf disciplineB yrs.since.phd
## 69869.0110 12831.5375 45287.6890 14505.1514 534.6313
## yrs.service
## -476.7179
The structure of our final fitted model is:
\[y = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4x_4 + e,\] where
salary
yrs.since.phd
yrs.service
discipline
rank
We can take a closer look at the coefficients of our fitted model:
## (Intercept) rankAssocProf rankProf disciplineB yrs.since.phd
## 69869.0110 12831.5375 45287.6890 14505.1514 534.6313
## yrs.service
## -476.7179
Examining the output we realise that R
has created three
“sub” dummy variables for the variable rank
: \[
dr_1 =
\begin{cases}
1 & \text{rank is AsstProf} \\
0 & \text{for rank is not AsstProf}
\end{cases}
\]
\[ dr_2 = \begin{cases} 1 & \text{rank is AssocProf} \\ 0 & \text{rank is not AssocProf} \end{cases} \]
\[ dr_3 = \begin{cases} 1 & \text{rank is Prof} \\ 0 & \text{rank is not Prof} \end{cases} \]
It has chosen to use the model: \[ y = b_0 + b_1dr_2 + b_2dr_3 + b_3d_1 + b_4x_1 + b_5x_2 + e, \]
where:
salary
yrs.since.phd
yrs.service
rank
discipline
as explained earlierNote that R
doesn’t need to use \(dr_1\), to create three models; it only
needs two dummy variables since it is using \(dr_1\) as a reference level, also known as
the base line. This subsequently allows R
to
create three models relating to rank
variable:
telling us that:
Learning this we can make an interpretation of our final fitted model as follows:
yrs.since.phd
) on average
salary (salary
) will go up by \(\$534.63\) assuming the rest of the
variables are fixed in the modelyrs.service
) on average
salary (salary
) will go down by \(\$476.72\) assuming the rest of the
variables are fixed in the modelrank: AsstProf
) who works in a “theoretical” department is
\(\$69,869.01\) and who works in an
“applied” department is \(\$84,374.16\); this can vary for the number
of years in service and since PhDrank: AssocProf
) who works in a “theoretical” department
is \(\$82,700.55\), and one who works
in an “applied” department is \(\$97,205.70\); this can vary for the number
of years in service and since PhDrank: Prof
) who
works in a “theoretical” department is \(\$115,156.70\), and who works in an
“applied” department is \(\$129,661.90\); this can vary for the
number of years in service and since PhDThis model explains around \(45\%\)
of the variability in the response variable salary
.
Adding ~ 0
to the lm()
formula enables
R
to suppress the intercept. Note that if we remove the
intercept, then we can directly obtain all “three intercepts” without a
base level to fit the final fitted model:
##
## Call:
## lm(formula = salary ~ 0 + rank + discipline + yrs.since.phd +
## yrs.service)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65244 -13498 -1455 9638 99682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## rankAsstProf 69869.0 3332.1 20.968 < 2e-16 ***
## rankAssocProf 82700.5 3916.7 21.115 < 2e-16 ***
## rankProf 115156.7 4350.9 26.467 < 2e-16 ***
## disciplineB 14505.2 2343.4 6.190 1.52e-09 ***
## yrs.since.phd 534.6 241.2 2.217 0.0272 *
## yrs.service -476.7 211.8 -2.250 0.0250 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22550 on 391 degrees of freedom
## Multiple R-squared: 0.9638, Adjusted R-squared: 0.9633
## F-statistic: 1736 on 6 and 391 DF, p-value: < 2.2e-16
So far we have seen that by fitting an additive regression model to the data, we aim to identify and understand how the value of a dependent response variable changes when any one of the independent explanatory variables is changed while the other independent variables stay the same. This is a restrictive form of a model as it only allows for linear relationships between the response and the explanatory variables, and the way in which one explanatory variable affects the response is the same for any value of the other explanatory variables used in the model.
We need to add flexibility to accommodate these limitations. This will allow the use of linear models for non-linear relationships in which the effect of one explanatory variable can be different for different values of the other explanatory variable by introducing the concept of interaction. This brings more complexity into the multivariate regression model.
Case Study: The Quality of red Bordeaux Vintages
To illustrate the multivariate model fitting procedure with
interactions we are going to use wine.csv
available from Eduardo
García Portugués’s book: Notes for Predictive Modeling. The dataset
is formed by the auction Price of 27 red Bordeaux vintages, five vintage
descriptors (WinterRain, AGST, HarvestRain, Age, Year), and the
population of France in the year of the vintage, FrancePop.
Year
: year in which grapes were harvested to make
winePrice
: logarithm of the average market price for
Bordeaux vintages according to 1990–1991 auctions. The price is relative
to the price of the 1961 vintage, regarded as the best one ever
recordedWinterRain
: winter rainfall (in mm)AGST
: Average Growing Season Temperature (in degrees
Celsius)HarvestRain
: harvest rainfall (in mm)Age
: age of the wine measured as the number of years
stored in a caskFrancePop
: population of France at Year (in
thousands)We would like to analyse the quality of a vintage that has been quantified as the price and make the interpretation of our statistical findings.
Don’t forget!!! 🤔 First things first: Get to Know Data 🤓
We will start this analysis by examining the pairwise plot.
## Year Price WinterRain AGST HarvestRain
## Min. :1952 Min. :6.205 Min. :376.0 Min. :14.98 Min. : 38.0
## 1st Qu.:1960 1st Qu.:6.508 1st Qu.:543.5 1st Qu.:16.15 1st Qu.: 88.0
## Median :1967 Median :6.984 Median :600.0 Median :16.42 Median :123.0
## Mean :1967 Mean :7.042 Mean :608.4 Mean :16.48 Mean :144.8
## 3rd Qu.:1974 3rd Qu.:7.441 3rd Qu.:705.5 3rd Qu.:17.01 3rd Qu.:185.5
## Max. :1980 Max. :8.494 Max. :830.0 Max. :17.65 Max. :292.0
## Age FrancePop
## Min. : 3.00 Min. :43184
## 1st Qu.: 9.50 1st Qu.:46856
## Median :16.00 Median :50650
## Mean :16.19 Mean :50085
## 3rd Qu.:22.50 3rd Qu.:53511
## Max. :31.00 Max. :55110
What conclusions can we draw:
We can notice a perfect relationship between the variables
Year
and Age
. This is to be expected since
this data was collected in 1983 and Age
was calculated as:
Age
= 1983 - Year
. Knowing this, we are going
to remove Year
from the analysis and use Age
as it will be easier to interpret.
There is a strong relationship between Year
, ie.
Age
and FrancePop
and since we want to impose
our viewpoint that the total population does not influence the quality
of the wine we will not consider this variable in the model.
We are going to investigate possible interactions between the
rainfall (WinterRain
) and the growing season temperature
(AGST
). In R
this will be created
automatically by using the *
operator.
Let us build a model:
We will start with the most complicated model that includes the highest-order interaction.
In R we will specify the three-way interaction, which will automatically add all combinations of two-way interactions.
##
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age +
## WinterRain * AGST * HarvestRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35058 -0.19462 -0.02645 0.17194 0.52079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.582e+00 1.924e+01 0.446 0.6609
## WinterRain -1.858e-02 2.896e-02 -0.642 0.5292
## AGST -1.748e-01 1.137e+00 -0.154 0.8795
## HarvestRain -4.713e-02 1.540e-01 -0.306 0.7631
## Age 2.476e-02 8.288e-03 2.987 0.0079 **
## WinterRain:AGST 1.272e-03 1.712e-03 0.743 0.4671
## WinterRain:HarvestRain 7.836e-05 2.600e-04 0.301 0.7665
## AGST:HarvestRain 3.059e-03 9.079e-03 0.337 0.7401
## WinterRain:AGST:HarvestRain -5.446e-06 1.540e-05 -0.354 0.7278
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2833 on 18 degrees of freedom
## Multiple R-squared: 0.8621, Adjusted R-squared: 0.8007
## F-statistic: 14.06 on 8 and 18 DF, p-value: 2.675e-06
The model explains well over \(80\%\) of variability and is clearly a strong model, but the key question is whether we can simplify it. We will start the process of this model simplification by removing the three-way interaction as it is clearly not significant.
##
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age +
## WinterRain:AGST + WinterRain:HarvestRain + AGST:HarvestRain,
## data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35245 -0.19452 0.01643 0.17289 0.51420
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.980e+00 1.066e+01 0.279 0.78293
## WinterRain -9.699e-03 1.408e-02 -0.689 0.49930
## AGST 1.542e-01 6.383e-01 0.242 0.81168
## HarvestRain 6.496e-03 2.610e-02 0.249 0.80610
## Age 2.441e-02 8.037e-03 3.037 0.00678 **
## WinterRain:AGST 7.490e-04 8.420e-04 0.890 0.38484
## WinterRain:HarvestRain -1.350e-05 7.338e-06 -1.840 0.08144 .
## AGST:HarvestRain -1.032e-04 1.520e-03 -0.068 0.94656
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2767 on 19 degrees of freedom
## Multiple R-squared: 0.8611, Adjusted R-squared: 0.8099
## F-statistic: 16.83 on 7 and 19 DF, p-value: 6.523e-07
The \(\bar{R}^2\) has slightly increased in value. Next, we remove the least significant two-way interaction term.
##
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age +
## WinterRain:AGST + WinterRain:HarvestRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35424 -0.19343 0.01176 0.17161 0.51218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.518e+00 6.946e+00 0.507 0.61802
## WinterRain -1.017e-02 1.195e-02 -0.851 0.40488
## AGST 1.218e-01 4.138e-01 0.294 0.77147
## HarvestRain 4.752e-03 4.553e-03 1.044 0.30901
## Age 2.451e-02 7.710e-03 3.179 0.00472 **
## WinterRain:AGST 7.769e-04 7.166e-04 1.084 0.29119
## WinterRain:HarvestRain -1.342e-05 7.059e-06 -1.902 0.07174 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2697 on 20 degrees of freedom
## Multiple R-squared: 0.8611, Adjusted R-squared: 0.8194
## F-statistic: 20.66 on 6 and 20 DF, p-value: 1.35e-07
Again, it is reassuring to notice an increase in the \(\bar{R}^2\), but we can still simplify the model further by removing another least significant two-way interaction term.
##
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age +
## WinterRain:HarvestRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5032 -0.1934 0.0109 0.1771 0.4621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.812e+00 1.598e+00 -2.386 0.026553 *
## WinterRain 2.747e-03 9.471e-04 2.900 0.008560 **
## AGST 5.586e-01 9.495e-02 5.883 7.71e-06 ***
## HarvestRain 4.717e-03 4.572e-03 1.032 0.313877
## Age 2.785e-02 7.094e-03 3.926 0.000774 ***
## WinterRain:HarvestRain -1.349e-05 7.088e-06 -1.903 0.070835 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2708 on 21 degrees of freedom
## Multiple R-squared: 0.8529, Adjusted R-squared: 0.8179
## F-statistic: 24.35 on 5 and 21 DF, p-value: 4.438e-08
There is an insignificant decrease in \(\bar{R}^2\). We notice
HarvestRain
is now the least significant term, but it is
used for the WinterRain:HarvestRain
interaction, which is
significant at \(\alpha = 5\%\) and
therefore we should keep it. However, as the concept of parsimony
prefers a model without interactions to a model containing interactions
between variables, we will remove the remaining interaction term and see
if it significantly affects the explanatory power of the model.
##
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46024 -0.23862 0.01347 0.18601 0.53443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.6515703 1.6880876 -2.163 0.04167 *
## WinterRain 0.0011667 0.0004820 2.420 0.02421 *
## AGST 0.6163916 0.0951747 6.476 1.63e-06 ***
## HarvestRain -0.0038606 0.0008075 -4.781 8.97e-05 ***
## Age 0.0238480 0.0071667 3.328 0.00305 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2865 on 22 degrees of freedom
## Multiple R-squared: 0.8275, Adjusted R-squared: 0.7962
## F-statistic: 26.39 on 4 and 22 DF, p-value: 4.057e-08
The \(\bar{R}^2\) is reduced by around \(2\%\), but it has all the significant terms and it is easier to interpret. For those reasons and in the spirit of parsimony that argues that a model should be as simple as possible, we can suggest that this should be regarded as the best final fitted model.
We realise that for the large numbers of explanatory variables, and many interactions and non-linear terms, the process of model simplification can take a very long time. There are many algorithms for automatic variable selection that can help us to chose the variables to include in a regression model. Stepwise regression and Best Subsets regression are two of the more common variable selection methods.
The stepwise procedure starts from the saturated model (or the maximal model, whichever is appropriate) through a series of simplifications to the minimal adequate model. This progression is made on the basis of deletion tests: F tests, AIC, t-tests or chi-squared tests that assess the significance of the increase in deviance that results when a given term is removed from the current model.
The best subset regression (BREG), also known as “all possible regressions”, as the name of the procedure indicates, fits a separate least squares regression for each possible combination of the \(p\) predictors, i.e. explanatory variables. After fitting all of the models, BREG then displays the best fitted models with one explanatory variable, two explanatory variables, three explanatory variables, and so on. Usually, either adjusted R-squared or Mallows Cp is the criterion for picking the best fitting models for this process. The result is a display of the best fitted models of different sizes up to the full/maximal model and the final fitted model can be selected by comparing displayed models based on the criteria of parsimony.
“These methods are frequently abused by naive researchers who seek to interpret the order of entry of variables into the regression equation as an index of their ‘‘importance’’. This practice is potentially misleading.” J. Fox and S. Weisberg’s book: An R Companion to Applied Regression, Third Edition, Sage (2019)
💡 When selecting a model one should remember the important concept of parsimony!
As M.J. Crawley points out in his well know editions of “The R Book” we need to remember that models are portrayals of phenomena that should be both “accurate and convenient” and the principle of parsimony is an essential tool for model exploration. As he suggests: “just because we go to the trouble of measuring something does not mean we have to have it in our model.”
“Parsimony says that, other things being equal, we prefer:
Useful links:
Best Subset, https://afit-r.github.io/model_selection, https://bookdown.org/tpinto_home/Regularisation/best-subset-selection.html
Summary
For problems ranging from bioinformatics to marketing, many analysts prefer to develop “classifiers” instead of developing predictive models.
Claeskens, G. and Hjort, N.L. (2008) Model Selection and Model Averaging, Cambridge University Press, Cambridge.
Fox, J. (2002) An R and S-Plus Companion to Applied Regression, Sage, Thousand Oaks, CA.
Salaries
data Case Study:~0
to the lm()
formula enables
R
to suppress the intercept. Try to fit the following model
by removing the intercept.model_2_1 <- lm(salary ~ 0 + rank + discipline + yrs.since.phd + yrs.service)
summary(model_2_1)
This will allow you to obtain all “three intercepts” without a reference level.
Does this model differ from the previously fitted
model_2
? Provide a justified explanation
Interpret the model
Salaries
data Case Study be further simplified? Justify your answerPrestige
data available from the
carData
package of datasets designed to accompany J. Fox
and S. Weisberg’s book: An R Companion to Applied Regression, Third
Edition, Sage (2019). Fit a multivariate model that explains
variation in the response variable prestige
, explaining the
reasons behind the steps taken with appropriate interpretation of the
final fitted model.carData::Prestige
: The Canadian occupational prestige
data, where the observations are occupations. Justification for treating
the 102 occupations as a sample implicitly rests on the claim that they
are “typical” of the population, at least with respect to the
relationship between prestige and income.
education
: Average education of occupational
incumbents, years, in 1971.
income
: Average income of incumbents, dollars, in
1971.
women
: Percentage of incumbents who are
women.
prestige
: Pineo-Porter prestige score for
occupation, from a social survey conducted in the mid-1960s.
census
: Canadian Census occupational code.
type
: Type of occupation. A factor with levels
(note: out of order):
bc
: Blue Collarprof
: Professional, Managerial, and Technicalwc
: White CollarCreative Commons Attribution-ShareAlike 4.0 International License.