Dummy Variables

As we have learnt previously, attribute variables are vectors of class factor in R. This vector is encoded numerically, with information about the levels of the variable saved in the levels attribute.

## [1] "factor"

##   [1] 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 1 2
##  [38] 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2
## [112] 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## [186] 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2
## [223] 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1
## [334] 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2
## [371] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## attr(,"levels")
## [1] "Female" "Male"

Note, that the unclass() function removes the attributes of the sex variable above and prints it using the default print method, allowing for easier examination of the internal structure of the object.

However, when using an attribute variable in a linear regression model it would make no sense to treat it as a measured explanatory variable because of its numeric levels. In the context of linear modelling we need to code them to represent the levels of the attribute.

Two-level attribute variables are very easy to code. We simply create an indicator or dummy variable that takes on two possible dummy numerical values. Consider the sex indicator variable. We can code this using a dummy variable $d$: \[\begin{equation} d=\left\{ \begin{array}{@{}ll@{}} 0, & \text{if female} \\ 1, & \text{if male} \end{array}\right. \end{equation}\]

💡 This is the default coding used in R. Zero value is assigned to the level which is first alphabetically, unless it is changed by using the releveld() function for example, or by specifying the levels of the factor variable specifically. </span)

For a simple regression model of salary versus sex:

\[salary = b_0 + b_1sex + e,\]

this results in the model

\[\begin{equation} salary_i = b_0 + b_1sex_i + e_i=\left\{ \begin{array}{@{}ll@{}} b_0 + b_1 \times 1 + e_i = b_0 + b_1 + e_i, & \text{if the person is male} \\ b_0 + b_1 \times 0 + e_i = b_0 + e_i, & \text{if the person is female} \end{array}\right. \end{equation}\]

where $b_0$ can be interpreted as the average $\text{salary}$ for females, and $b_0 + b_1$ as the average $\text{salary}$ for males. The value of $b_1$ represents the average difference in $\text{salary}$ between females and males.

We can conclude that dealing with an attribute variable with two levels in a linear model is straightforward. In this case, a dummy variable indicates whether an observation has a particular characteristic: yes/no. We can observe it as a “switch” in a model, as this dummy variable can only assume the values 0 and 1, where 0 indicates the absence of the effect, and 1 indicates the presence. The values 0/1 can be seen as off/on.

The way in which R codes dummy variables is controlled by the contrast option:

## $contrasts
##         unordered           ordered 
## "contr.treatment"      "contr.poly"

The output points out the conversion of the factor into an appropriate set of contrasts. In particular, the first one: for unordered factors, and the second one: the ordered factors. The former is applicable in our context.

To explicitly identify the coding of the factor, i.e. dummy variable used by R, we can use the contrasts() function.

##        Male
## Female    0
## Male      1

##   B
## A 0
## B 1

##           AssocProf Prof
## AsstProf          0    0
## AssocProf         1    0
## Prof              0    1

Note that applied “contr.treatment” conversion takes only the value 0 or 1 and that for an attribute variable with k levels it will create k-1 dummy variables. One can argue that the printout of the function could have been more informative by putting indexed letter d as a header for each of the printed columns. For example:

attribute variable sex, where k=2

attribute	$d$
Female	0
Male	1

attribute variable rank, where k=3

attribute	$d_1$	$d_2$
AsstProf	0	0
AssocProf	1	0
Prof	0	1

There are many different ways of coding attribute variables besides the dummy variable approach explained here. All of these different approaches lead to equivalent model fits. What differs are the coefficients, i.e. model parameters as they require different interpretations, arranged to measure particular contrasts. This 0/1 coding implemented in R’s default contr.treatment contrast offers straightforward interpretation of the associated parameter in the model, which often is not the case when implementing other contrasts.

Interpreting coefficients of attribute variables

In the case of measured predictors, we are comfortable with the interpretation of the linear model coefficient as a slope, which tells us what a unit increase in the response variable is, i.e. outcome per unit increase in the explanatory variable. This is not necessarily the right interpretation for attribute explanatory variables.

If we obtain the mean salary for each sex group we will find that for female professors the average salary is $\$101,002$ and for male professors the average is $\$115,090$. That is, a difference of $\$14,088$. If we now look at the parameters of the regression model for salary vs sex where females are coded as zero and males as one, we get exactly the same information, implying that the coefficient is the estimated difference in average between the two groups.

## 
## Call:
## lm(formula = salary ~ sex)
## 
## Coefficients:
## (Intercept)      sexMale  
##      101002        14088

For more on this topic check the following link: Categorical variables and interaction terms in linear regression

Fitting a Multivariate Regression Model

We are interested in the extent to which variation in the response variable salary is associated with variation in the explanatory variables available in the Salaries data set, that is we want to fit a multiple linear regression model to the given data. The model we wish to construct should contain enough to explain relations in the data and at the same time be simple enough to understand, explain to others, and use.

For convenience we will adopt the following notation:

$y$: salary
$x_1$: yrs.since.phd
$x_2$: yrs.service
$x_3$: discipline
$x_4$: sex
$x_5$: rank

Generally, in multiple regression we have a continuous response variable and two or more continuous explanatory variables, however in this dataset we have three attribute variables that we wish to include as the explanatory variables in the model.

Next, we need to specify the model that embodies our mechanistic understanding of the factors involved and the way that they are related to the response variable. It would make sense to expect that all of the available $x$ variables may impact the behaviour of $y$, thus the model we wish to build should reflect our viewpoint, i.e. $y = f(x_1, x_2, x_3, x_4, x_5)$:

\[y = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4x_4 + b_5x_5 + e\] Our viewpoint states a belief that all explanatory variables have a positive impact on the response. For example, more years in service will cause a higher salary.

Our objective now is to determine the values of the parameters in the model that lead to the best fit of the model to the data. That is, we are not only trying to estimate the parameters of the model, but we are also seeking the minimal adequate model to describe the data.

The best model is the model that produces the least unexplained variation following the principle of parsimony rather than complexity. That is the model should have as few parameters as possible, subject to the constraint that the parameters in the model should all be statistically significant.

For regression modelling in R we use the lm() function, that fits a linear model assuming normal errors and constant variance. We specify the model by a formula that uses arithmetic operators: +, -, *, / and ^ which enable different functionalities from their ordinary ones. But, before we dive into statistical modelling of the given data, we need to take a first step and conduct the most fundamental task of data analysis procedure: Get to Know Our Data.

Examining multivariate data is intrinsically more challenging than examining univariate and bivariate data. To get the most in-depth vision into multivariate data behaviour we construct scatter plot matrices that enable the display of pairwise relationships. In R, the scatter plot matrices are composed by the pairs() function, which comes as a part of the default graphics package. Since we wish to include attribute variables in our analysis we are going to use the GGally::ggpairs() function that produces a pairwise comparison of multivariate data for both data types: measured and attribute.

This is an information rich visualisation that includes pairwise relationships of all the variables we want to consider for our model. By focusing on the last column of the plots, we can notice influence from all explanatory variables onto the response, except maybe for discipline and sex. We also notice unbalanced representation of the groups for the variables rank and sex, but for the purpose of our practice in fitting a multi-factor model this isn’t too problematic. We need to be especially concerned with the extent of correlations between the explanatory variables, and what is of particular interest to us is the high multicollinearity between rank, yrs.since.phd and yrs.service, which happens when the variables are highly linearly related. As a consequence, we will need to keep an eye on the significance of using all of these variables in the model.

Fitting the Model

There are no fixed rules when fitting linear models, but there are adopted standards that have proven to work well in practice. We start off by fitting a maximal model then we carry on simplifying it by removing non-significant explanatory variables. This needs to be done with caution, making sure that the simplifications make good scientific sense, and do not lead to significant reductions in explanatory power. Although this should be the adopted strategy for fitting a model, it is not a guarantee to finding all the important structures in a complex data frame.

We can summarise our model building procedure algorithm as follows:

Fit the maximal model that includes all the variables
- Assess the overall significance of the model by checking how big the $R^2$/$\bar{R}^2$ is. If statistically significant, carry on with the model fitting procedure, otherwise stop (F-test)
Remove the least significant terms one at a time
- Check the variables’ $t_{calculated}$ values and perform a one tail or two tail t-test depending on your prior view
- If the deletion causes an insignificant increase in $\bar{R}^{2}$ leave that term out of the model
Keep removing terms from the model until the model contains nothing but significant terms.

## 
## Call:
## lm(formula = salary ~ ., data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65248 -13211  -1775  10384  99592 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    65955.2     4588.6  14.374  < 2e-16 ***
## rankAssocProf  12907.6     4145.3   3.114  0.00198 ** 
## rankProf       45066.0     4237.5  10.635  < 2e-16 ***
## disciplineB    14417.6     2342.9   6.154 1.88e-09 ***
## yrs.since.phd    535.1      241.0   2.220  0.02698 *  
## yrs.service     -489.5      211.9  -2.310  0.02143 *  
## sexMale         4783.5     3858.7   1.240  0.21584    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22540 on 390 degrees of freedom
## Multiple R-squared:  0.4547, Adjusted R-squared:  0.4463 
## F-statistic:  54.2 on 6 and 390 DF,  p-value: < 2.2e-16

Overall, is the model a good fit? How big is the $R^2$/$\bar{R}^2$?

The $R^2 = 45.47\%$ and the $\bar{R}^2 = 44.63\%$ are well above the value of zero allowing us to accept this as a valid model without having to formally test it to assess its statistical significance. It manages to explain almost half of the variability in the response variable salary.

Individually, are the explanatory variables important?

We identify the sex explanatory variable as clearly not significant, which is in line with the conclusion we could draw from the boxplot in the pairwise comparison plot for salary vs. sex. We will remove it to begin the process of model simplification.

## 
## Call:
## lm(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service, 
##     data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65244 -13498  -1455   9638  99682 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    69869.0     3332.1  20.968  < 2e-16 ***
## rankAssocProf  12831.5     4147.7   3.094  0.00212 ** 
## rankProf       45287.7     4236.7  10.689  < 2e-16 ***
## disciplineB    14505.2     2343.4   6.190 1.52e-09 ***
## yrs.since.phd    534.6      241.2   2.217  0.02720 *  
## yrs.service     -476.7      211.8  -2.250  0.02497 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22550 on 391 degrees of freedom
## Multiple R-squared:  0.4525, Adjusted R-squared:  0.4455 
## F-statistic: 64.64 on 5 and 391 DF,  p-value: < 2.2e-16

We note a slight reduction in $\bar{R^2}$ from $44.63\%$ to $44.55\%$ which we can regard as an insignificant decrease. The next step is to check the coefficients and assess for the effect of the remaining variables. We identify yrs.since.phd and yrs.service as the least influential in explaining the variability of salary. To illustrate how to formally assess their effect, we will conduct the t-test for the yrs.since.phd variable:

$H_0: b_{ysp} = 0$
$H_1: b_{ysp} > 0$

=========================

If $t_{calc} < t_{crit} => H_0$
If $t_{calc} > t_{crit} => H_1$

## [1] 1.64876

As $t_{calc} = 2.217 > t_{crit} = 1.64876 => H_1$ we will keep the remaining variable and stop with the model simplification and focus on its interpretation.

We can take a closer look at the coefficients of our fitted model:

##   (Intercept) rankAssocProf      rankProf   disciplineB yrs.since.phd 
##    69869.0110    12831.5375    45287.6890    14505.1514      534.6313 
##   yrs.service 
##     -476.7179

The structure of our final fitted model is:

\[y = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4x_4 + e,\] where

$y$: salary
$x_1$: yrs.since.phd
$x_2$: yrs.service
$x_3$: discipline
$x_4$: rank

We can take a closer look at the coefficients of our fitted model:

##   (Intercept) rankAssocProf      rankProf   disciplineB yrs.since.phd 
##    69869.0110    12831.5375    45287.6890    14505.1514      534.6313 
##   yrs.service 
##     -476.7179

Examining the output we realise that R has created three “sub” dummy variables for the variable rank: \[ dr_1 = \begin{cases} 1 & \text{rank is AsstProf} \\ 0 & \text{for rank is not AsstProf} \end{cases} \]

\[ dr_2 = \begin{cases} 1 & \text{rank is AssocProf} \\ 0 & \text{rank is not AssocProf} \end{cases} \]

\[ dr_3 = \begin{cases} 1 & \text{rank is Prof} \\ 0 & \text{rank is not Prof} \end{cases} \]

It has chosen to use the model: \[ y = b_0 + b_1dr_2 + b_2dr_3 + b_3d_1 + b_4x_1 + b_5x_2 + e, \]

where:

$y$ is salary
$x_1$ is yrs.since.phd
$x_2$ is yrs.service
$dr_2$ and $dr_3$ are the dummy variables defined above for the purpose of coding variable rank
$d_1$ is a dummy variable used in the coding of variable discipline as explained earlier

Note that R doesn’t need to use $dr_1$, to create three models; it only needs two dummy variables since it is using $dr_1$ as a reference level, also known as the base line. This subsequently allows R to create three models relating to rank variable:

AsstProf: $y = b_0 + b_3d_1 + b_4x_1 + b_5x_2 + e$
AssocProf: $y = (b_0 + b_1) + b_3d_1 + b_4x_1 + b_5x_2 + e$
Prof: $y = (b_0 + b_2) + b_3d_1 + b_4x_1 + b_5x_2 + e$

telling us that:

$b_0$ is the average salary for an Assistant Professor who works in a “theoretical” department and $b_0 + b_3$ the average salary for an Assistant Professor who works in an “applied” department
$(b_0 + b_1)$ is the average salary for an Associate Professor who works in a “theoretical” department and $(b_0 + b_1) + b_3$ the average salary for an Associate Professor who works in an “applied” department
$(b_0 + b_2)$ is the average salary for a Professor who works in a “theoretical” department and $(b_0 + b_2) + b_3$ the average salary for a Professor who works in an “applied” department

Learning this we can make an interpretation of our final fitted model as follows:

For every year since PhD (yrs.since.phd) on average salary (salary) will go up by $\$534.63$ assuming the rest of the variables are fixed in the model
For every year in service (yrs.service) on average salary (salary) will go down by $\$476.72$ assuming the rest of the variables are fixed in the model
The average salary of an Assistant Professor (rank: AsstProf) who works in a “theoretical” department is $\$69,869.01$ and who works in an “applied” department is $\$84,374.16$; this can vary for the number of years in service and since PhD
The average salary of an Associate Professor (rank: AssocProf) who works in a “theoretical” department is $\$82,700.55$, and one who works in an “applied” department is $\$97,205.70$; this can vary for the number of years in service and since PhD
The average salary of a Professor (rank: Prof) who works in a “theoretical” department is $\$115,156.70$, and who works in an “applied” department is $\$129,661.90$; this can vary for the number of years in service and since PhD

This model explains around $45\%$ of the variability in the response variable salary.

Adding ~ 0 to the lm() formula enables R to suppress the intercept. Note that if we remove the intercept, then we can directly obtain all “three intercepts” without a base level to fit the final fitted model:

## 
## Call:
## lm(formula = salary ~ 0 + rank + discipline + yrs.since.phd + 
##     yrs.service)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65244 -13498  -1455   9638  99682 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## rankAsstProf   69869.0     3332.1  20.968  < 2e-16 ***
## rankAssocProf  82700.5     3916.7  21.115  < 2e-16 ***
## rankProf      115156.7     4350.9  26.467  < 2e-16 ***
## disciplineB    14505.2     2343.4   6.190 1.52e-09 ***
## yrs.since.phd    534.6      241.2   2.217   0.0272 *  
## yrs.service     -476.7      211.8  -2.250   0.0250 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22550 on 391 degrees of freedom
## Multiple R-squared:  0.9638, Adjusted R-squared:  0.9633 
## F-statistic:  1736 on 6 and 391 DF,  p-value: < 2.2e-16

Adding a Complexity

So far we have seen that by fitting an additive regression model to the data, we aim to identify and understand how the value of a dependent response variable changes when any one of the independent explanatory variables is changed while the other independent variables stay the same. This is a restrictive form of a model as it only allows for linear relationships between the response and the explanatory variables, and the way in which one explanatory variable affects the response is the same for any value of the other explanatory variables used in the model.

We need to add flexibility to accommodate these limitations. This will allow the use of linear models for non-linear relationships in which the effect of one explanatory variable can be different for different values of the other explanatory variable by introducing the concept of interaction. This brings more complexity into the multivariate regression model.

Case Study: The Quality of red Bordeaux Vintages

To illustrate the multivariate model fitting procedure with interactions we are going to use wine.csv available from Eduardo García Portugués’s book: Notes for Predictive Modeling. The dataset is formed by the auction Price of 27 red Bordeaux vintages, five vintage descriptors (WinterRain, AGST, HarvestRain, Age, Year), and the population of France in the year of the vintage, FrancePop.

Year: year in which grapes were harvested to make wine
Price: logarithm of the average market price for Bordeaux vintages according to 1990–1991 auctions. The price is relative to the price of the 1961 vintage, regarded as the best one ever recorded
WinterRain: winter rainfall (in mm)
AGST: Average Growing Season Temperature (in degrees Celsius)
HarvestRain: harvest rainfall (in mm)
Age: age of the wine measured as the number of years stored in a cask
FrancePop: population of France at Year (in thousands)

We would like to analyse the quality of a vintage that has been quantified as the price and make the interpretation of our statistical findings.

Don’t forget!!! 🤔 First things first: Get to Know Data 🤓

We will start this analysis by examining the pairwise plot.

##       Year          Price         WinterRain         AGST        HarvestRain   
##  Min.   :1952   Min.   :6.205   Min.   :376.0   Min.   :14.98   Min.   : 38.0  
##  1st Qu.:1960   1st Qu.:6.508   1st Qu.:543.5   1st Qu.:16.15   1st Qu.: 88.0  
##  Median :1967   Median :6.984   Median :600.0   Median :16.42   Median :123.0  
##  Mean   :1967   Mean   :7.042   Mean   :608.4   Mean   :16.48   Mean   :144.8  
##  3rd Qu.:1974   3rd Qu.:7.441   3rd Qu.:705.5   3rd Qu.:17.01   3rd Qu.:185.5  
##  Max.   :1980   Max.   :8.494   Max.   :830.0   Max.   :17.65   Max.   :292.0  
##       Age          FrancePop    
##  Min.   : 3.00   Min.   :43184  
##  1st Qu.: 9.50   1st Qu.:46856  
##  Median :16.00   Median :50650  
##  Mean   :16.19   Mean   :50085  
##  3rd Qu.:22.50   3rd Qu.:53511  
##  Max.   :31.00   Max.   :55110

What conclusions can we draw:

We can notice a perfect relationship between the variables Year and Age. This is to be expected since this data was collected in 1983 and Age was calculated as: Age = 1983 - Year. Knowing this, we are going to remove Year from the analysis and use Age as it will be easier to interpret.
There is a strong relationship between Year, ie. Age and FrancePop and since we want to impose our viewpoint that the total population does not influence the quality of the wine we will not consider this variable in the model.
We are going to investigate possible interactions between the rainfall (WinterRain) and the growing season temperature (AGST). In R this will be created automatically by using the * operator.

Let us build a model:

We will start with the most complicated model that includes the highest-order interaction.

In R we will specify the three-way interaction, which will automatically add all combinations of two-way interactions.

## 
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age + 
##     WinterRain * AGST * HarvestRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35058 -0.19462 -0.02645  0.17194  0.52079 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                  8.582e+00  1.924e+01   0.446   0.6609   
## WinterRain                  -1.858e-02  2.896e-02  -0.642   0.5292   
## AGST                        -1.748e-01  1.137e+00  -0.154   0.8795   
## HarvestRain                 -4.713e-02  1.540e-01  -0.306   0.7631   
## Age                          2.476e-02  8.288e-03   2.987   0.0079 **
## WinterRain:AGST              1.272e-03  1.712e-03   0.743   0.4671   
## WinterRain:HarvestRain       7.836e-05  2.600e-04   0.301   0.7665   
## AGST:HarvestRain             3.059e-03  9.079e-03   0.337   0.7401   
## WinterRain:AGST:HarvestRain -5.446e-06  1.540e-05  -0.354   0.7278   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2833 on 18 degrees of freedom
## Multiple R-squared:  0.8621, Adjusted R-squared:  0.8007 
## F-statistic: 14.06 on 8 and 18 DF,  p-value: 2.675e-06

The model explains well over $80\%$ of variability and is clearly a strong model, but the key question is whether we can simplify it. We will start the process of this model simplification by removing the three-way interaction as it is clearly not significant.

## 
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age + 
##     WinterRain:AGST + WinterRain:HarvestRain + AGST:HarvestRain, 
##     data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35245 -0.19452  0.01643  0.17289  0.51420 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)   
## (Intercept)             2.980e+00  1.066e+01   0.279  0.78293   
## WinterRain             -9.699e-03  1.408e-02  -0.689  0.49930   
## AGST                    1.542e-01  6.383e-01   0.242  0.81168   
## HarvestRain             6.496e-03  2.610e-02   0.249  0.80610   
## Age                     2.441e-02  8.037e-03   3.037  0.00678 **
## WinterRain:AGST         7.490e-04  8.420e-04   0.890  0.38484   
## WinterRain:HarvestRain -1.350e-05  7.338e-06  -1.840  0.08144 . 
## AGST:HarvestRain       -1.032e-04  1.520e-03  -0.068  0.94656   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2767 on 19 degrees of freedom
## Multiple R-squared:  0.8611, Adjusted R-squared:  0.8099 
## F-statistic: 16.83 on 7 and 19 DF,  p-value: 6.523e-07

The $\bar{R}^2$ has slightly increased in value. Next, we remove the least significant two-way interaction term.

## 
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age + 
##     WinterRain:AGST + WinterRain:HarvestRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35424 -0.19343  0.01176  0.17161  0.51218 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)   
## (Intercept)             3.518e+00  6.946e+00   0.507  0.61802   
## WinterRain             -1.017e-02  1.195e-02  -0.851  0.40488   
## AGST                    1.218e-01  4.138e-01   0.294  0.77147   
## HarvestRain             4.752e-03  4.553e-03   1.044  0.30901   
## Age                     2.451e-02  7.710e-03   3.179  0.00472 **
## WinterRain:AGST         7.769e-04  7.166e-04   1.084  0.29119   
## WinterRain:HarvestRain -1.342e-05  7.059e-06  -1.902  0.07174 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2697 on 20 degrees of freedom
## Multiple R-squared:  0.8611, Adjusted R-squared:  0.8194 
## F-statistic: 20.66 on 6 and 20 DF,  p-value: 1.35e-07

Again, it is reassuring to notice an increase in the $\bar{R}^2$, but we can still simplify the model further by removing another least significant two-way interaction term.

## 
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age + 
##     WinterRain:HarvestRain, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5032 -0.1934  0.0109  0.1771  0.4621 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -3.812e+00  1.598e+00  -2.386 0.026553 *  
## WinterRain              2.747e-03  9.471e-04   2.900 0.008560 ** 
## AGST                    5.586e-01  9.495e-02   5.883 7.71e-06 ***
## HarvestRain             4.717e-03  4.572e-03   1.032 0.313877    
## Age                     2.785e-02  7.094e-03   3.926 0.000774 ***
## WinterRain:HarvestRain -1.349e-05  7.088e-06  -1.903 0.070835 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2708 on 21 degrees of freedom
## Multiple R-squared:  0.8529, Adjusted R-squared:  0.8179 
## F-statistic: 24.35 on 5 and 21 DF,  p-value: 4.438e-08

There is an insignificant decrease in $\bar{R}^2$. We notice HarvestRain is now the least significant term, but it is used for the WinterRain:HarvestRain interaction, which is significant at $\alpha = 5\%$ and therefore we should keep it. However, as the concept of parsimony prefers a model without interactions to a model containing interactions between variables, we will remove the remaining interaction term and see if it significantly affects the explanatory power of the model.

## 
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46024 -0.23862  0.01347  0.18601  0.53443 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.6515703  1.6880876  -2.163  0.04167 *  
## WinterRain   0.0011667  0.0004820   2.420  0.02421 *  
## AGST         0.6163916  0.0951747   6.476 1.63e-06 ***
## HarvestRain -0.0038606  0.0008075  -4.781 8.97e-05 ***
## Age          0.0238480  0.0071667   3.328  0.00305 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2865 on 22 degrees of freedom
## Multiple R-squared:  0.8275, Adjusted R-squared:  0.7962 
## F-statistic: 26.39 on 4 and 22 DF,  p-value: 4.057e-08

The $\bar{R}^2$ is reduced by around $2\%$, but it has all the significant terms and it is easier to interpret. For those reasons and in the spirit of parsimony that argues that a model should be as simple as possible, we can suggest that this should be regarded as the best final fitted model.

We realise that for the large numbers of explanatory variables, and many interactions and non-linear terms, the process of model simplification can take a very long time. There are many algorithms for automatic variable selection that can help us to chose the variables to include in a regression model. Stepwise regression and Best Subsets regression are two of the more common variable selection methods.

The stepwise procedure starts from the saturated model (or the maximal model, whichever is appropriate) through a series of simplifications to the minimal adequate model. This progression is made on the basis of deletion tests: F tests, AIC, t-tests or chi-squared tests that assess the significance of the increase in deviance that results when a given term is removed from the current model.

The best subset regression (BREG), also known as “all possible regressions”, as the name of the procedure indicates, fits a separate least squares regression for each possible combination of the $p$ predictors, i.e. explanatory variables. After fitting all of the models, BREG then displays the best fitted models with one explanatory variable, two explanatory variables, three explanatory variables, and so on. Usually, either adjusted R-squared or Mallows Cp is the criterion for picking the best fitting models for this process. The result is a display of the best fitted models of different sizes up to the full/maximal model and the final fitted model can be selected by comparing displayed models based on the criteria of parsimony.

“These methods are frequently abused by naive researchers who seek to interpret the order of entry of variables into the regression equation as an index of their ‘‘importance’’. This practice is potentially misleading.” J. Fox and S. Weisberg’s book: An R Companion to Applied Regression, Third Edition, Sage (2019)

💡 When selecting a model one should remember the important concept of parsimony!

As M.J. Crawley points out in his well know editions of “The R Book” we need to remember that models are portrayals of phenomena that should be both “accurate and convenient” and the principle of parsimony is an essential tool for model exploration. As he suggests: “just because we go to the trouble of measuring something does not mean we have to have it in our model.”

“Parsimony says that, other things being equal, we prefer:

a model with $n−1$ parameters to a model with n parameters
a model with $k−1$ explanatory variables to a model with k explanatory variables
a linear model to a model which is curved
a model without a hump to a model with a hump
a model without interactions to a model containing interactions between factors” Crawley, M.J. 2013, The R Book. 2nd Edition. John Wiley, New York

Useful links:

Summary

For problems ranging from bioinformatics to marketing, many analysts prefer to develop “classifiers” instead of developing predictive models.

YOUR TURN 👇

Go back to the Salaries data Case Study:

Adding ~0 to the lm() formula enables R to suppress the intercept. Try to fit the following model by removing the intercept.

model_2_1 <- lm(salary ~  0 + rank + discipline + yrs.since.phd + yrs.service)
summary(model_2_1)

This will allow you to obtain all “three intercepts” without a reference level.

Does this model differ from the previously fitted model_2? Provide a justified explanation
Interpret the model

Can the final fitted model developed for the Salaries data Case Study be further simplified? Justify your answer

Use Prestige data available from the carData package of datasets designed to accompany J. Fox and S. Weisberg’s book: An R Companion to Applied Regression, Third Edition, Sage (2019). Fit a multivariate model that explains variation in the response variable prestige, explaining the reasons behind the steps taken with appropriate interpretation of the final fitted model.

carData::Prestige: The Canadian occupational prestige data, where the observations are occupations. Justification for treating the 102 occupations as a sample implicitly rests on the claim that they are “typical” of the population, at least with respect to the relationship between prestige and income.

education: Average education of occupational incumbents, years, in 1971.
income: Average income of incumbents, dollars, in 1971.
women: Percentage of incumbents who are women.
prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.
census: Canadian Census occupational code.
type: Type of occupation. A factor with levels (note: out of order):
- bc: Blue Collar
- prof: Professional, Managerial, and Technical
- wc: White Collar

Creative Commons Attribution-ShareAlike 4.0 International License.

Dummy Variables

SOST20131/SOST30031

15 September, 2025

Dummy Variables

Interpreting coefficients of attribute variables

Fitting a Multivariate Regression Model

Fitting the Model

Adding a Complexity

Further Reading

YOUR TURN 👇