Simple Regression Models

Earlier we looked at how to examine sample data to investigate the nature of any relationship that may exist between a measured response variable and an attribute explanatory variable. This section examines how to investigate the nature of any relationship that may exist between a measured response variable and a measured explanatory variable.

The first step is to have a clear idea of what is meant by a connection between the response variable and the explanatory variable. The next step is to use some simple sample descriptive statistics to have a first look at the nature of the link between the response variable and the explanatory variable. This simple approach will lead to one of three conclusions, namely, on the basis of the sample information:

  1. there is very strong evidence to support a link
  2. there is absolutely no evidence of any link
  3. the sample evidence is inconclusive and further more sophisticated data analysis is required

This step is called the Initial Data Analysis or the IDA.

If the IDA suggests that Further Data Analysis is required, then this step seeks one of two conclusions:

  1. The sample evidence is consistent with there being no link between the response variable and the explanatory variable.
  2. The sample evidence is consistent with there being a link between the response variable and the explanatory variable.

As we have already seen in the previous sections, this process can be represented diagrammatically as:

The Data-Analysis Methodology seeks to find the answer to the following key question:

  • On the basis of the sample data, is there evidence of a connection/link/relationship between the response variable and the explanatory variable?

The final outcome is one of two conclusions:

  1. There is no evidence of a relationship, labelled as the ‘No’ outcome in the diagram above, in which case the analysis is finished.

  2. There is evidence of a relationship, labelled as the ‘Yes’ outcome in the diagram above, in which case the nature of the relationship needs to be described.

The first step is to have a clear idea of what is meant by a connection between a measured response variable and a measured explanatory variable. Imagine a population under study consisting of a very large number of population members, and on each population member two measurements are made, the value of \(Y\) the response variable and the value of \(X\) the explanatory variable. For the whole population a graph of \(Y\) against \(X\) could be plotted conceptually. If the graph looked as in the diagram below, then there is quite clearly a link between Y and X. If the value of \(X\) is known, the exact value of Y can be read off the graph. This is an unlikely scenario in the data-analysis context, because the relationship shown is a deterministic relationship. Deterministic means that if the value of \(X\) is known then the value of \(Y\) can be precisely determined from the relationship between \(Y\) and \(X\).

Scatter Plot

When analysing the relationship between the two measured variable we start off by creating a scatter plot. A scatter plot is a graph that shows one axis for the explanatory variable commonly known in the regression modelling as a predictor and labelled with \(X\), and one axis for the response variable, which is labelled with \(Y\) and commonly known as the outcome variable. Thus, each point on the graph represents a single \((X, Y)\) pair. The primary benefit is that the possible relationship between the two variables can be viewed and analysed with one glance and often the nature of a relationship can be determined quickly and easily.

Let us consider a few scatter plots. The following graph represents a perfect linear relationship. All points lie exactly on a straight line. It is easy in this situation to determine the intercept and the slope, i.e. gradient and hence specify the exact mathematical link between the response variable \(Y\) and the explanatory variable \(X\).

Graph 1

The relationship shown in Graph 2 shows clearly that as the value of \(X\) increases the value of \(Y\) indecreases, but not exactly on a straight line as in the previous scatter plot. This is showing a statistical link, as the value of the explanatory variable \(X\) increases the value of the response variable \(Y\) also tends to increase. An explanation for this is that the response \(Y\) may depend on a number of different variables, say \(X_1\), \(X_2\), \(X_3\), \(X_4\), \(X_5\), \(X_6\) etc. which could be written as:

\(Y = f(X_1, X_2, X_3, X_4, X_5, X_6, ...)\)

Graph 2

If the nature of the link between \(Y\) and \(X\) is under investigation then this could be represented as:

\[Y = f(X) + \text{effect of all other variables}\]

💡🤓The effect of all other variables is commonly abbreviated to e.

Graph 1 shows a link where the effect of all the other variables is nil. The response \(Y\) depends solely on the variable \(X\), Graph 2 shows a situation where \(Y\) depends on \(X\) but the other variables also have an influence.

Consider the model:

\[Y = f(X) + e\] Remember 😃, \(\text{e is the effect of all other variables}\)!

The influence on the response variable \(Y\) can be thought of as being made up of two components:

  1. the component of \(Y\) that is explained by changes in the value of \(X\), [the part due to changes in \(X\) through \(f(X)\)]
  2. the component of \(Y\) that is explained by changes in the other factors [the part not explained by changes in \(X\)]

Or in more abbreviated forms:

  1. the Variation in \(Y\) Explained by changes \(X\) or Explained Variation and
  2. the Variation in \(Y\) not explained by changes in \(X\) or the Unexplained Variation

The Total Variation in \(Y\) is made up of two components:

  1. Changes in \(Y\) Explained by changes in \(X\) and the
  2. Changes in \(Y\) not explained by changes in \(X\)

Which may be written as:

\[\text{The Total Variation in Y} = \text{Explained Variation} + \text{Unexplained Variation}\]

In Graph 1 the Unexplained Variation is nil, since the value of \(Y\) is completely determined by the value of \(X\). In Graph 2 the Explained Variation is large relative to the Unexplained Variation, since the value of \(Y\) is very largely influenced by the value of \(X\).

Consider Graph 3. Here there is no discernible pattern, and the value of Y seems to be unrelated to the value of X.

Graph 3

If \(Y\) is not related to \(X\) the Explained Variation component is zero and all the changes in \(Y\) are due to the other variables, that is the Unexplained Variation.

Finally, consider Graphs 4 & 5 below:

Graph 4

Graph 5

Graph 4 shows a similar picture to Graph 2, the difference being that as the value of \(X\) increases the value of \(Y\) decreases. The value of \(Y\) is influenced by the value of \(X\), so the Explained Variation is high relative to the Unexplained Variation. Consider the last graph, Graph 5, which is a deterministic relationship. The value of \(Y\) is completely specified by the value of \(X\). Hence the Unexplained Variation is zero.

Graphs Summary:

Graph 1 Graph 2 Graph 3 Graph 4 Graph 5
Explained Variation in Y All High Zero High All
Unexplained Variation in Y Zero Low All Low Zero

In regression the discussion started with the following idea: \[Y = f(X) + e\] And to quantify the strength of the link the influence on Y was broken down into two components: i. \[\text{The Total Variation in Y} = \text{Explained Variation} + \text{Unexplained Variation}\]

This presents two issues:

  • A: Can a model of the link be made?
  • B: Can The Total Variation in Y, Explained Variation and the Unexplained Variation be measured?

💡🤓 The simplest form of connection is a straight-line relationship and the question arises could a straight-line relationship be matched with the information contained in the graphs 1 -5?

  • Clearly it is easy for Graph 1 and Graph 5, the intercept and the gradient can be obtained directly from the graph. The relationship can then be written as:

\[Y = a + bX\] where:

  1. a is the intercept and

  2. b is the gradient

  • Since the Explained Variation is the same as the Changes in Y and the Unexplained Variation is zero, the precise evaluation of these is not necessary.

Developing a Statistical Model

For the statistical relationships as shown in Graphs 2 & 4:

  1. Can the intercept and gradient be measured?

  2. Can the values of the three quantities The Total Variation in Y, Explained Variation and The Unexplained Variation be measured?

It is sufficient to work out any two since:

\[\text{The Total Variation in Y} = \text{Explained Variation} + \text{Unexplained Variation}\]

Fitting a line by eye is subjective. It is unlikely that any two analysts will draw exactly the same line, hence the intercept and gradient will be slightly different from one person to the next. What is needed is an agreed method that will provide an estimate of the intercept and the gradient.

Consider the simple numerical example below:

Suppose we would like to fit a straight-line relationship to the following data:

\(X\) 1 2 3 4 5 6
\(Y\) 7 8 12 13 14 18

The problem is to use this information to measure the intercept and the gradient for this data set.

A simple way to do this is to draw what is considered to be the line of best fit by judgement or guesswork! 😁😉

The intercept can be read off the graph as approximately \(5\), and the gradient can be simply measured since if \(X = 0\) then \(Y = 5\), and if \(X = 5\) then \(Y = 15\), so for a change in \(X\) of \(5\) units (from 0 to 5) \(Y\) changes by \(10\), from \(5\) to \(15\). The definition of the gradient is The change in Y for a unit increase in X hence the gradient for this data set is \(10/5 = 2\).

The straight-line relationship obtained by this process is \[\hat{Y} = 5 + 2*X\].
Note, \(\hat{Y}\) is a notation for the value of \(Y\) as predicted by the straight-line relationship.

We can add more information about the predicted values of \(Y\) by the “estimated” model to our table, so that we have

X Y \(\hat{Y}\) \({Y - \hat{Y}}\)
1 7 7.00 0.00
2 8 9.00 -1.00
3 12 11.00 1.00
4 13 13.00 0.00
5 14 15.00 -1.00
6 18 17.00 1.00

Looking at the information contained in the previous plot, the column headed \(\hat{Y}\) contains the predicted values of \(Y\) for the values of \(X\). For example, the first value of \(\hat{Y}\) is when \(X = 1\) and \(Yp = 5 + 2*X\) so \(\hat{Y} = 5 +2*1 = 7\). The column headed \((Y - \hat{Y})\) is the difference between the actual value and the value predicted by the line. For example when \(X = 1\), \(Y = 7\) and \(\hat{Y} = 7\) so the predicted value lies on the line, as can be seen in the graph. For the value \(X = 2\) the actual value lies \(1\) unit below the line, as can also be seen from the graph.

The column \((Y - \hat{Y})\) measures the disagreement between the actual data and the line and a sensible strategy is to make this level of disagreement as small as possible. Referring to the graph on the table above notice that sometimes the actual data value is below the line and sometimes it is above the line, so on average the value will be close to zero. In this particular example the sum of the \((Y - \hat{Y})\) values add up to zero (in a more conventional notation \(\Sigma(Y - \hat{Y}) = 0\)). The quantity \(\Sigma(Y - \hat{Y})\) does not seem to be a satisfactory measure of disagreement, because there are a number of different lines with the property \(\Sigma(Y - \hat{Y}) = 0\).

A way of obtaining a satisfactory measure of disagreement is to square the individual \((Y - \hat{Y})\) values and add them up. i.e. obtain the quantity \(\Sigma(Y - \hat{Y})^2\). The result is always a positive number since the square of a negative number is positive. If this quantity is then chosen to be as small as possible then the level of disagreement between the actual data points and the fitted line is the least. This provides a criterion for the choice of the best line.

The quantity \(\Sigma(Y - \hat{Y})^2\) can be easily calculated and added to our table

X Y \(\hat{Y}\) \({Y - \hat{Y}}\) \((Y - \hat{Y})^2\)
1 7 7.00 0.00 0
2 8 9.00 -1.00 1
3 12 11.00 1.00 1
4 13 13.00 0.00 0
5 14 15.00 -1.00 1
6 18 17.00 1.00 1

The quantity \(\Sigma(Y - \hat{Y})^2\) is a measure of the disagreement between the actual \(Y\) values and the values predicted by the line. If this value is chosen to be as small as possible then the disagreement between the actual Y values and the line is the smallest it could possibly be, hence the line is The line of Best Fit.

This procedure of finding the intercept and the gradient of a line that makes the quantity \(\Sigma(Y - \hat{Y})^2\) a minimum is called The Method of Least Squares.

The Method of Least Squares was developed by K. F Gauss (1777 - 1855) a German mathematician, Gauss originated the ideas and a Russian mathematician A. A. Markov (1856 - 1922) developed the method.

In R we use the lm( ) function to fit linear models as illustrated below.

> x <- c(1:6)
> y <- c(7, 8, 12, 13, 14, 18)
> plot(x, y, pch = 16, main = "y = f(x) + e", xlim = c(0.25, 6), ylim = c(0, 20))
> abline(lm(y~x), lwd = 2, lty = 2, col = 2)

The final issue is to find out how to measure the three quantities:

  1. The Total Variation in \(Y\)
  2. The Explained Variation
  3. The Unexplained Variation

Taking these quantities one at a time they can be measured as follows:

The Unexplained Variation

This turns out to be very simple to measure. The quantity: \(\Sigma(Y - \hat{Y})^2\) is a measure of the Unexplained Variation. If the line were a perfect fit to the data, the value predicted by the line and the actual value would be exactly the same and the value of the quantity \(\Sigma(Y - \hat{Y})^2\) would be zero. This quantity is a measure of the disagreement between the actual \(Y\) values and the predicted values \(\hat{Y}\), which are also known as the residuals and are measuring the Unexplained Variation in \(Y\). For the example used above the value of \(\Sigma(Y - \hat{Y})^2\) is \(3.77\).

The Total Variation in Y

This is related to the measures of variability (spread) introduced earlier in the course and in particular to the standard deviation (\(\sigma\)). To measure The Total Variation in Y requires a measure of spread.

The Total Variation in Y is defined to be the quantity: \(\Sigma(Y - \bar{Y})^2\) Where \(\bar{Y}\) is the average value of \(Y\) (\(\bar{Y} = \Sigma(Y)/n\)).

In our earlier example \(\bar{Y} = 12\), so we can expand the table to include this calculation

X Y \(\hat{Y}\) \({Y - \hat{Y}}\) \((Y - \hat{Y})^2\) \(\Sigma(Y - \bar{Y})^2\)
1 7 7.00 0.00 0 25.00
2 8 9.00 -1.00 1 16.00
3 12 11.00 1.00 1 0.00
4 13 13.00 0.00 0 1.00
5 14 15.00 -1.00 1 4.00
6 18 17.00 1.00 1 36.00

giving \(\Sigma(Y - \bar{Y})^2 = 82\).

The Explained Variation in Y

If the line was a perfect fit, then the \(Y\) values and the \(\hat{Y}\) values would be exactly the same, and the quantity \((\hat{Y} - \bar{Y})^2\) would measure The Total Variation in Y. If the line is not a perfect match to the actual \(Y\) values then this quantity measures The Explained Variation in Y.

X Y \(\hat{Y}\) \({Y - \hat{Y}}\) \((Y - \hat{Y})^2\) \(\Sigma(Y - \bar{Y})^2\) \((\hat{Y} - \bar{Y})^2\)
1 7 7.00 0.00 0 25.00 27.94
2 8 9.00 -1.00 1 16.00 10.06
3 12 11.00 1.00 1 0.00 1.12
4 13 13.00 0.00 0 1.00 1.12
5 14 15.00 -1.00 1 4.00 10.06
6 18 17.00 1.00 1 36.00 27.94

incorporating this calculation into the table above will enable us to get \((\hat{Y} - \bar{Y})^2 = 78.23\).

Correlation does not imply causation

Once the intercept and slope have been estimated using the method of least squares, various indices are studied to determine the reliability of these estimates. One of the most popular of these reliability indices is the correlation coefficient. A sample correlation coefficient, more specifically known as the Pearson Product Moment correlation coefficient, denoted \(r\), has possible values between \(-1\) and \(+1\), as illustrated in the diagram below.

In fact, the correlation is a parameter of the bivariate normal distribution, that is used to describe the association between two variables, which does not include a cause and effect statement. That is, one variable does not depend on the other, i.e. the variables are not labelled as dependent and independent. Rather, they are considered as two random variables that seem to vary together. Hence, it is important to recognise that correlation does not imply causation. In correlation analysis, both \(Y\) and \(X\) are assumed to be random variables, whiles in linear regression, \(Y\) is assumed to be a random variable and \(X\) is assumed to be a fixed variable.

Main characteristic of the correlation coefficient \(r\) are:

  • there are no units for the correlation coefficient
  • it lies between \(-1 \leqslant r \leqslant 1\)
    • when its value is near -1 or +1 it shows a strong linear association
    • a correlation coefficient of -1 or +1 indicates a perfect linear association
    • value near 0 suggests no relationship
  • the sign of the correlation coefficient indicates the directions of the association
    • a positive \(r\) is associated with an estimated positive slope
    • a negative \(r\) is associated with an estimated negative slope

BUT!!!

  • \(r\) is a measure of the strength of a linear association and it should not be used when the relationship is non-linear, therefore \(r\) is NOT used to measure strength of a curved line
  • \(r\) is sensitive to outliers
  • \(r\) does not make a distinction between the response variable and the explanatory variable. That is, the correlation of \(X\) with \(Y\) is the same as the correlation of \(Y\) with \(X\).
  • in simple linear regression, squaring the correlation coefficient, \(r^2\), results in the value of the coefficient of determination \(R^2\)

💡🤓The Spearman rank correlation coefficient is an corresponding nonparametric equivalent of the Pearson correlation coefficient. This statistic is computed by replacing the data values with their ranks and pplying the Pearson correlation formula to the ranks of the data. Tied values are replaced with the average rank of the ties. Just like in the case of Pearson coefficient, this one is also really a measure of association rather than correlation, since the ranks are unchanged by a monotonic transformation of the original data. For the sample sizes greater than 10, the distribution of the Spearman rank correlation coefficient can be approximated by the distribution of the Pearson correlation coefficient. It is also worth knowing that the Spearman rank correlation coefficient uses the weights when weights are specified.

The coefficient of Determination \(R^2\)

We realise that when fitting a regression model we are seeking to find out how much variance is explained, or is accounted for, by the explanatory variable \(X\) in an outcome variable \(Y\).

In the earlier example above the following has been calculated:

  1. \(\text{The Total Variation in }\)\(Y = 82.00\)
  2. \(\text{The Explained Variation in }\)\(Y = 78.23\)
  3. \(\text{The Unexplained Variation in }\)\(Y = 3.77\)

Notice the relationship given below, is satisfied

\[\text{The Total Variation in Y = Explained Variation + Unexplained Variation}\]

\[82.00 = 78.23 + 3.77\] What do these quantities tell us? They are difficult to interpret because they are expressed in the units of the problem.

Consider the following ratio:

\({\text{The Explained Variation in Y} \over \text{The Total Variation in Y}} = {78.23 \over 82.00} = 0.954\)

This is saying that \(0.954\) or \(95.4\%\) of the changes in \(Y\) are explained by changes in \(X\). This is a useful and useable measure of the effectiveness of the match between the actual \(Y\) values and the predicted \(Y\) values.

Reviewing the five scatter prolos (Graphs 1 to 5) it can easily be seen that if the line is a perfect fit to the actual \(Y\) values as in Graphs 1 & 5 then this ratio will have the value \(1\) or \(100\%\).

If there is no link between \(Y\) and \(X\) then the The Explained Variation is zero hence the ratio will be \(0\) or \(0\%\). An example of this is shown in Graph 3.

The remaining graphs: Graphs 2 & 4 show a statistical relationship hence this ratio will lie between \(0\) & \(1\). The closer the ratio is to zero the less strong the link is, whilst the closer the ratio is to \(1\) the stronger the connection is between \(X\) & \(Y\).

The Ratio:

\[R^2 = {\text{The Explained Variation in Y} \over \text{The Total Variation in Y}}\]

is called the Coefficient of Determination, and usually labelled as \(R^2\), and may be defined as the proportion of the changes in \(Y\) explained by changes in \(X\).

This ratio is always on the scale \(0\) to \(1\), but by convention is usually expressed as a percentage, so is regarded as on the scale \(0\) to \(100\%\). The interpretation of this ratio is as follows:

The theoretical minimum \(R^2\) is \(0\). However, since linear regression is based on the best possible fit, \(R^2\) will always be greater than zero, even when the predictor and outcome variables bear no relationship to one another. The definition and interpretation of \(R^2\) is a very important tool in the data analyst’s tool kit for tracking connections between a measured response variable and a measured explanatory variable.

Using Sample Data to Track a connection

When working with sample data to investigate any relationships that may exist between a measured response variable and a measured explanatory variable, the information contained within the sample is imperfect, so has to be interpreted in the light of sampling error. This is particularly important when interpreting the value of \(R^2\) calculated from sample data. The sample R2 value gives a measure of the strength of the connection, and this can sometimes be difficult to interpret particularly if the sample size is small.

The data analysis methodology as set out earlier, is a procedure that enables you to make judgements from sample data. The methodology requires you to know what specific procedures make up the Initial Data Analysis IDA and how to interpret the results of the IDA to obtain one of three outcomes:

  1. No evidence of a connection between the response and the explanatory variable
  2. Clear evidence of a connection between the response and the explanatory variable
  3. The IDA is inconclusive and further analysis is required

If Further Data Analysis, FDA, is required then you need to know what constitutes this further analysis and how to interpret it.

Finally, if a connection between the response variable and the explanatory variable is detected then a description of the connection is required. In this case the connection needs to be described.

To demonstrate the data analysis methodology we’ll go back to Share Price Study Case Study.

Case Study: Share Price Study Data

A business analyst is studying share prices of companies from three different business sectors. As part of the study a random sample (n=60) of companies was selected and the following data was collected:

Variable Description
Share_Price The market value of a company share (£)
Profit Company annual profit (£1.000.000)
RD Company annual spending on research and development (£1.000)
Turnover Company annual total revenue (£1.000.000)
Competition A variable coded:
- 0 if the company operates in a very competitive market
- 1 if the company has a great deal of monopoly power
Sector A variable coded:
- 1 if the company operates in the IT business sector
- 2 if the company operates in the Finance business sector
- 3 if the company operates in the Pharmaceutical business sector
Type A variable coded:
- 0 if the company does business mainly in Europe
- 1 if the company trades globally

Let’s start off by investigating the relationship between the variables Share_Price and Profit.

We will adopt the following notation

  • \(Y: Share\_Price\)
  • \(X: Profit\)

\(\text{Model to be estimated: } Y = b_0 + b_1X + e\)

We start by accessing data:

> suppressPackageStartupMessages(library(dplyr))
> # read csv file
> companyd <- read.csv("https://tanjakec.github.io/mydata/SHARE_PRICE.csv", header=T)
> # look at the data
> glimpse(companyd)
## Rows: 60
## Columns: 7
## $ Share_Price <int> 880, 862, 850, 840, 838, 825, 808, 806, 801, 799, 783, 777…
## $ Profit      <dbl> 161.3, 170.5, 140.7, 115.7, 107.9, 138.8, 102.0, 102.7, 10…
## $ RD          <dbl> 152.6, 118.3, 110.6, 87.2, 75.1, 116.2, 91.3, 100.4, 113.5…
## $ Turnover    <dbl> 320.9, 306.3, 279.5, 193.2, 182.4, 265.2, 212.0, 170.3, 23…
## $ Competition <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1…
## $ Sector      <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2…
## $ Type        <int> 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1…
> # convert into factore type variables
> companyd[, 5] <- as.factor(companyd[, 5])
> companyd[, 6] <- as.factor(companyd[, 6])
> companyd[, 7] <- as.factor(companyd[, 7])
> # glance at data
> summary(companyd)
##   Share_Price        Profit             RD            Turnover     Competition
##  Min.   :101.0   Min.   :  2.90   Min.   : 39.20   Min.   : 30.3   0:30       
##  1st Qu.:501.2   1st Qu.: 59.73   1st Qu.: 75.78   1st Qu.:112.3   1:30       
##  Median :598.5   Median : 88.85   Median : 90.60   Median :173.5              
##  Mean   :602.8   Mean   : 84.76   Mean   : 89.64   Mean   :170.2              
##  3rd Qu.:739.8   3rd Qu.:106.62   3rd Qu.:104.15   3rd Qu.:216.6              
##  Max.   :880.0   Max.   :170.50   Max.   :152.60   Max.   :323.3              
##  Sector Type  
##  1:20   0:30  
##  2:20   1:30  
##  3:20         
##               
##               
## 
> glimpse(companyd)
## Rows: 60
## Columns: 7
## $ Share_Price <int> 880, 862, 850, 840, 838, 825, 808, 806, 801, 799, 783, 777…
## $ Profit      <dbl> 161.3, 170.5, 140.7, 115.7, 107.9, 138.8, 102.0, 102.7, 10…
## $ RD          <dbl> 152.6, 118.3, 110.6, 87.2, 75.1, 116.2, 91.3, 100.4, 113.5…
## $ Turnover    <dbl> 320.9, 306.3, 279.5, 193.2, 182.4, 265.2, 212.0, 170.3, 23…
## $ Competition <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1…
## $ Sector      <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2…
## $ Type        <fct> 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1…

The Initial Data Analysis for investigating a connection between a measured response and a measured explanatory variable requires obtaining a graph of the response against the explanatory variable and to calculate the value of \(R^2\) from the sample data.

The IDA has three possible outcomes:

  1. No evidence of a connection between the response and the explanatory variable.
  2. Clear evidence of a connection between the response and the explanatory variable.
  3. The IDA is inconclusive and further analysis is required.

The simplest form of connection is a straight-line relationship and the question arises could a straight-line relationship be matched with the information contained in the Graphs 1 to 5 discussed earlier?

As part of an informal investigation of the possible relationship between Share_Price and Profit, first we will use R to obtain a scatter plot with the line of the best fit. Rather than every time referring to the name of the data set containing the variables of interest, we will attach our data and refer to the variables directly using only their names (see help for the attach( ) function).

> attach(companyd)
> names(companyd)   # shows the names of the variables in the 'companyd' data frame
## [1] "Share_Price" "Profit"      "RD"          "Turnover"    "Competition"
## [6] "Sector"      "Type"
> plot(Share_Price ~ Profit, cex = .6, main = "Share Price = f(Profit) + e")
> abline(lm(Share_Price ~ Profit), lty = 2, col = 2)

Let’s go through the code and the functions we have used to produce this graph.

  • The plot( ) function gives a scatterplot of two numerical variables. The first variable listed will be plotted on the horizontal axis and the second on the vertical axis, i.e. you ‘feed’ as the arguments first variable representing X and then variable representing Y: plot( x, y). Considering that we are investigating a relationship between \(X\) and \(Y\) in the form of a regression line \(Y = b_0 + b1_X + e\), we specify that in R as a formula Y ~ X, which can be read as “\(Y\) is modelled as a function of \(X\)”. This means that in the R’s plot( ) function a formula interface can also be used, in which case the response variable \(Y\) needs to come before the tilde (\(\sim\)) and \(X\) variable that will be plotted on the horizontal axis.

  • Next, we fit a line of the best-fit through our scatterplot using the abline( ) function for the linear model \(Y = b_0 + b_1X\) to see how close are the points to the fitted line. The basic R function for fitting linear models by ordinary least squares is the lm( ), which stands for linear model. All we need to feed R with when using the lm( ) function is the formula Y~X.

The scatterplot shows a fairly strong and reasonably linear relationship between the two variables. In other words, the fit is reasonably good, but it is not perfect and we could do with some more information about it. Let us see what the lm( ) function provide as a part of the output.

> lm(Share_Price ~ Profit)
## 
## Call:
## lm(formula = Share_Price ~ Profit)
## 
## Coefficients:
## (Intercept)       Profit  
##     258.924        4.057

First, R displays the fitted model, after which it shows the estimates of the two parameters \(b_0\) & \(b_1\) to which it refers to as coefficients.

  • the intercept, \(b_0 = 258.924\)
  • the slope, \(b_1= 4.057\)

We must now take this estimated model and ask a series of questions to decide whether or not our estimated model is good/bad: that is, we have to subject the fitted model to a set of tests designed to check the validity of the model, which is in effect a test of your, as a modeller, viewpoint/theory.

Examining the scatterplot, we can see that not all of the points lie on the fitted line \(Share\_Price = 258.924 + 4.057Profit\). To explain how strong the relationship is between the two variables we need to obtain the coefficient of determination known as the \(R^2\) parameter.

The coefficient of determination, \(R^2\), is a single number that measures the extent to which the explanatory variable can explain, or account for, the variability in \(Y\) – that is, how well does the explanatory variable explain the variability in, or behaviour of, the phenomenon we are trying to understand.

Earlier, we saw that \(R^2\) is constrained to lie in the following range:

\[0\% \text{ <---------- } R^2 \text{ ----------> } 100\%\]

We realised that the closer \(R^2\) is to \(100\%\) then the better the model is, and conversely, a value of \(R^2\) close to \(0\%\) implies a weak/poor model.

To obtain all of the information about the fitted model we can use the summary( ) function as in the following:

> summary(lm(Share_Price ~ Profit))
## 
## Call:
## lm(formula = Share_Price ~ Profit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -175.513  -74.826    0.107   67.824  141.358 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 258.9243    28.7465   9.007 1.29e-12 ***
## Profit        4.0567     0.3102  13.077  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 89.98 on 58 degrees of freedom
## Multiple R-squared:  0.7467, Adjusted R-squared:  0.7424 
## F-statistic:   171 on 1 and 58 DF,  p-value: < 2.2e-16

So, what have we got here? 🤔

We will discuss some of the key components of R’s summary( ) function for linear regression models.

  • Call: shows what function and variables were used to create the model.
  • Residuals: difference between what the model predicted and the actual value of Y. Try to see if you can calculate this ‘residuals’ section by yourself using: summary(Y-model$fitted.values). 🤓
  • Coefficients:
  • estimated parameters for intercept and slope: \(b_0\) and \(b_1\)
  • Std. Error is Residual Standard Error divided by the square root of the sum of the square of that particular explanatory variable.
  • t value: Estimate divided by Std. Error
  • Pr(>|t|): is the value in a t distribution table with the given degrees of freedom.

{{% notice note %}} Note that in the first section we can find statistics relevant to the estimation of the model’s parameters. In the second part of the output we can find statistics related to the overall goodness of the fitted model. {{% /notice %}}

The summary(lm( )) produces the standard deviation of the error. We know that standard deviation is the square root of variance. Standard Error is very similar, the only difference is that instead of dividing by \(n-1\), you subtract \(n\) minus \(1 + k\), where \(k\) is the number of explanatory variables used in the model. See if you can use and adjust the code below to calculate this statistic. 🤓

# Residual Standard Error
k <- length(model$coefficients) - 1 #Subtract one to ignore intercept
SSE <- sum(model$residuals^2)
n <- length(model$residuals)
sqrt(SSE / (n-(1+k)) ) # Residual Standard Error

Next is the coefficient of determination, which helps us determine how well the model fits to the data. We have already seen that \(R^2\) subtracts the residual error from the variance in Y. The bigger the error, the worse the remaining variance will appear.

For the time being we will skip \(R^2_{adjusted}\) and point out that it is used for models with multiple variables, to which you will be introduced in the next section.

Lastly, the F-Statistic is the second “test” that the summary function produces for lm models. The F-Statistic is a “global” test that checks if at least one of your coefficients are nonzero, i.e. when dealing with a simple regression model to check if the model is worthy of further investigation and interpretation.

Model Interpretation

To go back to our model interpretation. Earlier, we said after examining the scatter plot that there is a a clear link between Share_Price and Profit. This is confirmed by the value of \(R^2 = 74.67\%\). The interpretation of \(R^2\) is suggesting that \(74.67\%\) of the changes in Share_Price are explained by changes in Profit. Alternatively, \(25.33\%\) of the changes in Share_Price are due to other variables.

For this example the outcome of the IDA is that there is clear evidence of a link between Share_Price and Profit and the nature of the influence being that as Profit increases Share_Price also increases. It only remains to describe the connection, and discuss how effective the model is, and ask if it can be used to predict Share_Price value from the value of Profit? 🤔

Further Data Analysis

The adequacy of \(R^2\) can be judged both informally and formally using hypothesis testing. Like all hypothesis tests, this test is carried out in four stages:

  1. Specify the hypotheses
  2. Define the test parameters and decision rule
  3. Examine the sample evidence
  4. Conclusions

Stage 1: Specify the hypotheses (\(H_0\) & \(H_1\)).

The Coefficient of Determination \(R^2\) is a useful quantity. By definition if \(R^2 = 0\) then there is no connection between the response variable and the explanatory variable. Conversely if the value of \(R^2\) is greater than zero there must be a connection. This enables the formal hypotheses to be defined as:

  • \(H_0 : R^2 = 0\) (There is no relationship between the response and interest and the explanatory variable.)
  • \(H_1 : R^2 > 0\) (There is a relationship between the response and the explanatory variable.)

Stage 2: Defining the test parameters and the decision rule.

The decision rule is based on the F statistic. The F distribution has a shape as shown below:

The value of \(F_{crit}\) is the value that divides the distribution with \(95\%\) of the area to the left of the \(F_{crit}\) ordinate, and \(5\%\) to the right. The accepted terminology is \(F_{crit}\) but this text will call this quantity \(F_{calc}\) to signify it is a quantity obtained from statistical tables.

The decision rule is:

  • If the value of \(F_{calc}\) from the sample data is larger than \(F_{crit}\), then the sample evidence favours the decision that there is a connection between the response variable and the explanatory variable. (i.e. favours \(H_1\).)

  • If the value of \(F_{calc}\) is smaller than \(F_{crit}\) then the sample evidence is consistent with no connection between the response variable and the explanatory variable. (i.e. favours \(H_0\))

The decision rule can be summarised as:

If \(F_{calc} < F_{crit}\) then favour \(H_0\), whilst if \(F_{calc} > F_{crit}\) then favour \(H_1\). This decision rule can be represented graphically as below:

The decision rule maybe written as:

  • If \(F_{calc} < F_{crit}\) then accept \(H_0\)
  • If \(F_{calc} > F_{crit}\) then accept \(H_1\)

The value of \(F_{crit}\) depends on the amount of data in the sample. The Degrees of Freedom (usually abbreviated to \(df\)) express this dependence on the sample size. For any given set of sample data, having obtained the \(df\) the specific value of \(F_{crit}\) can be obtained from Statistical Tables of the F distribution, or simply by using the qf(p, df1, df2) function in R.

Stage 3: Examining the sample evidence

For the investigation of the Share_Price vs Profit we plot the scatterplot and obtain the summary of the fitted model: \(Share\_Price = b_0 + b_1Profit\)

> # create a scatterplot
> plot(Share_Price ~ Profit, cex = .6, main = "Share Price = f(Profit) + e")
> # fit the model
> model_1 <- lm(Share_Price ~ Profit)
> # add the line of the best fit on the scatterplot
> abline(model_1, lty = 2, col = 2)

> # obtain R_sq of the fitted model
> summary(model_1)$r.squared
## [1] 0.7467388

This is a good model explaining almost \(3/4\) of the overall variability in the explanatory variable \(Y\). Hence, based on the sample evidence, variable Profit, can explain \(74.67\%\) of overall variation in the variable Share_Price.

💡🤓 Note, that if we don’t reject that there is a relationship between the two variables we will obtain full output of the summary(lm(model_1)) function regardless of the need for FDA. If we conclude that there is a clear relationship, we will need the estimates of the parameters to describe it, and if we need further to investigate the significance of the relationship we need relevant statistics for conducting the FDA.

In the cases where \(R^2 \approx 0\) we will carry out a hypothesis test for which we need \(F_{calc}\) from the summary( ) function.

Let us assume that the \(R^2\) for the given example Share_Price vs Profit is very small and close to zero. In that case, we cannot simply reject it as a statistically insignificant relationship, as \(R^2\) is not zero, but neither can we with confidence accept it as a statistically valid relationship based on the sample evidence we use.

This step requires the calculation of the \(F_{calc}\) value from the sample data, and the determining of the degrees of freedom.

> summary(model_1)
## 
## Call:
## lm(formula = Share_Price ~ Profit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -175.513  -74.826    0.107   67.824  141.358 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 258.9243    28.7465   9.007 1.29e-12 ***
## Profit        4.0567     0.3102  13.077  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 89.98 on 58 degrees of freedom
## Multiple R-squared:  0.7467, Adjusted R-squared:  0.7424 
## F-statistic:   171 on 1 and 58 DF,  p-value: < 2.2e-16

Stage 4: Conclusion

The associated F statistic evaluating the significance of the full model \(F_{calc} = 171\) with \(df_1=1\) and \(df_2=58\). Thus,

> F_crit <- qf(0.95, 1, 58)
> F_crit
## [1] 4.006873

\(F_{crit} = 4.01\) and using the decision rule given, we get that

\[F_{calc}=171 > F_{crit}=4.01\text{ => accept }H_1\] , ie this model is statistically significant and we need to describe it.

Describe the Relationship

If the resulting outcome from either the Initial Data Analysis (IDA) or the Further Data Analysis (FDA) is that there is a connection between the response variable and the explanatory variable then the last step is to describe this connection.

For the data analysis situation Measured v Measured this is simply stating the line of the best fit. This is a description of the connection between the two variables. Additionally the \(R^2\) value should be quoted as this gives a measure of how well the data and the line of best fit match.

The \(R^2\) value can be interpreted as a measure of the quality of predictions made from the line of best fit according to the rule of thumb:

In our example Share_Price vs Profit we had the following estimations:

  • \(b_0 = 258.9243\)
  • \(b_1 = 4.0567\)

\[Share\_Price = 258.9243 + 4.0567 Profit\]

with \(R^2 = 74.67\%\).

Making predictions

Can the regression line be used to make predictions about the Share_Price for a company with a given value of the Profit?

Suppose we have to make a prediction of the \(Share\_{Price}\) for company with the \(Profit = 137.2\)

The predictions are calculated from the model as follows:

\[Share\_Price = 258.9243 + 4.0567 \times 137.2000 = 815.5035\] Since the \(R^2\) value is nearly \(75\%\) any predictions about Share_Price made from this model are likely to be of good quality, but there may be some issues with some of these predictions of the Profit values out of the data range used in prediction. The Profit in the given data set range from about \(3\) to \(170\). Over this range any prediction is likely to be of good quality since the information in the data is reflecting experience within this range, and the \(R^2\) value is nearly \(75\%\).

> summary(Profit)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.90   59.73   88.85   84.76  106.62  170.50

The Share_Price for the company with the Profit that lies outside the range of experience is not likely to be very reliable, or of much value. These problems are caused by the fact that the information available about the link between Share_Price and Profit is only valid over the range of values that Profit takes in the data. There is no information about the nature of the connection outside this range and any predictions made must be treated with caution.

In general terms when interpreting a regression model the intercept is of little value, since it is generally out of the range of the data. The gradient term is the more important and the formal definition of the gradient provides the interpretation of this information.

The gradient is defined to be the change in \(Y\) for a unit increase in \(X\). For the model developed the gradient is \(4.06\). This is suggesting for every additional unit increase in Profit, Share_Price increases by \(£4.06\).

💡🤓 Avoid extrapolation! Refrain from predicting/interpreting the regression line for X-values outside the range of X in the data set!

YOUR TURN 👇

  1. Investigate the nature of the relationship in Share Price Study data for Share_Price vs RD and Share_Price vs Turnover.

  2. Download The Supermarket data set available at https://github.com/TanjaKec/mydata using the following link https://tanjakec.github.io/mydata/SUPERM.csv.

Crate an Rmd report in which you will:

  1. Give a brief explanation of the following statistical terms:
  1. Response Variable
  2. Explanatory variable
  3. Measured Variable
  4. Attribute Variable
  1. Provide a brief answer for each of the following questions:
  1. What is ‘Data Analysis Methodology’, and why is this needed when working with sample data?
  2. What are the statistical concepts used to investigate the relationship between a measured response variable and an attribute explanatory variable?
  3. What are the statistical concepts used to investigate the relationship between a measured response variable and a measured explanatory variable?
  1. Undertake the Data Analysis for the Supermarket Data Set to investigate the relationships between the response variable and from your point of view regarded as worth of attention set of explanatory variables. Present your points of views about the nature of the relationships and give a complete explanation, within the data analysis methodology, of this analysis.

You are expected to bring your report to into the next tutorial.

Multiple Linear Regression Model

In the previous section we learnt about simple linear regression as a very simple tool for predicting quantitative measured response variables. Its simple application, easy interpretation and predictive capabilities make it one of the most popular approaches for supervised learning. Developing good understanding of linear regression serves as a solid base for learning how to use and adopt other more sophisticated modelling procedures in the context of machine learning.

Additive Model

The basic idea in regression modelling is that there is a phenomenon of interest whose behaviour we seek to account for or explain. For example a company may be interested in why its sales figures have followed a particular pattern; Government officials may be interested in explaining the behaviour of unemployment statistics.

Let the phenomenon of interest be denoted by \(Y\); as discussed in the previous sections, \(Y\) is also known as the response variable, or the dependent variable. The fundamental nature of \(Y\) is that it displays variability and it is this inherent variability that the data analyst seeks to explain by using regression modelling. Intuitively, if we can account for a large amount of the movements in \(Y\) then in some sense this is good. Conversely, if we put forward an argument which explains only a very small amount of the behaviour of the phenomenon of interest then in some sense this is bad.

In trying to explain, or account for the behaviour of \(Y\) we often build a regression model. As a first step in building such a model we specify a set of variables, called the explanatory or predictor variables, that we believe to be important in explaining the behaviour of, or the variability in \(Y\). Specifying this set of variables is, in effect, the first step in developing as the data analysts our viewpoint, our model, our theory. As such we are entitled to ask where such a viewpoint comes from - for example, it may be the logical outcome of some theoretical argument or it may be a replication of a previous study.

💡🤓 We do well to remember that this statement of a model is a viewpoint, a belief, a theory and need not be correct, i.e. true.

Indeed, it is the very essence of the regression model validation to judge if a viewpoint, a belief, a theory is in any sense acceptable. So, when developing a regression model all we argue is: \(Y = f(X_1, X_2, X_3, … , X_k)\), which represents a belief that the response variable \(Y\) depends upon a set of \(k\) explanatory variables \((X_1, X_2, X_3,… , X_k)\).

Specifying a set of variables is not in itself a complete model. We have to be prepared to indicate and argue more precisely how the explanatory variables are supposed to relate to \(Y\): that is, we must specify the functional form that links the set of variables to \(Y\). The simplest relationship is a linear relationship or a straight line. As we have already learnt for a single explanatory variable, \(X_1\), this is given by:

  • \(Y = b_0 + b_1X_1\) (Within this structure it is important that \(b_0\) and \(b_1\) can be interpreted clearly.)

and for a set of \(k\)-variables this is given by:

  • \(Y = b_0 + b_1X_1 + b_2X_2 +... + b_kX_k\) (Within this structure it is important that \(b_0, b_1, b_2,… , b_k\) can be interpreted clearly.)

The linear models specified so far are exact or deterministic. Such structures do not represent statistical problems. The very nature of statistical data is that there is inherent variability in that there is some part of the responses variable that we cannot explain. Thus there is a random element whose behaviour is by nature and definition unpredictable. We know that a random component is going to impact on \(Y\), but we do not know the size or sign of such random happenings.

This random/stochastic part of the model is captured as follows:

  • \(Y = b_0 + b_1X_1 + e\), for a single explanatory variable model, and

  • \(Y = b_0 + b_1X_1 + b_2X_2 + ... + b_kX_k + e\), for the general \(k\) explanatory variable model

where \(e\) is also known as the error term.

💡🤓 These expressions need to be fully understood and the role and significance of the random/stochastic element has to be fully recognised.

In order to complete the specification of the statistical regression model some assumption has to be made about the underlying process that produces \(e\). The standard approach is to adopt the following distribution structure:

\(e \sim N(0,\sigma^2)\), with the error term from a normal distribution with a mean of \(0\), and a variance of \(\sigma^2\).

Knowing this, we should now be in a position to understand and identify the structure of the standard multiple linear regression model.

Prior Knowledge

As a statement of belief, the specified model just suggests a list of important explanatory variables within a linear structure. We need to be willing to take an extra conceptual step by specifying our expectation on how the individual explanatory variables are connected to the response variable. We should make assertions about:

  1. the expected sign of each slope (or marginal response coefficient) for each explanatory variable?
  2. the expected size of each slope?

Again knowledge of these two potential tricky areas are connected to theoretical arguments and are often the results of previous similar studies. Such knowledge is sometimes called prior knowledge, or prior views. In order to give a viewpoint some evidential or empirical support is required:

  • a sample of data has to be collected and
  • the model has to be put to the scrutiny of rigorous diagnostic checking

That is, the proposed theory has to be tested against the available evidence.

🤓 It needs to be fully understood that the model developed so far is no more than a belief.

We may believe passionately that \(Y\) is caused by the structure put forward but we do not know the structure in that we do not know the values of the \(b\) parameters. These \(b\) values are held to exist and have particular true numerical values which we need to find out. In an attempt to give our theory/model/viewpoint empirical content we have to collect a sample of pertinent data. On the basis of the information contained in this particular data, we have to estimate the \(b\) values and draw valid conclusions about the nature of the true but unknown underlying structure which is claimed to be responsible for the behaviour of \(Y\).

Building a Model

The sample, of size \(n\), evidence is seen as coming from the following structure:

\[Y_i = b_0 + b_1X_{i1} + b_2X_{i2} + b_3X_{i3} + … + b_kX_{ik} + e_i, \;\;\;\;\; where \;\; i=1, 2, ...n\] Selecting a sample by definition means that we have imperfect information, that is we do not have information on the whole population or we do not see the whole picture. We only see a possibly disjointed part of the whole picture and thus taking a sample creates obvious difficulties in making sense of what we can see. Taking a sample introduces an interpretation problem of trying to say something sensible about the whole picture, the true model, from a part of the picture.

The data handling process involves two major steps, namely:

  1. model estimation and
  2. model validation

Model Estimation

As we have discussed, the model developed so far is a viewpoint. It represents our view as to the true underlying structure of the world. We believe that this structure exists but unfortunately, we do not know the values of the parameters. These parameters are held to exist but their true values are unknown. Hence there is a need to develop estimates of the true but unknown parameters. These estimates are based on a sample of data and it is to be hoped that they are in some sense acceptable, i.e. good.

The most common estimating approach is the principle of Ordinary Least Squares (OLS). As we have already learnt, this idea is easiest to understand when there is only one explanatory variable and the problem can be easily depicted with a scatter diagram onto which we can superimpose the line of best fit. This process of finding the line of best fit has been programmed in R through the lm() function. Problems involving more than one explanatory variable are handled in a similar fashion. As part of the model fitting process, R produces a great deal of output which should be used to judge the validity, or usefulness, of the fitted model.

Model Validation

On deriving coefficient estimates from a sample of data we end up with a fitted regression model. We must now take this estimated model and ask a series of questions to decide whether or not our estimated model is good or bad. That is, we have to subject the fitted model to a set of tests designed to check the validity of the model, which is in effect a test of our viewpoint.

Test a): Does the fitted model make sense?

This type of testing is based on prior knowledge. We need to be in a position to judge if the parameter estimates are meaningful. This comes in two guises:

  1. Do the estimated coefficients have the correct sign?
  2. Do the estimated coefficients have the correct size?

This test assesses if the marginal response coefficients display both the correct signs and sizes. To assist the analyst with these issues theoretical reasoning and information gained from past studies should be used. For example, if one was relating sales to advertising then a positive connection would be expected and a simple plot of sales against advertising would reveal information concerning the direction and strength of any sample relationship. Similarly, in a study of sales against price then a negative relationship would be expected. The issue of “correct size” is normally more troublesome. Possibly some clue as to size can be drawn from similar, past studies and this can be compared with the current estimated values.

By way of addressing this test the modeller may adopt the following approach:

  1. Fit a regression model for each of the explanatory variables one at a time. Check each one for Test a).
  2. Fit a regression model for all of the explanatory variables together. Check each explanatory variable for Test a).
  3. Check if Test a) from 1. and 2. are consistent. If Yes then great? 😅; if No then…? 😟

Test b) \(R^2 / R^2_{adjusted}\): Overall is the model a good fit?

The coefficient of determination, \(R^2\), is a single number that measures the extent to which the set of explanatory variables can explain, or account for, the variability in \(Y\). That is, how well does the set of explanatory variables explain the behaviour of the phenomenon we are trying to understand? It is looking at the set of explanatory variables as a whole and is making no judgement about the contribution or importance of any individual explanatory variable. By construction, \(R^2\) is constrained to lie in the following range:

\[0\% <---------- \;\; R^2 \;\; ----------> 100\%\]

Intuitively the closer \(R^2\) is to \(100\%\) then the better the model is in the sense that the set of explanatory variables can explain a lot of the variability in \(Y\). Conversely, a value of \(R^2\) close to \(0\) implies a weak, i.e. poor model, in that the set of explanatory variables can only account for a limited amount of the behaviour of \(Y\). But what about non-extreme values of \(R^2\)?

The adequacy of \(R^2\) can be judged both formally and informally.

The formal test is given by the result:

  • \(H_0: R^2 = 0\) (that is, the set of explanatory variables are insignificant, or in other words: useless)
  • \(H_1: R^2 > 0\) (that is, at least one explanatory variable is significant, or in other words: important)

and the appropriate test statistic is such that \(F_{calc} \sim F(k, (n-(k+1)))\), obtained using a data set of size \(n\) for a fitted model with \(k\) number of explanatory variables.

The mechanics of an F-test are as follows:

  • Two vital pieces of information are required for an F-test: \(F_{calc}\) and \(F_{crit}\).

  • \(F_{calc}\) is displayed within the R’s standard output. \(F_{crit}\) however is a figure derived from degrees of freedom elements and a designated level of significance. For hypothesis testing the \(5\%\) significance (\(\alpha = 5\%\)) level is commonly used. The degrees of freedom elements are

    1. Regression: number of explanatory variables in the model
    2. Error: number of observations minus the number of estimated \(b\) coefficients in the fitted regression model
  • The all important decision making rule is: if \(F_{calc}\) is less than \(F_{crit}\) then the null hypothesis \(H_0\) is accepted; if \(F_{calc}\) is larger than \(F_{crit}\) then the alternative hypothesis \(H_1\) is accepted:

    if \(F_{calc} < F_{crit} => H_0\)

    if \(F_{calc} > F_{crit} => H_1\)

This formal test involves a rather weak alternative hypothesis, which says only that \(R^2\) is significantly bigger than \(0\). If \(H_1\) is accepted we will have to make a judgement, without the aid of any further formal test, about the usefulness of \(R^2\) and hence the model being studied.

\(R^2\) Adjusted

\(R^2\) can also be used in comparing competing models. If two or more models have been put forward to explain the same response variable then a reasonable rule is to rank the explanatory power of the models in terms of their \(R^2\) values. Thus the model with the highest \(R^2\) value would be the best model.

The information gathered from the individual regressions carried out in Test a) can be used to give an initial judgement and ranking of the relative importance of the various explanatory variables.

However this rule should be used carefully. It is a valid rule when the competing models have the same number of explanatory variables. It is not valid when comparing models that have different numbers of explanatory variables. It is intuitively obvious that a model with more explanatory variables will have a better chance of explaining the \(Y\) variable than a model with fewer variables. Thus the highest \(R^2\) rule is biased in favour of those models with more explanatory variables even when some of these explanatory variables are not very useful. In an attempt to redress the imbalance when comparing models with different numbers of explanatory variables, a different version of \(R^2\) called \(R^2_{adjusted}\) has been developed as follows:

\[\bar{R}^2 = 1 - \frac{(n-1)}{(n-(k+1))}(1-R^2)\] Note that the mathematically adopted way of writing \(R^2_{adjusted}\) is \(\bar{R}^2\).

By examining \(\bar{R}^2\) it can be seen that when \(k\) goes up it is possible for the value of \(\bar{R}^2\) to fall. Thus this statistic gives a better way of comparing models with different values for \(k\). As before the rule for comparing models is straightforward: choose the model with the highest \(\bar{R}^2\).

Test c) t-tests: Individually, are the explanatory variables important?

Using \(R^2\) to judge the ‘goodness’ of the set of explanatory variables does not tell us anything about the importance of any one single explanatory variable. Just because a set of variables is important does not necessarily mean that each individual variable is contributing towards explaining the behaviour of \(Y\). What is needed is a test than enables us to check the validity of each variable one at a time. Such a test procedure is available as follows:

  • \(H_0: b_i = 0\) (explanatory variable \(i\) is not important)
  • \(H_1: b_i < 0\) (explanatory variable \(i\) has a positive influence) or
    • \(H_1: b_i > 0\) (explanatory variable \(i\) has a negative influence) or
    • \(H_1: b_i \neq 0\) (explanatory variable \(i\) has an influence)

The appropriate \(H_1\) for any particular variable is crucially connected with the prior view as to the nature of the connection between any one proposed explanatory variable and the response variable.

The appropriate test statistic involves a t-test based on \((n-(k+1))\) degrees of freedom.

The critical region, denoted by \(H_1\), is crucially dependent upon the nature of the alternative hypothesis as put forward by the data analyst. \(t_{calc}\) is generated from the R output. \(t_{crit}\) is a combination of the Degrees of Freedom from Error and the level of test significance. \(t_{crit}\) is the benchmark helping to determine whether one accepts or rejects the null Hypothesis.

If \(t_{calc}\) lies in the critical region then the alternative hypothesis is accepted and the modeller accepts that the explanatory variable does have an influence on the behaviour of \(Y\). If this is not the case and \(t_{calc}\) lies in the \(H_0\) area, we will accept that the explanatory variable is useless and that it should be eliminated from the model.

The hypothesis testing methodology typically adopts the \(\alpha = 5\%\) level of significance and can be illustrated as follows:

Once all the variables have been individually analysed a further regression command should be given on all remaining acceptable explanatory variables in order to ascertain the best fitted model.

Some of the decision making outcomes from the series of t-test may be uncomfortable in terms of the information from other tests and we may have to experiment with various competing fitted models using subtle combinations of prior views, \(R^2\), \(\bar{R}^2\) values, and \(F\), \(t-test\) outcomes before selecting a best model or a set of equally good models.

There are many other, more sophisticated, tests that can be used to judge the validity, i.e. usefulness of a fitted regression model. Before we familiarise ourselves with them and in order to appreciate their practicality and effectiveness, we will conduct an analysis for fitting a multiple regression model based on the procedure explained above.

\(p\)-Value Approach for Hypothesis Testing

The \(p\)-value is a probability of obtaining a value of the test statistic or a more extreme value of the test statistic assuming that the null hypothesis is true. Thus, the \(p\)-value approach involves determining “likely” or “unlikely” by determining the probability, assuming the null hypothesis were true of observing a more extreme test statistic in the direction of the alternative hypothesis than the one observed.

  • if the \(p\)-value \(\geqslant\) \(\alpha\) accept \(H_0\) (“unlikely”)
  • if the \(p\)-value \(<\) \(\alpha\) reject \(H_0\) in favour of \(H_1\) (“likely”)

As such, in the context of the inference about the regression parameters, the \(p\)-values help determine whether the relationships that you observe in your sample also exist in the larger population. The \(p\)-value for each independent variable tests the null hypothesis that the variable has no effect on the dependent variable. In other words, there is insufficient evidence to conclude that there is effect at the population level.

Case Study

carData::Salaries: The 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data was collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members.

Format: A data frame with 397 observations on the following 6 variables:

  • rank: a factor with levels:
    • AssocProf
    • AsstProf
    • Prof
  • discipline: a factor with levels:
    • A (“theoretical” departments) or
    • B (“applied” departments)
  • yrs.since.phd: years since PhD
  • yrs.service: years of service
  • sex: a factor with levels
    • Female
    • Male
  • salary: nine-month salary, in dollars

The first set of questions is:

  • which of the measured variable is the response variable?
  • which are the explanatory variables?
  • are the explanatory variables measured or attribute, or a mixture of both?

Considering that the data is collected for the purpose of the analysis of academic salary we are going to identify salary as the response variable. The rest of the variables we can use as possible factors that can influence the behaviour of the response variable.

The “Standard” regression approach assumes modelling a relationship between measured response and measured explanatory variables. However, often when building multiple regression models we do not want to be limited to the choice of only measured explanatory variables. We also want to be able to include attribute explanatory variables in the multiple regression modelling. In consequence, it is important to learn about supplementary steps that are required to make such models interpretable.

Dummy Variables

As we have learnt previously, attribute variables are vectors of class factor in R. This vector is encoded numerically, with information about the levels of the variable saved in the levels attribute.

> # If you don't have carData installed yet, uncomment and run the line below
> # install.packages(carData)
> library(carData)
> data(Salaries)
> attach(Salaries)
> class(sex)
## [1] "factor"
> unclass(sex)
##   [1] 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 1 2
##  [38] 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2
## [112] 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## [186] 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2
## [223] 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1
## [334] 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2
## [371] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## attr(,"levels")
## [1] "Female" "Male"

The output points out the conversion of the factor into an appropriate set of contrasts. In particular, the first one: for unordered factors, and the second one: the ordered factors. The former one is applicable in our context.

To explicitly identify the coding of the factor, i.e. dummy variable used by R, we can use the contrasts() function.

> contrasts(sex)
##        Male
## Female    0
## Male      1
> contrasts(discipline)
##   B
## A 0
## B 1
> contrasts(rank)
##           AssocProf Prof
## AsstProf          0    0
## AssocProf         1    0
## Prof              0    1

Note that applied “contr.treatment” conversion takes only the value 0 or 1 and that for an attribute variable with k levels it will create k-1 dummy variables. One can argue that the printout of the function could have been more informative by putting indexed letter d as a header for each of the printed columns. For example:

  • attribute variable sex, where k=2
attribute \(d\)
Female 0
Male 1
  • attribute variable rank, where k=3
attribute \(d_1\) \(d_2\)
AsstProf 0 0
AssocProf 1 0
Prof 0 1

There are many different ways of coding attribute variables besides the dummy variable approach explained here. All of these different approaches lead to equivalent model fits. What differ are the coefficients, i.e. model parameters as they require different interpretations, arranged to measure particular contrasts. This 0/1 coding implemented in R’s default contr.treatment contrast offers straightforward interpretation of the associated parameter in the model, which often is not the case when implementing other contrasts.

Interpreting coefficients of attribute variables

In the case of measured predictors, we are comfortable with the interpretation of the linear model coefficient as a slope, which tells us what a unit increase in the response variable is, i.e. outcome per unit increase in the explanatory variable. This is not necessarily the right interpretation for attribute explanatory variables.

> # average salary values for each sex group
> suppressPackageStartupMessages(library(dplyr))
> Salaries %>% 
+     select(salary, sex) %>%   
+     group_by(sex) %>% 
+     summarise(mean=mean(salary))

Note, that the unclass() function removes the attributes of the sex variable above and prints it using default print method, allowing for easier examination of the internal structure of the object.

However, when using an attribute variable in a linear regression model it would make no sense to treat it as a measured explanatory variable because of its numeric levels. In the context of linear modelling we need to code them to represent the levels of the attribute.

Two-level attribute variables are very easy to code. We simply create an indicator or dummy variable that takes on two possible dummy numerical values. Consider the sex indicator variable. We can code this using a dummy variable \(d\): \[\begin{equation} d=\left\{ \begin{array}{@{}ll@{}} 0, & \text{if female} \\ 1, & \text{if male} \end{array}\right. \end{equation}\]

💡🤓 This is the default coding used in R. Zero value is assigned to the level which is first alphabetically, unless it is changed by using the releveld() function for example, or by specifying the levels of the factor variable specifically.

For a simple regression model of salary versus sex:

\[salary = b_0 + b_1sex + e,\]

this results in the model

\[\begin{equation} salary_i = b_0 + b_1sex_i + e_i=\left\{ \begin{array}{@{}ll@{}} b_0 + b_1 \times 1 + e_i = b_0 + b_1 + e_i, & \text{if the person is male} \\ b_0 + b_1 \times 0 + e_i = b_0 + e_i, & \text{if the person is female} \end{array}\right. \end{equation}\]

where \(b_0\) can be interpreted as the average \(\text{salary}\) for females, and \(b_0 + b_1\) as the average \(\text{salary}\) for males. The value of \(b_1\) represents the average difference in \(\text{salary}\) between females and males.

We can conclude that dealing with an attribute variable with two levels in a linear model is straightforward. In this case, a dummy variable indicates whether an observation has a particular characteristic: yes/no. We can observe it as a “switch” in a model, as this dummy variable can only assume the values 0 and 1, where 0 indicates the absence of the effect, and 1 indicates the presence. The values 0/1 can be seen as off/on.

The way in which R codes dummy variables is controlled by the contrast option:

> options("contrasts")
## $contrasts
##         unordered           ordered 
## "contr.treatment"      "contr.poly"

If we obtain the mean salary for each sex group we will find that for female professors the average salary is \(\$101,002\) and for male professors the average is \(\$115,090\). That is, a difference of \(\$14,088\). If we now look at the parameters of the regression model for salary vs sex where females are coded as zero and males as one, we get exactly the same information, implying that the coefficient is the estimated difference in average between the two groups.

> # regression model 
> lm(salary ~  sex)
## 
## Call:
## lm(formula = salary ~ sex)
## 
## Coefficients:
## (Intercept)      sexMale  
##      101002        14088

For more on this topic check the following link: Categorical variables and interaction terms in linear regression

Fitting a Multivariate Regression Model

We are interested in the extent to which variation in the response variable salary is associated with variation in the explanatory variables available in the Salaries data set, that is we want to fit a multiple linear regression model to the given data. The model we wish to construct should contain enough to explain relations in the data and at the same time be simple enough to understand, explain to others, and use.

For convenience we will adopt the following notation:

  • \(y\): salary
  • \(x_1\): yrs.since.phd
  • \(x_2\): yrs.service
  • \(x_3\): discipline
  • \(x_4\): sex
  • \(x_5\): rank

Generally, in multiple regression we have a continuous response variable and two or more continuous explanatory variables, however in this dataset we have three attribute variables that we wish to include as the explanatory variables in the model.

Next, we need to specify the model that embodies our mechanistic understanding of the factors involved and the way that they are related to the response variable. It would make sense to expect that all of the available \(x\) variables may impact the behaviour of \(y\), thus the model we wish to build should reflect our viewpoint, i.e. \(y = f(x_1, x_2, x_3, x_4, x_5)\):

\[y = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4x_4 + b_5x_5 + e\] Our viewpoint states a belief that all explanatory variables have positive impact on the response. For example, more years in service will cause higher salary.

Our objective now is to determine the values of the parameters in the model that lead to the best fit of the model to the data. That is, we are not only trying to estimate the parameters of the model, but we are also seeking the minimal adequate model to describe the data.

💡🤓 The best model is the model that produces the least unexplained variation following the principle of parsimony rather than complexity. That is the model should have as few parameters as possible, subject to the constraint that the parameters in the model should all be statistically significant.

For regression modelling in R we use the lm() function, that fits a linear model assuming normal errors and constant variance. We specify the model by a formula that uses arithmetic operators: +, -, *, / and ^ which enable different functionalities from their ordinary ones. But, before we dive into statistical modelling of the given data, we need to take a first step and conduct the most fundamental task of data analysis procedure: Get to Know Our Data.

Examining multivariate data is intrinsically more challenging than examining univariate and bivariate data. To get the most in-depth vision into multivariate data behaviour we construct scatter plot matrices that enable the display of pairwise relationships. In R, the scatter plot matrices are composed by the pairs() function, which comes as a part of the default graphics package. Since we wish to include attribute variables in our analysis we are going to use the GGally::ggpairs() function that produces a pairwise comparison of multivariate data for both data types: measured and attribute.

> # If you don't have GGally installed yet, uncomment and run the line below
> # install.packages(GGally)
> suppressPackageStartupMessages(library(GGally))
> ggpairs(Salaries)

This is an information rich visualisation that includes pairwise relationships of all the variables we want to consider for our model. By focusing on the last column of the plots, we can notice influence from all explanatory variables onto the response, except maybe for discipline and sex. We also notice unbalanced representation of the groups for variables rank and sex, but for the purpose of our practice in fitting a multi factor model this isn’t too problematic. We need to be especially concerned with the extent of correlations between the explanatory variables and what is of particular interest to us is the high multicollinearity between rank, yrs.since.phd and yrs.service, which happens when the variables are highly linearly related. As a consequence, we will need to keep an eye on the significance of using all of these variables in the model.

Fitting the Model

There are no fixed rules when fitting the linear models, but there are adopted standards that have proven to work well in practice. We start off by fitting a maximal model then we carry on simplifying it by removing non-significant explanatory variables. This needs to be done with caution, making sure that the simplifications make good scientific sense, and do not lead to significant reductions in explanatory power. Although this should be the adopted strategy for fitting a model it is not a guarantee to finding all the important structures in a complex data frame.

We can summarise our model building procedure algorithm as follows:

  1. Fit the maximal model that includes all the variables
    • Assess the overall significance of the model by checking how big the \(R^2\)/\(\bar{R}^2\) is. If statistically significant, carry on with the model fitting procedure, otherwise stop (F-test)
  2. Remove the least significant terms one at a time
    • Check variables’ \(t_{calculated}\) values and perform a one tail or two tail t-test depending on your prior view
    • If the deletion causes an insignificant increase in \(\bar{R}^{2}\) leave that term out of the model
  3. Keep removing terms from the model until the model contains nothing but significant terms.
> # model_1 <- lm(salary ~ yrs.since.phd + yrs.service + discipline + sex + rank, data = Salaries) #long handed way
> model_1 <- lm(salary ~ ., data = Salaries) # full stop, . , implies: all other variables in data that do not already appear in the formula
> summary(model_1)
## 
## Call:
## lm(formula = salary ~ ., data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65248 -13211  -1775  10384  99592 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    65955.2     4588.6  14.374  < 2e-16 ***
## rankAssocProf  12907.6     4145.3   3.114  0.00198 ** 
## rankProf       45066.0     4237.5  10.635  < 2e-16 ***
## disciplineB    14417.6     2342.9   6.154 1.88e-09 ***
## yrs.since.phd    535.1      241.0   2.220  0.02698 *  
## yrs.service     -489.5      211.9  -2.310  0.02143 *  
## sexMale         4783.5     3858.7   1.240  0.21584    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22540 on 390 degrees of freedom
## Multiple R-squared:  0.4547, Adjusted R-squared:  0.4463 
## F-statistic:  54.2 on 6 and 390 DF,  p-value: < 2.2e-16
  1. Overall, is the model a good fit? How big is the \(R^2\)/\(\bar{R}^2\)?

The \(R^2 = 45.47\%\) and the \(\bar{R}^2 = 44.63\%\) are well above the value of zero allowing us to accept this as a valid model without having to formally test it to assess its statistical significance. It manages to explain almost half of the variability in the response variable salary.

  1. Individually, are the explanatory variables important?

We identify the sex explanatory variable as clearly not significant, which is in line with the conclusion we could draw from the boxplot in the pairwise comparison plot for salary vs. sex. We will remove it to begin the process of model simplification.

> #model_1 <- lm(salary ~ yrs.since.phd + yrs.service + discipline + sex + rank, data = Salaries) # long handed method
> model_2 <- update(model_1,~. - sex) # refitting by removing the least significant term
> summary(model_2)
## 
## Call:
## lm(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service, 
##     data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65244 -13498  -1455   9638  99682 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    69869.0     3332.1  20.968  < 2e-16 ***
## rankAssocProf  12831.5     4147.7   3.094  0.00212 ** 
## rankProf       45287.7     4236.7  10.689  < 2e-16 ***
## disciplineB    14505.2     2343.4   6.190 1.52e-09 ***
## yrs.since.phd    534.6      241.2   2.217  0.02720 *  
## yrs.service     -476.7      211.8  -2.250  0.02497 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22550 on 391 degrees of freedom
## Multiple R-squared:  0.4525, Adjusted R-squared:  0.4455 
## F-statistic: 64.64 on 5 and 391 DF,  p-value: < 2.2e-16

We note a slight reduction in \(\bar{R^2}\) from \(44.63\%\) to \(44.55\%\) which we can regard as an insignificant decrease. The next step is to check the coefficients and assess for the effect of the remaining variables. We identify yrs.since.phd and yrs.service as the least influential in explaining the variability of salary. To illustrate how to formally assess their effect, we will conduct the t-test for the yrs.since.phd variable:

  • \(H_0: b_{ysp} = 0\)
  • \(H_1: b_{ysp} > 0\)

=========================

  • If \(t_{calc} < t_{crit} => H_0\)
  • If \(t_{calc} > t_{crit} => H_1\)
> qt(0.95, 391)
## [1] 1.64876

As \(t_{calc} = 2.217 > t_{crit} = 1.64876 => H_1\) we will keep the remaining variable and stop with the model simplification and focus on its interpretation.

We can take a closer look at the coefficients of our fitted model:

> coef(model_2)
##   (Intercept) rankAssocProf      rankProf   disciplineB yrs.since.phd 
##    69869.0110    12831.5375    45287.6890    14505.1514      534.6313 
##   yrs.service 
##     -476.7179

The structure of our final fitted model is:

\[y = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4x_4 + e,\] where

  • \(y\): salary
  • \(x_1\): yrs.since.phd
  • \(x_2\): yrs.service
  • \(x_3\): discipline
  • \(x_4\): rank

We can take a closer look at the coefficients of our fitted model:

> coef(model_2)
##   (Intercept) rankAssocProf      rankProf   disciplineB yrs.since.phd 
##    69869.0110    12831.5375    45287.6890    14505.1514      534.6313 
##   yrs.service 
##     -476.7179

Examining the output we realise that R has created three “sub” dummy variables for the variable rank: \[ dr_1 = \begin{cases} 1 & \text{rank is AsstProf} \\ 0 & \text{for rank is not AsstProf} \end{cases} \]

\[ dr_2 = \begin{cases} 1 & \text{rank is AssocProf} \\ 0 & \text{rank is not AssocProf} \end{cases} \]

\[ dr_3 = \begin{cases} 1 & \text{rank is Prof} \\ 0 & \text{rank is not Prof} \end{cases} \]

It has chosen to use the model: \[ y = b_0 + b_1dr_2 + b_2dr_3 + b_3d_1 + b_4x_1 + b_5x_2 + e, \]

where:

  • \(y\) is salary
  • \(x_1\) is yrs.since.phd
  • \(x_2\) is yrs.service
  • \(dr_2\) and \(dr_3\) are the dummy variables defined above for the purpose of coding variable rank
  • \(d_1\) is a dummy variable used in the coding of variable discipline as explained earlier

Note that R doesn’t need to use \(dr_1\), to create three models; it only needs two dummy variables since it is using \(dr_1\) as a reference level, also known as the base line. This subsequently allows R to create three models relating to rank variable:

  • AsstProf: \(y = b_0 + b_3d_1 + b_4x_1 + b_5x_2 + e\)
  • AssocProf: \(y = (b_0 + b_1) + b_3d_1 + b_4x_1 + b_5x_2 + e\)
  • Prof: \(y = (b_0 + b_2) + b_3d_1 + b_4x_1 + b_5x_2 + e\)

telling us that:

  • \(b_0\) is average salary for Assistant Professor who works in a “theoretical” department and \(b_0 + b_3\) average salary for Assistant Professor who works in an “applied” department
  • \((b_0 + b_1)\) is average salary for Associate Professor who works in a “theoretical” department and \((b_0 + b_1) + b_3\) average salary for Associate Professor who works in an “applied” department
  • \((b_0 + b_2)\) is average salary for Professor who works in a “theoretical” department and \((b_0 + b_2) + b_3\) average salary for Professor who works in an “applied” department

Learning this we can make an interpretation of our final fitted model as follows:

  • For every year since PhD (yrs.since.phd) on average salary (salary) will go up by \(\$534.63\) assuming the rest of the variables are fixed in the model
  • For every year in service (yrs.service) on average salary (salary) will go down by \(\$476.72\) assuming the rest of the variables are fixed in the model
  • Average salary of an Assistant Professor (rank: AsstProf) who works in a “theoretical” department is \(\$69,869.01\) and who works in an “applied” department is \(\$84,374.16\); this can vary for the number of years in service and since PhD
  • Average salary of an Associate Professor (rank: AssocProf) who works in a “theoretical” department is \(\$82,700.55\) and who works in an “applied” department is \(\$97,205.70\); this can vary for the number of years in service and since PhD
  • Average salary of a Professor (rank: Prof) who works in a “theoretical” department is \(\$115,156.70\) and who works in an “applied” department is \(\$129,661.90\); this can vary for the number of years in service and since PhD

This model explains around \(45\%\) of the variability in the response variable salary.

Adding ~ 0 to lm() formula enables R to suppress the intercept. Note that if we remove the intercept, then we can directly obtain all “three intercepts” without a base level to fit the final fitted model:

> model_2_1 <- lm(salary ~  0 + rank + discipline + yrs.since.phd + yrs.service)
> summary(model_2_1)
## 
## Call:
## lm(formula = salary ~ 0 + rank + discipline + yrs.since.phd + 
##     yrs.service)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65244 -13498  -1455   9638  99682 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## rankAsstProf   69869.0     3332.1  20.968  < 2e-16 ***
## rankAssocProf  82700.5     3916.7  21.115  < 2e-16 ***
## rankProf      115156.7     4350.9  26.467  < 2e-16 ***
## disciplineB    14505.2     2343.4   6.190 1.52e-09 ***
## yrs.since.phd    534.6      241.2   2.217   0.0272 *  
## yrs.service     -476.7      211.8  -2.250   0.0250 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22550 on 391 degrees of freedom
## Multiple R-squared:  0.9638, Adjusted R-squared:  0.9633 
## F-statistic:  1736 on 6 and 391 DF,  p-value: < 2.2e-16

Adding a Complexity

So far we have seen that by fitting an additive regression model to the data, we aim to identify and understand how the value of a dependent response variable changes when any one of the independent explanatory variables is changed while the other independent variables stay the same. This is a restrictive form of a model as it only allows for linear relationships between the response and the explanatory variables, and the way in which one explanatory variable affects the response is the same for any value of the other explanatory variables used in the model.

We need to add flexibility to accommodate these limitations. This will allow the use of linear models for non-linear relationships and in which the effect of one explanatory variable can be different for different values of the other explanatory variable by introducing the concept of interaction. This brings more complexity into the multivariate regression model.

Case Study: The Quality of red Bordeaux Vintages

To illustrate the multivariate model fitting procedure with interactions we are going to use wine.csv available from Eduardo García Portugués’s book: Notes for Predictive Modeling. The dataset is formed by the auction Price of 27 red Bordeaux vintages, five vintage descriptors (WinterRain, AGST, HarvestRain, Age, Year), and the population of France in the year of the vintage, FrancePop.

  • Year: year in which grapes were harvested to make wine
  • Price: logarithm of the average market price for Bordeaux vintages according to 1990–1991 auctions. The price is relative to the price of the 1961 vintage, regarded as the best one ever recorded
  • WinterRain: winter rainfall (in mm)
  • AGST: Average Growing Season Temperature (in degrees Celsius)
  • HarvestRain: harvest rainfall (in mm)
  • Age: age of the wine measured as the number of years stored in a cask
  • FrancePop: population of France at Year (in thousands)

We would like to analyse the quality of a vintage that has been quantified as the price and make the interpretation of our statistical findings.

💡🤓 Don’t forget!!! 🤔 First things first: Get to Know Data 🤓

We will start this analysis by examining the pairwise plot.

> wine = read.csv("https://raw.githubusercontent.com/egarpor/handy/master/datasets/wine.csv")
> summary(wine)
##       Year          Price         WinterRain         AGST        HarvestRain   
##  Min.   :1952   Min.   :6.205   Min.   :376.0   Min.   :14.98   Min.   : 38.0  
##  1st Qu.:1960   1st Qu.:6.508   1st Qu.:543.5   1st Qu.:16.15   1st Qu.: 88.0  
##  Median :1967   Median :6.984   Median :600.0   Median :16.42   Median :123.0  
##  Mean   :1967   Mean   :7.042   Mean   :608.4   Mean   :16.48   Mean   :144.8  
##  3rd Qu.:1974   3rd Qu.:7.441   3rd Qu.:705.5   3rd Qu.:17.01   3rd Qu.:185.5  
##  Max.   :1980   Max.   :8.494   Max.   :830.0   Max.   :17.65   Max.   :292.0  
##       Age          FrancePop    
##  Min.   : 3.00   Min.   :43184  
##  1st Qu.: 9.50   1st Qu.:46856  
##  Median :16.00   Median :50650  
##  Mean   :16.19   Mean   :50085  
##  3rd Qu.:22.50   3rd Qu.:53511  
##  Max.   :31.00   Max.   :55110
> ggpairs(wine)

What conclusions can we draw:

  • We can notice a perfect relationship between the variables Year and Age. This is to be expected since this data was collected in 1983 and Age was calculated as: Age = 1983 - Year. Knowing this, we are going to remove Year from the analysis and use Age as it will be easier to interpret.

  • There is a strong relationship between Year, ie. Age and FrancePop and since we want to impose our viewpoint that the total population does not influence the quality of wine we will not consider this variable in the model.

  • We are going to investigate possible interactions between the rainfall (WinterRain) and growing season temperature (AGST). In R this will be created automatically by using the * operator.

Let us build a model:

We will start with the most complicated model that includes the highest-order interaction.

💡🤓 In R we will specify the three-way interaction, which will automatically add all combinations of two-way interactions.

> model1 <- lm(Price ~ WinterRain + AGST + HarvestRain + Age + WinterRain * AGST * HarvestRain, data = wine)
> summary(model1)
## 
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age + 
##     WinterRain * AGST * HarvestRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35058 -0.19462 -0.02645  0.17194  0.52079 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                  8.582e+00  1.924e+01   0.446   0.6609   
## WinterRain                  -1.858e-02  2.896e-02  -0.642   0.5292   
## AGST                        -1.748e-01  1.137e+00  -0.154   0.8795   
## HarvestRain                 -4.713e-02  1.540e-01  -0.306   0.7631   
## Age                          2.476e-02  8.288e-03   2.987   0.0079 **
## WinterRain:AGST              1.272e-03  1.712e-03   0.743   0.4671   
## WinterRain:HarvestRain       7.836e-05  2.600e-04   0.301   0.7665   
## AGST:HarvestRain             3.059e-03  9.079e-03   0.337   0.7401   
## WinterRain:AGST:HarvestRain -5.446e-06  1.540e-05  -0.354   0.7278   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2833 on 18 degrees of freedom
## Multiple R-squared:  0.8621, Adjusted R-squared:  0.8007 
## F-statistic: 14.06 on 8 and 18 DF,  p-value: 2.675e-06

The model explains well over \(80\%\) of variability and is clearly a strong model, but the key question is whether we can simplify it. We will start the process of this model simplification by removing the three-way interaction as it is clearly not significant.

> model2 <- update(model1, ~. -WinterRain:AGST:HarvestRain, data =wine)
> summary(model2)
## 
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age + 
##     WinterRain:AGST + WinterRain:HarvestRain + AGST:HarvestRain, 
##     data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35245 -0.19452  0.01643  0.17289  0.51420 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)   
## (Intercept)             2.980e+00  1.066e+01   0.279  0.78293   
## WinterRain             -9.699e-03  1.408e-02  -0.689  0.49930   
## AGST                    1.542e-01  6.383e-01   0.242  0.81168   
## HarvestRain             6.496e-03  2.610e-02   0.249  0.80610   
## Age                     2.441e-02  8.037e-03   3.037  0.00678 **
## WinterRain:AGST         7.490e-04  8.420e-04   0.890  0.38484   
## WinterRain:HarvestRain -1.350e-05  7.338e-06  -1.840  0.08144 . 
## AGST:HarvestRain       -1.032e-04  1.520e-03  -0.068  0.94656   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2767 on 19 degrees of freedom
## Multiple R-squared:  0.8611, Adjusted R-squared:  0.8099 
## F-statistic: 16.83 on 7 and 19 DF,  p-value: 6.523e-07

The \(\bar{R}^2\) has slightly increased in value. Next, we remove the least significant two-way interaction term.

> model3 <- update(model2, ~. -AGST:HarvestRain, data = wine)
> summary(model3)
## 
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age + 
##     WinterRain:AGST + WinterRain:HarvestRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35424 -0.19343  0.01176  0.17161  0.51218 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)   
## (Intercept)             3.518e+00  6.946e+00   0.507  0.61802   
## WinterRain             -1.017e-02  1.195e-02  -0.851  0.40488   
## AGST                    1.218e-01  4.138e-01   0.294  0.77147   
## HarvestRain             4.752e-03  4.553e-03   1.044  0.30901   
## Age                     2.451e-02  7.710e-03   3.179  0.00472 **
## WinterRain:AGST         7.769e-04  7.166e-04   1.084  0.29119   
## WinterRain:HarvestRain -1.342e-05  7.059e-06  -1.902  0.07174 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2697 on 20 degrees of freedom
## Multiple R-squared:  0.8611, Adjusted R-squared:  0.8194 
## F-statistic: 20.66 on 6 and 20 DF,  p-value: 1.35e-07

Again, it is reassuring to notice an increase in the \(\bar{R}^2\), but we can still simplify the model further by removing another least significant two-way interaction term.

> model4 <- update(model3, ~. -WinterRain:AGST, data = wine)
> summary(model4)
## 
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age + 
##     WinterRain:HarvestRain, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5032 -0.1934  0.0109  0.1771  0.4621 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -3.812e+00  1.598e+00  -2.386 0.026553 *  
## WinterRain              2.747e-03  9.471e-04   2.900 0.008560 ** 
## AGST                    5.586e-01  9.495e-02   5.883 7.71e-06 ***
## HarvestRain             4.717e-03  4.572e-03   1.032 0.313877    
## Age                     2.785e-02  7.094e-03   3.926 0.000774 ***
## WinterRain:HarvestRain -1.349e-05  7.088e-06  -1.903 0.070835 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2708 on 21 degrees of freedom
## Multiple R-squared:  0.8529, Adjusted R-squared:  0.8179 
## F-statistic: 24.35 on 5 and 21 DF,  p-value: 4.438e-08

There is an insignificant decrease in \(\bar{R}^2\). We notice HarvestRain is now the least significant term, but it is used for the WinterRain:HarvestRain interaction, which is significant at \(\alpha = 5\%\) and therefore we should keep it. However, as the concept of parsimony prefers a model without interactions to a model containing interactions between variables, we will remove the remaining interaction term and see if it significantly affects the explanatory power of the model.

> model5 <- update(model4, ~. -WinterRain:HarvestRain, data = wine)
> summary(model5)
## 
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46024 -0.23862  0.01347  0.18601  0.53443 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.6515703  1.6880876  -2.163  0.04167 *  
## WinterRain   0.0011667  0.0004820   2.420  0.02421 *  
## AGST         0.6163916  0.0951747   6.476 1.63e-06 ***
## HarvestRain -0.0038606  0.0008075  -4.781 8.97e-05 ***
## Age          0.0238480  0.0071667   3.328  0.00305 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2865 on 22 degrees of freedom
## Multiple R-squared:  0.8275, Adjusted R-squared:  0.7962 
## F-statistic: 26.39 on 4 and 22 DF,  p-value: 4.057e-08

The \(\bar{R}^2\) is reduced by around \(2\%\), but it has all the significant terms and it is easier to interpret. For those reasons and in the spirit of parsimony that argues that a model should be as simple as possible, we can suggest that this should be regarded as the best final fitted model.

We realise that for the large numbers of explanatory variables, and many interactions and non-linear terms, the process of model simplification can take a very long time. There are many algorithms for automatic variable selection that can help us to chose the variables to include in a regression model. Stepwise regression and Best Subsets regression are two of the more common variable selection methods.

The stepwise procedure starts from the saturated model (or the maximal model, whichever is appropriate) through a series of simplifications to the minimal adequate model. This progression is made on the basis of deletion tests: F tests, AIC, t-tests or chi-squared tests that assess the significance of the increase in deviance that results when a given term is removed from the current model.

The best subset regression (BREG), also known as “all possible regressions”, as the name of the procedure indicates, fits a separate least squares regression for each possible combination of the \(p\) predictors, i.e. explanatory variables. After fitting all of the models, BREG then displays the best fitted models with one explanatory variable, two explanatory variables, three explanatory variables, and so on. Usually, either adjusted R-squared or Mallows Cp is the criterion for picking the best fitting models for this process. The result is a display of the best fitted models of different sizes up to the full/maximal model and the final fitted model can be selected by comparing displayed models based on the criteria of parsimony.

“These methods are frequently are abused by naive researchers who seek to interpret the order of entry of variables into the regression equation as an index of their ‘‘importance.’’ This practice is potentially misleading.” J. Fox and S. Weisberg book: An R Companion to Applied Regression, Third Edition, Sage (2019)

💡🤓 When selecting a model one should remember the important concept of parsimony!

As M.J. Crawley points out in his well know editions of “The R Book” we need to remember that models are portrayals of phenomena that should be both “accurate and convenient” and the principle of parsimony is an essential tool for model exploration. As he suggests: “just because we go to the trouble of measuring something does not mean we have to have it in our model.”

“Parsimony says that, other things being equal, we prefer:

  • a model with \(n−1\) parameters to a model with n parameters
  • a model with \(k−1\) explanatory variables to a model with k explanatory variables
  • a linear model to a model which is curved
  • a model without a hump to a model with a hump
  • a model without interactions to a model containing interactions between factors” Crawley, M.J. 2013, The R Book. 2nd Edition. John Wiley, New York

YOUR TURN 👇

  1. Go back to the Salaries data Case Study:
  1. Adding ~0 to lm() formula enables R to suppress the intercept. Try to fit the following model by removing the intercept.
model_2_1 <- lm(salary ~  0 + rank + discipline + yrs.since.phd + yrs.service)
summary(model_2_1)

This will allow you to obtain all “three intercepts” without a reference level.

  • Does this model differ from the previously fitted model_2? Provide justified explanation

  • Interpret the model

  1. Can the final fitted model developed for the Salaries data Case Study be further simplified? Justify your answer
  1. Use Prestige data available from the carData package of datasets designed to accompany J. Fox and S. Weisberg book: An R Companion to Applied Regression, Third Edition, Sage (2019). Fit a multivariate model that explains variation in the response variable prestige, explaining the reasons behind the steps taken with appropriate interpretation of the final fitted model.

carData::Prestige: The Canadian occupational prestige data, where the observations are occupations. Justification for treating the 102 occupations as a sample implicitly rests on the claim that they are “typical” of the population, at least with respect to the relationship between prestige and income.

  • education: Average education of occupational incumbents, years, in 1971.

  • income: Average income of incumbents, dollars, in 1971.

  • women: Percentage of incumbents who are women.

  • prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.

  • census: Canadian Census occupational code.

  • type: Type of occupation. A factor with levels (note: out of order):

    • bc: Blue Collar
    • prof: Professional, Managerial, and Technical
    • wc: White Collar
  1. The dataset Prostate available in the package lasso2 contains information on 97 men who were about to receive a radical prostatectomy. This data set comes from a study examining the correlation between the prostate specific antigen (lpsa) and a number of other clinical measures.
  • Select the best subset model to predict lpsa using all the available predictors.

Creative Commons Attribution-ShareAlike 4.0 International License.