Earlier we looked at how to examine sample data to investigate the nature of any relationship that may exist between a measured response variable and an attribute explanatory variable. This section examines how to investigate the nature of any relationship that may exist between a measured response variable and a measured explanatory variable.
The first step is to have a clear idea of what is meant by a connection between the response variable and the explanatory variable. The next step is to use some simple sample descriptive statistics to have a first look at the nature of the link between the response variable and the explanatory variable. This simple approach will lead to one of three conclusions, namely, on the basis of the sample information:
This step is called the Initial Data Analysis or the IDA.
If the IDA suggests that Further Data Analysis is required, then this step seeks one of two conclusions:
As we have already seen in the previous sections, this process can be represented diagrammatically as:
The Data-Analysis Methodology seeks to find the answer to the following key question:
The final outcome is one of two conclusions:
There is no evidence of a relationship, labelled as the ‘No’ outcome in the diagram above, in which case the analysis is finished.
There is evidence of a relationship, labelled as the ‘Yes’ outcome in the diagram above, in which case the nature of the relationship needs to be described.
The first step is to have a clear idea of what is meant by a connection between a measured response variable and a measured explanatory variable. Imagine a population under study consisting of a very large number of population members, and on each population member two measurements are made, the value of \(Y\) the response variable and the value of \(X\) the explanatory variable. For the whole population a graph of \(Y\) against \(X\) could be plotted conceptually. If the graph looked as in the diagram below, then there is quite clearly a link between Y and X. If the value of \(X\) is known, the exact value of Y can be read off the graph. This is an unlikely scenario in the data-analysis context, because the relationship shown is a deterministic relationship. Deterministic means that if the value of \(X\) is known then the value of \(Y\) can be precisely determined from the relationship between \(Y\) and \(X\).
When analysing the relationship between the two measured variable we start off by creating a scatter plot. A scatter plot is a graph that shows one axis for the explanatory variable commonly known in the regression modelling as a predictor and labelled with \(X\), and one axis for the response variable, which is labelled with \(Y\) and commonly known as the outcome variable. Thus, each point on the graph represents a single \((X, Y)\) pair. The primary benefit is that the possible relationship between the two variables can be viewed and analysed with one glance and often the nature of a relationship can be determined quickly and easily.
Let us consider a few scatter plots. The following graph represents a perfect linear relationship. All points lie exactly on a straight line. It is easy in this situation to determine the intercept and the slope, i.e. gradient and hence specify the exact mathematical link between the response variable \(Y\) and the explanatory variable \(X\).
The relationship shown in Graph 2 shows clearly that as the value of \(X\) increases the value of \(Y\) indecreases, but not exactly on a straight line as in the previous scatter plot. This is showing a statistical link, as the value of the explanatory variable \(X\) increases the value of the response variable \(Y\) also tends to increase. An explanation for this is that the response \(Y\) may depend on a number of different variables, say \(X_1\), \(X_2\), \(X_3\), \(X_4\), \(X_5\), \(X_6\) etc. which could be written as:
\(Y = f(X_1, X_2, X_3, X_4, X_5, X_6, ...)\)
If the nature of the link between \(Y\) and \(X\) is under investigation then this could be represented as:
\[Y = f(X) + \text{effect of all other variables}\]
💡🤓The effect of all other variables is commonly abbreviated to e.
Graph 1 shows a link where the effect of all the other variables is nil. The response \(Y\) depends solely on the variable \(X\), Graph 2 shows a situation where \(Y\) depends on \(X\) but the other variables also have an influence.
Consider the model:
\[Y = f(X) + e\] Remember 😃, \(\text{e is the effect of all other variables}\)!
The influence on the response variable \(Y\) can be thought of as being made up of two components:
Or in more abbreviated forms:
The Total Variation in \(Y\) is made up of two components:
Which may be written as:
\[\text{The Total Variation in Y} = \text{Explained Variation} + \text{Unexplained Variation}\]
In Graph 1 the Unexplained Variation is nil, since the value of \(Y\) is completely determined by the value of \(X\). In Graph 2 the Explained Variation is large relative to the Unexplained Variation, since the value of \(Y\) is very largely influenced by the value of \(X\).
Consider Graph 3. Here there is no discernible pattern, and the value of Y seems to be unrelated to the value of X.
If \(Y\) is not related to \(X\) the Explained Variation component is zero and all the changes in \(Y\) are due to the other variables, that is the Unexplained Variation.
Finally, consider Graphs 4 & 5 below:
Graph 4 shows a similar picture to Graph 2, the difference being that as the value of \(X\) increases the value of \(Y\) decreases. The value of \(Y\) is influenced by the value of \(X\), so the Explained Variation is high relative to the Unexplained Variation. Consider the last graph, Graph 5, which is a deterministic relationship. The value of \(Y\) is completely specified by the value of \(X\). Hence the Unexplained Variation is zero.
Graphs Summary:
Graph 1 | Graph 2 | Graph 3 | Graph 4 | Graph 5 | |
---|---|---|---|---|---|
Explained Variation in Y | All | High | Zero | High | All |
Unexplained Variation in Y | Zero | Low | All | Low | Zero |
In regression the discussion started with the following idea: \[Y = f(X) + e\] And to quantify the strength of the link the influence on Y was broken down into two components: i. \[\text{The Total Variation in Y} = \text{Explained Variation} + \text{Unexplained Variation}\]
This presents two issues:
💡🤓 The simplest form of connection is a straight-line relationship and the question arises could a straight-line relationship be matched with the information contained in the graphs 1 -5?
\[Y = a + bX\] where:
a is the intercept and
b is the gradient
For the statistical relationships as shown in Graphs 2 & 4:
Can the intercept and gradient be measured?
Can the values of the three quantities The Total Variation in Y, Explained Variation and The Unexplained Variation be measured?
It is sufficient to work out any two since:
\[\text{The Total Variation in Y} = \text{Explained Variation} + \text{Unexplained Variation}\]
Fitting a line by eye is subjective. It is unlikely that any two analysts will draw exactly the same line, hence the intercept and gradient will be slightly different from one person to the next. What is needed is an agreed method that will provide an estimate of the intercept and the gradient.
Consider the simple numerical example below:
Suppose we would like to fit a straight-line relationship to the following data:
\(X\) | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
\(Y\) | 7 | 8 | 12 | 13 | 14 | 18 |
The problem is to use this information to measure the intercept and the gradient for this data set.
A simple way to do this is to draw what is considered to be the line of best fit by judgement or guesswork! 😁😉
The intercept can be read off the graph as approximately \(5\), and the gradient can be simply measured since if \(X = 0\) then \(Y = 5\), and if \(X = 5\) then \(Y = 15\), so for a change in \(X\) of \(5\) units (from 0 to 5) \(Y\) changes by \(10\), from \(5\) to \(15\). The definition of the gradient is The change in Y for a unit increase in X hence the gradient for this data set is \(10/5 = 2\).
The straight-line relationship obtained by this process is \[\hat{Y} = 5 + 2*X\].
Note, \(\hat{Y}\) is a notation for the
value of \(Y\) as predicted by the
straight-line relationship.
We can add more information about the predicted values of \(Y\) by the “estimated” model to our table, so that we have
X | Y | \(\hat{Y}\) | \({Y - \hat{Y}}\) |
---|---|---|---|
1 | 7 | 7.00 | 0.00 |
2 | 8 | 9.00 | -1.00 |
3 | 12 | 11.00 | 1.00 |
4 | 13 | 13.00 | 0.00 |
5 | 14 | 15.00 | -1.00 |
6 | 18 | 17.00 | 1.00 |
Looking at the information contained in the previous plot, the column headed \(\hat{Y}\) contains the predicted values of \(Y\) for the values of \(X\). For example, the first value of \(\hat{Y}\) is when \(X = 1\) and \(Yp = 5 + 2*X\) so \(\hat{Y} = 5 +2*1 = 7\). The column headed \((Y - \hat{Y})\) is the difference between the actual value and the value predicted by the line. For example when \(X = 1\), \(Y = 7\) and \(\hat{Y} = 7\) so the predicted value lies on the line, as can be seen in the graph. For the value \(X = 2\) the actual value lies \(1\) unit below the line, as can also be seen from the graph.
The column \((Y - \hat{Y})\) measures the disagreement between the actual data and the line and a sensible strategy is to make this level of disagreement as small as possible. Referring to the graph on the table above notice that sometimes the actual data value is below the line and sometimes it is above the line, so on average the value will be close to zero. In this particular example the sum of the \((Y - \hat{Y})\) values add up to zero (in a more conventional notation \(\Sigma(Y - \hat{Y}) = 0\)). The quantity \(\Sigma(Y - \hat{Y})\) does not seem to be a satisfactory measure of disagreement, because there are a number of different lines with the property \(\Sigma(Y - \hat{Y}) = 0\).
A way of obtaining a satisfactory measure of disagreement is to square the individual \((Y - \hat{Y})\) values and add them up. i.e. obtain the quantity \(\Sigma(Y - \hat{Y})^2\). The result is always a positive number since the square of a negative number is positive. If this quantity is then chosen to be as small as possible then the level of disagreement between the actual data points and the fitted line is the least. This provides a criterion for the choice of the best line.
The quantity \(\Sigma(Y - \hat{Y})^2\) can be easily calculated and added to our table
X | Y | \(\hat{Y}\) | \({Y - \hat{Y}}\) | \((Y - \hat{Y})^2\) |
---|---|---|---|---|
1 | 7 | 7.00 | 0.00 | 0 |
2 | 8 | 9.00 | -1.00 | 1 |
3 | 12 | 11.00 | 1.00 | 1 |
4 | 13 | 13.00 | 0.00 | 0 |
5 | 14 | 15.00 | -1.00 | 1 |
6 | 18 | 17.00 | 1.00 | 1 |
The quantity \(\Sigma(Y - \hat{Y})^2\) is a measure of the disagreement between the actual \(Y\) values and the values predicted by the line. If this value is chosen to be as small as possible then the disagreement between the actual Y values and the line is the smallest it could possibly be, hence the line is The line of Best Fit.
This procedure of finding the intercept and the gradient of a line that makes the quantity \(\Sigma(Y - \hat{Y})^2\) a minimum is called The Method of Least Squares.
The Method of Least Squares was developed by K. F Gauss (1777 - 1855) a German mathematician, Gauss originated the ideas and a Russian mathematician A. A. Markov (1856 - 1922) developed the method.
In R we use the lm( )
function to fit linear models as
illustrated below.
> x <- c(1:6)
> y <- c(7, 8, 12, 13, 14, 18)
> plot(x, y, pch = 16, main = "y = f(x) + e", xlim = c(0.25, 6), ylim = c(0, 20))
> abline(lm(y~x), lwd = 2, lty = 2, col = 2)
The final issue is to find out how to measure the three quantities:
Taking these quantities one at a time they can be measured as follows:
This turns out to be very simple to measure. The quantity: \(\Sigma(Y - \hat{Y})^2\) is a measure of the Unexplained Variation. If the line were a perfect fit to the data, the value predicted by the line and the actual value would be exactly the same and the value of the quantity \(\Sigma(Y - \hat{Y})^2\) would be zero. This quantity is a measure of the disagreement between the actual \(Y\) values and the predicted values \(\hat{Y}\), which are also known as the residuals and are measuring the Unexplained Variation in \(Y\). For the example used above the value of \(\Sigma(Y - \hat{Y})^2\) is \(3.77\).
This is related to the measures of variability (spread) introduced earlier in the course and in particular to the standard deviation (\(\sigma\)). To measure The Total Variation in Y requires a measure of spread.
The Total Variation in Y is defined to be the quantity: \(\Sigma(Y - \bar{Y})^2\) Where \(\bar{Y}\) is the average value of \(Y\) (\(\bar{Y} = \Sigma(Y)/n\)).
In our earlier example \(\bar{Y} = 12\), so we can expand the table to include this calculation
X | Y | \(\hat{Y}\) | \({Y - \hat{Y}}\) | \((Y - \hat{Y})^2\) | \(\Sigma(Y - \bar{Y})^2\) |
---|---|---|---|---|---|
1 | 7 | 7.00 | 0.00 | 0 | 25.00 |
2 | 8 | 9.00 | -1.00 | 1 | 16.00 |
3 | 12 | 11.00 | 1.00 | 1 | 0.00 |
4 | 13 | 13.00 | 0.00 | 0 | 1.00 |
5 | 14 | 15.00 | -1.00 | 1 | 4.00 |
6 | 18 | 17.00 | 1.00 | 1 | 36.00 |
giving \(\Sigma(Y - \bar{Y})^2 = 82\).
If the line was a perfect fit, then the \(Y\) values and the \(\hat{Y}\) values would be exactly the same, and the quantity \((\hat{Y} - \bar{Y})^2\) would measure The Total Variation in Y. If the line is not a perfect match to the actual \(Y\) values then this quantity measures The Explained Variation in Y.
X | Y | \(\hat{Y}\) | \({Y - \hat{Y}}\) | \((Y - \hat{Y})^2\) | \(\Sigma(Y - \bar{Y})^2\) | \((\hat{Y} - \bar{Y})^2\) |
---|---|---|---|---|---|---|
1 | 7 | 7.00 | 0.00 | 0 | 25.00 | 27.94 |
2 | 8 | 9.00 | -1.00 | 1 | 16.00 | 10.06 |
3 | 12 | 11.00 | 1.00 | 1 | 0.00 | 1.12 |
4 | 13 | 13.00 | 0.00 | 0 | 1.00 | 1.12 |
5 | 14 | 15.00 | -1.00 | 1 | 4.00 | 10.06 |
6 | 18 | 17.00 | 1.00 | 1 | 36.00 | 27.94 |
incorporating this calculation into the table above will enable us to get \((\hat{Y} - \bar{Y})^2 = 78.23\).
Once the intercept and slope have been estimated using the method of least squares, various indices are studied to determine the reliability of these estimates. One of the most popular of these reliability indices is the correlation coefficient. A sample correlation coefficient, more specifically known as the Pearson Product Moment correlation coefficient, denoted \(r\), has possible values between \(-1\) and \(+1\), as illustrated in the diagram below.
In fact, the correlation is a parameter of the bivariate normal distribution, that is used to describe the association between two variables, which does not include a cause and effect statement. That is, one variable does not depend on the other, i.e. the variables are not labelled as dependent and independent. Rather, they are considered as two random variables that seem to vary together. Hence, it is important to recognise that correlation does not imply causation. In correlation analysis, both \(Y\) and \(X\) are assumed to be random variables, whiles in linear regression, \(Y\) is assumed to be a random variable and \(X\) is assumed to be a fixed variable.
Main characteristic of the correlation coefficient \(r\) are:
BUT!!!
💡🤓The Spearman rank correlation coefficient is an corresponding nonparametric equivalent of the Pearson correlation coefficient. This statistic is computed by replacing the data values with their ranks and pplying the Pearson correlation formula to the ranks of the data. Tied values are replaced with the average rank of the ties. Just like in the case of Pearson coefficient, this one is also really a measure of association rather than correlation, since the ranks are unchanged by a monotonic transformation of the original data. For the sample sizes greater than 10, the distribution of the Spearman rank correlation coefficient can be approximated by the distribution of the Pearson correlation coefficient. It is also worth knowing that the Spearman rank correlation coefficient uses the weights when weights are specified.
We realise that when fitting a regression model we are seeking to find out how much variance is explained, or is accounted for, by the explanatory variable \(X\) in an outcome variable \(Y\).
In the earlier example above the following has been calculated:
Notice the relationship given below, is satisfied
\[\text{The Total Variation in Y = Explained Variation + Unexplained Variation}\]
\[82.00 = 78.23 + 3.77\] What do these quantities tell us? They are difficult to interpret because they are expressed in the units of the problem.
Consider the following ratio:
\({\text{The Explained Variation in Y} \over \text{The Total Variation in Y}} = {78.23 \over 82.00} = 0.954\)
This is saying that \(0.954\) or \(95.4\%\) of the changes in \(Y\) are explained by changes in \(X\). This is a useful and useable measure of the effectiveness of the match between the actual \(Y\) values and the predicted \(Y\) values.
Reviewing the five scatter prolos (Graphs 1 to 5) it can easily be seen that if the line is a perfect fit to the actual \(Y\) values as in Graphs 1 & 5 then this ratio will have the value \(1\) or \(100\%\).
If there is no link between \(Y\) and \(X\) then the The Explained Variation is zero hence the ratio will be \(0\) or \(0\%\). An example of this is shown in Graph 3.
The remaining graphs: Graphs 2 & 4 show a statistical relationship hence this ratio will lie between \(0\) & \(1\). The closer the ratio is to zero the less strong the link is, whilst the closer the ratio is to \(1\) the stronger the connection is between \(X\) & \(Y\).
The Ratio:
\[R^2 = {\text{The Explained Variation in Y} \over \text{The Total Variation in Y}}\]
is called the Coefficient of Determination, and usually labelled as \(R^2\), and may be defined as the proportion of the changes in \(Y\) explained by changes in \(X\).
This ratio is always on the scale \(0\) to \(1\), but by convention is usually expressed as a percentage, so is regarded as on the scale \(0\) to \(100\%\). The interpretation of this ratio is as follows:
The theoretical minimum \(R^2\) is \(0\). However, since linear regression is based on the best possible fit, \(R^2\) will always be greater than zero, even when the predictor and outcome variables bear no relationship to one another. The definition and interpretation of \(R^2\) is a very important tool in the data analyst’s tool kit for tracking connections between a measured response variable and a measured explanatory variable.
When working with sample data to investigate any relationships that may exist between a measured response variable and a measured explanatory variable, the information contained within the sample is imperfect, so has to be interpreted in the light of sampling error. This is particularly important when interpreting the value of \(R^2\) calculated from sample data. The sample R2 value gives a measure of the strength of the connection, and this can sometimes be difficult to interpret particularly if the sample size is small.
The data analysis methodology as set out earlier, is a procedure that enables you to make judgements from sample data. The methodology requires you to know what specific procedures make up the Initial Data Analysis IDA and how to interpret the results of the IDA to obtain one of three outcomes:
If Further Data Analysis, FDA, is required then you need to know what constitutes this further analysis and how to interpret it.
Finally, if a connection between the response variable and the explanatory variable is detected then a description of the connection is required. In this case the connection needs to be described.
To demonstrate the data analysis methodology we’ll go back to Share Price Study Case Study.
To go back to our model interpretation. Earlier, we said after examining the scatter plot that there is a a clear link between Share_Price and Profit. This is confirmed by the value of \(R^2 = 74.67\%\). The interpretation of \(R^2\) is suggesting that \(74.67\%\) of the changes in Share_Price are explained by changes in Profit. Alternatively, \(25.33\%\) of the changes in Share_Price are due to other variables.
For this example the outcome of the IDA is that there is clear evidence of a link between Share_Price and Profit and the nature of the influence being that as Profit increases Share_Price also increases. It only remains to describe the connection, and discuss how effective the model is, and ask if it can be used to predict Share_Price value from the value of Profit? 🤔
The adequacy of \(R^2\) can be judged both informally and formally using hypothesis testing. Like all hypothesis tests, this test is carried out in four stages:
Stage 1: Specify the hypotheses (\(H_0\) & \(H_1\)).
The Coefficient of Determination \(R^2\) is a useful quantity. By definition if \(R^2 = 0\) then there is no connection between the response variable and the explanatory variable. Conversely if the value of \(R^2\) is greater than zero there must be a connection. This enables the formal hypotheses to be defined as:
Stage 2: Defining the test parameters and the decision rule.
The decision rule is based on the F statistic. The F distribution has a shape as shown below:
The value of \(F_{crit}\) is the value that divides the distribution with \(95\%\) of the area to the left of the \(F_{crit}\) ordinate, and \(5\%\) to the right. The accepted terminology is \(F_{crit}\) but this text will call this quantity \(F_{calc}\) to signify it is a quantity obtained from statistical tables.
The decision rule is:
If the value of \(F_{calc}\) from the sample data is larger than \(F_{crit}\), then the sample evidence favours the decision that there is a connection between the response variable and the explanatory variable. (i.e. favours \(H_1\).)
If the value of \(F_{calc}\) is smaller than \(F_{crit}\) then the sample evidence is consistent with no connection between the response variable and the explanatory variable. (i.e. favours \(H_0\))
The decision rule can be summarised as:
If \(F_{calc} < F_{crit}\) then favour \(H_0\), whilst if \(F_{calc} > F_{crit}\) then favour \(H_1\). This decision rule can be represented graphically as below:
The decision rule maybe written as:
The value of \(F_{crit}\) depends on
the amount of data in the sample. The Degrees of Freedom
(usually abbreviated to \(df\)) express
this dependence on the sample size. For any given set of sample data,
having obtained the \(df\) the specific
value of \(F_{crit}\) can be obtained
from Statistical Tables of the F distribution, or simply by
using the qf(p, df1, df2)
function in R.
Stage 3: Examining the sample evidence
For the investigation of the Share_Price vs Profit we plot the scatterplot and obtain the summary of the fitted model: \(Share\_Price = b_0 + b_1Profit\)
> # create a scatterplot
> plot(Share_Price ~ Profit, cex = .6, main = "Share Price = f(Profit) + e")
> # fit the model
> model_1 <- lm(Share_Price ~ Profit)
> # add the line of the best fit on the scatterplot
> abline(model_1, lty = 2, col = 2)
> # obtain R_sq of the fitted model
> summary(model_1)$r.squared
## [1] 0.7467388
This is a good model explaining almost \(3/4\) of the overall variability in the explanatory variable \(Y\). Hence, based on the sample evidence, variable Profit, can explain \(74.67\%\) of overall variation in the variable Share_Price.
💡🤓 Note, that if we don’t reject that there is a relationship between the two variables we will obtain full output of the
summary(lm(model_1))
function regardless of the need for FDA. If we conclude that there is a clear relationship, we will need the estimates of the parameters to describe it, and if we need further to investigate the significance of the relationship we need relevant statistics for conducting the FDA.
In the cases where \(R^2 \approx 0\)
we will carry out a hypothesis test for which we need \(F_{calc}\) from the summary( )
function.
Let us assume that the \(R^2\) for the given example Share_Price vs Profit is very small and close to zero. In that case, we cannot simply reject it as a statistically insignificant relationship, as \(R^2\) is not zero, but neither can we with confidence accept it as a statistically valid relationship based on the sample evidence we use.
This step requires the calculation of the \(F_{calc}\) value from the sample data, and the determining of the degrees of freedom.
> summary(model_1)
##
## Call:
## lm(formula = Share_Price ~ Profit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -175.513 -74.826 0.107 67.824 141.358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 258.9243 28.7465 9.007 1.29e-12 ***
## Profit 4.0567 0.3102 13.077 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 89.98 on 58 degrees of freedom
## Multiple R-squared: 0.7467, Adjusted R-squared: 0.7424
## F-statistic: 171 on 1 and 58 DF, p-value: < 2.2e-16
Stage 4: Conclusion
The associated F statistic evaluating the significance of the full model \(F_{calc} = 171\) with \(df_1=1\) and \(df_2=58\). Thus,
> F_crit <- qf(0.95, 1, 58)
> F_crit
## [1] 4.006873
\(F_{crit} = 4.01\) and using the decision rule given, we get that
\[F_{calc}=171 > F_{crit}=4.01\text{ => accept }H_1\] , ie this model is statistically significant and we need to describe it.
Describe the Relationship
If the resulting outcome from either the Initial Data Analysis (IDA) or the Further Data Analysis (FDA) is that there is a connection between the response variable and the explanatory variable then the last step is to describe this connection.
For the data analysis situation Measured v Measured this is simply stating the line of the best fit. This is a description of the connection between the two variables. Additionally the \(R^2\) value should be quoted as this gives a measure of how well the data and the line of best fit match.
The \(R^2\) value can be interpreted as a measure of the quality of predictions made from the line of best fit according to the rule of thumb:
In our example Share_Price vs Profit we had the following estimations:
\[Share\_Price = 258.9243 + 4.0567 Profit\]
with \(R^2 = 74.67\%\).
Making predictions
Can the regression line be used to make predictions about the Share_Price for a company with a given value of the Profit?
Suppose we have to make a prediction of the \(Share\_{Price}\) for company with the \(Profit = 137.2\)
The predictions are calculated from the model as follows:
\[Share\_Price = 258.9243 + 4.0567 \times 137.2000 = 815.5035\] Since the \(R^2\) value is nearly \(75\%\) any predictions about Share_Price made from this model are likely to be of good quality, but there may be some issues with some of these predictions of the Profit values out of the data range used in prediction. The Profit in the given data set range from about \(3\) to \(170\). Over this range any prediction is likely to be of good quality since the information in the data is reflecting experience within this range, and the \(R^2\) value is nearly \(75\%\).
> summary(Profit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.90 59.73 88.85 84.76 106.62 170.50
The Share_Price for the company with the Profit that lies outside the range of experience is not likely to be very reliable, or of much value. These problems are caused by the fact that the information available about the link between Share_Price and Profit is only valid over the range of values that Profit takes in the data. There is no information about the nature of the connection outside this range and any predictions made must be treated with caution.
In general terms when interpreting a regression model the intercept is of little value, since it is generally out of the range of the data. The gradient term is the more important and the formal definition of the gradient provides the interpretation of this information.
The gradient is defined to be the change in \(Y\) for a unit increase in \(X\). For the model developed the gradient is \(4.06\). This is suggesting for every additional unit increase in Profit, Share_Price increases by \(£4.06\).
💡🤓 Avoid extrapolation! Refrain from predicting/interpreting the regression line for X-values outside the range of X in the data set!
Investigate the nature of the relationship in Share Price Study
data for Share_Price vs RD
and
Share_Price vs Turnover
.
Download The Supermarket data set available at https://github.com/TanjaKec/mydata using the following link https://tanjakec.github.io/mydata/SUPERM.csv.
Crate an Rmd report in which you will:
You are expected to bring your report to into the next tutorial.
In the previous section we learnt about simple linear regression as a very simple tool for predicting quantitative measured response variables. Its simple application, easy interpretation and predictive capabilities make it one of the most popular approaches for supervised learning. Developing good understanding of linear regression serves as a solid base for learning how to use and adopt other more sophisticated modelling procedures in the context of machine learning.
The basic idea in regression modelling is that there is a phenomenon of interest whose behaviour we seek to account for or explain. For example a company may be interested in why its sales figures have followed a particular pattern; Government officials may be interested in explaining the behaviour of unemployment statistics.
Let the phenomenon of interest be denoted by \(Y\); as discussed in the previous sections, \(Y\) is also known as the response variable, or the dependent variable. The fundamental nature of \(Y\) is that it displays variability and it is this inherent variability that the data analyst seeks to explain by using regression modelling. Intuitively, if we can account for a large amount of the movements in \(Y\) then in some sense this is good. Conversely, if we put forward an argument which explains only a very small amount of the behaviour of the phenomenon of interest then in some sense this is bad.
In trying to explain, or account for the behaviour of \(Y\) we often build a regression model. As a first step in building such a model we specify a set of variables, called the explanatory or predictor variables, that we believe to be important in explaining the behaviour of, or the variability in \(Y\). Specifying this set of variables is, in effect, the first step in developing as the data analysts our viewpoint, our model, our theory. As such we are entitled to ask where such a viewpoint comes from - for example, it may be the logical outcome of some theoretical argument or it may be a replication of a previous study.
💡🤓 We do well to remember that this statement of a model is a viewpoint, a belief, a theory and need not be correct, i.e. true.
Indeed, it is the very essence of the regression model validation to judge if a viewpoint, a belief, a theory is in any sense acceptable. So, when developing a regression model all we argue is: \(Y = f(X_1, X_2, X_3, … , X_k)\), which represents a belief that the response variable \(Y\) depends upon a set of \(k\) explanatory variables \((X_1, X_2, X_3,… , X_k)\).
Specifying a set of variables is not in itself a complete model. We have to be prepared to indicate and argue more precisely how the explanatory variables are supposed to relate to \(Y\): that is, we must specify the functional form that links the set of variables to \(Y\). The simplest relationship is a linear relationship or a straight line. As we have already learnt for a single explanatory variable, \(X_1\), this is given by:
and for a set of \(k\)-variables this is given by:
The linear models specified so far are exact or deterministic. Such structures do not represent statistical problems. The very nature of statistical data is that there is inherent variability in that there is some part of the responses variable that we cannot explain. Thus there is a random element whose behaviour is by nature and definition unpredictable. We know that a random component is going to impact on \(Y\), but we do not know the size or sign of such random happenings.
This random/stochastic part of the model is captured as follows:
\(Y = b_0 + b_1X_1 + e\), for a single explanatory variable model, and
\(Y = b_0 + b_1X_1 + b_2X_2 + ... + b_kX_k + e\), for the general \(k\) explanatory variable model
where \(e\) is also known as the error term.
💡🤓 These expressions need to be fully understood and the role and significance of the random/stochastic element has to be fully recognised.
In order to complete the specification of the statistical regression model some assumption has to be made about the underlying process that produces \(e\). The standard approach is to adopt the following distribution structure:
\(e \sim N(0,\sigma^2)\), with the error term from a normal distribution with a mean of \(0\), and a variance of \(\sigma^2\).
Knowing this, we should now be in a position to understand and identify the structure of the standard multiple linear regression model.
As a statement of belief, the specified model just suggests a list of important explanatory variables within a linear structure. We need to be willing to take an extra conceptual step by specifying our expectation on how the individual explanatory variables are connected to the response variable. We should make assertions about:
Again knowledge of these two potential tricky areas are connected to theoretical arguments and are often the results of previous similar studies. Such knowledge is sometimes called prior knowledge, or prior views. In order to give a viewpoint some evidential or empirical support is required:
That is, the proposed theory has to be tested against the available evidence.
🤓 It needs to be fully understood that the model developed so far is no more than a belief.
We may believe passionately that \(Y\) is caused by the structure put forward but we do not know the structure in that we do not know the values of the \(b\) parameters. These \(b\) values are held to exist and have particular true numerical values which we need to find out. In an attempt to give our theory/model/viewpoint empirical content we have to collect a sample of pertinent data. On the basis of the information contained in this particular data, we have to estimate the \(b\) values and draw valid conclusions about the nature of the true but unknown underlying structure which is claimed to be responsible for the behaviour of \(Y\).
The sample, of size \(n\), evidence is seen as coming from the following structure:
\[Y_i = b_0 + b_1X_{i1} + b_2X_{i2} + b_3X_{i3} + … + b_kX_{ik} + e_i, \;\;\;\;\; where \;\; i=1, 2, ...n\] Selecting a sample by definition means that we have imperfect information, that is we do not have information on the whole population or we do not see the whole picture. We only see a possibly disjointed part of the whole picture and thus taking a sample creates obvious difficulties in making sense of what we can see. Taking a sample introduces an interpretation problem of trying to say something sensible about the whole picture, the true model, from a part of the picture.
The data handling process involves two major steps, namely:
As we have discussed, the model developed so far is a viewpoint. It represents our view as to the true underlying structure of the world. We believe that this structure exists but unfortunately, we do not know the values of the parameters. These parameters are held to exist but their true values are unknown. Hence there is a need to develop estimates of the true but unknown parameters. These estimates are based on a sample of data and it is to be hoped that they are in some sense acceptable, i.e. good.
The most common estimating approach is the principle of
Ordinary Least Squares (OLS). As we have
already learnt, this idea is easiest to understand when there is only
one explanatory variable and the problem can be easily depicted with a
scatter diagram onto which we can superimpose the line of best fit. This
process of finding the line of best fit has been programmed in R through
the lm()
function. Problems involving more than one
explanatory variable are handled in a similar fashion. As part of the
model fitting process, R produces a great deal of output which should be
used to judge the validity, or usefulness, of the fitted model.
On deriving coefficient estimates from a sample of data we end up with a fitted regression model. We must now take this estimated model and ask a series of questions to decide whether or not our estimated model is good or bad. That is, we have to subject the fitted model to a set of tests designed to check the validity of the model, which is in effect a test of our viewpoint.
Test a): Does the fitted model make sense?
This type of testing is based on prior knowledge. We need to be in a position to judge if the parameter estimates are meaningful. This comes in two guises:
This test assesses if the marginal response coefficients display both the correct signs and sizes. To assist the analyst with these issues theoretical reasoning and information gained from past studies should be used. For example, if one was relating sales to advertising then a positive connection would be expected and a simple plot of sales against advertising would reveal information concerning the direction and strength of any sample relationship. Similarly, in a study of sales against price then a negative relationship would be expected. The issue of “correct size” is normally more troublesome. Possibly some clue as to size can be drawn from similar, past studies and this can be compared with the current estimated values.
By way of addressing this test the modeller may adopt the following approach:
Test b) \(R^2 / R^2_{adjusted}\): Overall is the model a good fit?
The coefficient of determination, \(R^2\), is a single number that measures the extent to which the set of explanatory variables can explain, or account for, the variability in \(Y\). That is, how well does the set of explanatory variables explain the behaviour of the phenomenon we are trying to understand? It is looking at the set of explanatory variables as a whole and is making no judgement about the contribution or importance of any individual explanatory variable. By construction, \(R^2\) is constrained to lie in the following range:
\[0\% <---------- \;\; R^2 \;\; ----------> 100\%\]
Intuitively the closer \(R^2\) is to \(100\%\) then the better the model is in the sense that the set of explanatory variables can explain a lot of the variability in \(Y\). Conversely, a value of \(R^2\) close to \(0\) implies a weak, i.e. poor model, in that the set of explanatory variables can only account for a limited amount of the behaviour of \(Y\). But what about non-extreme values of \(R^2\)?
The adequacy of \(R^2\) can be judged both formally and informally.
The formal test is given by the result:
and the appropriate test statistic is such that \(F_{calc} \sim F(k, (n-(k+1)))\), obtained using a data set of size \(n\) for a fitted model with \(k\) number of explanatory variables.
The mechanics of an F-test are as follows:
Two vital pieces of information are required for an F-test: \(F_{calc}\) and \(F_{crit}\).
\(F_{calc}\) is displayed within the R’s standard output. \(F_{crit}\) however is a figure derived from degrees of freedom elements and a designated level of significance. For hypothesis testing the \(5\%\) significance (\(\alpha = 5\%\)) level is commonly used. The degrees of freedom elements are
The all important decision making rule is: if \(F_{calc}\) is less than \(F_{crit}\) then the null hypothesis \(H_0\) is accepted; if \(F_{calc}\) is larger than \(F_{crit}\) then the alternative hypothesis \(H_1\) is accepted:
if \(F_{calc} < F_{crit} => H_0\)
if \(F_{calc} > F_{crit} => H_1\)
This formal test involves a rather weak alternative hypothesis, which says only that \(R^2\) is significantly bigger than \(0\). If \(H_1\) is accepted we will have to make a judgement, without the aid of any further formal test, about the usefulness of \(R^2\) and hence the model being studied.
\(R^2\) can also be used in comparing competing models. If two or more models have been put forward to explain the same response variable then a reasonable rule is to rank the explanatory power of the models in terms of their \(R^2\) values. Thus the model with the highest \(R^2\) value would be the best model.
The information gathered from the individual regressions carried out in Test a) can be used to give an initial judgement and ranking of the relative importance of the various explanatory variables.
However this rule should be used carefully. It is a valid rule when the competing models have the same number of explanatory variables. It is not valid when comparing models that have different numbers of explanatory variables. It is intuitively obvious that a model with more explanatory variables will have a better chance of explaining the \(Y\) variable than a model with fewer variables. Thus the highest \(R^2\) rule is biased in favour of those models with more explanatory variables even when some of these explanatory variables are not very useful. In an attempt to redress the imbalance when comparing models with different numbers of explanatory variables, a different version of \(R^2\) called \(R^2_{adjusted}\) has been developed as follows:
\[\bar{R}^2 = 1 - \frac{(n-1)}{(n-(k+1))}(1-R^2)\] Note that the mathematically adopted way of writing \(R^2_{adjusted}\) is \(\bar{R}^2\).
By examining \(\bar{R}^2\) it can be seen that when \(k\) goes up it is possible for the value of \(\bar{R}^2\) to fall. Thus this statistic gives a better way of comparing models with different values for \(k\). As before the rule for comparing models is straightforward: choose the model with the highest \(\bar{R}^2\).
Test c) t-tests: Individually, are the explanatory variables important?
Using \(R^2\) to judge the ‘goodness’ of the set of explanatory variables does not tell us anything about the importance of any one single explanatory variable. Just because a set of variables is important does not necessarily mean that each individual variable is contributing towards explaining the behaviour of \(Y\). What is needed is a test than enables us to check the validity of each variable one at a time. Such a test procedure is available as follows:
The appropriate \(H_1\) for any particular variable is crucially connected with the prior view as to the nature of the connection between any one proposed explanatory variable and the response variable.
The appropriate test statistic involves a t-test based on \((n-(k+1))\) degrees of freedom.
The critical region, denoted by \(H_1\), is crucially dependent upon the nature of the alternative hypothesis as put forward by the data analyst. \(t_{calc}\) is generated from the R output. \(t_{crit}\) is a combination of the Degrees of Freedom from Error and the level of test significance. \(t_{crit}\) is the benchmark helping to determine whether one accepts or rejects the null Hypothesis.
If \(t_{calc}\) lies in the critical region then the alternative hypothesis is accepted and the modeller accepts that the explanatory variable does have an influence on the behaviour of \(Y\). If this is not the case and \(t_{calc}\) lies in the \(H_0\) area, we will accept that the explanatory variable is useless and that it should be eliminated from the model.
The hypothesis testing methodology typically adopts the \(\alpha = 5\%\) level of significance and can be illustrated as follows:
Once all the variables have been individually analysed a further regression command should be given on all remaining acceptable explanatory variables in order to ascertain the best fitted model.
Some of the decision making outcomes from the series of t-test may be uncomfortable in terms of the information from other tests and we may have to experiment with various competing fitted models using subtle combinations of prior views, \(R^2\), \(\bar{R}^2\) values, and \(F\), \(t-test\) outcomes before selecting a best model or a set of equally good models.
There are many other, more sophisticated, tests that can be used to judge the validity, i.e. usefulness of a fitted regression model. Before we familiarise ourselves with them and in order to appreciate their practicality and effectiveness, we will conduct an analysis for fitting a multiple regression model based on the procedure explained above.
The \(p\)-value is a probability of obtaining a value of the test statistic or a more extreme value of the test statistic assuming that the null hypothesis is true. Thus, the \(p\)-value approach involves determining “likely” or “unlikely” by determining the probability, assuming the null hypothesis were true of observing a more extreme test statistic in the direction of the alternative hypothesis than the one observed.
As such, in the context of the inference about the regression parameters, the \(p\)-values help determine whether the relationships that you observe in your sample also exist in the larger population. The \(p\)-value for each independent variable tests the null hypothesis that the variable has no effect on the dependent variable. In other words, there is insufficient evidence to conclude that there is effect at the population level.
carData::Salaries
: The 2008-09
nine-month academic salary for Assistant Professors, Associate
Professors and Professors in a college in the U.S. The data was
collected as part of the on-going effort of the college’s administration
to monitor salary differences between male and female faculty
members.
Format: A data frame with 397 observations on the following 6 variables:
rank
: a factor with levels:
discipline
: a factor with levels:
yrs.since.phd
: years since PhDyrs.service
: years of servicesex
: a factor with levels
salary
: nine-month salary, in dollarsThe first set of questions is:
Considering that the data is collected for the purpose of the
analysis of academic salary we are going to identify salary
as the response variable. The rest of the variables we can use as
possible factors that can influence the behaviour of the response
variable.
The “Standard” regression approach assumes modelling a relationship between measured response and measured explanatory variables. However, often when building multiple regression models we do not want to be limited to the choice of only measured explanatory variables. We also want to be able to include attribute explanatory variables in the multiple regression modelling. In consequence, it is important to learn about supplementary steps that are required to make such models interpretable.
As we have learnt previously, attribute variables are vectors of class factor in R. This vector is encoded numerically, with information about the levels of the variable saved in the levels attribute.
> # If you don't have carData installed yet, uncomment and run the line below
> # install.packages(carData)
> library(carData)
> data(Salaries)
> attach(Salaries)
> class(sex)
## [1] "factor"
> unclass(sex)
## [1] 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 1 2
## [38] 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2
## [112] 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## [186] 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2
## [223] 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1
## [334] 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2
## [371] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## attr(,"levels")
## [1] "Female" "Male"
The output points out the conversion of the factor into an appropriate set of contrasts. In particular, the first one: for unordered factors, and the second one: the ordered factors. The former one is applicable in our context.
To explicitly identify the coding of the factor, i.e. dummy variable
used by R, we can use the contrasts()
function.
> contrasts(sex)
## Male
## Female 0
## Male 1
> contrasts(discipline)
## B
## A 0
## B 1
> contrasts(rank)
## AssocProf Prof
## AsstProf 0 0
## AssocProf 1 0
## Prof 0 1
Note that applied “contr.treatment” conversion takes only the value 0 or 1 and that for an attribute variable with k levels it will create k-1 dummy variables. One can argue that the printout of the function could have been more informative by putting indexed letter d as a header for each of the printed columns. For example:
sex
, where k=2attribute | \(d\) |
---|---|
Female | 0 |
Male | 1 |
rank
, where k=3attribute | \(d_1\) | \(d_2\) |
---|---|---|
AsstProf | 0 | 0 |
AssocProf | 1 | 0 |
Prof | 0 | 1 |
There are many different ways of coding attribute variables besides
the dummy variable approach explained here. All of these different
approaches lead to equivalent model fits. What differ are the
coefficients, i.e. model parameters as they require different
interpretations, arranged to measure particular contrasts. This 0/1
coding implemented in R’s default
contr.treatment
contrast offers
straightforward interpretation of the associated parameter in the model,
which often is not the case when implementing other contrasts.
In the case of measured predictors, we are comfortable with the interpretation of the linear model coefficient as a slope, which tells us what a unit increase in the response variable is, i.e. outcome per unit increase in the explanatory variable. This is not necessarily the right interpretation for attribute explanatory variables.
> # average salary values for each sex group
> suppressPackageStartupMessages(library(dplyr))
> Salaries %>%
+ select(salary, sex) %>%
+ group_by(sex) %>%
+ summarise(mean=mean(salary))
Note, that the unclass()
function removes the attributes
of the sex
variable above and prints it using default print
method, allowing for easier examination of the internal structure of the
object.
However, when using an attribute variable in a linear regression model it would make no sense to treat it as a measured explanatory variable because of its numeric levels. In the context of linear modelling we need to code them to represent the levels of the attribute.
Two-level attribute variables are very easy to code. We simply create
an indicator or dummy variable that
takes on two possible dummy numerical values. Consider the
sex
indicator variable. We can code this using a dummy
variable \(d\): \[\begin{equation}
d=\left\{
\begin{array}{@{}ll@{}}
0, & \text{if female} \\
1, & \text{if male}
\end{array}\right.
\end{equation}\]
💡🤓 This is the default coding used in R. Zero value is assigned to the level which is first alphabetically, unless it is changed by using the
releveld()
function for example, or by specifying the levels of the factor variable specifically.
For a simple regression model of salary
versus
sex
:
\[salary = b_0 + b_1sex + e,\]
this results in the model
\[\begin{equation} salary_i = b_0 + b_1sex_i + e_i=\left\{ \begin{array}{@{}ll@{}} b_0 + b_1 \times 1 + e_i = b_0 + b_1 + e_i, & \text{if the person is male} \\ b_0 + b_1 \times 0 + e_i = b_0 + e_i, & \text{if the person is female} \end{array}\right. \end{equation}\]
where \(b_0\) can be interpreted as the average \(\text{salary}\) for females, and \(b_0 + b_1\) as the average \(\text{salary}\) for males. The value of \(b_1\) represents the average difference in \(\text{salary}\) between females and males.
We can conclude that dealing with an attribute variable with two levels in a linear model is straightforward. In this case, a dummy variable indicates whether an observation has a particular characteristic: yes/no. We can observe it as a “switch” in a model, as this dummy variable can only assume the values 0 and 1, where 0 indicates the absence of the effect, and 1 indicates the presence. The values 0/1 can be seen as off/on.
The way in which R codes dummy variables is controlled by the contrast option:
> options("contrasts")
## $contrasts
## unordered ordered
## "contr.treatment" "contr.poly"
If we obtain the mean salary for each sex group we will find that for
female professors the average salary is \(\$101,002\) and for male professors the
average is \(\$115,090\). That is, a
difference of \(\$14,088\). If we now
look at the parameters of the regression model for salary
vs sex
where females are coded as zero and males as one, we
get exactly the same information, implying that the coefficient is the
estimated difference in average between the two groups.
> # regression model
> lm(salary ~ sex)
##
## Call:
## lm(formula = salary ~ sex)
##
## Coefficients:
## (Intercept) sexMale
## 101002 14088
For more on this topic check the following link: Categorical variables and interaction terms in linear regression
We are interested in the extent to which variation in the response
variable salary
is associated with variation in the
explanatory variables available in the Salaries
data set,
that is we want to fit a multiple linear regression model to the given
data. The model we wish to construct should contain enough to explain
relations in the data and at the same time be simple enough to
understand, explain to others, and use.
For convenience we will adopt the following notation:
salary
yrs.since.phd
yrs.service
discipline
sex
rank
Generally, in multiple regression we have a continuous response variable and two or more continuous explanatory variables, however in this dataset we have three attribute variables that we wish to include as the explanatory variables in the model.
Next, we need to specify the model that embodies our mechanistic understanding of the factors involved and the way that they are related to the response variable. It would make sense to expect that all of the available \(x\) variables may impact the behaviour of \(y\), thus the model we wish to build should reflect our viewpoint, i.e. \(y = f(x_1, x_2, x_3, x_4, x_5)\):
\[y = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4x_4 + b_5x_5 + e\] Our viewpoint states a belief that all explanatory variables have positive impact on the response. For example, more years in service will cause higher salary.
Our objective now is to determine the values of the parameters in the model that lead to the best fit of the model to the data. That is, we are not only trying to estimate the parameters of the model, but we are also seeking the minimal adequate model to describe the data.
💡🤓 The best model is the model that produces the least unexplained variation following the principle of parsimony rather than complexity. That is the model should have as few parameters as possible, subject to the constraint that the parameters in the model should all be statistically significant.
For regression modelling in R we use the lm()
function,
that fits a linear model assuming normal errors and constant variance.
We specify the model by a formula that uses arithmetic operators:
+
, -
, *
, /
and
^
which enable different functionalities from their
ordinary ones. But, before we dive into statistical modelling of the
given data, we need to take a first step and conduct the most
fundamental task of data analysis procedure: Get to Know Our
Data.
Examining multivariate data is intrinsically more challenging than
examining univariate and bivariate data. To get the most in-depth vision
into multivariate data behaviour we construct scatter plot matrices that
enable the display of pairwise relationships. In R, the scatter plot
matrices are composed by the pairs()
function, which comes
as a part of the default graphics
package. Since we wish to
include attribute variables in our analysis we are going to use the
GGally::ggpairs()
function that produces a pairwise
comparison of multivariate data for both data types: measured and
attribute.
> # If you don't have GGally installed yet, uncomment and run the line below
> # install.packages(GGally)
> suppressPackageStartupMessages(library(GGally))
> ggpairs(Salaries)
This is an information rich visualisation that includes pairwise
relationships of all the variables we want to consider for our model. By
focusing on the last column of the plots, we can notice influence from
all explanatory variables onto the response, except maybe for
discipline
and sex
. We also notice unbalanced
representation of the groups for variables rank
and
sex
, but for the purpose of our practice in fitting a multi
factor model this isn’t too problematic. We need to be especially
concerned with the extent of correlations between the explanatory
variables and what is of particular interest to us is the high
multicollinearity between rank
,
yrs.since.phd
and yrs.service
, which happens
when the variables are highly linearly related. As a
consequence, we will need to keep an eye on the significance of using
all of these variables in the model.
There are no fixed rules when fitting the linear models, but there are adopted standards that have proven to work well in practice. We start off by fitting a maximal model then we carry on simplifying it by removing non-significant explanatory variables. This needs to be done with caution, making sure that the simplifications make good scientific sense, and do not lead to significant reductions in explanatory power. Although this should be the adopted strategy for fitting a model it is not a guarantee to finding all the important structures in a complex data frame.
We can summarise our model building procedure algorithm as follows:
> # model_1 <- lm(salary ~ yrs.since.phd + yrs.service + discipline + sex + rank, data = Salaries) #long handed way
> model_1 <- lm(salary ~ ., data = Salaries) # full stop, . , implies: all other variables in data that do not already appear in the formula
> summary(model_1)
##
## Call:
## lm(formula = salary ~ ., data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65248 -13211 -1775 10384 99592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65955.2 4588.6 14.374 < 2e-16 ***
## rankAssocProf 12907.6 4145.3 3.114 0.00198 **
## rankProf 45066.0 4237.5 10.635 < 2e-16 ***
## disciplineB 14417.6 2342.9 6.154 1.88e-09 ***
## yrs.since.phd 535.1 241.0 2.220 0.02698 *
## yrs.service -489.5 211.9 -2.310 0.02143 *
## sexMale 4783.5 3858.7 1.240 0.21584
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22540 on 390 degrees of freedom
## Multiple R-squared: 0.4547, Adjusted R-squared: 0.4463
## F-statistic: 54.2 on 6 and 390 DF, p-value: < 2.2e-16
The \(R^2 = 45.47\%\) and the \(\bar{R}^2 = 44.63\%\) are well above the
value of zero allowing us to accept this as a valid model without having
to formally test it to assess its statistical significance. It manages
to explain almost half of the variability in the response variable
salary
.
We identify the sex
explanatory variable as clearly not
significant, which is in line with the conclusion we could draw from the
boxplot in the pairwise comparison plot for salary
vs. sex
. We will remove it to begin the process of model
simplification.
> #model_1 <- lm(salary ~ yrs.since.phd + yrs.service + discipline + sex + rank, data = Salaries) # long handed method
> model_2 <- update(model_1,~. - sex) # refitting by removing the least significant term
> summary(model_2)
##
## Call:
## lm(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service,
## data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65244 -13498 -1455 9638 99682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69869.0 3332.1 20.968 < 2e-16 ***
## rankAssocProf 12831.5 4147.7 3.094 0.00212 **
## rankProf 45287.7 4236.7 10.689 < 2e-16 ***
## disciplineB 14505.2 2343.4 6.190 1.52e-09 ***
## yrs.since.phd 534.6 241.2 2.217 0.02720 *
## yrs.service -476.7 211.8 -2.250 0.02497 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22550 on 391 degrees of freedom
## Multiple R-squared: 0.4525, Adjusted R-squared: 0.4455
## F-statistic: 64.64 on 5 and 391 DF, p-value: < 2.2e-16
We note a slight reduction in \(\bar{R^2}\) from \(44.63\%\) to \(44.55\%\) which we can regard as an
insignificant decrease. The next step is to check the coefficients and
assess for the effect of the remaining variables. We identify
yrs.since.phd
and yrs.service
as the least
influential in explaining the variability of salary
. To
illustrate how to formally assess their effect, we will conduct the
t-test for the yrs.since.phd
variable:
=========================
> qt(0.95, 391)
## [1] 1.64876
As \(t_{calc} = 2.217 > t_{crit} = 1.64876 => H_1\) we will keep the remaining variable and stop with the model simplification and focus on its interpretation.
We can take a closer look at the coefficients of our fitted model:
> coef(model_2)
## (Intercept) rankAssocProf rankProf disciplineB yrs.since.phd
## 69869.0110 12831.5375 45287.6890 14505.1514 534.6313
## yrs.service
## -476.7179
The structure of our final fitted model is:
\[y = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4x_4 + e,\] where
salary
yrs.since.phd
yrs.service
discipline
rank
We can take a closer look at the coefficients of our fitted model:
> coef(model_2)
## (Intercept) rankAssocProf rankProf disciplineB yrs.since.phd
## 69869.0110 12831.5375 45287.6890 14505.1514 534.6313
## yrs.service
## -476.7179
Examining the output we realise that R
has created three
“sub” dummy variables for the variable rank
: \[
dr_1 =
\begin{cases}
1 & \text{rank is AsstProf} \\
0 & \text{for rank is not AsstProf}
\end{cases}
\]
\[ dr_2 = \begin{cases} 1 & \text{rank is AssocProf} \\ 0 & \text{rank is not AssocProf} \end{cases} \]
\[ dr_3 = \begin{cases} 1 & \text{rank is Prof} \\ 0 & \text{rank is not Prof} \end{cases} \]
It has chosen to use the model: \[ y = b_0 + b_1dr_2 + b_2dr_3 + b_3d_1 + b_4x_1 + b_5x_2 + e, \]
where:
salary
yrs.since.phd
yrs.service
rank
discipline
as explained earlierNote that R
doesn’t need to use \(dr_1\), to create three models; it only
needs two dummy variables since it is using \(dr_1\) as a reference level, also known as
the base line. This subsequently allows R
to
create three models relating to rank
variable:
telling us that:
Learning this we can make an interpretation of our final fitted model as follows:
yrs.since.phd
) on average
salary (salary
) will go up by \(\$534.63\) assuming the rest of the
variables are fixed in the modelyrs.service
) on average
salary (salary
) will go down by \(\$476.72\) assuming the rest of the
variables are fixed in the modelrank: AsstProf
) who works in a “theoretical” department is
\(\$69,869.01\) and who works in an
“applied” department is \(\$84,374.16\); this can vary for the number
of years in service and since PhDrank: AssocProf
) who works in a “theoretical” department
is \(\$82,700.55\) and who works in an
“applied” department is \(\$97,205.70\); this can vary for the number
of years in service and since PhDrank: Prof
) who works in
a “theoretical” department is \(\$115,156.70\) and who works in an
“applied” department is \(\$129,661.90\); this can vary for the
number of years in service and since PhDThis model explains around \(45\%\)
of the variability in the response variable salary
.
Adding ~ 0
to lm()
formula enables
R
to suppress the intercept. Note that if we remove the
intercept, then we can directly obtain all “three intercepts” without a
base level to fit the final fitted model:
> model_2_1 <- lm(salary ~ 0 + rank + discipline + yrs.since.phd + yrs.service)
> summary(model_2_1)
##
## Call:
## lm(formula = salary ~ 0 + rank + discipline + yrs.since.phd +
## yrs.service)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65244 -13498 -1455 9638 99682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## rankAsstProf 69869.0 3332.1 20.968 < 2e-16 ***
## rankAssocProf 82700.5 3916.7 21.115 < 2e-16 ***
## rankProf 115156.7 4350.9 26.467 < 2e-16 ***
## disciplineB 14505.2 2343.4 6.190 1.52e-09 ***
## yrs.since.phd 534.6 241.2 2.217 0.0272 *
## yrs.service -476.7 211.8 -2.250 0.0250 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22550 on 391 degrees of freedom
## Multiple R-squared: 0.9638, Adjusted R-squared: 0.9633
## F-statistic: 1736 on 6 and 391 DF, p-value: < 2.2e-16
So far we have seen that by fitting an additive regression model to the data, we aim to identify and understand how the value of a dependent response variable changes when any one of the independent explanatory variables is changed while the other independent variables stay the same. This is a restrictive form of a model as it only allows for linear relationships between the response and the explanatory variables, and the way in which one explanatory variable affects the response is the same for any value of the other explanatory variables used in the model.
We need to add flexibility to accommodate these limitations. This will allow the use of linear models for non-linear relationships and in which the effect of one explanatory variable can be different for different values of the other explanatory variable by introducing the concept of interaction. This brings more complexity into the multivariate regression model.
To illustrate the multivariate model fitting procedure with
interactions we are going to use wine.csv
available from Eduardo
García Portugués’s book: Notes for Predictive Modeling. The dataset
is formed by the auction Price of 27 red Bordeaux vintages, five vintage
descriptors (WinterRain, AGST, HarvestRain, Age, Year), and the
population of France in the year of the vintage, FrancePop.
Year
: year in which grapes were harvested to make
winePrice
: logarithm of the average market price for
Bordeaux vintages according to 1990–1991 auctions. The price is relative
to the price of the 1961 vintage, regarded as the best one ever
recordedWinterRain
: winter rainfall (in mm)AGST
: Average Growing Season Temperature (in degrees
Celsius)HarvestRain
: harvest rainfall (in mm)Age
: age of the wine measured as the number of years
stored in a caskFrancePop
: population of France at Year (in
thousands)We would like to analyse the quality of a vintage that has been quantified as the price and make the interpretation of our statistical findings.
💡🤓 Don’t forget!!! 🤔 First things first: Get to Know Data 🤓
We will start this analysis by examining the pairwise plot.
> wine = read.csv("https://raw.githubusercontent.com/egarpor/handy/master/datasets/wine.csv")
> summary(wine)
## Year Price WinterRain AGST HarvestRain
## Min. :1952 Min. :6.205 Min. :376.0 Min. :14.98 Min. : 38.0
## 1st Qu.:1960 1st Qu.:6.508 1st Qu.:543.5 1st Qu.:16.15 1st Qu.: 88.0
## Median :1967 Median :6.984 Median :600.0 Median :16.42 Median :123.0
## Mean :1967 Mean :7.042 Mean :608.4 Mean :16.48 Mean :144.8
## 3rd Qu.:1974 3rd Qu.:7.441 3rd Qu.:705.5 3rd Qu.:17.01 3rd Qu.:185.5
## Max. :1980 Max. :8.494 Max. :830.0 Max. :17.65 Max. :292.0
## Age FrancePop
## Min. : 3.00 Min. :43184
## 1st Qu.: 9.50 1st Qu.:46856
## Median :16.00 Median :50650
## Mean :16.19 Mean :50085
## 3rd Qu.:22.50 3rd Qu.:53511
## Max. :31.00 Max. :55110
> ggpairs(wine)
What conclusions can we draw:
We can notice a perfect relationship between the variables
Year
and Age
. This is to be expected since
this data was collected in 1983 and Age
was calculated as:
Age
= 1983 - Year
. Knowing this, we are going
to remove Year
from the analysis and use Age
as it will be easier to interpret.
There is a strong relationship between Year
, ie.
Age
and FrancePop
and since we want to impose
our viewpoint that the total population does not influence the quality
of wine we will not consider this variable in the model.
We are going to investigate possible interactions between the
rainfall (WinterRain
) and growing season temperature
(AGST
). In R
this will be created
automatically by using the *
operator.
Let us build a model:
We will start with the most complicated model that includes the highest-order interaction.
💡🤓 In R we will specify the three-way interaction, which will automatically add all combinations of two-way interactions.
> model1 <- lm(Price ~ WinterRain + AGST + HarvestRain + Age + WinterRain * AGST * HarvestRain, data = wine)
> summary(model1)
##
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age +
## WinterRain * AGST * HarvestRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35058 -0.19462 -0.02645 0.17194 0.52079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.582e+00 1.924e+01 0.446 0.6609
## WinterRain -1.858e-02 2.896e-02 -0.642 0.5292
## AGST -1.748e-01 1.137e+00 -0.154 0.8795
## HarvestRain -4.713e-02 1.540e-01 -0.306 0.7631
## Age 2.476e-02 8.288e-03 2.987 0.0079 **
## WinterRain:AGST 1.272e-03 1.712e-03 0.743 0.4671
## WinterRain:HarvestRain 7.836e-05 2.600e-04 0.301 0.7665
## AGST:HarvestRain 3.059e-03 9.079e-03 0.337 0.7401
## WinterRain:AGST:HarvestRain -5.446e-06 1.540e-05 -0.354 0.7278
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2833 on 18 degrees of freedom
## Multiple R-squared: 0.8621, Adjusted R-squared: 0.8007
## F-statistic: 14.06 on 8 and 18 DF, p-value: 2.675e-06
The model explains well over \(80\%\) of variability and is clearly a strong model, but the key question is whether we can simplify it. We will start the process of this model simplification by removing the three-way interaction as it is clearly not significant.
> model2 <- update(model1, ~. -WinterRain:AGST:HarvestRain, data =wine)
> summary(model2)
##
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age +
## WinterRain:AGST + WinterRain:HarvestRain + AGST:HarvestRain,
## data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35245 -0.19452 0.01643 0.17289 0.51420
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.980e+00 1.066e+01 0.279 0.78293
## WinterRain -9.699e-03 1.408e-02 -0.689 0.49930
## AGST 1.542e-01 6.383e-01 0.242 0.81168
## HarvestRain 6.496e-03 2.610e-02 0.249 0.80610
## Age 2.441e-02 8.037e-03 3.037 0.00678 **
## WinterRain:AGST 7.490e-04 8.420e-04 0.890 0.38484
## WinterRain:HarvestRain -1.350e-05 7.338e-06 -1.840 0.08144 .
## AGST:HarvestRain -1.032e-04 1.520e-03 -0.068 0.94656
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2767 on 19 degrees of freedom
## Multiple R-squared: 0.8611, Adjusted R-squared: 0.8099
## F-statistic: 16.83 on 7 and 19 DF, p-value: 6.523e-07
The \(\bar{R}^2\) has slightly increased in value. Next, we remove the least significant two-way interaction term.
> model3 <- update(model2, ~. -AGST:HarvestRain, data = wine)
> summary(model3)
##
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age +
## WinterRain:AGST + WinterRain:HarvestRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35424 -0.19343 0.01176 0.17161 0.51218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.518e+00 6.946e+00 0.507 0.61802
## WinterRain -1.017e-02 1.195e-02 -0.851 0.40488
## AGST 1.218e-01 4.138e-01 0.294 0.77147
## HarvestRain 4.752e-03 4.553e-03 1.044 0.30901
## Age 2.451e-02 7.710e-03 3.179 0.00472 **
## WinterRain:AGST 7.769e-04 7.166e-04 1.084 0.29119
## WinterRain:HarvestRain -1.342e-05 7.059e-06 -1.902 0.07174 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2697 on 20 degrees of freedom
## Multiple R-squared: 0.8611, Adjusted R-squared: 0.8194
## F-statistic: 20.66 on 6 and 20 DF, p-value: 1.35e-07
Again, it is reassuring to notice an increase in the \(\bar{R}^2\), but we can still simplify the model further by removing another least significant two-way interaction term.
> model4 <- update(model3, ~. -WinterRain:AGST, data = wine)
> summary(model4)
##
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age +
## WinterRain:HarvestRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5032 -0.1934 0.0109 0.1771 0.4621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.812e+00 1.598e+00 -2.386 0.026553 *
## WinterRain 2.747e-03 9.471e-04 2.900 0.008560 **
## AGST 5.586e-01 9.495e-02 5.883 7.71e-06 ***
## HarvestRain 4.717e-03 4.572e-03 1.032 0.313877
## Age 2.785e-02 7.094e-03 3.926 0.000774 ***
## WinterRain:HarvestRain -1.349e-05 7.088e-06 -1.903 0.070835 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2708 on 21 degrees of freedom
## Multiple R-squared: 0.8529, Adjusted R-squared: 0.8179
## F-statistic: 24.35 on 5 and 21 DF, p-value: 4.438e-08
There is an insignificant decrease in \(\bar{R}^2\). We notice
HarvestRain
is now the least significant term, but it is
used for the WinterRain:HarvestRain
interaction, which is
significant at \(\alpha = 5\%\) and
therefore we should keep it. However, as the concept of parsimony
prefers a model without interactions to a model containing interactions
between variables, we will remove the remaining interaction term and see
if it significantly affects the explanatory power of the model.
> model5 <- update(model4, ~. -WinterRain:HarvestRain, data = wine)
> summary(model5)
##
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46024 -0.23862 0.01347 0.18601 0.53443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.6515703 1.6880876 -2.163 0.04167 *
## WinterRain 0.0011667 0.0004820 2.420 0.02421 *
## AGST 0.6163916 0.0951747 6.476 1.63e-06 ***
## HarvestRain -0.0038606 0.0008075 -4.781 8.97e-05 ***
## Age 0.0238480 0.0071667 3.328 0.00305 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2865 on 22 degrees of freedom
## Multiple R-squared: 0.8275, Adjusted R-squared: 0.7962
## F-statistic: 26.39 on 4 and 22 DF, p-value: 4.057e-08
The \(\bar{R}^2\) is reduced by around \(2\%\), but it has all the significant terms and it is easier to interpret. For those reasons and in the spirit of parsimony that argues that a model should be as simple as possible, we can suggest that this should be regarded as the best final fitted model.
We realise that for the large numbers of explanatory variables, and many interactions and non-linear terms, the process of model simplification can take a very long time. There are many algorithms for automatic variable selection that can help us to chose the variables to include in a regression model. Stepwise regression and Best Subsets regression are two of the more common variable selection methods.
The stepwise procedure starts from the saturated model (or the maximal model, whichever is appropriate) through a series of simplifications to the minimal adequate model. This progression is made on the basis of deletion tests: F tests, AIC, t-tests or chi-squared tests that assess the significance of the increase in deviance that results when a given term is removed from the current model.
The best subset regression (BREG), also known as “all possible regressions”, as the name of the procedure indicates, fits a separate least squares regression for each possible combination of the \(p\) predictors, i.e. explanatory variables. After fitting all of the models, BREG then displays the best fitted models with one explanatory variable, two explanatory variables, three explanatory variables, and so on. Usually, either adjusted R-squared or Mallows Cp is the criterion for picking the best fitting models for this process. The result is a display of the best fitted models of different sizes up to the full/maximal model and the final fitted model can be selected by comparing displayed models based on the criteria of parsimony.
“These methods are frequently are abused by naive researchers who seek to interpret the order of entry of variables into the regression equation as an index of their ‘‘importance.’’ This practice is potentially misleading.” J. Fox and S. Weisberg book: An R Companion to Applied Regression, Third Edition, Sage (2019)
💡🤓 When selecting a model one should remember the important concept of parsimony!
As M.J. Crawley points out in his well know editions of “The R Book” we need to remember that models are portrayals of phenomena that should be both “accurate and convenient” and the principle of parsimony is an essential tool for model exploration. As he suggests: “just because we go to the trouble of measuring something does not mean we have to have it in our model.”
“Parsimony says that, other things being equal, we prefer:
Best Subset, https://afit-r.github.io/model_selection, https://bookdown.org/tpinto_home/Regularisation/best-subset-selection.html
Claeskens, G. and Hjort, N.L. (2008) Model Selection and Model Averaging, Cambridge University Press, Cambridge.
Fox, J. (2002) An R and S-Plus Companion to Applied Regression, Sage, Thousand Oaks, CA.
Salaries
data Case Study:~0
to lm()
formula enables
R
to suppress the intercept. Try to fit the following model
by removing the intercept.model_2_1 <- lm(salary ~ 0 + rank + discipline + yrs.since.phd + yrs.service)
summary(model_2_1)
This will allow you to obtain all “three intercepts” without a reference level.
Does this model differ from the previously fitted
model_2
? Provide justified explanation
Interpret the model
Salaries
data Case Study be further simplified? Justify your answerPrestige
data available from the
carData
package of datasets designed to accompany J. Fox
and S. Weisberg book: An R Companion to Applied Regression, Third
Edition, Sage (2019). Fit a multivariate model that explains
variation in the response variable prestige
, explaining the
reasons behind the steps taken with appropriate interpretation of the
final fitted model.carData::Prestige
: The Canadian occupational prestige
data, where the observations are occupations. Justification for treating
the 102 occupations as a sample implicitly rests on the claim that they
are “typical” of the population, at least with respect to the
relationship between prestige and income.
education
: Average education of occupational
incumbents, years, in 1971.
income
: Average income of incumbents, dollars, in
1971.
women
: Percentage of incumbents who are
women.
prestige
: Pineo-Porter prestige score for
occupation, from a social survey conducted in the mid-1960s.
census
: Canadian Census occupational code.
type
: Type of occupation. A factor with levels
(note: out of order):
bc
: Blue Collarprof
: Professional, Managerial, and Technicalwc
: White CollarProstate
available in the package
lasso2
contains information on 97 men who were about to
receive a radical prostatectomy. This data set comes from a study
examining the correlation between the prostate specific antigen
(lpsa
) and a number of other clinical measures.lpsa
using all
the available predictors.Creative Commons Attribution-ShareAlike 4.0 International License.