In this section you will be introduced to a set of concepts that enable data to be explored with the objective
The starting point is to understand what data is.
Can you provide a formal definition of the population and the sample? 😁
The population is the set of all people/objects of interest in the study being undertaken.
In statistical terms the whole data set is called the population. This represents “Perfect Information”, however in practice it is often impossible to enumerate the whole population. The analyst therefore takes a sample drawn from the population and uses this information to make judgements (inferences) about the population.
Clearly if the results of any analysis are based on a sample drawn from the population, then if the sample is going to have any validity, then the sample should be chosen in a way that is fair and reflects the structure of the population.
The process of sampling to obtain a representative sample is a large area of statistical study. The simplest model of a representative sample is a “random sample”:
A sample chosen in such a way that each item in the population has an equal chance of being included is a random sample.
As soon as sample data is used, the information contained within the sample is “Imperfect” and depends on the particular sample chosen. The key problem is to use this sample data to draw valid conclusions about the population with the knowledge of and taking into account the ‘error due to sampling’.
The importance of working with representative samples should be seriously considered; a good way to appreciate this importance is to see the consequences of using unrepresentative samples. A book by Darrell Huff called How to Lie with Statistics, published by Penguin contains several anecdotes of unrepresentative samples and the consequences of treating them as representative.
Usually the data will have been collected in response to some perceived problem, in the hope of being able to glean some pointers from this data that will be helpful in the analysis of the problem. Data is commonly presented to the data analyst in this way with a request to analyse the data.
Before attempting to analyse any data, the analyst should:
Make sure that the problem under investigation is clearly understood, and that the objectives of the investigation have been clearly specified.
Before any analysis is considered the analyst should make sure that the individual variables making up the data set are clearly understood.
The analyst must understand the data before attempting any analysis.
In the summary you should ask yourself:
Do I understand the problem under investigation and are the objectives of the investigation clear? The only way to obtain this information is to ask questions, and keep asking questions until satisfactory answers have been obtained.
Do I understand exactly what each variable is measuring/recording?
A starting point is to examine the characteristics of each individual variable in the data set.
The way to proceed depends upon the type of variable being examined.
Classification of variable types
The variables can be one of two broad types:
A common way of handling attribute data is to give it a numerical code. Hence, we often refer to them as coded variables.
There are two types of measured variables, a measured variable that is measured on some continuous scale of measurement, e.g. a person’s height. This type of measured variable is called a continuous variable. The other type of measured variable is a discrete variable. This results from counting; for example ‘the number of passengers on a given flight’.
The concept of statistical distribution is central to statistical analysis.
This concept relates to the population and conceptually assumes that we have perfect information; the exact composition of the population is known.
The ideas and concepts for examining population data provide a
framework for the way of examining data obtained from a sample. The Data
Analyst classifies the variables as either attribute or measured and
examines the statistical distribution of the particular sample variable
from the sample data.
For an attribute variable the number of occurrences of each attribute is
obtained, and for a measured variable the sample descriptive statistics
describing the centre, width and symmetry of the distribution are
calculated.
attribute:
measured:
For an attribute variable it is very simple. We observe the frequency of occurrence of each level of the attribute variable as shown in the barplot above.
For a measured variable the area under the curve from one value to another measures the relative proportion of the population having the outcome value in that range.
A statistical distribution for a measured variable can be summarised using three key descriptions:
The common measures of the centre of a distribution are the Mean (arithmetic average) and the Median. The median value of the variable is defined to be the particular value of the variable such that half the data values are less than the median value and half are greater.
The common measures of the width of a distribution are the Standard Deviation and the Inter-Quartile Range. The Standard Deviation is the square root of the average squared deviation from the mean. Ultimately the standard deviation is a relative measure of spread (width); the larger the standard deviation the wider the distribution. The inter-quartile range is the range over which the middle 50% of the data values varies.
By analogy with the median it is possible to define the quartiles:
The diagram below shows this pictorially:
🤓💡 Conventionally the mean and standard deviation are given together as one pair of measures of location and spread, and the median and inter-quartile range as another pair of measures.
There are a number of measures of symmetry; the simplest way to measure symmetry is to compare the mean and the median. For a perfectly symmetrical distribution the mean and the median will be exactly the same. This idea leads to the definition of Pearson’s coefficient of Skewness as:
Pearson's coefficient of Skewness = 3(mean - median) / standard deviation
An alternative measure of Skewness is the Quartile Measure of Skewness defined as:
Quartile Measure of Skewness = [(Q1 - Q3) - (Q2 - Q1)]/(Q3 - Q1)
Important Key Points:
The descriptive statistics provide a numerical description of the key parameters of the distribution of a measured sample variable.
One of the key steps required of the Data Analyst is to investigate the relationship between variables. This requires a further classification of the variables contained within the data, as either a response variable or an explanatory variable.
A response variable is a variable that measures either directly or indirectly the objectives of the analysis.
An explanatory variable is a variable that may influence the response variable.
In general there are four different combinations of type of Response Variable and type of Explanatory Variable. These four combinations are shown below:
The starting point for any investigation of the connections between a response variable and an explanatory variable starts with examining the variables, and defining the response variable, or response variables, and the explanatory variables.
🤓💡: In large empirical investigations there may be a number of objectives and a number of response variables.
The method for investigating the connections between a response variable and an explanatory variable depends on the type of variables. The methodology is different for combinations as shown in the box above, and applying an inappropriate method causes problems. 💡⚡️😩
The first step is to have a clear idea of what is meant by a connection between the response variable and the explanatory variable. This will provide a framework for defining a Data-Analysis process to explore the connection between the two variables, and will utilise the ideas previously developed.
The next step is to use some simple sample descriptive statistics to have a first look at the nature of the link between the variables. This simple approach may allow the analyst to conclude that on the basis of the sample information there is strong evidence to support a link, or there is no evidence of a link, or that the simple approach is inconclusive and further more sophisticated data analysis is required. This step is called the Initial Data Analysis and is sometimes abbreviated to IDA.
If the Initial Data Analysis suggests that Further Data Analysis (FDA) is required, then this step seeks one of two conclusions:
or
The outcome of the analysis is one of the two alternatives given above. If the outcome is that there is no evidence of a connection, then no further action is required by the analyst since the analysis is now complete.
If however the outcome of the analysis is that there is evidence of a connection, then the nature of the connection between the two variables needs to be described.
🤓💡 The Data-Analysis Methodology described above seeks to find the answer to the following key question:
On the basis of the sample data is there evidence of a connection between the response variable and the explanatory variable?
The outcome is one of two conclusions
No evidence of a relationship
Yes there is evidence of a relationship, in which case the link needs to be described.
This process can be represented diagrammatically as:
For each of the four data analysis situations given, the data analyst needs to know what constitutes the Initial Data Analysis (I.D.A.) and how to undertake and interpret the I.D.A. If Further Data Analysis is required the analyst needs to know how to undertake and interpret the Further Data Analysis.
There is a relationship between a measured response and an attribute explanatory variable if the average value of the response is dependent on the level of the attribute explanatory variable.
Given a measured response and an attribute explanatory variable with two levels, “red” & “blue”, if the statistical distribution of the response variable for attribute level “red” and attribute level “blue” are exactly the same then the level of the attribute variable has no influence on the value response, there is no relationship.
This can be illustrated as below:
The first step is to have a clear idea of what is meant by a connection between a measured response variable and a measured explanatory variable. Imagine a population under study consisting of a very large number of population members, and on each population member two measurements are made, the value of \(Y\) the response variable and the value of \(X\) the explanatory variable. For the whole population a graph of \(Y\) against \(X\) could be plotted conceptually.
If the graph shows a perfect line, then there is quite clearly a link between \(Y\) and \(X\). If the value of \(X\) is known, the exact value of \(Y\) can be read off the graph. This is an unlikely scenario in the data-analysis context, because this kind of relationship is a deterministic relationship. Deterministic means that if the value of \(X\) is known then the value of Y can be precisely determined from the relationship between Y and \(X\). What is more likely to happen is that other variables may also have an influence on \(Y\).
If the nature of the link between Y and X is under investigation then this could be represented as:
\(Y = f(X) + effect\) of all other variables [effect of all other variables is commonly abbreviated to \(e\)]
Considered the model: \[Y = f(X) + e \text{ [e is the effect of all other variables]}\]
The influence on the response variable Y can be thought of as being made up of two components:
the component of Y that is explained by changes in the value of X, [the part due to changes in \(X\) through \(f(X)\)]
the component of Y that is explained by changes in the other factors. [the part not explained by changes in \(X\)]
Or in more abbreviated forms the ‘Variation in Y Explained by changes X’ or ‘Explained Variation’ and the ‘Variation in Y not explained by changes in X’ or the ‘Unexplained Variation’.
In conclusion, the Total Variation in Y is made up of the two components:
Which may be written as: \[\text{The Total Variation in Y = Explained Variation + Unexplained Variation}\]
🤓💡 The discussion started with the following idea:
\[Y = f(X) + e\]
And to quantify the strength of the link, the influence on \(Y\) was broken down into two components: \[\text{The Total Variation in Y = Explained Variation + Unexplained Variation}\]
This presents two issues:
Total Variation in Y
,
Explained Variation
and the
Unexplained Variation
be measured?What do these quantities tell us?
Maybe we can observe a proportion of the
Explained Variation in Y
over the
Total Variation in Y
. This ratio is always on the scale
\(0\) to \(1\), but by convention is usually expressed
as a percentage so is regarded as on the scale \(0\) to \(100\%\). It is called \(R^2\) and interpretation of this ratio is
as follows:
\[R^2: 0\% \text{ (no link) <--------------- } 50\% \text{(Statistical Link) ---------------> }100\%\text{ (Perfect Link)}\]
The definition and interpretation of R_sq
is a very
important tool in the data analyst’s tool kit for tracking connections
between a measured response variable and a measured explanatory
variable.
We can put those ideas into our DA Methodology frameworks as shown below.
🤓💡 Note that you will hardly ever be in the situation in which \(R^2\) would be so close to zero that it would make you conclude that on the basis of the sample evidence used in IDA it is possible to conclude that there is no relationship between the two variables. If \(R^2\) value is very small (for example around \(2\%\)) this would need to be further tested to conclude if it is statistically significant based on the sample evidence by applying FDA.
If the ‘Initial Data Analysis’ is inconclusive then ‘Further Data Analysis’ is required.
The ‘Further Data Analysis’ is a procedure that enables a decision to be made, based on the sample evidence, as to one of two outcomes:
These statistical procedures are called hypothesis tests, which essentially provide a decision rule for choosing between one of the two outcomes: “There is no relationship” or “There is a relationship” based on the sample evidence.
All hypothesis tests are carried out in four stages:
Stage 1: Specifying the hypotheses
Stage 2: Defining the test parameters and the decision rule
Stage 3: Examining the sample evidence
Stage 4: The conclusions
Statistical Models used in FDA
Measured Response v Attribute Explanatory Variable with exactly two levels:
Measured Response v Attribute Explanatory Variable with more than two levels:
Measured Response v Measured Explanatory Variable
Measured Response v Measured Explanatory Variables
Attribute Response v Attribute Explanatory Variable
Make sure you can answer the following questions:
What are the underlying ideas that enable a relationship between two variables to be investigated?
What is the purpose of summary statistics?
What is the data analysis methodology for exploring the relationship between:
a measured response variable and an attribute explanatory variable?
a measured response variable and a measured explanatory variable?
Earlier we looked at some basic statistical concepts. This section examines how to investigate the nature of any relationship that may exist between a measured response variable and a measured explanatory variable.
The first step is to have a clear idea of what is meant by a connection between the response variable and the explanatory variable. The next step is to use some simple sample descriptive statistics to have a first look at the nature of the link between the response variable and the explanatory variable. This simple approach will lead to one of three conclusions, namely, on the basis of the sample information:
This step is called the Initial Data Analysis or the IDA.
If the IDA suggests that Further Data Analysis is required, then this step seeks one of two conclusions:
As we have already seen in the previous sections, this process can be represented diagrammatically as:
The Data-Analysis Methodology seeks to find the answer to the following key question:
The final outcome is one of two conclusions:
There is no evidence of a relationship, labelled as the ‘No’ outcome in the diagram above, in which case the analysis is finished.
There is evidence of a relationship, labelled as the ‘Yes’ outcome in the diagram above, in which case the nature of the relationship needs to be described.
The first step is to have a clear idea of what is meant by a connection between a measured response variable and a measured explanatory variable. Imagine a population under study consisting of a very large number of population members, and on each population member two measurements are made, the value of \(Y\) the response variable and the value of \(X\) the explanatory variable. For the whole population a graph of \(Y\) against \(X\) could be plotted conceptually. If the graph looked as in the diagram below, then there is quite clearly a link between Y and X. If the value of \(X\) is known, the exact value of Y can be read off the graph. This is an unlikely scenario in the data-analysis context, because the relationship shown is a deterministic relationship. Deterministic means that if the value of \(X\) is known then the value of \(Y\) can be precisely determined from the relationship between \(Y\) and \(X\).
When analysing the relationship between the two measured variable we start off by creating a scatter plot. A scatter plot is a graph that shows one axis for the explanatory variable commonly known in the regression modelling as a predictor and labelled with \(X\), and one axis for the response variable, which is labelled with \(Y\) and commonly known as the outcome variable. Thus, each point on the graph represents a single \((X, Y)\) pair. The primary benefit is that the possible relationship between the two variables can be viewed and analysed with one glance and often the nature of a relationship can be determined quickly and easily.
Let us consider a few scatter plots. The following graph represents a perfect linear relationship. All points lie exactly on a straight line. It is easy in this situation to determine the intercept and the slope, i.e. gradient and hence specify the exact mathematical link between the response variable \(Y\) and the explanatory variable \(X\).
-Graph 1
The relationship shown in Graph 2 shows clearly that as the value of \(X\) increases the value of \(Y\) indecreases, but not exactly on a straight line as in the previous scatter plot. This is showing a statistical link, as the value of the explanatory variable \(X\) increases the value of the response variable \(Y\) also tends to increase. An explanation for this is that the response \(Y\) may depend on a number of different variables, say \(X_1\), \(X_2\), \(X_3\), \(X_4\), \(X_5\), \(X_6\) etc. which could be written as:
\(Y = f(X_1, X_2, X_3, X_4, X_5, X_6, ...)\)
-Graph 2
If the nature of the link between \(Y\) and \(X\) is under investigation then this could be represented as:
\[Y = f(X) + \text{effect of all other variables}\]
{{% notice note %}} The effect of all other variables is commonly abbreviated to e. {{% /notice %}}
Graph 1 shows a link where the effect of all the other variables is nil. The response \(Y\) depends solely on the variable \(X\), Graph 2 shows a situation where \(Y\) depends on \(X\) but the other variables also have an influence.
Consider the model:
\[Y = f(X) + e\] Remember 😃, \(\text{e is the effect of all other variables}\)!
The influence on the response variable \(Y\) can be thought of as being made up of two components:
Or in more abbreviated forms:
The Total Variation in \(Y\) is made up of two components:
Which may be written as:
\[\text{The Total Variation in Y} = \text{Explained Variation} + \text{Unexplained Variation}\]
In Graph 1 the Unexplained Variation is nil, since the value of \(Y\) is completely determined by the value of \(X\). In Graph 2 the Explained Variation is large relative to the Unexplained Variation, since the value of \(Y\) is very largely influenced by the value of \(X\).
Consider Graph 3. Here there is no discernible pattern, and the value of Y seems to be unrelated to the value of X.
- Graph 3
If \(Y\) is not related to \(X\) the Explained Variation component is zero and all the changes in \(Y\) are due to the other variables, that is the Unexplained Variation.
Finally, consider Graphs 4 & 5 below:
-Graph 4
-Graph 5
Graph 4 shows a similar picture to Graph 2, the difference being that as the value of \(X\) increases the value of \(Y\) decreases. The value of \(Y\) is influenced by the value of \(X\), so the Explained Variation is high relative to the Unexplained Variation. Consider the last graph, Graph 5, which is a deterministic relationship. The value of \(Y\) is completely specified by the value of \(X\). Hence the Unexplained Variation is zero.
Graphs Summary:
Graph 1 | Graph 2 | Graph 3 | Graph 4 | Graph 5 | |
---|---|---|---|---|---|
Explained Variation in Y | All | High | Zero | High | All |
Unexplained Variation in Y | Zero | Low | All | Low | Zero |
In regression the discussion started with the following idea: \[Y = f(X) + e\] And to quantify the strength of the link the influence on Y was broken down into two components: i. \[\text{The Total Variation in Y} = \text{Explained Variation} + \text{Unexplained Variation}\]
This presents two issues:
{{% notice note %}} The simplest form of connection is a straight-line relationship and the question arises could a straight-line relationship be matched with the information contained in the graphs 1 -5? {{% /notice %}}
\[Y = a + bX\] where:
a is the intercept and
b is the gradient
For the statistical relationships as shown in Graphs 2 & 4:
Can the intercept and gradient be measured?
Can the values of the three quantities The Total Variation in Y, Explained Variation and The Unexplained Variation be measured?
It is sufficient to work out any two since:
\[\text{The Total Variation in Y} = \text{Explained Variation} + \text{Unexplained Variation}\]
Fitting a line by eye is subjective. It is unlikely that any two analysts will draw exactly the same line, hence the intercept and gradient will be slightly different from one person to the next. What is needed is an agreed method that will provide an estimate of the intercept and the gradient.
Consider the simple numerical example below:
Suppose we would like to fit a straight-line relationship to the following data:
\(X\) | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
\(Y\) | 7 | 8 | 12 | 13 | 14 | 18 |
The problem is to use this information to measure the intercept and the gradient for this data set.
A simple way to do this is to draw what is considered to be the line of best fit by judgement or guesswork! 😁😉
The intercept can be read off the graph as approximately \(5\), and the gradient can be simply measured since if \(X = 0\) then \(Y = 5\), and if \(X = 5\) then \(Y = 15\), so for a change in \(X\) of \(5\) units (from 0 to 5) \(Y\) changes by \(10\), from \(5\) to \(15\). The definition of the gradient is The change in Y for a unit increase in X hence the gradient for this data set is \(10/5 = 2\).
The straight-line relationship obtained by this process is \[\hat{Y} = 5 + 2*X\].
Note, \(\hat{Y}\) is a notation for the
value of \(Y\) as predicted by the
straight-line relationship.
We can add more information about the predicted values of \(Y\) by the “estimated” model to our table, so that we have
X | Y | \(\hat{Y}\) | \({Y - \hat{Y}}\) |
---|---|---|---|
1 | 7 | 7.00 | 0.00 |
2 | 8 | 9.00 | -1.00 |
3 | 12 | 11.00 | 1.00 |
4 | 13 | 13.00 | 0.00 |
5 | 14 | 15.00 | -1.00 |
6 | 18 | 17.00 | 1.00 |
Looking at the information contained in the previous plot, the column headed \(\hat{Y}\) contains the predicted values of \(Y\) for the values of \(X\). For example, the first value of \(\hat{Y}\) is when \(X = 1\) and \(Yp = 5 + 2*X\) so \(\hat{Y} = 5 +2*1 = 7\). The column headed \((Y - \hat{Y})\) is the difference between the actual value and the value predicted by the line. For example when \(X = 1\), \(Y = 7\) and \(\hat{Y} = 7\) so the predicted value lies on the line, as can be seen in the graph. For the value \(X = 2\) the actual value lies \(1\) unit below the line, as can also be seen from the graph.
The column \((Y - \hat{Y})\) measures the disagreement between the actual data and the line and a sensible strategy is to make this level of disagreement as small as possible. Referring to the graph on the table above notice that sometimes the actual data value is below the line and sometimes it is above the line, so on average the value will be close to zero. In this particular example the sum of the \((Y - \hat{Y})\) values add up to zero (in a more conventional notation \(\Sigma(Y - \hat{Y}) = 0\)). The quantity \(\Sigma(Y - \hat{Y})\) does not seem to be a satisfactory measure of disagreement, because there are a number of different lines with the property \(\Sigma(Y - \hat{Y}) = 0\).
A way of obtaining a satisfactory measure of disagreement is to square the individual \((Y - \hat{Y})\) values and add them up. i.e. obtain the quantity \(\Sigma(Y - \hat{Y})^2\). The result is always a positive number since the square of a negative number is positive. If this quantity is then chosen to be as small as possible then the level of disagreement between the actual data points and the fitted line is the least. This provides a criterion for the choice of the best line.
The quantity \(\Sigma(Y - \hat{Y})^2\) can be easily calculated and added to our table
X | Y | \(\hat{Y}\) | \({Y - \hat{Y}}\) | \((Y - \hat{Y})^2\) |
---|---|---|---|---|
1 | 7 | 7.00 | 0.00 | 0 |
2 | 8 | 9.00 | -1.00 | 1 |
3 | 12 | 11.00 | 1.00 | 1 |
4 | 13 | 13.00 | 0.00 | 0 |
5 | 14 | 15.00 | -1.00 | 1 |
6 | 18 | 17.00 | 1.00 | 1 |
The quantity \(\Sigma(Y - \hat{Y})^2\) is a measure of the disagreement between the actual \(Y\) values and the values predicted by the line. If this value is chosen to be as small as possible then the disagreement between the actual Y values and the line is the smallest it could possibly be, hence the line is The line of Best Fit.
This procedure of finding the intercept and the gradient of a line that makes the quantity \(\Sigma(Y - \hat{Y})^2\) a minimum is called The Method of Least Squares.
The Method of Least Squares was developed by K. F Gauss (1777 - 1855) a German mathematician, Gauss originated the ideas and a Russian mathematician A. A. Markov (1856 - 1922) developed the method.
In R we use the lm( )
function to fit linear models as
illustrated below.
The final issue is to find out how to measure the three quantities:
Taking these quantities one at a time they can be measured as follows:
The Unexplained Variation
This turns out to be very simple to measure. The quantity: \(\Sigma(Y - \hat{Y})^2\) is a measure of the Unexplained Variation. If the line were a perfect fit to the data, the value predicted by the line and the actual value would be exactly the same and the value of the quantity \(\Sigma(Y - \hat{Y})^2\) would be zero. This quantity is a measure of the disagreement between the actual \(Y\) values and the predicted values \(\hat{Y}\), which are also known as the residuals and are measuring the Unexplained Variation in \(Y\). For the example used above the value of \(\Sigma(Y - \hat{Y})^2\) is \(3.77\).
The Total Variation in Y
This is related to the measures of variability (spread) introduced earlier in the course and in particular to the standard deviation (\(\sigma\)). To measure The Total Variation in Y requires a measure of spread.
The Total Variation in Y is defined to be the quantity: \(\Sigma(Y - \bar{Y})^2\) Where \(\bar{Y}\) is the average value of \(Y\) (\(\bar{Y} = \Sigma(Y)/n\)).
In our earlier example \(\bar{Y} = 12\), so we can expand the table to include this calculation
X | Y | \(\hat{Y}\) | \({Y - \hat{Y}}\) | \((Y - \hat{Y})^2\) | \(\Sigma(Y - \bar{Y})^2\) |
---|---|---|---|---|---|
1 | 7 | 7.00 | 0.00 | 0 | 25.00 |
2 | 8 | 9.00 | -1.00 | 1 | 16.00 |
3 | 12 | 11.00 | 1.00 | 1 | 0.00 |
4 | 13 | 13.00 | 0.00 | 0 | 1.00 |
5 | 14 | 15.00 | -1.00 | 1 | 4.00 |
6 | 18 | 17.00 | 1.00 | 1 | 36.00 |
giving \(\Sigma(Y - \bar{Y})^2 = 82\).
The Explained Variation in Y
If the line was a perfect fit, then the \(Y\) values and the \(\hat{Y}\) values would be exactly the same, and the quantity \((\hat{Y} - \bar{Y})^2\) would measure The Total Variation in Y. If the line is not a perfect match to the actual \(Y\) values then this quantity measures The Explained Variation in Y.
X | Y | \(\hat{Y}\) | \({Y - \hat{Y}}\) | \((Y - \hat{Y})^2\) | \(\Sigma(Y - \bar{Y})^2\) | \((\hat{Y} - \bar{Y})^2\) |
---|---|---|---|---|---|---|
1 | 7 | 7.00 | 0.00 | 0 | 25.00 | 27.94 |
2 | 8 | 9.00 | -1.00 | 1 | 16.00 | 10.06 |
3 | 12 | 11.00 | 1.00 | 1 | 0.00 | 1.12 |
4 | 13 | 13.00 | 0.00 | 0 | 1.00 | 1.12 |
5 | 14 | 15.00 | -1.00 | 1 | 4.00 | 10.06 |
6 | 18 | 17.00 | 1.00 | 1 | 36.00 | 27.94 |
incorporating this calculation into the table above will enable us to get \((\hat{Y} - \bar{Y})^2 = 78.23\).
Once the intercept and slope have been estimated using the method of least squares, various indices are studied to determine the reliability of these estimates. One of the most popular of these reliability indices is the correlation coefficient. A sample correlation coefficient, more specifically known as the Pearson Product Moment correlation coefficient, denoted \(r\), has possible values between \(-1\) and \(+1\), as illustrated in the diagram below.
In fact, the correlation is a parameter of the bivariate normal distribution, that is used to describe the association between two variables, which does not include a cause and effect statement. That is, one variable does not depend on the other, i.e. the variables are not labelled as dependent and independent. Rather, they are considered as two random variables that seem to vary together. Hence, it is important to recognise that correlation does not imply causation. In correlation analysis, both \(Y\) and \(X\) are assumed to be random variables, whiles in linear regression, \(Y\) is assumed to be a random variable and \(X\) is assumed to be a fixed variable.
Main characteristic of the correlation coefficient \(r\) are:
BUT!!!
{{% notice note %}} The Spearman rank correlation coefficient is an corresponding nonparametric equivalent of the Pearson correlation coefficient. This statistic is computed by replacing the data values with their ranks and pplying the Pearson correlation formula to the ranks of the data. Tied values are replaced with the average rank of the ties. Just like in the case of Pearson coefficient, this one is also really a measure of association rather than correlation, since the ranks are unchanged by a monotonic transformation of the original data. For the sample sizes greater than 10, the distribution of the Spearman rank correlation coefficient can be approximated by the distribution of the Pearson correlation coefficient. It is also worth knowing that the Spearman rank correlation coefficient uses the weights when weights are specified. {{% /notice %}}
We realise that when fitting a regression model we are seeking to find out how much variance is explained, or is accounted for, by the explanatory variable \(X\) in an outcome variable \(Y\).
In the earlier example above the following has been calculated:
Notice the relationship given below, is satisfied
\[\text{The Total Variation in Y = Explained Variation + Unexplained Variation}\]
\[82.00 = 78.23 + 3.77\] What do these quantities tell us? They are difficult to interpret because they are expressed in the units of the problem.
Consider the following ratio:
\({\text{The Explained Variation in Y} \over \text{The Total Variation in Y}} = {78.23 \over 82.00} = 0.954\)
This is saying that \(0.954\) or \(95.4\%\) of the changes in \(Y\) are explained by changes in \(X\). This is a useful and useable measure of the effectiveness of the match between the actual \(Y\) values and the predicted \(Y\) values.
Reviewing the five scatter prolos (Graphs 1 to 5) it can easily be seen that if the line is a perfect fit to the actual \(Y\) values as in Graphs 1 & 5 then this ratio will have the value \(1\) or \(100\%\).
If there is no link between \(Y\) and \(X\) then the The Explained Variation is zero hence the ratio will be \(0\) or \(0\%\). An example of this is shown in Graph 3.
The remaining graphs: Graphs 2 & 4 show a statistical relationship hence this ratio will lie between \(0\) & \(1\). The closer the ratio is to zero the less strong the link is, whilst the closer the ratio is to \(1\) the stronger the connection is between \(X\) & \(Y\).
The Ratio:
\[R^2 = {\text{The Explained Variation in Y} \over \text{The Total Variation in Y}}\]
is called the Coefficient of Determination, and usually labelled as \(R^2\), and may be defined as the proportion of the changes in \(Y\) explained by changes in \(X\).
This ratio is always on the scale \(0\) to \(1\), but by convention is usually expressed as a percentage, so is regarded as on the scale \(0\) to \(100\%\). The interpretation of this ratio is as follows:
The theoretical minimum \(R^2\) is \(0\). However, since linear regression is based on the best possible fit, \(R^2\) will always be greater than zero, even when the predictor and outcome variables bear no relationship to one another. The definition and interpretation of \(R^2\) is a very important tool in the data analyst’s tool kit for tracking connections between a measured response variable and a measured explanatory variable.
When working with sample data to investigate any relationships that may exist between a measured response variable and a measured explanatory variable, the information contained within the sample is imperfect, so has to be interpreted in the light of sampling error. This is particularly important when interpreting the value of \(R^2\) calculated from sample data. The sample R2 value gives a measure of the strength of the connection, and this can sometimes be difficult to interpret particularly if the sample size is small.
The data analysis methodology as set out earlier, is a procedure that enables you to make judgements from sample data. The methodology requires you to know what specific procedures make up the Initial Data Analysis IDA and how to interpret the results of the IDA to obtain one of three outcomes:
If Further Data Analysis, FDA, is required then you need to know what constitutes this further analysis and how to interpret it.
Finally, if a connection between the response variable and the explanatory variable is detected then a description of the connection is required. In this case the connection needs to be described.
To demonstrate the data analysis methodology we’ll go back to Share Price Study Case Study.
Share Price Study Data
A business analyst is studying share prices of companies from three different business sectors. As part of the study a random sample (n=60) of companies was selected and the following data was collected:
Variable | Description |
---|---|
Share_Price | The market value of a company share (£) |
Profit | Company annual profit (£1.000.000) |
RD | Company annual spending on research and development (£1.000) |
Turnover | Company annual total revenue (£1.000.000) |
Competition | A variable coded: |
- 0 if the company operates in a very competitive market | |
- 1 if the company has a great deal of monopoly power | |
Sector | A variable coded: |
- 1 if the company operates in the IT business sector | |
- 2 if the company operates in the Finance business sector | |
- 3 if the company operates in the Pharmaceutical business sector | |
Type | A variable coded: |
- 0 if the company does business mainly in Europe | |
- 1 if the company trades globally |
Let’s start off by investigating the relationship between the
variables Share_Price
and Profit
.
We will adopt the following notation
\(\text{Model to be estimated: } Y = b_0 + b_1X + e\)
We start by accessing data:
## Rows: 60
## Columns: 7
## $ Share_Price <int> 880, 862, 850, 840, 838, 825, 808, 806, 801, 799, 783, 777…
## $ Profit <dbl> 161.3, 170.5, 140.7, 115.7, 107.9, 138.8, 102.0, 102.7, 10…
## $ RD <dbl> 152.6, 118.3, 110.6, 87.2, 75.1, 116.2, 91.3, 100.4, 113.5…
## $ Turnover <dbl> 320.9, 306.3, 279.5, 193.2, 182.4, 265.2, 212.0, 170.3, 23…
## $ Competition <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1…
## $ Sector <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2…
## $ Type <int> 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1…
## Share_Price Profit RD Turnover Competition
## Min. :101.0 Min. : 2.90 Min. : 39.20 Min. : 30.3 0:30
## 1st Qu.:501.2 1st Qu.: 59.73 1st Qu.: 75.78 1st Qu.:112.3 1:30
## Median :598.5 Median : 88.85 Median : 90.60 Median :173.5
## Mean :602.8 Mean : 84.76 Mean : 89.64 Mean :170.2
## 3rd Qu.:739.8 3rd Qu.:106.62 3rd Qu.:104.15 3rd Qu.:216.6
## Max. :880.0 Max. :170.50 Max. :152.60 Max. :323.3
## Sector Type
## 1:20 0:30
## 2:20 1:30
## 3:20
##
##
##
## Rows: 60
## Columns: 7
## $ Share_Price <int> 880, 862, 850, 840, 838, 825, 808, 806, 801, 799, 783, 777…
## $ Profit <dbl> 161.3, 170.5, 140.7, 115.7, 107.9, 138.8, 102.0, 102.7, 10…
## $ RD <dbl> 152.6, 118.3, 110.6, 87.2, 75.1, 116.2, 91.3, 100.4, 113.5…
## $ Turnover <dbl> 320.9, 306.3, 279.5, 193.2, 182.4, 265.2, 212.0, 170.3, 23…
## $ Competition <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1…
## $ Sector <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2…
## $ Type <fct> 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1…
The Initial Data Analysis for investigating a connection between a measured response and a measured explanatory variable requires obtaining a graph of the response against the explanatory variable and to calculate the value of \(R^2\) from the sample data.
The IDA has three possible outcomes:
The simplest form of connection is a straight-line relationship and the question arises could a straight-line relationship be matched with the information contained in the Graphs 1 to 5 discussed earlier?
As part of an informal investigation of the possible relationship
between Share_Price
and Profit
, first we will
use R to obtain a scatter plot with the line of the best fit. Rather
than every time referring to the name of the data set containing the
variables of interest, we will attach our data and refer to the
variables directly using only their names (see help for the
attach( )
function).
## [1] "Share_Price" "Profit" "RD" "Turnover" "Competition"
## [6] "Sector" "Type"
Let’s go through the code and the functions we have used to produce this graph.
The
plot( )
function gives a scatterplot of two numerical variables. The first variable listed will be plotted on the horizontal axis and the second on the vertical axis, i.e. you ‘feed’ as the arguments first variable representing X and then variable representing Y:plot( x, y)
. Considering that we are investigating a relationship between \(X\) and \(Y\) in the form of a regression line \(Y = b_0 + b1_X + e\), we specify that in R as a formulaY ~ X
, which can be read as “\(Y\) is modelled as a function of \(X\)”. This means that in the R’splot( )
function a formula interface can also be used, in which case the response variable \(Y\) needs to come before the tilde (\(\sim\)) and \(X\) variable that will be plotted on the horizontal axis.Next, we fit a line of the best-fit through our scatterplot using the
abline( )
function for the linear model \(Y = b_0 + b_1X\) to see how close are the points to the fitted line. The basic R function for fitting linear models by ordinary least squares is thelm( )
, which stands for linear model. All we need to feed R with when using thelm( )
function is the formulaY~X
.
The scatterplot shows a fairly strong and reasonably linear
relationship between the two variables. In other words, the fit is
reasonably good, but it is not perfect and we could do with some more
information about it. Let us see what the lm( )
function
provide as a part of the output.
##
## Call:
## lm(formula = Share_Price ~ Profit)
##
## Coefficients:
## (Intercept) Profit
## 258.924 4.057
First, R displays the fitted model, after which it shows the estimates of the two parameters \(b_0\) & \(b_1\) to which it refers to as coefficients.
- the intercept, \(b_0 = 258.924\)
- the slope, \(b_1= 4.057\)
We must now take this estimated model and ask a series of questions to decide whether or not our estimated model is good/bad: that is, we have to subject the fitted model to a set of tests designed to check the validity of the model, which is in effect a test of your, as a modeller, viewpoint/theory.
Examining the scatterplot, we can see that not all of the points lie on the fitted line \(Share\_Price = 258.924 + 4.057Profit\). To explain how strong the relationship is between the two variables we need to obtain the coefficient of determination known as the \(R^2\) parameter.
The coefficient of determination, \(R^2\), is a single number that measures the extent to which the explanatory variable can explain, or account for, the variability in \(Y\) – that is, how well does the explanatory variable explain the variability in, or behaviour of, the phenomenon we are trying to understand.
Earlier, we saw that \(R^2\) is constrained to lie in the following range:
\[0\% \text{ <---------- } R^2 \text{ ----------> } 100\%\]
We realised that the closer \(R^2\) is to \(100\%\) then the better the model is, and conversely, a value of \(R^2\) close to \(0\%\) implies a weak/poor model.
To obtain all of the information about the fitted model we can use
the summary( )
function as in the following:
##
## Call:
## lm(formula = Share_Price ~ Profit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -175.513 -74.826 0.107 67.824 141.358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 258.9243 28.7465 9.007 1.29e-12 ***
## Profit 4.0567 0.3102 13.077 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 89.98 on 58 degrees of freedom
## Multiple R-squared: 0.7467, Adjusted R-squared: 0.7424
## F-statistic: 171 on 1 and 58 DF, p-value: < 2.2e-16
So, what have we got here? 🤔
We will discuss some of the key components of R’s
summary( )
function for linear regression models.
- Call: shows what function and variables were used to create the model.
- Residuals: difference between what the model predicted and the actual value of Y. Try to see if you can calculate this ‘residuals’ section by yourself using:
summary(Y-model$fitted.values)
. 🤓- Coefficients:
- estimated parameters for intercept and slope: \(b_0\) and \(b_1\)
- Std. Error is Residual Standard Error divided by the square root of the sum of the square of that particular explanatory variable.
- t value: Estimate divided by Std. Error
- Pr(>|t|): is the value in a t distribution table with the given degrees of freedom.
{{% notice note %}} Note that in the first section we can find statistics relevant to the estimation of the model’s parameters. In the second part of the output we can find statistics related to the overall goodness of the fitted model. {{% /notice %}}
The
summary(lm( ))
produces the standard deviation of the error. We know that standard deviation is the square root of variance. Standard Error is very similar, the only difference is that instead of dividing by \(n-1\), you subtract \(n\) minus \(1 + k\), where \(k\) is the number of explanatory variables used in the model. See if you can use and adjust the code below to calculate this statistic. 🤓# Residual Standard Error k <- length(model$coefficients) - 1 #Subtract one to ignore intercept SSE <- sum(model$residuals^2) n <- length(model$residuals) sqrt(SSE / (n-(1+k)) ) # Residual Standard Error
Next is the coefficient of determination, which helps us determine how well the model fits to the data. We have already seen that \(R^2\) subtracts the residual error from the variance in Y. The bigger the error, the worse the remaining variance will appear.
For the time being we will skip \(R^2_{adjusted}\) and point out that it is used for models with multiple variables, to which you will be introduced in the next section.
Lastly, the F-Statistic is the second “test” that the summary function produces for
lm
models. The F-Statistic is a “global” test that checks if at least one of your coefficients are nonzero, i.e. when dealing with a simple regression model to check if the model is worthy of further investigation and interpretation.
To go back to our model interpretation. Earlier, we said after examining the scatter plot that there is a a clear link between Share_Price and Profit. This is confirmed by the value of \(R^2 = 74.67\%\). The interpretation of \(R^2\) is suggesting that \(74.67\%\) of the changes in Share_Price are explained by changes in Profit. Alternatively, \(25.33\%\) of the changes in Share_Price are due to other variables.
For this example the outcome of the IDA is that there is clear evidence of a link between Share_Price and Profit and the nature of the influence being that as Profit increases Share_Price also increases. It only remains to describe the connection, and discuss how effective the model is, and ask if it can be used to predict Share_Price value from the value of Profit? 🤔
The adequacy of \(R^2\) can be judged both informally and formally using hypothesis testing. Like all hypothesis tests, this test is carried out in four stages:
Stage 1: Specify the hypotheses (\(H_0\) & \(H_1\)).
The Coefficient of Determination \(R^2\) is a useful quantity. By definition if \(R^2 = 0\) then there is no connection between the response variable and the explanatory variable. Conversely if the value of \(R^2\) is greater than zero there must be a connection. This enables the formal hypotheses to be defined as:
Stage 2: Defining the test parameters and the decision rule.
The decision rule is based on the F statistic. The F distribution has a shape as shown below:
The value of \(F_{crit}\) is the value that divides the distribution with \(95\%\) of the area to the left of the \(F_{crit}\) ordinate, and \(5\%\) to the right. The accepted terminology is \(F_{crit}\) but this text will call this quantity \(F_{calc}\) to signify it is a quantity obtained from statistical tables.
The decision rule is:
If the value of \(F_{calc}\) from the sample data is larger than \(F_{crit}\), then the sample evidence favours the decision that there is a connection between the response variable and the explanatory variable. (i.e. favours \(H_1\).)
If the value of \(F_{calc}\) is smaller than \(F_{crit}\) then the sample evidence is consistent with no connection between the response variable and the explanatory variable. (i.e. favours \(H_0\))
The decision rule can be summarised as:
If \(F_{calc} < F_{crit}\) then favour \(H_0\), whilst if \(F_{calc} > F_{crit}\) then favour \(H_1\). This decision rule can be represented graphically as below:
The decision rule maybe written as:
The value of \(F_{crit}\) depends on
the amount of data in the sample. The Degrees of Freedom
(usually abbreviated to \(df\)) express
this dependence on the sample size. For any given set of sample data,
having obtained the \(df\) the specific
value of \(F_{crit}\) can be obtained
from Statistical Tables of the F distribution, or simply by
using the qf(p, df1, df2)
function in R.
Stage 3: Examining the sample evidence
For the investigation of the Share_Price vs Profit we plot the scatterplot and obtain the summary of the fitted model: \(Share\_Price = b_0 + b_1Profit\)
## [1] 0.7467388
This is a good model explaining almost \(3/4\) of the overall variability in the explanatory variable \(Y\). Hence, based on the sample evidence, variable Profit, can explain \(74.67\%\) of overall variation in the variable Share_Price.
{{% notice note %}} Note, that if we don’t reject that there is a
relationship between the two variables we will obtain full output of the
summary(lm(model_1))
function regardless of the need for
FDA. If we conclude that there is a clear relationship, we will need the
estimates of the parameters to describe it, and if we need further to
investigate the significance of the relationship we need relevant
statistics for conducting the FDA. {{% /notice %}}
In the cases where \(R^2 \approx 0\)
we will carry out a hypothesis test for which we need \(F_{calc}\) from the summary( )
function.
Let us assume that the \(R^2\) for the given example Share_Price vs Profit is very small and close to zero. In that case, we cannot simply reject it as a statistically insignificant relationship, as \(R^2\) is not zero, but neither can we with confidence accept it as a statistically valid relationship based on the sample evidence we use.
This step requires the calculation of the \(F_{calc}\) value from the sample data, and the determining of the degrees of freedom.
##
## Call:
## lm(formula = Share_Price ~ Profit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -175.513 -74.826 0.107 67.824 141.358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 258.9243 28.7465 9.007 1.29e-12 ***
## Profit 4.0567 0.3102 13.077 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 89.98 on 58 degrees of freedom
## Multiple R-squared: 0.7467, Adjusted R-squared: 0.7424
## F-statistic: 171 on 1 and 58 DF, p-value: < 2.2e-16
Stage 4: Conclusion
The associated F statistic evaluating the significance of the full model \(F_{calc} = 171\) with \(df_1=1\) and \(df_2=58\). Thus,
## [1] 4.006873
\(F_{crit} = 4.01\) and using the decision rule given, we get that
\[F_{calc}=171 > F_{crit}=4.01\text{ => accept }H_1\] , ie this model is statistically significant and we need to describe it.
If the resulting outcome from either the Initial Data Analysis (IDA) or the Further Data Analysis (FDA) is that there is a connection between the response variable and the explanatory variable then the last step is to describe this connection.
For the data analysis situation Measured v Measured this is simply stating the line of the best fit. This is a description of the connection between the two variables. Additionally the \(R^2\) value should be quoted as this gives a measure of how well the data and the line of best fit match.
The \(R^2\) value can be interpreted as a measure of the quality of predictions made from the line of best fit according to the rule of thumb:
In our example Share_Price vs Profit we had the following estimations:
\[Share\_Price = 258.9243 + 4.0567 Profit\]
with \(R^2 = 74.67\%\).
Can the regression line be used to make predictions about the Share_Price for a company with a given value of the Profit?
Suppose we have to make a prediction of the \(Share\_{Price}\) for company with the \(Profit = 137.2\)
The predictions are calculated from the model as follows:
\[Share\_Price = 258.9243 + 4.0567 \times 137.2000 = 815.5035\] Since the \(R^2\) value is nearly \(75\%\) any predictions about Share_Price made from this model are likely to be of good quality, but there may be some issues with some of these predictions of the Profit values out of the data range used in prediction. The Profit in the given data set range from about \(3\) to \(170\). Over this range any prediction is likely to be of good quality since the information in the data is reflecting experience within this range, and the \(R^2\) value is nearly \(75\%\).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.90 59.73 88.85 84.76 106.62 170.50
The Share_Price for the company with the Profit that lies outside the range of experience is not likely to be very reliable, or of much value. These problems are caused by the fact that the information available about the link between Share_Price and Profit is only valid over the range of values that Profit takes in the data. There is no information about the nature of the connection outside this range and any predictions made must be treated with caution.
In general terms when interpreting a regression model the intercept is of little value, since it is generally out of the range of the data. The gradient term is the more important and the formal definition of the gradient provides the interpretation of this information.
The gradient is defined to be the change in \(Y\) for a unit increase in \(X\). For the model developed the gradient is \(4.06\). This is suggesting for every additional unit increase in Profit, Share_Price increases by \(£4.06\).
Avoid extrapolation! Refrain from predicting/interpreting the regression line for X-values outside the range of X in the data set!
Investigate the nature of the relationship in Share Price Study
data for Share_Price vs RD
and
Share_Price vs Turnover
.
Download The Supermarket data set available at https://github.com/TanjaKec/mydata using the following link https://tanjakec.github.io/mydata/SUPERM.csv.
Crate a report in which you will:
You are expected to bring your report to into the next class.
Creative Commons Attribution-ShareAlike 4.0 International License.