Basic Statistical Concepts

In this section you will be introduced to a set of concepts that enable data to be explored with the objective

  • of summarising and understanding the main features of the variables contained within the data and
  • to investigate the nature of any linkages between the variables that may exist

The starting point is to understand what data is.

  • What is the population?
  • Why do we use samples?

Can you provide a formal definition of the population and the sample? 😁

The population is the set of all people/objects of interest in the study being undertaken.

In statistical terms the whole data set is called the population. This represents “Perfect Information”, however in practice it is often impossible to enumerate the whole population. The analyst therefore takes a sample drawn from the population and uses this information to make judgements (inferences) about the population.

Clearly if the results of any analysis are based on a sample drawn from the population, then if the sample is going to have any validity, then the sample should be chosen in a way that is fair and reflects the structure of the population.

The process of sampling to obtain a representative sample is a large area of statistical study. The simplest model of a representative sample is a “random sample”:

A sample chosen in such a way that each item in the population has an equal chance of being included is a random sample.

As soon as sample data is used, the information contained within the sample is “Imperfect” and depends on the particular sample chosen. The key problem is to use this sample data to draw valid conclusions about the population with the knowledge of and taking into account the ‘error due to sampling’.

The importance of working with representative samples should be seriously considered; a good way to appreciate this importance is to see the consequences of using unrepresentative samples. A book by Darrell Huff called How to Lie with Statistics, published by Penguin contains several anecdotes of unrepresentative samples and the consequences of treating them as representative.

Data Analysis Using Sample Data

Usually the data will have been collected in response to some perceived problem, in the hope of being able to glean some pointers from this data that will be helpful in the analysis of the problem. Data is commonly presented to the data analyst in this way with a request to analyse the data.

Before attempting to analyse any data, the analyst should:

  1. Make sure that the problem under investigation is clearly understood, and that the objectives of the investigation have been clearly specified.

  2. Before any analysis is considered the analyst should make sure that the individual variables making up the data set are clearly understood.

The analyst must understand the data before attempting any analysis.

In the summary you should ask yourself:

  1. Do I understand the problem under investigation and are the objectives of the investigation clear? The only way to obtain this information is to ask questions, and keep asking questions until satisfactory answers have been obtained.

  2. Do I understand exactly what each variable is measuring/recording?

Describing Variables

A starting point is to examine the characteristics of each individual variable in the data set.

The way to proceed depends upon the type of variable being examined.

Classification of variable types

The variables can be one of two broad types:

  • Attribute variables: are variables that have their outcomes described in terms of their characteristics or attributes.
  • gender
  • days in a week

A common way of handling attribute data is to give it a numerical code. Hence, we often refer to them as coded variables.

  • Measured variables: are variables that have their outcomes taken from a numerical scale; the resulting outcome is expressed in numerical terms.
  • weight
  • age

There are two types of measured variables, a measured variable that is measured on some continuous scale of measurement, e.g. a person’s height. This type of measured variable is called a continuous variable. The other type of measured variable is a discrete variable. This results from counting; for example ‘the number of passengers on a given flight’.

The Concept of Statistical Distribution

The concept of statistical distribution is central to statistical analysis.

This concept relates to the population and conceptually assumes that we have perfect information; the exact composition of the population is known.

The ideas and concepts for examining population data provide a framework for the way of examining data obtained from a sample. The Data Analyst classifies the variables as either attribute or measured and examines the statistical distribution of the particular sample variable from the sample data.
For an attribute variable the number of occurrences of each attribute is obtained, and for a measured variable the sample descriptive statistics describing the centre, width and symmetry of the distribution are calculated.

attribute:

measured:

What does the distribution show?

For an attribute variable it is very simple. We observe the frequency of occurrence of each level of the attribute variable as shown in the barplot above.

For a measured variable the area under the curve from one value to another measures the relative proportion of the population having the outcome value in that range.

A statistical distribution for a measured variable can be summarised using three key descriptions:

  • the centre of the distribution;
  • the width of the distribution;
  • the symmetry of the distribution

The common measures of the centre of a distribution are the Mean (arithmetic average) and the Median. The median value of the variable is defined to be the particular value of the variable such that half the data values are less than the median value and half are greater.

The common measures of the width of a distribution are the Standard Deviation and the Inter-Quartile Range. The Standard Deviation is the square root of the average squared deviation from the mean. Ultimately the standard deviation is a relative measure of spread (width); the larger the standard deviation the wider the distribution. The inter-quartile range is the range over which the middle 50% of the data values varies.

By analogy with the median it is possible to define the quartiles:

  • Q1 is the value of the variable that divides the distribution 25% to the left and 75% to the right.
  • Q2 is the value of the variable that divides the distribution 50% to the left and 50% to the right. This is the median by definition.
  • Q3 is the value of the variable that divides the distribution 75% to the left and 25% to the right.
  • The inter-quartile range is the value Q3 - Q1.

The diagram below shows this pictorially:

🤓💡 Conventionally the mean and standard deviation are given together as one pair of measures of location and spread, and the median and inter-quartile range as another pair of measures.

There are a number of measures of symmetry; the simplest way to measure symmetry is to compare the mean and the median. For a perfectly symmetrical distribution the mean and the median will be exactly the same. This idea leads to the definition of Pearson’s coefficient of Skewness as:

Pearson's coefficient of Skewness = 3(mean - median) / standard deviation

An alternative measure of Skewness is the Quartile Measure of Skewness defined as:

Quartile Measure of Skewness = [(Q1 - Q3) - (Q2 - Q1)]/(Q3 - Q1)


Important Key Points:

  • What is Data?
  • Variables
  • Two types of variable:
    • an attribute variable
    • a measured variable
  • The concept of Statistical Distribution:
    • As applied to an attribute variable
    • As applied to a measured variable
  • Descriptive Statistics for a measured variable:
    • Measures of Centre
      • Mean, Median
    • Measures of Width:
      • Standard Deviation
      • Inter-Quartile Range

The descriptive statistics provide a numerical description of the key parameters of the distribution of a measured sample variable.


Investigating the relationship between variables

One of the key steps required of the Data Analyst is to investigate the relationship between variables. This requires a further classification of the variables contained within the data, as either a response variable or an explanatory variable.

A response variable is a variable that measures either directly or indirectly the objectives of the analysis.

An explanatory variable is a variable that may influence the response variable.

Bivariate Relationships

In general there are four different combinations of type of Response Variable and type of Explanatory Variable. These four combinations are shown below:

The starting point for any investigation of the connections between a response variable and an explanatory variable starts with examining the variables, and defining the response variable, or response variables, and the explanatory variables.

🤓💡: In large empirical investigations there may be a number of objectives and a number of response variables.

The method for investigating the connections between a response variable and an explanatory variable depends on the type of variables. The methodology is different for combinations as shown in the box above, and applying an inappropriate method causes problems. 💡⚡️😩

DA Methodology

The first step is to have a clear idea of what is meant by a connection between the response variable and the explanatory variable. This will provide a framework for defining a Data-Analysis process to explore the connection between the two variables, and will utilise the ideas previously developed.

The next step is to use some simple sample descriptive statistics to have a first look at the nature of the link between the variables. This simple approach may allow the analyst to conclude that on the basis of the sample information there is strong evidence to support a link, or there is no evidence of a link, or that the simple approach is inconclusive and further more sophisticated data analysis is required. This step is called the Initial Data Analysis and is sometimes abbreviated to IDA.

If the Initial Data Analysis suggests that Further Data Analysis (FDA) is required, then this step seeks one of two conclusions:

  1. The sample evidence is consistent with there being no link between the response variable and the explanatory variable.

or

  1. The sample evidence is consistent with there being a link between the response variable and the explanatory variable.

The outcome of the analysis is one of the two alternatives given above. If the outcome is that there is no evidence of a connection, then no further action is required by the analyst since the analysis is now complete.

If however the outcome of the analysis is that there is evidence of a connection, then the nature of the connection between the two variables needs to be described.

🤓💡 The Data-Analysis Methodology described above seeks to find the answer to the following key question:

On the basis of the sample data is there evidence of a connection between the response variable and the explanatory variable?

The outcome is one of two conclusions

  1. No evidence of a relationship

  2. Yes there is evidence of a relationship, in which case the link needs to be described.

This process can be represented diagrammatically as:

For each of the four data analysis situations given, the data analyst needs to know what constitutes the Initial Data Analysis (I.D.A.) and how to undertake and interpret the I.D.A. If Further Data Analysis is required the analyst needs to know how to undertake and interpret the Further Data Analysis.

Measured Vs Attribute(2-levels)

There is a relationship between a measured response and an attribute explanatory variable if the average value of the response is dependent on the level of the attribute explanatory variable.

Given a measured response and an attribute explanatory variable with two levels, “red” & “blue”, if the statistical distribution of the response variable for attribute level “red” and attribute level “blue” are exactly the same then the level of the attribute variable has no influence on the value response, there is no relationship.

This can be illustrated as below:

Measured Vs Measured

The first step is to have a clear idea of what is meant by a connection between a measured response variable and a measured explanatory variable. Imagine a population under study consisting of a very large number of population members, and on each population member two measurements are made, the value of \(Y\) the response variable and the value of \(X\) the explanatory variable. For the whole population a graph of \(Y\) against \(X\) could be plotted conceptually.

If the graph shows a perfect line, then there is quite clearly a link between \(Y\) and \(X\). If the value of \(X\) is known, the exact value of \(Y\) can be read off the graph. This is an unlikely scenario in the data-analysis context, because this kind of relationship is a deterministic relationship. Deterministic means that if the value of \(X\) is known then the value of Y can be precisely determined from the relationship between Y and \(X\). What is more likely to happen is that other variables may also have an influence on \(Y\).

If the nature of the link between Y and X is under investigation then this could be represented as:

\(Y = f(X) + effect\) of all other variables [effect of all other variables is commonly abbreviated to \(e\)]

Considered the model: \[Y = f(X) + e \text{ [e is the effect of all other variables]}\]

The influence on the response variable Y can be thought of as being made up of two components:

  1. the component of Y that is explained by changes in the value of X, [the part due to changes in \(X\) through \(f(X)\)]

  2. the component of Y that is explained by changes in the other factors. [the part not explained by changes in \(X\)]

Or in more abbreviated forms the ‘Variation in Y Explained by changes X’ or ‘Explained Variation’ and the ‘Variation in Y not explained by changes in X’ or the ‘Unexplained Variation’.

In conclusion, the Total Variation in Y is made up of the two components:

  • the \(Changes in Y Explained by changes in X\) and
  • the \(Changes in Y not explained by changes in X\)

Which may be written as: \[\text{The Total Variation in Y = Explained Variation + Unexplained Variation}\]

🤓💡 The discussion started with the following idea:

\[Y = f(X) + e\]

And to quantify the strength of the link, the influence on \(Y\) was broken down into two components: \[\text{The Total Variation in Y = Explained Variation + Unexplained Variation}\]

This presents two issues:

  1. Can a model of the link be made?
  2. Can The Total Variation in Y, Explained Variation and the Unexplained Variation be measured?

What do these quantities tell us?

Maybe we can observe a proportion of the Explained Variation in Y over the Total Variation in Y. This ratio is always on the scale \(0\) to \(1\), but by convention is usually expressed as a percentage so is regarded as on the scale \(0\) to \(100\%\). It is called \(R^2\) and interpretation of this ratio is as follows:

\[R^2: 0\% \text{ (no link) <--------------- } 50\% \text{(Statistical Link) ---------------> }100\%\text{ (Perfect Link)}\]

The definition and interpretation of R_sq is a very important tool in the data analyst’s tool kit for tracking connections between a measured response variable and a measured explanatory variable.

We can put those ideas into our DA Methodology frameworks as shown below.

🤓💡 Note that you will hardly ever be in the situation in which \(R^2\) would be so close to zero that it would make you conclude that on the basis of the sample evidence used in IDA it is possible to conclude that there is no relationship between the two variables. If \(R^2\) value is very small (for example around \(2\%\)) this would need to be further tested to conclude if it is statistically significant based on the sample evidence by applying FDA.

Further Data Analysis

If the Initial Data Analysis is inconclusive then Further Data Analysis is required.

The ‘Further Data Analysis’ is a procedure that enables a decision to be made, based on the sample evidence, as to one of two outcomes:

  • There is no relationship
  • There is a relationship

These statistical procedures are called hypothesis tests, which essentially provide a decision rule for choosing between one of the two outcomes: “There is no relationship” or “There is a relationship” based on the sample evidence.

All hypothesis tests are carried out in four stages:

  • Stage 1: Specifying the hypotheses

  • Stage 2: Defining the test parameters and the decision rule

  • Stage 3: Examining the sample evidence

  • Stage 4: The conclusions

Statistical Models used in FDA

  • Measured Response v Attribute Explanatory Variable with exactly two levels:

    • t-test
  • Measured Response v Attribute Explanatory Variable with more than two levels:

    • One-Way ANOVA
  • Measured Response v Measured Explanatory Variable

    • Simple Regression Model
  • Measured Response v Measured Explanatory Variables

    • Multifactor Regression Model
  • Attribute Response v Attribute Explanatory Variable

    • Chi-Square Test of Independence

YOUR TURN 👇

Make sure you can answer the following questions:

  1. What are the underlying ideas that enable a relationship between two variables to be investigated?

  2. What is the purpose of summary statistics?

  3. What is the data analysis methodology for exploring the relationship between:

  1. a measured response variable and an attribute explanatory variable?

  2. a measured response variable and a measured explanatory variable?


Regression Modelling

Earlier we looked at some basic statistical concepts. This section examines how to investigate the nature of any relationship that may exist between a measured response variable and a measured explanatory variable.

The first step is to have a clear idea of what is meant by a connection between the response variable and the explanatory variable. The next step is to use some simple sample descriptive statistics to have a first look at the nature of the link between the response variable and the explanatory variable. This simple approach will lead to one of three conclusions, namely, on the basis of the sample information:

  1. there is very strong evidence to support a link
  2. there is absolutely no evidence of any link
  3. the sample evidence is inconclusive and further more sophisticated data analysis is required

This step is called the Initial Data Analysis or the IDA.

If the IDA suggests that Further Data Analysis is required, then this step seeks one of two conclusions:

  1. The sample evidence is consistent with there being no link between the response variable and the explanatory variable.
  2. The sample evidence is consistent with there being a link between the response variable and the explanatory variable.

As we have already seen in the previous sections, this process can be represented diagrammatically as:

The Data-Analysis Methodology seeks to find the answer to the following key question:

  • On the basis of the sample data, is there evidence of a connection/link/relationship between the response variable and the explanatory variable?

The final outcome is one of two conclusions:

  1. There is no evidence of a relationship, labelled as the ‘No’ outcome in the diagram above, in which case the analysis is finished.

  2. There is evidence of a relationship, labelled as the ‘Yes’ outcome in the diagram above, in which case the nature of the relationship needs to be described.

The first step is to have a clear idea of what is meant by a connection between a measured response variable and a measured explanatory variable. Imagine a population under study consisting of a very large number of population members, and on each population member two measurements are made, the value of \(Y\) the response variable and the value of \(X\) the explanatory variable. For the whole population a graph of \(Y\) against \(X\) could be plotted conceptually. If the graph looked as in the diagram below, then there is quite clearly a link between Y and X. If the value of \(X\) is known, the exact value of Y can be read off the graph. This is an unlikely scenario in the data-analysis context, because the relationship shown is a deterministic relationship. Deterministic means that if the value of \(X\) is known then the value of \(Y\) can be precisely determined from the relationship between \(Y\) and \(X\).


Scatter Plot

When analysing the relationship between the two measured variable we start off by creating a scatter plot. A scatter plot is a graph that shows one axis for the explanatory variable commonly known in the regression modelling as a predictor and labelled with \(X\), and one axis for the response variable, which is labelled with \(Y\) and commonly known as the outcome variable. Thus, each point on the graph represents a single \((X, Y)\) pair. The primary benefit is that the possible relationship between the two variables can be viewed and analysed with one glance and often the nature of a relationship can be determined quickly and easily.

Let us consider a few scatter plots. The following graph represents a perfect linear relationship. All points lie exactly on a straight line. It is easy in this situation to determine the intercept and the slope, i.e. gradient and hence specify the exact mathematical link between the response variable \(Y\) and the explanatory variable \(X\).

-Graph 1

The relationship shown in Graph 2 shows clearly that as the value of \(X\) increases the value of \(Y\) indecreases, but not exactly on a straight line as in the previous scatter plot. This is showing a statistical link, as the value of the explanatory variable \(X\) increases the value of the response variable \(Y\) also tends to increase. An explanation for this is that the response \(Y\) may depend on a number of different variables, say \(X_1\), \(X_2\), \(X_3\), \(X_4\), \(X_5\), \(X_6\) etc. which could be written as:

\(Y = f(X_1, X_2, X_3, X_4, X_5, X_6, ...)\)

-Graph 2

If the nature of the link between \(Y\) and \(X\) is under investigation then this could be represented as:

\[Y = f(X) + \text{effect of all other variables}\]

{{% notice note %}} The effect of all other variables is commonly abbreviated to e. {{% /notice %}}

Graph 1 shows a link where the effect of all the other variables is nil. The response \(Y\) depends solely on the variable \(X\), Graph 2 shows a situation where \(Y\) depends on \(X\) but the other variables also have an influence.

Consider the model:

\[Y = f(X) + e\] Remember 😃, \(\text{e is the effect of all other variables}\)!

The influence on the response variable \(Y\) can be thought of as being made up of two components:

  1. the component of \(Y\) that is explained by changes in the value of \(X\), [the part due to changes in \(X\) through \(f(X)\)]
  2. the component of \(Y\) that is explained by changes in the other factors [the part not explained by changes in \(X\)]

Or in more abbreviated forms:

  1. the Variation in \(Y\) Explained by changes \(X\) or Explained Variation and
  2. the Variation in \(Y\) not explained by changes in \(X\) or the Unexplained Variation

The Total Variation in \(Y\) is made up of two components:

  1. Changes in \(Y\) Explained by changes in \(X\) and the
  2. Changes in \(Y\) not explained by changes in \(X\)

Which may be written as:

\[\text{The Total Variation in Y} = \text{Explained Variation} + \text{Unexplained Variation}\]

In Graph 1 the Unexplained Variation is nil, since the value of \(Y\) is completely determined by the value of \(X\). In Graph 2 the Explained Variation is large relative to the Unexplained Variation, since the value of \(Y\) is very largely influenced by the value of \(X\).

Consider Graph 3. Here there is no discernible pattern, and the value of Y seems to be unrelated to the value of X.

- Graph 3

If \(Y\) is not related to \(X\) the Explained Variation component is zero and all the changes in \(Y\) are due to the other variables, that is the Unexplained Variation.

Finally, consider Graphs 4 & 5 below:

-Graph 4

-Graph 5

Graph 4 shows a similar picture to Graph 2, the difference being that as the value of \(X\) increases the value of \(Y\) decreases. The value of \(Y\) is influenced by the value of \(X\), so the Explained Variation is high relative to the Unexplained Variation. Consider the last graph, Graph 5, which is a deterministic relationship. The value of \(Y\) is completely specified by the value of \(X\). Hence the Unexplained Variation is zero.

Graphs Summary:

Graph 1 Graph 2 Graph 3 Graph 4 Graph 5
Explained Variation in Y All High Zero High All
Unexplained Variation in Y Zero Low All Low Zero

In regression the discussion started with the following idea: \[Y = f(X) + e\] And to quantify the strength of the link the influence on Y was broken down into two components: i. \[\text{The Total Variation in Y} = \text{Explained Variation} + \text{Unexplained Variation}\]

This presents two issues:

  • A: Can a model of the link be made?
  • B: Can The Total Variation in Y, Explained Variation and the Unexplained Variation be measured?

{{% notice note %}} The simplest form of connection is a straight-line relationship and the question arises could a straight-line relationship be matched with the information contained in the graphs 1 -5? {{% /notice %}}

  • Clearly it is easy for Graph 1 and Graph 5, the intercept and the gradient can be obtained directly from the graph. The relationship can then be written as:

\[Y = a + bX\] where:

  1. a is the intercept and

  2. b is the gradient

  • Since the Explained Variation is the same as the Changes in Y and the Unexplained Variation is zero, the precise evaluation of these is not necessary.

Developing a Statistical Model

For the statistical relationships as shown in Graphs 2 & 4:

  1. Can the intercept and gradient be measured?

  2. Can the values of the three quantities The Total Variation in Y, Explained Variation and The Unexplained Variation be measured?

It is sufficient to work out any two since:

\[\text{The Total Variation in Y} = \text{Explained Variation} + \text{Unexplained Variation}\]

Fitting a line by eye is subjective. It is unlikely that any two analysts will draw exactly the same line, hence the intercept and gradient will be slightly different from one person to the next. What is needed is an agreed method that will provide an estimate of the intercept and the gradient.

Consider the simple numerical example below:

Suppose we would like to fit a straight-line relationship to the following data:

\(X\) 1 2 3 4 5 6
\(Y\) 7 8 12 13 14 18

The problem is to use this information to measure the intercept and the gradient for this data set.

A simple way to do this is to draw what is considered to be the line of best fit by judgement or guesswork! 😁😉

The intercept can be read off the graph as approximately \(5\), and the gradient can be simply measured since if \(X = 0\) then \(Y = 5\), and if \(X = 5\) then \(Y = 15\), so for a change in \(X\) of \(5\) units (from 0 to 5) \(Y\) changes by \(10\), from \(5\) to \(15\). The definition of the gradient is The change in Y for a unit increase in X hence the gradient for this data set is \(10/5 = 2\).

The straight-line relationship obtained by this process is \[\hat{Y} = 5 + 2*X\].
Note, \(\hat{Y}\) is a notation for the value of \(Y\) as predicted by the straight-line relationship.

We can add more information about the predicted values of \(Y\) by the “estimated” model to our table, so that we have

X Y \(\hat{Y}\) \({Y - \hat{Y}}\)
1 7 7.00 0.00
2 8 9.00 -1.00
3 12 11.00 1.00
4 13 13.00 0.00
5 14 15.00 -1.00
6 18 17.00 1.00

Looking at the information contained in the previous plot, the column headed \(\hat{Y}\) contains the predicted values of \(Y\) for the values of \(X\). For example, the first value of \(\hat{Y}\) is when \(X = 1\) and \(Yp = 5 + 2*X\) so \(\hat{Y} = 5 +2*1 = 7\). The column headed \((Y - \hat{Y})\) is the difference between the actual value and the value predicted by the line. For example when \(X = 1\), \(Y = 7\) and \(\hat{Y} = 7\) so the predicted value lies on the line, as can be seen in the graph. For the value \(X = 2\) the actual value lies \(1\) unit below the line, as can also be seen from the graph.

The column \((Y - \hat{Y})\) measures the disagreement between the actual data and the line and a sensible strategy is to make this level of disagreement as small as possible. Referring to the graph on the table above notice that sometimes the actual data value is below the line and sometimes it is above the line, so on average the value will be close to zero. In this particular example the sum of the \((Y - \hat{Y})\) values add up to zero (in a more conventional notation \(\Sigma(Y - \hat{Y}) = 0\)). The quantity \(\Sigma(Y - \hat{Y})\) does not seem to be a satisfactory measure of disagreement, because there are a number of different lines with the property \(\Sigma(Y - \hat{Y}) = 0\).

A way of obtaining a satisfactory measure of disagreement is to square the individual \((Y - \hat{Y})\) values and add them up. i.e. obtain the quantity \(\Sigma(Y - \hat{Y})^2\). The result is always a positive number since the square of a negative number is positive. If this quantity is then chosen to be as small as possible then the level of disagreement between the actual data points and the fitted line is the least. This provides a criterion for the choice of the best line.

The quantity \(\Sigma(Y - \hat{Y})^2\) can be easily calculated and added to our table

X Y \(\hat{Y}\) \({Y - \hat{Y}}\) \((Y - \hat{Y})^2\)
1 7 7.00 0.00 0
2 8 9.00 -1.00 1
3 12 11.00 1.00 1
4 13 13.00 0.00 0
5 14 15.00 -1.00 1
6 18 17.00 1.00 1

The quantity \(\Sigma(Y - \hat{Y})^2\) is a measure of the disagreement between the actual \(Y\) values and the values predicted by the line. If this value is chosen to be as small as possible then the disagreement between the actual Y values and the line is the smallest it could possibly be, hence the line is The line of Best Fit.

This procedure of finding the intercept and the gradient of a line that makes the quantity \(\Sigma(Y - \hat{Y})^2\) a minimum is called The Method of Least Squares.

The Method of Least Squares was developed by K. F Gauss (1777 - 1855) a German mathematician, Gauss originated the ideas and a Russian mathematician A. A. Markov (1856 - 1922) developed the method.

In R we use the lm( ) function to fit linear models as illustrated below.

The final issue is to find out how to measure the three quantities:

  1. The Total Variation in \(Y\)
  2. The Explained Variation
  3. The Unexplained Variation

Taking these quantities one at a time they can be measured as follows:

The Unexplained Variation

This turns out to be very simple to measure. The quantity: \(\Sigma(Y - \hat{Y})^2\) is a measure of the Unexplained Variation. If the line were a perfect fit to the data, the value predicted by the line and the actual value would be exactly the same and the value of the quantity \(\Sigma(Y - \hat{Y})^2\) would be zero. This quantity is a measure of the disagreement between the actual \(Y\) values and the predicted values \(\hat{Y}\), which are also known as the residuals and are measuring the Unexplained Variation in \(Y\). For the example used above the value of \(\Sigma(Y - \hat{Y})^2\) is \(3.77\).

The Total Variation in Y

This is related to the measures of variability (spread) introduced earlier in the course and in particular to the standard deviation (\(\sigma\)). To measure The Total Variation in Y requires a measure of spread.

The Total Variation in Y is defined to be the quantity: \(\Sigma(Y - \bar{Y})^2\) Where \(\bar{Y}\) is the average value of \(Y\) (\(\bar{Y} = \Sigma(Y)/n\)).

In our earlier example \(\bar{Y} = 12\), so we can expand the table to include this calculation

X Y \(\hat{Y}\) \({Y - \hat{Y}}\) \((Y - \hat{Y})^2\) \(\Sigma(Y - \bar{Y})^2\)
1 7 7.00 0.00 0 25.00
2 8 9.00 -1.00 1 16.00
3 12 11.00 1.00 1 0.00
4 13 13.00 0.00 0 1.00
5 14 15.00 -1.00 1 4.00
6 18 17.00 1.00 1 36.00

giving \(\Sigma(Y - \bar{Y})^2 = 82\).

The Explained Variation in Y

If the line was a perfect fit, then the \(Y\) values and the \(\hat{Y}\) values would be exactly the same, and the quantity \((\hat{Y} - \bar{Y})^2\) would measure The Total Variation in Y. If the line is not a perfect match to the actual \(Y\) values then this quantity measures The Explained Variation in Y.

X Y \(\hat{Y}\) \({Y - \hat{Y}}\) \((Y - \hat{Y})^2\) \(\Sigma(Y - \bar{Y})^2\) \((\hat{Y} - \bar{Y})^2\)
1 7 7.00 0.00 0 25.00 27.94
2 8 9.00 -1.00 1 16.00 10.06
3 12 11.00 1.00 1 0.00 1.12
4 13 13.00 0.00 0 1.00 1.12
5 14 15.00 -1.00 1 4.00 10.06
6 18 17.00 1.00 1 36.00 27.94

incorporating this calculation into the table above will enable us to get \((\hat{Y} - \bar{Y})^2 = 78.23\).

Correlation does not imply causation

Once the intercept and slope have been estimated using the method of least squares, various indices are studied to determine the reliability of these estimates. One of the most popular of these reliability indices is the correlation coefficient. A sample correlation coefficient, more specifically known as the Pearson Product Moment correlation coefficient, denoted \(r\), has possible values between \(-1\) and \(+1\), as illustrated in the diagram below.

In fact, the correlation is a parameter of the bivariate normal distribution, that is used to describe the association between two variables, which does not include a cause and effect statement. That is, one variable does not depend on the other, i.e. the variables are not labelled as dependent and independent. Rather, they are considered as two random variables that seem to vary together. Hence, it is important to recognise that correlation does not imply causation. In correlation analysis, both \(Y\) and \(X\) are assumed to be random variables, whiles in linear regression, \(Y\) is assumed to be a random variable and \(X\) is assumed to be a fixed variable.

Main characteristic of the correlation coefficient \(r\) are:

  • there are no units for the correlation coefficient
  • it lies between \(-1 \leqslant r \leqslant 1\)
    • when its value is near -1 or +1 it shows a strong linear association
    • a correlation coefficient of -1 or +1 indicates a perfect linear association
    • value near 0 suggests no relationship
  • the sign of the correlation coefficient indicates the directions of the association
    • a positive \(r\) is associated with an estimated positive slope
    • a negative \(r\) is associated with an estimated negative slope

BUT!!!

  • \(r\) is a measure of the strength of a linear association and it should not be used when the relationship is non-linear, therefore \(r\) is NOT used to measure strength of a curved line
  • \(r\) is sensitive to outliers
  • \(r\) does not make a distinction between the response variable and the explanatory variable. That is, the correlation of \(X\) with \(Y\) is the same as the correlation of \(Y\) with \(X\).
  • in simple linear regression, squaring the correlation coefficient, \(r^2\), results in the value of the coefficient of determination \(R^2\)

{{% notice note %}} The Spearman rank correlation coefficient is an corresponding nonparametric equivalent of the Pearson correlation coefficient. This statistic is computed by replacing the data values with their ranks and pplying the Pearson correlation formula to the ranks of the data. Tied values are replaced with the average rank of the ties. Just like in the case of Pearson coefficient, this one is also really a measure of association rather than correlation, since the ranks are unchanged by a monotonic transformation of the original data. For the sample sizes greater than 10, the distribution of the Spearman rank correlation coefficient can be approximated by the distribution of the Pearson correlation coefficient. It is also worth knowing that the Spearman rank correlation coefficient uses the weights when weights are specified. {{% /notice %}}

The coefficient of Determination \(R^2\)

We realise that when fitting a regression model we are seeking to find out how much variance is explained, or is accounted for, by the explanatory variable \(X\) in an outcome variable \(Y\).

In the earlier example above the following has been calculated:

  1. \(\text{The Total Variation in }\)\(Y = 82.00\)
  2. \(\text{The Explained Variation in }\)\(Y = 78.23\)
  3. \(\text{The Unexplained Variation in }\)\(Y = 3.77\)

Notice the relationship given below, is satisfied

\[\text{The Total Variation in Y = Explained Variation + Unexplained Variation}\]

\[82.00 = 78.23 + 3.77\] What do these quantities tell us? They are difficult to interpret because they are expressed in the units of the problem.

Consider the following ratio:

\({\text{The Explained Variation in Y} \over \text{The Total Variation in Y}} = {78.23 \over 82.00} = 0.954\)

This is saying that \(0.954\) or \(95.4\%\) of the changes in \(Y\) are explained by changes in \(X\). This is a useful and useable measure of the effectiveness of the match between the actual \(Y\) values and the predicted \(Y\) values.

Reviewing the five scatter prolos (Graphs 1 to 5) it can easily be seen that if the line is a perfect fit to the actual \(Y\) values as in Graphs 1 & 5 then this ratio will have the value \(1\) or \(100\%\).

If there is no link between \(Y\) and \(X\) then the The Explained Variation is zero hence the ratio will be \(0\) or \(0\%\). An example of this is shown in Graph 3.

The remaining graphs: Graphs 2 & 4 show a statistical relationship hence this ratio will lie between \(0\) & \(1\). The closer the ratio is to zero the less strong the link is, whilst the closer the ratio is to \(1\) the stronger the connection is between \(X\) & \(Y\).

The Ratio:

\[R^2 = {\text{The Explained Variation in Y} \over \text{The Total Variation in Y}}\]

is called the Coefficient of Determination, and usually labelled as \(R^2\), and may be defined as the proportion of the changes in \(Y\) explained by changes in \(X\).

This ratio is always on the scale \(0\) to \(1\), but by convention is usually expressed as a percentage, so is regarded as on the scale \(0\) to \(100\%\). The interpretation of this ratio is as follows:

The theoretical minimum \(R^2\) is \(0\). However, since linear regression is based on the best possible fit, \(R^2\) will always be greater than zero, even when the predictor and outcome variables bear no relationship to one another. The definition and interpretation of \(R^2\) is a very important tool in the data analyst’s tool kit for tracking connections between a measured response variable and a measured explanatory variable.

Using Sample Data to Track a connection

When working with sample data to investigate any relationships that may exist between a measured response variable and a measured explanatory variable, the information contained within the sample is imperfect, so has to be interpreted in the light of sampling error. This is particularly important when interpreting the value of \(R^2\) calculated from sample data. The sample R2 value gives a measure of the strength of the connection, and this can sometimes be difficult to interpret particularly if the sample size is small.

The data analysis methodology as set out earlier, is a procedure that enables you to make judgements from sample data. The methodology requires you to know what specific procedures make up the Initial Data Analysis IDA and how to interpret the results of the IDA to obtain one of three outcomes:

  1. No evidence of a connection between the response and the explanatory variable
  2. Clear evidence of a connection between the response and the explanatory variable
  3. The IDA is inconclusive and further analysis is required

If Further Data Analysis, FDA, is required then you need to know what constitutes this further analysis and how to interpret it.

Finally, if a connection between the response variable and the explanatory variable is detected then a description of the connection is required. In this case the connection needs to be described.

To demonstrate the data analysis methodology we’ll go back to Share Price Study Case Study.

Share Price Study Data

A business analyst is studying share prices of companies from three different business sectors. As part of the study a random sample (n=60) of companies was selected and the following data was collected:

Variable Description
Share_Price The market value of a company share (£)
Profit Company annual profit (£1.000.000)
RD Company annual spending on research and development (£1.000)
Turnover Company annual total revenue (£1.000.000)
Competition A variable coded:
- 0 if the company operates in a very competitive market
- 1 if the company has a great deal of monopoly power
Sector A variable coded:
- 1 if the company operates in the IT business sector
- 2 if the company operates in the Finance business sector
- 3 if the company operates in the Pharmaceutical business sector
Type A variable coded:
- 0 if the company does business mainly in Europe
- 1 if the company trades globally

Let’s start off by investigating the relationship between the variables Share_Price and Profit.

We will adopt the following notation

  • \(Y: Share\_Price\)
  • \(X: Profit\)

\(\text{Model to be estimated: } Y = b_0 + b_1X + e\)

We start by accessing data:

## Rows: 60
## Columns: 7
## $ Share_Price <int> 880, 862, 850, 840, 838, 825, 808, 806, 801, 799, 783, 777…
## $ Profit      <dbl> 161.3, 170.5, 140.7, 115.7, 107.9, 138.8, 102.0, 102.7, 10…
## $ RD          <dbl> 152.6, 118.3, 110.6, 87.2, 75.1, 116.2, 91.3, 100.4, 113.5…
## $ Turnover    <dbl> 320.9, 306.3, 279.5, 193.2, 182.4, 265.2, 212.0, 170.3, 23…
## $ Competition <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1…
## $ Sector      <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2…
## $ Type        <int> 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1…
##   Share_Price        Profit             RD            Turnover     Competition
##  Min.   :101.0   Min.   :  2.90   Min.   : 39.20   Min.   : 30.3   0:30       
##  1st Qu.:501.2   1st Qu.: 59.73   1st Qu.: 75.78   1st Qu.:112.3   1:30       
##  Median :598.5   Median : 88.85   Median : 90.60   Median :173.5              
##  Mean   :602.8   Mean   : 84.76   Mean   : 89.64   Mean   :170.2              
##  3rd Qu.:739.8   3rd Qu.:106.62   3rd Qu.:104.15   3rd Qu.:216.6              
##  Max.   :880.0   Max.   :170.50   Max.   :152.60   Max.   :323.3              
##  Sector Type  
##  1:20   0:30  
##  2:20   1:30  
##  3:20         
##               
##               
## 
## Rows: 60
## Columns: 7
## $ Share_Price <int> 880, 862, 850, 840, 838, 825, 808, 806, 801, 799, 783, 777…
## $ Profit      <dbl> 161.3, 170.5, 140.7, 115.7, 107.9, 138.8, 102.0, 102.7, 10…
## $ RD          <dbl> 152.6, 118.3, 110.6, 87.2, 75.1, 116.2, 91.3, 100.4, 113.5…
## $ Turnover    <dbl> 320.9, 306.3, 279.5, 193.2, 182.4, 265.2, 212.0, 170.3, 23…
## $ Competition <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1…
## $ Sector      <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2…
## $ Type        <fct> 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1…

The Initial Data Analysis for investigating a connection between a measured response and a measured explanatory variable requires obtaining a graph of the response against the explanatory variable and to calculate the value of \(R^2\) from the sample data.

The IDA has three possible outcomes:

  1. No evidence of a connection between the response and the explanatory variable.
  2. Clear evidence of a connection between the response and the explanatory variable.
  3. The IDA is inconclusive and further analysis is required.

The simplest form of connection is a straight-line relationship and the question arises could a straight-line relationship be matched with the information contained in the Graphs 1 to 5 discussed earlier?

As part of an informal investigation of the possible relationship between Share_Price and Profit, first we will use R to obtain a scatter plot with the line of the best fit. Rather than every time referring to the name of the data set containing the variables of interest, we will attach our data and refer to the variables directly using only their names (see help for the attach( ) function).

## [1] "Share_Price" "Profit"      "RD"          "Turnover"    "Competition"
## [6] "Sector"      "Type"

Let’s go through the code and the functions we have used to produce this graph.

  • The plot( ) function gives a scatterplot of two numerical variables. The first variable listed will be plotted on the horizontal axis and the second on the vertical axis, i.e. you ‘feed’ as the arguments first variable representing X and then variable representing Y: plot( x, y). Considering that we are investigating a relationship between \(X\) and \(Y\) in the form of a regression line \(Y = b_0 + b1_X + e\), we specify that in R as a formula Y ~ X, which can be read as “\(Y\) is modelled as a function of \(X\)”. This means that in the R’s plot( ) function a formula interface can also be used, in which case the response variable \(Y\) needs to come before the tilde (\(\sim\)) and \(X\) variable that will be plotted on the horizontal axis.

  • Next, we fit a line of the best-fit through our scatterplot using the abline( ) function for the linear model \(Y = b_0 + b_1X\) to see how close are the points to the fitted line. The basic R function for fitting linear models by ordinary least squares is the lm( ), which stands for linear model. All we need to feed R with when using the lm( ) function is the formula Y~X.

The scatterplot shows a fairly strong and reasonably linear relationship between the two variables. In other words, the fit is reasonably good, but it is not perfect and we could do with some more information about it. Let us see what the lm( ) function provide as a part of the output.

## 
## Call:
## lm(formula = Share_Price ~ Profit)
## 
## Coefficients:
## (Intercept)       Profit  
##     258.924        4.057

First, R displays the fitted model, after which it shows the estimates of the two parameters \(b_0\) & \(b_1\) to which it refers to as coefficients.

  • the intercept, \(b_0 = 258.924\)
  • the slope, \(b_1= 4.057\)

We must now take this estimated model and ask a series of questions to decide whether or not our estimated model is good/bad: that is, we have to subject the fitted model to a set of tests designed to check the validity of the model, which is in effect a test of your, as a modeller, viewpoint/theory.

Examining the scatterplot, we can see that not all of the points lie on the fitted line \(Share\_Price = 258.924 + 4.057Profit\). To explain how strong the relationship is between the two variables we need to obtain the coefficient of determination known as the \(R^2\) parameter.

The coefficient of determination, \(R^2\), is a single number that measures the extent to which the explanatory variable can explain, or account for, the variability in \(Y\) – that is, how well does the explanatory variable explain the variability in, or behaviour of, the phenomenon we are trying to understand.

Earlier, we saw that \(R^2\) is constrained to lie in the following range:

\[0\% \text{ <---------- } R^2 \text{ ----------> } 100\%\]

We realised that the closer \(R^2\) is to \(100\%\) then the better the model is, and conversely, a value of \(R^2\) close to \(0\%\) implies a weak/poor model.

To obtain all of the information about the fitted model we can use the summary( ) function as in the following:

## 
## Call:
## lm(formula = Share_Price ~ Profit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -175.513  -74.826    0.107   67.824  141.358 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 258.9243    28.7465   9.007 1.29e-12 ***
## Profit        4.0567     0.3102  13.077  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 89.98 on 58 degrees of freedom
## Multiple R-squared:  0.7467, Adjusted R-squared:  0.7424 
## F-statistic:   171 on 1 and 58 DF,  p-value: < 2.2e-16

So, what have we got here? 🤔

We will discuss some of the key components of R’s summary( ) function for linear regression models.

  • Call: shows what function and variables were used to create the model.
  • Residuals: difference between what the model predicted and the actual value of Y. Try to see if you can calculate this ‘residuals’ section by yourself using: summary(Y-model$fitted.values). 🤓
  • Coefficients:
  • estimated parameters for intercept and slope: \(b_0\) and \(b_1\)
  • Std. Error is Residual Standard Error divided by the square root of the sum of the square of that particular explanatory variable.
  • t value: Estimate divided by Std. Error
  • Pr(>|t|): is the value in a t distribution table with the given degrees of freedom.

{{% notice note %}} Note that in the first section we can find statistics relevant to the estimation of the model’s parameters. In the second part of the output we can find statistics related to the overall goodness of the fitted model. {{% /notice %}}

The summary(lm( )) produces the standard deviation of the error. We know that standard deviation is the square root of variance. Standard Error is very similar, the only difference is that instead of dividing by \(n-1\), you subtract \(n\) minus \(1 + k\), where \(k\) is the number of explanatory variables used in the model. See if you can use and adjust the code below to calculate this statistic. 🤓

# Residual Standard Error
k <- length(model$coefficients) - 1 #Subtract one to ignore intercept
SSE <- sum(model$residuals^2)
n <- length(model$residuals)
sqrt(SSE / (n-(1+k)) ) # Residual Standard Error

Next is the coefficient of determination, which helps us determine how well the model fits to the data. We have already seen that \(R^2\) subtracts the residual error from the variance in Y. The bigger the error, the worse the remaining variance will appear.

For the time being we will skip \(R^2_{adjusted}\) and point out that it is used for models with multiple variables, to which you will be introduced in the next section.

Lastly, the F-Statistic is the second “test” that the summary function produces for lm models. The F-Statistic is a “global” test that checks if at least one of your coefficients are nonzero, i.e. when dealing with a simple regression model to check if the model is worthy of further investigation and interpretation.

Model Interpretation

To go back to our model interpretation. Earlier, we said after examining the scatter plot that there is a a clear link between Share_Price and Profit. This is confirmed by the value of \(R^2 = 74.67\%\). The interpretation of \(R^2\) is suggesting that \(74.67\%\) of the changes in Share_Price are explained by changes in Profit. Alternatively, \(25.33\%\) of the changes in Share_Price are due to other variables.

For this example the outcome of the IDA is that there is clear evidence of a link between Share_Price and Profit and the nature of the influence being that as Profit increases Share_Price also increases. It only remains to describe the connection, and discuss how effective the model is, and ask if it can be used to predict Share_Price value from the value of Profit? 🤔

Further Data Analysis

The adequacy of \(R^2\) can be judged both informally and formally using hypothesis testing. Like all hypothesis tests, this test is carried out in four stages:

  1. Specify the hypotheses
  2. Define the test parameters and decision rule
  3. Examine the sample evidence
  4. Conclusions

Stage 1: Specify the hypotheses (\(H_0\) & \(H_1\)).

The Coefficient of Determination \(R^2\) is a useful quantity. By definition if \(R^2 = 0\) then there is no connection between the response variable and the explanatory variable. Conversely if the value of \(R^2\) is greater than zero there must be a connection. This enables the formal hypotheses to be defined as:

  • \(H_0 : R^2 = 0\) (There is no relationship between the response and interest and the explanatory variable.)
  • \(H_1 : R^2 > 0\) (There is a relationship between the response and the explanatory variable.)

Stage 2: Defining the test parameters and the decision rule.

The decision rule is based on the F statistic. The F distribution has a shape as shown below:

The value of \(F_{crit}\) is the value that divides the distribution with \(95\%\) of the area to the left of the \(F_{crit}\) ordinate, and \(5\%\) to the right. The accepted terminology is \(F_{crit}\) but this text will call this quantity \(F_{calc}\) to signify it is a quantity obtained from statistical tables.

The decision rule is:

  • If the value of \(F_{calc}\) from the sample data is larger than \(F_{crit}\), then the sample evidence favours the decision that there is a connection between the response variable and the explanatory variable. (i.e. favours \(H_1\).)

  • If the value of \(F_{calc}\) is smaller than \(F_{crit}\) then the sample evidence is consistent with no connection between the response variable and the explanatory variable. (i.e. favours \(H_0\))

The decision rule can be summarised as:

If \(F_{calc} < F_{crit}\) then favour \(H_0\), whilst if \(F_{calc} > F_{crit}\) then favour \(H_1\). This decision rule can be represented graphically as below:

The decision rule maybe written as:

  • If \(F_{calc} < F_{crit}\) then accept \(H_0\)
  • If \(F_{calc} > F_{crit}\) then accept \(H_1\)

The value of \(F_{crit}\) depends on the amount of data in the sample. The Degrees of Freedom (usually abbreviated to \(df\)) express this dependence on the sample size. For any given set of sample data, having obtained the \(df\) the specific value of \(F_{crit}\) can be obtained from Statistical Tables of the F distribution, or simply by using the qf(p, df1, df2) function in R.

Stage 3: Examining the sample evidence

For the investigation of the Share_Price vs Profit we plot the scatterplot and obtain the summary of the fitted model: \(Share\_Price = b_0 + b_1Profit\)

## [1] 0.7467388

This is a good model explaining almost \(3/4\) of the overall variability in the explanatory variable \(Y\). Hence, based on the sample evidence, variable Profit, can explain \(74.67\%\) of overall variation in the variable Share_Price.

{{% notice note %}} Note, that if we don’t reject that there is a relationship between the two variables we will obtain full output of the summary(lm(model_1)) function regardless of the need for FDA. If we conclude that there is a clear relationship, we will need the estimates of the parameters to describe it, and if we need further to investigate the significance of the relationship we need relevant statistics for conducting the FDA. {{% /notice %}}

In the cases where \(R^2 \approx 0\) we will carry out a hypothesis test for which we need \(F_{calc}\) from the summary( ) function.

Let us assume that the \(R^2\) for the given example Share_Price vs Profit is very small and close to zero. In that case, we cannot simply reject it as a statistically insignificant relationship, as \(R^2\) is not zero, but neither can we with confidence accept it as a statistically valid relationship based on the sample evidence we use.

This step requires the calculation of the \(F_{calc}\) value from the sample data, and the determining of the degrees of freedom.

## 
## Call:
## lm(formula = Share_Price ~ Profit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -175.513  -74.826    0.107   67.824  141.358 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 258.9243    28.7465   9.007 1.29e-12 ***
## Profit        4.0567     0.3102  13.077  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 89.98 on 58 degrees of freedom
## Multiple R-squared:  0.7467, Adjusted R-squared:  0.7424 
## F-statistic:   171 on 1 and 58 DF,  p-value: < 2.2e-16

Stage 4: Conclusion

The associated F statistic evaluating the significance of the full model \(F_{calc} = 171\) with \(df_1=1\) and \(df_2=58\). Thus,

## [1] 4.006873

\(F_{crit} = 4.01\) and using the decision rule given, we get that

\[F_{calc}=171 > F_{crit}=4.01\text{ => accept }H_1\] , ie this model is statistically significant and we need to describe it.

Describe the Relationship

If the resulting outcome from either the Initial Data Analysis (IDA) or the Further Data Analysis (FDA) is that there is a connection between the response variable and the explanatory variable then the last step is to describe this connection.

For the data analysis situation Measured v Measured this is simply stating the line of the best fit. This is a description of the connection between the two variables. Additionally the \(R^2\) value should be quoted as this gives a measure of how well the data and the line of best fit match.

The \(R^2\) value can be interpreted as a measure of the quality of predictions made from the line of best fit according to the rule of thumb:

In our example Share_Price vs Profit we had the following estimations:

  • \(b_0 = 258.9243\)
  • \(b_1 = 4.0567\)

\[Share\_Price = 258.9243 + 4.0567 Profit\]

with \(R^2 = 74.67\%\).

Making predictions

Can the regression line be used to make predictions about the Share_Price for a company with a given value of the Profit?

Suppose we have to make a prediction of the \(Share\_{Price}\) for company with the \(Profit = 137.2\)

The predictions are calculated from the model as follows:

\[Share\_Price = 258.9243 + 4.0567 \times 137.2000 = 815.5035\] Since the \(R^2\) value is nearly \(75\%\) any predictions about Share_Price made from this model are likely to be of good quality, but there may be some issues with some of these predictions of the Profit values out of the data range used in prediction. The Profit in the given data set range from about \(3\) to \(170\). Over this range any prediction is likely to be of good quality since the information in the data is reflecting experience within this range, and the \(R^2\) value is nearly \(75\%\).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.90   59.73   88.85   84.76  106.62  170.50

The Share_Price for the company with the Profit that lies outside the range of experience is not likely to be very reliable, or of much value. These problems are caused by the fact that the information available about the link between Share_Price and Profit is only valid over the range of values that Profit takes in the data. There is no information about the nature of the connection outside this range and any predictions made must be treated with caution.

In general terms when interpreting a regression model the intercept is of little value, since it is generally out of the range of the data. The gradient term is the more important and the formal definition of the gradient provides the interpretation of this information.

The gradient is defined to be the change in \(Y\) for a unit increase in \(X\). For the model developed the gradient is \(4.06\). This is suggesting for every additional unit increase in Profit, Share_Price increases by \(£4.06\).

Avoid extrapolation! Refrain from predicting/interpreting the regression line for X-values outside the range of X in the data set!

YOUR TURN 👇

  1. Investigate the nature of the relationship in Share Price Study data for Share_Price vs RD and Share_Price vs Turnover.

  2. Download The Supermarket data set available at https://github.com/TanjaKec/mydata using the following link https://tanjakec.github.io/mydata/SUPERM.csv.

Crate a report in which you will:

  1. Give a brief explanation of the following statistical terms:
  1. Response Variable
  2. Explanatory variable
  3. Measured Variable
  4. Attribute Variable
  1. Provide a brief answer for each of the following questions:
  1. What is ‘Data Analysis Methodology’, and why is this needed when working with sample data?
  2. What are the statistical concepts used to investigate the relationship between a measured response variable and an attribute explanatory variable?
  3. What are the statistical concepts used to investigate the relationship between a measured response variable and a measured explanatory variable?
  1. Undertake the Data Analysis for the Supermarket Data Set to investigate the relationships between the response variable and from your point of view regarded as worth of attention set of explanatory variables. Present your points of views about the nature of the relationships and give a complete explanation, within the data analysis methodology, of this analysis.

You are expected to bring your report to into the next class.


Creative Commons Attribution-ShareAlike 4.0 International License.