Data Analytics Module
Lecturer: Hans van der Zwan
Handout 06
Topic: association analysis

Literature
Rumsey D. J. (2010). Statistical Essentials for Dummies. Hoboken: Wiley Publishing.
Ismay C. & Kim A. Y. (2019). ModernDive. Statistical Inference for Data Science. https://moderndive.com.

Recommended literature

Preparation class
See module description

6 Association analysis

In many case data analysis is about analyzing association between variables: measuring the strength of a relationship, testing if the relationship is significant (or can be attributed to chance because the relationship is measured using a random sample), describing the relationship with a mathematical equation.
The objective of this kind of analysis can be predicting or estimating an output based on one or more inputs or just to examine the relationships between different variables and the structure of the data. In this lesson, the emphasis is on the former.

When it is about predicting and estimating there are one response (dependent) variable (the Y-variable) and one or more explanatory (independent) variables (X variables). Problems with a categorical response variable are referred to as classification problems, while those involving a numeric response as regression problems.

6.1 Relationships between two categorical variables

Common ways to examine relationships between two categorical variables:

Graphical: clustered bar chart; stacked bar chart
Descriptive statistics: cross tables
Metric to measure the strength of the relation:
- phi coefficent, in case of two binary variables
- Cramer’s V
  see: http://groups.chass.utoronto.ca/pol242/Labs/LM-3A/LM-3A_content.htm and use http://vassarstats.net menu ‘frequency data’ for calculations
Hypotheses testing:
- tests on difference between proportions
- chi-square tests a test to test if two categorical variables are independent

Example: PPD LONDON, association between OLD/NEW and LEASEHOLD/FREEHOLD
Research sub question: is there a dependency between the variables OLD/NEW and LEASEHOLD/FREEHOLD for properties in London?
The question is answered based on the properties sold in January 2019.

Table 1
Houses sold in London in January 2019,
divided by TYPE and FREEHOLD/LEASHOLD

TYPE	F	L
DETACHED	35	2
FLAT	7	775
SEMI-DETACHED	121	4
TERRACED	420	22

Note. As can be seen in this table the distribution over the Freehold and Leashold categories differ for the different types. The question is, if the association between the two variables is significant.
The strength of the relationship can be measured by Cramer’s V, this metric has a value of 0.949 in this case. This means there is a very strong relationship between TYPE and FREEHOLD/LEASHOLD; if the type is known good predictions can be made for the FREEHOLD/LEASEHOLD category.

Example: crime incidents in 2017 in Washington, D.C.
The file Crime_Incidents_in_2017.csv (source: http://us-city.census.okfn.org/) contains information about crimes in Washington D.C. in 2017.
To analyse if there is a relationship between the variables METHOD, a categorical variable with free unique values (GUN, KNIFE and OTHERS), and SHIFT a categorical variable with three unique values (DAY, EVENING, MIDNIGHT) a contingency table has been created.

Figure 1. Crimes in Washington D.C. in 2017.

Table 2
Crimes in Washington D.C. in 2017

METHOD	DAY	EVENING	MIDNIGHT
GUN	266	585	734
KNIFE	188	306	330

In this example Cramer’s V equals 0.08. The relationship between the two variables is very weak, although it is significant, chisquare(2) = 15.290, p < .001.

6.2 Relationship between categorical and numeric variable

Common ways to examine relationships between two categorical variables:

Graphical: side-by-side boxplots, side-by-side histograms, multiple density curves
Tabulation: five number summary/ descriptive statistis per category in one table
Hypotheses testing:
- t test on difference between means
- Wilcoxon ranksumtest (also called Mann Whitney U-test) (also applicable in the case of small samples)

Example: comparison airpollution between Amsterdam and Rotterdam
A comparison of the difference between airpollution caused by PM10 in Amsterdam, location Einsteinweg and Rotterdam, location Statenweg. The comparison is based on the figures from 2018 as reported on the RIVM website.

Figure 2. Daily average PM10 levels in 2018 on a location in Amsterdam and a location in Rotterdam. Overall the air quality in Rotterdam is worse than in Amsterdam.

Table 3
Summary statistics daily PM10 levels

CITY	LOCATION	COUNT	MINIMUM	Q1	MEDIAN	Q3	MAXIMUM	AVERAGE	SD
Amsterdam	Amsterdam-Einsteinweg	363	5.087	15.783	19.688	26.910	69.146	22.450	10.050
Rotterdam	Rotterdam-Statenweg	365	6.800	17.258	21.721	29.312	81.204	24.320	10.550

A paired t-test is used to investigate whether the daily average values in Amsterdam-Einsteinweg differ from these in Rotteram-Statenweg. Based on this test it can be concluded that the difference between the daily values in Amsterdam (M = 22.5, SD = 10.0) and Rotterdam (M = 24.3, SD = 10.5) is significant, t(362) = -8.847, p < .001.

6.3 Relationship between two numerical variables

Common ways to examine relationships between two numerical variables:

Graphs per variable in order to verify if there are any outliers (boxplots / histograms).
Table: summary statistics per variable; verify if there are missing data.
Graphical: Scatterplot to graph the relationship between the two variables. Verify if the graph supports that there exists a linear relationship and examine if there are outliers.
Measure of linear association: correlation coefficient.
Notice: statistical association is not the same as a causal relation!!
Describing the relationship using a mathematical model: linear regression analysis.

Example: Dutch cars

The RDWregr.xlsx file contains data of a number of vehicles including the mass and the selling price.¹ It may be assumed that there is a relationship between mass and selling price.
1. Generate a table with the most important statistics concerning the two variables.
2. Create boxplots.
3. Create a scatterplot; examine the outliers; does the graph support the assumption about a relationship between mass and selling price.
4. The strength of the relationship can be measured with the correlation coefficient r (see below); what’s the value of r (use Excel function correl() )?
5. The relationship can be described by a mathematical equation; right click on the dots in the graph and choose the option to plot the best fitting line and display the equation on chart.

6.3.1 Correlation coefficient

The correlation coefficient r is a measure for the strength of a linear relationship between two variables. More about how to interpret a correlation coefficient can be found on this website .

6.3.1.1 Significant correlation

Even when there is no correlation between two variables the correlation coefficient calculated based on a sample will not be 0. That’s why it is quite common to use a significance test to test whether the r value differs significantly from 0. More about this test can be found here.

6.3.2 Least square regression

Video.
In Excel, the equation of the OLS (Ordinary Least Squares) regression line can be found in different ways:
- Plotting the regression line and its equation in the scatterplot.
- Using menu choice: data/data analysis/regression.
- Using formulas to calculate the intercept (formula: intercept) and the slope (formula: slope) of the regression line.

6.3.2.1 MS Excel regression output

Figure 3 shows the MS Excel regression output for the Dutch car examples (see abve).

Figure 3. Regression output created with MS Excel, menu: data/Data Analysis/Regression.

The regression model in the example output in figure 4 can be described by the equation:
CATALOG PRICE = -70,849 + 81.30 x MASS, where
MASS is the mass of the car in kg, and
CATALOG PRICE is the catalog price in euro.


Call:
lm(formula = Catalog_price ~ Mass, data = rdw)

Residuals:
   Min     1Q Median     3Q    Max 
-47701  -4551   -765   4342  50948 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -70849.017   4167.561  -17.00   <2e-16 ***
Mass            81.305      2.971   27.36   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9246 on 205 degrees of freedom
Multiple R-squared:  0.7851,    Adjusted R-squared:  0.784 
F-statistic: 748.7 on 1 and 205 DF,  p-value: < 2.2e-16

Figure 4. Regression output RDW example, using R.

6.3.2.2 Assessing the regression model

After generating a regression model it is important to assess the model. This is done to judge whether the model fits the data and, in case of comparing models, which model fits the data best.

Coefficient of determination R²
In case of simple linear regression, i.e. with just one X-variable, the R2 value is the correlation coefficient squared. In a multiple regression setting it’s more complicated.

In general, the following equation holds: R² = \(\frac{variation\ of\ the\ model\ Yvalues\ around\ the\ mean\ Yvalue}{variation\ of\ the\ Yvalues\ around\ the\ mean\ Yvalue}\) = \(\frac{\Sigma(\widehat{Y} - \bar{Y})^{2}} {\Sigma(Y - \bar{Y})^{2}}\)

The denominator of this expression, \(\Sigma(Y - \bar{Y})^{2}\), measures the total variation in the Y-values around the mean Y-value, while the numerator, \(\Sigma(\widehat{Y} - \bar{Y})^{2}\), measures the variation of the model Y-values (the \(\widehat{Y}\)-values) around the mean Y-value. The latter is the variation in the Y-values that is explained by the regression model.
So R² measures the proportion of the variation in the Y-values that is explained by the regression model. In all cases: 0 <= R² <= 1.

The standard error of the estimate s_e
The standard error of the estimate is a metric for the spread around the regression line.
The value can be used to estimate the so-called margin of error if the model is used to predict a Y-value given the X-value(s). As a rule of thumb, the margin of error that should be taken in account when predicting a Y-value is 2xs_e.
So the smaller the s_e value the better the model can predict a Y-value.
To evaluate the value of s_e, it is compared with \(\bar{Y}\), commonly by dividing s_e by \(\bar{Y}\).

P-values of the coefficient(s)
The P-values of the model coefficients are used to determine whether the different X-variables are significant in the model. Most common interpretation: a P-value less than .05 indicates that the X-variable is significant. The test for which this P-value is calculated is:
H₀: \(\beta\) = 0
H_A: \(\beta\) <> 0, where \(\beta\) is the coefficient of the X-variable in question.
Not rejecting H₀ means the coefficient of the X-variable doesn’t differ significantly from 0, in other words Y doesn’t depend significantly on X.

6.3.3 Assessing the model in the RDW example

R² = .785, which means that 78.5% of the variation in catalog prices is explained by the variation in masses.
s_e = 9246; if the model is used to estimate the catalog price, based on the mass of a car, a margin of error of 18,492 (2 x s_e) should be taken in account. In other words, this model is not a very good model to predict catalog prices.
The regression coeeficient of the MASS variable (81.3) differs significantly from 0, p < .001. MASS is a significant variable.

6.3.4 Multiple linear regression, an example

The buyer of a new car has to pay a special tax. The heigth of this special tax depends on different factors. Aim of this example is to find a model with which the heigth of the special tax for a Toyota can be estimated, based on different characteristics of this car. For this reason a random sample from in the Netherlands registered Toyota’s has been drawn, reference date 2019-06-12; see file toyota_sample.csv.

6.3.4.1 Data description

The sample contains 400 observations on 15 variables.

As a first analysis the correlation coefficients between some of the numeric variables has been calculated. MS Excel: Data/Data Analysis/Correlation.

Figure 4. Correlation matrix generated with MS Excel.

As can be seen in Figure 5 the variable SPECIAL_TAX has the highest correlation with CATALOG_PRICE. That’s why the first regression model is a simple linear refgression model with CATALOG_PRICE as explanatory variable.

Figure 5. Scatterplot, SPECIAL_TAX in euro against CATALOG_PRICE in euro.

6.3.4.2 Simple linear models

A first model uses CATALOG_PRICE as explanatory variable.

Figure 6. MS Excel output simple linear regression model with SPECIAL_TAX as response variable and CATALOG_PRICE as explanatory variable.

A second model uses MASS as explanatory variable.


Call:
lm(formula = SPECIAL_TAX ~ MASS, data = toyota)

Residuals:
    Min      1Q  Median      3Q     Max 
-5453.7 -2541.8  -234.7  1467.7 26311.6 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6396.8549   702.5876  -9.105   <2e-16 ***
MASS            7.6394     0.5753  13.279   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3080 on 398 degrees of freedom
Multiple R-squared:  0.307, Adjusted R-squared:  0.3053 
F-statistic: 176.3 on 1 and 398 DF,  p-value: < 2.2e-16

6.3.4.3 Multiple linear model: SPECIAL_TAX ~ MASS + CATALOG_PRICE

In the second model two explanaory variables ares used: MASS and CATALOG_PRICE. This model has to be

Figure 7. MS Excel output linear regression model with SPECIAL_TAX as response variable and CATALOG_PRICE and MASS as explanatory variable.

Although this model uses two explanatory variables which are both moderately correlated with the response variable SPECIAL_TAX, the model is not much better than the simple linear models. The reason for this is that the two explanatory variables are highly correlated with each other. In general it is preferable to use explanatory variables which are not correlated to each other.

6.3.4.4 Multiple linear model: SPECIAL_TAX ~ CATALOG_PRICE + ELECTRIC

The third model makes use of a dummy variable ELECTRIC (1 = FUEL_DESCRPTION=“ELECTRICITY”, 0 = FUEL_DESCRPTION<>“ELECTRICITY”).

Figure 8. MS Excel output linear regression model with SPECIAL_TAX as response variable and CATALOG_PRICE and ELECTRIC as explanatory variables. ELECTRIC is a dummy variable which takes on value 1 if the car is an electric car and 0 otherwise.

This model is really an improvement of the simple regression model with CATALOG_PRICE as the only explanatory variable.

Exercise
As can be seen in Figure 5 the data set contains a couple of outliers. Remove these outliers from the data set and generate the different OLS models without these outliers.

The data set is not a random sample from all registered cars in the Netherlands; it is a random sample from registered cars from three brands, KIA, BMW and AUDI; because of didactic reasons, KIA PICANTO’s are excluded from the sample.↩

Association Analysis