Introduction
This is our first lab when we are considering 2 dimensions and
instead of calculating univariate statistics by groups (or factors) of
other variable - we will measure their common relationships based on
co-variance and correlation coefficients.
*Please be very careful when choosing the measure of correlation! In
case of different measurument scales we have to recode one of the
variables into weaker scale.
It would be nice to add some additional plots in the background. Feel
free to add your own sections and use external packages.
Data
This time we are going to use a typical credit scoring data with
predefined “default” variables and personal demografic and income data.
Please take a look closer at headers and descriptions of each
variable.
Scatterplots by groups
We can finally see if there any differences between risk status:

We can also see more closely if there any differences between those
two distributions adding their estimated density plots:
We can also put those plots together:

Correlation coefficients - Pearson’s linear correlation
Ok, let’s move to some calculations. In R, we can use the cor()
function. It takes three arguments and the method: cor(x, y, method) For
2 quantitative data, with all assumptions met, we can calculate simple
Pearson’s coefficient of linear correlation:
## [1] 0.574346
Ok, what about the percentage of the explained variability?
## [1] 32.98734
So as we can see almost ??? of total log of incomes’ variability is
explained by differences in age. The rest (???) is probably explained by
other factors.
Partial and semipartial correlation
The partial and semi-partial (also known as part) correlations are
used to express the specific portion of variance explained by
eliminating the effect of other variables when assessing the correlation
between two variables.
Partial correlation holds constant one variable when computing the
relations to others. Suppose we want to know the correlation between X
and Y holding Z constant for both X and Y. That would be the partial
correlation between X and Y controlling for Z.
Semipartial correlation holds Z constant for either X or Y, but not
both, so if we wanted to control X for Z, we could compute the
semipartial correlation between X and Y holding Z constant for X.
Suppose we want to know the correlation between the log of income and
age controlling for years of employment. How highly correlated are these
after controlling for tenure?
**There can be more than one control variable.
## [1] 0.3194263
How can we interpret the obtained partial correlation coefficient?
What is the difference between that one and the semi-partial
coefficient:
## [1] 1.044703
Rank correlation
For 2 different scales - like for example this pair of variables:
income vs. education levels - we cannot use Pearson’s coefficient. The
only possibility is to rank also incomes… and lose some more detailed
information about them.
First, let’s see boxplots of income by education levels.

Now, let’s see Kendal’s coefficient of rank correlation (robust for
ties).
## [1] 0.1577567
Point-biserial correlation
Let’s try to verify if there is a significant relationship between
incomes and risk status. First, let’s take a look at the boxplot:

If you would like to compare 1 quantitative variable (income) and 1
dychotomous variable (default status - binary), then you can use
point-biserial coefficient:
##
## Pearson's product-moment correlation
##
## data: log(bank$income) and bank$default
## t = -3.6057, df = 698, p-value = 0.0003334
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.20725185 -0.06174165
## sample estimates:
## cor
## -0.1352258
Nonlinear correlation - eta coefficient
If you would like to check if there are any nonlinearities between 2
variables, the only possibility (beside transformations and linear
analysis) is to calculate “eta” coefficient and compare it with the
Pearson’s linear coefficient.
## eta.sq eta.sq.part
## age -22.10569 -22.10569
Correlation matrix
We can also prepare the correlation matrix for all quantitative
variables stored in our data frame.
We can use ggcorr() function:
## Warning in ggcorr(bank): data in column(s) 'def', 'educ' are not numeric and
## were ignored

As you can see - the default correlation matrix is not the best idea
for all measurement scales (including binary variable “default”).
That’s why now we can perform our bivariate analysis with ggpair with
grouping.
Correlation matrix with scatterplots
Here is what we are about to calculate: - The correlation matrix
between age, log_income, employ, address, debtinc, creddebt, and othdebt
variable grouped by whether the person has a default status or not. -
Plot the distribution of each variable by group - Display the scatter
plot with the trend by group
## age employ address income debtinc creddebt othdebt
## age 1.00 0.54 0.60 0.48 0.02 0.30 0.34
## employ 0.54 1.00 0.32 0.62 -0.03 0.40 0.41
## address 0.60 0.32 1.00 0.32 0.01 0.21 0.23
## income 0.48 0.62 0.32 1.00 -0.03 0.57 0.61
## debtinc 0.02 -0.03 0.01 -0.03 1.00 0.50 0.58
## creddebt 0.30 0.40 0.21 0.57 0.50 1.00 0.63
## othdebt 0.34 0.41 0.23 0.61 0.58 0.63 1.00
Qualitative data
In case of two variables measured on nominal or ordinal&nominal
scale - we are forced to organize so called “contingency” table with
frequencies and calculate some kind of the correlation coefficient based
on them. This is so called “contingency analysis”.
Let’s consider one example based on our data: verify, if there is any
significant correlation between education level and credit risk.
table(bank$educ,bank$def) %>%
CramerV()
## [1] 0.1281313
Exercise 1. Contingency analysis.
Do you believe in the Afterlife? https://nationalpost.com/news/canada/millennials-do-you-believe-in-life-after-life
A survey was conducted and a random sample of 1091 questionnaires is
given in the form of the following contingency table:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
data_afterlife = np.array([[435, 147], [375, 134]])
rows = ['Female', 'Male']
columns = ['Yes', 'No']
df_afterlife = pd.DataFrame(data_afterlife, index=rows, columns=columns)
chi2_stat, p_val, _, _ = chi2_contingency(data_afterlife)
print(" ")
print("---EX1---")
## ---EX1---
print("Chi-square statistic:", chi2_stat) #rly small, data is very close to what we'd expect by chance
## Chi-square statistic: 0.1110272160868229
print("p-value:", p_val) #hight, any relationship we see between the variables is just random luck. So, we don't have strong evidence that there's a real connection.
## p-value: 0.7389776820172238
n = np.sum(data_afterlife)
phi_coeff = np.sqrt(chi2_stat / n)
print("Phi coefficient:", phi_coeff) #small,here's hardly any relationship between the variables. It's so small that it's likely not meaningful.
## Phi coefficient: 0.010087936733575699
print("SUMMARY:")
## SUMMARY:
print("Data isn't meaningful enough and does not have high evidence of proving something")
## Data isn't meaningful enough and does not have high evidence of proving something
Exercise 2. Contingency analysis for the ‘Titanic’ data.
Let’s consider the titanic dataset which contains a complete list of
passengers and crew members on the RMS Titanic. It includes a variable
indicating whether a person did survive the sinking of the RMS Titanic
on April 15, 1912. A data frame contains 2456 observations on 14
variables.
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
data_titanic=pd.read_csv("titanic.csv",sep=';')
# print(data_titanic.describe())
# print(data_titanic.isnull().sum()) cool functions for general analysis
contingency_table = pd.crosstab(data_titanic['Status'], columns='count')
print(contingency_table)
## col_0 count
## Status
## Survivor 711
## Victim 1496
# Perform chi-square test of independence
chi2, p, _, _ = chi2_contingency(contingency_table['count'].values.reshape(2, 1))
print("\nChi-square test statistic:", chi2)
##
## Chi-square test statistic: 0.0
print("p-value:", p)
## p-value: 1.0
