Introduction
This is our first lab when we are considering 2 dimensions and
instead of calculating univariate statistics by groups (or factors) of
other variable - we will measure their common relationships based on
co-variance and correlation coefficients.
*Please be very careful when choosing the measure of correlation! In
case of different measurument scales we have to recode one of the
variables into weaker scale.
It would be nice to add some additional plots in the background. Feel
free to add your own sections and use external packages.
Data
This time we are going to use a typical credit scoring data with
predefined “default” variables and personal demografic and income data.
Please take a look closer at headers and descriptions of each
variable.
Scatterplots
First let’s visualize our quantitative relationships using
scatterplots.

You can also normalize the skewed distribution of incomes using
log:

We can add an estimated linear regression line:
## `geom_smooth()` using formula = 'y ~ x'

Scatterplots by groups
We can finally see if there any differences between risk status:
## `geom_smooth()` using formula = 'y ~ x'

We can also see more closely if there any differences between those
two distributions adding their estimated density plots:
## Instalowanie pakietu w 'C:/Users/magda/AppData/Local/R/win-library/4.3'
## (ponieważ 'lib' nie jest określony)
## pakiet 'ggExtra' został pomyślnie rozpakowany oraz sumy MD5 zostały sprawdzone
##
## Pobrane pakiety binarne są w
## C:\Users\magda\AppData\Local\Temp\Rtmp23SdA1\downloaded_packages
## Warning: pakiet 'ggExtra' został zbudowany w wersji R 4.3.3

We can also put those plots together:

Scatterplots with density curves
We can also see more closely if there any differences between those
two distributions adding their estimated density plots:

Correlation coefficients - Pearson’s linear correlation
Ok, let’s move to some calculations. In R, we can use the cor()
function. It takes three arguments and the method: cor(x, y, method) For
2 quantitative data, with all assumptions met, we can calculate simple
Pearson’s coefficient of linear correlation:
## [1] "Pearson's linear correlation: 0.574346033083121"
Ok, what about the percentage of the explained variability?
## [1] "Percentage of the explained variability: 32.9873365718318 %"
So as we can see almost 32,99% of total log of incomes’ variability
is explained by differences in age. The rest (67,1%) is probably
explained by other factors.
Partial and semipartial correlation
The partial and semi-partial (also known as part) correlations are
used to express the specific portion of variance explained by
eliminating the effect of other variables when assessing the correlation
between two variables.
Partial correlation holds constant one variable when computing the
relations to others. Suppose we want to know the correlation between X
and Y holding Z constant for both X and Y. That would be the partial
correlation between X and Y controlling for Z.
Semipartial correlation holds Z constant for either X or Y, but not
both, so if we wanted to control X for Z, we could compute the
semipartial correlation between X and Y holding Z constant for X.
Suppose we want to know the correlation between the log of income and
age controlling for years of employment. How highly correlated are these
after controlling for tenure?
**There can be more than one control variable.
## [1] "Partial correlation: 0.319426338025541"
How can we interpret the obtained partial correlation coefficient?
What is the difference between that one and the semi-partial
coefficient:
## [1] "Semi-partial correlation: 0.22037110188406"
Partial correlation coefficient calculated for the log of income and
age controlling for years of employment equals 0,319. It means that
after including influence of years of employment there exist positive
and moderate relationship between age and logarithm of income. So after
removing the effects of years of employment, these two are still
moderately correlated.
On the other hand, the semi-partial correlation between log income
and age, controlling for years of employment for log income equals
0,220. Therefore after taking into account the effect of years of
employment on the logarithm of income, there is a weaker but still
positive relationship between age and the logarithm of income. In other
words, age is moderately correlated with log income, after removing the
effect of years of employment on log income only.
Finally, comparing these two coefficients we can see that controlling
for years of employment has a greater impact on logarithm of income than
on age, which is visible in the lower value of the semi-partial
correlation coefficient compared to the partial correlation.
Rank correlation
For 2 different scales - like for example this pair of variables:
income vs. education levels - we cannot use Pearson’s coefficient. The
only possibility is to rank also incomes… and lose some more detailed
information about them.
First, let’s see boxplots of income by education levels.


In these box-plots we see that people with the highest education
generally earn the most. In case of lower education levels the
difference between them isn’t as significant.
Now, let’s see Kendal’s coefficient of rank correlation (robust for
ties).
## [1] "Kendall's rank correlation coefficient: 0.158"
Kendall’s rank correlation coefficient is in range -1 to 1, where 1
inform us about perfect positive correlation, -1 about perfect negative
correlation and 0 about no correlation. Value 0,158 suggest weak
correlation between income and age.
Point-biserial correlation
Let’s try to verify if there is a significant relationship between
incomes and risk status. First, let’s take a look at the boxplot:


If you would like to compare 1 quantitative variable (income) and 1
dychotomous variable (default status - binary), then you can use
point-biserial coefficient:
## [1] "Point-biserial correlation coefficient (income): 0.071"
## [1] "Point-biserial correlation coefficient (log income): 0.135"
We can see that this correlation is positive and weak, since the
absolute value is below 0,2.
Nonlinear correlation - eta coefficient
If you would like to check if there are any nonlinearities between 2
variables, the only possibility (beside transformations and linear
analysis) is to calculate “eta” coefficient and compare it with the
Pearson’s linear coefficient.
## [1] "Eta coefficient (income ~ default): 0.0709696566192232"
## [1] "Eta coefficient (log income ~ default): 0.135225798931301"
The values of eta coefficient and Pearsons coefficient are the same,
so there is no nonlinear correlation.
Correlation matrix
We can also prepare the correlation matrix for all quantitative
variables stored in our data frame.
We can use ggcorr() function:
## Warning in ggcorr(bank, method = c("everything", "pearson"), label = TRUE):
## data in column(s) 'ed', 'default', 'def', 'educ' are not numeric and were
## ignored

As you can see - the default correlation matrix is not the best idea
for all measurement scales (including binary variable “default”).
That’s why now we can perform our bivariate analysis with ggpair with
grouping.
Correlation matrix with scatterplots
Here is what we are about to calculate: - The correlation matrix
between age, log_income, employ, address, debtinc, creddebt, and othdebt
variable grouped by whether the person has a default status or not. -
Plot the distribution of each variable by group - Display the scatter
plot with the trend by group
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Qualitative data
In case of two variables measured on nominal or ordinal&nominal
scale - we are forced to organize so called “contingency” table with
frequencies and calculate some kind of the correlation coefficient based
on them. This is so called “contingency analysis”.
Let’s consider one example based on our data: verify, if there is any
significant correlation between education level and credit risk.
bank$ed <- as.factor(bank$ed)
bank$default <- as.factor(bank$default)
contingency_table <- table(bank$ed, bank$default)
print(contingency_table)
##
## 0 1
## 1 293 79
## 2 139 59
## 3 57 30
## 4 24 14
## 5 4 1
chisq.test(contingency_table)
## Warning in chisq.test(contingency_table): Aproksymacja chi-kwadrat może być
## niepoprawna
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 11.492, df = 4, p-value = 0.02155
prop.table(contingency_table)
##
## 0 1
## 1 0.418571429 0.112857143
## 2 0.198571429 0.084285714
## 3 0.081428571 0.042857143
## 4 0.034285714 0.020000000
## 5 0.005714286 0.001428571
cramers_v <- CramerV(contingency_table)
print(paste("Cramer's V:", round(cramers_v, 3)))
## [1] "Cramer's V: 0.128"
The results suggest that there is a relationship between education
level and credit risk, although it is weak (the value of Cramer’s V is
low). However, due to the low p-value, we can consider this relationship
to be statistically significant.
Exercise 1. Contingency analysis.
Do you believe in the Afterlife? https://nationalpost.com/news/canada/millennials-do-you-believe-in-life-after-life
A survey was conducted and a random sample of 1091 questionnaires is
given in the form of the following contingency table:
## Believe
## Gender Yes No
## Female 435 375
## Male 147 134

Our task is to check if there is a significant relationship between
the belief in the afterlife and gender. We can perform this procedure
with the simple chi-square statistics and chosen qualitative correlation
coefficient (two-way 2x2 table).
## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels,
## w.exp = w.exp)
##
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries
## lower estimate upper
## unweighted kappa -0.043 0.011 0.065
## weighted kappa -0.043 0.011 0.065
##
## Number of subjects = 1091
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: dane
## X-squared = 0.11103, df = 1, p-value = 0.739
## Believe
## Gender Yes No
## Female 0.3987168 0.3437214
## Male 0.1347388 0.1228231
As you can see we can calculate our chi-square statistic really
quickly for two-way tables or larger. Now we can standardize this
contingency measure to see if the relationship is significant.
## [1] 0.01218871
## [1] 0.0121878
## [1] 0.01218871
## [1] 0.01218871


The unweighted Cohen Kappa value ranges from -0.043 to 0.065,
suggesting that there is only minimal above-random agreement between
belief in an afterlife and gender. Similarly the Chi-square test
indicates that there is no statistically significant relationship
between belief in the afterlife and gender (p > 0.05). Furthermore,
all from coefficients Phi, Cont, Cramer’s V and Tschuprow’s T are around
0,012, which suggest very weak relationship between this belief and
gender.
Exercise 2. Contingency analysis for the ‘Titanic’ data.
Let’s consider the titanic dataset which contains a complete list of
passengers and crew members on the RMS Titanic. It includes a variable
indicating whether a person did survive the sinking of the RMS Titanic
on April 15, 1912. A data frame contains 2456 observations on 14
variables.
The website http://www.encyclopedia-titanica.org/ offers detailed
information about passengers and crew members on the RMS Titanic.
According to the website 1317 passengers and 890 crew member were
aboard.
8 musicians and 9 employees of the shipyard company are listed as
passengers, but travelled with a free ticket, which is why they have NA
values in fare. In addition to that, fare is truely missing for a few
regular passengers.
# your answer here
titanic <- titanic[complete.cases(titanic$Fare.Price), ]
contingency_table_titanic <- table(titanic$Status, titanic$Crew.or.Passenger.)
print(contingency_table_titanic)
##
## Crew Passenger
## 214 35
## Survivor 211 500
## Victim 679 817
prop.table(contingency_table_titanic)
##
## Crew Passenger
## 0.08713355 0.01425081
## Survivor 0.08591205 0.20358306
## Victim 0.27646580 0.33265472
chisq.test(contingency_table_titanic)
##
## Pearson's Chi-squared test
##
## data: contingency_table_titanic
## X-squared = 236.24, df = 2, p-value < 2.2e-16
Phi(contingency_table_titanic)
## [1] 0.3101466
ContCoef(contingency_table_titanic)
## [1] 0.2962265
CramerV(contingency_table_titanic)
## [1] 0.3101466
TschuprowT(contingency_table_titanic)
## [1] 0.2608012
mosaicplot(contingency_table_titanic)

barplot(contingency_table_titanic)

The result of the chi-square test indicates that there is a
statistically significant relationship between status (survival or
death) and crew or passenger affiliation (p < 0.05).
Coefficients values estimates around 0,3, which means that there exist
moderate correlation between these two variables. Based on this analysis
we can say that belonging to the crew or passengers had a significant
impact on survival during the Titanic disaster. Passengers had higher
chance of surviving than the crew members.
