Introduction
This is our first lab when we are considering 2 dimensions and
instead of calculating univariate statistics by groups (or factors) of
other variable - we will measure their common relationships based on
co-variance and correlation coefficients.
*Please be very careful when choosing the measure of correlation! In
case of different measurement scales we have to recode one of the
variables into weaker scale.
It would be nice to add some additional plots in the background. Feel
free to add your own sections and use external packages.
Data
This time we are going to use a typical credit scoring data with
predefined “default” variables and personal demographic and income data.
Please take a look closer at headers and descriptions of each
variable.
Scatterplots
First let’s visualize our quantitative relationships using scatter
plots.

You can also normalize the skewed distribution of incomes using
log:
## Warning: Using size for a discrete variable is not advised.

## [1] 0.4251568
## [1] 0.1577567
## estimate p.value statistic n gp Method
## 1 0.5842774 3.210675e-65 19.0072 700 1 pearson
## estimate p.value statistic n gp Method
## 1 0.3194263 4.805085e-18 8.899323 700 1 pearson
We can add an estimated linear regression line:
## `geom_smooth()` using formula = 'y ~ x'

Scatterplots by groups
We can finally see if there any differences between risk status:
## `geom_smooth()` using formula = 'y ~ x'

We can also see more closely if there any differences between those
two distributions adding their estimated density plots:
## Warning: pakiet 'ggExtra' został zbudowany w wersji R 4.3.3

We can also put those plots together:

Scatterplots with density curves
We can also see more closely if there any differences between those
two distributions adding their estimated density plots:

Correlation coefficients - Pearson’s linear correlation
Ok, let’s move to some calculations. In R, we can use the cor()
function. It takes three arguments and the method: cor(x, y, method) For
2 quantitative data, with all assumptions met, we can calculate simple
Pearson’s coefficient of linear correlation:
## [1] 0.574346
Ok, what about the percentage of the explained variability?
## 32.98734 %
So as we can see almost 32.99% of total log of incomes’ variability
is explained by differences in age. The rest 67% is probably explained
by other factors.
Partial and semipartial correlation
The partial and semi-partial (also known as part) correlations are
used to express the specific portion of variance explained by
eliminating the effect of other variables when assessing the correlation
between two variables.
Partial correlation holds constant one variable when computing the
relations to others. Suppose we want to know the correlation between X
and Y holding Z constant for both X and Y. That would be the partial
correlation between X and Y controlling for Z.
Semipartial correlation holds Z constant for either X or Y, but not
both, so if we wanted to control X for Z, we could compute the
semipartial correlation between X and Y holding Z constant for X.
Suppose we want to know the correlation between the log of income and
age controlling for years of employment. How highly correlated are these
after controlling for tenure?
**There can be more than one control variable.
## Partial Correlation (controlling for years of employment): 0.3194263
## Semipartial Correlation (controlling for years of employment on age): 0.2203711
How can we interpret the obtained partial correlation coefficient?
What is the difference between that one and the semi-partial
coefficient:
The partial correlation between logged income and age, controlling
for years of employment, is 0.3194263 . This value indicates that the
unique relationship between age and logged income, independent of the
years of employment, is positive , suggesting that older individuals
tend to have higher logged incomes, after accounting for their years of
employment.
The semipartial correlation, on the other hand, shows how much of the
variance in logged income can be explained uniquely by age, after
controlling for the effect of years of employment only on the age
variable. The semipartial correlation coefficient is 0.2203711. This
measure highlights the direct influence of age on logged income without
the mixed effects of employment duration on age.
Rank correlation
For 2 different scales - like for example this pair of variables:
income vs. education levels - we cannot use Pearson’s coefficient. The
only possibility is to rank also incomes… and lose some more detailed
information about them.
First, let’s see boxplots of income by education levels.

Now, let’s see Kendal’s coefficient of rank correlation (robust for
ties).
## 0.1577567
Point-biserial correlation
Let’s try to verify if there is a significant relationship between
incomes and risk status. First, let’s take a look at the boxplot:
## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?

If you would like to compare 1 quantitative variable (income) and 1
dychotomous variable (default status - binary), then you can use
point-biserial coefficient:
##
## Pearson's product-moment correlation
##
## data: bank$income and as.numeric(bank$default)
## t = -1.8797, df = 698, p-value = 0.06056
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.144313546 0.003149751
## sample estimates:
## cor
## -0.07096966
The point-biserial correlation coefficient is ~ -0.071. It is
negative, so it means that when income increases, likelihood of
defaulting on a loan decreases. It makes sense, because people with
higher incomes are usually less likely to default. But also, p-value can
suggest that this relationship is not so strong as I would thought it
is, because it is above level of 0.05.
Nonlinear correlation - eta coefficient
If you would like to check if there are any nonlinearities between 2
variables, the only possibility (beside transformations and linear
analysis) is to calculate “eta” coefficient and compare it with the
Pearson’s linear coefficient.
## [1] 0.07096966
Eta coefficient is ~ 0.071, so it is the same as previously
calculated point-biserial coefficient. It means the relationship between
our 2 variables is entirely linear.
Correlation matrix
We can also prepare the correlation matrix for all quantitative
variables stored in our data frame.
We can use ggcorr() function:

As you can see - the default correlation matrix is not the best idea
for all measurement scales (including binary variable “default”).
That’s why now we can perform our bivariate analysis with ggpair with
grouping.
Correlation matrix with scatterplots
Here is what we are about to calculate: - The correlation matrix
between age, log_income, employ, address, debtinc, creddebt, and othdebt
variable grouped by whether the person has a default status or not. -
Plot the distribution of each variable by group - Display the scatter
plot with the trend by group
# first correlation matrix still using ggcorr
bank$log_income <- log(bank$income)
data_for_matrix <- c("age", "log_income", "employ", "address", "debtinc", "creddebt", "othdebt")
default_groups <- split(bank, bank$default)
corr_matrix_plots <- lapply(names(default_groups), function(group) {
data_numeric <- default_groups[[group]][, data_for_matrix, drop = FALSE]
data_numeric <- na.omit(data_numeric)
plot <- ggcorr(data_numeric, label = TRUE, hjust = 1) +
ggtitle(paste("Correlation matrix: ", group))
return(plot)
})
# displaying matrices
grid.arrange(grobs = corr_matrix_plots, ncol = 1)

# and now using ggpairs :)
data <- bank[, data_for_matrix]
pairs_plot <- ggpairs(data,
aes(color = as.factor(bank$default), alpha = 0.5),
upper = list(continuous = wrap("cor", size = 4)),
lower = list(continuous = wrap("points", alpha = 0.5)),
diag = list(continuous = wrap("barDiag", fill = "white")))
print(pairs_plot)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This matrix shows: - scatter plots in the lower triangle - correlation
coefficients in the upper triangle - density plots on the diagonal
Qualitative data
In case of two variables measured on nominal or ordinal&nominal
scale - we are forced to organize so called “contingency” table with
frequencies and calculate some kind of the correlation coefficient based
on them. This is so called “contingency analysis”.
Let’s consider one example based on our data: verify, if there is any
significant correlation between education level and credit risk.
contingency_table <- table(bank$ed, bank$default)
chi_square_result <- chisq.test(contingency_table)
## Warning in chisq.test(contingency_table): Aproksymacja chi-kwadrat może być
## niepoprawna
print(chi_square_result)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 11.492, df = 4, p-value = 0.02155
From the result we can assume that these two variables are not
independent from each other, but probably more tests are needed to tell
exactly how dependent they are.
Exercise 1. Contingency analysis.
Do you believe in the Afterlife? https://nationalpost.com/news/canada/millennials-do-you-believe-in-life-after-life
A survey was conducted and a random sample of 1091 questionnaires is
given in the form of the following contingency table:
## Believe
## Gender Yes No
## Female 435 375
## Male 147 134

Our task is to check if there is a significant relationship between
the belief in the afterlife and gender. We can perform this procedure
with the simple chi-square statistics and chosen qualitative correlation
coefficient (two-way 2x2 table).
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: dane
## X-squared = 0.11103, df = 1, p-value = 0.739
## Believe
## Gender Yes No
## Female 0.3987168 0.3437214
## Male 0.1347388 0.1228231
As you can see we can calculate our chi-square statistic really
quickly for two-way tables or larger. Now we can standardize this
contingency measure to see if the relationship is significant.
## [1] 0.01218871


Exercise 2. Contingency analysis for the ‘Titanic’ data.
Let’s consider the titanic dataset which contains a complete list of
passengers and crew members on the RMS Titanic. It includes a variable
indicating whether a person did survive the sinking of the RMS Titanic
on April 15, 1912. A data frame contains 2456 observations on 14
variables.
The website http://www.encyclopedia-titanica.org/ offers detailed
information about passengers and crew members on the RMS Titanic.
According to the website 1317 passengers and 890 crew member were
aboard.
8 musicians and 9 employees of the shipyard company are listed as
passengers, but travelled with a free ticket, which is why they have NA
values in fare. In addition to that, fare is truely missing for a few
regular passengers.
head(titanic)
## Status Disembarked.at Home.Country Age Year.of.Birth
## DE GRASSE, Mr J. Cherbourg NA NA
## EVANS, Miss Cherbourg NA NA
## MULLEN, Cherbourg NA NA
## WOTTON, Mr Henry Swaffin Cherbourg 54 1858
## BRAND, Mr Cherbourg NA NA
## FLETCHER, Miss N. Cherbourg NA NA
## Crew.or.Passenger. Gender Class...Department
## DE GRASSE, Mr J. Passenger Male 2nd Class
## EVANS, Miss Passenger Female 2nd Class
## MULLEN, Passenger Female 2nd Class
## WOTTON, Mr Henry Swaffin Passenger Male 1st Class
## BRAND, Mr Passenger Male 1st Class
## FLETCHER, Miss N. Passenger Female 1st Class
## Embarked Job Job.details
## DE GRASSE, Mr J. Southampton
## EVANS, Miss Southampton
## MULLEN, Southampton
## WOTTON, Mr Henry Swaffin Southampton Butcher Butcher's Shop Proprietor
## BRAND, Mr Southampton
## FLETCHER, Miss N. Southampton
## Ticket.Number Fare.Price Fare_GBP Fare_today
## DE GRASSE, Mr J. 761 P1 1.0 82.110
## EVANS, Miss 88 P1 1.0 82.110
## MULLEN, 404 P1 1.0 82.110
## WOTTON, Mr Henry Swaffin 86 P1 10s 1.5 123.165
## BRAND, Mr 8 P1 10s 1.5 123.165
## FLETCHER, Miss N. 405 P1 10s 1.5 123.165
## Profile.on.Encyclopedia.Titanica
## DE GRASSE, Mr J. http://www.encyclopedia-titanica.org/titanic-biography/j-de-grasse.html
## EVANS, Miss http://www.encyclopedia-titanica.org/titanic-biography/evans.html
## MULLEN, http://www.encyclopedia-titanica.org/titanic-biography/mullen.html
## WOTTON, Mr Henry Swaffin http://www.encyclopedia-titanica.org/titanic-cross-channel-passenger/henry-swaffin-wotton.html
## BRAND, Mr http://www.encyclopedia-titanica.org/titanic-biography/brand.html
## FLETCHER, Miss N. http://www.encyclopedia-titanica.org/titanic-biography/n-fletcher.html
summary(titanic)
## Status Disembarked.at Home.Country Age
## Length:2456 Length:2456 Length:2456 Min. : 0.17
## Class :character Class :character Class :character 1st Qu.:23.00
## Mode :character Mode :character Mode :character Median :29.00
## Mean :30.59
## 3rd Qu.:38.00
## Max. :74.00
## NA's :32
## Year.of.Birth Crew.or.Passenger. Gender Class...Department
## Min. :1837 Length:2456 Length:2456 Length:2456
## 1st Qu.:1874 Class :character Class :character Class :character
## Median :1882 Mode :character Mode :character Mode :character
## Mean :1881
## 3rd Qu.:1889
## Max. :1973
## NA's :32
## Embarked Job Job.details Ticket.Number
## Length:2456 Length:2456 Length:2456 Length:2456
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Fare.Price Fare_GBP Fare_today
## Length:2456 Min. : 1.00 Min. : 82.11
## Class :character 1st Qu.: 7.90 1st Qu.: 648.33
## Mode :character Median : 14.45 Median : 1186.83
## Mean : 33.16 Mean : 2722.71
## 3rd Qu.: 31.07 3rd Qu.: 2551.06
## Max. :512.33 Max. :42067.35
## NA's :1136 NA's :1136
## Profile.on.Encyclopedia.Titanica
## Length:2456
## Class :character
## Mode :character
##
##
##
##
# Replace NA in fare with the median fare
titanic$Fare[is.na(titanic$Fare)] <- median(titanic$Fare, na.rm = TRUE)
titanic$Age[is.na(titanic$Age)] <- median(titanic$Age, na.rm = TRUE)
ggplot(titanic, aes(x = Age, fill = factor(Status))) +
geom_histogram(binwidth = 5, position = "fill", color = "black") +
labs(y = "Proportion", x = "Age", fill = "Status", title = "Survival by Age") +
theme_minimal()

# contingency table of survival by country
survival_by_country <- table(titanic$Status, titanic$Home.Country)
chisq.test(survival_by_country)
## Warning in chisq.test(survival_by_country): Aproksymacja chi-kwadrat może być
## niepoprawna
##
## Pearson's Chi-squared test
##
## data: survival_by_country
## X-squared = 1696.2, df = 84, p-value < 2.2e-16
# contingency table of survival by gender
survival_by_gender <- table(titanic$Status, titanic$Gender)
chisq.test(survival_by_gender)
##
## Pearson's Chi-squared test
##
## data: survival_by_gender
## X-squared = 552.06, df = 2, p-value < 2.2e-16
# contingency table of survival by age
survival_by_age <- table(titanic$Status, titanic$Age)
chisq.test(survival_by_age)
## Warning in chisq.test(survival_by_age): Aproksymacja chi-kwadrat może być
## niepoprawna
##
## Pearson's Chi-squared test
##
## data: survival_by_age
## X-squared = 308.5, df = 154, p-value = 2.135e-12
From the plot we can see, that there is some small relation between
age and surviving - generally there were more younger than older people
who survived, but also there is some increase when looking at old people
- that were about 60 years old. Looking at chi-squared test results, for
every tested variable p-value is extremely low. It means, that there is
significant dependence between survival status and those variables.
