This is our first lab when we are considering 2 dimensions and instead of calculating univariate statistics by groups (or factors) of other variable - we will measure their common relationships based on co-variance and correlation coefficients.
*Please be very careful when choosing the measure of correlation! In case of different measurument scales we have to recode one of the variables into weaker scale.
It would be nice to add some additional plots in the background. Feel free to add your own sections and use external packages.
This time we are going to use a typical credit scoring data with predefined “default” variables and personal demografic and income data. Please take a look closer at headers and descriptions of each variable.
First let’s visualize our quantitative relationships using scatterplots.
## # A tibble: 700 × 14
## age ed employ address income debtinc creddebt othdebt default preddef1
## <dbl> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+l> <dbl>
## 1 41 3 [Som… 17 12 176 9.3 11.4 5.01 1 [Yes] 0.808
## 2 27 1 [Did… 10 6 31 17.3 1.36 4.00 0 [No] 0.198
## 3 40 1 [Did… 15 14 55 5.5 0.856 2.17 0 [No] 0.0100
## 4 41 1 [Did… 15 14 120 2.9 2.66 0.821 0 [No] 0.0221
## 5 24 2 [Hig… 2 0 28 17.3 1.79 3.06 1 [Yes] 0.782
## 6 41 2 [Hig… 5 5 25 10.2 0.393 2.16 0 [No] 0.217
## 7 39 1 [Did… 20 9 67 30.6 3.83 16.7 0 [No] 0.186
## 8 43 1 [Did… 12 11 38 3.6 0.129 1.24 0 [No] 0.0147
## 9 24 1 [Did… 3 4 19 24.4 1.36 3.28 1 [Yes] 0.748
## 10 36 1 [Did… 0 13 25 19.7 2.78 2.15 0 [No] 0.815
## # ℹ 690 more rows
## # ℹ 4 more variables: preddef2 <dbl>, preddef3 <dbl>, def <fct>, educ <fct>
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: size.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## [1] 0.574346
## [1] 32.98734
## estimate p.value statistic n gp Method
## 1 0.3194263 4.805085e-18 8.899323 700 1 pearson
correlation by pearson: 0.574346; So as we can see almost 33% of total log of incomes’ variability is explained by differences in age. The rest (67%) is probably explained by other factors.
The partial and semi-partial (also known as part) correlations are used to express the specific portion of variance explained by eliminating the effect of other variables when assessing the correlation between two variables.
Partial correlation holds constant one variable when computing the relations to others. Suppose we want to know the correlation between X and Y holding Z constant for both X and Y. That would be the partial correlation between X and Y controlling for Z.
Semipartial correlation holds Z constant for either X or Y, but not both, so if we wanted to control X for Z, we could compute the semipartial correlation between X and Y holding Z constant for X.
Suppose we want to know the correlation between the log of income and age controlling for years of employment. How highly correlated are these after controlling for tenure?
**There can be more than one control variable.
How can we interpret the obtained partial correlation coefficient? What is the difference between that one and the semi-partial coefficient: difference = 13.4%
Summary: correlation between age and income is not that strong, if we remove employee factor, which determines work experience. Moreover, correlation is sighnificantly less stronger, if income factor is not affected by the work experience. Thus, work experience is more important,than just age.
## [1] 0.1863441
For 2 different scales - like for example this pair of variables: income vs. education levels - we cannot use Pearson’s coefficient. The only possibility is to rank also incomes… and lose some more detailed information about them.
Now, let’s see Kendal’s coefficient of rank correlation (robust for ties).
## [1] 0.1577567
As could be observed, Kendall method doesn’t show significant correlation.
Let’s try to verify if there is a significant relationship between incomes and risk status. First, let’s take a look at the boxplot:
If you would like to compare 1 quantitative variable (income) and 1 dychotomous variable (default status - binary), then you can use point-biserial coefficient:
## [1] 0.07096966
Point-biserial coeffitient doesn’t show any relationship
If you would like to check if there are any nonlinearities between 2 variables, the only possibility (beside transformations and linear analysis) is to calculate “eta” coefficient and compare it with the Pearson’s linear coefficient.
## # Effect Size for ANOVA
##
## Parameter | Eta2 | 95% CI
## -------------------------------
## age | 0.23 | [0.19, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
Eta correlation is weak, so no non-linear correlations observed ## Correlation matrix
We can also prepare the correlation matrix for all quantitative variables stored in our data frame.
We can use ggcorr() function:
## Warning in ggcorr(bank, method = c("pairwise", "pearson"), label = TRUE): data
## in column(s) 'def', 'educ' are not numeric and were ignored
As you can see - the default correlation matrix is not the best idea for all measurement scales (including binary variable “default”).
That’s why now we can perform our bivariate analysis with ggpair with grouping.
Here is what we are about to calculate: - The correlation matrix between age, log_income, employ, address, debtinc, creddebt, and othdebt variable grouped by whether the person has a default status or not. - Plot the distribution of each variable by group - Display the scatter plot with the trend by group
## age logincome employ address debtinc creddebt
## age 1.00000000 0.58098114 0.53260754 0.58720390 0.04378353 0.3396201
## logincome 0.58098114 1.00000000 0.73506990 0.33693923 -0.01177147 0.5458139
## employ 0.53260754 0.73506990 1.00000000 0.28466130 0.01928459 0.4586166
## address 0.58720390 0.33693923 0.28466130 1.00000000 0.04719117 0.2283938
## debtinc 0.04378353 -0.01177147 0.01928459 0.04719117 1.00000000 0.4914285
## creddebt 0.33962012 0.54581388 0.45861655 0.22839380 0.49142847 1.0000000
## othdebt 0.36511656 0.59260331 0.45976770 0.23265114 0.60851671 0.5656813
## othdebt
## age 0.3651166
## logincome 0.5926033
## employ 0.4597677
## address 0.2326511
## debtinc 0.6085167
## creddebt 0.5656813
## othdebt 1.0000000
## age logincome employ address debtinc creddebt othdebt
## age 1.0000000 0.5306996 0.5123606 0.6081712 0.1383360 0.3934238 0.3835157
## logincome 0.5306996 1.0000000 0.6902666 0.4171589 0.1461786 0.7327435 0.7379426
## employ 0.5123606 0.6902666 1.0000000 0.3191779 0.2683807 0.7412512 0.5569447
## address 0.6081712 0.4171589 0.3191779 1.0000000 0.1757972 0.3839202 0.3400486
## debtinc 0.1383360 0.1461786 0.2683807 0.1757972 1.0000000 0.4476288 0.5420365
## creddebt 0.3934238 0.7327435 0.7412512 0.3839202 0.4476288 1.0000000 0.6930729
## othdebt 0.3835157 0.7379426 0.5569447 0.3400486 0.5420365 0.6930729 1.0000000
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
In case of two variables measured on nominal or ordinal&nominal scale - we are forced to organize so called “contingency” table with frequencies and calculate some kind of the correlation coefficient based on them. This is so called “contingency analysis”.
Let’s consider one example based on our data: verify, if there is any significant correlation between education level and credit risk.
s_data <- data.frame(bank$def, bank$ed)
s_data <- table(bank$def, bank$ed)
s_data_matrix <- as.matrix(s_data)
print(chisq.test(s_data_matrix))
## Warning in chisq.test(s_data_matrix): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: s_data_matrix
## X-squared = 11.492, df = 4, p-value = 0.02155
Do you believe in the Afterlife? https://nationalpost.com/news/canada/millennials-do-you-believe-in-life-after-life A survey was conducted and a random sample of 1091 questionnaires is given in the form of the following contingency table:
## Believe
## Gender Yes No
## Female 435 375
## Male 147 134
Our task is to check if there is a significant relationship between the belief in the afterlife and gender. We can perform this procedure with the simple chi-square statistics and chosen qualitative correlation coefficient (two-way 2x2 table).
## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels,
## w.exp = w.exp)
##
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries
## lower estimate upper
## unweighted kappa -0.043 0.011 0.065
## weighted kappa -0.043 0.011 0.065
##
## Number of subjects = 1091
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: dane
## X-squared = 0.11103, df = 1, p-value = 0.739
## Believe
## Gender Yes No
## Female 0.3987168 0.3437214
## Male 0.1347388 0.1228231
As you can see we can calculate our chi-square statistic really quickly for two-way tables or larger. Now we can standardize this contingency measure to see if the relationship is significant.
## [1] 0.01218871
## [1] 0.0121878
## [1] 0.01218871
## [1] 0.01218871
As could be observed, no significant relation between gender and the
belief in the afterlife exists.
Let’s consider the titanic dataset which contains a complete list of passengers and crew members on the RMS Titanic. It includes a variable indicating whether a person did survive the sinking of the RMS Titanic on April 15, 1912. A data frame contains 2456 observations on 14 variables.
The website http://www.encyclopedia-titanica.org/ offers detailed information about passengers and crew members on the RMS Titanic. According to the website 1317 passengers and 890 crew member were aboard.
head(titanic)
## Status Disembarked.at Home.Country Age Year.of.Birth
## DE GRASSE, Mr J. Cherbourg NA NA
## EVANS, Miss Cherbourg NA NA
## MULLEN, Cherbourg NA NA
## WOTTON, Mr Henry Swaffin Cherbourg 54 1858
## BRAND, Mr Cherbourg NA NA
## FLETCHER, Miss N. Cherbourg NA NA
## Crew.or.Passenger. Gender Class...Department
## DE GRASSE, Mr J. Passenger Male 2nd Class
## EVANS, Miss Passenger Female 2nd Class
## MULLEN, Passenger Female 2nd Class
## WOTTON, Mr Henry Swaffin Passenger Male 1st Class
## BRAND, Mr Passenger Male 1st Class
## FLETCHER, Miss N. Passenger Female 1st Class
## Embarked Job Job.details
## DE GRASSE, Mr J. Southampton
## EVANS, Miss Southampton
## MULLEN, Southampton
## WOTTON, Mr Henry Swaffin Southampton Butcher Butcher's Shop Proprietor
## BRAND, Mr Southampton
## FLETCHER, Miss N. Southampton
## Ticket.Number Fare.Price Fare_GBP Fare_today
## DE GRASSE, Mr J. 761 P1 1.0 82.110
## EVANS, Miss 88 P1 1.0 82.110
## MULLEN, 404 P1 1.0 82.110
## WOTTON, Mr Henry Swaffin 86 P1 10s 1.5 123.165
## BRAND, Mr 8 P1 10s 1.5 123.165
## FLETCHER, Miss N. 405 P1 10s 1.5 123.165
## Profile.on.Encyclopedia.Titanica
## DE GRASSE, Mr J. http://www.encyclopedia-titanica.org/titanic-biography/j-de-grasse.html
## EVANS, Miss http://www.encyclopedia-titanica.org/titanic-biography/evans.html
## MULLEN, http://www.encyclopedia-titanica.org/titanic-biography/mullen.html
## WOTTON, Mr Henry Swaffin http://www.encyclopedia-titanica.org/titanic-cross-channel-passenger/henry-swaffin-wotton.html
## BRAND, Mr http://www.encyclopedia-titanica.org/titanic-biography/brand.html
## FLETCHER, Miss N. http://www.encyclopedia-titanica.org/titanic-biography/n-fletcher.html
8 musicians and 9 employees of the shipyard company are listed as passengers, but travelled with a free ticket, which is why they have NA values in fare. In addition to that, fare is truely missing for a few regular passengers.
# your answer here
library(naniar)
## Warning: package 'naniar' was built under R version 4.3.3
library(dlookr)
## Warning: package 'dlookr' was built under R version 4.3.3
## Registered S3 methods overwritten by 'dlookr':
## method from
## plot.transform scales
## print.transform scales
##
## Attaching package: 'dlookr'
## The following object is masked from 'package:psych':
##
## describe
## The following object is masked from 'package:tidyr':
##
## extract
## The following object is masked from 'package:base':
##
## transform
vis_miss(titanic, cluster = TRUE)
titanic%>%
miss_case_table()
## # A tibble: 3 × 3
## n_miss_in_case n_cases pct_cases
## <int> <int> <dbl>
## 1 0 1296 52.8
## 2 2 1152 46.9
## 3 4 8 0.326
mean_age <- mean(titanic$Age, na.rm = TRUE)
titanic$Age[is.na(titanic$Age)] <- mean_age
titanic$Crew_or_Passenger_Flag <- ifelse(titanic$Crew.or.Passenger. == 'Passenger', 0, 1)
titanic$Gender_binary <- ifelse(titanic$Gender == 'Male', 0, 1)
titanic$Class_Department_Flag <- ifelse(titanic$Class...Department == '1st Class', 3,
ifelse(titanic$Class...Department == '2nd Class', 2,
ifelse(titanic$Class...Department == '3rd Class', 1, NA)))
titanic$Status_Flag <- ifelse(titanic$Status == 'Victim', 0, 1)
ggplot(titanic, aes(x = Gender)) +
geom_bar(fill = "lightblue", color = "black") +
labs(title = "Distribution of Nominal Variable", x = "Categories", y = "Count") +
theme_minimal()
library(psych)
# Assuming your binary variables are binary_var1 and binary_var2
phi_coefficient_gender <- phi(table(titanic$Status_Flag, titanic$Gender_binary))
print(phi_coefficient_gender)
## Phi (adj.) | 95% CI
## -------------------------
## 0.37 | [0.33, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
s_data <- data.frame(titanic$Status_Flag, titanic$Crew_or_Passenger_Flag)
s_data <- table(titanic$Status_Flag, titanic$Crew_or_Passenger_Flag)
s_data_matrix <- as.matrix(s_data)
print(chisq.test(s_data_matrix))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: s_data_matrix
## X-squared = 0.25137, df = 1, p-value = 0.6161
s_data <- data.frame(titanic$Status_Flag, titanic$Class_Department_Flag)
s_data <- table(titanic$Status_Flag, titanic$Class_Department_Flag)
s_data_matrix <- as.matrix(s_data)
print(chisq.test(s_data_matrix))
##
## Pearson's Chi-squared test
##
## data: s_data_matrix
## X-squared = 154.78, df = 2, p-value < 2.2e-16
point_biserial <- biserial.cor(titanic$Age, titanic$Status_Flag)
print(point_biserial)
## [1] -0.01473289
Considering results of all correlation tests, the most chances to survive had members of first class. Overall, ticket to better class significantly increased chances of survival for its holder.Also, women had slightly higher probability to evacuate. Between crew and passengers there is almost no difference, as well as while looking at the age of passengers.