Data

This time we are going to use a typical credit scoring data with predefined “default” variables and personal demografic and income data. Please take a look closer at headers and descriptions of each variable.

Scatterplots

First let’s visualize our quantitative relationships using scatterplots.

Normalizing the skewed distribution of incomes using log & Plot

Creating a scatter plot with the logarithmically transformed data

Scatterplots by groups

Estimated Density Plotes

To see more closely if there any differences between those two distributions adding their estimated density plots

Plots Together:

Giving Density Curves to Scatterplots

## `geom_smooth()` using formula = 'y ~ x'

Correlation coefficients - P.L correlation:

## [1] -0.02677729

Percentage of the explained variability:

## [1] 0.07170234

Difference between that one and the S-P coefficient:

## [1] 0.6018467

How can we interpret the obtained partial correlation coefficient? What is the difference between that one and the semi-partial coefficient:

## [1] 0.6018467

Rank correlation

For 2 different scales - like for example this pair of variables: income vs. education levels - we cannot use Pearson’s coefficient. The only possibility is to rank also incomes… and lose some more detailed information about them.

First, let’s see boxplots of income by education levels.

Kendal’s coefficient of rank correlation:

#(robust for ties))

## [1] -0.01224209

Point-biserial correlation

Comparing QVar and DVar

## [1] -0.07096966

Eta Coefficient:

## [1] 0.708378

Correlation matrix

Correlation matrix with scatterplots

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## Warning in warn_if_args_exist(list(...)): Extra arguments: "aes_string" are
## being ignored.  If these are meant to be aesthetics, submit them using the
## 'mapping' variable within ggpairs with ggplot2::aes or ggplot2::aes_string.

Exercise 1. Contingency analysis.

##         Believe
## Gender   Yes  No
##   Female 435 375
##   Male   147 134

## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels, 
##     w.exp = w.exp)
## 
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries 
##                   lower estimate upper
## unweighted kappa -0.043    0.011 0.065
## weighted kappa   -0.043    0.011 0.065
## 
##  Number of subjects = 1091
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  dane
## X-squared = 0.11103, df = 1, p-value = 0.739
##         Believe
## Gender         Yes        No
##   Female 0.3987168 0.3437214
##   Male   0.1347388 0.1228231
## [1] 0.01218871

Exercise 2. Contingency analysis for the ‘Titanic’ data.

Dropping rows:

# Function to Drop NA
titanic <- titanic %>% drop_na()

titanic$Status <- as.factor(titanic$Status)
titanic$Gender <- as.factor(titanic$Gender)

# Create Contingency Table and Perform Chi-Square Test
contingency_table <- table(titanic$Gender, titanic$Status)
chi_square_test <- chisq.test(contingency_table)
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect

Creating Contingency Table

Phi(contingency_table)
## [1] 0.5257638
ContCoef(contingency_table)
## [1] 0.4653638
CramerV(contingency_table)
## [1] 0.5257638
TschuprowT(contingency_table)
## [1] 0.4421129

Ploting mosaicplot:

mosaicplot(contingency_table)

Ploting bar plot:

barplot(contingency_table)

### According to the data, we can see it was better to be a woman.

