Introduction

This is our first lab when we are considering 2 dimensions and instead of calculating univariate statistics by groups (or factors) of other variable - we will measure their common relationships based on co-variance and correlation coefficients.

*Please be very careful when choosing the measure of correlation! In case of different measurument scales we have to recode one of the variables into weaker scale.

It would be nice to add some additional plots in the background. Feel free to add your own sections and use external packages.

Data

This time we are going to use a typical credit scoring data with predefined “default” variables and personal demografic and income data. Please take a look closer at headers and descriptions of each variable.

Scatterplots

First let’s visualize our quantitative relationships using scatterplots.

You can also normalize the skewed distribution of incomes using log:

We can add an estimated linear regression line:

## `geom_smooth()` using formula = 'y ~ x'

Scatterplots by groups

We can finally see if there any differences between risk status:

## `geom_smooth()` using formula = 'y ~ x'

We can also see more closely if there any differences between those two distributions adding their estimated density plots:

We can also put those plots together:

Scatterplots with density curves

We can also see more closely if there any differences between those two distributions adding their estimated density plots:

Correlation coefficients - Pearson’s linear correlation

Ok, let’s move to some calculations. In R, we can use the cor() function. It takes three arguments and the method: cor(x, y, method) For 2 quantitative data, with all assumptions met, we can calculate simple Pearson’s coefficient of linear correlation:

Ok, what about the percentage of the explained variability?

## [1] 0.3298734
## [1] 0.04994899

So as we can see almost ??? of total log of incomes’ variability is explained by differences in age. The rest (???) is probably explained by other factors.

Partial and semipartial correlation

The partial and semi-partial (also known as part) correlations are used to express the specific portion of variance explained by eliminating the effect of other variables when assessing the correlation between two variables.

Partial correlation holds constant one variable when computing the relations to others. Suppose we want to know the correlation between X and Y holding Z constant for both X and Y. That would be the partial correlation between X and Y controlling for Z.

Semipartial correlation holds Z constant for either X or Y, but not both, so if we wanted to control X for Z, we could compute the semipartial correlation between X and Y holding Z constant for X.

Suppose we want to know the correlation between the log of income and age controlling for years of employment. How highly correlated are these after controlling for tenure?

**There can be more than one control variable.

##    estimate      p.value statistic   n gp  Method
## 1 0.3194263 4.805085e-18  8.899323 700  1 pearson
##    estimate      p.value statistic   n gp  Method
## 1 0.2203711 3.899134e-09  5.964597 700  1 pearson

How can we interpret the obtained partial correlation coefficient? What is the difference between that one and the semi-partial coefficient:

Rank correlation

For 2 different scales - like for example this pair of variables: income vs. education levels - we cannot use Pearson’s coefficient. The only possibility is to rank also incomes
 and lose some more detailed information about them.

First, let’s see boxplots of income by education levels.

Now, let’s see Kendal’s coefficient of rank correlation (robust for ties).

## [1] 0.1577567

Point-biserial correlation

Let’s try to verify if there is a significant relationship between incomes and risk status. First, let’s take a look at the boxplot:

If you would like to compare 1 quantitative variable (income) and 1 dychotomous variable (default status - binary), then you can use point-biserial coefficient:

## [1] 0.07096966

Nonlinear correlation - eta coefficient

If you would like to check if there are any nonlinearities between 2 variables, the only possibility (beside transformations and linear analysis) is to calculate “eta” coefficient and compare it with the Pearson’s linear coefficient.

##  num [1:700] 5.17 3.43 4.01 4.79 3.33 ...
##  - attr(*, "label")= chr "Household income in thousands"
##  - attr(*, "format.spss")= chr "F8.2"
##  Factor w/ 2 levels "0","1": 2 1 1 1 2 1 1 1 2 1 ...

Correlation matrix

We can also prepare the correlation matrix for all quantitative variables stored in our data frame.

We can use ggcorr() function:

As you can see - the default correlation matrix is not the best idea for all measurement scales (including binary variable “default”).

That’s why now we can perform our bivariate analysis with ggpair with grouping.

Correlation matrix with scatterplots

Here is what we are about to calculate: - The correlation matrix between age, log_income, employ, address, debtinc, creddebt, and othdebt variable grouped by whether the person has a default status or not. - Plot the distribution of each variable by group - Display the scatter plot with the trend by group

Qualitative data

In case of two variables measured on nominal or ordinal&nominal scale - we are forced to organize so called “contingency” table with frequencies and calculate some kind of the correlation coefficient based on them. This is so called “contingency analysis”.

Let’s consider one example based on our data: verify, if there is any significant correlation between education level and credit risk.

cont_table <-table(ed, def)

chisq.test(cont_table)
## Warning in chisq.test(cont_table): Aproksymacja chi-kwadrat moĆŒe być
## niepoprawna
## 
##  Pearson's Chi-squared test
## 
## data:  cont_table
## X-squared = 11.492, df = 4, p-value = 0.02155
#p is smaller than 0.05 so it means that these two values are correlated to each other

Exercise 1. Contingency analysis.

Do you believe in the Afterlife? https://nationalpost.com/news/canada/millennials-do-you-believe-in-life-after-life A survey was conducted and a random sample of 1091 questionnaires is given in the form of the following contingency table:

##         Believe
## Gender   Yes  No
##   Female 435 375
##   Male   147 134

Our task is to check if there is a significant relationship between the belief in the afterlife and gender. We can perform this procedure with the simple chi-square statistics and chosen qualitative correlation coefficient (two-way 2x2 table).

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  dane
## X-squared = 0.11103, df = 1, p-value = 0.739
##         Believe
## Gender         Yes        No
##   Female 0.3987168 0.3437214
##   Male   0.1347388 0.1228231

As you can see we can calculate our chi-square statistic really quickly for two-way tables or larger. Now we can standardize this contingency measure to see if the relationship is significant.

## [1] 0.01218871

Exercise 2. Contingency analysis for the ‘Titanic’ data.

Let’s consider the titanic dataset which contains a complete list of passengers and crew members on the RMS Titanic. It includes a variable indicating whether a person did survive the sinking of the RMS Titanic on April 15, 1912. A data frame contains 2456 observations on 14 variables.

The website http://www.encyclopedia-titanica.org/ offers detailed information about passengers and crew members on the RMS Titanic. According to the website 1317 passengers and 890 crew member were aboard.

8 musicians and 9 employees of the shipyard company are listed as passengers, but travelled with a free ticket, which is why they have NA values in fare. In addition to that, fare is truely missing for a few regular passengers.

# your answer here
attach(titanic)
survivors_males <- sum(titanic$Status == "Survivor" & titanic$Gender == "Male")
survivors_females <- sum(titanic$Status == "Survivor" & titanic$Gender == "Female")
victims_males <- sum(titanic$Status == "Victim" & titanic$Gender == "Male")
victims_females <- sum(titanic$Status == "Victim" & titanic$Gender == "Female")

tit_table = c(survivors_males, survivors_females, victims_males, victims_females)

dim(tit_table) = c(2,2)

dimnames(tit_table)=list(Gender=c('Male','Female'),Status=c('Survivor','Victim'))

tit_table
##         Status
## Gender   Survivor Victim
##   Male        352   1366
##   Female      359    130
fourfoldplot(tit_table)

chisq.test(tit_table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tit_table
## X-squared = 485.87, df = 1, p-value < 2.2e-16
prop.table(tit_table)
##         Status
## Gender    Survivor     Victim
##   Male   0.1594925 0.61893974
##   Female 0.1626643 0.05890349
Phi(tit_table)
## [1] 0.4703662
mosaicplot(tit_table)

barplot(tit_table)

---
title: 'Descriptive Statistics'
subtitle: 'Bivariate Analysis'
date: "`r Sys.Date()`"
author: "Krzysztof Świrydczuk, Marcin Tyszka, Jakub Tarnawski"
output:
  html_document: 
    theme: cerulean
    highlight: textmate
    fontsize: 10pt
    toc: yes
    code_download: yes
    toc_float:
      collapsed: no
    df_print: default
    toc_depth: 5
editor_options: 
  minarkdown: 
    wrap: 72
---

```{r setup,	message = FALSE,	warning = FALSE,	include = FALSE}
library(dplyr)
library(tidyverse)
library(HSAUR3)
library(haven)
library(ggplot2)
library(gridExtra)
library(ppcor) # this package computes partial and semipartial correlations.
library(ltm) # this package computes point-biserial correlations.
library(devtools) 
install_github("markheckmann/ryouready") # please install package "ryouready" from github! (then # it)
library(ryouready) # this package computes nonlinear "eta" correlations.
library(GGally) # this package computes correlation matrix.
library(psych) # this package computes qualitative correlations.
library(DescTools) # this package computes qualitative correlations.
library(ggpubr)
```

## Introduction

This is our first lab when we are considering 2 dimensions and instead of calculating univariate statistics by groups (or factors) of other variable - we will measure their common relationships based on co-variance and correlation coefficients.

\*Please be very careful when choosing the measure of correlation! In case of different measurument scales we have to recode one of the variables into weaker scale.

It would be nice to add some additional plots in the background. Feel free to add your own sections and use external packages.

## Data

This time we are going to use a typical credit scoring data with predefined "default" variables and personal demografic and income data. Please take a look closer at headers and descriptions of each variable.

```{r load-data, warning=TRUE, include=FALSE}
download.file("https://github.com/kflisikowski/ds/blob/master/bank_defaults.sav?raw=true", destfile ="bank_defaults.sav",mode="wb")
bank_defaults <- read_sav("bank_defaults.sav")
bank<-na.omit(bank_defaults)
bank$def<-as.factor(bank$default)
bank$educ<-as.factor(bank$ed)
```

## Scatterplots

First let's visualize our quantitative relationships using scatterplots.

```{r echo=FALSE, warning=TRUE}
# Basic scatter plot
attach(bank)
ggplot(bank, aes(age, income)) +
  geom_point()

# Change the point size, and shape
ggplot(bank, aes(age, income, size = income, )) +
  geom_point()

```

You can also normalize the skewed distribution of incomes using log:

```{r echo=FALSE, warning=TRUE}
# Basic scatter plot with the log of income
income_log = log(income)
ggplot(bank, aes(age, income_log)) +
  geom_point()


```

We can add an estimated linear regression line:

```{r echo=FALSE, warning=TRUE}
ggplot(bank, aes(age, income_log)) +
  geom_point() +
  geom_smooth(method = "lm")




```

## Scatterplots by groups

We can finally see if there any differences between risk status:

```{r echo=FALSE, warning=TRUE}

ggplot(bank, aes(age, income_log, color = def)) +
  geom_point() +
  geom_smooth(method = "lm")



```

We can also see more closely if there any differences between those two distributions adding their estimated density plots:

```{r echo=FALSE, warning=TRUE}
# scatter plot of x and y variables
# colour by groups
ggplot(bank, aes(age, income_log, color = def)) +
  geom_point()


# Marginal density plot of age (top panel)
plot1 <- ggplot(bank, aes(age)) +
  geom_density()


# Marginal density plot of y (right panel)
plot2 <- ggplot(bank, aes(income_log)) +
  geom_density()

```

We can also put those plots together:

```{r echo=FALSE, warning=TRUE}
ggarrange(plot1, plot2)


```

## Scatterplots with density curves

We can also see more closely if there any differences between those two distributions adding their estimated density plots:

```{r echo=FALSE, warning=TRUE}

# Repeated task

```

## Correlation coefficients - Pearson's linear correlation

Ok, let's move to some calculations. In R, we can use the cor() function. It takes three arguments and the method: cor(x, y, method) For 2 quantitative data, with all assumptions met, we can calculate simple Pearson's coefficient of linear correlation:

```{r echo=FALSE, warning=TRUE}
corr1 <- cor(income_log, age, method = "pearson")

corr2 <- cor(income_log, ed, method = "pearson")


```

Ok, what about the percentage of the explained variability?

```{r echo=FALSE, warning=TRUE}
corr1 * corr1
corr2 * corr2


```

So as we can see almost ??? of total log of incomes' variability is explained by differences in age. The rest (???) is probably explained by other factors.

## Partial and semipartial correlation

The partial and semi-partial (also known as part) correlations are used to express the specific portion of variance explained by eliminating the effect of other variables when assessing the correlation between two variables.

Partial correlation holds constant one variable when computing the relations to others. Suppose we want to know the correlation between X and Y holding Z constant for both X and Y. That would be the partial correlation between X and Y controlling for Z.

Semipartial correlation holds Z constant for either X or Y, but not both, so if we wanted to control X for Z, we could compute the semipartial correlation between X and Y holding Z constant for X.

Suppose we want to know the correlation between the log of income and age controlling for years of employment. How highly correlated are these after controlling for tenure?

\*\*There can be more than one control variable.

```{r echo=FALSE, warning=FALSE}
library(ppcor)
correlationData <- data.frame(logOfIncome=income_log,
                              age=bank$age,
                              employment=bank$employ)

partial_corr_result <- pcor.test(income_log, bank$age, bank$employ, method="pearson")
partial_corr_result

semipartial_corr_result <- spcor.test(income_log, bank$age, bank$employ, method="pearson")
semipartial_corr_result


```

How can we interpret the obtained partial correlation coefficient? What is the difference between that one and the semi-partial coefficient:

```{r echo=FALSE, warning=FALSE}

#Both correlations are positive, thus with higher age the log of income rises. this could mean that the income is greater because of your age, or could be accidental

#Semi Correlation is lower

```

## Rank correlation

For 2 different scales - like for example this pair of variables: income vs. education levels - we cannot use Pearson's coefficient. The only possibility is to rank also incomes... and lose some more detailed information about them.

First, let's see boxplots of income by education levels.

```{r echo=FALSE, warning=TRUE}

ggplot(bank, aes(x = educ, y = income)) +
  geom_boxplot() +
  labs(x = "Education Levels", y = "Income") +
  theme_minimal()


```

Now, let's see Kendal's coefficient of rank correlation (robust for ties).

```{r echo=FALSE, warning=TRUE}


bank$educnumeric <- as.numeric(bank$educ)
kendall_cor <- cor(bank$income, bank$educnumeric, method = "kendall")

print(kendall_cor)


```

## Point-biserial correlation

Let's try to verify if there is a significant relationship between incomes and risk status. First, let's take a look at the boxplot:

```{r echo=FALSE, warning=TRUE}
ggplot(bank, aes(x = income_log, y = def)) +
  geom_boxplot()


```

If you would like to compare 1 quantitative variable (income) and 1 dychotomous variable (default status - binary), then you can use point-biserial coefficient:

```{r echo=FALSE, warning=FALSE}

biserial_cor <- biserial.cor(income, def)

print(biserial_cor)
```

## Nonlinear correlation - eta coefficient

If you would like to check if there are any nonlinearities between 2 variables, the only possibility (beside transformations and linear analysis) is to calculate "eta" coefficient and compare it with the Pearson's linear coefficient.

```{r echo=FALSE, warning=FALSE}
str(income_log)
str(def)

#eta_cor <- eta(def, income_log)

# We dont know why it is not working. Income_log and def are vectors.

```

## Correlation matrix

We can also prepare the correlation matrix for all quantitative variables stored in our data frame.

We can use ggcorr() function:

```{r echo=FALSE, warning=TRUE}
matrix <- bank[, sapply(bank, is.numeric)]

ggcorr(matrix)

```

As you can see - the default correlation matrix is not the best idea for all measurement scales (including binary variable "default").

That's why now we can perform our bivariate analysis with ggpair with grouping.

## Correlation matrix with scatterplots

Here is what we are about to calculate: - The correlation matrix between age, log_income, employ, address, debtinc, creddebt, and othdebt variable grouped by whether the person has a default status or not. - Plot the distribution of each variable by group - Display the scatter plot with the trend by group

```{r echo=FALSE, warning=TRUE}

# We have not been able to do this one.


```

## Qualitative data

In case of two variables measured on nominal or ordinal&nominal scale - we are forced to organize so called "contingency" table with frequencies and calculate some kind of the correlation coefficient based on them. This is so called "contingency analysis".

Let's consider one example based on our data: verify, if there is any significant correlation between education level and credit risk.

```{r}
cont_table <-table(ed, def)

chisq.test(cont_table)

#p is smaller than 0.05 so it means that these two values are correlated to each other

```

## Exercise 1. Contingency analysis.

Do you believe in the Afterlife? <https://nationalpost.com/news/canada/millennials-do-you-believe-in-life-after-life> A survey was conducted and a random sample of 1091 questionnaires is given in the form of the following contingency table:

```{r echo=FALSE, warning=FALSE}
x=c(435,147,375,134)
dim(x)=c(2,2)
dane<-as.table(x)
dimnames(dane)=list(Gender=c('Female','Male'),Believe=c('Yes','No'))
dane
fourfoldplot(dane)
```

Our task is to check if there is a significant relationship between the belief in the afterlife and gender. We can perform this procedure with the simple chi-square statistics and chosen qualitative correlation coefficient (two-way 2x2 table).

```{r echo=FALSE, warning=FALSE}
yes<-c(435,147)
no<-c(375,134)
#cohen.kappa(cbind(yes,no))
chisq.test(dane)
prop.table(dane)
```

As you can see we can calculate our chi-square statistic really quickly for two-way tables or larger. Now we can standardize this contingency measure to see if the relationship is significant.

```{r echo=FALSE, warning=FALSE}
Phi(dane)
#?ContCoef
#ContCoef(dane)
#CramerV(dane)
#TschuprowT(dane)
mosaicplot(dane)
barplot(dane)
```

## Exercise 2. Contingency analysis for the 'Titanic' data.

Let's consider the titanic dataset which contains a complete list of passengers and crew members on the RMS Titanic. It includes a variable indicating whether a person did survive the sinking of the RMS Titanic on April 15, 1912. A data frame contains 2456 observations on 14 variables.

```{r load-data2, warning=TRUE, include=FALSE}
download.file("https://github.com/kflisikowski/ds/blob/master/titanic.csv?raw=true", destfile ="titanic.csv",mode="wb")
titanic <- read.csv("titanic.csv",row.names=1,sep=";")
```

The website <http://www.encyclopedia-titanica.org/> offers detailed information about passengers and crew members on the RMS Titanic. According to the website 1317 passengers and 890 crew member were aboard.

8 musicians and 9 employees of the shipyard company are listed as passengers, but travelled with a free ticket, which is why they have NA values in fare. In addition to that, fare is truely missing for a few regular passengers.

```{r}
# your answer here
attach(titanic)
survivors_males <- sum(titanic$Status == "Survivor" & titanic$Gender == "Male")
survivors_females <- sum(titanic$Status == "Survivor" & titanic$Gender == "Female")
victims_males <- sum(titanic$Status == "Victim" & titanic$Gender == "Male")
victims_females <- sum(titanic$Status == "Victim" & titanic$Gender == "Female")

tit_table = c(survivors_males, survivors_females, victims_males, victims_females)

dim(tit_table) = c(2,2)

dimnames(tit_table)=list(Gender=c('Male','Female'),Status=c('Survivor','Victim'))

tit_table

fourfoldplot(tit_table)

chisq.test(tit_table)

prop.table(tit_table)

Phi(tit_table)

mosaicplot(tit_table)

barplot(tit_table)
```
