Introduction

This is our first lab when we are considering 2 dimensions and instead of calculating univariate statistics by groups (or factors) of other variable - we will measure their common relationships based on co-variance and correlation coefficients.

*Please be very careful when choosing the measure of correlation! In case of different measurument scales we have to recode one of the variables into weaker scale.

It would be nice to add some additional plots in the background. Feel free to add your own sections and use external packages.

Data

This time we are going to use a typical credit scoring data with predefined “default” variables and personal demografic and income data. Please take a look closer at headers and descriptions of each variable.

Scatterplots

First let’s visualize our quantitative relationships using scatterplots.

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: size.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## [1] 0.574346
##    estimate      p.value statistic   n gp  Method
## 1 0.3194263 4.805085e-18  8.899323 700  1 pearson
## [1] 0.1577567

Exercise 1. Contingency analysis.

Do you believe in the Afterlife? https://nationalpost.com/news/canada/millennials-do-you-believe-in-life-after-life A survey was conducted and a random sample of 1091 questionnaires is given in the form of the following contingency table:

##         Believe
## Gender   Yes  No
##   Female 435 375
##   Male   147 134

Our task is to check if there is a significant relationship between the belief in the afterlife and gender. We can perform this procedure with the simple chi-square statistics and chosen qualitative correlation coefficient (two-way 2x2 table).

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  dane
## X-squared = 0.11103, df = 1, p-value = 0.739
##         Believe
## Gender         Yes        No
##   Female 0.3987168 0.3437214
##   Male   0.1347388 0.1228231

As you can see we can calculate our chi-square statistic really quickly for two-way tables or larger. Now we can standardize this contingency measure to see if the relationship is significant.

## [1] 0.01218871

Conclusion: there does not seem to be any significant correlation between gender and believing in afterlife

Exercise 2. Contingency analysis for the ‘Titanic’ data.

Let’s consider the titanic dataset which contains a complete list of passengers and crew members on the RMS Titanic. It includes a variable indicating whether a person did survive the sinking of the RMS Titanic on April 15, 1912. A data frame contains 2456 observations on 14 variables.

The website http://www.encyclopedia-titanica.org/ offers detailed information about passengers and crew members on the RMS Titanic. According to the website 1317 passengers and 890 crew member were aboard.

8 musicians and 9 employees of the shipyard company are listed as passengers, but travelled with a free ticket, which is why they have NA values in fare. In addition to that, fare is truely missing for a few regular passengers.

# Load the Titanic dataset
download.file("https://github.com/kflisikowski/ds/blob/master/titanic.csv?raw=true", destfile = "titanic.csv", mode = "wb")
titanic <- read.csv("titanic.csv", row.names = 1, sep = ";")

# Show the structure of the data
#str(titanic)

# Summary of the data
#summary(titanic)

# Identify missing values
sapply(titanic, function(x) sum(is.na(x)))
##                           Status                   Disembarked.at 
##                                0                                0 
##                     Home.Country                              Age 
##                                0                               32 
##                    Year.of.Birth               Crew.or.Passenger. 
##                               32                                0 
##                           Gender               Class...Department 
##                                0                                0 
##                         Embarked                              Job 
##                                0                                0 
##                      Job.details                    Ticket.Number 
##                                0                                0 
##                       Fare.Price                         Fare_GBP 
##                                0                             1136 
##                       Fare_today Profile.on.Encyclopedia.Titanica 
##                             1136                                0
# Removing rows with missing 'Fare_GBP' values for regular passengers
titanic_clean <- titanic[!is.na(titanic$Fare_GBP), ]

#removing the poeple who disembarked form the dataset
titanic_clean <- titanic_clean %>%
  filter(Disembarked.at == "Not Disembarked")

# Confirm the removal of rows with NA 'Fare_GBP' values
sapply(titanic_clean, function(x) sum(is.na(x)))
##                           Status                   Disembarked.at 
##                                0                                0 
##                     Home.Country                              Age 
##                                0                                4 
##                    Year.of.Birth               Crew.or.Passenger. 
##                                4                                0 
##                           Gender               Class...Department 
##                                0                                0 
##                         Embarked                              Job 
##                                0                                0 
##                      Job.details                    Ticket.Number 
##                                0                                0 
##                       Fare.Price                         Fare_GBP 
##                                0                                0 
##                       Fare_today Profile.on.Encyclopedia.Titanica 
##                                0                                0
# Create a contingency table for Survival and Passenger Class
table_survived_class <- table(titanic_clean$Status, titanic_clean$Class...Department)

# Perform chi-square test
chisq_test <- chisq.test(table_survived_class)

# Display the results
print(table_survived_class)
##           
##            1st Class 2nd Class 3rd Class
##   Survivor       200       119       179
##   Victim         117       152       524
print(chisq_test)
## 
##  Pearson's Chi-squared test
## 
## data:  table_survived_class
## X-squared = 134.68, df = 2, p-value < 2.2e-16
# Plotting the contingency table
mosaicplot(table_survived_class, main = "Survival by Passenger Class", shade = TRUE)

# Create a contingency table for Survival and Gender
table_survived_gender <- table(titanic_clean$Status, titanic_clean$Gender)

# Perform chi-square test
chisq_test_gender <- chisq.test(table_survived_gender)

# Display the results
print(table_survived_gender)
##           
##            Female Male
##   Survivor    339  159
##   Victim      127  666
print(chisq_test_gender)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_survived_gender
## X-squared = 357.12, df = 1, p-value < 2.2e-16
# Plotting the contingency table
mosaicplot(table_survived_gender, main = "Survival by Gender", shade = TRUE)

in both cases p-value is lower than 2.2e-16, observed differences in the tables are extremely unlikely to have occurred by random chance alone.

