credit_DE <- read.table('./german_credit_data.csv',
header = TRUE,
sep = ',',
dec = '.')
head(credit_DE)
## X Age Sex Job Housing Saving.accounts Checking.account Credit.amount
## 1 0 67 male 2 own <NA> little 1169
## 2 1 22 female 2 own little moderate 5951
## 3 2 49 male 1 own little <NA> 2096
## 4 3 45 male 2 free little little 7882
## 5 4 53 male 2 free little little 4870
## 6 5 35 male 1 free <NA> <NA> 9055
## Duration Purpose
## 1 6 radio/TV
## 2 48 radio/TV
## 3 12 education
## 4 42 furniture/equipment
## 5 24 car
## 6 36 education
str(credit_DE)
## 'data.frame': 1000 obs. of 10 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ Job : int 2 2 1 2 2 1 2 3 1 3 ...
## $ Housing : chr "own" "own" "own" "free" ...
## $ Saving.accounts : chr NA "little" "little" "little" ...
## $ Checking.account: chr "little" "moderate" NA "little" ...
## $ Credit.amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Purpose : chr "radio/TV" "radio/TV" "education" "furniture/equipment" ...
Unit of observation: a German citizen with credit history.
Sample size: 1000 observations (or people).
Variables:
The source of the data is USI Machine Learning. Here is the link.
#IMPORTANT: The function 'sample_n' makes another sample every time we activate the code. For constant results of tests we conduct further, it's crucial to conserve one particular sample. This is the reason why we leave #-signs before some of next lines of the code, so it serves only for demonstration:
#library(dplyr)
#library(tidyr)
#credit_DE130 <- credit_DE[,-c(1,10)] %>%
#drop_na() %>% #some rows have been deleted due to missing data
#mutate(size_of_credit = ifelse(Credit.amount > mean(Credit.amount), 'big', 'small')) %>%
#rename(Gender = Sex,
#'size of credit' = size_of_credit) #new variable has been added
#library(tidyverse)
#credit_DE130 <- sample_n(credit_DE130, 130)
#library(readr)
#write_csv(credit_DE130, 'credit_DE130.csv')
credit_DE130 <- read.table('./credit_DE130.csv',
header=TRUE,
sep = ',',
dec = '.')
credit_DE130$Job <- factor(credit_DE130$Job,
levels = c(0:3),
labels = c('unskilled and non-resident',
'unskilled and resident',
'skilled',
'highly skilled')) #manipulation: making a factor variable from numeric one (proceeding with some other variables of different types)
credit_DE130$Gender <- factor(credit_DE130$Gender,
levels = c('male','female'),
labels = c('M','F')) # Makes more sense, in terms of renaming values, so they are considerably shorter. Moreover this, regarding descriptive statistics, it's going to offer us completely different information, when applying, for example, function 'summary'
credit_DE130$size.of.credit <- factor(credit_DE130$size.of.credit)
credit_DE130$Saving.accounts <- factor(credit_DE130$Saving.accounts)
credit_DE130$Checking.account <- factor(credit_DE130$Checking.account)
credit_DE130$Housing <- factor(credit_DE130$Housing)
credit_DE130 <- credit_DE130[order(credit_DE130$Age),]
str(credit_DE130)
## 'data.frame': 130 obs. of 9 variables:
## $ Age : int 20 21 21 22 22 22 22 22 22 23 ...
## $ Gender : Factor w/ 2 levels "M","F": 1 1 2 1 2 1 2 1 2 2 ...
## $ Job : Factor w/ 4 levels "unskilled and non-resident",..: 3 3 3 3 3 2 3 3 3 3 ...
## $ Housing : Factor w/ 3 levels "free","own","rent": 3 3 3 2 3 3 2 2 3 2 ...
## $ Saving.accounts : Factor w/ 4 levels "little","moderate",..: 1 1 1 3 1 1 1 4 1 3 ...
## $ Checking.account: Factor w/ 3 levels "little","moderate",..: 2 2 1 1 1 2 2 2 1 1 ...
## $ Credit.amount : int 585 2779 1049 2828 3650 276 1670 1007 1366 4281 ...
## $ Duration : int 12 18 18 24 18 9 9 12 9 33 ...
## $ size.of.credit : Factor w/ 2 levels "big","small": 2 2 2 2 1 2 2 2 2 1 ...
head(credit_DE130)
## Age Gender Job Housing Saving.accounts Checking.account
## 99 20 M skilled rent little moderate
## 77 21 M skilled rent little moderate
## 112 21 F skilled rent little little
## 55 22 M skilled own quite rich little
## 74 22 F skilled rent little little
## 86 22 M unskilled and resident rent little moderate
## Credit.amount Duration size.of.credit
## 99 585 12 small
## 77 2779 18 small
## 112 1049 18 small
## 55 2828 24 small
## 74 3650 18 big
## 86 276 9 small
mean(credit_DE130$Age)
## [1] 34.89231
median(credit_DE130$Duration)
## [1] 18
range(credit_DE130$Credit.amount)
## [1] 276 18424
sd(credit_DE130$Credit.amount)
## [1] 2680.965
summary(credit_DE130)
## Age Gender Job Housing
## Min. :20.00 M:90 unskilled and non-resident: 2 free:13
## 1st Qu.:26.00 F:40 unskilled and resident :28 own :93
## Median :31.00 skilled :83 rent:24
## Mean :34.89 highly skilled :17
## 3rd Qu.:40.75
## Max. :70.00
## Saving.accounts Checking.account Credit.amount Duration
## little :100 little :60 Min. : 276 Min. : 6.00
## moderate : 15 moderate:55 1st Qu.: 1332 1st Qu.:12.00
## quite rich: 7 rich :15 Median : 2578 Median :18.00
## rich : 8 Mean : 3178 Mean :20.59
## 3rd Qu.: 4002 3rd Qu.:24.00
## Max. :18424 Max. :60.00
## size.of.credit
## big :44
## small:86
##
##
##
##
library(psych)
describeBy(credit_DE130$Credit.amount, group = credit_DE130$Gender)
##
## Descriptive statistics by group
## group: M
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 90 3296.23 2457.67 2627 3002.18 1948.14 276 14896 14620 1.58 3.9
## se
## X1 259.06
## ------------------------------------------------------------
## group: F
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 40 2910.8 3144.02 1858 2252.44 1486.31 683 18424 17741 3.28 12.59
## se
## X1 497.11
library(pastecs)
round(stat.desc(credit_DE130$Credit.amount), 2)
## nbr.val nbr.null nbr.na min max range
## 130.00 0.00 0.00 276.00 18424.00 18148.00
## sum median mean SE.mean CI.mean.0.95 var
## 413093.00 2577.50 3177.64 235.14 465.22 7187571.89
## std.dev coef.var
## 2680.96 0.84
For the proper beginning, we simplify the data by shortening it to only 2 variables:
credit_DE130_short <- credit_DE130[,-c(1,3:6,8,9)]
head(credit_DE130_short)
## Gender Credit.amount
## 99 M 585
## 77 M 2779
## 112 F 1049
## 55 M 2828
## 74 F 3650
## 86 M 276
Additionally, we shall pay attention, if the sample sizes from our new shortened (long format) dataset are big enough, so we could be sure in efficiency of tests. This information we can find in one of previously run functions of descriptive statistics, namely ‘describeBy()’. And so, there are 40 female citizens and 90 male citizens observed.
Is there a difference in the credit amounts between male and female German citizens?
To make sure if the distribution of credit amounts among male and female German citizens is normal or not:
First of all we can have a look at the histograms of both male and female citizens’ samples:
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(credit_DE130, aes(x = Credit.amount))+
geom_histogram(binwidth = 700, colour = 'wheat2', fill = 'steelblue', linetype = 'longdash')+
labs(x = 'Credit amount',
y = 'Frequency')+
facet_wrap(~Gender, ncol = 1)+
theme_dark()
Though both histograms are skewed on the right, it seems there are still some slight differences to see
To be certain, that the distributions are normal, we conduct the Shapiro-Wilc grouped test
H0(M): the distribution of credit amounts taken by male citizens is normal
H1(M): the distribution of credit amounts taken by male citizens isn’t normal
H0(F): the distribution of credit amounts taken by female citizens is normal
H1(F): the distribution of credit amounts taken by female citizens isn’t normal
#install.packages('rstatix')
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
credit_DE130 %>%
group_by(Gender) %>%
shapiro_test(Credit.amount)
## # A tibble: 2 × 4
## Gender variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 M Credit.amount 0.860 0.0000000995
## 2 F Credit.amount 0.613 0.00000000471
Apparently, the risk of 1st type error (shown by the p-values) is small enough to neglect it. This is the reason why we reject both H0 at p<0.001. There is no normality in distribution of credit amounts in groups of male and female citizens (in each sample, which represents another population).
Since the size of each sample is big enough, we can be sure, that the results couldn’t be corrupted by lack of observations. Therefore, we do not need to build the quantile-quantile plot to observe distributions more precisely.
Considering absence of important reasons, we depict distribution by qq-plot only for demonstration. Though it isn’t necessary, this will merely give us another robust proof of what we’ve seen in Shapiro-Wilc test:
library(ggpubr)
ggqqplot(credit_DE130_short,
'Credit.amount',
facet.by = 'Gender')
Based on the previous results, for demonstration purposes we conduct a corresponding to the research question parametric test, which is in our case independent samples t-test. It includes requirement of distribution normality, after which we must know if the variances of credit amounts among male and female citizens are equal. As the next step, we conduct levene test:
H0: the variances of credit amounts between female and male citizens are equal
H1: the variances of credit amounts between female and male citizens aren’t equal
#install.packages('car')
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
leveneTest(credit_DE130_short$Credit.amount,
group = credit_DE130_short$Gender)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.0206 0.886
## 128
We fail to reject H0 at p = 0.886. There is enough evidence, that variances of credit amounts in both groups are same
H0: µF - µM = 0
H1: µF - µM ≠ 0
#since the group variances are equal, there is no need in Welch correction
t.test(credit_DE130_short$Credit.amount ~ credit_DE130_short$Gender,
var.equal = TRUE,
alternative = 'two.sided')
##
## Two Sample t-test
##
## data: credit_DE130_short$Credit.amount by credit_DE130_short$Gender
## t = 0.75529, df = 128, p-value = 0.4515
## alternative hypothesis: true difference in means between group M and group F is not equal to 0
## 95 percent confidence interval:
## -624.3061 1395.1728
## sample estimates:
## mean in group M mean in group F
## 3296.233 2910.800
According to the results, there is a big risk of the 1st type error, so we cannot disregard it. That’s why we fail to reject H0 at p=0.452. There is no difference in credit amounts taken by male and female German citizens.
!Because the distribution of credit amounts among male and female German citizens is not normal, we may not apply parametric test. It leads us to non-parametric Wilcoxon rank sum test!
H0: the distribution locations of the credit amounts in both male and female groups of German citizens are same
H1: the distribution locations of the credit amounts in both male and female groups of German citizens aren’t same
wilcox.test(credit_DE130_short$Credit.amount ~ credit_DE130_short$Gender,
correct = FALSE,
exact = FALSE,
alternative = 'two.sided')
##
## Wilcoxon rank sum test
##
## data: credit_DE130_short$Credit.amount by credit_DE130_short$Gender
## W = 2087.5, p-value = 0.147
## alternative hypothesis: true location shift is not equal to 0
According to p-value from the test, we fail to reject H0 at p=0.147. Therefore the distribution locations of the credit amount in both male and female groups of German citizens are same
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize(wilcox.test(credit_DE130_short$Credit.amount ~ credit_DE130_short$Gender,
correct = FALSE,
exact = FALSE,
alternative = 'two.sided'))
## r (rank biserial) | 95% CI
## ---------------------------------
## 0.16 | [-0.05, 0.36]
The ‘r’ value tells us, that the difference between the distribution locations is ‘small’.
For explanation of the results, it can be interpreted, that the credit amounts taken by male and female citizens in Germany are equal, except small difference.