TASK 1-3

credit_DE <- read.table('./german_credit_data.csv',
                        header = TRUE,
                        sep = ',',
                        dec = '.')

head(credit_DE)

##   X Age    Sex Job Housing Saving.accounts Checking.account Credit.amount
## 1 0  67   male   2     own            <NA>           little          1169
## 2 1  22 female   2     own          little         moderate          5951
## 3 2  49   male   1     own          little             <NA>          2096
## 4 3  45   male   2    free          little           little          7882
## 5 4  53   male   2    free          little           little          4870
## 6 5  35   male   1    free            <NA>             <NA>          9055
##   Duration             Purpose
## 1        6            radio/TV
## 2       48            radio/TV
## 3       12           education
## 4       42 furniture/equipment
## 5       24                 car
## 6       36           education

str(credit_DE)

## 'data.frame':    1000 obs. of  10 variables:
##  $ X               : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Age             : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ Sex             : chr  "male" "female" "male" "male" ...
##  $ Job             : int  2 2 1 2 2 1 2 3 1 3 ...
##  $ Housing         : chr  "own" "own" "own" "free" ...
##  $ Saving.accounts : chr  NA "little" "little" "little" ...
##  $ Checking.account: chr  "little" "moderate" NA "little" ...
##  $ Credit.amount   : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ Duration        : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Purpose         : chr  "radio/TV" "radio/TV" "education" "furniture/equipment" ...

TASK 4

Unit of observation: a German citizen with credit history.
Sample size: 1000 observations (or people).
Variables:
- Number of a citizen in the list (isn’t going to be considered further, nominal),
- Age (ratio),
- Sex (nominal),
- Job (nominal),
- Housing (nominal),
- Saving.accounts (ordinal),
- Checking.account (ordinal),
- Credit.amount (ratio),
- Duration (ratio),
- Purpose (nominal).

TASK 5

The source of the data is USI Machine Learning. Here is the link.

TASK 6

1. Data manipulation

#IMPORTANT: The function 'sample_n' makes another sample every time we activate the code. For constant results of tests we conduct further, it's crucial to conserve one particular sample. This is the reason why we leave #-signs before some of next lines of the code, so it serves only for demonstration:

#library(dplyr)
#library(tidyr)
#credit_DE130 <- credit_DE[,-c(1,10)] %>% 
  #drop_na() %>% #some rows have been deleted due to missing data
  #mutate(size_of_credit = ifelse(Credit.amount > mean(Credit.amount), 'big', 'small')) %>% 
  #rename(Gender = Sex,
         #'size of credit' = size_of_credit) #new variable has been added

#library(tidyverse)
#credit_DE130 <- sample_n(credit_DE130, 130)

#library(readr)
#write_csv(credit_DE130, 'credit_DE130.csv')

credit_DE130 <- read.table('./credit_DE130.csv', 
                           header=TRUE,
                           sep = ',',
                           dec = '.')

credit_DE130$Job <- factor(credit_DE130$Job,
                         levels = c(0:3),
                         labels = c('unskilled and non-resident',
                                    'unskilled and resident',
                                    'skilled',
                                    'highly skilled')) #manipulation: making a factor variable from numeric one (proceeding with some other variables of different types)

credit_DE130$Gender <- factor(credit_DE130$Gender,
                            levels = c('male','female'),
                            labels = c('M','F'))  # Makes more sense, in terms of renaming values, so they are considerably shorter. Moreover this, regarding  descriptive statistics, it's going to offer us completely different information, when applying, for example, function 'summary'

credit_DE130$size.of.credit <- factor(credit_DE130$size.of.credit)

credit_DE130$Saving.accounts <- factor(credit_DE130$Saving.accounts)

credit_DE130$Checking.account <- factor(credit_DE130$Checking.account)

credit_DE130$Housing <- factor(credit_DE130$Housing)

credit_DE130 <- credit_DE130[order(credit_DE130$Age),]

str(credit_DE130)

## 'data.frame':    130 obs. of  9 variables:
##  $ Age             : int  20 21 21 22 22 22 22 22 22 23 ...
##  $ Gender          : Factor w/ 2 levels "M","F": 1 1 2 1 2 1 2 1 2 2 ...
##  $ Job             : Factor w/ 4 levels "unskilled and non-resident",..: 3 3 3 3 3 2 3 3 3 3 ...
##  $ Housing         : Factor w/ 3 levels "free","own","rent": 3 3 3 2 3 3 2 2 3 2 ...
##  $ Saving.accounts : Factor w/ 4 levels "little","moderate",..: 1 1 1 3 1 1 1 4 1 3 ...
##  $ Checking.account: Factor w/ 3 levels "little","moderate",..: 2 2 1 1 1 2 2 2 1 1 ...
##  $ Credit.amount   : int  585 2779 1049 2828 3650 276 1670 1007 1366 4281 ...
##  $ Duration        : int  12 18 18 24 18 9 9 12 9 33 ...
##  $ size.of.credit  : Factor w/ 2 levels "big","small": 2 2 2 2 1 2 2 2 2 1 ...

head(credit_DE130)

##     Age Gender                    Job Housing Saving.accounts Checking.account
## 99   20      M                skilled    rent          little         moderate
## 77   21      M                skilled    rent          little         moderate
## 112  21      F                skilled    rent          little           little
## 55   22      M                skilled     own      quite rich           little
## 74   22      F                skilled    rent          little           little
## 86   22      M unskilled and resident    rent          little         moderate
##     Credit.amount Duration size.of.credit
## 99            585       12          small
## 77           2779       18          small
## 112          1049       18          small
## 55           2828       24          small
## 74           3650       18            big
## 86            276        9          small

2. A few estimates of parameters

mean(credit_DE130$Age)

## [1] 34.89231

median(credit_DE130$Duration)

## [1] 18

range(credit_DE130$Credit.amount)

## [1]   276 18424

sd(credit_DE130$Credit.amount)

## [1] 2680.965

3. Descriptive statistics

Advantage of ‘summary’ is that it gives either descriptive statistical data on the numeric variables, or data on amount of observations for every value of categorical variable, if such one is set as a factor.

summary(credit_DE130)

##       Age        Gender                         Job     Housing  
##  Min.   :20.00   M:90   unskilled and non-resident: 2   free:13  
##  1st Qu.:26.00   F:40   unskilled and resident    :28   own :93  
##  Median :31.00          skilled                   :83   rent:24  
##  Mean   :34.89          highly skilled            :17            
##  3rd Qu.:40.75                                                   
##  Max.   :70.00                                                   
##    Saving.accounts Checking.account Credit.amount      Duration    
##  little    :100    little  :60      Min.   :  276   Min.   : 6.00  
##  moderate  : 15    moderate:55      1st Qu.: 1332   1st Qu.:12.00  
##  quite rich:  7    rich    :15      Median : 2578   Median :18.00  
##  rich      :  8                     Mean   : 3178   Mean   :20.59  
##                                     3rd Qu.: 4002   3rd Qu.:24.00  
##                                     Max.   :18424   Max.   :60.00  
##  size.of.credit
##  big  :44      
##  small:86      
##                
##                
##                
##

‘describeBy’ offers several descriptive parameters on a numeric variable, grouped by values of a categorical variable. For example, here is statistical data on the credit amount, collected among each gender group:

library(psych)
describeBy(credit_DE130$Credit.amount, group = credit_DE130$Gender)

## 
##  Descriptive statistics by group 
## group: M
##    vars  n    mean      sd median trimmed     mad min   max range skew kurtosis
## X1    1 90 3296.23 2457.67   2627 3002.18 1948.14 276 14896 14620 1.58      3.9
##        se
## X1 259.06
## ------------------------------------------------------------ 
## group: F
##    vars  n   mean      sd median trimmed     mad min   max range skew kurtosis
## X1    1 40 2910.8 3144.02   1858 2252.44 1486.31 683 18424 17741 3.28    12.59
##        se
## X1 497.11

‘stat.desc’ is another advantageous function, which allows to get considerable amount of descriptive statistical data for a numeric variable or the whole data set (it cannot be applied on categorical variable)

library(pastecs)
round(stat.desc(credit_DE130$Credit.amount), 2)

##      nbr.val     nbr.null       nbr.na          min          max        range 
##       130.00         0.00         0.00       276.00     18424.00     18148.00 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##    413093.00      2577.50      3177.64       235.14       465.22   7187571.89 
##      std.dev     coef.var 
##      2680.96         0.84

TASK 7

0. Proper beginning

For the proper beginning, we simplify the data by shortening it to only 2 variables:

credit_DE130_short <- credit_DE130[,-c(1,3:6,8,9)]

head(credit_DE130_short)

##     Gender Credit.amount
## 99       M           585
## 77       M          2779
## 112      F          1049
## 55       M          2828
## 74       F          3650
## 86       M           276

Additionally, we shall pay attention, if the sample sizes from our new shortened (long format) dataset are big enough, so we could be sure in efficiency of tests. This information we can find in one of previously run functions of descriptive statistics, namely ‘describeBy()’. And so, there are 40 female citizens and 90 male citizens observed.

1. The research question

Is there a difference in the credit amounts between male and female German citizens?

2. Normality of distribution

To make sure if the distribution of credit amounts among male and female German citizens is normal or not:

2.1. A plot

First of all we can have a look at the histograms of both male and female citizens’ samples:

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(credit_DE130, aes(x = Credit.amount))+
  geom_histogram(binwidth = 700, colour = 'wheat2', fill = 'steelblue', linetype = 'longdash')+
  labs(x = 'Credit amount',
       y = 'Frequency')+
  facet_wrap(~Gender, ncol = 1)+
  theme_dark()

Though both histograms are skewed on the right, it seems there are still some slight differences to see

2.2. Shapiro-Wilc test

To be certain, that the distributions are normal, we conduct the Shapiro-Wilc grouped test

H0(M): the distribution of credit amounts taken by male citizens is normal

H1(M): the distribution of credit amounts taken by male citizens isn’t normal
H0(F): the distribution of credit amounts taken by female citizens is normal

H1(F): the distribution of credit amounts taken by female citizens isn’t normal

#install.packages('rstatix')
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

credit_DE130 %>% 
  group_by(Gender) %>% 
  shapiro_test(Credit.amount)

## # A tibble: 2 × 4
##   Gender variable      statistic             p
##   <fct>  <chr>             <dbl>         <dbl>
## 1 M      Credit.amount     0.860 0.0000000995 
## 2 F      Credit.amount     0.613 0.00000000471

Apparently, the risk of 1st type error (shown by the p-values) is small enough to neglect it. This is the reason why we reject both H0 at p<0.001. There is no normality in distribution of credit amounts in groups of male and female citizens (in each sample, which represents another population).

Since the size of each sample is big enough, we can be sure, that the results couldn’t be corrupted by lack of observations. Therefore, we do not need to build the quantile-quantile plot to observe distributions more precisely.

2.3. The quantile-quantile plot

Considering absence of important reasons, we depict distribution by qq-plot only for demonstration. Though it isn’t necessary, this will merely give us another robust proof of what we’ve seen in Shapiro-Wilc test:

library(ggpubr)
ggqqplot(credit_DE130_short,
         'Credit.amount',
         facet.by = 'Gender')

3.The performance of the test

3.1. Assumptions for a parametric test

Based on the previous results, for demonstration purposes we conduct a corresponding to the research question parametric test, which is in our case independent samples t-test. It includes requirement of distribution normality, after which we must know if the variances of credit amounts among male and female citizens are equal. As the next step, we conduct levene test:

H0: the variances of credit amounts between female and male citizens are equal

H1: the variances of credit amounts between female and male citizens aren’t equal

#install.packages('car')
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

leveneTest(credit_DE130_short$Credit.amount,
           group = credit_DE130_short$Gender)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.0206  0.886
##       128

We fail to reject H0 at p = 0.886. There is enough evidence, that variances of credit amounts in both groups are same

3.2. Demonstration of (parametric) independent samples t-test

H0: µF - µM = 0

H1: µF - µM ≠ 0

#since the group variances are equal, there is no need in Welch correction
t.test(credit_DE130_short$Credit.amount ~ credit_DE130_short$Gender,
       var.equal = TRUE,
       alternative = 'two.sided')

## 
##  Two Sample t-test
## 
## data:  credit_DE130_short$Credit.amount by credit_DE130_short$Gender
## t = 0.75529, df = 128, p-value = 0.4515
## alternative hypothesis: true difference in means between group M and group F is not equal to 0
## 95 percent confidence interval:
##  -624.3061 1395.1728
## sample estimates:
## mean in group M mean in group F 
##        3296.233        2910.800

According to the results, there is a big risk of the 1st type error, so we cannot disregard it. That’s why we fail to reject H0 at p=0.452. There is no difference in credit amounts taken by male and female German citizens.

!Because the distribution of credit amounts among male and female German citizens is not normal, we may not apply parametric test. It leads us to non-parametric Wilcoxon rank sum test!

3.3. (non-parametric) Wilcoxon rank sum test

H0: the distribution locations of the credit amounts in both male and female groups of German citizens are same

H1: the distribution locations of the credit amounts in both male and female groups of German citizens aren’t same

wilcox.test(credit_DE130_short$Credit.amount ~ credit_DE130_short$Gender,
            correct = FALSE,
            exact = FALSE,
            alternative = 'two.sided')

## 
##  Wilcoxon rank sum test
## 
## data:  credit_DE130_short$Credit.amount by credit_DE130_short$Gender
## W = 2087.5, p-value = 0.147
## alternative hypothesis: true location shift is not equal to 0

According to p-value from the test, we fail to reject H0 at p=0.147. Therefore the distribution locations of the credit amount in both male and female groups of German citizens are same

4. Effect size

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

effectsize(wilcox.test(credit_DE130_short$Credit.amount ~ credit_DE130_short$Gender,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = 'two.sided'))

## r (rank biserial) |        95% CI
## ---------------------------------
## 0.16              | [-0.05, 0.36]

The ‘r’ value tells us, that the difference between the distribution locations is ‘small’.

5. Conclussions

For explanation of the results, it can be interpreted, that the credit amounts taken by male and female citizens in Germany are equal, except small difference.

HOMEWORK 2

Oleh Burdukov

2025-03-26