#TASK1-3

credit_DE <- read.table('./german_credit_data.csv',
                        header = TRUE,
                        sep = ',',
                        dec = '.')

head(credit_DE)
##   X Age    Sex Job Housing Saving.accounts Checking.account Credit.amount
## 1 0  67   male   2     own            <NA>           little          1169
## 2 1  22 female   2     own          little         moderate          5951
## 3 2  49   male   1     own          little             <NA>          2096
## 4 3  45   male   2    free          little           little          7882
## 5 4  53   male   2    free          little           little          4870
## 6 5  35   male   1    free            <NA>             <NA>          9055
##   Duration             Purpose
## 1        6            radio/TV
## 2       48            radio/TV
## 3       12           education
## 4       42 furniture/equipment
## 5       24                 car
## 6       36           education
str(credit_DE)
## 'data.frame':    1000 obs. of  10 variables:
##  $ X               : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Age             : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ Sex             : chr  "male" "female" "male" "male" ...
##  $ Job             : int  2 2 1 2 2 1 2 3 1 3 ...
##  $ Housing         : chr  "own" "own" "own" "free" ...
##  $ Saving.accounts : chr  NA "little" "little" "little" ...
##  $ Checking.account: chr  "little" "moderate" NA "little" ...
##  $ Credit.amount   : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ Duration        : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Purpose         : chr  "radio/TV" "radio/TV" "education" "furniture/equipment" ...

#TASK4 1. unit of observation: a German citizen with credit history 2. sample size: 1000 observations (or people) 3.variables: number of a citizen in the list(isn’t going to be considered further, nominal), age(ratio), Sex(nominal), Job(nominal), housing(nominal), saving.accounts (ordinal), checking.account (ordinal), credit.amount (ratio), duration(ratio), Purpose(nominal)

#TASK5 The source of the data is USI Machine Learning

#TASK6

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
credit_DE2 <- credit_DE[,-c(1,10)] %>% 
  drop_na() %>% #some rows have been deleted due to missing data
  mutate(size_of_credit = ifelse(Credit.amount > mean(Credit.amount), 'big', 'small')) %>% 
  rename(Gender = Sex,
         'size of credit' = size_of_credit) #new names to some variables have been given
#new variable has been added
credit_DE2$Job <- factor(credit_DE2$Job,
                         levels = c(0:3),
                         labels = c('unskilled and non-resident',
                                    'unskilled and resident',
                                    'skilled',
                                    'highly skilled')) #manipulation: making a factor variable from numeric one (proceeding with some other variables of different types)

credit_DE2$Gender <- factor(credit_DE2$Gender,
                            levels = c('male','female'),
                            labels = c('M','F')) #makes more sense, in terms of renaming values, so they are considerably shorter. Moreover this, regarding  descriptive statistics, it's going to offer us completely different information, when applying, for example, function 'summary'

credit_DE2$`size of credit` <- factor(credit_DE2$`size of credit`)

credit_DE2$Saving.accounts <- factor(credit_DE2$Saving.accounts)

credit_DE2$Checking.account <- factor(credit_DE2$Checking.account)

credit_DE2$Housing <- factor(credit_DE2$Housing)

credit_DE2 <- credit_DE2[order(credit_DE2$Age),]
str(credit_DE2)
## 'data.frame':    522 obs. of  9 variables:
##  $ Age             : int  19 20 20 20 20 20 20 20 20 21 ...
##  $ Gender          : Factor w/ 2 levels "M","F": 2 2 2 1 2 1 2 2 1 1 ...
##  $ Job             : Factor w/ 4 levels "unskilled and non-resident",..: 2 3 3 3 3 3 3 2 4 3 ...
##  $ Housing         : Factor w/ 3 levels "free","own","rent": 3 3 2 2 2 3 3 3 3 3 ...
##  $ Saving.accounts : Factor w/ 4 levels "little","moderate",..: 4 1 4 2 1 1 1 1 1 2 ...
##  $ Checking.account: Factor w/ 3 levels "little","moderate",..: 2 1 2 1 2 2 1 2 1 2 ...
##  $ Credit.amount   : int  983 1282 1577 674 1967 585 2039 2718 1107 3031 ...
##  $ Duration        : int  12 12 11 12 24 12 18 24 12 45 ...
##  $ size of credit  : Factor w/ 2 levels "big","small": 2 2 2 2 2 2 2 2 2 2 ...
head(credit_DE2)
##     Age Gender                    Job Housing Saving.accounts Checking.account
## 207  19      F unskilled and resident    rent            rich         moderate
## 89   20      F                skilled    rent          little           little
## 95   20      F                skilled     own            rich         moderate
## 108  20      M                skilled     own        moderate           little
## 217  20      F                skilled     own          little         moderate
## 268  20      M                skilled    rent          little         moderate
##     Credit.amount Duration size of credit
## 207           983       12          small
## 89           1282       12          small
## 95           1577       11          small
## 108           674       12          small
## 217          1967       24          small
## 268           585       12          small

#TASK7

mean(credit_DE2$Age)
## [1] 34.88889
median(credit_DE2$Duration)
## [1] 18
range(credit_DE2$Credit.amount)
## [1]   276 18424
sd(credit_DE2$Credit.amount)
## [1] 2929.155

now we come to the functions, that offer us much more extended descriptive statistic on variables. 1.Advantage of ‘summary’ is that it gives either descriptive statistical data on the numeric variables, or data on amount of observations for every value of categorical variable, if such one is set as a factor.

summary(credit_DE2)
##       Age        Gender                          Job      Housing   
##  Min.   :19.00   M:354   unskilled and non-resident: 14   free: 65  
##  1st Qu.:26.00   F:168   unskilled and resident    :116   own :349  
##  Median :31.50           skilled                   :313   rent:108  
##  Mean   :34.89           highly skilled            : 79             
##  3rd Qu.:41.00                                                      
##  Max.   :75.00                                                      
##    Saving.accounts Checking.account Credit.amount      Duration    
##  little    :412    little  :245     Min.   :  276   Min.   : 6.00  
##  moderate  : 64    moderate:224     1st Qu.: 1298   1st Qu.:12.00  
##  quite rich: 23    rich    : 53     Median : 2326   Median :18.00  
##  rich      : 23                     Mean   : 3279   Mean   :21.34  
##                                     3rd Qu.: 3971   3rd Qu.:26.75  
##                                     Max.   :18424   Max.   :72.00  
##  size of credit
##  big  :181     
##  small:341     
##                
##                
##                
## 

2.’describeBy’ offers several descriptive parameters on a numeric variable, grouped by values of a categorical variable. For example, here is statistical data on the credit amount, collected among each job group:

library(psych)
describeBy(credit_DE2$Credit.amount, group = credit_DE2$Job)
## 
##  Descriptive statistics by group 
## group: unskilled and non-resident
##    vars  n    mean     sd median trimmed    mad min  max range skew kurtosis
## X1    1 14 1767.86 1088.9 1325.5 1698.58 762.06 609 3758  3149 0.77    -1.04
##        se
## X1 291.02
## ------------------------------------------------------------ 
## group: unskilled and resident
##    vars   n    mean      sd median trimmed     mad min   max range skew
## X1    1 116 2250.72 2032.63 1533.5 1871.33 1142.34 276 11998 11722 2.36
##    kurtosis     se
## X1     6.82 188.73
## ------------------------------------------------------------ 
## group: skilled
##    vars   n    mean     sd median trimmed    mad min   max range skew kurtosis
## X1    1 313 3129.13 2515.5   2415 2715.61 1746.5 338 15945 15607 2.06      5.7
##        se
## X1 142.18
## ------------------------------------------------------------ 
## group: highly skilled
##    vars  n    mean      sd median trimmed     mad  min   max range skew
## X1    1 79 5648.78 4236.66   4439 5118.11 3805.83 1050 18424 17374 0.98
##    kurtosis     se
## X1     0.14 476.66

3.’stat.desc’ is another advantageous function, which allows to get considerable amount of descriptive statistical data for a numeric variable or the whole data set (it cannot be applied on categorical variable)

library(pastecs)
## 
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
## 
##     extract
## The following objects are masked from 'package:dplyr':
## 
##     first, last
round(stat.desc(credit_DE2$Credit.amount), 2)
##      nbr.val     nbr.null       nbr.na          min          max        range 
##       522.00         0.00         0.00       276.00     18424.00     18148.00 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##   1711505.00      2326.50      3278.75       128.21       251.86   8579950.05 
##      std.dev     coef.var 
##      2929.16         0.89

#TASK8 1.Histogram

#install.packages('ggplot2')
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(credit_DE2, aes(x = Age)) +
  geom_histogram(binwidth = 5, colour = 'grey', fill = 'darkolivegreen', linetype = 'longdash') +
  ylab('Frequency')+
  xlab('Age')+ 
  theme_dark()

The histogram is skewed to the right. In terms of histogram’s context, this means (as we can see), that most of Germans, who take credit money, are in the age from 20 to 40-45 years.

#install.packages('ggplot2')
library(ggplot2)
ggplot(credit_DE2, aes(x = Age, fill = Gender)) +
  geom_histogram(binwidth = 5, colour = 'grey', fill = 'darkolivegreen', linetype = 'longdash') +
  ylab('Frequency')+
  xlab('Age')+
  facet_wrap(~Gender, ncol = 1)+
  theme_dark()

#install.packages('ggplot2')
library(ggplot2)
ggplot(credit_DE2, aes(x = Age, fill = Gender)) +
  geom_histogram(binwidth = 5, colour = 'grey', linetype = 'longdash', position = position_dodge(width = 7)) +
  ylab('Frequency')+
  xlab('Age')+ 
  theme_dark()

2.Boxplot

library(ggplot2)
ggplot(credit_DE2, aes(y = Duration, x = Gender, fill = Gender)) +
  geom_boxplot() +
  xlab('Gender')

Since, as we can see, there are quite many observations (in terms of unique German female citizens with credit money taken), which have duration of 48 Month or more, there is no sense to not consider them However, we can delete observations that are higher than or equal 60:

head(credit_DE2[order(-credit_DE2$Duration),])
##     Age Gender            Job Housing Saving.accounts Checking.account
## 359  24      M        skilled     own        moderate         moderate
## 177  24      F highly skilled     own        moderate         moderate
## 379  27      M highly skilled     own          little         moderate
## 508  36      M        skilled    rent          little           little
## 490  42      M        skilled    free          little         moderate
## 200  60      F highly skilled    free        moderate         moderate
##     Credit.amount Duration size of credit
## 359          5595       72            big
## 177          7408       60            big
## 379         14027       60            big
## 508          7297       60            big
## 490          6288       60            big
## 200         14782       60            big
credit_DE2BP <- credit_DE2[!credit_DE2$Duration >=60,]

library(ggplot2)
ggplot(credit_DE2BP, aes(y = Duration, x = Gender, fill = Gender))+
  geom_boxplot()+
  xlab('Gender')

Explanation: despite of gender of a German citizen with credit amount, the 50% of borrowing periods are shorter than ca.18 month, meanwhile the other 50% of those are obviously longer than 18 month Additionally, the highest amounts are being lent by male Germans. Moreover this, the interquartile range (q1 to q3 (not Q1 to Q3, because it’s a sample, not the whole population)), among female citizens is sort of smaller, than among male, which means, that the duration of lending mostly varies in a smaller range. 3.Scatter plot

library(ggplot2)
ggplot(credit_DE2, aes(x = Credit.amount, y = Duration))+
  geom_point(colour = 'skyblue4')

another way to create a scatter plot (without ggplot)

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplot(Duration~Credit.amount | Gender,
            ylab = 'Duration',
            xlab = 'Credit.amount',
            data = credit_DE2,
            smooth = FALSE)

The Scatterplot shows us, that female Germans take higher credit amounts (in average) for a !longer! period of time than male Germans. On the other hands, regarding lower credit amounts, the duration of lending among female Germans is slightly shorter than among male Germans In addition, we might create entire matrix of such scatter plots

library(car)
scatterplotMatrix(credit_DE2[,-c(2:6,9)],
                  smooth = FALSE)