#TASK1-3
credit_DE <- read.table('./german_credit_data.csv',
header = TRUE,
sep = ',',
dec = '.')
head(credit_DE)
## X Age Sex Job Housing Saving.accounts Checking.account Credit.amount
## 1 0 67 male 2 own <NA> little 1169
## 2 1 22 female 2 own little moderate 5951
## 3 2 49 male 1 own little <NA> 2096
## 4 3 45 male 2 free little little 7882
## 5 4 53 male 2 free little little 4870
## 6 5 35 male 1 free <NA> <NA> 9055
## Duration Purpose
## 1 6 radio/TV
## 2 48 radio/TV
## 3 12 education
## 4 42 furniture/equipment
## 5 24 car
## 6 36 education
str(credit_DE)
## 'data.frame': 1000 obs. of 10 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ Job : int 2 2 1 2 2 1 2 3 1 3 ...
## $ Housing : chr "own" "own" "own" "free" ...
## $ Saving.accounts : chr NA "little" "little" "little" ...
## $ Checking.account: chr "little" "moderate" NA "little" ...
## $ Credit.amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Purpose : chr "radio/TV" "radio/TV" "education" "furniture/equipment" ...
#TASK4 1. unit of observation: a German citizen with credit history 2. sample size: 1000 observations (or people) 3.variables: number of a citizen in the list(isn’t going to be considered further, nominal), age(ratio), Sex(nominal), Job(nominal), housing(nominal), saving.accounts (ordinal), checking.account (ordinal), credit.amount (ratio), duration(ratio), Purpose(nominal)
#TASK5 The source of the data is USI Machine Learning
#TASK6
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
credit_DE2 <- credit_DE[,-c(1,10)] %>%
drop_na() %>% #some rows have been deleted due to missing data
mutate(size_of_credit = ifelse(Credit.amount > mean(Credit.amount), 'big', 'small')) %>%
rename(Gender = Sex,
'size of credit' = size_of_credit) #new names to some variables have been given
#new variable has been added
credit_DE2$Job <- factor(credit_DE2$Job,
levels = c(0:3),
labels = c('unskilled and non-resident',
'unskilled and resident',
'skilled',
'highly skilled')) #manipulation: making a factor variable from numeric one (proceeding with some other variables of different types)
credit_DE2$Gender <- factor(credit_DE2$Gender,
levels = c('male','female'),
labels = c('M','F')) #makes more sense, in terms of renaming values, so they are considerably shorter. Moreover this, regarding descriptive statistics, it's going to offer us completely different information, when applying, for example, function 'summary'
credit_DE2$`size of credit` <- factor(credit_DE2$`size of credit`)
credit_DE2$Saving.accounts <- factor(credit_DE2$Saving.accounts)
credit_DE2$Checking.account <- factor(credit_DE2$Checking.account)
credit_DE2$Housing <- factor(credit_DE2$Housing)
credit_DE2 <- credit_DE2[order(credit_DE2$Age),]
str(credit_DE2)
## 'data.frame': 522 obs. of 9 variables:
## $ Age : int 19 20 20 20 20 20 20 20 20 21 ...
## $ Gender : Factor w/ 2 levels "M","F": 2 2 2 1 2 1 2 2 1 1 ...
## $ Job : Factor w/ 4 levels "unskilled and non-resident",..: 2 3 3 3 3 3 3 2 4 3 ...
## $ Housing : Factor w/ 3 levels "free","own","rent": 3 3 2 2 2 3 3 3 3 3 ...
## $ Saving.accounts : Factor w/ 4 levels "little","moderate",..: 4 1 4 2 1 1 1 1 1 2 ...
## $ Checking.account: Factor w/ 3 levels "little","moderate",..: 2 1 2 1 2 2 1 2 1 2 ...
## $ Credit.amount : int 983 1282 1577 674 1967 585 2039 2718 1107 3031 ...
## $ Duration : int 12 12 11 12 24 12 18 24 12 45 ...
## $ size of credit : Factor w/ 2 levels "big","small": 2 2 2 2 2 2 2 2 2 2 ...
head(credit_DE2)
## Age Gender Job Housing Saving.accounts Checking.account
## 207 19 F unskilled and resident rent rich moderate
## 89 20 F skilled rent little little
## 95 20 F skilled own rich moderate
## 108 20 M skilled own moderate little
## 217 20 F skilled own little moderate
## 268 20 M skilled rent little moderate
## Credit.amount Duration size of credit
## 207 983 12 small
## 89 1282 12 small
## 95 1577 11 small
## 108 674 12 small
## 217 1967 24 small
## 268 585 12 small
#TASK7
mean(credit_DE2$Age)
## [1] 34.88889
median(credit_DE2$Duration)
## [1] 18
range(credit_DE2$Credit.amount)
## [1] 276 18424
sd(credit_DE2$Credit.amount)
## [1] 2929.155
now we come to the functions, that offer us much more extended descriptive statistic on variables. 1.Advantage of ‘summary’ is that it gives either descriptive statistical data on the numeric variables, or data on amount of observations for every value of categorical variable, if such one is set as a factor.
summary(credit_DE2)
## Age Gender Job Housing
## Min. :19.00 M:354 unskilled and non-resident: 14 free: 65
## 1st Qu.:26.00 F:168 unskilled and resident :116 own :349
## Median :31.50 skilled :313 rent:108
## Mean :34.89 highly skilled : 79
## 3rd Qu.:41.00
## Max. :75.00
## Saving.accounts Checking.account Credit.amount Duration
## little :412 little :245 Min. : 276 Min. : 6.00
## moderate : 64 moderate:224 1st Qu.: 1298 1st Qu.:12.00
## quite rich: 23 rich : 53 Median : 2326 Median :18.00
## rich : 23 Mean : 3279 Mean :21.34
## 3rd Qu.: 3971 3rd Qu.:26.75
## Max. :18424 Max. :72.00
## size of credit
## big :181
## small:341
##
##
##
##
2.’describeBy’ offers several descriptive parameters on a numeric variable, grouped by values of a categorical variable. For example, here is statistical data on the credit amount, collected among each job group:
library(psych)
describeBy(credit_DE2$Credit.amount, group = credit_DE2$Job)
##
## Descriptive statistics by group
## group: unskilled and non-resident
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 14 1767.86 1088.9 1325.5 1698.58 762.06 609 3758 3149 0.77 -1.04
## se
## X1 291.02
## ------------------------------------------------------------
## group: unskilled and resident
## vars n mean sd median trimmed mad min max range skew
## X1 1 116 2250.72 2032.63 1533.5 1871.33 1142.34 276 11998 11722 2.36
## kurtosis se
## X1 6.82 188.73
## ------------------------------------------------------------
## group: skilled
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 313 3129.13 2515.5 2415 2715.61 1746.5 338 15945 15607 2.06 5.7
## se
## X1 142.18
## ------------------------------------------------------------
## group: highly skilled
## vars n mean sd median trimmed mad min max range skew
## X1 1 79 5648.78 4236.66 4439 5118.11 3805.83 1050 18424 17374 0.98
## kurtosis se
## X1 0.14 476.66
3.’stat.desc’ is another advantageous function, which allows to get considerable amount of descriptive statistical data for a numeric variable or the whole data set (it cannot be applied on categorical variable)
library(pastecs)
##
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
##
## extract
## The following objects are masked from 'package:dplyr':
##
## first, last
round(stat.desc(credit_DE2$Credit.amount), 2)
## nbr.val nbr.null nbr.na min max range
## 522.00 0.00 0.00 276.00 18424.00 18148.00
## sum median mean SE.mean CI.mean.0.95 var
## 1711505.00 2326.50 3278.75 128.21 251.86 8579950.05
## std.dev coef.var
## 2929.16 0.89
#TASK8 1.Histogram
#install.packages('ggplot2')
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(credit_DE2, aes(x = Age)) +
geom_histogram(binwidth = 5, colour = 'grey', fill = 'darkolivegreen', linetype = 'longdash') +
ylab('Frequency')+
xlab('Age')+
theme_dark()
The histogram is skewed to the right. In terms of histogram’s context,
this means (as we can see), that most of Germans, who take credit money,
are in the age from 20 to 40-45 years.
#install.packages('ggplot2')
library(ggplot2)
ggplot(credit_DE2, aes(x = Age, fill = Gender)) +
geom_histogram(binwidth = 5, colour = 'grey', fill = 'darkolivegreen', linetype = 'longdash') +
ylab('Frequency')+
xlab('Age')+
facet_wrap(~Gender, ncol = 1)+
theme_dark()
#install.packages('ggplot2')
library(ggplot2)
ggplot(credit_DE2, aes(x = Age, fill = Gender)) +
geom_histogram(binwidth = 5, colour = 'grey', linetype = 'longdash', position = position_dodge(width = 7)) +
ylab('Frequency')+
xlab('Age')+
theme_dark()
2.Boxplot
library(ggplot2)
ggplot(credit_DE2, aes(y = Duration, x = Gender, fill = Gender)) +
geom_boxplot() +
xlab('Gender')
Since, as we can see, there are quite many observations (in terms of
unique German female citizens with credit money taken), which have
duration of 48 Month or more, there is no sense to not consider them
However, we can delete observations that are higher than or equal
60:
head(credit_DE2[order(-credit_DE2$Duration),])
## Age Gender Job Housing Saving.accounts Checking.account
## 359 24 M skilled own moderate moderate
## 177 24 F highly skilled own moderate moderate
## 379 27 M highly skilled own little moderate
## 508 36 M skilled rent little little
## 490 42 M skilled free little moderate
## 200 60 F highly skilled free moderate moderate
## Credit.amount Duration size of credit
## 359 5595 72 big
## 177 7408 60 big
## 379 14027 60 big
## 508 7297 60 big
## 490 6288 60 big
## 200 14782 60 big
credit_DE2BP <- credit_DE2[!credit_DE2$Duration >=60,]
library(ggplot2)
ggplot(credit_DE2BP, aes(y = Duration, x = Gender, fill = Gender))+
geom_boxplot()+
xlab('Gender')
Explanation: despite of gender of a German citizen with credit amount,
the 50% of borrowing periods are shorter than ca.18 month, meanwhile the
other 50% of those are obviously longer than 18 month Additionally, the
highest amounts are being lent by male Germans. Moreover this, the
interquartile range (q1 to q3 (not Q1 to Q3, because it’s a sample, not
the whole population)), among female citizens is sort of smaller, than
among male, which means, that the duration of lending mostly varies in a
smaller range. 3.Scatter plot
library(ggplot2)
ggplot(credit_DE2, aes(x = Credit.amount, y = Duration))+
geom_point(colour = 'skyblue4')
another way to create a scatter plot (without ggplot)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
scatterplot(Duration~Credit.amount | Gender,
ylab = 'Duration',
xlab = 'Credit.amount',
data = credit_DE2,
smooth = FALSE)
The Scatterplot shows us, that female Germans take higher credit amounts
(in average) for a !longer! period of time than male Germans. On the
other hands, regarding lower credit amounts, the duration of lending
among female Germans is slightly shorter than among male Germans In
addition, we might create entire matrix of such scatter plots
library(car)
scatterplotMatrix(credit_DE2[,-c(2:6,9)],
smooth = FALSE)