Some preparations need to be made before we start to show more examples on data visualization using the bank credit data. The preparation usually starts with loading the R packages or libraries. Install the packages locally if they have not been done so, e.g. install.packages("flexdashboard").
# load necessary library and packages for the analysis
library(dplyr)
library(tidyr)
library(readxl)
library(RColorBrewer)
library(ggplot2)
library(knitr)
library(png)
library(animation)
library(gifski)
library(table1)
library(arsenal)
library(gtsummary)
library(gt)
library(corrplot)
library(xtable)
library(rcompanion)
library(car)In R, we could simply simulate the data generating process (DGP), i.e. randomly draws numbers from a known distribution with preset parameters. Simply speaking, we are “playing God” here. R provides a complete set of functions associated with each family of probability distributions, to name a few: normal, uniform, exponential for continuous r.v.; bernoulli, binomial and poisson for discrete ones, etc. To simulate data from a probability distribution, say normal distribution, call the distribution’s “random number generator” rnorm(n=1000, mean = 5, sd = 2.1) which simulates 1000 data points from normally distributed random variable with mean of 5 and standard deviation of 2.1, as shown in the r-chunk below.
# take 1000 samples from a normal distribution with mean of 5 and standard deviation of 2.1
gaussian_sample = rnorm(n = 1000, mean = 5, sd = 2.1)
# take 1000 samples from a uniform distribution between -5 and 20
unif_sample = runif(n = 1000, min = -5, max = 20)
# take 1000 samples from a poisson distribution with lambda = 3
pois_sample = rpois(n = 1000, lambda = 3)
# take 1000 samples from an exponential distribution of rate = 2
pareto_sample = rexp(n = 1000, rate = 2)We could also let R computes the density curve for us. dnorm() computes the probability density function for normal distribution, i.e. \(f(x)\), which could be used to plot or visualize the density curve.
# first, create a `grid` for x
x = seq(-5, 15, length=1000)
# second, compute the probability density function f(x)
gx = dnorm(x,mean = 5, sd = 2.1)
# plot f(x) vs. x, i.e. pdf
plot(x, gx, type="l", lty=1, xlab="x", col = 'red',
ylab="Probability Density")
title("Normal Distribution", line = 0.5) ### Import and Preparing Data We shall continue with our bank credit risk analysis using data set of “Credit Risk Data.xlsx”. We have produced a scatter plot of
Savings against Age and a grouped bar chart show frequencies of Loan Purpose by Gender. A simplified version of R script codes is copied here in the r-chunk below. See last week’s “L1Rscript.R” for details.
## Data Preparation:
# Import the data set we need and make the necessary preparations.
# change to desired working directory with all relevant files including R script, Rmarkdown, data file, etc.
#setwd("~/Dropbox/Teaching/NUS IS5126/Rscripts/probStats/")
# Data Input and Basic Data Preparation
# import data using readxl::read.excel function and make sure there is the excel file 'Credit Risk Data.xlsx' in current directory.
credit = read_excel(path = 'Credit Risk Data.xlsx', sheet = 'Base Data', skip = 2)
# display the structure of an object in R, here 'credit' is a dataframe with 425 obs and and 12 variables.
head(credit)# converting a string variable to factor (binary).
credit$`Credit Risk` = as.factor(credit$`Credit Risk`)
# similarly, convert a string variable 'credit$`Job`' to a ordered factor variable (categorical); order is manually specified
credit$Job = factor(credit$Job, levels = c("Unemployed", "Unskilled", "Skilled", "Management"), ordered = TRUE)
# prepare all other data types
credit$`Loan Purpose` = as.factor(credit$`Loan Purpose`)
credit$Gender = as.factor(credit$Gender)
# change the labels for categorical variable `Gender`
levels(credit$Gender)[levels(credit$Gender) == "F"] = "Female"
levels(credit$Gender)[levels(credit$Gender) == "M"] = "Male"
credit$`Marital Status` = as.factor(credit$`Marital Status`)
credit$Housing = as.factor(credit$Housing)The law of large numbers (LLN) and central limit theorem (CLT) are two cornerstones of modern theory of statistics. Those two theorems are key to understand why we always prefer bigger data set and why the most common probability distribution in statistics is normal distribution (thus the name). In short, (speaking with sample mean)
The first simulation shows the how law of large numbers works in a simulation. The random variable \(X\) itself has a normal distribution. Our goal is to show that the sample mean \(\bar{X}\), (again, it is key to understand that \(\bar{X}\) is a random variable) converges (in probability) to true mean \(\mu_X\) when the sample size \(n\) gets larger. For the following simulation, we are doing the followings:
Observe that realizations of sample mean \(\bar{X}\) becomes closer to the true mean when \(n\) is getting larger. That means, the standard deviation of sampling distribution of sample mean \(\bar{X}\) becomes smaller when \(n\) becomes larger.
The second simulation shows the how central limit theorem works in a simulation. The random variable \(X\) itself has a exponential distribution (nothing likes a normal distribution). Our goal is to show that the sample mean \(\bar{X}\), converges (in distribution) to a normal distribution when the sample size \(n\) gets larger. For the following simulation, we are doing the followings:
Observe that sampling distribution of \(\bar{X}\) becomes more normal distribution when \(n\) is getting larger. Put in another word, the empirical density curve (“red solid”) is closer to the theoretical normal density curve (“blue dashed”) as \(n\) increases.
Summary statistics table summarizes all key statistical measures for variables of interest in one table. The following R code shows how to produce one using package gtsummary.
## Summary Statistics Tables
# summary statistics table with `table1`
#sumstat1 = table1::table1(~ Checking + Savings + `Marital Status` + `Credit Risk` | Gender, data = credit)
#print(sumstat1)
# summary statistics table with `arsenal`
#sumstat2 = arsenal::tableby(~ Savings + Age + Housing , data = credit)
#summary(sumstat2, text=FALSE)
# summary statistics table with `gtsummary`
sumstat3 =
credit %>%
select(Checking, Savings, Age, Gender) %>%
tbl_summary(
by = Gender,
type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ c("{mean} ({sd})", "{median} ({p25}, {p75})" , "{min}, {max}"),
missing = "no"
) %>%
modify_header(stat_by = md("**{level}**<br>N = {n} ({style_percent(p)}%)")) %>%
modify_spanning_header(all_stat_cols() ~ "**Gender**") %>%
add_n() %>%
add_p() %>%
bold_labels() %>%
italicize_levels()
sumtable = as_gt(sumstat3)
sumtable
| Characteristic | N | Gender | p-value1 | |
|---|---|---|---|---|
| Female N = 135 (32%) |
Male N = 290 (68%) |
|||
| Checking | 425 | >0.9 | ||
| Mean (SD) | 952 (2,938) | 1,093 (3,244) | ||
| Median (IQR) | 0 (0, 569) | 0 (0, 554) | ||
| Range | 0, 19,766 | 0, 19,812 | ||
| Savings | 425 | 0.14 | ||
| Mean (SD) | 2,075 (3,917) | 1,690 (3,439) | ||
| Median (IQR) | 680 (324, 964) | 558 (208, 898) | ||
| Range | 0, 19,568 | 0, 19,811 | ||
| Age | 425 | <0.001 | ||
| Mean (SD) | 32 (12) | 36 (10) | ||
| Median (IQR) | 28 (23, 36) | 34 (27, 41) | ||
| Range | 18, 66 | 19, 73 | ||
|
1
Wilcoxon rank sum test
|
||||
Pearson’s Correlation Matrix computes and shows the Pearson’s \(\rho\) among all continuous data variables. cor() function in base R library could easily do the job and we further use corrplot package to visualize the correlation matrix in a nice-looking plot.
## Pearson's correlation matrix
# note that Pearson's correlation only applies to continuous numeric data
credit.cont = credit %>%
select(Checking, Savings, `Months Customer`, `Months Employed`, `Age`)
# use base R's `cor()` function to produce the matrix
credit.corrmat = cor(credit.cont)
credit.corrmat## Checking Savings Months Customer Months Employed
## Checking 1.000000000 0.02007924 -0.036226269 -0.009190981
## Savings 0.020079239 1.00000000 -0.050440712 0.048977962
## Months Customer -0.036226269 -0.05044071 1.000000000 0.054455798
## Months Employed -0.009190981 0.04897796 0.054455798 1.000000000
## Age -0.002369302 -0.02828589 -0.003785909 0.306798529
## Age
## Checking -0.002369302
## Savings -0.028285886
## Months Customer -0.003785909
## Months Employed 0.306798529
## Age 1.000000000
# visualize correlation matrix with "corrplot" package
corrplot(credit.corrmat, type = "full", order = "hclust", method = "shade",
tl.col = "black", tl.srt = 45)mtcars.corrmat = mtcars %>%
select(-c("vs", "am","gear","carb")) %>%
cor() %>%
round(2)
lower = mtcars.corrmat
lower[upper.tri(lower)] = ""
lower = as.data.frame(lower)
View(lower)
corrtbl = xtable(lower, type = "html")
print(corrtbl)## % latex table generated in R 4.0.3 by xtable 1.8-4 package
## % Fri Jan 22 16:39:00 2021
## \begin{table}[ht]
## \centering
## \begin{tabular}{rlllllll}
## \hline
## & mpg & cyl & disp & hp & drat & wt & qsec \\
## \hline
## mpg & 1 & & & & & & \\
## cyl & -0.85 & 1 & & & & & \\
## disp & -0.85 & 0.9 & 1 & & & & \\
## hp & -0.78 & 0.83 & 0.79 & 1 & & & \\
## drat & 0.68 & -0.7 & -0.71 & -0.45 & 1 & & \\
## wt & -0.87 & 0.78 & 0.89 & 0.66 & -0.71 & 1 & \\
## qsec & 0.42 & -0.59 & -0.43 & -0.71 & 0.09 & -0.17 & 1 \\
## \hline
## \end{tabular}
## \end{table}
# visualize correlation matrix with "corrplot" package
corrplot(mtcars.corrmat, type = "lower", order = "hclust", method = "circle",
tl.col = "black", tl.srt = 45)Sales Transactions.xlsxSales Transactions.xlsx contains the records of all sale transactions for a day, July 14. Each of the column is defined as follows:
CustID : Unique identifier for a customerRegion: Region of customer’s home addressPayment: Mode of payment used for the sales transactionTransction Code: Numerical code for the sales transactionSource: Source of the sales (whether it is through the Web or email)Amount: Sales amountProduct: Product bought by customerTime Of Day: time in which the sale transaction took place.#import data file into RStudio
ST <- read_excel("Sales Transactions.xlsx", col_types = c("numeric", "text", "text", "numeric", "text", "numeric", "text", "date")) Let’s use this Sales Transaction data to conduct interval estimation and hypotheses testing.
A confidence interval with \((1-\alpha)\) degree of confidence: a range of values between which the parameter is believed to be, along with a probability that CI correctly captures the true parameter. The ``formula’’ for CI:
\[ \text{point est. }\pm \text{ crit.val.(sampling dist., } \alpha)\times \text{S.E.} \] Question: how do we understand the concept of 95% confidence interval? What does it mean by “95% confident that parameter is in between”?
Let’s see another simulation for a 50% confidence interval.
Consider how to obtain the following interval estimates. - Compute the 95% confidence interval for the mean of Amount. - Compute the 99% confidence interval for proportion of book sales transactions with sales amount being greater than $50.
Let’s first compute the 95% confidence interval for the mean of Amount.
#compute manually 95% CI for mean `Amount`
uCIamt95 = mean(ST$Amount) -
qt(0.025, df=nrow(ST)-1) * sd(ST$Amount)/sqrt(nrow(ST)) #upper CI
#NOTE: qt(.025) gives a negative value here so we use mean - error to compute upper CI
lCIamt95 = mean(ST$Amount) +
qt(0.025, df=nrow(ST)-1) * sd(ST$Amount)/sqrt(nrow(ST)) #lower CI
#print result
print(cbind(lCIamt95, uCIamt95), digits=4) ## lCIamt95 uCIamt95
## [1,] 34.76 45.13
Now let’s compute the 99% confidence interval for proportion of book sales transactions with sales amount being greater than $50.
Q: How is the confidence interval for proportion different from that of mean?
#compute 95% CI for proportion (Amount>50)
book = ST %>% filter(Product=="Book")
bk50 = book %>% filter(Amount>50)
pbk50 = nrow(bk50)/nrow(book) # proportion of books which cost greater than $50
# compute the lower and upper bounds of CI
lCIpbk50 <- pbk50 + (qnorm(0.005) * sqrt(pbk50*(1-pbk50)/nrow(book)))
uCIpbk50 <- pbk50 - (qnorm(0.005) * sqrt(pbk50*(1-pbk50)/nrow(book)))
# print CI
print(cbind(lCIpbk50, uCIpbk50),digits=3)## lCIpbk50 uCIpbk50
## [1,] 0.139 0.267
Suppose the manager would like to draw some conclusions from the sales transaction data. He does not believe there are outliers in the data so you may leave all the data as is. He would like to set up and test the following hypotheses:
Amount greater than $50 at least 10 percent of book sales transactions?There are two approaches to conduct one-sided test here:
Let’s see how we conduct the one-sided test on proportion with the second approach. Write out the hypothesis system: \[\begin{align*} &H_0: p \ge 0.1 \\ &H_A: p < 0.1 \end{align*}\]
Assuming the null hypothesis is true, any value we get above 10% is probable! The observed probability has to be significantly lower than 10% (left critical region) for the null hypothesis to be rejected.
# compute z-statistic for proportion. The next 3 lines have been executed earlier.
# book<-ST %>% filter(Product=="Book")
# bk50<- book %>% filter(Amount>50)
# pbk50<-nrow(bk50)/nrow(book)
z = (pbk50 - 0.10) / sqrt(0.1*(1-0.1)/nrow(book))
z## [1] 5.550227
## [1] -1.644854
## [1] FALSE
H0: Proportion of 50 book order amount greater than 50 >= 0.1;
H1: Proportion of 50 book order amount greater than 50 < 0.1
Since the test statistics \(z\) is larger than the critical value (left tail), we fail to reject the null hypothesis. Therefore, the data shows evidence that “proportion of 50 book order amount greater than 50 is at least 10%”.
##
## Welch Two Sample t-test
##
## data: ST$Amount by ST$Product
## t = 8.0304, df = 260.96, p-value = 3.344e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 27.47079 45.31916
## sample estimates:
## mean in group Book mean in group DVD
## 56.21559 19.82062
H0: Mean amount for Books = Mean amount for DVD;
H1: Mean amount for Books != Mean amount for DVD
From the t-test output, the p-value is extremely small, particually smaller than any conventional cutoff. Therefore, data shows very strong evidence to reject the null hypothesis and we conclude that the popuplation means for books and DVD are very unlikely to be the same.
To test on the equality of means from multiple samples (or groups), one-way anova test is the top choice given its better performance, i.e. high test power. Parametric test such as anova, however, requires stronger assumptions. (no free lunch!)
ANOVA Assumption: Check Normality!
# First check if ANOVA assumptions are met
# check normality
# plot histogram
par(mfcol=c(2,2))
E <-ST %>% filter(Region=="East")
W <-ST %>% filter(Region=="West")
N <-ST %>% filter(Region=="North")
S <-ST %>% filter(Region=="South")
hist(E$Amount, main="Histogram for `East`", xlab="Amount")
hist(W$Amount, main="Histogram for `West`", xlab="Amount")
hist(N$Amount, main="Histogram for `North`", xlab="Amount")
hist(S$Amount, main="Histogram for `South`", xlab="Amount")# plot qqplots
par(mfcol=c(2,2))
qqnorm(E$Amount, main="QQplot for `East`", xlab="Amount")
qqline(E$Amount)
qqnorm(W$Amount, main="QQplot for `West`", xlab="Amount")
qqline(W$Amount)
qqnorm(N$Amount, main="QQplot for `North`", xlab="Amount")
qqline(N$Amount)
qqnorm(S$Amount, main="QQplot for `South`", xlab="Amount")
qqline(S$Amount)# Shapiro Wilkins Test
lapply(list(E,W,N,S),
function(sa)
{
shapiro.test(sa$Amount)
}) # applying test on each vector in list## [[1]]
##
## Shapiro-Wilk normality test
##
## data: sa$Amount
## W = 0.46655, p-value < 2.2e-16
##
##
## [[2]]
##
## Shapiro-Wilk normality test
##
## data: sa$Amount
## W = 0.41315, p-value < 2.2e-16
##
##
## [[3]]
##
## Shapiro-Wilk normality test
##
## data: sa$Amount
## W = 0.36983, p-value < 2.2e-16
##
##
## [[4]]
##
## Shapiro-Wilk normality test
##
## data: sa$Amount
## W = 0.45291, p-value < 2.2e-16
##
## East North South West
## 98 85 99 190
The boxplot shows that the means from three groups are very close, which we need to test further.
From both dianostic graphics and shapiro normality tests, we conclude that the normality assumption is violated mainly because of clusters of outliers. Keep in mind that, anova test is somewhat robust to non-normality but rather sensitive to heteroskadasticity, i.e., unequal variances across groups, which we shall check on next.
ANOVA Assumption 2: Check Equal Variance!
##
## Fligner-Killeen test of homogeneity of variances
##
## data: Amount by Region
## Fligner-Killeen:med chi-squared = 5.9472, df = 3, p-value = 0.1142
Given the departure from normality, we choose to conduct Fligner and Levene’ test for equal variances. Both fails to reject the null hypothesis (the null: equal variances). We are quite confident that the homoskedasticity assunmption is met. One-way anova is the way to go:
Run One-Way Anova
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ST$Region) 3 6549 2183 0.663 0.575
## Residuals 468 1540965 3293
H0: Mean amount for all regions are the same;
H1: Mean amount for at least one region is different from the others
Rather large p-value from one-way anova test shows little evidence to reject the null hypothesis. We conclude that the means from these three groups are very likely to be the same, which confirms our preliminary observation from the boxplot at the very begining.