Statistical Metrics for Descriptive Analytics

Preparation

Some preparations need to be made before we start to show more examples on data visualization using the bank credit data. The preparation usually starts with loading the R packages or libraries. Install the packages locally if they have not been done so, e.g. install.packages("flexdashboard").

Loading Packages

# load necessary library and packages for the analysis
library(dplyr) 
library(tidyr)
library(readxl)
library(RColorBrewer)
library(ggplot2)
library(knitr)
library(png)

library(animation)
library(gifski)
library(table1)
library(arsenal)
library(gtsummary)
library(gt)
library(corrplot)
library(xtable)
library(rcompanion)
library(car)

Probability Distributions

In R, we could simply simulate the data generating process (DGP), i.e. randomly draws numbers from a known distribution with preset parameters. Simply speaking, we are “playing God” here. R provides a complete set of functions associated with each family of probability distributions, to name a few: normal, uniform, exponential for continuous r.v.; bernoulli, binomial and poisson for discrete ones, etc. To simulate data from a probability distribution, say normal distribution, call the distribution’s “random number generator” rnorm(n=1000, mean = 5, sd = 2.1) which simulates 1000 data points from normally distributed random variable with mean of 5 and standard deviation of 2.1, as shown in the r-chunk below.

# take 1000 samples from a normal distribution with mean of 5 and standard deviation of 2.1
gaussian_sample = rnorm(n = 1000, mean = 5, sd = 2.1)

# take 1000 samples from a uniform distribution between -5 and 20
unif_sample = runif(n = 1000, min = -5, max = 20)

# take 1000 samples from a poisson distribution with lambda = 3
pois_sample = rpois(n = 1000, lambda = 3)

# take 1000 samples from an exponential distribution of rate = 2
pareto_sample = rexp(n = 1000, rate = 2)

We could also let R computes the density curve for us. dnorm() computes the probability density function for normal distribution, i.e. \(f(x)\), which could be used to plot or visualize the density curve.

# first, create a `grid` for x
x = seq(-5, 15, length=1000)
# second, compute the probability density function f(x)
gx = dnorm(x,mean = 5, sd = 2.1)
# plot f(x) vs. x, i.e. pdf
plot(x, gx, type="l", lty=1, xlab="x", col = 'red',
     ylab="Probability Density")
title("Normal Distribution", line = 0.5)

### Import and Preparing Data We shall continue with our bank credit risk analysis using data set of “Credit Risk Data.xlsx”. We have produced a scatter plot of Savings against Age and a grouped bar chart show frequencies of Loan Purpose by Gender. A simplified version of R script codes is copied here in the r-chunk below. See last week’s “L1Rscript.R” for details.

## Data Preparation: 
# Import the data set we need and make the necessary preparations. 
# change to desired working directory with all relevant files including R script, Rmarkdown, data file, etc.
#setwd("~/Dropbox/Teaching/NUS IS5126/Rscripts/probStats/") 

# Data Input and Basic Data Preparation
# import data using readxl::read.excel function and make sure there is the excel file 'Credit Risk Data.xlsx' in current directory.
credit = read_excel(path = 'Credit Risk Data.xlsx', sheet = 'Base Data', skip = 2)
# display the structure of an object in R, here 'credit' is a dataframe with 425 obs and and 12 variables.
head(credit)
# converting a string variable to factor (binary).
credit$`Credit Risk` = as.factor(credit$`Credit Risk`)
# similarly, convert a string variable 'credit$`Job`' to a ordered factor variable (categorical); order is manually specified
credit$Job = factor(credit$Job, levels = c("Unemployed", "Unskilled", "Skilled", "Management"), ordered = TRUE)
# prepare all other data types
credit$`Loan Purpose` = as.factor(credit$`Loan Purpose`)
credit$Gender = as.factor(credit$Gender)
# change the labels for categorical variable `Gender`
levels(credit$Gender)[levels(credit$Gender) == "F"] = "Female"
levels(credit$Gender)[levels(credit$Gender) == "M"] = "Male"
credit$`Marital Status` = as.factor(credit$`Marital Status`)
credit$Housing = as.factor(credit$Housing)

Some Fun Simulations for LLN and CLT

The law of large numbers (LLN) and central limit theorem (CLT) are two cornerstones of modern theory of statistics. Those two theorems are key to understand why we always prefer bigger data set and why the most common probability distribution in statistics is normal distribution (thus the name). In short, (speaking with sample mean)

  • LLN says that the value of sample mean \(\bar{X}\) comes closer to the true parameter mean \(\mu_X\), when sample size gets larger.
  • CLT says that the shape of sample mean \(\bar{X}'s\) probability distribution (i.e. sampling distribution of \(\bar{X}\)) comes closer to a normal distribution, when sample size gets larger (regardless of the shape of \(X\)).

Simulation for Law of Large Numbers (LLN)

The first simulation shows the how law of large numbers works in a simulation. The random variable \(X\) itself has a normal distribution. Our goal is to show that the sample mean \(\bar{X}\), (again, it is key to understand that \(\bar{X}\) is a random variable) converges (in probability) to true mean \(\mu_X\) when the sample size \(n\) gets larger. For the following simulation, we are doing the followings:

  1. Fixing \(n\), a number of \(n\) observations (i.e. sample size) are randomly drawn from \(X\) following its distribution, i.e. normal distribution with true mean \(\mu_X = 0\).
  2. We draw 25 of such samples above independently out of the same normal distribution of \(X\).
  3. Compute 25 sample mean \(\bar{X}\) from those 25 independent samples.
  4. Plot those 25 sample mean’s locations relative to the true mean \(\mu_X = 0\) (“red line”).
  5. Repeat step 1 to step 4, starting from \(n = 1\) and increasing by 1 until \(n = nmax = 175\). Therefore we have 175 plots with increasing sample size \(n = 1, 2, 3, \ldots, 175\).

Observe that realizations of sample mean \(\bar{X}\) becomes closer to the true mean when \(n\) is getting larger. That means, the standard deviation of sampling distribution of sample mean \(\bar{X}\) becomes smaller when \(n\) becomes larger.

par(bg = "white")
# see other interesting examples using "animation" package in R
ani.options(interval = 0.01, nmax = 175)
lln.ani(pch = 1, np = 25, col.poly = "steelblue2", col.mu = "orangered2")

Simulation for Central Limit Theorem (CLT)

The second simulation shows the how central limit theorem works in a simulation. The random variable \(X\) itself has a exponential distribution (nothing likes a normal distribution). Our goal is to show that the sample mean \(\bar{X}\), converges (in distribution) to a normal distribution when the sample size \(n\) gets larger. For the following simulation, we are doing the followings:

  1. Fixing \(n\), a number of \(n\) observations (i.e. sample size) are randomly drawn from \(X\) following its distribution, i.e. exponential distribution with \(\lambda = 1\).
  2. We draw 300 of such samples above independently out of the same exponential distribution of \(X\).
  3. Compute 300 sample mean \(\bar{X}\) from those 300 independent samples.
  4. Plot the histogram for sample mean \(\bar{X}\) since we have 300 of them which gives us the empirical distribution of \(\bar{X}\) at sample size \(n\). The “red” curve is the smoothed density curve straight from the histogram.
  5. Repeat step 1 to step 4, starting from \(n = 1\) and increasing by 1 until \(n = nmax = 150\). Therefore we have 150 histogram plots with increasing sample size \(n = 1, 2, 3, \ldots, 150\).

Observe that sampling distribution of \(\bar{X}\) becomes more normal distribution when \(n\) is getting larger. Put in another word, the empirical density curve (“red solid”) is closer to the theoretical normal density curve (“blue dashed”) as \(n\) increases.

# see other interesting examples using "animation" package in R
par(bg = "white")
ani.options(interval = 0.1, nmax = 150)
op = par(mar = c(3, 3, 1, 0.5), mgp = c(1.5, 0.5, 0), tcl = -0.3)
clt.ani(type = "s", col = c("darkolivegreen2", "red", "blue", "black"), bg="white")

Table of Summary Statitics and Correlation Matrix

How to Produce a Summary Statistics Table

Summary statistics table summarizes all key statistical measures for variables of interest in one table. The following R code shows how to produce one using package gtsummary.

## Summary Statistics Tables

# summary statistics table with `table1`
#sumstat1 = table1::table1(~ Checking + Savings + `Marital Status` + `Credit Risk` | Gender, data = credit)
#print(sumstat1)

# summary statistics table with `arsenal`
#sumstat2 = arsenal::tableby(~  Savings + Age + Housing , data = credit)
#summary(sumstat2, text=FALSE)

# summary statistics table with `gtsummary`
sumstat3 = 
  credit %>% 
  select(Checking, Savings, Age, Gender) %>%
  tbl_summary(
    by = Gender,
    type = all_continuous() ~ "continuous2",
    statistic = all_continuous() ~ c("{mean} ({sd})", "{median} ({p25}, {p75})" , "{min}, {max}"),
    missing = "no"
    ) %>%
  modify_header(stat_by = md("**{level}**<br>N =  {n} ({style_percent(p)}%)")) %>%
  modify_spanning_header(all_stat_cols() ~ "**Gender**") %>%
  add_n() %>%
  add_p() %>%
  bold_labels() %>%
  italicize_levels()

sumtable = as_gt(sumstat3)
sumtable

Characteristic N Gender p-value1
Female
N = 135 (32%)
Male
N = 290 (68%)
Checking 425 >0.9
Mean (SD) 952 (2,938) 1,093 (3,244)
Median (IQR) 0 (0, 569) 0 (0, 554)
Range 0, 19,766 0, 19,812
Savings 425 0.14
Mean (SD) 2,075 (3,917) 1,690 (3,439)
Median (IQR) 680 (324, 964) 558 (208, 898)
Range 0, 19,568 0, 19,811
Age 425 <0.001
Mean (SD) 32 (12) 36 (10)
Median (IQR) 28 (23, 36) 34 (27, 41)
Range 18, 66 19, 73

1 Wilcoxon rank sum test

## How to Produce and Visualize Pearson’s Correlation Matrix

Pearson’s Correlation Matrix computes and shows the Pearson’s \(\rho\) among all continuous data variables. cor() function in base R library could easily do the job and we further use corrplot package to visualize the correlation matrix in a nice-looking plot.

## Pearson's correlation matrix
# note that Pearson's correlation only applies to continuous numeric data 
credit.cont = credit %>% 
  select(Checking, Savings, `Months Customer`, `Months Employed`, `Age`)
  # use base R's `cor()` function to produce the matrix
credit.corrmat = cor(credit.cont)
credit.corrmat
##                     Checking     Savings Months Customer Months Employed
## Checking         1.000000000  0.02007924    -0.036226269    -0.009190981
## Savings          0.020079239  1.00000000    -0.050440712     0.048977962
## Months Customer -0.036226269 -0.05044071     1.000000000     0.054455798
## Months Employed -0.009190981  0.04897796     0.054455798     1.000000000
## Age             -0.002369302 -0.02828589    -0.003785909     0.306798529
##                          Age
## Checking        -0.002369302
## Savings         -0.028285886
## Months Customer -0.003785909
## Months Employed  0.306798529
## Age              1.000000000
# visualize correlation matrix with "corrplot" package
corrplot(credit.corrmat, type = "full", order = "hclust", method = "shade",
         tl.col = "black", tl.srt = 45)

# let's use another data for a quick illustration
data(mtcars)
head(mtcars)
mtcars.corrmat = mtcars %>%
  select(-c("vs", "am","gear","carb")) %>%
  cor() %>%
  round(2)

lower = mtcars.corrmat
lower[upper.tri(lower)] = ""
lower = as.data.frame(lower)
View(lower)

corrtbl = xtable(lower, type = "html")
print(corrtbl)
## % latex table generated in R 4.0.3 by xtable 1.8-4 package
## % Fri Jan 22 16:39:00 2021
## \begin{table}[ht]
## \centering
## \begin{tabular}{rlllllll}
##   \hline
##  & mpg & cyl & disp & hp & drat & wt & qsec \\ 
##   \hline
## mpg & 1 &  &  &  &  &  &  \\ 
##   cyl & -0.85 & 1 &  &  &  &  &  \\ 
##   disp & -0.85 & 0.9 & 1 &  &  &  &  \\ 
##   hp & -0.78 & 0.83 & 0.79 & 1 &  &  &  \\ 
##   drat & 0.68 & -0.7 & -0.71 & -0.45 & 1 &  &  \\ 
##   wt & -0.87 & 0.78 & 0.89 & 0.66 & -0.71 & 1 &  \\ 
##   qsec & 0.42 & -0.59 & -0.43 & -0.71 & 0.09 & -0.17 & 1 \\ 
##    \hline
## \end{tabular}
## \end{table}
# visualize correlation matrix with "corrplot" package
corrplot(mtcars.corrmat, type = "lower", order = "hclust", method = "circle",
         tl.col = "black", tl.srt = 45)

Statistical Inference and Hypothesis: An Example of Sales Transactional Data

  • Dataset required: Sales Transactions.xlsx

Sales Transactions.xlsx contains the records of all sale transactions for a day, July 14. Each of the column is defined as follows:

  • CustID : Unique identifier for a customer
  • Region: Region of customer’s home address
  • Payment: Mode of payment used for the sales transaction
  • Transction Code: Numerical code for the sales transaction
  • Source: Source of the sales (whether it is through the Web or email)
  • Amount: Sales amount
  • Product: Product bought by customer
  • Time Of Day: time in which the sale transaction took place.
#import data file into RStudio
ST <- read_excel("Sales Transactions.xlsx", col_types = c("numeric", "text", "text", "numeric", "text", "numeric", "text", "date")) 

Let’s use this Sales Transaction data to conduct interval estimation and hypotheses testing.

Computing Interval Estimates

A confidence interval with \((1-\alpha)\) degree of confidence: a range of values between which the parameter is believed to be, along with a probability that CI correctly captures the true parameter. The ``formula’’ for CI:

\[ \text{point est. }\pm \text{ crit.val.(sampling dist., } \alpha)\times \text{S.E.} \] Question: how do we understand the concept of 95% confidence interval? What does it mean by “95% confident that parameter is in between”?

Let’s see another simulation for a 50% confidence interval.

Consider how to obtain the following interval estimates. - Compute the 95% confidence interval for the mean of Amount. - Compute the 99% confidence interval for proportion of book sales transactions with sales amount being greater than $50.

Let’s first compute the 95% confidence interval for the mean of Amount.

#compute manually 95% CI for mean `Amount`
uCIamt95 = mean(ST$Amount) - 
    qt(0.025, df=nrow(ST)-1)  * sd(ST$Amount)/sqrt(nrow(ST)) #upper CI
#NOTE: qt(.025) gives a negative value here so we use mean - error to compute upper CI
lCIamt95 = mean(ST$Amount) + 
    qt(0.025, df=nrow(ST)-1) * sd(ST$Amount)/sqrt(nrow(ST)) #lower CI
#print result
print(cbind(lCIamt95, uCIamt95), digits=4) 
##      lCIamt95 uCIamt95
## [1,]    34.76    45.13

Now let’s compute the 99% confidence interval for proportion of book sales transactions with sales amount being greater than $50.

Q: How is the confidence interval for proportion different from that of mean?

#compute 95% CI for proportion (Amount>50)
book = ST %>% filter(Product=="Book")
bk50 = book %>% filter(Amount>50)
pbk50 = nrow(bk50)/nrow(book) # proportion of books which cost greater than $50
# compute the lower and upper bounds of CI
lCIpbk50 <- pbk50 + (qnorm(0.005) * sqrt(pbk50*(1-pbk50)/nrow(book)))
uCIpbk50 <- pbk50 - (qnorm(0.005) * sqrt(pbk50*(1-pbk50)/nrow(book)))
# print CI
print(cbind(lCIpbk50, uCIpbk50),digits=3)
##      lCIpbk50 uCIpbk50
## [1,]    0.139    0.267

Conducting Hypothesis Tests

Suppose the manager would like to draw some conclusions from the sales transaction data. He does not believe there are outliers in the data so you may leave all the data as is. He would like to set up and test the following hypotheses:

  • Is the proportion of book sales transactions with Amount greater than $50 at least 10 percent of book sales transactions?
  • Is the mean sales amount for books the same as for dvds?
  • Is the mean sales amount the same across all 4 regions?

One–sided test on proportion.

There are two approaches to conduct one-sided test here:

  1. The null says that \(H_0: p \le 0.1\) and compare the test statistics with the critical value on the right tail.
  2. The null says that \(H_0: p \ge 0.1\) and compare the test statistics with the critical value on the left tail.

Let’s see how we conduct the one-sided test on proportion with the second approach. Write out the hypothesis system: \[\begin{align*} &H_0: p \ge 0.1 \\ &H_A: p < 0.1 \end{align*}\]

Assuming the null hypothesis is true, any value we get above 10% is probable! The observed probability has to be significantly lower than 10% (left critical region) for the null hypothesis to be rejected.

# compute z-statistic for proportion. The next 3 lines have been executed earlier.
# book<-ST %>% filter(Product=="Book")
# bk50<- book %>% filter(Amount>50)
# pbk50<-nrow(bk50)/nrow(book)

z = (pbk50 - 0.10) / sqrt(0.1*(1-0.1)/nrow(book))
z
## [1] 5.550227
#compute critical value; note this is a one-sided hypothesis test
cv95 = qnorm(0.05)
cv95
## [1] -1.644854
z < cv95 # check if z passed the threshold
## [1] FALSE

H0: Proportion of 50 book order amount greater than 50 >= 0.1;
H1: Proportion of 50 book order amount greater than 50 < 0.1

Since the test statistics \(z\) is larger than the critical value (left tail), we fail to reject the null hypothesis. Therefore, the data shows evidence that “proportion of 50 book order amount greater than 50 is at least 10%”.

Two-sided test on means from two samples

t.test(ST$Amount~ST$Product)
## 
##  Welch Two Sample t-test
## 
## data:  ST$Amount by ST$Product
## t = 8.0304, df = 260.96, p-value = 3.344e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  27.47079 45.31916
## sample estimates:
## mean in group Book  mean in group DVD 
##           56.21559           19.82062

H0: Mean amount for Books = Mean amount for DVD;
H1: Mean amount for Books != Mean amount for DVD

From the t-test output, the p-value is extremely small, particually smaller than any conventional cutoff. Therefore, data shows very strong evidence to reject the null hypothesis and we conclude that the popuplation means for books and DVD are very unlikely to be the same.

Two-sided test on multiple means.

To test on the equality of means from multiple samples (or groups), one-way anova test is the top choice given its better performance, i.e. high test power. Parametric test such as anova, however, requires stronger assumptions. (no free lunch!)

  1. Independent samples.
  2. Population random variables are normally distributed.
  3. Homoskedasticity, i.e. equal variances across groups.

ANOVA Assumption: Check Normality!

# First check if ANOVA assumptions are met
# check normality
# plot histogram 
par(mfcol=c(2,2))
E <-ST %>% filter(Region=="East") 
W <-ST %>% filter(Region=="West")
N <-ST %>% filter(Region=="North") 
S <-ST %>% filter(Region=="South") 


hist(E$Amount, main="Histogram for `East`", xlab="Amount")
hist(W$Amount, main="Histogram for `West`", xlab="Amount")
hist(N$Amount, main="Histogram for `North`", xlab="Amount")
hist(S$Amount, main="Histogram for `South`", xlab="Amount")

# plot qqplots
par(mfcol=c(2,2))
qqnorm(E$Amount, main="QQplot for `East`", xlab="Amount")
qqline(E$Amount)
qqnorm(W$Amount, main="QQplot for `West`", xlab="Amount")
qqline(W$Amount)
qqnorm(N$Amount, main="QQplot for `North`", xlab="Amount")
qqline(N$Amount)
qqnorm(S$Amount, main="QQplot for `South`", xlab="Amount")
qqline(S$Amount)

# plot boxplots
par(mfcol= c(1,1))
boxplot(ST$Amount ~ ST$Region)

# Shapiro Wilkins Test
lapply(list(E,W,N,S), 
       function(sa) 
         {
         shapiro.test(sa$Amount)
         }) # applying test on each vector in list
## [[1]]
## 
##  Shapiro-Wilk normality test
## 
## data:  sa$Amount
## W = 0.46655, p-value < 2.2e-16
## 
## 
## [[2]]
## 
##  Shapiro-Wilk normality test
## 
## data:  sa$Amount
## W = 0.41315, p-value < 2.2e-16
## 
## 
## [[3]]
## 
##  Shapiro-Wilk normality test
## 
## data:  sa$Amount
## W = 0.36983, p-value < 2.2e-16
## 
## 
## [[4]]
## 
##  Shapiro-Wilk normality test
## 
## data:  sa$Amount
## W = 0.45291, p-value < 2.2e-16
# check sample sizes across regions
table(ST$Region)
## 
##  East North South  West 
##    98    85    99   190

The boxplot shows that the means from three groups are very close, which we need to test further.

From both dianostic graphics and shapiro normality tests, we conclude that the normality assumption is violated mainly because of clusters of outliers. Keep in mind that, anova test is somewhat robust to non-normality but rather sensitive to heteroskadasticity, i.e., unequal variances across groups, which we shall check on next.

ANOVA Assumption 2: Check Equal Variance!

# check equal variance assumption with Fligner's
fligner.test(Amount ~ Region, ST)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  Amount by Region
## Fligner-Killeen:med chi-squared = 5.9472, df = 3, p-value = 0.1142
# check equal variance assumption with Levene's
leveneTest(Amount~ as.factor(Region), ST)

Given the departure from normality, we choose to conduct Fligner and Levene’ test for equal variances. Both fails to reject the null hypothesis (the null: equal variances). We are quite confident that the homoskedasticity assunmption is met. One-way anova is the way to go:

Run One-Way Anova

# Conduct Anova
aov.amt <- aov(ST$Amount ~ as.factor(ST$Region)) 
summary(aov.amt)
##                       Df  Sum Sq Mean Sq F value Pr(>F)
## as.factor(ST$Region)   3    6549    2183   0.663  0.575
## Residuals            468 1540965    3293

H0: Mean amount for all regions are the same;
H1: Mean amount for at least one region is different from the others

Rather large p-value from one-way anova test shows little evidence to reject the null hypothesis. We conclude that the means from these three groups are very likely to be the same, which confirms our preliminary observation from the boxplot at the very begining.