Problem statement

To find the normality of any 1 variable from a dataset with over 1000 observation, sourced from Kaggle.

Normality - definition

In statistics, normality tests are used to determine if a data set is well-modeled by a normal distribution and to compute how likely it is for a random variable underlying the data set to be normally distributed.


knitr:: include_graphics("C:/MBA/Sem - 3/normal_dist.png")
Figure 1: Normal distribution & empirical rule

Figure 1: Normal distribution & empirical rule


Loading Data - Bank Customer Churn Modelling Dataset

Source: (https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling)

df = read.csv("C:/MBA/Sem - 3/Churn_Modelling.csv")
head(df,15)
##    RowNumber CustomerId  Surname CreditScore Geography Gender Age Tenure
## 1          1   15634602 Hargrave         619    France Female  42      2
## 2          2   15647311     Hill         608     Spain Female  41      1
## 3          3   15619304     Onio         502    France Female  42      8
## 4          4   15701354     Boni         699    France Female  39      1
## 5          5   15737888 Mitchell         850     Spain Female  43      2
## 6          6   15574012      Chu         645     Spain   Male  44      8
## 7          7   15592531 Bartlett         822    France   Male  50      7
## 8          8   15656148   Obinna         376   Germany Female  29      4
## 9          9   15792365       He         501    France   Male  44      4
## 10        10   15592389       H?         684    France   Male  27      2
## 11        11   15767821   Bearce         528    France   Male  31      6
## 12        12   15737173  Andrews         497     Spain   Male  24      3
## 13        13   15632264      Kay         476    France Female  34     10
## 14        14   15691483     Chin         549    France Female  25      5
## 15        15   15600882    Scott         635     Spain Female  35      7
##      Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
## 1       0.00             1         1              1       101348.88      1
## 2   83807.86             1         0              1       112542.58      0
## 3  159660.80             3         1              0       113931.57      1
## 4       0.00             2         0              0        93826.63      0
## 5  125510.82             1         1              1        79084.10      0
## 6  113755.78             2         1              0       149756.71      1
## 7       0.00             2         1              1        10062.80      0
## 8  115046.74             4         1              0       119346.88      1
## 9  142051.07             2         0              1        74940.50      0
## 10 134603.88             1         1              1        71725.73      0
## 11 102016.72             2         0              0        80181.12      0
## 12      0.00             2         1              0        76390.01      0
## 13      0.00             2         1              0        26260.98      0
## 14      0.00             2         0              0       190857.79      0
## 15      0.00             2         1              1        65951.65      0

Using base plotting functions : Histogram


#Using histogram base
hist(df$CreditScore)

# using lines for density plot
hist(df$CreditScore, freq=FALSE, xlim= c(0,1000))
lines(x=density(df$CreditScore), col="red")

Using ggplot library


# Creating histogram using ggplot
library(ggplot2)

ggplot(df) + aes(x=CreditScore) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Using ggplot histogram and density functions combined
ggplot(df) + aes(x=CreditScore) + geom_histogram(aes(y= ..density.. ), colour="black", fill="white") + geom_density(size=0.7, fill="blue",alpha=0.4)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


  • In the below graph, the vertical RED line indicates the Mean of the values.


ggplot(df) + aes(x=CreditScore) + geom_histogram(aes(y= ..density.. ), colour="black", fill="white") + geom_density(size=0.7, fill="blue",alpha=0.4)+ geom_vline(xintercept= mean(df$CreditScore),colour="red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Basic statistics - Mean , Standard Deviation

paste("Calculated Mean (of column 'CreditScore':", round(mean(df$CreditScore),2))
## [1] "Calculated Mean (of column 'CreditScore': 650.53"
paste("Calculated Standard Deviation of column 'CreditScore':", round(sd(df$CreditScore),2))
## [1] "Calculated Standard Deviation of column 'CreditScore': 96.65"

Findings:

1. Graphical interpretation:

  • Histogram, Density plot indicate that the column CreditScore is normally distributed (Considering the bell-shape curve which denoted the spread and shape of the distribution).

2. Basic statistical interpretation:

  • A standard deviation (or σ) is a measure of how dispersed the data is in relation to the mean. Low standard deviation means data are clustered around the mean, and high standard deviation indicates data are more spread out.

  • From the basic statistical analysis, we can observe that the standard deviation tends to be relatively closer to zero, thereby assuming that the data is normally distributed.

References