To find the normality of any 1 variable from a dataset with over 1000 observation, sourced from Kaggle.
In statistics, normality tests are used to determine if a data set is well-modeled by a normal distribution and to compute how likely it is for a random variable underlying the data set to be normally distributed.
knitr:: include_graphics("C:/MBA/Sem - 3/normal_dist.png")
Figure 1: Normal distribution & empirical rule
df = read.csv("C:/MBA/Sem - 3/Churn_Modelling.csv")
head(df,15)
## RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure
## 1 1 15634602 Hargrave 619 France Female 42 2
## 2 2 15647311 Hill 608 Spain Female 41 1
## 3 3 15619304 Onio 502 France Female 42 8
## 4 4 15701354 Boni 699 France Female 39 1
## 5 5 15737888 Mitchell 850 Spain Female 43 2
## 6 6 15574012 Chu 645 Spain Male 44 8
## 7 7 15592531 Bartlett 822 France Male 50 7
## 8 8 15656148 Obinna 376 Germany Female 29 4
## 9 9 15792365 He 501 France Male 44 4
## 10 10 15592389 H? 684 France Male 27 2
## 11 11 15767821 Bearce 528 France Male 31 6
## 12 12 15737173 Andrews 497 Spain Male 24 3
## 13 13 15632264 Kay 476 France Female 34 10
## 14 14 15691483 Chin 549 France Female 25 5
## 15 15 15600882 Scott 635 Spain Female 35 7
## Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
## 1 0.00 1 1 1 101348.88 1
## 2 83807.86 1 0 1 112542.58 0
## 3 159660.80 3 1 0 113931.57 1
## 4 0.00 2 0 0 93826.63 0
## 5 125510.82 1 1 1 79084.10 0
## 6 113755.78 2 1 0 149756.71 1
## 7 0.00 2 1 1 10062.80 0
## 8 115046.74 4 1 0 119346.88 1
## 9 142051.07 2 0 1 74940.50 0
## 10 134603.88 1 1 1 71725.73 0
## 11 102016.72 2 0 0 80181.12 0
## 12 0.00 2 1 0 76390.01 0
## 13 0.00 2 1 0 26260.98 0
## 14 0.00 2 0 0 190857.79 0
## 15 0.00 2 1 1 65951.65 0
#Using histogram base
hist(df$CreditScore)
# using lines for density plot
hist(df$CreditScore, freq=FALSE, xlim= c(0,1000))
lines(x=density(df$CreditScore), col="red")
# Creating histogram using ggplot
library(ggplot2)
ggplot(df) + aes(x=CreditScore) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Using ggplot histogram and density functions combined
ggplot(df) + aes(x=CreditScore) + geom_histogram(aes(y= ..density.. ), colour="black", fill="white") + geom_density(size=0.7, fill="blue",alpha=0.4)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(df) + aes(x=CreditScore) + geom_histogram(aes(y= ..density.. ), colour="black", fill="white") + geom_density(size=0.7, fill="blue",alpha=0.4)+ geom_vline(xintercept= mean(df$CreditScore),colour="red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
paste("Calculated Mean (of column 'CreditScore':", round(mean(df$CreditScore),2))
## [1] "Calculated Mean (of column 'CreditScore': 650.53"
paste("Calculated Standard Deviation of column 'CreditScore':", round(sd(df$CreditScore),2))
## [1] "Calculated Standard Deviation of column 'CreditScore': 96.65"
A standard deviation (or σ) is a measure of how dispersed the data is in relation to the mean. Low standard deviation means data are clustered around the mean, and high standard deviation indicates data are more spread out.
From the basic statistical analysis, we can observe that the standard deviation tends to be relatively closer to zero, thereby assuming that the data is normally distributed.