Testing the normality of datasets

Introduction

The adult dataset is from the 1994 Census database. It is also known as “Census Income” dataset. Details of this dataset can be found at UCI Machine Learning Repository. We will use this dataset to explore normality of certain kind of data.

install.packages("moments")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

library(moments) # loading the package 
install.packages("dplyr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

install.packages("goft")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

library(goft)

## Loading required package: fitdistrplus

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## Loading required package: survival

## Loading required package: sn

## Loading required package: stats4

## 
## Attaching package: 'sn'

## The following object is masked from 'package:stats':
## 
##     sd

install.packages("devtools")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

library(devtools)

## Loading required package: usethis

rm(list = ls())

dataset<-read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",header=FALSE,sep=",",stringsAsFactors = FALSE,na.strings="NA")
 
# naming the variables of data set

names(dataset)<-c("age","workclass","fnlwgt","education","education_number","marital_status","occupation","relationship","race","sex","capital_gain","capital_loss","hours_per_week","native_country","class")

# looking at summary of the data

str(dataset)

## 'data.frame':    32561 obs. of  15 variables:
##  $ age             : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass       : chr  " State-gov" " Self-emp-not-inc" " Private" " Private" ...
##  $ fnlwgt          : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education       : chr  " Bachelors" " Bachelors" " HS-grad" " 11th" ...
##  $ education_number: int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital_status  : chr  " Never-married" " Married-civ-spouse" " Divorced" " Married-civ-spouse" ...
##  $ occupation      : chr  " Adm-clerical" " Exec-managerial" " Handlers-cleaners" " Handlers-cleaners" ...
##  $ relationship    : chr  " Not-in-family" " Husband" " Not-in-family" " Husband" ...
##  $ race            : chr  " White" " White" " White" " Black" ...
##  $ sex             : chr  " Male" " Male" " Male" " Male" ...
##  $ capital_gain    : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital_loss    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours_per_week  : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ native_country  : chr  " United-States" " United-States" " United-States" " United-States" ...
##  $ class           : chr  " <=50K" " <=50K" " <=50K" " <=50K" ...

Age data: Descriptive Stats and plots

Next let’s look at two variables in this dataset: age and capital_gain, where age is self explanatory and capital_gain presents whether the household have any gains from investment sources other than wage/salary.

Let’s first explore some standard descriptive statistics of these two varitables

age_data <- dataset$age
mean_age<-mean(age_data)
cat("mean age is", mean_age, "\n")

## mean age is 38.58165

sd_age<-sd(age_data)
cat("standard deviation of age is", sd_age, "\n")

## standard deviation of age is 13.64043

kurtosis_age <-kurtosis(age_data)
cat("kurtosis of age is", kurtosis_age, "\n")

## kurtosis of age is 2.833714

Let’s further examine how the datasets look like by plotting plotting both datasets. After all, we want to see what we have. Considering that both variables are discrete variables, we can just scatterplot them by placing index (observation) on the x axis and the value on the y axis.

plot(age_data, main = "Age data", xlab = "Observations", ylab = "age")

This doesn’t look very informative so let’s analyze the data a bit. We could use a histogram, but since the data is “discrete” I’ll just count the occurences of each age present in the dataset.

age_counts <- table(age_data)
plot(age_counts)

## Age Data: Further confirmation of its normality

Now we can fairly easily see that the dataset (population) behaves somehwat “normally”, and this means that if we were to take standard reach samples with a 95% confidence level with a margin of error of 10%, we can fairly confidentally say that our sample indeed represents the reality=population. But before that I want to quickly confirm the population is indeed normally distributed, i.e. 68% of the values fall between one standard deviation from the mean.

lower_limit <- mean_age - sd_age
upper_limit <- mean_age + sd_age
between_sd_values <- age_data[age_data >= lower_limit & age_data <= upper_limit]
percentage_between_sd_values <- length(between_sd_values)/length(age_data)
cat("And as we see, ", percentage_between_sd_values, " falls within one standard deviation from the mean, meaning our dataset is indeed fairly normal. As a further confirmation its kurtosis, i.e. fatness of the tail, is ", kurtosis_age, " and since it's below 3 we should have anything to worry.")

## And as we see,  0.6634931  falls within one standard deviation from the mean, meaning our dataset is indeed fairly normal. As a further confirmation its kurtosis, i.e. fatness of the tail, is  2.833714  and since it's below 3 we should have anything to worry.

Age data: What normality of the underlying data means for your sample

Now, the assumption that the data at hand, I mean the actual real life population data (which you know very little about), is normal is ESSENTIAL to anything you do with your data. As we are now privileged to “know” our dataset, it’s not only an assumption but rather knowledge and let’s see what does this mean by taking a REACH sample of 100 observations randomly:

for (i in 1:5){
  REACH_sample1<-sample(age_data, size = 100, replace = FALSE)
  mean_REACH_sample <- mean(REACH_sample1)
  cat("Mean of the REACH sample is ", mean_REACH_sample, " \n while the actual mean is   ", mean_age, "\n with a difference of ", mean_age - mean_REACH_sample, " which is     pretty good... as expected", "\n")
  }

## Mean of the REACH sample is  37.76  
##  while the actual mean is    38.58165 
##  with a difference of  0.8216468  which is     pretty good... as expected 
## Mean of the REACH sample is  38.67  
##  while the actual mean is    38.58165 
##  with a difference of  -0.08835324  which is     pretty good... as expected 
## Mean of the REACH sample is  37.48  
##  while the actual mean is    38.58165 
##  with a difference of  1.101647  which is     pretty good... as expected 
## Mean of the REACH sample is  36.77  
##  while the actual mean is    38.58165 
##  with a difference of  1.811647  which is     pretty good... as expected 
## Mean of the REACH sample is  39.65  
##  while the actual mean is    38.58165 
##  with a difference of  -1.068353  which is     pretty good... as expected

##But what if you don’t know and keep assuming normality

Now, as we’ve worked a bit on what does it mean that a variable is normally distributed, let’s look at another variable, namely capital_gain which tells us annual gains from financial instruments outside of standard wages etc.

Let’s further assume that we don’t know anything about the underlying “population” but rather just take a sample and go for it. Let’s take a sample of 100 observations from this and get our factsheet ready.

capital_data <-dataset$capital_gain
REACH_sample2<-sample(capital_data, size = 100, replace = FALSE)
  mean_REACH_sample2 <- mean(REACH_sample2)
  cat("Mean capital gain is ", mean_REACH_sample2, " and we are good to go.")

## Mean capital gain is  502.37  and we are good to go.

All looks good unless you are REALLY smart, or you happen to run the above calculation a couple of times. Let’s do it.

REACH_sample2<-sample(capital_data, size = 100, replace = FALSE)
mean_REACH_sample2 <- mean(REACH_sample2)
cat("Mean capital gain is ", mean_REACH_sample2, "\n")

## Mean capital gain is  341.85

for (i in 1:5){
  REACH_sample2<-sample(capital_data, size = 100, replace = FALSE)
  mean_REACH_sample2 <- mean(REACH_sample2)
  cat("Mean capital gain is ", mean_REACH_sample2, " WTF! \n")
}

## Mean capital gain is  2664.05  WTF! 
## Mean capital gain is  338.72  WTF! 
## Mean capital gain is  1384  WTF! 
## Mean capital gain is  1156.5  WTF! 
## Mean capital gain is  1625.24  WTF!

If you had (easily) now reported that with a 95% confidence, the sample mean is within +-10% from the true mean, you’d be clearly lying as - first of all - our sample mean seem to be all over the place. There’s something fishy going on with this data. You’d next probably wanna plot this out and see what the #%#ck is going on.

for (i in 1:5){
  plot(sample(capital_data, size = 100, replace = FALSE))
}

It’s starting to look there’s nothing very normal at all with our samples which begs the question whether the underlying data is normally distributed and whether we can use our normal tools at all. In this case, it’s certain that we should not rely on our standard sample size calculation which is based on the (now wrong) assumption that the underlying data is normally distributed.

##Capital Gain Data: What is going on

Let’s go back into the population data and look what’s actually goin on at the actual population data by examining the capital gains.Let’s first look at the standard statistics of the capital population and then calculate how much data lies within one standard deviation from the true mean.

#Descriptive stats
capital_data <-dataset$capital_gain
mean_capital <-mean(capital_data)
cat("mean capital gain is", mean_capital, "\n")

## mean capital gain is 1077.649

sd_capital <-sd(capital_data)
cat("standard deviation of capital gain is", sd_capital, "\n")

## standard deviation of capital gain is 7385.292

kurtosis_capital <- kurtosis(capital_data)
cat("kurtosis of capital gain is", kurtosis_capital, "\n")

## kurtosis of capital gain is 157.7755

#Calculating how much data is within one SD from the mean
lower_limit <- mean_capital - sd_capital
upper_limit <- mean_capital + sd_capital
between_sd_values <- capital_data[capital_data >= lower_limit & age_data <= upper_limit]
percentage_between_sd_values <- length(between_sd_values)/length(capital_data)
cat("And as we see, ", percentage_between_sd_values, " % of the values fall within one standard deviation from the mean, meaning our dataset is not normal which is further illustrated by the kurtosis ", kurtosis_capital, " indicating a fat tail, which should make us worry")

## And as we see,  1  % of the values fall within one standard deviation from the mean, meaning our dataset is not normal which is further illustrated by the kurtosis  157.7755  indicating a fat tail, which should make us worry

I’ll further order the dataset from smallest to the highest and plot those in a diagram.

ordered_capital_data_asc <- capital_data[order(capital_data, decreasing = FALSE)]
plot(ordered_capital_data_asc, xlab="ordered observations", ylab="gain in USD")

From the ordered dataset, you can see that the dataset is not nicely grouped around the mean. It’s called power law distribution (Pareto etc.) and

Conclusion

The main point here is that unless the underlying population data is normally distributed, nothing what you expect to work does not work. There are indeed some sophisticted models what can be used to tackle such phenomena but those are out of our reach and thus we should be careful. Lending from Nassim Taleb, these phenomenas follow ‘fractal’ or power law properties and should be thus handled differently.

To be continued….

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Testing the normality of datasets

Juha Ronkainen

2023-10-13

Introduction

Age data: Descriptive Stats and plots

Age data: What normality of the underlying data means for your sample

Conclusion