Section 1 :- Problem Statement :- To compare theoretical normal distribution with the empherical distribution data using a histogram with a normal distribution overlay.
Data Chosen :- The column chosen for this assignment was waist girth in men and women from the dataset provided.
Approach :- Normal distribution fitting was done using a basic histogram plot and ploting a normal distribution overlay along it.
Section 2 :- The packages required for this assignment were readxl for importing the dataset and dplyr for plotting the graphs.
Section 3 :- The data is imported from the dataset bdims.csv.xlsx using the following code.
#Load Package readxl
library(readxl)
#Import dataset
bdims_csv <- read_excel("~/Master of Data Science (RMIT)/Semester 1/Applied Analytics/Practice/bdims.csv.xlsx")
View(bdims_csv)
The attributes waist girth (wai.gi) and gender (sex) were selected from the dataset and put in a seperate variable titled “bdims_waistgirth”.
#Load Package dplyr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Selecting the columns waist girth and sex
bdims_waistgirth <- select(bdims_csv,sex,wai.gi)
View(bdims_waistgirth)
The observations in the column sex were assigned labels as female for 0 and male for 1.
#Adding labels to the column sex
bdims_waistgirth$sex <- factor(bdims_waistgirth$sex,levels = c(0,1), labels = c("female","male"))
View(bdims_waistgirth)
The column waist girth (wai.gi) was seperated into 2 variables according to sex i.e. male and female.
#Storing data of waist girth of males and females in different variables
bdims_waistgirth_male <- bdims_waistgirth[bdims_waistgirth$sex %in% "male",]
View(bdims_waistgirth_male)
bdims_waistgirth_female <- bdims_waistgirth[bdims_waistgirth$sex %in% "female",]
View(bdims_waistgirth_female)
Section 4 :- The summary statistics for women is as follows.
#Summary statistics of gaist girth of females
bdims_waistgirth_female$wai.gi %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.90 64.75 68.30 69.80 72.75 101.50
The summary statistics for men is as follows.
#Summary statistics of gaist girth of males
bdims_waistgirth_male$wai.gi %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 67.10 77.90 83.40 84.53 90.00 113.20
Section 5 :-
The data for men and women was plotted as seperate histograms with a normal distribution overlay. We can understand that the waist girth data of men and women closely fits the normal distribution when taken seperately.
The normal distribution overlay was plotted by using the line() function with the xfit being the values in the data and yfit frequency of occurence of the values in xfit.
Observation :-
Men :- The histogram and the normal distribution overlay plotted for men was a very close fit. The mean is 84.533, median is 83.4 and standard deviation is 8.782. We can see here that the mean and the median are very close. In the theoretical aspects, the mean and median of the normal distribution should be equal, however it is a very rare observation in a real world scenario. A small value of standard deviation indicates that majority of the observations are clustered towards the mean.
#Histogram for Men
x <- bdims_waistgirth_male$wai.gi
mean(x)
## [1] 84.5332
median(x)
## [1] 83.4
sd(x)
## [1] 8.782241
h<-hist(x, breaks=10, col="yellow", xlab="Waist Girth in Men",
main="Histogram with Normal Curve for Waist Girth in Men")
xfit<-seq(min(x),max(x),length=247)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=3)
Women :- The histogram and the normal distribution overlay plotted for women was very close to the theoretical normal distribution bell curve.The mean is 69.803, median is 68.3 and standard deviation is 7.587. The values of the mean and the median are very close to each other. It means that the normal distribution curve is very similar to the theoretical graph. A small standard deviation suggests less variation of the observations from the average value.
#Histogram for Women
x <- bdims_waistgirth_female$wai.gi
mean(x)
## [1] 69.80346
median(x)
## [1] 68.3
sd(x)
## [1] 7.587748
h<-hist(x, breaks=10, col="orange", xlab="Waist Girth in Women",
main="Histogram with Normal Curve for Waist Girth in Women")
xfit<-seq(min(x),max(x),length=260)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=3)
Section 6 :-
Interpretation :-
The comparison of histograms for men and women show that both the data sets can be considered as examples of normal distribution graphs, however the plot for the waist girth in women fits more acurately in the defination of normal distribution than the plot for the waist girth in men as the standard deviation for women is less than that of the men.