Ankit Munot - s3764950
We need to understand the distribution for one of the body measurements separately in men and women and compare it to a normal distribution. We have used a data set named bdims.csv and the body measurement considered is wri.di (Respondent’s wrist diameter in cm, measured as sum of two wrists). The assignment involves importing the data set and calculating the summary statistics for men and women separately. Further,we study the distribution of wri.di in men and women by plotting a histogram with normal distribution overlay.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
bdims <- read.csv("G:/Semester -1/Introduction to Statistics/bdims.csv")
bdims$sex <- factor(bdims$sex, levels= c(0,1), labels = c("Female","Male"))
head(bdims)
## bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1 42.9 26.0 31.5 17.7 28.0 13.1 10.4 18.8 14.1 106.2
## 2 43.7 28.5 33.5 16.9 30.8 14.0 11.8 20.6 15.1 110.5
## 3 40.1 28.2 33.3 20.9 31.7 13.9 10.9 19.7 14.1 115.1
## 4 44.3 29.9 34.0 18.4 28.2 13.9 11.2 20.9 15.0 104.5
## 5 42.5 29.9 34.0 21.5 29.4 15.2 11.6 20.7 14.9 107.5
## 6 43.3 27.0 31.5 19.6 31.3 14.0 11.5 18.8 13.9 119.8
## che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1 89.5 71.5 74.5 93.5 51.5 32.5 26.0 34.5 36.5 23.5
## 2 97.0 79.0 86.5 94.8 51.5 34.4 28.0 36.5 37.5 24.5
## 3 97.5 83.2 82.9 95.0 57.3 33.4 28.8 37.0 37.3 21.9
## 4 97.0 77.8 78.8 94.0 53.0 31.0 26.2 37.0 34.8 23.0
## 5 97.5 80.0 82.5 98.5 55.4 32.0 28.4 37.7 38.6 24.4
## 6 99.9 82.5 80.1 95.3 57.5 33.0 28.0 36.6 36.1 23.5
## wri.gi age wgt hgt sex
## 1 16.5 21 65.6 174.0 Male
## 2 17.0 23 71.8 175.3 Male
## 3 16.9 28 80.7 193.5 Male
## 4 16.6 23 72.6 186.5 Male
## 5 18.0 22 78.8 187.2 Male
## 6 16.9 21 74.8 181.5 Male
In the above chunk, I am reading the data from the csv dataset by using read_csv(),head() function helps me get the first few rows of the data .
The data frame includes 507 observations of 25 variables.
I have converted the sex variable into factor and then defined labels for it as Female for 0 and Male for 1.
bdims %>% group_by(sex) %>% summarise(Mean = mean(wri.di, na.rm = TRUE),
Median = median(wri.di, na.rm = TRUE),
SD = sd(wri.di, na.rm = TRUE),
Q1 = quantile(wri.di,probs = .25,na.rm = TRUE),
Q3 = quantile(wri.di,probs = .75,na.rm = TRUE),
IQR= IQR(wri.di, na.rm = TRUE),
Min = min(wri.di,na.rm = TRUE),
Max = max(wri.di,na.rm = TRUE),
Missing = sum(is.na(wri.di)))
## # A tibble: 2 x 10
## sex Mean Median SD Q1 Q3 IQR Min Max Missing
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Female 9.87 9.8 0.662 9.4 10.4 1 8.1 12.2 0
## 2 Male 11.2 11.2 0.636 10.8 11.6 0.850 9.8 13.3 0
Male_respond<- bdims %>% filter(sex=="Male")
Female_respond <- bdims %>% filter(sex=="Female")
table(Female_respond$wri.di)
##
## 8.1 8.3 8.4 8.5 8.6 8.7 8.9 9 9.1 9.2 9.3 9.4 9.5 9.6 9.7
## 1 1 2 1 3 3 7 6 8 20 4 20 7 22 5
## 9.8 9.9 10 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 11 11.2 11.5
## 23 9 20 4 18 5 19 12 10 4 9 3 9 1 3
## 12.2
## 1
table(Male_respond$wri.di)
##
## 9.8 9.9 10 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 11 11.1 11.2
## 1 1 3 3 5 4 13 6 11 8 16 4 22 10 22
## 11.3 11.4 11.5 11.6 11.7 11.8 11.9 12 12.2 12.3 12.4 12.5 12.6 12.8 12.9
## 14 9 19 14 10 14 3 18 3 2 3 2 1 3 1
## 13.2 13.3
## 1 1
In the above chunk, I have calculated the mean, median, standard deviation, first and third quartile, interquartile range, minimum and maximum values of the wrist dimensions and grouped it by sex. -na.rm function removes the null values from data. - I have captured the male respondents in Male_respond and the female respondents in Female_respond. - I used the variable wri.di because it had high frequency for various values of wrist diameter. - Using the table function , I looked at the most frequent value or mode for males and females, we see it to be 9.8 for females and 11.2 for males.
mean_male = mean(Male_respond$wri.di)
sd_male = sd(Male_respond$wri.di)
hist(Male_respond$wri.di, breaks = 30, prob = TRUE,xlab = "Wrist diameter of male population in cm, measured as sum of two wrists", main= "Normal Curve over Histogram for Male Respondants",col="lightblue")
Male <- seq(min(Male_respond$wri.di),max(Male_respond$wri.di),0.1)
Male_Normal_Distribution <- dnorm(Male, mean_male, sd_male)
points(Male,Male_Normal_Distribution, type = 'l',col="blue",lwd = 3)
mean_female = mean(Female_respond$wri.di)
sd_female = sd(Female_respond$wri.di)
hist(Female_respond$wri.di, breaks = 30, prob = TRUE,xlab = "Wrist diameter of female population in cm, measured as sum of two wrists", main= "Normal Curve over Histogram for Female Respondants",col="lightgreen")
Female <- seq(min(Female_respond$wri.di),max(Female_respond$wri.di),0.1)
Female_Normal_Distribution <- dnorm(Female, mean_female, sd_female)
points(Female,Female_Normal_Distribution, type = 'l',col="hotpink4",lwd = 3)
I plotted a normal curve overlaying the histogram of the empirical data for both males and females. Normal distributions are symmetric, unimodal, and asymptotic, and the mean, median, and mode are all equal. In the emperical data we see that the mean, median and mode are all approximately equal, for males it is 9.8 and for males it is 11.2 which justifies the normal nature of data. The curve is also symmetrical and bell shaped which further justifies the curve to be normally distributed. In a normal distribution, approximately 99.7% of the data fall within three standard deviations of the mean. This is evident from both the male and female plots, we can refer the right side of the plot to have a clear view. As a conclusion we can say that the values for wrist diameter lie under a normal distribution for both males and females but a graphical representation like histogram offers no guarantee.