MATH1324 Assignment 1

Student Details

Ankit Munot - s3764950

Problem Statement

We need to understand the distribution for one of the body measurements separately in men and women and compare it to a normal distribution. We have used a data set named bdims.csv and the body measurement considered is wri.di (Respondent’s wrist diameter in cm, measured as sum of two wrists). The assignment involves importing the data set and calculating the summary statistics for men and women separately. Further,we study the distribution of wri.di in men and women by plotting a histogram with normal distribution overlay.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

I have used the dplyr package to transform and summarize tabular data with rows and columns,this package helps me to use the filter() function and the %>% (pipe operator) in my code.

bdims <- read.csv("G:/Semester -1/Introduction to Statistics/bdims.csv")



bdims$sex <- factor(bdims$sex, levels= c(0,1), labels = c("Female","Male"))
head(bdims)

##   bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1   42.9   26.0   31.5   17.7   28.0   13.1   10.4   18.8   14.1  106.2
## 2   43.7   28.5   33.5   16.9   30.8   14.0   11.8   20.6   15.1  110.5
## 3   40.1   28.2   33.3   20.9   31.7   13.9   10.9   19.7   14.1  115.1
## 4   44.3   29.9   34.0   18.4   28.2   13.9   11.2   20.9   15.0  104.5
## 5   42.5   29.9   34.0   21.5   29.4   15.2   11.6   20.7   14.9  107.5
## 6   43.3   27.0   31.5   19.6   31.3   14.0   11.5   18.8   13.9  119.8
##   che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1   89.5   71.5   74.5   93.5   51.5   32.5   26.0   34.5   36.5   23.5
## 2   97.0   79.0   86.5   94.8   51.5   34.4   28.0   36.5   37.5   24.5
## 3   97.5   83.2   82.9   95.0   57.3   33.4   28.8   37.0   37.3   21.9
## 4   97.0   77.8   78.8   94.0   53.0   31.0   26.2   37.0   34.8   23.0
## 5   97.5   80.0   82.5   98.5   55.4   32.0   28.4   37.7   38.6   24.4
## 6   99.9   82.5   80.1   95.3   57.5   33.0   28.0   36.6   36.1   23.5
##   wri.gi age  wgt   hgt  sex
## 1   16.5  21 65.6 174.0 Male
## 2   17.0  23 71.8 175.3 Male
## 3   16.9  28 80.7 193.5 Male
## 4   16.6  23 72.6 186.5 Male
## 5   18.0  22 78.8 187.2 Male
## 6   16.9  21 74.8 181.5 Male

In the above chunk, I am reading the data from the csv dataset by using read_csv(),head() function helps me get the first few rows of the data .
The data frame includes 507 observations of 25 variables.
I have converted the sex variable into factor and then defined labels for it as Female for 0 and Male for 1.

bdims %>% group_by(sex) %>% summarise(Mean = mean(wri.di, na.rm = TRUE),
                                      Median = median(wri.di, na.rm = TRUE),
                                      SD = sd(wri.di, na.rm = TRUE),
                                      Q1 = quantile(wri.di,probs = .25,na.rm = TRUE),
                                      Q3 = quantile(wri.di,probs = .75,na.rm = TRUE),
                                      IQR= IQR(wri.di, na.rm = TRUE),
                                      Min = min(wri.di,na.rm = TRUE),
                                      Max = max(wri.di,na.rm = TRUE),
                                      Missing = sum(is.na(wri.di)))

## # A tibble: 2 x 10
##   sex     Mean Median    SD    Q1    Q3   IQR   Min   Max Missing
##   <fct>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <int>
## 1 Female  9.87    9.8 0.662   9.4  10.4 1       8.1  12.2       0
## 2 Male   11.2    11.2 0.636  10.8  11.6 0.850   9.8  13.3       0

Male_respond<- bdims %>% filter(sex=="Male")
Female_respond <- bdims %>% filter(sex=="Female")

table(Female_respond$wri.di)

## 
##  8.1  8.3  8.4  8.5  8.6  8.7  8.9    9  9.1  9.2  9.3  9.4  9.5  9.6  9.7 
##    1    1    2    1    3    3    7    6    8   20    4   20    7   22    5 
##  9.8  9.9   10 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9   11 11.2 11.5 
##   23    9   20    4   18    5   19   12   10    4    9    3    9    1    3 
## 12.2 
##    1

table(Male_respond$wri.di)

## 
##  9.8  9.9   10 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9   11 11.1 11.2 
##    1    1    3    3    5    4   13    6   11    8   16    4   22   10   22 
## 11.3 11.4 11.5 11.6 11.7 11.8 11.9   12 12.2 12.3 12.4 12.5 12.6 12.8 12.9 
##   14    9   19   14   10   14    3   18    3    2    3    2    1    3    1 
## 13.2 13.3 
##    1    1

In the above chunk, I have calculated the mean, median, standard deviation, first and third quartile, interquartile range, minimum and maximum values of the wrist dimensions and grouped it by sex. -na.rm function removes the null values from data. - I have captured the male respondents in Male_respond and the female respondents in Female_respond. - I used the variable wri.di because it had high frequency for various values of wrist diameter. - Using the table function , I looked at the most frequent value or mode for males and females, we see it to be 9.8 for females and 11.2 for males.

mean_male = mean(Male_respond$wri.di) 
sd_male = sd(Male_respond$wri.di) 
hist(Male_respond$wri.di, breaks = 30, prob = TRUE,xlab = "Wrist diameter of male population in cm, measured as sum of two wrists", main= "Normal Curve over Histogram for Male Respondants",col="lightblue")   
Male <- seq(min(Male_respond$wri.di),max(Male_respond$wri.di),0.1) 
Male_Normal_Distribution <- dnorm(Male, mean_male, sd_male)  
points(Male,Male_Normal_Distribution, type = 'l',col="blue",lwd = 3)

mean_female = mean(Female_respond$wri.di) 
sd_female = sd(Female_respond$wri.di) 
hist(Female_respond$wri.di, breaks = 30, prob = TRUE,xlab = "Wrist diameter of female population in cm, measured as sum of two wrists", main= "Normal Curve over Histogram for Female Respondants",col="lightgreen")   
Female <- seq(min(Female_respond$wri.di),max(Female_respond$wri.di),0.1) 
Female_Normal_Distribution <- dnorm(Female, mean_female, sd_female)  
points(Female,Female_Normal_Distribution, type = 'l',col="hotpink4",lwd = 3)

I am calculating the mean and standard deviation for males and females. Further a histogram is plotted for both the genders with the wrist dimensions on the x axis and the density on the y axis.
The seq function generates regular sequences from the minimum value to the maximum value with gaps of 0.1 as specified in the code below.
The dnorm function gives the density of the normsl with mean and standard deviation equal to that of the data.
The points function is used to draw a sequence of points with the coordinates as the sequence and normal distribution.

Interpretation

I plotted a normal curve overlaying the histogram of the empirical data for both males and females. Normal distributions are symmetric, unimodal, and asymptotic, and the mean, median, and mode are all equal. In the emperical data we see that the mean, median and mode are all approximately equal, for males it is 9.8 and for males it is 11.2 which justifies the normal nature of data. The curve is also symmetrical and bell shaped which further justifies the curve to be normally distributed. In a normal distribution, approximately 99.7% of the data fall within three standard deviations of the mean. This is evident from both the male and female plots, we can refer the right side of the plot to have a clear view. As a conclusion we can say that the values for wrist diameter lie under a normal distribution for both males and females but a graphical representation like histogram offers no guarantee.

MATH1324 Assignment 1

Modeling Body Measurements

Student Details

Problem Statement

Interpretation