MATH1324 Assignment 1

Student Details

Lai Teng Wong (s3714421)

Problem Statement

The Body Measurements Dataset (bdims.csv) from Exploring Relationships in Body Dimensions consists of 507 observations of 25 variables, taken from 247 physically active men and 260 physically active women. The variables of the data set include nine skeletal measurements, twelve girth measurements, as well as age, weight, height and gender.

However, we will not be looking at all the variables in the data set, instead we will only focus on two variables: the respondent’s weight in kilograms, denoted by wgt and the respondent’s gender (1 for male, 0 for female), denoted by sex in the data set. We will investigate whether the variable wgt fits a normal distribution for both men and women by comparing the empirical distribution of wgt to a normal distribution using a histogram with normal distribution overlay and a Q-Q plot.

Load Packages

library(readr)
library(dplyr)
library(magrittr)
library(ggplot2)
library(tidyverse)
library(knitr)
library(kableExtra)
library(colourpicker)
library(RColorBrewer)
library(moments)

Data

setwd("C:/Users/laite/Desktop/Applied Analytics/Assignment 1") #Set working directory
bdim <- read_csv("bdims.csv") #Read csv

## Rows: 507 Columns: 25

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (25): bia.di, bii.di, bit.di, che.de, che.di, elb.di, wri.di, kne.di, an...

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

any(is.na(bdim)) #Check if the data set has any missing values

## [1] FALSE

str(bdim$sex) #Check the structure of age variable

##  num [1:507] 1 1 1 1 1 1 1 1 1 1 ...

str(bdim$wgt) #Check the structure of wgt variable

##  num [1:507] 65.6 71.8 80.7 72.6 78.8 74.8 86.4 78.4 62 81.6 ...

bdim$sex <- bdim$sex %>% factor(levels = c(0,1), labels = c("Female","Male"))
levels(bdim$sex)

## [1] "Female" "Male"

bdim <- bdim %>% select(sex, wgt) #Select our focus variable: sex and wgt
bdim_female <- bdim %>% filter(sex == "Female") #Filter data set by Female
bdim_male <- bdim %>% filter(sex == "Male") #Filter data set by Male

The original data set bdims.csv was stored as a data frame under bdim and does not have any missing values, as any(is.na(bdim)) returned output: FALSE.

sex was imported into R as numeric, and has been relabelled and factorized into a non-ordered factor of 2 levels: 0 for “Male” and 1 for “Female”.

wgt was imported correctly as numeric.

Since we are only focusing on sex and wgt, I used the select() function to select these 2 variables, stored them under bdim, and filtered them to 2 separate data sets: bdim_male for Male and bdim_female for Female.

Summary Statistics

Compute summary statistics which include mean, median, standard deviation, first and third quartile, interquartile range, minimum and maximum values for wgt grouped by sex. I have also included the calculations for skewness and kurtosis to faciliate the interpretation of the distributions.

summary(bdim_female) #Summary statistics for Female

##      sex           wgt       
##  Female:260   Min.   : 42.0  
##  Male  :  0   1st Qu.: 54.5  
##               Median : 59.0  
##               Mean   : 60.6  
##               3rd Qu.: 65.6  
##               Max.   :105.2

summary(bdim_male) #Summary statistics for Male

##      sex           wgt        
##  Female:  0   Min.   : 53.90  
##  Male  :247   1st Qu.: 70.95  
##               Median : 77.30  
##               Mean   : 78.14  
##               3rd Qu.: 85.50  
##               Max.   :116.40

#To have a better look of the summary statistics in table form
bdim %>% group_by(sex) %>% 
  summarise(Mean=mean(wgt) %>% round(3), 
            SD = sd(wgt) %>% round(3), 
            Median = median(wgt), 
            First_Quartile = quantile(wgt,probs=.25), 
            Third_Quartile = quantile(wgt,probs=.75), 
            Interquartile_Range= IQR(wgt,na.rm=TRUE), 
            Min = min(wgt), 
            Max = max(wgt)) %>% 
  kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options= "condensed")

sex	Mean	SD	Median	First_Quartile	Third_Quartile	Interquartile_Range	Min	Max
Female	60.600	9.616	59.0	54.50	65.6	11.10	42.0	105.2
Male	78.145	10.513	77.3	70.95	85.5	14.55	53.9	116.4

skewness(bdim_female$wgt) #Calculate the skewness of weight in females

## [1] 1.141751

kurtosis(bdim_female$wgt) #Calculate the kurtosis of weight in females

## [1] 5.590288

skewness(bdim_male$wgt) #Calculate the skewness of weight in males

## [1] 0.2911787

kurtosis(bdim_male$wgt) #Calculate the kurtosis of weight in males

## [1] 3.150565

Distribution Fitting

#Plot a histogram with normal distribution overlay for weight in females
ggplot(bdim_female, aes(x = wgt)) + 
  geom_histogram(aes(y = ..density..), 
                 colour = "black", fill = "thistle1", 
                 breaks = seq(40, 110, by = 2))+
  stat_function(fun = dnorm,
                args = list(mean = mean(bdim_female$wgt),
                            sd = sd(bdim_female$wgt)),
                col = "red", size = 1) + 
  xlab("Weight of Female Respondents (kg)") + 
  ylab("Density") + 
  theme_bw() +
  ggtitle("Distribution of Weight in Female Respondents") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_continuous(breaks = seq(40, 110, by = 10)) +
  geom_vline(xintercept=mean(bdim_female$wgt), size=1, color="blue") +  
  geom_vline(xintercept=median(bdim_female$wgt), size=1, color="green")

#Plot a normal Q-Q plot for weight in females
ggplot(bdim_female, aes(sample = wgt)) +
  geom_qq(color = "maroon3") + 
  geom_qq_line(color = "orange", size =1) + 
  labs( y = "Weight of Female Respondents (kg)", x = "Theoretical Quantile") + 
  theme_bw() + 
  ggtitle("Normal Q-Q Plot for Female Respondents") + 
  theme(plot.title = element_text(hjust = 0.5))

#Plot a histogram and normal distribution overlay for weight in males
ggplot(bdim_male, aes(x = wgt)) + 
  geom_histogram(aes(y = ..density..), 
                 colour = "black", fill = "lightcyan", 
                 breaks = seq(50, 120, by = 2)) +
  stat_function(fun = dnorm,
                args = list(mean = mean(bdim_male$wgt),
                            sd = sd(bdim_male$wgt)),
                col = "red", size = 1) + 
  xlab("Weight of Male Respondents (kg)") + 
  ylab("Density") + 
  theme_bw() +
  ggtitle("Distribution of Weight in Male Respondents") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_continuous(breaks = seq(50, 120, by = 10)) +
  geom_vline(xintercept=mean(bdim_male$wgt), size=1, color="blue") +
  geom_vline(xintercept=median(bdim_male$wgt), size=1, color="green")

#Plot a normal Q-Q plot for weight in males
ggplot(bdim_male, aes(sample = wgt)) +
  geom_qq(color = "dodgerblue") + 
  geom_qq_line(color = "orange", size =1) + 
  labs( y = "Weight of Male Respondents (kg)", x = "Theoretical Quantile") + 
  theme_bw() + 
  ggtitle("Normal Q-Q Plot for Male Respondents") + 
  theme(plot.title = element_text(hjust = 0.5))

To compare the empirical distribution of wgt to a normal distribution separately in men and in women, the distribution of weight in males and females were plotted above on two separate density histograms, each with a normal distribution curve overlay, depicted by a red bell-shaped curve over the histogram. The green line shows the median weight, and the blue line shows the mean weight for both males and females respectively. A normal distribution will have equal values of mean, median and mode. Since a normal distribution is a continuous probability distribution, in order for a normal probability density function curve to be plotted and shown properly on the same vertical scale, density was plotted on the y-axis for both histograms instead of frequency/count. The total area of all the bars on the density histograms add up to 1.

To further determine whether both distributions fit a normal distribution, a Quantile-Quantile (Q-Q) Plot was plotted for males and females respectively. An orange line was added to the normal Q-Q plot to help visualize whether all the points lie closely to the straight diagonal line.

Interpretation

Distribution of Weight in Females

Looking at the histogram for females, we can see that most of the data grouped in the form of histogram bins are on the left side of the histogram, while less are on the right. There is some deviation of the distribution from the bell-shaped curve. It is obvious that the distribution is not symmetric and has a long right tail, indicating that the distribution is right-skewed. We can also confirm this by looking at the mean value (60.6) shown by the blue line and the median (59.0) shown by the green line, the mean value is above the median value.

Based on the Q-Q plot for females, the points in the middle are lying close to the line, but not on both ends, it has a slight curve upwards on both ends with some outliers, indicating that the distribution of weight in females is definitely leaning more towards a right-skewed distribution rather than a normal distribution.

If we look at the skewness and kurtosis calculation, a normal distribution has a skewness of 0 and a kurtosis of 3. The distribution of weight in females have a positive skewness of 1.14 which indicates a right skew and a kurtosis of 5.59 which is much greater than 3, this indicates that the distribution has more values concentrated on the left side of the distribution, which again proves that the distribution is right-skewed.

Distribution of Weight in Males

Looking at the histogram for males, there is a bit of deviation of the distribution from the bell-shaped curve at the center of the histogram, however it still appears to be near-normal compared to the distribution of weight in females. There is a very short tail with some small bins lying on the right side of the histogram, which seemed like the distribution is slightly right-skewed. The mean value (78.145) shown by the blue line does appear larger than the median (77.3) as shown by the green line, however the difference is less than 1.0.

Based on the Q-Q plot for males, majority of the points are lying very near to the straight line, except for a few points on both ends. This shows that the distribution is approximately normal or very near a normal distribution.

If we look at the skewness and kurtosis calculation, a normal distribution has a skewness of 0 and a kurtosis of 3. The distribution of weight in males have a positive skewness of 0.29 which is very close to 0, indicating that the distribution is approximately symmetric and a kurtosis of 3.15 which is very close to 3, indicating that the distribution has most of the values concentrated on the center of the distribution rather than on the tails.

Conclusion

The initial problem statement was to determine if wgt fits a normal distribution for men and women. In the real world, it is very uncommon to come across a perfect normal distribution where mean=median=mode.

If we compare the distribution of weight for males and females, the distribution of weight in females definitely does not fit a normal distribution. However, it is debatable whether to assume normality for the distribution of weight in males, given that the distribution does exhibit a very slight right skewness. In this case, I will assume normality for the distribution of weight in males, because the distribution does exhibit traits that are approximately or very similar to a normal distribution.

If we assume normality for the distribution of weight in males, we can apply the Central Limit Theorem on the sampling distributions of the mean and assume that the sampling distribution of mean weight in males will be normally distributed. Nevertheless, for both of the distribution of weight in males and females, it is safe to assume under the Central Limit Theorem that when we have a sample size large enough of more than 30, the sampling distribution of mean weight in males and females will be approximately normal, even though the current underlying distribution of weight in females is right-skewed (not normal).

In general, a normal distribution is easy to work with and there are many variables in the real world which are distributed approximately normally. Therefore, it is good to find out whether a variable fits a normal distribution. If we have variables which fit a normal distribution, we can perform statistical testing and simple calculations including probability calculations on the variables and even compare the empirical probability values to its respective theoretical probability values. This can inform many useful insights.

Reference List

Baglin, J. (2021). Module 4 Probability Distributions: Random, but Predictable. [Module Webpage]. Canvas @ RMIT University. https://astral-theory-157510.appspot.com/secured/MATH1324_Module_04.html

Baglin, J. (2021). Module 5 Sampling: Randomly Representative. [Module Webpage]. Canvas @ RMIT University. https://astral-theory-157510.appspot.com/secured/MATH1324_Module_05.html

Heinz, G., Peterson, L. J., Johnson, R. W., Carter, J. K. (2013). Exploring Relationships in Body Dimensions. Journal of Statistics Education, 11(2). http://jse.amstat.org/v11n2/datasets.heinz.html