1. Student Details


Rashbir Singh Kohli (s3810585)


2. Problem Statement



3. Load Packages


For this assignment two libraries are used:

library('data.table')
library("ggpubr")

4. Data


data <- fread('bdims.csv')
df <- transform(data[, c('bic.gi', 'sex')], sex=factor(sex))
df_male <- df[sex == 1, 'bic.gi']
df_female <- df[sex == 0, 'bic.gi']

5. Summary Statistics


For the purpose of calculating the summary statistics (i.e., mean, median, standard deviation, first and third quartile, interquartile range, minimum and maximum values) for the assignment, I used summary() function and stored the data seperately for both the genders into a DataFrame.

5.1 Creating Female information dataframe

infoDfFe <- data.frame(trimws(sub(".*:", "", summary(df_female)[1:length(summary(df_female))])), metric <- c('Min', '1Q', 'Median', 'Mean', '3Q', 'Max'), stringsAsFactors = FALSE)
names(infoDfFe) <- c('Value', 'metric')
infoDfFe <- transform(infoDfFe, Value = as.numeric(Value))

5.2 Creating Male information dataframe

infoDfMale <- data.frame(trimws(sub(".*:", "", summary(df_male)[1:length(summary(df_male))])), metric <- c('Min', '1Q', 'Median', 'Mean', '3Q', 'Max'), stringsAsFactors = FALSE)
names(infoDfMale) <- c('Value', 'metric')
infoDfMale <- transform(infoDfMale, Value = as.numeric(Value))

5.3 Adding standard deviation

  • Calculated the Standar Deviation using sd() function.
  • Stored the result into the DataFrame.
infoDfFe <- rbind(infoDfFe, list(sd(df_female$bic.gi), 'Std'))
infoDfMale <- rbind(infoDfMale, list(sd(df_male$bic.gi), 'Std'))

5.4 Adding IQR ( IQR = Q3 - Q1)

  • Calculated the Inter Quartile Range using IQR() function.
  • Stored the result into the DataFrame.
infoDfFe <- rbind(infoDfFe, list(IQR(df_female$bic.gi), 'IQR'))
infoDfMale <- rbind(infoDfMale, list(IQR(df_male$bic.gi), 'IQR'))

5.5 OUTPUT

  • The following section shows the Summary Statistics for the ‘bic.gi’ (Bicep Girth) of Male and Female

5.5.1 Female

  • For Female Summary Statistics are as follows:
    1. Mean - 28.1
    2. Median - 27.8
    3. Standard Deviation - 2.709477
    4. First Quartile - 26.4
    5. Third Quartile - 29.8
    6. Interquartile Range - 3.4
    7. Minimum value - 22.4
    8. Maximum value - 40.3
cat(' Mean value for female: ',infoDfFe$Value[infoDfFe$metric == 'Mean'], '\n', 'Median value for female: ',infoDfFe$Value[infoDfFe$metric == 'Median'], '\n', 'Standard Deviation value for female: ',infoDfFe$Value[infoDfFe$metric == 'Std'], '\n', 'First Quartile value for female: ',infoDfFe$Value[infoDfFe$metric == '1Q'], '\n', 'Third Quartile value for female: ',infoDfFe$Value[infoDfFe$metric == '3Q'], '\n', 'Interquartile Range value for female: ',infoDfFe$Value[infoDfFe$metric == 'IQR'], '\n', 'Minimum value for female: ',infoDfFe$Value[infoDfFe$metric == 'Min'], '\n', 'Maximum value for female: ',infoDfFe$Value[infoDfFe$metric == 'Max'])
##  Mean value for female:  28.1 
##  Median value for female:  27.8 
##  Standard Deviation value for female:  2.709477 
##  First Quartile value for female:  26.4 
##  Third Quartile value for female:  29.8 
##  Interquartile Range value for female:  3.4 
##  Minimum value for female:  22.4 
##  Maximum value for female:  40.3

5.5.2 Male

  • For Male Summary Statistics are as follows:
    1. Mean - 34.4
    2. Median - 34.4
    3. Standard Deviation - 2.982037
    4. First Quartile - 32.5
    5. Third Quartile - 36.4
    6. Interquartile Range - 3.9
    7. Minimum value - 25.6
    8. Maximum value - 42.4
cat(' Mean value for male: ',infoDfMale$Value[infoDfMale$metric == 'Mean'], '\n', 'Median value for male: ',infoDfMale$Value[infoDfMale$metric == 'Median'], '\n', 'Standard Deviation value for male: ',infoDfMale$Value[infoDfMale$metric == 'Std'], '\n', 'First Quartile value for male: ',infoDfMale$Value[infoDfMale$metric == '1Q'], '\n', 'Third Quartile value for male: ',infoDfMale$Value[infoDfMale$metric == '3Q'], '\n', 'Interquartile Range value for male: ',infoDfMale$Value[infoDfMale$metric == 'IQR'], '\n', 'Minimum value for male: ',infoDfMale$Value[infoDfMale$metric == 'Min'], '\n', 'Maximum value for male: ',infoDfMale$Value[infoDfMale$metric == 'Max'])
##  Mean value for male:  34.4 
##  Median value for male:  34.4 
##  Standard Deviation value for male:  2.982037 
##  First Quartile value for male:  32.5 
##  Third Quartile value for male:  36.4 
##  Interquartile Range value for male:  3.9 
##  Minimum value for male:  25.6 
##  Maximum value for male:  42.4

6. Distribution Fitting


6.1 Distribution Fitting for Female

hist(df_female$bic.gi, breaks = 20, probability = TRUE, xlab = 'Bicep girth', ylab = 'Frequency', main = 'Histogram for Female Bicep Girth')
lines(density(df_female$bic.gi), col = 'Blue', lwd=2)
curve(dnorm(x, mean=mean(df_female$bic.gi), sd=sd(df_female$bic.gi)), yaxt="n", lty="dotted", col="darkgreen", lwd=4, add=TRUE)
op <- par(cex = 0.7)
legend("topright", legend = c("Density Curve for Female Sample", "Normal Curve"), bty = "n", text.col = "black", horiz = F, pch=c(15,16), col = c('Blue', "darkgreen"))

6.2 Distribution Fitting for Male

hist(df_male$bic.gi, breaks = 20, probability = TRUE, xlab = 'Bicep girth', ylab = 'Frequency', main = 'Histogram for Male Bicep Girth')
lines(density(df_male$bic.gi), col = 'Blue', lwd=2)
curve(dnorm(x, mean=mean(df_male$bic.gi), sd=sd(df_male$bic.gi)), add=TRUE, yaxt="n", lty="dotted", col="darkgreen", lwd=4)
op <- par(cex = 0.7)
legend("topright", legend = c("Density Curve for Male Sample", "Normal Curve"), bty = "n", text.col = "black", horiz = F, pch=c(15,16), col = c('Blue', "darkgreen"), )


7. Interpretation


7.1 Interpretation based on section 5 and 6

  • From section 5.5.1 that shows summary statistics for bicep girth of females it can be interpreted that:
    1. Mean is greater than Median it shows that data is Right Skewed.
    2. Standard Deviation is 2.709477, which shows that data is less spread.
    3. First Quartile is 26.4 that shows that 25% data is less than 26.4.
    4. Third Quartile is 29.8 that shows that 75% data is less than 29.8.
    5. Interquartile Range value is 3.4, it shows that the Spread is less
  • From section 5.5.2 that shows summary statistics for bicep girth of males it can be interpreted that:
    1. Mean is equal to the Median it shows that data is Symmetric.
    2. Standard Deviation is 2.982037, which shows that data is less spread.
    3. First Quartile is 32.5 that shows that 25% data is less than 32.5.
    4. Third Quartile is 36.4 that shows that 75% data is less than 36.4.
    5. Interquartile Range value is 3.9, it shows that the Spread is less
  • From section 6.1 that shows the histogram distribution along with the density curve and normal curve it can be interpreted that:
    1. The histogram is Right Skewed.
    2. The histogram have Less spread that shows lower standard deviation.
    3. Single Higher peak showes where the Mean is and is close to that of bell curve shape.
    4. It can also be interpretated that majority of data lies between 25 to 30, that resembles the interpretation of section 5.5.1.
    5. According to 68-95-99.7 rule for normal distribution:
      • For this histogram 69.62% values lies between 1 standard deviation. distance of mean.
      • For this histogram 95.38% values lies between 2 standard deviation. distance of mean.
      • For this histogram 99.23% values lies between 3 standard deviation. distance of mean.
    6. It can be said that bicep girth for females Closely Follow the Rules of Normal Distribution.
print('68-95-99.7 rule')
## [1] "68-95-99.7 rule"
(length(df_female$bic.gi[df_female$bic.gi > (28.1 - 2.709477) & df_female$bic.gi < (28.1 + 2.709477)]) / length(df_female$bic.gi)) * 100
## [1] 69.61538
(length(df_female$bic.gi[df_female$bic.gi > (28.1 - 2*(2.709477)) & df_female$bic.gi < (28.1 + 2*(2.709477))]) / length(df_female$bic.gi)) * 100
## [1] 95.38462
(length(df_female$bic.gi[df_female$bic.gi > (28.1 - 3*(2.709477)) & df_female$bic.gi < (28.1 + 3*(2.709477))]) / length(df_female$bic.gi)) * 100
## [1] 99.23077
  • From section 6.2 that shows the histogram distribution along with the density curve and normal curve it can be interpreted that:
    1. The histogram is Symmetric.
    2. The histogram have Less spread that shows lower standard deviation.
    3. Single Higher peak showes where the Mean is and is close to that of **bell curve shape*.
    4. It can also be interpretated that majority of data lies between 31 to 38, that resembles the interpretation of section 5.5.2.
    5. According to 68-95-99.7 rule for normal distribution:
      • For this histogram roughly 68.02% values lies between 1 standard deviation. distance of mean.
      • For this histogram roughly 96.36% values lies between 2 standard deviation. distance of mean.
      • For this histogram roughly 100% values lies between 3 standard deviation. distance of mean.
    6. It can be said that bicep girth for males Follow the Rules of Normal Distribution.
print('68-95-99.7 rule')
## [1] "68-95-99.7 rule"
(length(df_male$bic.gi[df_male$bic.gi > (34.4 - 2.982037) & df_male$bic.gi < (34.4 + 2.982037)]) / length(df_male$bic.gi)) * 100
## [1] 68.01619
(length(df_male$bic.gi[df_male$bic.gi > (34.4 - 2*(2.982037)) & df_male$bic.gi < (34.4 + 2*(2.982037))]) / length(df_male$bic.gi)) * 100
## [1] 96.35628
(length(df_male$bic.gi[df_male$bic.gi > (34.4 - 3*(2.982037)) & df_male$bic.gi < (34.4 + 3*(2.982037))]) / length(df_male$bic.gi)) * 100
## [1] 100

7.2 Interpretation based on Q-Q Plot

  • Another way of testing the normality of data is using QQ plot.
  • QQ plot is a scatter plot with diagnol line.
  • If the data is normal then all the scatter points lies on the diagnol line.

7.2.1 Q-Q Plot for female

  • Some of the point are away from the diagnol line in the top-right corner, that shows the right skewness in section 6.1.
  • The majority of the scatter is close to the diagnol line so it is safe to say that bicep girth for females Closely Follow the Rules of Normal Distribution.
ggqqplot(df_female$bic.gi, main = '                                  QQ Plot for Female Bicep Girth')

7.2.2 Q-Q Plot for male

  • The majority of the scatter is close to the diagnol line so it is safe to say that bicep girth for males Follow the Rules of Normal Distribution.
ggqqplot(df_male$bic.gi, main = '                                 QQ Plot for Male Bicep Girth')