Instruction

There are three questions. Each question contains multiple tasks. To receive a full mark in this part, you should correctly solve all tasks and add appropriate labels to your graphical summaries.

Do not modify the header of this file.

Format: All assignment tasks are marked by the prompt Task. You should enter your solution either as embedded R code below Task or as text after the prompt Answer. See examples in the Import data section.

Submission: Upon completion, you must render this worksheet (using Knit in R Studio) into an html file and submit the html file. Make sure the file extension “html” is in lower case.

Import data

In this assignment, we will use a data set collected from a survey for MATH1005 students at the start of 2022 Semester 2. There are a total of 928 students enrolled in this course.

Download the data file A1MATH1005.csv in your data folder within your MATH1062 folder. This R Markdown file (Assignment1Worksheet.Rmd) should also be saved under your MATH1062 folder.

Then import the csv file into a variable called data:

### write your code here. Here is a sample solution 
data = read.csv("A1MATH1005.csv", header = T)
### the following displays the dimenson of the data
dim(data)

## [1] 281   8

If you save the data file and the worksheet correctly, you should be able to load the data file and see that the data contain 281 observations and 8 variables.

Task: How many survey responses are there? How many variables are there?

Answer: There are 281 survey responses and 8 variables.

Name and ID

Task: Please enter your name and ID.

Answer: Nguyen Nhat Dan Bui, SID 510009603

====START OF ASSIGNMENT QUESTIONS====

1 Histogram

Task: Display the structure of data. Then make an appropriate histogram for the variable Height on the density scale. Here you can use the default number of class intervals.

Hint: you may need to convert the Height variable into the correct type in R to generate the histogram. You also need to perform data cleaning here to get a sensible histogram for this case.

### write your code here
data = read.csv("A1math1005.csv")
### display the structure of 'data' and 'Height' specifically
str(data)

## 'data.frame':    281 obs. of  8 variables:
##  $ Gender       : chr  "Female" "Female" "Female" "Male" ...
##  $ International: chr  "Domestic" "Domestic" "International" "Domestic" ...
##  $ Major        : chr  "Physics, Chemistry" "Biomedical Engineering" "Statistics" "Transport" ...
##  $ Height       : chr  "154" "182" "156" "172" ...
##  $ ShoeSize     : chr  "9.5" "9 AUS" "US 9" "8" ...
##  $ Age          : int  18 22 18 29 18 19 19 18 18 20 ...
##  $ Country      : chr  "Australia" "Australia" "India" "Nepal" ...
##  $ Language     : chr  "English " "English" "Hindi" "Nepali" ...

data$Height

##   [1] "154"    "182"    "156"    "172"    "193"    "167"    ""       "183"   
##   [9] "150cm"  "181"    "183"    "167"    "188"    "164.5"  "176"    "178"   
##  [17] "176"    "160"    "158"    "176"    "175"    "153"    "174"    "183"   
##  [25] "160"    "169"    "175"    "172"    "179"    "171 cm" "173"    "181"   
##  [33] "175"    "171"    "170"    "170"    "170"    "156"    "167"    "170"   
##  [41] "165"    "168"    "170"    "173"    "178"    "180"    "173"    "184"   
##  [49] "180"    "180"    "148"    "179"    "165"    "169"    "172"    "174"   
##  [57] "178"    "162"    "180"    "170"    "178"    "156"    "186"    "171"   
##  [65] "144"    "166"    "180"    "189cm"  "178"    "180"    "183"    "183"   
##  [73] "181"    "177"    "178"    "176"    "178"    "180"    "173"    "173"   
##  [81] "198"    "160"    "190"    "163"    "181"    "175"    "168"    "167"   
##  [89] "181"    "181"    "161"    "182"    "183"    "175"    "178"    "191"   
##  [97] "161"    "165"    "150"    "175"    "166"    "185"    "176"    "172"   
## [105] "183"    "168"    "160"    "171"    "164"    "187"    "189"    "164"   
## [113] "165"    "173"    "170"    "183"    "163"    "162"    "177"    "183"   
## [121] "172"    "172"    "170"    "170"    "186"    "167"    "176"    "181"   
## [129] "150"    "145"    "172"    "165"    "168"    "169"    "177"    "174"   
## [137] "175"    "173"    "161"    "151"    "183"    "184"    "155"    "158"   
## [145] "162"    "177"    "185"    "172"    "155"    "180"    "170"    "161"   
## [153] "192"    "169"    "190.5"  "188"    "181"    "191"    "184"    "172"   
## [161] "175"    "163"    "171"    "180"    "-175"   "162"    "164"    "192"   
## [169] "162"    "188"    "178"    "178"    "188"    "185"    "181"    "165"   
## [177] "175"    "180"    "155"    "174"    "184"    "174"    "171"    "175"   
## [185] "163"    "165"    "168"    "166"    "163"    "176"    "174"    "185"   
## [193] "170"    "156"    "161"    "178"    "178"    "165"    "162"    "163"   
## [201] "193"    "175"    "180"    "184"    "174"    "176"    "158"    "180"   
## [209] "181"    "181"    "178"    "164"    "181"    "168"    "178"    "164"   
## [217] "169"    "171"    "182"    "182"    "175"    "185"    "170"    "170"   
## [225] "182"    "176"    "168"    "173"    "164"    "189"    "179"    "161.5" 
## [233] "180"    "172"    "168"    "181"    "183"    "182"    "182"    "148"   
## [241] "170"    "164"    "180"    "180"    "170"    "168"    "178"    "178"   
## [249] "160"    "180"    "175"    "178"    "163"    "164"    "167"    "183"   
## [257] "185"    "181"    "180"    "186"    "178"    "183"    "173"    "186"   
## [265] "167"    "176"    "158"    "175"    "175"    "163"    "183"    "182"   
## [273] "178"    "173"    "189"    "180"    "158"    "185.4"  "158"    "182"   
## [281] "170"

# clean data by extracting the number with number and cm 
data$Height = gsub("[a-zA-Z]","", data$Height)
# convert 'Height' from chr to num 
data$Height = as.numeric (data$Height) 
# use abs to make all number positive 
abs(data$Height)

##   [1] 154.0 182.0 156.0 172.0 193.0 167.0    NA 183.0 150.0 181.0 183.0 167.0
##  [13] 188.0 164.5 176.0 178.0 176.0 160.0 158.0 176.0 175.0 153.0 174.0 183.0
##  [25] 160.0 169.0 175.0 172.0 179.0 171.0 173.0 181.0 175.0 171.0 170.0 170.0
##  [37] 170.0 156.0 167.0 170.0 165.0 168.0 170.0 173.0 178.0 180.0 173.0 184.0
##  [49] 180.0 180.0 148.0 179.0 165.0 169.0 172.0 174.0 178.0 162.0 180.0 170.0
##  [61] 178.0 156.0 186.0 171.0 144.0 166.0 180.0 189.0 178.0 180.0 183.0 183.0
##  [73] 181.0 177.0 178.0 176.0 178.0 180.0 173.0 173.0 198.0 160.0 190.0 163.0
##  [85] 181.0 175.0 168.0 167.0 181.0 181.0 161.0 182.0 183.0 175.0 178.0 191.0
##  [97] 161.0 165.0 150.0 175.0 166.0 185.0 176.0 172.0 183.0 168.0 160.0 171.0
## [109] 164.0 187.0 189.0 164.0 165.0 173.0 170.0 183.0 163.0 162.0 177.0 183.0
## [121] 172.0 172.0 170.0 170.0 186.0 167.0 176.0 181.0 150.0 145.0 172.0 165.0
## [133] 168.0 169.0 177.0 174.0 175.0 173.0 161.0 151.0 183.0 184.0 155.0 158.0
## [145] 162.0 177.0 185.0 172.0 155.0 180.0 170.0 161.0 192.0 169.0 190.5 188.0
## [157] 181.0 191.0 184.0 172.0 175.0 163.0 171.0 180.0 175.0 162.0 164.0 192.0
## [169] 162.0 188.0 178.0 178.0 188.0 185.0 181.0 165.0 175.0 180.0 155.0 174.0
## [181] 184.0 174.0 171.0 175.0 163.0 165.0 168.0 166.0 163.0 176.0 174.0 185.0
## [193] 170.0 156.0 161.0 178.0 178.0 165.0 162.0 163.0 193.0 175.0 180.0 184.0
## [205] 174.0 176.0 158.0 180.0 181.0 181.0 178.0 164.0 181.0 168.0 178.0 164.0
## [217] 169.0 171.0 182.0 182.0 175.0 185.0 170.0 170.0 182.0 176.0 168.0 173.0
## [229] 164.0 189.0 179.0 161.5 180.0 172.0 168.0 181.0 183.0 182.0 182.0 148.0
## [241] 170.0 164.0 180.0 180.0 170.0 168.0 178.0 178.0 160.0 180.0 175.0 178.0
## [253] 163.0 164.0 167.0 183.0 185.0 181.0 180.0 186.0 178.0 183.0 173.0 186.0
## [265] 167.0 176.0 158.0 175.0 175.0 163.0 183.0 182.0 178.0 173.0 189.0 180.0
## [277] 158.0 185.4 158.0 182.0 170.0

Height = abs(data$Height)
#histogram of variable 'Height' 
hist(Height, freq = F, right = F, xlab = "Height (cm)", main = "Histogram for Height of Math1005 students in 2022 Semester 2")

2 Sample mean and sample median

Here our overall aim is to use sample mean and median to determine the shape of the distribution of the observed height.

Task: Calculate the sample mean and sample median of the observed height.

Hint: you need to remove data entries with value NA, which are missing data. One possible way is to use the function is.na() to identify data entries with value NA.

### write your code here
###
# remove NA value 
Height = na.omit(Height)
mean (Height)

## [1] 173.2996

median(Height)

## [1] 175

Task: Replot the histogram and use abline to indicate the sample mean and sample median on the histogram.

### write your code here
###
hist(Height, freq = F, right = F, xlab = "Height (cm)", main = "Histogram for Height of Math1005 students in 2022 Semester 2")
abline(v = mean(Height), col = "green")
abline (v = median(Height), col = "purple")

Task: By comparing the sample mean and sample median, describe the shape of the data.

Answer: The sample mean is 173.3 and sample median is 175, means the sample mean is smaller than the sample median, leading the data being left skewed

3 Box plot

3.1 Sample median and interquartile range

The overall aim of this sub-question is to check your understanding of the box plot and summarizing the spread of a data set using interquartile range.

Task: Calculate the first quartile, third quartile, and interquartile range (IQR) of the observed height.

### write your code here
###
summary(Height) #this is to check if the below calculation is correct

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   144.0   167.0   175.0   173.3   181.0   198.0

# calculate first quartile (25%) and thrid quartile (75%) 
quantile(Height, probs = c (.25, .5, .75), type = 1)

## 25% 50% 75% 
## 167 175 181

#calculate interquartile range
IQR(Height)

## [1] 14

Task: Make a box plot for the observed height, and then use abline to indicate the location of the sample median and the interquartile range on the box plot.

### write your code here
###
boxplot(Height)
abline(h = median(Height), col = "red")
iqr = quantile(Height)[4] - quantile(Height)[2]
abline(h = quantile(Height)[2] - 1.5*iqr, col = "purple")
abline(h = quantile(Height)[4] + 1.5*iqr, col = "purple")
abline(h = quantile(Height)[1], col ="green")

3.2 Comparative box plot

Here our goal is to understand the height differences between biological sexes using the comparative box plot.

Task: Use tools such as the frequency table to check how many categories are in the Gender variable and how many data points are available in each of the categories. Then, discuss what variables should be included in the subsequent analysis and what variables should be excluded. You should state the reason.

You need to put your code below and your discussion below as well.

Hint: the empty gender "" indicates missing data.

### write your code here
###
# frequency table to check how many categories and how many data points in 'Gender' variable 
table (data$Gender)

## 
##                              Female              Male Prefer not to say 
##                 1               111               167                 2

# Clean data by omitting the missing data and 'Prefer not to say' option 
data1 = na.omit(data[data$Gender !="" & data$Gender != "Prefer not to say",])

Answer: (write your discussion here) Two categories: Female and Male should be included in the subsequent analysis because “Prefer not to say” is not really the representation of biological sex of the respondents, if they are included, there may be noise in the data and analysis. “Prefer not to say” option is also not a clear division between Female and Male, which may be hard to compare. Furthermore, without including “Prefer not to say” category, we can presumably avoid misinterpretation of the results.

Task: Make a comparative box plot for the observed height by splitting it by the variable Gender (the recorded biological sex).

Hint: after data cleaning and selection, you need to make sure the height variable and the gender variable have the same size

### write your code here
###
# Trim data to make height variable the same size as gender variable 
lgender = length(data1$Gender)
lheights = length (data1$Height)
if(lheights > lgender) {Height = Height[1:lgender]}
#Make sure all observations in the Heights variable is positive 
nHeight = abs(data1$Height)
# Comparative boxplot
data1$Gender = factor(data1$Gender, levels = c("Female", "Male"))
Gender = data1$Gender
boxplot(nHeight ~ Gender, horizontal = T, main = "Height differences between male and female students", xlab = "Heights (in cm)")

Task: What does the comparative box plot reveal about the height of students for biological males and biological females?

Answer: The median heights of male students are significantly higher than that of female students. The interquartile ranges of female students are greater than that of males. The overall range of the data set (distances between the ends of the two whiskers) is also greater for female students.

====END OF THE WORKSHEET====

MATH1062 (Part B) / MATH1005 Assignment 1

© University of Sydney MATH1062 (Statistics Part) / MATH1005

17 March 2024