There are three questions. Each question contains multiple tasks. To receive a full mark in this part, you should correctly solve all tasks and add appropriate labels to your graphical summaries.
Do not modify the header of this file.
Format: All assignment tasks are marked by the
prompt Task. You should enter your
solution either as embedded R code below
Task or as text after the prompt
Answer. See examples in the Import data section.
Submission: Upon completion, you must render this
worksheet (using Knit in R Studio) into an html file and
submit the html file. Make sure the file extension “html” is in lower
case.
In this assignment, we will use a data set collected from a survey for MATH1005 students at the start of 2022 Semester 2. There are a total of 928 students enrolled in this course.
Download the data file A1MATH1005.csv in your
data folder within your MATH1062 folder. This
R Markdown file (Assignment1Worksheet.Rmd) should also be
saved under your MATH1062 folder.
Then import the csv file into a variable called
data:
### write your code here. Here is a sample solution
data = read.csv("A1MATH1005.csv", header = T)
### the following displays the dimenson of the data
dim(data)
## [1] 281 8
If you save the data file and the worksheet correctly, you should be able to load the data file and see that the data contain 281 observations and 8 variables.
Task: How many survey responses are
there? How many variables are there?
Answer: There are 281 survey responses
and 8 variables.
Task: Please enter your name and
ID.
Answer: Nguyen Nhat Dan Bui, SID
510009603
====START OF ASSIGNMENT QUESTIONS====
Task: Display the structure of
data. Then make an appropriate histogram for the variable
Height on the density scale. Here you can use the default
number of class intervals.
Hint: you may need to convert the Height variable
into the correct type in R to generate the histogram. You also need to
perform data cleaning here to get a sensible histogram for this
case.
### write your code here
data = read.csv("A1math1005.csv")
### display the structure of 'data' and 'Height' specifically
str(data)
## 'data.frame': 281 obs. of 8 variables:
## $ Gender : chr "Female" "Female" "Female" "Male" ...
## $ International: chr "Domestic" "Domestic" "International" "Domestic" ...
## $ Major : chr "Physics, Chemistry" "Biomedical Engineering" "Statistics" "Transport" ...
## $ Height : chr "154" "182" "156" "172" ...
## $ ShoeSize : chr "9.5" "9 AUS" "US 9" "8" ...
## $ Age : int 18 22 18 29 18 19 19 18 18 20 ...
## $ Country : chr "Australia" "Australia" "India" "Nepal" ...
## $ Language : chr "English " "English" "Hindi" "Nepali" ...
data$Height
## [1] "154" "182" "156" "172" "193" "167" "" "183"
## [9] "150cm" "181" "183" "167" "188" "164.5" "176" "178"
## [17] "176" "160" "158" "176" "175" "153" "174" "183"
## [25] "160" "169" "175" "172" "179" "171 cm" "173" "181"
## [33] "175" "171" "170" "170" "170" "156" "167" "170"
## [41] "165" "168" "170" "173" "178" "180" "173" "184"
## [49] "180" "180" "148" "179" "165" "169" "172" "174"
## [57] "178" "162" "180" "170" "178" "156" "186" "171"
## [65] "144" "166" "180" "189cm" "178" "180" "183" "183"
## [73] "181" "177" "178" "176" "178" "180" "173" "173"
## [81] "198" "160" "190" "163" "181" "175" "168" "167"
## [89] "181" "181" "161" "182" "183" "175" "178" "191"
## [97] "161" "165" "150" "175" "166" "185" "176" "172"
## [105] "183" "168" "160" "171" "164" "187" "189" "164"
## [113] "165" "173" "170" "183" "163" "162" "177" "183"
## [121] "172" "172" "170" "170" "186" "167" "176" "181"
## [129] "150" "145" "172" "165" "168" "169" "177" "174"
## [137] "175" "173" "161" "151" "183" "184" "155" "158"
## [145] "162" "177" "185" "172" "155" "180" "170" "161"
## [153] "192" "169" "190.5" "188" "181" "191" "184" "172"
## [161] "175" "163" "171" "180" "-175" "162" "164" "192"
## [169] "162" "188" "178" "178" "188" "185" "181" "165"
## [177] "175" "180" "155" "174" "184" "174" "171" "175"
## [185] "163" "165" "168" "166" "163" "176" "174" "185"
## [193] "170" "156" "161" "178" "178" "165" "162" "163"
## [201] "193" "175" "180" "184" "174" "176" "158" "180"
## [209] "181" "181" "178" "164" "181" "168" "178" "164"
## [217] "169" "171" "182" "182" "175" "185" "170" "170"
## [225] "182" "176" "168" "173" "164" "189" "179" "161.5"
## [233] "180" "172" "168" "181" "183" "182" "182" "148"
## [241] "170" "164" "180" "180" "170" "168" "178" "178"
## [249] "160" "180" "175" "178" "163" "164" "167" "183"
## [257] "185" "181" "180" "186" "178" "183" "173" "186"
## [265] "167" "176" "158" "175" "175" "163" "183" "182"
## [273] "178" "173" "189" "180" "158" "185.4" "158" "182"
## [281] "170"
# clean data by extracting the number with number and cm
data$Height = gsub("[a-zA-Z]","", data$Height)
# convert 'Height' from chr to num
data$Height = as.numeric (data$Height)
# use abs to make all number positive
abs(data$Height)
## [1] 154.0 182.0 156.0 172.0 193.0 167.0 NA 183.0 150.0 181.0 183.0 167.0
## [13] 188.0 164.5 176.0 178.0 176.0 160.0 158.0 176.0 175.0 153.0 174.0 183.0
## [25] 160.0 169.0 175.0 172.0 179.0 171.0 173.0 181.0 175.0 171.0 170.0 170.0
## [37] 170.0 156.0 167.0 170.0 165.0 168.0 170.0 173.0 178.0 180.0 173.0 184.0
## [49] 180.0 180.0 148.0 179.0 165.0 169.0 172.0 174.0 178.0 162.0 180.0 170.0
## [61] 178.0 156.0 186.0 171.0 144.0 166.0 180.0 189.0 178.0 180.0 183.0 183.0
## [73] 181.0 177.0 178.0 176.0 178.0 180.0 173.0 173.0 198.0 160.0 190.0 163.0
## [85] 181.0 175.0 168.0 167.0 181.0 181.0 161.0 182.0 183.0 175.0 178.0 191.0
## [97] 161.0 165.0 150.0 175.0 166.0 185.0 176.0 172.0 183.0 168.0 160.0 171.0
## [109] 164.0 187.0 189.0 164.0 165.0 173.0 170.0 183.0 163.0 162.0 177.0 183.0
## [121] 172.0 172.0 170.0 170.0 186.0 167.0 176.0 181.0 150.0 145.0 172.0 165.0
## [133] 168.0 169.0 177.0 174.0 175.0 173.0 161.0 151.0 183.0 184.0 155.0 158.0
## [145] 162.0 177.0 185.0 172.0 155.0 180.0 170.0 161.0 192.0 169.0 190.5 188.0
## [157] 181.0 191.0 184.0 172.0 175.0 163.0 171.0 180.0 175.0 162.0 164.0 192.0
## [169] 162.0 188.0 178.0 178.0 188.0 185.0 181.0 165.0 175.0 180.0 155.0 174.0
## [181] 184.0 174.0 171.0 175.0 163.0 165.0 168.0 166.0 163.0 176.0 174.0 185.0
## [193] 170.0 156.0 161.0 178.0 178.0 165.0 162.0 163.0 193.0 175.0 180.0 184.0
## [205] 174.0 176.0 158.0 180.0 181.0 181.0 178.0 164.0 181.0 168.0 178.0 164.0
## [217] 169.0 171.0 182.0 182.0 175.0 185.0 170.0 170.0 182.0 176.0 168.0 173.0
## [229] 164.0 189.0 179.0 161.5 180.0 172.0 168.0 181.0 183.0 182.0 182.0 148.0
## [241] 170.0 164.0 180.0 180.0 170.0 168.0 178.0 178.0 160.0 180.0 175.0 178.0
## [253] 163.0 164.0 167.0 183.0 185.0 181.0 180.0 186.0 178.0 183.0 173.0 186.0
## [265] 167.0 176.0 158.0 175.0 175.0 163.0 183.0 182.0 178.0 173.0 189.0 180.0
## [277] 158.0 185.4 158.0 182.0 170.0
Height = abs(data$Height)
#histogram of variable 'Height'
hist(Height, freq = F, right = F, xlab = "Height (cm)", main = "Histogram for Height of Math1005 students in 2022 Semester 2")
Here our overall aim is to use sample mean and median to determine the shape of the distribution of the observed height.
Task: Calculate the sample mean and
sample median of the observed height.
Hint: you need to remove data entries with value NA,
which are missing data. One possible way is to use the function
is.na() to identify data entries with value
NA.
### write your code here
###
# remove NA value
Height = na.omit(Height)
mean (Height)
## [1] 173.2996
median(Height)
## [1] 175
Task: Replot the histogram and use
abline to indicate the sample mean and sample median on the
histogram.
### write your code here
###
hist(Height, freq = F, right = F, xlab = "Height (cm)", main = "Histogram for Height of Math1005 students in 2022 Semester 2")
abline(v = mean(Height), col = "green")
abline (v = median(Height), col = "purple")
Task: By comparing the sample mean and
sample median, describe the shape of the data.
Answer: The sample mean is 173.3 and
sample median is 175, means the sample mean is smaller than the sample
median, leading the data being left skewed
The overall aim of this sub-question is to check your understanding of the box plot and summarizing the spread of a data set using interquartile range.
Task: Calculate the first quartile,
third quartile, and interquartile range (IQR) of the observed
height.
### write your code here
###
summary(Height) #this is to check if the below calculation is correct
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 144.0 167.0 175.0 173.3 181.0 198.0
# calculate first quartile (25%) and thrid quartile (75%)
quantile(Height, probs = c (.25, .5, .75), type = 1)
## 25% 50% 75%
## 167 175 181
#calculate interquartile range
IQR(Height)
## [1] 14
Task: Make a box plot for the observed
height, and then use abline to indicate the location of the
sample median and the interquartile range on the box plot.
### write your code here
###
boxplot(Height)
abline(h = median(Height), col = "red")
iqr = quantile(Height)[4] - quantile(Height)[2]
abline(h = quantile(Height)[2] - 1.5*iqr, col = "purple")
abline(h = quantile(Height)[4] + 1.5*iqr, col = "purple")
abline(h = quantile(Height)[1], col ="green")
Here our goal is to understand the height differences between biological sexes using the comparative box plot.
Task: Use tools such as the frequency
table to check how many categories are in the Gender
variable and how many data points are available in each of the
categories. Then, discuss what variables should be included in the
subsequent analysis and what variables should be excluded. You should
state the reason.
You need to put your code below and your discussion below as well.
Hint: the empty gender "" indicates missing
data.
### write your code here
###
# frequency table to check how many categories and how many data points in 'Gender' variable
table (data$Gender)
##
## Female Male Prefer not to say
## 1 111 167 2
# Clean data by omitting the missing data and 'Prefer not to say' option
data1 = na.omit(data[data$Gender !="" & data$Gender != "Prefer not to say",])
Answer: (write your discussion here)
Two categories: Female and Male should be included in the subsequent
analysis because “Prefer not to say” is not really the representation of
biological sex of the respondents, if they are included, there may be
noise in the data and analysis. “Prefer not to say” option is also not a
clear division between Female and Male, which may be hard to compare.
Furthermore, without including “Prefer not to say” category, we can
presumably avoid misinterpretation of the results.
Task: Make a comparative box plot for
the observed height by splitting it by the variable Gender
(the recorded biological sex).
Hint: after data cleaning and selection, you need to make sure the height variable and the gender variable have the same size
### write your code here
###
# Trim data to make height variable the same size as gender variable
lgender = length(data1$Gender)
lheights = length (data1$Height)
if(lheights > lgender) {Height = Height[1:lgender]}
#Make sure all observations in the Heights variable is positive
nHeight = abs(data1$Height)
# Comparative boxplot
data1$Gender = factor(data1$Gender, levels = c("Female", "Male"))
Gender = data1$Gender
boxplot(nHeight ~ Gender, horizontal = T, main = "Height differences between male and female students", xlab = "Heights (in cm)")
Task: What does the comparative box
plot reveal about the height of students for biological males and
biological females?
Answer: The median heights of male
students are significantly higher than that of female students. The
interquartile ranges of female students are greater than that of males.
The overall range of the data set (distances between the ends of the two
whiskers) is also greater for female students.
====END OF THE WORKSHEET====