Instruction

There are three questions. Each question contains multiple tasks. To receive a full mark in this part, you should correctly solve all tasks and add appropriate labels to your graphical summaries.

Do not modify the header of this file.

Format: All assignment tasks are marked by the prompt Task. You should enter your solution either as embedded R code below Task or as text after the prompt Answer. See examples in the Import data section.

Submission: Upon completion, you must render this worksheet (using Knit in R Studio) into an html file and submit the html file. Make sure the file extension “html” is in lower case.

Import data

In this assignment, we will use a data set collected from a survey for MATH1005 students at the start of 2022 Semester 2. There are a total of 928 students enrolled in this course.

Download the data file A1MATH1005.csv in your data folder within your MATH1062 folder. This R Markdown file (Assignment1Worksheet.Rmd) should also be saved under your MATH1062 folder.

Then import the csv file into a variable called data:

### write your code here. Here is a sample solution 
data = read.csv("data/A1MATH1005.csv", header = T)
### the following displays the dimenison of the data
dim(data)

## [1] 281   8

If you save the data file and the worksheet correctly, you should be able to load the data file and see that the data contain 281 observations and 8 variables.

Task: How many survey responses are there? How many variables are there?

Answer: There are 281 survey responses and 8 variables.

====START OF ASSIGNMENT QUESTIONS====

1 Histogram

Task: Display the structure of data. Then make an appropriate histogram for the variable Height on the density scale. Here you can use the default number of class intervals.

Hint: you may need to convert the Height variable into the correct type in R to generate the histogram. You also need to perform data cleaning here to get a sensible histogram for this case.

### write your code here
# data structure 
str(data)

## 'data.frame':    281 obs. of  8 variables:
##  $ Gender       : chr  "Female" "Female" "Female" "Male" ...
##  $ International: chr  "Domestic" "Domestic" "International" "Domestic" ...
##  $ Major        : chr  "Physics, Chemistry" "Biomedical Engineering" "Statistics" "Transport" ...
##  $ Height       : chr  "154" "182" "156" "172" ...
##  $ ShoeSize     : chr  "9.5" "9 AUS" "US 9" "8" ...
##  $ Age          : int  18 22 18 29 18 19 19 18 18 20 ...
##  $ Country      : chr  "Australia" "Australia" "India" "Nepal" ...
##  $ Language     : chr  "English " "English" "Hindi" "Nepali" ...

# Height data 
unique(data$Height)

##  [1] "154"    "182"    "156"    "172"    "193"    "167"    ""       "183"   
##  [9] "150cm"  "181"    "188"    "164.5"  "176"    "178"    "160"    "158"   
## [17] "175"    "153"    "174"    "169"    "179"    "171 cm" "173"    "171"   
## [25] "170"    "165"    "168"    "180"    "184"    "148"    "162"    "186"   
## [33] "144"    "166"    "189cm"  "177"    "198"    "190"    "163"    "161"   
## [41] "191"    "150"    "185"    "164"    "187"    "189"    "145"    "151"   
## [49] "155"    "192"    "190.5"  "-175"   "161.5"  "185.4"

# caculate
data$Height <- ifelse(data$Height == "150cm", 150, data$Height)
data$Height <- ifelse(data$Height == "171cm", 171, data$Height)
data$Height <- ifelse(data$Height == "189cm", 189, data$Height)
#numeric data
numeric_data <- as.numeric(data$Height)

## Warning: 강제형변환에 의해 생성된 NA 입니다

#Na value
clear_data <- na.omit(numeric_data)
positive_data <- clear_data[clear_data >= 0]
#new data
unique(positive_data)

##  [1] 154.0 182.0 156.0 172.0 193.0 167.0 183.0 150.0 181.0 188.0 164.5 176.0
## [13] 178.0 160.0 158.0 175.0 153.0 174.0 169.0 179.0 173.0 171.0 170.0 165.0
## [25] 168.0 180.0 184.0 148.0 162.0 186.0 144.0 166.0 189.0 177.0 198.0 190.0
## [37] 163.0 161.0 191.0 185.0 164.0 187.0 145.0 151.0 155.0 192.0 190.5 161.5
## [49] 185.4

str(positive_data)

##  num [1:278] 154 182 156 172 193 167 183 150 181 183 ...

#histodiagram
hist(positive_data, main = "height distribution", xlab = "Height = (cm)", ylab = "Frequency")

# Sample mean and sample median

 #Here our overall aim is to use sample mean and median to determine the shape of the distribution of the observed height.
  
#**`Task:`** Calculate the sample mean and sample median of the observed height. 

#*Hint: you need to remove data entries with value `NA`, which are missing data. One possible way is to use the function `is.na()` to identify data entries with value `NA`. *

### Mean structure
mean(data$Height,na.rm = T)

## Warning in mean.default(data$Height, na.rm = T): 인자가 수치형 또는 논리형이
## 아니므로 NA를 반환합니다

## [1] NA

# Median structure
median(data$Height,na.rm = T)

## [1] "174"

#**`Task:`** Replot the histogram and use `abline` to indicate the sample mean and sample median on the histogram.

### write your code here
# histrogram 
hist(positive_data, main = "height distribution", xlab = "Height = (cm)", ylab = "Frequency")

# abline in Mean
abline(v = data$Height, col = "red")

## Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...):
## 강제형변환에 의해 생성된 NA 입니다

# abline in Median
abline(v = data$Height, col = "blue")

## Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...):
## 강제형변환에 의해 생성된 NA 입니다

# Legend to show
legend("topright", legend = c("Mean", "Median"), col = c("red", "blue"))

#**`Task:`** By comparing the sample mean and sample median, describe the shape of the data. 

#**`Answer:`** if the mean is bigger than the median, the shape of the data will show the right-screwed graph which is positive. If the mean is less than the median, the sahpe of the data will show the left-screwed graph which is negative.If the mean is equal with the median it may show the normal-distribution. 

# Box plot

## Sample median and interquartile range

#The overall aim of this sub-question is to check your understanding of the box plot and summarizing the spread of a data set using interquartile range.

#**`Task:`** Calculate the first quartile, third quartile, and interquartile range (IQR) of the observed height.

### write your code here
#interquartile range
IQR_positive_data <- IQR(positive_data, na.rm = TRUE)
print("Interquartile range (IQR) of Height:", IQR(data$Height, na.rm = T))

## Warning in quantile(as.numeric(x), c(0.25, 0.75), na.rm = na.rm, names = FALSE,
## : 강제형변환에 의해 생성된 NA 입니다

## [1] "Interquartile range (IQR) of Height:"

#first quartile
Q1 <- quantile(positive_data, probs = 0.25, na.rm = TRUE)
print(paste("First Quartile (Q1) of Height:", Q1))

## [1] "First Quartile (Q1) of Height: 167"

#thrid quartile
Q3 <- quantile(positive_data, probs = 0.75, na.rm = TRUE)
print(paste("Third Quartile (Q3) of Height:", Q3))

## [1] "Third Quartile (Q3) of Height: 181"

#**`Task:`** Make a box plot for the observed height, and then use `abline` to indicate the location of the sample median and the interquartile range on the box plot.

### write your code here
###
boxplot(positive_data, main = "Box Plot of Observed Height", ylab = "Height (cm)")

# Calculate median and IQR for annotations
median_height <- median(positive_data, na.rm = T)
IQR_height <- IQR(positive_data, na.rm = T)

# Add a point for the median
points(1, median_height, col = "red")

## Comparative box plot

#Here our goal is to understand the height differences between biological sexes using the comparative box plot.

#**`Task:`** Use tools such as the frequency table to check how many categories are in the `Gender` variable and how many data points are available in each of the categories. Then, discuss what variables should be included in the subsequent analysis and what variables should be excluded. You should state the reason. 

#You need to put your code below and your discussion below as well.

#*Hint: the empty gender `""` indicates missing data.*

### write your code here
###  Indicate Gender
gender_frequency <- table(data$Gender)
#Gender frequency Na
gender_frequency_data <- na.omit(numeric_data)



#**`Answer:`** (It should be include the variables and data that we are using.)

#**`Task:`** Make a comparative box plot for the observed height by splitting it by the variable `Gender` (the recorded biological sex). 

#*Hint: after data cleaning and selection, you need to make sure the height variable and the gender variable have the same size*

#```{r}
### write your code here
### Gender distribution 
Gender = factor(data$Gender, levels = c("Female", "Male"))
#boxplot(data$Height ~ Gender,horizontal=T, col=c("light blue", "light green"), main = "Height distribution by gender")


#**`Task:`** What does the comparative box plot reveal about the height of students for biological males and biological females? 

#**`Answer:`** The comparative box plot reveal that the biological males are higher than the biological females which is shown. The interquartile range, each quartile, median and mean changes.  

#**====END OF THE WORKSHEET====**

MATH1062 (Part B) / MATH1005 Assignment 1

© University of Sydney MATH1062 (Statistics Part) / MATH1005

17 3월 2024

Instruction

Import data

1 Histogram