Instruction

There are several questions. Each question may contain multiple tasks. To receive a full mark in this part, you should correctly solve all tasks, justify your solution in the space provided in case necessary, and add appropriate labels to your graphical summaries.

Do NOT modify the header of this file. Do NOT delete or alter any text description from this file. Only work in the space provided.

Format: All assignment tasks have either a field for writing embedded R code, an answer field marked by the prompt Answer to Task x.x, or both. You should enter your solution either as embedded R code or as text after the prompt Answer to Task x.x.

Submission: Upon completion, you must render this worksheet (using Knit in R Studio) into an html file and submit the html file. Make sure the file extension “html” is in lower case. Your html file MUST contain all the R code you have written in the worksheet.

Task 0.0: The data story of Chick Weight

This is a sample task demonstrating how to answer assignment questions using this R Markdown worksheet. Please read this data story carefully. You do NOT need to answer the question in this task. Your tasks start at Question 1 below. 

In this assignment, we will use the ChickWeight data, which is a built-in data set in R. This dataset records the body weights of chicks at birth and every second day thereafter until day 20, with additional measurements taken on day 21. The chicks were divided into four groups based on different protein diets. We will focus on the following three variables for this assignment:

  • the variable weight contains the measured body weights of chicks in grams.
  • the variable Time records the number of days since birth when the measurement was made.
  • the variable Diet records the group of protein diet.

The variables weight and Time are numerical. The variable Diet is categorical. Note that the variable names and the dataframe name are case sensitive.

Write R code in the following code block to display the dimension of the data, variable names, and the first several rows of the data set. How many variables in this data set? What is the sample size? Write your comment after the Answer to Task 0.0 prompt provided below.

### Write your code here. The code is completed in Task 1.0 for demonstration.
dim(ChickWeight) # the dimension of the ChickWeight data set
## [1] 578   4
names(ChickWeight) # display variable names
## [1] "weight" "Time"   "Chick"  "Diet"
head(ChickWeight) # display the first several variables
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

Answer to Task 0.0: (Write your answer here.) There are 4 variables. The sample size is 578.

====START OF ASSIGNMENT QUESTIONS====

1 Barplot, histogram, and skewness

Task 1.1

Produce a frequency table and a barplot for the variable Diet. Write your code in the following code block. Among all diet groups, which one has the highest frequency? Write your comment after the Answer to Task 1.1 prompt provided below.

### Code for Task 1.1. Write your code here
###

Answer to Task 1.1: (Write your answer here.)

Task 1.2

The overall aim of this task is to use sample mean and median to determine the shape of a data distribution.

In the following code block, create an appropriate histogram for the variable weight on the density scale. Here you can use the default number of class intervals. Calculate the sample mean and the sample median of the variable weight, and then use abline to indicate the locations of the sample median and the sample mean on the histogram.

Based on your findings, comment on the skewness of the variable weight and justify your answer. Write your answer after the Answer to Task 1.2 prompt provided below.

### Code for Task 1.2. Write your code here
###

Answer to Task 1.2: (Write your answer here.)

2 Boxplot, data selection, and outliers

Task 2.1

We want to understand the effectiveness of the Diet on the weight of the chicks using the comparative boxplot. Here we consider the weight of chicks after 18 days (including day 18) since birth. In the following code block, first select data points in the Diet and weight variables corresponding to 18 days and later (see the Time variable). Then, make a comparative boxplot for the selected data points from weight by splitting it by the corresponding Diet.

Based on the reported centres of the comparative boxplot, comment on which diet group is most effective on chick weight and justify your answer. Write your answer after the Answer to Task 2.1 prompt provided below.

### Code for Task 2.1. Write your code here
###

Answer to Task 2.1: (Write your answer here.)

Task 2.2

The rest of this question is to check your understanding of the boxplot, numerical summaries used for constructing a boxplot, and how to identify outliers. We will focus on the variable weight.

  • Calculate median of weight and the quartiles used for identifying the middle 50% of data points.
  • Make a boxplot (preferbaly a horizontal one).
  • Use abline to indicate the location of the sample median and the interquartile range on the boxplot.
### Code for Task 2.2.  Write your code here
###

Task 2.3

Are there any outliers? Write your answer after the Answer to Task 2.3 prompt provided below.

Answer to Task 2.3: (Write your answer here.)

In the following code block, write R code to count the number of outliers in this data set if there is any. Hint: you may need to calculate the location of the whiskers in the boxplot (the lower and upper thresholds in Tukey’s convention) first.

### Code for Task 2.3.  Write your code here
###

3 Normal curve

Task 3.1

We consider the population of chicks aged between 18 and 21 days (inclusive of day 18 and day 21) after birth. We want to apply the normal curve to estimate the proportion of chicks in this population weighing above 400 grams. The data selected in Task 2.1 will serve as the sample drawn from this population.

In the following code block, calculate the sample mean and sample standard deviation. Construct a normal curve using these values, and subsequently determine the proportion of chicks weighing above 400 grams. What percentage of chicks will weigh above 400 grams? Please also write your answer after the Answer to Task 3.1 prompt provided below, rounding your answer (in percentage) to two decimal places.

Hint: If you encountered difficulties selecting data in Task 2.1, you can use the sample mean \(206\) and the sample SD \(66\) instead. You will not be penalised for using these values in this question.

### Code for Task 3.1.  Write your code here
###

Answer to Task 3.1: (Write your answer here.)

Task 3.2

In the following code block, calculate the 30% percentile of the population of chick weights based on the normal curve constructed above. Please also provide your answer after the Answer to Task 3.2 prompt provided below, rounding your answer (in percentage) to two decimal places.

### Code for Task 3.2.  Write your code here
###

Answer to Task 3.2: (Write your answer here.)

Task 3.3

In our lectures, we learned about the distinction between the population standard deviation (SD) and the sample SD. Additionally, we learned that variance = SD\(^2\). R has built-in functions sd() and var() for computing the sample SD and the sample variance. Here we want to write our own R function to compute the population variance and apply it to the ChickWeight data set.

In the following, we provide the function definition for my_pop_var(X), where X is the input data. Complete this function so it can compute the population variance for the input data X.

### Code for Task 3.3. 
###
my_pop_var = function(X){
  ###  Write your code below for population variance
  
  
  
}

Task 3.4

Apply your function written above to compute the population variance of the variable weight in ChickWeight.

### Code for Task 3.4.  Write your code here
###

====END OF THE WORKSHEET====