There are several questions. Each question may contain multiple tasks. To receive a full mark in this part, you should correctly solve all tasks, justify your solution in the space provided in case necessary, and add appropriate labels to your graphical summaries.
Do NOT modify the header of this file. Do NOT delete or alter any text description from this file. Only work in the space provided.
Format: All assignment tasks have either a field for
writing embedded R code, an answer field marked by the prompt
Answer to Task x.x, or both. You should
enter your solution either as embedded R code or as text after the
prompt Answer to Task x.x.
Submission: Upon completion, you must render this
worksheet (using Knit in R Studio) into an html file and
submit the html file. Make sure the file extension “html” is in lower
case. Your html file MUST contain all the R code you
have written in the worksheet.
Task 0.0: The data
story of Chick WeightThis is a sample task demonstrating how to answer assignment questions using this R Markdown worksheet. Please read this data story carefully. You do NOT need to answer the question in this task. Your tasks start at Question 1 below.
In this assignment, we will use the ChickWeight data,
which is a built-in data set in R. This dataset records the body weights
of chicks at birth and every second day thereafter until day 20, with
additional measurements taken on day 21. The chicks were divided into
four groups based on different protein diets. We will focus on the
following three variables for this assignment:
weight contains the measured body weights
of chicks in grams.Time records the number of days since
birth when the measurement was made.Diet records the group of protein
diet.The variables weight and Time are
numerical. The variable Diet is categorical. Note
that the variable names and the dataframe name are case
sensitive.
Write R code in the following code block to display the dimension of
the data, variable names, and the first several rows of the data set.
How many variables in this data set? What is the sample size? Write your
comment after the Answer to Task 0.0
prompt provided below.
### Write your code here. The code is completed in Task 1.0 for demonstration.
dim(ChickWeight) # the dimension of the ChickWeight data set
## [1] 578 4
names(ChickWeight) # display variable names
## [1] "weight" "Time" "Chick" "Diet"
head(ChickWeight) # display the first several variables
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
Answer to Task 0.0: (Write your answer
here.) There are 4 variables. The sample size is 578.
====START OF ASSIGNMENT QUESTIONS====
Task 1.1Produce a frequency table and a barplot for the variable
Diet. Write your code in the following code block. Among
all diet groups, which one has the highest frequency? Write your comment
after the Answer to Task 1.1 prompt
provided below.
### Code for Task 1.1. Write your code here
###
Answer to Task 1.1: (Write your answer
here.)
Task 1.2The overall aim of this task is to use sample mean and median to determine the shape of a data distribution.
In the following code block, create an appropriate histogram for the
variable weight on the density scale. Here
you can use the default number of class intervals. Calculate the sample
mean and the sample median of the variable weight, and then
use abline to indicate the locations of the sample median
and the sample mean on the histogram.
Based on your findings, comment on the skewness of the variable
weight and justify your answer. Write your answer after the
Answer to Task 1.2 prompt provided
below.
### Code for Task 1.2. Write your code here
###
Answer to Task 1.2: (Write your answer
here.)
Task 2.1We want to understand the effectiveness of the Diet on
the weight of the chicks using the comparative boxplot.
Here we consider the weight of chicks after 18 days (including
day 18) since birth. In the following code block, first select
data points in the Diet and weight variables
corresponding to 18 days and later (see the Time variable).
Then, make a comparative boxplot for the selected data points from
weight by splitting it by the corresponding
Diet.
Based on the reported centres of the comparative
boxplot, comment on which diet group is most effective on chick weight
and justify your answer. Write your answer after the
Answer to Task 2.1 prompt provided
below.
### Code for Task 2.1. Write your code here
###
Answer to Task 2.1: (Write your answer
here.)
Task 2.2The rest of this question is to check your understanding of the
boxplot, numerical summaries used for constructing a boxplot, and how to
identify outliers. We will focus on the variable
weight.
weight and the quartiles used for
identifying the middle 50% of data points.abline to indicate the location of the sample
median and the interquartile range on the boxplot.### Code for Task 2.2. Write your code here
###
Task 2.3Are there any outliers? Write your answer after the
Answer to Task 2.3 prompt provided
below.
Answer to Task 2.3: (Write your answer
here.)
In the following code block, write R code to count the number of outliers in this data set if there is any. Hint: you may need to calculate the location of the whiskers in the boxplot (the lower and upper thresholds in Tukey’s convention) first.
### Code for Task 2.3. Write your code here
###
Task 3.1We consider the population of chicks aged between 18 and 21 days
(inclusive of day 18 and day 21) after birth. We want to apply the
normal curve to estimate the proportion of chicks in this population
weighing above 400 grams. The data selected in
Task 2.1 will serve as the sample drawn
from this population.
In the following code block, calculate the sample mean and sample
standard deviation. Construct a normal curve using these values, and
subsequently determine the proportion of chicks weighing above 400
grams. What percentage of chicks will weigh above 400 grams? Please also
write your answer after the
Answer to Task 3.1 prompt provided below,
rounding your answer (in percentage) to two decimal places.
Hint: If you encountered difficulties selecting data
in Task 2.1, you can use the sample mean
\(206\) and the sample SD \(66\) instead. You will not be penalised for
using these values in this question.
### Code for Task 3.1. Write your code here
###
Answer to Task 3.1: (Write your answer
here.)
Task 3.2In the following code block, calculate the 30% percentile of the
population of chick weights based on the normal curve constructed above.
Please also provide your answer after the
Answer to Task 3.2 prompt provided below,
rounding your answer (in percentage) to two decimal places.
### Code for Task 3.2. Write your code here
###
Answer to Task 3.2: (Write your answer
here.)
Task 3.3In our lectures, we learned about the distinction between the
population standard deviation (SD) and the sample SD. Additionally, we
learned that variance = SD\(^2\). R has
built-in functions sd() and var() for
computing the sample SD and the sample variance. Here we want to write
our own R function to compute the population variance
and apply it to the ChickWeight data set.
In the following, we provide the function definition for
my_pop_var(X), where X is the input data.
Complete this function so it can compute the population variance for the
input data X.
### Code for Task 3.3.
###
my_pop_var = function(X){
### Write your code below for population variance
}
Task 3.4Apply your function written above to compute the population variance
of the variable weight in ChickWeight.
### Code for Task 3.4. Write your code here
###
====END OF THE WORKSHEET====