There are several questions. Each question may contain multiple tasks. To receive a full mark in this part, you should correctly solve all tasks, justify your solution in the space provided in case necessary, and add appropriate labels to your graphical summaries.
Do NOT modify the header of this file. Do NOT delete or alter any text description from this file. Only work in the space provided.
Format: All assignment tasks have either a field for
writing embedded R code, an answer field marked by the prompt
Answer to Task x.x, or both. You should
enter your solution either as embedded R code or as text after the
prompt Answer to Task x.x.
Submission: Upon completion, you must render this
worksheet (using Knit in R Studio) into an html file and
submit the html file. Make sure the file extension “html” is in lower
case. Your html file MUST contain all the R code you
have written in the worksheet.
Task 0.0: The data
story of Chick WeightThis is a sample task demonstrating how to answer assignment questions using this R Markdown worksheet. Please read this data story carefully. You do NOT need to answer the question in this task. Your tasks start at Question 1 below.
In this assignment, we will use the ChickWeight data,
which is a built-in data set in R. This dataset records the body weights
of chicks at birth and every second day thereafter until day 20, with
additional measurements taken on day 21. The chicks were divided into
four groups based on different protein diets. We will focus on the
following three variables for this assignment:
weight contains the measured body weights
of chicks in grams.Time records the number of days since
birth when the measurement was made.Diet records the group of protein
diet.The variables weight and Time are
numerical. The variable Diet is categorical. Note
that the variable names and the dataframe name are case
sensitive.
Write R code in the following code block to display the dimension of
the data, variable names, and the first several rows of the data set.
How many variables in this data set? What is the sample size? Write your
comment after the Answer to Task 0.0
prompt provided below.
### Write your code here. The code is completed in Task 1.0 for demonstration.
dim(ChickWeight) # the dimension of the ChickWeight data set
## [1] 578 4
names(ChickWeight) # display variable names
## [1] "weight" "Time" "Chick" "Diet"
head(ChickWeight) # display the first several variables
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
Answer to Task 0.0: (Write your answer
here.) There are 4 variables. The sample size is 578.
====START OF ASSIGNMENT QUESTIONS====
Task 1.1Produce a frequency table and a barplot for the variable
Diet. Write your code in the following code block. Among
all diet groups, which one has the highest frequency? Write your comment
after the Answer to Task 1.1 prompt
provided below.
### Code for Task 1.1. Write your code here
###
data("ChickWeight")
diet_freq <- table(ChickWeight$Diet) #frequency
print(diet_freq)
##
## 1 2 3 4
## 220 120 120 118
barplot(diet_freq, main = "frequency", xlab = "groups", ylab = "frequency") #barlot
Answer to Task 1.1: (Write your answer
here.) Group 1 has the highest frequency.
Task 1.2The overall aim of this task is to use sample mean and median to determine the shape of a data distribution.
In the following code block, create an appropriate histogram for the
variable weight on the density scale. Here
you can use the default number of class intervals. Calculate the sample
mean and the sample median of the variable weight, and then
use abline to indicate the locations of the sample median
and the sample mean on the histogram.
Based on your findings, comment on the skewness of the variable
weight and justify your answer. Write your answer after the
Answer to Task 1.2 prompt provided
below.
### Code for Task 1.2. Write your code here
###
weight <- ChickWeight$weight
hist(weight, main = "Weight Distribution", xlab = "Weight", col = "lightblue", breaks = "Sturges", prob = TRUE) #histogram
mean_weight <- mean(weight) #mean
median_weight <- median(weight) #median
abline(v = mean_weight, col = "red", lwd = 2, lty = 2) # red shows mean
abline(v = median_weight, col = "blue", lwd = 2, lty = 2) # blue shows median
legend("topright", legend = c(paste("Mean =", round(mean_weight, 2)),
paste("Median =", round(median_weight, 2))),
col = c("red", "blue"), lty = 2, lwd = 2) #locations
Answer to Task 1.2: (Write your answer
here.) Mean equals to 121.82 and median equals to 103.
Task 2.1We want to understand the effectiveness of the Diet on
the weight of the chicks using the comparative boxplot.
Here we consider the weight of chicks after 18 days (including
day 18) since birth. In the following code block, first select
data points in the Diet and weight variables
corresponding to 18 days and later (see the Time variable).
Then, make a comparative boxplot for the selected data points from
weight by splitting it by the corresponding
Diet.
Based on the reported centres of the comparative
boxplot, comment on which diet group is most effective on chick weight
and justify your answer. Write your answer after the
Answer to Task 2.1 prompt provided
below.
### Code for Task 2.1. Write your code here
###
subset_data <- ChickWeight[ChickWeight$Time >= 18, ]
# boxplot
boxplot(weight ~ Diet, data = subset_data,
main = "Comparison of Chick Weights by Diet at Day 18 and Beyond",
xlab = "Groups",
ylab = "Weight",
col = "lightblue",
border = "black")
legend("topright", legend = levels(subset_data$Diet),
fill = "lightblue", border = "black")
#average
mean_weights <- tapply(subset_data$weight, subset_data$Diet, mean)
print(mean_weights)
## 1 2 3 4
## 168.8600 202.6667 254.1000 224.3214
Answer to Task 2.1: (Write your answer
here.) Group 1 is most effective.
Task 2.2The rest of this question is to check your understanding of the
boxplot, numerical summaries used for constructing a boxplot, and how to
identify outliers. We will focus on all data entries in
the variable weight.
weight and the quartiles used for
identifying the middle 50% of data points.abline to indicate the location of the sample
median and the interquartile range on the boxplot.### Code for Task 2.2. Write your code here
###
weight <- ChickWeight$weight
# caculate
median_weight <- median(weight)
Q1 <- quantile(weight, 0.25)
Q3 <- quantile(weight, 0.75)
IQR <- Q3 - Q1
# print
print(paste("median: ", median_weight))
## [1] "median: 103"
print(paste("first quartiles(Q1): ", Q1))
## [1] "first quartiles(Q1): 63"
print(paste("third quartiles(Q3): ", Q3))
## [1] "third quartiles(Q3): 163.75"
print(paste("distance (IQR): ", IQR))
## [1] "distance (IQR): 100.75"
#boxplot
boxplot(weight, horizontal = TRUE,
main = "Boxplot of Chick Weights",
xlab = "Weight",
col = "lightblue",
border = "black")
# add lines
abline(v = median_weight, col = "red", lwd = 2, lty = 2)
abline(v = Q1, col = "blue", lwd = 2, lty = 2)
abline(v = Q3, col = "green", lwd = 2, lty = 2)
#show picture
legend("topright", legend = c("Median", "Q1", "Q3"),
col = c("red", "blue", "green"), lty = 2, lwd = 2)
Task 2.3Are there any outliers? Write your answer after the
Answer to Task 2.3 prompt provided
below.
Answer to Task 2.3: (Write your answer
here.) There is an outlier in group 4.
In the following code block, write R code to count the number of outliers in this data set if there is any. Hint: you may need to calculate the location of the whiskers in the boxplot (the lower and upper thresholds in Tukey’s convention) first.
### Code for Task 2.3. Write your code here
###
subset_data <- ChickWeight[ChickWeight$Time >= 18, ]
outliers <- data.frame()
# caculate outliers in each group
for (diet in unique(subset_data$Diet)) {
diet_data <- subset(subset_data, Diet == diet)$weight
Q1 <- quantile(diet_data, 0.25)
Q3 <- quantile(diet_data, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
diet_outliers <- subset(subset_data, Diet == diet & (weight < lower_bound | weight > upper_bound))
outliers <- rbind(outliers, diet_outliers)
}
print(outliers)
## weight Time Chick Diet
## 554 322 21 48 4
#boxplot
boxplot(weight ~ Diet, data = subset_data,
main = "Chick Weights by Diet",
xlab = "Group",
ylab = "Weight",
col = "lightblue")
if (nrow(outliers) > 0) {
points(as.numeric(outliers$Diet), outliers$weight, col = "red", pch = 19) #red for outliers
}
Task 3.1We consider the population of chicks aged between 18 and 21 days
(inclusive of day 18 and day 21) after birth. We want to apply the
normal curve to estimate the proportion of chicks in this population
weighing above 400 grams. The data selected in
Task 2.1 will serve as the sample drawn
from this population.
In the following code block, calculate the sample mean and sample
standard deviation. Construct a normal curve using these values, and
subsequently determine the proportion of chicks weighing above 400
grams. What percentage of chicks will weigh above 400 grams? Please also
write your answer after the
Answer to Task 3.1 prompt provided below,
rounding your answer (in percentage) to two decimal places.
Hint: If you encountered difficulties selecting data
in Task 2.1, you can use the sample mean
\(206\) and the sample SD \(66\) instead. You will not be penalised for
using these values in this question.
### Code for Task 3.1. Write your code here
###
subset_data <- ChickWeight[ChickWeight$Time >= 18 & ChickWeight$Time <= 21, ]
# Calculate the sample mean and standard deviation
mean_weight <- mean(subset_data$weight)
sd_weight <- sd(subset_data$weight)
print(paste("Sample Mean: ", mean_weight))
## [1] "Sample Mean: 205.992753623188"
print(paste("Sample Standard Deviation: ", sd_weight))
## [1] "Sample Standard Deviation: 65.9148418071712"
x_values <- seq(mean_weight - 4 * sd_weight, mean_weight + 4 * sd_weight, length.out = 100)
# Calculate
y_values <- dnorm(x_values, mean = mean_weight, sd = sd_weight)
# Plot the normal curve
plot(x_values, y_values, type = "l", main = "Normal Distribution of Chick Weights",
xlab = "Weight (grams)", ylab = "Density", col = "blue")
# Add a vertical line
abline(v = 400, col = "red", lwd = 2, lty = 2)
# Calculate the proportion of chicks weighing more than 400 grams
probability_more_than_400 <- 1 - pnorm(400, mean = mean_weight, sd = sd_weight)
rounded_probability <- round(probability_more_than_400 * 100, 2)
print(paste("Proportion of chicks weighing more than 400 grams: ", rounded_probability, "%"))
## [1] "Proportion of chicks weighing more than 400 grams: 0.16 %"
# Highlight
x_fill <- seq(400, mean_weight + 4 * sd_weight, length.out = 100)
y_fill <- dnorm(x_fill, mean = mean_weight, sd = sd_weight)
polygon(c(400, x_fill, mean_weight + 4 * sd_weight), c(0, y_fill, 0), col = "lightblue", border = NA)
Answer to Task 3.1: (Write your answer
here.) There will be 0.16% chickens weigh above 400 grams.
Task 3.2In the following code block, calculate the 30% percentile of the
population of chick weights based on the normal curve constructed above.
Please also provide your answer after the
Answer to Task 3.2 prompt provided below,
rounding your answer to two decimal places.
### Code for Task 3.2. Write your code here
###
subset_data <- ChickWeight[ChickWeight$Time >= 18 & ChickWeight$Time <= 21, ]
# Calculate
mean_weight <- mean(subset_data$weight)
sd_weight <- sd(subset_data$weight)
# Calculate the 30th percentile using qnorm
percentile_30 <- qnorm(0.30, mean = mean_weight, sd = sd_weight)
# Round the result
rounded_percentile_30 <- round(percentile_30, 2)
print(paste("30th percentile of chick weights: ", rounded_percentile_30, " grams"))
## [1] "30th percentile of chick weights: 171.43 grams"
Answer to Task 3.2: (Write your answer
here.) 30% percentile of the population of chick weights is 171.43g.
Task 3.3In our lectures, we learned about the distinction between the
population standard deviation (SD) and the sample SD. Additionally, we
learned that variance = SD\(^2\). R has
built-in functions sd() and var() for
computing the sample SD and the sample variance. Here we want to write
our own R function to compute the population variance
and apply it to the ChickWeight data set.
In the following, we provide the function definition for
my_pop_var(X), where X is the input data.
Complete this function so it can compute the population variance for the
input data X.
### Code for Task 3.3.
###
my_pop_var = function(X){
### Write your code below for population variance
mu <- mean(X)
# Calculate the sum
squared_deviations <- (X - mu)^2
sum_squared_deviations <- sum(squared_deviations)
# Calculate the population variance
population_variance <- sum_squared_deviations / length(X)
return(population_variance)
}
# Test the function
data("ChickWeight")
weight_data <- ChickWeight$weight
# Apply the custom function to calculate the population variance of weights
population_variance_weight <- my_pop_var(weight_data)
print(paste("Population Variance of chick weights: ", population_variance_weight))
## [1] "Population Variance of chick weights: 5042.4843003556"
Task 3.4Apply your function written above to compute the population variance
of the variable weight in ChickWeight.
### Code for Task 3.4. Write your code here
###
my_pop_var = function(X) {
mu <- mean(X)
# Calculate the sum
squared_deviations <- (X - mu)^2
sum_squared_deviations <- sum(squared_deviations)
# Calculate the population variance
population_variance <- sum_squared_deviations / length(X)
return(population_variance)
}
data("ChickWeight")
# Extract the weight data
weight_data <- ChickWeight$weight
# Apply the custom function to calculate the population variance of weights
population_variance_weight <- my_pop_var(weight_data)
print(paste("Population Variance of chick weights: ", population_variance_weight))
## [1] "Population Variance of chick weights: 5042.4843003556"
====END OF THE WORKSHEET====