Instruction

There are several questions. Each question may contain multiple tasks. To receive a full mark in this part, you should correctly solve all tasks, justify your solution in the space provided in case necessary, and add appropriate labels to your graphical summaries.

Do NOT modify the header of this file. Do NOT delete or alter any text description from this file. Only work in the space provided.

Format: All assignment tasks have either a field for writing embedded R code, an answer field marked by the prompt Answer to Task x.x, or both. You should enter your solution either as embedded R code or as text after the prompt Answer to Task x.x.

Submission: Upon completion, you must render this worksheet (using Knit in R Studio) into an html file and submit the html file. Make sure the file extension “html” is in lower case. Your html file MUST contain all the R code you have written in the worksheet.

Task 0.0: The data story of Chick Weight

This is a sample task demonstrating how to answer assignment questions using this R Markdown worksheet. Please read this data story carefully. You do NOT need to answer the question in this task. Your tasks start at Question 1 below. 

In this assignment, we will use the ChickWeight data, which is a built-in data set in R. This dataset records the body weights of chicks at birth and every second day thereafter until day 20, with additional measurements taken on day 21. The chicks were divided into four groups based on different protein diets. We will focus on the following three variables for this assignment:

  • the variable weight contains the measured body weights of chicks in grams.
  • the variable Time records the number of days since birth when the measurement was made.
  • the variable Diet records the group of protein diet.

The variables weight and Time are numerical. The variable Diet is categorical. Note that the variable names and the dataframe name are case sensitive.

Write R code in the following code block to display the dimension of the data, variable names, and the first several rows of the data set. How many variables in this data set? What is the sample size? Write your comment after the Answer to Task 0.0 prompt provided below.

### Write your code here. The code is completed in Task 1.0 for demonstration.
dim(ChickWeight) # the dimension of the ChickWeight data set
## [1] 578   4
names(ChickWeight) # display variable names
## [1] "weight" "Time"   "Chick"  "Diet"
head(ChickWeight) # display the first several variables
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

Answer to Task 0.0: (Write your answer here.) There are 4 variables. The sample size is 578.

====START OF ASSIGNMENT QUESTIONS====

1 Barplot, histogram, and skewness

Task 1.1

Produce a frequency table and a barplot for the variable Diet. Write your code in the following code block. Among all diet groups, which one has the highest frequency? Write your comment after the Answer to Task 1.1 prompt provided below.

### Code for Task 1.1. Write your code here
###
data("ChickWeight")

diet_freq <- table(ChickWeight$Diet) #frequency
print(diet_freq)
## 
##   1   2   3   4 
## 220 120 120 118
barplot(diet_freq, main = "frequency", xlab = "groups", ylab = "frequency") #barlot

Answer to Task 1.1: (Write your answer here.) Group 1 has the highest frequency.

Task 1.2

The overall aim of this task is to use sample mean and median to determine the shape of a data distribution.

In the following code block, create an appropriate histogram for the variable weight on the density scale. Here you can use the default number of class intervals. Calculate the sample mean and the sample median of the variable weight, and then use abline to indicate the locations of the sample median and the sample mean on the histogram.

Based on your findings, comment on the skewness of the variable weight and justify your answer. Write your answer after the Answer to Task 1.2 prompt provided below.

### Code for Task 1.2. Write your code here
###
weight <- ChickWeight$weight


hist(weight, main = "Weight Distribution", xlab = "Weight", col = "lightblue", breaks = "Sturges", prob = TRUE) #histogram

mean_weight <- mean(weight) #mean
median_weight <- median(weight) #median


abline(v = mean_weight, col = "red", lwd = 2, lty = 2)  # red shows mean
abline(v = median_weight, col = "blue", lwd = 2, lty = 2)  # blue shows median

legend("topright", legend = c(paste("Mean =", round(mean_weight, 2)), 
                              paste("Median =", round(median_weight, 2))),
       col = c("red", "blue"), lty = 2, lwd = 2) #locations

Answer to Task 1.2: (Write your answer here.) Mean equals to 121.82 and median equals to 103.

2 Boxplot, data selection, and outliers

Task 2.1

We want to understand the effectiveness of the Diet on the weight of the chicks using the comparative boxplot. Here we consider the weight of chicks after 18 days (including day 18) since birth. In the following code block, first select data points in the Diet and weight variables corresponding to 18 days and later (see the Time variable). Then, make a comparative boxplot for the selected data points from weight by splitting it by the corresponding Diet.

Based on the reported centres of the comparative boxplot, comment on which diet group is most effective on chick weight and justify your answer. Write your answer after the Answer to Task 2.1 prompt provided below.

### Code for Task 2.1. Write your code here
###
subset_data <- ChickWeight[ChickWeight$Time >= 18, ]

# boxplot
boxplot(weight ~ Diet, data = subset_data,
        main = "Comparison of Chick Weights by Diet at Day 18 and Beyond",
        xlab = "Groups",
        ylab = "Weight",
        col = "lightblue",
        border = "black")

legend("topright", legend = levels(subset_data$Diet),
       fill = "lightblue", border = "black")

#average
mean_weights <- tapply(subset_data$weight, subset_data$Diet, mean)
print(mean_weights)
##        1        2        3        4 
## 168.8600 202.6667 254.1000 224.3214

Answer to Task 2.1: (Write your answer here.) Group 1 is most effective.

Task 2.2

The rest of this question is to check your understanding of the boxplot, numerical summaries used for constructing a boxplot, and how to identify outliers. We will focus on all data entries in the variable weight.

  • Calculate median of weight and the quartiles used for identifying the middle 50% of data points.
  • Make a boxplot (preferbaly a horizontal one).
  • Use abline to indicate the location of the sample median and the interquartile range on the boxplot.
### Code for Task 2.2.  Write your code here
###
weight <- ChickWeight$weight

# caculate
median_weight <- median(weight)
Q1 <- quantile(weight, 0.25)
Q3 <- quantile(weight, 0.75)
IQR <- Q3 - Q1

# print
print(paste("median: ", median_weight))
## [1] "median:  103"
print(paste("first quartiles(Q1): ", Q1))
## [1] "first quartiles(Q1):  63"
print(paste("third quartiles(Q3): ", Q3))
## [1] "third quartiles(Q3):  163.75"
print(paste("distance (IQR): ", IQR))
## [1] "distance (IQR):  100.75"
#boxplot
boxplot(weight, horizontal = TRUE,
        main = "Boxplot of Chick Weights",
        xlab = "Weight",
        col = "lightblue",
        border = "black")

# add lines
abline(v = median_weight, col = "red", lwd = 2, lty = 2)
abline(v = Q1, col = "blue", lwd = 2, lty = 2)
abline(v = Q3, col = "green", lwd = 2, lty = 2)

#show picture
legend("topright", legend = c("Median", "Q1", "Q3"),
       col = c("red", "blue", "green"), lty = 2, lwd = 2)

Task 2.3

Are there any outliers? Write your answer after the Answer to Task 2.3 prompt provided below.

Answer to Task 2.3: (Write your answer here.) There is an outlier in group 4.

In the following code block, write R code to count the number of outliers in this data set if there is any. Hint: you may need to calculate the location of the whiskers in the boxplot (the lower and upper thresholds in Tukey’s convention) first.

### Code for Task 2.3.  Write your code here
###
subset_data <- ChickWeight[ChickWeight$Time >= 18, ]

outliers <- data.frame()

# caculate outliers in each group
for (diet in unique(subset_data$Diet)) {
  diet_data <- subset(subset_data, Diet == diet)$weight
  Q1 <- quantile(diet_data, 0.25)
  Q3 <- quantile(diet_data, 0.75)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  
  diet_outliers <- subset(subset_data, Diet == diet & (weight < lower_bound | weight > upper_bound))
  outliers <- rbind(outliers, diet_outliers)
}
print(outliers)
##     weight Time Chick Diet
## 554    322   21    48    4
#boxplot
boxplot(weight ~ Diet, data = subset_data,
        main = "Chick Weights by Diet",
        xlab = "Group",
        ylab = "Weight",
        col = "lightblue")
if (nrow(outliers) > 0) {
  points(as.numeric(outliers$Diet), outliers$weight, col = "red", pch = 19)  #red for outliers
}

3 Normal curve

Task 3.1

We consider the population of chicks aged between 18 and 21 days (inclusive of day 18 and day 21) after birth. We want to apply the normal curve to estimate the proportion of chicks in this population weighing above 400 grams. The data selected in Task 2.1 will serve as the sample drawn from this population.

In the following code block, calculate the sample mean and sample standard deviation. Construct a normal curve using these values, and subsequently determine the proportion of chicks weighing above 400 grams. What percentage of chicks will weigh above 400 grams? Please also write your answer after the Answer to Task 3.1 prompt provided below, rounding your answer (in percentage) to two decimal places.

Hint: If you encountered difficulties selecting data in Task 2.1, you can use the sample mean \(206\) and the sample SD \(66\) instead. You will not be penalised for using these values in this question.

### Code for Task 3.1.  Write your code here
###
subset_data <- ChickWeight[ChickWeight$Time >= 18 & ChickWeight$Time <= 21, ]

# Calculate the sample mean and standard deviation
mean_weight <- mean(subset_data$weight)
sd_weight <- sd(subset_data$weight)
print(paste("Sample Mean: ", mean_weight))
## [1] "Sample Mean:  205.992753623188"
print(paste("Sample Standard Deviation: ", sd_weight))
## [1] "Sample Standard Deviation:  65.9148418071712"
x_values <- seq(mean_weight - 4 * sd_weight, mean_weight + 4 * sd_weight, length.out = 100)

# Calculate
y_values <- dnorm(x_values, mean = mean_weight, sd = sd_weight)

# Plot the normal curve
plot(x_values, y_values, type = "l", main = "Normal Distribution of Chick Weights",
     xlab = "Weight (grams)", ylab = "Density", col = "blue")

# Add a vertical line
abline(v = 400, col = "red", lwd = 2, lty = 2)

# Calculate the proportion of chicks weighing more than 400 grams
probability_more_than_400 <- 1 - pnorm(400, mean = mean_weight, sd = sd_weight)
rounded_probability <- round(probability_more_than_400 * 100, 2)
print(paste("Proportion of chicks weighing more than 400 grams: ", rounded_probability, "%"))
## [1] "Proportion of chicks weighing more than 400 grams:  0.16 %"
# Highlight
x_fill <- seq(400, mean_weight + 4 * sd_weight, length.out = 100)
y_fill <- dnorm(x_fill, mean = mean_weight, sd = sd_weight)
polygon(c(400, x_fill, mean_weight + 4 * sd_weight), c(0, y_fill, 0), col = "lightblue", border = NA)

Answer to Task 3.1: (Write your answer here.) There will be 0.16% chickens weigh above 400 grams.

Task 3.2

In the following code block, calculate the 30% percentile of the population of chick weights based on the normal curve constructed above. Please also provide your answer after the Answer to Task 3.2 prompt provided below, rounding your answer to two decimal places.

### Code for Task 3.2.  Write your code here
###
subset_data <- ChickWeight[ChickWeight$Time >= 18 & ChickWeight$Time <= 21, ]

# Calculate
mean_weight <- mean(subset_data$weight)
sd_weight <- sd(subset_data$weight)

# Calculate the 30th percentile using qnorm
percentile_30 <- qnorm(0.30, mean = mean_weight, sd = sd_weight)

# Round the result
rounded_percentile_30 <- round(percentile_30, 2)
print(paste("30th percentile of chick weights: ", rounded_percentile_30, " grams"))
## [1] "30th percentile of chick weights:  171.43  grams"

Answer to Task 3.2: (Write your answer here.) 30% percentile of the population of chick weights is 171.43g.

Task 3.3

In our lectures, we learned about the distinction between the population standard deviation (SD) and the sample SD. Additionally, we learned that variance = SD\(^2\). R has built-in functions sd() and var() for computing the sample SD and the sample variance. Here we want to write our own R function to compute the population variance and apply it to the ChickWeight data set.

In the following, we provide the function definition for my_pop_var(X), where X is the input data. Complete this function so it can compute the population variance for the input data X.

### Code for Task 3.3. 
###
my_pop_var = function(X){
  ###  Write your code below for population variance
  mu <- mean(X)
  
  # Calculate the sum
  squared_deviations <- (X - mu)^2
  sum_squared_deviations <- sum(squared_deviations)
  
  # Calculate the population variance
  population_variance <- sum_squared_deviations / length(X)
  return(population_variance)
}

# Test the function
data("ChickWeight")
weight_data <- ChickWeight$weight

# Apply the custom function to calculate the population variance of weights
population_variance_weight <- my_pop_var(weight_data)
print(paste("Population Variance of chick weights: ", population_variance_weight))
## [1] "Population Variance of chick weights:  5042.4843003556"

Task 3.4

Apply your function written above to compute the population variance of the variable weight in ChickWeight.

### Code for Task 3.4.  Write your code here
###
my_pop_var = function(X) {
  mu <- mean(X)
  
  # Calculate the sum
  squared_deviations <- (X - mu)^2
  sum_squared_deviations <- sum(squared_deviations)
  
  # Calculate the population variance
  population_variance <- sum_squared_deviations / length(X)
  return(population_variance)
}
data("ChickWeight")

# Extract the weight data
weight_data <- ChickWeight$weight

# Apply the custom function to calculate the population variance of weights
population_variance_weight <- my_pop_var(weight_data)
print(paste("Population Variance of chick weights: ", population_variance_weight))
## [1] "Population Variance of chick weights:  5042.4843003556"

====END OF THE WORKSHEET====