Instruction

There are several questions. Each question may contain multiple tasks. To receive a full mark in this part, you should correctly solve all tasks, justify your solution in the space provided in case necessary, and add appropriate labels to your graphical summaries.

Do NOT modify the header of this file. Do NOT delete or alter any text description from this file. Only work in the space provided.

Format: All assignment tasks have either a field for writing embedded R code, an answer field marked by the prompt Answer to Task x.x, or both. You should enter your solution either as embedded R code or as text after the prompt Answer to Task x.x.

Submission: Upon completion, you must render this worksheet (using Knit in R Studio) into an html file and submit the html file. Make sure the file extension “html” is in lower case. Your html file MUST contain all the R code you have written in the worksheet.

Task 0.0: The data story of Chick Weight

This is a sample task demonstrating how to answer assignment questions using this R Markdown worksheet. Please read this data story carefully. You do NOT need to answer the question in this task. Your tasks start at Question 1 below. 

In this assignment, we will use the ChickWeight data, which is a built-in data set in R. This dataset records the body weights of chicks at birth and every second day thereafter until day 20, with additional measurements taken on day 21. The chicks were divided into four groups based on different protein diets. We will focus on the following three variables for this assignment:

  • the variable weight contains the measured body weights of chicks in grams.
  • the variable Time records the number of days since birth when the measurement was made.
  • the variable Diet records the group of protein diet.

The variables weight and Time are numerical. The variable Diet is categorical. Note that the variable names and the dataframe name are case sensitive.

Write R code in the following code block to display the dimension of the data, variable names, and the first several rows of the data set. How many variables in this data set? What is the sample size? Write your comment after the Answer to Task 0.0 prompt provided below.

### Write your code here. The code is completed in Task 1.0 for demonstration.
dim(ChickWeight) # the dimension of the ChickWeight data set
## [1] 578   4
names(ChickWeight) # display variable names
## [1] "weight" "Time"   "Chick"  "Diet"
head(ChickWeight) # display the first several variables
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

Answer to Task 0.0: (Write your answer here.) There are 4 variables. The sample size is 578.

====START OF ASSIGNMENT QUESTIONS====

1 Barplot, histogram, and skewness

Task 1.1

Produce a frequency table and a barplot for the variable Diet. Write your code in the following code block. Among all diet groups, which one has the highest frequency? Write your comment after the Answer to Task 1.1 prompt provided below.

### Code for Task 1.1. Write your code here
###

dietofchicks = ChickWeight$Diet      
frequency_table <- table(dietofchicks)    #creating frequency table
frequency_table
## dietofchicks
##   1   2   3   4 
## 220 120 120 118
barplot(frequency_table, main="Barplot of Diet Frequency", xlab="Diet", ylab="Frequency")  #creating barplot with title, x-axis and y-axis label

Answer to Task 1.1: (Write your answer here.) Category 1 is the most frequent, with 220 chicks.

Task 1.2

The overall aim of this task is to use sample mean and median to determine the shape of a data distribution.

In the following code block, create an appropriate histogram for the variable weight on the density scale. Here you can use the default number of class intervals. Calculate the sample mean and the sample median of the variable weight, and then use abline to indicate the locations of the sample median and the sample mean on the histogram.

Based on your findings, comment on the skewness of the variable weight and justify your answer. Write your answer after the Answer to Task 1.2 prompt provided below.

### Code for Task 1.2. Write your code here
###
mean_weight <- mean(ChickWeight$weight)
median_weight <- median(ChickWeight$weight)


hist(ChickWeight$weight, freq = F, xlab = "Weight (g)", ylab = "Density", main = "Approximate Historgram of Chick Weight")
abline(v=mean_weight, col = "red")
abline(v=median_weight, col = "green")

legend("topright", legend=c("Mean", "Median"), col=c("red", "green"), lty=1, lwd=1)

Answer to Task 1.2: (Write your answer here.) As the mean of ~121 (red) is higher than the median of 103 (green), the weight of the chicks is positively skewed. This represents a distribution where most values are clustered around the left tail, as reflected in the histogram.

2 Boxplot, data selection, and outliers

Task 2.1

We want to understand the effectiveness of the Diet on the weight of the chicks using the comparative boxplot. Here we consider the weight of chicks after 18 days (including day 18) since birth. In the following code block, first select data points in the Diet and weight variables corresponding to 18 days and later (see the Time variable). Then, make a comparative boxplot for the selected data points from weight by splitting it by the corresponding Diet.

Based on the reported centres of the comparative boxplot, comment on which diet group is most effective on chick weight and justify your answer. Write your answer after the Answer to Task 2.1 prompt provided below.

### Code for Task 2.1. Write your code here
###
weight_day18onwards <- ChickWeight[ChickWeight$Time >= 18,]
boxplot(weight ~ Diet, data=weight_day18onwards, main="Boxplot of Weight by Diet", xlab="Diet", ylab="Weight (grams)")

Answer to Task 2.1: (Write your answer here.) Diet #3 is the most effective on creating a high chick weight, it has the highest median and the highest total weight achieved. It also has the highest 3Q and 1Q of all diets.

Task 2.2

The rest of this question is to check your understanding of the boxplot, numerical summaries used for constructing a boxplot, and how to identify outliers. We will focus on all data entries in the variable weight.

  • Calculate median of weight and the quartiles used for identifying the middle 50% of data points.
  • Make a boxplot (preferbaly a horizontal one).
  • Use abline to indicate the location of the sample median and the interquartile range on the boxplot.
### Code for Task 2.2.  Write your code here
###
median_weight <- median(ChickWeight$weight)
Q1 <- quantile(ChickWeight$weight, prob=c(0.25))
Q3 <- quantile(ChickWeight$weight, prob=c(0.75))

boxplot(ChickWeight$weight, main = "Boxplot of Chick Weight", xlab = "Weight(g)", horizontal = T)
abline(v=median_weight,col="red")
abline(v=Q1,col="blue")
abline(v=Q3, col = "green")

Task 2.3

Are there any outliers? Write your answer after the Answer to Task 2.3 prompt provided below.

Answer to Task 2.3: (Write your answer here.) There are outliers, as indicated by the clear circles to the far right of the boxplot.These are usually values which are more than 1.5*IQR above the 3Q or below the 1Q.

In the following code block, write R code to count the number of outliers in this data set if there is any. Hint: you may need to calculate the location of the whiskers in the boxplot (the lower and upper thresholds in Tukey’s convention) first.

### Code for Task 2.3.  Write your code here
###
IQR <- (Q3-Q1)
Outlier <- (Q3+1.5*IQR)
sum(ChickWeight$weight>315)
## [1] 9

In this dataset an outlier is any chick with weight greater than 315, there are 9 of these outliers.There are no outliers to the lower side, as indicated by the boxplot in 2.2

3 Normal curve

Task 3.1

We consider the population of chicks aged between 18 and 21 days (inclusive of day 18 and day 21) after birth. We want to apply the normal curve to estimate the proportion of chicks in this population weighing above 400 grams. The data selected in Task 2.1 will serve as the sample drawn from this population.

In the following code block, calculate the sample mean and sample standard deviation. Construct a normal curve using these values, and subsequently determine the proportion of chicks weighing above 400 grams. What percentage of chicks will weigh above 400 grams? Please also write your answer after the Answer to Task 3.1 prompt provided below, rounding your answer (in percentage) to two decimal places.

Hint: If you encountered difficulties selecting data in Task 2.1, you can use the sample mean \(206\) and the sample SD \(66\) instead. You will not be penalised for using these values in this question.

### Code for Task 3.1.  Write your code here
###
datatobeconsidered <- ChickWeight[ChickWeight$Time >= 18 & ChickWeight$Time <=21, ]
meanweight_18to21 <- mean(datatobeconsidered$weight)
standarddeviation_18to21 <- sd(datatobeconsidered$weight)

prop_above_400 <- 1 - pnorm(400, mean=meanweight_18to21, sd=standarddeviation_18to21)
Propabove400_aspercentage = prop_above_400 * 100

#drawing curve

x_vals <- seq(min(ChickWeight$weight), max(ChickWeight$weight))
y_vals <- dnorm(x_vals, mean=meanweight_18to21, sd=standarddeviation_18to21)


x_vals <- seq(min(ChickWeight$weight), max(ChickWeight$weight))
y_vals <- dnorm(x_vals, mean=meanweight_18to21, sd=standarddeviation_18to21)


plot(x_vals, y_vals, type="l", 
     main="Normal Curve for Chick Weights (18-21 Days)", 
     xlab="Weight (grams)", ylab="Density")

Answer to Task 3.1: (Write your answer here.) Mean = 205.99g Standard deviation = 65.91

Probability a chick weighs more than 400g is 0.16%.

All to 2.dp

Task 3.2

In the following code block, calculate the 30% percentile of the population of chick weights based on the normal curve constructed above. Please also provide your answer after the Answer to Task 3.2 prompt provided below, rounding your answer to two decimal places.

### Code for Task 3.2.  Write your code here
###

percentile = qnorm(0.3, mean = meanweight_18to21, sd = standarddeviation_18to21)
round(percentile, 2) #output
## [1] 171.43

Answer to Task 3.2: (Write your answer here.)

The 30% percentile is 171.43g (2.dp)

Task 3.3

In our lectures, we learned about the distinction between the population standard deviation (SD) and the sample SD. Additionally, we learned that variance = SD\(^2\). R has built-in functions sd() and var() for computing the sample SD and the sample variance. Here we want to write our own R function to compute the population variance and apply it to the ChickWeight data set.

In the following, we provide the function definition for my_pop_var(X), where X is the input data. Complete this function so it can compute the population variance for the input data X.

### Code for Task 3.3. 
###
my_pop_var = function(X){
  ###  Write your code below for population variance
  n <- length(X)
  mean_X <- mean(X)
  pop_var <- sum((X - mean_X)^2) / n
  return(pop_var)
}

Task 3.4

Apply your function written above to compute the population variance of the variable weight in ChickWeight.

### Code for Task 3.4.  Write your code here
###
pop_varianceofweight <- my_pop_var(ChickWeight$weight)
round(pop_varianceofweight,2) #output
## [1] 5042.48

Population variance of weight is therefore 5042.48 (2.dp)

====END OF THE WORKSHEET====