Instruction

There are several questions. Each question may contain multiple tasks. To receive a full mark in this part, you should correctly solve all tasks, justify your solution in the space provided in case necessary, and add appropriate labels to your graphical summaries.

Do NOT modify the header of this file. Do NOT delete or alter any text description from this file. Only work in the space provided.

Format: All assignment tasks have either a field for writing embedded R code, an answer field marked by the prompt Answer to Task x.x, or both. You should enter your solution either as embedded R code or as text after the prompt Answer to Task x.x.

Submission: Upon completion, you must render this worksheet (using Knit in R Studio) into an html file and submit the html file. Make sure the file extension “html” is in lower case. Your html file MUST contain all the R code you have written in the worksheet.

Task 0.0: The data story of Motor Trend Car Road Tests

This is a sample task demonstrating how to answer assignment questions using this R Markdown worksheet. Please read this data story carefully. You do NOT need to answer the question in this task. Your tasks start at Question 1 below. 

In this assignment, we will use the mtcars data, which is a built-in data set in R. This dataset was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). We will focus on the following four variables for this assignment:

  • the variable disp contains the measured engine displacement (in cubic inch)
  • the variable hp records the gross engine power (in horsepower)
  • the variable qsec records the time for completing 1/4 mile (in second).
  • the variable am records the type of transmission. (0 = automatic, 1 = manual)

The variables disp, hp and qsec are numerical. The variable am is categorical (although given in the form of integers). Note that the variable names and the dataframe name are case sensitive.

Write R code in the following code block to display the dimension of the data, variable names, and the first several rows of the data set. How many variables in this data set? What is the sample size? Write your comment after the Answer to Task 0.0 prompt provided below.

### Write your code here. The code is completed in Task 1.0 for demonstration.
dim(mtcars) # the dimension of the mtcars data set
## [1] 32 11
names(mtcars) # display variable names
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
head(mtcars) # display the first several variables
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Answer to Task 0.0: (Write your answer here.) There are 11 variables. The sample size is 32.

====START OF ASSIGNMENT QUESTIONS====

1 Histogram and skewness

Task 1.1

The overall aim of this task is to use sample mean and median to determine the shape of a data distribution.

In the following code block, create an appropriate histogram for the variable hp on the density scale. Here you can use the default number of class intervals. Calculate the sample mean and the sample median of the variable hp, and then use the function abline to indicate the locations of the sample median and the sample mean on the histogram.

Based on your findings, comment on the skewness of the variable hp and justify your answer. Write your answer after the Answer to Task 1.1 prompt provided below.

### Code for Task 1.1. Write your code here
###
hist(mtcars$hp, 
     main = "Histogram of Horsepower (hp)",
     xlab = "Horsepower",
     prob = TRUE, 
     col = "lightblue")
mean_hp <- mean(mtcars$hp)
median_hp <- median(mtcars$hp)
abline(v = mean_hp, col = "red", lwd = 2, lty = 2)
abline(v = median_hp, col = "blue", lwd = 2, lty = 2)
legend("topright", 
       legend = c("Mean", "Median"),
       col = c("red", "blue"), 
       lty = 2, 
       lwd = 2)

Answer to Task 1.1: (Write your answer here.) The sample mean for horsepower (hp) was 146.69, with a median of 123. The mean is greater than the median, and the histogram shows a longer right tail. This indicates that the distribution of horsepower is skewed to the right (positive).

2 Boxplot, data selection, and outliers

Task 2.1

We want to understand the effectiveness of the am (transmission type) on the qsec (time for 1/4 mile, the shorter the faster) of cars using the comparative boxplot. Here we consider cars with displacement (disp) more than 130 cubic inches. In the following code block, first select data points in the am and qsec variables according to disp (>130). Then, make a comparative boxplot for the selected data points from qsec by splitting it by the corresponding am.

Based on the reported centers of the comparative boxplot, comment on which transmission type is faster (1 for manual and 0 for ) in general and justify your answer. Write your answer after the Answer to Task 2.1 prompt provided below.

### Code for Task 2.1. Write your code here
###
subset_data <- subset(mtcars, disp > 130, select = c(am, qsec))
boxplot(qsec ~ am, 
        data = subset_data,
        horizontal = TRUE,
        main = "1/4 Mile Time (qsec) by Transmission Type (disp > 130)",
        xlab = "1/4 Mile Time (seconds)",
        ylab = "Transmission Type",
        names = c("Automatic (0)", "Manual (1)"),
        col = "lightgreen")

median_auto <- median(subset_data$qsec[subset_data$am == 0])
median_manual <- median(subset_data$qsec[subset_data$am == 1])

Answer to Task 2.1: (Write your answer here.) The median quarter-mile time for an automatic transmission (am=0) is 18.56 seconds, and the median quarter-mile time for a manual transmission (am=1) is 16.90 seconds. Because manual transmission cars have shorter mean 1/4 mile times, they are faster than automatic transmission cars for vehicles with 130 cubic inches of engine displacement.

Task 2.2

The rest of this question is to check your understanding of the boxplot, numerical summaries used for constructing a boxplot, and how to identify outliers. We will use all data entries in the variable qsec.

  • Calculate median of qsec and the quartiles used for identifying the middle 50% of data points.
  • Make a boxplot (preferbaly a horizontal one).
  • Use abline to indicate the location of the sample median and the interquartile range on the boxplot.
### Code for Task 2.2.  Write your code here
###
qsec_median <- median(mtcars$qsec)
qsec_quartiles <- quantile(mtcars$qsec, c(0.25, 0.75))
iqr <- qsec_quartiles[2] - qsec_quartiles[1]

boxplot(mtcars$qsec,
        horizontal = TRUE,
        main = "Boxplot of 1/4 Mile Time (qsec)",
        xlab = "Seconds",
        col = "orange")
abline(v = qsec_median, col = "purple", lwd = 2)
abline(v = c(qsec_quartiles[1], qsec_quartiles[2]), 
       col = "darkgray", 
       lty = 2)

outliers <- boxplot.stats(mtcars$qsec)$out

Are there any outliers? Write your answer after the Answer to Task 2.2 prompt provided below. Answer to Task 2.2: (Write your answer here.) The median qsec is 17.71 seconds, the quartiles Q1=16.89, Q3=18.90. The interquartile range (IQR) was 2.01. The boxplot shows no outliers because all data points are within the whisker (Q1-1.5 IQR = 13.84 and Q3 + 1.5IQR = 21.95).

3 Normal curve

Task 3.1

We consider all data entries in mtcars as a sample collected from all available cars on the market. First, we examine the variable qsec and aim to use the normal curve to estimate the proportion of cars on the market with 1/4 mile time exceeding 20 seconds.

In the following code block, calculate the sample mean and sample standard deviation. Construct a normal curve using these values, and subsequently determine the proportion of cars having 1/4 mile time exceeding 20 seconds. What percentage of cars have 1/4 mile time exceeding 20 seconds? Please also write your answer after the Answer to Task 3.1 prompt provided below, rounding your answer (in percentage) to two decimal places.

### Code for Task 3.1.  Write your code here
###
mean_qsec <- mean(mtcars$qsec)
sd_qsec <- sd(mtcars$qsec)
proportion_above_20 <- (1 - pnorm(20, mean = mean_qsec, sd = sd_qsec)) * 100

hist(mtcars$qsec, 
     main = "1/4 Mile Time (qsec) with Normal Curve",
     xlab = "1/4 Mile Time (seconds)",
     ylab = "Density",
     prob = TRUE,               
     col = "lightblue",
     breaks = 10)      

curve(dnorm(x, mean = mean_qsec, sd = sd_qsec), 
      col = "red", 
      lwd = 2,
      add = TRUE)

abline(v = 20, col = "darkgreen", lwd = 2, lty = 2)

x_fill <- seq(20, max(mtcars$qsec), length.out = 100)
y_fill <- dnorm(x_fill, mean = mean_qsec, sd = sd_qsec)
polygon(c(20, x_fill, max(mtcars$qsec)), 
        c(0, y_fill, 0), 
        col = rgb(0, 0.5, 0, 0.3), 
        border = NA)

legend("topright",
       legend = c("Normal Curve", "20s Threshold", "Area >20s"),
       col = c("red", "darkgreen", rgb(0, 0.5, 0, 0.3)),
       lwd = c(2, 2, NA),
       lty = c(1, 2, NA),
       pch = c(NA, NA, 15),
       pt.cex = 2)

Answer to Task 3.1: (Write your answer here.) Using the normal approximation (mean=17.85, SD=1.79), the estimated percentage of cars taking more than 20 seconds for a quarter mile is 11.51%.The red curve in the graph is the fitted normal distribution, the green dotted line marks the 20-second threshold, and the shaded area represents the probability area over 20 seconds.

Task 3.2

In the following code block, calculate the 30-th percentile of the 1/4 mile time of cars based on the normal curve constructed above. Please also provide your answer after the Answer to Task 3.2 prompt provided below, rounding your answer to two decimal places.

### Code for Task 3.2.  Write your code here
###
percentile_30 <- qnorm(0.3, mean = mean_qsec, sd = sd_qsec)

Answer to Task 3.2: (Write your answer here.) The 30th percentile of 1/4 mile time based on the normal curve is 17.04 seconds.

Task 3.3

In our lectures, we learned about the distinction between the population standard deviation (SD) and the sample SD. Additionally, we learned that variance = SD\(^2\). R has built-in functions sd() and var() for computing the sample SD and the sample variance. Here we want to write our own R function to compute the population variance and apply it to the mtcars data set.

In the following, we provide the function definition for my_pop_var(X), where X is the input data. Complete this function so it can compute the population variance for the input data X.

### Code for Task 3.3. 
###
my_pop_var <- function(X) {
  mu <- mean(X)                  
  sum_sq_dev <- sum((X - mu)^2)   
  pop_var <- sum_sq_dev / length(X)  
  return(pop_var)}
test_data <- c(1, 2, 3, 4, 5)
cat("Test Case:\n",
    "Manual Population Variance: 2.0\n",
    "my_pop_var Result:", my_pop_var(test_data), "\n\n")
## Test Case:
##  Manual Population Variance: 2.0
##  my_pop_var Result: 2
###  Write your code below for population variance

Task 3.4

Apply your function written above to compute the population variance of the variable qsec in mtcars.

### Code for Task 3.4.  Write your code here
###
pop_var_qsec <- my_pop_var(mtcars$qsec)
cat("Population Variance of qsec:", pop_var_qsec, "\n")
## Population Variance of qsec: 3.09338
n <- length(mtcars$qsec)
sample_var <- var(mtcars$qsec)
cat("Sample Variance of qsec (var()):", sample_var, "\n",
    "Population Variance * n/(n-1):", pop_var_qsec * n/(n-1), "\n")
## Sample Variance of qsec (var()): 3.193166 
##  Population Variance * n/(n-1): 3.193166

====END OF THE WORKSHEET====