There are several questions. Each question may contain multiple tasks. To receive a full mark in this part, you should correctly solve all tasks, justify your solution in the space provided in case necessary, and add appropriate labels to your graphical summaries.
Do NOT modify the header of this file. Do NOT delete or alter any text description from this file. Only work in the space provided.
Format: All assignment tasks have either a field for
writing embedded R code, an answer field marked by the prompt
Answer to Task x.x
, or both. You should
enter your solution either as embedded R code or as text after the
prompt Answer to Task x.x
.
Submission: Upon completion, you must render this
worksheet (using Knit
in R Studio) into an html file and
submit the html file. Make sure the file extension “html” is in lower
case. Your html file MUST contain all the R code you
have written in the worksheet.
Task 0.0:
The data
story of Motor Trend Car Road TestsThis is a sample task demonstrating how to answer assignment questions using this R Markdown worksheet. Please read this data story carefully. You do NOT need to answer the question in this task. Your tasks start at Question 1 below.
In this assignment, we will use the mtcars
data, which
is a built-in data set in R. This dataset was extracted from the 1974
Motor Trend US magazine, and comprises fuel consumption and 10 aspects
of automobile design and performance for 32 automobiles (1973–74
models). We will focus on the following four variables for this
assignment:
disp
contains the measured engine
displacement (in cubic inch)hp
records the gross engine power (in
horsepower)qsec
records the time for completing 1/4
mile (in second).am
records the type of transmission. (0 =
automatic, 1 = manual)The variables disp
, hp
and
qsec
are numerical. The variable am
is
categorical (although given in the form of integers). Note that
the variable names and the dataframe name are case
sensitive.
Write R code in the following code block to display the dimension of
the data, variable names, and the first several rows of the data set.
How many variables in this data set? What is the sample size? Write your
comment after the Answer to Task 0.0
prompt provided below.
### Write your code here. The code is completed in Task 1.0 for demonstration.
dim(mtcars) # the dimension of the mtcars data set
## [1] 32 11
names(mtcars) # display variable names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
head(mtcars) # display the first several variables
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Answer to Task 0.0:
(Write your answer
here.) There are 11 variables. The sample size is 32.
====START OF ASSIGNMENT QUESTIONS====
Task 1.1
The overall aim of this task is to use sample mean and median to determine the shape of a data distribution.
In the following code block, create an appropriate histogram for the
variable hp
on the density scale. Here you
can use the default number of class intervals. Calculate the sample mean
and the sample median of the variable hp
, and then use the
function abline
to indicate the locations of the sample
median and the sample mean on the histogram.
Based on your findings, comment on the skewness of the variable
hp
and justify your answer. Write your answer after the
Answer to Task 1.1
prompt provided
below.
### Code for Task 1.1. Write your code here
###
hist(mtcars$hp,
main = "Histogram of Horsepower (hp)",
xlab = "Horsepower",
prob = TRUE,
col = "lightblue")
mean_hp <- mean(mtcars$hp)
median_hp <- median(mtcars$hp)
abline(v = mean_hp, col = "red", lwd = 2, lty = 2)
abline(v = median_hp, col = "blue", lwd = 2, lty = 2)
legend("topright",
legend = c("Mean", "Median"),
col = c("red", "blue"),
lty = 2,
lwd = 2)
Answer to Task 1.1:
(Write your answer
here.) The sample mean for horsepower (hp) was 146.69, with a median of
123. The mean is greater than the median, and the histogram shows a
longer right tail. This indicates that the distribution of horsepower is
skewed to the right (positive).
Task 2.1
We want to understand the effectiveness of the am
(transmission type) on the qsec
(time for 1/4 mile, the
shorter the faster) of cars using the comparative boxplot. Here we
consider cars with displacement (disp
) more than
130 cubic inches. In the following code block, first select
data points in the am
and qsec
variables
according to disp
(>130). Then, make a comparative
boxplot for the selected data points from qsec
by splitting
it by the corresponding am
.
Based on the reported centers of the comparative
boxplot, comment on which transmission type is faster (1 for manual and
0 for ) in general and justify your answer. Write your answer after the
Answer to Task 2.1
prompt provided
below.
### Code for Task 2.1. Write your code here
###
subset_data <- subset(mtcars, disp > 130, select = c(am, qsec))
boxplot(qsec ~ am,
data = subset_data,
horizontal = TRUE,
main = "1/4 Mile Time (qsec) by Transmission Type (disp > 130)",
xlab = "1/4 Mile Time (seconds)",
ylab = "Transmission Type",
names = c("Automatic (0)", "Manual (1)"),
col = "lightgreen")
median_auto <- median(subset_data$qsec[subset_data$am == 0])
median_manual <- median(subset_data$qsec[subset_data$am == 1])
Answer to Task 2.1:
(Write your answer
here.) The median quarter-mile time for an automatic transmission (am=0)
is 18.56 seconds, and the median quarter-mile time for a manual
transmission (am=1) is 16.90 seconds. Because manual transmission cars
have shorter mean 1/4 mile times, they are faster than automatic
transmission cars for vehicles with 130 cubic inches of engine
displacement.
Task 2.2
The rest of this question is to check your understanding of the
boxplot, numerical summaries used for constructing a boxplot, and how to
identify outliers. We will use all data entries in the
variable qsec
.
qsec
and the quartiles used for
identifying the middle 50% of data points.abline
to indicate the location of the sample
median and the interquartile range on the boxplot.### Code for Task 2.2. Write your code here
###
qsec_median <- median(mtcars$qsec)
qsec_quartiles <- quantile(mtcars$qsec, c(0.25, 0.75))
iqr <- qsec_quartiles[2] - qsec_quartiles[1]
boxplot(mtcars$qsec,
horizontal = TRUE,
main = "Boxplot of 1/4 Mile Time (qsec)",
xlab = "Seconds",
col = "orange")
abline(v = qsec_median, col = "purple", lwd = 2)
abline(v = c(qsec_quartiles[1], qsec_quartiles[2]),
col = "darkgray",
lty = 2)
outliers <- boxplot.stats(mtcars$qsec)$out
Are there any outliers? Write your answer after the
Answer to Task 2.2
prompt provided below.
Answer to Task 2.2:
(Write your answer
here.) The median qsec is 17.71 seconds, the quartiles Q1=16.89,
Q3=18.90. The interquartile range (IQR) was 2.01. The boxplot shows no
outliers because all data points are within the whisker (Q1-1.5 IQR
= 13.84 and Q3 + 1.5IQR = 21.95).
Task 3.1
We consider all data entries in mtcars
as a sample
collected from all available cars on the market. First, we examine the
variable qsec
and aim to use the normal curve to estimate
the proportion of cars on the market with 1/4 mile time
exceeding 20 seconds.
In the following code block, calculate the sample mean and sample
standard deviation. Construct a normal curve using these values, and
subsequently determine the proportion of cars having 1/4 mile time
exceeding 20 seconds. What percentage of cars have 1/4 mile time
exceeding 20 seconds? Please also write your answer after the
Answer to Task 3.1
prompt provided below,
rounding your answer (in percentage) to two decimal places.
### Code for Task 3.1. Write your code here
###
mean_qsec <- mean(mtcars$qsec)
sd_qsec <- sd(mtcars$qsec)
proportion_above_20 <- (1 - pnorm(20, mean = mean_qsec, sd = sd_qsec)) * 100
hist(mtcars$qsec,
main = "1/4 Mile Time (qsec) with Normal Curve",
xlab = "1/4 Mile Time (seconds)",
ylab = "Density",
prob = TRUE,
col = "lightblue",
breaks = 10)
curve(dnorm(x, mean = mean_qsec, sd = sd_qsec),
col = "red",
lwd = 2,
add = TRUE)
abline(v = 20, col = "darkgreen", lwd = 2, lty = 2)
x_fill <- seq(20, max(mtcars$qsec), length.out = 100)
y_fill <- dnorm(x_fill, mean = mean_qsec, sd = sd_qsec)
polygon(c(20, x_fill, max(mtcars$qsec)),
c(0, y_fill, 0),
col = rgb(0, 0.5, 0, 0.3),
border = NA)
legend("topright",
legend = c("Normal Curve", "20s Threshold", "Area >20s"),
col = c("red", "darkgreen", rgb(0, 0.5, 0, 0.3)),
lwd = c(2, 2, NA),
lty = c(1, 2, NA),
pch = c(NA, NA, 15),
pt.cex = 2)
Answer to Task 3.1:
(Write your answer
here.) Using the normal approximation (mean=17.85, SD=1.79), the
estimated percentage of cars taking more than 20 seconds for a quarter
mile is 11.51%.The red curve in the graph is the fitted normal
distribution, the green dotted line marks the 20-second threshold, and
the shaded area represents the probability area over 20 seconds.
Task 3.2
In the following code block, calculate the 30-th percentile of the
1/4 mile time of cars based on the normal curve constructed above.
Please also provide your answer after the
Answer to Task 3.2
prompt provided below,
rounding your answer to two decimal places.
### Code for Task 3.2. Write your code here
###
percentile_30 <- qnorm(0.3, mean = mean_qsec, sd = sd_qsec)
Answer to Task 3.2:
(Write your answer
here.) The 30th percentile of 1/4 mile time based on the normal curve is
17.04 seconds.
Task 3.3
In our lectures, we learned about the distinction between the
population standard deviation (SD) and the sample SD. Additionally, we
learned that variance = SD\(^2\). R has
built-in functions sd()
and var()
for
computing the sample SD and the sample variance. Here we want to write
our own R function to compute the population variance
and apply it to the mtcars
data set.
In the following, we provide the function definition for
my_pop_var(X)
, where X
is the input data.
Complete this function so it can compute the population variance for the
input data X
.
### Code for Task 3.3.
###
my_pop_var <- function(X) {
mu <- mean(X)
sum_sq_dev <- sum((X - mu)^2)
pop_var <- sum_sq_dev / length(X)
return(pop_var)}
test_data <- c(1, 2, 3, 4, 5)
cat("Test Case:\n",
"Manual Population Variance: 2.0\n",
"my_pop_var Result:", my_pop_var(test_data), "\n\n")
## Test Case:
## Manual Population Variance: 2.0
## my_pop_var Result: 2
### Write your code below for population variance
Task 3.4
Apply your function written above to compute the population variance
of the variable qsec
in mtcars
.
### Code for Task 3.4. Write your code here
###
pop_var_qsec <- my_pop_var(mtcars$qsec)
cat("Population Variance of qsec:", pop_var_qsec, "\n")
## Population Variance of qsec: 3.09338
n <- length(mtcars$qsec)
sample_var <- var(mtcars$qsec)
cat("Sample Variance of qsec (var()):", sample_var, "\n",
"Population Variance * n/(n-1):", pop_var_qsec * n/(n-1), "\n")
## Sample Variance of qsec (var()): 3.193166
## Population Variance * n/(n-1): 3.193166
====END OF THE WORKSHEET====