There are several questions. Each question may contain multiple tasks. To receive a full mark in this part, you should correctly solve all tasks, justify your solution in the space provided in case necessary, and add appropriate labels to your graphical summaries.
Do NOT modify the header of this file. Do NOT delete or alter any text description from this file. Only work in the space provided.
Format: All assignment tasks have either a field for
writing embedded R code, an answer field marked by the prompt
Answer to Task x.x, or both. You should
enter your solution either as embedded R code or as text after the
prompt Answer to Task x.x.
Submission: Upon completion, you must render this
worksheet (using Knit in R Studio) into an html file and
submit the html file. Make sure the file extension “html” is in lower
case. Your html file MUST contain all the R code you
have written in the worksheet.
Task 0.0: The data
story of Motor Trend Car Road TestsThis is a sample task demonstrating how to answer assignment questions using this R Markdown worksheet. Please read this data story carefully. You do NOT need to answer the question in this task. Your tasks start at Question 1 below.
In this assignment, we will use the mtcars data, which
is a built-in data set in R. This dataset was extracted from the 1974
Motor Trend US magazine, and comprises fuel consumption and 10 aspects
of automobile design and performance for 32 automobiles (1973–74
models). We will focus on the following four variables for this
assignment:
disp contains the measured engine
displacement (in cubic inch)hp records the gross engine power (in
horsepower)qsec records the time for completing 1/4
mile (in second).am records the type of transmission. (0 =
automatic, 1 = manual)The variables disp, hp and
qsec are numerical. The variable am is
categorical (although given in the form of integers). Note that
the variable names and the dataframe name are case
sensitive.
Write R code in the following code block to display the dimension of
the data, variable names, and the first several rows of the data set.
How many variables in this data set? What is the sample size? Write your
comment after the Answer to Task 0.0
prompt provided below.
### Write your code here. The code is completed in Task 1.0 for demonstration.
dim(mtcars) # the dimension of the mtcars data set
## [1] 32 11
names(mtcars) # display variable names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
head(mtcars) # display the first several variables
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Answer to Task 0.0: (Write your answer
here.) There are 11 variables. The sample size is 32.
====START OF ASSIGNMENT QUESTIONS====
Task 1.1The overall aim of this task is to use sample mean and median to determine the shape of a data distribution.
In the following code block, create an appropriate histogram for the
variable hp on the density scale. Here you
can use the default number of class intervals. Calculate the sample mean
and the sample median of the variable hp, and then use the
function abline to indicate the locations of the sample
median and the sample mean on the histogram.
Based on your findings, comment on the skewness of the variable
hp and justify your answer. Write your answer after the
Answer to Task 1.1 prompt provided
below.
### Code for Task 1.1. Write your code here
###
HorsePower=mtcars$hp
mean(HorsePower)
## [1] 146.6875
median(HorsePower)
## [1] 123
hist(HorsePower, density=F, right=F, xlab="Gross Engine Power (in Horsepower)",ylab="Density",main="Histogram for Horsepower of Automobiles in US (for 1973-74 models)")
abline(v=mean(mtcars$hp),col="green")
abline(v=median(mtcars$hp),col="blue")
Answer to Task 1.1: (Write your answer
here.) This histogram is positively skewed. Positively skewed data is
when the median < mean. For the variable hp, its sample median of 123
is less than the sample mean of 14.6875. So it is positively skewed.
Task 2.1We want to understand the effectiveness of the am
(transmission type) on the qsec (time for 1/4 mile, the
shorter the faster) of cars using the comparative boxplot. Here we
consider cars with displacement (disp) more than
130 cubic inches. In the following code block, first select
data points in the am and qsec variables
according to disp (>130). Then, make a comparative
boxplot for the selected data points from qsec by splitting
it by the corresponding am.
Based on the reported centers of the comparative
boxplot, comment on which transmission type is faster (1 for manual and
0 for automatic) in general and justify your answer. Write your answer
after the Answer to Task 2.1 prompt
provided below.
### Code for Task 2.1. Write your code here
###
mtcars$am=as.factor(mtcars$am)
mtcars1=mtcars[mtcars$disp>130,]
Transmission=mtcars1$am
QMiletime=mtcars1$qsec
levels(Transmission)<-c("Automatic","Manual")
boxplot(QMiletime~Transmission, horizontal=T, xlab="Time (in seconds)", ylab="Type of Transmission", main="Time taken to complete 1/4 mile by Type of Transmission")
Answer to Task 2.1: (Write your answer
here.) The Automatic transmission type is faster than Manual
Task 2.2The rest of this question is to check your understanding of the
boxplot, numerical summaries used for constructing a boxplot, and how to
identify outliers. We will use all data entries in the
variable qsec.
qsec and the quartiles used for
identifying the middle 50% of data points.abline to indicate the location of the sample
median and the interquartile range on the boxplot.### Code for Task 2.2. Write your code here
###
sort(mtcars$qsec)
## [1] 14.50 14.60 15.41 15.50 15.84 16.46 16.70 16.87 16.90 17.02 17.02 17.05
## [13] 17.30 17.40 17.42 17.60 17.82 17.98 18.00 18.30 18.52 18.60 18.61 18.90
## [25] 18.90 19.44 19.47 19.90 20.00 20.01 20.22 22.90
length(mtcars$qsec)
## [1] 32
median(mtcars$qsec)
## [1] 17.71
summary(mtcars$qsec)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.50 16.89 17.71 17.85 18.90 22.90
IQR(mtcars$qsec)
## [1] 2.0075
boxplot(mtcars$qsec,horizontal=T, xlab="Time (in seconds)", main="Time taken to complete 1/4 Mile")
iqr=quantile(mtcars$qsec)[4]-quantile(mtcars$qsec)[2]
abline(v=median(mtcars$qsec), col="green")
abline(v=quantile(mtcars$qsec)[2],col="red")
abline(v=quantile(mtcars$qsec)[4],col="red")
abline(v=quantile(mtcars$qsec)[2]-1.5*iqr, col="purple")
abline(v=quantile(mtcars$qsec)[4]+1.5*iqr,col="purple")
Are there any outliers? Write your answer after the
Answer to Task 2.2 prompt provided
below.
Answer to Task 2.2: (Write your answer
here.) There are no outliers. None of the data points are found at a
distance of 1.5*IQR from the 1st or 3rd quartiles. This means no
outliers are beyond the lower or upper thresholds.
Task 3.1We consider all data entries in mtcars as a sample
collected from all available cars on the market. First, we examine the
variable qsec and aim to use the normal curve to estimate
the proportion of cars on the market with 1/4 mile time
exceeding 20 seconds.
In the following code block, calculate the sample mean and sample
standard deviation. Construct a normal curve using these values, and
subsequently determine the proportion of cars having 1/4 mile time
exceeding 20 seconds. What percentage of cars have 1/4 mile time
exceeding 20 seconds? Please also write your answer after the
Answer to Task 3.1 prompt provided below,
rounding your answer (in percentage) to two decimal places.
### Code for Task 3.1. Write your code here
###
qsec=mtcars$qsec
mean(qsec)
## [1] 17.84875
sd(qsec)
## [1] 1.786943
m=mean(qsec)
s=sd(qsec)
curve(dnorm(x,m,s),xlim=c(10,25))
pnorm(20,m,s)
## [1] 0.8856804
1-pnorm(20,m,s)
## [1] 0.1143196
Answer to Task 3.1: (Write your answer
here.) The sample mean is 17.85sec, and sample sd is 1.79sec. 11.43%
have a 1/4 mile time exceeding 20seconds.
Task 3.2In the following code block, calculate the 30-th percentile of the
1/4 mile time of cars based on the normal curve constructed above.
Please also provide your answer after the
Answer to Task 3.2 prompt provided below,
rounding your answer to two decimal places.
### Code for Task 3.2. Write your code here
###
qnorm(0.3,m,s)
## [1] 16.91168
Answer to Task 3.2: (Write your answer
here.) A car with a time of 16.91 seconds would fall into the 30th
percentile
Task 3.3In our lectures, we learned about the distinction between the
population standard deviation (SD) and the sample SD. Additionally, we
learned that variance = SD\(^2\). R has
built-in functions sd() and var() for
computing the sample SD and the sample variance. Here we want to write
our own R function to compute the population variance
and apply it to the mtcars data set.
In the following, we provide the function definition for
my_pop_var(X), where X is the input data.
Complete this function so it can compute the population variance for the
input data X.
### Code for Task 3.3.
###
my_pop_var = function(X){
### Write your code below for population variance
m=sum(X)/length(X)
v=sum((X-m)^2)/length(X)
return(v)
}
Task 3.4Apply your function written above to compute the population variance
of the variable qsec in mtcars.
### Code for Task 3.4. Write your code here
###
my_pop_var(mtcars$qsec)
## [1] 3.09338
n=length(qsec)
(sd(qsec)*sqrt((n-1)/n))^2
## [1] 3.09338
====END OF THE WORKSHEET====