Instruction

There are several questions. Each question may contain multiple tasks. To receive a full mark in this part, you should correctly solve all tasks, justify your solution in the space provided in case necessary, and add appropriate labels to your graphical summaries.

Do NOT modify the header of this file. Do NOT delete or alter any text description from this file. Only work in the space provided.

Format: All assignment tasks have either a field for writing embedded R code, an answer field marked by the prompt Answer to Task x.x, or both. You should enter your solution either as embedded R code or as text after the prompt Answer to Task x.x.

Submission: Upon completion, you must render this worksheet (using Knit in R Studio) into an html file and submit the html file. Make sure the file extension “html” is in lower case. Your html file MUST contain all the R code you have written in the worksheet.

`Task 0.0:` The data story of Motor Trend Car Road Tests

This is a sample task demonstrating how to answer assignment questions using this R Markdown worksheet. Please read this data story carefully. You do NOT need to answer the question in this task. Your tasks start at Question 1 below.

In this assignment, we will use the mtcars data, which is a built-in data set in R. This dataset was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). We will focus on the following four variables for this assignment:

the variable disp contains the measured engine displacement (in cubic inch)
the variable hp records the gross engine power (in horsepower)
the variable qsec records the time for completing 1/4 mile (in second).
the variable am records the type of transmission. (0 = automatic, 1 = manual)

The variables disp, hp and qsec are numerical. The variable am is categorical (although given in the form of integers). Note that the variable names and the dataframe name are case sensitive.

Write R code in the following code block to display the dimension of the data, variable names, and the first several rows of the data set. How many variables in this data set? What is the sample size? Write your comment after the Answer to Task 0.0 prompt provided below.

### Write your code here. The code is completed in Task 1.0 for demonstration.
dim(mtcars) # the dimension of the mtcars data set

## [1] 32 11

names(mtcars) # display variable names

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

head(mtcars) # display the first several variables

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Answer to Task 0.0: (Write your answer here.) There are 11 variables. The sample size is 32.

====START OF ASSIGNMENT QUESTIONS====

1 Histogram and skewness

`Task 1.1`

The overall aim of this task is to use sample mean and median to determine the shape of a data distribution.

In the following code block, create an appropriate histogram for the variable hp on the density scale. Here you can use the default number of class intervals. Calculate the sample mean and the sample median of the variable hp, and then use the function abline to indicate the locations of the sample median and the sample mean on the histogram.

Based on your findings, comment on the skewness of the variable hp and justify your answer. Write your answer after the Answer to Task 1.1 prompt provided below.

### Code for Task 1.1. Write your code here
###
HorsePower=mtcars$hp
mean(HorsePower)

## [1] 146.6875

median(HorsePower)

## [1] 123

hist(HorsePower, density=F, right=F, xlab="Gross Engine Power (in Horsepower)",ylab="Density",main="Histogram for Horsepower of Automobiles in US (for 1973-74 models)")
abline(v=mean(mtcars$hp),col="green")
abline(v=median(mtcars$hp),col="blue")

Answer to Task 1.1: (Write your answer here.) This histogram is positively skewed. Positively skewed data is when the median < mean. For the variable hp, its sample median of 123 is less than the sample mean of 14.6875. So it is positively skewed.

2 Boxplot, data selection, and outliers

`Task 2.1`

We want to understand the effectiveness of the am (transmission type) on the qsec (time for 1/4 mile, the shorter the faster) of cars using the comparative boxplot. Here we consider cars with displacement (disp) more than 130 cubic inches. In the following code block, first select data points in the am and qsec variables according to disp (>130). Then, make a comparative boxplot for the selected data points from qsec by splitting it by the corresponding am.

Based on the reported centers of the comparative boxplot, comment on which transmission type is faster (1 for manual and 0 for automatic) in general and justify your answer. Write your answer after the Answer to Task 2.1 prompt provided below.

### Code for Task 2.1. Write your code here
###
mtcars$am=as.factor(mtcars$am)
mtcars1=mtcars[mtcars$disp>130,]

Transmission=mtcars1$am
QMiletime=mtcars1$qsec

levels(Transmission)<-c("Automatic","Manual")

boxplot(QMiletime~Transmission, horizontal=T, xlab="Time (in seconds)", ylab="Type of Transmission", main="Time taken to complete 1/4 mile by Type of Transmission")

Answer to Task 2.1: (Write your answer here.) The Automatic transmission type is faster than Manual

`Task 2.2`

The rest of this question is to check your understanding of the boxplot, numerical summaries used for constructing a boxplot, and how to identify outliers. We will use all data entries in the variable qsec.

Calculate median of qsec and the quartiles used for identifying the middle 50% of data points.
Make a boxplot (preferbaly a horizontal one).
Use abline to indicate the location of the sample median and the interquartile range on the boxplot.

### Code for Task 2.2.  Write your code here
###
sort(mtcars$qsec)

##  [1] 14.50 14.60 15.41 15.50 15.84 16.46 16.70 16.87 16.90 17.02 17.02 17.05
## [13] 17.30 17.40 17.42 17.60 17.82 17.98 18.00 18.30 18.52 18.60 18.61 18.90
## [25] 18.90 19.44 19.47 19.90 20.00 20.01 20.22 22.90

length(mtcars$qsec)

## [1] 32

median(mtcars$qsec)

## [1] 17.71

summary(mtcars$qsec)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.50   16.89   17.71   17.85   18.90   22.90

IQR(mtcars$qsec)

## [1] 2.0075

boxplot(mtcars$qsec,horizontal=T, xlab="Time (in seconds)", main="Time taken to complete 1/4 Mile")
iqr=quantile(mtcars$qsec)[4]-quantile(mtcars$qsec)[2]
abline(v=median(mtcars$qsec), col="green")
abline(v=quantile(mtcars$qsec)[2],col="red")
abline(v=quantile(mtcars$qsec)[4],col="red")
abline(v=quantile(mtcars$qsec)[2]-1.5*iqr, col="purple")
abline(v=quantile(mtcars$qsec)[4]+1.5*iqr,col="purple")

Are there any outliers? Write your answer after the Answer to Task 2.2 prompt provided below.

Answer to Task 2.2: (Write your answer here.) There are no outliers. None of the data points are found at a distance of 1.5*IQR from the 1st or 3rd quartiles. This means no outliers are beyond the lower or upper thresholds.

3 Normal curve

`Task 3.1`

We consider all data entries in mtcars as a sample collected from all available cars on the market. First, we examine the variable qsec and aim to use the normal curve to estimate the proportion of cars on the market with 1/4 mile time exceeding 20 seconds.

In the following code block, calculate the sample mean and sample standard deviation. Construct a normal curve using these values, and subsequently determine the proportion of cars having 1/4 mile time exceeding 20 seconds. What percentage of cars have 1/4 mile time exceeding 20 seconds? Please also write your answer after the Answer to Task 3.1 prompt provided below, rounding your answer (in percentage) to two decimal places.

### Code for Task 3.1.  Write your code here
###
qsec=mtcars$qsec

mean(qsec)

## [1] 17.84875

sd(qsec)

## [1] 1.786943

m=mean(qsec)
s=sd(qsec)
curve(dnorm(x,m,s),xlim=c(10,25))

pnorm(20,m,s)

## [1] 0.8856804

1-pnorm(20,m,s)

## [1] 0.1143196

Answer to Task 3.1: (Write your answer here.) The sample mean is 17.85sec, and sample sd is 1.79sec. 11.43% have a 1/4 mile time exceeding 20seconds.

`Task 3.2`

In the following code block, calculate the 30-th percentile of the 1/4 mile time of cars based on the normal curve constructed above. Please also provide your answer after the Answer to Task 3.2 prompt provided below, rounding your answer to two decimal places.

### Code for Task 3.2.  Write your code here
###

qnorm(0.3,m,s)

## [1] 16.91168

Answer to Task 3.2: (Write your answer here.) A car with a time of 16.91 seconds would fall into the 30th percentile

`Task 3.3`

In our lectures, we learned about the distinction between the population standard deviation (SD) and the sample SD. Additionally, we learned that variance = SD\(^2\). R has built-in functions sd() and var() for computing the sample SD and the sample variance. Here we want to write our own R function to compute the population variance and apply it to the mtcars data set.

In the following, we provide the function definition for my_pop_var(X), where X is the input data. Complete this function so it can compute the population variance for the input data X.

### Code for Task 3.3. 
###
my_pop_var = function(X){
  ###  Write your code below for population variance
   m=sum(X)/length(X)
  v=sum((X-m)^2)/length(X)
  return(v)
}

`Task 3.4`

Apply your function written above to compute the population variance of the variable qsec in mtcars.

### Code for Task 3.4.  Write your code here
###
my_pop_var(mtcars$qsec)

## [1] 3.09338

n=length(qsec)
(sd(qsec)*sqrt((n-1)/n))^2

## [1] 3.09338

====END OF THE WORKSHEET====

MATH1062 (Part B) / MATH1005 Assignment 1 Worksheet

University of Sydney MATH1062 (Statistics) / MATH1005

07 March 2025

Instruction

`Task 0.0:` The data story of Motor Trend Car Road Tests

1 Histogram and skewness

`Task 1.1`

2 Boxplot, data selection, and outliers

`Task 2.1`

`Task 2.2`

3 Normal curve

`Task 3.1`

`Task 3.2`

`Task 3.3`

`Task 3.4`

MATH1062 (Part B) / MATH1005 Assignment 1 Worksheet

University of Sydney MATH1062 (Statistics) / MATH1005

07 March 2025

Instruction

Task 0.0: The data story of Motor Trend Car Road Tests

1 Histogram and skewness

Task 1.1

2 Boxplot, data selection, and outliers

Task 2.1

Task 2.2

3 Normal curve

Task 3.1

Task 3.2

Task 3.3

Task 3.4

`Task 0.0:` The data story of Motor Trend Car Road Tests

`Task 1.1`

`Task 2.1`

`Task 2.2`

`Task 3.1`

`Task 3.2`

`Task 3.3`

`Task 3.4`