Environment Setting

Package loading

This report will use “mtcars” dataebase to do an analysis about car performance. We first extract the data “mtcars” and set the environment. In order to provide an analysis, we use “psych, tidyverse, and sm” three packages.

# Generate practical dataset 
# data <- mtcars
# write.csv(data, 'data/data.csv')
library(psych)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()   masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
library(sm)
## Package 'sm', version 2.2-5.7: type help(sm) for summary information

Loading data

As mentioned before, this report uses “mtcars” database to do the analysis, so next we read the data from the file. The “mtcars” dataset comprises 11 aspects of automobile design and performance for 32 automobiles. Therefore, “mtcars” dataset contains 32 data samples with 11 variables of car performance. And this report will mainly analyze automobiles’ speed (quarter-mile time) and its influencing factors.

data <- read.csv('../data/data.csv')

Analysis

Descriptive Statistics

For the starting point, this report uses descriptive statistics to analyze the overall situation of the dataset “mtcars”, helping to learn about the general performance of automobiles in the sample from the perspectives of central tendency and variability of the 11 car variables.

summary(data)
##       X                  mpg             cyl             disp      
##  Length:32          Min.   :10.40   Min.   :4.000   Min.   : 71.1  
##  Class :character   1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8  
##  Mode  :character   Median :19.20   Median :6.000   Median :196.3  
##                     Mean   :20.09   Mean   :6.188   Mean   :230.7  
##                     3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0  
##                     Max.   :33.90   Max.   :8.000   Max.   :472.0  
##        hp             drat             wt             qsec      
##  Min.   : 52.0   Min.   :2.760   Min.   :1.513   Min.   :14.50  
##  1st Qu.: 96.5   1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89  
##  Median :123.0   Median :3.695   Median :3.325   Median :17.71  
##  Mean   :146.7   Mean   :3.597   Mean   :3.217   Mean   :17.85  
##  3rd Qu.:180.0   3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90  
##  Max.   :335.0   Max.   :4.930   Max.   :5.424   Max.   :22.90  
##        vs               am              gear            carb      
##  Min.   :0.0000   Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4375   Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :5.000   Max.   :8.000

We first summarize the data information of the 11 automobile indicators to show a big picture of a car’s performance. The output displays the “mean, minimum, 1st quartile, median, 3rd quartile, maximum” of 32 automobiles’ “Miles per gallon (mpg), Number of cylinders (cyl), Displacement (disp), Gross horsepower (hp), Rear axle ratio (drat), Weight (wt), 1/4 mile time (qsec), Engine (vs), Transmission (am), Number of forward gears (gear), and Number of carburetors (carb).”

describe(data)
##      vars  n   mean     sd median trimmed    mad   min    max  range  skew
## X*      1 32  16.50   9.38  16.50   16.50  11.86  1.00  32.00  31.00  0.00
## mpg     2 32  20.09   6.03  19.20   19.70   5.41 10.40  33.90  23.50  0.61
## cyl     3 32   6.19   1.79   6.00    6.23   2.97  4.00   8.00   4.00 -0.17
## disp    4 32 230.72 123.94 196.30  222.52 140.48 71.10 472.00 400.90  0.38
## hp      5 32 146.69  68.56 123.00  141.19  77.10 52.00 335.00 283.00  0.73
## drat    6 32   3.60   0.53   3.70    3.58   0.70  2.76   4.93   2.17  0.27
## wt      7 32   3.22   0.98   3.33    3.15   0.77  1.51   5.42   3.91  0.42
## qsec    8 32  17.85   1.79  17.71   17.83   1.42 14.50  22.90   8.40  0.37
## vs      9 32   0.44   0.50   0.00    0.42   0.00  0.00   1.00   1.00  0.24
## am     10 32   0.41   0.50   0.00    0.38   0.00  0.00   1.00   1.00  0.36
## gear   11 32   3.69   0.74   4.00    3.62   1.48  3.00   5.00   2.00  0.53
## carb   12 32   2.81   1.62   2.00    2.65   1.48  1.00   8.00   7.00  1.05
##      kurtosis    se
## X*      -1.31  1.66
## mpg     -0.37  1.07
## cyl     -1.76  0.32
## disp    -1.21 21.91
## hp      -0.14 12.12
## drat    -0.71  0.09
## wt      -0.02  0.17
## qsec     0.34  0.32
## vs      -2.00  0.09
## am      -1.92  0.09
## gear    -1.07  0.13
## carb     1.26  0.29

Then, we use “describe” function to add more descriptive measures of the 11 cars indicators - Standard deviation, Trimmed mean, Median absolute deviation, Range, Skewness, Kurtosis, and Standard error of the mean.

Data Manipulation

After reviewing descriptive analysis results and learning about the general situation of automobile performance, this report mainly focus on exploring the relationship between the car’s speed (1/4 mile time) and the car’s transmission type.

table(data$am)
## 
##  0  1 
## 19 13
data_a <- filter(data, am == 0)
data_b <- filter(data, am != 0)

In the dataset, there are 19 automobiles with automatic transmissions, represented by “0” and 13 automobiles with manual transmissions, represented by “1”. Since we want to analyze the performance of cars with different transmission types, we manipulate the data by dividing the data sample into two categories according to the two transmission types. “data_a” represents automatic transmission cars and “data_b” represents manual transmission cars.

mean(data_a$qsec)
## [1] 18.18316
mean(data_b$qsec)
## [1] 17.36

The average of automatic transmission cars’ quarter-mile time is 17.1473684. And average of manual transmission cars’ quarter-mile time is 24.3923077. According to the statistic, we can find out that on average, cars with manual transmissions have less quarter-mile time than cars with automatic transmissions. However, the number indicates that the difference between the two types of cars is not very significant.

Visualization

After calculating the speed difference in the cars with two types of transmissions, we then utilize different types of graphs to illustrate the relationship between speed and car transmission type clearly.

Histogram

In order to find out the relationship between quarter-mile time and cars’ transmission type, we first draw the histogram of cars’ 1/4 mile time.

hist(data$qsec,main="Histogram of 1/4 Mile Time",xlab="1/4 mile time",ylab="Frequency")

This histogram shows the quarter-mile time of the 32 automobile samples. From this graph, we can discover that the number of cars with 1/4 mile time between 17 and 18 seconds is the highest. And the number of cars decreases both below and beyond this range. Therefore, the automobiles with quarter-mile time between 17 and 18 seconds are the most common.

Kernel Density

plot(density(data$qsec),main="Kernel Density of 1/4 Mile Time") 

The Kernel Density gives a more fluent distribution of cars’ quarter-mile time than the histogram. And the graph shows a similar result with the histogram. Most automobiles’ 1/4 mile time in the sample is around 17 to 18 seconds.

sm.density.compare(data$qsec, data$am, xlab="1/4 mile time",model='equal')
## Test of equal densities:  p-value =  0.58
title(main="1/4 Mile Time Distribution by Car Transmission")
am.f <- factor(data$am,levels=c(0,1),labels=c("Automatic transmission","Manual transmission"))
colfill <- c(2:(2+length(levels(am.f))))
legend('topright',levels(am.f),fill=colfill,cex=1)

sm.density.compare” function helps to compare the quarter-mile time distribution of automobiles with two types of transmissions. From the graph, we can discover that there is a difference in 1/4 mile time between cars with two types of transmissions. Most automobiles with automatic transmission have 1/4 mile time of around 17.5 seconds, while most automobiles with manual transmission have 1/4 mile time of between 16 and 20 seconds. That is, the number of cars with quarter-mile time from 16 to 20 seconds has little difference.

Boxplot

boxplot(qsec ~ am, data = data, xlab="Transmission type",ylab="1/4 mile time")  

data %>%
  ggplot( 
    aes(x=am, y=qsec, group = am)
  ) +
  geom_boxplot() +
  geom_jitter(color="black", size=0.4, alpha=0.9)

We also draw the boxplot to show the difference of quarter-mile type between cars with two types of transmissions, and it shows a similar result with the Kernel density. The automatic transmission car group has higher maximum, minimum, and median of 1/4 mile time than the manual transmission car group. It is worth mentioning that there is an outlier in automatic transmission car group with an extremely high quarter-mile time that should be ignored. In general, most cars with automatic transmissions have higher 1/4 mile time than cars with manual transmissions. However, the difference is not that significant.

From the Kernel density and boxplot, we can find a difference of 1/4 mile time between cars with two different types of transmissions. However, according to the mean statistic and visualization, the difference is not that significant.

Modeling

Independent 2-group T-test

Since our sample is small, we have to use the t-test to check the correctness of difference in car speed (1/4 mile time) between 2 types of transmissions found in the sample.

m1 <- t.test(qsec ~ am, data=data)
m1 # p-value is smaller than 0.05, accept the hypothesis.the mean difference is between 4 and 11 (95%).
## 
##  Welch Two Sample t-test
## 
## data:  qsec by am
## t = 1.2878, df = 25.534, p-value = 0.2093
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.4918522  2.1381679
## sample estimates:
## mean in group 0 mean in group 1 
##        18.18316        17.36000

The null hypothesis of t-test is that “The difference in means between two groups is equal to 0.” And the t-test results show that p-value=0.2093, which is larger than 0.05. This large p-value indicates that we cannot reject the null hypothesis and cannot accept the alternative hypothesis “The difference in means between two groups is not equal to 0.” There is a relatively higher possibility that the null hypothesis will be true, meaning the difference is not significant. Therefore, the t-test helps to test our findings that automobiles with automatic transmissions and those with manual transmissions have different quarter-mile time. And the results show that that difference is not significant.

Regression Modeling

Finally, we use “lm” function to fit a linear regression model between 1/4 mile time and automobiles’ transmission types to explain their relationship.

m2 <- lm(qsec ~ am, data=data)
summary(m2)
## 
## Call:
## lm(formula = qsec ~ am, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8600 -0.9583 -0.3516  1.2517  4.7168 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.1832     0.4056  44.833   <2e-16 ***
## am           -0.8232     0.6363  -1.294    0.206    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.768 on 30 degrees of freedom
## Multiple R-squared:  0.05284,    Adjusted R-squared:  0.02126 
## F-statistic: 1.674 on 1 and 30 DF,  p-value: 0.2057

The summary of the regression model fit shows that the coefficient estimate of am (transmission type) is -0.8232, which means that one additional increase in am is associated with an average increase of -0.8232 in 1/4 mile time. So the estimated regression equation can be expressed as “1/4 mile time=18.1832-0.8232*am (am=0, 1)“. And p-value is 0.2057, which is larger than 0.05, indicating the regression model as a whole is statistically not significant. Therefore, the linear regression model indicates that automatic transmission automobiles have higher 1/4 mile time than manual transmission cars, but the difference is not significant, proving our previous findings.

In conclusion, this report utilizes dataset “mtcars” to analyze automobiles’ performance. And we mainly focus on exploring the relationship between car speed (1/4 mile time) and transmission type. Utilizing graphs of 1/4 mile time and modeling, we can draw the conclusion that automatic transmission cars have higher 1/4 mile time than manual transmission cars, indicating manual transmission cars’ fast speed, but the difference is not significant.