Cars.csv will be used for Exercise. The variables in the data are included below in the table. The variables in the data set are the following attributes of cars in the year 2004:
Make – the auto manufacturer
Model – name of the vehicle
Type – SUV, sedan, sports, truck, or wagon
Origin – continent of the manufacturer; Europe, Asia, or USA
Invoice – price (dollars) that the manufacturer sends to the dealer upon delivery of the car
Horsepower – amount of the car’s power
MPG_City – miles per gallon (fuel efficiency) during city driving
MPG_Highway – miles per gallon during highway driving
Wheelbase – distance (inches) between the centers of the front and rear wheels
Length – distance (inches) from the nose to the tail of the car
# Read the CSV file
file.cars=read.csv("CARS.csv")
# Make new MPG_Combo variable
MPG_Combo <- 0.6*file.cars$MPG_City +0.4*file.cars$MPG_Highway
#Add MPG_Combo variable to the end of tabale
file.cars.mpg_combo <- cbind(file.cars,MPG_Combo)
#Draw Box plot for the MPG_Combo variable and trim box plot
boxplot(file.cars.mpg_combo$MPG_Combo,main="Combined MPG (60% in City and 40% in Highway)",xlab="Combo",ylab="MPG",col = "aquamarine3",border ="aquamarine4")
#Point the mean value with a blue asterisk sign
points(mean(file.cars.mpg_combo$MPG_Combo,na.rm = TRUE),col="blue",pch=8)
According to the above box plot, in combined mode:
- The “Mean” and “Median” values are almost close, then distribution might be normal.
- 50% of cars (Q1 to Q3) can travel 20 to 25 MPG.
- There are some outliers above the maximum and below the minimum
- To ensure about the normality, we must use QQ-Plot or perform a Quantitative test
- Minimum is around 14 MPG and maximum is around 32 MPG
#Draw box plot for MPG_Combo by Type
boxplot(file.cars.mpg_combo$MPG_Combo~file.cars.mpg_combo$Type,xlab = "Car Type",ylab = "MPG",main="MPG_Combo by Type of Cars",border=(c("aquamarine4","cyan4","coral4","darkgoldenrod4","darkorchid4")),horizontal = FALSE,col=(c("aquamarine3","cyan3","coral3","darkgoldenrod3","darkorchid3")))
Basic descriptive statistics comprise:
# Calculate descriptive stattistics:
a=psych:: describe(file.cars$Horsepower)
knitr::kable(a,format = "markdown",caption = "Statistical Values",align = "c")
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X1 | 1 | 425 | 216.76 | 71.31586 | 210 | 211.5367 | 66.717 | 100 | 500 | 400 | 0.9528091 | 1.546674 | 3.459327 |
According to above table:
- Mean > Median and Skewness > 0, thus we guess we will have a right long tail.
- To ensure, we will use visual method.
# Prepare the sheet for 1X3 diagrams
par(mfrow=c(1,3))
# Plot the Histogram
hist(file.cars$Horsepower,col="cyan3",xlab = "Box plot for Hoursepower",ylab = "Value",main = "Histogram of Housepower")
# draw Box Plot
boxplot(file.cars$Horsepower,main="Box Plot for Hoursepower",xlab="Value",ylab="Housepower",col = "aquamarine3",border ="aquamarine4",horizontal = TRUE)
points(mean(file.cars$Horsepower,na.rm = TRUE),col="blue",pch=8)
# draw QQ Plot
qqnorm.var=qqnorm(file.cars$Horsepower)
qqline(file.cars$Horsepower,col="red")
As histogram diagram, we can see the right skewness and right long tail.
As box plot diagram, we can see more outliers at the right side.
As QQ plot, we can see majority near the red line but some other outliers, to ensure about the normality, we must use the “Shapiro-Wilk” test as below.
# calculate Shapiro value
shapiro.results=shapiro.test(file.cars$Horsepower)
print(shapiro.results[2])
## $p.value
## [1] 2.320249e-11
P-Value is very small and less than significant value, hence
Horsepower data is not follow normal distribution.
# Prepare the sheet for 1X3 diagrams
par(mfrow=c(1,3))
# Visual diagrams for SPORT cars
file.cars[file.cars$Type=="Sports",]$Horsepower %>%
hist(col="cadetblue3",xlab = "Hoursepower",ylab = "Value",main = "Histogram-Sport")
file.cars[file.cars$Type=="Sports",]$Horsepower %>%
boxplot(main="Box plot-Sport",xlab="Value",ylab="Housepower",col = "aquamarine3",border ="aquamarine4",horizontal = TRUE)
points(mean(file.cars$Horsepower,na.rm = TRUE),col="blue",pch=8)
file.cars[file.cars$Type=="Sports",]$Horsepower %>%
qqnorm(main="QQ Plot-Sport")
file.cars[file.cars$Type=="Sports",]$Horsepower %>%
qqline(col="red")
As the visual method:
- Histogram and Box plot show a little right skewed and right long tail.
- QQ plot shows there are a majority of data closed to red line, but there are a lot of points away from the red line.
Therefor, it may be normal distribution or not, to be ensure, we use the quantitative test (Shapiro-Wilk Test)
# calculate Shapiro value
file.cars[file.cars$Type == "Sports","Horsepower"] %>%
shapiro.test() %>%
print()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.94276, p-value = 0.01898
P-Value < Significant Value (0.05)
P-Value is smaller than significant value, hence
We have enough evidence to reject the H0 (Null Hypothesis), thus our conclusion is data doesn’t follow normal distribution.
# Prepare the sheet for 1X3 diagrams
par(mfrow=c(1,3))
# Visual diagrams for SUV cars
file.cars[file.cars$Type=="SUV",]$Horsepower %>%
hist(col="cadetblue3",xlab = "Hoursepower",ylab = "Value",main = "Histogram-SUV")
file.cars[file.cars$Type=="SUV",]$Horsepower %>%
boxplot(main="Box plot-SUV",xlab="Value",ylab="Housepower",col = "aquamarine3",border ="aquamarine4",horizontal = TRUE)
points(mean(file.cars$Horsepower,na.rm = TRUE),col="blue",pch=8)
file.cars[file.cars$Type=="SUV",]$Horsepower %>%
qqnorm(main="QQ Plot-SUV")
file.cars[file.cars$Type=="SUV",]$Horsepower %>%
qqline(col="red")
As the visual method:
- Histogram and Box plot show a little left skewed and left long tail.
_ QQ plot shows there are a majority of data closed to red line, but there are a lot of points with more distance to red line.
Therefor, it we can not ensure about normality, thus we have to use the quantitatively test (Shapiro-Wilk Test)
# calculate Shapiro value
file.cars[file.cars$Type=="SUV","Horsepower"] %>%
shapiro.test() %>%
print()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.95945, p-value = 0.04423
P-Value < Significant Value (0.05)
P-Value is smaller than significant value, hence
We have enough evidence to reject the H0 (Null Hypothesis), thus our conclusion is data doesn’t follow normal distribution.
# Prepare the sheet for 1X3 diagrams
par(mfrow=c(1,3))
# Visual diagrams for Truck cars
file.cars[file.cars$Type=="Truck",]$Horsepower %>%
hist(col="cadetblue3",xlab = "Hoursepower",ylab = "Value",main = "Histogram-Truck")
file.cars[file.cars$Type=="Truck",]$Horsepower %>%
boxplot(main="Box plot-Truck",xlab="Value",ylab="Housepower",col = "aquamarine3",border ="aquamarine4",horizontal = TRUE)
points(mean(file.cars$Horsepower,na.rm = TRUE),col="blue",pch=8)
file.cars[file.cars$Type=="Truck",]$Horsepower %>%
qqnorm(main="QQ Plot-Truck")
file.cars[file.cars$Type=="Truck",]$Horsepower %>%
qqline(col="red")
As the visual method:
- Histogram and Box plot show a data is not normal.
_ QQ plot shows there are a lot of points with more distance of red line.
Data looks is not normal distributed, but we will use the quantitatively test (Shapiro-Wilk Test) to ensure.
# calculate Shapiro value
file.cars[file.cars$Type=="Truck","Horsepower"] %>%
shapiro.test() %>%
print()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.8951, p-value = 0.01697
P-Value < Significant Value (0.05)
P-Value is smaller than significant value, then
We have enough evidence to reject the H0 (NULL Hypothesis), thus our conclusion is data doesn’t follow normal distribution.
Comment:
According to the above box plot:
- Wagon and Sedan travel maximum efficiency
- Truck and SUV are about same and have minimum efficiency
- Sports is little more efficiency than Truck and SUV but less than Sedan and Wagon
- Sedan has more outliers on maximum efficiency side