Cars.csv will be used for Exercise. The variables in the data are included below in the table. The variables in the data set are the following attributes of cars in the year 2004:

 Make – the auto manufacturer
 Model – name of the vehicle
 Type – SUV, sedan, sports, truck, or wagon
 Origin – continent of the manufacturer; Europe, Asia, or USA
 Invoice – price (dollars) that the manufacturer sends to the dealer upon delivery of the car
 Horsepower – amount of the car’s power
 MPG_City – miles per gallon (fuel efficiency) during city driving
 MPG_Highway – miles per gallon during highway driving
 Wheelbase – distance (inches) between the centers of the front and rear wheels
 Length – distance (inches) from the nose to the tail of the car

Exercise: Descriptive Statistics


Question (a):

  • Create a combined mpg variable called MPG_Combo which combines 60% of the MPG_City and 40% of the MPG_Highway. Obtain a box plot for MPG_Combo and comment on what the plot tells us about fuel efficiencies.

Answer -:

# Read the CSV file
file.cars=read.csv("CARS.csv") 
# Make new MPG_Combo variable
MPG_Combo <- 0.6*file.cars$MPG_City +0.4*file.cars$MPG_Highway

#Add MPG_Combo variable to the end of tabale
file.cars.mpg_combo <- cbind(file.cars,MPG_Combo)



#Draw Box plot for the MPG_Combo variable and trim box plot
boxplot(file.cars.mpg_combo$MPG_Combo,main="Combined MPG (60% in City and 40% in Highway)",xlab="Combo",ylab="MPG",col = "aquamarine3",border ="aquamarine4")

#Point the mean value with a blue asterisk sign
points(mean(file.cars.mpg_combo$MPG_Combo,na.rm = TRUE),col="blue",pch=8)

Comment:

According to the above box plot, in combined mode:
- The “Mean” and “Median” values are almost close, then distribution might be normal.
- 50% of cars (Q1 to Q3) can travel 20 to 25 MPG.
- There are some outliers above the maximum and below the minimum
- To ensure about the normality, we must use QQ-Plot or perform a Quantitative test
- Minimum is around 14 MPG and maximum is around 32 MPG


Question (b):

  • Obtain box plots for MPG_Combo by Type and comment on any differences you notice between the different vehicle types combined fuel efficiency.

Answer -:

#Draw box plot for MPG_Combo by Type 
boxplot(file.cars.mpg_combo$MPG_Combo~file.cars.mpg_combo$Type,xlab = "Car Type",ylab = "MPG",main="MPG_Combo by Type of Cars",border=(c("aquamarine4","cyan4","coral4","darkgoldenrod4","darkorchid4")),horizontal = FALSE,col=(c("aquamarine3","cyan3","coral3","darkgoldenrod3","darkorchid3")))

Comment:

According to the above box plot:
- Wagon and Sedan travel maximum efficiency
- Truck and SUV are about same and have minimum efficiency
- Sports is little more efficiency than Truck and SUV but less than Sedan and Wagon
- Sedan has more outliers on maximum efficiency side


Question (c):

  • Obtain basic descriptive statistics for Horsepower for all vehicles. Comment on any general features and statistics of the data. Use visual and quantitative methods to comment on whether an assumption of Normality would be reasonable for Horsepower variable.

Answer -:

Basic descriptive statistics comprise:

# Calculate descriptive stattistics:
a=psych:: describe(file.cars$Horsepower)
knitr::kable(a,format = "markdown",caption = "Statistical Values",align = "c")
Statistical Values
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 425 216.76 71.31586 210 211.5367 66.717 100 500 400 0.9528091 1.546674 3.459327
Comment:

According to above table:
- Mean > Median and Skewness > 0, thus we guess we will have a right long tail.
- To ensure, we will use visual method.

Visual method
Histogram Plot / Box Plot / QQ Plot
# Prepare the sheet for 1X3 diagrams
par(mfrow=c(1,3))

# Plot the Histogram
hist(file.cars$Horsepower,col="cyan3",xlab = "Box plot for Hoursepower",ylab = "Value",main = "Histogram of Housepower")

# draw Box Plot
boxplot(file.cars$Horsepower,main="Box Plot for Hoursepower",xlab="Value",ylab="Housepower",col = "aquamarine3",border ="aquamarine4",horizontal = TRUE)
points(mean(file.cars$Horsepower,na.rm = TRUE),col="blue",pch=8)

# draw QQ Plot
qqnorm.var=qqnorm(file.cars$Horsepower)
qqline(file.cars$Horsepower,col="red")

Comments:

As histogram diagram, we can see the right skewness and right long tail.
As box plot diagram, we can see more outliers at the right side.
As QQ plot, we can see majority near the red line but some other outliers, to ensure about the normality, we must use the “Shapiro-Wilk” test as below.

Quantitative method
Shapiro-Wilk Test
# calculate Shapiro value
shapiro.results=shapiro.test(file.cars$Horsepower)
print(shapiro.results[2])
## $p.value
## [1] 2.320249e-11

Conclusion:

P-Value is very small and less than significant value, hence
Horsepower data is not follow normal distribution.


Question (d):

  • Use visual and quantitative methods to comment on whether an assumption of normality would be reasonable for Horsepower variable by Type, especially for Sports, SUV, and Truck (i.e., check normality of Horsepower from Type of i) Sport, ii) SUV, and iii) Truck.

Answer -:

i) Type=Sport

Visual method
# Prepare the sheet for 1X3 diagrams
par(mfrow=c(1,3))

# Visual diagrams for SPORT cars
file.cars[file.cars$Type=="Sports",]$Horsepower %>% 
hist(col="cadetblue3",xlab = "Hoursepower",ylab = "Value",main = "Histogram-Sport")

file.cars[file.cars$Type=="Sports",]$Horsepower %>%
boxplot(main="Box plot-Sport",xlab="Value",ylab="Housepower",col = "aquamarine3",border ="aquamarine4",horizontal = TRUE)
points(mean(file.cars$Horsepower,na.rm = TRUE),col="blue",pch=8)

file.cars[file.cars$Type=="Sports",]$Horsepower %>% 
qqnorm(main="QQ Plot-Sport")

file.cars[file.cars$Type=="Sports",]$Horsepower %>%
qqline(col="red")

Comments:

As the visual method:
- Histogram and Box plot show a little right skewed and right long tail.
- QQ plot shows there are a majority of data closed to red line, but there are a lot of points away from the red line.
Therefor, it may be normal distribution or not, to be ensure, we use the quantitative test (Shapiro-Wilk Test)

Quantitave method
Shapiro-Wilk Test
  • H0: Data follow normal distribution
  • H1: Data doesn’t follow normal distribution
# calculate Shapiro value

file.cars[file.cars$Type == "Sports","Horsepower"] %>% 
  shapiro.test() %>% 
  print()
## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.94276, p-value = 0.01898

Conclusion:

P-Value < Significant Value (0.05)

P-Value is smaller than significant value, hence
We have enough evidence to reject the H0 (Null Hypothesis), thus our conclusion is data doesn’t follow normal distribution.

ii) Type=SUV

Visual method
# Prepare the sheet for 1X3 diagrams
par(mfrow=c(1,3))

# Visual diagrams for SUV cars
file.cars[file.cars$Type=="SUV",]$Horsepower %>% 
hist(col="cadetblue3",xlab = "Hoursepower",ylab = "Value",main = "Histogram-SUV")

file.cars[file.cars$Type=="SUV",]$Horsepower %>%
boxplot(main="Box plot-SUV",xlab="Value",ylab="Housepower",col = "aquamarine3",border ="aquamarine4",horizontal = TRUE)
points(mean(file.cars$Horsepower,na.rm = TRUE),col="blue",pch=8)

file.cars[file.cars$Type=="SUV",]$Horsepower %>% 
qqnorm(main="QQ Plot-SUV")

file.cars[file.cars$Type=="SUV",]$Horsepower %>%
qqline(col="red")

Comments:

As the visual method:
- Histogram and Box plot show a little left skewed and left long tail.
_ QQ plot shows there are a majority of data closed to red line, but there are a lot of points with more distance to red line.
Therefor, it we can not ensure about normality, thus we have to use the quantitatively test (Shapiro-Wilk Test)

Quantitave method
Shapiro-Wilk Test
  • H0: Data follow normal distribution
  • H1: Data doesn’t follow normal distribution
# calculate Shapiro value

file.cars[file.cars$Type=="SUV","Horsepower"] %>% 
  shapiro.test() %>% 
  print()
## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.95945, p-value = 0.04423

Conclusion:

P-Value < Significant Value (0.05)

P-Value is smaller than significant value, hence
We have enough evidence to reject the H0 (Null Hypothesis), thus our conclusion is data doesn’t follow normal distribution.

iiii) Type=Truck

Visual method
# Prepare the sheet for 1X3 diagrams
par(mfrow=c(1,3))

# Visual diagrams for Truck cars
file.cars[file.cars$Type=="Truck",]$Horsepower %>% 
hist(col="cadetblue3",xlab = "Hoursepower",ylab = "Value",main = "Histogram-Truck")

file.cars[file.cars$Type=="Truck",]$Horsepower %>%
boxplot(main="Box plot-Truck",xlab="Value",ylab="Housepower",col = "aquamarine3",border ="aquamarine4",horizontal = TRUE)
points(mean(file.cars$Horsepower,na.rm = TRUE),col="blue",pch=8)

file.cars[file.cars$Type=="Truck",]$Horsepower %>% 
qqnorm(main="QQ Plot-Truck")

file.cars[file.cars$Type=="Truck",]$Horsepower %>%
qqline(col="red")

Comments:

As the visual method:
- Histogram and Box plot show a data is not normal.
_ QQ plot shows there are a lot of points with more distance of red line.
Data looks is not normal distributed, but we will use the quantitatively test (Shapiro-Wilk Test) to ensure.

Quantitavly method
Shapiro-Wilk Test
  • H0: Data follow normal distribution
  • H1: Data doesn’t follow normal distribution
# calculate Shapiro value

file.cars[file.cars$Type=="Truck","Horsepower"] %>% 
  shapiro.test() %>% 
  print()
## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.8951, p-value = 0.01697

Conclusion:

P-Value < Significant Value (0.05)

P-Value is smaller than significant value, then
We have enough evidence to reject the H0 (NULL Hypothesis), thus our conclusion is data doesn’t follow normal distribution.