Author:490089392 Lab:F16A

R Markdown

Executive Summary

This report analysis focuses on motor vehicle fuel consumption. With the development of society, almost every family has one or two motor vehicles. For dealers and customers, the performance-price ratio of motor vehicles is mainly reflected in fuel consumption. Evaluating the performance of motor vehicles is also closely related to fuel consumption. In order to evaluate the reasonable factors of fuel consumption by dealer and customer searching, this report selected the data of vehicle fuel consumption and obtained the influencing factors of City Fuel Efficiency through analysis.

IDA for sourced data

This set of data is related to motor vehicle fuel consumption data from https://fueleconomy.gov/ [1], which is found on the website: http://bigdata-madesimple.com/70-websites-to-get-large-data-repositories-for-free/. The data contains fuel consumption economy information for 38 popular models vehicle. Through all the data set, there is 234 rows and 11 variables. Variables in the data are explained as follows: Manufacturer model: The name of the car manufacturer. Displ: engine displacement, in litres. Year: Year of production
Cyl: number of cylinders. Trans: type of transmission Drv: f = front-wheel drive, r = rear wheel drive, 4 = 4wd Cty: city miles per gallon Hwy: highway miles per gallon Fl: fuel type

# install.packages("ggplot2")
# install.packages("reshape")
# install.packages("corrplot")

library(ggplot2)
library(reshape)
library(corrplot)
## corrplot 0.84 loaded
df<-mpg[,-2]
df
## # A tibble: 234 x 10
##    manufacturer displ  year   cyl trans      drv     cty   hwy fl    class 
##    <chr>        <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
##  1 audi           1.8  1999     4 auto(l5)   f        18    29 p     compa~
##  2 audi           1.8  1999     4 manual(m5) f        21    29 p     compa~
##  3 audi           2    2008     4 manual(m6) f        20    31 p     compa~
##  4 audi           2    2008     4 auto(av)   f        21    30 p     compa~
##  5 audi           2.8  1999     6 auto(l5)   f        16    26 p     compa~
##  6 audi           2.8  1999     6 manual(m5) f        18    26 p     compa~
##  7 audi           3.1  2008     6 auto(av)   f        18    27 p     compa~
##  8 audi           1.8  1999     4 manual(m5) 4        18    26 p     compa~
##  9 audi           1.8  1999     4 auto(l5)   4        16    25 p     compa~
## 10 audi           2    2008     4 manual(m6) 4        20    28 p     compa~
## # ... with 224 more rows

Including Plots

Exploring data

First of all, I am most interested in whether different car manufacturers produce the same class of cars. For example, we all know that Volkswagen produces many Compact and sub-compact models. So I first selected all the automobile manufacturers in the statistics to draw the bar chart.

df<- rename(df,c(trans= "Transmission"))
df<- rename(df,c(manufacturer="Vehicle Manufacturer"))
g<-ggplot(df,aes(class))
g+geom_bar(aes(fill = df$`Vehicle Manufacturer`))

From this figure, we can see that almost all manufacturers will produce SUVs. SUV is also almost the most popular class for consumers. Audi apparently prefers compact class vehicles while Toyota and Volkswagen are the two main producers of compact. Only Chevrolet produces 2seater classes in this statistics. Midsize class producers are obviously more Japanese producers such as Toyota and Honda. Dodge’s vehicles are not rich in pickup and minivan classes. Some manufacturers, such as Jeep, only produce SUVs. In conclusion, big producers like Volkswagen and Toyota are involved in almost every field. In contrast, Jeep and Mercury are specialized in one area.

After we have a preliminary understanding of the manufacturers, let's look at the fuel consumption of different manufacturers. The abscissa is Efficiency Fuel (miles/gallon). The ordinate is the manufacturer. The more to the right of the box plot in the figure, the higher the efficiency value, that is, the lower the fuel consumption value.
df<- rename(df,c(hwy='Highway Efficiency Fuel(miles/gallon)'))
p <-ggplot(df,aes(`Vehicle Manufacturer`,df$`Highway Efficiency Fuel(miles/gallon)`))
p+geom_boxplot()+coord_flip()+theme(axis.line=element_line(colour = "black"),panel.background = element_rect(fill=NA))

df<- rename(df,c(cty='City Efficiency Fuel(miles/gallon)'))
p <-ggplot(df,aes(`Vehicle Manufacturer`,df$`City Efficiency Fuel(miles/gallon)`))
p+geom_boxplot()+coord_flip()+theme(axis.line=element_line(colour = "black"),panel.background = element_rect(fill=NA))

Obviously, according to the figures, Volkswagen, Toyota and Honda are relatively fuel-efficient vehicles, whether they are city or highway. Even Volkswagen has three cars that consume significantly less fuel than the average model. Considering the previous figure of manufacturer’s distribution, small car manufacturers such as Volkswagen, Toyota and Honda will be relatively low in fuel consumption, while SUV manufacturers such as Dodge, Jeep and Land Rover are significantly higher in fuel consumption than other manufacturers. Companies with a wide range of production classes span a larger range, such as Toyota. Toyota’s vehicles are among the highest and lowest fuel consumption groups, but overall, Toyota’s fuel consumption is low.

df_cor<-cbind(df$displ,df$year,df$cyl,df$`City Efficiency Fuel(miles/gallon)`,df$`Highway Efficiency Fuel(miles/gallon)`)
colnames(df_cor)<-c("displ","year","cyl","City","Highway")
res <- cor(df_cor)
corrplot(res, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45)

Before analyzing the city’s fuel consumption, we first make a thermodynamic diagram of the correlation coefficients of several related factors. Blue represents positive correlation and red represents negative correlation. The deeper the color illustrate the higher the correlation. As we see in the figure, the highest positive correlation with urban fuel consumption is high-speed fuel consumption. In contrast, the higher the parameters of displ and cyl, the lower the fuel efficiency value. The year is almost non-correlated with the fuel consumption. This also lays the groundwork for the next analysis, we only need to select two arguments displ, cyl and one block argument, Highway, as parameters for regression analysis of urban fuel consumption. In the linear regression model, in order to prevent over-fitting, I first divide the data into 80% train data and 20% test data. The linear regression model is obtained as follows:

set.seed(12345)
row.number<-sample(x=1:nrow(df_cor),size = 0.75*nrow(df_cor))
train=df_cor[row.number,]
test=df_cor[-row.number,]
train =data.frame(train)
test= data.frame(test)
lm_fit<-lm(train$City ~., data = train) 
summary(lm_fit)
## 
## Call:
## lm(formula = train$City ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0892 -0.5910 -0.0633  0.7821  4.9112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 44.18534   41.48173   1.065 0.288307    
## displ        0.14476    0.20298   0.713 0.476728    
## year        -0.01924    0.02079  -0.926 0.355973    
## cyl         -0.56700    0.15864  -3.574 0.000458 ***
## Highway      0.59901    0.02409  24.866  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.209 on 170 degrees of freedom
## Multiple R-squared:  0.9255, Adjusted R-squared:  0.9238 
## F-statistic: 528.3 on 4 and 170 DF,  p-value: < 2.2e-16

Each coefficients of the model are obtained by different parameters. Coefficient means that the predicted value increases or decreases with the increase or decrease of a unit of the corresponding variable. Adjusted R-square = 0.92 in the model shows that the fitting effect of the model is very good. Furthermore, the P-value of the model is almost zero, which shows that the model is remarkably effective. City Fuel Efficiency can be explained by the variables we choose. From this model, we can see that the parameters CYL and Highway are significant indicators that these two variables most likely affect the value of City Fuel Efficiency. When we choose to buy a car, the only parameter we can choose is number of cylinder.

pre<-predict(lm_fit,newdata=test)
obs<-test$City
SSE<-sum((obs-pre)^2)
SST<-sum((obs-mean(obs))^2)
r2<-1-SSE/SST
rmse<-sqrt(mean((pre-obs)^2))
r2
## [1] 0.9344154
rmse
## [1] 0.9764112

Finally, we analyze the gap between the training model and the actual value of the data in the prediction set. The most important two indicators R-square and RMSE almost reach 1, which shows that the training effect of the model is very good. This model and report are meaningful to the actual situation and can roughly predict Vehicle’s City Fuel Efficiency based on this model. In conclusion, in the process of exploring data, we first have a general understanding of the preliminary evaluation and general situation of the class habits and fuel consumption of different manufacturers. Next, the correlation coefficient matrix is used to judge the selected variables. Finally, the prediction results are given by linear regression and the influencing factors of City Fuel Efficiency are found. Hence, I give some suggestions to consumers and dealers based on the data analysis. If consumers consider City Fuel Efficiency, the most important indicator to be concerned with is number of cylinder. Under the general standard of cylinder, considering engine displacement, year has no effect on fuel consumption.