First of all, I am most interested in whether different car manufacturers produce the same class of cars. For example, we all know that Volkswagen produces many Compact and sub-compact models. So I first selected all the automobile manufacturers in the statistics to draw the bar chart.
df<- rename(df,c(trans= "Transmission"))
df<- rename(df,c(manufacturer="Vehicle Manufacturer"))
g<-ggplot(df,aes(class))
g+geom_bar(aes(fill = df$`Vehicle Manufacturer`))
From this figure, we can see that almost all manufacturers will produce SUVs. SUV is also almost the most popular class for consumers. Audi apparently prefers compact class vehicles while Toyota and Volkswagen are the two main producers of compact. Only Chevrolet produces 2seater classes in this statistics. Midsize class producers are obviously more Japanese producers such as Toyota and Honda. Dodge’s vehicles are not rich in pickup and minivan classes. Some manufacturers, such as Jeep, only produce SUVs. In conclusion, big producers like Volkswagen and Toyota are involved in almost every field. In contrast, Jeep and Mercury are specialized in one area.
After we have a preliminary understanding of the manufacturers, let's look at the fuel consumption of different manufacturers. The abscissa is Efficiency Fuel (miles/gallon). The ordinate is the manufacturer. The more to the right of the box plot in the figure, the higher the efficiency value, that is, the lower the fuel consumption value.
df<- rename(df,c(hwy='Highway Efficiency Fuel(miles/gallon)'))
p <-ggplot(df,aes(`Vehicle Manufacturer`,df$`Highway Efficiency Fuel(miles/gallon)`))
p+geom_boxplot()+coord_flip()+theme(axis.line=element_line(colour = "black"),panel.background = element_rect(fill=NA))
df<- rename(df,c(cty='City Efficiency Fuel(miles/gallon)'))
p <-ggplot(df,aes(`Vehicle Manufacturer`,df$`City Efficiency Fuel(miles/gallon)`))
p+geom_boxplot()+coord_flip()+theme(axis.line=element_line(colour = "black"),panel.background = element_rect(fill=NA))
Obviously, according to the figures, Volkswagen, Toyota and Honda are relatively fuel-efficient vehicles, whether they are city or highway. Even Volkswagen has three cars that consume significantly less fuel than the average model. Considering the previous figure of manufacturer’s distribution, small car manufacturers such as Volkswagen, Toyota and Honda will be relatively low in fuel consumption, while SUV manufacturers such as Dodge, Jeep and Land Rover are significantly higher in fuel consumption than other manufacturers. Companies with a wide range of production classes span a larger range, such as Toyota. Toyota’s vehicles are among the highest and lowest fuel consumption groups, but overall, Toyota’s fuel consumption is low.
df_cor<-cbind(df$displ,df$year,df$cyl,df$`City Efficiency Fuel(miles/gallon)`,df$`Highway Efficiency Fuel(miles/gallon)`)
colnames(df_cor)<-c("displ","year","cyl","City","Highway")
res <- cor(df_cor)
corrplot(res, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45)
Before analyzing the city’s fuel consumption, we first make a thermodynamic diagram of the correlation coefficients of several related factors. Blue represents positive correlation and red represents negative correlation. The deeper the color illustrate the higher the correlation. As we see in the figure, the highest positive correlation with urban fuel consumption is high-speed fuel consumption. In contrast, the higher the parameters of displ and cyl, the lower the fuel efficiency value. The year is almost non-correlated with the fuel consumption. This also lays the groundwork for the next analysis, we only need to select two arguments displ, cyl and one block argument, Highway, as parameters for regression analysis of urban fuel consumption. In the linear regression model, in order to prevent over-fitting, I first divide the data into 80% train data and 20% test data. The linear regression model is obtained as follows:
set.seed(12345)
row.number<-sample(x=1:nrow(df_cor),size = 0.75*nrow(df_cor))
train=df_cor[row.number,]
test=df_cor[-row.number,]
train =data.frame(train)
test= data.frame(test)
lm_fit<-lm(train$City ~., data = train)
summary(lm_fit)
##
## Call:
## lm(formula = train$City ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0892 -0.5910 -0.0633 0.7821 4.9112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.18534 41.48173 1.065 0.288307
## displ 0.14476 0.20298 0.713 0.476728
## year -0.01924 0.02079 -0.926 0.355973
## cyl -0.56700 0.15864 -3.574 0.000458 ***
## Highway 0.59901 0.02409 24.866 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.209 on 170 degrees of freedom
## Multiple R-squared: 0.9255, Adjusted R-squared: 0.9238
## F-statistic: 528.3 on 4 and 170 DF, p-value: < 2.2e-16
Each coefficients of the model are obtained by different parameters. Coefficient means that the predicted value increases or decreases with the increase or decrease of a unit of the corresponding variable. Adjusted R-square = 0.92 in the model shows that the fitting effect of the model is very good. Furthermore, the P-value of the model is almost zero, which shows that the model is remarkably effective. City Fuel Efficiency can be explained by the variables we choose. From this model, we can see that the parameters CYL and Highway are significant indicators that these two variables most likely affect the value of City Fuel Efficiency. When we choose to buy a car, the only parameter we can choose is number of cylinder.
pre<-predict(lm_fit,newdata=test)
obs<-test$City
SSE<-sum((obs-pre)^2)
SST<-sum((obs-mean(obs))^2)
r2<-1-SSE/SST
rmse<-sqrt(mean((pre-obs)^2))
r2
## [1] 0.9344154
rmse
## [1] 0.9764112
Finally, we analyze the gap between the training model and the actual value of the data in the prediction set. The most important two indicators R-square and RMSE almost reach 1, which shows that the training effect of the model is very good. This model and report are meaningful to the actual situation and can roughly predict Vehicle’s City Fuel Efficiency based on this model. In conclusion, in the process of exploring data, we first have a general understanding of the preliminary evaluation and general situation of the class habits and fuel consumption of different manufacturers. Next, the correlation coefficient matrix is used to judge the selected variables. Finally, the prediction results are given by linear regression and the influencing factors of City Fuel Efficiency are found. Hence, I give some suggestions to consumers and dealers based on the data analysis. If consumers consider City Fuel Efficiency, the most important indicator to be concerned with is number of cylinder. Under the general standard of cylinder, considering engine displacement, year has no effect on fuel consumption.