Having an opportunity to work with Mercedes_Benz_Korea(MBK), Korean Branch of Daimler AG in German, one curiosity to know car line up is suddenly embeded in a researcher’s mind. This is simple analysis of car_line_up. In this simple project, reseacher would like to describe some descripitve analysis of car_line_up and explain a correlative analysis between some elements of car such as weight, city_mpg, gear, and others. What kind of car is utmost car for you? It could help for one who wants to buy new one to choose the one via this article. I am personally interested to explore:
In order to answer two questions, I performed exploratory and descriptive analyses, and used linear regression as methodologies to explain. Also, I established both simple as total and multivariate linear analysis by car types. Let’s start it.
For the purpose of this analysis, I created Mercedes_2017 dataset arranged from the mbusa.com website. The dataset comprises of model and 16 aspects of Mercedes Benz for 82 mobiles.
The first six records of the dataset are shown below:
library(xlsx)
## Loading required package: rJava
## Loading required package: xlsxjars
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(data.table)
## -------------------------------------------------------------------------
## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!
## -------------------------------------------------------------------------
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(knitr)
# install.packages("printr")
library(printr)
# install.packages("ggpubr")
library(ggpubr)
## Loading required package: magrittr
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## combine, src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
mercedes_2017_df <- read.csv(file = "Mercedes_2017.csv", stringsAsFactors = FALSE)
mercedes_2017_df[mercedes_2017_df==""] <- NA
kable(x = head(mercedes_2017_df, 3), align = 'c')
| model | category | price | pax | trunk | gear | cyl | electric | hp | rpm | accel | wt | city_mpg | city_mpge | highway_mpg | highway_mpge |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| C300 | Sedan | 39500 | 5 | 12.6 | 7 | 4 | No | 241 | 5550 | 6.0 | 3417 | 24 | NA | 34 | NA |
| C300 4MATIC | Sedan | 41500 | 5 | 12.6 | 7 | 4 | No | 241 | 5550 | 6.0 | 3594 | 24 | NA | 31 | NA |
| C350e Pulg-in Hybrid | Sedan | 46050 | 5 | 11.8 | 7 | 4 | Yes | 275 | NA | 5.8 | 4057 | NA | 45 | NA | 61 |
Notice that each line of mercedes_2017_df represents one model of car. Each column is then one attribute of that car, such as city_fuel efficiency, the number of cylinders, and so on.
Interestingly, the type Sedan, SUV, and Coupe has more than 15 car models. However, the type Electric and Wagon has only one model. Statistically, this dataset needs data cleaning to perform exploratory analysis. Let’s take a look.
count_df <- mercedes_2017_df %>% select(model, category) %>% group_by(category) %>% mutate(count = 1) %>% summarise(total = sum(count))
kable(count_df, align = "c")
| category | total |
|---|---|
| Cabriolet | 10 |
| Coupe | 22 |
| Electric | 1 |
| Roadster | 8 |
| Sedan | 18 |
| SUV | 22 |
| Wagon | 1 |
In this chapter, dataset needs to be cleaned to analyze more accurately the correlation between variables. To do it, the type Wagon and electric will be removed from this dataset. Also, the model having the “NA” or empty values will be removed from this dataset. And also, we will select usable variables.
mercedes_2017_df$cyl <- as.factor(mercedes_2017_df$cyl)
mercedes_2017_df$model <- as.character(mercedes_2017_df$model)
mercedes_2017_df$wt <- mercedes_2017_df$wt / 1000
selected_df <- mercedes_2017_df %>%
select(model, category, wt, cyl, gear, hp, accel, city_mpg, highway_mpg, rpm) %>%
group_by(category) %>%
filter(category != "Wagon" & category != "Electric") %>%
na.omit()
selected_df$wt <- as.numeric(selected_df$wt)
selected_df$rpm <- as.numeric(selected_df$rpm)
selected_df$gear <- as.numeric(selected_df$gear)
selected_df$hp <- as.numeric(selected_df$hp)
selected_df$city_mpg <- as.numeric(selected_df$city_mpg)
selected_df$highway_mpg <- as.numeric(selected_df$highway_mpg)
analysis_data <- selected_df[, c(3,4, 5,6,7,8,9,10)]
# unused dataset
unused_df_A <- mercedes_2017_df %>%
select(model, category, wt, cyl, gear, hp, accel, city_mpg, highway_mpg, rpm) %>%
group_by(category) %>%
subset(is.na(wt) | is.na(city_mpg) | is.na(cyl) | is.na(accel) | is.na(rpm))
unused_df_B <- mercedes_2017_df %>% select(model, category, wt, cyl, gear, hp, accel, city_mpg, highway_mpg, rpm) %>% filter(category == "Wagon")
unused_df <- merge(unused_df_A, unused_df_B, all = TRUE)
Beginning the analysis by performing some initial exploratory analysis, the existing patterns between variable can be fount in the dataset. The linear graph is a very effectice tool. Looking at whole data, the overall flow of this dataset could be found.
par(mfrow = c(1, 2))
# Histogram with Normal Curve
dist_mpg <- analysis_data$city_mpg
h <-hist(dist_mpg, breaks=10, col="red", xlab="Miles Per Gallon",
main="Histogram of Miles per Gallon")
xfit <-seq(min(dist_mpg),max(dist_mpg),length=40)
yfit <-dnorm(xfit,mean=mean(dist_mpg),sd=sd(dist_mpg))
yfit <- yfit*diff(h$mids[1:2])*length(dist_mpg)
lines(xfit, yfit, col="blue", lwd=2)
# Kernel Density Plot
den_mpg <- density(dist_mpg)
plot(den_mpg, xlab = "MPG", main ="Density Plot of MPG")
pairs.panels(analysis_data,
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
The function pairs.panels [in psych package] is useful to create a scatter plot of matrices, with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal. Interestingly. all the variables are differently drawing its linear line. The graph gives us a question why all the graphs vary between variables. According to the p-values between the different variables, some shows negative correlation, positive correlation, and even non-correlation.
ggscatter(analysis_data, x = "wt", y = "city_mpg",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Weight (1000 lbs)", ylab = "Miles/(US) gallon")
Is the covariation linear? Yes, form the plot above, the relationship between two variables is negatively linear. In the situation where the scatter plots show curved patterns, we are dealing with nonlinear association between the two variables.
Pearson correlation (r), which measures a linear dependence between two variables (x and y). It is referred to as Pearson’s correlation or simply as the correlation coefficient. More importantly, if the p-value is < 5%, then the correlation between x and y is significant. For R function, correlation coefficient can be computed using the functions cor() or cor.test():
Let’s check the test assumption if two variables follow a normal distribution, by using Use “Shapiro-Wilk”" normality test –> R function: shapiro.test()
shapiro.test(analysis_data$wt)
##
## Shapiro-Wilk normality test
##
## data: analysis_data$wt
## W = 0.9477, p-value = 0.01215
shapiro.test(analysis_data$city_mpg)
##
## Shapiro-Wilk normality test
##
## data: analysis_data$city_mpg
## W = 0.94782, p-value = 0.01231
Both of P-Values is less than 0.05, implying that a significant difference does exist from normal distribution.
par(mfrow = c(1,2))
ggqqplot(analysis_data$city_mpg, ylab = "CITY_MPG")
ggqqplot(analysis_data$wt, ylab = "WT")
From the normality plots, we conclude that both populations may come from normal distributions.
res <- cor.test(analysis_data$wt, analysis_data$city_mpg,
method = "pearson")
res
##
## Pearson's product-moment correlation
##
## data: analysis_data$wt and analysis_data$city_mpg
## t = -13.971, df = 58, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9255852 -0.8031619
## sample estimates:
## cor
## -0.8780207
The p-value of the test is 2.2e-16, which is less than the significance level alpha = 0.05. We can conclude that wt and city_mpg are significantly correlated with a correlation coefficient of -0.87 and p-value of 2.2e-16.
In this chapter, we want to analyze in detail by car types and number of cylinders over weight. Looking at the research question, the main purpose of this analysis is to figure out if there is negatively relationship between weight and city_mpg over all the car types. Let’s see below
wt_mpg_category <- ggplot(selected_df, aes(x = wt, y = city_mpg)) + geom_point(aes(shape = category, colour = category))
wt_mpg_category +
stat_smooth(method = "lm") +
facet_grid(~ category) +
xlab("Weight(ibs/1000)") +
ylab("Miles/(US) gallon on City") +
ggtitle("Mercedes car comparison")
As you look at the graph above, all the car type shows a relationship between two variables. This one is not quiet different from the graph above chapter 4.
wt_mpg_cyl <- ggplot(selected_df, aes(x = wt, y = city_mpg)) + geom_point(aes(shape = factor(cyl), colour = factor(cyl)))
wt_mpg_cyl +
stat_smooth(method = "lm") +
facet_grid(~ cyl) +
xlab("Weight(ibs/1000)") +
ylab("Miles/(US) gallon on City") +
ggtitle("Mercedes car comparison")
As you look at the graph above, there is a relationship between two variables. This one is not quiet different from the graph above chapter 4.
I am personally interested to explore:
The answer is that both questions are negatively correlated. However, the degree in each kind shows difference.