Chapter_1. Introduction

Having an opportunity to work with Mercedes_Benz_Korea(MBK), Korean Branch of Daimler AG in German, one curiosity to know car line up is suddenly embeded in a researcher’s mind. This is simple analysis of car_line_up. In this simple project, reseacher would like to describe some descripitve analysis of car_line_up and explain a correlative analysis between some elements of car such as weight, city_mpg, gear, and others. What kind of car is utmost car for you? It could help for one who wants to buy new one to choose the one via this article. I am personally interested to explore:

“Is Weight(wt) negatively correlated with Miles/(US) gallon (City_MPG) over all the car types?”
“Is Weight(Wt) negatively correlated with Miles/(US) gallon (City_MPG) over all the cylinders?”

In order to answer two questions, I performed exploratory and descriptive analyses, and used linear regression as methodologies to explain. Also, I established both simple as total and multivariate linear analysis by car types. Let’s start it.

Chapter_2. Looking at the data set

For the purpose of this analysis, I created Mercedes_2017 dataset arranged from the mbusa.com website. The dataset comprises of model and 16 aspects of Mercedes Benz for 82 mobiles.

Point_1. Car Terminology

A brief definition of the variables in the dateset is below:

col_1. model: vehicle’s names

col_2. category: vehicle’s types

col_3. price: current value of each vehicle in 2017

col_4. pax: the number of passengers in each car

col_5. trunk: the vehicle’s storage in each car

col_6. gear: Number of forward gears

col_7. cyl: Number of cylinders

col_8. electric: Hybrid type or not

col_9. hp: Gross horsepower

col_10. rpm: Revolutions per minute

col_11. accel: a vehicle’s capacity to gain speed within a short time

col_12. wt: Weight(ibs/1000)

col_13. city_mpg: Miles/(US) gallon on City

col_14. city_mpge: Miles/(US) electricity on City

col_15. highway_mpg: Miles/(US) gallon on highway

col_16. highway_mpge: Miles/(US) electricity on highway

The first six records of the dataset are shown below:

library(xlsx)

## Loading required package: rJava

## Loading required package: xlsxjars

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(data.table)

## -------------------------------------------------------------------------

## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!

## -------------------------------------------------------------------------

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(knitr)
# install.packages("printr")
library(printr)
# install.packages("ggpubr")
library(ggpubr)

## Loading required package: magrittr

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     combine, src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:Hmisc':
## 
##     describe

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

mercedes_2017_df <- read.csv(file = "Mercedes_2017.csv", stringsAsFactors = FALSE)
mercedes_2017_df[mercedes_2017_df==""] <- NA
kable(x = head(mercedes_2017_df, 3), align = 'c')

model	category	price	pax	trunk	gear	cyl	electric	hp	rpm	accel	wt	city_mpg	city_mpge	highway_mpg	highway_mpge
C300	Sedan	39500	5	12.6	7	4	No	241	5550	6.0	3417	24	NA	34	NA
C300 4MATIC	Sedan	41500	5	12.6	7	4	No	241	5550	6.0	3594	24	NA	31	NA
C350e Pulg-in Hybrid	Sedan	46050	5	11.8	7	4	Yes	275	NA	5.8	4057	NA	45	NA	61

Notice that each line of mercedes_2017_df represents one model of car. Each column is then one attribute of that car, such as city_fuel efficiency, the number of cylinders, and so on.

Point_2. Summarize of Types & Models

Interestingly, the type Sedan, SUV, and Coupe has more than 15 car models. However, the type Electric and Wagon has only one model. Statistically, this dataset needs data cleaning to perform exploratory analysis. Let’s take a look.

count_df <- mercedes_2017_df %>% select(model, category) %>% group_by(category) %>% mutate(count = 1) %>% summarise(total = sum(count))

kable(count_df, align = "c")

category	total
Cabriolet	10
Coupe	22
Electric	1
Roadster	8
Sedan	18
SUV	22
Wagon	1

Chapter_3. Data Cleaning

In this chapter, dataset needs to be cleaned to analyze more accurately the correlation between variables. To do it, the type Wagon and electric will be removed from this dataset. Also, the model having the “NA” or empty values will be removed from this dataset. And also, we will select usable variables.

mercedes_2017_df$cyl <- as.factor(mercedes_2017_df$cyl)
mercedes_2017_df$model <- as.character(mercedes_2017_df$model)
mercedes_2017_df$wt <- mercedes_2017_df$wt / 1000

selected_df <- mercedes_2017_df %>% 
    select(model, category, wt, cyl, gear, hp, accel, city_mpg, highway_mpg, rpm) %>% 
    group_by(category) %>% 
    filter(category != "Wagon" & category != "Electric") %>% 
    na.omit()

selected_df$wt <- as.numeric(selected_df$wt)
selected_df$rpm <- as.numeric(selected_df$rpm)
selected_df$gear <- as.numeric(selected_df$gear)
selected_df$hp <- as.numeric(selected_df$hp)
selected_df$city_mpg <- as.numeric(selected_df$city_mpg)
selected_df$highway_mpg <- as.numeric(selected_df$highway_mpg)

analysis_data <- selected_df[, c(3,4, 5,6,7,8,9,10)]

# unused dataset
unused_df_A <- mercedes_2017_df %>% 
    select(model, category, wt, cyl, gear, hp, accel, city_mpg, highway_mpg, rpm) %>% 
    group_by(category) %>% 
    subset(is.na(wt) | is.na(city_mpg) | is.na(cyl) | is.na(accel) | is.na(rpm))
    
unused_df_B <- mercedes_2017_df %>% select(model, category, wt, cyl, gear, hp, accel, city_mpg, highway_mpg, rpm) %>% filter(category == "Wagon")

unused_df <- merge(unused_df_A, unused_df_B, all = TRUE)

Chapter_4. Exploratory Analysis

Beginning the analysis by performing some initial exploratory analysis, the existing patterns between variable can be fount in the dataset. The linear graph is a very effectice tool. Looking at whole data, the overall flow of this dataset could be found.

Point_1. City_MPG Distribution

par(mfrow = c(1, 2))
# Histogram with Normal Curve
dist_mpg <- analysis_data$city_mpg
h <-hist(dist_mpg, breaks=10, col="red", xlab="Miles Per Gallon",
   main="Histogram of Miles per Gallon")
xfit <-seq(min(dist_mpg),max(dist_mpg),length=40)
yfit <-dnorm(xfit,mean=mean(dist_mpg),sd=sd(dist_mpg))
yfit <- yfit*diff(h$mids[1:2])*length(dist_mpg)
lines(xfit, yfit, col="blue", lwd=2)

# Kernel Density Plot
den_mpg <- density(dist_mpg)
plot(den_mpg, xlab = "MPG", main ="Density Plot of MPG")

Point_2. matrix of scatter graph

pairs.panels(analysis_data, 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

The function pairs.panels [in psych package] is useful to create a scatter plot of matrices, with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal. Interestingly. all the variables are differently drawing its linear line. The graph gives us a question why all the graphs vary between variables. According to the p-values between the different variables, some shows negative correlation, positive correlation, and even non-correlation.

Point_3. Scatter Plots between two variables (wt vs city_mpg)

ggscatter(analysis_data, x = "wt", y = "city_mpg", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Weight (1000 lbs)", ylab = "Miles/(US) gallon")

Is the covariation linear? Yes, form the plot above, the relationship between two variables is negatively linear. In the situation where the scatter plots show curved patterns, we are dealing with nonlinear association between the two variables.

Point_4. Method of Correlation Analysis.

Pearson correlation (r), which measures a linear dependence between two variables (x and y). It is referred to as Pearson’s correlation or simply as the correlation coefficient. More importantly, if the p-value is < 5%, then the correlation between x and y is significant. For R function, correlation coefficient can be computed using the functions cor() or cor.test():

Let’s check the test assumption if two variables follow a normal distribution, by using Use “Shapiro-Wilk”" normality test –> R function: shapiro.test()

shapiro.test(analysis_data$wt)

## 
##  Shapiro-Wilk normality test
## 
## data:  analysis_data$wt
## W = 0.9477, p-value = 0.01215

shapiro.test(analysis_data$city_mpg)

## 
##  Shapiro-Wilk normality test
## 
## data:  analysis_data$city_mpg
## W = 0.94782, p-value = 0.01231

Both of P-Values is less than 0.05, implying that a significant difference does exist from normal distribution.

par(mfrow = c(1,2))
ggqqplot(analysis_data$city_mpg, ylab = "CITY_MPG")

ggqqplot(analysis_data$wt, ylab = "WT")

From the normality plots, we conclude that both populations may come from normal distributions.

(a) Pearson correlation test

res <- cor.test(analysis_data$wt, analysis_data$city_mpg, 
                    method = "pearson")
res

## 
##  Pearson's product-moment correlation
## 
## data:  analysis_data$wt and analysis_data$city_mpg
## t = -13.971, df = 58, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9255852 -0.8031619
## sample estimates:
##        cor 
## -0.8780207

(b) Interpretation of the result

The p-value of the test is 2.2e-16, which is less than the significance level alpha = 0.05. We can conclude that wt and city_mpg are significantly correlated with a correlation coefficient of -0.87 and p-value of 2.2e-16.

Chapter_5. More Details from chapter_4.

In this chapter, we want to analyze in detail by car types and number of cylinders over weight. Looking at the research question, the main purpose of this analysis is to figure out if there is negatively relationship between weight and city_mpg over all the car types. Let’s see below

Point_1. Weight and City_MPG by Category.

wt_mpg_category <- ggplot(selected_df, aes(x = wt, y = city_mpg)) + geom_point(aes(shape = category, colour = category))
wt_mpg_category + 
  stat_smooth(method = "lm") + 
  facet_grid(~ category) + 
  xlab("Weight(ibs/1000)") +
  ylab("Miles/(US) gallon on City") +
  ggtitle("Mercedes car comparison")

As you look at the graph above, all the car type shows a relationship between two variables. This one is not quiet different from the graph above chapter 4.

Point_2. Weight and City_MPG by Cylinders.

wt_mpg_cyl <- ggplot(selected_df, aes(x = wt, y = city_mpg)) + geom_point(aes(shape = factor(cyl), colour = factor(cyl)))
wt_mpg_cyl + 
  stat_smooth(method = "lm") + 
  facet_grid(~ cyl) + 
  xlab("Weight(ibs/1000)") +
  ylab("Miles/(US) gallon on City") +
  ggtitle("Mercedes car comparison")

As you look at the graph above, there is a relationship between two variables. This one is not quiet different from the graph above chapter 4.

Chapter 6. Conclusion

I am personally interested to explore:

“Is Weight(wt) negatively correlated with Miles/(US) gallon (City_MPG) over all the car types?”
“Is Weight(Wt) negatively correlated with Miles/(US) gallon (City_MPG) over all the cylinders?”

The answer is that both questions are negatively correlated. However, the degree in each kind shows difference.

Correlation_Analysis_Mercedes_2017

Evan

2017년 6월 29일