Exploratory data analysis

We have two small csv files that describe some characteristics of vehicles. In cars_multi.csv we have the columns:

In the cars_price.csv we have the columns:

My task is to understand how the data in these columns relate to each other, to uncover interesting things, and to communicate those findings. I’m going to focus on the correlation between mpg and the other properties.

# Importing library
library(dplyr)
library(ggplot2)
library(corrplot)

str(cars_price)
## 'data.frame':    398 obs. of  2 variables:
##  $ ID   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ price: num  25562 24221 27241 33685 20000 ...
# Load data
cars_multi <- read.csv("cars_multi.csv")
cars_price <- read.csv("cars_price.csv")

Looking the first 6 observations of the dataset

# Head of the dataset
head(cars_multi)
##   ID mpg cylinders displacement horsepower weight acceleration model
## 1  1  18         8          307        130   3504         12.0    70
## 2  2  15         8          350        165   3693         11.5    70
## 3  3  18         8          318        150   3436         11.0    70
## 4  4  16         8          304        150   3433         12.0    70
## 5  5  17         8          302        140   3449         10.5    70
## 6  6  15         8          429        198   4341         10.0    70
##   origin                  car_name
## 1      1 chevrolet chevelle malibu
## 2      1         buick skylark 320
## 3      1        plymouth satellite
## 4      1             amc rebel sst
## 5      1               ford torino
## 6      1          ford galaxie 500
head(cars_price)
##   ID    price
## 1  1 25561.59
## 2  2 24221.42
## 3  3 27240.84
## 4  4 33684.97
## 5  5 20000.00
## 6  6 30000.00

Dimensions

# Dimensions of the dataset
dim(cars_multi)
## [1] 398  10
dim(cars_price)
## [1] 398   2

We decide to merge the dataset, that way we are going to work with one dataset

# Join two dataset
cars <- left_join(cars_multi, cars_price, by = "ID")

Now we have the following columns

colnames(cars)
##  [1] "ID"           "mpg"          "cylinders"    "displacement"
##  [5] "horsepower"   "weight"       "acceleration" "model"       
##  [9] "origin"       "car_name"     "price"

Checking missing cases

# Complet cases
sum(!complete.cases(cars))
## [1] 0

There are 0 missing cases, apparently.

Overview of our dataset

summary(cars)
##        ID             mpg          cylinders      displacement  
##  Min.   :  1.0   Min.   : 9.00   Min.   :3.000   Min.   : 68.0  
##  1st Qu.:100.2   1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2  
##  Median :199.5   Median :23.00   Median :4.000   Median :148.5  
##  Mean   :199.5   Mean   :23.51   Mean   :5.455   Mean   :193.4  
##  3rd Qu.:298.8   3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0  
##  Max.   :398.0   Max.   :46.60   Max.   :8.000   Max.   :455.0  
##                                                                 
##    horsepower      weight      acceleration       model      
##  150    : 22   Min.   :1613   Min.   : 8.00   Min.   :70.00  
##  90     : 20   1st Qu.:2224   1st Qu.:13.82   1st Qu.:73.00  
##  88     : 19   Median :2804   Median :15.50   Median :76.00  
##  110    : 18   Mean   :2970   Mean   :15.57   Mean   :76.01  
##  100    : 17   3rd Qu.:3608   3rd Qu.:17.18   3rd Qu.:79.00  
##  75     : 14   Max.   :5140   Max.   :24.80   Max.   :82.00  
##  (Other):288                                                 
##      origin                car_name       price      
##  Min.   :1.000   ford pinto    :  6   Min.   : 1598  
##  1st Qu.:1.000   amc matador   :  5   1st Qu.:23110  
##  Median :1.000   ford maverick :  5   Median :30000  
##  Mean   :1.573   toyota corolla:  5   Mean   :29684  
##  3rd Qu.:2.000   amc gremlin   :  4   3rd Qu.:36430  
##  Max.   :3.000   amc hornet    :  4   Max.   :53746  
##                  (Other)       :369

Structure

Our data frame has the structure

# Structure of the dataset
str(cars)
## 'data.frame':    398 obs. of  11 variables:
##  $ ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : Factor w/ 94 levels "?","100","102",..: 17 35 29 29 24 42 47 46 48 40 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ model       : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ car_name    : Factor w/ 305 levels "amc ambassador brougham",..: 50 37 232 15 162 142 55 224 242 2 ...
##  $ price       : num  25562 24221 27241 33685 20000 ...

Looking each variable

MPG

Mpg means Miles per gallon and we want to know the most common value

ggplot(cars, aes(mpg)) + 
  geom_histogram(binwidth = 5) +
  labs(title = "Histogram of MPG", y = "Count") +
  theme_classic()

We can see that the most common mpg is something between 15 and 20 mpg

Cylinders

ggplot(cars, aes(cylinders)) + 
  geom_bar() +
  labs(title = "Cylinders", y = "Count") +
  theme_classic()

For cylinders we can see that 4 cylinders is 2 times more often than 8 cylinders

Displacement

Using box plot we can see:

boxplot(cars$displacement, data=cars$displacement, main="Box Plot Displacement", 
    xlab="", ylab="Displacement")

There is more data/value greater than the average

Horsepower

Here we realize that we have some missing value at the horsepower

count(cars[as.character(cars$horsepower) == "?",])
## # A tibble: 1 x 1
##       n
##   <int>
## 1     6

In fact we have 6 missing values at horsepower.

Weight

ggplot(cars, aes(weight)) + 
  geom_histogram(binwidth = 5) +
  labs(title = "Histogram of Weight", y = "Count") +
  theme_classic()

For Weight we see that the most common weight is something between 2000 and 3000. But most important we saw that we have the only one unique weight for the majority of the cars

Acceleration

ggplot(cars, aes(acceleration)) + 
  geom_density() +
  labs(title = "Density of Weight") +
  theme_classic()

We see that the density of acceleration is more concentrate at 15

Model

to_Plot <- as.data.frame(table(cars$model))
colnames(to_Plot) <- c("Model", "Frequency")

ggplot(to_Plot, aes(x = Model, y = Frequency)) + 
  geom_bar(stat = "identity") +
  labs(title = "Model") +
  theme_classic()

As we can see we have a good a balance sample for model

Origin

ggplot(cars, aes(origin)) + 
  geom_bar() +
  labs(title = "Origin", y = "Count") +
  theme_classic()

We have the majority of the cars from origin 1

Price

For price we decide to do some pre processing to make more simpler. We are going to keep only the value before the point. For example: If we have 1598.07337 we are going to keep only 1598.

We made this decision because we belive that the value after the point is meaningless

cars$price <- as.integer(cars$price)

boxplot(cars$price, data=cars$price, main="Price BoxPLot", 
    xlab="", ylab="Price")

Looking at the box plot of price we can see that we have only one outlier

to_Plot <- as.data.frame(table(cars$price))
colnames(to_Plot) <- c("Price", "Frequency")

ggplot(head(to_Plot[ order(-to_Plot[,2]), ]), aes(x = reorder(Price, Frequency), y = Frequency)) + 
  geom_bar(stat = "identity") +
  labs(title = "Common Price", x = "Price") +
  theme_classic() + 
  coord_flip()

With this visualization we can see that we have 3 price that repeat more than 40 times. We have 219 unique prices. This could be a problem if we have to predict the price of the cars because we have unbalanced data

Correlation

At this plot we can see the correlation between all features. Two features can have a positive correlation, a negative correlation and a neutral correlation When the dot is red that means that is a negative correlation. Which means when one value is getting high the other value is getting slow. The biggest is the dot more negative is the value When the dot is blue it is a positive correlation Blank means that this two variables has no correlation

# Transforming from factor to numeric
cars$horsepower <- as.numeric(as.character(cars$horsepower))

# Removing not complete row 
cars <- cars[complete.cases(cars),]

# Removing the ID
cars <- cars[,-1]

nums <- sapply(cars, is.numeric)
correlations <- cor(cars[,nums])
corrplot(correlations, order = "hclust")

Looking at the MPG we can see that MPG has an negative correlation with horsepower, weight, cylinders and displacement which make total sense. In the other hand MPG has an poisitive correlation with origin, acceleration and model. The correlation between price and mpg is neutral.

Correlation Some Individual Visualization

To become more clear about positive correlation, a negative correlation and a neutral correlation we are going to get some example from our dataset.

Positive Correlation

We saw that MPG and Origin has a strong positive correlation. And you can confirm that at the next graphic:

ggplot(cars, aes(mpg, acceleration)) + 
  geom_jitter() + 
  theme_classic() +
  geom_smooth(method = "lm", se = FALSE)

You can see that the best line that fit our data is an crescent line. Which is what we are expecting. If MPG and Acceleration has the best positive correlation we are going to see something like that

ggplot(cars, aes(mpg, mpg)) + 
  geom_jitter() + 
  theme_classic() +
  labs(title = "Best Positive Correlation", y = "", x = "") +
  geom_smooth(method = "lm", se = FALSE)

Negative Correlation

We saw that MPG and Horsepower has a strong negative correlation as you can see next

ggplot(cars, aes(mpg, horsepower)) + 
  geom_jitter() + 
  theme_classic() +
  geom_smooth(method = "lm", se = FALSE)

You can see that the best line that fit our data is an decreasing line. Which is what we are expecting. If MPG and Horsepower has the best positive correlation we are going to see something like that

ggplot(cars, aes(mpg, -mpg)) + 
  geom_jitter() + 
  theme_classic() +
  labs(title = "Best Negative Correlation", y = "", x = "") +
  geom_smooth(method = "lm", se = FALSE)

Neutral Correlation

We saw that MPG and Price has an neutral correlation as you can see next

ggplot(cars, aes(mpg, price)) + 
  geom_jitter() + 
  theme_classic() +
  geom_smooth(method = "lm", se = FALSE)

You can see that the best line that fit our data is an straight line.

Conclusion

I dont have any background with cars but the negative correlation between mpg and horsepower, weight, cylinders and displacement made sense. But for me I didnt understate why mpg and acceleration has positive correlation, I was expecting an negative correlation.

Another surprise for me during the Exploratory data analysis was that price and mpg has no correlation.

For the next step I’m excited to create and run some model to predict the price of the cars based at this dataset. How good can be my model with only this data? It the Rsquared that I could find good enough?