We have two small csv files that describe some characteristics of vehicles. In cars_multi.csv we have the columns:
In the cars_price.csv we have the columns:
My task is to understand how the data in these columns relate to each other, to uncover interesting things, and to communicate those findings. I’m going to focus on the correlation between mpg and the other properties.
# Importing library
library(dplyr)
library(ggplot2)
library(corrplot)
str(cars_price)
## 'data.frame': 398 obs. of 2 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ price: num 25562 24221 27241 33685 20000 ...
# Load data
cars_multi <- read.csv("cars_multi.csv")
cars_price <- read.csv("cars_price.csv")
Looking the first 6 observations of the dataset
# Head of the dataset
head(cars_multi)
## ID mpg cylinders displacement horsepower weight acceleration model
## 1 1 18 8 307 130 3504 12.0 70
## 2 2 15 8 350 165 3693 11.5 70
## 3 3 18 8 318 150 3436 11.0 70
## 4 4 16 8 304 150 3433 12.0 70
## 5 5 17 8 302 140 3449 10.5 70
## 6 6 15 8 429 198 4341 10.0 70
## origin car_name
## 1 1 chevrolet chevelle malibu
## 2 1 buick skylark 320
## 3 1 plymouth satellite
## 4 1 amc rebel sst
## 5 1 ford torino
## 6 1 ford galaxie 500
head(cars_price)
## ID price
## 1 1 25561.59
## 2 2 24221.42
## 3 3 27240.84
## 4 4 33684.97
## 5 5 20000.00
## 6 6 30000.00
Dimensions
# Dimensions of the dataset
dim(cars_multi)
## [1] 398 10
dim(cars_price)
## [1] 398 2
We decide to merge the dataset, that way we are going to work with one dataset
# Join two dataset
cars <- left_join(cars_multi, cars_price, by = "ID")
Now we have the following columns
colnames(cars)
## [1] "ID" "mpg" "cylinders" "displacement"
## [5] "horsepower" "weight" "acceleration" "model"
## [9] "origin" "car_name" "price"
Checking missing cases
# Complet cases
sum(!complete.cases(cars))
## [1] 0
There are 0 missing cases, apparently.
Overview of our dataset
summary(cars)
## ID mpg cylinders displacement
## Min. : 1.0 Min. : 9.00 Min. :3.000 Min. : 68.0
## 1st Qu.:100.2 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.2
## Median :199.5 Median :23.00 Median :4.000 Median :148.5
## Mean :199.5 Mean :23.51 Mean :5.455 Mean :193.4
## 3rd Qu.:298.8 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0
## Max. :398.0 Max. :46.60 Max. :8.000 Max. :455.0
##
## horsepower weight acceleration model
## 150 : 22 Min. :1613 Min. : 8.00 Min. :70.00
## 90 : 20 1st Qu.:2224 1st Qu.:13.82 1st Qu.:73.00
## 88 : 19 Median :2804 Median :15.50 Median :76.00
## 110 : 18 Mean :2970 Mean :15.57 Mean :76.01
## 100 : 17 3rd Qu.:3608 3rd Qu.:17.18 3rd Qu.:79.00
## 75 : 14 Max. :5140 Max. :24.80 Max. :82.00
## (Other):288
## origin car_name price
## Min. :1.000 ford pinto : 6 Min. : 1598
## 1st Qu.:1.000 amc matador : 5 1st Qu.:23110
## Median :1.000 ford maverick : 5 Median :30000
## Mean :1.573 toyota corolla: 5 Mean :29684
## 3rd Qu.:2.000 amc gremlin : 4 3rd Qu.:36430
## Max. :3.000 amc hornet : 4 Max. :53746
## (Other) :369
Our data frame has the structure
# Structure of the dataset
str(cars)
## 'data.frame': 398 obs. of 11 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : Factor w/ 94 levels "?","100","102",..: 17 35 29 29 24 42 47 46 48 40 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ model : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ car_name : Factor w/ 305 levels "amc ambassador brougham",..: 50 37 232 15 162 142 55 224 242 2 ...
## $ price : num 25562 24221 27241 33685 20000 ...
Mpg means Miles per gallon and we want to know the most common value
ggplot(cars, aes(mpg)) +
geom_histogram(binwidth = 5) +
labs(title = "Histogram of MPG", y = "Count") +
theme_classic()
We can see that the most common mpg is something between 15 and 20 mpg
ggplot(cars, aes(cylinders)) +
geom_bar() +
labs(title = "Cylinders", y = "Count") +
theme_classic()
For cylinders we can see that 4 cylinders is 2 times more often than 8 cylinders
Using box plot we can see:
boxplot(cars$displacement, data=cars$displacement, main="Box Plot Displacement",
xlab="", ylab="Displacement")
There is more data/value greater than the average
Here we realize that we have some missing value at the horsepower
count(cars[as.character(cars$horsepower) == "?",])
## # A tibble: 1 x 1
## n
## <int>
## 1 6
In fact we have 6 missing values at horsepower.
ggplot(cars, aes(weight)) +
geom_histogram(binwidth = 5) +
labs(title = "Histogram of Weight", y = "Count") +
theme_classic()
For Weight we see that the most common weight is something between 2000 and 3000. But most important we saw that we have the only one unique weight for the majority of the cars
ggplot(cars, aes(acceleration)) +
geom_density() +
labs(title = "Density of Weight") +
theme_classic()
We see that the density of acceleration is more concentrate at 15
to_Plot <- as.data.frame(table(cars$model))
colnames(to_Plot) <- c("Model", "Frequency")
ggplot(to_Plot, aes(x = Model, y = Frequency)) +
geom_bar(stat = "identity") +
labs(title = "Model") +
theme_classic()
As we can see we have a good a balance sample for model
ggplot(cars, aes(origin)) +
geom_bar() +
labs(title = "Origin", y = "Count") +
theme_classic()
We have the majority of the cars from origin 1
For price we decide to do some pre processing to make more simpler. We are going to keep only the value before the point. For example: If we have 1598.07337 we are going to keep only 1598.
We made this decision because we belive that the value after the point is meaningless
cars$price <- as.integer(cars$price)
boxplot(cars$price, data=cars$price, main="Price BoxPLot",
xlab="", ylab="Price")
Looking at the box plot of price we can see that we have only one outlier
to_Plot <- as.data.frame(table(cars$price))
colnames(to_Plot) <- c("Price", "Frequency")
ggplot(head(to_Plot[ order(-to_Plot[,2]), ]), aes(x = reorder(Price, Frequency), y = Frequency)) +
geom_bar(stat = "identity") +
labs(title = "Common Price", x = "Price") +
theme_classic() +
coord_flip()
With this visualization we can see that we have 3 price that repeat more than 40 times. We have 219 unique prices. This could be a problem if we have to predict the price of the cars because we have unbalanced data
At this plot we can see the correlation between all features. Two features can have a positive correlation, a negative correlation and a neutral correlation When the dot is red that means that is a negative correlation. Which means when one value is getting high the other value is getting slow. The biggest is the dot more negative is the value When the dot is blue it is a positive correlation Blank means that this two variables has no correlation
# Transforming from factor to numeric
cars$horsepower <- as.numeric(as.character(cars$horsepower))
# Removing not complete row
cars <- cars[complete.cases(cars),]
# Removing the ID
cars <- cars[,-1]
nums <- sapply(cars, is.numeric)
correlations <- cor(cars[,nums])
corrplot(correlations, order = "hclust")
Looking at the MPG we can see that MPG has an negative correlation with horsepower, weight, cylinders and displacement which make total sense. In the other hand MPG has an poisitive correlation with origin, acceleration and model. The correlation between price and mpg is neutral.
To become more clear about positive correlation, a negative correlation and a neutral correlation we are going to get some example from our dataset.
We saw that MPG and Origin has a strong positive correlation. And you can confirm that at the next graphic:
ggplot(cars, aes(mpg, acceleration)) +
geom_jitter() +
theme_classic() +
geom_smooth(method = "lm", se = FALSE)
You can see that the best line that fit our data is an crescent line. Which is what we are expecting. If MPG and Acceleration has the best positive correlation we are going to see something like that
ggplot(cars, aes(mpg, mpg)) +
geom_jitter() +
theme_classic() +
labs(title = "Best Positive Correlation", y = "", x = "") +
geom_smooth(method = "lm", se = FALSE)
We saw that MPG and Horsepower has a strong negative correlation as you can see next
ggplot(cars, aes(mpg, horsepower)) +
geom_jitter() +
theme_classic() +
geom_smooth(method = "lm", se = FALSE)
You can see that the best line that fit our data is an decreasing line. Which is what we are expecting. If MPG and Horsepower has the best positive correlation we are going to see something like that
ggplot(cars, aes(mpg, -mpg)) +
geom_jitter() +
theme_classic() +
labs(title = "Best Negative Correlation", y = "", x = "") +
geom_smooth(method = "lm", se = FALSE)
We saw that MPG and Price has an neutral correlation as you can see next
ggplot(cars, aes(mpg, price)) +
geom_jitter() +
theme_classic() +
geom_smooth(method = "lm", se = FALSE)
You can see that the best line that fit our data is an straight line.
I dont have any background with cars but the negative correlation between mpg and horsepower, weight, cylinders and displacement made sense. But for me I didnt understate why mpg and acceleration has positive correlation, I was expecting an negative correlation.
Another surprise for me during the Exploratory data analysis was that price and mpg has no correlation.
For the next step I’m excited to create and run some model to predict the price of the cars based at this dataset. How good can be my model with only this data? It the Rsquared that I could find good enough?