The data set used in this project aggregates 2,000 rows of vehicle information. However, the data set does not contain all of a vehicle’s descriptive qualities, such as the many available luxury features like cruise control and automatic windows or safety features like airbags or anti-lock brakes. Regardless, the project will follow a deliberate process of cleaning, manipulating, exploratory data exploration, and modeling to answer several questions.
Analysis is focused on the following questions:
Is there a connection between any of the descriptive variables in each row and the price of the vehicle? If so, what variables are most closely related to the price of the vehicle?
Is there a most popular vehicle model contained in the data set?
What are the ranges of price that could be expected for vehicles in the dataset?
If prices follow a pattern, what information can be imputed to provide an understanding of the value for the purchase or sale of the vehicle?
Is it possible to predict the cost of the vehicle based on the information in the dataset?
Analysis for this project begins by importing the data set and gathering the basic properties of the data set. Each row within the data set contains numerous descriptive variables. Variables such as Age, KM, HP, Doors, Weight, and CC describe vehicle features in numeric format. The fuel type variable represents the type of fuel needed to power the vehicle and contains character values of “CNG” (Compressed Natural Gas), “Diesel”, and “Petrol”. Two additional numeric variables are used as categorical feature encoding. In the MetColor and Automatic variables, a value of 1 denotes that the vehicle is metallic or an automatic transmission vehicle. A value of 0 in those variables indicates that the vehicle is not metallic colored or is a manual transmission vehicle. At the onset of this project, the data set is 2036 rows and 10 columns.
#import data set
data <- read.csv("ToyotaCorolla2.csv")
# Call basic summary statistics of the data set
summary(data)
## Price Age KM FuelType
## Min. : 4350.00 Min. : 1.00000 Min. : 1.00 Length:2036
## 1st Qu.: 8750.00 1st Qu.:41.00000 1st Qu.: 39136.75 Class :character
## Median :10322.50 Median :54.00000 Median : 61135.50 Mode :character
## Mean :11166.29 Mean :52.44745 Mean : 67298.85
## 3rd Qu.:12450.00 3rd Qu.:67.00000 3rd Qu.: 88541.75
## Max. :32500.00 Max. :80.00000 Max. :243000.00
## HP MetColor Automatic CC
## Min. : 69.0000 Min. :0.0000000 Min. :0.00000000 Min. :1300.000
## 1st Qu.: 97.0000 1st Qu.:0.0000000 1st Qu.:0.00000000 1st Qu.:1400.000
## Median :110.0000 Median :1.0000000 Median :0.00000000 Median :1600.000
## Mean :101.4224 Mean :0.6807466 Mean :0.05009823 Mean :1575.857
## 3rd Qu.:110.0000 3rd Qu.:1.0000000 3rd Qu.:0.00000000 3rd Qu.:1600.000
## Max. :192.0000 Max. :1.0000000 Max. :1.00000000 Max. :2000.000
## Doors Weight
## Min. :2.000000 Min. :1000.000
## 1st Qu.:3.000000 1st Qu.:1040.000
## Median :4.000000 Median :1070.000
## Mean :4.083988 Mean :1075.027
## 3rd Qu.:5.000000 3rd Qu.:1095.000
## Max. :5.000000 Max. :1615.000
# Gather dimensions of the data set
dim(data)
## [1] 2036 10
Analysis for this project begins by importing the data set and gathering an understanding of the structure and dimensions of the data set. Then project underwent effort to understand the quality of data by eliminating duplicated observations.
# Visualize the location and frequency of missing data
vis_miss(data)
Initial inspection showed no instances of missing values, but there are numerous duplicated rows. While it is possible that vehicles in a data set could contain many shared values, the connection of each variable aligned with price and odometer reading raised suspicions about the quality of each of those instances and was therefore removed.
# Identify the duplicated rows and save the duplicates as a new data frame
duplicated <- data[duplicated(data), ]
# Create new data frame entirely of unique entries across all rows and columns
cars <- data %>% distinct()
The data set now contains 1435 distinct rows and is of high enough quality to begin exploratory analysis.
Because it is known that vehicle manufacturers discontinue models of vehicles and release new models, understanding the age of the year of production could illuminate information crucial to answering established research questions. Therefore, it was necessary to impute a value that describes the year of production. The year of production is expected to assist in identifying manufacturing trends and future vehicle classification.
# Creates a Year column based on the values of the Age Column.
cars <- cars %>%
mutate(year = 2004 - floor((Age - 1 + 6)/12))
# Mutate variables into a factor format
cars$FuelType <- as.factor(cars$FuelType)
To gain a further understanding of the data’s contents, the research progressed by gathering measures of central tendency and data distribution per variable. In every box plot in this paper, the white circle with a black outline denotes the mean value of that variable.
summary(cars)
## Price Age KM FuelType
## Min. : 4350.00 Min. : 1.00000 Min. : 1.00 CNG : 17
## 1st Qu.: 8450.00 1st Qu.:44.00000 1st Qu.: 43000.00 Diesel: 154
## Median : 9900.00 Median :61.00000 Median : 63451.00 Petrol:1264
## Mean :10720.92 Mean :55.98049 Mean : 68571.78
## 3rd Qu.:11950.00 3rd Qu.:70.00000 3rd Qu.: 87041.50
## Max. :32500.00 Max. :80.00000 Max. :243000.00
## HP MetColor Automatic CC
## Min. : 69.000 Min. :0.0000000 Min. :0.00000000 Min. :1300.000
## 1st Qu.: 90.000 1st Qu.:0.0000000 1st Qu.:0.00000000 1st Qu.:1400.000
## Median :110.000 Median :1.0000000 Median :0.00000000 Median :1600.000
## Mean :101.492 Mean :0.6745645 Mean :0.05574913 Mean :1566.526
## 3rd Qu.:110.000 3rd Qu.:1.0000000 3rd Qu.:0.00000000 3rd Qu.:1600.000
## Max. :192.000 Max. :1.0000000 Max. :1.00000000 Max. :2000.000
## Doors Weight year
## Min. :2.000000 Min. :1000.000 Min. :1997.000
## 1st Qu.:3.000000 1st Qu.:1040.000 1st Qu.:1998.000
## Median :4.000000 Median :1070.000 Median :1999.000
## Mean :4.032753 Mean :1072.287 Mean :1999.385
## 3rd Qu.:5.000000 3rd Qu.:1085.000 3rd Qu.:2000.000
## Max. :5.000000 Max. :1615.000 Max. :2004.000
# Histogram of the price of the vehicle in the data set
hist_price <- cars %>%
ggplot(aes(x = Price)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Price",
subtitle = "Toyota Corolla Dataset",
x = "Price",
y= "Frequency") +
theme_minimal()
box_price <- cars %>%
ggplot(aes(x = "Category", y = Price)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
stat_summary(fun = "mean", shape = 21, color = "black") +
labs(title = "Boxplot of Price",
subtitle = "Toyota Corolla Dataset",
x = "Vehicle",
y = "Price") +
theme_minimal()
combined_price <- grid.arrange(hist_price, box_price, ncol = 2)
As seen in the above histogram, the price of the vehicles in the data set is not uniformly or normally distributed and contains a right skewness. The bulk of the data set is comprised of vehicles ranging from 8,500 to 12,000 in price. Because of the right skewness of the data, the mean price is higher than the median price. Still, it is currently unknown what qualities of each vehicle gather a significantly higher price.
# Histogram of the age of vehicle in months
hist_age <- cars %>%
ggplot(aes(x = Age)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Age",
subtitle = "Toyota Corolla Dataset",
x = "Age of Vehicle in Months",
y = "Frequency") +
theme_minimal()
box_age <- cars %>%
ggplot(aes(x = "Category", y = Age)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
stat_summary(fun = "mean", shape = 21, color = "black") +
labs(title = "Boxplot of Age",
subtitle = "Toyota Corolla Dataset",
x = "Vehicle",
y = "Age (months)") +
theme_minimal()
combined_age <- grid.arrange(hist_age, box_age, ncol = 2)
From the histogram and box plot above, it is clear that the majority of the data set is comprised of used vehicles, and the distribution of the age of the vehicle follows a left-skewed distribution.
cars %>% group_by(year) %>% count(year) %>%
ggplot(aes(x = year, y = n)) +
geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
labs(title = "Production Year of Vehicle",
subtitle = "Toyota Corolla Dataset",
x = "Year",
y = "Number of Vehicles") +
theme_minimal()
year_table <- cars %>% group_by %>% count(year)
year_table %>% kbl(caption = "Number of Vehicels by Year") %>% kable_classic_2("hover")
| year | n |
|---|---|
| 1997 | 90 |
| 1998 | 409 |
| 1999 | 377 |
| 2000 | 226 |
| 2001 | 158 |
| 2002 | 108 |
| 2003 | 59 |
| 2004 | 8 |
While it is informative to gather descriptive statistics of the mean and median vehicle age in months, a more effective means of understanding the age of the vehicle is to count the vehicles by year of production. The bar chart and table above show the number of vehicles by production year in the data set.
hist_KM <- cars %>%
ggplot(aes(x = KM)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Odometer (KM)",
subtitle = "Toyota Corolla Dataset",
x = "Odometer Reading (KM)",
y = "Frequency") +
theme_minimal()
box_KM <- cars %>%
ggplot(aes(x = "Category", y = KM)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
stat_summary(fun = "mean", shape = 21, color = "black") +
labs(title = "Boxplot of Odometer (KM)",
subtitle = "Toyota Corolla Dataset",
x = "Vehicle",
y = "Odometer Reading (KM)") +
theme_minimal()
combined_KM <- grid.arrange(hist_KM, box_KM, ncol = 2)
The histogram and boxplot of the odometer reading of each vehicle provide a descriptive measure of the quality of the vehicle. It is estimated that the odometer reading will be a crucial explanatory factor in understanding the price of the vehicle because it speaks to the “newness” of the vehicle. However, the understanding of quality derived from the odometer reading does not apply to every vehicle because they vary in the age of production. For example, a car that is one year old and has traveled 100,000 miles is assumed to have been used very differently than a vehicle that is 10 years old and traveled 100,000 miles.
# Extract all unique variables in the Fuel Type Column
unique(cars$FuelType)
## [1] Diesel Petrol CNG
## Levels: CNG Diesel Petrol
cars %>% group_by(FuelType) %>% count(FuelType) %>%
ggplot(aes(x = reorder(FuelType, -n), y = n)) +
geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
labs(title = "Number of Vehicles by Fuel Type",
subtitle = "Toyota Corolla Dataset",
x = "Fuel Type",
y = "Number of Vehicles") +
theme_minimal()
| FuelType | n |
|---|---|
| Petrol | 1264 |
| Diesel | 154 |
| CNG | 17 |
Based on the bar chart of fuel type, it is apparent that the data set is comprised mainly of petrol vehicles. A vehicle that is powered by CNG is very rare in comparison to both petrol and diesel powered vehicles.
unique(cars$HP)
## [1] 90 192 69 110 97 71 116 98 86 72 107 73
cars %>% group_by(HP) %>% count(HP) %>%
ggplot(aes(x = HP, y = n)) +
geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
labs(title = "Number of Vehicles by Horsepower",
subtitle = "Toyota Dataset",
x = "Vehicle Horsepower",
y = "Number of Vehicles") +
theme_minimal()
| HP | n |
|---|---|
| 110 | 835 |
| 86 | 249 |
| 97 | 164 |
| 72 | 73 |
| 90 | 36 |
| 69 | 34 |
| 107 | 21 |
| 192 | 11 |
| 116 | 8 |
| 98 | 2 |
| 71 | 1 |
| 73 | 1 |
The horsepower bar chart raises questions about data quality and challenges the creation of vehicle price assessments later in this project. The data set contains 12 different horsepower values, but several occur infrequently. For example, is a vehicle annotated as having 71 or 73 horsepower an other kind of car than the vehicles with 72 horsepower?
cars %>% group_by(MetColor) %>% count(MetColor) %>%
ggplot(aes(x = MetColor, y =n)) +
geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
labs(title = "Number of Cars with Metallic Paint",
subtitle = "Toyota Dataset",
x = "Vehicle Paint Color",
y = "Number of Vehicles") +
theme_minimal()
met_table <- cars %>% group_by(MetColor) %>% count(MetColor)
met_table %>% kbl(caption = "Number of Vehicels by Paint Color") %>% kable_classic_2("hover")
| MetColor | n |
|---|---|
| 0 | 467 |
| 1 | 968 |
The vehicles in the data set are comprised of a mix of metallic color paint and non-metallic color paint. Vehicles with a metallic color occur more than twice as frequently as those without a metallic color. At this time, it is unknown if metallic color is more popular or if the manufacturer produced more vehicles without metallic paint as an option. The bar chart provides information about the paint color per vehicle within the data set and is not descriptive of larger trends.
cars %>% group_by(Automatic) %>% count(Automatic) %>%
ggplot(aes(x = Automatic, y = n)) +
geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
labs(title = "Number of Vehicles by Transmission Type",
subtitle = "Toyota Dataset",
x = "Vehicle Transmission Type",
y = "Number of Vehicles") +
theme_minimal()
trans_table <- cars %>% group_by(Automatic) %>% count(Automatic)
trans_table %>% kbl(caption = "Number of Vehicels by Transmission Type") %>%
kable_classic_2("hover")
| Automatic | n |
|---|---|
| 0 | 1355 |
| 1 | 80 |
Similar to metallic paint, little information in the transmission type variable could be applied to inform a greater understanding of trends. Instead, the bar chart above presents the frequency of occurrence of automatic and manual transmission vehicles.
cars %>% group_by(CC) %>% count(CC)
## # A tibble: 12 × 2
## # Groups: CC [12]
## CC n
## <int> <int>
## 1 1300 248
## 2 1332 2
## 3 1398 2
## 4 1400 164
## 5 1587 4
## 6 1598 4
## 7 1600 846
## 8 1800 14
## 9 1900 30
## 10 1975 1
## 11 1995 2
## 12 2000 118
cars %>% group_by(CC) %>% count(CC) %>%
ggplot(aes(x = CC, y =n)) +
geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
labs(title = "Number of Cars by Engine Displacement",
subtitle = "Toyota Dataset",
x = "Engine Size",
y = "Number of Vehicles") +
theme_minimal()
| CC | n |
|---|---|
| 1600 | 846 |
| 1300 | 248 |
| 1400 | 164 |
| 2000 | 118 |
| 1900 | 30 |
| 1800 | 14 |
| 1587 | 4 |
| 1598 | 4 |
| 1332 | 2 |
| 1398 | 2 |
| 1995 | 2 |
| 1975 | 1 |
The “CC” variable raises further questions about data quality in the data set. Engine displacement sizes are fairly consistent, but the data set shows many more than expected. Additionally, some engine sizes occur at such low frequencies that they create questions. For example, is an engine CC size of 1598 a data entry error or a distinct engine made by the manufacturer? Further analysis will aim to minimize the effects of potentially erroneous information while simultaneously understanding patterns of vehicles based on shared traits.
cars %>% group_by(Doors) %>% count(Doors) %>%
ggplot(aes(x = Doors, y =n)) +
geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
labs(title = "Number of Vehicles by Door Count",
subtitle = "Toyota Dataset",
x = "Number of Doors",
y = "Number of Vehicles") +
theme_minimal()
| Doors | n |
|---|---|
| 5 | 673 |
| 3 | 622 |
| 4 | 138 |
| 2 | 2 |
The bar chart above describes the frequency of the number of doors per vehicle in the data set. Vehicles with five and four doors are significantly more represented in the data set than a three or 2 door vehicle.
hist_Weight <- cars %>%
ggplot(aes(x = Weight)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Vehicle Weight",
subtitle = "Toyota Dataset",
x = "Vehicle Weight",
y = "Number of Vehicles") +
theme_minimal()
box_Weight <- cars %>%
ggplot(aes(x = "Category", y = Weight)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
stat_summary(fun = "mean", shape = 21, color = "black") +
labs(title = "Boxplot of Vehicle Weight",
subtitle = "Toyota Dataset",
x = "Vehicle",
y = "Weight") +
theme_minimal()
combined_HP <- grid.arrange(hist_Weight, box_Weight, ncol = 2)
While there may be questions raised in the horsepower and CC variables, the weight variable is assumed to be accurate because of the various options unseen in the data set that can affect the weight of the vehicle, such as wheel size, roof racks, and other features. Regardless, the histogram shows that vehicles less than 1,100 contain over 75% of the rows in the data set. Heavier vehicles occur at meager rates.
cars$year <- as.factor(cars$year)
cars %>%
ggplot(aes(x = year, y = Price, fill = year)) +
geom_boxplot() +
stat_summary(fun = "mean", shape = 21, fill = "white", color = "black") +
labs(title = "Boxplot of Price per Year",
subtitle = "Toyota Dataset",
x = "Year of Production",
y = "Price of Vehicle") +
scale_y_continuous(limits = c(0, max(cars$Price))) +
theme_minimal() +
theme(legend.position = "none")
The box plot of the vehicle price grouped by the vehicle’s year illuminates a trend in the data set. First, it appears as though the newer the vehicle, the higher the cost of the vehicle. However, there are outliers in all but two years of production. Therefore, it is believed that other factors beyond the year of production inform price. Further analysis will aim to understand what factors in the dataset, when coupled with age, create predictability in the price of the vehicle.
Based on the research goals of this project, further analysis was required to understand how each of the variables in the dataset were related to one another. Bivariate analysis began with establishing the correlations between each of the variables and then created visualizations of several notable pairings of variables.
# Coerce all variables to numeric format, remove NA values
cars_numeric <- cars
cars_numeric[] <- lapply(cars, as.numeric)
cars_numeric <- subset(cars_numeric, select = -c(FuelType))
# Create visualization of correlations
corrplot(corrs,
order = "hclust",
tl.col = "black",
tl.srt = 45)
| Price | Age | KM | HP | MetColor | Automatic | CC | Doors | Weight | year | |
|---|---|---|---|---|---|---|---|---|---|---|
| Price | 1.0000000000 | -0.8762733805 | -0.5694195943 | 0.3141341299 | 0.1076036677 | 0.0339275078 | 0.1598967139 | 0.1836039137 | 0.5758694098 | 0.8582304170 |
| Age | -0.8762733805 | 1.0000000000 | 0.5045745614 | -0.1552931956 | -0.1071686211 | 0.0313536648 | -0.1295636623 | -0.1469288098 | -0.4664843289 | -0.9805906788 |
| KM | -0.5694195943 | 0.5045745614 | 1.0000000000 | -0.3329041773 | -0.0798635631 | -0.0821675939 | 0.3053305003 | -0.0351930973 | -0.0239687883 | -0.4932347233 |
| HP | 0.3141341299 | -0.1552931956 | -0.3329041773 | 1.0000000000 | 0.0582721811 | 0.0133125134 | 0.0494304626 | 0.0918027208 | 0.0871433318 | 0.1542135794 |
| MetColor | 0.1076036677 | -0.1071686211 | -0.0798635631 | 0.0582721811 | 1.0000000000 | -0.0192215676 | 0.0338704207 | 0.0847964976 | 0.0560966665 | 0.1052297259 |
| Automatic | 0.0339275078 | 0.0313536648 | -0.0821675939 | 0.0133125134 | -0.0192215676 | 1.0000000000 | -0.0690599888 | -0.0274924593 | 0.0584987130 | -0.0306542997 |
| CC | 0.1598967139 | -0.1295636623 | 0.3053305003 | 0.0494304626 | 0.0338704207 | -0.0690599888 | 1.0000000000 | 0.1254096400 | 0.6500983692 | 0.1251221338 |
| Doors | 0.1836039137 | -0.1469288098 | -0.0351930973 | 0.0918027208 | 0.0847964976 | -0.0274924593 | 0.1254096400 | 1.0000000000 | 0.3017335071 | 0.1408475515 |
| Weight | 0.5758694098 | -0.4664843289 | -0.0239687883 | 0.0871433318 | 0.0560966665 | 0.0584987130 | 0.6500983692 | 0.3017335071 | 1.0000000000 | 0.4542064758 |
| year | 0.8582304170 | -0.9805906788 | -0.4932347233 | 0.1542135794 | 0.1052297259 | -0.0306542997 | 0.1251221338 | 0.1408475515 | 0.4542064758 | 1.0000000000 |
The correlation matrix above shows several variables highly related to one another and several almost completely unrelated variables. For example, the strongest relationship in the data set occurs between the Age and Price variables. Other strong relationships arise between the Age of the vehicle and the Odometer reading (KM) and the vehicle’s price with the vehicle’s weight.
# Age vs KM for all cars
cars %>%
ggplot(aes(x = Age, y = KM)) +
geom_point() +
labs(title = "Scatterplot of Age and Odometer Reading",
subtitle = "Toyota Dataset",
x = "Age of Vehicle (months)",
y = "Odometer Reading (KM)") +
geom_smooth(method = "lm") +
theme_minimal()
The scatter plot above shows several interesting patterns. First, as expected, the older the vehicle is, the higher the number of KM driven. Second, not every vehicle is used at the same rate, and the difference of how vehicles are used becomes greater over time.
cars %>%
ggplot(aes(x = Age, y = KM, color = FuelType)) +
geom_point() +
labs(title = "Scatterplot of Age and Odeomter Reading by Fuel Type",
subtitle = "Toyota Dataset",
x = "Age of Vehicle (months)",
y = "Odometer Reading (KM)") +
theme_minimal()
Interestingly, when the age of the vehicle is plotted with a coloring of the fuel type, a pattern becomes clear that highlights a difference between vehicles in the dataset. Generally speaking, diesel-powered vehicles have a higher odometer reading across the entire dataset.
cars %>%
ggplot(aes(x = Age, y = KM, color = HP)) +
geom_point() +
labs(title = "Scatterplot of Age and Odometer Reading by Horsepower",
subtitle = "Toyota Dataset",
x = "Age of Vehicle (months)",
y = "Odometer Reading (KM)") +
theme_minimal()
cars %>%
ggplot(aes(x = Age, y = KM, color = Doors)) +
geom_point() +
labs(title = "Scatterplot of Age and Odometer Reading by Number of Doors",
subtitle = "Toyota Dataset",
x = "Age of Vehicle (months)",
y = "Odometer Reading (KM)") +
theme_minimal()
cars %>%
ggplot(aes(x = Age, y = KM, color = Weight)) +
geom_point() +
labs(title = "Scatterplot of Age and Odometer Reading by Vehicle Weight",
subtitle = "Toyota Dataset",
x = "Age of Vehicle (months)",
y = "Odometer Reading (KM)") +
theme_minimal()
An interesting pattern becomes apparent when evaluating the age of the vehicle and its odometer reading, which is aligned with what was observed in the correlation matrix. As the vehicle becomes older, the odometer reading increases, but not at the same rate for each vehicle. Although the number of doors, the weight of the vehicle, and the horsepower are descriptive qualities of the vehicle, no clear pattern appears when inspecting how each of those qualities is connected to the age-odometer relationship.
cars %>%
ggplot(aes(x = Age, y = Weight)) +
geom_point() +
labs(title = "Scatterplot of Age and Vehicle Weight",
subtitle = "Toyota Dataset",
x = "Age of Vehicle (months)",
y = "Vehicle Weight") +
geom_smooth(method = "lm") +
scale_x_continuous(limits = c(0, max(cars$Age))) +
scale_y_continuous(limits = c(0, max(cars$Weight))) +
theme_minimal()
As identified in the correlation matrix, there is a connection between the vehicle’s age and the vehicle’s weight. The scatter plot above shows that connection and it appears as though manufacturers are making heavier vehicles now than they were compared to the oldest vehicles in the data set.
cars %>%
ggplot(aes(x = Age, y = Price)) +
geom_point() +
labs(title = "Scatterplot of Age and Price",
subtitle = "Toyota Dataset",
x = "Age of Vehicle (months)",
y = "Vehicle Price") +
geom_smooth(method = "lm") +
scale_x_continuous(limits = c(0, max(cars$Age))) +
scale_y_continuous(limits = c(0, max(cars$Price))) +
theme_minimal()
cars %>%
ggplot(aes(x = Age, y = Price, color = FuelType)) +
geom_point() +
labs(title = "Scatterplot of Age and Price by Fuel Type",
subtitle = "Toyota Dataset",
x = "Age of Vehicle (months)",
y = "Vehicle Price") +
scale_x_continuous(limits = c(0, max(cars$Age))) +
scale_y_continuous(limits = c(0, max(cars$Price))) +
theme_minimal()
cars %>%
ggplot(aes(x = Age, y = Price, color = HP)) +
geom_point() +
labs(title = "Scatterplot of Age and Price by Vehicle Horsepower",
subtitle = "Toyota Dataset",
x = "Age of Vehicle (months)",
y = "Vehicle Price") +
scale_x_continuous(limits = c(0, max(cars$Age))) +
scale_y_continuous(limits = c(0, max(cars$Price))) +
theme_minimal()
cars %>%
ggplot(aes(x = Age, y = Price, color = Doors)) +
geom_point() +
labs(title = "Scatterplot of Age and Price by NUmber of Doors",
subtitle = "Toyota Dataset",
x = "Age of Vehicle (months)",
y = "Vehicle Price") +
scale_x_continuous(limits = c(0, max(cars$Age))) +
scale_y_continuous(limits = c(0, max(cars$Price))) +
theme_minimal()
cars %>%
ggplot(aes(x = Age, y = Price, color = Weight)) +
geom_point() +
labs(title = "Scatterplot of Age and Price by Vehicle Weight",
subtitle = "Toyota Dataset",
x = "Age of Vehicle (months)",
y = "Vehicle Price") +
scale_x_continuous(limits = c(0, max(cars$Age))) +
scale_y_continuous(limits = c(0, max(cars$Price))) +
theme_minimal()
The strongest connection found in the correlation matrix occurred between the vehicle price and the vehicle age. It is believed that the vehicle age could be highly predictive of vehicle price. While visualizing the age of vehicle and the price of vehicle in the data set, and interesting pattern becamce apparent, the higher end of horsepower vehicles occurs clustered within a small age range. Based on that information, horsepower could be valuable in determining the model of the vehicle, which could become valuable in predicting the price of the vehicle later in this project.
cars %>%
ggplot(aes(x = year, y = Price, color = FuelType)) +
geom_point() +
labs(title = "Scatterplot of Year and Price by Fuel Type",
subtitle = "Toyota Dataset",
x = "Year of Vehicle",
y = "Vehicle Price") +
theme_minimal()
cars %>%
ggplot(aes(x = year, y = Price, color = HP)) +
geom_point() +
labs(title = "Scatterplot of Year and Price by Horsepower",
subtitle = "Toyota Dataset",
x = "Year of Vehicle",
y = "Vehicle Price") +
theme_minimal()
cars %>%
ggplot(aes(x = year, y = Price, color = Doors)) +
geom_point() +
labs(title = "Scatterplot of Age and Price by Number of Doors",
subtitle = "Toyota Dataset",
x = "Year of Vehicle",
y = "Vehicle Price") +
theme_minimal()
Even though year and age describe the same quality in the vehicle, arranging the data by year aims to understand patterns that are not immediately obvious. While the number of doors is not very informative when aligned in a scatter plot of Year and Price, other patterns appear when evaluating horsepower and fuel type.
First, there appears to be a trend where the most expensive newer vehicles are diesel-powered, and the cheapest older vehicles are diesel-powered. When taking the previously established understanding gained on the odometer reading per fuel type stating that diesel vehicles travel more kilometers than their petrol counterparts, further evidence is available that the odometer reading is highly predictive of the vehicle price.
Second, the pattern of the highest horsepower vehicles is apparent in a very small section of the scatter plot of years and price, and those vehicles garner the highest price. This pattern further illustrates that more than one factor informs price.
cars %>%
ggplot(aes(x = KM, y = Price)) +
geom_point() +
labs(title = "Scatterplot of Odometer Reading and Price",
subtitle = "Toyota Dataset",
x = "Odometer Reading (KM)",
y = "Vehicle Price") +
scale_x_continuous(limits = c(0, max(cars$KM))) +
scale_y_continuous(limits = c(0, max(cars$Price))) +
theme_minimal()
cars %>%
ggplot(aes(x = KM, y = Price, color = Age)) +
geom_point() +
labs(title = "Scatterplot of Odometer Reading and Price by Age",
subtitle = "Toyota Dataset",
x = "Odometer Reading (KM)",
y = "Vehicle Price") +
scale_x_continuous(limits = c(0, max(cars$KM))) +
scale_y_continuous(limits = c(0, max(cars$Price))) +
theme_minimal()
As expected, based on knowledge gained in other areas of this project, the scatter plot shows the connection between the odometer reading and the price of the vehicle. However, when paired with the age of the vehicle, it is apparent that not every vehicle is used at the same rate. There are very new vehicles with high and very old ones with low odometer readings.
Based on information in the dataset, it is apparent there are multiple types of vehicles in the data set. When attempting to produce a prediction of the price of the vehicle, it is assumed that the type of vehicle may improve the predictive abilities of the model. If clustering or grouping of the vehicles by type did not occur, the analysis would create comparisons on significantly different items. Therefore, k-means clustering was conducted to quantitatively establish the types of vehicles in the data set and associate each vehicle with that type.
Features selected for clustering were descriptive measures that could not be applied to other types of vehicles and are assumed to be the most mutually exclusive variables. For example, an automatic transmission vehicle or a vehicle with metallic paint is likely an option for all models.
# Create new data frame with selected features for Clustering
features <- cars %>% select(HP, CC, Doors, Weight)
scaled_features <- scale(features)
# Function to determine within cluster sum of squares
wcss <- function(k) {
kmeans_result <- kmeans(scaled_features, centers = k)
return(kmeans_result$tot.withinss)}
# Determine number of clusters by using within cluster sum of squares function
k_values <- 1:12
wcss_values <- sapply(k_values, wcss)
# Create elbow plot to determine optimal number of clusters
elbow_plot <- data.frame(k = k_values, WCSS = wcss_values)
ggplot(elbow_plot, aes(x = k, y = WCSS)) +
geom_line() +
geom_point() +
labs(title = "Elbow Plot for Optimal k",
x = "Number of Clusters (k)",
y = "Within-Cluster Sum of Squares (WCSS)")
An elbow curve was created to establish the optimal number of clusters within the data set based on the chosen features. The elbow curve plot above shows a flattening in the within sum of squares values between clusters at 6. Therefore, it has been estimated that there are six significantly different types of models of vehicles in the data set. Next, the model type was attached to each row to provide a further descriptive quality.
# Use optimal k to assign model to each vehicle in dataset
k <- 6
result <- kmeans(scaled_features, centers = k)
cars$Model <- result$cluster
# Convert Model into Factor
cars$Model <- as.factor(cars$Model)
cars %>% group_by(Model) %>% count(Model) %>%
ggplot(aes(x = Model, y = n)) +
geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
labs(title = "Number of Vehicles per Model",
subtitle = "Toyota Dataset",
x = "Model",
y = "number of Vehicles") +
theme_minimal()
| Model | n |
|---|---|
| 1 | 422 |
| 2 | 22 |
| 3 | 144 |
| 4 | 152 |
| 5 | 241 |
| 6 | 454 |
After clustering and classifying each of the cars based on their values of weight, engine displacement, doors, and horsepower, the bar chart above was created to understand the frequency of each type of model in the dataset.
# Visualize differences in models by HP and weight
cars %>%
ggplot(aes(x = Age, y = Price, color = Model)) +
geom_point() +
labs(title = "Scatterplot of Age and Price by Vehicle Model",
subtitle = "Toyota Dataset",
x = "Vehicle Age (months)",
y = "Vehicle Price") +
theme_minimal()
While initially difficult to extract meaningful information fromt the above scatterplot, several logical conclusions can be drawn when evaluating the age of the vehicle and its price arranged by model. First, one model is one of the cheapest models of vehicle. Next, there is largely more expensive model than other models.
# Visualize differences in models by HP and weight
cars %>%
ggplot(aes(x = KM, y = Price, color = Model)) +
geom_point() +
labs(title = "Scatterplot of Odometer Reading and Price by Vehicle Model",
subtitle = "Toyota Dataset",
x = "Odometer Reading (KM)",
y = "Vehicle Price") +
theme_minimal()
To further explain the separation between vehicles, an interesting trait becomes apparent when evaluating the Odometer reading and price. Across the top section of the entire plat, one stands out as gathering some of the highest prices. Therefore, it is reasonable that the vehicle model could add to future predictive efforts.
cars %>%
ggplot(aes(x = Model, y = Price, fill = Model)) +
geom_boxplot() +
stat_summary(fun = "mean", shape = 21, fill = "white", color = "black") +
labs(title = "Boxplot of Price per Model",
subtitle = "Toyota Dataset",
x = "Model of Vehicle",
y = "Price of Vehicle") +
scale_y_continuous(limits = c(0, max(cars$Price))) +
theme_minimal() +
theme(legend.position = "none")
This section has provided a range of prices for each vehicle model as it was separated by specific features within the data set. Coupled with additional information, a range of expected price will likely be achieved.
Once additional insight was created on the types of vehicles in the data set, the focus was shifted to making price range expectations for each car model. Among other factors, such as KM, the age or year of the car is proven to be highly related to the price of the vehicle. Therefore, a descriptive measure of the value of each car was created by first grouping the data set but Model and Year of production, then extracting the price quantiles. Next, if a car fell between the 25th and 75th quantile, a value of average was assigned per row. If the vehicle price occurred above the 75th quantile, it was given a categorization of expensive. Finally, the vehicle was categorized as cheap if the price was below the 25th quantile.
# Group data frame by Model of car and extract quantiles of price
quartiles_per_model <- cars %>%
group_by(Model, year) %>%
summarize(Q1 = quantile(Price, 0.25),
Q3 = quantile(Price, 0.75),
.groups = "drop")
# Create Function for attaching the descriptive value of "cheap", "expensive", or "average" per model.
categorize_price <- function(price, year, q1, q3) {
if (price < q1) {
return("cheap")
} else if (price > q3) {
return("expensive")
} else {
return("average")
}
}
# Apply function to data frame
cars <- cars %>%
left_join(quartiles_per_model, by = c("Model", "year")) %>%
mutate(Price_Category = mapply(categorize_price, Price, year, Q1, Q3))
# Visualize distribution of value along Age and Price variables
cars %>%
ggplot(aes(x = Age, y = Price, color = Price_Category)) +
geom_point() +
labs(title = "Scatterplot of Age and Price by Price Category",
subtitle = "Toyota Dataset",
x = "Vehicle Age (months)",
y = "Vehicle Price") +
theme_minimal()
After categorizing based on the model and year, a scatter plot was made to understand the distribution of price and age when grouped by year. Unsurprisingly, the older the vehicle, the less overlap in prices; the newer the vehicle, the more overlap in prices. It is hypothesized that this is occurring because of several factors. First, the data set is more populated by older vehicles, as seen earlier in this paper. Second, the difference between other descriptive factors of quality wanes as the vehicle ages. For example, a 50,000 km difference in a two-year-old car may not be as important as a 50,000 km difference in a much older vehicle.
# Visualize distribution of value along Age and Price variables
cars %>%
ggplot(aes(x = year, y = Price, color = Price_Category)) +
geom_point() +
labs(title = "Scatterplot of Year and Price by Value",
subtitle = "Toyota Dataset",
x = "Vehicle Year",
y = "Vehicle Price") +
theme_minimal()
The same phenomenon exists when the same information as the previous plot is arranged per production year instead of age in months. The older the vehicle, the tighter the distribution of price.
Categorizing each model provides price ranges that a customer or a seller can expect based purely on the model and year of production. Further discriminating factors must be considered to price a vehicle accurately, but this was a crucial step in understanding the price of a vehicle and aiding in finding predictability of price.
The final step in this project was to create a linear regression model to predict the price of the vehicle. During this research phase, several models were created to find the most accurate price prediction method. Then, each model was evaluated using a testing and training set validation methodology.
cars$Model <- as.numeric(cars$Model)
# Create linear model to predict price of vehicle based on all data
model_A <- lm(Price ~ Age + KM + HP + CC + Doors + Weight + Model,
data = cars)
summary(model_A)
##
## Call:
## lm(formula = Price ~ Age + KM + HP + CC + Doors + Weight + Model,
## data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12132.3243 -771.8593 -0.1815 765.2973 6266.0423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5929.669261178 1004.412318170 -5.90362 0.0000000044371
## Age -122.144491049 2.591762396 -47.12797 < 0.000000000000000222
## KM -0.016717145 0.001287522 -12.98397 < 0.000000000000000222
## HP 32.411068165 2.579988713 12.56248 < 0.000000000000000222
## CC -1.724614107 0.289060851 -5.96627 0.0000000030568
## Doors -32.367976053 50.882053025 -0.63614 0.52479
## Weight 22.612449033 1.067795897 21.17675 < 0.000000000000000222
## Model -18.402473919 23.499313758 -0.78311 0.43369
##
## (Intercept) ***
## Age ***
## KM ***
## HP ***
## CC ***
## Doors
## Weight ***
## Model
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1333.703 on 1427 degrees of freedom
## Multiple R-squared: 0.86408, Adjusted R-squared: 0.8634133
## F-statistic: 1295.975 on 7 and 1427 DF, p-value: < 0.00000000000000022204
# Create linear model on selected data
model_B <- lm(Price ~ Age + KM + Model, data = cars)
summary(model_B)
##
## Call:
## lm(formula = Price ~ Age + KM + Model, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6795.4164 -990.5365 -83.7145 837.9680 12605.2367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20568.394833597 165.300223084 124.43053 < 0.0000000000000002 ***
## Age -153.607387536 2.728547372 -56.29640 < 0.0000000000000002 ***
## KM -0.016572145 0.001354638 -12.23363 < 0.0000000000000002 ***
## Model -29.592725590 21.418209513 -1.38166 0.16729
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1655.986 on 1431 degrees of freedom
## Multiple R-squared: 0.7898669, Adjusted R-squared: 0.7894263
## F-statistic: 1792.989 on 3 and 1431 DF, p-value: < 0.00000000000000022204
# Create linear model to predict price of vehicle based on all data
model_C <- lm(Price ~ Age + KM + HP + CC + Weight + Doors,
data = cars)
summary(model_C)
##
## Call:
## lm(formula = Price ~ Age + KM + HP + CC + Weight + Doors, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12079.6802 -759.6196 -2.5351 757.0873 6246.0535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6026.569337503 996.626276545 -6.04697 0.0000000018813
## Age -122.138790417 2.591401189 -47.13234 < 0.000000000000000222
## KM -0.016703458 0.001287229 -12.97629 < 0.000000000000000222
## HP 32.742225792 2.544752007 12.86657 < 0.000000000000000222
## CC -1.663921809 0.278439621 -5.97588 0.0000000028856
## Weight 22.612569864 1.067651281 21.17973 < 0.000000000000000222
## Doors -57.877015407 39.083172384 -1.48087 0.13886
##
## (Intercept) ***
## Age ***
## KM ***
## HP ***
## CC ***
## Weight ***
## Doors
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1333.522 on 1428 degrees of freedom
## Multiple R-squared: 0.8640216, Adjusted R-squared: 0.8634503
## F-statistic: 1512.278 on 6 and 1428 DF, p-value: < 0.00000000000000022204
# Partition data into test and training
samples <- cars$Price %>%
createDataPartition(p = .8, list = FALSE)
train.data <- cars[samples, ]
test.data <- cars[-samples, ]
# Construct model A and model B from training data
model_A_validate <- lm(Price ~ Age + KM + HP + CC + Doors + Weight + Model,
data = train.data)
model_B_validate <- lm(Price ~ Age + KM + Model, data = train.data)
model_C_validate <- lm(Price ~ Age + KM + HP + CC + Weight + Doors, data = train.data)
# Evaluate the quality of the model_A through R2, RSME, and MAE
predictions_model_A <- model_A_validate %>% predict(test.data)
predictions_model_A <- data.frame(Model = "A",
R2 = R2(predictions_model_A, test.data$Price),
RMSE = RMSE(predictions_model_A, test.data$Price),
MAE = MAE(predictions_model_A, test.data$Price))
# Evaluate the quality of the model model_B through R2, RSME, and MAE
predictions_model_B <- model_B_validate %>% predict(test.data)
predictions_model_B <- data.frame(Model = "B",
R2 = R2(predictions_model_B, test.data$Price),
RMSE = RMSE(predictions_model_B, test.data$Price),
MAE = MAE(predictions_model_B, test.data$Price))
# Evaluate the quality of the model model_B through R2, RSME, and MAE
predictions_model_C <- model_C_validate %>% predict(test.data)
predictions_model_C <- data.frame(Model = "C",
R2 = R2(predictions_model_C, test.data$Price),
RMSE = RMSE(predictions_model_C, test.data$Price),
MAE = MAE(predictions_model_C, test.data$Price))
# Merge model statistics into single data frame
validation <- bind_rows(predictions_model_A, predictions_model_B, predictions_model_C)
validation %>% kbl(caption = "Evaluation of Model Effectiveness") %>%
kable_classic_2("hover")
| Model | R2 | RMSE | MAE |
|---|---|---|---|
| A | 0.8926599223 | 1316.466083 | 966.6009983 |
| B | 0.7806615114 | 1862.734272 | 1249.7517477 |
| C | 0.8926465011 | 1316.542173 | 966.6350931 |
After validation, it was found that there is a moderate level of predictability in the price of a vehicle based on the information in the data set. The first model considered each of the variables and achieved the best predictive powers measured through analysis of residuals and the mean absolute error of those residuals. The second model eliminated many variables and asked if the model, age, and odometer reading were enough information to predict the price of the vehicle. The third model tested the car model’s importance, informing the vehicle’s price. All other variables from the data set were included in model “C” except the vehicle model. Accuracy suffered very slightly by excluding the vehicle model in the predictive model. While it is informative to understand which model of the vehicle was purchased, there is enough overlap in the prices per model to minimize the effect. To gain further accuracy in predicting the cost of the vehicle, additional features must be added to the data set. Luxury features such as entertainment systems or security and safety features are expected to comprise the information needed to create a more accurate predictive model.
There is a connection between several variables in the dataset. The most closely related variables are price and KM.
There is a most popular vehicle model in the data set, and not every vehicle in the data set contains the same features. Within the data set are vehicles with high horsepower, low horsepower, large engines, and small engines, as well as 3, 4, and 5-door models. The research did not focus on the descriptive factors of each model but sought to identify the existence of models and has achieved that goal.
The most accurate method of understanding the range of price was achieved through a combination of the model of vehicle and the year of production of the vehicle. When coupling the two descirptive factors, a starting point is achieved that will allow the customer and retailed to further refine price based on additional features.
There is a pattern that describes the expected price of the vehicle. A price category has been imputed into the data set that describes each vehicle as cheap, average, or expensive based on comparing other vehicles of the same model and year.
While model and year are valid starting points for the price range per vehicle, additional features provide accuracy in predicting the price of the vehicle. However, because the data set is very shallow in descriptive features, predictability is not 100% accurate. Additional information is required to gain further predictive accuracy.