Toyota Corolla Dataset

Introduction

The data set used in this project aggregates 2,000 rows of vehicle information. However, the data set does not contain all of a vehicle’s descriptive qualities, such as the many available luxury features like cruise control and automatic windows or safety features like airbags or anti-lock brakes. Regardless, the project will follow a deliberate process of cleaning, manipulating, exploratory data exploration, and modeling to answer several questions.

Analysis is focused on the following questions:

Is there a connection between any of the descriptive variables in each row and the price of the vehicle? If so, what variables are most closely related to the price of the vehicle?
Is there a most popular vehicle model contained in the data set?
What are the ranges of price that could be expected for vehicles in the dataset?
If prices follow a pattern, what information can be imputed to provide an understanding of the value for the purchase or sale of the vehicle?
Is it possible to predict the cost of the vehicle based on the information in the dataset?

Analysis for this project begins by importing the data set and gathering the basic properties of the data set. Each row within the data set contains numerous descriptive variables. Variables such as Age, KM, HP, Doors, Weight, and CC describe vehicle features in numeric format. The fuel type variable represents the type of fuel needed to power the vehicle and contains character values of “CNG” (Compressed Natural Gas), “Diesel”, and “Petrol”. Two additional numeric variables are used as categorical feature encoding. In the MetColor and Automatic variables, a value of 1 denotes that the vehicle is metallic or an automatic transmission vehicle. A value of 0 in those variables indicates that the vehicle is not metallic colored or is a manual transmission vehicle. At the onset of this project, the data set is 2036 rows and 10 columns.

#import data set
data <- read.csv("ToyotaCorolla2.csv")

# Call basic summary statistics of the data set
summary(data)

##      Price               Age                 KM              FuelType        
##  Min.   : 4350.00   Min.   : 1.00000   Min.   :     1.00   Length:2036       
##  1st Qu.: 8750.00   1st Qu.:41.00000   1st Qu.: 39136.75   Class :character  
##  Median :10322.50   Median :54.00000   Median : 61135.50   Mode  :character  
##  Mean   :11166.29   Mean   :52.44745   Mean   : 67298.85                     
##  3rd Qu.:12450.00   3rd Qu.:67.00000   3rd Qu.: 88541.75                     
##  Max.   :32500.00   Max.   :80.00000   Max.   :243000.00                     
##        HP              MetColor           Automatic                CC          
##  Min.   : 69.0000   Min.   :0.0000000   Min.   :0.00000000   Min.   :1300.000  
##  1st Qu.: 97.0000   1st Qu.:0.0000000   1st Qu.:0.00000000   1st Qu.:1400.000  
##  Median :110.0000   Median :1.0000000   Median :0.00000000   Median :1600.000  
##  Mean   :101.4224   Mean   :0.6807466   Mean   :0.05009823   Mean   :1575.857  
##  3rd Qu.:110.0000   3rd Qu.:1.0000000   3rd Qu.:0.00000000   3rd Qu.:1600.000  
##  Max.   :192.0000   Max.   :1.0000000   Max.   :1.00000000   Max.   :2000.000  
##      Doors              Weight        
##  Min.   :2.000000   Min.   :1000.000  
##  1st Qu.:3.000000   1st Qu.:1040.000  
##  Median :4.000000   Median :1070.000  
##  Mean   :4.083988   Mean   :1075.027  
##  3rd Qu.:5.000000   3rd Qu.:1095.000  
##  Max.   :5.000000   Max.   :1615.000

# Gather dimensions of the data set
dim(data)

## [1] 2036   10

Analysis of Data Quality

Analysis for this project begins by importing the data set and gathering an understanding of the structure and dimensions of the data set. Then project underwent effort to understand the quality of data by eliminating duplicated observations.

# Visualize the location and frequency of missing data
vis_miss(data)

Initial inspection showed no instances of missing values, but there are numerous duplicated rows. While it is possible that vehicles in a data set could contain many shared values, the connection of each variable aligned with price and odometer reading raised suspicions about the quality of each of those instances and was therefore removed.

# Identify the duplicated rows and save the duplicates as a new data frame
duplicated <- data[duplicated(data), ]

# Create new data frame entirely of unique entries across all rows and columns
cars <- data %>% distinct()

The data set now contains 1435 distinct rows and is of high enough quality to begin exploratory analysis.

Data Manipulation

Because it is known that vehicle manufacturers discontinue models of vehicles and release new models, understanding the age of the year of production could illuminate information crucial to answering established research questions. Therefore, it was necessary to impute a value that describes the year of production. The year of production is expected to assist in identifying manufacturing trends and future vehicle classification.

# Creates a Year column based on the values of the Age Column.
cars <- cars %>%
  mutate(year = 2004 - floor((Age - 1 + 6)/12))

# Mutate variables into a factor format
cars$FuelType <- as.factor(cars$FuelType)

Univariate Analysis

To gain a further understanding of the data’s contents, the research progressed by gathering measures of central tendency and data distribution per variable. In every box plot in this paper, the white circle with a black outline denotes the mean value of that variable.

Price

summary(cars)

##      Price               Age                 KM              FuelType   
##  Min.   : 4350.00   Min.   : 1.00000   Min.   :     1.00   CNG   :  17  
##  1st Qu.: 8450.00   1st Qu.:44.00000   1st Qu.: 43000.00   Diesel: 154  
##  Median : 9900.00   Median :61.00000   Median : 63451.00   Petrol:1264  
##  Mean   :10720.92   Mean   :55.98049   Mean   : 68571.78                
##  3rd Qu.:11950.00   3rd Qu.:70.00000   3rd Qu.: 87041.50                
##  Max.   :32500.00   Max.   :80.00000   Max.   :243000.00                
##        HP             MetColor           Automatic                CC          
##  Min.   : 69.000   Min.   :0.0000000   Min.   :0.00000000   Min.   :1300.000  
##  1st Qu.: 90.000   1st Qu.:0.0000000   1st Qu.:0.00000000   1st Qu.:1400.000  
##  Median :110.000   Median :1.0000000   Median :0.00000000   Median :1600.000  
##  Mean   :101.492   Mean   :0.6745645   Mean   :0.05574913   Mean   :1566.526  
##  3rd Qu.:110.000   3rd Qu.:1.0000000   3rd Qu.:0.00000000   3rd Qu.:1600.000  
##  Max.   :192.000   Max.   :1.0000000   Max.   :1.00000000   Max.   :2000.000  
##      Doors              Weight              year         
##  Min.   :2.000000   Min.   :1000.000   Min.   :1997.000  
##  1st Qu.:3.000000   1st Qu.:1040.000   1st Qu.:1998.000  
##  Median :4.000000   Median :1070.000   Median :1999.000  
##  Mean   :4.032753   Mean   :1072.287   Mean   :1999.385  
##  3rd Qu.:5.000000   3rd Qu.:1085.000   3rd Qu.:2000.000  
##  Max.   :5.000000   Max.   :1615.000   Max.   :2004.000

# Histogram of the price of the vehicle in the data set
hist_price <- cars %>%
  ggplot(aes(x = Price)) +
  geom_histogram(color = "steelblue4", fill = "lightblue2") +
  labs(title = "Histogram of Price",
       subtitle = "Toyota Corolla Dataset",
       x = "Price",
       y= "Frequency") +
  theme_minimal()

box_price <- cars %>% 
  ggplot(aes(x = "Category", y = Price)) +
  geom_boxplot(color = "steelblue4", outlier.colour = "red") +
  stat_summary(fun = "mean", shape = 21, color = "black") +
  labs(title = "Boxplot of Price",
       subtitle = "Toyota Corolla Dataset",
       x = "Vehicle",
       y = "Price") +
  theme_minimal()

combined_price <- grid.arrange(hist_price, box_price, ncol = 2)

As seen in the above histogram, the price of the vehicles in the data set is not uniformly or normally distributed and contains a right skewness. The bulk of the data set is comprised of vehicles ranging from 8,500 to 12,000 in price. Because of the right skewness of the data, the mean price is higher than the median price. Still, it is currently unknown what qualities of each vehicle gather a significantly higher price.

Age

# Histogram of the age of vehicle in months
hist_age <- cars %>%
  ggplot(aes(x = Age)) +
  geom_histogram(color = "steelblue4", fill = "lightblue2") +
  labs(title = "Histogram of Age",
       subtitle = "Toyota Corolla Dataset",
       x = "Age of Vehicle in Months",
       y = "Frequency") +
  theme_minimal()

box_age <- cars %>%
  ggplot(aes(x = "Category", y = Age)) +
  geom_boxplot(color = "steelblue4", outlier.colour = "red") +
  stat_summary(fun = "mean", shape = 21, color = "black") +
  labs(title = "Boxplot of Age",
       subtitle = "Toyota Corolla Dataset",
       x = "Vehicle",
       y = "Age (months)") +
  theme_minimal()

combined_age <- grid.arrange(hist_age, box_age, ncol = 2)

From the histogram and box plot above, it is clear that the majority of the data set is comprised of used vehicles, and the distribution of the age of the vehicle follows a left-skewed distribution.

Year

cars %>% group_by(year) %>% count(year) %>%
  ggplot(aes(x = year, y = n)) +
  geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
  labs(title = "Production Year of Vehicle",
       subtitle = "Toyota Corolla Dataset",
       x = "Year",
       y = "Number of Vehicles") +
  theme_minimal()

year_table <- cars %>% group_by %>% count(year)
year_table %>% kbl(caption = "Number of Vehicels by Year") %>% kable_classic_2("hover")

Number of Vehicels by Year
year	n
1997	90
1998	409
1999	377
2000	226
2001	158
2002	108
2003	59
2004	8

While it is informative to gather descriptive statistics of the mean and median vehicle age in months, a more effective means of understanding the age of the vehicle is to count the vehicles by year of production. The bar chart and table above show the number of vehicles by production year in the data set.

KM

hist_KM <- cars %>%
  ggplot(aes(x = KM)) +
  geom_histogram(color = "steelblue4", fill = "lightblue2") +
  labs(title = "Histogram of Odometer (KM)",
       subtitle = "Toyota Corolla Dataset",
       x = "Odometer Reading (KM)",
       y = "Frequency") +
  theme_minimal()

box_KM <- cars %>%
  ggplot(aes(x = "Category", y = KM)) +
  geom_boxplot(color = "steelblue4", outlier.colour = "red") +
  stat_summary(fun = "mean", shape = 21, color = "black") +
  labs(title = "Boxplot of Odometer (KM)",
       subtitle = "Toyota Corolla Dataset",
       x = "Vehicle",
       y = "Odometer Reading (KM)") +
  theme_minimal()

combined_KM <- grid.arrange(hist_KM, box_KM, ncol = 2)

The histogram and boxplot of the odometer reading of each vehicle provide a descriptive measure of the quality of the vehicle. It is estimated that the odometer reading will be a crucial explanatory factor in understanding the price of the vehicle because it speaks to the “newness” of the vehicle. However, the understanding of quality derived from the odometer reading does not apply to every vehicle because they vary in the age of production. For example, a car that is one year old and has traveled 100,000 miles is assumed to have been used very differently than a vehicle that is 10 years old and traveled 100,000 miles.

Fuel Type

# Extract all unique variables in the Fuel Type Column
unique(cars$FuelType)

## [1] Diesel Petrol CNG   
## Levels: CNG Diesel Petrol

cars %>% group_by(FuelType) %>% count(FuelType) %>%
  ggplot(aes(x = reorder(FuelType, -n), y = n)) +
  geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
  labs(title = "Number of Vehicles by Fuel Type",
       subtitle = "Toyota Corolla Dataset",
       x = "Fuel Type",
       y = "Number of Vehicles") +
  theme_minimal()

Number of Vehicels by Fuel Type
FuelType	n
Petrol	1264
Diesel	154
CNG	17

Based on the bar chart of fuel type, it is apparent that the data set is comprised mainly of petrol vehicles. A vehicle that is powered by CNG is very rare in comparison to both petrol and diesel powered vehicles.

HP

unique(cars$HP)

##  [1]  90 192  69 110  97  71 116  98  86  72 107  73

cars %>% group_by(HP) %>% count(HP) %>%
  ggplot(aes(x = HP, y = n)) +
  geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
  labs(title = "Number of Vehicles by Horsepower",
       subtitle = "Toyota Dataset",
       x = "Vehicle Horsepower",
       y = "Number of Vehicles") +
  theme_minimal()

Number of Vehicels by HP
HP	n
110	835
86	249
97	164
72	73
90	36
69	34
107	21
192	11
116	8
98	2
71	1
73	1

The horsepower bar chart raises questions about data quality and challenges the creation of vehicle price assessments later in this project. The data set contains 12 different horsepower values, but several occur infrequently. For example, is a vehicle annotated as having 71 or 73 horsepower an other kind of car than the vehicles with 72 horsepower?

MetColor

cars %>% group_by(MetColor) %>% count(MetColor) %>%
  ggplot(aes(x = MetColor, y =n)) +
  geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
  labs(title = "Number of Cars with Metallic Paint",
       subtitle = "Toyota Dataset",
       x = "Vehicle Paint Color",
       y = "Number of Vehicles") +
  theme_minimal()

met_table <- cars %>% group_by(MetColor) %>% count(MetColor)
met_table %>% kbl(caption = "Number of Vehicels by Paint Color") %>% kable_classic_2("hover")

Number of Vehicels by Paint Color
MetColor	n
0	467
1	968

The vehicles in the data set are comprised of a mix of metallic color paint and non-metallic color paint. Vehicles with a metallic color occur more than twice as frequently as those without a metallic color. At this time, it is unknown if metallic color is more popular or if the manufacturer produced more vehicles without metallic paint as an option. The bar chart provides information about the paint color per vehicle within the data set and is not descriptive of larger trends.

Automatic

cars %>% group_by(Automatic) %>% count(Automatic) %>%
  ggplot(aes(x = Automatic, y = n)) +
  geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
  labs(title = "Number of Vehicles by Transmission Type",
       subtitle = "Toyota Dataset",
       x = "Vehicle Transmission Type",
       y = "Number of Vehicles") +
  theme_minimal()

trans_table <- cars %>% group_by(Automatic) %>% count(Automatic)
trans_table %>% kbl(caption = "Number of Vehicels by Transmission Type") %>%
  kable_classic_2("hover")

Number of Vehicels by Transmission Type
Automatic	n
0	1355
1	80

Similar to metallic paint, little information in the transmission type variable could be applied to inform a greater understanding of trends. Instead, the bar chart above presents the frequency of occurrence of automatic and manual transmission vehicles.

CC

cars %>% group_by(CC) %>% count(CC)

## # A tibble: 12 × 2
## # Groups:   CC [12]
##       CC     n
##    <int> <int>
##  1  1300   248
##  2  1332     2
##  3  1398     2
##  4  1400   164
##  5  1587     4
##  6  1598     4
##  7  1600   846
##  8  1800    14
##  9  1900    30
## 10  1975     1
## 11  1995     2
## 12  2000   118

cars %>% group_by(CC) %>% count(CC) %>%
  ggplot(aes(x = CC, y =n)) +
  geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
  labs(title = "Number of Cars by Engine Displacement",
       subtitle = "Toyota Dataset",
       x = "Engine Size",
       y = "Number of Vehicles") +
  theme_minimal()

Number of Vehicels by CC
CC	n
1600	846
1300	248
1400	164
2000	118
1900	30
1800	14
1587	4
1598	4
1332	2
1398	2
1995	2
1975	1

The “CC” variable raises further questions about data quality in the data set. Engine displacement sizes are fairly consistent, but the data set shows many more than expected. Additionally, some engine sizes occur at such low frequencies that they create questions. For example, is an engine CC size of 1598 a data entry error or a distinct engine made by the manufacturer? Further analysis will aim to minimize the effects of potentially erroneous information while simultaneously understanding patterns of vehicles based on shared traits.

Doors

cars %>% group_by(Doors) %>% count(Doors) %>%
  ggplot(aes(x = Doors, y =n)) +
  geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
  labs(title = "Number of Vehicles by Door Count",
       subtitle = "Toyota Dataset",
       x = "Number of Doors",
       y = "Number of Vehicles") +
  theme_minimal()

Number of Vehicels by Doors
Doors	n
5	673
3	622
4	138
2	2

The bar chart above describes the frequency of the number of doors per vehicle in the data set. Vehicles with five and four doors are significantly more represented in the data set than a three or 2 door vehicle.

Weight

hist_Weight <- cars %>%
  ggplot(aes(x = Weight)) +
  geom_histogram(color = "steelblue4", fill = "lightblue2") +
  labs(title = "Histogram of Vehicle Weight",
       subtitle = "Toyota Dataset",
       x = "Vehicle Weight",
       y = "Number of Vehicles") +
  theme_minimal()

box_Weight <- cars %>%
  ggplot(aes(x = "Category", y = Weight)) +
  geom_boxplot(color = "steelblue4", outlier.colour = "red") +
  stat_summary(fun = "mean", shape = 21, color = "black") +
  labs(title = "Boxplot of Vehicle Weight",
       subtitle = "Toyota Dataset",
       x = "Vehicle",
       y = "Weight") +
  theme_minimal()

combined_HP <- grid.arrange(hist_Weight, box_Weight, ncol = 2)

While there may be questions raised in the horsepower and CC variables, the weight variable is assumed to be accurate because of the various options unseen in the data set that can affect the weight of the vehicle, such as wheel size, roof racks, and other features. Regardless, the histogram shows that vehicles less than 1,100 contain over 75% of the rows in the data set. Heavier vehicles occur at meager rates.

Year

cars$year <- as.factor(cars$year)

cars %>% 
  ggplot(aes(x = year, y = Price, fill = year)) +
  geom_boxplot() +
  stat_summary(fun = "mean", shape = 21, fill = "white", color = "black") +
  labs(title = "Boxplot of Price per Year",
       subtitle = "Toyota Dataset",
       x = "Year of Production",
       y = "Price of Vehicle") +
  scale_y_continuous(limits = c(0, max(cars$Price))) +
  theme_minimal() +
  theme(legend.position = "none")

The box plot of the vehicle price grouped by the vehicle’s year illuminates a trend in the data set. First, it appears as though the newer the vehicle, the higher the cost of the vehicle. However, there are outliers in all but two years of production. Therefore, it is believed that other factors beyond the year of production inform price. Further analysis will aim to understand what factors in the dataset, when coupled with age, create predictability in the price of the vehicle.

Bivariate Analysis

Based on the research goals of this project, further analysis was required to understand how each of the variables in the dataset were related to one another. Bivariate analysis began with establishing the correlations between each of the variables and then created visualizations of several notable pairings of variables.

Correlations

# Coerce all variables to numeric format, remove NA values
cars_numeric <- cars
cars_numeric[] <- lapply(cars, as.numeric)
cars_numeric <- subset(cars_numeric, select = -c(FuelType))

# Create visualization of correlations
  corrplot(corrs, 
         order = "hclust",
         tl.col = "black",
         tl.srt = 45)

Correlations
	Price	Age	KM	HP	MetColor	Automatic	CC	Doors	Weight	year
Price	1.0000000000	-0.8762733805	-0.5694195943	0.3141341299	0.1076036677	0.0339275078	0.1598967139	0.1836039137	0.5758694098	0.8582304170
Age	-0.8762733805	1.0000000000	0.5045745614	-0.1552931956	-0.1071686211	0.0313536648	-0.1295636623	-0.1469288098	-0.4664843289	-0.9805906788
KM	-0.5694195943	0.5045745614	1.0000000000	-0.3329041773	-0.0798635631	-0.0821675939	0.3053305003	-0.0351930973	-0.0239687883	-0.4932347233
HP	0.3141341299	-0.1552931956	-0.3329041773	1.0000000000	0.0582721811	0.0133125134	0.0494304626	0.0918027208	0.0871433318	0.1542135794
MetColor	0.1076036677	-0.1071686211	-0.0798635631	0.0582721811	1.0000000000	-0.0192215676	0.0338704207	0.0847964976	0.0560966665	0.1052297259
Automatic	0.0339275078	0.0313536648	-0.0821675939	0.0133125134	-0.0192215676	1.0000000000	-0.0690599888	-0.0274924593	0.0584987130	-0.0306542997
CC	0.1598967139	-0.1295636623	0.3053305003	0.0494304626	0.0338704207	-0.0690599888	1.0000000000	0.1254096400	0.6500983692	0.1251221338
Doors	0.1836039137	-0.1469288098	-0.0351930973	0.0918027208	0.0847964976	-0.0274924593	0.1254096400	1.0000000000	0.3017335071	0.1408475515
Weight	0.5758694098	-0.4664843289	-0.0239687883	0.0871433318	0.0560966665	0.0584987130	0.6500983692	0.3017335071	1.0000000000	0.4542064758
year	0.8582304170	-0.9805906788	-0.4932347233	0.1542135794	0.1052297259	-0.0306542997	0.1251221338	0.1408475515	0.4542064758	1.0000000000

The correlation matrix above shows several variables highly related to one another and several almost completely unrelated variables. For example, the strongest relationship in the data set occurs between the Age and Price variables. Other strong relationships arise between the Age of the vehicle and the Odometer reading (KM) and the vehicle’s price with the vehicle’s weight.

Age & KM

# Age vs KM for all cars
cars %>%
  ggplot(aes(x = Age, y = KM)) +
  geom_point() +
  labs(title = "Scatterplot of Age and Odometer Reading",
       subtitle = "Toyota Dataset",
       x = "Age of Vehicle (months)",
       y = "Odometer Reading (KM)") +
  geom_smooth(method = "lm") +
  theme_minimal()

The scatter plot above shows several interesting patterns. First, as expected, the older the vehicle is, the higher the number of KM driven. Second, not every vehicle is used at the same rate, and the difference of how vehicles are used becomes greater over time.

cars %>% 
  ggplot(aes(x = Age, y = KM, color = FuelType)) +
  geom_point() +
  labs(title = "Scatterplot of Age and Odeomter Reading by Fuel Type",
       subtitle = "Toyota Dataset",
       x = "Age of Vehicle (months)",
       y = "Odometer Reading (KM)") +
  theme_minimal()

Interestingly, when the age of the vehicle is plotted with a coloring of the fuel type, a pattern becomes clear that highlights a difference between vehicles in the dataset. Generally speaking, diesel-powered vehicles have a higher odometer reading across the entire dataset.

cars %>%
  ggplot(aes(x = Age, y = KM, color = HP)) +
  geom_point() +
  labs(title = "Scatterplot of Age and Odometer Reading by Horsepower",
       subtitle = "Toyota Dataset",
       x = "Age of Vehicle (months)",
       y = "Odometer Reading (KM)") +
  theme_minimal()

cars %>%
  ggplot(aes(x = Age, y = KM, color = Doors)) +
  geom_point() +
  labs(title = "Scatterplot of Age and Odometer Reading by Number of Doors",
       subtitle = "Toyota Dataset",
       x = "Age of Vehicle (months)",
       y = "Odometer Reading (KM)") +
  theme_minimal()

cars %>%
  ggplot(aes(x = Age, y = KM, color = Weight)) +
  geom_point() +
  labs(title = "Scatterplot of Age and Odometer Reading by Vehicle Weight",
       subtitle = "Toyota Dataset",
       x = "Age of Vehicle (months)",
       y = "Odometer Reading (KM)") +  
  theme_minimal()

Age & KM Conclusions

An interesting pattern becomes apparent when evaluating the age of the vehicle and its odometer reading, which is aligned with what was observed in the correlation matrix. As the vehicle becomes older, the odometer reading increases, but not at the same rate for each vehicle. Although the number of doors, the weight of the vehicle, and the horsepower are descriptive qualities of the vehicle, no clear pattern appears when inspecting how each of those qualities is connected to the age-odometer relationship.

Age & Weight

cars %>%
  ggplot(aes(x = Age, y = Weight)) +
  geom_point() +
  labs(title = "Scatterplot of Age and Vehicle Weight",
       subtitle = "Toyota Dataset",
       x = "Age of Vehicle (months)",
       y = "Vehicle Weight") +
  geom_smooth(method = "lm") +   
  scale_x_continuous(limits = c(0, max(cars$Age))) +
  scale_y_continuous(limits = c(0, max(cars$Weight))) +
  theme_minimal()

Age & Weight Conclusions

As identified in the correlation matrix, there is a connection between the vehicle’s age and the vehicle’s weight. The scatter plot above shows that connection and it appears as though manufacturers are making heavier vehicles now than they were compared to the oldest vehicles in the data set.

Price & Age

cars %>%
  ggplot(aes(x = Age, y = Price)) +
  geom_point() +
  labs(title = "Scatterplot of Age and Price",
       subtitle = "Toyota Dataset",
       x = "Age of Vehicle (months)",
       y = "Vehicle Price") +
  geom_smooth(method = "lm") + 
  scale_x_continuous(limits = c(0, max(cars$Age))) +
  scale_y_continuous(limits = c(0, max(cars$Price))) +
  theme_minimal()

cars %>%
  ggplot(aes(x = Age, y = Price, color = FuelType)) +
  geom_point() +
  labs(title = "Scatterplot of Age and Price by Fuel Type",
       subtitle = "Toyota Dataset",
       x = "Age of Vehicle (months)",
       y = "Vehicle Price") +
  scale_x_continuous(limits = c(0, max(cars$Age))) +
  scale_y_continuous(limits = c(0, max(cars$Price))) +
  theme_minimal()

cars %>%
  ggplot(aes(x = Age, y = Price, color = HP)) +
  geom_point() +
  labs(title = "Scatterplot of Age and Price by Vehicle Horsepower",
       subtitle = "Toyota Dataset",
       x = "Age of Vehicle (months)",
       y = "Vehicle Price") +
  scale_x_continuous(limits = c(0, max(cars$Age))) +
  scale_y_continuous(limits = c(0, max(cars$Price))) +
  theme_minimal()

cars %>%
  ggplot(aes(x = Age, y = Price, color = Doors)) +
  geom_point() +
  labs(title = "Scatterplot of Age and Price by NUmber of Doors",
       subtitle = "Toyota Dataset",
       x = "Age of Vehicle (months)",
       y = "Vehicle Price") +
  scale_x_continuous(limits = c(0, max(cars$Age))) +
  scale_y_continuous(limits = c(0, max(cars$Price))) +
  theme_minimal()

cars %>%
  ggplot(aes(x = Age, y = Price, color = Weight)) +
  geom_point() +
  labs(title = "Scatterplot of Age and Price by Vehicle Weight",
       subtitle = "Toyota Dataset",
       x = "Age of Vehicle (months)",
       y = "Vehicle Price") +
  scale_x_continuous(limits = c(0, max(cars$Age))) +
  scale_y_continuous(limits = c(0, max(cars$Price))) +
  theme_minimal()

Price & Age Conclusions

The strongest connection found in the correlation matrix occurred between the vehicle price and the vehicle age. It is believed that the vehicle age could be highly predictive of vehicle price. While visualizing the age of vehicle and the price of vehicle in the data set, and interesting pattern becamce apparent, the higher end of horsepower vehicles occurs clustered within a small age range. Based on that information, horsepower could be valuable in determining the model of the vehicle, which could become valuable in predicting the price of the vehicle later in this project.

Price & Year

cars %>% 
  ggplot(aes(x = year, y = Price, color = FuelType)) +
  geom_point() +
  labs(title = "Scatterplot of Year and Price by Fuel Type",
       subtitle = "Toyota Dataset",
       x = "Year of Vehicle",
       y = "Vehicle Price") +
  theme_minimal()

cars %>% 
  ggplot(aes(x = year, y = Price, color = HP)) +
  geom_point() +
  labs(title = "Scatterplot of Year and Price by Horsepower",
       subtitle = "Toyota Dataset",
       x = "Year of Vehicle",
       y = "Vehicle Price") +
  theme_minimal()

cars %>% 
  ggplot(aes(x = year, y = Price, color = Doors)) +
  geom_point() +
  labs(title = "Scatterplot of Age and Price by Number of Doors",
       subtitle = "Toyota Dataset",
       x = "Year of Vehicle",
       y = "Vehicle Price") +
  theme_minimal()

Price & Year Conclusions

Even though year and age describe the same quality in the vehicle, arranging the data by year aims to understand patterns that are not immediately obvious. While the number of doors is not very informative when aligned in a scatter plot of Year and Price, other patterns appear when evaluating horsepower and fuel type.

First, there appears to be a trend where the most expensive newer vehicles are diesel-powered, and the cheapest older vehicles are diesel-powered. When taking the previously established understanding gained on the odometer reading per fuel type stating that diesel vehicles travel more kilometers than their petrol counterparts, further evidence is available that the odometer reading is highly predictive of the vehicle price.

Second, the pattern of the highest horsepower vehicles is apparent in a very small section of the scatter plot of years and price, and those vehicles garner the highest price. This pattern further illustrates that more than one factor informs price.

Price & KM

cars %>% 
  ggplot(aes(x = KM, y = Price)) +
  geom_point() +
  labs(title = "Scatterplot of Odometer Reading and Price",
       subtitle = "Toyota Dataset",
       x = "Odometer Reading (KM)",
       y = "Vehicle Price") +
  scale_x_continuous(limits = c(0, max(cars$KM))) +
  scale_y_continuous(limits = c(0, max(cars$Price))) +
  theme_minimal()

cars %>% 
  ggplot(aes(x = KM, y = Price, color = Age)) +
  geom_point() +
  labs(title = "Scatterplot of Odometer Reading and Price by Age",
       subtitle = "Toyota Dataset",
       x = "Odometer Reading (KM)",
       y = "Vehicle Price") +
  scale_x_continuous(limits = c(0, max(cars$KM))) +
  scale_y_continuous(limits = c(0, max(cars$Price))) +
  theme_minimal()

Price & KM Conclusions

As expected, based on knowledge gained in other areas of this project, the scatter plot shows the connection between the odometer reading and the price of the vehicle. However, when paired with the age of the vehicle, it is apparent that not every vehicle is used at the same rate. There are very new vehicles with high and very old ones with low odometer readings.

Clustering

Based on information in the dataset, it is apparent there are multiple types of vehicles in the data set. When attempting to produce a prediction of the price of the vehicle, it is assumed that the type of vehicle may improve the predictive abilities of the model. If clustering or grouping of the vehicles by type did not occur, the analysis would create comparisons on significantly different items. Therefore, k-means clustering was conducted to quantitatively establish the types of vehicles in the data set and associate each vehicle with that type.

Features selected for clustering were descriptive measures that could not be applied to other types of vehicles and are assumed to be the most mutually exclusive variables. For example, an automatic transmission vehicle or a vehicle with metallic paint is likely an option for all models.

# Create new data frame with selected features for Clustering
features <- cars %>% select(HP, CC, Doors, Weight)
scaled_features <- scale(features)

# Function to determine within cluster sum of squares
wcss <- function(k) {
  kmeans_result <- kmeans(scaled_features, centers = k)
  return(kmeans_result$tot.withinss)}

# Determine number of clusters by using within cluster sum of squares function
k_values <- 1:12
wcss_values <- sapply(k_values, wcss)

# Create elbow plot to determine optimal number of clusters
elbow_plot <- data.frame(k = k_values, WCSS = wcss_values)

ggplot(elbow_plot, aes(x = k, y = WCSS)) +
  geom_line() +
  geom_point() +
  labs(title = "Elbow Plot for Optimal k",
       x = "Number of Clusters (k)",
       y = "Within-Cluster Sum of Squares (WCSS)")

An elbow curve was created to establish the optimal number of clusters within the data set based on the chosen features. The elbow curve plot above shows a flattening in the within sum of squares values between clusters at 6. Therefore, it has been estimated that there are six significantly different types of models of vehicles in the data set. Next, the model type was attached to each row to provide a further descriptive quality.

# Use optimal k to assign model to each vehicle in dataset
k <- 6
result <- kmeans(scaled_features, centers = k)
cars$Model <- result$cluster

# Convert Model into Factor
cars$Model <- as.factor(cars$Model)

Vehicle Model Analysis

cars %>% group_by(Model) %>% count(Model) %>%
  ggplot(aes(x = Model, y = n)) +
  geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
  labs(title = "Number of Vehicles per Model",
       subtitle = "Toyota Dataset",
       x = "Model",
       y = "number of Vehicles") +
  theme_minimal()

Number of Vehicels by Model
Model	n
1	422
2	22
3	144
4	152
5	241
6	454

After clustering and classifying each of the cars based on their values of weight, engine displacement, doors, and horsepower, the bar chart above was created to understand the frequency of each type of model in the dataset.

# Visualize differences in models by HP and weight
cars %>% 
  ggplot(aes(x = Age, y = Price, color = Model)) +
  geom_point() +
  labs(title = "Scatterplot of Age and Price by Vehicle Model",
       subtitle = "Toyota Dataset",
       x = "Vehicle Age (months)",
       y = "Vehicle Price") +
  theme_minimal()

While initially difficult to extract meaningful information fromt the above scatterplot, several logical conclusions can be drawn when evaluating the age of the vehicle and its price arranged by model. First, one model is one of the cheapest models of vehicle. Next, there is largely more expensive model than other models.

# Visualize differences in models by HP and weight
cars %>% 
  ggplot(aes(x = KM, y = Price, color = Model)) +
  geom_point() +
  labs(title = "Scatterplot of Odometer Reading and Price by Vehicle Model",
       subtitle = "Toyota Dataset",
       x = "Odometer Reading (KM)",
       y = "Vehicle Price") +
  theme_minimal()

To further explain the separation between vehicles, an interesting trait becomes apparent when evaluating the Odometer reading and price. Across the top section of the entire plat, one stands out as gathering some of the highest prices. Therefore, it is reasonable that the vehicle model could add to future predictive efforts.

cars %>% 
  ggplot(aes(x = Model, y = Price, fill = Model)) +
  geom_boxplot() +
  stat_summary(fun = "mean", shape = 21, fill = "white", color = "black") +
  labs(title = "Boxplot of Price per Model",
       subtitle = "Toyota Dataset",
       x = "Model of Vehicle",
       y = "Price of Vehicle") +
  scale_y_continuous(limits = c(0, max(cars$Price))) +
  theme_minimal() +
  theme(legend.position = "none")

Vehicle Model Conclusions

This section has provided a range of prices for each vehicle model as it was separated by specific features within the data set. Coupled with additional information, a range of expected price will likely be achieved.

Price Categorization

Once additional insight was created on the types of vehicles in the data set, the focus was shifted to making price range expectations for each car model. Among other factors, such as KM, the age or year of the car is proven to be highly related to the price of the vehicle. Therefore, a descriptive measure of the value of each car was created by first grouping the data set but Model and Year of production, then extracting the price quantiles. Next, if a car fell between the 25th and 75th quantile, a value of average was assigned per row. If the vehicle price occurred above the 75th quantile, it was given a categorization of expensive. Finally, the vehicle was categorized as cheap if the price was below the 25th quantile.

# Group data frame by Model of car and extract quantiles of price
quartiles_per_model <- cars %>%
  group_by(Model, year) %>% 
  summarize(Q1 = quantile(Price, 0.25),
            Q3 = quantile(Price, 0.75),
            .groups = "drop")

# Create Function for attaching the descriptive value of "cheap", "expensive", or "average" per model.
categorize_price <- function(price, year, q1, q3) {
  if (price < q1) {
    return("cheap")
  } else if (price > q3) {
    return("expensive")
  } else {
    return("average")
  }
}

# Apply function to data frame
cars <- cars %>%
  left_join(quartiles_per_model, by = c("Model", "year")) %>%
  mutate(Price_Category = mapply(categorize_price, Price, year, Q1, Q3))

# Visualize distribution of value along Age and Price variables
cars %>%
  ggplot(aes(x = Age, y = Price, color = Price_Category)) +
  geom_point() +
  labs(title = "Scatterplot of Age and Price by Price Category",
       subtitle = "Toyota Dataset",
       x = "Vehicle Age (months)",
       y = "Vehicle Price") +
  theme_minimal()

After categorizing based on the model and year, a scatter plot was made to understand the distribution of price and age when grouped by year. Unsurprisingly, the older the vehicle, the less overlap in prices; the newer the vehicle, the more overlap in prices. It is hypothesized that this is occurring because of several factors. First, the data set is more populated by older vehicles, as seen earlier in this paper. Second, the difference between other descriptive factors of quality wanes as the vehicle ages. For example, a 50,000 km difference in a two-year-old car may not be as important as a 50,000 km difference in a much older vehicle.

# Visualize distribution of value along Age and Price variables
cars %>%
  ggplot(aes(x = year, y = Price, color = Price_Category)) +
  geom_point() +
  labs(title = "Scatterplot of Year and Price by Value",
       subtitle = "Toyota Dataset",
       x = "Vehicle Year",
       y = "Vehicle Price") +
  theme_minimal()

The same phenomenon exists when the same information as the previous plot is arranged per production year instead of age in months. The older the vehicle, the tighter the distribution of price.

Price Categorization Conclusion

Categorizing each model provides price ranges that a customer or a seller can expect based purely on the model and year of production. Further discriminating factors must be considered to price a vehicle accurately, but this was a crucial step in understanding the price of a vehicle and aiding in finding predictability of price.

Linear Regression

The final step in this project was to create a linear regression model to predict the price of the vehicle. During this research phase, several models were created to find the most accurate price prediction method. Then, each model was evaluated using a testing and training set validation methodology.

cars$Model <- as.numeric(cars$Model)

# Create linear model to predict price of vehicle based on all data
model_A <- lm(Price ~ Age + KM + HP + CC + Doors + Weight + Model,
              data = cars)
summary(model_A)

## 
## Call:
## lm(formula = Price ~ Age + KM + HP + CC + Doors + Weight + Model, 
##     data = cars)
## 
## Residuals:
##         Min          1Q      Median          3Q         Max 
## -12132.3243   -771.8593     -0.1815    765.2973   6266.0423 
## 
## Coefficients:
##                    Estimate      Std. Error   t value               Pr(>|t|)
## (Intercept) -5929.669261178  1004.412318170  -5.90362        0.0000000044371
## Age          -122.144491049     2.591762396 -47.12797 < 0.000000000000000222
## KM             -0.016717145     0.001287522 -12.98397 < 0.000000000000000222
## HP             32.411068165     2.579988713  12.56248 < 0.000000000000000222
## CC             -1.724614107     0.289060851  -5.96627        0.0000000030568
## Doors         -32.367976053    50.882053025  -0.63614                0.52479
## Weight         22.612449033     1.067795897  21.17675 < 0.000000000000000222
## Model         -18.402473919    23.499313758  -0.78311                0.43369
##                
## (Intercept) ***
## Age         ***
## KM          ***
## HP          ***
## CC          ***
## Doors          
## Weight      ***
## Model          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1333.703 on 1427 degrees of freedom
## Multiple R-squared:   0.86408,   Adjusted R-squared:  0.8634133 
## F-statistic: 1295.975 on 7 and 1427 DF,  p-value: < 0.00000000000000022204

# Create linear model on selected data
model_B <- lm(Price ~ Age + KM + Model, data = cars)
summary(model_B)

## 
## Call:
## lm(formula = Price ~ Age + KM + Model, data = cars)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -6795.4164  -990.5365   -83.7145   837.9680 12605.2367 
## 
## Coefficients:
##                    Estimate      Std. Error   t value             Pr(>|t|)    
## (Intercept) 20568.394833597   165.300223084 124.43053 < 0.0000000000000002 ***
## Age          -153.607387536     2.728547372 -56.29640 < 0.0000000000000002 ***
## KM             -0.016572145     0.001354638 -12.23363 < 0.0000000000000002 ***
## Model         -29.592725590    21.418209513  -1.38166              0.16729    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1655.986 on 1431 degrees of freedom
## Multiple R-squared:  0.7898669,  Adjusted R-squared:  0.7894263 
## F-statistic: 1792.989 on 3 and 1431 DF,  p-value: < 0.00000000000000022204

# Create linear model to predict price of vehicle based on all data
model_C <- lm(Price ~ Age + KM + HP + CC + Weight + Doors,
              data = cars)
summary(model_C)

## 
## Call:
## lm(formula = Price ~ Age + KM + HP + CC + Weight + Doors, data = cars)
## 
## Residuals:
##         Min          1Q      Median          3Q         Max 
## -12079.6802   -759.6196     -2.5351    757.0873   6246.0535 
## 
## Coefficients:
##                    Estimate      Std. Error   t value               Pr(>|t|)
## (Intercept) -6026.569337503   996.626276545  -6.04697        0.0000000018813
## Age          -122.138790417     2.591401189 -47.13234 < 0.000000000000000222
## KM             -0.016703458     0.001287229 -12.97629 < 0.000000000000000222
## HP             32.742225792     2.544752007  12.86657 < 0.000000000000000222
## CC             -1.663921809     0.278439621  -5.97588        0.0000000028856
## Weight         22.612569864     1.067651281  21.17973 < 0.000000000000000222
## Doors         -57.877015407    39.083172384  -1.48087                0.13886
##                
## (Intercept) ***
## Age         ***
## KM          ***
## HP          ***
## CC          ***
## Weight      ***
## Doors          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1333.522 on 1428 degrees of freedom
## Multiple R-squared:  0.8640216,  Adjusted R-squared:  0.8634503 
## F-statistic: 1512.278 on 6 and 1428 DF,  p-value: < 0.00000000000000022204

Linear Regression Validation

# Partition data into test and training
samples <- cars$Price %>%
  createDataPartition(p = .8, list = FALSE)
train.data <- cars[samples, ]
test.data <- cars[-samples, ]

# Construct model A and model B from training data
model_A_validate <- lm(Price ~ Age + KM + HP + CC + Doors + Weight + Model,
              data = train.data)
model_B_validate <- lm(Price ~ Age + KM + Model, data = train.data)
model_C_validate <- lm(Price ~ Age + KM + HP + CC + Weight + Doors, data = train.data)

# Evaluate the quality of the model_A through R2, RSME, and MAE
predictions_model_A <- model_A_validate %>% predict(test.data)
predictions_model_A <- data.frame(Model = "A",
           R2 = R2(predictions_model_A, test.data$Price),
           RMSE = RMSE(predictions_model_A, test.data$Price),
           MAE = MAE(predictions_model_A, test.data$Price))

# Evaluate the quality of the model model_B through R2, RSME, and MAE
predictions_model_B <- model_B_validate %>% predict(test.data)
predictions_model_B <- data.frame(Model = "B",
           R2 = R2(predictions_model_B, test.data$Price),
           RMSE = RMSE(predictions_model_B, test.data$Price),
           MAE = MAE(predictions_model_B, test.data$Price))

# Evaluate the quality of the model model_B through R2, RSME, and MAE
predictions_model_C <- model_C_validate %>% predict(test.data)
predictions_model_C <- data.frame(Model = "C",
           R2 = R2(predictions_model_C, test.data$Price),
           RMSE = RMSE(predictions_model_C, test.data$Price),
           MAE = MAE(predictions_model_C, test.data$Price))

# Merge model statistics into single data frame
validation <- bind_rows(predictions_model_A, predictions_model_B, predictions_model_C)

validation %>% kbl(caption = "Evaluation of Model Effectiveness") %>% 
  kable_classic_2("hover")

Evaluation of Model Effectiveness
Model	R2	RMSE	MAE
A	0.8926599223	1316.466083	966.6009983
B	0.7806615114	1862.734272	1249.7517477
C	0.8926465011	1316.542173	966.6350931

Regression Conclusion

After validation, it was found that there is a moderate level of predictability in the price of a vehicle based on the information in the data set. The first model considered each of the variables and achieved the best predictive powers measured through analysis of residuals and the mean absolute error of those residuals. The second model eliminated many variables and asked if the model, age, and odometer reading were enough information to predict the price of the vehicle. The third model tested the car model’s importance, informing the vehicle’s price. All other variables from the data set were included in model “C” except the vehicle model. Accuracy suffered very slightly by excluding the vehicle model in the predictive model. While it is informative to understand which model of the vehicle was purchased, there is enough overlap in the prices per model to minimize the effect. To gain further accuracy in predicting the cost of the vehicle, additional features must be added to the data set. Luxury features such as entertainment systems or security and safety features are expected to comprise the information needed to create a more accurate predictive model.

Conclusion

Question 1

Is there a connection between any of the descriptive variables in each row and the price of the vehicle? If so, what variables are most closely related to the price of the vehicle?

There is a connection between several variables in the dataset. The most closely related variables are price and KM.

Question 2

Is there a most popular vehicle model contained in the data set?

There is a most popular vehicle model in the data set, and not every vehicle in the data set contains the same features. Within the data set are vehicles with high horsepower, low horsepower, large engines, and small engines, as well as 3, 4, and 5-door models. The research did not focus on the descriptive factors of each model but sought to identify the existence of models and has achieved that goal.

Question 3

What are the ranges of price that could be expected for vehicles in the dataset?

The most accurate method of understanding the range of price was achieved through a combination of the model of vehicle and the year of production of the vehicle. When coupling the two descirptive factors, a starting point is achieved that will allow the customer and retailed to further refine price based on additional features.

Question 4

If prices follow a pattern, what information can be imputed to provide an understanding of the value for the purchase or sale of the vehicle?

There is a pattern that describes the expected price of the vehicle. A price category has been imputed into the data set that describes each vehicle as cheap, average, or expensive based on comparing other vehicles of the same model and year.

Question 5

Is it possible to predict the cost of the vehicle based on the information in the dataset?

While model and year are valid starting points for the price range per vehicle, additional features provide accuracy in predicting the price of the vehicle. However, because the data set is very shallow in descriptive features, predictability is not 100% accurate. Additional information is required to gain further predictive accuracy.

Toyota Corolla Dataset

David Curtis

2023-11-13

Introduction

Analysis of Data Quality

Data Manipulation

Univariate Analysis

Price

Age

Year

KM

Fuel Type

HP

MetColor

Automatic

CC

Doors

Weight

Year

Bivariate Analysis

Correlations

Age & KM

Age & KM Conclusions

Age & Weight

Age & Weight Conclusions

Price & Age

Price & Age Conclusions

Price & Year

Price & Year Conclusions

Price & KM

Price & KM Conclusions

Clustering

Vehicle Model Analysis

Vehicle Model Conclusions

Price Categorization

Price Categorization Conclusion

Linear Regression

Linear Regression Validation

Regression Conclusion

Conclusion

Question 1

Question 2

Question 3

Question 4

Question 5