In the following report, we’ve prepared the data contained in the Corolla data set which englobes different variables including prices, manufacturing year, mileage, and others. In this report, we follow the steps of cleaning the data to remove unnecessary outliers and execute an EDA analysis process to understand the given data and relate the variables, and in future steps of our challenge project, have a prepared dataset with the correct information to carry on further analyses. Overall, this report aims to provide a detailed panorama on the procedures followed to clean the data, as well as explain the reasons as to why these steps were taken. We aim to demonstrate concise and clear explanations, with the purpose of achieving a proper understanding of the procedure.
As part of our two first steps, we initiated by loading our dataset into the program, and storing it into a variable that we decided to call “data”. As well, we applied a “gsub” function to our dataset that proceeds to replace any leading question mark at the beginning of stings in the “Model” column with an empty string (““), which basically means that the question marks found are being removed throughout this function. Once this step is made, a summarized demonstration of the data is displayed by applying the skim function. Lastly but not least, by using the”“colnames” function to “data”, this line returns the names of all columns in the data frame as a character vector, listing each variable name foun in our dataset after any data manipulation.
In shorter words, these steps are taken not only to initiate with our data preparation process, but most importantly, it helps to clean, summarize, and verify the structure of “data”.
data = read.csv("corolla2.csv")
data$Model <- gsub("^\\?", "", data$Model)
skim(data)
| Name | data |
| Number of rows | 1436 |
| Number of columns | 37 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 35 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Model | 0 | 1 | 15 | 75 | 0 | 319 | 0 |
| Fuel_Type | 0 | 1 | 3 | 6 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 721.56 | 416.48 | 1 | 361.75 | 721.5 | 1081.25 | 1442 | ▇▇▇▇▇ |
| Price | 0 | 1 | 10730.82 | 3626.96 | 4350 | 8450.00 | 9900.0 | 11950.00 | 32500 | ▇▅▁▁▁ |
| Age_08_04 | 0 | 1 | 55.95 | 18.60 | 1 | 44.00 | 61.0 | 70.00 | 80 | ▁▂▃▆▇ |
| Mfg_Month | 0 | 1 | 5.55 | 3.35 | 1 | 3.00 | 5.0 | 8.00 | 12 | ▇▅▅▃▅ |
| Mfg_Year | 0 | 1 | 1999.63 | 1.54 | 1998 | 1998.00 | 1999.0 | 2001.00 | 2004 | ▇▂▂▁▁ |
| KM | 0 | 1 | 68533.26 | 37506.45 | 1 | 43000.00 | 63389.5 | 87020.75 | 243000 | ▅▇▂▁▁ |
| HP | 0 | 1 | 101.50 | 14.98 | 69 | 90.00 | 110.0 | 110.00 | 192 | ▃▇▁▁▁ |
| Met_Color | 0 | 1 | 0.67 | 0.47 | 0 | 0.00 | 1.0 | 1.00 | 1 | ▃▁▁▁▇ |
| Automatic | 0 | 1 | 0.06 | 0.23 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
| cc | 0 | 1 | 1576.86 | 424.39 | 1300 | 1400.00 | 1600.0 | 1600.00 | 16000 | ▇▁▁▁▁ |
| Doors | 0 | 1 | 4.03 | 0.95 | 2 | 3.00 | 4.0 | 5.00 | 5 | ▁▇▁▂▇ |
| Cylinders | 0 | 1 | 4.00 | 0.00 | 4 | 4.00 | 4.0 | 4.00 | 4 | ▁▁▇▁▁ |
| Gears | 0 | 1 | 5.03 | 0.19 | 3 | 5.00 | 5.0 | 5.00 | 6 | ▁▁▁▇▁ |
| Quarterly_Tax | 0 | 1 | 87.12 | 41.13 | 19 | 69.00 | 85.0 | 85.00 | 283 | ▇▇▁▂▁ |
| Weight | 0 | 1 | 1072.46 | 52.64 | 1000 | 1040.00 | 1070.0 | 1085.00 | 1615 | ▇▁▁▁▁ |
| Mfr_Guarantee | 0 | 1 | 0.41 | 0.49 | 0 | 0.00 | 0.0 | 1.00 | 1 | ▇▁▁▁▆ |
| BOVAG_Guarantee | 0 | 1 | 0.90 | 0.31 | 0 | 1.00 | 1.0 | 1.00 | 1 | ▁▁▁▁▇ |
| Guarantee_Period | 0 | 1 | 3.82 | 3.01 | 3 | 3.00 | 3.0 | 3.00 | 36 | ▇▁▁▁▁ |
| ABS | 0 | 1 | 0.81 | 0.39 | 0 | 1.00 | 1.0 | 1.00 | 1 | ▂▁▁▁▇ |
| Airbag_1 | 0 | 1 | 0.97 | 0.17 | 0 | 1.00 | 1.0 | 1.00 | 1 | ▁▁▁▁▇ |
| Airbag_2 | 0 | 1 | 0.72 | 0.45 | 0 | 0.00 | 1.0 | 1.00 | 1 | ▃▁▁▁▇ |
| Airco | 0 | 1 | 0.51 | 0.50 | 0 | 0.00 | 1.0 | 1.00 | 1 | ▇▁▁▁▇ |
| Automatic_airco | 0 | 1 | 0.06 | 0.23 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
| Boardcomputer | 0 | 1 | 0.29 | 0.46 | 0 | 0.00 | 0.0 | 1.00 | 1 | ▇▁▁▁▃ |
| CD_Player | 0 | 1 | 0.22 | 0.41 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▂ |
| Central_Lock | 0 | 1 | 0.58 | 0.49 | 0 | 0.00 | 1.0 | 1.00 | 1 | ▆▁▁▁▇ |
| Powered_Windows | 0 | 1 | 0.56 | 0.50 | 0 | 0.00 | 1.0 | 1.00 | 1 | ▆▁▁▁▇ |
| Power_Steering | 0 | 1 | 0.98 | 0.15 | 0 | 1.00 | 1.0 | 1.00 | 1 | ▁▁▁▁▇ |
| Radio | 0 | 1 | 0.15 | 0.35 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▂ |
| Mistlamps | 0 | 1 | 0.26 | 0.44 | 0 | 0.00 | 0.0 | 1.00 | 1 | ▇▁▁▁▃ |
| Sport_Model | 0 | 1 | 0.30 | 0.46 | 0 | 0.00 | 0.0 | 1.00 | 1 | ▇▁▁▁▃ |
| Backseat_Divider | 0 | 1 | 0.77 | 0.42 | 0 | 1.00 | 1.0 | 1.00 | 1 | ▂▁▁▁▇ |
| Metallic_Rim | 0 | 1 | 0.20 | 0.40 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▂ |
| Radio_cassette | 0 | 1 | 0.15 | 0.35 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▂ |
| Tow_Bar | 0 | 1 | 0.28 | 0.45 | 0 | 0.00 | 0.0 | 1.00 | 1 | ▇▁▁▁▃ |
colnames(data)
## [1] "Id" "Model" "Price" "Age_08_04"
## [5] "Mfg_Month" "Mfg_Year" "KM" "Fuel_Type"
## [9] "HP" "Met_Color" "Automatic" "cc"
## [13] "Doors" "Cylinders" "Gears" "Quarterly_Tax"
## [17] "Weight" "Mfr_Guarantee" "BOVAG_Guarantee" "Guarantee_Period"
## [21] "ABS" "Airbag_1" "Airbag_2" "Airco"
## [25] "Automatic_airco" "Boardcomputer" "CD_Player" "Central_Lock"
## [29] "Powered_Windows" "Power_Steering" "Radio" "Mistlamps"
## [33] "Sport_Model" "Backseat_Divider" "Metallic_Rim" "Radio_cassette"
## [37] "Tow_Bar"
Afterwards, we apply the command of “str” to data in order to provide a structured summary of the data frame, and give out a quick overview of its internal structure. This line of code is helpful to quickly check data types, detect (if any) unexpected types, and overall ensure the data looks as expected before proceeding to further steps and analysis.
str(data)
## 'data.frame': 1436 obs. of 37 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Model : chr "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" ...
## $ Price : int 13500 13750 13950 14950 13750 12950 16900 18600 21500 12950 ...
## $ Age_08_04 : int 23 23 24 26 30 32 27 30 27 23 ...
## $ Mfg_Month : int 10 10 9 7 3 1 6 3 6 10 ...
## $ Mfg_Year : int 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
## $ KM : int 46986 72937 41711 48000 38500 61000 94612 75889 19700 71138 ...
## $ Fuel_Type : chr "Diesel" "Diesel" "Diesel" "Diesel" ...
## $ HP : int 90 90 90 90 90 90 90 90 192 69 ...
## $ Met_Color : int 1 1 1 0 0 0 1 1 0 0 ...
## $ Automatic : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cc : int 2000 2000 2000 2000 2000 2000 2000 2000 1800 1900 ...
## $ Doors : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Cylinders : int 4 4 4 4 4 4 4 4 4 4 ...
## $ Gears : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Quarterly_Tax : int 210 210 210 210 210 210 210 210 100 185 ...
## $ Weight : int 1165 1165 1165 1165 1170 1170 1245 1245 1185 1105 ...
## $ Mfr_Guarantee : int 0 0 1 1 1 0 0 1 0 0 ...
## $ BOVAG_Guarantee : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Guarantee_Period: int 3 3 3 3 3 3 3 3 3 3 ...
## $ ABS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Airbag_1 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Airbag_2 : int 1 1 1 1 1 1 1 1 0 1 ...
## $ Airco : int 0 1 0 0 1 1 1 1 1 1 ...
## $ Automatic_airco : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Boardcomputer : int 1 1 1 1 1 1 1 1 0 1 ...
## $ CD_Player : int 0 1 0 0 0 0 0 1 0 0 ...
## $ Central_Lock : int 1 1 0 0 1 1 1 1 1 0 ...
## $ Powered_Windows : int 1 0 0 0 1 1 1 1 1 0 ...
## $ Power_Steering : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Radio : int 0 0 0 0 0 0 0 0 1 0 ...
## $ Mistlamps : int 0 0 0 0 1 1 0 0 0 0 ...
## $ Sport_Model : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Backseat_Divider: int 1 1 1 1 1 1 1 1 0 1 ...
## $ Metallic_Rim : int 0 0 0 0 0 0 0 0 1 0 ...
## $ Radio_cassette : int 0 0 0 0 0 0 0 0 1 0 ...
## $ Tow_Bar : int 0 0 0 0 0 0 0 0 0 0 ...
As part of our fourth step, we know that histograms are essential for understanding the distribution of continuous variables in a dataset. Therefore, we proceeded to start visualizing our data through means of histograms. We decided to use histograms within our cleaning process, since it can allow us to identify factors such as skewness, multimodalities, or irregularities in the data distribution. In return, this results helpful for our data cleaning process. From the results obtained, we are able to state the following:
This histogram depicts a right-skewed distribution of car prices, showing that most cars are priced towards the lower end, but there are a few very expensive cars. This is important to take in consideration as it can suggest outliers that may need attention during the data cleaning
The histogram indicates the age of the cars in months corresponding to August of 2004. The distribution depicts a relatively uniform spread, with no significant concetration in any specific age range. There is a slight peak around 50.70 month range, which might suggest that most cars are in this middle age category. This is helpful for us to understand the age distribution and can help to identify whether certain age groups need special attention for data cleaning.
This histogramshows a left skewed distribution, with most vehicles being relatively recent, but a few older vehicles that might need to be checked for anomalies
The distribution of mileage also shows to be right-skewed, indicating that there are many carshaving lower mileage, but few with exceptionally high mileage. High mileage cars could be candidates for further scrunity or removal, as these also represent outliers.
The horsepower distribution appears to be more uniform but also slightly right skewed. A uniform distribution in this context might indicate no strong concentration og vehicles around specific horsepower values.
We can obsserve that the majority of the vehicles have engine sizes around 1400 to 1600 cc. There are a few cars with significantly larger engine sizes, showing thhe presence of outliers.
In this histograms most cars have lower quarterly taxes and a few have much higher taxes. This indicates that the histogram is right-skewed. This skewness can suggest potential outliers at the high end, which may need attention during data cleaning.
The distribution of weight is likely centered around a specific range, with most cars having moderate weights and fewer vehicles at the extremes. As seen in other variables, cars with significantly higher weights can be outliers that can be and should be taken in considerarion for further data cleaning. Understanding the distribution of car weights can help ensure consistency within the dataset and can also identify any data points that don’t fit the overall pattern.
ggplot(data = data, aes(x = Price)) +
geom_histogram(bins = 50, fill = "blue", color = "black") +
labs(title = "Price Distribution of Toyota Corolla")
ggplot(data = data, aes(x = Age_08_04)) +
geom_histogram(bins = 50, fill = "green", color = "black") +
labs(title = "Age of Seller Distribution of Toyota Corolla")
ggplot(data = data, aes(x = Mfg_Year)) +
geom_histogram(bins = 50, fill = "red", color = "black") +
labs(title = "Year of Model Distribution of Toyota Corolla")
ggplot(data = data, aes(x = KM)) +
geom_histogram(bins = 50, fill = "purple", color = "black") +
labs(title = "KM Distribution of Toyota Corolla")
ggplot(data = data, aes(x = HP)) +
geom_histogram(bins = 50, fill = "pink", color = "black") +
labs(title = "Horsepower Distribution of Toyota Corolla")
ggplot(data = data, aes(x = cc)) +
geom_histogram(bins = 50, fill = "grey", color = "black") +
labs(title = "Cubic Capacity Distribution of Toyota Corolla")
ggplot(data = data, aes(x = Quarterly_Tax)) +
geom_histogram(bins = 50, fill = "magenta", color = "black") +
labs(title = "Quaterly Tax Distribution of Toyota Corolla")
ggplot(data = data, aes(x = Weight)) +
geom_histogram(bins = 50, fill = "orange", color = "black") +
labs(title = "Weight Distribution of Toyota Corolla")
Next, we proceeded to create a correaltion matrix showing the relaionships between numerical variables in the dataset. This way, we can more easily identify which variables are strongly or weakly related. This step is useful during data cleaning because highly correlated variables may contain redundant information that can guide future selection or transformation.
From the correlation matrix, the strongest relationships observed are:
Price and HP: Higher horsepower vehicles tend to incline towards more expensive prices
Price and KM: There is likely a negative correlation here, since higher mileage cars tend to have lower prices
Price and Mfg_Year: Newer cars correlate to having a higher price
num_vars = data %>% select("Price","Mfg_Year","KM","HP", "cc","Quarterly_Tax","Weight")
cor_matrix <- cor(num_vars, use = "complete.obs")
corrplot(cor_matrix, type = "full")
Afterwards we created to identify outliers, and therefore remove these from “data” to clean our datatset and prevent them from further affect any future analyses and achieve appropriate prediction models, and overall improve data quality.
ggplot(data = data, mapping = aes(x = "", y = data[, 3])) +
geom_boxplot() +
xlab(names(data)[3]) + ylab(names(data)[3])
ggplot(data = data, mapping = aes(x = "", y = data[, 7])) +
geom_boxplot() +
xlab(names(data)[7]) + ylab(names(data)[7])
ggplot(data = data, mapping = aes(x = "", y = data[, 16])) +
geom_boxplot() +
xlab(names(data)[16]) + ylab(names(data)[16])
When analyzing our dataset, we must remove outliers for several reasons. First, outliers can negatively affect the accuracy of models like linear regression by distorting parameters, leading to poor model fit and reduced predictive capabilities. They can also distort the data structure, resulting in misleading coefficients and inaccurate insights. Additionally, linear regression relies on key assumptions like normality and linearity, and outliers often violate these assumptions, producing unreliable estimates. During exploratory data analysis (EDA), we identified several variables that required revision: KM, Price, Weight, and Gears. To remove outliers, we used the boxplot.stats function with the $out parameter, which helped us identify outliers for each variable. We then created a vector with the row numbers of these outliers and deleted those rows for Price, KM, and Weight. Since Gears had a limited set of possible values (as identified by the unique function), we used a different approach for this variable. We manually identified and deleted the rows containing outlier values. Finally, we checked for missing data to ensure the model could be executed correctly.
out_Price = boxplot.stats(data$Price)$out
outlier_Price = which(data$Price %in% out_Price)
data <- data[-outlier_Price, ]
out_KM = boxplot.stats(data$KM)$out
outlier_KM = which(data$KM %in% out_KM)
data <- data[-outlier_KM, ]
out_Weight = boxplot.stats(data$Weight)$out
outlier_Weight = which(data$Weight %in% out_Weight)
data <- data[-outlier_Weight, ]
row_index_Gears1 <- which(data$Gears == 3)
row_index_Gears1
## [1] 789 823
row_index_Gears2 <- which(data$Gears == 4)
row_index_Gears2
## [1] 1229
data = data[-c(957, 992, 1427), ]
rows_with_na <- which(!complete.cases(data))
data[rows_with_na, ]
## [1] Id Model Price Age_08_04
## [5] Mfg_Month Mfg_Year KM Fuel_Type
## [9] HP Met_Color Automatic cc
## [13] Doors Cylinders Gears Quarterly_Tax
## [17] Weight Mfr_Guarantee BOVAG_Guarantee Guarantee_Period
## [21] ABS Airbag_1 Airbag_2 Airco
## [25] Automatic_airco Boardcomputer CD_Player Central_Lock
## [29] Powered_Windows Power_Steering Radio Mistlamps
## [33] Sport_Model Backseat_Divider Metallic_Rim Radio_cassette
## [37] Tow_Bar
## <0 rows> (or 0-length row.names)
# No missing data in the df :)
The md.pattern() function is used to analyze the pattern of missing data in both, numerical and non-numerical variables (binary or categorical). The code generates a visual pattern for missing data for the selected numerical varibales found within the dataset. The output shows which variables have missing values and the combinations of missing data across records. This is an important step, as it helps to understand if any variables have significant missing values. As well, a new subset is created by excluding the numerical values and keeping only the categorical or binary variables. In this case, we found that no data is missing, which was a positive finding to continue the preparation of our dataset.
md.pattern(num_vars)
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## Price Mfg_Year KM HP cc Quarterly_Tax Weight
## 1436 1 1 1 1 1 1 1 0
## 0 0 0 0 0 0 0 0
#Binary and categorical data
other_vars <- data %>%
select(-c("Price","Mfg_Year","KM","HP", "cc","Quarterly_Tax","Weight"))
md.pattern(other_vars)
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## Id Model Age_08_04 Mfg_Month Fuel_Type Met_Color Automatic Doors Cylinders
## 1236 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0
## Gears Mfr_Guarantee BOVAG_Guarantee Guarantee_Period ABS Airbag_1 Airbag_2
## 1236 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0
## Airco Automatic_airco Boardcomputer CD_Player Central_Lock Powered_Windows
## 1236 1 1 1 1 1 1
## 0 0 0 0 0 0
## Power_Steering Radio Mistlamps Sport_Model Backseat_Divider Metallic_Rim
## 1236 1 1 1 1 1 1
## 0 0 0 0 0 0
## Radio_cassette Tow_Bar
## 1236 1 1 0
## 0 0 0
Step 9 consisted on making scatter plots for 4 variables at a time, we used the ggpairs() function to plot them and compare them, however as we’re comparing price, it was the only variable that repeated in both occasions.
For the first scatter plot, we used the variables price, manufacturing year, kilometers driven (or mileages) and horsepower. The main results obtained in the first scatter plot show the strong positive correlation between price and manufacturing year of about 0.885, meaning newer cars are more expensive. In contrast, price and km have a strong negative correlation -0.570, meaning higher mileage leads to lower prices
On the other hand, the second scatter plot included variables such as price, cc, quarterly tax and weight. The main results obtained in this second case showed that the strongest positive correlation is between weight and quarterly tax, its coefficient is of 0.626 and it indicates that as the weight increases, the quarterly tax rises. For price and weight there was also a strong positive relationship of 0.581, meaning that higher weights usually come to higher prices, nonetheless the correlation for price and cc is on the weaker side with a correlation of 0.126
ggpairs(num_vars, columns = c("Price", "Mfg_Year", "KM", "HP"),
title = "Scatter Plot Matrix of Price vs Mfg_Year, KM, HP")
# Scatter plot matrix for the next group of 4 variables (replace with appropriate names)
ggpairs(num_vars, columns = c("Price", "cc", "Quarterly_Tax", "Weight"),
title = "Scatter Plot Matrix of Price vs cc, Quarterly_Tax, Weight")
In step 10, we focused on creating QQ-plots using the function qqnorm() that would help us visualize whether our data follows a normal distribution and we also added a reference line with the function qqline(). In this step we used 5 different variables, price, manufacturing year, kilometers driven, horsepower and weight. Each result will now be explained:
For the QQ-Plot for Price, the graph shows that majority of the data fall along the reference line in the middle part, indicating a good fit in this area, but there are slight deviations at both tails, the lower left side deviate below the line, suggesting that actual prices are lower than expected. On the other hand, the upper right side has points deviating above the line, which indicates that the actual prices are higher than expected.
For the QQ-Plot for Mfg_Year, the result indicates that there are some sort of steps throughout the reference line, typically this means that the data is limited and/or discrete. In this case, each “step” corresponds to a specific year where a significant number of cars were manufactured. This also indicates the the “steeper” portions represent the transitions between the years. In this plot its observable the non-continuity nature of the manufacturing years.
For the QQ-Plot for KM, mileage follows a normal distributions with some deviations. The central part of the plot seems to be a good fit to the reference line, however there’s a clear deviation at the upper right side of the graph, indicating that the actual mileage for some cars is higher than expected.
For the QQ-Plot for HP, similarly to the manufacturing years, horsepower doesn’t follow a normal distribution, but instead a kind of steps, indicating the concentrated values at certain levels. The flat sections in each step represent the high number of cars with the same HP and the steeper sections show the jumps between these values. The plot also has minor deviations specifically on the higher end, which could mean there are few cars with higher HP.
For the QQ-Plot for Weight, this plot has minor deviations, a step-like pattern at some points and normal distribution all at once, first from the lower left side end, the point deviate below the line, meaning some car’s weight is lighter than expected, then there’s a subtle stepped pattern, this suggests that car weights tend to cluster at certain values, mainly because of common classes. Lastly, on the upper right end, the points deviate above the line, maybe suggesting there are a few heavier cars than expected
# QQ-Plot for Price
qqnorm(data$Price, main = "QQ Plot of Price")
qqline(data$Price, col = "blue", lwd = 2)
#QQ-Plot for Mfg_Year
qqnorm(data$Mfg_Year, main = "QQ Plot of Mfg_Year")
qqline(data$Mfg_Year, col = "red", lwd = 2)
#QQ-Plot for KM
qqnorm(data$KM, main = "QQ Plot of KM")
qqline(data$KM, col = "purple", lwd = 2)
#QQ-Plot for HP
qqnorm(data$HP, main = "QQ Plot of HP")
qqline(data$HP, col = "pink", lwd = 2)
#QQ-Plot for Weight
qqnorm(data$Weight, main = "QQ Plot of Weight")
qqline(data$Weight, col = "orange", lwd = 2)
In this step, the goal is to identify unique categories within some variables. For practicity purposes, the function “head” is applied to the functions of “unique” to identify the number of distinct categories or classes found in categorical attributes, which results useful when analyzing categorical data for models.
head(unique(data$Model))
## [1] "TOYOTA Corolla 1.9 D HATCHB TERRA 2/3-Doors"
## [2] "TOYOTA Corolla 1.6 16v L.SOL 2/3-Doors"
## [3] "TOYOTA Corolla 1.6 16V VVT I 3DR TERRA 2/3-Doors"
## [4] "TOYOTA Corolla 1.6 16V VVT I 3DR SOL 2/3-Doors"
## [5] "TOYOTA Corolla 1.4 VVTI Linea Terra 2/3-Doors"
## [6] "TOYOTA Corolla 1.4 16V VVT I 3DR TERRA COMFORT 2/3-Doors"
head(unique(data$Model))
## [1] "TOYOTA Corolla 1.9 D HATCHB TERRA 2/3-Doors"
## [2] "TOYOTA Corolla 1.6 16v L.SOL 2/3-Doors"
## [3] "TOYOTA Corolla 1.6 16V VVT I 3DR TERRA 2/3-Doors"
## [4] "TOYOTA Corolla 1.6 16V VVT I 3DR SOL 2/3-Doors"
## [5] "TOYOTA Corolla 1.4 VVTI Linea Terra 2/3-Doors"
## [6] "TOYOTA Corolla 1.4 16V VVT I 3DR TERRA COMFORT 2/3-Doors"
head(unique(data$Price))
## [1] 12950 16750 16950 15950 16250 15750
head(unique(data$Mfg_Year))
## [1] 2002 2003 2001 2000 1999 1998
head(unique(data$HP))
## [1] 69 110 97 116 86 72
head(unique(data$Weight))
## [1] 1105 1065 1120 1100 1110 1075
This step produces a summary of the dataset, demonstrating a comprehensive overview of the dataset to help spot any potential issues in both categorical and numerical values. It helps to overall understand the data distribution and spot, if any, anomalies.
summary(data)
## Id Model Price Age_08_04
## Min. : 10.0 Length:1236 Min. : 4450 Min. :13.00
## 1st Qu.: 459.8 Class :character 1st Qu.: 8450 1st Qu.:50.00
## Median : 789.5 Mode :character Median : 9750 Median :62.00
## Mean : 784.3 Mean : 9988 Mean :59.34
## 3rd Qu.:1130.2 3rd Qu.:10995 3rd Qu.:71.00
## Max. :1442.0 Max. :17200 Max. :80.00
## Mfg_Month Mfg_Year KM Fuel_Type
## Min. : 1.000 Min. :1998 Min. : 1 Length:1236
## 1st Qu.: 3.000 1st Qu.:1998 1st Qu.: 45975 Class :character
## Median : 5.000 Median :1999 Median : 64371 Mode :character
## Mean : 5.584 Mean :1999 Mean : 67554
## 3rd Qu.: 8.000 3rd Qu.:2000 3rd Qu.: 85000
## Max. :12.000 Max. :2003 Max. :151000
## HP Met_Color Automatic cc
## Min. : 69.0 Min. :0.0000 Min. :0.00000 Min. :1300
## 1st Qu.: 97.0 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1400
## Median :110.0 Median :1.0000 Median :0.00000 Median :1600
## Mean :101.8 Mean :0.6715 Mean :0.05583 Mean :1536
## 3rd Qu.:110.0 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1600
## Max. :116.0 Max. :1.0000 Max. :1.00000 Max. :2000
## Doors Cylinders Gears Quarterly_Tax Weight
## Min. :2.000 Min. :4 Min. :3.000 Min. : 19.00 Min. :1000
## 1st Qu.:3.000 1st Qu.:4 1st Qu.:5.000 1st Qu.: 69.00 1st Qu.:1035
## Median :4.000 Median :4 Median :5.000 Median : 69.00 Median :1060
## Mean :3.999 Mean :4 Mean :5.025 Mean : 79.47 Mean :1060
## 3rd Qu.:5.000 3rd Qu.:4 3rd Qu.:5.000 3rd Qu.: 85.00 3rd Qu.:1075
## Max. :5.000 Max. :4 Max. :6.000 Max. :197.00 Max. :1135
## Mfr_Guarantee BOVAG_Guarantee Guarantee_Period ABS
## Min. :0.0000 Min. :0.0000 Min. : 3.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.: 3.000 1st Qu.:1.0000
## Median :0.0000 Median :1.0000 Median : 3.000 Median :1.0000
## Mean :0.4215 Mean :0.9013 Mean : 3.699 Mean :0.7994
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 3.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :24.000 Max. :1.0000
## Airbag_1 Airbag_2 Airco Automatic_airco
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :1.0000 Median :1.0000 Median :0.0000 Median :0.00000
## Mean :0.9693 Mean :0.7087 Mean :0.4644 Mean :0.01375
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## Boardcomputer CD_Player Central_Lock Powered_Windows
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :1.0000 Median :1.0000
## Mean :0.2451 Mean :0.1845 Mean :0.5469 Mean :0.5332
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Power_Steering Radio Mistlamps Sport_Model
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.9757 Mean :0.1481 Mean :0.2387 Mean :0.2727
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Backseat_Divider Metallic_Rim Radio_cassette Tow_Bar
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.7702 Mean :0.1934 Mean :0.1472 Mean :0.2969
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
After analyzing the situation as a team, we determined that a good approach to identify the most relevant variables affecting vehicle prices is to analyze different car models individually. Therefore, we decided to create separate dataframes for the most popular types of cars to identify the factors influencing the price ranges of each vehicle. We used the table function to determine how many times each model appeared in the Model column of the dataset. Then, we identified the top three most repeated models and created dataframes that included only these models. This allowed us to analyze each car model individually, providing better insight into the factors that affect vehicle prices.
model_counts <- table(data$Model)
top_model <- names(which.max(model_counts))
top_model_data <- data[data$Model == top_model, ]
head(top_model_data)
## Id Model Price Age_08_04
## 247 248 TOYOTA Corolla 1.6 16V HATCHB LINEA TERRA 2/3-Doors 11950 42
## 293 294 TOYOTA Corolla 1.6 16V HATCHB LINEA TERRA 2/3-Doors 10500 35
## 479 481 TOYOTA Corolla 1.6 16V HATCHB LINEA TERRA 2/3-Doors 10500 54
## 484 486 TOYOTA Corolla 1.6 16V HATCHB LINEA TERRA 2/3-Doors 9900 56
## 526 528 TOYOTA Corolla 1.6 16V HATCHB LINEA TERRA 2/3-Doors 10250 52
## 579 582 TOYOTA Corolla 1.6 16V HATCHB LINEA TERRA 2/3-Doors 11500 55
## Mfg_Month Mfg_Year KM Fuel_Type HP Met_Color Automatic cc Doors
## 247 3 2001 55968 Petrol 110 0 0 1600 3
## 293 10 2001 43000 Petrol 110 0 0 1600 3
## 479 3 2000 63135 Petrol 110 1 0 1600 3
## 484 1 2000 62751 Petrol 110 0 0 1600 3
## 526 5 2000 49432 Petrol 110 1 0 1600 3
## 579 2 2000 33230 Petrol 110 0 0 1600 3
## Cylinders Gears Quarterly_Tax Weight Mfr_Guarantee BOVAG_Guarantee
## 247 4 5 69 1050 1 1
## 293 4 5 69 1050 0 1
## 479 4 5 69 1050 1 1
## 484 4 5 69 1050 0 1
## 526 4 5 69 1050 1 1
## 579 4 5 69 1050 0 1
## Guarantee_Period ABS Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer
## 247 3 1 1 1 1 0 0
## 293 3 1 1 1 0 0 0
## 479 3 1 1 1 1 0 0
## 484 3 1 1 1 1 0 0
## 526 3 1 1 1 1 0 0
## 579 3 1 1 1 1 0 0
## CD_Player Central_Lock Powered_Windows Power_Steering Radio Mistlamps
## 247 0 0 0 1 0 0
## 293 1 1 0 1 0 0
## 479 0 1 1 1 0 0
## 484 0 0 0 1 0 0
## 526 0 1 1 1 1 0
## 579 0 1 1 1 0 0
## Sport_Model Backseat_Divider Metallic_Rim Radio_cassette Tow_Bar
## 247 1 1 0 0 0
## 293 1 1 1 0 0
## 479 0 1 0 0 1
## 484 1 1 0 0 1
## 526 0 1 0 1 1
## 579 1 1 0 0 0
second_model <- names(sort(model_counts, decreasing = TRUE))[2]
second_model_data <- data[data$Model == second_model, ]
head(second_model_data)
## Id Model Price Age_08_04
## 420 422 TOYOTA Corolla 1.3 16V HATCHB LINEA TERRA 2/3-Doors 8950 55
## 647 650 TOYOTA Corolla 1.3 16V HATCHB LINEA TERRA 2/3-Doors 7250 65
## 694 697 TOYOTA Corolla 1.3 16V HATCHB LINEA TERRA 2/3-Doors 9900 65
## 705 708 TOYOTA Corolla 1.3 16V HATCHB LINEA TERRA 2/3-Doors 8450 65
## 713 716 TOYOTA Corolla 1.3 16V HATCHB LINEA TERRA 2/3-Doors 8750 65
## 724 727 TOYOTA Corolla 1.3 16V HATCHB LINEA TERRA 2/3-Doors 7950 61
## Mfg_Month Mfg_Year KM Fuel_Type HP Met_Color Automatic cc Doors
## 420 2 2000 94122 Petrol 86 1 0 1300 3
## 647 4 1999 117609 Petrol 86 0 0 1300 3
## 694 4 1999 97173 Petrol 86 0 0 1300 3
## 705 4 1999 93408 Petrol 86 0 0 1300 3
## 713 4 1999 91246 Petrol 86 1 0 1300 3
## 724 8 1999 88127 Petrol 86 0 0 1300 3
## Cylinders Gears Quarterly_Tax Weight Mfr_Guarantee BOVAG_Guarantee
## 420 4 5 69 1015 0 1
## 647 4 5 69 1015 0 1
## 694 4 5 69 1015 1 1
## 705 4 5 69 1015 0 1
## 713 4 5 69 1015 0 1
## 724 4 5 69 1015 1 1
## Guarantee_Period ABS Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer
## 420 3 1 1 1 0 0 0
## 647 3 1 1 1 0 0 0
## 694 3 1 1 1 0 0 0
## 705 3 1 1 1 0 0 0
## 713 3 1 1 1 0 0 0
## 724 3 1 1 1 0 0 0
## CD_Player Central_Lock Powered_Windows Power_Steering Radio Mistlamps
## 420 0 1 1 1 0 0
## 647 0 0 0 1 0 0
## 694 0 0 0 1 0 0
## 705 0 0 0 1 0 0
## 713 0 0 0 1 0 0
## 724 0 0 0 1 1 0
## Sport_Model Backseat_Divider Metallic_Rim Radio_cassette Tow_Bar
## 420 1 1 0 0 1
## 647 1 1 0 0 0
## 694 1 1 0 0 0
## 705 1 1 0 0 0
## 713 1 1 0 0 0
## 724 1 1 0 1 0
third_model <- names(sort(model_counts, decreasing = TRUE))[3]
third_model_data <- data[data$Model == third_model, ]
head(third_model_data)
## Id Model Price Age_08_04
## 406 408 TOYOTA Corolla 1.6 16V LIFTB LINEA LUNA 4/5-Doors 9950 54
## 497 499 TOYOTA Corolla 1.6 16V LIFTB LINEA LUNA 4/5-Doors 11750 54
## 625 628 TOYOTA Corolla 1.6 16V LIFTB LINEA LUNA 4/5-Doors 7950 67
## 654 657 TOYOTA Corolla 1.6 16V LIFTB LINEA LUNA 4/5-Doors 9950 64
## 665 668 TOYOTA Corolla 1.6 16V LIFTB LINEA LUNA 4/5-Doors 9950 62
## 667 670 TOYOTA Corolla 1.6 16V LIFTB LINEA LUNA 4/5-Doors 9500 68
## Mfg_Month Mfg_Year KM Fuel_Type HP Met_Color Automatic cc Doors
## 406 3 2000 103454 Petrol 110 1 0 1600 5
## 497 3 2000 58530 Petrol 110 0 0 1600 5
## 625 2 1999 135337 CNG 110 1 0 1600 5
## 654 5 1999 114846 Petrol 110 1 0 1600 5
## 665 7 1999 109547 Petrol 110 0 0 1600 5
## 667 1 1999 109150 Petrol 110 0 0 1600 5
## Cylinders Gears Quarterly_Tax Weight Mfr_Guarantee BOVAG_Guarantee
## 406 4 5 85 1075 1 1
## 497 4 5 85 1075 0 1
## 625 4 5 197 1120 0 1
## 654 4 5 85 1075 1 1
## 665 4 5 85 1075 1 1
## 667 4 5 85 1075 0 1
## Guarantee_Period ABS Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer
## 406 3 1 1 1 1 0 0
## 497 3 1 1 1 1 0 0
## 625 3 1 1 1 1 0 0
## 654 3 1 1 1 1 0 0
## 665 3 1 1 1 1 0 0
## 667 3 1 1 1 1 0 0
## CD_Player Central_Lock Powered_Windows Power_Steering Radio Mistlamps
## 406 0 1 1 1 0 1
## 497 0 1 1 1 0 1
## 625 1 1 1 1 0 1
## 654 0 1 1 1 0 1
## 665 1 1 1 1 0 1
## 667 0 1 1 1 0 1
## Sport_Model Backseat_Divider Metallic_Rim Radio_cassette Tow_Bar
## 406 0 1 1 0 0
## 497 0 1 1 0 0
## 625 0 1 1 0 0
## 654 0 1 1 0 0
## 665 0 1 1 0 0
## 667 0 1 0 0 0
Onto the exploration of top model, first we used the function ggpairs() as a visualization tool, creating a scatter plot matrix, that relates certain car characteristics such as price, manufacturing year, mileage, horsepower and weight. The result of this matrix showed the distribution of each variable along the diagonal, with the variables relationship on the lower triangle and their corresponding correlation coefficients in the upper triangle. Based on this we found out that newer cars tend to be pricier and cars with very high mileage have lower prices. It’s also observable that there’s a weak almost negligible positive relationship between price and weight, and not a real correlation between horsepower and mileage.
The next thing we did was to use a correlation matrix plot, first we selected the chosen columns, in this case “Price”, “Mfg_Year”, “KM”,“HP”, “cc”,“Quarterly_Tax” and “Weight”. By using the function cor() to calculate the correlation matrix and then plotted the results. The resulting plot showed different-colored circles to represent the correlation coefficients between each variable pair. This tool helped us identify patters and relationship between variables, so we quickly found a strong and positive relationship between the price and the manufacturing year which are represented in the dark blue circles, as well as a strong negative correlation between the quaterly tax and weight.
Finally for this step, we created a QQ-plot using qqnorm() to compare the quantiles of the price variable to the quantiles of a standard distribution, we also added the function qqline() to be able to add a reference line and analyze the closeness of the data points with the expected pattern. Based on our results, the QQ-plot of price deviates a bit when the prices get higher, meaning there are less cars with very high prices in contrast with the rest of the line indicating there are more cars with lower prices. All of this is supported in our summary statistics that shows there’s a skewed distribution.
ggpairs(top_model_data, columns = c("Price", "Mfg_Year", "KM", "HP","Weight"),
title = "Scatter Plot Matrix of Price vs Mfg_Year, KM, HP, Weight")
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
top_model_data_num = top_model_data %>% select("Price","Mfg_Year","KM","HP", "cc","Quarterly_Tax","Weight")
cor_matrix <- cor(top_model_data_num, use = "complete.obs")
## Warning in cor(top_model_data_num, use = "complete.obs"): the standard
## deviation is zero
corrplot(cor_matrix, type = "full")
qqnorm(top_model_data$Price, main = "QQ Plot of Price of Top Model")
qqline(top_model_data$Price, col = "blue", lwd = 2)
head(summary(top_model_data))
## Id Model Price Age_08_04
## Min. : 248 Length:107 Min. : 5950 Min. :35.00
## 1st Qu.: 852 Class :character 1st Qu.: 7750 1st Qu.:62.50
## Median :1017 Mode :character Median : 8750 Median :68.00
## Mean :1027 Mean : 8624 Mean :67.62
## 3rd Qu.:1264 3rd Qu.: 9250 3rd Qu.:74.00
## Max. :1429 Max. :11950 Max. :80.00
## Mfg_Month Mfg_Year KM Fuel_Type
## Min. : 1.000 Min. :1998 Min. : 22835 Length:107
## 1st Qu.: 4.000 1st Qu.:1998 1st Qu.: 54328 Class :character
## Median : 6.000 Median :1999 Median : 65000 Mode :character
## Mean : 5.981 Mean :1999 Mean : 70753
## 3rd Qu.: 8.000 3rd Qu.:1999 3rd Qu.: 87919
## Max. :12.000 Max. :2001 Max. :137050
## HP Met_Color Automatic cc Doors
## Min. :110 Min. :0.0000 Min. :0 Min. :1600 Min. :3
## 1st Qu.:110 1st Qu.:0.0000 1st Qu.:0 1st Qu.:1600 1st Qu.:3
## Median :110 Median :1.0000 Median :0 Median :1600 Median :3
## Mean :110 Mean :0.6449 Mean :0 Mean :1600 Mean :3
## 3rd Qu.:110 3rd Qu.:1.0000 3rd Qu.:0 3rd Qu.:1600 3rd Qu.:3
## Max. :110 Max. :1.0000 Max. :0 Max. :1600 Max. :3
## Cylinders Gears Quarterly_Tax Weight Mfr_Guarantee
## Min. :4 Min. :5 Min. : 69.0 Min. :1050 Min. :0.0000
## 1st Qu.:4 1st Qu.:5 1st Qu.: 69.0 1st Qu.:1050 1st Qu.:0.0000
## Median :4 Median :5 Median : 69.0 Median :1050 Median :0.0000
## Mean :4 Mean :5 Mean : 70.2 Mean :1050 Mean :0.4486
## 3rd Qu.:4 3rd Qu.:5 3rd Qu.: 69.0 3rd Qu.:1050 3rd Qu.:1.0000
## Max. :4 Max. :5 Max. :197.0 Max. :1078 Max. :1.0000
## BOVAG_Guarantee Guarantee_Period ABS Airbag_1
## Min. :0.0000 Min. :3 Min. :0.0000 Min. :1
## 1st Qu.:1.0000 1st Qu.:3 1st Qu.:1.0000 1st Qu.:1
## Median :1.0000 Median :3 Median :1.0000 Median :1
## Mean :0.9907 Mean :3 Mean :0.8598 Mean :1
## 3rd Qu.:1.0000 3rd Qu.:3 3rd Qu.:1.0000 3rd Qu.:1
## Max. :1.0000 Max. :3 Max. :1.0000 Max. :1
## Airbag_2 Airco Automatic_airco Boardcomputer
## Min. :0.0000 Min. :0.0000 Min. :0 Min. :0
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0 1st Qu.:0
## Median :1.0000 Median :0.0000 Median :0 Median :0
## Mean :0.8037 Mean :0.3364 Mean :0 Mean :0
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0 3rd Qu.:0
## Max. :1.0000 Max. :1.0000 Max. :0 Max. :0
## CD_Player Central_Lock Powered_Windows Power_Steering
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :1
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1
## Median :0.00000 Median :0.0000 Median :0.0000 Median :1
## Mean :0.07477 Mean :0.3925 Mean :0.3458 Mean :1
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1
## Radio Mistlamps Sport_Model Backseat_Divider
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :1
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:1
## Median :0.0000 Median :0.00000 Median :1.0000 Median :1
## Mean :0.1402 Mean :0.02804 Mean :0.6729 Mean :1
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1
## Metallic_Rim Radio_cassette Tow_Bar
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1589 Mean :0.1402 Mean :0.3178
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
Then, we did an exploration of a second car model using the same conditions as for the top model, including the same variables and functions such as ggpairs() for a scatter plot matrix, cor() for the correlation matrix and for plotting the result, and qqnorm() to compare the QQ-plots and the qqline() to add a reference line to the plot.
For the scatter plot matrix, it was shown that price is heavily influenced by manufacturing year and mileage in case of used cars. For example, newer cars with lower mileage tend to have higher prices, this is noticeable in the strong positive correlation, having a coefficient of about 0.529, on the other hand for used cars, there’s a moderate negative correlation of around (-0.501). There was also a weak correlation between manufacturing year and mileage and a negligible relationship between horsepower and weight. This result is very similar to the top model.
Next, the correlation matrix showed that there’s a pattern and a high positive correlation between weight and engine displacement indicated by the dark blue circles (HP has also a strong positive correlation), however there was also a negative correlation between price and the kilometers driven, indicating that as the kilometers driven (mileage) increases, price usually decreases.
Lastly, this second model also displays a QQ-plot to compare the distribution of the observed price data to a theoretical normal distribution (obtained by the qqline() function). So in the results we can see that most of the data are along the blue line, specially in the middle of the point, nonetheless this suggests that in both ends the points fall either above or below the line, indicating that the tails of the data are heavier or data is skewed. In other words, there are more extreme prices (high or low) in the results.
ggpairs(second_model_data, columns = c("Price", "Mfg_Year", "KM", "HP","Weight"),
title = "Scatter Plot Matrix of Price vs Mfg_Year, KM, HP, Weight of the 2nd car model")
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
second_model_data_num = second_model_data %>% select("Price","Mfg_Year","KM","HP", "cc","Quarterly_Tax","Weight")
cor_matrix <- cor(second_model_data_num, use = "complete.obs")
## Warning in cor(second_model_data_num, use = "complete.obs"): the standard
## deviation is zero
corrplot(cor_matrix, type = "full")
qqnorm(second_model_data$Price, main = "QQ Plot of Price of Second Model")
qqline(second_model_data$Price, col = "blue", lwd = 2)
head(summary(second_model_data))
## Id Model Price Age_08_04
## Min. : 422.0 Length:80 Min. : 5600 Min. :55.00
## 1st Qu.: 812.5 Class :character 1st Qu.: 7500 1st Qu.:65.00
## Median :1015.5 Mode :character Median : 7972 Median :68.00
## Mean :1041.0 Mean : 8158 Mean :69.14
## 3rd Qu.:1248.0 3rd Qu.: 8950 3rd Qu.:75.00
## Max. :1441.0 Max. :10845 Max. :80.00
## Mfg_Month Mfg_Year KM Fuel_Type
## Min. : 1.000 Min. :1998 Min. : 16916 Length:80
## 1st Qu.: 3.000 1st Qu.:1998 1st Qu.: 55804 Class :character
## Median : 4.000 Median :1999 Median : 73440 Mode :character
## Mean : 5.112 Mean :1999 Mean : 69592
## 3rd Qu.: 7.000 3rd Qu.:1999 3rd Qu.: 84420
## Max. :12.000 Max. :2000 Max. :146197
## HP Met_Color Automatic cc Doors
## Min. :86 Min. :0.0000 Min. :0 Min. :1300 Min. :3
## 1st Qu.:86 1st Qu.:0.0000 1st Qu.:0 1st Qu.:1300 1st Qu.:3
## Median :86 Median :1.0000 Median :0 Median :1300 Median :3
## Mean :86 Mean :0.5625 Mean :0 Mean :1300 Mean :3
## 3rd Qu.:86 3rd Qu.:1.0000 3rd Qu.:0 3rd Qu.:1300 3rd Qu.:3
## Max. :86 Max. :1.0000 Max. :0 Max. :1332 Max. :3
## Cylinders Gears Quarterly_Tax Weight Mfr_Guarantee
## Min. :4 Min. :5 Min. :69 Min. :1015 Min. :0.0000
## 1st Qu.:4 1st Qu.:5 1st Qu.:69 1st Qu.:1015 1st Qu.:0.0000
## Median :4 Median :5 Median :69 Median :1015 Median :0.0000
## Mean :4 Mean :5 Mean :69 Mean :1015 Mean :0.3875
## 3rd Qu.:4 3rd Qu.:5 3rd Qu.:69 3rd Qu.:1015 3rd Qu.:1.0000
## Max. :4 Max. :5 Max. :69 Max. :1015 Max. :1.0000
## BOVAG_Guarantee Guarantee_Period ABS Airbag_1
## Min. :0.00 Min. :3 Min. :0.000 Min. :0.00
## 1st Qu.:1.00 1st Qu.:3 1st Qu.:0.000 1st Qu.:1.00
## Median :1.00 Median :3 Median :1.000 Median :1.00
## Mean :0.95 Mean :3 Mean :0.725 Mean :0.95
## 3rd Qu.:1.00 3rd Qu.:3 3rd Qu.:1.000 3rd Qu.:1.00
## Max. :1.00 Max. :3 Max. :1.000 Max. :1.00
## Airbag_2 Airco Automatic_airco Boardcomputer CD_Player
## Min. :0.000 Min. :0 Min. :0 Min. :0 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0.0000
## Median :1.000 Median :0 Median :0 Median :0 Median :0.0000
## Mean :0.775 Mean :0 Mean :0 Mean :0 Mean :0.0125
## 3rd Qu.:1.000 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0.0000
## Max. :1.000 Max. :0 Max. :0 Max. :0 Max. :1.0000
## Central_Lock Powered_Windows Power_Steering Radio Mistlamps
## Min. :0.0 Min. :0.00 Min. :0.00 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0 1st Qu.:0.00 1st Qu.:1.00 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.0 Median :0.00 Median :1.00 Median :0.000 Median :0.0000
## Mean :0.1 Mean :0.15 Mean :0.95 Mean :0.125 Mean :0.0125
## 3rd Qu.:0.0 3rd Qu.:0.00 3rd Qu.:1.00 3rd Qu.:0.000 3rd Qu.:0.0000
## Max. :1.0 Max. :1.00 Max. :1.00 Max. :1.000 Max. :1.0000
## Sport_Model Backseat_Divider Metallic_Rim Radio_cassette
## Min. :0.0000 Min. :0.00 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:0.000
## Median :1.0000 Median :1.00 Median :0.0000 Median :0.000
## Mean :0.5375 Mean :0.95 Mean :0.1125 Mean :0.125
## 3rd Qu.:1.0000 3rd Qu.:1.00 3rd Qu.:0.0000 3rd Qu.:0.000
## Max. :1.0000 Max. :1.00 Max. :1.0000 Max. :1.000
## Tow_Bar
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.175
## 3rd Qu.:0.000
## Max. :1.000
For a third car model, we used the same conditions as the top and second model, including the same variables and functions such as ggpairs() for a scatter plot matrix, cor() for the correlation matrix and for plotting the result, and qqnorm() to compare the QQ-plots and the qqline() to add a reference line to the plot.
First for the scatter plot, the results showed that the correlation between price and manufacturing year is strong and positive, with around (0.662), also suggesting that newer cars tend to higher prices. Additionally, price and kilometers driven showed a negative correlation of -0.403, so higher mileage lower prices. Overall, the general overview is very similar to the top and second models
Next, the correlation matrix for the third model show a strong positive correlation between price and manufacturing year, represented in dark blue circles, in contrast with the strong negative correlation between price and kilometers driven, there are also some correlations that are either weak or negligible, but can be observed in the graph.
Finally, for this third model we obtained a QQ-plot representing the comparison of the actual price of top model to the theoretical normal distribution. The results showed that the prices in the middle range are almost normally distributed, however at the left lower end, the point deviate below the line, meaning that the prices are lower than expected, contrary to the upper right side, where point deviate above the line, indicating prices are higher than expected
ggpairs(third_model_data, columns = c("Price", "Mfg_Year", "KM", "HP","Weight"),
title = "Scatter Plot Matrix of Price vs Mfg_Year, KM, HP, Weight of the 3rd car model")
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
third_model_data_num = third_model_data %>% select("Price","Mfg_Year","KM","HP", "cc","Quarterly_Tax","Weight")
cor_matrix <- cor(third_model_data_num, use = "complete.obs")
## Warning in cor(third_model_data_num, use = "complete.obs"): the standard
## deviation is zero
corrplot(cor_matrix, type = "full")
qqnorm(third_model_data$Price, main = "QQ Plot of Price of Top Model")
qqline(third_model_data$Price, col = "blue", lwd = 2)
head(summary(third_model_data))
## Id Model Price Age_08_04
## Min. : 408.0 Length:79 Min. : 6950 Min. :54.00
## 1st Qu.: 781.5 Class :character 1st Qu.: 8875 1st Qu.:62.00
## Median : 922.0 Mode :character Median : 9500 Median :65.00
## Mean : 967.1 Mean : 9499 Mean :67.03
## 3rd Qu.:1213.5 3rd Qu.: 9950 3rd Qu.:72.00
## Max. :1405.0 Max. :11950 Max. :80.00
## Mfg_Month Mfg_Year KM Fuel_Type
## Min. : 1.000 Min. :1998 Min. : 22705 Length:79
## 1st Qu.: 2.000 1st Qu.:1998 1st Qu.: 59506 Class :character
## Median : 5.000 Median :1999 Median : 71825 Mode :character
## Mean : 5.468 Mean :1999 Mean : 72995
## 3rd Qu.: 8.000 3rd Qu.:1999 3rd Qu.: 85704
## Max. :12.000 Max. :2000 Max. :135337
## HP Met_Color Automatic cc Doors
## Min. :110 Min. :0.0000 Min. :0 Min. :1600 Min. :5
## 1st Qu.:110 1st Qu.:0.0000 1st Qu.:0 1st Qu.:1600 1st Qu.:5
## Median :110 Median :1.0000 Median :0 Median :1600 Median :5
## Mean :110 Mean :0.7089 Mean :0 Mean :1600 Mean :5
## 3rd Qu.:110 3rd Qu.:1.0000 3rd Qu.:0 3rd Qu.:1600 3rd Qu.:5
## Max. :110 Max. :1.0000 Max. :0 Max. :1600 Max. :5
## Cylinders Gears Quarterly_Tax Weight Mfr_Guarantee
## Min. :4 Min. :5 Min. : 85.00 Min. :1075 Min. :0.0000
## 1st Qu.:4 1st Qu.:5 1st Qu.: 85.00 1st Qu.:1075 1st Qu.:0.0000
## Median :4 Median :5 Median : 85.00 Median :1075 Median :0.0000
## Mean :4 Mean :5 Mean : 86.42 Mean :1076 Mean :0.3291
## 3rd Qu.:4 3rd Qu.:5 3rd Qu.: 85.00 3rd Qu.:1075 3rd Qu.:1.0000
## Max. :4 Max. :5 Max. :197.00 Max. :1120 Max. :1.0000
## BOVAG_Guarantee Guarantee_Period ABS Airbag_1
## Min. :0.0000 Min. :3 Min. :0.0000 Min. :1
## 1st Qu.:1.0000 1st Qu.:3 1st Qu.:1.0000 1st Qu.:1
## Median :1.0000 Median :3 Median :1.0000 Median :1
## Mean :0.9494 Mean :3 Mean :0.8608 Mean :1
## 3rd Qu.:1.0000 3rd Qu.:3 3rd Qu.:1.0000 3rd Qu.:1
## Max. :1.0000 Max. :3 Max. :1.0000 Max. :1
## Airbag_2 Airco Automatic_airco Boardcomputer
## Min. :0.0000 Min. :0.0000 Min. :0 Min. :0
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0 1st Qu.:0
## Median :1.0000 Median :1.0000 Median :0 Median :0
## Mean :0.8481 Mean :0.8734 Mean :0 Mean :0
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0 3rd Qu.:0
## Max. :1.0000 Max. :1.0000 Max. :0 Max. :0
## CD_Player Central_Lock Powered_Windows Power_Steering
## Min. :0.00000 Min. :1 Min. :1 Min. :1
## 1st Qu.:0.00000 1st Qu.:1 1st Qu.:1 1st Qu.:1
## Median :0.00000 Median :1 Median :1 Median :1
## Mean :0.07595 Mean :1 Mean :1 Mean :1
## 3rd Qu.:0.00000 3rd Qu.:1 3rd Qu.:1 3rd Qu.:1
## Max. :1.00000 Max. :1 Max. :1 Max. :1
## Radio Mistlamps Sport_Model Backseat_Divider
## Min. :0.00000 Min. :0.0000 Min. :0 Min. :1
## 1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:0 1st Qu.:1
## Median :0.00000 Median :1.0000 Median :0 Median :1
## Mean :0.06329 Mean :0.8987 Mean :0 Mean :1
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0 3rd Qu.:1
## Max. :1.00000 Max. :1.0000 Max. :0 Max. :1
## Metallic_Rim Radio_cassette Tow_Bar
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :1.0000 Median :0.00000 Median :0.0000
## Mean :0.7215 Mean :0.06329 Mean :0.3165
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000
Since all the values found in the “Cylinders” variable are the same in the dataset, these are removed to prevent them from affecting any further analyses.
data = data %>% select(-Cylinders)
# All values in Cylinders are the same, not relevant.
In conclusion, this report meticulously details the processes used to clean and analyze the Corolla dataset, providing an overview of each step taken to enhance the quality of the data. From the initial loading and cleaning of the dataset, including the removal of leading question marks and outliers, to the application of exploratory data analysis techniques such as histograms, correlation matrices, scatter plots, and QQ-plots, we have approached the challenge of understanding the relationships between various variables. The insights obtained from these analyses have shown us key patterns, particularly the strong positive correlation between vehicle price and manufacturing year, as well as the negative correlation with mileage.
Furthermore, we addressed the implications of outliers and missing data, ensuring a good dataset that can be used for future predictive modeling. Our exploration of multiple car models highlights consistent trends across the dataset, reinforcing our findings and identifying the importance of data cleaning and analysis in deriving conclusions. Finally, this report serves not only to show our methodical approach to data preparation but also as a resource for subsequent analyses, allowing us to build predictive models with great accuracy and reliability. The rigorous examination of the data ensures that we are well-equipped to address the complexities in automotive pricing, ultimately making it easier to have more informed decision-making in future reports.