chart.Correlation((cars_train[, c(1, 5:7, 9, 28)]), histogram = TRUE)
The correlation matrix allows us to understand the relationship that numeric variables maintain with each other. We can see that “num_fotos” and “num_portas” have low correlation with all the others and have little influence on the price of the cars.
Interestingly, “num_portas” shows a slight negative correlation with price, which may seem counterintuitive due to the common perception that cars with more doors tend to be more expensive because they are larger, etc. However, the data exhibit this behavior due to the presence of observations representing luxury sports cars with two doors that have a very high value.
The other numeric variables follow an expected pattern. “ano_de_fabricacao” has a strong positive correlation with “ano_modelo”, and both have a positive correlation with price, indicating that the newer the car, the higher its price tends to be.
The “hodometro” variable also follows an expected pattern, showing a negative correlation with “ano_de_fabricacao”, “ano_modelo”, and “preco”. We can infer that older cars tend to have a higher odometer(hodometro in portuguese) reading. Consequently, the higher the odometer reading, the lower the price of the automobile.
This brief analysis shows us that the numeric variables, in general, have reliable and coherent relationships with reality.
anova <- aov(preco ~ marca, data = cars_train)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## marca 39 5.703e+13 1.462e+12 308 <2e-16 ***
## Residuals 29544 1.403e+14 4.747e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we wish to analyze a different variable than “marca”, simply replace “marca” with the desired variable’s name.
Analysis of variance (ANOVA) is useful for assessing the significance of one variable in relation to another, allowing for a clear understanding of the relationships between variables. When used in conjunction with correlation analyses, we obtain a comprehensive overview of the relationships between variables.
Using the “aov” function, it was possible to analyze all independent variables in relation to the dependent variable “price”.
The results indicated that virtually all variables were significant for price formation at a 95% confidence level. The only variables that were not significant at a 95% confidence level were “elegivel_revisao”, “ipva_pago”, and “veiculo_alienado”.
summary(cars_train)
## num_fotos marca modelo versao
## Min. : 8.00 Length:29584 Length:29584 Length:29584
## 1st Qu.: 8.00 Class :character Class :character Class :character
## Median : 8.00 Mode :character Mode :character Mode :character
## Mean :10.32
## 3rd Qu.:14.00
## Max. :21.00
## NA's :177
## ano_de_fabricacao ano_modelo hodometro cambio
## Min. :1985 Min. :1997 Min. : 100 Length:29584
## 1st Qu.:2015 1st Qu.:2016 1st Qu.: 31214 Class :character
## Median :2018 Median :2018 Median : 57434 Mode :character
## Mean :2017 Mean :2018 Mean : 58431
## 3rd Qu.:2019 3rd Qu.:2020 3rd Qu.: 81954
## Max. :2022 Max. :2023 Max. :390065
##
## num_portas tipo blindado cor
## Min. :2.000 Length:29584 Length:29584 Length:29584
## 1st Qu.:4.000 Class :character Class :character Class :character
## Median :4.000 Mode :character Mode :character Mode :character
## Mean :3.941
## 3rd Qu.:4.000
## Max. :4.000
##
## tipo_vendedor cidade_vendedor estado_vendedor anunciante
## Length:29584 Length:29584 Length:29584 Length:29584
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## entrega_delivery troca elegivel_revisao dono_aceita_troca
## Mode :logical Mode :logical Mode :logical Length:29584
## FALSE:23601 FALSE:24523 FALSE:29584 Class :character
## TRUE :5983 TRUE :5061 Mode :character
##
##
##
##
## veiculo_único_dono revisoes_concessionaria ipva_pago
## Length:29584 Length:29584 Length:29584
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## veiculo_licenciado garantia_de_fábrica revisoes_dentro_agenda veiculo_alienado
## Length:29584 Length:29584 Length:29584 Mode:logical
## Class :character Class :character Class :character NA's:29584
## Mode :character Mode :character Mode :character
##
##
##
##
## preco
## Min. : 9870
## 1st Qu.: 76572
## Median : 114356
## Mean : 133024
## 3rd Qu.: 163680
## Max. :1359813
##
summary(cars_test)
## num_fotos marca modelo versao
## Min. : 8.00 Length:9862 Length:9862 Length:9862
## 1st Qu.: 8.00 Class :character Class :character Class :character
## Median : 8.00 Mode :character Mode :character Mode :character
## Mean :10.32
## 3rd Qu.:14.00
## Max. :21.00
## NA's :60
## ano_de_fabricacao ano_modelo hodometro cambio
## Min. :1988 Min. :2007 Min. : 100 Length:9862
## 1st Qu.:2015 1st Qu.:2016 1st Qu.: 31323 Class :character
## Median :2018 Median :2018 Median : 56742 Mode :character
## Mean :2017 Mean :2018 Mean : 58237
## 3rd Qu.:2019 3rd Qu.:2020 3rd Qu.: 81784
## Max. :2022 Max. :2023 Max. :381728
##
## num_portas tipo blindado cor
## Min. :2.000 Length:9862 Length:9862 Length:9862
## 1st Qu.:4.000 Class :character Class :character Class :character
## Median :4.000 Mode :character Mode :character Mode :character
## Mean :3.943
## 3rd Qu.:4.000
## Max. :4.000
##
## tipo_vendedor cidade_vendedor estado_vendedor anunciante
## Length:9862 Length:9862 Length:9862 Length:9862
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## entrega_delivery troca elegivel_revisao dono_aceita_troca
## Mode :logical Mode :logical Mode :logical Length:9862
## FALSE:7907 FALSE:8217 FALSE:9862 Class :character
## TRUE :1955 TRUE :1645 Mode :character
##
##
##
##
## veiculo_único_dono revisoes_concessionaria ipva_pago
## Length:9862 Length:9862 Length:9862
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## veiculo_licenciado garantia_de_fábrica revisoes_dentro_agenda veiculo_alienado
## Length:9862 Length:9862 Length:9862 Mode:logical
## Class :character Class :character Class :character NA's:9862
## Mode :character Mode :character Mode :character
##
##
##
##
Through the summary function, we can view statistical
information about the variables in the ‘cars_train’ and ‘cars_test’
datasets. It provides the type of each variable, as well as the range
and statistical information such as mean, median, and quartiles for the
numeric variables.
After analyzing the dataset using the ‘summary’ function, we can observe some relevant information. For example, the mean price of cars is $133,024. Additionally, we notice that the dataset contains observations representing very recent cars with a model year of 2023, as well as observations related to older cars with a manufacturing year as early as 1985.
We also observe considerable variation in the odometer readings of the cars in the dataset. Some cars have very low mileage, suggesting limited use, while others have quite high mileage, indicating more extensive use.
Thus, using summary allows us to get an overview of the
characteristics and distribution of the variables. These statistical
insights are helpful for better understanding the data and identifying
potential discrepancies, outliers, or patterns in the variable
values.
levels(factor(cars_train$marca))
## [1] "ALFA ROMEO" "AUDI" "BMW" "BRM"
## [5] "CHERY" "CHEVROLET" "CHRYSLER" "CITROËN"
## [9] "DODGE" "EFFA" "FERRARI" "FIAT"
## [13] "FORD" "HONDA" "HYUNDAI" "IVECO"
## [17] "JAC" "JAGUAR" "JEEP" "KIA"
## [21] "LAND ROVER" "LEXUS" "LIFAN" "MASERATI"
## [25] "MERCEDES-BENZ" "MINI" "MITSUBISHI" "NISSAN"
## [29] "PEUGEOT" "PORSCHE" "RAM" "RENAULT"
## [33] "SMART" "SSANGYONG" "SUBARU" "SUZUKI"
## [37] "TOYOTA" "TROLLER" "VOLKSWAGEN" "VOLVO"
levels(factor(cars_test$marca))
## [1] "ALFA ROMEO" "AUDI" "BMW" "CHERY"
## [5] "CHEVROLET" "CHRYSLER" "CITROËN" "DODGE"
## [9] "EFFA" "FERRARI" "FIAT" "FORD"
## [13] "HONDA" "HYUNDAI" "JAC" "JAGUAR"
## [17] "JEEP" "KIA" "LAMBORGHINI" "LAND ROVER"
## [21] "LEXUS" "LIFAN" "MASERATI" "MERCEDES-BENZ"
## [25] "MINI" "MITSUBISHI" "NISSAN" "PEUGEOT"
## [29] "PORSCHE" "RAM" "RENAULT" "SMART"
## [33] "SSANGYONG" "SUBARU" "SUZUKI" "TOYOTA"
## [37] "TROLLER" "VOLKSWAGEN" "VOLVO"
If we wish to analyze a variable other than ‘marca’, simply replace ‘marca’ with the desired variable’s name.
The “levels” function allows us to identify the number of categories/levels present in qualitative/categorical variables. This function is useful because, when combined with other analyses, it provides us with a good understanding of the complexity of the data we are dealing with.
In this particular case, we observe that some of the qualitative variables have a large number of categories, with particular emphasis on the “versao” (version) variable.
This indicates that training a predictive model will be challenging, as these variables increase the complexity of the model and should be included in the training process. We know, both from the analyses conducted here and our real-world experience, that variables such as model, version, and car brand have a direct influence on the price.
It’s also worth noting that the categories of the qualitative variables are not exactly the same in both datasets. For example, the “BRM” category in the “marca” (brand) variable is exclusive to the “cars_train” dataset. Therefore, special data treatment will be necessary to make price predictions in the “cars_test” dataset using a model trained on the “cars_train” dataset.
table(cars_train$marca)
##
## ALFA ROMEO AUDI BMW BRM CHERY
## 9 1698 1784 1 153
## CHEVROLET CHRYSLER CITROËN DODGE EFFA
## 3020 30 194 37 1
## FERRARI FIAT FORD HONDA HYUNDAI
## 1 1918 1060 1586 2043
## IVECO JAC JAGUAR JEEP KIA
## 2 3 148 2000 408
## LAND ROVER LEXUS LIFAN MASERATI MERCEDES-BENZ
## 760 75 8 7 1125
## MINI MITSUBISHI NISSAN PEUGEOT PORSCHE
## 137 862 438 1675 349
## RAM RENAULT SMART SSANGYONG SUBARU
## 168 538 12 14 41
## SUZUKI TOYOTA TROLLER VOLKSWAGEN VOLVO
## 41 2180 177 4594 287
table(cars_test$marca)
##
## ALFA ROMEO AUDI BMW CHERY CHEVROLET
## 1 593 591 49 1000
## CHRYSLER CITROËN DODGE EFFA FERRARI
## 10 54 10 1 1
## FIAT FORD HONDA HYUNDAI JAC
## 605 385 511 697 2
## JAGUAR JEEP KIA LAMBORGHINI LAND ROVER
## 64 667 157 1 267
## LEXUS LIFAN MASERATI MERCEDES-BENZ MINI
## 16 2 5 376 45
## MITSUBISHI NISSAN PEUGEOT PORSCHE RAM
## 286 145 571 121 53
## RENAULT SMART SSANGYONG SUBARU SUZUKI
## 173 4 6 13 10
## TOYOTA TROLLER VOLKSWAGEN VOLVO
## 702 46 1546 76
If we want to analyze a different variable than “marca” (brand), simply replace “marca” with the name of the desired variable.
The table function is also very useful for obtaining
information about qualitative variables. It allows us to check the count
of observations in each of the categories of a qualitative variable.
When analyzing the data, we noticed that variables with a larger number of categories have some categories with a very small number of observations. Specifically, the “versao” (version) variable has several categories with few observations.
As mentioned earlier, this characteristic in a dataset can make predictions more challenging, especially when we cannot simply exclude the variables that exhibit these characteristics.