Calculating and displaying the correlation matrix

chart.Correlation((cars_train[, c(1, 5:7, 9, 28)]), histogram = TRUE)

The correlation matrix allows us to understand the relationship that numeric variables maintain with each other. We can see that “num_fotos” and “num_portas” have low correlation with all the others and have little influence on the price of the cars.

Interestingly, “num_portas” shows a slight negative correlation with price, which may seem counterintuitive due to the common perception that cars with more doors tend to be more expensive because they are larger, etc. However, the data exhibit this behavior due to the presence of observations representing luxury sports cars with two doors that have a very high value.

The other numeric variables follow an expected pattern. “ano_de_fabricacao” has a strong positive correlation with “ano_modelo”, and both have a positive correlation with price, indicating that the newer the car, the higher its price tends to be.

The “hodometro” variable also follows an expected pattern, showing a negative correlation with “ano_de_fabricacao”, “ano_modelo”, and “preco”. We can infer that older cars tend to have a higher odometer(hodometro in portuguese) reading. Consequently, the higher the odometer reading, the lower the price of the automobile.

This brief analysis shows us that the numeric variables, in general, have reliable and coherent relationships with reality.

Analysis of Variance

anova <- aov(preco ~ marca, data = cars_train) 
summary(anova)                                      
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## marca          39 5.703e+13 1.462e+12     308 <2e-16 ***
## Residuals   29544 1.403e+14 4.747e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we wish to analyze a different variable than “marca”, simply replace “marca” with the desired variable’s name.

Analysis of variance (ANOVA) is useful for assessing the significance of one variable in relation to another, allowing for a clear understanding of the relationships between variables. When used in conjunction with correlation analyses, we obtain a comprehensive overview of the relationships between variables.

Using the “aov” function, it was possible to analyze all independent variables in relation to the dependent variable “price”.

The results indicated that virtually all variables were significant for price formation at a 95% confidence level. The only variables that were not significant at a 95% confidence level were “elegivel_revisao”, “ipva_pago”, and “veiculo_alienado”.

Displaying statistics of the variables

summary(cars_train)
##    num_fotos        marca              modelo             versao         
##  Min.   : 8.00   Length:29584       Length:29584       Length:29584      
##  1st Qu.: 8.00   Class :character   Class :character   Class :character  
##  Median : 8.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.32                                                           
##  3rd Qu.:14.00                                                           
##  Max.   :21.00                                                           
##  NA's   :177                                                             
##  ano_de_fabricacao   ano_modelo     hodometro         cambio         
##  Min.   :1985      Min.   :1997   Min.   :   100   Length:29584      
##  1st Qu.:2015      1st Qu.:2016   1st Qu.: 31214   Class :character  
##  Median :2018      Median :2018   Median : 57434   Mode  :character  
##  Mean   :2017      Mean   :2018   Mean   : 58431                     
##  3rd Qu.:2019      3rd Qu.:2020   3rd Qu.: 81954                     
##  Max.   :2022      Max.   :2023   Max.   :390065                     
##                                                                      
##    num_portas        tipo             blindado             cor           
##  Min.   :2.000   Length:29584       Length:29584       Length:29584      
##  1st Qu.:4.000   Class :character   Class :character   Class :character  
##  Median :4.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :3.941                                                           
##  3rd Qu.:4.000                                                           
##  Max.   :4.000                                                           
##                                                                          
##  tipo_vendedor      cidade_vendedor    estado_vendedor     anunciante       
##  Length:29584       Length:29584       Length:29584       Length:29584      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  entrega_delivery   troca         elegivel_revisao dono_aceita_troca 
##  Mode :logical    Mode :logical   Mode :logical    Length:29584      
##  FALSE:23601      FALSE:24523     FALSE:29584      Class :character  
##  TRUE :5983       TRUE :5061                       Mode  :character  
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##  veiculo_único_dono revisoes_concessionaria  ipva_pago        
##  Length:29584       Length:29584            Length:29584      
##  Class :character   Class :character        Class :character  
##  Mode  :character   Mode  :character        Mode  :character  
##                                                               
##                                                               
##                                                               
##                                                               
##  veiculo_licenciado garantia_de_fábrica revisoes_dentro_agenda veiculo_alienado
##  Length:29584       Length:29584        Length:29584           Mode:logical    
##  Class :character   Class :character    Class :character       NA's:29584      
##  Mode  :character   Mode  :character    Mode  :character                       
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##      preco        
##  Min.   :   9870  
##  1st Qu.:  76572  
##  Median : 114356  
##  Mean   : 133024  
##  3rd Qu.: 163680  
##  Max.   :1359813  
## 
summary(cars_test)
##    num_fotos        marca              modelo             versao         
##  Min.   : 8.00   Length:9862        Length:9862        Length:9862       
##  1st Qu.: 8.00   Class :character   Class :character   Class :character  
##  Median : 8.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.32                                                           
##  3rd Qu.:14.00                                                           
##  Max.   :21.00                                                           
##  NA's   :60                                                              
##  ano_de_fabricacao   ano_modelo     hodometro         cambio         
##  Min.   :1988      Min.   :2007   Min.   :   100   Length:9862       
##  1st Qu.:2015      1st Qu.:2016   1st Qu.: 31323   Class :character  
##  Median :2018      Median :2018   Median : 56742   Mode  :character  
##  Mean   :2017      Mean   :2018   Mean   : 58237                     
##  3rd Qu.:2019      3rd Qu.:2020   3rd Qu.: 81784                     
##  Max.   :2022      Max.   :2023   Max.   :381728                     
##                                                                      
##    num_portas        tipo             blindado             cor           
##  Min.   :2.000   Length:9862        Length:9862        Length:9862       
##  1st Qu.:4.000   Class :character   Class :character   Class :character  
##  Median :4.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :3.943                                                           
##  3rd Qu.:4.000                                                           
##  Max.   :4.000                                                           
##                                                                          
##  tipo_vendedor      cidade_vendedor    estado_vendedor     anunciante       
##  Length:9862        Length:9862        Length:9862        Length:9862       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  entrega_delivery   troca         elegivel_revisao dono_aceita_troca 
##  Mode :logical    Mode :logical   Mode :logical    Length:9862       
##  FALSE:7907       FALSE:8217      FALSE:9862       Class :character  
##  TRUE :1955       TRUE :1645                       Mode  :character  
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##  veiculo_único_dono revisoes_concessionaria  ipva_pago        
##  Length:9862        Length:9862             Length:9862       
##  Class :character   Class :character        Class :character  
##  Mode  :character   Mode  :character        Mode  :character  
##                                                               
##                                                               
##                                                               
##                                                               
##  veiculo_licenciado garantia_de_fábrica revisoes_dentro_agenda veiculo_alienado
##  Length:9862        Length:9862         Length:9862            Mode:logical    
##  Class :character   Class :character    Class :character       NA's:9862       
##  Mode  :character   Mode  :character    Mode  :character                       
##                                                                                
##                                                                                
##                                                                                
## 

Through the summary function, we can view statistical information about the variables in the ‘cars_train’ and ‘cars_test’ datasets. It provides the type of each variable, as well as the range and statistical information such as mean, median, and quartiles for the numeric variables.

After analyzing the dataset using the ‘summary’ function, we can observe some relevant information. For example, the mean price of cars is $133,024. Additionally, we notice that the dataset contains observations representing very recent cars with a model year of 2023, as well as observations related to older cars with a manufacturing year as early as 1985.

We also observe considerable variation in the odometer readings of the cars in the dataset. Some cars have very low mileage, suggesting limited use, while others have quite high mileage, indicating more extensive use.

Thus, using summary allows us to get an overview of the characteristics and distribution of the variables. These statistical insights are helpful for better understanding the data and identifying potential discrepancies, outliers, or patterns in the variable values.

Displaying the categories/levels of qualitative variables

levels(factor(cars_train$marca)) 
##  [1] "ALFA ROMEO"    "AUDI"          "BMW"           "BRM"          
##  [5] "CHERY"         "CHEVROLET"     "CHRYSLER"      "CITROËN"      
##  [9] "DODGE"         "EFFA"          "FERRARI"       "FIAT"         
## [13] "FORD"          "HONDA"         "HYUNDAI"       "IVECO"        
## [17] "JAC"           "JAGUAR"        "JEEP"          "KIA"          
## [21] "LAND ROVER"    "LEXUS"         "LIFAN"         "MASERATI"     
## [25] "MERCEDES-BENZ" "MINI"          "MITSUBISHI"    "NISSAN"       
## [29] "PEUGEOT"       "PORSCHE"       "RAM"           "RENAULT"      
## [33] "SMART"         "SSANGYONG"     "SUBARU"        "SUZUKI"       
## [37] "TOYOTA"        "TROLLER"       "VOLKSWAGEN"    "VOLVO"
levels(factor(cars_test$marca))  
##  [1] "ALFA ROMEO"    "AUDI"          "BMW"           "CHERY"        
##  [5] "CHEVROLET"     "CHRYSLER"      "CITROËN"       "DODGE"        
##  [9] "EFFA"          "FERRARI"       "FIAT"          "FORD"         
## [13] "HONDA"         "HYUNDAI"       "JAC"           "JAGUAR"       
## [17] "JEEP"          "KIA"           "LAMBORGHINI"   "LAND ROVER"   
## [21] "LEXUS"         "LIFAN"         "MASERATI"      "MERCEDES-BENZ"
## [25] "MINI"          "MITSUBISHI"    "NISSAN"        "PEUGEOT"      
## [29] "PORSCHE"       "RAM"           "RENAULT"       "SMART"        
## [33] "SSANGYONG"     "SUBARU"        "SUZUKI"        "TOYOTA"       
## [37] "TROLLER"       "VOLKSWAGEN"    "VOLVO"

If we wish to analyze a variable other than ‘marca’, simply replace ‘marca’ with the desired variable’s name.

The “levels” function allows us to identify the number of categories/levels present in qualitative/categorical variables. This function is useful because, when combined with other analyses, it provides us with a good understanding of the complexity of the data we are dealing with.

In this particular case, we observe that some of the qualitative variables have a large number of categories, with particular emphasis on the “versao” (version) variable.

This indicates that training a predictive model will be challenging, as these variables increase the complexity of the model and should be included in the training process. We know, both from the analyses conducted here and our real-world experience, that variables such as model, version, and car brand have a direct influence on the price.

It’s also worth noting that the categories of the qualitative variables are not exactly the same in both datasets. For example, the “BRM” category in the “marca” (brand) variable is exclusive to the “cars_train” dataset. Therefore, special data treatment will be necessary to make price predictions in the “cars_test” dataset using a model trained on the “cars_train” dataset.

Displaying the frequencies of each category/level of the qualitative variables

table(cars_train$marca)         
## 
##    ALFA ROMEO          AUDI           BMW           BRM         CHERY 
##             9          1698          1784             1           153 
##     CHEVROLET      CHRYSLER       CITROËN         DODGE          EFFA 
##          3020            30           194            37             1 
##       FERRARI          FIAT          FORD         HONDA       HYUNDAI 
##             1          1918          1060          1586          2043 
##         IVECO           JAC        JAGUAR          JEEP           KIA 
##             2             3           148          2000           408 
##    LAND ROVER         LEXUS         LIFAN      MASERATI MERCEDES-BENZ 
##           760            75             8             7          1125 
##          MINI    MITSUBISHI        NISSAN       PEUGEOT       PORSCHE 
##           137           862           438          1675           349 
##           RAM       RENAULT         SMART     SSANGYONG        SUBARU 
##           168           538            12            14            41 
##        SUZUKI        TOYOTA       TROLLER    VOLKSWAGEN         VOLVO 
##            41          2180           177          4594           287
table(cars_test$marca)          
## 
##    ALFA ROMEO          AUDI           BMW         CHERY     CHEVROLET 
##             1           593           591            49          1000 
##      CHRYSLER       CITROËN         DODGE          EFFA       FERRARI 
##            10            54            10             1             1 
##          FIAT          FORD         HONDA       HYUNDAI           JAC 
##           605           385           511           697             2 
##        JAGUAR          JEEP           KIA   LAMBORGHINI    LAND ROVER 
##            64           667           157             1           267 
##         LEXUS         LIFAN      MASERATI MERCEDES-BENZ          MINI 
##            16             2             5           376            45 
##    MITSUBISHI        NISSAN       PEUGEOT       PORSCHE           RAM 
##           286           145           571           121            53 
##       RENAULT         SMART     SSANGYONG        SUBARU        SUZUKI 
##           173             4             6            13            10 
##        TOYOTA       TROLLER    VOLKSWAGEN         VOLVO 
##           702            46          1546            76

If we want to analyze a different variable than “marca” (brand), simply replace “marca” with the name of the desired variable.

The table function is also very useful for obtaining information about qualitative variables. It allows us to check the count of observations in each of the categories of a qualitative variable.

When analyzing the data, we noticed that variables with a larger number of categories have some categories with a very small number of observations. Specifically, the “versao” (version) variable has several categories with few observations.

As mentioned earlier, this characteristic in a dataset can make predictions more challenging, especially when we cannot simply exclude the variables that exhibit these characteristics.