Loading packages and dataset

pacotes <- c("readxl", "dplyr", "ggplot2" )
lapply(pacotes, library, character.only = TRUE)
cars_train <- read_excel("cars_train.xlsx")

Analyses

Using table to look at the available brands:

table(cars_train$marca)
## 
##    ALFA ROMEO          AUDI           BMW           BRM         CHERY 
##             9          1698          1784             1           153 
##     CHEVROLET      CHRYSLER       CITROËN         DODGE          EFFA 
##          3020            30           194            37             1 
##       FERRARI          FIAT          FORD         HONDA       HYUNDAI 
##             1          1918          1060          1586          2043 
##         IVECO           JAC        JAGUAR          JEEP           KIA 
##             2             3           148          2000           408 
##    LAND ROVER         LEXUS         LIFAN      MASERATI MERCEDES-BENZ 
##           760            75             8             7          1125 
##          MINI    MITSUBISHI        NISSAN       PEUGEOT       PORSCHE 
##           137           862           438          1675           349 
##           RAM       RENAULT         SMART     SSANGYONG        SUBARU 
##           168           538            12            14            41 
##        SUZUKI        TOYOTA       TROLLER    VOLKSWAGEN         VOLVO 
##            41          2180           177          4594           287

I’ll consider the term “popular” as a reference to brands that have more affordable cars in Brazil.

Aggregating observations by “estado_vendedor” and showing a ranking of states with the highest number of observations

vendas_por_estado <- populares %>%
  group_by(estado_vendedor) %>%
  summarise(total_vendas = n()) %>%
  arrange(desc(total_vendas))
  top_10_estados <- head(vendas_por_estado, 10)
  top_10_estados
## # A tibble: 10 × 2
##    estado_vendedor        total_vendas
##    <chr>                         <int>
##  1 São Paulo (SP)                11827
##  2 Rio de Janeiro (RJ)            1829
##  3 Paraná (PR)                    1759
##  4 Santa Catarina (SC)            1663
##  5 Minas Gerais (MG)              1298
##  6 Rio Grande do Sul (RS)         1107
##  7 Goiás (GO)                      588
##  8 Bahia (BA)                      490
##  9 Pernambuco (PE)                 255
## 10 Alagoas (AL)                    120

Getting the highest price for each “estado_vendedor”

maior_preco_por_estado <- populares %>%
  group_by(estado_vendedor) %>%
  summarize(maior_preco = max(preco)) %>%
arrange(desc(maior_preco))
maior_preco_por_estado
## # A tibble: 25 × 2
##    estado_vendedor        maior_preco
##    <chr>                        <dbl>
##  1 São Paulo (SP)             781750.
##  2 Minas Gerais (MG)          653173.
##  3 Goiás (GO)                 586482.
##  4 Paraná (PR)                550147.
##  5 Santa Catarina (SC)        545973.
##  6 Rio Grande do Sul (RS)     540420.
##  7 Rio de Janeiro (RJ)        484355.
##  8 Sergipe (SE)               400378.
##  9 Bahia (BA)                 372539.
## 10 Alagoas (AL)               345007.
## # … with 15 more rows

Calculating the standard deviation for each “estado_vendedor”

desvio_padrao_por_estado <- populares %>%
  group_by(estado_vendedor) %>%
  summarize(desvio_padrao = sd(preco)) %>%
arrange(desc(desvio_padrao))
desvio_padrao_por_estado
## # A tibble: 25 × 2
##    estado_vendedor        desvio_padrao
##    <chr>                          <dbl>
##  1 Sergipe (SE)                 110768.
##  2 Rio Grande do Sul (RS)        65032.
##  3 Minas Gerais (MG)             64713.
##  4 Paraná (PR)                   63789.
##  5 Goiás (GO)                    62744.
##  6 Rio de Janeiro (RJ)           62449.
##  7 Bahia (BA)                    61629.
##  8 Santa Catarina (SC)           60929.
##  9 São Paulo (SP)                60881.
## 10 Alagoas (AL)                  59437.
## # … with 15 more rows

Calculating the average price for each state

populares_top_estados <- filter(populares, estado_vendedor %in% top_10_estados$estado_vendedor)
media_preco_por_estado <- populares_top_estados %>%
  group_by(estado_vendedor) %>%
  summarize(media_preco = mean(preco))
  media_preco_por_estado
## # A tibble: 10 × 2
##    estado_vendedor        media_preco
##    <chr>                        <dbl>
##  1 Alagoas (AL)               124097.
##  2 Bahia (BA)                 114218.
##  3 Goiás (GO)                 132822.
##  4 Minas Gerais (MG)          119491.
##  5 Paraná (PR)                123895.
##  6 Pernambuco (PE)            105228.
##  7 Rio Grande do Sul (RS)     122820.
##  8 Rio de Janeiro (RJ)        115898.
##  9 Santa Catarina (SC)        115379.
## 10 São Paulo (SP)             110091.

Creating a boxplot of prices for the top 5 states with the most sales

siglas_estados <- c("AL", "BA", "GO", "MG", "PR", "PE", "RJ", "RS", "SC", "SP" )
ggplot(populares_top_estados, aes(x = estado_vendedor, y = preco)) +
  geom_boxplot() +
  labs(title = "Boxplot dos Preços dos Carros Populares",
       x = "Estado Vendedor",
       y = "Preço") +
  scale_x_discrete(labels = siglas_estados) +
  scale_y_continuous(labels = scales::comma)

We can observe that the state of São Paulo had a significantly higher number of observations compared to other states. Furthermore, although the average selling price is similar among the top ten states with the most observations, São Paulo has the highest selling price for a popular brand car. The standard deviation also indicates reasonable price variability, suggesting that sellers can get good deals when selling their cars in the state of São Paulo.

Given that the difference in average prices for popular cars between states is relatively small, it is more advantageous to sell the car in the state with the highest number of observations. In this case, the seller would have an easier time closing the sale and would likely receive more offers to choose from, which may be more attractive.

Therefore, based on the analysis of the dataset, we can conclude that São Paulo is the most favorable state to sell a popular brand car, mainly due to the significant difference in the number of observations compared to other states.

It’s worth noting that this analysis considered only the “popular” brands in the “cars_train” dataset, which are: “CHEVROLET,” “FIAT,” “FORD,” “HONDA,” “HYUNDAI,” “JEEP,” “KIA,” “NISSAN,” “PEUGEOT,” “RENAULT,” “TOYOTA,” and “VOLKSWAGEN.”