Scraping and analyzing https://www.autoscout24.com/

Alima Dzhanybaeva

2022-12-26

Scraping

At first, I created a function that scrapes information about one particular car ( name (car brand), model, price, and all other variables contained in the table) and returns list with obtained data.

get_one_car <- function(url) {
  t <- read_html(url) 
  t_list <- list()
  
  t_list[['name']] <- t %>% html_nodes('.StageTitle_boldClassifiedInfo__L7JmO:nth-child(1)') %>% html_text()
  t_list[['model']] <- t %>% html_nodes('.StageTitle_model__pG_6i') %>% html_text()
  t_list[['price']] <- t %>% html_nodes('.StandardPrice_price__X_zzU') %>% html_text()
  
  keys <- t %>% html_nodes('.DataGrid_defaultDtStyle__yzRR_') %>% html_text()
  values <- t %>% html_nodes('.DataGrid_fontBold__r__dO') %>% html_text()
  
  for (i in 1:length(keys)) {
    t_list[[keys[i]]] <- values[i]
    
  }
  return(t_list)
}

Further, I generated another function that first finds all the links on the webpage and then derives only ones for cars themselves.

get_links <- function(url){
  t <- read_html(url)
  all_links <- t %>% html_nodes('a') %>% html_attr('href')
  links_for_cars <- all_links[startsWith(all_links, '/offers/')]
  for (i in 1:length(links_for_cars)) {
    links_for_cars[i] <- paste0('https://www.autoscout24.com', links_for_cars[i])
  }
  return(links_for_cars)
}

Next:

  1. I created variable that contains all pages for one particular brand (bmw_links),

  2. applied previously generated get_links function to it and unlisted the obtained result (all_bmw_links),

  3. used get_one_car function to links for all cars of one brand and applied rbindlist (bmw_df),

  4. selected only necessary variables: Name, Model, Price, Body type, Type, Mileage, Fuel consumption, and Colour.

# BMW
bmw_links <- paste0('https://www.autoscout24.com/lst/bmw?sort=standard&desc=0&ustate=N%2CU&atype=C&search_id=27pikzseqjf&page=', 1:20)
all_bmw_links <- lapply(bmw_links, get_links)
all_bmw_links <- unlist(all_bmw_links)
bmw_df <- lapply(all_bmw_links, get_one_car)
bmw_df <- rbindlist(bmw_df, fill=T)

bmw <- bmw_df %>% select(c('name', 'model', 'price', 'Body type', 'Type',
                           'Mileage', 'Fuel consumption', 'Colour'))

All these steps were also applied to other car brands (Audi, Ford, Mercedes Benz, Opel, and Renault), so in the end, we have 6 separate dataframes.

Data cleaning

Further, I merged previously generated tables for 6 different car brands and created a new dataframe that contains 2400 observations.

df <- rbind(bmw, audi, ford, mercedes, opel, renault)

In order to make price, mileage, and fuel consumption suitable for analysis I separated the initial variables into several new ones and then left only variables that contain numbers in them.

df <- separate(df, price, ' ', into = c('euro', 'price'))
df <- separate(df, price, '.-', into = c('price', 'del'))
df$price <- as.numeric(gsub(',','',df$price))

df <- separate(df, Mileage, ' ', into = c('mileage', 'km'))
df$mileage <-as.numeric(gsub(',','',df$mileage))

df <- separate(df, 'Fuel consumption', '(comb.)', into = c('fuel_cons', 'city'))
df <- separate(df, fuel_cons, '/', into = c('fuel_cons', 'km'))
df <- separate(df, fuel_cons, ' ', into = c('fuel_cons', 'l'))
df$fuel_cons <- as.numeric(df$fuel_cons)

df <- select(df, -c('euro', 'del', 'km', 'city', 'km', 'l'))

Analysis

In the table below you can get familiar with the summary statistics of the dataframe.

Descriptive statistics
Mean Median SD Min Max
Price (in EUR) 48191.04 30990.00 46810.85 950.00 558800.00
Mileage (in km) 67185.83 53150.00 63935.17 0.00 406000.00
Fuel consumption (L/100 km) 7.07 6.40 3.19 0.00 20.00
  1. The first two graphs were created to see how the average price of the car differs across the 6 most popular car brands.
df1 <- df %>% 
  group_by(name) %>% 
  summarise(
    price = mean(price),
    fuel_cons = mean(fuel_cons, na.rm = T))

ggplot(df1, aes(x=reorder(name, -price), y=price, fill=name)) + 
  geom_bar(stat = 'identity') +
  geom_text(aes(label = round(price)), vjust = 1.5, colour = "white") +
  scale_color_brewer(palette="Dark2") +
  labs(x = 'Car brand', y = 'Price (in EUR)', title = 'Average price by car brand') +
  theme_bw() +
  theme(legend.position="none")

ggplot(df, aes(x=reorder(name, -price), y=price, fill=name))+ geom_boxplot(outlier.colour="black", outlier.shape=16,
             outlier.size=2, notch=FALSE) +
  scale_color_brewer(palette="Dark2") +
  scale_y_continuous(breaks = seq(0, 600000, by = 100000),labels = label_comma()) + 
  labs(x = 'Car brand', y = 'Price (in EUR)', title = 'Prices by car brand') +
  theme_bw() +
  theme(legend.position="none")

As we can see from the bar chart on the left, the average price is the highest for Mercedes Benz. Moreover, the box plot for this particular brand is the widest (indicating that there is more variation in prices) and there are also some very prominent extreme values that go up to 550,000 euros.

On the contrary, the “cheapest” brands are Opel and Renault with very narrow box plots and extreme values not exceeding 100,000 euros.

  1. The next graph demonstrates us the fuel consumption by different car brands.
ggplot(df1, aes(x=reorder(name, -fuel_cons), y=fuel_cons, fill=name)) + geom_bar(stat = 'identity') +
  geom_text(aes(label = round(fuel_cons, 2)), vjust = 1.5, colour = "white") +
  scale_color_brewer(palette="Dark2") +
  labs(x = 'Car brand', y = 'Fuel consumption (L/100km)', title = 'Fuel consumption by car brand') +
  theme_bw() +
  theme(legend.position="none")

Mercedes Benz turned out to be not only ‘the most expensive’ but also the most fuel-consuming brand with 8.45 liters per 100 kilometers.

At the same time, ‘the cheapest’ cars turned out to be the most economical (5.37l/100 km for Opel and 4.78/100km for Renault), moreover, the difference in fuel consumption with other brands is quite noticeable.

  1. The scatter plot below was created to show the relationship between car price and its mileage.
ggplot(df, aes(x=mileage, y=price)) + 
  stat_summary_bin(fun = 'mean', bins = 50, geom = 'point', color = 'red', size = 2) +
  scale_x_continuous(labels = label_comma()) +
  labs(x = 'Mileage (km)', y = 'Price (in EUR)', title = "Prices by cars' mileage") +
  theme_bw()

As we can see from the graph, as the mileage increases the price of the car decreases. The steepest fall in prices is for mileage between 0 and 100,000 km.

  1. Additionally, I decided to have a look at how average prices vary for different values of two categorical variables : color and body type.
# Av price by color
df2 <- df %>% 
  group_by(Colour) %>% 
  summarise(
    price = mean(price))
df2 <- df2[-15,]

ggplot(df2, aes(x=reorder(Colour, -price), y=price, fill=Colour)) + 
  geom_bar(stat = 'identity') +
  geom_text(aes(label = round(price)), vjust = -0.75, colour = "black") +
  scale_fill_manual("legend", values = c("Green" = "green3", "Orange" = "orange1", "Gold" = "goldenrod1",
                                         "Black" = "black", "Red" = "red", "Grey" = "gray48", "Silver" = "azure3",
                                         "Blue" = "royalblue3", "Yellow" = "yellow", "White" = "white", 
                                         "Beige" = "wheat1", "Violet" = "darkorchid3", "Brown" = "tan4",
                                         "Bronze" = "chocolate2"))+
  ylim(0, 80000) +
  labs(x = 'Car colour', y = 'Price (in EUR)', title='Average prices by colour of the cars') +
  theme_bw() +
  theme(legend.position="none")

# Av price by body type
df3 <- df %>% 
  rename(body_type = 'Body type') %>%
  group_by(body_type) %>% 
  summarise(
    price = mean(price),
    fuel_cons = mean(fuel_cons, na.rm = T))

ggplot(df3, aes(x=reorder(body_type, -price), y=price, fill=body_type)) + 
  geom_bar(stat = 'identity') +
  geom_text(aes(label = round(price)), vjust = 1.5, colour = "white") +
  scale_color_brewer(palette="Dark2") +
  labs(x = 'Body type', y = 'Price (in EUR)', title = 'Average prive by body type of the car') +
  theme_bw() +
  theme(legend.position="none")

Surprisingly, the highest average price of around 70,000 euros is for green, orange, and golden cars. Brown, bronze, and violet cars, on the contrary, have the lowest average price.

As for the average price by different body types of cars, the highest value (71,900 euros) is for the coupe, and the lowest (16,927 euros) is for the compact vehicles.

Conslusion

Consequently, on www.autoscout24.com, you can find the best deal among Opel and Renault, as these car brands have the lowest average price and the lowest average fuel consumption. On the other hand, Mercedes Benz has the highest average value both for price and fuel consumption.

The biggest drop in prices is for cars that have a mileage between 0 and 100,000 km, after that point, the decrease becomes less substantial.

The lowest average price is for brown, bronze, and violet cars and for compact body type; the highest average price is for green, orange, and gold vehicles and for coupe body type.