I have found a dataset that contains information about cars on the German automotive market.
df <- read.table("germany_auto_industry_dataset.csv", header = TRUE, sep = ",")
head(df)
## Brand Model Year Mileage Fuel.Type Fuel.Consumption..L.100km.
## 1 Audi Q7 2006 260886 Diesel 9.5
## 2 Opel Corsa 2015 72505 Hybrid 6.5
## 3 Mercedes C-Class 2007 125356 Electric 9.9
## 4 Volkswagen Polo 2009 130867 Electric 4.1
## 5 Opel Astra 2022 57482 Electric 5.2
## 6 Volkswagen Tiguan 2011 107269 Electric 10.1
## Horsepower..HP. Transmission Price City
## 1 213 Automatic 12063.27 Cologne
## 2 335 Automatic 33890.58 Berlin
## 3 445 Automatic 92639.12 Berlin
## 4 165 Automatic 88003.50 Munich
## 5 145 Manual 26028.97 Hamburg
## 6 449 Automatic 15308.15 Berlin
General characteristics of the data:
sample size is \(n = 500\)
unit of observation is one car
Variables and units of measurements (if applicable):
Brand: represents the car’s brand
Model: the specific model of the brand
Year: manufacturing year
Mileage: the total kilometers each car has traveled
Fuel.Type: type of fuel the car uses (Petrol/Diesel/Electric/Hybrid)
Fuel.Consumption..L.100km.: average fuel consumption per 100km (in liters)
Horsepower..HP.: engine’s power rating (in horsepower)
Transmission: type of transmission (Manual/Automatic)
Price: price of the vehicle (in Euros)
City: location where vehicle is available
The data was obtained from Kaggle and is available under the following URL:
https://www.kaggle.com/datasets/heidarmirhajisadati/german-vehicle-price-and-efficiency-dataset/data
First, I will rename some of the variables because of the odd column names.
library(dplyr)
df <- rename(df,
Fuel_Type = Fuel.Type,
Fuel_Consumption = Fuel.Consumption..L.100km.,
Horsepower = Horsepower..HP.)
I am also interested in the average horsepower for hybrid vehicles in the dataset. For that, I will create a seperate data frame and calculate the mean horsepower.
# first I make sure that Fuel_Type is treated as a factor
df$Fuel_Type <- as.factor(df$Fuel_Type)
# then I create the new dataframe
df_hybrid = df[df$Fuel_Type == "Hybrid", ]
# check the data frame
head(df_hybrid)
## Brand Model Year Mileage Fuel_Type Fuel_Consumption Horsepower
## 2 Opel Corsa 2015 72505 Hybrid 6.5 335
## 15 Opel Corsa 2005 235306 Hybrid 10.0 74
## 18 Volkswagen Golf 2005 35558 Hybrid 7.6 111
## 24 Mercedes GLE 2018 53607 Hybrid 7.6 171
## 26 Audi A3 2015 59165 Hybrid 6.6 157
## 29 Mercedes E-Class 2019 294184 Hybrid 7.9 110
## Transmission Price City
## 2 Automatic 33890.58 Berlin
## 15 Automatic 53162.29 Munich
## 18 Automatic 21605.04 Frankfurt
## 24 Automatic 97407.64 Cologne
## 26 Automatic 40829.05 Cologne
## 29 Manual 17642.96 Munich
# calculate mean horsepower
mean(df_hybrid$Horsepower)
## [1] 271.7724
I chose the variable Horsepower for presenting descriptive statistics. I will inspect the statistics for the whole sample, not only for the subsample of hybrid cars.
summary(df$Horsepower)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 70.0 175.0 279.5 281.1 385.0 500.0
The summary()
function computes several statistics. From
the output, we can easily find out that the mean horsepower in the
sample is \(281.1\) and the median is
\(279.5\). It also shows, for example,
the third quantile, which is the value \(385\). The interpretation is that \(75\%\) of the data falls below \(385\), and \(25\%\) above. We can also calculate the
range rather easily by subtracting the minimum observation from the
maximum. In our case, this leads to: \[ 500 -
70 = 430 \]
We can also double check with the min()
and
max()
functions.
range = (max(df$Horsepower) - min(df$Horsepower))
range
## [1] 430
As brand is a categorical variable, we can use a barplot to find out the distribution.
library(ggplot2)
# first make sure it is treated as a factor
df$Brand <- as.factor(df$Brand)
ggplot(df, aes(x = Brand)) + geom_bar(color = "black", fill = "forestgreen") +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.3, colour = "black")
We can observe that the most frequent brand is Volkswagen with 97 instances, and the least frequent is Audi with 64.
df$Model <- as.factor(df$Model)
ggplot(df, aes(x = Model, fill = Brand)) +
geom_bar() +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.3, colour = "black") +
theme(axis.text.x = element_text(angle = 90, hjust = 0.5)) # rotate labels
A more meaningful way to plot this is to sort the data by the most frequent model.
df_model <- df
# set the order of the most frequent model as the level of the factor
df_model$Model <- factor(df$Model, levels = names(sort(table(df$Model), decreasing = TRUE)))
ggplot(df_model, aes(x = Model, fill = Brand)) +
geom_bar() +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.3, colour = "black") +
theme(axis.text.x = element_text(angle = 90, hjust = 0.5)) # rotate labels
We see that the Volkswagen Tiguan is the most frequent model in our data, followed by the Porsche Cayenne. The least frequent is Audi Q5.
As year is a continuous numeric variable, we can make a histogram and inspect the distribution.
ggplot(df, aes(x = Year)) + geom_histogram(color = "black", fill = "forestgreen", binwidth = 2)
We see that the majority of the cars were manufactured between 2005 and 2007. This distribution does not resemble normal either.
Similarly to Horsepower and Year, we can make a histogram.
ggplot(df, aes(x = Mileage)) + geom_histogram(color = "black", fill = "forestgreen", bins = 25)
We see that the most frequent mileage is about \(200000 \ km\). The variable does not seem to be normally distributed either.
Just like with Brand, a barplot nicely represents the fuel type.
df$Fuel_Type <- as.factor(df$Fuel_Type)
ggplot(df, aes(x = Fuel_Type)) + geom_bar(col = "black", fill = "forestgreen") +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.3, colour = "black")
The most frequent fuel type is petrol, while the less frequent is diesel. The distribution of the categories is more or less balanced.
To inspect the variable Fuel_Consumption, we can make a boxplot.
ggplot(df, aes(x = Fuel_Consumption)) + geom_boxplot(fill = "forestgreen")
The median seems to be a bit above \(8 l/100km\). There do not seem to be any outliers.
We can inspect the distribution of horsepower with a histogram.
ggplot(df, aes(x = Horsepower)) + geom_histogram(fill = "forestgreen", color = "black", binwidth = 20)
We see that the distribution does not resemble the normal distribution at all. In fact, it seems to be multimodal.
We can also plot the horsepower with relation to price.
ggplot(df, aes(x = Price, y = Horsepower)) + geom_point()
There does not seem to be a clear tendency between horsepower and price. This sounds surprising, but indicates that other factors are more influential in this respect. We can check the correlation in R:
cor(df$Price, df$Horsepower)
## [1] -0.01994441
We even see a negative correlation, but the value is very close to 0.
We can inspect the distribution of horsepower with respect to the fuel type.
ggplot(df, aes(x = Fuel_Type, y = Horsepower)) + geom_boxplot()
We see that the median horsepower is the highest for diesel cars in
our sample. However, the medians are rather close to each other, one
would need to conduct further statistical analysis to draw inferences.
Nevertheless, we can check the medians of the four groups, among other
statistics, with the describeBy()
function.
library(psych)
describeBy(df$Horsepower, group = df$Fuel_Type )
##
## Descriptive statistics by group
## group: Diesel
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 113 284.06 121.45 295 287.25 142.33 70 489 419 -0.19 -1.11
## se
## X1 11.42
## ------------------------------------------------------------
## group: Electric
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 129 277.27 126.1 270 276.34 157.16 79 500 421 0.04 -1.28 11.1
## ------------------------------------------------------------
## group: Hybrid
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 123 271.77 122.59 271 269.23 160.12 70 500 430 0.12 -1.17
## se
## X1 11.05
## ------------------------------------------------------------
## group: Petrol
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 135 290.82 115.23 281 292.43 136.4 74 491 417 -0.05 -1.11
## se
## X1 9.92
We can see that the median for diesel is \(295\), \(271\) and \(270\) for electric and hybrid respectively, and \(281\) for petrol cars.
We can look at a barplot to find out the distribution of Transmission.
ggplot(df, aes(x = Transmission)) + geom_bar(col = "black", fill = "forestgreen") +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.3, colour = "black")
We see that there are 20 more vehicles with manual transmission in our dataset, but the distribution is again more or less balanced.
Probably the main variable of interest, Price, can be nicely inspected by a histogram.
ggplot(df, aes(x = Price)) + geom_histogram(fill = "forestgreen", color = "black", bins = 20)
It does not resemble normal distribution either, but a multimodal one, centered around about \(€ 250000\), \(€62000\) and \(€95000\).
Finally, let us look at the variable city. First, we can make a table and see how many observations we have in each category.
table(df$City)
##
## Berlin Cologne Frankfurt Hamburg Munich
## 99 100 108 99 94
The distribution seems to be rather balanced, nevertheless, we can make a barplot to illustrate it.
ggplot(df, aes(x = City)) + geom_bar(col = "black", fill = "forestgreen") +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.3, colour = "black")