Homework 1

Task 1 - 2

I have found a dataset that contains information about cars on the German automotive market.

df <- read.table("germany_auto_industry_dataset.csv", header = TRUE, sep = ",")

Task 3

head(df)

##        Brand   Model Year Mileage Fuel.Type Fuel.Consumption..L.100km.
## 1       Audi      Q7 2006  260886    Diesel                        9.5
## 2       Opel   Corsa 2015   72505    Hybrid                        6.5
## 3   Mercedes C-Class 2007  125356  Electric                        9.9
## 4 Volkswagen    Polo 2009  130867  Electric                        4.1
## 5       Opel   Astra 2022   57482  Electric                        5.2
## 6 Volkswagen  Tiguan 2011  107269  Electric                       10.1
##   Horsepower..HP. Transmission    Price    City
## 1             213    Automatic 12063.27 Cologne
## 2             335    Automatic 33890.58  Berlin
## 3             445    Automatic 92639.12  Berlin
## 4             165    Automatic 88003.50  Munich
## 5             145       Manual 26028.97 Hamburg
## 6             449    Automatic 15308.15  Berlin

Task 4

General characteristics of the data:

sample size is \(n = 500\)
unit of observation is one car

Variables and units of measurements (if applicable):

Brand: represents the car’s brand
Model: the specific model of the brand
Year: manufacturing year
Mileage: the total kilometers each car has traveled
Fuel.Type: type of fuel the car uses (Petrol/Diesel/Electric/Hybrid)
Fuel.Consumption..L.100km.: average fuel consumption per 100km (in liters)
Horsepower..HP.: engine’s power rating (in horsepower)
Transmission: type of transmission (Manual/Automatic)
Price: price of the vehicle (in Euros)
City: location where vehicle is available

Task 5

The data was obtained from Kaggle and is available under the following URL:

https://www.kaggle.com/datasets/heidarmirhajisadati/german-vehicle-price-and-efficiency-dataset/data

Task 6

First, I will rename some of the variables because of the odd column names.

library(dplyr)
df <- rename(df,
       Fuel_Type = Fuel.Type,
       Fuel_Consumption = Fuel.Consumption..L.100km.,
       Horsepower = Horsepower..HP.)

I am also interested in the average horsepower for hybrid vehicles in the dataset. For that, I will create a seperate data frame and calculate the mean horsepower.

# first I make sure that Fuel_Type is treated as a factor
df$Fuel_Type <- as.factor(df$Fuel_Type)

# then I create the new dataframe
df_hybrid = df[df$Fuel_Type == "Hybrid", ]

# check the data frame
head(df_hybrid)

##         Brand   Model Year Mileage Fuel_Type Fuel_Consumption Horsepower
## 2        Opel   Corsa 2015   72505    Hybrid              6.5        335
## 15       Opel   Corsa 2005  235306    Hybrid             10.0         74
## 18 Volkswagen    Golf 2005   35558    Hybrid              7.6        111
## 24   Mercedes     GLE 2018   53607    Hybrid              7.6        171
## 26       Audi      A3 2015   59165    Hybrid              6.6        157
## 29   Mercedes E-Class 2019  294184    Hybrid              7.9        110
##    Transmission    Price      City
## 2     Automatic 33890.58    Berlin
## 15    Automatic 53162.29    Munich
## 18    Automatic 21605.04 Frankfurt
## 24    Automatic 97407.64   Cologne
## 26    Automatic 40829.05   Cologne
## 29       Manual 17642.96    Munich

# calculate mean horsepower

mean(df_hybrid$Horsepower)

## [1] 271.7724

Task 7

I chose the variable Horsepower for presenting descriptive statistics. I will inspect the statistics for the whole sample, not only for the subsample of hybrid cars.

summary(df$Horsepower)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    70.0   175.0   279.5   281.1   385.0   500.0

The summary() function computes several statistics. From the output, we can easily find out that the mean horsepower in the sample is \(281.1\) and the median is \(279.5\). It also shows, for example, the third quantile, which is the value \(385\). The interpretation is that \(75\%\) of the data falls below \(385\), and \(25\%\) above. We can also calculate the range rather easily by subtracting the minimum observation from the maximum. In our case, this leads to: \[ 500 - 70 = 430 \]

We can also double check with the min() and max() functions.

range = (max(df$Horsepower) - min(df$Horsepower))
range

## [1] 430

Task 8

Brand

As brand is a categorical variable, we can use a barplot to find out the distribution.

library(ggplot2)

# first make sure it is treated as a factor
df$Brand <- as.factor(df$Brand)

ggplot(df, aes(x = Brand)) + geom_bar(color = "black", fill = "forestgreen") +
    geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.3, colour = "black")

We can observe that the most frequent brand is Volkswagen with 97 instances, and the least frequent is Audi with 64.

Model

df$Model <- as.factor(df$Model)

ggplot(df, aes(x = Model, fill = Brand)) + 
      geom_bar() +
      geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.3, colour = "black") +
      theme(axis.text.x = element_text(angle = 90, hjust = 0.5))  # rotate labels

A more meaningful way to plot this is to sort the data by the most frequent model.

df_model <- df

# set the order of the most frequent model as the level of the factor
df_model$Model <- factor(df$Model, levels = names(sort(table(df$Model), decreasing = TRUE)))


ggplot(df_model, aes(x = Model, fill = Brand)) + 
      geom_bar() +
      geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.3, colour = "black") +
      theme(axis.text.x = element_text(angle = 90, hjust = 0.5))  # rotate labels

We see that the Volkswagen Tiguan is the most frequent model in our data, followed by the Porsche Cayenne. The least frequent is Audi Q5.

Year

As year is a continuous numeric variable, we can make a histogram and inspect the distribution.

ggplot(df, aes(x = Year)) + geom_histogram(color = "black", fill = "forestgreen", binwidth = 2)

We see that the majority of the cars were manufactured between 2005 and 2007. This distribution does not resemble normal either.

Mileage

Similarly to Horsepower and Year, we can make a histogram.

ggplot(df, aes(x = Mileage)) + geom_histogram(color = "black", fill = "forestgreen", bins = 25)

We see that the most frequent mileage is about \(200000 \ km\). The variable does not seem to be normally distributed either.

Fuel_Type

Just like with Brand, a barplot nicely represents the fuel type.

df$Fuel_Type <- as.factor(df$Fuel_Type)

ggplot(df, aes(x = Fuel_Type)) + geom_bar(col = "black", fill = "forestgreen") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.3, colour = "black")

The most frequent fuel type is petrol, while the less frequent is diesel. The distribution of the categories is more or less balanced.

Fuel_Consumption

To inspect the variable Fuel_Consumption, we can make a boxplot.

ggplot(df, aes(x = Fuel_Consumption)) + geom_boxplot(fill = "forestgreen")

The median seems to be a bit above \(8 l/100km\). There do not seem to be any outliers.

Horsepower

Histogram

We can inspect the distribution of horsepower with a histogram.

ggplot(df, aes(x = Horsepower)) + geom_histogram(fill = "forestgreen", color = "black", binwidth = 20)

We see that the distribution does not resemble the normal distribution at all. In fact, it seems to be multimodal.

Scatterplot

We can also plot the horsepower with relation to price.

ggplot(df, aes(x = Price, y = Horsepower)) + geom_point()

There does not seem to be a clear tendency between horsepower and price. This sounds surprising, but indicates that other factors are more influential in this respect. We can check the correlation in R:

cor(df$Price, df$Horsepower)

## [1] -0.01994441

We even see a negative correlation, but the value is very close to 0.

Boxplot

We can inspect the distribution of horsepower with respect to the fuel type.

ggplot(df, aes(x = Fuel_Type, y = Horsepower)) + geom_boxplot()

We see that the median horsepower is the highest for diesel cars in our sample. However, the medians are rather close to each other, one would need to conduct further statistical analysis to draw inferences. Nevertheless, we can check the medians of the four groups, among other statistics, with the describeBy() function.

library(psych)
describeBy(df$Horsepower, group = df$Fuel_Type )

## 
##  Descriptive statistics by group 
## group: Diesel
##    vars   n   mean     sd median trimmed    mad min max range  skew kurtosis
## X1    1 113 284.06 121.45    295  287.25 142.33  70 489   419 -0.19    -1.11
##       se
## X1 11.42
## ------------------------------------------------------------ 
## group: Electric
##    vars   n   mean    sd median trimmed    mad min max range skew kurtosis   se
## X1    1 129 277.27 126.1    270  276.34 157.16  79 500   421 0.04    -1.28 11.1
## ------------------------------------------------------------ 
## group: Hybrid
##    vars   n   mean     sd median trimmed    mad min max range skew kurtosis
## X1    1 123 271.77 122.59    271  269.23 160.12  70 500   430 0.12    -1.17
##       se
## X1 11.05
## ------------------------------------------------------------ 
## group: Petrol
##    vars   n   mean     sd median trimmed   mad min max range  skew kurtosis
## X1    1 135 290.82 115.23    281  292.43 136.4  74 491   417 -0.05    -1.11
##      se
## X1 9.92

We can see that the median for diesel is \(295\), \(271\) and \(270\) for electric and hybrid respectively, and \(281\) for petrol cars.

Transmission

We can look at a barplot to find out the distribution of Transmission.

ggplot(df, aes(x = Transmission)) + geom_bar(col = "black", fill = "forestgreen") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.3, colour = "black")

We see that there are 20 more vehicles with manual transmission in our dataset, but the distribution is again more or less balanced.

Price

Probably the main variable of interest, Price, can be nicely inspected by a histogram.

ggplot(df, aes(x = Price)) + geom_histogram(fill = "forestgreen", color = "black", bins = 20)

It does not resemble normal distribution either, but a multimodal one, centered around about \(€ 250000\), \(€62000\) and \(€95000\).

City

Finally, let us look at the variable city. First, we can make a table and see how many observations we have in each category.

table(df$City)

## 
##    Berlin   Cologne Frankfurt   Hamburg    Munich 
##        99       100       108        99        94

The distribution seems to be rather balanced, nevertheless, we can make a barplot to illustrate it.

ggplot(df, aes(x = City)) + geom_bar(col = "black", fill = "forestgreen") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.3, colour = "black")