I have found a dataset that contains information about cars on the German automotive market.
df <- read.table("germany_auto_industry_dataset.csv", header = TRUE, sep = ",")
I check the number of observations in my data.
nrow(df)
## [1] 500
As 500 observations are a bit too many for hypothesis testing, I create a subsample of 250 rows. I also set the seed for reproducability.
library(dplyr)
set.seed(999)
df <- df %>% dplyr::sample_n(size = 250, replace = FALSE)
# check rows
nrow(df)
## [1] 250
head(df)
## Brand Model Year Mileage Fuel.Type Fuel.Consumption..L.100km.
## 1 BMW 5 Series 2018 261788 Diesel 9.6
## 2 BMW 3 Series 2020 54423 Petrol 4.0
## 3 Porsche Panamera 2021 279748 Hybrid 9.7
## 4 Porsche Panamera 2015 203688 Petrol 6.9
## 5 Volkswagen Polo 2017 74333 Electric 11.7
## 6 Opel Insignia 2014 70986 Petrol 4.9
## Horsepower..HP. Transmission Price City
## 1 240 Automatic 54091.03 Cologne
## 2 133 Automatic 93746.84 Munich
## 3 167 Manual 14451.87 Munich
## 4 134 Automatic 23729.79 Frankfurt
## 5 180 Automatic 54905.58 Munich
## 6 102 Manual 81634.86 Frankfurt
General characteristics of the data:
sample size is \(n = 500\) originally, now reduced to \(250\)
unit of observation is one car
Variables and units of measurements (if applicable):
Brand: represents the car’s brand
Model: the specific model of the brand
Year: manufacturing year
Mileage: the total kilometers each car has traveled
Fuel.Type: type of fuel the car uses (Petrol/Diesel/Electric/Hybrid)
Fuel.Consumption..L.100km.: average fuel consumption per 100km (in liters)
Horsepower..HP.: engine’s power rating (in horsepower)
Transmission: type of transmission (Manual/Automatic)
Price: price of the vehicle (in Euros)
City: location where vehicle is available
The data was obtained from Kaggle and is available under the following URL:
https://www.kaggle.com/datasets/heidarmirhajisadati/german-vehicle-price-and-efficiency-dataset/data
First, I will rename some of the variables because of the odd column names.
# use rename function from dplyr
df <- rename(df,
Fuel_Type = Fuel.Type,
Fuel_Consumption = Fuel.Consumption..L.100km.,
Horsepower = Horsepower..HP.)
Then I will make sure that the categorical variables I will use are treated as factors. For that, I will look at the structure of the data frame.
str(df)
## 'data.frame': 250 obs. of 10 variables:
## $ Brand : chr "BMW" "BMW" "Porsche" "Porsche" ...
## $ Model : chr "5 Series" "3 Series" "Panamera" "Panamera" ...
## $ Year : int 2018 2020 2021 2015 2017 2014 2007 2018 2022 2018 ...
## $ Mileage : int 261788 54423 279748 203688 74333 70986 203660 258142 25732 63602 ...
## $ Fuel_Type : chr "Diesel" "Petrol" "Hybrid" "Petrol" ...
## $ Fuel_Consumption: num 9.6 4 9.7 6.9 11.7 4.9 6.1 8.3 10.5 5.8 ...
## $ Horsepower : int 240 133 167 134 180 102 352 336 270 230 ...
## $ Transmission : chr "Automatic" "Automatic" "Manual" "Automatic" ...
## $ Price : num 54091 93747 14452 23730 54906 ...
## $ City : chr "Cologne" "Munich" "Munich" "Frankfurt" ...
Then I will convert these variables into factors.
df$Brand <- factor(df$Brand)
df$Model <- factor(df$Model)
df$Fuel_Type <- factor(df$Fuel_Type)
# set the levels for transmission so manual is the baseline
df$Transmission <- factor(df$Transmission, levels = c("Manual", "Automatic"))
df$City <- factor(df$City)
I also check whether there are any missing values in the data.
# returns the number of NAs
sum(is.na(df))
## [1] 0
As there are none, we can continue with the analysis.
I chose the variable Price for presenting descriptive statistics.
library(psych)
describe(df$Price)
## vars n mean sd median trimmed mad min max
## X1 1 250 49132.99 26978.34 49039.22 48630.48 34153.84 5308.85 99981.48
## range skew kurtosis se
## X1 94672.63 0.11 -1.2 1706.26
From the output, we can easily find out that the mean price in the sample is \(€49133\). The median is \(€49039.22\), which means that half of the prices in the data are lower than or equal to \(€49039.22\), while the other half are larger. It also shows, for example, the minimum and maximum values of Price, \(5308.85\) and \(99981.48\) respectively. By subtracting the minimum from the maximum, we also obtain the range, \(94672.63\). This is also shown in the output.
I have formulated the following research question from my data:
Is there a difference between the price of cars with manual transmission and automatic transmission?
As the assignment requires both parametric and non-parametric tests to be performed, I will do the testing first. Then, I will check whether the assumptions for the parametric test hold, and argue which test is more suitable.
For the parametric test, it makes sense to compare the arithmetic means of the two groups. Formally, this translates to:
\[ H_0: \mu_{manual} = \mu_{automatic} \] \[ H_1: \mu_{manual} \neq \mu_{automatic} \]
As my two samples, cars with manual and automatic transmissions, come from two different populations, they are independent. A car cannot have manual and automatic transmission at the same time, so there can be no overlap between the populations.
Thus, I will perform an independent samples t-test.
However, first I will explore the variances of the variable Price within the two groups. The t-test would require equal variances, but by applying Welch correction, we can overcome this violation.
describeBy(df$Price, group = df$Transmission)
##
## Descriptive statistics by group
## group: Manual
## vars n mean sd median trimmed mad min max
## X1 1 142 49784.16 26677.55 54528.65 49568.67 34299.25 5308.85 99981.48
## range skew kurtosis se
## X1 94672.63 0.02 -1.2 2238.73
## ------------------------------------------------------------
## group: Automatic
## vars n mean sd median trimmed mad min max
## X1 1 108 48276.83 27469.95 44070.04 47446.25 33617.34 5560.13 99797.19
## range skew kurtosis se
## X1 94237.06 0.23 -1.21 2643.3
We can see that the standard deviations are quite close to each other. In order to be sure, we can perform a Levene test. In the context of this test, \(H_0\) is that the variances in the two groups are equal, \(H_1\) is that they are different.
library(car)
leveneTest(df$Price, group = df$Transmission)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.1258 0.7231
## 248
As the \(p\) value is very high, \(H_0\) cannot be rejected at \(p = 0.7231\). Thus, I assume the variances are equal.
I proceed with the t-test.
t.test(df$Price ~ df$Transmission,
var.equal = TRUE,
alternative = "two.sided")
##
## Two Sample t-test
##
## data: df$Price by df$Transmission
## t = 0.43689, df = 248, p-value = 0.6626
## alternative hypothesis: true difference in means between group Manual and group Automatic is not equal to 0
## 95 percent confidence interval:
## -5287.973 8302.626
## sample estimates:
## mean in group Manual mean in group Automatic
## 49784.16 48276.83
As the \(p\) value is significantly higher than \(0.05\), we cannot reject \(H_0\).
From the non-parametric test, the suitable in this case is the Wilcoxon Rank Sum Test. However, arithmetic means cannot be compared with this test. Instead, we compare the distribution locations. To state the hypothesis:
\(H_0\): The distribution location of Price is the same for manual and automatic transmission
\(H_1\): The distribution location of Price is not the same for manual and automatic transmission
Performing the test, I get:
wilcox.test(df$Price ~ df$Transmission,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: df$Price by df$Transmission
## W = 7893, p-value = 0.6912
## alternative hypothesis: true location shift is not equal to 0
Again, the \(p\) value is a lot higher than \(0.005\), so \(H_0\) cannot be rejected.
For an independent samples t-test, the following assumptions have to be met:
As Price is clearly a numeric variable measured in Euro, the first assumption is met.
I have already checked the assumption for equal variances above, and it holds. Even if it did not, Welch correction could be applied and the t-test would still be suitable.
I need to check whether Price is normally distributed within the populations of manual and automatic transmission.
# make a plot
library(ggplot2)
ggplot(df, aes(x = Price)) + geom_histogram(color = "black", bins = 15) +
facet_wrap(~Transmission, nrow = 1)
Based on the plots, the distribution of Price does not appear to be normal in any of the groups.
# qq plot
library(ggpubr)
ggqqplot(df, "Price", facet.by = "Transmission")
Similarly to the histograms, the quantile-quantile plots suggest a serious violation of the normality assumption.
In order to formally test the normality, the Shapiro-Wilk test can be conducted. Under this test, the null hypothesis is that the variable is normally distributed.
library(rstatix)
df %>% group_by(Transmission) %>% shapiro_test(Price)
## # A tibble: 2 × 4
## Transmission variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Manual Price 0.954 0.000116
## 2 Automatic Price 0.946 0.000259
It can be concluded that \(H_0\) is rejected at \(p < 0.001\) in both groups. As a result, the non-parametric Wilcoxon Rank Sum Test is the one that should be used.
To find out the effect size in the Wilcoxon Rank Sum Test, the Bisserial Correlation can be calculated.
library(effectsize)
effectsize(wilcox.test(df$Price ~ df$Transmission,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## 0.03 | [-0.11, 0.17]
interpret_rank_biserial(0.03)
## [1] "tiny"
## (Rules: funder2019)
The effect size is tiny, but this was expected, as the null hypothesis was not rejected.
To answer the research question, based on the sample data, we did not find any significant differences in the distribution locations of Price for manual and automatic transmission cars.