Homework 2

Task 1 - 2

I have found a dataset that contains information about cars on the German automotive market.

df <- read.table("germany_auto_industry_dataset.csv", header = TRUE, sep = ",")

I check the number of observations in my data.

nrow(df)

## [1] 500

As 500 observations are a bit too many for hypothesis testing, I create a subsample of 250 rows. I also set the seed for reproducability.

library(dplyr)
set.seed(999)
df <- df %>% dplyr::sample_n(size = 250, replace = FALSE)

# check rows
nrow(df)

## [1] 250

Task 3

head(df)

##        Brand    Model Year Mileage Fuel.Type Fuel.Consumption..L.100km.
## 1        BMW 5 Series 2018  261788    Diesel                        9.6
## 2        BMW 3 Series 2020   54423    Petrol                        4.0
## 3    Porsche Panamera 2021  279748    Hybrid                        9.7
## 4    Porsche Panamera 2015  203688    Petrol                        6.9
## 5 Volkswagen     Polo 2017   74333  Electric                       11.7
## 6       Opel Insignia 2014   70986    Petrol                        4.9
##   Horsepower..HP. Transmission    Price      City
## 1             240    Automatic 54091.03   Cologne
## 2             133    Automatic 93746.84    Munich
## 3             167       Manual 14451.87    Munich
## 4             134    Automatic 23729.79 Frankfurt
## 5             180    Automatic 54905.58    Munich
## 6             102       Manual 81634.86 Frankfurt

Task 4

General characteristics of the data:

sample size is \(n = 500\) originally, now reduced to \(250\)
unit of observation is one car

Variables and units of measurements (if applicable):

Brand: represents the car’s brand
Model: the specific model of the brand
Year: manufacturing year
Mileage: the total kilometers each car has traveled
Fuel.Type: type of fuel the car uses (Petrol/Diesel/Electric/Hybrid)
Fuel.Consumption..L.100km.: average fuel consumption per 100km (in liters)
Horsepower..HP.: engine’s power rating (in horsepower)
Transmission: type of transmission (Manual/Automatic)
Price: price of the vehicle (in Euros)
City: location where vehicle is available

Task 5

The data was obtained from Kaggle and is available under the following URL:

https://www.kaggle.com/datasets/heidarmirhajisadati/german-vehicle-price-and-efficiency-dataset/data

Task 6

Data manipulation

First, I will rename some of the variables because of the odd column names.

# use rename function from dplyr

df <- rename(df,
       Fuel_Type = Fuel.Type,
       Fuel_Consumption = Fuel.Consumption..L.100km.,
       Horsepower = Horsepower..HP.)

Then I will make sure that the categorical variables I will use are treated as factors. For that, I will look at the structure of the data frame.

str(df)

## 'data.frame':    250 obs. of  10 variables:
##  $ Brand           : chr  "BMW" "BMW" "Porsche" "Porsche" ...
##  $ Model           : chr  "5 Series" "3 Series" "Panamera" "Panamera" ...
##  $ Year            : int  2018 2020 2021 2015 2017 2014 2007 2018 2022 2018 ...
##  $ Mileage         : int  261788 54423 279748 203688 74333 70986 203660 258142 25732 63602 ...
##  $ Fuel_Type       : chr  "Diesel" "Petrol" "Hybrid" "Petrol" ...
##  $ Fuel_Consumption: num  9.6 4 9.7 6.9 11.7 4.9 6.1 8.3 10.5 5.8 ...
##  $ Horsepower      : int  240 133 167 134 180 102 352 336 270 230 ...
##  $ Transmission    : chr  "Automatic" "Automatic" "Manual" "Automatic" ...
##  $ Price           : num  54091 93747 14452 23730 54906 ...
##  $ City            : chr  "Cologne" "Munich" "Munich" "Frankfurt" ...

Then I will convert these variables into factors.

df$Brand <- factor(df$Brand)
df$Model <- factor(df$Model)
df$Fuel_Type <- factor(df$Fuel_Type)
# set the levels for transmission so manual is the baseline
df$Transmission <- factor(df$Transmission, levels = c("Manual", "Automatic"))
df$City <- factor(df$City)

I also check whether there are any missing values in the data.

# returns the number of NAs
sum(is.na(df))

## [1] 0

As there are none, we can continue with the analysis.

Descriptive statistics

I chose the variable Price for presenting descriptive statistics.

library(psych)
describe(df$Price)

##    vars   n     mean       sd   median  trimmed      mad     min      max
## X1    1 250 49132.99 26978.34 49039.22 48630.48 34153.84 5308.85 99981.48
##       range skew kurtosis      se
## X1 94672.63 0.11     -1.2 1706.26

From the output, we can easily find out that the mean price in the sample is \(€49133\). The median is \(€49039.22\), which means that half of the prices in the data are lower than or equal to \(€49039.22\), while the other half are larger. It also shows, for example, the minimum and maximum values of Price, \(5308.85\) and \(99981.48\) respectively. By subtracting the minimum from the maximum, we also obtain the range, \(94672.63\). This is also shown in the output.

Task 7

Research Question

I have formulated the following research question from my data:

Is there a difference between the price of cars with manual transmission and automatic transmission?

As the assignment requires both parametric and non-parametric tests to be performed, I will do the testing first. Then, I will check whether the assumptions for the parametric test hold, and argue which test is more suitable.

Parametric test

For the parametric test, it makes sense to compare the arithmetic means of the two groups. Formally, this translates to:

\[ H_0: \mu_{manual} = \mu_{automatic} \] \[ H_1: \mu_{manual} \neq \mu_{automatic} \]

As my two samples, cars with manual and automatic transmissions, come from two different populations, they are independent. A car cannot have manual and automatic transmission at the same time, so there can be no overlap between the populations.

Thus, I will perform an independent samples t-test.

However, first I will explore the variances of the variable Price within the two groups. The t-test would require equal variances, but by applying Welch correction, we can overcome this violation.

describeBy(df$Price, group = df$Transmission)

## 
##  Descriptive statistics by group 
## group: Manual
##    vars   n     mean       sd   median  trimmed      mad     min      max
## X1    1 142 49784.16 26677.55 54528.65 49568.67 34299.25 5308.85 99981.48
##       range skew kurtosis      se
## X1 94672.63 0.02     -1.2 2238.73
## ------------------------------------------------------------ 
## group: Automatic
##    vars   n     mean       sd   median  trimmed      mad     min      max
## X1    1 108 48276.83 27469.95 44070.04 47446.25 33617.34 5560.13 99797.19
##       range skew kurtosis     se
## X1 94237.06 0.23    -1.21 2643.3

We can see that the standard deviations are quite close to each other. In order to be sure, we can perform a Levene test. In the context of this test, \(H_0\) is that the variances in the two groups are equal, \(H_1\) is that they are different.

library(car)
leveneTest(df$Price, group = df$Transmission)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.1258 0.7231
##       248

As the \(p\) value is very high, \(H_0\) cannot be rejected at \(p = 0.7231\). Thus, I assume the variances are equal.

I proceed with the t-test.

t.test(df$Price ~ df$Transmission,
       var.equal = TRUE,
       alternative = "two.sided")

## 
##  Two Sample t-test
## 
## data:  df$Price by df$Transmission
## t = 0.43689, df = 248, p-value = 0.6626
## alternative hypothesis: true difference in means between group Manual and group Automatic is not equal to 0
## 95 percent confidence interval:
##  -5287.973  8302.626
## sample estimates:
##    mean in group Manual mean in group Automatic 
##                49784.16                48276.83

As the \(p\) value is significantly higher than \(0.05\), we cannot reject \(H_0\).

Non-parametric test

From the non-parametric test, the suitable in this case is the Wilcoxon Rank Sum Test. However, arithmetic means cannot be compared with this test. Instead, we compare the distribution locations. To state the hypothesis:

\(H_0\): The distribution location of Price is the same for manual and automatic transmission

\(H_1\): The distribution location of Price is not the same for manual and automatic transmission

Performing the test, I get:

wilcox.test(df$Price ~ df$Transmission,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  df$Price by df$Transmission
## W = 7893, p-value = 0.6912
## alternative hypothesis: true location shift is not equal to 0

Again, the \(p\) value is a lot higher than \(0.005\), so \(H_0\) cannot be rejected.

Checking the assumptions

For an independent samples t-test, the following assumptions have to be met:

Variable is numeric
The distribution of the variable is normal in both populations
Variable has the same variance in both populations

As Price is clearly a numeric variable measured in Euro, the first assumption is met.

I have already checked the assumption for equal variances above, and it holds. Even if it did not, Welch correction could be applied and the t-test would still be suitable.

I need to check whether Price is normally distributed within the populations of manual and automatic transmission.

# make a plot
library(ggplot2)

ggplot(df, aes(x = Price)) + geom_histogram(color = "black", bins = 15) +
  facet_wrap(~Transmission, nrow = 1)

Based on the plots, the distribution of Price does not appear to be normal in any of the groups.

# qq plot
library(ggpubr)
ggqqplot(df, "Price", facet.by = "Transmission")

Similarly to the histograms, the quantile-quantile plots suggest a serious violation of the normality assumption.

In order to formally test the normality, the Shapiro-Wilk test can be conducted. Under this test, the null hypothesis is that the variable is normally distributed.

library(rstatix)
df %>% group_by(Transmission) %>% shapiro_test(Price)

## # A tibble: 2 × 4
##   Transmission variable statistic        p
##   <fct>        <chr>        <dbl>    <dbl>
## 1 Manual       Price        0.954 0.000116
## 2 Automatic    Price        0.946 0.000259

It can be concluded that \(H_0\) is rejected at \(p < 0.001\) in both groups. As a result, the non-parametric Wilcoxon Rank Sum Test is the one that should be used.

Effect size

To find out the effect size in the Wilcoxon Rank Sum Test, the Bisserial Correlation can be calculated.

library(effectsize)

effectsize(wilcox.test(df$Price ~ df$Transmission,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided"))

## r (rank biserial) |        95% CI
## ---------------------------------
## 0.03              | [-0.11, 0.17]

interpret_rank_biserial(0.03)

## [1] "tiny"
## (Rules: funder2019)

The effect size is tiny, but this was expected, as the null hypothesis was not rejected.

Conclusion

To answer the research question, based on the sample data, we did not find any significant differences in the distribution locations of Price for manual and automatic transmission cars.