Homework 2 - MVA - CarMarket

CarMarket <- read.table("~/IMB/Multivariate analysis/HOMEWORK MVA/bmw.csv", header=TRUE, sep=",", dec=";")
CarMarket <- CarMarket[order(CarMarket$price, decreasing = TRUE), ] 

#I've sorted the data by price in a descending manner. 

head(CarMarket)

##          model year  price transmission mileage fuelType tax  mpg engineSize
## 3639  2 Series 2015 123456    Semi-Auto   33419   Diesel  20 68.9        2.0
## 5363        M4 2016  99950    Automatic     771   Petrol 300 33.2        3.0
## 2910        M4 2017  89990    Semi-Auto    1336   Petrol 145 33.2        3.0
## 4777        M5 2019  89900    Semi-Auto    2269   Petrol 145 24.1        4.4
## 1814  8 Series 2019  88980    Semi-Auto      88   Petrol 145 24.4        4.4
## 722   8 Series 2019  84898    Semi-Auto    3185   Petrol 145 24.4        4.4

This data set regards the British used car market, or used BMW market to be precise. The unit of observation are specific cars, with their respective propreties, which are denoted as variables in the data set. The sample size consists of 10781 observations and 9 respective variables. The variables consist of: -Model of car: this variable entails specific models which the BMW line offers ranging from 1-8 and x1-x8 -year of production: the year the specific unit of car was produced -price: price of the unit in GBP -transmission: what kind of transmission is used in the specific vehicle (either manual or automatic) -mileage: the distance the vehicle has already travelled in miles -fuel type: type of fuel used to power powertrain -tax: required tax payment upon purchase of the vehicle -mpg: fuel economy of the vehicle denoted in distance driven (in miles) per gallon of fuel (4L) -engine size: the displacement of the powertrain in the vehicle

This data is sourced from Kaggle: https://www.kaggle.com/datasets/adityadesai13/used-car-dataset-ford-and-mercedes?select=bmw.csv

We are going to test the following hypothesis:

Ho: The mean values of prices among different fuel types of BMW’s in the sample are the same.

H1: the mean values of prices among different fuel types of BMW’s in the sample are different for at least one of the fuel types.

unique(CarMarket$fuelType)

## [1] "Diesel"   "Petrol"   "Hybrid"   "Other"    "Electric"

Here I used the unique function, to clearly see which of the fuel types are present in the dataset.

CarMarket <- drop_na(CarMarket)

I used the head function to present the structure of the dataset.

CarMarket$fuelTypef<- factor(CarMarket$fuelType, levels = c ("Diesel", "Petrol","Hybrid","Other"), labels = c("1", "2", "3","4") )

I coerced the fuel types into a factor for further analysis.

CarMarketfit2019andup <- CarMarket[CarMarket$year >= 2019,]

I segmented the dataset to consist of vehicles manufactured in 2019 or after in order for it to be more relevant, or rather to include the market segment which is more active.

library(psych)
describeBy(CarMarketfit2019andup$price, CarMarketfit2019andup$fuelTypef)

## 
##  Descriptive statistics by group 
## group: 1
##    vars    n     mean       sd median  trimmed     mad   min   max range skew
## X1    1 2347 32623.41 11167.56  29480 31044.17 7401.14 14498 79991 65493 1.46
##    kurtosis     se
## X1     2.19 230.52
## ------------------------------------------------------------ 
## group: 2
##    vars    n     mean       sd median  trimmed     mad   min   max range skew
## X1    1 1748 30409.94 10999.97  27476 28607.31 6796.24 11995 89900 77905 1.79
##    kurtosis    se
## X1     3.81 263.1
## ------------------------------------------------------------ 
## group: 3
##    vars   n     mean       sd median  trimmed     mad   min   max range skew
## X1    1 123 35229.74 12477.58  31995 32977.16 5181.69 20850 74226 53376 1.72
##    kurtosis      se
## X1     2.37 1125.07
## ------------------------------------------------------------ 
## group: 4
## NULL

I then descirbed my dataset by fuel type and price, in order to see a rough picture, of how the prices differ along the fuel type variable.

CarMarketfit2019andup %>% group_by(fuelTypef) %>%
shapiro_test(price)

## # A tibble: 3 × 4
##   fuelTypef variable statistic        p
##   <fct>     <chr>        <dbl>    <dbl>
## 1 1         price        0.873 4.86e-40
## 2 2         price        0.837 4.73e-39
## 3 3         price        0.773 1.63e-12

After concucting the Shapiro-Wilk test, I coould see that the values weren’t normally distributed among the fuel type variable, which ment I must use a non parametric test, in order for it to have signifficance.

kruskal.test(price ~ fuelTypef, 
             data = CarMarketfit2019andup)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  price by fuelTypef
## Kruskal-Wallis chi-squared = 95.252, df = 2, p-value < 2.2e-16

The test i chose was the Kruskal-Wallis rank sum test. The p value was calculated to be lower than 0.05, which means the null hypothesis can be rejected. This means that the alternative hypothesis can be acceptet which goes as follows: H1: the mean values of prices among different fuel types of BMW’s in the sample are different for at least one of the fuel types.

kruskal_effsize(price ~ fuelTypef, 
                data = CarMarketfit2019andup)

## # A tibble: 1 × 5
##   .y.       n effsize method  magnitude
## * <chr> <int>   <dbl> <chr>   <ord>    
## 1 price  4218  0.0221 eta2[H] small

Lastly, I conducted a eta squared test, which is used to determine the effect size of the variables used in the ANOVA or in this case the Kruskal Wallis rank sum test. It indicated the percentage of varaince in the dependant variable explained by the independent variable. In my case it ammounted to approximately 0.022, providing for a small effect.

The following hypothesis tested will regard the price and mean price values, among specific models in the BMW lineup. To be precise:

Ho: There is no statistically significant difference of mean prices distribution between the 3 series and 5 series.

H1: There is a statistsically significant difference of mean prices distribution between the 3 series and the 5 series.

CarMarketfit2019andup$modelf <- factor(CarMarketfit2019andup$model, levels = c(" 3 Series" , " 5 Series" ," M5"  ,     " 8 Series" ," X7"    ,   " i8"     ,  " X5"    ,   " X6"  ,     " X3"     ,  " M3"    ,   " X4"     ,  " M4"  , " 7 Series", " 6 Series" ," Z4"    ,   " M2" ,      " 4 Series" ," X2" ,      " 1 Series" ," 2 Series", " X1"), labels = c("1", "2", "3", "4", "5","6", "7", "8","9","10", "11", "12","13", "14", "15", "16","17","18", "19", "20", "21" ))

I’ve coerced the model variable into factors for further analysis.

CarMarketfit2019andup <- drop_na(CarMarketfit2019andup)

I’ve ommited the NA rows.

CarMarketfit2019andup35 <-  filter(CarMarketfit2019andup, modelf %in% c("1", "2"))
head(CarMarketfit2019andup35)

##       model year price transmission mileage fuelType tax  mpg engineSize
## 1  3 Series 2020 71990    Semi-Auto     150   Diesel 150 47.1        3.0
## 2  5 Series 2020 54845    Semi-Auto     450   Diesel 145 60.1        3.0
## 3  3 Series 2020 53988    Automatic    5000   Petrol 145 34.9        3.0
## 4  3 Series 2020 53260    Semi-Auto     347   Petrol 145 33.6        3.0
## 5  3 Series 2020 51000    Semi-Auto    5000   Diesel 150 43.5        3.0
## 6  3 Series 2020 45000    Semi-Auto    2500   Petrol 145 33.6        3.0
##   fuelTypef modelf
## 1         1      1
## 2         1      2
## 3         2      1
## 4         2      1
## 5         1      1
## 6         2      1

I’ve excluded all other models in the sample in order to analyse only the models intended for analysis.

library(psych)
describeBy(CarMarketfit2019andup35$price, CarMarketfit2019andup35$modelf)

## 
##  Descriptive statistics by group 
## group: 1
##    vars   n     mean      sd median  trimmed     mad   min   max range skew
## X1    1 810 29461.03 5611.13  28450 29005.67 5131.28 16698 71990 55292 1.21
##    kurtosis     se
## X1     4.44 197.16
## ------------------------------------------------------------ 
## group: 2
##    vars   n     mean      sd median  trimmed     mad   min   max range skew
## X1    1 398 31198.62 4585.61  30497 30782.44 4448.54 21000 54845 33845 0.94
##    kurtosis     se
## X1     1.53 229.86
## ------------------------------------------------------------ 
## group: 3
## NULL
## ------------------------------------------------------------ 
## group: 4
## NULL
## ------------------------------------------------------------ 
## group: 5
## NULL
## ------------------------------------------------------------ 
## group: 6
## NULL
## ------------------------------------------------------------ 
## group: 7
## NULL
## ------------------------------------------------------------ 
## group: 8
## NULL
## ------------------------------------------------------------ 
## group: 9
## NULL
## ------------------------------------------------------------ 
## group: 10
## NULL
## ------------------------------------------------------------ 
## group: 11
## NULL
## ------------------------------------------------------------ 
## group: 12
## NULL
## ------------------------------------------------------------ 
## group: 13
## NULL
## ------------------------------------------------------------ 
## group: 14
## NULL
## ------------------------------------------------------------ 
## group: 15
## NULL
## ------------------------------------------------------------ 
## group: 16
## NULL
## ------------------------------------------------------------ 
## group: 17
## NULL
## ------------------------------------------------------------ 
## group: 18
## NULL
## ------------------------------------------------------------ 
## group: 19
## NULL
## ------------------------------------------------------------ 
## group: 20
## NULL
## ------------------------------------------------------------ 
## group: 21
## NULL

I’ve described the dataset by price and model factor, in order to assess the price distribution among the two.

library(ggplot2)

Series_3 <- ggplot(CarMarketfit2019andup35[CarMarketfit2019andup35$modelf=="1",  ], aes(x = price)) +
  theme_linedraw() + 
  geom_histogram() +
  ggtitle("3 Series price")

Series_5 <- ggplot(CarMarketfit2019andup35[CarMarketfit2019andup35$modelf=="2",  ], aes(x = price)) +
  theme_linedraw() + 
  geom_histogram() +
  ggtitle("5 Series price")

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.2.2

ggarrange(Series_3, Series_5,
          ncol= 2, nrow = 1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

I’ve graphically presented the price distribution of the samples, in order to gain intuitive insight of the distribution.

CarMarketfit2019andup35 %>% 
group_by(modelf) %>%
shapiro_test(price)

## # A tibble: 2 × 4
##   modelf variable statistic        p
##   <fct>  <chr>        <dbl>    <dbl>
## 1 1      price        0.939 1.17e-17
## 2 2      price        0.949 1.75e-10

The Shapiro-Wilks test shows us that the distribution is not normal, which calls for the use of a non parametric test, in this case we’re using Wilcox rank sum test.

wilcox.test(CarMarketfit2019andup35$price ~ CarMarketfit2019andup35$modelf,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  CarMarketfit2019andup35$price by CarMarketfit2019andup35$modelf
## W = 123845, p-value = 5.645e-11
## alternative hypothesis: true location shift is not equal to 0

Due to the p value being lower than 0.05, we can reject the null hypothesis and accept the alternative: H1: There is a statistsically significant difference of mean prices distribution between the 3 series and the 5 series.

library(effectsize)

## Warning: package 'effectsize' was built under R version 4.2.2

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

rb <- rank_biserial(price ~ modelf, data = CarMarketfit2019andup35,)
print(rb)

## r (rank biserial) |         95% CI
## ----------------------------------
## -0.23             | [-0.30, -0.17]

interpret_rank_biserial(-0.23)

## [1] "medium"
## (Rules: funder2019)

The interpretation of the effectsize according to the bisserial correlation is medium, meaning there is a medium difference in the distribution in the prices among the 3 and 5 Series BMWs in the 2019 and newer sample.

The last hypothesis test will check if the proportion of 3 Series vs 5 Series manufactured in 2019 or later, is the same as the proportion of the units sold in the same interval. The number of 3 Series sold in the EU since 2019 is approximately 250,000 while the 5 Series accounted for approximately 150,000 units sold. That gives us a ratio of 0.6, or 60 5 Series units sold per 100 3 Series units sold.

Therefore my Hypothesis are as follows:

H0: The proportion of 5 Series being units sold on the market in comparison of the 3 Series units sold is 0.6.

H1: The proportion of 5 series being units sold on the market in comparison of the 3 Series units sold is not 0.6.

CarMarketfit2019andup3 <-  filter(CarMarketfit2019andup, modelf %in% c("1"))
head(CarMarketfit2019andup3)

##       model year price transmission mileage fuelType tax  mpg engineSize
## 1  3 Series 2020 71990    Semi-Auto     150   Diesel 150 47.1        3.0
## 2  3 Series 2020 53988    Automatic    5000   Petrol 145 34.9        3.0
## 3  3 Series 2020 53260    Semi-Auto     347   Petrol 145 33.6        3.0
## 4  3 Series 2020 51000    Semi-Auto    5000   Diesel 150 43.5        3.0
## 5  3 Series 2020 45000    Semi-Auto    2500   Petrol 145 33.6        3.0
## 6  3 Series 2020 44320    Semi-Auto     160   Diesel 145 49.6        2.0
##   fuelTypef modelf
## 1         1      1
## 2         2      1
## 3         2      1
## 4         1      1
## 5         2      1
## 6         1      1

nrow(CarMarketfit2019andup3)

## [1] 810

nrow(CarMarketfit2019andup35)

## [1] 1208

binom.test(x = 810,
           n = 1208,
           p = 0.6, 
           alternative = "greater")

## 
##  Exact binomial test
## 
## data:  810 and 1208
## number of successes = 810, number of trials = 1208, p-value = 2.441e-07
## alternative hypothesis: true probability of success is greater than 0.6
## 95 percent confidence interval:
##  0.6475801 1.0000000
## sample estimates:
## probability of success 
##              0.6705298

Due to the p value being lower than 0.05, we can reject the null hypothesis and accept the altarnative.

Homework 2 - MVA - CarMarket

Kristjan Puhek

2023-01-12