In this project we have analysed two different datasets. First one contained the information about used cars, their characteristics and prices. The other one stored the data about numerous songs, their traits and possible position in the world’s most important music ranking. Using different machine learning algorithms we were able to predict the features that were most interesting for us. We have compared the models’ efficiency and learned how to work with real life data, that stems from internet sources and data scrapping. The project was a way to learn modern approaches in the unsupervised data analytics. Our results have also an additional value added - they can be used not only for the educational purposes, but due to their high efficiency can be then used to give predictions about new entities. The following sections of this report contain all the necessary codes to replicate our analysis and can serve as a guidline to follow in a similar research project.
The market for used vehicles is growing dynamically and is expected to continue its growth in the future1. It is also a curious market for the point of a theoretical economic analysis – it is a market with always perfect competition, in which private sellers are matching with interested owners. The products could be different in many ways, but at the same time there are almost always some comparable offers, that a possible buyer can choose from. In the marked of used goods there is always a great problem with the asymmetric information. Some researchers say, that in the situation of uncertainty, the price of a such good could be taken as an indicator of quality2 – it cannot be too low, which could indicate a lower value, but at the same time it cannot be too high, as the buyers can always choose another offer. Getting a good, generalized model for a price of an used car could tell us not only what is the market assessment of a particular car model, but it can also say a lot about the way the market works – what influences the price the most and what is less important. Thus in the first part of a project we will attempt to predict the price of a used car, basing on its characteristics.
The dataset autos contains information about used cars, offered for sell on the auction site. Observations were scrapped with the usage of Python’s library Scrapy by Orges Leka. The link to where the data is available with all the information about the author is here: https://www.kaggle.com/orgesleka/used-cars-database/data#autos.csv.
In the currently available state, the set contains twenty variables of different types and 189’349 observations (which gives the product of the numbers of rows and columns that is above 3’000’000), so that it could be interpreted as a big dataset.
autos <- read.csv('autos.csv')
nrow(autos)*ncol(autos) #large dataset
## [1] 3786980
The data was retrieved from the German site (Ebay-Kleinanzeigens), so before the first usage it needs to be translated. Also, because it is a result of scraping, there are many missing values, decoded only as empty spaces, that also needs to be changed before the usage. The format of the data needs to be changed as well, as many of the variables are recognized as characters, where factors should be in place. The detailed code for those first transformations are included below.
autos$seller <- factor(autos$seller)
levels(autos$seller) <- c('dealer', 'private')
autos$offerType <- factor(autos$offerType)
levels(autos$offerType) <- c('offer', 'request')
autos$abtest <- factor(autos$abtest)
autos$vehicleType <- factor(autos$vehicleType)
levels(autos$vehicleType) <- c('NULL', 'other', 'bus', 'cabriolet', 'coupe',
'small_car', 'kombi', 'limousine', 'suv')
autos$gearbox <- factor(autos$gearbox)
levels(autos$gearbox) <- c('NULL', 'automatic', 'manual')
autos$model <- factor(autos$model)
levels(autos$model)[1] <- 'NULL'
levels(autos$model)[42] <- 'other'
autos$monthOfRegistration <- factor(autos$monthOfRegistration)
levels(autos$monthOfRegistration) <-c ('NULL', 'Jan', 'Feb', 'Mar', 'Apr', 'May',
'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
autos$fuelType <- factor(autos$fuelType)
levels(autos$fuelType) <- c('NULL', 'other', 'gasoline', 'cng', 'diesel',
'electro','hybrid', 'lpg')
autos$brand <- factor(autos$brand)
autos$notRepairedDamage <- factor(autos$notRepairedDamage)
levels(autos$notRepairedDamage) <- c('NULL', 'yes', 'no')
autos$dateCrawled <- as.POSIXct(autos$dateCrawled, format = '%Y-%m-%d %H:%M:%OS', tz = "GMT")
autos$dateCreated <- as.POSIXct(autos$dateCreated, format = '%Y-%m-%d %H:%M:%OS', tz = "GMT")
autos$lastSeen <- as.POSIXct(autos$lastSeen, format = '%Y-%m-%d %H:%M:%OS', tz = "GMT")
The dataset contains twenty different values, that describe a car’s characteristics found in the ad:
The class of each variable in a properly read dataset is presented below. In total we have one character variable, three representing dates (in POSIXct class), ten unordered factors and six columns with integers.
#the initial state of the properly read dataset
str(autos)
## 'data.frame': 189349 obs. of 20 variables:
## $ dateCrawled : POSIXct, format: "2016-03-24 11:52:17" "2016-03-24 10:58:45" ...
## $ name : Factor w/ 128113 levels "–_ein_Kombi_+_4WD__perfekt_fuer_alle_Lebenslagen__",..: 43604 2399 50197 44742 94997 15678 80970 118441 34980 119726 ...
## $ seller : Factor w/ 2 levels "dealer","private": 2 2 2 2 2 2 2 2 2 2 ...
## $ offerType : Factor w/ 2 levels "offer","request": 1 1 1 1 1 1 1 1 1 1 ...
## $ price : int 480 18300 9800 1500 3600 650 2200 0 14500 999 ...
## $ abtest : Factor w/ 2 levels "control","test": 2 2 2 2 2 2 2 2 1 2 ...
## $ vehicleType : Factor w/ 9 levels "NULL","other",..: 1 5 9 6 6 8 4 8 3 6 ...
## $ yearOfRegistration : int 1993 2011 2004 2001 2008 1995 2004 1980 2014 1998 ...
## $ gearbox : Factor w/ 3 levels "NULL","automatic",..: 3 3 2 3 3 3 3 3 3 3 ...
## $ powerPS : int 0 190 163 75 69 102 109 50 125 101 ...
## $ model : Factor w/ 250 levels "NULL","1_reihe",..: 119 1 120 119 104 13 9 42 58 119 ...
## $ kilometer : int 150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
## $ monthOfRegistration: Factor w/ 13 levels "NULL","Jan","Feb",..: 1 6 9 7 8 11 9 8 9 1 ...
## $ fuelType : Factor w/ 8 levels "NULL","other",..: 3 5 5 3 5 3 3 3 3 1 ...
## $ brand : Factor w/ 40 levels "alfa_romeo","audi",..: 39 2 15 39 32 3 26 39 11 39 ...
## $ notRepairedDamage : Factor w/ 3 levels "NULL","yes","no": 1 2 1 3 3 2 3 3 1 1 ...
## $ dateCreated : POSIXct, format: "2016-03-24" "2016-03-24" ...
## $ nrOfPictures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ postalCode : int 70435 66954 90480 91074 60437 33775 67112 19348 94505 27472 ...
## $ lastSeen : POSIXct, format: "2016-04-07 03:16:57" "2016-04-07 01:46:50" ...
summary(autos)
## dateCrawled name
## Min. :2016-03-05 14:06:22 Ford_Fiesta : 336
## 1st Qu.:2016-03-13 13:48:27 Volkswagen_Golf_1.4: 335
## Median :2016-03-21 18:44:48 BMW_318i : 334
## Mean :2016-03-21 14:11:47 Opel_Corsa : 316
## 3rd Qu.:2016-03-29 15:39:13 BMW_316i : 258
## Max. :2016-04-07 14:36:58 BMW_320i : 256
## (Other) :187514
## seller offerType price abtest
## dealer : 2 offer :189341 Min. : 0 control:91131
## private:189347 request: 8 1st Qu.: 1150 test :98218
## Median : 2950
## Mean : 10895
## 3rd Qu.: 7200
## Max. :99999999
##
## vehicleType yearOfRegistration gearbox powerPS
## limousine:48701 Min. :1000 NULL : 10395 Min. : 0.0
## small_car:40759 1st Qu.:1999 automatic: 39220 1st Qu.: 70.0
## kombi :34498 Median :2003 manual :139734 Median : 105.0
## NULL :19437 Mean :2005 Mean : 114.7
## bus :15532 3rd Qu.:2008 3rd Qu.: 150.0
## cabriolet:11668 Max. :9999 Max. :19208.0
## (Other) :18754
## model kilometer monthOfRegistration fuelType
## golf : 15286 Min. : 5000 NULL :19025 gasoline:114106
## other : 13459 1st Qu.:125000 Mar :18490 diesel : 54968
## 3er : 10528 Median :150000 Jun :16892 NULL : 16936
## NULL : 10398 Mean :125640 Apr :15762 lpg : 2732
## polo : 6714 3rd Qu.:150000 May :15759 cng : 308
## corsa : 6415 Max. :150000 Jul :14858 hybrid : 142
## (Other):126549 (Other):88563 (Other) : 157
## brand notRepairedDamage dateCreated
## volkswagen :40687 NULL: 36542 Min. :2014-03-10 00:00:00
## bmw :20545 yes : 18410 1st Qu.:2016-03-13 00:00:00
## opel :20427 no :134397 Median :2016-03-21 00:00:00
## mercedes_benz:17931 Mean :2016-03-20 20:19:40
## audi :16676 3rd Qu.:2016-03-29 00:00:00
## ford :13093 Max. :2016-04-07 00:00:00
## (Other) :59990
## nrOfPictures postalCode lastSeen
## Min. :0 Min. : 1067 Min. :2016-03-05 14:15:08
## 1st Qu.:0 1st Qu.:30559 1st Qu.:2016-03-23 11:53:24
## Median :0 Median :49661 Median :2016-04-03 23:17:40
## Mean :0 Mean :50892 Mean :2016-03-30 03:46:56
## 3rd Qu.:0 3rd Qu.:71577 3rd Qu.:2016-04-06 10:45:49
## Max. :0 Max. :99998 Max. :2016-04-07 14:58:51
##
General descriptive statistics presented in a code chunk above could be seen as a help in the assessment of a quality of a data. As we can see, the target variable price reaches some troubling levels at its extremes. We can expect that those zeros and 99999999 values are representing missing information. We can see some strange values for the yearOfRegistration as well - values 1000 and 9999 for the year are indicating some problems with the data. We also have some zeros in the powerPS variable, which should not happen for the standard vehicle. The variables abtest and nrOfPictures are providing no insight about the car’s characteristics. Probably they should be dropped out of the future analysis. We can observe almost no variation in factors seller and offerType - that could be an issue in future modelling. We also have some classes in factors called ‘NULL’ - those represent missing information for a particular characteristic.
Even a very simplistic EDA gives us some outline on what we can expect in our data, and what should be the points that we need to focus on before we start the modelling proces.
Firstly we will take care of the missing and uncorrect values in the dataset. The most obvious change would be to recode values 0 and 99999999 of price to NA (those should not be considered as a realistic price offered by a seller). Those values could influence our analysis and make the prediction worse, due to incorrect data.
tail(autos[order(autos$price), 'price'], 50) #those values seem strange, most probably they are incorrect (made up prices so as not to give the proper price tag in the ad itself - those are not informative for us)
## [1] 745000 745000 820000 849000 911911 999990 999999
## [8] 999999 999999 999999 999999 999999 999999 999999
## [15] 1000000 1111111 1234566 1600000 1600000 2995000 3890000
## [22] 3895000 9999999 9999999 10000000 10000000 10000000 10000000
## [29] 10000000 10000000 10000000 10010011 11111111 11111111 11111111
## [36] 11111111 11111111 11111111 11111111 12345678 12345678 12345678
## [43] 27322222 99000000 99999999 99999999 99999999 99999999 99999999
## [50] 99999999
#let's clean the dataset of the most unbeliavable prices
autos[autos$price== 99999999,]$price <- 0
autos[autos$price== 999999,]$price <- 0
autos[autos$price== 11111111,]$price <- 0
autos[autos$price== 12345678,]$price <- 0
autos[autos$price== 0,]$price <- NA
As we could also see from the general descriptive statistics of the price variable - we have some extreme values that inflate the mean of the variable. Those are a cause of a high skewness in that feature. We would overcome that with introducting the log transformation to it - which makes the distribution more close to normal. For now we will keep the original and the transformed variable, but this is possible that we will model only the transformed one.
autos$logprice<-log(autos$price)
hist(autos$logprice)
We will check our previous observation with the usage of Box-Cox procedure. As we can see from the output below the optimal lambda is close to 0, so the log transformation of price should be enough to get better results from our modelling phase. Thanks to that confirmation in the next phases we will use only logprice as our target variable.
require('bestNormalize')
price_boxcox <- boxcox(autos$price)
price_boxcox$lambda
## [1] 0.1181524
Then we will clean up values that also seem to have some problems in themselves. The year of the first registration of a car should not be lower than 1900 (those are museum artifacts and are not sold by Ebay). Also it should not be higher than the year the ad is presentes, which was 2016 (last scrapped data is from April, 2016). That is why we have decided to recode those troubling values in to missings.
autos[autos$yearOfRegistration== 9999,]$yearOfRegistration <- 0
autos[autos$yearOfRegistration > 2016,]$yearOfRegistration <- 0 #newer than the data was scrapped
autos[autos$yearOfRegistration < 1900,]$yearOfRegistration <- 0
autos[autos$yearOfRegistration== 0,]$yearOfRegistration <- NA
The same happened to the horsepower of a vehicle - there should no be a car that has such power at 0. Those are incorrect data.
autos[autos$powerPS== 0,]$powerPS<-NA
We will also take care of the NULL categories in gearbox and notRepairedDamage - those will be changed to their default values (respectively: manual and no).
#gearbox and notRepairedDamage
autos[autos$gearbox == 'NULL',]$gearbox <- 'manual' #change to defaults
autos[autos$notRepairedDamage == 'NULL',]$notRepairedDamage <- 'no'
#drop unused factor levels
autos$gearbox <- droplevels(autos$gearbox)
autos$notRepairedDamage <- droplevels(autos$notRepairedDamage)
There was an attempt to convert German postalcode to more general categories, like Lands, but that was not a straightforward task. Postalcodes in Germany do not correspond well with the borders of each NUTS1 region3. The only possible way was to read German GIS structures and try to somehow overlap those with the postalcodes. However, due to additional effort and the unsure outcome of that, the idea was dropped. That would be though a good inspiration for some future analysis made on this dataset 4
During a careful analysis of the dataset some odd patterns were found. There were some ads that did not offer any car to sell, but rather indicated an eagerness to buy a vehicle with given specification. Some exchange offers were present as well. Those usually contain missing or untrue values, as they are only an outline of the request and not the exact sell offer. We have decided to correct them in the dataset - we believe that they should have the category (offerType) set to request.
We have manage to tackle them by some manipulations done on our dataset. Those troublesome ads had titles starting from ‘seek’ or ‘exchange’ (in german: ‘suche’, ‘tausche’). We have managed to extract those first words from the titles and convert them to lowercase letters. Then the observations where words ‘suche’ or ‘tausche’ were present, the offerType was changed to request.
#Suche, SUCHE, suche, tausch, Tausch, TAUSCH, tausche - ads starting with those words are more likely to get some wrong result - those are not offers for sell
autos$subName <- tolower(sub("_.*", "", autos$name)) #get the first word from the ad title
#change those troublesome offers to their proper type, which should be request
autos[autos$subName == 'suche',]$offerType <- 'request' #I am looking for...
autos[autos$subName == 'tausch',]$offerType <- 'request' #I will exchange...
autos[autos$subName == 'tausche',]$offerType <- 'request' #I will exchange...
#clear the dataset of the temp variable
autos$subName <- NULL
In the next step we will exclude requests from our dataset, as they do not represent the sell ads, which are of interest for this paper. Also the name variable will be excluded, as it is not comparable across observations and instead of adding to the explainability it may only cause computational problems. Also all important cases were already cleaned with the usage of this covariate. Having that we will not need the original name variable in our dataset.
#exclude requests from database
autos <- autos[autos$offerType == 'offer',]
#drop the unused variable
autos$offerType <- NULL
autos$name <- NULL
Variables lastSeen and dateCreated on their own do not give us any information about the offer. However, they can be used to approximate the time the ad was open in the market. That feature could be useful in determining how attractive the offer was. That could be also linked to the market matching mechanism - if the price was relatively low, the offer should be active for a short time, longer time of the ad presence on the website could indicate a small demand, that could be caused by too high price.
#calculate time difference in days
autos$howLong <- difftime(autos$lastSeen, autos$dateCreated, units = c("days"))
We have decided to remove abtest and nrOfPictures as those variables give no information about the item and are some artifacts of a faulty data generation process or old processing. Also the already exploited (and no longer useful) variables lastSeen, dateCrawled and dateCreated were droped from the analysis.
autos$abtest <- NULL
autos$nrOfPictures <- NULL
autos$lastSeen <- NULL
autos$dateCrawled <- NULL
autos$dateCreated <- NULL
Our dataset contains lots of observations. Some of them though are incorrect and misleading. Our data stems from ads on Ebay, which are created by individual users - that causes problems with consistency of the data and may be an issue for the modelling. We have spotted some of the incorrect observations and changed them to missings. We believe though that despite spotting and treating some of them, there still might be more incorect observations that are likely to be tied with the missing values in some categories (not fully filled offer, non-standard vehicle type, etc.). From the point of our research goal those are not the observations we would like to include in our model. To get the predictions of the prices of used cars, we would like to have the best and most true data, we could have.
#percentage of missing values
(1-length(which(complete.cases(autos)))/nrow(autos))*100
## [1] 15.21267
#number of non-missing cases - data of good quality
length(which(complete.cases(autos)))
## [1] 159390
The missing data is an issue for 15% of the observations. If we wanted to impute all of them, that would lead to problems with reliability of the model. For the 15% of the dataset we would cope with some untrue values. That is unacceptable if we want to have a model as close to the reality as possible. On the other hand, even after removing all those problematic cases, we would still have almost 160 thousands of observations. Such size could still be considered as a big dataset, that is enough for a proper analysis.
Having that we have decided just to exclude the missing cases from our analysis.
#remove the missings in the dataset
autos <- autos[which(complete.cases(autos)),]
save(list = "autos", file = "autos_prepared.RData")
At first we will check if some numeric variables are highly correlated with each other (that causes colinearity problem).
require(dplyr)
autos_numeric_vars <- sapply(autos, is.numeric) %>% which() %>% names()
autos_correlations <- cor(autos[,autos_numeric_vars], use = "pairwise.complete.obs")
require(corrplot)
corrplot(autos_correlations)
As we can see we do not have too much of a correlation in the variables - colinearity is not an issue for the numeric variables. However we do not see high correlations with the price variable as well (there is some with the kilometer variable only). We can see better resuls while we consider the log transformated variable - that gives more hope for the modelling phase - there is a relation of logprice with variables yearOfRegistration, powerPS and a stronger one with kilometer. Also the signs of relations are as we could expect them to be. The postalCode has almost no influence on our variable, which is expected, as on its own it does not give any beneficial information to the a-spatial analysis. We will exclude it from the dataset.
autos$postalCode <- NULL
We will use the ANOVA for the categorical variables now, to check if mean of the dependent variable is different in subgroups indicated by their levels.
autos_categorical_vars <- sapply(autos, is.factor) %>% which() %>% names()
# The higher the F-statistic (or the lower its p-value) the stronger we reject H0 (of no difference between groups)
autos_F_anova <- function(categorical_var) {
anova_ <- aov(autos$logprice ~ autos[[categorical_var]])
return(summary(anova_)[[1]][1, 4]) #get the F statistic
}
sapply(autos_categorical_vars, autos_F_anova) %>%
sort(decreasing = TRUE) -> autos_anova_all_categorical
autos_anova_all_categorical
save(list = c('autos_anova_all_categorical'), file = "autos_Ftest.RData")
## gearbox notRepairedDamage vehicleType
## 1.489636e+04 1.230559e+04 3.448338e+03
## fuelType brand monthOfRegistration
## 3.012229e+03 7.300495e+02 5.144894e+02
## model seller
## 2.729511e+02 7.516753e-01
The result above presents the F test statistics for each variable. The higher the F-statistics the stronger we reject H0 (of no difference between groups). It seems that Our variables have significant effect on the logaritmic price, except of the seller variable. That was expected, as it has almost no variation. Thus we will exclude it from our analysis.
autos$seller <- NULL
Finally we will check if our dataset includes any other variables with a very small variance. To do this, we will use nearZeroVar function from the caret package. As we can see from the output below, we have sucessfully cleaned the dataset from the variables that does not provide enough variation. As the variables that are stored in a set (both old and transformed) are not subceptible to the issue of near zero variance. We can sucessfully move onto the modelling phase.
require(caret)
nearZeroVar(autos, saveMetrics = TRUE) -> autos_nzv_stats
save(list = c('autos_nzv_stats'), file = 'autos_nzv.RData')
autos_nzv_stats
## freqRatio percentUnique zeroVar nzv
## price 1.003150 2.564778217 FALSE FALSE
## vehicleType 1.236917 0.005646527 FALSE FALSE
## yearOfRegistration 1.028812 0.052073530 FALSE FALSE
## gearbox 3.515425 0.001254784 FALSE FALSE
## powerPS 1.495796 0.382081686 FALSE FALSE
## model 1.142278 0.156220591 FALSE FALSE
## kilometer 5.969149 0.008156095 FALSE FALSE
## monthOfRegistration 1.103725 0.008156095 FALSE FALSE
## fuelType 2.011839 0.005019135 FALSE FALSE
## brand 1.881764 0.025095677 FALSE FALSE
## notRepairedDamage 9.802440 0.001254784 FALSE FALSE
## logprice 1.003150 2.564778217 FALSE FALSE
## howLong 1.000000 60.437292176 FALSE FALSE
At this step, when our data is prepared and preprocessed, we will make the split to the train and test samples in proportion 60/40. The createDataPartition will be used to get similar distribution of the target variable in both subsamples.
set.seed(111) #seed for reproductibility
autos_which_train <- createDataPartition(autos$logprice, p = 0.6,list = FALSE)
autos_train <- autos[autos_which_train,]
autos_test <- autos[-autos_which_train,]
save(list = c("autos_train", "autos_test"), file = "autos_traintest.RData")
The larger sample will be used to train three different models: linear regression, KNN and SVM. After each model will be properly fitted and prepared we will compare the efficiency of the final models from different classes.
We will start the modelling phase with estimating the linear regression model on the logarithm of a car price. Our initial model will include all the variables that are included in the dataset after the pre-processing and feature selectuion (except of the original variable price).
options(scipen=10)
options(contrasts = c("contr.treatment", "contr.treatment"))
autos_lm1 <- lm(logprice ~ howLong + notRepairedDamage + powerPS + kilometer +
yearOfRegistration + gearbox + vehicleType + fuelType +
monthOfRegistration + brand + model,
data = autos_train)
save(list = c("autos_lm1"), file = "autos_lm.RData")
summary(autos_lm1)
##
## Call:
## lm(formula = logprice ~ howLong + notRepairedDamage + powerPS +
## kilometer + yearOfRegistration + gearbox + vehicleType +
## fuelType + monthOfRegistration + brand + model, data = autos_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6051 -0.3072 0.0489 0.3621 9.8067
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -135.31152049106 0.92327592863 -146.556
## howLong 0.00946701349 0.00027459914 34.476
## notRepairedDamageno 0.66386164587 0.00822396845 80.723
## powerPS 0.00046085944 0.00001865219 24.708
## kilometer -0.00000729647 0.00000007116 -102.535
## yearOfRegistration 0.07086283965 0.00045680168 155.128
## gearboxmanual -0.18731183755 0.00690147242 -27.141
## vehicleTypeother 0.85120621085 0.03035277568 28.044
## vehicleTypebus 0.98311758358 0.01925202750 51.066
## vehicleTypecabriolet 1.49231461582 0.01795714873 83.104
## vehicleTypecoupe 1.30975179098 0.01848236960 70.865
## vehicleTypesmall_car 0.84991765817 0.01571834408 54.072
## vehicleTypekombi 0.99301710276 0.01547030806 64.189
## vehicleTypelimousine 1.04462383390 0.01498533741 69.710
## vehicleTypesuv 1.25153286070 0.02505844963 49.945
## fuelTypeother -0.47030499764 0.11548898271 -4.072
## fuelTypegasoline 0.10233805051 0.01195059448 8.563
## fuelTypecng 0.25573040134 0.05960287243 4.291
## fuelTypediesel 0.40093158500 0.01279980903 31.323
## fuelTypeelectro -0.41337620219 0.15406719770 -2.683
## fuelTypehybrid 0.14120763384 0.09187054437 1.537
## fuelTypelpg 0.16333749481 0.02248365706 7.265
## monthOfRegistrationJan 0.26913575255 0.01373999375 19.588
## monthOfRegistrationFeb 0.28146712180 0.01396367516 20.157
## monthOfRegistrationMar 0.27672025475 0.01271610717 21.761
## monthOfRegistrationApr 0.27291923976 0.01308224625 20.862
## monthOfRegistrationMay 0.27438324984 0.01306964946 20.994
## monthOfRegistrationJun 0.26580712325 0.01291863899 20.575
## monthOfRegistrationJul 0.31289508188 0.01321787368 23.672
## monthOfRegistrationAug 0.27421753277 0.01384484963 19.806
## monthOfRegistrationSep 0.31207207223 0.01366419572 22.839
## monthOfRegistrationOct 0.29537966097 0.01338182033 22.073
## monthOfRegistrationNov 0.31087139765 0.01365863463 22.760
## monthOfRegistrationDec 0.27841743542 0.01361576333 20.448
## brandaudi 0.58678381973 0.06218332445 9.436
## brandbmw 0.48311567639 0.06637493611 7.279
## brandchevrolet 0.27272541440 0.07021690087 3.884
## brandchrysler -0.27697114537 0.08443821773 -3.280
## brandcitroen -0.61653464053 0.06546740973 -9.417
## branddacia 0.01476819156 0.22712337753 0.065
## branddaewoo -0.90808809785 0.11060591129 -8.210
## branddaihatsu -0.53807376866 0.13529715956 -3.977
## brandfiat -0.46751450083 0.06361390546 -7.349
## brandford -0.27390168638 0.06374083825 -4.297
## brandhonda -0.13332923953 0.08672830207 -1.537
## brandhyundai -0.66121691061 0.06795679366 -9.730
## brandjaguar 0.86623864656 0.09801562890 8.838
## brandjeep 0.23789259749 0.10008364830 2.377
## brandkia -0.62302393032 0.07985831905 -7.802
## brandlada 0.58866119327 0.33098847938 1.778
## brandlancia -0.22539098411 0.13387469894 -1.684
## brandland_rover 1.11553340515 0.25185247394 4.429
## brandmazda -0.44330689701 0.06479430257 -6.842
## brandmercedes_benz 0.70345334722 0.05997302611 11.729
## brandmini 0.76256776218 0.13870002070 5.498
## brandmitsubishi -0.44207499142 0.07444471787 -5.938
## brandnissan -0.07037554138 0.07260380765 -0.969
## brandopel -0.00528808555 0.06280271830 -0.084
## brandpeugeot -0.38018378713 0.06339038229 -5.997
## brandporsche 1.64499827552 0.08745759855 18.809
## brandrenault -0.09096493619 0.06780466349 -1.342
## brandrover -0.86902568710 0.08625862109 -10.075
## brandsaab -0.37482709976 0.09031086638 -4.150
## brandseat -0.66723008097 0.10677686650 -6.249
## brandskoda -0.59711620434 0.08978822485 -6.650
## brandsmart -0.33096290069 0.11311112291 -2.926
## brandsonstige_autos 1.38715022064 0.06484032018 21.393
## brandsubaru 0.32366648882 0.15614439723 2.073
## brandsuzuki -0.47949633793 0.06893267786 -6.956
## brandtoyota 0.03951973047 0.07138343134 0.554
## brandtrabant 0.94960837043 0.17059271798 5.567
## brandvolkswagen 0.29677346209 0.06235196581 4.760
## brandvolvo 0.19724043882 0.07916580622 2.491
## model1_reihe -0.01138263410 0.05096343079 -0.223
## model100 -0.00211364383 0.07454984455 -0.028
## model145 -0.61798594725 0.21840607264 -2.830
## model147 -0.19296581462 0.08216201178 -2.349
## model156 -0.33274515998 0.08366161827 -3.977
## model159 0.61240034746 0.10644311988 5.753
## model1er 0.24981629020 0.04076760161 6.128
## model2_reihe 0.31480054616 0.03884368391 8.104
## model200 1.13265737268 0.23275216546 4.866
## model3_reihe 0.30765396443 0.03849649791 7.992
## model300c 1.17703878685 0.12829834785 9.174
## model3er 0.07446703779 0.03559611465 2.092
## model4_reihe 0.33960660893 0.06588841994 5.154
## model5_reihe 1.12770538575 0.07312855271 15.421
## model500 0.84808770637 0.05662537533 14.977
## model5er 0.25027795569 0.03748302881 6.677
## model6_reihe 0.52601874876 0.05323659098 9.881
## model601 -0.40348084241 0.18578309157 -2.172
## model6er 1.07777543937 0.08977679546 12.005
## model7er 0.46995107865 0.05412986999 8.682
## model80 -0.49583113994 0.04802119112 -10.325
## model850 -0.30156607848 0.11802028778 -2.555
## model90 0.06469455706 0.16222779980 0.399
## model900 0.71740727007 0.14857829323 4.828
## model9000 0.96030380460 0.21537194830 4.459
## model911 0.99576857501 0.08311563872 11.981
## modela_klasse -0.60040052912 0.03567852340 -16.828
## modela1 0.05770501834 0.06357269857 0.908
## modela2 0.22876869936 0.07882727051 2.902
## modela3 -0.02131173524 0.03521603285 -0.605
## modela4 -0.04678581941 0.03361120562 -1.392
## modela5 0.36080504720 0.05277814820 6.836
## modela6 0.07038604380 0.03559965610 1.977
## modela8 0.31389022817 0.06582527212 4.769
## modelaccord 0.53351557832 0.10037915996 5.315
## modelagila -0.16526933581 0.07615420796 -2.170
## modelalhambra 1.14066684927 0.11731021942 9.724
## modelalmera -0.22733280140 0.07823784762 -2.906
## modelaltea 1.20592834071 0.11682730284 10.322
## modelamarok 0.35453068179 0.20515855008 1.728
## modelother 0.50060627180 0.01935210730 25.868
## modelantara 0.36169233779 0.13729944320 2.634
## modelarosa 0.39594268526 0.10400900940 3.807
## modelastra -0.04118940526 0.03312972853 -1.243
## modelauris 0.37462959514 0.10246635844 3.656
## modelavensis 0.48306476091 0.07515888591 6.427
## modelaveo -0.19693226678 0.16238139142 -1.213
## modelaygo 0.02605914955 0.08385466445 0.311
## modelb_klasse 0.02872039290 0.05585359082 0.514
## modelb_max 0.62861973716 0.24590376924 2.556
## modelbeetle 0.19108317359 0.05561390829 3.436
## modelberlingo 0.64340958555 0.06441685163 9.988
## modelbora -0.00590453390 0.05507913747 -0.107
## modelboxster -0.45897514955 0.09731227350 -4.717
## modelbravo 0.17407733884 0.09338843724 1.864
## modelc_klasse -0.24650132798 0.03120203154 -7.900
## modelc_max 0.72258800802 0.06425850299 11.245
## modelc_reihe 0.23720200762 0.11493461124 2.064
## modelc1 0.56590889752 0.07949412441 7.119
## modelc2 0.48492477133 0.08323529456 5.826
## modelc3 0.66539075271 0.07380051739 9.016
## modelc4 0.84829345683 0.07767474473 10.921
## modelc5 0.69209681728 0.07627616414 9.074
## modelcaddy 0.12557517641 0.04452345952 2.820
## modelcalibra -0.06958037859 0.12241049050 -0.568
## modelcaptiva 0.14364394636 0.11841799077 1.213
## modelcarisma -0.16008189328 0.09408891303 -1.701
## modelcarnival 0.35300000833 0.10767681703 3.278
## modelcayenne -0.29428510358 0.09995329872 -2.944
## modelcc 0.51631973433 0.09603701849 5.376
## modelceed 0.81689418628 0.09981558521 8.184
## modelcharade 0.05823465678 0.34922547377 0.167
## modelcherokee 0.35676190318 0.13356579843 2.671
## modelcitigo 0.54575078085 0.15162728733 3.599
## modelcivic 0.22995605884 0.07994966509 2.876
## modelcl 0.37488398383 0.08267049754 4.535
## modelclio -0.07002734060 0.04785112019 -1.463
## modelclk -0.17733444484 0.04340758225 -4.085
## modelclubman 0.02003214521 0.15972724345 0.125
## modelcolt 0.11389678435 0.07206919551 1.580
## modelcombo -0.08803794344 0.08124916023 -1.084
## modelcooper 0.00061191669 0.13115086675 0.005
## modelcordoba 0.24843443879 0.11667825827 2.129
## modelcorolla 0.14750599164 0.07558710963 1.951
## modelcorsa -0.22914560085 0.03305933751 -6.931
## modelcr_reihe 0.64571685418 0.10124467359 6.378
## modelcroma 0.53663167821 0.19824304755 2.707
## modelcrossfire 0.86750162280 0.19419248985 4.467
## modelcuore -0.01999841870 0.14887021185 -0.134
## modelcx_reihe 0.97660553711 0.11766240088 8.300
## modeldefender 0.29585709255 0.26781479069 1.105
## modeldelta 1.13205053755 0.28572962557 3.962
## modeldiscovery -0.14265305291 0.27668237951 -0.516
## modeldoblo 0.48878497153 0.08814620838 5.545
## modelducato 0.83461390481 0.07980698452 10.458
## modelduster 0.05012119020 0.24175829977 0.207
## modele_klasse -0.03631460016 0.03180895325 -1.142
## modelelefantino -0.97019267128 0.38490925503 -2.521
## modeleos 0.05335054903 0.06395116927 0.834
## modelescort -0.56993760981 0.05971651145 -9.544
## modelespace 0.11259546498 0.07766751676 1.450
## modelexeo 1.19940612955 0.15626475189 7.675
## modelfabia 0.80201865499 0.07807338203 10.273
## modelfiesta 0.06568576073 0.03933846585 1.670
## modelfocus 0.24919523074 0.03863671951 6.450
## modelforester 0.13771439076 0.18509634718 0.744
## modelforfour 0.49532013421 0.11937975728 4.149
## modelfortwo 0.08375617971 0.09982457413 0.839
## modelfox -0.23525151522 0.05758899879 -4.085
## modelfreelander -0.65763670539 0.25861733776 -2.543
## modelfusion 0.38445604191 0.09146094498 4.203
## modelg_klasse 1.43485835118 0.13341890520 10.755
## modelgalant -0.15323957352 0.11854251234 -1.293
## modelgalaxy 0.57842716458 0.05662397170 10.215
## modelgetz 0.48915401575 0.08379963025 5.837
## modelgl 0.63496639142 0.16302512176 3.895
## modelglk 0.12392414402 0.10049429084 1.233
## modelgolf -0.00150506197 0.02754359736 -0.055
## modelgrand 0.51226540152 0.07521422490 6.811
## modeli_reihe 0.86113207795 0.05841815200 14.741
## modeli3 0.44830386666 0.36831664569 1.217
## modelibiza 0.74432077551 0.09486638696 7.846
## modelimpreza 0.14665699921 0.17723824196 0.827
## modelinsignia 0.62556128835 0.06114340396 10.231
## modeljazz 0.47834305266 0.10787804561 4.434
## modeljetta 0.24393765758 0.07865660445 3.101
## modeljimny 0.78734816818 0.10790867200 7.296
## modeljuke 0.54514905724 0.13641931243 3.996
## modeljusty -0.99981199419 0.18767016824 -5.327
## modelka -0.41109860598 0.04487342824 -9.161
## modelkadett 0.75438052837 0.09419195048 8.009
## modelkaefer 2.15500041047 0.07759486153 27.772
## modelkalina -1.51524865402 0.79927793164 -1.896
## modelkalos 0.67087057770 0.17769454962 3.775
## modelkangoo -0.07778740359 0.06411749990 -1.213
## modelkappa 0.37937521570 0.30193363738 1.256
## modelkuga 0.73209168126 0.07385210095 9.913
## modellaguna -0.26350599224 0.05609689164 -4.697
## modellancer 0.68569793870 0.11057423511 6.201
## modellanos 0.13521239484 0.18318855475 0.738
## modellegacy -0.47009937036 0.18671504258 -2.518
## modelleon 1.15329650036 0.09843781098 11.716
## modellodgy 0.00590871736 0.30483360993 0.019
## modellogan -0.20545312467 0.23370109060 -0.879
## modellupo -0.36548836769 0.03932296750 -9.295
## modellybra -0.68136136429 0.23675729985 -2.878
## modelm_klasse 0.09672105874 0.05371718928 1.801
## modelm_reihe 1.15249014196 0.07866256102 14.651
## modelmateria 1.10104901770 0.32280327705 3.411
## modelmatiz -0.44828843200 0.08595353724 -5.215
## modelmegane -0.12458141898 0.04896760284 -2.544
## modelmeriva 0.22437842235 0.05444550512 4.121
## modelmicra -0.21265783967 0.06139609767 -3.464
## modelmii 0.60916634713 0.18343524668 3.321
## modelmodus 0.14961522078 0.08850118261 1.691
## modelmondeo 0.02831434065 0.04158796989 0.681
## modelmove 0.01954244787 0.22100021604 0.088
## modelmusa 0.37194596010 0.27239389445 1.365
## modelmustang 2.13228566589 0.08881319656 24.009
## modelmx_reihe 0.70119390195 0.05639453648 12.434
## modelnavara 0.90256301268 0.15490457286 5.827
## modelniva -0.98250970548 0.35867780824 -2.739
## modelnote 0.28410918920 0.13164536554 2.158
## modelnubira 0.47766905918 0.20167494931 2.369
## modeloctavia 1.12225086542 0.07760313224 14.461
## modelomega -0.32428092342 0.04987537826 -6.502
## modelone -0.08453142536 0.13566869383 -0.623
## modeloutlander 1.03339017205 0.13265009979 7.790
## modelpajero 1.06410808801 0.11437228156 9.304
## modelpanda 0.40426708828 0.06660627973 6.070
## modelpassat 0.01937765709 0.03010286931 0.644
## modelphaeton 0.54230070629 0.09816435157 5.524
## modelpicanto 0.57809241238 0.09001535598 6.422
## modelpolo -0.32059925090 0.02958975727 -10.835
## modelprimera -0.17810144631 0.07638565626 -2.332
## modelptcruiser 0.30763059538 0.10338333271 2.976
## modelpunto 0.10317506523 0.04344852730 2.375
## modelq3 0.04007188510 0.09258705211 0.433
## modelq5 0.45775977171 0.07220845386 6.339
## modelq7 0.58097307359 0.08319798891 6.983
## modelqashqai 0.63320560333 0.07502092613 8.440
## modelr19 -0.43631218156 0.15443495400 -2.825
## modelrange_rover -0.00661712923 0.28078960282 -0.024
## modelrange_rover_evoque -0.18424921195 0.32381663741 -0.569
## modelrange_rover_sport -0.03110997103 0.28604303921 -0.109
## modelrangerover 2.21153102268 0.42705951346 5.179
## modelrav 0.53049619888 0.09255330167 5.732
## modelrio 0.50409409984 0.09564287436 5.271
## modelroadster 0.32510687296 0.13536296630 2.402
## modelroomster 0.94613116257 0.11112148148 8.514
## modelrx_reihe 0.85497354811 0.12151252433 7.036
## models_klasse 0.62515953685 0.05120031322 12.210
## models_max 1.01104751822 0.07037696580 14.366
## models_type -0.33573436933 0.14998886588 -2.238
## models60 0.38567313374 0.12173367209 3.168
## modelsandero -0.02586361114 0.23563080704 -0.110
## modelsanta 0.95148884714 0.08758484167 10.864
## modelscenic -0.07969605446 0.05576278248 -1.429
## modelscirocco 0.38563996038 0.06319582281 6.102
## modelseicento -0.21163196538 0.06934070937 -3.052
## modelserie_2 1.81287487611 0.57177920757 3.171
## modelsharan 0.16792436507 0.04517939086 3.717
## modelsignum 0.34489192838 0.07404186516 4.658
## modelsirion 0.54353280120 0.17404561120 3.123
## modelsl 1.10110035526 0.06178985761 17.820
## modelslk -0.05144399168 0.04812059730 -1.069
## modelsorento 0.95588403301 0.10089506781 9.474
## modelspark -0.51316613272 0.12889681686 -3.981
## modelspider 1.19146571998 0.11558148296 10.308
## modelsportage 0.97317285963 0.10699847928 9.095
## modelsprinter 0.06101353943 0.05896821954 1.035
## modelstilo 0.22684268083 0.07096706326 3.196
## modelsuperb 1.28252388400 0.09939729185 12.903
## modelswift 0.46524956577 0.07016641812 6.631
## modelterios 0.86578849746 0.23766595758 3.643
## modeltigra -0.26719046220 0.05733049736 -4.661
## modeltiguan 0.32481301824 0.05606803389 5.793
## modeltoledo 0.44184114116 0.12354198395 3.576
## modeltouareg 0.56545447702 0.06049451241 9.347
## modeltouran 0.37178893147 0.03746092077 9.925
## modeltransit 0.57126154230 0.06382909675 8.950
## modeltransporter 0.86363392084 0.03428942391 25.187
## modeltt 0.29875317241 0.04806923586 6.215
## modeltucson 1.01778799903 0.10068198818 10.109
## modeltwingo -0.33506599807 0.04651773063 -7.203
## modelup -0.20781844970 0.08155868167 -2.548
## modelv_klasse -0.03855527468 0.12550436019 -0.307
## modelv40 -0.42733173631 0.07707635449 -5.544
## modelv50 0.36288892142 0.10495363662 3.458
## modelv60 0.59377796755 0.20383938050 2.913
## modelv70 0.16649610176 0.08038160258 2.071
## modelvectra -0.28614615736 0.03749596610 -7.631
## modelverso 0.62108545294 0.08699316669 7.139
## modelviano 0.40425952407 0.07833953721 5.160
## modelvito -0.13591008342 0.05553892429 -2.447
## modelvivaro 0.58948787061 0.08807654906 6.693
## modelvoyager -0.14520840198 0.10334463109 -1.405
## modelwrangler 0.89538443532 0.14250098856 6.283
## modelx_reihe 0.48331106619 0.04598194670 10.511
## modelx_trail 0.53771850742 0.10767064122 4.994
## modelx_type -0.88737391998 0.13666680643 -6.493
## modelxc_reihe 0.60611612730 0.09748027477 6.218
## modelyaris 0.22036479606 0.06436806786 3.424
## modelyeti 0.85618967337 0.12050066999 7.105
## modelypsilon -0.07928501556 0.15991262269 -0.496
## modelz_reihe 0.47876539790 0.06020964108 7.952
## modelzafira 0.27531323077 0.04111960039 6.695
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## howLong < 2e-16 ***
## notRepairedDamageno < 2e-16 ***
## powerPS < 2e-16 ***
## kilometer < 2e-16 ***
## yearOfRegistration < 2e-16 ***
## gearboxmanual < 2e-16 ***
## vehicleTypeother < 2e-16 ***
## vehicleTypebus < 2e-16 ***
## vehicleTypecabriolet < 2e-16 ***
## vehicleTypecoupe < 2e-16 ***
## vehicleTypesmall_car < 2e-16 ***
## vehicleTypekombi < 2e-16 ***
## vehicleTypelimousine < 2e-16 ***
## vehicleTypesuv < 2e-16 ***
## fuelTypeother 4.66e-05 ***
## fuelTypegasoline < 2e-16 ***
## fuelTypecng 1.78e-05 ***
## fuelTypediesel < 2e-16 ***
## fuelTypeelectro 0.007296 **
## fuelTypehybrid 0.124290
## fuelTypelpg 3.77e-13 ***
## monthOfRegistrationJan < 2e-16 ***
## monthOfRegistrationFeb < 2e-16 ***
## monthOfRegistrationMar < 2e-16 ***
## monthOfRegistrationApr < 2e-16 ***
## monthOfRegistrationMay < 2e-16 ***
## monthOfRegistrationJun < 2e-16 ***
## monthOfRegistrationJul < 2e-16 ***
## monthOfRegistrationAug < 2e-16 ***
## monthOfRegistrationSep < 2e-16 ***
## monthOfRegistrationOct < 2e-16 ***
## monthOfRegistrationNov < 2e-16 ***
## monthOfRegistrationDec < 2e-16 ***
## brandaudi < 2e-16 ***
## brandbmw 3.40e-13 ***
## brandchevrolet 0.000103 ***
## brandchrysler 0.001038 **
## brandcitroen < 2e-16 ***
## branddacia 0.948156
## branddaewoo 2.24e-16 ***
## branddaihatsu 6.98e-05 ***
## brandfiat 2.01e-13 ***
## brandford 1.73e-05 ***
## brandhonda 0.124218
## brandhyundai < 2e-16 ***
## brandjaguar < 2e-16 ***
## brandjeep 0.017459 *
## brandkia 6.17e-15 ***
## brandlada 0.075326 .
## brandlancia 0.092263 .
## brandland_rover 9.46e-06 ***
## brandmazda 7.87e-12 ***
## brandmercedes_benz < 2e-16 ***
## brandmini 3.85e-08 ***
## brandmitsubishi 2.89e-09 ***
## brandnissan 0.332393
## brandopel 0.932896
## brandpeugeot 2.01e-09 ***
## brandporsche < 2e-16 ***
## brandrenault 0.179738
## brandrover < 2e-16 ***
## brandsaab 3.32e-05 ***
## brandseat 4.15e-10 ***
## brandskoda 2.94e-11 ***
## brandsmart 0.003434 **
## brandsonstige_autos < 2e-16 ***
## brandsubaru 0.038187 *
## brandsuzuki 3.52e-12 ***
## brandtoyota 0.579836
## brandtrabant 2.61e-08 ***
## brandvolkswagen 1.94e-06 ***
## brandvolvo 0.012723 *
## model1_reihe 0.823264
## model100 0.977381
## model145 0.004663 **
## model147 0.018846 *
## model156 6.98e-05 ***
## model159 8.78e-09 ***
## model1er 8.94e-10 ***
## model2_reihe 5.37e-16 ***
## model200 1.14e-06 ***
## model3_reihe 1.35e-15 ***
## model300c < 2e-16 ***
## model3er 0.036441 *
## model4_reihe 2.55e-07 ***
## model5_reihe < 2e-16 ***
## model500 < 2e-16 ***
## model5er 2.45e-11 ***
## model6_reihe < 2e-16 ***
## model601 0.029874 *
## model6er < 2e-16 ***
## model7er < 2e-16 ***
## model80 < 2e-16 ***
## model850 0.010614 *
## model90 0.690050
## model900 1.38e-06 ***
## model9000 8.25e-06 ***
## model911 < 2e-16 ***
## modela_klasse < 2e-16 ***
## modela1 0.364038
## modela2 0.003707 **
## modela3 0.545067
## modela4 0.163934
## modela5 8.18e-12 ***
## modela6 0.048027 *
## modela8 1.86e-06 ***
## modelaccord 1.07e-07 ***
## modelagila 0.029995 *
## modelalhambra < 2e-16 ***
## modelalmera 0.003666 **
## modelaltea < 2e-16 ***
## modelamarok 0.083977 .
## modelother < 2e-16 ***
## modelantara 0.008432 **
## modelarosa 0.000141 ***
## modelastra 0.213769
## modelauris 0.000256 ***
## modelavensis 1.31e-10 ***
## modelaveo 0.225218
## modelaygo 0.755979
## modelb_klasse 0.607107
## modelb_max 0.010579 *
## modelbeetle 0.000591 ***
## modelberlingo < 2e-16 ***
## modelbora 0.914630
## modelboxster 2.40e-06 ***
## modelbravo 0.062323 .
## modelc_klasse 2.81e-15 ***
## modelc_max < 2e-16 ***
## modelc_reihe 0.039039 *
## modelc1 1.10e-12 ***
## modelc2 5.70e-09 ***
## modelc3 < 2e-16 ***
## modelc4 < 2e-16 ***
## modelc5 < 2e-16 ***
## modelcaddy 0.004797 **
## modelcalibra 0.569752
## modelcaptiva 0.225123
## modelcarisma 0.088873 .
## modelcarnival 0.001045 **
## modelcayenne 0.003238 **
## modelcc 7.62e-08 ***
## modelceed 2.78e-16 ***
## modelcharade 0.867564
## modelcherokee 0.007563 **
## modelcitigo 0.000319 ***
## modelcivic 0.004025 **
## modelcl 5.78e-06 ***
## modelclio 0.143350
## modelclk 4.40e-05 ***
## modelclubman 0.900196
## modelcolt 0.114023
## modelcombo 0.278565
## modelcooper 0.996277
## modelcordoba 0.033238 *
## modelcorolla 0.051004 .
## modelcorsa 4.19e-12 ***
## modelcr_reihe 1.80e-10 ***
## modelcroma 0.006792 **
## modelcrossfire 7.93e-06 ***
## modelcuore 0.893138
## modelcx_reihe < 2e-16 ***
## modeldefender 0.269289
## modeldelta 7.44e-05 ***
## modeldiscovery 0.606146
## modeldoblo 2.94e-08 ***
## modelducato < 2e-16 ***
## modelduster 0.835761
## modele_klasse 0.253604
## modelelefantino 0.011718 *
## modeleos 0.404149
## modelescort < 2e-16 ***
## modelespace 0.147142
## modelexeo 1.66e-14 ***
## modelfabia < 2e-16 ***
## modelfiesta 0.094970 .
## modelfocus 1.13e-10 ***
## modelforester 0.456869
## modelforfour 3.34e-05 ***
## modelfortwo 0.401453
## modelfox 4.41e-05 ***
## modelfreelander 0.010995 *
## modelfusion 2.63e-05 ***
## modelg_klasse < 2e-16 ***
## modelgalant 0.196119
## modelgalaxy < 2e-16 ***
## modelgetz 5.33e-09 ***
## modelgl 9.83e-05 ***
## modelglk 0.217524
## modelgolf 0.956423
## modelgrand 9.77e-12 ***
## modeli_reihe < 2e-16 ***
## modeli3 0.223543
## modelibiza 4.34e-15 ***
## modelimpreza 0.407980
## modelinsignia < 2e-16 ***
## modeljazz 9.26e-06 ***
## modeljetta 0.001927 **
## modeljimny 2.98e-13 ***
## modeljuke 6.44e-05 ***
## modeljusty 9.98e-08 ***
## modelka < 2e-16 ***
## modelkadett 1.17e-15 ***
## modelkaefer < 2e-16 ***
## modelkalina 0.057993 .
## modelkalos 0.000160 ***
## modelkangoo 0.225056
## modelkappa 0.208943
## modelkuga < 2e-16 ***
## modellaguna 2.64e-06 ***
## modellancer 5.62e-10 ***
## modellanos 0.460452
## modellegacy 0.011813 *
## modelleon < 2e-16 ***
## modellodgy 0.984535
## modellogan 0.379334
## modellupo < 2e-16 ***
## modellybra 0.004004 **
## modelm_klasse 0.071775 .
## modelm_reihe < 2e-16 ***
## modelmateria 0.000648 ***
## modelmatiz 1.84e-07 ***
## modelmegane 0.010956 *
## modelmeriva 3.77e-05 ***
## modelmicra 0.000533 ***
## modelmii 0.000898 ***
## modelmodus 0.090927 .
## modelmondeo 0.495981
## modelmove 0.929537
## modelmusa 0.172108
## modelmustang < 2e-16 ***
## modelmx_reihe < 2e-16 ***
## modelnavara 5.68e-09 ***
## modelniva 0.006159 **
## modelnote 0.030919 *
## modelnubira 0.017862 *
## modeloctavia < 2e-16 ***
## modelomega 7.97e-11 ***
## modelone 0.533238
## modeloutlander 6.75e-15 ***
## modelpajero < 2e-16 ***
## modelpanda 1.29e-09 ***
## modelpassat 0.519762
## modelphaeton 3.31e-08 ***
## modelpicanto 1.35e-10 ***
## modelpolo < 2e-16 ***
## modelprimera 0.019723 *
## modelptcruiser 0.002925 **
## modelpunto 0.017568 *
## modelq3 0.665159
## modelq5 2.32e-10 ***
## modelq7 2.91e-12 ***
## modelqashqai < 2e-16 ***
## modelr19 0.004726 **
## modelrange_rover 0.981199
## modelrange_rover_evoque 0.569363
## modelrange_rover_sport 0.913393
## modelrangerover 2.24e-07 ***
## modelrav 9.97e-09 ***
## modelrio 1.36e-07 ***
## modelroadster 0.016319 *
## modelroomster < 2e-16 ***
## modelrx_reihe 1.99e-12 ***
## models_klasse < 2e-16 ***
## models_max < 2e-16 ***
## models_type 0.025198 *
## models60 0.001534 **
## modelsandero 0.912597
## modelsanta < 2e-16 ***
## modelscenic 0.152951
## modelscirocco 1.05e-09 ***
## modelseicento 0.002273 **
## modelserie_2 0.001522 **
## modelsharan 0.000202 ***
## modelsignum 3.20e-06 ***
## modelsirion 0.001791 **
## modelsl < 2e-16 ***
## modelslk 0.285044
## modelsorento < 2e-16 ***
## modelspark 6.86e-05 ***
## modelspider < 2e-16 ***
## modelsportage < 2e-16 ***
## modelsprinter 0.300819
## modelstilo 0.001392 **
## modelsuperb < 2e-16 ***
## modelswift 3.36e-11 ***
## modelterios 0.000270 ***
## modeltigra 3.16e-06 ***
## modeltiguan 6.93e-09 ***
## modeltoledo 0.000348 ***
## modeltouareg < 2e-16 ***
## modeltouran < 2e-16 ***
## modeltransit < 2e-16 ***
## modeltransporter < 2e-16 ***
## modeltt 5.15e-10 ***
## modeltucson < 2e-16 ***
## modeltwingo 5.93e-13 ***
## modelup 0.010833 *
## modelv_klasse 0.758690
## modelv40 2.96e-08 ***
## modelv50 0.000545 ***
## modelv60 0.003581 **
## modelv70 0.038331 *
## modelvectra 2.34e-14 ***
## modelverso 9.44e-13 ***
## modelviano 2.47e-07 ***
## modelvito 0.014402 *
## modelvivaro 2.20e-11 ***
## modelvoyager 0.159998
## modelwrangler 3.33e-10 ***
## modelx_reihe < 2e-16 ***
## modelx_trail 5.92e-07 ***
## modelx_type 8.46e-11 ***
## modelxc_reihe 5.06e-10 ***
## modelyaris 0.000618 ***
## modelyeti 1.21e-12 ***
## modelypsilon 0.620035
## modelz_reihe 1.86e-15 ***
## modelzafira 2.16e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7293 on 95316 degrees of freedom
## Multiple R-squared: 0.6622, Adjusted R-squared: 0.6611
## F-statistic: 585.7 on 319 and 95316 DF, p-value: < 2.2e-16
The model formula is not that long, but due to many dummies created from the categorical variables number of estimates is very large. What is satisfying though is that our model includes moslty significant variables - there are only some particular dummies that are not statistically different than zero. Also the R-squared statistic is pretty high - we have managed to explain 65.8% of the logprice variance.
#Examine the error of the model
autos_predicted <- predict(autos_lm1)
ggplot(data.frame(error = autos_train$logprice - autos_predicted),
aes(x = error)) + geom_histogram(fill = "blue", bins = 100) + theme_bw()
The plot above shows the distribution of model’s residuals. The errors seem to be normally distributed around zero. there are some outliers with values around -7, but there are not too many of them. The largest part of the distribution is highly concentrated around mean. Our model seems to work pretty good.
In the next stage we will see if there are any variables that still could be dropped from the model. Due to a large size of the model, we will use two automatic methods - one based on p-value and the other based on AIC.
require("olsrr")
ols_step_backward_p(autos_lm1, prem = 0.05, progress = FALSE) -> autos_lm1_backward_p
ols_step_backward_aic(autos_lm1, progress = FALSE) -> autos_lm1_backward_AIC
save(list = c("autos_lm1_backward_p", "autos_lm1_backward_AIC"), file = "autos_backward.RData")
# a list of removed terms from p-value evaluation should be available here
autos_lm1_backward_p$removed
## NULL
# a list of removed terms from AIC evaluation should be available here
autos_lm1_backward_AIC$predictors
## NULL
Both methods indicated that no variables should be dropped. We can say that our linear model is completed. Let’s save it into the final object and move onto other methods.
autos_regression <- autos_lm1
We will use the KNN method to predict the logarithmic price of a car. Firstly we will use the k equal to the square root of n. After the estimation we will use the scale preprocessing option. After that the k parameter will be optimised using the CV with three folds.
require('dplyr')
ctrl_nocv <- trainControl(method = "none")
sroot <- sqrt(nrow(autos_train))
k_value <- data.frame(k = round(sroot))
paste0(c('The k parameter is equal to: '), round(sroot))
set.seed(111)
autos_train_knn <-
train(logprice ~ .,
data = autos_train %>%
dplyr::select(-price),
method = "knn",
trControl = ctrl_nocv,
tuneGrid = k_value)
autos_train_knn_scaled <-
train(logprice ~ .,
data = autos_train %>%
dplyr::select(-price),
method = "knn",
trControl = ctrl_nocv,
tuneGrid = k_value,
preProcess = c("scale"))
save(list = c("autos_train_knn", "autos_train_knn_scaled"), file = "autos_knn.RData")
Due to complexity issues the cross-validation needed to be done on a subsample of the initial training dataset. We have decided to run it on a 50% of observations from the dataset. That was computable, but still - numerically extensive (it took 6 hours to train those two models below, using a 32 RAM machine with parallel computation done on 5 cores).
set.seed(111) #seed for reproductibility
#create a subsample of the training dataset for the crossvalidation
autos_which_train2 <- createDataPartition(autos_train$logprice, p = 0.5,list = FALSE)
autos_trainSub1 <- autos_train[autos_which_train2,]
autos_trainSub2 <- autos_train[-autos_which_train2,]
save(list = c("autos_trainSub1", "autos_trainSub2"), file = "autos_train_subsample.RData")
require('doParallel')
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
different_k <- data.frame(k = seq(300, 315, 5))
ctrl_cv3 <- trainControl(method = "cv", number = 3)
set.seed(111)
autos_train_knn_tune <-
train(logprice ~ .,
data = autos_trainSub1 %>%
dplyr::select(-price),
method = "knn",
trControl = ctrl_cv3,
tuneGrid = different_k)
autos_train_knn_scaled_tune <-
train(logprice ~ .,
data = autos_trainSub1 %>%
dplyr::select(-price),
method = "knn",
trControl = ctrl_cv3,
tuneGrid = different_k,
# data transformation
preProcess = c("scale"))
save(list = c("autos_train_knn_tune", "autos_train_knn_scaled_tune"), file = "autos_knn_cv.RData")
stopCluster(cl)
In the next step we will compare the efficiency on the train sample of four competing knn models. To do so we will use the regressionMetrics wrap-up function made by dr hab. Piotr Wójcik.
regressionMetrics <- function(real, predicted) {
MSE <- mean((real - predicted)^2)
RMSE <- sqrt(MSE)
MAE <- mean(abs(real - predicted))
MedAE <- median(abs(real - predicted))
MSLE <- mean((log(1 + real) - log(1 + predicted))^2)
TSS <- sum((real - mean(real))^2)
RSS <- sum((predicted - real)^2)
R2 <- 1 - RSS/TSS
result <- data.frame(MSE, RMSE, MAE, MedAE, MSLE, R2)
return(result)
}
In the first step we will make predictions from those four models on the full train sample. Surprisingly that process for the knn model also took a lot of time. That process was also executed using the doParallel package.
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
autos_train_knn_p = predict(autos_train_knn , autos_train)
autos_train_knn_scaled_p = predict(autos_train_knn_scaled, autos_train)
save(list = c("autos_train_knn_p", "autos_train_knn_scaled_p"), file = "autos_prediction_normal.RData")
autos_train_knn_tune_p = predict(autos_train_knn_tune, autos_train)
autos_train_knn_scaled_tune_p = predict(autos_train_knn_scaled_tune, autos_train)
save(list = c("autos_train_knn_tune_p", "autos_train_knn_scaled_tune_p"), file = "autos_prediction_cv.RData")
stopCluster(cl)
autos_train_predictions_knn <-
data.frame(autos_train_knn_p,
autos_train_knn_scaled_p,
autos_train_knn_tune_p,
autos_train_knn_scaled_tune_p)
knn_comparison <- sapply(autos_train_predictions_knn,
function(x) regressionMetrics(real = autos_train$logprice,
predicted = x)) %>%
t()
save(list = c("knn_comparison"), file = "autos_prediction_quality.RData")
knn_comparison
## MSE RMSE MAE MedAE
## autos_train_knn_p 0.5669199 0.7529408 0.4953571 0.3564015
## autos_train_knn_scaled_p 0.8609624 0.9278806 0.7058453 0.5944562
## autos_train_knn_tune_p 0.6146339 0.7839859 0.5281713 0.3900608
## autos_train_knn_scaled_tune_p 0.9320884 0.9654472 0.7451484 0.6394451
## MSLE R2
## autos_train_knn_p 0.01527081 0.638773
## autos_train_knn_scaled_p 0.01877466 0.4514166
## autos_train_knn_tune_p 0.0159009 0.6083709
## autos_train_knn_scaled_tune_p 0.0195606 0.4060969
As we can see from the table below, it seems that the simplest model is the best one. It has the smallest values of errors and the R^2 statistics is at the highest point. We will save it in the new object and the use it for the final model comparison.
autos_knn <- autos_train_knn
In the last step we will estimate a ridge regression in two variants. That would be a ridge regression with and without the crossvalidation, used to tune the lambda parameter.
Let’s estimate both models and then compare their efficiency on the training dataset to choose the best one.
require('dplyr')
ctrl_nocv <- trainControl(method = "none")
set.seed(111)
autos_ridge_nocv <- train(logprice ~ .,
data = autos_train %>%
dplyr::select(-price),
method = "glmnet",
trControl = ctrl_nocv)
require("caret")
lambdas <- exp(log(10)*seq(-2, 9, length.out = 200))
ctrl_cv3 <- trainControl(method = "cv", number = 3)
parameters_ridge <- expand.grid(alpha = 0, lambda = lambdas)
set.seed(111)
autos_ridge_cv <- train(logprice ~ .,
data = autos_train %>%
dplyr::select(-price),
method = "glmnet",
tuneGrid = parameters_ridge,
trControl = ctrl_cv3)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
autos_train_ridge_nocv_p = predict(autos_ridge_nocv, autos_train)
autos_train_ridge_cv_p = predict(autos_ridge_cv, autos_train)
autos_train_predictions_ridge <-
data.frame(autos_train_ridge_nocv_p,
autos_train_ridge_cv_p)
ridge_comparison <- sapply(autos_train_predictions_ridge,
function(x) regressionMetrics(real = autos_train$logprice,
predicted = x)) %>%
t()
ridge_comparison
## MSE RMSE MAE MedAE
## autos_train_ridge_nocv_p 0.5348312 0.7313216 0.4759086 0.3437378
## autos_train_ridge_cv_p 0.544384 0.7378238 0.4846693 0.3535332
## MSLE R2
## autos_train_ridge_nocv_p 0.01472735 0.6592191
## autos_train_ridge_cv_p 0.0148823 0.6531323
It seems that the standard model, without the crossvalidation works the best in this case. Once again we will store it for the final analysis.
autos_ridge <- autos_ridge_nocv
To compare the out-of-sample accuracy we will check the results of models from three families on the test dataset. To get to that point we will make forecasts on the new sample and then compare them to the real values.
autos_lm_f = predict(autos_regression, autos_test)
autos_ridge_f = predict(autos_ridge, autos_test)
autos_knn_f = predict(autos_knn, autos_test)
Compare the forecast with the real values and get the measures of fit.
autos_forecasts <-
data.frame(autos_lm_f, autos_knn_f, autos_ridge_f)
final_comparison <- sapply(autos_forecasts,
function(x) regressionMetrics(real = autos_test$logprice,
predicted = x)) %>%
t()
final_comparison
## MSE RMSE MAE MedAE MSLE R2
## autos_lm_f 0.5456356 0.7386715 0.4728525 0.3367709 0.01524542 0.6526545
## autos_knn_f 0.5796483 0.7613464 0.4966286 0.355761 0.01594923 0.6310024
## autos_ridge_f 0.550027 0.7416381 0.4767605 0.3423969 0.0153393 0.649859
It turns out that the standard linear regression performs the best while forecasting the logaritmic car price on a new dataset. It has the highest R squared value and gives the lowest errors. One also needs to keep in mind that in our case the linear regression provided even more benefits than the other methods - it not only gives the best predictions and forecasts, but also has a form that is easy to be interpreted. Our case is a proof, that in some cases the simplest models are the best ones.
Due to the development of the non-physical digital audio files and the increasing accessibility of the online resources of the downloadable music, at the turn of the second and third millenium a small revolution at the music industry started. Nowadays over 60% of the revenues from recorded music in the USA comes from the streaming services, such as Spotify, Tidal or AppleMusic and the upward trend of this sector doesn’t stop5. Streaming platforms enable access to the millions of songs for no or little monthly fee. This makes them extremely convenient for the music listeners but also for the data analysts. Thanks to the large databases collected by the streaming companies it is possible to analyze the music market in a way previously unobtainable, also for songs recorded long before the popularity of the streaming platforms. The details about all of the songs, their structure and character are perfect to be used to discover which songs are destined for success and which ones will fail to reach the top. Can we reveil the secret of the commercial success? In this article we use the characteristics of the tracks from the beginning of the third millenium to predict whether the song was hit or not.
library(dplyr)
library(data.table)
library(stringr)
library(ggplot2)
library(stargazer)
library(corrplot)
library(caret)
library(smbinning)
library(bestNormalize)
library(lmtest)
library(verification)
library(knitr)
The dataset consists of 5872 observations and 19 features of the tracks from years 2000-2009. The data was fetched with the spotipy, which is the Python module for Spotify’s API, and with the analogous module billboard for Billboard’s API. billboard module was used to extract the information about the Billboard Hot 100 weekly record chart, published by Billboard magazine. This music chart is created based on the radio play, sales and online streaming of the songs in USA and is one of the best indicators of the songs commercial success in the United States. The dataset was created by Farooq Ansari. More information about the dataset and the author can be found here.
spotify <- read.csv('spotify.csv')
glimpse(spotify)
## Observations: 5,872
## Variables: 19
## $ track <fct> "Lucky Man", "On The Hotline", "Clouds Of Dem...
## $ artist <fct> Montgomery Gentry, Pretty Ricky, Candlemass, ...
## $ uri <fct> spotify:track:4GiXBCUF7H6YfNQsnBRIzl, spotify...
## $ danceability <dbl> 0.578, 0.704, 0.162, 0.188, 0.630, 0.726, 0.3...
## $ energy <dbl> 0.471, 0.854, 0.836, 0.994, 0.764, 0.837, 0.9...
## $ key <int> 4, 10, 9, 4, 2, 11, 1, 11, 10, 7, 8, 2, 1, 0,...
## $ loudness <dbl> -7.270, -5.477, -3.009, -3.745, -4.353, -7.22...
## $ mode <int> 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, ...
## $ speechiness <dbl> 0.0289, 0.1830, 0.0473, 0.1660, 0.0275, 0.096...
## $ acousticness <dbl> 0.36800000, 0.01850000, 0.00011100, 0.0000073...
## $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00457000, 0.0784000...
## $ liveness <dbl> 0.1590, 0.1480, 0.1740, 0.1920, 0.1250, 0.136...
## $ valence <dbl> 0.5320, 0.6880, 0.3000, 0.3330, 0.6310, 0.969...
## $ tempo <dbl> 133.061, 92.988, 86.964, 148.440, 112.098, 13...
## $ duration_ms <int> 196707, 242587, 338893, 255667, 193760, 19272...
## $ time_signature <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, ...
## $ chorus_hit <dbl> 30.88059, 41.51106, 65.32887, 58.59528, 22.62...
## $ sections <int> 13, 10, 13, 9, 10, 10, 4, 10, 11, 16, 9, 9, 9...
## $ target <int> 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, ...
The dataset consists of the following variables:
More details about the features stored and shared by Spotify can be found here. Billboard Hot 100 for every week since the first release of the chart on the 4th August 1958 is available here.
summary(spotify)
## track artist
## Breathe : 5 Toby Keith : 27
## Angel : 4 Rascal Flatts: 24
## Beautiful : 4 Tim McGraw : 24
## Closer : 4 Iron Maiden : 23
## Forever : 4 Kenny Chesney: 23
## Girlfriend: 4 George Strait: 22
## (Other) :5847 (Other) :5729
## uri danceability
## spotify:track:0t9Jd84JnsV8HRMaQzHUom: 2 Min. :0.0588
## spotify:track:0Uhnzk5zI46IRlQ04LNOtc: 2 1st Qu.:0.4160
## spotify:track:1mJ05BN0So26a14iib85aI: 2 Median :0.5560
## spotify:track:1qHRFZE8qykNXYZadzmi1m: 2 Mean :0.5429
## spotify:track:2aIdVb8v9KTpEZnftkz2mD: 2 3rd Qu.:0.6810
## spotify:track:3f3omU8n47Mqyab5nCaGyT: 2 Max. :0.9860
## (Other) :5860
## energy key loudness mode
## Min. :0.000348 Min. : 0.000 Min. :-47.327 Min. :0.0000
## 1st Qu.:0.567000 1st Qu.: 2.000 1st Qu.: -8.315 1st Qu.:0.0000
## Median :0.744000 Median : 5.000 Median : -6.042 Median :1.0000
## Mean :0.694511 Mean : 5.276 Mean : -7.449 Mean :0.6451
## 3rd Qu.:0.885000 3rd Qu.: 8.000 3rd Qu.: -4.562 3rd Qu.:1.0000
## Max. :0.999000 Max. :11.000 Max. : 1.137 Max. :1.0000
##
## speechiness acousticness instrumentalness liveness
## Min. :0.02240 Min. :0.000000 Min. :0.0000000 Min. :0.0193
## 1st Qu.:0.03617 1st Qu.:0.004553 1st Qu.:0.0000000 1st Qu.:0.0937
## Median :0.05270 Median :0.060300 Median :0.0000218 Median :0.1310
## Mean :0.09236 Mean :0.214374 Mean :0.1509172 Mean :0.1961
## 3rd Qu.:0.10700 3rd Qu.:0.312000 3rd Qu.:0.0472750 3rd Qu.:0.2630
## Max. :0.95000 Max. :0.996000 Max. :0.9980000 Max. :0.9870
##
## valence tempo duration_ms time_signature
## Min. :0.0000 Min. : 46.76 Min. : 15920 Min. :0.000
## 1st Qu.:0.2780 1st Qu.: 96.98 1st Qu.: 206813 1st Qu.:4.000
## Median :0.4860 Median :120.00 Median : 238006 Median :4.000
## Mean :0.4823 Mean :121.61 Mean : 258171 Mean :3.914
## 3rd Qu.:0.6870 3rd Qu.:141.50 3rd Qu.: 279160 3rd Qu.:4.000
## Max. :0.9820 Max. :213.23 Max. :4170227 Max. :5.000
##
## chorus_hit sections target
## Min. : 0.00 Min. : 1.00 Min. :0.0
## 1st Qu.: 27.51 1st Qu.: 9.00 1st Qu.:0.0
## Median : 36.04 Median : 10.00 Median :0.5
## Mean : 40.73 Mean : 11.06 Mean :0.5
## 3rd Qu.: 47.89 3rd Qu.: 12.00 3rd Qu.:1.0
## Max. :262.62 Max. :169.00 Max. :1.0
##
spotify %>%
is.na() %>%
colSums() %>%
sort()
## track artist uri danceability
## 0 0 0 0
## energy key loudness mode
## 0 0 0 0
## speechiness acousticness instrumentalness liveness
## 0 0 0 0
## valence tempo duration_ms time_signature
## 0 0 0 0
## chorus_hit sections target
## 0 0 0
There are no missing values in the dataset, so we don’t have to worry about that.
table(duplicated(spotify$uri))
##
## FALSE TRUE
## 5855 17
As stated in the variable description, uri is the unique identificator of the song. However, some of the uri values repeat, so this needs to be checked.
setDT(spotify)[, .N, uri] %>% merge(spotify, by = "uri") %>% .[.$N>1] -> duplicates
kable(head(duplicates[,c(3:13)],10))
track | artist | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness |
---|---|---|---|---|---|---|---|---|---|---|
I Don’t Care (feat. Adam Gontier) | Apocalyptica | 0.311 | 0.744 | 2 | -4.615 | 0 | 0.0436 | 0.00998 | 0.0000000 | 0.0799 |
I Don’t Care | Apocalyptica Featuring Adam Gontier | 0.311 | 0.744 | 2 | -4.615 | 0 | 0.0436 | 0.00998 | 0.0000000 | 0.0799 |
Hey Ma | Cam’ron | 0.791 | 0.666 | 8 | -5.484 | 1 | 0.3410 | 0.02560 | 0.0000063 | 0.0796 |
Hey Ma | Cam’Ron Featuring Juelz Santana, Freekey Zekey & Toya | 0.791 | 0.666 | 8 | -5.484 | 1 | 0.3410 | 0.02560 | 0.0000063 | 0.0796 |
All I Want To Do | Sugarland | 0.709 | 0.725 | 4 | -4.999 | 1 | 0.0556 | 0.20800 | 0.0000454 | 0.1580 |
Want To | Sugarland | 0.709 | 0.725 | 4 | -4.999 | 1 | 0.0556 | 0.20800 | 0.0000454 | 0.1580 |
Give It Away | George Strait | 0.674 | 0.646 | 5 | -5.876 | 1 | 0.0270 | 0.48600 | 0.0000837 | 0.2620 |
It Just Comes Natural | George Strait | 0.674 | 0.646 | 5 | -5.876 | 1 | 0.0270 | 0.48600 | 0.0000837 | 0.2620 |
Buy U A Drank (Shawty Snappin’) | T-Pain Featuring Yung Joc | 0.451 | 0.550 | 1 | -8.137 | 1 | 0.2620 | 0.01080 | 0.0000000 | 0.0737 |
Shawty | Plies Featuring T-Pain | 0.451 | 0.550 | 1 | -8.137 | 1 | 0.2620 | 0.01080 | 0.0000000 | 0.0737 |
To make the output clear only selected variables are presented, but it still looks like these doubled observations may be the same, despite some slight differences in the track or artist names. If the rest of the features is the same, we can simply remove the duplicates. Let’s check whether and which observations differ more than that.
for (i in seq(1, nrow(duplicates) - 1, 2)) {
if(!identical(duplicates[i, 5:20], duplicates[i+1, 5:20])) {
print(paste("observations:", i, "and", i+1))
for (j in 5:ncol(duplicates)){
if(duplicates[i, ..j] != duplicates[i + 1, ..j]){
print(paste("difference in: ", colnames(duplicates)[j]))
}
}
}
}
## [1] "observations: 1 and 2"
## [1] "difference in: target"
## [1] "observations: 3 and 4"
## [1] "difference in: target"
## [1] "observations: 15 and 16"
## [1] "difference in: target"
## [1] "observations: 25 and 26"
## [1] "difference in: target"
## [1] "observations: 29 and 30"
## [1] "difference in: target"
## [1] "observations: 31 and 32"
## [1] "difference in: target"
table(duplicates$target)
##
## 0 1
## 6 28
Many duplicates have the same features but there are 6 pairs of observations that differ. The only difference between them is the one with the target variable. We can also see that there are only 6 observations with the target value 1.
kable(duplicates[c(1:4, 15:16, 25:26, 29:32), 3:4])
track | artist |
---|---|
I Don’t Care (feat. Adam Gontier) | Apocalyptica |
I Don’t Care | Apocalyptica Featuring Adam Gontier |
Hey Ma | Cam’ron |
Hey Ma | Cam’Ron Featuring Juelz Santana, Freekey Zekey & Toya |
South Side | Moby |
South Side | Moby Featuring Gwen Stefani |
No One Knows | Queens Of The Stone Age |
No One Knows | Queens of the Stone Age |
When Love Takes Over | David Guetta Featuring Kelly Rowland |
When Love Takes Over (feat. Kelly Rowland) | David Guetta |
Jai Ho! (You Are My Destiny) | A R Rahman & The Pussycat Dolls Featuring Nicole Scherzinger |
Jai Ho! (You Are My Destiny) | A.R. Rahman |
These pairs are obviously the same songs coded twice so we can easily check on the official site of the Billboard magazine that the first song did enter the chart. All pairs have the same problem and all of them can be found on the Billboard Hot 100 chart. For the most cases those are songs which were recorded as a collaboration of various artists and that this information is once stored in the artist variable and other time in the track variable or nowhere at all. Lack of the some of participating artists in the artist variable results in the wrong value of the target variable. Now we know that we should remove the observations which have target equal to 0 if there exists a case like that in a pair and any observation otherwise. As we don’t need the proper names of tracks and artists it will be easier to remove the duplicates in the dataset regardless to the target variable and then change its value to 1 for the remaining halves of pairs.
spotify <- distinct(spotify, uri, .keep_all = TRUE)
spotify[spotify$uri %in% duplicates$uri,]$target <- 1
setDF(spotify)
In this analysis we focus on songs as they are the tracks that are taken into account in the billboard charts. However, on the Spotify platform there are also tracks containing speeches, stand up comedies and other audio files that should be removed. We can do it using the speechiness variable. Values between 0.33 and 0.66 represent tracks with many words like for example rap songs but values over 0.66 represent the tracks which are probably made entirely of words like speeches or other types that aren’t songs. Based on that definision we will remove the twelve observations which have the speechiness over 0.66.
sum(spotify$speechiness > 0.66)
## [1] 12
spotify <- spotify[which(spotify$speechiness <= 0.66), ]
table(spotify$time_signature)
##
## 0 1 3 4 5
## 1 52 423 5284 83
There is also only one observation with the time_signature equal to 0. No time signature is possible and often denoted as 0/0 but with just one observation it wouldn’t be meanignful and we can’t group it with other levels so this observation will also be removed8.
spotify <- spotify[which(spotify$time_signature!=0), ]
spotify$time_signature <- as.factor(spotify$time_signature)
spotify$time_signature <- droplevels(spotify$time_signature)
summary(spotify$time_signature)
## 1 3 4 5
## 52 423 5284 83
kable(head(spotify[order(spotify$chorus_hit), c(1:2, 17:18)], 15))
track | artist | chorus_hit | sections | |
---|---|---|---|---|
146 | Full Metal Swimsuit | Agoraphobic Nosebleed | 0.00000 | 2 |
256 | Interzone, Pt. 1: Part I: Why don’t you have a sode pop | Enno Poppe | 0.00000 | 2 |
632 | Ten | Parts & Labor | 0.00000 | 2 |
667 | Oxygen Corrosion | Insect Warfare | 0.00000 | 2 |
1592 | Concrete Waves | Parts & Labor | 0.00000 | 2 |
2123 | 10,000 Angry Hardcore Kids - Wormwood Album Version | Few Left Standing | 0.00000 | 2 |
2219 | Under an Ancient Oak | Tears of Mankind | 0.00000 | 2 |
4571 | American Hardcore | Weekend Nachos | 0.00000 | 2 |
4947 | Mega Armageddon Death Extented “Live” | Electro Hippies | 0.00000 | 2 |
4958 | Assimilated Pollutants | Magrudergrind | 0.00000 | 1 |
5054 | Heavy Crowns | Half the World | 0.00000 | 2 |
5124 | You’re the New Metalcore | Tiefighter | 0.00000 | 2 |
4179 | Suffer | Rudimentary Peni | 4.98552 | 3 |
3798 | Inner Metal Cage | Blut Aus Nord | 7.11360 | 3 |
744 | Few and Far Between | Shutdown | 7.67978 | 3 |
The last problem is with the chorus_hit variable. There are some observations in which the value of this variable suggests that the chorus starts with the beginning of the track. Having in mind that this variable denotes the start of the third section, these values seemed rather unrealistic. One can see that all those observations have less than 3 sections, so in fact those are the missing values in case of no third section in the song. These observations must be removed from the data.
spotify <- spotify[which(spotify$sections>3), ]
ggplot(spotify,
aes(x = speechiness)) +
geom_histogram(fill = "green",
bins = 100) +
theme_bw()
ggplot(spotify,
aes(x = chorus_hit)) +
geom_histogram(fill = "green",
bins = 100) +
theme_bw()
Continuous variables in this dataset really differ in their distributions. Two of them, speechiness and chorus_hit, are clearly right-skewed and we might consider logarythmic or other transformations.
speech_yeojohnson <- yeojohnson(spotify$speechiness)
speech_yeojohnson$lambda
## [1] -4.99994
Yeo-Johnson procedure suggests the transformation of speechiness variable with the lambda close to -5. This transformation will be hard to interpret but the distribution of the transformed variable, although still far from perfect, is likely to be much better for the modeling.
spotify$speechiness_yeo <-
speech_yeojohnson$x.t
ggplot(spotify,
aes(x = speechiness_yeo)) +
geom_histogram(fill = "green",
bins = 100) +
theme_bw()
chorus_yeojohnson <- yeojohnson(spotify$chorus_hit)
chorus_yeojohnson$lambda
## [1] -0.4587033
The same procedure repeated for the chorus_hit variable results in lambda equal to about -0.46. This transformation results in much more symmetric, better distribution so this variable will also be used in its transformed form.
spotify$chorus_yeo <-
chorus_yeojohnson$x.t
ggplot(spotify,
aes(x = chorus_yeo)) +
geom_histogram(fill = "green",
bins = 100) +
theme_bw()
summary(spotify$duration_ms)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 54040 207554 238560 259319 279560 4170227
sum(head(spotify[order(spotify$duration_ms), "target"], 20) == "1")
## [1] 0
sum(tail(spotify[order(spotify$duration_ms), "target"], 20) == "1")
## [1] 0
The shortest song in the dataset lasts 54 seconds, while the longest over one hour. There is no single song with the target equal to 1 among 20 shortest or 20 longest songs. We can expect that neither the shortest nor the longest songs have the slightest chance to appear on the billboard charts. The relationship between duration_ms and target is likely to be nonlinear, maybe much closer to quadratic. That’s why we decided to do the supervised bining based on Weight of Evidence (WOE). Thanks to that instead of the original variable we can use the set of three WOE values which represent the predictive power of the groups.
duration_binned <-
smbinning(df = spotify,
y = "target",
x = "duration_ms",
p = 0.1)
spotify <-
smbinning.gen(spotify,
duration_binned,
"duration_woe")
levels(spotify$duration_woe) <-
duration_binned$ivtable$WoE[1:nlevels(spotify$duration_woe)]
spotify$duration_woe <-
as.numeric(levels(spotify$duration_woe))[spotify$duration_woe]
table(spotify$duration_woe)
##
## -1.6969 -1.5174 0.5252
## 980 614 4200
At the end we will also convert the rest of the nominal variables to factors and remove the columns which are redundant for the further analysis.
spotify$key <- as.factor(spotify$key)
spotify$mode <- as.factor(spotify$mode)
spotify$target <- as.factor(spotify$target)
spotify$track <- NULL
spotify$artist <- NULL
spotify$uri <- NULL
spotify$speechiness <- NULL
spotify$duration_ms <- NULL
spotify$chorus_hit <- NULL
In this part the relations between the variables will be checked. We will begin with the correlation between the continuous variables.
spotify_numeric_vars <- sapply(spotify, is.numeric) %>% which() %>% names()
spotify_correlations <- cor(spotify[,spotify_numeric_vars],
use = "pairwise.complete.obs")
corrplot(spotify_correlations)
There are some strong correlations between variables. The strongest relation seems to be between the three of them: energy, loudness and acousticness. This is quite reasonable and we expected from the beginning that the first two might be correlated with each other. Also variable valence might be correlated with the danceability. Surprisingly there is no relation observed between danceability and tempo. We will check the correlations on the cutoff point equal to 75% to find the hypothetical variables most correlated with the rest.
findCorrelation(spotify_correlations,
cutoff = 0.75,
names = TRUE)
## [1] "loudness"
spotify_correlations[2:4,]
## danceability energy loudness acousticness
## energy 0.07451541 1.0000000 0.7744462 -0.7411570
## loudness 0.29352601 0.7744462 1.0000000 -0.6809118
## acousticness -0.18719132 -0.7411570 -0.6809118 1.0000000
## instrumentalness liveness valence tempo sections
## energy -0.2318692 0.2071754 0.2901205 0.2182530 -0.1711454
## loudness -0.4445925 0.1225952 0.3381099 0.1578119 -0.1799497
## acousticness 0.2594117 -0.1030944 -0.1887238 -0.1516302 0.1628709
## speechiness_yeo chorus_yeo duration_woe
## energy 0.2242508 -0.03238815 0.1400527
## loudness 0.1524275 -0.07569779 0.2722363
## acousticness -0.1350651 0.02826504 -0.2062271
Function findCorrelation suggests removing loudness. We can see that the correlation between energy and loudness is the one with the value above 75%. We will have that in mind when making the final decision about the variables selected to analysis. In the next step we will perform the analysis of variance (ANOVA) in order to find out whether our continuous variables differ for the levels of the dependent variable target.
spotify_F_anova <- function(numeric_var) {
anova_ <- aov(spotify[[numeric_var]] ~
spotify$target)
return(summary(anova_)[[1]][1, 4])
}
sapply(spotify_numeric_vars,
spotify_F_anova) %>%
sort(decreasing = TRUE) -> spotify_anova_all
spotify_anova_all
## instrumentalness danceability duration_woe loudness
## 1668.30039 1534.52400 1211.53978 794.92089
## valence acousticness sections chorus_yeo
## 494.77053 305.00229 74.34824 48.81317
## energy liveness speechiness_yeo tempo
## 41.33762 29.23026 13.75730 12.66193
All variables from this part have rather high values of the F statistics which means our variables should have the significant effect on the dependent variable. Nonetheless, energy has much lower value of the F statistic than loudness or acousticness so probabily it will be much better to exclude energy, despite the suggestion we obtained before. We will also check if modality, associated with the mood of the song has indeed any relation with valence.
aov(spotify$valence ~ spotify$mode) ->
spotify_anova
summary(spotify_anova)[[1]][1, 4:5]
## F value Pr(>F)
## spotify$mode 0.339 0.5604
F statistic is low and we can’t reject the null that there is no difference between groups so we don’t need to worry about the relation betweet those two independent variables. It is also possible to investigate the relation between our dependent variable and other quantitative variables with Cramer’s V.
DescTools::CramerV(spotify$target,
spotify$key)
## [1] 0.08031027
DescTools::CramerV(spotify$target,
spotify$mode)
## [1] 0.09801108
DescTools::CramerV(spotify$target,
spotify$time_signature)
## [1] 0.1752306
Key and mode have small Cramer’s coefficients, close to 0. Coefficient for time_signature is slightly higher but overall they are all not impressive.
nearZeroVar(spotify, saveMetrics = TRUE)
## freqRatio percentUnique zeroVar nzv
## danceability 1.100000 15.17086641 FALSE FALSE
## energy 1.095238 17.27649292 FALSE FALSE
## key 1.003190 0.20711080 FALSE FALSE
## loudness 1.000000 75.52640663 FALSE FALSE
## mode 1.815355 0.03451847 FALSE FALSE
## acousticness 1.076923 46.78978253 FALSE FALSE
## instrumentalness 218.400000 39.31653435 FALSE FALSE
## liveness 1.033898 20.57300656 FALSE FALSE
## valence 1.058824 19.79634104 FALSE FALSE
## tempo 1.000000 92.99275112 FALSE FALSE
## time_signature 12.552632 0.06903693 FALSE FALSE
## sections 1.094645 0.93199862 FALSE FALSE
## target 1.017409 0.03451847 FALSE FALSE
## speechiness_yeo 1.095238 18.34656541 FALSE FALSE
## chorus_yeo 1.000000 99.63755609 FALSE FALSE
## duration_woe 4.285714 0.05177770 FALSE FALSE
The last factor we will consider in the feature selection part will be the variance of the variables. Fortunately we have no variables left in the dataset with the variance close to zero so we don’t have to do anything about it. Finally we exclude energy. Both levels of target occurs in almost identical number of observations so resampling can be omitted. Nevertheless, we should remember that in reality only a little share of all existing songs ever reaches the billboard charts and becomes the hits in a way we interpret them here. After all transformations and selections the basic statistics of the continuous variables are as follows:
spotify$energy <- NULL
table(spotify$target)
##
## 0 1
## 2872 2922
stargazer(
spotify[, spotify_numeric_vars[-2]], type = "text",
summary.stat = c("min", "p25", "median", "p75", "max", "mean", "sd")
)
##
## ===========================================================================
## Statistic Min Pctl(25) Median Pctl(75) Max Mean St. Dev.
## ---------------------------------------------------------------------------
## danceability 0.059 0.418 0.557 0.682 0.986 0.544 0.190
## loudness -47.327 -8.278 -6.033 -4.553 1.137 -7.417 5.068
## acousticness 0.000 0.005 0.060 0.309 0.996 0.213 0.295
## instrumentalness 0 0 0.000 0.05 1 0.149 0.300
## liveness 0.019 0.094 0.130 0.262 0.987 0.195 0.161
## valence 0.011 0.281 0.487 0.687 0.982 0.484 0.254
## tempo 46.755 97.012 120.002 141.633 213.233 121.707 30.115
## sections 4 9 10 12 169 11.115 5.670
## speechiness_yeo -1.051 -0.749 -0.414 0.482 3.241 0.000 1.000
## chorus_yeo -3.256 -0.689 0.012 0.680 3.123 -0.000 1.000
## duration_woe -1.697 -1.517 0.525 0.525 0.525 -0.067 0.963
## ---------------------------------------------------------------------------
At the beginning we need to divide our data into train and test samples with regard to the equal occurance of the levels of the dependent variable. The split will be made in 60/40 proportion. We will use the train sample to fit the models and the best models will be compared on their unbiased evaluations on the test sample.
set.seed(2346) #seed for reproductibility
spotify_which_train <- createDataPartition(spotify$target, p = 0.6,list = FALSE)
spotify_train <- spotify[spotify_which_train,]
spotify_test <- spotify[-spotify_which_train,]
save(list = c("spotify_train", "spotify_test"), file = "spotify_traintest.RData")
The first models we will create are the two types of regression dedicated for a dichotomous variable. Both methods are comparable but they use different link functions. Results are usually very similar unless the extreme observations from the data have relatively big impact on them.
spotify_logit <- glm(target ~ .,
family = binomial(link = "logit"),
data = spotify_train)
lrtest(spotify_logit)
## Likelihood ratio test
##
## Model 1: target ~ danceability + key + loudness + mode + acousticness +
## instrumentalness + liveness + valence + tempo + time_signature +
## sections + speechiness_yeo + chorus_yeo + duration_woe
## Model 2: target ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 27 -1407.3
## 2 1 -2410.6 -26 2006.8 < 0.00000000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
spotify_logit_fitted <- predict(spotify_logit,
type = "response")
table(spotify_train$target,
ifelse(spotify_logit_fitted > 0.5, 1, 0))
##
## 0 1
## 0 1289 435
## 1 197 1557
summary(spotify_logit)
##
## Call:
## glm(formula = target ~ ., family = binomial(link = "logit"),
## data = spotify_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8077 -0.5398 0.2328 0.6768 3.6683
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.61877017 0.75438783 -3.471 0.000518 ***
## danceability 6.87140325 0.40027952 17.167 < 0.0000000000000002 ***
## key1 0.37237204 0.20510149 1.816 0.069439 .
## key2 -0.11073593 0.19878182 -0.557 0.577478
## key3 0.76518067 0.31477112 2.431 0.015061 *
## key4 0.40216173 0.21691531 1.854 0.063739 .
## key5 0.35974300 0.22024550 1.633 0.102391
## key6 0.52476801 0.22534449 2.329 0.019873 *
## key7 0.08494842 0.19298805 0.440 0.659811
## key8 0.49382893 0.23006551 2.146 0.031835 *
## key9 -0.01026590 0.20425963 -0.050 0.959916
## key10 0.43647410 0.22874730 1.908 0.056378 .
## key11 0.16236115 0.21414338 0.758 0.448338
## loudness 0.20449600 0.02211942 9.245 < 0.0000000000000002 ***
## mode1 0.60945705 0.10495003 5.807 0.00000000636 ***
## acousticness -0.13959151 0.23949628 -0.583 0.559991
## instrumentalness -6.26531284 0.54152552 -11.570 < 0.0000000000000002 ***
## liveness -0.76623723 0.30202758 -2.537 0.011181 *
## valence -1.18922232 0.24047991 -4.945 0.00000076064 ***
## tempo 0.00297467 0.00162380 1.832 0.066964 .
## time_signature3 0.21099704 0.70198979 0.301 0.763742
## time_signature4 0.33514386 0.67175196 0.499 0.617843
## time_signature5 1.22575990 0.84423716 1.452 0.146525
## sections 0.00008904 0.01429587 0.006 0.995031
## speechiness_yeo -0.13338790 0.04787412 -2.786 0.005333 **
## chorus_yeo -0.09015483 0.04958231 -1.818 0.069020 .
## duration_woe 0.65979252 0.05512671 11.969 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4821.3 on 3477 degrees of freedom
## Residual deviance: 2814.5 on 3451 degrees of freedom
## AIC: 2868.5
##
## Number of Fisher Scoring iterations: 7
Let’s begin with the logit model. Based on the likelihood ratio test our model is jointly significant. Most of the observations are predicted correctly on the train sample but not all of them. There are also some insignificant variables so we’ll try to get rid of all of them altogether and make the second, reduced version of the logit model.
spotify_logit_reduced <- glm(target ~ danceability + key + loudness + mode + instrumentalness + liveness + valence
+ speechiness_yeo + duration_woe,
family = binomial(link = "logit"),
data = spotify_train)
lrtest(spotify_logit, spotify_logit_reduced)
## Likelihood ratio test
##
## Model 1: target ~ danceability + key + loudness + mode + acousticness +
## instrumentalness + liveness + valence + tempo + time_signature +
## sections + speechiness_yeo + chorus_yeo + duration_woe
## Model 2: target ~ danceability + key + loudness + mode + instrumentalness +
## liveness + valence + speechiness_yeo + duration_woe
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 27 -1407.3
## 2 20 -1413.2 -7 11.787 0.1078
spotify_logit_reduced_fitted <- predict(spotify_logit_reduced,
type = "response")
table(spotify_train$target,
ifelse(spotify_logit_reduced_fitted > 0.5, 1, 0))
##
## 0 1
## 0 1289 435
## 1 201 1553
summary(spotify_logit_reduced)
##
## Call:
## glm(formula = target ~ danceability + key + loudness + mode +
## instrumentalness + liveness + valence + speechiness_yeo +
## duration_woe, family = binomial(link = "logit"), data = spotify_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7172 -0.5452 0.2376 0.6741 3.6341
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.850801 0.273645 -6.764 0.0000000000135 ***
## danceability 6.690324 0.377163 17.739 < 0.0000000000000002 ***
## key1 0.384604 0.204040 1.885 0.0594 .
## key2 -0.102110 0.197775 -0.516 0.6057
## key3 0.770114 0.314112 2.452 0.0142 *
## key4 0.403440 0.216519 1.863 0.0624 .
## key5 0.365104 0.218615 1.670 0.0949 .
## key6 0.537323 0.224506 2.393 0.0167 *
## key7 0.086904 0.191790 0.453 0.6505
## key8 0.482088 0.229380 2.102 0.0356 *
## key9 0.003608 0.203650 0.018 0.9859
## key10 0.444972 0.227633 1.955 0.0506 .
## key11 0.177770 0.213041 0.834 0.4040
## loudness 0.211012 0.019580 10.777 < 0.0000000000000002 ***
## mode1 0.619307 0.103998 5.955 0.0000000026013 ***
## instrumentalness -6.238622 0.538220 -11.591 < 0.0000000000000002 ***
## liveness -0.768980 0.299941 -2.564 0.0104 *
## valence -1.097356 0.236321 -4.644 0.0000034254614 ***
## speechiness_yeo -0.120720 0.047127 -2.562 0.0104 *
## duration_woe 0.659464 0.053705 12.279 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4821.3 on 3477 degrees of freedom
## Residual deviance: 2826.3 on 3458 degrees of freedom
## AIC: 2866.3
##
## Number of Fisher Scoring iterations: 7
P-value higher than 5% indicates that variables acousticness, tempo, time_signature, sections and chorus_yeo are jointly insignificant and can be removed from the model. We can also check whether variable key, which has only chosen levels significant, should be left or excluded.
spotify_logit3 <- glm(target ~ danceability + loudness + mode + instrumentalness + liveness + valence
+ speechiness_yeo + duration_woe,
family = binomial(link = "logit"),
data = spotify_train)
lrtest(spotify_logit, spotify_logit3)
## Likelihood ratio test
##
## Model 1: target ~ danceability + key + loudness + mode + acousticness +
## instrumentalness + liveness + valence + tempo + time_signature +
## sections + speechiness_yeo + chorus_yeo + duration_woe
## Model 2: target ~ danceability + loudness + mode + instrumentalness +
## liveness + valence + speechiness_yeo + duration_woe
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 27 -1407.3
## 2 9 -1425.0 -18 35.518 0.008128 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This time we reject the null hypothesis so we can’t reduce the model anymore. We need to keep the key in the model. Know we can see the difference between the initial and the reduced logit models. There are some dissimilarities in the predictions but not very big.
table(ifelse(spotify_logit_fitted > 0.5, 1, 0),
ifelse(spotify_logit_reduced_fitted > 0.5, 1, 0))
##
## 0 1
## 0 1458 28
## 1 32 1960
spotify_probit <- glm(target ~ .,
family = binomial(link = "probit"),
data = spotify_train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
lrtest(spotify_probit)
## Likelihood ratio test
##
## Model 1: target ~ danceability + key + loudness + mode + acousticness +
## instrumentalness + liveness + valence + tempo + time_signature +
## sections + speechiness_yeo + chorus_yeo + duration_woe
## Model 2: target ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 27 -1419.0
## 2 1 -2410.6 -26 1983.3 < 0.00000000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
spotify_probit_fitted <- predict(spotify_probit,
type = "response")
table(spotify_train$target,
ifelse(spotify_probit_fitted > 0.5, 1, 0))
##
## 0 1
## 0 1282 442
## 1 197 1557
summary(spotify_probit)
##
## Call:
## glm(formula = target ~ ., family = binomial(link = "probit"),
## data = spotify_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9771 -0.5861 0.1951 0.6936 4.1143
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5480255 0.4343054 -3.564 0.000365 ***
## danceability 3.9270060 0.2204860 17.811 < 0.0000000000000002 ***
## key1 0.2035270 0.1164755 1.747 0.080571 .
## key2 -0.0560372 0.1152518 -0.486 0.626815
## key3 0.5084385 0.1804435 2.818 0.004837 **
## key4 0.2319933 0.1245455 1.863 0.062502 .
## key5 0.2166584 0.1266122 1.711 0.087045 .
## key6 0.3255703 0.1292885 2.518 0.011797 *
## key7 0.0605921 0.1113585 0.544 0.586360
## key8 0.3179609 0.1320795 2.407 0.016069 *
## key9 0.0297994 0.1176938 0.253 0.800118
## key10 0.2585331 0.1311275 1.972 0.048653 *
## key11 0.1064298 0.1232864 0.863 0.387987
## loudness 0.1193937 0.0123880 9.638 < 0.0000000000000002 ***
## mode1 0.3460934 0.0600691 5.762 0.00000000833 ***
## acousticness -0.0401308 0.1375350 -0.292 0.770450
## instrumentalness -3.0212385 0.2268848 -13.316 < 0.0000000000000002 ***
## liveness -0.4107594 0.1734247 -2.369 0.017860 *
## valence -0.6690570 0.1373806 -4.870 0.00000111542 ***
## tempo 0.0015413 0.0009417 1.637 0.101702
## time_signature3 0.2061709 0.4059382 0.508 0.611532
## time_signature4 0.2559065 0.3883175 0.659 0.509887
## time_signature5 0.7640957 0.4788073 1.596 0.110526
## sections 0.0001089 0.0080648 0.014 0.989222
## speechiness_yeo -0.0779418 0.0274283 -2.842 0.004488 **
## chorus_yeo -0.0548367 0.0283778 -1.932 0.053312 .
## duration_woe 0.3877349 0.0317594 12.209 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4821.3 on 3477 degrees of freedom
## Residual deviance: 2838.0 on 3451 degrees of freedom
## AIC: 2892
##
## Number of Fisher Scoring iterations: 7
We will perform the same steps for the probit model. Initial model is jointly significant likewise was the logit model and the same variables are insignificant. We will try to exclude them again.
spotify_probit_reduced <- glm(target ~ danceability + key + loudness + mode + instrumentalness + liveness + valence
+ speechiness_yeo + duration_woe,
family = binomial(link = "probit"),
data = spotify_train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
lrtest(spotify_probit, spotify_probit_reduced)
## Likelihood ratio test
##
## Model 1: target ~ danceability + key + loudness + mode + acousticness +
## instrumentalness + liveness + valence + tempo + time_signature +
## sections + speechiness_yeo + chorus_yeo + duration_woe
## Model 2: target ~ danceability + key + loudness + mode + instrumentalness +
## liveness + valence + speechiness_yeo + duration_woe
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 27 -1419.0
## 2 20 -1424.5 -7 11.037 0.137
spotify_probit_reduced_fitted <- predict(spotify_probit_reduced,
type = "response")
table(spotify_train$target,
ifelse(spotify_probit_reduced_fitted > 0.5, 1, 0))
##
## 0 1
## 0 1277 447
## 1 202 1552
summary(spotify_probit_reduced)
##
## Call:
## glm(formula = target ~ danceability + key + loudness + mode +
## instrumentalness + liveness + valence + speechiness_yeo +
## duration_woe, family = binomial(link = "probit"), data = spotify_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8564 -0.5911 0.2089 0.6885 4.0342
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.07503 0.15681 -6.855 0.00000000000711 ***
## danceability 3.84387 0.20792 18.487 < 0.0000000000000002 ***
## key1 0.20892 0.11604 1.800 0.07181 .
## key2 -0.05101 0.11487 -0.444 0.65699
## key3 0.50976 0.18035 2.827 0.00471 **
## key4 0.23097 0.12439 1.857 0.06334 .
## key5 0.21762 0.12590 1.728 0.08391 .
## key6 0.33257 0.12895 2.579 0.00991 **
## key7 0.05952 0.11089 0.537 0.59140
## key8 0.30927 0.13183 2.346 0.01898 *
## key9 0.03537 0.11746 0.301 0.76334
## key10 0.26585 0.13071 2.034 0.04195 *
## key11 0.11381 0.12282 0.927 0.35411
## loudness 0.12139 0.01089 11.151 < 0.0000000000000002 ***
## mode1 0.35259 0.05961 5.915 0.00000000332374 ***
## instrumentalness -3.01077 0.22514 -13.373 < 0.0000000000000002 ***
## liveness -0.41620 0.17255 -2.412 0.01587 *
## valence -0.62165 0.13525 -4.596 0.00000429792782 ***
## speechiness_yeo -0.07265 0.02705 -2.686 0.00724 **
## duration_woe 0.38700 0.03095 12.506 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4821.3 on 3477 degrees of freedom
## Residual deviance: 2849.0 on 3458 degrees of freedom
## AIC: 2889
##
## Number of Fisher Scoring iterations: 7
spotify_probit3 <- glm(target ~ danceability + loudness + mode + instrumentalness + liveness + valence
+ speechiness_yeo + duration_woe + chorus_yeo,
family = binomial(link = "probit"),
data = spotify_train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
lrtest(spotify_probit, spotify_probit3)
## Likelihood ratio test
##
## Model 1: target ~ danceability + key + loudness + mode + acousticness +
## instrumentalness + liveness + valence + tempo + time_signature +
## sections + speechiness_yeo + chorus_yeo + duration_woe
## Model 2: target ~ danceability + loudness + mode + instrumentalness +
## liveness + valence + speechiness_yeo + duration_woe + chorus_yeo
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 27 -1419.0
## 2 10 -1434.9 -17 31.919 0.0154 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table(ifelse(spotify_probit_fitted > 0.5, 1, 0),
ifelse(spotify_probit_reduced_fitted > 0.5, 1, 0))
##
## 0 1
## 0 1450 29
## 1 29 1970
Again we can exclude those mentioned variables and again we have to leave the key variable in the model. In the end our reduced probit model has the same variables as reduced logit. Differences between the predictions in the initial and reduced probit models are small but exist. We will check the efficiency of predictions from the two logit and two probit models all at once. To do so, we will use the function summary_binary created by dr hab. Piotr Wójcik, which returns selected statistics based on the confusionMatrix function from the caret package.
summary_binary <- function(predicted_probs,
real,
cutoff = 0.5,
level_positive = "Yes",
level_negative = "No") {
# save the classification table
ctable <- confusionMatrix(as.factor(ifelse(predicted_probs > cutoff,
level_positive,
level_negative)),
real,
level_positive)
# extract selected statistics into a vector
stats <- round(c(ctable$overall[1],
ctable$byClass[c(1:4, 7, 11)]),
5)
# and return them as a function result
return(stats)
}
spotify_logprob_predictions <-
data.frame(spotify_logit_fitted,
spotify_logit_reduced_fitted,
spotify_probit_fitted,
spotify_probit_reduced_fitted)
(spotify_logprob_comparison <- sapply(spotify_logprob_predictions,
function(x) summary_binary(real = spotify_train$target,
predicted = x,
level_positive = "1",
level_negative = "0")) %>%
t())
## Accuracy Sensitivity Specificity
## spotify_logit_fitted 0.81829 0.88769 0.74768
## spotify_logit_reduced_fitted 0.81714 0.88540 0.74768
## spotify_probit_fitted 0.81627 0.88769 0.74362
## spotify_probit_reduced_fitted 0.81340 0.88483 0.74072
## Pos Pred Value Neg Pred Value F1
## spotify_logit_fitted 0.78163 0.86743 0.83129
## spotify_logit_reduced_fitted 0.78119 0.86510 0.83004
## spotify_probit_fitted 0.77889 0.86680 0.82974
## spotify_probit_reduced_fitted 0.77639 0.86342 0.82707
## Balanced Accuracy
## spotify_logit_fitted 0.81768
## spotify_logit_reduced_fitted 0.81654
## spotify_probit_fitted 0.81565
## spotify_probit_reduced_fitted 0.81278
It seems that the first logit model with all variables included is also the best one. Accuracy, sensitivity and specificity are all pretty high. Surprisingly in both models the reduction of the insignificant variables resulted in slightly lower accuracy. The logit models are generally better than their probit equivalents. We’ll use the first model, spotify_logit, in the further analysis. These measures were obtained on the default cutoff point equal to 0.5. We can also use the ROC curve to obtain the quality of the model regardless to the selected cutoff point. The Area Under the Curve for our best model is equal to 0.89. It is a very promising result which indicates great predictive power of the model.
roc.plot(ifelse(spotify_train$target == "1", 1, 0),
spotify_logit_fitted)
## Warning in roc.plot.default(ifelse(spotify_train$target == "1", 1,
## 0), spotify_logit_fitted): Large amount of unique predictions used as
## thresholds. Consider specifying thresholds.
roc.area(ifelse(spotify_train$target == "1", 1, 0),
spotify_logit_fitted)
## $A
## [1] 0.8945437
##
## $n.total
## [1] 3478
##
## $n.events
## [1] 1754
##
## $n.noevents
## [1] 1724
##
## $p.value
## [1] 0
Now let’s move to some other classification methods. KNN is an algorithm which classifies the certain observations based on the representation of observations closest to the one classified. The main aspect one has to consider in the application of K-nearest neighbors is the value of the parameter K which indicates the number of observations treated as the nearest neighbors. One way to among others obtain the optimal value of K is to use a cross-validation and apply different potential values of the parameter. In this part we will use the cross-validation with 5 folds. Popular idea is to start with the parameter equal to the square root of the number of observations which in our case is almost 59. We can include this value in the set of analysed parameters but we will also check many higher and lower values than this one.
ctrl_cv5 <- trainControl(method = "cv",
number = 5)
sqrt(nrow(spotify_train))
## [1] 58.97457
k <- 59
different_k <- data.frame(k = seq(1 , k + 40, 2))
set.seed(2346)
spotify_knn <-
train(target ~ .,
data = spotify_train,
method = "knn",
trControl = ctrl_cv5,
tuneGrid = different_k)
save("spotify_knn", file = "spotify_knn.RData")
spotify_knn
## k-Nearest Neighbors
##
## 3478 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 2782, 2784, 2782, 2782, 2782
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.6676231 0.3346136
## 3 0.6837225 0.3666163
## 5 0.6935125 0.3859301
## 7 0.6860313 0.3707489
## 9 0.6900485 0.3786158
## 11 0.6865961 0.3715661
## 13 0.6842948 0.3668596
## 15 0.6891873 0.3765265
## 17 0.6851660 0.3684282
## 19 0.6817111 0.3613800
## 21 0.6797078 0.3572630
## 23 0.6866077 0.3710952
## 25 0.6811372 0.3600725
## 27 0.6811355 0.3599656
## 29 0.6825690 0.3628108
## 31 0.6825764 0.3627967
## 33 0.6837217 0.3650323
## 35 0.6822891 0.3620863
## 37 0.6782653 0.3539636
## 39 0.6759639 0.3493009
## 41 0.6690599 0.3354017
## 43 0.6681979 0.3336342
## 45 0.6696363 0.3364898
## 47 0.6658998 0.3289665
## 49 0.6673383 0.3318443
## 51 0.6673383 0.3317852
## 53 0.6693506 0.3358023
## 55 0.6670501 0.3311381
## 57 0.6641740 0.3253642
## 59 0.6621609 0.3212494
## 61 0.6615820 0.3200200
## 63 0.6601444 0.3171009
## 65 0.6607224 0.3182484
## 67 0.6581338 0.3130152
## 69 0.6552585 0.3072100
## 71 0.6518094 0.3002360
## 73 0.6512322 0.2990509
## 75 0.6526674 0.3019391
## 77 0.6541083 0.3048089
## 79 0.6512281 0.2989785
## 81 0.6489284 0.2943343
## 83 0.6483554 0.2932050
## 85 0.6520951 0.3007184
## 87 0.6483545 0.2932178
## 89 0.6486435 0.2937880
## 91 0.6500820 0.2966535
## 93 0.6518045 0.3000943
## 95 0.6486402 0.2937245
## 97 0.6480663 0.2925414
## 99 0.6472051 0.2907704
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
plot(spotify_knn)
Turns out that the best K is equal to 5. It is much smaller number than 59 and we can see that the square root of the number of observations in the dataset would be far from optimal. Next we will try to improve the results with the variables standardization. Many of the features used are already in a range from 0 to 1, but we have variables which take values from completely different range as well. There are also some categorical features so maybe we should try range standardization to have all variables in the same range.
set.seed(2346)
spotify_knn_range <-
train(target ~ .,
data = spotify_train,
method = "knn",
trControl = ctrl_cv5,
tuneGrid = different_k,
preProcess = c("range"))
save("spotify_knn_range", file = "spotify_knn_range.RData")
spotify_knn_range
## k-Nearest Neighbors
##
## 3478 samples
## 14 predictor
## 2 classes: '0', '1'
##
## Pre-processing: re-scaling to [0, 1] (26)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 2782, 2784, 2782, 2782, 2782
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.7429627 0.4856000
## 3 0.7570522 0.5135994
## 5 0.7593552 0.5180646
## 7 0.7702773 0.5399142
## 9 0.7699899 0.5393015
## 11 0.7650999 0.5294653
## 13 0.7676869 0.5346245
## 15 0.7662476 0.5316713
## 17 0.7645235 0.5281784
## 19 0.7596343 0.5183298
## 21 0.7579085 0.5148192
## 23 0.7550333 0.5089716
## 25 0.7518715 0.5026274
## 27 0.7510086 0.5008023
## 29 0.7495702 0.4978936
## 31 0.7489955 0.4967025
## 33 0.7495752 0.4978348
## 35 0.7518724 0.5023810
## 37 0.7466958 0.4919383
## 39 0.7449717 0.4884450
## 41 0.7449709 0.4884374
## 43 0.7449700 0.4884014
## 45 0.7464051 0.4912816
## 47 0.7466925 0.4918469
## 49 0.7435299 0.4854667
## 51 0.7441055 0.4866099
## 53 0.7403690 0.4790955
## 55 0.7397926 0.4779138
## 57 0.7351900 0.4686366
## 59 0.7372031 0.4726529
## 61 0.7369174 0.4720454
## 63 0.7340422 0.4662479
## 65 0.7314560 0.4610278
## 67 0.7323181 0.4627458
## 69 0.7320291 0.4621680
## 71 0.7294420 0.4569532
## 73 0.7311695 0.4604098
## 75 0.7271407 0.4522853
## 77 0.7274297 0.4528513
## 79 0.7265676 0.4511160
## 81 0.7257047 0.4493769
## 83 0.7251317 0.4482214
## 85 0.7251317 0.4482160
## 87 0.7251308 0.4481926
## 89 0.7274322 0.4528174
## 91 0.7268575 0.4516421
## 93 0.7265709 0.4510803
## 95 0.7248451 0.4475984
## 97 0.7251300 0.4481802
## 99 0.7251284 0.4481772
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
plot(spotify_knn_range)
The new optimal parameter is higher and equal to 7. Again the best ones are small numbers of nearest neighbors although this time K equal to 5 would be much worse option. We can already see that with the range standardization we achieved to obtain higher accuracy than before. To compare those two results we will use slightly different version of the function introduced in the previous part, summary_binary_class, also written by dr hab. Piotr Wójcik and assess the correctness of predictions on the train data.
summary_binary_class <- function(predicted_classes,
real,
level_positive = "Yes",
level_negative = "No") {
# save the classification table
ctable <- confusionMatrix(as.factor(predicted_classes),
real,
level_positive)
# extract selected statistics into a vector
stats <- round(c(ctable$overall[1],
ctable$byClass[c(1:4, 7, 11)]),
5)
# and return them as a function result
return(stats)
}
spotify_knn_predictions <-
data.frame(spotify_knn = predict(spotify_knn,
spotify_train),
spotify_knn_range = predict(spotify_knn_range,
spotify_train)
)
(spotify_knn_comparison <- sapply(spotify_knn_predictions,
function(x) summary_binary_class(real = spotify_train$target,
predicted = x,
level_positive = "1",
level_negative = "0")) %>%
t())
## Accuracy Sensitivity Specificity Pos Pred Value
## spotify_knn 0.80247 0.87286 0.73086 0.76742
## spotify_knn_range 0.81829 0.89111 0.74420 0.77994
## Neg Pred Value F1 Balanced Accuracy
## spotify_knn 0.84963 0.81675 0.80186
## spotify_knn_range 0.87042 0.83183 0.81765
The difference is rather subtle but the range standardization indeed improved the results and this is the model we will keep for final comparision.
The last algorithm will be SVM. It is a method of finding a hyperplane which divide observations among groups. We will use three different kernel functions to create three models, each one with the five folds cross validation with 2 repetitions. We will start with the most basic, linear function and a set of 10 different values of cost parameter.
ctrl_cv5x2 <- trainControl(method = "repeatedcv",
number = 5,
repeats = 2)
parametersLinear <- data.frame(C = c(0.01, 0.05,
0.1, 0.5, 1, 2, 5, 10, 15, 20))
set.seed(2346)
spotify_svm_lin <- train(target ~ .,
data = spotify_train,
method = "svmLinear",
tuneGrid = parametersLinear,
trControl = ctrl_cv5x2)
save("spotify_svm_lin", file = "spotify_svm_lin.RData")
spotify_svm_lin
## Support Vector Machines with Linear Kernel
##
## 3478 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 2782, 2784, 2782, 2782, 2782, 2782, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.01 0.8021840 0.6037115
## 0.05 0.8116738 0.6228179
## 0.10 0.8133984 0.6262906
## 0.50 0.8162740 0.6320554
## 1.00 0.8161303 0.6317668
## 2.00 0.8161301 0.6317665
## 5.00 0.8161301 0.6317662
## 10.00 0.8164177 0.6323419
## 15.00 0.8161303 0.6317684
## 20.00 0.8161305 0.6317706
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 10.
The best model is selected for the parameter equal to 10. Now we can move to the more advanced functions. We will try to find the optimal model for the polynomial kernel function. This time we have three different parameters to optimize, so in order to compute it in a reasonable amount of time we reduced the number of variants for each parameter compared to the previous model.
parametersPoly <- expand.grid(C = c(0.001, 1),
degree = 2:5,
scale = c(1, 2))
set.seed(2346)
spotify_svm_poly <- train(target ~ .,
data = spotify_train,
method = "svmPoly",
tuneGrid = parametersPoly,
trControl = ctrl_cv5x2)
save("spotify_svm_poly", file = "spotify_svm_poly.RData")
spotify_svm_poly
## Support Vector Machines with Polynomial Kernel
##
## 3478 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 2782, 2784, 2782, 2782, 2782, 2782, ...
## Resampling results across tuning parameters:
##
## C degree scale Accuracy Kappa
## 0.001 2 1 0.8092288 0.6178963
## 0.001 2 2 0.8141180 0.6277800
## 0.001 3 1 0.7949978 0.5895695
## 0.001 3 2 0.7742970 0.5483236
## 0.001 4 1 0.7535970 0.5069233
## 0.001 4 2 0.7374916 0.4747470
## 0.001 5 1 0.7448283 0.4894262
## 0.001 5 2 0.7412312 0.4822324
## 1.000 2 1 0.8188608 0.6374328
## 1.000 2 2 0.8182863 0.6362952
## 1.000 3 1 0.7444035 0.4887826
## 1.000 3 2 0.7406623 0.4812980
## 1.000 4 1 0.7400789 0.4799228
## 1.000 4 2 0.7354779 0.4706896
## 1.000 5 1 0.7422392 0.4842720
## 1.000 5 2 0.7412312 0.4822324
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 2, scale = 1 and C = 1.
The optimal parameters from all verified combinations are: degree=2, scale=1 and C=1. Finally we’ll do the same with the gaussian radial basis function which requires two different parameters.
parametersRadial <-
expand.grid(C = c(5, 10, 15, 20, 25),
sigma = c(0.001, 0.01, 0.1, 1))
set.seed(2346)
spotify_svm_radial <- train(target ~ .,
data = spotify_train,
method = "svmRadial",
tuneGrid = parametersRadial,
trControl = ctrl_cv5x2)
save("spotify_svm_radial", file = "spotify_svm_radial.RData")
spotify_svm_radial
## Support Vector Machines with Radial Basis Function Kernel
##
## 3478 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 2782, 2784, 2782, 2782, 2782, 2782, ...
## Resampling results across tuning parameters:
##
## C sigma Accuracy Kappa
## 5 0.001 0.8026149 0.6045883
## 5 0.010 0.8192944 0.6380919
## 5 0.100 0.8021892 0.6041310
## 5 1.000 0.7298725 0.4612187
## 10 0.001 0.8076471 0.6147153
## 10 0.010 0.8260509 0.6516700
## 10 0.100 0.7971572 0.5941563
## 10 1.000 0.7297289 0.4609334
## 15 0.001 0.8098050 0.6190576
## 15 0.010 0.8276304 0.6548511
## 15 0.100 0.7892499 0.5783929
## 15 1.000 0.7297289 0.4609334
## 20 0.001 0.8118179 0.6231003
## 20 0.010 0.8263371 0.6522651
## 20 0.100 0.7875241 0.5749633
## 20 1.000 0.7297289 0.4609334
## 25 0.001 0.8132545 0.6259777
## 25 0.010 0.8267704 0.6531526
## 25 0.100 0.7836435 0.5672075
## 25 1.000 0.7297289 0.4609334
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01 and C = 15.
The sigma parameter is equal to 0.01 and C is equal to 15. Now we will make predictions on the same sample, compare all three SVM models and decide which function suits probably the best in case of our data.
spotify_svm_predictions <-
data.frame(spotify_svm_lin = predict(spotify_svm_lin,
spotify_train),
spotify_svm_poly = predict(spotify_svm_poly,
spotify_train),
spotify_svm_radial = predict(spotify_svm_radial,
spotify_train)
)
(spotify_svm_comparison <- sapply(spotify_svm_predictions,
function(x) summary_binary_class(real = spotify_train$target,
predicted = x,
level_positive = "1",
level_negative = "0")) %>%
t())
## Accuracy Sensitivity Specificity Pos Pred Value
## spotify_svm_lin 0.81944 0.89453 0.74304 0.77982
## spotify_svm_poly 0.87378 0.90707 0.83991 0.85217
## spotify_svm_radial 0.86429 0.92873 0.79872 0.82439
## Neg Pred Value F1 Balanced Accuracy
## spotify_svm_lin 0.87381 0.83324 0.81878
## spotify_svm_poly 0.89882 0.87876 0.87349
## spotify_svm_radial 0.91678 0.87346 0.86373
In contrast to the previous algorithms, the differences between the respective models are higher, especially between the model with the linear function and the rest. Although the sensitivity is quite similar, the specificity is the measure which matters here. The best model is the one with polynomial kernel function. It is also the most balanced one in terms of sensitivity and specificity. Nonetheless all three models have good predictions and none of them turned out to be a failure.
We have created models with three completely different methods and selected the best one for every type of algorithm. Now we will compare their forecasts on the test sample.
spotify_final_forecasts <-
data.frame(spotify_logit = ifelse(predict(spotify_logit,
spotify_test,
type = "response")>0.5,1,0),
spotify_knn_range = predict(spotify_knn_range,
spotify_test),
spotify_svm_poly = predict(spotify_svm_poly,
spotify_test)
)
(spotify_final_comparison <- sapply(spotify_final_forecasts,
function(x) summary_binary_class(real = spotify_test$target,
predicted = x,
level_positive = "1",
level_negative = "0")) %>%
t())
## Accuracy Sensitivity Specificity Pos Pred Value
## spotify_logit 0.82081 0.87757 0.76307 0.79029
## spotify_knn_range 0.79231 0.87243 0.71080 0.75426
## spotify_svm_poly 0.82168 0.85616 0.78659 0.80321
## Neg Pred Value F1 Balanced Accuracy
## spotify_logit 0.85967 0.83164 0.82032
## spotify_knn_range 0.84560 0.80905 0.79162
## spotify_svm_poly 0.84314 0.82884 0.82137
In spite of the essential differences between the models, their accuracy is very similar. The highest overall accuracy is for the forecasts created with the SVM model but the value of the accuracy measure for the logit model is in fact only negligibly smaller. Only results from KNN are noticeably worse, mainly because of the lower specificity. On the other hand, forecasts from SVM have slightly worse sensitivity then the rest while the logit model is much more interpretable so selection of the best model should depend on the details of its further application. The most important is that all these models are quite good so we managed to achieve the goal of this analysis.
Predicting which songs are hits and which are not using only their characteristics is possible. There is no single method that leads to the success, it is rather the effort and many different attempts that guarantee a quality model.
Mordor Intelligence (2019). Used Car Market – growth, trends and forecast (2020-2025). Online: https://www.mordorintelligence.com/industry-reports/global-used-car-market-growth-trends-and-forecast-2019-2024.↩
Tellis, G. J., & Wernerfelt, B. (1987). Competitive price and quality under asymmetric information. Marketing science, 6(3), 240-253.↩
https://en.wikipedia.org/wiki/List_of_postal_codes_in_Germany↩
https://www.r-bloggers.com/case-study-mapping-german-zip-codes-in-r/↩
News and Notes on 2017 RIAA Revenue Statistics. Online: http://www.riaa.com/wp-content/uploads/2018/03/RIAA-Year-End-2017-News-and-Notes.pdf.↩
https://www.researchgate.net/post/How_and_what_can_be_changed_to_a_cyclic_variable_so_that_it_can_be_used_for_modeling_in_R↩