How to use different machine learning techniques to obtain your research goal?

In this project we have analysed two different datasets. First one contained the information about used cars, their characteristics and prices. The other one stored the data about numerous songs, their traits and possible position in the world’s most important music ranking. Using different machine learning algorithms we were able to predict the features that were most interesting for us. We have compared the models’ efficiency and learned how to work with real life data, that stems from internet sources and data scrapping. The project was a way to learn modern approaches in the unsupervised data analytics. Our results have also an additional value added - they can be used not only for the educational purposes, but due to their high efficiency can be then used to give predictions about new entities. The following sections of this report contain all the necessary codes to replicate our analysis and can serve as a guidline to follow in a similar research project.

How much the German used cars are worth?

Used cars dataset - outline of the problem

The market for used vehicles is growing dynamically and is expected to continue its growth in the future1. It is also a curious market for the point of a theoretical economic analysis – it is a market with always perfect competition, in which private sellers are matching with interested owners. The products could be different in many ways, but at the same time there are almost always some comparable offers, that a possible buyer can choose from. In the marked of used goods there is always a great problem with the asymmetric information. Some researchers say, that in the situation of uncertainty, the price of a such good could be taken as an indicator of quality2 – it cannot be too low, which could indicate a lower value, but at the same time it cannot be too high, as the buyers can always choose another offer. Getting a good, generalized model for a price of an used car could tell us not only what is the market assessment of a particular car model, but it can also say a lot about the way the market works – what influences the price the most and what is less important. Thus in the first part of a project we will attempt to predict the price of a used car, basing on its characteristics.

Description of data and problem analysed

The dataset autos contains information about used cars, offered for sell on the auction site. Observations were scrapped with the usage of Python’s library Scrapy by Orges Leka. The link to where the data is available with all the information about the author is here: https://www.kaggle.com/orgesleka/used-cars-database/data#autos.csv.

In the currently available state, the set contains twenty variables of different types and 189’349 observations (which gives the product of the numbers of rows and columns that is above 3’000’000), so that it could be interpreted as a big dataset.

autos <- read.csv('autos.csv')
nrow(autos)*ncol(autos) #large dataset
## [1] 3786980

The data was retrieved from the German site (Ebay-Kleinanzeigens), so before the first usage it needs to be translated. Also, because it is a result of scraping, there are many missing values, decoded only as empty spaces, that also needs to be changed before the usage. The format of the data needs to be changed as well, as many of the variables are recognized as characters, where factors should be in place. The detailed code for those first transformations are included below.

autos$seller <- factor(autos$seller)
levels(autos$seller) <- c('dealer', 'private')

autos$offerType <- factor(autos$offerType)
levels(autos$offerType) <- c('offer', 'request')

autos$abtest <- factor(autos$abtest)

autos$vehicleType <- factor(autos$vehicleType)
levels(autos$vehicleType) <- c('NULL', 'other', 'bus', 'cabriolet', 'coupe', 
                             'small_car', 'kombi', 'limousine', 'suv')

autos$gearbox <- factor(autos$gearbox)
levels(autos$gearbox) <- c('NULL', 'automatic', 'manual')

autos$model <- factor(autos$model)
levels(autos$model)[1] <- 'NULL'
levels(autos$model)[42] <- 'other'

autos$monthOfRegistration <- factor(autos$monthOfRegistration)
levels(autos$monthOfRegistration) <-c ('NULL', 'Jan', 'Feb', 'Mar', 'Apr', 'May',
                       'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')

autos$fuelType <- factor(autos$fuelType)
levels(autos$fuelType) <- c('NULL', 'other', 'gasoline', 'cng', 'diesel', 
                          'electro','hybrid', 'lpg')

autos$brand <- factor(autos$brand)

autos$notRepairedDamage <- factor(autos$notRepairedDamage)
levels(autos$notRepairedDamage) <- c('NULL', 'yes', 'no')

autos$dateCrawled <- as.POSIXct(autos$dateCrawled, format = '%Y-%m-%d %H:%M:%OS', tz = "GMT")
autos$dateCreated <- as.POSIXct(autos$dateCreated, format = '%Y-%m-%d %H:%M:%OS', tz = "GMT")
autos$lastSeen <- as.POSIXct(autos$lastSeen, format = '%Y-%m-%d %H:%M:%OS', tz = "GMT")

Initial descriptive analysis of the data

The dataset contains twenty different values, that describe a car’s characteristics found in the ad:

  • dateCrawled: when this ad was first crawled, all field-values are taken from this date
  • name: “name” of the car taken from the title of an advertisement
  • seller: type of a seller - private or dealer
  • offerType: type of an ad - offer or request
  • price: the sell price placed on the ad (in Euro)
  • abtest: not meaningful variable with two levels (test/control) - probably an artifact of the author’s initial modeling process
  • vehicleType: type of a vehicle, specific to is size and general characteristics
  • yearOfRegistration: at which year the car was first registered
  • gearbox: information whether the car has automatic or manual gearbox (contains missing values, which probably mean a manual gearbox, that is a non reported default option)
  • powerPS: power of the car - horsePower (ger. Pfredestärke)
  • model: a model of a car (a categorical variable with many levels)
  • kilometer: how many kilometers the car has driven
  • monthOfRegistration: at which month of a year the car was first registered
  • fuelType: type of a fuel need to power this car
  • brand: a brand that produced this car
  • notRepairedDamage: if the car has a damage which is not repaired yet (contains missing values, which probably indicates a lack of a large non repaired damage - a non reported default option)
  • dateCreated: the date for which the ad at ebay was created
  • nrOfPictures: number of pictures in the ad (this field contains only 0’s due to the problem with a crawler)
  • postalCode: a German postal code identifying the localisation of the seller
  • lastSeenOnline: when the crawler saw this ad last online

The class of each variable in a properly read dataset is presented below. In total we have one character variable, three representing dates (in POSIXct class), ten unordered factors and six columns with integers.

#the initial state of the properly read dataset
str(autos)
## 'data.frame':    189349 obs. of  20 variables:
##  $ dateCrawled        : POSIXct, format: "2016-03-24 11:52:17" "2016-03-24 10:58:45" ...
##  $ name               : Factor w/ 128113 levels "–_ein_Kombi_+_4WD__perfekt_fuer_alle_Lebenslagen__",..: 43604 2399 50197 44742 94997 15678 80970 118441 34980 119726 ...
##  $ seller             : Factor w/ 2 levels "dealer","private": 2 2 2 2 2 2 2 2 2 2 ...
##  $ offerType          : Factor w/ 2 levels "offer","request": 1 1 1 1 1 1 1 1 1 1 ...
##  $ price              : int  480 18300 9800 1500 3600 650 2200 0 14500 999 ...
##  $ abtest             : Factor w/ 2 levels "control","test": 2 2 2 2 2 2 2 2 1 2 ...
##  $ vehicleType        : Factor w/ 9 levels "NULL","other",..: 1 5 9 6 6 8 4 8 3 6 ...
##  $ yearOfRegistration : int  1993 2011 2004 2001 2008 1995 2004 1980 2014 1998 ...
##  $ gearbox            : Factor w/ 3 levels "NULL","automatic",..: 3 3 2 3 3 3 3 3 3 3 ...
##  $ powerPS            : int  0 190 163 75 69 102 109 50 125 101 ...
##  $ model              : Factor w/ 250 levels "NULL","1_reihe",..: 119 1 120 119 104 13 9 42 58 119 ...
##  $ kilometer          : int  150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
##  $ monthOfRegistration: Factor w/ 13 levels "NULL","Jan","Feb",..: 1 6 9 7 8 11 9 8 9 1 ...
##  $ fuelType           : Factor w/ 8 levels "NULL","other",..: 3 5 5 3 5 3 3 3 3 1 ...
##  $ brand              : Factor w/ 40 levels "alfa_romeo","audi",..: 39 2 15 39 32 3 26 39 11 39 ...
##  $ notRepairedDamage  : Factor w/ 3 levels "NULL","yes","no": 1 2 1 3 3 2 3 3 1 1 ...
##  $ dateCreated        : POSIXct, format: "2016-03-24" "2016-03-24" ...
##  $ nrOfPictures       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ postalCode         : int  70435 66954 90480 91074 60437 33775 67112 19348 94505 27472 ...
##  $ lastSeen           : POSIXct, format: "2016-04-07 03:16:57" "2016-04-07 01:46:50" ...
summary(autos)
##   dateCrawled                                   name       
##  Min.   :2016-03-05 14:06:22   Ford_Fiesta        :   336  
##  1st Qu.:2016-03-13 13:48:27   Volkswagen_Golf_1.4:   335  
##  Median :2016-03-21 18:44:48   BMW_318i           :   334  
##  Mean   :2016-03-21 14:11:47   Opel_Corsa         :   316  
##  3rd Qu.:2016-03-29 15:39:13   BMW_316i           :   258  
##  Max.   :2016-04-07 14:36:58   BMW_320i           :   256  
##                                (Other)            :187514  
##      seller         offerType          price              abtest     
##  dealer :     2   offer  :189341   Min.   :       0   control:91131  
##  private:189347   request:     8   1st Qu.:    1150   test   :98218  
##                                    Median :    2950                  
##                                    Mean   :   10895                  
##                                    3rd Qu.:    7200                  
##                                    Max.   :99999999                  
##                                                                      
##     vehicleType    yearOfRegistration      gearbox          powerPS       
##  limousine:48701   Min.   :1000       NULL     : 10395   Min.   :    0.0  
##  small_car:40759   1st Qu.:1999       automatic: 39220   1st Qu.:   70.0  
##  kombi    :34498   Median :2003       manual   :139734   Median :  105.0  
##  NULL     :19437   Mean   :2005                          Mean   :  114.7  
##  bus      :15532   3rd Qu.:2008                          3rd Qu.:  150.0  
##  cabriolet:11668   Max.   :9999                          Max.   :19208.0  
##  (Other)  :18754                                                          
##      model          kilometer      monthOfRegistration     fuelType     
##  golf   : 15286   Min.   :  5000   NULL   :19025       gasoline:114106  
##  other  : 13459   1st Qu.:125000   Mar    :18490       diesel  : 54968  
##  3er    : 10528   Median :150000   Jun    :16892       NULL    : 16936  
##  NULL   : 10398   Mean   :125640   Apr    :15762       lpg     :  2732  
##  polo   :  6714   3rd Qu.:150000   May    :15759       cng     :   308  
##  corsa  :  6415   Max.   :150000   Jul    :14858       hybrid  :   142  
##  (Other):126549                    (Other):88563       (Other) :   157  
##            brand       notRepairedDamage  dateCreated                 
##  volkswagen   :40687   NULL: 36542       Min.   :2014-03-10 00:00:00  
##  bmw          :20545   yes : 18410       1st Qu.:2016-03-13 00:00:00  
##  opel         :20427   no  :134397       Median :2016-03-21 00:00:00  
##  mercedes_benz:17931                     Mean   :2016-03-20 20:19:40  
##  audi         :16676                     3rd Qu.:2016-03-29 00:00:00  
##  ford         :13093                     Max.   :2016-04-07 00:00:00  
##  (Other)      :59990                                                  
##   nrOfPictures   postalCode       lastSeen                  
##  Min.   :0     Min.   : 1067   Min.   :2016-03-05 14:15:08  
##  1st Qu.:0     1st Qu.:30559   1st Qu.:2016-03-23 11:53:24  
##  Median :0     Median :49661   Median :2016-04-03 23:17:40  
##  Mean   :0     Mean   :50892   Mean   :2016-03-30 03:46:56  
##  3rd Qu.:0     3rd Qu.:71577   3rd Qu.:2016-04-06 10:45:49  
##  Max.   :0     Max.   :99998   Max.   :2016-04-07 14:58:51  
## 

General descriptive statistics presented in a code chunk above could be seen as a help in the assessment of a quality of a data. As we can see, the target variable price reaches some troubling levels at its extremes. We can expect that those zeros and 99999999 values are representing missing information. We can see some strange values for the yearOfRegistration as well - values 1000 and 9999 for the year are indicating some problems with the data. We also have some zeros in the powerPS variable, which should not happen for the standard vehicle. The variables abtest and nrOfPictures are providing no insight about the car’s characteristics. Probably they should be dropped out of the future analysis. We can observe almost no variation in factors seller and offerType - that could be an issue in future modelling. We also have some classes in factors called ‘NULL’ - those represent missing information for a particular characteristic.

Even a very simplistic EDA gives us some outline on what we can expect in our data, and what should be the points that we need to focus on before we start the modelling proces.

Variable transformations and cleaning of the dataset

Firstly we will take care of the missing and uncorrect values in the dataset. The most obvious change would be to recode values 0 and 99999999 of price to NA (those should not be considered as a realistic price offered by a seller). Those values could influence our analysis and make the prediction worse, due to incorrect data.

Price

tail(autos[order(autos$price), 'price'], 50) #those values seem strange, most probably they are incorrect (made up prices so as not to give the proper price tag in the ad itself - those are not informative for us)
##  [1]   745000   745000   820000   849000   911911   999990   999999
##  [8]   999999   999999   999999   999999   999999   999999   999999
## [15]  1000000  1111111  1234566  1600000  1600000  2995000  3890000
## [22]  3895000  9999999  9999999 10000000 10000000 10000000 10000000
## [29] 10000000 10000000 10000000 10010011 11111111 11111111 11111111
## [36] 11111111 11111111 11111111 11111111 12345678 12345678 12345678
## [43] 27322222 99000000 99999999 99999999 99999999 99999999 99999999
## [50] 99999999
#let's clean the dataset of the most unbeliavable prices
autos[autos$price== 99999999,]$price <- 0
autos[autos$price== 999999,]$price <- 0
autos[autos$price== 11111111,]$price <- 0
autos[autos$price== 12345678,]$price <- 0

autos[autos$price== 0,]$price <- NA

As we could also see from the general descriptive statistics of the price variable - we have some extreme values that inflate the mean of the variable. Those are a cause of a high skewness in that feature. We would overcome that with introducting the log transformation to it - which makes the distribution more close to normal. For now we will keep the original and the transformed variable, but this is possible that we will model only the transformed one.

autos$logprice<-log(autos$price)
hist(autos$logprice)

We will check our previous observation with the usage of Box-Cox procedure. As we can see from the output below the optimal lambda is close to 0, so the log transformation of price should be enough to get better results from our modelling phase. Thanks to that confirmation in the next phases we will use only logprice as our target variable.

require('bestNormalize')
price_boxcox <- boxcox(autos$price)

price_boxcox$lambda
## [1] 0.1181524

Then we will clean up values that also seem to have some problems in themselves. The year of the first registration of a car should not be lower than 1900 (those are museum artifacts and are not sold by Ebay). Also it should not be higher than the year the ad is presentes, which was 2016 (last scrapped data is from April, 2016). That is why we have decided to recode those troubling values in to missings.

autos[autos$yearOfRegistration== 9999,]$yearOfRegistration <- 0
autos[autos$yearOfRegistration > 2016,]$yearOfRegistration <- 0 #newer than the data was scrapped
autos[autos$yearOfRegistration < 1900,]$yearOfRegistration <- 0
autos[autos$yearOfRegistration== 0,]$yearOfRegistration <- NA

The same happened to the horsepower of a vehicle - there should no be a car that has such power at 0. Those are incorrect data.

autos[autos$powerPS== 0,]$powerPS<-NA

We will also take care of the NULL categories in gearbox and notRepairedDamage - those will be changed to their default values (respectively: manual and no).

#gearbox and notRepairedDamage
autos[autos$gearbox == 'NULL',]$gearbox <- 'manual' #change to defaults
autos[autos$notRepairedDamage == 'NULL',]$notRepairedDamage <- 'no'

#drop unused factor levels
autos$gearbox <- droplevels(autos$gearbox)
autos$notRepairedDamage <- droplevels(autos$notRepairedDamage)

PostalCode to LAND - an inspiration for future analysis

There was an attempt to convert German postalcode to more general categories, like Lands, but that was not a straightforward task. Postalcodes in Germany do not correspond well with the borders of each NUTS1 region3. The only possible way was to read German GIS structures and try to somehow overlap those with the postalcodes. However, due to additional effort and the unsure outcome of that, the idea was dropped. That would be though a good inspiration for some future analysis made on this dataset 4

During a careful analysis of the dataset some odd patterns were found. There were some ads that did not offer any car to sell, but rather indicated an eagerness to buy a vehicle with given specification. Some exchange offers were present as well. Those usually contain missing or untrue values, as they are only an outline of the request and not the exact sell offer. We have decided to correct them in the dataset - we believe that they should have the category (offerType) set to request.

We have manage to tackle them by some manipulations done on our dataset. Those troublesome ads had titles starting from ‘seek’ or ‘exchange’ (in german: ‘suche’, ‘tausche’). We have managed to extract those first words from the titles and convert them to lowercase letters. Then the observations where words ‘suche’ or ‘tausche’ were present, the offerType was changed to request.

#Suche, SUCHE, suche, tausch, Tausch, TAUSCH, tausche - ads starting with those words are more likely to get some wrong result - those are not offers for sell

autos$subName <- tolower(sub("_.*", "", autos$name)) #get the first word from the ad title

#change those troublesome offers to their proper type, which should be request
autos[autos$subName == 'suche',]$offerType <- 'request' #I am looking for...
autos[autos$subName == 'tausch',]$offerType <- 'request' #I will exchange...
autos[autos$subName == 'tausche',]$offerType <- 'request' #I will exchange...

#clear the dataset of the temp variable
autos$subName <- NULL

In the next step we will exclude requests from our dataset, as they do not represent the sell ads, which are of interest for this paper. Also the name variable will be excluded, as it is not comparable across observations and instead of adding to the explainability it may only cause computational problems. Also all important cases were already cleaned with the usage of this covariate. Having that we will not need the original name variable in our dataset.

#exclude requests from database
autos <- autos[autos$offerType == 'offer',]
#drop the unused variable
autos$offerType <- NULL
autos$name <- NULL

Variables lastSeen and dateCreated on their own do not give us any information about the offer. However, they can be used to approximate the time the ad was open in the market. That feature could be useful in determining how attractive the offer was. That could be also linked to the market matching mechanism - if the price was relatively low, the offer should be active for a short time, longer time of the ad presence on the website could indicate a small demand, that could be caused by too high price.

#calculate time difference in days
autos$howLong <- difftime(autos$lastSeen, autos$dateCreated, units = c("days"))

We have decided to remove abtest and nrOfPictures as those variables give no information about the item and are some artifacts of a faulty data generation process or old processing. Also the already exploited (and no longer useful) variables lastSeen, dateCrawled and dateCreated were droped from the analysis.

autos$abtest <- NULL
autos$nrOfPictures <- NULL
autos$lastSeen <- NULL
autos$dateCrawled <- NULL
autos$dateCreated <- NULL

Treating the NAs in the dataset

Our dataset contains lots of observations. Some of them though are incorrect and misleading. Our data stems from ads on Ebay, which are created by individual users - that causes problems with consistency of the data and may be an issue for the modelling. We have spotted some of the incorrect observations and changed them to missings. We believe though that despite spotting and treating some of them, there still might be more incorect observations that are likely to be tied with the missing values in some categories (not fully filled offer, non-standard vehicle type, etc.). From the point of our research goal those are not the observations we would like to include in our model. To get the predictions of the prices of used cars, we would like to have the best and most true data, we could have.

#percentage of missing values
(1-length(which(complete.cases(autos)))/nrow(autos))*100 
## [1] 15.21267
#number of non-missing cases - data of good quality
length(which(complete.cases(autos)))
## [1] 159390

The missing data is an issue for 15% of the observations. If we wanted to impute all of them, that would lead to problems with reliability of the model. For the 15% of the dataset we would cope with some untrue values. That is unacceptable if we want to have a model as close to the reality as possible. On the other hand, even after removing all those problematic cases, we would still have almost 160 thousands of observations. Such size could still be considered as a big dataset, that is enough for a proper analysis.

Having that we have decided just to exclude the missing cases from our analysis.

#remove the missings in the dataset
autos <- autos[which(complete.cases(autos)),]

save(list = "autos", file = "autos_prepared.RData")

Variable selection methods

At first we will check if some numeric variables are highly correlated with each other (that causes colinearity problem).

require(dplyr)
autos_numeric_vars <- sapply(autos, is.numeric) %>%  which() %>%  names()
autos_correlations <- cor(autos[,autos_numeric_vars], use = "pairwise.complete.obs")
require(corrplot)
corrplot(autos_correlations)

As we can see we do not have too much of a correlation in the variables - colinearity is not an issue for the numeric variables. However we do not see high correlations with the price variable as well (there is some with the kilometer variable only). We can see better resuls while we consider the log transformated variable - that gives more hope for the modelling phase - there is a relation of logprice with variables yearOfRegistration, powerPS and a stronger one with kilometer. Also the signs of relations are as we could expect them to be. The postalCode has almost no influence on our variable, which is expected, as on its own it does not give any beneficial information to the a-spatial analysis. We will exclude it from the dataset.

autos$postalCode <- NULL

We will use the ANOVA for the categorical variables now, to check if mean of the dependent variable is different in subgroups indicated by their levels.

autos_categorical_vars <- sapply(autos, is.factor) %>% which() %>% names()

# The higher the F-statistic (or the lower its p-value) the stronger we reject H0 (of no difference between groups)

autos_F_anova <- function(categorical_var) {
  anova_ <- aov(autos$logprice ~ autos[[categorical_var]]) 
  return(summary(anova_)[[1]][1, 4]) #get the F statistic 
}

sapply(autos_categorical_vars, autos_F_anova) %>% 
  sort(decreasing = TRUE) -> autos_anova_all_categorical

autos_anova_all_categorical
save(list = c('autos_anova_all_categorical'), file = "autos_Ftest.RData")
##             gearbox   notRepairedDamage         vehicleType 
##        1.489636e+04        1.230559e+04        3.448338e+03 
##            fuelType               brand monthOfRegistration 
##        3.012229e+03        7.300495e+02        5.144894e+02 
##               model              seller 
##        2.729511e+02        7.516753e-01

The result above presents the F test statistics for each variable. The higher the F-statistics the stronger we reject H0 (of no difference between groups). It seems that Our variables have significant effect on the logaritmic price, except of the seller variable. That was expected, as it has almost no variation. Thus we will exclude it from our analysis.

autos$seller <- NULL 

Finally we will check if our dataset includes any other variables with a very small variance. To do this, we will use nearZeroVar function from the caret package. As we can see from the output below, we have sucessfully cleaned the dataset from the variables that does not provide enough variation. As the variables that are stored in a set (both old and transformed) are not subceptible to the issue of near zero variance. We can sucessfully move onto the modelling phase.

require(caret)
nearZeroVar(autos, saveMetrics = TRUE) -> autos_nzv_stats
save(list = c('autos_nzv_stats'), file = 'autos_nzv.RData')
autos_nzv_stats
##                     freqRatio percentUnique zeroVar   nzv
## price                1.003150   2.564778217   FALSE FALSE
## vehicleType          1.236917   0.005646527   FALSE FALSE
## yearOfRegistration   1.028812   0.052073530   FALSE FALSE
## gearbox              3.515425   0.001254784   FALSE FALSE
## powerPS              1.495796   0.382081686   FALSE FALSE
## model                1.142278   0.156220591   FALSE FALSE
## kilometer            5.969149   0.008156095   FALSE FALSE
## monthOfRegistration  1.103725   0.008156095   FALSE FALSE
## fuelType             2.011839   0.005019135   FALSE FALSE
## brand                1.881764   0.025095677   FALSE FALSE
## notRepairedDamage    9.802440   0.001254784   FALSE FALSE
## logprice             1.003150   2.564778217   FALSE FALSE
## howLong              1.000000  60.437292176   FALSE FALSE

Modelling phase

Train/test split

At this step, when our data is prepared and preprocessed, we will make the split to the train and test samples in proportion 60/40. The createDataPartition will be used to get similar distribution of the target variable in both subsamples.

set.seed(111) #seed for reproductibility
autos_which_train <- createDataPartition(autos$logprice, p = 0.6,list = FALSE) 

autos_train <- autos[autos_which_train,]
autos_test <- autos[-autos_which_train,]

save(list = c("autos_train", "autos_test"), file = "autos_traintest.RData")

The larger sample will be used to train three different models: linear regression, KNN and SVM. After each model will be properly fitted and prepared we will compare the efficiency of the final models from different classes.

Linear regression model

We will start the modelling phase with estimating the linear regression model on the logarithm of a car price. Our initial model will include all the variables that are included in the dataset after the pre-processing and feature selectuion (except of the original variable price).

options(scipen=10)
options(contrasts = c("contr.treatment", "contr.treatment")) 
autos_lm1 <- lm(logprice ~ howLong + notRepairedDamage + powerPS +  kilometer +
                  yearOfRegistration + gearbox + vehicleType + fuelType +
                   monthOfRegistration +  brand + model,  
                   data = autos_train) 
save(list = c("autos_lm1"), file = "autos_lm.RData")
summary(autos_lm1)
## 
## Call:
## lm(formula = logprice ~ howLong + notRepairedDamage + powerPS + 
##     kilometer + yearOfRegistration + gearbox + vehicleType + 
##     fuelType + monthOfRegistration + brand + model, data = autos_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6051  -0.3072   0.0489   0.3621   9.8067 
## 
## Coefficients:
##                                 Estimate       Std. Error  t value
## (Intercept)             -135.31152049106    0.92327592863 -146.556
## howLong                    0.00946701349    0.00027459914   34.476
## notRepairedDamageno        0.66386164587    0.00822396845   80.723
## powerPS                    0.00046085944    0.00001865219   24.708
## kilometer                 -0.00000729647    0.00000007116 -102.535
## yearOfRegistration         0.07086283965    0.00045680168  155.128
## gearboxmanual             -0.18731183755    0.00690147242  -27.141
## vehicleTypeother           0.85120621085    0.03035277568   28.044
## vehicleTypebus             0.98311758358    0.01925202750   51.066
## vehicleTypecabriolet       1.49231461582    0.01795714873   83.104
## vehicleTypecoupe           1.30975179098    0.01848236960   70.865
## vehicleTypesmall_car       0.84991765817    0.01571834408   54.072
## vehicleTypekombi           0.99301710276    0.01547030806   64.189
## vehicleTypelimousine       1.04462383390    0.01498533741   69.710
## vehicleTypesuv             1.25153286070    0.02505844963   49.945
## fuelTypeother             -0.47030499764    0.11548898271   -4.072
## fuelTypegasoline           0.10233805051    0.01195059448    8.563
## fuelTypecng                0.25573040134    0.05960287243    4.291
## fuelTypediesel             0.40093158500    0.01279980903   31.323
## fuelTypeelectro           -0.41337620219    0.15406719770   -2.683
## fuelTypehybrid             0.14120763384    0.09187054437    1.537
## fuelTypelpg                0.16333749481    0.02248365706    7.265
## monthOfRegistrationJan     0.26913575255    0.01373999375   19.588
## monthOfRegistrationFeb     0.28146712180    0.01396367516   20.157
## monthOfRegistrationMar     0.27672025475    0.01271610717   21.761
## monthOfRegistrationApr     0.27291923976    0.01308224625   20.862
## monthOfRegistrationMay     0.27438324984    0.01306964946   20.994
## monthOfRegistrationJun     0.26580712325    0.01291863899   20.575
## monthOfRegistrationJul     0.31289508188    0.01321787368   23.672
## monthOfRegistrationAug     0.27421753277    0.01384484963   19.806
## monthOfRegistrationSep     0.31207207223    0.01366419572   22.839
## monthOfRegistrationOct     0.29537966097    0.01338182033   22.073
## monthOfRegistrationNov     0.31087139765    0.01365863463   22.760
## monthOfRegistrationDec     0.27841743542    0.01361576333   20.448
## brandaudi                  0.58678381973    0.06218332445    9.436
## brandbmw                   0.48311567639    0.06637493611    7.279
## brandchevrolet             0.27272541440    0.07021690087    3.884
## brandchrysler             -0.27697114537    0.08443821773   -3.280
## brandcitroen              -0.61653464053    0.06546740973   -9.417
## branddacia                 0.01476819156    0.22712337753    0.065
## branddaewoo               -0.90808809785    0.11060591129   -8.210
## branddaihatsu             -0.53807376866    0.13529715956   -3.977
## brandfiat                 -0.46751450083    0.06361390546   -7.349
## brandford                 -0.27390168638    0.06374083825   -4.297
## brandhonda                -0.13332923953    0.08672830207   -1.537
## brandhyundai              -0.66121691061    0.06795679366   -9.730
## brandjaguar                0.86623864656    0.09801562890    8.838
## brandjeep                  0.23789259749    0.10008364830    2.377
## brandkia                  -0.62302393032    0.07985831905   -7.802
## brandlada                  0.58866119327    0.33098847938    1.778
## brandlancia               -0.22539098411    0.13387469894   -1.684
## brandland_rover            1.11553340515    0.25185247394    4.429
## brandmazda                -0.44330689701    0.06479430257   -6.842
## brandmercedes_benz         0.70345334722    0.05997302611   11.729
## brandmini                  0.76256776218    0.13870002070    5.498
## brandmitsubishi           -0.44207499142    0.07444471787   -5.938
## brandnissan               -0.07037554138    0.07260380765   -0.969
## brandopel                 -0.00528808555    0.06280271830   -0.084
## brandpeugeot              -0.38018378713    0.06339038229   -5.997
## brandporsche               1.64499827552    0.08745759855   18.809
## brandrenault              -0.09096493619    0.06780466349   -1.342
## brandrover                -0.86902568710    0.08625862109  -10.075
## brandsaab                 -0.37482709976    0.09031086638   -4.150
## brandseat                 -0.66723008097    0.10677686650   -6.249
## brandskoda                -0.59711620434    0.08978822485   -6.650
## brandsmart                -0.33096290069    0.11311112291   -2.926
## brandsonstige_autos        1.38715022064    0.06484032018   21.393
## brandsubaru                0.32366648882    0.15614439723    2.073
## brandsuzuki               -0.47949633793    0.06893267786   -6.956
## brandtoyota                0.03951973047    0.07138343134    0.554
## brandtrabant               0.94960837043    0.17059271798    5.567
## brandvolkswagen            0.29677346209    0.06235196581    4.760
## brandvolvo                 0.19724043882    0.07916580622    2.491
## model1_reihe              -0.01138263410    0.05096343079   -0.223
## model100                  -0.00211364383    0.07454984455   -0.028
## model145                  -0.61798594725    0.21840607264   -2.830
## model147                  -0.19296581462    0.08216201178   -2.349
## model156                  -0.33274515998    0.08366161827   -3.977
## model159                   0.61240034746    0.10644311988    5.753
## model1er                   0.24981629020    0.04076760161    6.128
## model2_reihe               0.31480054616    0.03884368391    8.104
## model200                   1.13265737268    0.23275216546    4.866
## model3_reihe               0.30765396443    0.03849649791    7.992
## model300c                  1.17703878685    0.12829834785    9.174
## model3er                   0.07446703779    0.03559611465    2.092
## model4_reihe               0.33960660893    0.06588841994    5.154
## model5_reihe               1.12770538575    0.07312855271   15.421
## model500                   0.84808770637    0.05662537533   14.977
## model5er                   0.25027795569    0.03748302881    6.677
## model6_reihe               0.52601874876    0.05323659098    9.881
## model601                  -0.40348084241    0.18578309157   -2.172
## model6er                   1.07777543937    0.08977679546   12.005
## model7er                   0.46995107865    0.05412986999    8.682
## model80                   -0.49583113994    0.04802119112  -10.325
## model850                  -0.30156607848    0.11802028778   -2.555
## model90                    0.06469455706    0.16222779980    0.399
## model900                   0.71740727007    0.14857829323    4.828
## model9000                  0.96030380460    0.21537194830    4.459
## model911                   0.99576857501    0.08311563872   11.981
## modela_klasse             -0.60040052912    0.03567852340  -16.828
## modela1                    0.05770501834    0.06357269857    0.908
## modela2                    0.22876869936    0.07882727051    2.902
## modela3                   -0.02131173524    0.03521603285   -0.605
## modela4                   -0.04678581941    0.03361120562   -1.392
## modela5                    0.36080504720    0.05277814820    6.836
## modela6                    0.07038604380    0.03559965610    1.977
## modela8                    0.31389022817    0.06582527212    4.769
## modelaccord                0.53351557832    0.10037915996    5.315
## modelagila                -0.16526933581    0.07615420796   -2.170
## modelalhambra              1.14066684927    0.11731021942    9.724
## modelalmera               -0.22733280140    0.07823784762   -2.906
## modelaltea                 1.20592834071    0.11682730284   10.322
## modelamarok                0.35453068179    0.20515855008    1.728
## modelother                 0.50060627180    0.01935210730   25.868
## modelantara                0.36169233779    0.13729944320    2.634
## modelarosa                 0.39594268526    0.10400900940    3.807
## modelastra                -0.04118940526    0.03312972853   -1.243
## modelauris                 0.37462959514    0.10246635844    3.656
## modelavensis               0.48306476091    0.07515888591    6.427
## modelaveo                 -0.19693226678    0.16238139142   -1.213
## modelaygo                  0.02605914955    0.08385466445    0.311
## modelb_klasse              0.02872039290    0.05585359082    0.514
## modelb_max                 0.62861973716    0.24590376924    2.556
## modelbeetle                0.19108317359    0.05561390829    3.436
## modelberlingo              0.64340958555    0.06441685163    9.988
## modelbora                 -0.00590453390    0.05507913747   -0.107
## modelboxster              -0.45897514955    0.09731227350   -4.717
## modelbravo                 0.17407733884    0.09338843724    1.864
## modelc_klasse             -0.24650132798    0.03120203154   -7.900
## modelc_max                 0.72258800802    0.06425850299   11.245
## modelc_reihe               0.23720200762    0.11493461124    2.064
## modelc1                    0.56590889752    0.07949412441    7.119
## modelc2                    0.48492477133    0.08323529456    5.826
## modelc3                    0.66539075271    0.07380051739    9.016
## modelc4                    0.84829345683    0.07767474473   10.921
## modelc5                    0.69209681728    0.07627616414    9.074
## modelcaddy                 0.12557517641    0.04452345952    2.820
## modelcalibra              -0.06958037859    0.12241049050   -0.568
## modelcaptiva               0.14364394636    0.11841799077    1.213
## modelcarisma              -0.16008189328    0.09408891303   -1.701
## modelcarnival              0.35300000833    0.10767681703    3.278
## modelcayenne              -0.29428510358    0.09995329872   -2.944
## modelcc                    0.51631973433    0.09603701849    5.376
## modelceed                  0.81689418628    0.09981558521    8.184
## modelcharade               0.05823465678    0.34922547377    0.167
## modelcherokee              0.35676190318    0.13356579843    2.671
## modelcitigo                0.54575078085    0.15162728733    3.599
## modelcivic                 0.22995605884    0.07994966509    2.876
## modelcl                    0.37488398383    0.08267049754    4.535
## modelclio                 -0.07002734060    0.04785112019   -1.463
## modelclk                  -0.17733444484    0.04340758225   -4.085
## modelclubman               0.02003214521    0.15972724345    0.125
## modelcolt                  0.11389678435    0.07206919551    1.580
## modelcombo                -0.08803794344    0.08124916023   -1.084
## modelcooper                0.00061191669    0.13115086675    0.005
## modelcordoba               0.24843443879    0.11667825827    2.129
## modelcorolla               0.14750599164    0.07558710963    1.951
## modelcorsa                -0.22914560085    0.03305933751   -6.931
## modelcr_reihe              0.64571685418    0.10124467359    6.378
## modelcroma                 0.53663167821    0.19824304755    2.707
## modelcrossfire             0.86750162280    0.19419248985    4.467
## modelcuore                -0.01999841870    0.14887021185   -0.134
## modelcx_reihe              0.97660553711    0.11766240088    8.300
## modeldefender              0.29585709255    0.26781479069    1.105
## modeldelta                 1.13205053755    0.28572962557    3.962
## modeldiscovery            -0.14265305291    0.27668237951   -0.516
## modeldoblo                 0.48878497153    0.08814620838    5.545
## modelducato                0.83461390481    0.07980698452   10.458
## modelduster                0.05012119020    0.24175829977    0.207
## modele_klasse             -0.03631460016    0.03180895325   -1.142
## modelelefantino           -0.97019267128    0.38490925503   -2.521
## modeleos                   0.05335054903    0.06395116927    0.834
## modelescort               -0.56993760981    0.05971651145   -9.544
## modelespace                0.11259546498    0.07766751676    1.450
## modelexeo                  1.19940612955    0.15626475189    7.675
## modelfabia                 0.80201865499    0.07807338203   10.273
## modelfiesta                0.06568576073    0.03933846585    1.670
## modelfocus                 0.24919523074    0.03863671951    6.450
## modelforester              0.13771439076    0.18509634718    0.744
## modelforfour               0.49532013421    0.11937975728    4.149
## modelfortwo                0.08375617971    0.09982457413    0.839
## modelfox                  -0.23525151522    0.05758899879   -4.085
## modelfreelander           -0.65763670539    0.25861733776   -2.543
## modelfusion                0.38445604191    0.09146094498    4.203
## modelg_klasse              1.43485835118    0.13341890520   10.755
## modelgalant               -0.15323957352    0.11854251234   -1.293
## modelgalaxy                0.57842716458    0.05662397170   10.215
## modelgetz                  0.48915401575    0.08379963025    5.837
## modelgl                    0.63496639142    0.16302512176    3.895
## modelglk                   0.12392414402    0.10049429084    1.233
## modelgolf                 -0.00150506197    0.02754359736   -0.055
## modelgrand                 0.51226540152    0.07521422490    6.811
## modeli_reihe               0.86113207795    0.05841815200   14.741
## modeli3                    0.44830386666    0.36831664569    1.217
## modelibiza                 0.74432077551    0.09486638696    7.846
## modelimpreza               0.14665699921    0.17723824196    0.827
## modelinsignia              0.62556128835    0.06114340396   10.231
## modeljazz                  0.47834305266    0.10787804561    4.434
## modeljetta                 0.24393765758    0.07865660445    3.101
## modeljimny                 0.78734816818    0.10790867200    7.296
## modeljuke                  0.54514905724    0.13641931243    3.996
## modeljusty                -0.99981199419    0.18767016824   -5.327
## modelka                   -0.41109860598    0.04487342824   -9.161
## modelkadett                0.75438052837    0.09419195048    8.009
## modelkaefer                2.15500041047    0.07759486153   27.772
## modelkalina               -1.51524865402    0.79927793164   -1.896
## modelkalos                 0.67087057770    0.17769454962    3.775
## modelkangoo               -0.07778740359    0.06411749990   -1.213
## modelkappa                 0.37937521570    0.30193363738    1.256
## modelkuga                  0.73209168126    0.07385210095    9.913
## modellaguna               -0.26350599224    0.05609689164   -4.697
## modellancer                0.68569793870    0.11057423511    6.201
## modellanos                 0.13521239484    0.18318855475    0.738
## modellegacy               -0.47009937036    0.18671504258   -2.518
## modelleon                  1.15329650036    0.09843781098   11.716
## modellodgy                 0.00590871736    0.30483360993    0.019
## modellogan                -0.20545312467    0.23370109060   -0.879
## modellupo                 -0.36548836769    0.03932296750   -9.295
## modellybra                -0.68136136429    0.23675729985   -2.878
## modelm_klasse              0.09672105874    0.05371718928    1.801
## modelm_reihe               1.15249014196    0.07866256102   14.651
## modelmateria               1.10104901770    0.32280327705    3.411
## modelmatiz                -0.44828843200    0.08595353724   -5.215
## modelmegane               -0.12458141898    0.04896760284   -2.544
## modelmeriva                0.22437842235    0.05444550512    4.121
## modelmicra                -0.21265783967    0.06139609767   -3.464
## modelmii                   0.60916634713    0.18343524668    3.321
## modelmodus                 0.14961522078    0.08850118261    1.691
## modelmondeo                0.02831434065    0.04158796989    0.681
## modelmove                  0.01954244787    0.22100021604    0.088
## modelmusa                  0.37194596010    0.27239389445    1.365
## modelmustang               2.13228566589    0.08881319656   24.009
## modelmx_reihe              0.70119390195    0.05639453648   12.434
## modelnavara                0.90256301268    0.15490457286    5.827
## modelniva                 -0.98250970548    0.35867780824   -2.739
## modelnote                  0.28410918920    0.13164536554    2.158
## modelnubira                0.47766905918    0.20167494931    2.369
## modeloctavia               1.12225086542    0.07760313224   14.461
## modelomega                -0.32428092342    0.04987537826   -6.502
## modelone                  -0.08453142536    0.13566869383   -0.623
## modeloutlander             1.03339017205    0.13265009979    7.790
## modelpajero                1.06410808801    0.11437228156    9.304
## modelpanda                 0.40426708828    0.06660627973    6.070
## modelpassat                0.01937765709    0.03010286931    0.644
## modelphaeton               0.54230070629    0.09816435157    5.524
## modelpicanto               0.57809241238    0.09001535598    6.422
## modelpolo                 -0.32059925090    0.02958975727  -10.835
## modelprimera              -0.17810144631    0.07638565626   -2.332
## modelptcruiser             0.30763059538    0.10338333271    2.976
## modelpunto                 0.10317506523    0.04344852730    2.375
## modelq3                    0.04007188510    0.09258705211    0.433
## modelq5                    0.45775977171    0.07220845386    6.339
## modelq7                    0.58097307359    0.08319798891    6.983
## modelqashqai               0.63320560333    0.07502092613    8.440
## modelr19                  -0.43631218156    0.15443495400   -2.825
## modelrange_rover          -0.00661712923    0.28078960282   -0.024
## modelrange_rover_evoque   -0.18424921195    0.32381663741   -0.569
## modelrange_rover_sport    -0.03110997103    0.28604303921   -0.109
## modelrangerover            2.21153102268    0.42705951346    5.179
## modelrav                   0.53049619888    0.09255330167    5.732
## modelrio                   0.50409409984    0.09564287436    5.271
## modelroadster              0.32510687296    0.13536296630    2.402
## modelroomster              0.94613116257    0.11112148148    8.514
## modelrx_reihe              0.85497354811    0.12151252433    7.036
## models_klasse              0.62515953685    0.05120031322   12.210
## models_max                 1.01104751822    0.07037696580   14.366
## models_type               -0.33573436933    0.14998886588   -2.238
## models60                   0.38567313374    0.12173367209    3.168
## modelsandero              -0.02586361114    0.23563080704   -0.110
## modelsanta                 0.95148884714    0.08758484167   10.864
## modelscenic               -0.07969605446    0.05576278248   -1.429
## modelscirocco              0.38563996038    0.06319582281    6.102
## modelseicento             -0.21163196538    0.06934070937   -3.052
## modelserie_2               1.81287487611    0.57177920757    3.171
## modelsharan                0.16792436507    0.04517939086    3.717
## modelsignum                0.34489192838    0.07404186516    4.658
## modelsirion                0.54353280120    0.17404561120    3.123
## modelsl                    1.10110035526    0.06178985761   17.820
## modelslk                  -0.05144399168    0.04812059730   -1.069
## modelsorento               0.95588403301    0.10089506781    9.474
## modelspark                -0.51316613272    0.12889681686   -3.981
## modelspider                1.19146571998    0.11558148296   10.308
## modelsportage              0.97317285963    0.10699847928    9.095
## modelsprinter              0.06101353943    0.05896821954    1.035
## modelstilo                 0.22684268083    0.07096706326    3.196
## modelsuperb                1.28252388400    0.09939729185   12.903
## modelswift                 0.46524956577    0.07016641812    6.631
## modelterios                0.86578849746    0.23766595758    3.643
## modeltigra                -0.26719046220    0.05733049736   -4.661
## modeltiguan                0.32481301824    0.05606803389    5.793
## modeltoledo                0.44184114116    0.12354198395    3.576
## modeltouareg               0.56545447702    0.06049451241    9.347
## modeltouran                0.37178893147    0.03746092077    9.925
## modeltransit               0.57126154230    0.06382909675    8.950
## modeltransporter           0.86363392084    0.03428942391   25.187
## modeltt                    0.29875317241    0.04806923586    6.215
## modeltucson                1.01778799903    0.10068198818   10.109
## modeltwingo               -0.33506599807    0.04651773063   -7.203
## modelup                   -0.20781844970    0.08155868167   -2.548
## modelv_klasse             -0.03855527468    0.12550436019   -0.307
## modelv40                  -0.42733173631    0.07707635449   -5.544
## modelv50                   0.36288892142    0.10495363662    3.458
## modelv60                   0.59377796755    0.20383938050    2.913
## modelv70                   0.16649610176    0.08038160258    2.071
## modelvectra               -0.28614615736    0.03749596610   -7.631
## modelverso                 0.62108545294    0.08699316669    7.139
## modelviano                 0.40425952407    0.07833953721    5.160
## modelvito                 -0.13591008342    0.05553892429   -2.447
## modelvivaro                0.58948787061    0.08807654906    6.693
## modelvoyager              -0.14520840198    0.10334463109   -1.405
## modelwrangler              0.89538443532    0.14250098856    6.283
## modelx_reihe               0.48331106619    0.04598194670   10.511
## modelx_trail               0.53771850742    0.10767064122    4.994
## modelx_type               -0.88737391998    0.13666680643   -6.493
## modelxc_reihe              0.60611612730    0.09748027477    6.218
## modelyaris                 0.22036479606    0.06436806786    3.424
## modelyeti                  0.85618967337    0.12050066999    7.105
## modelypsilon              -0.07928501556    0.15991262269   -0.496
## modelz_reihe               0.47876539790    0.06020964108    7.952
## modelzafira                0.27531323077    0.04111960039    6.695
##                         Pr(>|t|)    
## (Intercept)              < 2e-16 ***
## howLong                  < 2e-16 ***
## notRepairedDamageno      < 2e-16 ***
## powerPS                  < 2e-16 ***
## kilometer                < 2e-16 ***
## yearOfRegistration       < 2e-16 ***
## gearboxmanual            < 2e-16 ***
## vehicleTypeother         < 2e-16 ***
## vehicleTypebus           < 2e-16 ***
## vehicleTypecabriolet     < 2e-16 ***
## vehicleTypecoupe         < 2e-16 ***
## vehicleTypesmall_car     < 2e-16 ***
## vehicleTypekombi         < 2e-16 ***
## vehicleTypelimousine     < 2e-16 ***
## vehicleTypesuv           < 2e-16 ***
## fuelTypeother           4.66e-05 ***
## fuelTypegasoline         < 2e-16 ***
## fuelTypecng             1.78e-05 ***
## fuelTypediesel           < 2e-16 ***
## fuelTypeelectro         0.007296 ** 
## fuelTypehybrid          0.124290    
## fuelTypelpg             3.77e-13 ***
## monthOfRegistrationJan   < 2e-16 ***
## monthOfRegistrationFeb   < 2e-16 ***
## monthOfRegistrationMar   < 2e-16 ***
## monthOfRegistrationApr   < 2e-16 ***
## monthOfRegistrationMay   < 2e-16 ***
## monthOfRegistrationJun   < 2e-16 ***
## monthOfRegistrationJul   < 2e-16 ***
## monthOfRegistrationAug   < 2e-16 ***
## monthOfRegistrationSep   < 2e-16 ***
## monthOfRegistrationOct   < 2e-16 ***
## monthOfRegistrationNov   < 2e-16 ***
## monthOfRegistrationDec   < 2e-16 ***
## brandaudi                < 2e-16 ***
## brandbmw                3.40e-13 ***
## brandchevrolet          0.000103 ***
## brandchrysler           0.001038 ** 
## brandcitroen             < 2e-16 ***
## branddacia              0.948156    
## branddaewoo             2.24e-16 ***
## branddaihatsu           6.98e-05 ***
## brandfiat               2.01e-13 ***
## brandford               1.73e-05 ***
## brandhonda              0.124218    
## brandhyundai             < 2e-16 ***
## brandjaguar              < 2e-16 ***
## brandjeep               0.017459 *  
## brandkia                6.17e-15 ***
## brandlada               0.075326 .  
## brandlancia             0.092263 .  
## brandland_rover         9.46e-06 ***
## brandmazda              7.87e-12 ***
## brandmercedes_benz       < 2e-16 ***
## brandmini               3.85e-08 ***
## brandmitsubishi         2.89e-09 ***
## brandnissan             0.332393    
## brandopel               0.932896    
## brandpeugeot            2.01e-09 ***
## brandporsche             < 2e-16 ***
## brandrenault            0.179738    
## brandrover               < 2e-16 ***
## brandsaab               3.32e-05 ***
## brandseat               4.15e-10 ***
## brandskoda              2.94e-11 ***
## brandsmart              0.003434 ** 
## brandsonstige_autos      < 2e-16 ***
## brandsubaru             0.038187 *  
## brandsuzuki             3.52e-12 ***
## brandtoyota             0.579836    
## brandtrabant            2.61e-08 ***
## brandvolkswagen         1.94e-06 ***
## brandvolvo              0.012723 *  
## model1_reihe            0.823264    
## model100                0.977381    
## model145                0.004663 ** 
## model147                0.018846 *  
## model156                6.98e-05 ***
## model159                8.78e-09 ***
## model1er                8.94e-10 ***
## model2_reihe            5.37e-16 ***
## model200                1.14e-06 ***
## model3_reihe            1.35e-15 ***
## model300c                < 2e-16 ***
## model3er                0.036441 *  
## model4_reihe            2.55e-07 ***
## model5_reihe             < 2e-16 ***
## model500                 < 2e-16 ***
## model5er                2.45e-11 ***
## model6_reihe             < 2e-16 ***
## model601                0.029874 *  
## model6er                 < 2e-16 ***
## model7er                 < 2e-16 ***
## model80                  < 2e-16 ***
## model850                0.010614 *  
## model90                 0.690050    
## model900                1.38e-06 ***
## model9000               8.25e-06 ***
## model911                 < 2e-16 ***
## modela_klasse            < 2e-16 ***
## modela1                 0.364038    
## modela2                 0.003707 ** 
## modela3                 0.545067    
## modela4                 0.163934    
## modela5                 8.18e-12 ***
## modela6                 0.048027 *  
## modela8                 1.86e-06 ***
## modelaccord             1.07e-07 ***
## modelagila              0.029995 *  
## modelalhambra            < 2e-16 ***
## modelalmera             0.003666 ** 
## modelaltea               < 2e-16 ***
## modelamarok             0.083977 .  
## modelother               < 2e-16 ***
## modelantara             0.008432 ** 
## modelarosa              0.000141 ***
## modelastra              0.213769    
## modelauris              0.000256 ***
## modelavensis            1.31e-10 ***
## modelaveo               0.225218    
## modelaygo               0.755979    
## modelb_klasse           0.607107    
## modelb_max              0.010579 *  
## modelbeetle             0.000591 ***
## modelberlingo            < 2e-16 ***
## modelbora               0.914630    
## modelboxster            2.40e-06 ***
## modelbravo              0.062323 .  
## modelc_klasse           2.81e-15 ***
## modelc_max               < 2e-16 ***
## modelc_reihe            0.039039 *  
## modelc1                 1.10e-12 ***
## modelc2                 5.70e-09 ***
## modelc3                  < 2e-16 ***
## modelc4                  < 2e-16 ***
## modelc5                  < 2e-16 ***
## modelcaddy              0.004797 ** 
## modelcalibra            0.569752    
## modelcaptiva            0.225123    
## modelcarisma            0.088873 .  
## modelcarnival           0.001045 ** 
## modelcayenne            0.003238 ** 
## modelcc                 7.62e-08 ***
## modelceed               2.78e-16 ***
## modelcharade            0.867564    
## modelcherokee           0.007563 ** 
## modelcitigo             0.000319 ***
## modelcivic              0.004025 ** 
## modelcl                 5.78e-06 ***
## modelclio               0.143350    
## modelclk                4.40e-05 ***
## modelclubman            0.900196    
## modelcolt               0.114023    
## modelcombo              0.278565    
## modelcooper             0.996277    
## modelcordoba            0.033238 *  
## modelcorolla            0.051004 .  
## modelcorsa              4.19e-12 ***
## modelcr_reihe           1.80e-10 ***
## modelcroma              0.006792 ** 
## modelcrossfire          7.93e-06 ***
## modelcuore              0.893138    
## modelcx_reihe            < 2e-16 ***
## modeldefender           0.269289    
## modeldelta              7.44e-05 ***
## modeldiscovery          0.606146    
## modeldoblo              2.94e-08 ***
## modelducato              < 2e-16 ***
## modelduster             0.835761    
## modele_klasse           0.253604    
## modelelefantino         0.011718 *  
## modeleos                0.404149    
## modelescort              < 2e-16 ***
## modelespace             0.147142    
## modelexeo               1.66e-14 ***
## modelfabia               < 2e-16 ***
## modelfiesta             0.094970 .  
## modelfocus              1.13e-10 ***
## modelforester           0.456869    
## modelforfour            3.34e-05 ***
## modelfortwo             0.401453    
## modelfox                4.41e-05 ***
## modelfreelander         0.010995 *  
## modelfusion             2.63e-05 ***
## modelg_klasse            < 2e-16 ***
## modelgalant             0.196119    
## modelgalaxy              < 2e-16 ***
## modelgetz               5.33e-09 ***
## modelgl                 9.83e-05 ***
## modelglk                0.217524    
## modelgolf               0.956423    
## modelgrand              9.77e-12 ***
## modeli_reihe             < 2e-16 ***
## modeli3                 0.223543    
## modelibiza              4.34e-15 ***
## modelimpreza            0.407980    
## modelinsignia            < 2e-16 ***
## modeljazz               9.26e-06 ***
## modeljetta              0.001927 ** 
## modeljimny              2.98e-13 ***
## modeljuke               6.44e-05 ***
## modeljusty              9.98e-08 ***
## modelka                  < 2e-16 ***
## modelkadett             1.17e-15 ***
## modelkaefer              < 2e-16 ***
## modelkalina             0.057993 .  
## modelkalos              0.000160 ***
## modelkangoo             0.225056    
## modelkappa              0.208943    
## modelkuga                < 2e-16 ***
## modellaguna             2.64e-06 ***
## modellancer             5.62e-10 ***
## modellanos              0.460452    
## modellegacy             0.011813 *  
## modelleon                < 2e-16 ***
## modellodgy              0.984535    
## modellogan              0.379334    
## modellupo                < 2e-16 ***
## modellybra              0.004004 ** 
## modelm_klasse           0.071775 .  
## modelm_reihe             < 2e-16 ***
## modelmateria            0.000648 ***
## modelmatiz              1.84e-07 ***
## modelmegane             0.010956 *  
## modelmeriva             3.77e-05 ***
## modelmicra              0.000533 ***
## modelmii                0.000898 ***
## modelmodus              0.090927 .  
## modelmondeo             0.495981    
## modelmove               0.929537    
## modelmusa               0.172108    
## modelmustang             < 2e-16 ***
## modelmx_reihe            < 2e-16 ***
## modelnavara             5.68e-09 ***
## modelniva               0.006159 ** 
## modelnote               0.030919 *  
## modelnubira             0.017862 *  
## modeloctavia             < 2e-16 ***
## modelomega              7.97e-11 ***
## modelone                0.533238    
## modeloutlander          6.75e-15 ***
## modelpajero              < 2e-16 ***
## modelpanda              1.29e-09 ***
## modelpassat             0.519762    
## modelphaeton            3.31e-08 ***
## modelpicanto            1.35e-10 ***
## modelpolo                < 2e-16 ***
## modelprimera            0.019723 *  
## modelptcruiser          0.002925 ** 
## modelpunto              0.017568 *  
## modelq3                 0.665159    
## modelq5                 2.32e-10 ***
## modelq7                 2.91e-12 ***
## modelqashqai             < 2e-16 ***
## modelr19                0.004726 ** 
## modelrange_rover        0.981199    
## modelrange_rover_evoque 0.569363    
## modelrange_rover_sport  0.913393    
## modelrangerover         2.24e-07 ***
## modelrav                9.97e-09 ***
## modelrio                1.36e-07 ***
## modelroadster           0.016319 *  
## modelroomster            < 2e-16 ***
## modelrx_reihe           1.99e-12 ***
## models_klasse            < 2e-16 ***
## models_max               < 2e-16 ***
## models_type             0.025198 *  
## models60                0.001534 ** 
## modelsandero            0.912597    
## modelsanta               < 2e-16 ***
## modelscenic             0.152951    
## modelscirocco           1.05e-09 ***
## modelseicento           0.002273 ** 
## modelserie_2            0.001522 ** 
## modelsharan             0.000202 ***
## modelsignum             3.20e-06 ***
## modelsirion             0.001791 ** 
## modelsl                  < 2e-16 ***
## modelslk                0.285044    
## modelsorento             < 2e-16 ***
## modelspark              6.86e-05 ***
## modelspider              < 2e-16 ***
## modelsportage            < 2e-16 ***
## modelsprinter           0.300819    
## modelstilo              0.001392 ** 
## modelsuperb              < 2e-16 ***
## modelswift              3.36e-11 ***
## modelterios             0.000270 ***
## modeltigra              3.16e-06 ***
## modeltiguan             6.93e-09 ***
## modeltoledo             0.000348 ***
## modeltouareg             < 2e-16 ***
## modeltouran              < 2e-16 ***
## modeltransit             < 2e-16 ***
## modeltransporter         < 2e-16 ***
## modeltt                 5.15e-10 ***
## modeltucson              < 2e-16 ***
## modeltwingo             5.93e-13 ***
## modelup                 0.010833 *  
## modelv_klasse           0.758690    
## modelv40                2.96e-08 ***
## modelv50                0.000545 ***
## modelv60                0.003581 ** 
## modelv70                0.038331 *  
## modelvectra             2.34e-14 ***
## modelverso              9.44e-13 ***
## modelviano              2.47e-07 ***
## modelvito               0.014402 *  
## modelvivaro             2.20e-11 ***
## modelvoyager            0.159998    
## modelwrangler           3.33e-10 ***
## modelx_reihe             < 2e-16 ***
## modelx_trail            5.92e-07 ***
## modelx_type             8.46e-11 ***
## modelxc_reihe           5.06e-10 ***
## modelyaris              0.000618 ***
## modelyeti               1.21e-12 ***
## modelypsilon            0.620035    
## modelz_reihe            1.86e-15 ***
## modelzafira             2.16e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7293 on 95316 degrees of freedom
## Multiple R-squared:  0.6622, Adjusted R-squared:  0.6611 
## F-statistic: 585.7 on 319 and 95316 DF,  p-value: < 2.2e-16

The model formula is not that long, but due to many dummies created from the categorical variables number of estimates is very large. What is satisfying though is that our model includes moslty significant variables - there are only some particular dummies that are not statistically different than zero. Also the R-squared statistic is pretty high - we have managed to explain 65.8% of the logprice variance.

#Examine the error of the model
autos_predicted <- predict(autos_lm1)

ggplot(data.frame(error = autos_train$logprice - autos_predicted),
       aes(x = error)) + geom_histogram(fill = "blue", bins = 100) + theme_bw()

The plot above shows the distribution of model’s residuals. The errors seem to be normally distributed around zero. there are some outliers with values around -7, but there are not too many of them. The largest part of the distribution is highly concentrated around mean. Our model seems to work pretty good.

In the next stage we will see if there are any variables that still could be dropped from the model. Due to a large size of the model, we will use two automatic methods - one based on p-value and the other based on AIC.

require("olsrr")

ols_step_backward_p(autos_lm1, prem = 0.05, progress = FALSE) -> autos_lm1_backward_p
ols_step_backward_aic(autos_lm1, progress =  FALSE) -> autos_lm1_backward_AIC

save(list = c("autos_lm1_backward_p", "autos_lm1_backward_AIC"), file = "autos_backward.RData")
# a list of removed terms from p-value evaluation should be available here
autos_lm1_backward_p$removed
## NULL
# a list of removed terms from AIC evaluation should be available here
autos_lm1_backward_AIC$predictors
## NULL

Both methods indicated that no variables should be dropped. We can say that our linear model is completed. Let’s save it into the final object and move onto other methods.

autos_regression <- autos_lm1

K-nearest neighbours

We will use the KNN method to predict the logarithmic price of a car. Firstly we will use the k equal to the square root of n. After the estimation we will use the scale preprocessing option. After that the k parameter will be optimised using the CV with three folds.

require('dplyr')
ctrl_nocv <- trainControl(method = "none")

sroot <- sqrt(nrow(autos_train))
k_value <- data.frame(k = round(sroot))

paste0(c('The k parameter is equal to: '), round(sroot))

set.seed(111)
autos_train_knn <- 
  train(logprice ~ ., 
        data = autos_train %>% 
          dplyr::select(-price),
        method = "knn",
        trControl = ctrl_nocv,
        tuneGrid = k_value)

autos_train_knn_scaled <-
  train(logprice ~ ., 
        data = autos_train %>% 
          dplyr::select(-price),
        method = "knn",
        trControl = ctrl_nocv,
        tuneGrid = k_value,
        preProcess = c("scale"))

save(list = c("autos_train_knn", "autos_train_knn_scaled"), file = "autos_knn.RData")

Due to complexity issues the cross-validation needed to be done on a subsample of the initial training dataset. We have decided to run it on a 50% of observations from the dataset. That was computable, but still - numerically extensive (it took 6 hours to train those two models below, using a 32 RAM machine with parallel computation done on 5 cores).

set.seed(111) #seed for reproductibility

#create a subsample of the training dataset for the crossvalidation
autos_which_train2 <- createDataPartition(autos_train$logprice, p = 0.5,list = FALSE) 

autos_trainSub1 <- autos_train[autos_which_train2,]
autos_trainSub2 <- autos_train[-autos_which_train2,]

save(list = c("autos_trainSub1", "autos_trainSub2"), file = "autos_train_subsample.RData")
require('doParallel')

cl <- makePSOCKcluster(5)
registerDoParallel(cl)


different_k <- data.frame(k = seq(300, 315, 5))
ctrl_cv3 <- trainControl(method = "cv", number = 3)

set.seed(111)
autos_train_knn_tune <- 
  train(logprice ~ ., 
        data = autos_trainSub1 %>% 
          dplyr::select(-price),
        method = "knn",
        trControl = ctrl_cv3,
        tuneGrid = different_k)

autos_train_knn_scaled_tune <-
  train(logprice ~ ., 
        data = autos_trainSub1 %>% 
          dplyr::select(-price),
        method = "knn",
        trControl = ctrl_cv3,
        tuneGrid = different_k,
        # data transformation
        preProcess = c("scale"))

save(list = c("autos_train_knn_tune", "autos_train_knn_scaled_tune"), file = "autos_knn_cv.RData")

stopCluster(cl)

In the next step we will compare the efficiency on the train sample of four competing knn models. To do so we will use the regressionMetrics wrap-up function made by dr hab. Piotr Wójcik.

regressionMetrics <- function(real, predicted) {
  MSE <- mean((real - predicted)^2)
  RMSE <- sqrt(MSE)
  MAE <- mean(abs(real - predicted))
  MedAE <- median(abs(real - predicted))
  MSLE <- mean((log(1 + real) - log(1 + predicted))^2)
  TSS <- sum((real - mean(real))^2)
  RSS <- sum((predicted - real)^2)
  R2 <- 1 - RSS/TSS
  
  result <- data.frame(MSE, RMSE, MAE, MedAE, MSLE, R2)
  return(result)
}

In the first step we will make predictions from those four models on the full train sample. Surprisingly that process for the knn model also took a lot of time. That process was also executed using the doParallel package.

cl <- makePSOCKcluster(5)
registerDoParallel(cl)

autos_train_knn_p = predict(autos_train_knn , autos_train)
autos_train_knn_scaled_p = predict(autos_train_knn_scaled, autos_train)
save(list = c("autos_train_knn_p", "autos_train_knn_scaled_p"), file = "autos_prediction_normal.RData")

autos_train_knn_tune_p = predict(autos_train_knn_tune, autos_train)
autos_train_knn_scaled_tune_p = predict(autos_train_knn_scaled_tune, autos_train)
save(list = c("autos_train_knn_tune_p", "autos_train_knn_scaled_tune_p"), file = "autos_prediction_cv.RData")

stopCluster(cl)
autos_train_predictions_knn <- 
  data.frame(autos_train_knn_p,
             autos_train_knn_scaled_p,
             autos_train_knn_tune_p,
             autos_train_knn_scaled_tune_p)

knn_comparison <- sapply(autos_train_predictions_knn,
       function(x) regressionMetrics(real = autos_train$logprice,
                                     predicted = x)) %>% 
  t()

save(list = c("knn_comparison"), file = "autos_prediction_quality.RData")

knn_comparison
##                               MSE       RMSE      MAE       MedAE    
## autos_train_knn_p             0.5669199 0.7529408 0.4953571 0.3564015
## autos_train_knn_scaled_p      0.8609624 0.9278806 0.7058453 0.5944562
## autos_train_knn_tune_p        0.6146339 0.7839859 0.5281713 0.3900608
## autos_train_knn_scaled_tune_p 0.9320884 0.9654472 0.7451484 0.6394451
##                               MSLE       R2       
## autos_train_knn_p             0.01527081 0.638773 
## autos_train_knn_scaled_p      0.01877466 0.4514166
## autos_train_knn_tune_p        0.0159009  0.6083709
## autos_train_knn_scaled_tune_p 0.0195606  0.4060969

As we can see from the table below, it seems that the simplest model is the best one. It has the smallest values of errors and the R^2 statistics is at the highest point. We will save it in the new object and the use it for the final model comparison.

autos_knn <- autos_train_knn

Ridge regression

In the last step we will estimate a ridge regression in two variants. That would be a ridge regression with and without the crossvalidation, used to tune the lambda parameter.

Let’s estimate both models and then compare their efficiency on the training dataset to choose the best one.

require('dplyr')
ctrl_nocv <- trainControl(method = "none")

set.seed(111)
autos_ridge_nocv <- train(logprice ~ ., 
                      data = autos_train %>% 
                        dplyr::select(-price), 
                      method = "glmnet", 
                      trControl = ctrl_nocv)
require("caret")
lambdas <- exp(log(10)*seq(-2, 9, length.out = 200))

ctrl_cv3 <- trainControl(method = "cv", number = 3)

parameters_ridge <- expand.grid(alpha = 0, lambda = lambdas)

set.seed(111)
autos_ridge_cv <- train(logprice ~ ., 
                      data = autos_train %>% 
                        dplyr::select(-price), 
                      method = "glmnet", 
                      tuneGrid = parameters_ridge,
                      trControl = ctrl_cv3)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
autos_train_ridge_nocv_p = predict(autos_ridge_nocv, autos_train)
autos_train_ridge_cv_p = predict(autos_ridge_cv, autos_train)
autos_train_predictions_ridge <- 
  data.frame(autos_train_ridge_nocv_p,
             autos_train_ridge_cv_p)

ridge_comparison <- sapply(autos_train_predictions_ridge,
       function(x) regressionMetrics(real = autos_train$logprice,
                                     predicted = x)) %>% 
  t()

ridge_comparison
##                          MSE       RMSE      MAE       MedAE    
## autos_train_ridge_nocv_p 0.5348312 0.7313216 0.4759086 0.3437378
## autos_train_ridge_cv_p   0.544384  0.7378238 0.4846693 0.3535332
##                          MSLE       R2       
## autos_train_ridge_nocv_p 0.01472735 0.6592191
## autos_train_ridge_cv_p   0.0148823  0.6531323

It seems that the standard model, without the crossvalidation works the best in this case. Once again we will store it for the final analysis.

autos_ridge <- autos_ridge_nocv

Compare accuacy of 3 different methods

To compare the out-of-sample accuracy we will check the results of models from three families on the test dataset. To get to that point we will make forecasts on the new sample and then compare them to the real values.

autos_lm_f = predict(autos_regression, autos_test)
autos_ridge_f = predict(autos_ridge, autos_test)
autos_knn_f = predict(autos_knn, autos_test)

Compare the forecast with the real values and get the measures of fit.

autos_forecasts <- 
  data.frame(autos_lm_f, autos_knn_f, autos_ridge_f)

final_comparison <- sapply(autos_forecasts,
       function(x) regressionMetrics(real = autos_test$logprice,
                                     predicted = x)) %>% 
  t()

final_comparison
##               MSE       RMSE      MAE       MedAE     MSLE       R2       
## autos_lm_f    0.5456356 0.7386715 0.4728525 0.3367709 0.01524542 0.6526545
## autos_knn_f   0.5796483 0.7613464 0.4966286 0.355761  0.01594923 0.6310024
## autos_ridge_f 0.550027  0.7416381 0.4767605 0.3423969 0.0153393  0.649859

Summary and conclusions

It turns out that the standard linear regression performs the best while forecasting the logaritmic car price on a new dataset. It has the highest R squared value and gives the lowest errors. One also needs to keep in mind that in our case the linear regression provided even more benefits than the other methods - it not only gives the best predictions and forecasts, but also has a form that is easy to be interpreted. Our case is a proof, that in some cases the simplest models are the best ones.

Hit or flop? Predicting the commercial success of the songs

Introduction

Due to the development of the non-physical digital audio files and the increasing accessibility of the online resources of the downloadable music, at the turn of the second and third millenium a small revolution at the music industry started. Nowadays over 60% of the revenues from recorded music in the USA comes from the streaming services, such as Spotify, Tidal or AppleMusic and the upward trend of this sector doesn’t stop5. Streaming platforms enable access to the millions of songs for no or little monthly fee. This makes them extremely convenient for the music listeners but also for the data analysts. Thanks to the large databases collected by the streaming companies it is possible to analyze the music market in a way previously unobtainable, also for songs recorded long before the popularity of the streaming platforms. The details about all of the songs, their structure and character are perfect to be used to discover which songs are destined for success and which ones will fail to reach the top. Can we reveil the secret of the commercial success? In this article we use the characteristics of the tracks from the beginning of the third millenium to predict whether the song was hit or not.

Libraries used

library(dplyr)
library(data.table)
library(stringr)
library(ggplot2)
library(stargazer)
library(corrplot)
library(caret)
library(smbinning)
library(bestNormalize)
library(lmtest)
library(verification)
library(knitr)

Dataset

The dataset consists of 5872 observations and 19 features of the tracks from years 2000-2009. The data was fetched with the spotipy, which is the Python module for Spotify’s API, and with the analogous module billboard for Billboard’s API. billboard module was used to extract the information about the Billboard Hot 100 weekly record chart, published by Billboard magazine. This music chart is created based on the radio play, sales and online streaming of the songs in USA and is one of the best indicators of the songs commercial success in the United States. The dataset was created by Farooq Ansari. More information about the dataset and the author can be found here.

Variables description

spotify <- read.csv('spotify.csv')

glimpse(spotify)
## Observations: 5,872
## Variables: 19
## $ track            <fct> "Lucky Man", "On The Hotline", "Clouds Of Dem...
## $ artist           <fct> Montgomery Gentry, Pretty Ricky, Candlemass, ...
## $ uri              <fct> spotify:track:4GiXBCUF7H6YfNQsnBRIzl, spotify...
## $ danceability     <dbl> 0.578, 0.704, 0.162, 0.188, 0.630, 0.726, 0.3...
## $ energy           <dbl> 0.471, 0.854, 0.836, 0.994, 0.764, 0.837, 0.9...
## $ key              <int> 4, 10, 9, 4, 2, 11, 1, 11, 10, 7, 8, 2, 1, 0,...
## $ loudness         <dbl> -7.270, -5.477, -3.009, -3.745, -4.353, -7.22...
## $ mode             <int> 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, ...
## $ speechiness      <dbl> 0.0289, 0.1830, 0.0473, 0.1660, 0.0275, 0.096...
## $ acousticness     <dbl> 0.36800000, 0.01850000, 0.00011100, 0.0000073...
## $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00457000, 0.0784000...
## $ liveness         <dbl> 0.1590, 0.1480, 0.1740, 0.1920, 0.1250, 0.136...
## $ valence          <dbl> 0.5320, 0.6880, 0.3000, 0.3330, 0.6310, 0.969...
## $ tempo            <dbl> 133.061, 92.988, 86.964, 148.440, 112.098, 13...
## $ duration_ms      <int> 196707, 242587, 338893, 255667, 193760, 19272...
## $ time_signature   <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, ...
## $ chorus_hit       <dbl> 30.88059, 41.51106, 65.32887, 58.59528, 22.62...
## $ sections         <int> 13, 10, 13, 9, 10, 10, 4, 10, 11, 16, 9, 9, 9...
## $ target           <int> 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, ...

The dataset consists of the following variables:

  • track: title of the song, redundant for our analysis.
  • artist: name of the artist or band, also unnecessary.
  • uri: unique identifier that can be used to find the certain track in the Spotify platform.
  • danceability: measures the suitability of a song for dancing based on some other features e.g. tempo, beat strength and regularity.
  • energy: intensity, activity of the track, may be corelated with the loudness variable.
  • key: key of the song expressed with the standard Pitch Class notation. We know the order and even exact differences between pitches so the first idea is to treat this variable as ordinal variable. However, pitch classes represent all pitches under octave equivalence so in fact we can’t say whether the certain key is higher or lower than the other in terms of the tone frequency. Having that in mind we could still treat them as the ordered pitch class representation but another problem is that this is cyclic order, which means that increasing of the pitch class one level up from 11 moves it to the level 0.6 It is like the problem of circular distribution but for the quantitative variable. There are many different approaches to deal with the circular distribution.7 In case of the quantitative variable it seems that the most reasonable way is to simply leave it as the nominal variable.
  • loudness: averaged loudness in decibels, so in fact it is a feature already expressed in a logarithmic scale.
  • mode: binary variable which represents the modality of a track were 1 means a major scale and 0 means a minor scale. It is widely regarded that the songs played in minor scale are more sad and depressing so it may have connection with the changes in the valence variable.
  • speechiness: measures the presence of spoken words.
  • acousticness: describes the confidence of how much acoustic is the track.
  • instrumentalness: represents the probability that a track is instrumental; values from 0.5 to 1 are intended to represent tracks with no vocals.
  • liveness: measures the sounds of audience in the track. Values above 0.8 are intended to represent the high probability of the certain track being live.
  • valence describes the positiveness of the track.
  • tempo: estimated pace of the track expressed in beats per minute (BPM).
  • duration_ms: duration measured in miliseconds.
  • time_signature: estimated time signature (also known as meter), measures the number of beats in every bar. Although the meter is normally denoted with two numbers, often presented as the irreducible fraction, here we have only selected single numbers. Lack of number 2 suggests that this is simplified, reduced time signature with no regard to the note lengths. This approach isn’t consistent with the rule that meter can’t be reduced. This variable can still be meaningful but because of the issues presented it will be treated as a nominal variable.
  • chorus_hit: estimate of the start of the first chorus in the track, created by author of the database, expressed in seconds. In fact this variable represents the timestamp of the start of the third section.
  • sections: the number of sections in the track, created by author of the database.
  • target: binary variable. Value 1 means that the certain song has featured at least once in the Billboard Hot 100 in the years 2000-2009. Author’s additional conditions used in the selection of the tracks with the target variable equal to 0 were that not only the song but also the artist must have not featured in the Billboard Hot 100 in those years, the song must have belonged to a non-mainstream or avant-garde genre and must have been released in the USA. Initially the database consists of the equal number of observations with the target value 0 and 1.

More details about the features stored and shared by Spotify can be found here. Billboard Hot 100 for every week since the first release of the chart on the 4th August 1958 is available here.

Data preparation

Dataset cleaning

summary(spotify)
##         track                artist    
##  Breathe   :   5   Toby Keith   :  27  
##  Angel     :   4   Rascal Flatts:  24  
##  Beautiful :   4   Tim McGraw   :  24  
##  Closer    :   4   Iron Maiden  :  23  
##  Forever   :   4   Kenny Chesney:  23  
##  Girlfriend:   4   George Strait:  22  
##  (Other)   :5847   (Other)      :5729  
##                                    uri        danceability   
##  spotify:track:0t9Jd84JnsV8HRMaQzHUom:   2   Min.   :0.0588  
##  spotify:track:0Uhnzk5zI46IRlQ04LNOtc:   2   1st Qu.:0.4160  
##  spotify:track:1mJ05BN0So26a14iib85aI:   2   Median :0.5560  
##  spotify:track:1qHRFZE8qykNXYZadzmi1m:   2   Mean   :0.5429  
##  spotify:track:2aIdVb8v9KTpEZnftkz2mD:   2   3rd Qu.:0.6810  
##  spotify:track:3f3omU8n47Mqyab5nCaGyT:   2   Max.   :0.9860  
##  (Other)                             :5860                   
##      energy              key            loudness            mode       
##  Min.   :0.000348   Min.   : 0.000   Min.   :-47.327   Min.   :0.0000  
##  1st Qu.:0.567000   1st Qu.: 2.000   1st Qu.: -8.315   1st Qu.:0.0000  
##  Median :0.744000   Median : 5.000   Median : -6.042   Median :1.0000  
##  Mean   :0.694511   Mean   : 5.276   Mean   : -7.449   Mean   :0.6451  
##  3rd Qu.:0.885000   3rd Qu.: 8.000   3rd Qu.: -4.562   3rd Qu.:1.0000  
##  Max.   :0.999000   Max.   :11.000   Max.   :  1.137   Max.   :1.0000  
##                                                                        
##   speechiness       acousticness      instrumentalness       liveness     
##  Min.   :0.02240   Min.   :0.000000   Min.   :0.0000000   Min.   :0.0193  
##  1st Qu.:0.03617   1st Qu.:0.004553   1st Qu.:0.0000000   1st Qu.:0.0937  
##  Median :0.05270   Median :0.060300   Median :0.0000218   Median :0.1310  
##  Mean   :0.09236   Mean   :0.214374   Mean   :0.1509172   Mean   :0.1961  
##  3rd Qu.:0.10700   3rd Qu.:0.312000   3rd Qu.:0.0472750   3rd Qu.:0.2630  
##  Max.   :0.95000   Max.   :0.996000   Max.   :0.9980000   Max.   :0.9870  
##                                                                           
##     valence           tempo         duration_ms      time_signature 
##  Min.   :0.0000   Min.   : 46.76   Min.   :  15920   Min.   :0.000  
##  1st Qu.:0.2780   1st Qu.: 96.98   1st Qu.: 206813   1st Qu.:4.000  
##  Median :0.4860   Median :120.00   Median : 238006   Median :4.000  
##  Mean   :0.4823   Mean   :121.61   Mean   : 258171   Mean   :3.914  
##  3rd Qu.:0.6870   3rd Qu.:141.50   3rd Qu.: 279160   3rd Qu.:4.000  
##  Max.   :0.9820   Max.   :213.23   Max.   :4170227   Max.   :5.000  
##                                                                     
##    chorus_hit        sections          target   
##  Min.   :  0.00   Min.   :  1.00   Min.   :0.0  
##  1st Qu.: 27.51   1st Qu.:  9.00   1st Qu.:0.0  
##  Median : 36.04   Median : 10.00   Median :0.5  
##  Mean   : 40.73   Mean   : 11.06   Mean   :0.5  
##  3rd Qu.: 47.89   3rd Qu.: 12.00   3rd Qu.:1.0  
##  Max.   :262.62   Max.   :169.00   Max.   :1.0  
## 
spotify %>%
  is.na() %>%
  colSums() %>%
  sort()
##            track           artist              uri     danceability 
##                0                0                0                0 
##           energy              key         loudness             mode 
##                0                0                0                0 
##      speechiness     acousticness instrumentalness         liveness 
##                0                0                0                0 
##          valence            tempo      duration_ms   time_signature 
##                0                0                0                0 
##       chorus_hit         sections           target 
##                0                0                0

There are no missing values in the dataset, so we don’t have to worry about that.

table(duplicated(spotify$uri))
## 
## FALSE  TRUE 
##  5855    17

As stated in the variable description, uri is the unique identificator of the song. However, some of the uri values repeat, so this needs to be checked.

setDT(spotify)[, .N, uri] %>% merge(spotify, by = "uri") %>% .[.$N>1]  -> duplicates

kable(head(duplicates[,c(3:13)],10)) 
track artist danceability energy key loudness mode speechiness acousticness instrumentalness liveness
I Don’t Care (feat. Adam Gontier) Apocalyptica 0.311 0.744 2 -4.615 0 0.0436 0.00998 0.0000000 0.0799
I Don’t Care Apocalyptica Featuring Adam Gontier 0.311 0.744 2 -4.615 0 0.0436 0.00998 0.0000000 0.0799
Hey Ma Cam’ron 0.791 0.666 8 -5.484 1 0.3410 0.02560 0.0000063 0.0796
Hey Ma Cam’Ron Featuring Juelz Santana, Freekey Zekey & Toya 0.791 0.666 8 -5.484 1 0.3410 0.02560 0.0000063 0.0796
All I Want To Do Sugarland 0.709 0.725 4 -4.999 1 0.0556 0.20800 0.0000454 0.1580
Want To Sugarland 0.709 0.725 4 -4.999 1 0.0556 0.20800 0.0000454 0.1580
Give It Away George Strait 0.674 0.646 5 -5.876 1 0.0270 0.48600 0.0000837 0.2620
It Just Comes Natural George Strait 0.674 0.646 5 -5.876 1 0.0270 0.48600 0.0000837 0.2620
Buy U A Drank (Shawty Snappin’) T-Pain Featuring Yung Joc 0.451 0.550 1 -8.137 1 0.2620 0.01080 0.0000000 0.0737
Shawty Plies Featuring T-Pain 0.451 0.550 1 -8.137 1 0.2620 0.01080 0.0000000 0.0737

To make the output clear only selected variables are presented, but it still looks like these doubled observations may be the same, despite some slight differences in the track or artist names. If the rest of the features is the same, we can simply remove the duplicates. Let’s check whether and which observations differ more than that.

for (i in seq(1, nrow(duplicates) - 1, 2)) {
  if(!identical(duplicates[i, 5:20], duplicates[i+1, 5:20])) {
    print(paste("observations:", i, "and", i+1))
    for (j in 5:ncol(duplicates)){
      if(duplicates[i, ..j] != duplicates[i + 1, ..j]){
        print(paste("difference in: ", colnames(duplicates)[j]))
      }
    }
  }
}
## [1] "observations: 1 and 2"
## [1] "difference in:  target"
## [1] "observations: 3 and 4"
## [1] "difference in:  target"
## [1] "observations: 15 and 16"
## [1] "difference in:  target"
## [1] "observations: 25 and 26"
## [1] "difference in:  target"
## [1] "observations: 29 and 30"
## [1] "difference in:  target"
## [1] "observations: 31 and 32"
## [1] "difference in:  target"
table(duplicates$target)
## 
##  0  1 
##  6 28

Many duplicates have the same features but there are 6 pairs of observations that differ. The only difference between them is the one with the target variable. We can also see that there are only 6 observations with the target value 1.

kable(duplicates[c(1:4, 15:16, 25:26, 29:32), 3:4]) 
track artist
I Don’t Care (feat. Adam Gontier) Apocalyptica
I Don’t Care Apocalyptica Featuring Adam Gontier
Hey Ma Cam’ron
Hey Ma Cam’Ron Featuring Juelz Santana, Freekey Zekey & Toya
South Side Moby
South Side Moby Featuring Gwen Stefani
No One Knows Queens Of The Stone Age
No One Knows Queens of the Stone Age
When Love Takes Over David Guetta Featuring Kelly Rowland
When Love Takes Over (feat. Kelly Rowland) David Guetta
Jai Ho! (You Are My Destiny) A R Rahman & The Pussycat Dolls Featuring Nicole Scherzinger
Jai Ho! (You Are My Destiny) A.R. Rahman

These pairs are obviously the same songs coded twice so we can easily check on the official site of the Billboard magazine that the first song did enter the chart. All pairs have the same problem and all of them can be found on the Billboard Hot 100 chart. For the most cases those are songs which were recorded as a collaboration of various artists and that this information is once stored in the artist variable and other time in the track variable or nowhere at all. Lack of the some of participating artists in the artist variable results in the wrong value of the target variable. Now we know that we should remove the observations which have target equal to 0 if there exists a case like that in a pair and any observation otherwise. As we don’t need the proper names of tracks and artists it will be easier to remove the duplicates in the dataset regardless to the target variable and then change its value to 1 for the remaining halves of pairs.

spotify <- distinct(spotify, uri, .keep_all = TRUE)
spotify[spotify$uri %in% duplicates$uri,]$target <- 1
setDF(spotify)

In this analysis we focus on songs as they are the tracks that are taken into account in the billboard charts. However, on the Spotify platform there are also tracks containing speeches, stand up comedies and other audio files that should be removed. We can do it using the speechiness variable. Values between 0.33 and 0.66 represent tracks with many words like for example rap songs but values over 0.66 represent the tracks which are probably made entirely of words like speeches or other types that aren’t songs. Based on that definision we will remove the twelve observations which have the speechiness over 0.66.

sum(spotify$speechiness > 0.66)
## [1] 12
spotify <- spotify[which(spotify$speechiness <= 0.66), ]
table(spotify$time_signature)
## 
##    0    1    3    4    5 
##    1   52  423 5284   83

There is also only one observation with the time_signature equal to 0. No time signature is possible and often denoted as 0/0 but with just one observation it wouldn’t be meanignful and we can’t group it with other levels so this observation will also be removed8.

spotify <- spotify[which(spotify$time_signature!=0), ]

spotify$time_signature <- as.factor(spotify$time_signature)

spotify$time_signature <- droplevels(spotify$time_signature)

summary(spotify$time_signature)
##    1    3    4    5 
##   52  423 5284   83
kable(head(spotify[order(spotify$chorus_hit), c(1:2, 17:18)], 15))
track artist chorus_hit sections
146 Full Metal Swimsuit Agoraphobic Nosebleed 0.00000 2
256 Interzone, Pt. 1: Part I: Why don’t you have a sode pop Enno Poppe 0.00000 2
632 Ten Parts & Labor 0.00000 2
667 Oxygen Corrosion Insect Warfare 0.00000 2
1592 Concrete Waves Parts & Labor 0.00000 2
2123 10,000 Angry Hardcore Kids - Wormwood Album Version Few Left Standing 0.00000 2
2219 Under an Ancient Oak Tears of Mankind 0.00000 2
4571 American Hardcore Weekend Nachos 0.00000 2
4947 Mega Armageddon Death Extented “Live” Electro Hippies 0.00000 2
4958 Assimilated Pollutants Magrudergrind 0.00000 1
5054 Heavy Crowns Half the World 0.00000 2
5124 You’re the New Metalcore Tiefighter 0.00000 2
4179 Suffer Rudimentary Peni 4.98552 3
3798 Inner Metal Cage Blut Aus Nord 7.11360 3
744 Few and Far Between Shutdown 7.67978 3

The last problem is with the chorus_hit variable. There are some observations in which the value of this variable suggests that the chorus starts with the beginning of the track. Having in mind that this variable denotes the start of the third section, these values seemed rather unrealistic. One can see that all those observations have less than 3 sections, so in fact those are the missing values in case of no third section in the song. These observations must be removed from the data.

spotify <- spotify[which(spotify$sections>3), ]

Variables transformation

ggplot(spotify,
       aes(x = speechiness)) +
  geom_histogram(fill = "green",
                 bins = 100) +
  theme_bw()

ggplot(spotify,
       aes(x = chorus_hit)) +
  geom_histogram(fill = "green",
                 bins = 100) +
  theme_bw()

Continuous variables in this dataset really differ in their distributions. Two of them, speechiness and chorus_hit, are clearly right-skewed and we might consider logarythmic or other transformations.

Speechiness

speech_yeojohnson <- yeojohnson(spotify$speechiness)
speech_yeojohnson$lambda
## [1] -4.99994

Yeo-Johnson procedure suggests the transformation of speechiness variable with the lambda close to -5. This transformation will be hard to interpret but the distribution of the transformed variable, although still far from perfect, is likely to be much better for the modeling.

spotify$speechiness_yeo <- 
  speech_yeojohnson$x.t

ggplot(spotify,
       aes(x = speechiness_yeo)) +
  geom_histogram(fill = "green",
                 bins = 100) +
  theme_bw()

Chorus_hit

chorus_yeojohnson <- yeojohnson(spotify$chorus_hit)
chorus_yeojohnson$lambda
## [1] -0.4587033

The same procedure repeated for the chorus_hit variable results in lambda equal to about -0.46. This transformation results in much more symmetric, better distribution so this variable will also be used in its transformed form.

spotify$chorus_yeo <- 
  chorus_yeojohnson$x.t

ggplot(spotify,
       aes(x = chorus_yeo)) +
  geom_histogram(fill = "green",
                 bins = 100) +
  theme_bw()

Binning of duration_ms

summary(spotify$duration_ms)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   54040  207554  238560  259319  279560 4170227
sum(head(spotify[order(spotify$duration_ms), "target"], 20) == "1")
## [1] 0
sum(tail(spotify[order(spotify$duration_ms), "target"], 20) == "1")
## [1] 0

The shortest song in the dataset lasts 54 seconds, while the longest over one hour. There is no single song with the target equal to 1 among 20 shortest or 20 longest songs. We can expect that neither the shortest nor the longest songs have the slightest chance to appear on the billboard charts. The relationship between duration_ms and target is likely to be nonlinear, maybe much closer to quadratic. That’s why we decided to do the supervised bining based on Weight of Evidence (WOE). Thanks to that instead of the original variable we can use the set of three WOE values which represent the predictive power of the groups.

duration_binned <-
  smbinning(df = spotify, 
            y = "target", 
            x = "duration_ms", 
            p = 0.1) 

spotify <- 
  smbinning.gen(spotify,
                duration_binned,
                "duration_woe") 


levels(spotify$duration_woe) <- 
  duration_binned$ivtable$WoE[1:nlevels(spotify$duration_woe)]

spotify$duration_woe <-
  as.numeric(levels(spotify$duration_woe))[spotify$duration_woe] 

table(spotify$duration_woe)
## 
## -1.6969 -1.5174  0.5252 
##     980     614    4200

At the end we will also convert the rest of the nominal variables to factors and remove the columns which are redundant for the further analysis.

spotify$key <- as.factor(spotify$key)
spotify$mode <- as.factor(spotify$mode)
spotify$target <- as.factor(spotify$target)

spotify$track <- NULL
spotify$artist <- NULL
spotify$uri <- NULL
spotify$speechiness <- NULL
spotify$duration_ms <- NULL
spotify$chorus_hit <- NULL

Features selection

In this part the relations between the variables will be checked. We will begin with the correlation between the continuous variables.

spotify_numeric_vars <- sapply(spotify, is.numeric) %>%  which() %>%  names()

spotify_correlations <-   cor(spotify[,spotify_numeric_vars],
                              use = "pairwise.complete.obs")

corrplot(spotify_correlations)

There are some strong correlations between variables. The strongest relation seems to be between the three of them: energy, loudness and acousticness. This is quite reasonable and we expected from the beginning that the first two might be correlated with each other. Also variable valence might be correlated with the danceability. Surprisingly there is no relation observed between danceability and tempo. We will check the correlations on the cutoff point equal to 75% to find the hypothetical variables most correlated with the rest.

findCorrelation(spotify_correlations,
                cutoff = 0.75,
                names = TRUE)
## [1] "loudness"
spotify_correlations[2:4,]
##              danceability     energy   loudness acousticness
## energy         0.07451541  1.0000000  0.7744462   -0.7411570
## loudness       0.29352601  0.7744462  1.0000000   -0.6809118
## acousticness  -0.18719132 -0.7411570 -0.6809118    1.0000000
##              instrumentalness   liveness    valence      tempo   sections
## energy             -0.2318692  0.2071754  0.2901205  0.2182530 -0.1711454
## loudness           -0.4445925  0.1225952  0.3381099  0.1578119 -0.1799497
## acousticness        0.2594117 -0.1030944 -0.1887238 -0.1516302  0.1628709
##              speechiness_yeo  chorus_yeo duration_woe
## energy             0.2242508 -0.03238815    0.1400527
## loudness           0.1524275 -0.07569779    0.2722363
## acousticness      -0.1350651  0.02826504   -0.2062271

Function findCorrelation suggests removing loudness. We can see that the correlation between energy and loudness is the one with the value above 75%. We will have that in mind when making the final decision about the variables selected to analysis. In the next step we will perform the analysis of variance (ANOVA) in order to find out whether our continuous variables differ for the levels of the dependent variable target.

spotify_F_anova <- function(numeric_var) {
  anova_ <- aov(spotify[[numeric_var]] ~ 
                  spotify$target) 
  
  return(summary(anova_)[[1]][1, 4])
}

sapply(spotify_numeric_vars,
       spotify_F_anova) %>% 
  sort(decreasing = TRUE) -> spotify_anova_all

spotify_anova_all
## instrumentalness     danceability     duration_woe         loudness 
##       1668.30039       1534.52400       1211.53978        794.92089 
##          valence     acousticness         sections       chorus_yeo 
##        494.77053        305.00229         74.34824         48.81317 
##           energy         liveness  speechiness_yeo            tempo 
##         41.33762         29.23026         13.75730         12.66193

All variables from this part have rather high values of the F statistics which means our variables should have the significant effect on the dependent variable. Nonetheless, energy has much lower value of the F statistic than loudness or acousticness so probabily it will be much better to exclude energy, despite the suggestion we obtained before. We will also check if modality, associated with the mood of the song has indeed any relation with valence.

aov(spotify$valence ~ spotify$mode) ->
  spotify_anova

summary(spotify_anova)[[1]][1, 4:5]
##              F value Pr(>F)
## spotify$mode   0.339 0.5604

F statistic is low and we can’t reject the null that there is no difference between groups so we don’t need to worry about the relation betweet those two independent variables. It is also possible to investigate the relation between our dependent variable and other quantitative variables with Cramer’s V.

DescTools::CramerV(spotify$target,
                   spotify$key)
## [1] 0.08031027
DescTools::CramerV(spotify$target,
                   spotify$mode)
## [1] 0.09801108
DescTools::CramerV(spotify$target,
                   spotify$time_signature)
## [1] 0.1752306

Key and mode have small Cramer’s coefficients, close to 0. Coefficient for time_signature is slightly higher but overall they are all not impressive.

nearZeroVar(spotify, saveMetrics = TRUE)
##                   freqRatio percentUnique zeroVar   nzv
## danceability       1.100000   15.17086641   FALSE FALSE
## energy             1.095238   17.27649292   FALSE FALSE
## key                1.003190    0.20711080   FALSE FALSE
## loudness           1.000000   75.52640663   FALSE FALSE
## mode               1.815355    0.03451847   FALSE FALSE
## acousticness       1.076923   46.78978253   FALSE FALSE
## instrumentalness 218.400000   39.31653435   FALSE FALSE
## liveness           1.033898   20.57300656   FALSE FALSE
## valence            1.058824   19.79634104   FALSE FALSE
## tempo              1.000000   92.99275112   FALSE FALSE
## time_signature    12.552632    0.06903693   FALSE FALSE
## sections           1.094645    0.93199862   FALSE FALSE
## target             1.017409    0.03451847   FALSE FALSE
## speechiness_yeo    1.095238   18.34656541   FALSE FALSE
## chorus_yeo         1.000000   99.63755609   FALSE FALSE
## duration_woe       4.285714    0.05177770   FALSE FALSE

The last factor we will consider in the feature selection part will be the variance of the variables. Fortunately we have no variables left in the dataset with the variance close to zero so we don’t have to do anything about it. Finally we exclude energy. Both levels of target occurs in almost identical number of observations so resampling can be omitted. Nevertheless, we should remember that in reality only a little share of all existing songs ever reaches the billboard charts and becomes the hits in a way we interpret them here. After all transformations and selections the basic statistics of the continuous variables are as follows:

spotify$energy <- NULL
table(spotify$target)
## 
##    0    1 
## 2872 2922
stargazer(
  spotify[, spotify_numeric_vars[-2]], type = "text", 
  summary.stat = c("min", "p25", "median", "p75", "max", "mean", "sd")
)
## 
## ===========================================================================
## Statistic          Min   Pctl(25) Median  Pctl(75)   Max    Mean   St. Dev.
## ---------------------------------------------------------------------------
## danceability      0.059   0.418    0.557   0.682    0.986   0.544   0.190  
## loudness         -47.327  -8.278  -6.033   -4.553   1.137  -7.417   5.068  
## acousticness      0.000   0.005    0.060   0.309    0.996   0.213   0.295  
## instrumentalness    0       0      0.000    0.05      1     0.149   0.300  
## liveness          0.019   0.094    0.130   0.262    0.987   0.195   0.161  
## valence           0.011   0.281    0.487   0.687    0.982   0.484   0.254  
## tempo            46.755   97.012  120.002 141.633  213.233 121.707  30.115 
## sections            4       9       10       12      169   11.115   5.670  
## speechiness_yeo  -1.051   -0.749  -0.414   0.482    3.241   0.000   1.000  
## chorus_yeo       -3.256   -0.689   0.012   0.680    3.123  -0.000   1.000  
## duration_woe     -1.697   -1.517   0.525   0.525    0.525  -0.067   0.963  
## ---------------------------------------------------------------------------

Modelling

Split into train and test sample

At the beginning we need to divide our data into train and test samples with regard to the equal occurance of the levels of the dependent variable. The split will be made in 60/40 proportion. We will use the train sample to fit the models and the best models will be compared on their unbiased evaluations on the test sample.

set.seed(2346) #seed for reproductibility

spotify_which_train <- createDataPartition(spotify$target, p = 0.6,list = FALSE) 


spotify_train <- spotify[spotify_which_train,]
spotify_test <- spotify[-spotify_which_train,]

save(list = c("spotify_train", "spotify_test"), file = "spotify_traintest.RData")

Logit/probit models

The first models we will create are the two types of regression dedicated for a dichotomous variable. Both methods are comparable but they use different link functions. Results are usually very similar unless the extreme observations from the data have relatively big impact on them.

spotify_logit <- glm(target ~ ., 
                       family =  binomial(link = "logit"),
                    data = spotify_train)

lrtest(spotify_logit)
## Likelihood ratio test
## 
## Model 1: target ~ danceability + key + loudness + mode + acousticness + 
##     instrumentalness + liveness + valence + tempo + time_signature + 
##     sections + speechiness_yeo + chorus_yeo + duration_woe
## Model 2: target ~ 1
##   #Df  LogLik  Df  Chisq            Pr(>Chisq)    
## 1  27 -1407.3                                     
## 2   1 -2410.6 -26 2006.8 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
spotify_logit_fitted <- predict(spotify_logit,
                                type = "response")

table(spotify_train$target,
      ifelse(spotify_logit_fitted > 0.5, 1, 0))
##    
##        0    1
##   0 1289  435
##   1  197 1557
summary(spotify_logit)
## 
## Call:
## glm(formula = target ~ ., family = binomial(link = "logit"), 
##     data = spotify_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8077  -0.5398   0.2328   0.6768   3.6683  
## 
## Coefficients:
##                     Estimate  Std. Error z value             Pr(>|z|)    
## (Intercept)      -2.61877017  0.75438783  -3.471             0.000518 ***
## danceability      6.87140325  0.40027952  17.167 < 0.0000000000000002 ***
## key1              0.37237204  0.20510149   1.816             0.069439 .  
## key2             -0.11073593  0.19878182  -0.557             0.577478    
## key3              0.76518067  0.31477112   2.431             0.015061 *  
## key4              0.40216173  0.21691531   1.854             0.063739 .  
## key5              0.35974300  0.22024550   1.633             0.102391    
## key6              0.52476801  0.22534449   2.329             0.019873 *  
## key7              0.08494842  0.19298805   0.440             0.659811    
## key8              0.49382893  0.23006551   2.146             0.031835 *  
## key9             -0.01026590  0.20425963  -0.050             0.959916    
## key10             0.43647410  0.22874730   1.908             0.056378 .  
## key11             0.16236115  0.21414338   0.758             0.448338    
## loudness          0.20449600  0.02211942   9.245 < 0.0000000000000002 ***
## mode1             0.60945705  0.10495003   5.807        0.00000000636 ***
## acousticness     -0.13959151  0.23949628  -0.583             0.559991    
## instrumentalness -6.26531284  0.54152552 -11.570 < 0.0000000000000002 ***
## liveness         -0.76623723  0.30202758  -2.537             0.011181 *  
## valence          -1.18922232  0.24047991  -4.945        0.00000076064 ***
## tempo             0.00297467  0.00162380   1.832             0.066964 .  
## time_signature3   0.21099704  0.70198979   0.301             0.763742    
## time_signature4   0.33514386  0.67175196   0.499             0.617843    
## time_signature5   1.22575990  0.84423716   1.452             0.146525    
## sections          0.00008904  0.01429587   0.006             0.995031    
## speechiness_yeo  -0.13338790  0.04787412  -2.786             0.005333 ** 
## chorus_yeo       -0.09015483  0.04958231  -1.818             0.069020 .  
## duration_woe      0.65979252  0.05512671  11.969 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4821.3  on 3477  degrees of freedom
## Residual deviance: 2814.5  on 3451  degrees of freedom
## AIC: 2868.5
## 
## Number of Fisher Scoring iterations: 7

Let’s begin with the logit model. Based on the likelihood ratio test our model is jointly significant. Most of the observations are predicted correctly on the train sample but not all of them. There are also some insignificant variables so we’ll try to get rid of all of them altogether and make the second, reduced version of the logit model.

spotify_logit_reduced <- glm(target ~ danceability + key + loudness + mode + instrumentalness + liveness + valence
                      + speechiness_yeo + duration_woe, 
                     family =  binomial(link = "logit"),
                     data = spotify_train)

lrtest(spotify_logit, spotify_logit_reduced)
## Likelihood ratio test
## 
## Model 1: target ~ danceability + key + loudness + mode + acousticness + 
##     instrumentalness + liveness + valence + tempo + time_signature + 
##     sections + speechiness_yeo + chorus_yeo + duration_woe
## Model 2: target ~ danceability + key + loudness + mode + instrumentalness + 
##     liveness + valence + speechiness_yeo + duration_woe
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1  27 -1407.3                     
## 2  20 -1413.2 -7 11.787     0.1078
spotify_logit_reduced_fitted <- predict(spotify_logit_reduced,
                                type = "response")

table(spotify_train$target,
      ifelse(spotify_logit_reduced_fitted > 0.5, 1, 0))
##    
##        0    1
##   0 1289  435
##   1  201 1553
summary(spotify_logit_reduced)
## 
## Call:
## glm(formula = target ~ danceability + key + loudness + mode + 
##     instrumentalness + liveness + valence + speechiness_yeo + 
##     duration_woe, family = binomial(link = "logit"), data = spotify_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7172  -0.5452   0.2376   0.6741   3.6341  
## 
## Coefficients:
##                   Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)      -1.850801   0.273645  -6.764      0.0000000000135 ***
## danceability      6.690324   0.377163  17.739 < 0.0000000000000002 ***
## key1              0.384604   0.204040   1.885               0.0594 .  
## key2             -0.102110   0.197775  -0.516               0.6057    
## key3              0.770114   0.314112   2.452               0.0142 *  
## key4              0.403440   0.216519   1.863               0.0624 .  
## key5              0.365104   0.218615   1.670               0.0949 .  
## key6              0.537323   0.224506   2.393               0.0167 *  
## key7              0.086904   0.191790   0.453               0.6505    
## key8              0.482088   0.229380   2.102               0.0356 *  
## key9              0.003608   0.203650   0.018               0.9859    
## key10             0.444972   0.227633   1.955               0.0506 .  
## key11             0.177770   0.213041   0.834               0.4040    
## loudness          0.211012   0.019580  10.777 < 0.0000000000000002 ***
## mode1             0.619307   0.103998   5.955      0.0000000026013 ***
## instrumentalness -6.238622   0.538220 -11.591 < 0.0000000000000002 ***
## liveness         -0.768980   0.299941  -2.564               0.0104 *  
## valence          -1.097356   0.236321  -4.644      0.0000034254614 ***
## speechiness_yeo  -0.120720   0.047127  -2.562               0.0104 *  
## duration_woe      0.659464   0.053705  12.279 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4821.3  on 3477  degrees of freedom
## Residual deviance: 2826.3  on 3458  degrees of freedom
## AIC: 2866.3
## 
## Number of Fisher Scoring iterations: 7

P-value higher than 5% indicates that variables acousticness, tempo, time_signature, sections and chorus_yeo are jointly insignificant and can be removed from the model. We can also check whether variable key, which has only chosen levels significant, should be left or excluded.

spotify_logit3 <- glm(target ~ danceability + loudness + mode + instrumentalness + liveness + valence
                      + speechiness_yeo + duration_woe, 
                      family =  binomial(link = "logit"),
                      data = spotify_train)

lrtest(spotify_logit, spotify_logit3)
## Likelihood ratio test
## 
## Model 1: target ~ danceability + key + loudness + mode + acousticness + 
##     instrumentalness + liveness + valence + tempo + time_signature + 
##     sections + speechiness_yeo + chorus_yeo + duration_woe
## Model 2: target ~ danceability + loudness + mode + instrumentalness + 
##     liveness + valence + speechiness_yeo + duration_woe
##   #Df  LogLik  Df  Chisq Pr(>Chisq)   
## 1  27 -1407.3                         
## 2   9 -1425.0 -18 35.518   0.008128 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This time we reject the null hypothesis so we can’t reduce the model anymore. We need to keep the key in the model. Know we can see the difference between the initial and the reduced logit models. There are some dissimilarities in the predictions but not very big.

table(ifelse(spotify_logit_fitted > 0.5, 1, 0),
      ifelse(spotify_logit_reduced_fitted > 0.5, 1, 0))
##    
##        0    1
##   0 1458   28
##   1   32 1960
spotify_probit <- glm(target ~ ., 
                     family =  binomial(link = "probit"),
                     data = spotify_train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
lrtest(spotify_probit)
## Likelihood ratio test
## 
## Model 1: target ~ danceability + key + loudness + mode + acousticness + 
##     instrumentalness + liveness + valence + tempo + time_signature + 
##     sections + speechiness_yeo + chorus_yeo + duration_woe
## Model 2: target ~ 1
##   #Df  LogLik  Df  Chisq            Pr(>Chisq)    
## 1  27 -1419.0                                     
## 2   1 -2410.6 -26 1983.3 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
spotify_probit_fitted <- predict(spotify_probit,
                                type = "response")
table(spotify_train$target,
      ifelse(spotify_probit_fitted > 0.5, 1, 0))
##    
##        0    1
##   0 1282  442
##   1  197 1557
summary(spotify_probit)
## 
## Call:
## glm(formula = target ~ ., family = binomial(link = "probit"), 
##     data = spotify_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9771  -0.5861   0.1951   0.6936   4.1143  
## 
## Coefficients:
##                    Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)      -1.5480255  0.4343054  -3.564             0.000365 ***
## danceability      3.9270060  0.2204860  17.811 < 0.0000000000000002 ***
## key1              0.2035270  0.1164755   1.747             0.080571 .  
## key2             -0.0560372  0.1152518  -0.486             0.626815    
## key3              0.5084385  0.1804435   2.818             0.004837 ** 
## key4              0.2319933  0.1245455   1.863             0.062502 .  
## key5              0.2166584  0.1266122   1.711             0.087045 .  
## key6              0.3255703  0.1292885   2.518             0.011797 *  
## key7              0.0605921  0.1113585   0.544             0.586360    
## key8              0.3179609  0.1320795   2.407             0.016069 *  
## key9              0.0297994  0.1176938   0.253             0.800118    
## key10             0.2585331  0.1311275   1.972             0.048653 *  
## key11             0.1064298  0.1232864   0.863             0.387987    
## loudness          0.1193937  0.0123880   9.638 < 0.0000000000000002 ***
## mode1             0.3460934  0.0600691   5.762        0.00000000833 ***
## acousticness     -0.0401308  0.1375350  -0.292             0.770450    
## instrumentalness -3.0212385  0.2268848 -13.316 < 0.0000000000000002 ***
## liveness         -0.4107594  0.1734247  -2.369             0.017860 *  
## valence          -0.6690570  0.1373806  -4.870        0.00000111542 ***
## tempo             0.0015413  0.0009417   1.637             0.101702    
## time_signature3   0.2061709  0.4059382   0.508             0.611532    
## time_signature4   0.2559065  0.3883175   0.659             0.509887    
## time_signature5   0.7640957  0.4788073   1.596             0.110526    
## sections          0.0001089  0.0080648   0.014             0.989222    
## speechiness_yeo  -0.0779418  0.0274283  -2.842             0.004488 ** 
## chorus_yeo       -0.0548367  0.0283778  -1.932             0.053312 .  
## duration_woe      0.3877349  0.0317594  12.209 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4821.3  on 3477  degrees of freedom
## Residual deviance: 2838.0  on 3451  degrees of freedom
## AIC: 2892
## 
## Number of Fisher Scoring iterations: 7

We will perform the same steps for the probit model. Initial model is jointly significant likewise was the logit model and the same variables are insignificant. We will try to exclude them again.

spotify_probit_reduced <- glm(target ~ danceability + key + loudness + mode + instrumentalness + liveness + valence
                             + speechiness_yeo + duration_woe, 
                             family =  binomial(link = "probit"),
                             data = spotify_train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
lrtest(spotify_probit, spotify_probit_reduced)
## Likelihood ratio test
## 
## Model 1: target ~ danceability + key + loudness + mode + acousticness + 
##     instrumentalness + liveness + valence + tempo + time_signature + 
##     sections + speechiness_yeo + chorus_yeo + duration_woe
## Model 2: target ~ danceability + key + loudness + mode + instrumentalness + 
##     liveness + valence + speechiness_yeo + duration_woe
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1  27 -1419.0                     
## 2  20 -1424.5 -7 11.037      0.137
spotify_probit_reduced_fitted <- predict(spotify_probit_reduced,
                                        type = "response")

table(spotify_train$target,
      ifelse(spotify_probit_reduced_fitted > 0.5, 1, 0))
##    
##        0    1
##   0 1277  447
##   1  202 1552
summary(spotify_probit_reduced)
## 
## Call:
## glm(formula = target ~ danceability + key + loudness + mode + 
##     instrumentalness + liveness + valence + speechiness_yeo + 
##     duration_woe, family = binomial(link = "probit"), data = spotify_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8564  -0.5911   0.2089   0.6885   4.0342  
## 
## Coefficients:
##                  Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)      -1.07503    0.15681  -6.855     0.00000000000711 ***
## danceability      3.84387    0.20792  18.487 < 0.0000000000000002 ***
## key1              0.20892    0.11604   1.800              0.07181 .  
## key2             -0.05101    0.11487  -0.444              0.65699    
## key3              0.50976    0.18035   2.827              0.00471 ** 
## key4              0.23097    0.12439   1.857              0.06334 .  
## key5              0.21762    0.12590   1.728              0.08391 .  
## key6              0.33257    0.12895   2.579              0.00991 ** 
## key7              0.05952    0.11089   0.537              0.59140    
## key8              0.30927    0.13183   2.346              0.01898 *  
## key9              0.03537    0.11746   0.301              0.76334    
## key10             0.26585    0.13071   2.034              0.04195 *  
## key11             0.11381    0.12282   0.927              0.35411    
## loudness          0.12139    0.01089  11.151 < 0.0000000000000002 ***
## mode1             0.35259    0.05961   5.915     0.00000000332374 ***
## instrumentalness -3.01077    0.22514 -13.373 < 0.0000000000000002 ***
## liveness         -0.41620    0.17255  -2.412              0.01587 *  
## valence          -0.62165    0.13525  -4.596     0.00000429792782 ***
## speechiness_yeo  -0.07265    0.02705  -2.686              0.00724 ** 
## duration_woe      0.38700    0.03095  12.506 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4821.3  on 3477  degrees of freedom
## Residual deviance: 2849.0  on 3458  degrees of freedom
## AIC: 2889
## 
## Number of Fisher Scoring iterations: 7
spotify_probit3 <- glm(target ~ danceability + loudness + mode + instrumentalness + liveness + valence
                      + speechiness_yeo + duration_woe + chorus_yeo, 
                      family =  binomial(link = "probit"),
                      data = spotify_train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
lrtest(spotify_probit, spotify_probit3)
## Likelihood ratio test
## 
## Model 1: target ~ danceability + key + loudness + mode + acousticness + 
##     instrumentalness + liveness + valence + tempo + time_signature + 
##     sections + speechiness_yeo + chorus_yeo + duration_woe
## Model 2: target ~ danceability + loudness + mode + instrumentalness + 
##     liveness + valence + speechiness_yeo + duration_woe + chorus_yeo
##   #Df  LogLik  Df  Chisq Pr(>Chisq)  
## 1  27 -1419.0                        
## 2  10 -1434.9 -17 31.919     0.0154 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table(ifelse(spotify_probit_fitted > 0.5, 1, 0),
      ifelse(spotify_probit_reduced_fitted > 0.5, 1, 0))
##    
##        0    1
##   0 1450   29
##   1   29 1970

Again we can exclude those mentioned variables and again we have to leave the key variable in the model. In the end our reduced probit model has the same variables as reduced logit. Differences between the predictions in the initial and reduced probit models are small but exist. We will check the efficiency of predictions from the two logit and two probit models all at once. To do so, we will use the function summary_binary created by dr hab. Piotr Wójcik, which returns selected statistics based on the confusionMatrix function from the caret package.

summary_binary <- function(predicted_probs,
                           real,
                           cutoff = 0.5,
                           level_positive = "Yes",
                           level_negative = "No") {
  # save the classification table
  ctable <- confusionMatrix(as.factor(ifelse(predicted_probs > cutoff, 
                                             level_positive, 
                                             level_negative)), 
                            real, 
                            level_positive) 
  # extract selected statistics into a vector
  stats <- round(c(ctable$overall[1],
                   ctable$byClass[c(1:4, 7, 11)]),
                 5)
  # and return them as a function result
  return(stats)
}
spotify_logprob_predictions <- 
  data.frame(spotify_logit_fitted,
             spotify_logit_reduced_fitted,
             spotify_probit_fitted,
             spotify_probit_reduced_fitted)

(spotify_logprob_comparison <- sapply(spotify_logprob_predictions,
                                     function(x) summary_binary(real = spotify_train$target,
                                                                predicted = x,
                                                                level_positive = "1",
                                                                level_negative = "0")) %>% 
  t())
##                               Accuracy Sensitivity Specificity
## spotify_logit_fitted           0.81829     0.88769     0.74768
## spotify_logit_reduced_fitted   0.81714     0.88540     0.74768
## spotify_probit_fitted          0.81627     0.88769     0.74362
## spotify_probit_reduced_fitted  0.81340     0.88483     0.74072
##                               Pos Pred Value Neg Pred Value      F1
## spotify_logit_fitted                 0.78163        0.86743 0.83129
## spotify_logit_reduced_fitted         0.78119        0.86510 0.83004
## spotify_probit_fitted                0.77889        0.86680 0.82974
## spotify_probit_reduced_fitted        0.77639        0.86342 0.82707
##                               Balanced Accuracy
## spotify_logit_fitted                    0.81768
## spotify_logit_reduced_fitted            0.81654
## spotify_probit_fitted                   0.81565
## spotify_probit_reduced_fitted           0.81278

It seems that the first logit model with all variables included is also the best one. Accuracy, sensitivity and specificity are all pretty high. Surprisingly in both models the reduction of the insignificant variables resulted in slightly lower accuracy. The logit models are generally better than their probit equivalents. We’ll use the first model, spotify_logit, in the further analysis. These measures were obtained on the default cutoff point equal to 0.5. We can also use the ROC curve to obtain the quality of the model regardless to the selected cutoff point. The Area Under the Curve for our best model is equal to 0.89. It is a very promising result which indicates great predictive power of the model.

roc.plot(ifelse(spotify_train$target == "1", 1, 0),
         spotify_logit_fitted) 
## Warning in roc.plot.default(ifelse(spotify_train$target == "1", 1,
## 0), spotify_logit_fitted): Large amount of unique predictions used as
## thresholds. Consider specifying thresholds.

roc.area(ifelse(spotify_train$target == "1", 1, 0),
         spotify_logit_fitted)
## $A
## [1] 0.8945437
## 
## $n.total
## [1] 3478
## 
## $n.events
## [1] 1754
## 
## $n.noevents
## [1] 1724
## 
## $p.value
## [1] 0

K-nearest neighbors

Now let’s move to some other classification methods. KNN is an algorithm which classifies the certain observations based on the representation of observations closest to the one classified. The main aspect one has to consider in the application of K-nearest neighbors is the value of the parameter K which indicates the number of observations treated as the nearest neighbors. One way to among others obtain the optimal value of K is to use a cross-validation and apply different potential values of the parameter. In this part we will use the cross-validation with 5 folds. Popular idea is to start with the parameter equal to the square root of the number of observations which in our case is almost 59. We can include this value in the set of analysed parameters but we will also check many higher and lower values than this one.

ctrl_cv5 <- trainControl(method = "cv",
                         number = 5)

sqrt(nrow(spotify_train))
## [1] 58.97457
k <- 59

different_k <- data.frame(k = seq(1 , k + 40, 2))
set.seed(2346)

spotify_knn <- 
  train(target ~ ., 
        data = spotify_train,
        method = "knn",
        trControl = ctrl_cv5,
        tuneGrid = different_k)

save("spotify_knn", file = "spotify_knn.RData")
spotify_knn
## k-Nearest Neighbors 
## 
## 3478 samples
##   14 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 2782, 2784, 2782, 2782, 2782 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.6676231  0.3346136
##    3  0.6837225  0.3666163
##    5  0.6935125  0.3859301
##    7  0.6860313  0.3707489
##    9  0.6900485  0.3786158
##   11  0.6865961  0.3715661
##   13  0.6842948  0.3668596
##   15  0.6891873  0.3765265
##   17  0.6851660  0.3684282
##   19  0.6817111  0.3613800
##   21  0.6797078  0.3572630
##   23  0.6866077  0.3710952
##   25  0.6811372  0.3600725
##   27  0.6811355  0.3599656
##   29  0.6825690  0.3628108
##   31  0.6825764  0.3627967
##   33  0.6837217  0.3650323
##   35  0.6822891  0.3620863
##   37  0.6782653  0.3539636
##   39  0.6759639  0.3493009
##   41  0.6690599  0.3354017
##   43  0.6681979  0.3336342
##   45  0.6696363  0.3364898
##   47  0.6658998  0.3289665
##   49  0.6673383  0.3318443
##   51  0.6673383  0.3317852
##   53  0.6693506  0.3358023
##   55  0.6670501  0.3311381
##   57  0.6641740  0.3253642
##   59  0.6621609  0.3212494
##   61  0.6615820  0.3200200
##   63  0.6601444  0.3171009
##   65  0.6607224  0.3182484
##   67  0.6581338  0.3130152
##   69  0.6552585  0.3072100
##   71  0.6518094  0.3002360
##   73  0.6512322  0.2990509
##   75  0.6526674  0.3019391
##   77  0.6541083  0.3048089
##   79  0.6512281  0.2989785
##   81  0.6489284  0.2943343
##   83  0.6483554  0.2932050
##   85  0.6520951  0.3007184
##   87  0.6483545  0.2932178
##   89  0.6486435  0.2937880
##   91  0.6500820  0.2966535
##   93  0.6518045  0.3000943
##   95  0.6486402  0.2937245
##   97  0.6480663  0.2925414
##   99  0.6472051  0.2907704
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
plot(spotify_knn)

Turns out that the best K is equal to 5. It is much smaller number than 59 and we can see that the square root of the number of observations in the dataset would be far from optimal. Next we will try to improve the results with the variables standardization. Many of the features used are already in a range from 0 to 1, but we have variables which take values from completely different range as well. There are also some categorical features so maybe we should try range standardization to have all variables in the same range.

set.seed(2346)

spotify_knn_range <- 
  train(target ~ ., 
        data = spotify_train,
        method = "knn",
        trControl = ctrl_cv5,
        tuneGrid = different_k,
        preProcess = c("range"))

save("spotify_knn_range", file = "spotify_knn_range.RData")
spotify_knn_range
## k-Nearest Neighbors 
## 
## 3478 samples
##   14 predictor
##    2 classes: '0', '1' 
## 
## Pre-processing: re-scaling to [0, 1] (26) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 2782, 2784, 2782, 2782, 2782 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.7429627  0.4856000
##    3  0.7570522  0.5135994
##    5  0.7593552  0.5180646
##    7  0.7702773  0.5399142
##    9  0.7699899  0.5393015
##   11  0.7650999  0.5294653
##   13  0.7676869  0.5346245
##   15  0.7662476  0.5316713
##   17  0.7645235  0.5281784
##   19  0.7596343  0.5183298
##   21  0.7579085  0.5148192
##   23  0.7550333  0.5089716
##   25  0.7518715  0.5026274
##   27  0.7510086  0.5008023
##   29  0.7495702  0.4978936
##   31  0.7489955  0.4967025
##   33  0.7495752  0.4978348
##   35  0.7518724  0.5023810
##   37  0.7466958  0.4919383
##   39  0.7449717  0.4884450
##   41  0.7449709  0.4884374
##   43  0.7449700  0.4884014
##   45  0.7464051  0.4912816
##   47  0.7466925  0.4918469
##   49  0.7435299  0.4854667
##   51  0.7441055  0.4866099
##   53  0.7403690  0.4790955
##   55  0.7397926  0.4779138
##   57  0.7351900  0.4686366
##   59  0.7372031  0.4726529
##   61  0.7369174  0.4720454
##   63  0.7340422  0.4662479
##   65  0.7314560  0.4610278
##   67  0.7323181  0.4627458
##   69  0.7320291  0.4621680
##   71  0.7294420  0.4569532
##   73  0.7311695  0.4604098
##   75  0.7271407  0.4522853
##   77  0.7274297  0.4528513
##   79  0.7265676  0.4511160
##   81  0.7257047  0.4493769
##   83  0.7251317  0.4482214
##   85  0.7251317  0.4482160
##   87  0.7251308  0.4481926
##   89  0.7274322  0.4528174
##   91  0.7268575  0.4516421
##   93  0.7265709  0.4510803
##   95  0.7248451  0.4475984
##   97  0.7251300  0.4481802
##   99  0.7251284  0.4481772
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
plot(spotify_knn_range)

The new optimal parameter is higher and equal to 7. Again the best ones are small numbers of nearest neighbors although this time K equal to 5 would be much worse option. We can already see that with the range standardization we achieved to obtain higher accuracy than before. To compare those two results we will use slightly different version of the function introduced in the previous part, summary_binary_class, also written by dr hab. Piotr Wójcik and assess the correctness of predictions on the train data.

summary_binary_class <- function(predicted_classes,
                                 real,
                                 level_positive = "Yes",
                                 level_negative = "No") {
  # save the classification table
  ctable <- confusionMatrix(as.factor(predicted_classes), 
                            real, 
                            level_positive) 
  # extract selected statistics into a vector
  stats <- round(c(ctable$overall[1],
                   ctable$byClass[c(1:4, 7, 11)]),
                 5)
  # and return them as a function result
  return(stats)
}
spotify_knn_predictions <-
  data.frame(spotify_knn = predict(spotify_knn,
                                   spotify_train),
             spotify_knn_range = predict(spotify_knn_range,
                                   spotify_train)
  )

(spotify_knn_comparison <- sapply(spotify_knn_predictions,
                                 function(x) summary_binary_class(real = spotify_train$target,
                                                                  predicted = x,
                                                                  level_positive = "1",
                                                                  level_negative = "0")) %>%
  t())
##                   Accuracy Sensitivity Specificity Pos Pred Value
## spotify_knn        0.80247     0.87286     0.73086        0.76742
## spotify_knn_range  0.81829     0.89111     0.74420        0.77994
##                   Neg Pred Value      F1 Balanced Accuracy
## spotify_knn              0.84963 0.81675           0.80186
## spotify_knn_range        0.87042 0.83183           0.81765

The difference is rather subtle but the range standardization indeed improved the results and this is the model we will keep for final comparision.

Support Vector Machines

The last algorithm will be SVM. It is a method of finding a hyperplane which divide observations among groups. We will use three different kernel functions to create three models, each one with the five folds cross validation with 2 repetitions. We will start with the most basic, linear function and a set of 10 different values of cost parameter.

ctrl_cv5x2 <- trainControl(method = "repeatedcv",
                           number = 5,
                           repeats = 2)

parametersLinear <- data.frame(C = c(0.01, 0.05, 
                                0.1, 0.5, 1, 2, 5, 10, 15, 20))

set.seed(2346)

spotify_svm_lin <- train(target ~ ., 
                           data = spotify_train, 
                           method = "svmLinear",
                           tuneGrid = parametersLinear,
                           trControl = ctrl_cv5x2)

save("spotify_svm_lin", file = "spotify_svm_lin.RData")
spotify_svm_lin
## Support Vector Machines with Linear Kernel 
## 
## 3478 samples
##   14 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 2782, 2784, 2782, 2782, 2782, 2782, ... 
## Resampling results across tuning parameters:
## 
##   C      Accuracy   Kappa    
##    0.01  0.8021840  0.6037115
##    0.05  0.8116738  0.6228179
##    0.10  0.8133984  0.6262906
##    0.50  0.8162740  0.6320554
##    1.00  0.8161303  0.6317668
##    2.00  0.8161301  0.6317665
##    5.00  0.8161301  0.6317662
##   10.00  0.8164177  0.6323419
##   15.00  0.8161303  0.6317684
##   20.00  0.8161305  0.6317706
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 10.

The best model is selected for the parameter equal to 10. Now we can move to the more advanced functions. We will try to find the optimal model for the polynomial kernel function. This time we have three different parameters to optimize, so in order to compute it in a reasonable amount of time we reduced the number of variants for each parameter compared to the previous model.

parametersPoly <- expand.grid(C = c(0.001, 1),
                              degree = 2:5,
                              scale = c(1, 2))

set.seed(2346)

spotify_svm_poly <- train(target ~ ., 
                         data = spotify_train, 
                         method = "svmPoly",
                         tuneGrid = parametersPoly,
                         trControl = ctrl_cv5x2)

save("spotify_svm_poly", file = "spotify_svm_poly.RData")
spotify_svm_poly
## Support Vector Machines with Polynomial Kernel 
## 
## 3478 samples
##   14 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 2782, 2784, 2782, 2782, 2782, 2782, ... 
## Resampling results across tuning parameters:
## 
##   C      degree  scale  Accuracy   Kappa    
##   0.001  2       1      0.8092288  0.6178963
##   0.001  2       2      0.8141180  0.6277800
##   0.001  3       1      0.7949978  0.5895695
##   0.001  3       2      0.7742970  0.5483236
##   0.001  4       1      0.7535970  0.5069233
##   0.001  4       2      0.7374916  0.4747470
##   0.001  5       1      0.7448283  0.4894262
##   0.001  5       2      0.7412312  0.4822324
##   1.000  2       1      0.8188608  0.6374328
##   1.000  2       2      0.8182863  0.6362952
##   1.000  3       1      0.7444035  0.4887826
##   1.000  3       2      0.7406623  0.4812980
##   1.000  4       1      0.7400789  0.4799228
##   1.000  4       2      0.7354779  0.4706896
##   1.000  5       1      0.7422392  0.4842720
##   1.000  5       2      0.7412312  0.4822324
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 2, scale = 1 and C = 1.

The optimal parameters from all verified combinations are: degree=2, scale=1 and C=1. Finally we’ll do the same with the gaussian radial basis function which requires two different parameters.

parametersRadial <- 
  expand.grid(C = c(5, 10, 15, 20, 25),
              sigma = c(0.001, 0.01, 0.1, 1))

set.seed(2346)

spotify_svm_radial <- train(target ~ ., 
                           data = spotify_train, 
                           method = "svmRadial",
                           tuneGrid = parametersRadial,
                           trControl = ctrl_cv5x2)

save("spotify_svm_radial", file = "spotify_svm_radial.RData")
spotify_svm_radial
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 3478 samples
##   14 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 2782, 2784, 2782, 2782, 2782, 2782, ... 
## Resampling results across tuning parameters:
## 
##   C   sigma  Accuracy   Kappa    
##    5  0.001  0.8026149  0.6045883
##    5  0.010  0.8192944  0.6380919
##    5  0.100  0.8021892  0.6041310
##    5  1.000  0.7298725  0.4612187
##   10  0.001  0.8076471  0.6147153
##   10  0.010  0.8260509  0.6516700
##   10  0.100  0.7971572  0.5941563
##   10  1.000  0.7297289  0.4609334
##   15  0.001  0.8098050  0.6190576
##   15  0.010  0.8276304  0.6548511
##   15  0.100  0.7892499  0.5783929
##   15  1.000  0.7297289  0.4609334
##   20  0.001  0.8118179  0.6231003
##   20  0.010  0.8263371  0.6522651
##   20  0.100  0.7875241  0.5749633
##   20  1.000  0.7297289  0.4609334
##   25  0.001  0.8132545  0.6259777
##   25  0.010  0.8267704  0.6531526
##   25  0.100  0.7836435  0.5672075
##   25  1.000  0.7297289  0.4609334
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01 and C = 15.

The sigma parameter is equal to 0.01 and C is equal to 15. Now we will make predictions on the same sample, compare all three SVM models and decide which function suits probably the best in case of our data.

spotify_svm_predictions <-
  data.frame(spotify_svm_lin = predict(spotify_svm_lin,
                                       spotify_train),
             spotify_svm_poly = predict(spotify_svm_poly,
                                        spotify_train),
             spotify_svm_radial = predict(spotify_svm_radial,
                                          spotify_train)
  )



(spotify_svm_comparison <- sapply(spotify_svm_predictions,
                                 function(x) summary_binary_class(real = spotify_train$target,
                                                                  predicted = x,
                                                                  level_positive = "1",
                                                                  level_negative = "0")) %>% 
  t())
##                    Accuracy Sensitivity Specificity Pos Pred Value
## spotify_svm_lin     0.81944     0.89453     0.74304        0.77982
## spotify_svm_poly    0.87378     0.90707     0.83991        0.85217
## spotify_svm_radial  0.86429     0.92873     0.79872        0.82439
##                    Neg Pred Value      F1 Balanced Accuracy
## spotify_svm_lin           0.87381 0.83324           0.81878
## spotify_svm_poly          0.89882 0.87876           0.87349
## spotify_svm_radial        0.91678 0.87346           0.86373

In contrast to the previous algorithms, the differences between the respective models are higher, especially between the model with the linear function and the rest. Although the sensitivity is quite similar, the specificity is the measure which matters here. The best model is the one with polynomial kernel function. It is also the most balanced one in terms of sensitivity and specificity. Nonetheless all three models have good predictions and none of them turned out to be a failure.

Comparision of the forecasts

We have created models with three completely different methods and selected the best one for every type of algorithm. Now we will compare their forecasts on the test sample.

spotify_final_forecasts <-
  data.frame(spotify_logit = ifelse(predict(spotify_logit,
                                            spotify_test,
                                            type = "response")>0.5,1,0),
             spotify_knn_range = predict(spotify_knn_range,
                                         spotify_test),
             spotify_svm_poly = predict(spotify_svm_poly,
                                        spotify_test)
  )



(spotify_final_comparison <- sapply(spotify_final_forecasts,
                                 function(x) summary_binary_class(real = spotify_test$target,
                                                                  predicted = x,
                                                                  level_positive = "1",
                                                                  level_negative = "0")) %>% 
  t())
##                   Accuracy Sensitivity Specificity Pos Pred Value
## spotify_logit      0.82081     0.87757     0.76307        0.79029
## spotify_knn_range  0.79231     0.87243     0.71080        0.75426
## spotify_svm_poly   0.82168     0.85616     0.78659        0.80321
##                   Neg Pred Value      F1 Balanced Accuracy
## spotify_logit            0.85967 0.83164           0.82032
## spotify_knn_range        0.84560 0.80905           0.79162
## spotify_svm_poly         0.84314 0.82884           0.82137

In spite of the essential differences between the models, their accuracy is very similar. The highest overall accuracy is for the forecasts created with the SVM model but the value of the accuracy measure for the logit model is in fact only negligibly smaller. Only results from KNN are noticeably worse, mainly because of the lower specificity. On the other hand, forecasts from SVM have slightly worse sensitivity then the rest while the logit model is much more interpretable so selection of the best model should depend on the details of its further application. The most important is that all these models are quite good so we managed to achieve the goal of this analysis.

Conclusion

Predicting which songs are hits and which are not using only their characteristics is possible. There is no single method that leads to the success, it is rather the effort and many different attempts that guarantee a quality model.