Pitchfork is one of the most widely read online music magazines. It is most famed for its review scores and ‘Best New Music’ award system. It assigns a score from 0 to 10 to one decimal place for an album that is released and reviewed by one of their writers. An album which receives a score of higher than 8.0 will often receive ‘Best New Music’ status- an indicator that it is of particularly high quality and an award which holds a lot of influence in terms of the album’s sales and credibility.
The aim of this project will be to explore a dataset of Pitchfork reviews and, ultimately, build a prediction model that attempts to predict the score an album will get when reviewed by Pitchfork. Below, I note some aims, expectations and points of intrigue.
What is the average score awarded to an album and how does the distribution of scores look?
If biases exist, where are they? Genre? Review’s author?
My expectation is that the previous scores given to the artist will provide the best indication of score.
What predictors can be created to improve prediction? How much can prediction accuracy be improved from using just one predictor (based on previous score(s)) to using multiple?
For the various needs of the prediction model, during the analysis process I discovered multiple ways that the original data needed reshaping. I will remove some unnecessary variables and show what the data looks like.
URL <- "https://dl.dropboxusercontent.com/s/cqf7cgxh91eoeyn/pitchfork.csv"
if(!file.exists("./pitchfork.csv")){download.file(URL, "./pitchfork.csv")}
library(tidyverse)
p4kdata <- read.csv("./pitchfork.csv")
p4kdata <- p4kdata %>% select(-link, -review)
head(p4kdata)
## artist album
## 1 David Byrne â\200œâ\200¦The Best Live Show of All Timeâ\200\235 â\200” NME EP
## 2 DJ Healer Lost Lovesongs / Lostsongs Vol. 2
## 3 Jorge Velez Roman Birds
## 4 Chandra Transportation EPs
## 5 The Chainsmokers Sick Boy
## 6 Silent Servant Shadows of Death and Desire
## genre score date author role bnm
## 1 Rock 5.5 January 11 2019 Andy Beta Contributor 0
## 2 Electronic 6.2 January 11 2019 Chal Ravens Contributor 0
## 3 Electronic 7.9 January 10 2019 Philip Sherburne Contributing Editor 0
## 4 Rock 7.8 January 10 2019 Andy Beta Contributor 0
## 5 Electronic 3.1 January 9 2019 Larry Fitzmaurice Contributor 0
## 6 Electronic 7.8 January 9 2019 Harley Brown Contributor 0
## label release_year
## 1 Nonesuch 2018
## 2 Planet Uterus 2019
## 3 Self-released 2019
## 4 Telephone Explosion 2018
## 5 Disruptor,Columbia 2018
## 6 Hospital Productions 2018
Here I change data classes and formats, reorder all of the data by date of review and fix some formatting issues.
p4kdata$artist <- as.character(p4kdata$artist)
p4kdata$date <- as.Date(p4kdata$date, format = "%B %d %Y")
p4kdata <- p4kdata[order(desc(p4kdata$date)),]
p4kdata$author <- p4kdata$author %>%
gsub("\\s+|\\.|-|é", "", .)
p4kdata$role <- p4kdata$role %>%
gsub("\\s$", "", .) %>%
as.factor(.)
p4kdata$genre[p4kdata$genre==""] <- NA
For the label variable, a variety of formatting issues arose when creating more variables that use label. This meant having to change punctuation, make use of spaces and deal with specific issues, namely one letter record labels and albums that were co-released on two labels.
p4kdata$label <- gsub(" ", "", p4kdata$label)
p4kdata$label <- gsub(",", " ", p4kdata$label)
p4kdata$label <- tolower(p4kdata$label)
for(x in 1:length(p4kdata$label)){
p4kdata$label[x] <- paste(unique(strsplit(p4kdata$label[x], " ")[[1]]), collapse = " ")
}
p4kdata$label[grep("^$", p4kdata$label)] <- "Unspecified"
p4kdata[grep("^k$", p4kdata$label),]$label <-"krecords"
p4kdata[grep("^a$", p4kdata$label),]$label <-"arecords"
p4kdata$label <- gsub("^k ", "krecords ", p4kdata$label)
p4kdata[grep("father/daughter", p4kdata$label),]$label <-"fatherdaughter"
p4kdata[grep("ever/never", p4kdata$label),]$label <-"evernever"
p4kdata[grep("i/am/me", p4kdata$label),]$label <-"iamme"
p4kdata[grep("bar/none", p4kdata$label),]$label <-"barnone"
p4kdata[grep("local331/3", p4kdata$label),]$label <-"local3313"
p4kdata[grep("sciona/v", p4kdata$label),]$label <-"scionav"
p4kdata[grep("^feel$", p4kdata$label),]$label <-"feelrecords"
p4kdata[grep("20/20/20", p4kdata$label),]$label <-"202020"
p4kdata[grep("fe/hardboiled", p4kdata$label),]$label <-"fehardboiled"
p4kdata[grep("anti-", p4kdata$label),]$label <-"antiminus"
p4kdata$label <- gsub("/", " ", p4kdata$label)
p4kdata$label <- gsub("[^0-9a-z[[:space:]]]", "", p4kdata$label)
p4kdata$label <- gsub("self-released ", "", p4kdata$label)
p4kdata$label <- gsub(" self-released", "", p4kdata$label)
For the review authors, there were some spelling errors detected that had to be corrected. Finally, in the last line of code, I also assigned artists with no names to NA.
p4kdata$author[p4kdata$author=="DrAndyBeta"] <- "AndyBeta"
p4kdata$author[p4kdata$author=="DrewGaerig"] <- "AndrewGaerig"
p4kdata$author[p4kdata$author=="StephenMDeusner"] <- "StephenDeusner"
p4kdata$author[p4kdata$author=="StephenMDuesner"] <- "StephenDeusner"
p4kdata$author[p4kdata$author=="MarkRichardSan"] <- "MarkRichardson"
p4kdata$author[p4kdata$author=="SavyReyesKulkarni"] <- "SabyReyesKulkarni"
p4kdata$author[p4kdata$author=="PaulAThompson"] <- "PaulThompson"
p4kdata$author[p4kdata$author=="SeanFennessy"] <- "SeanFennessey"
p4kdata$author[p4kdata$author=="AlexLindhart"] <- "AlexLinhardt"
p4kdata$author[p4kdata$author=="CoryDByrom"] <- "CoryByrom"
p4kdata$author[p4kdata$author=="MarcusJMoore"] <- "MarcusMoore"
p4kdata$author[p4kdata$author=="JeremyDLarson"] <- "JeremyLarson"
p4kdata$artist[p4kdata$artist==""] <- "NA"
Next I look to creating new variables that will, potentially, become predictors in my model. This will involve writing functions, using for-loops, etc. Here I create a variable with all artist’s names.
artistnames <- unique(p4kdata$artist)
First, I will create a variable PrevTotal which has the total number of previous reviews the artist has had.
PrevTotalfn <- function(name){
artist <<- p4kdata[p4kdata$artist==name,]
y <<- NULL
for(x in 1:nrow(artist)){
z <- sum(artist$date < artist$date[x])
y <- c(y,z)
}
oneartist <<- cbind(artist, PrevTotal = y)
}
p4kdata2 <- lapply(artistnames, PrevTotalfn)
p4kdata2 <- bind_rows(p4kdata2)
Now a variable PrevBNMTotal which has the total number of BNMs (Best New Music) previously given to the artist.
PrevBNMTotalfn <- function(name){
artist <<- p4kdata2[p4kdata2$artist==name,]
y <<- NULL
for(x in 1:nrow(artist)){
a <- artist[artist$date < artist$date[x],]
b <- sum(a$bnm)
y <- c(y,b)
}
oneartist <<- cbind(artist, PrevBNMTotal = y)
}
p4kdata3 <- lapply(artistnames, PrevBNMTotalfn)
p4kdata3 <- bind_rows(p4kdata3)
Now a variable PrevScore which is the score given to the artist’s most recently reviewed album.
PrevScorefn <- function(name){
artist <<- p4kdata3[p4kdata3$artist==name,]
y <<- NULL
if(nrow(artist)==1){
y <- NA
oneartist <- cbind(artist, PrevScore = y)
} else {
for(x in 1:(nrow(artist)-1)){
a <- artist[x+1,]
b <- a$score
y <- c(y,b)
}
y <- c(y, NA)
oneartist <- cbind(artist, PrevScore = y)
}
}
p4kdata4 <- lapply(artistnames, PrevScorefn)
p4kdata4 <- bind_rows(p4kdata4)
Now a variable PrevScoreAvg which is the mean of all of the artist’s previously received scores.
PrevScoreAvgfn <- function(name){
artist <<- p4kdata4[p4kdata4$artist==name,]
y <<- NULL
if(nrow(artist)==1){
y <- NA
oneartist <- cbind(artist, PrevScoreAvg = y)
} else {
for(x in 1:(nrow(artist)-1)){
a <- artist[(x+1):nrow(artist),]
b <- mean(a$score)
y <- c(y,b)
}
y <- c(y, NA)
oneartist <- cbind(artist, PrevScoreAvg = y)
}
}
p4kdata5 <- lapply(artistnames, PrevScoreAvgfn)
p4kdata5 <- bind_rows(p4kdata5)
Now a variable PrevAuthorSame which indicates whether the previous review for the artist was written by the same author or not.
PrevAuthorSamefn <- function(name){
artist <<- p4kdata5[p4kdata5$artist==name,]
y <<- NULL
if(nrow(artist)==1){
y <- FALSE
oneartist <- cbind(artist, PrevAuthorSame = y)
} else {
for(x in 1:(nrow(artist)-1)){
b <- identical(artist$author[x], artist$author[x+1])
y <- c(y,b)
}
y <- c(y, FALSE)
oneartist <- cbind(artist, PrevAuthorSame = y)
}
}
p4kdata6 <- lapply(artistnames, PrevAuthorSamefn)
p4kdata6 <- bind_rows(p4kdata6)
Now a variable PrevAuthorSameTotal which indicates the number of times the same author has previously reviewed this artist.
PrevAuthorSameTotalfn <- function(name){
artist <<- p4kdata6[p4kdata6$artist==name,]
y <<- NULL
if(nrow(artist)==1){
y <- 0
oneartist <- cbind(artist, PrevAuthorSameTotal = y)
} else {
for(x in 1:(nrow(artist)-1)){
b <- sum(artist$author[x]==artist$author[(x+1):nrow(artist)])
y <- c(y,b)
}
y <- c(y, 0)
oneartist <- cbind(artist, PrevAuthorSameTotal = y)
}
}
p4kdata7 <- lapply(artistnames, PrevAuthorSameTotalfn)
p4kdata7 <- bind_rows(p4kdata7)
Now a variable TimeSincePrev which shows how many days have elapsed since the artist was last reviewed.
TimeSincePrevfn <- function(name){
artist <<- p4kdata7[p4kdata7$artist==name,]
y <<- NULL
if(nrow(artist)==1){
y <- NA
oneartist <- cbind(artist, TimeSincePrev = y)
} else {
for(x in 1:(nrow(artist)-1)){
b <- artist$date[x]-artist$date[x+1]
y <- c(y,b)
}
y <- c(y, NA)
oneartist <- cbind(artist, TimeSincePrev = y)
}
}
p4kdata8 <- lapply(artistnames, TimeSincePrevfn)
p4kdata8 <- bind_rows(p4kdata8)
Now a variable PrevTwoScoreChange which shows the percentage growth or decline in score over the two reviews prior.
PrevTwoScoreChangefn <- function(name){
artist <<- p4kdata8[p4kdata8$artist==name,]
y <<- NULL
if(nrow(artist) <= 2){
y <- NA
oneartist <- cbind(artist, PrevTwoScoreChange = y)
} else {
for(x in 1:(nrow(artist)-2)){
b <- artist$score[x+1]/artist$score[x+2]
y <- c(y,b)
}
y <- c(y, NA, NA)
oneartist <- cbind(artist, PrevTwoScoreChange = y)
}
}
p4kdata9 <- lapply(artistnames, PrevTwoScoreChangefn)
p4kdata9 <- bind_rows(p4kdata9)
Now a variable PrevTwoScoreAppreciated which indicates whether the two scores prior showed an increase or a decrease in score.
p4kdata9 <- p4kdata9 %>% mutate(PrevTwoScoreAppreciated = PrevTwoScoreChange > 1)
Now variables for genre. Originally, in the data, we have a genre variable, however problems arise as, in many cases, more than one genre is assigned to each album. For instance, see below.
p4kdata[104,]
## artist album genre score date author
## 104 Mariah Carey Caution Pop/R&B 7.5 November 22 2018 Maura Johnston
## role bnm label release_year
## 104 Contributor 0 Epic Records 2018
Thus I create a separate variable for each genre indicating whether this genre, be it alone or alongside other genres, appears in the album’s genre tag.
p4kdata9 <- p4kdata9 %>% mutate(GenreElectronic = grepl("Electronic", p4kdata9$genre),
GenreExperimental = grepl("Experimental", p4kdata9$genre),
GenreFolk = grepl("Folk", p4kdata9$genre),
GenreCountry = grepl("Country", p4kdata9$genre),
GenreGlobal = grepl("Global", p4kdata9$genre),
GenreRock = grepl("Rock", p4kdata9$genre),
GenreMetal = grepl("Metal", p4kdata9$genre),
GenreRandB = grepl("R&B", p4kdata9$genre),
GenrePop = grepl("Pop", p4kdata9$genre),
GenreRap = grepl("Rap", p4kdata9$genre),
GenreJazz = grepl("Jazz", p4kdata9$genre))
Now, using for-loops this time, a variable LabelAvg showing what the average of previous scores of reviews for this label is.
p4kdata9 <- p4kdata9 %>% add_column(LabelAvg = NA)
for(x in 1:nrow(p4kdata9)){
datecutset <- p4kdata9 %>% slice(-x) %>% filter(date <= p4kdata9[x,]$date)
labelnames <- strsplit(p4kdata9[x,]$label, " ")[[1]]
a <- NULL
for(y in 1:length(labelnames)){
b <- datecutset[grepl(paste0("\\b", labelnames[y], "\\b"), datecutset$label),]
a <- rbind(a, b)
}
a <- a[!duplicated(a),]
p4kdata9[x,]$LabelAvg <- mean(a$score)
}
Now a variable LabelPrev which shows the score that this label’s previous review received.
p4kdata9 <- p4kdata9 %>% add_column(LabelPrev = NA)
for(x in 1:nrow(p4kdata9)){
datecutset <- p4kdata9 %>% slice(-x) %>% filter(date <= p4kdata9[x,]$date)
labelnames <- strsplit(p4kdata9[x,]$label, " ")[[1]]
a <- NULL
for(y in 1:length(labelnames)){
b <- datecutset[grepl(paste0("\\b", labelnames[y], "\\b"), datecutset$label),]
a <- rbind(a, b)
}
a <- a[!duplicated(a),]
a <- a %>% arrange(desc(date))
p4kdata9[x,]$LabelPrev <- a[1,]$score
}
Now a variable LabelTotal which shows the total of how many reviews the label has previously received.
p4kdata9 <- p4kdata9 %>% add_column(LabelTotal = NA)
for(x in 1:nrow(p4kdata9)){
datecutset <- p4kdata9 %>% slice(-x) %>% filter(date <= p4kdata9[x,]$date)
labelnames <- strsplit(p4kdata9[x,]$label, " ")[[1]]
a <- NULL
for(y in 1:length(labelnames)){
b <- datecutset[grepl(paste0("\\b", labelnames[y], "\\b"), datecutset$label),]
a <- rbind(a, b)
}
a <- a[!duplicated(a),]
p4kdata9[x,]$LabelTotal <- nrow(a)
}
p4kdata9[p4kdata9$label=="self-released",]$LabelTotal <- 0
I save as an r-object for ease of future use.
saveRDS(p4kdata9, "./p4knewdata.rds")
Finally, let’s take a look at the same entry we saw earlier with the new variables added.
p4knewdata <- readRDS("./p4knewdata.rds")
p4knewdata[which(p4knewdata$album==p4kdata[104,]$album),]
## artist album genre score date author role
## 1081 Mariah Carey Caution Pop/R&B 7.5 2018-11-22 MauraJohnston Contributor
## bnm label release_year PrevTotal PrevBNMTotal PrevScore PrevScoreAvg
## 1081 0 epicrecords 2018 2 0 7.9 7.8
## PrevAuthorSame PrevAuthorSameTotal TimeSincePrev PrevTwoScoreChange
## 1081 FALSE 0 347 1.025974
## PrevTwoScoreAppreciated GenreElectronic GenreExperimental GenreFolk
## 1081 TRUE FALSE FALSE FALSE
## GenreCountry GenreGlobal GenreRock GenreMetal GenreRandB GenrePop GenreRap
## 1081 FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## GenreJazz LabelAvg LabelPrev LabelTotal
## 1081 FALSE 6.82 7.1 5
To begin with, I perform some final cleaning by removing albums by Various Artists and removing the variables PrevTwoScoreAppreciated and PrevTwoScoreChange. The reason for removing these variables is that they contain too many NA values and in removing the NAs the datasets become too small and thus the models less reliable. I then create a testing set to measure model proficiency against and a final test set to provide a final display of the model’s work. Also note that the datasets were saved as R objects and I load them below, followed by showing the proportional size of the training set.
p4knewdata <- p4knewdata[p4knewdata$artist!="Various Artists",]
TEST <- p4knewdata %>% filter(date < "2019-02-01", date > "2018-01-01") %>% select(-PrevTwoScoreAppreciated, -PrevTwoScoreChange)
testing <- p4knewdata %>% filter(date <= "2018-01-01", date > "2014-01-01") %>% select(-PrevTwoScoreAppreciated, -PrevTwoScoreChange)
training <- p4knewdata %>% filter(date <= "2014-01-01", date > "2003-01-01") %>% select(-PrevTwoScoreAppreciated, -PrevTwoScoreChange)
training <- na.omit(training)
testing <- na.omit(testing)
TEST <- na.omit(TEST)
p4knewdata <- readRDS("./p4knewdata.rds")
training <- readRDS("./p4ktraining.rds")
testing <- readRDS("./p4ktesting.rds")
finaltest <- readRDS("./p4kfinaltest.rds")
nrow(training)/(nrow(training)+nrow(testing))
## [1] 0.7138658
Next I confirm some of the features of the data.
mean(p4knewdata$score)
## [1] 7.030189
var(p4knewdata$score)
## [1] 1.591397
quantile(p4knewdata$score, probs=c(0.025,0.975))
## 2.5% 97.5%
## 3.8 9.0
So pitchfork reviews have an average score of 7.0 with general variance of 1.6 around that mean. 95% of reviews fall in between scores of 3.8 and 9.0. Next I will look at a plot of the distribution of the data.
upper <- mean(p4knewdata$score)+2*sd(p4knewdata$score)
lower <- mean(p4knewdata$score)-2*sd(p4knewdata$score)
inbounds <- p4knewdata %>% filter(score > lower, score < upper) %>% nrow
inbounds/nrow(p4knewdata)
## [1] 0.9463487
p4knewdata %>% ggplot(aes(x=score)) +
geom_histogram(aes(y=..density..), binwidth=.5, color="black", fill="white") +
geom_density(alpha=.2, fill="#BB1111", bw=0.2) +
geom_vline(xintercept=lower) +
geom_vline(xintercept=upper)
Clearly the scores follow a normal distribution and, as confirmed above, around 95% of scores fall within two standard deviations of the mean.
Next I will plot the individual variables each against the score in scatter plots. I won’t include the code for plots g2 to g12 as it closely resembles that of g1.
library(gridExtra)
library(RColorBrewer)
g1 <- p4knewdata %>% ggplot(aes(x=score, y=PrevTotal)) +
geom_point(position=position_jitter(h=0.1, w=0.1), alpha=.1, color="#BB1111") +
geom_smooth(method='lm', col="black")
grid.arrange(g1, g2, g3, g4, g5, g6, g7, g8, g9, g10, g11, g12, nrow = 4)
As expected, the previous score and the average of all previous scores appear to have the highest correlations. The time elapsed since the previous review and the indicator of appreciation over the previous two scores appear to have little or no correlation. All other variables look to have some correlation albeit not such strong correlation. At this stage, whilst seeing stronger correlation would have provided more hope of being able to construct a highly accurate model further down the line, the amount of correlation seen in the plots certainly suggests that the variables should not be dismissed.
Next I similarly plot the genre variables against score. Again, I only share the code that produces the first plot.
colours <- brewer.pal(n = 12, name = "Paired")
gg1 <- p4knewdata %>% mutate(GenreElectronic = as.numeric(GenreElectronic)) %>%
ggplot(aes(x=score, y=GenreElectronic)) +
geom_point(position=position_jitter(h=0.1, w=0.1), alpha=.1, color=colours[1]) +
geom_smooth(method='lm', col="black")
grid.arrange(gg1, gg2, gg3, gg4, gg5, gg6, gg7, gg8, gg9, gg10, gg11, nrow=4)
Here there are both negative and positive correlations with the apparent most highly correlated genres being experimental, rock and electronic.
To confirm what was shown with regression lines in the previous plots, I calculate and tabulate the correlations between each variable and score. I first have to remove some infinite values from the PrevTwoScoreChange variable. I also save the table as an r object for later use.
p4knewdata2 <- p4knewdata[-which(is.infinite(p4knewdata$PrevTwoScoreChange)),]
interestedvars <- p4knewdata %>% select(-score, -artist, -album, -genre, -date, -author, -role, -bnm, -label, -release_year) %>% names()
corTable <- NULL
for(x in interestedvars){
y <- p4knewdata2 %>% select(score, x) %>% na.omit() %>% cor()
corTable <- rbind(corTable, y)
}
corTable <- corTable[-seq(1, 2*length(interestedvars), by = 2),-2]
corTable <- corTable %>% as.data.frame() %>% arrange(desc(.))
colnames(corTable) <- "Correlation"
saveRDS(corTable, "./corTable.rds")
corTable
## Correlation
## PrevScoreAvg 0.3541280775
## PrevScore 0.3433678955
## LabelAvg 0.1789222358
## LabelPrev 0.1110331396
## PrevBNMTotal 0.0958369433
## GenreExperimental 0.0892275016
## PrevAuthorSameTotal 0.0623383514
## PrevAuthorSame 0.0525909464
## PrevTotal 0.0449685862
## GenreJazz 0.0434258311
## GenreGlobal 0.0401799480
## LabelTotal 0.0399137947
## PrevTwoScoreAppreciated 0.0396887312
## GenreFolk 0.0296663980
## GenreCountry 0.0296663980
## GenreMetal -0.0006806591
## TimeSincePrev -0.0069245876
## PrevTwoScoreChange -0.0239133312
## GenreRandB -0.0249390197
## GenrePop -0.0249390197
## GenreRap -0.0259418147
## GenreElectronic -0.0317382747
## GenreRock -0.0479664216
Along with the confirmation of previous observations, the average score of the label and the previous score the label received appear to be reasonably correlated too.
Thus, it can be concluded that previous scores the artist received are the best indicators, the label that released the album are the next best, and the genre is the third. The author of the review may also provide some indication too.
Some interesting points to note in terms of genre are that experimental music generally receives the most positive scoring, rock music fares the most poorly, while global, jazz, folk and country receive slight positive favoring, and electronic, rap, pop and r&b receive slight negative treatment. However it must be mentioned that, as can be seen in the previous plots, the genres that are reviewed the least are the ones that have a positive correlation, with the exception of experimental music; it is reviewed a fair amount and is the standout correlated genre. This makes it absolutely the most interesting case. Metal is the only genre that has a correlation of less than 0.02 and is in fact so close to 0 that it would be fair to conclude that there is absolutely no bias for or against metal music when scoring.
A final interesting and surprising observation is that the previous number of ‘Best New Music’ awards given to an album has correlation with score of less than 0.1. Since the ‘Best New Music’ award is given to music of an excellent standard and reviews that have a score of over 8.0, one might have expected it to have correlation with score similar to that of the previous score variables. This could be explained by the fact that such a large percentage of the data simply has 0 previous BNM awards. In spite of this observation, the variable is correlated enough that it remains of interest.
The goal of the model fitting stage will be to see if much improvement can be made on a simple linear model that uses the average of the artist’s previous review scores, the variable most highly correlated with score, as the sole predictor. So first I load the data and I create the simple model.
library(caret)
p4knewdata <- readRDS("./p4knewdata.rds")
training <- readRDS("./p4ktraining.rds")
testing <- readRDS("./p4ktesting.rds")
finaltest <- readRDS("./p4kfinaltest.rds")
corTable <- readRDS("./corTable.rds")
PREVSCOREAVGONLYfit <- lm(score~PrevScoreAvg, data=training)
PREVSCOREAVGONLYpred <- predict(PREVSCOREAVGONLYfit, testing)
mean(abs(PREVSCOREAVGONLYpred-testing$score))
## [1] 0.7801489
So the simple linear model with a single predictor has an average error of around 0.78 when tested against the testing dataset.
Next I use the step function to find the best model that selects the best predictors.
corTable <- corTable[-grep("PrevTwoScore", rownames(corTable)),,drop=FALSE]
interestedvars2 <- paste(rownames(corTable), collapse="+")
step(lm(score~PrevScoreAvg+PrevScore+LabelAvg+LabelPrev+PrevBNMTotal+GenreExperimental+PrevAuthorSameTotal+PrevAuthorSame+PrevTotal+GenreJazz+GenreGlobal+LabelTotal+GenreFolk+GenreCountry+GenreMetal+TimeSincePrev+GenreRandB+GenrePop+GenreRap+GenreElectronic+GenreRock,
data=training),
direction = "both", trace=0)
##
## Call:
## lm(formula = score ~ PrevScoreAvg + PrevScore + LabelAvg + PrevBNMTotal +
## GenreExperimental + PrevAuthorSameTotal + LabelTotal + GenreRandB +
## GenreRap + GenreElectronic + GenreRock, data = training)
##
## Coefficients:
## (Intercept) PrevScoreAvg PrevScore
## 3.102904 0.258185 0.118789
## LabelAvg PrevBNMTotal GenreExperimentalTRUE
## 0.185719 0.044228 0.178785
## PrevAuthorSameTotal LabelTotal GenreRandBTRUE
## 0.085085 0.001241 -0.265318
## GenreRapTRUE GenreElectronicTRUE GenreRockTRUE
## -0.274023 -0.246727 -0.245493
I proceed to build the prediction model using the variables selected by the step function.
STEPfit <- lm(formula = score ~ PrevScoreAvg + PrevScore + LabelAvg + PrevBNMTotal +
GenreExperimental + PrevAuthorSameTotal + LabelTotal + GenreRandB +
GenreRap + GenreElectronic + GenreRock, data = training)
STEPpred <- predict(STEPfit, testing)
mean(abs(STEPpred-testing$score))
## [1] 0.7466902
In this case the linear model has an average error of around 0.747 against the test data. This represents an improvement of 0.033. It’s a small improvement but enough such that, epsecially when rounding to once decimal place is taken into consideration, it could be significant.
Next I will try some more complicated models. First, I set a control variable that will ensure cross-validation is done by each model. I also remove the unnecessary variables so that only predictors and score are left.
set.seed(211009)
control <- trainControl(method="cv", number=3, verboseIter=F)
training2 <- training %>% select(-artist, -album, -genre, -date, -author, -role, -bnm, -label, -release_year)
First I try a prediction trees model, second a random forest one and third a boosting one.
TREEfit <- train(score~., method="rpart", data=training2, tuneLength = 50, trControl=control)
TREEpred <- predict(TREEfit, testing)
mean(abs(TREEpred-testing$score))
## [1] 0.7923031
RFfit <- train(score~., method="rf", data=training2, trControl=control)
RFpred <- predict(RFfit, testing)
mean(abs(RFpred-testing$score))
## [1] 0.7647829
GBMfit <- train(score~., method="gbm", data=training2, trControl=control, verbose=FALSE)
GBMpred <- predict(GBMfit, testing)
mean(abs(GBMpred-testing$score))
## [1] 0.7498256
The boosting model has the lowest average error of around 0.75 which is similar to that of the linear model built earlier and represents an improvement of about 0.03 which is again small but has the potential to be significant. I proceed with the linear model and the gbm model.
Next I look at some summaries to see if there are predictors that should be removed. First, I perform an analysis of variance test on the linear model.
anova(STEPfit, test="Chisq")
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## PrevScoreAvg 1 1299.3 1299.27 900.1075 < 2.2e-16 ***
## PrevScore 1 59.3 59.31 41.0867 1.571e-10 ***
## LabelAvg 1 86.4 86.44 59.8834 1.179e-14 ***
## PrevBNMTotal 1 12.6 12.59 8.7203 0.0031594 **
## GenreExperimental 1 36.1 36.11 25.0172 5.849e-07 ***
## PrevAuthorSameTotal 1 30.3 30.32 21.0066 4.673e-06 ***
## LabelTotal 1 19.5 19.55 13.5425 0.0002353 ***
## GenreRandB 1 2.3 2.28 1.5786 0.2090153
## GenreRap 1 1.1 1.13 0.7855 0.3755063
## GenreElectronic 1 22.4 22.35 15.4863 8.407e-05 ***
## GenreRock 1 49.1 49.06 33.9892 5.840e-09 ***
## Residuals 5816 8395.1 1.44
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The high p-values suggest that GenreRandB and GenreRap variables could be removed.
STEPfit2 <- lm(formula = score ~ PrevScoreAvg + PrevScore + LabelAvg + PrevBNMTotal +
GenreExperimental + PrevAuthorSameTotal + LabelTotal + GenreElectronic + GenreRock, data = training)
STEPpred2 <- predict(STEPfit2, testing)
mean(abs(STEPpred2-testing$score))
## [1] 0.7436192
This marginally improves the accuracy of the model against the testing set.
Next I look at the summary of the gbm model.
summary(GBMfit, plotit=FALSE)
## var rel.inf
## PrevScoreAvg PrevScoreAvg 40.71256432
## PrevScore PrevScore 17.63936822
## LabelAvg LabelAvg 14.38092889
## TimeSincePrev TimeSincePrev 8.95865228
## LabelTotal LabelTotal 5.03388899
## LabelPrev LabelPrev 4.80790142
## GenreExperimentalTRUE GenreExperimentalTRUE 1.74720149
## PrevAuthorSameTotal PrevAuthorSameTotal 1.72848705
## GenreElectronicTRUE GenreElectronicTRUE 1.45558921
## PrevTotal PrevTotal 1.44241967
## GenreRockTRUE GenreRockTRUE 0.87626951
## GenreRapTRUE GenreRapTRUE 0.44561368
## GenreFolkTRUE GenreFolkTRUE 0.33900091
## GenreRandBTRUE GenreRandBTRUE 0.22390725
## GenreMetalTRUE GenreMetalTRUE 0.10860397
## PrevBNMTotal PrevBNMTotal 0.09960313
## PrevAuthorSameTRUE PrevAuthorSameTRUE 0.00000000
## GenreCountryTRUE GenreCountryTRUE 0.00000000
## GenreGlobalTRUE GenreGlobalTRUE 0.00000000
## GenrePopTRUE GenrePopTRUE 0.00000000
## GenreJazzTRUE GenreJazzTRUE 0.00000000
5 of the predictors appear not to have any influence on the model so they can be removed. Note that the nature of a gbm model means that the reported accuracy will differ each time the model is run.
GBMfit2 <- train(score~PrevScoreAvg+PrevScore+LabelAvg+TimeSincePrev+LabelTotal+LabelPrev+GenreExperimental+PrevAuthorSameTotal+GenreElectronic+PrevTotal+GenreRock+GenreRap+GenreFolk+GenreRandB+GenreMetal+PrevBNMTotal, method="gbm", data=training2, trControl=control, verbose=FALSE)
GBMpred2 <- predict(GBMfit2, testing)
mean(abs(GBMpred2-testing$score))
## [1] 0.753722
The linear and gbm model have similar accuracy so I will proceed with both and do a final test with the previously designated finaltest data.
First I perform a final test with the two models and note the accuracy. I also include the linear model with one predictor to give a final comparison.
PREVSCOREAVGONLYfinalpred <- predict(PREVSCOREAVGONLYfit, finaltest)
mean(abs(PREVSCOREAVGONLYfinalpred-finaltest$score))
## [1] 0.7991907
STEPfinalpred <- predict(STEPfit2, finaltest)
mean(abs(STEPfinalpred-finaltest$score))
## [1] 0.7643821
GBMfinalpred <- predict(GBMfit2, finaltest)
mean(abs(GBMfinalpred-finaltest$score))
## [1] 0.7612905
The gbm model is likely to offer the highest accuracy so I select this one as my final prediction model. Here are my concluding statements.
As expected, Pitchfork review scores are best predicted by the previous scores which the artist received.
The label the album was released on serves as a reasonable predictor, while the genre has some influence and the author of the review has a little.
Within the genres, there is positive bias towards any album that has a genre tag that includes ‘experimental’.
A simple model with just one predictor, the average of the artist’s previous review scores, sees an average error of around 0.8.
A boosting model involving a host of predictors can improve average prediction accuracy by between 0.03 and 0.04.
Whilst I may have hoped for a higher amount of improvement with models that included the predictors I created, the improvement shows that, among Pitchfork reviews, there do exist tendencies and potential biases. A final note is that of course there are outside influences that could impact a score, but, with access to and selection of that data being quite a task, I chose to stick to only data afforded by the Pitchfork reviews themselves.