Pitchfork Scores Data Analysis and Prediction

Introduction
Cleaning the Data
Writing Functions
Analysis and Exploration
Model Fitting
Conclusions

Introduction

Pitchfork is one of the most widely read online music magazines. It is most famed for its review scores and ‘Best New Music’ award system. It assigns a score from 0 to 10 to one decimal place for an album that is released and reviewed by one of their writers. An album which receives a score of higher than 8.0 will often receive ‘Best New Music’ status- an indicator that it is of particularly high quality and an award which holds a lot of influence in terms of the album’s sales and credibility.

The aim of this project will be to explore a dataset of Pitchfork reviews and, ultimately, build a prediction model that attempts to predict the score an album will get when reviewed by Pitchfork. Below, I note some aims, expectations and points of intrigue.

What is the average score awarded to an album and how does the distribution of scores look?
If biases exist, where are they? Genre? Review’s author?
My expectation is that the previous scores given to the artist will provide the best indication of score.
What predictors can be created to improve prediction? How much can prediction accuracy be improved from using just one predictor (based on previous score(s)) to using multiple?

Cleaning the Data

For the various needs of the prediction model, during the analysis process I discovered multiple ways that the original data needed reshaping. I will remove some unnecessary variables and show what the data looks like.

URL <- "https://dl.dropboxusercontent.com/s/cqf7cgxh91eoeyn/pitchfork.csv"
if(!file.exists("./pitchfork.csv")){download.file(URL, "./pitchfork.csv")}

library(tidyverse)
p4kdata <- read.csv("./pitchfork.csv")
p4kdata <- p4kdata %>% select(-link, -review)
head(p4kdata)

##             artist                                              album
## 1      David Byrne â\200œâ\200¦The Best Live Show of All Timeâ\200\235 â\200” NME EP
## 2        DJ Healer                  Lost Lovesongs / Lostsongs Vol. 2
## 3      Jorge Velez                                        Roman Birds
## 4          Chandra                                 Transportation EPs
## 5 The Chainsmokers                                           Sick Boy
## 6   Silent Servant                        Shadows of Death and Desire
##        genre score            date            author                role bnm
## 1       Rock   5.5 January 11 2019         Andy Beta         Contributor   0
## 2 Electronic   6.2 January 11 2019       Chal Ravens         Contributor   0
## 3 Electronic   7.9 January 10 2019  Philip Sherburne Contributing Editor   0
## 4       Rock   7.8 January 10 2019         Andy Beta         Contributor   0
## 5 Electronic   3.1  January 9 2019 Larry Fitzmaurice         Contributor   0
## 6 Electronic   7.8  January 9 2019      Harley Brown         Contributor   0
##                  label release_year
## 1             Nonesuch         2018
## 2        Planet Uterus         2019
## 3        Self-released         2019
## 4  Telephone Explosion         2018
## 5   Disruptor,Columbia         2018
## 6 Hospital Productions         2018

Here I change data classes and formats, reorder all of the data by date of review and fix some formatting issues.

p4kdata$artist <- as.character(p4kdata$artist)
p4kdata$date <- as.Date(p4kdata$date, format = "%B %d %Y")
p4kdata <- p4kdata[order(desc(p4kdata$date)),]
p4kdata$author <- p4kdata$author %>% 
  gsub("\\s+|\\.|-|Ã©", "", .) 
p4kdata$role <- p4kdata$role %>% 
  gsub("\\s$", "", .) %>%
  as.factor(.)
p4kdata$genre[p4kdata$genre==""] <- NA

For the label variable, a variety of formatting issues arose when creating more variables that use label. This meant having to change punctuation, make use of spaces and deal with specific issues, namely one letter record labels and albums that were co-released on two labels.

p4kdata$label <- gsub(" ", "", p4kdata$label)
p4kdata$label <- gsub(",", " ", p4kdata$label)
p4kdata$label <- tolower(p4kdata$label)
for(x in 1:length(p4kdata$label)){
  p4kdata$label[x] <- paste(unique(strsplit(p4kdata$label[x], " ")[[1]]), collapse = " ")
}
p4kdata$label[grep("^$", p4kdata$label)] <- "Unspecified"

p4kdata[grep("^k$", p4kdata$label),]$label <-"krecords"
p4kdata[grep("^a$", p4kdata$label),]$label <-"arecords"
p4kdata$label <- gsub("^k ", "krecords ", p4kdata$label)

p4kdata[grep("father/daughter", p4kdata$label),]$label <-"fatherdaughter"
p4kdata[grep("ever/never", p4kdata$label),]$label <-"evernever"
p4kdata[grep("i/am/me", p4kdata$label),]$label <-"iamme"
p4kdata[grep("bar/none", p4kdata$label),]$label <-"barnone"
p4kdata[grep("local331/3", p4kdata$label),]$label <-"local3313"
p4kdata[grep("sciona/v", p4kdata$label),]$label <-"scionav"
p4kdata[grep("^feel$", p4kdata$label),]$label <-"feelrecords"
p4kdata[grep("20/20/20", p4kdata$label),]$label <-"202020"
p4kdata[grep("fe/hardboiled", p4kdata$label),]$label <-"fehardboiled"
p4kdata[grep("anti-", p4kdata$label),]$label <-"antiminus"

p4kdata$label <- gsub("/", " ", p4kdata$label)
p4kdata$label <- gsub("[^0-9a-z[[:space:]]]", "", p4kdata$label)
p4kdata$label <- gsub("self-released ", "", p4kdata$label)
p4kdata$label <- gsub(" self-released", "", p4kdata$label)

For the review authors, there were some spelling errors detected that had to be corrected. Finally, in the last line of code, I also assigned artists with no names to NA.

p4kdata$author[p4kdata$author=="DrAndyBeta"] <- "AndyBeta"
p4kdata$author[p4kdata$author=="DrewGaerig"] <- "AndrewGaerig"
p4kdata$author[p4kdata$author=="StephenMDeusner"] <- "StephenDeusner"
p4kdata$author[p4kdata$author=="StephenMDuesner"] <- "StephenDeusner"
p4kdata$author[p4kdata$author=="MarkRichardSan"] <- "MarkRichardson"
p4kdata$author[p4kdata$author=="SavyReyesKulkarni"] <- "SabyReyesKulkarni"
p4kdata$author[p4kdata$author=="PaulAThompson"] <- "PaulThompson"
p4kdata$author[p4kdata$author=="SeanFennessy"] <- "SeanFennessey"
p4kdata$author[p4kdata$author=="AlexLindhart"] <- "AlexLinhardt"
p4kdata$author[p4kdata$author=="CoryDByrom"] <- "CoryByrom"
p4kdata$author[p4kdata$author=="MarcusJMoore"] <- "MarcusMoore"
p4kdata$author[p4kdata$author=="JeremyDLarson"] <- "JeremyLarson"

p4kdata$artist[p4kdata$artist==""] <- "NA"

Writing Functions

Next I look to creating new variables that will, potentially, become predictors in my model. This will involve writing functions, using for-loops, etc. Here I create a variable with all artist’s names.

artistnames <- unique(p4kdata$artist)

First, I will create a variable PrevTotal which has the total number of previous reviews the artist has had.

PrevTotalfn <- function(name){
  artist <<- p4kdata[p4kdata$artist==name,]
  y <<- NULL
  for(x in 1:nrow(artist)){
    z <- sum(artist$date < artist$date[x])
    y <- c(y,z)
  }
  oneartist <<- cbind(artist, PrevTotal = y)
}

p4kdata2 <- lapply(artistnames, PrevTotalfn)
p4kdata2 <- bind_rows(p4kdata2)

Now a variable PrevBNMTotal which has the total number of BNMs (Best New Music) previously given to the artist.

PrevBNMTotalfn <- function(name){
  artist <<- p4kdata2[p4kdata2$artist==name,]
  y <<- NULL
  for(x in 1:nrow(artist)){
    a <- artist[artist$date < artist$date[x],]
    b <- sum(a$bnm)
    y <- c(y,b)
  }
  oneartist <<- cbind(artist, PrevBNMTotal = y)
}

p4kdata3 <- lapply(artistnames, PrevBNMTotalfn)
p4kdata3 <- bind_rows(p4kdata3)

Now a variable PrevScore which is the score given to the artist’s most recently reviewed album.

PrevScorefn <- function(name){
  artist <<- p4kdata3[p4kdata3$artist==name,]
  y <<- NULL
  if(nrow(artist)==1){
    y <- NA
    oneartist <- cbind(artist, PrevScore = y)
  } else {
    for(x in 1:(nrow(artist)-1)){
      a <- artist[x+1,]
      b <- a$score
      y <- c(y,b)
    }
    y <- c(y, NA)
    oneartist <- cbind(artist, PrevScore = y)
  }
}

p4kdata4 <- lapply(artistnames, PrevScorefn)
p4kdata4 <- bind_rows(p4kdata4)

Now a variable PrevScoreAvg which is the mean of all of the artist’s previously received scores.

PrevScoreAvgfn <- function(name){
  artist <<- p4kdata4[p4kdata4$artist==name,]
  y <<- NULL
  if(nrow(artist)==1){
    y <- NA
    oneartist <- cbind(artist, PrevScoreAvg = y)
  } else {
    for(x in 1:(nrow(artist)-1)){
      a <- artist[(x+1):nrow(artist),]
      b <- mean(a$score)
      y <- c(y,b)
    }
    y <- c(y, NA)
    oneartist <- cbind(artist, PrevScoreAvg = y)
  }
}

p4kdata5 <- lapply(artistnames, PrevScoreAvgfn)  
p4kdata5 <- bind_rows(p4kdata5)

Now a variable PrevAuthorSame which indicates whether the previous review for the artist was written by the same author or not.

PrevAuthorSamefn <- function(name){
  artist <<- p4kdata5[p4kdata5$artist==name,]
  y <<- NULL
  if(nrow(artist)==1){
    y <- FALSE
    oneartist <- cbind(artist, PrevAuthorSame = y)
  } else {
    for(x in 1:(nrow(artist)-1)){
      b <- identical(artist$author[x], artist$author[x+1])
      y <- c(y,b)
    }
    y <- c(y, FALSE)
    oneartist <- cbind(artist, PrevAuthorSame = y)
  }
}

p4kdata6 <- lapply(artistnames, PrevAuthorSamefn)  
p4kdata6 <- bind_rows(p4kdata6)

Now a variable PrevAuthorSameTotal which indicates the number of times the same author has previously reviewed this artist.

PrevAuthorSameTotalfn <- function(name){
  artist <<- p4kdata6[p4kdata6$artist==name,]
  y <<- NULL
  if(nrow(artist)==1){
    y <- 0
    oneartist <- cbind(artist, PrevAuthorSameTotal = y)
  } else {
    for(x in 1:(nrow(artist)-1)){
      b <- sum(artist$author[x]==artist$author[(x+1):nrow(artist)])
      y <- c(y,b)
    }
    y <- c(y, 0)
    oneartist <- cbind(artist, PrevAuthorSameTotal = y)
  }
}

p4kdata7 <- lapply(artistnames, PrevAuthorSameTotalfn)  
p4kdata7 <- bind_rows(p4kdata7)

Now a variable TimeSincePrev which shows how many days have elapsed since the artist was last reviewed.

TimeSincePrevfn <- function(name){
  artist <<- p4kdata7[p4kdata7$artist==name,]
  y <<- NULL
  if(nrow(artist)==1){
    y <- NA
    oneartist <- cbind(artist, TimeSincePrev = y)
  } else {
    for(x in 1:(nrow(artist)-1)){
      b <- artist$date[x]-artist$date[x+1]
      y <- c(y,b)
    }
    y <- c(y, NA)
    oneartist <- cbind(artist, TimeSincePrev = y)
  }
}

p4kdata8 <- lapply(artistnames, TimeSincePrevfn)
p4kdata8 <- bind_rows(p4kdata8)

Now a variable PrevTwoScoreChange which shows the percentage growth or decline in score over the two reviews prior.

PrevTwoScoreChangefn <- function(name){
  artist <<- p4kdata8[p4kdata8$artist==name,]
  y <<- NULL
  if(nrow(artist) <= 2){
    y <- NA
    oneartist <- cbind(artist, PrevTwoScoreChange = y)
  } else {
    for(x in 1:(nrow(artist)-2)){
      b <- artist$score[x+1]/artist$score[x+2]
      y <- c(y,b)
    }
    y <- c(y, NA, NA)
    oneartist <- cbind(artist, PrevTwoScoreChange = y)
  }
}

p4kdata9 <- lapply(artistnames, PrevTwoScoreChangefn)
p4kdata9 <- bind_rows(p4kdata9)

Now a variable PrevTwoScoreAppreciated which indicates whether the two scores prior showed an increase or a decrease in score.

p4kdata9 <- p4kdata9 %>% mutate(PrevTwoScoreAppreciated = PrevTwoScoreChange > 1)

Now variables for genre. Originally, in the data, we have a genre variable, however problems arise as, in many cases, more than one genre is assigned to each album. For instance, see below.

p4kdata[104,]

##           artist   album   genre score             date         author
## 104 Mariah Carey Caution Pop/R&B   7.5 November 22 2018 Maura Johnston
##            role bnm        label release_year
## 104 Contributor   0 Epic Records         2018

Thus I create a separate variable for each genre indicating whether this genre, be it alone or alongside other genres, appears in the album’s genre tag.

p4kdata9 <- p4kdata9 %>% mutate(GenreElectronic = grepl("Electronic", p4kdata9$genre),
                                GenreExperimental = grepl("Experimental", p4kdata9$genre),
                                GenreFolk = grepl("Folk", p4kdata9$genre),
                                GenreCountry = grepl("Country", p4kdata9$genre),
                                GenreGlobal = grepl("Global", p4kdata9$genre),
                                GenreRock = grepl("Rock", p4kdata9$genre),
                                GenreMetal = grepl("Metal", p4kdata9$genre),
                                GenreRandB = grepl("R&B", p4kdata9$genre),
                                GenrePop = grepl("Pop", p4kdata9$genre),
                                GenreRap = grepl("Rap", p4kdata9$genre),
                                GenreJazz = grepl("Jazz", p4kdata9$genre))

Now, using for-loops this time, a variable LabelAvg showing what the average of previous scores of reviews for this label is.

p4kdata9 <- p4kdata9 %>% add_column(LabelAvg = NA)

for(x in 1:nrow(p4kdata9)){
  datecutset <- p4kdata9 %>% slice(-x) %>% filter(date <= p4kdata9[x,]$date)
  labelnames <- strsplit(p4kdata9[x,]$label, " ")[[1]]
  a <- NULL
  for(y in 1:length(labelnames)){
    b <- datecutset[grepl(paste0("\\b", labelnames[y], "\\b"), datecutset$label),]
    a <- rbind(a, b)
  }
  a <- a[!duplicated(a),]
  p4kdata9[x,]$LabelAvg <- mean(a$score)
}

Now a variable LabelPrev which shows the score that this label’s previous review received.

p4kdata9 <- p4kdata9 %>% add_column(LabelPrev = NA)

for(x in 1:nrow(p4kdata9)){
  datecutset <- p4kdata9 %>% slice(-x) %>% filter(date <= p4kdata9[x,]$date)
  labelnames <- strsplit(p4kdata9[x,]$label, " ")[[1]]
  a <- NULL
  for(y in 1:length(labelnames)){
    b <- datecutset[grepl(paste0("\\b", labelnames[y], "\\b"), datecutset$label),]
    a <- rbind(a, b)
  }
  a <- a[!duplicated(a),]
  a <- a %>% arrange(desc(date))
  p4kdata9[x,]$LabelPrev <- a[1,]$score
}

Now a variable LabelTotal which shows the total of how many reviews the label has previously received.

p4kdata9 <- p4kdata9 %>% add_column(LabelTotal = NA)

for(x in 1:nrow(p4kdata9)){
  datecutset <- p4kdata9 %>% slice(-x) %>% filter(date <= p4kdata9[x,]$date)
  labelnames <- strsplit(p4kdata9[x,]$label, " ")[[1]]
  a <- NULL
  for(y in 1:length(labelnames)){
    b <- datecutset[grepl(paste0("\\b", labelnames[y], "\\b"), datecutset$label),]
    a <- rbind(a, b)
  }
  a <- a[!duplicated(a),]
  p4kdata9[x,]$LabelTotal <- nrow(a)
}

p4kdata9[p4kdata9$label=="self-released",]$LabelTotal <- 0

I save as an r-object for ease of future use.

saveRDS(p4kdata9, "./p4knewdata.rds")

Finally, let’s take a look at the same entry we saw earlier with the new variables added.

p4knewdata <- readRDS("./p4knewdata.rds")
p4knewdata[which(p4knewdata$album==p4kdata[104,]$album),]

##            artist   album   genre score       date        author        role
## 1081 Mariah Carey Caution Pop/R&B   7.5 2018-11-22 MauraJohnston Contributor
##      bnm       label release_year PrevTotal PrevBNMTotal PrevScore PrevScoreAvg
## 1081   0 epicrecords         2018         2            0       7.9          7.8
##      PrevAuthorSame PrevAuthorSameTotal TimeSincePrev PrevTwoScoreChange
## 1081          FALSE                   0           347           1.025974
##      PrevTwoScoreAppreciated GenreElectronic GenreExperimental GenreFolk
## 1081                    TRUE           FALSE             FALSE     FALSE
##      GenreCountry GenreGlobal GenreRock GenreMetal GenreRandB GenrePop GenreRap
## 1081        FALSE       FALSE     FALSE      FALSE       TRUE     TRUE    FALSE
##      GenreJazz LabelAvg LabelPrev LabelTotal
## 1081     FALSE     6.82       7.1          5

Analysis and Exploration

To begin with, I perform some final cleaning by removing albums by Various Artists and removing the variables PrevTwoScoreAppreciated and PrevTwoScoreChange. The reason for removing these variables is that they contain too many NA values and in removing the NAs the datasets become too small and thus the models less reliable. I then create a testing set to measure model proficiency against and a final test set to provide a final display of the model’s work. Also note that the datasets were saved as R objects and I load them below, followed by showing the proportional size of the training set.

p4knewdata <- p4knewdata[p4knewdata$artist!="Various Artists",]
TEST <- p4knewdata %>% filter(date < "2019-02-01", date > "2018-01-01") %>% select(-PrevTwoScoreAppreciated, -PrevTwoScoreChange)
testing <- p4knewdata %>% filter(date <= "2018-01-01", date > "2014-01-01") %>% select(-PrevTwoScoreAppreciated, -PrevTwoScoreChange)
training <- p4knewdata %>% filter(date <= "2014-01-01", date > "2003-01-01") %>% select(-PrevTwoScoreAppreciated, -PrevTwoScoreChange)
training <- na.omit(training)
testing <- na.omit(testing)
TEST <- na.omit(TEST)

p4knewdata <- readRDS("./p4knewdata.rds")
training <- readRDS("./p4ktraining.rds")
testing <- readRDS("./p4ktesting.rds")
finaltest <- readRDS("./p4kfinaltest.rds")
nrow(training)/(nrow(training)+nrow(testing))

## [1] 0.7138658

Next I confirm some of the features of the data.

mean(p4knewdata$score)

## [1] 7.030189

var(p4knewdata$score)

## [1] 1.591397

quantile(p4knewdata$score, probs=c(0.025,0.975))

##  2.5% 97.5% 
##   3.8   9.0

So pitchfork reviews have an average score of 7.0 with general variance of 1.6 around that mean. 95% of reviews fall in between scores of 3.8 and 9.0. Next I will look at a plot of the distribution of the data.

upper <- mean(p4knewdata$score)+2*sd(p4knewdata$score)
lower <- mean(p4knewdata$score)-2*sd(p4knewdata$score)
inbounds <- p4knewdata %>% filter(score > lower, score < upper) %>% nrow
inbounds/nrow(p4knewdata)

## [1] 0.9463487

p4knewdata %>% ggplot(aes(x=score)) +
  geom_histogram(aes(y=..density..), binwidth=.5, color="black", fill="white") +
  geom_density(alpha=.2, fill="#BB1111", bw=0.2) + 
  geom_vline(xintercept=lower) +
  geom_vline(xintercept=upper)

Clearly the scores follow a normal distribution and, as confirmed above, around 95% of scores fall within two standard deviations of the mean.

Next I will plot the individual variables each against the score in scatter plots. I won’t include the code for plots g2 to g12 as it closely resembles that of g1.

library(gridExtra)
library(RColorBrewer)
g1 <- p4knewdata %>% ggplot(aes(x=score, y=PrevTotal)) +
  geom_point(position=position_jitter(h=0.1, w=0.1), alpha=.1, color="#BB1111") +
  geom_smooth(method='lm', col="black")

grid.arrange(g1, g2, g3, g4, g5, g6, g7, g8, g9, g10, g11, g12, nrow = 4)

As expected, the previous score and the average of all previous scores appear to have the highest correlations. The time elapsed since the previous review and the indicator of appreciation over the previous two scores appear to have little or no correlation. All other variables look to have some correlation albeit not such strong correlation. At this stage, whilst seeing stronger correlation would have provided more hope of being able to construct a highly accurate model further down the line, the amount of correlation seen in the plots certainly suggests that the variables should not be dismissed.

Next I similarly plot the genre variables against score. Again, I only share the code that produces the first plot.

colours <- brewer.pal(n = 12, name = "Paired")
gg1 <- p4knewdata %>% mutate(GenreElectronic = as.numeric(GenreElectronic)) %>%
  ggplot(aes(x=score, y=GenreElectronic)) +
  geom_point(position=position_jitter(h=0.1, w=0.1), alpha=.1, color=colours[1]) +
  geom_smooth(method='lm', col="black")

grid.arrange(gg1, gg2, gg3, gg4, gg5, gg6, gg7, gg8, gg9, gg10, gg11, nrow=4)

Here there are both negative and positive correlations with the apparent most highly correlated genres being experimental, rock and electronic.

To confirm what was shown with regression lines in the previous plots, I calculate and tabulate the correlations between each variable and score. I first have to remove some infinite values from the PrevTwoScoreChange variable. I also save the table as an r object for later use.

p4knewdata2 <- p4knewdata[-which(is.infinite(p4knewdata$PrevTwoScoreChange)),]
interestedvars <- p4knewdata %>% select(-score, -artist, -album, -genre, -date, -author, -role, -bnm, -label, -release_year) %>% names()

corTable <- NULL
for(x in interestedvars){
  y <- p4knewdata2 %>% select(score, x) %>% na.omit() %>% cor()
  corTable <- rbind(corTable, y)
}
corTable <- corTable[-seq(1, 2*length(interestedvars), by = 2),-2]
corTable <- corTable %>% as.data.frame() %>% arrange(desc(.))
colnames(corTable) <- "Correlation"
saveRDS(corTable, "./corTable.rds")
corTable

##                           Correlation
## PrevScoreAvg             0.3541280775
## PrevScore                0.3433678955
## LabelAvg                 0.1789222358
## LabelPrev                0.1110331396
## PrevBNMTotal             0.0958369433
## GenreExperimental        0.0892275016
## PrevAuthorSameTotal      0.0623383514
## PrevAuthorSame           0.0525909464
## PrevTotal                0.0449685862
## GenreJazz                0.0434258311
## GenreGlobal              0.0401799480
## LabelTotal               0.0399137947
## PrevTwoScoreAppreciated  0.0396887312
## GenreFolk                0.0296663980
## GenreCountry             0.0296663980
## GenreMetal              -0.0006806591
## TimeSincePrev           -0.0069245876
## PrevTwoScoreChange      -0.0239133312
## GenreRandB              -0.0249390197
## GenrePop                -0.0249390197
## GenreRap                -0.0259418147
## GenreElectronic         -0.0317382747
## GenreRock               -0.0479664216

Along with the confirmation of previous observations, the average score of the label and the previous score the label received appear to be reasonably correlated too.

Thus, it can be concluded that previous scores the artist received are the best indicators, the label that released the album are the next best, and the genre is the third. The author of the review may also provide some indication too.

Some interesting points to note in terms of genre are that experimental music generally receives the most positive scoring, rock music fares the most poorly, while global, jazz, folk and country receive slight positive favoring, and electronic, rap, pop and r&b receive slight negative treatment. However it must be mentioned that, as can be seen in the previous plots, the genres that are reviewed the least are the ones that have a positive correlation, with the exception of experimental music; it is reviewed a fair amount and is the standout correlated genre. This makes it absolutely the most interesting case. Metal is the only genre that has a correlation of less than 0.02 and is in fact so close to 0 that it would be fair to conclude that there is absolutely no bias for or against metal music when scoring.

A final interesting and surprising observation is that the previous number of ‘Best New Music’ awards given to an album has correlation with score of less than 0.1. Since the ‘Best New Music’ award is given to music of an excellent standard and reviews that have a score of over 8.0, one might have expected it to have correlation with score similar to that of the previous score variables. This could be explained by the fact that such a large percentage of the data simply has 0 previous BNM awards. In spite of this observation, the variable is correlated enough that it remains of interest.

Model Fitting

The goal of the model fitting stage will be to see if much improvement can be made on a simple linear model that uses the average of the artist’s previous review scores, the variable most highly correlated with score, as the sole predictor. So first I load the data and I create the simple model.

library(caret)
p4knewdata <- readRDS("./p4knewdata.rds")
training <- readRDS("./p4ktraining.rds")
testing <- readRDS("./p4ktesting.rds")
finaltest <- readRDS("./p4kfinaltest.rds")
corTable <- readRDS("./corTable.rds")

PREVSCOREAVGONLYfit <- lm(score~PrevScoreAvg, data=training)
PREVSCOREAVGONLYpred <- predict(PREVSCOREAVGONLYfit, testing)
mean(abs(PREVSCOREAVGONLYpred-testing$score))

## [1] 0.7801489

So the simple linear model with a single predictor has an average error of around 0.78 when tested against the testing dataset.

Next I use the step function to find the best model that selects the best predictors.

corTable <- corTable[-grep("PrevTwoScore", rownames(corTable)),,drop=FALSE]
interestedvars2 <- paste(rownames(corTable), collapse="+")

step(lm(score~PrevScoreAvg+PrevScore+LabelAvg+LabelPrev+PrevBNMTotal+GenreExperimental+PrevAuthorSameTotal+PrevAuthorSame+PrevTotal+GenreJazz+GenreGlobal+LabelTotal+GenreFolk+GenreCountry+GenreMetal+TimeSincePrev+GenreRandB+GenrePop+GenreRap+GenreElectronic+GenreRock,
        data=training),
     direction = "both", trace=0)

## 
## Call:
## lm(formula = score ~ PrevScoreAvg + PrevScore + LabelAvg + PrevBNMTotal + 
##     GenreExperimental + PrevAuthorSameTotal + LabelTotal + GenreRandB + 
##     GenreRap + GenreElectronic + GenreRock, data = training)
## 
## Coefficients:
##           (Intercept)           PrevScoreAvg              PrevScore  
##              3.102904               0.258185               0.118789  
##              LabelAvg           PrevBNMTotal  GenreExperimentalTRUE  
##              0.185719               0.044228               0.178785  
##   PrevAuthorSameTotal             LabelTotal         GenreRandBTRUE  
##              0.085085               0.001241              -0.265318  
##          GenreRapTRUE    GenreElectronicTRUE          GenreRockTRUE  
##             -0.274023              -0.246727              -0.245493

I proceed to build the prediction model using the variables selected by the step function.

STEPfit <- lm(formula = score ~ PrevScoreAvg + PrevScore + LabelAvg + PrevBNMTotal + 
                GenreExperimental + PrevAuthorSameTotal + LabelTotal + GenreRandB + 
                GenreRap + GenreElectronic + GenreRock, data = training)
STEPpred <- predict(STEPfit, testing)
mean(abs(STEPpred-testing$score))

## [1] 0.7466902

In this case the linear model has an average error of around 0.747 against the test data. This represents an improvement of 0.033. It’s a small improvement but enough such that, epsecially when rounding to once decimal place is taken into consideration, it could be significant.

Next I will try some more complicated models. First, I set a control variable that will ensure cross-validation is done by each model. I also remove the unnecessary variables so that only predictors and score are left.

set.seed(211009)
control <- trainControl(method="cv", number=3, verboseIter=F)
training2 <- training %>% select(-artist, -album, -genre, -date, -author, -role, -bnm, -label, -release_year)

First I try a prediction trees model, second a random forest one and third a boosting one.

TREEfit <- train(score~., method="rpart", data=training2, tuneLength = 50, trControl=control)
TREEpred <- predict(TREEfit, testing)
mean(abs(TREEpred-testing$score))

## [1] 0.7923031

RFfit <- train(score~., method="rf", data=training2, trControl=control)
RFpred <- predict(RFfit, testing)
mean(abs(RFpred-testing$score))

## [1] 0.7647829

GBMfit <- train(score~., method="gbm", data=training2, trControl=control, verbose=FALSE)
GBMpred <- predict(GBMfit, testing)
mean(abs(GBMpred-testing$score))

## [1] 0.7498256

The boosting model has the lowest average error of around 0.75 which is similar to that of the linear model built earlier and represents an improvement of about 0.03 which is again small but has the potential to be significant. I proceed with the linear model and the gbm model.

Next I look at some summaries to see if there are predictors that should be removed. First, I perform an analysis of variance test on the linear model.

anova(STEPfit, test="Chisq")

## Analysis of Variance Table
## 
## Response: score
##                       Df Sum Sq Mean Sq  F value    Pr(>F)    
## PrevScoreAvg           1 1299.3 1299.27 900.1075 < 2.2e-16 ***
## PrevScore              1   59.3   59.31  41.0867 1.571e-10 ***
## LabelAvg               1   86.4   86.44  59.8834 1.179e-14 ***
## PrevBNMTotal           1   12.6   12.59   8.7203 0.0031594 ** 
## GenreExperimental      1   36.1   36.11  25.0172 5.849e-07 ***
## PrevAuthorSameTotal    1   30.3   30.32  21.0066 4.673e-06 ***
## LabelTotal             1   19.5   19.55  13.5425 0.0002353 ***
## GenreRandB             1    2.3    2.28   1.5786 0.2090153    
## GenreRap               1    1.1    1.13   0.7855 0.3755063    
## GenreElectronic        1   22.4   22.35  15.4863 8.407e-05 ***
## GenreRock              1   49.1   49.06  33.9892 5.840e-09 ***
## Residuals           5816 8395.1    1.44                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The high p-values suggest that GenreRandB and GenreRap variables could be removed.

STEPfit2 <- lm(formula = score ~ PrevScoreAvg + PrevScore + LabelAvg + PrevBNMTotal + 
                GenreExperimental + PrevAuthorSameTotal + LabelTotal + GenreElectronic + GenreRock, data = training)
STEPpred2 <- predict(STEPfit2, testing)
mean(abs(STEPpred2-testing$score))

## [1] 0.7436192

This marginally improves the accuracy of the model against the testing set.

Next I look at the summary of the gbm model.

summary(GBMfit, plotit=FALSE)

##                                         var     rel.inf
## PrevScoreAvg                   PrevScoreAvg 40.71256432
## PrevScore                         PrevScore 17.63936822
## LabelAvg                           LabelAvg 14.38092889
## TimeSincePrev                 TimeSincePrev  8.95865228
## LabelTotal                       LabelTotal  5.03388899
## LabelPrev                         LabelPrev  4.80790142
## GenreExperimentalTRUE GenreExperimentalTRUE  1.74720149
## PrevAuthorSameTotal     PrevAuthorSameTotal  1.72848705
## GenreElectronicTRUE     GenreElectronicTRUE  1.45558921
## PrevTotal                         PrevTotal  1.44241967
## GenreRockTRUE                 GenreRockTRUE  0.87626951
## GenreRapTRUE                   GenreRapTRUE  0.44561368
## GenreFolkTRUE                 GenreFolkTRUE  0.33900091
## GenreRandBTRUE               GenreRandBTRUE  0.22390725
## GenreMetalTRUE               GenreMetalTRUE  0.10860397
## PrevBNMTotal                   PrevBNMTotal  0.09960313
## PrevAuthorSameTRUE       PrevAuthorSameTRUE  0.00000000
## GenreCountryTRUE           GenreCountryTRUE  0.00000000
## GenreGlobalTRUE             GenreGlobalTRUE  0.00000000
## GenrePopTRUE                   GenrePopTRUE  0.00000000
## GenreJazzTRUE                 GenreJazzTRUE  0.00000000

5 of the predictors appear not to have any influence on the model so they can be removed. Note that the nature of a gbm model means that the reported accuracy will differ each time the model is run.

GBMfit2 <- train(score~PrevScoreAvg+PrevScore+LabelAvg+TimeSincePrev+LabelTotal+LabelPrev+GenreExperimental+PrevAuthorSameTotal+GenreElectronic+PrevTotal+GenreRock+GenreRap+GenreFolk+GenreRandB+GenreMetal+PrevBNMTotal, method="gbm", data=training2, trControl=control, verbose=FALSE)
GBMpred2 <- predict(GBMfit2, testing)
mean(abs(GBMpred2-testing$score))

## [1] 0.753722

The linear and gbm model have similar accuracy so I will proceed with both and do a final test with the previously designated finaltest data.

Conclusions

First I perform a final test with the two models and note the accuracy. I also include the linear model with one predictor to give a final comparison.

PREVSCOREAVGONLYfinalpred <- predict(PREVSCOREAVGONLYfit, finaltest)
mean(abs(PREVSCOREAVGONLYfinalpred-finaltest$score))

## [1] 0.7991907

STEPfinalpred <- predict(STEPfit2, finaltest)
mean(abs(STEPfinalpred-finaltest$score))

## [1] 0.7643821

GBMfinalpred <- predict(GBMfit2, finaltest)
mean(abs(GBMfinalpred-finaltest$score))

## [1] 0.7612905

The gbm model is likely to offer the highest accuracy so I select this one as my final prediction model. Here are my concluding statements.

As expected, Pitchfork review scores are best predicted by the previous scores which the artist received.
The label the album was released on serves as a reasonable predictor, while the genre has some influence and the author of the review has a little.
Within the genres, there is positive bias towards any album that has a genre tag that includes ‘experimental’.
A simple model with just one predictor, the average of the artist’s previous review scores, sees an average error of around 0.8.
A boosting model involving a host of predictors can improve average prediction accuracy by between 0.03 and 0.04.

Whilst I may have hoped for a higher amount of improvement with models that included the predictors I created, the improvement shows that, among Pitchfork reviews, there do exist tendencies and potential biases. A final note is that of course there are outside influences that could impact a score, but, with access to and selection of that data being quite a task, I chose to stick to only data afforded by the Pitchfork reviews themselves.