options(width=100)
knitr::opts_chunk$set(out.width='1000px',dpi=200,message=FALSE,warning=FALSE)
#load packages
library(ggplot2)
library(dplyr)
library(gridExtra)
library(Amelia)
library(corrplot)
library(MASS)
library(caret)

1 Motivation

I like videogames and I like numbers, analyzing data and making some nice visualizations. Most of the games I played the last few years were like Borderlands 2, ie shooter-looter where the end goal is actually getting the best loot (gears, weapons) possible. These games have key-points in common :

As the end goal of these games is to reach the level cap as soon as possible, in order to engage in end games activities (raid bosses, raid, rifts,etc …)), the progression to reach these caps is relatively fast.

Also, as a gamer, I always found the progression in the early levels very satisfactory because you actually see your character getting stronger (in stats and visually (more important gears/armors/weapons)) and also it gives the feeling that you are making actual progression. (I’m looking at you the Witcher 3)

So I decided to qualify this feeling by looking at the data and see how it translates in term of level progression vs. time played.

disclaimer :

2 Destiny

2.1 Intro and Methodology

There are 2 levels to max in Destiny : character level (LVL) and Light level (LIGHT). LVL increases by doing activities (story missions, strikes, patrols) whereas LIGHT is an aggregate of the LIGHT levels of all you weapons/gears.

LVL is a soft level cap and the LIGHT is the hard level cap, meaning you reach the cap quicker, whereas reaching the hard level cap requires more grinding.

The data are retrieved via Bungie’s API (HTTP request along with personal API key) under a JSON format :

  • use a python script to get the end-points of the API and navigate through the Manifest.
  • LVL as well as all the reputations ranks are coded as float to get better precision.
  • script is run every minute via cronjob and appends data to a csv file.
  • analysis, plots are done using R.
  • as I could not analyzed data from my existing characters, I have re-started several characters from scratch.
The character I used for these data analysis (Female Warlock) looked like the below picture :
id1 ='4611686018428669871'
id2 ='2305843009399642403'

#get character summary
fullName = 'http://www.bungie.net/Platform/Destiny/2/Account/'+(id1)+'/Character/'+ (id2)

2.2 Plots

There are several features I saved and looked throughout the progression :

  • LVL and LIGHT levels, as well as reputations levels (these are levels you can increase by doing activities attached to a certain NPC)
  • tot time played, time played per session (minutes, secondes)
  • type, name of activities
destinyData<-read.csv('destiny/destiny_Warlock.csv',header=TRUE,sep=',')
head(destinyData,2)
##   X        date       time LVL LIGHT time_played time_played_session     hashId Activity
## 1 1 2016-09-05   15:33:08    1     5           1                   1 1846390409   Story 
## 2 2 2016-09-05   15:34:02    1     5           1                   1 1846390409   Story 
##      Location  Planet               Name Vanguard Crucible Dead.orbit New.monarchy Fwc Crypto
## 1  Old Russia  Earth   A Guardian Rises         0        0          0            0   0      0
## 2  Old Russia  Earth   A Guardian Rises         0        0          0            0   0      0
##   Gunsmith Crota House Queen session
## 1        0     0     0     0       1
## 2        0     0     0     0       1
h1 <- ggplot(data=destinyData,aes(x=time_played,y=LVL)) + 
  geom_point(aes(color=factor(session)),size=2) + 
  xlab('time played [min]') + ylab('LVL') + 
  theme(legend.position="none",axis.text=element_text(size=12),axis.title=element_text(size=14)) + 
  geom_hline(yintercept = 40,size=1,alpha=.5,color='black')

g1 <- ggplot(data=destinyData,aes(x=time_played,y=LIGHT)) + 
  geom_point(aes(color=factor(session)),size=3) + 
  xlab('Time played [min]') + ylab('LIGHT')  + ylim(0,350) +
  theme(legend.position="none",axis.text=element_text(size=12),axis.title=element_text(size=14)) +
  geom_hline(yintercept = 335,size=1,alpha=.5,color='black')

model<-lm(destinyData$LIGHT~ (destinyData$time_played + I(destinyData$time_played^2)))
myfit<-function(x) model$coefficient[3]*x^2 + model$coefficient[2]*x + model$coefficient[1]
g1 <- g1 + stat_function(fun=myfit, colour="black",size=1)

grid.arrange(h1,g1,ncol=2)

Comments :

  • we see the soft level cap reached earlier than the hard level cap(symbolized by horizontal lines).
  • at the time of this study, the hard level cap was 335 so we see that I haven’t reached it.
  • The color code is based on my game session, meaning each time I played the game (often by chunk of 1 to 2 hours).
  • the fit on the right panel is a polynomial degree 2, meaning it fits : LIGHT = a + b*TimePlayed + c*(TimePlayed)^{2}

Another useful plot is the LVL progression per session, meaning how the LVL increases each time played.

I defined a (gameplay) session each time I play this character ; most of the time I played for 1-2 hours. Having this quantity helps better to have an understanding of how the levels increased vs time/time_played_in_session.

What is shown here is the LIGHT increase vs the time played for each session, where session is in chronological order. Lines are linear fits.

  • We see that the slopes of the linear fits decrease as the session index increases
  • it means that at the beginning, the progression is very fast
  • towards the LIGHT cap (session ~19-20), the increase in LIGHT level is very small for the time played ( = level cap, grind)

2.3 Validity of the fit/model prediction

R has some packages to plot the residuals (true - fit) and some other diagnostic plots. But we also see that this simple polynomial function achieves a R-squared relatively good (0.996) so it is enough to use it for predictions, meaning how long do I need to play to reach a certain LVL (note : I do this in the Borderlands 2 section)

summary(model)
## 
## Call:
## lm(formula = destinyData$LIGHT ~ (destinyData$time_played + I(destinyData$time_played^2)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6327  -3.4388  -0.6347   2.8556  18.7959 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   5.788e+00  3.497e-01   16.55   <2e-16 ***
## destinyData$time_played       2.409e-01  7.310e-04  329.58   <2e-16 ***
## I(destinyData$time_played^2) -5.228e-05  3.177e-07 -164.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.277 on 2026 degrees of freedom
## Multiple R-squared:  0.996,  Adjusted R-squared:  0.996 
## F-statistic: 2.525e+05 on 2 and 2026 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model)

2.4 Other plots

timeAct<-ggplot(data=destinyData,aes(x=Activity)) + geom_bar(aes(fill=factor(Planet))) + ylab("Time played [min]") + xlab("Activities") + scale_fill_manual(name="Planet",values=c("#3B9AB2", "#78B7C5", "#EBCC2A", "#E1AF00", "#F21A00","#9A8822", "#F5CDB4"))
print(timeAct)

Knowing these type of data is very valuable for the player because it gives insights on how he is playing and where he can improve on his character progression. For Destiny, I was surprised to notice how much I spent in Social Activities ( = Tower, Reef). Clearly I spent a lot of time for gears and weapons management/optimization. It is also reflected in the plots LIGHT vs. total time played where we see small peaks at recurrent times and traduces how I play : session of 1-2 hours of strikes/story missions then I go to the Tower to decrypt engrams, hence the increase in LIGHT level.

ggplot(data=destinyData,aes(x=time_played,y=LVL)) + 
  geom_point(aes(color=factor(Activity)),size=2) + 
  xlab('time played [min]') + ylab('LVL') + 
  theme(axis.text=element_text(size=12),axis.title=element_text(size=14),legend.position='top') + 
  scale_color_manual(name="Activity",values=c("#3B9AB2", "#78B7C5", "#EBCC2A", "#E1AF00", "#F21A00","#9A8822")) + 
  geom_hline(yintercept = 40,size=.5,alpha=.5,color='black')

3 The Division

3.1 Intro and Methodology

The game has 3 main characteristics, which also are a combination of your gears and weapons levels :

  • primary DPS.
  • Health.
  • Skill power.

For this study I used the game API to collect the data. However, contrary to Destiny, the API is hosted on Uplay : Ubisoft games portal, which made this analysis less precise because :

  • data is not on a character basis, but on an account basis. Having played the game, my LVL stats were already maxed out from previous character
  • end point of the API are related to your activities in the game, like number of enemies killed, etc … and nothing was related to the character stats itself.

As for Destiny :

  • a cron job is run every minute to record data from the API.
  • new features (like time played, number of kills per session) are created later.

3.2 Some progression plots

divisionData<-read.csv('thedivision/theDivision_character.csv',sep=',')
head(divisionData,2)
##   X        date       time playtime maxLevel mainStory maxDZ maxUGRank TotalKills ItemsExtracted
## 1 1 2017-01-01   07:20:02         0       30       100    12         0          0              0
## 2 2 2017-01-01   07:21:01         0       30       100    12         0          0              0
##   SkillKills RogueAgentsKilled               tempo Session SessionTimePlayedSec TotTimePlayedSec
## 1          0                 0 2017-01-01 07:20:02       1                    0                0
## 2          0                 0 2017-01-01 07:21:01       1                   59               59
##   SessionTimePlayedMin KillsPerSession PVE TotTimePlayedMin
## 1            0.0000000               0   1        0.0000000
## 2            0.9833333               0   1        0.9833333
g1 <- ggplot(data=divisionData, aes(x=SessionTimePlayedMin, y=SkillKills)) + 
  geom_line(aes(color=factor(Session))) +  
  xlab('Time played (min)') + ylab('Total Kills') + 
  theme(text = element_text(size=8))

#fit of kills per session with intercept at starting kills
g2 <- ggplot(data=divisionData, aes(x=SessionTimePlayedMin, y=TotalKills, color=factor(Session))) + 
  xlab('Time played (min)') + ylab('Total Kills') + 
  theme(text = element_text(size=8)) 
g2 <- g2 + geom_smooth(method='lm') + geom_point(size=1)
g2

Contrary to the character EXP / LVL ,the number of TotalKills is relatively identical throughout the sessions played.

tt<-as.data.frame(divisionData %>% 
                    dplyr::group_by(Session) %>% 
                    dplyr::select(Session, KillsPerSession) %>% 
                    dplyr::summarise(session = mean(Session), maxKills = max(KillsPerSession)))

ggplot(data=tt, aes(x=Session, y=maxKills)) + 
  geom_bar(stat="identity") + 
  xlab('Session') + 
  theme(text = element_text(size=8))
#fit of kills per session with intercept = 0
k1 <- ggplot(data=divisionData, aes(x=SessionTimePlayedMin, y=KillsPerSession, color=factor(Session))) + 
  xlab('Time played (min)') + ylab('Kills per Session') + 
  theme(text = element_text(size=8),legend.position="top")
k1 <- k1 + geom_point(size=1) + geom_smooth(formula = y ~ x+ x*x)

#all kills vs all time, but fit independent per session
k2 <- ggplot(data=divisionData, aes(x= TotTimePlayedMin, y=TotalKills, color=factor(Session))) + 
  xlab('Time played (min)') + ylab('Total Kills') + 
  theme(text = element_text(size=8),legend.position="top")
k2 <- k2 + geom_point(size=1) + geom_smooth(formula = y ~ x+ x*x)

grid.arrange(k1,k2,ncol=2)

4 Diablo 3

4.1 Intro and Methodology

Diablo 3 characters have a lot of features to manage, making this analysis very enjoyable.

Diablo 3 has a soft level cap at 70, then further progression increases the parangon level. (similar to BL2 BAR)

4.2 Some progression plots, and correlation plots

d3Data<-read.csv('d3/d3-wizard.csv',sep=',')
head(d3Data,2)
##   X       date     time act LVL difficulty strength dexterity intelligence vitality damage
## 1 1 2017-03-19 07:19:15   1   2       hard        9         9           13       11      3
## 2 2 2017-03-19 07:21:04   1   2       hard        9         9           13       11      3
##   toughness healing    gold Parangon               tempo hour min Session SessionTimePlayedSec
## 1       205       0 6001668       87 2017-04-19 07:19:15    7  19       1                    0
## 2       228      18 5994237       87 2017-04-19 07:21:04    7  21       1                  109
##   TotTimePlayedSec SessionTimePlayedMin TotTimePlayedMin
## 1                0             0.000000         0.000000
## 2              109             1.816667         1.816667

The plot below shows the LVL progression vs. the TimePlayed (left is for the full time played, right is to exclude the linear part due to LVL cap, for fitting purpose). Linear smooth is used for fitting.

l1<-ggplot(data=d3Data,aes(x= TotTimePlayedMin,y=LVL)) + 
  geom_point(aes(color=factor(Session))) + 
  theme(legend.position="top") + 
  geom_hline(yintercept = 70, size=.5,alpha=.5,color='black')

l2<-ggplot(data=d3Data,aes(x= TotTimePlayedMin,y=LVL)) + 
  geom_point(aes(color=factor(Session))) + 
  geom_smooth(fullrange=FALSE,color='black') + 
  theme(legend.position="top") + xlim(0,900) +
  geom_hline(yintercept = 70, size=.5,alpha=.5,color='black')
grid.arrange(l1,l2,ncol=2)

The plot below shows the comparison with other hardcore attempts (monk_v2,monk_v3). The interesting thing here is the difference in level progression : we see that the progression :

  • from monk_v2 to monk_v3 is better : it was due to a weapon that does 100% critical hit damage, hence increasing the XP obtained from enemies.
  • from monk to wizard : due to a better optimization of the difficulty settings throughout the story playtime. Increasing the difficulty in D3 increases the XP gained.
monk2<-read.csv('d3/monk-v2.csv',sep=',')
monk3<-read.csv('d3/monk-v3.csv',sep=',')
levDif = c('hard','expert','master','torment_1')
monk2$difficulty_reorder<- factor(monk2$difficulty, levels = levDif)
monk3$difficulty_reorder<- factor(monk3$difficulty, levels = levDif)
d3Data$difficulty_reorder<- factor(d3Data$difficulty, levels = levDif)
ggplot() + 
  geom_line(data=monk2,aes(x=TotTimePlayedMin,y=LVL,color="monk_v2"),size=1) +
  geom_line(data=monk3,aes(x=TotTimePlayedMin,y=LVL,color="monk_v3"),size=1) +
  geom_line(data=d3Data,aes(x=TotTimePlayedMin,y=LVL,color="current_wiz"),size=1) +
  scale_color_manual(name="Character",values=c(current_wiz="#46ACC8",monk_v2="#F21A00",monk_v3="#EBCC2A")) + 
  geom_hline(yintercept = 70, size=.5,alpha=.5,color='black')

The plot below shows the correlation matrices of the different numerical features. It’s useful to know since it gives insights of which features is correlated (as some features are a combination of others) with another. The thinner the ellipse is, the more correlated the 2 features are. A positive correlation (blue in this plot) means the 2 features increases in the same direction.

resNum <- d3Data %>% dplyr::select(difficulty,LVL,strength,dexterity,intelligence,vitality,damage,toughness,healing,gold,TotTimePlayedMin)
corNum <- sapply(resNum,is.numeric)
corData<-cor(resNum[,corNum])
corrPlot<-corrplot(corData,method='ellipse')

5 Borderlands 2

5.1 Intro and Methodology

I will explain here briefly how I tracked the data in BL2 since there was no API available :

  • every time I leveled up (or after a certain amount of time), I took a screenshot with the PS4 feature
  • by default the PS4 saves pictures as name_of_game.date.time.png so the title of the each picture gives me the date / time information
  • I actually 3 pictures each time : one for shield, life data ; another for current level progression as well as the level of the curent mission ; the last one is for the money,eridium
  • then I report this into a csv file I used later for plotting data
df<-read.csv('./bl2/data_updated.csv')

#convert to date format
df$tempo <- as.POSIXct(paste(df$date,df$time), format="%Y-%M-%d %H:%M:%S")
df$Time <- as.POSIXct(df$time, format="%H:%M:%S")
df$hour<-as.numeric(format(as.POSIXct(df$time,format="%H:%M:%S"),"%H"))
df$min<-as.numeric(format(as.POSIXct(df$time,format="%H:%M:%S"),"%M"))

#calculate the time played in each session
diffSec<-function(mycol){
        start = 0
        myvec<-c()
        myvec[1]=start
        for(iter in 2:length(mycol)){
        start = start + abs(as.numeric(difftime(mycol[iter],mycol[iter-1],units="sec")))
        myvec[iter]=start
        }
        return(myvec)
}
df$trueLevel<-(df$current / df$maxCurrent) + df$level
#define sessions and save result in a single dataframe
cnt<-0
mylist<-list()
mylist[[1]]<-df %>% filter(date=="2017-03-29")
mylist[[2]]<-df %>% filter(date=="2017-03-30")
mylist[[3]]<-df %>% filter(date=="2017-03-31")
mylist[[4]]<-df %>% filter(date=="2017-04-01" & hour<=12)
mylist[[5]]<-df %>% filter(date=="2017-04-01" & hour >12)
mylist[[6]]<-df %>% filter(date=="2017-04-02" & hour<=12)
mylist[[7]]<-df %>% filter(date=="2017-04-02" & hour>12)
mylist[[8]]<-df %>% filter(date=="2017-04-07" & hour<=12)
mylist[[9]]<-df %>% filter(date=="2017-04-07" & hour>12)
mylist[[10]]<-df %>% filter(date=="2017-04-08" & hour<=12)
mylist[[11]]<-df %>% filter(date=="2017-04-08" & hour>12)
mylist[[12]]<-df %>% filter(date=="2017-04-09" & hour<=12)
mylist[[13]]<-df %>% filter(date=="2017-04-09" & hour>12)
mylist[[14]]<-df %>% filter(date=="2017-04-11")
mylist[[15]]<-df %>% filter(date=="2017-04-12")
mylist[[16]]<-df %>% filter(date=="2017-04-13")
mylist[[17]]<-df %>% filter(date=="2017-04-14")
mylist[[18]]<-df %>% filter(date=="2017-04-15" & hour<=12)
mylist[[19]]<-df %>% filter(date=="2017-04-15" & hour>12)
mylist[[20]]<-df %>% filter(date=="2017-04-16" & hour<=12)
mylist[[21]]<-df %>% filter(date=="2017-04-16" & hour>12)
mylist[[22]]<-df %>% filter(date=="2017-04-17")
mylist[[23]]<-df %>% filter(date=="2017-04-18")
mylist[[24]]<-df %>% filter(date=="2017-04-19")
mylist[[25]]<-df %>% filter(date=="2017-04-20")
mylist[[26]]<-df %>% filter(date=="2017-04-21")
mylist[[27]]<-df %>% filter(date=="2017-04-22" & hour<=12)
mylist[[28]]<-df %>% filter(date=="2017-04-22" & hour>12)
mylist[[29]]<-df %>% filter(date=="2017-04-23")
mylist[[30]]<-df %>% filter(date=="2017-04-25")
mylist[[31]]<-df %>% filter(date=="2017-04-26")
mylist[[32]]<-df %>% filter(date=="2017-04-27")

currentMax<-0
totMax<-0

makeSession<-function(x,num){
    session<-rep(num,x[1],length(x))
    return(session)
}

#calculate seesion features per dataframe
for(i in 1:length(mylist)){
    mylist[[i]]$Session<-makeSession(mylist[[i]]$Time,i)
    mylist[[i]]$SessionTimePlayedSec <- diffSec(mylist[[i]]$Time)
    mylist[[i]]$TotTimePlayedSec <- mylist[[i]]$SessionTimePlayedSec + totMax
    mylist[[i]]$SessionTimePlayedMin <- mylist[[i]]$SessionTimePlayedSec / 60
    currentMax<-mylist[[i]]$SessionTimePlayedSec[length(mylist[[i]]$time)]
    totMax<-totMax + currentMax
}

#merge all datsers for summary
dataBL2<-do.call(rbind,mylist)
dataBL2$TotTimePlayedMin <- dataBL2$TotTimePlayedSec / 60

5.2 Plots

5.2.1 EXP needed per level

RES<-data.frame('lvl'=unique(dataBL2$level)+1, 'exp'= unique(dataBL2$maxCurrent))
makeDiff<-function(x){
    res<-c()
    res[1]<-x[1]
    for(i in 2:length(x)){
        res[i]<-x[i] - x[i-1]
    }
    return(res)
}
RES$toNextLevel<-makeDiff(RES$exp)

ggplot() + 
  geom_histogram(data=RES,aes(x=lvl,y=toNextLevel),stat='identity',alpha=.5) + 
  geom_point(data=RES,aes(x=lvl,y=toNextLevel)) + 
  stat_smooth(data=RES,aes(x=lvl,y=toNextLevel),method = "lm", formula = y ~ poly(x, 2), size = 1,color='#F21A00')

This plot is very interesting because it gives the needed EXP to level up. The fit is a polynomial degree 2 which matches pretty well the data.

5.2.2 true level vs. time played

Having the current level of the character and the EXP needed to level is invaluable because we can calculate a true level, meaning a floating point data instead of integer. It gives a better progression curve description.

g<-ggplot(data=dataBL2,aes(x=TotTimePlayedMin,y=trueLevel)) + geom_point(aes(color=factor(Session)),size=4)
g <- g + stat_smooth(method = "lm", formula = y ~ sqrt(x), size = .5 , color='black')
g

The fit is trueLevel = sqrt(time_played). More on the validity of the fit in the next section.

  • We see the LVL progression starts quickly
  • then starts to reach a plateau

5.2.3 levelMission vs. TRUE level

One another interesting feature to look at is the difference between the LVL required for a mission and the actual character LVL. What is interesting is the plot below :

  • left panel : correlation between character LVL and mission LVL ; black line is perfect correlation. We see tha I tend to be over-levelup, meaning to be 1 or 2 LVL higher than the one required. This can have an impact on the progression since doing such missions will reward in less EXP points.
  • right panel : character LVL and mission LVL vs. TimePlayed ; during all this playthrough I’m above the LVL required.
f1<-ggplot(data=dataBL2,aes(x=levelMission,y=trueLevel)) + geom_point(aes(color=factor(Session)),size=4) + geom_abline(intercept = 0, slope = 1,color='black') + theme(legend.position='top')
f2<-ggplot() + 
  geom_line(data=dataBL2,aes(x=TotTimePlayedMin,y=trueLevel,color="character")) +
  geom_line(data=dataBL2,aes(x=TotTimePlayedMin,y=levelMission,color="mission_required")) +
  scale_color_manual(name="LVL",values=c(character="#EBCC2A",mission_required="#F21A00")) + 
  ylab('LVL') + theme(legend.position='top')
grid.arrange(f1, f2,ncol=2)

5.2.4 True LEVEL vs. Games modes

g1<-ggplot(data=dataBL2,aes(x=TotTimePlayedMin,y=trueLevel)) + geom_point(aes(color=factor(Session)),size=1)
g2<-ggplot(data=dataBL2,aes(x=TotTimePlayedMin,y=trueLevel)) + geom_point(aes(color=factor(mode)),size=1)
g3<-ggplot(data=dataBL2,aes(x=TotTimePlayedMin,y=trueLevel)) + geom_point(aes(color=factor(type)),size=1)
grid.arrange(g1,g2,g3,ncol=1)

5.3 Modeling / Fits

5.3.1 LVL prediction

With this data, we can try different fit to estimate the trueLevel as a function of time_played. First I create 2 samples :

  • a train sample on which I will look at the model accuracy (residuals)
  • a test sample on which I will test the model
train<-data.frame(dataBL2 %>% filter(Session<24))
test<-data.frame(dataBL2 %>% filter(Session>=24))
model<-lm(formula = trueLevel ~ sqrt(TotTimePlayedMin),data=train)
summary(model)
## 
## Call:
## lm(formula = trueLevel ~ sqrt(TotTimePlayedMin), data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77363 -0.93707 -0.07234  0.95870  2.41917 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -1.419168   0.175064  -8.107 4.26e-14 ***
## sqrt(TotTimePlayedMin)  0.847083   0.005681 149.113  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.046 on 210 degrees of freedom
## Multiple R-squared:  0.9906, Adjusted R-squared:  0.9906 
## F-statistic: 2.223e+04 on 1 and 210 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model)

Then we can predict the trueLevel on the test sample :

pred = predict(model, new=test, interval="confidence",level=0.95)
resModel<-data.frame(cbind(pred,true=test$trueLevel))
true_vs_fit<-ggplot(data=resModel,aes(x=true,y=fit))  + geom_point(size=2) + geom_abline(slope=1,color='red') + ggtitle("fitted LVL vs. true values")
true_fit_residuals<-ggplot(data=resModel,aes(x=true - fit))  + geom_histogram(bins=15) + ggtitle("LVL residuals")

grid.arrange(true_vs_fit, true_fit_residuals,ncol=2)

We see that :

  • the fitted values are closed to the true LVL (perfect correlation is symbolized by the red line)
  • the residuals are not centred at 0 (however I would need more data for this plot to conclude here)
test$fit<- resModel$fit

ggplot() + 
  geom_point(data=train,aes(x=TotTimePlayedMin,y=trueLevel,color="train"),size=3) + 
  geom_point(data=test,aes(x=TotTimePlayedMin,y=trueLevel,color="test"),size=3) + 
  geom_line(data=test,aes(x=TotTimePlayedMin,y=fit,color="prediction"),linetype=2,size=1) + 
  scale_color_manual(name="data sample",values=c(train="#EBCC2A",test="#F21A00",prediction="#46ACC8"))

Comments :

  • the prediction from the fit (blue line) looks visually in agreement with the true LVL
  • even if there are some fluctuations from the test data around the predicted values, the line fit is in good agreement with the continuation of the train sample.

5.3.2 Polynomial degree 4 model

Given the curve of trueLevel vs. timePlayed, we could have think to use a polynomial function instead of the sqrt function.

g<-ggplot(data=dataBL2,aes(x=TotTimePlayedMin,y=trueLevel)) + geom_point(aes(color=factor(Session)),size=4)
g <- g + stat_smooth(method = "lm", formula = y ~ poly(x, 4), size = .5 , color='black')
g

modelPol4<-lm(trueLevel ~ (poly(TotTimePlayedMin,4)),data=train)
#modelPol4<-lm(trueLevel ~ (TotTimePlayedMin + I(TotTimePlayedMin^2) + I(TotTimePlayedMin^3) + I(TotTimePlayedMin^4)),data=train)
summary(modelPol4)
## 
## Call:
## lm(formula = trueLevel ~ (poly(TotTimePlayedMin, 4)), data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.54270 -0.36255  0.01055  0.41131  1.28770 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 22.3882     0.0418  535.64   <2e-16 ***
## poly(TotTimePlayedMin, 4)1 154.8124     0.6086  254.38   <2e-16 ***
## poly(TotTimePlayedMin, 4)2 -19.2013     0.6086  -31.55   <2e-16 ***
## poly(TotTimePlayedMin, 4)3   8.1195     0.6086   13.34   <2e-16 ***
## poly(TotTimePlayedMin, 4)4  -7.5035     0.6086  -12.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6086 on 207 degrees of freedom
## Multiple R-squared:  0.9969, Adjusted R-squared:  0.9968 
## F-statistic: 1.651e+04 on 4 and 207 DF,  p-value: < 2.2e-16

The R-squared score is a bit better, however it completely fails to predict unknown data (test sample)

pred2 = predict(modelPol4, new=test, interval="confidence",level=0.95)
resModel2<-data.frame(cbind(pred2,true=test$trueLevel))
test$fit<- resModel2$fit

ggplot() + 
  geom_point(data=train,aes(x=TotTimePlayedMin,y=trueLevel,color="train"),size=3) + 
  geom_point(data=test,aes(x=TotTimePlayedMin,y=trueLevel,color="test"),size=3) + 
  geom_line(data=test,aes(x=TotTimePlayedMin,y=fit,color="prediction"),linetype=2,size=1) + 
  scale_color_manual(name="data sample",values=c(train="#EBCC2A",test="#F21A00",prediction="#46ACC8"))

anova(model,modelPol4)

6 Conclusions

History :