Summary

As most MMO players, I wanted to optimize my DPS, which led me to check out which stats top prioritize as a PLD. What i stumbled upon the most were spreadsheets about the gain brought by each stat, which is not exactly the same as statsweights, which take into account DPS rotations and their interaction with the stats. What got out was that, in theory, crit was the stat to focus on as a PLD for that small DPS boost that is sometimes needed in tough Savage raids.

But I was still interested in statsweight, and since the way they are calculated is through multiple simulations that reveal the stats are top contributors to simulated DPS, my attempt focused on collecting logs from boss fights and see if I could figure out statweights through a linear regression.

The data I got were from PLD DPS logs on Ramuh, which had the highest values of DPS (since you can have a high uptime on that boss once the mechanics are mastered).

Unfortunately, there seems to be a high contribution of player skill, with an error so high that my model only explains 35% of the DPS at best, the rest being statistical noise during measurement

Ideally, I would use a large number of DPS logs against training dummies to have a better model fit. Do not hesitate to send me those if you have some.

Data Cleaning & Model Building

I collected data from combat logs in FFlogs : https://www.fflogs.com/zone/rankings/33#metric=dps&class=Global&boss=-1&spec=Paladin&dpstype=pdps

I’ll put a quick description at the end about how I collected the data for those that are interested

I ended up with a clean data like this :

clean_df <- read.csv("./travaux persos/clean_df.csv")
head(clean_df)

##                  name     hp  STR tenacity crit   DH deter skill_speed     DPS
## 1        Kyoko Terai  161047 4734      528 3863 1760  2282        1017 12142.2
## 2         Meekus Cat  161079 4737      528 3863 1820  2342         897 11758.8
## 3        Sleepy Chan  161047 4734      528 3863 1700  2342        1017 11841.1
## 4      Hina Gamiluna  161016 4734      528 3863 1760  2282        1017 12017.2
## 5   Seronin Ashekage  161079 4737      528 3863 1700  2282        1077 11795.1
## 6 Crystalyn Romanoff  161079 4737      528 3863 1760  2282        1017 12110.3

In order to build the model, I do a tiny bit of pre-processing :

I remove the HP and Name variables
I remove outliers that are outside of a 2-standard deviation interval (I saw some logs at 19k DPS for a PLD which I assume are measurement errors)

DPSsd <- sd(clean_df$DPS)
DPSmean <- mean(clean_df$DPS)

library(dplyr)
model_df <- clean_df %>% select(-c(name,hp)) %>% filter(DPS <= DPSmean + 2*DPSsd) %>%
   filter(DPS >= DPSmean - 2*DPSsd)

We have the following DPS repartition :

library(ggplot2)
ggplot(data = model_df, aes(x = DPS)) + geom_histogram(fill = "lightblue")

I slice my data into 2, 80% for model training and 20% for testing.

I set an RNG seed so that you can try it at home and obtain the same results.

library(caret)
set.seed(69420)
inTrain <- createDataPartition(y = model_df$DPS, p = 0.8, list = F)

training <- model_df[inTrain, ]
testing <- model_df[-inTrain, ]

I build my linear model on the training data frame :

modlm <- train(DPS ~ ., data = training, method = "lm")

One way of checking the model’s accuracy is to :

Use the data from the training set to predict the DPS using our linear model
Compare the predicted value to the actual measured DPS
Calculate the root mean of squared error (RMSE) :

\[ RMSE = \frac{1}{n} ~ \sum_n (predicted~ value - true~ value)² \]

My RSME is :

RMSE(predict(modlm, testing), testing$DPS)

## [1] 765.1144

Compared to the global dps, it is about 0.3% :

RMSE(predict(modlm, testing), testing$DPS)/sum(testing$DPS)

## [1] 0.003519549

Here’s a quick plot of the actual values vs predicted values. If the model was perfect, all the dots would be on the red line :

qplot(predict(modlm, testing), testing$DPS) + geom_abline(intercept = 0, slope = 1, colour = "red")

And here’s the data frame of predicted values vs measured values :

confirm <- data.frame(predicted = predict(modlm, testing), real = testing$DPS)
head(confirm, 20)

##    predicted    real
## 2  11238.650 11758.8
## 10 11229.787 11723.3
## 15 10850.366 11624.7
## 16  9946.778 11598.7
## 19 10104.310 11311.3
## 21 11228.172 11389.9
## 41 11172.946 11108.1
## 42 11138.794 11363.6
## 44 10955.385 10655.6
## 46 11177.791 11415.1
## 48 10812.692 10745.7
## 49 10897.517 11014.3
## 66 10504.166 10469.2
## 67 10574.578 10516.7
## 68 10192.969 10238.5
## 79 10564.740  9431.9
## 80 10922.667  9401.0
## 92 11176.176  9745.3
## 95 11224.942 11898.3
## 97 10218.967  9979.9

So we have a model that’s quite accurate yet far from perfect.

Coefficients

Now let’s check what coefficient our model puts for each stat :

summary(modlm)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1773.19  -279.31   -46.64   398.31  1137.79 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12687.7786  2248.5696   5.643    3e-07 ***
## STR            -1.6149     0.7126  -2.266  0.02641 *  
## tenacity       -0.5265     0.6070  -0.867  0.38861    
## crit            1.4090     0.5111   2.757  0.00737 ** 
## DH              0.3380     0.4645   0.728  0.46924    
## deter           0.1331     0.5438   0.245  0.80730    
## skill_speed     0.1213     0.5880   0.206  0.83713    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 597.4 on 73 degrees of freedom
## Multiple R-squared:  0.3841, Adjusted R-squared:  0.3335 
## F-statistic: 7.588 on 6 and 73 DF,  p-value: 2.406e-06

Crit seems to be the top stat for PLD, however, the model gives a negative value to STR.

That might be explained by the fact that players that have high values of strength performed poorly in comparison to players with lower values of STR :

qplot(STR, DPS, data = model_df)

Globally, the contribution of other stats is impossible to tell because the measured coefficients are not statistically significant (p-values far above 5%).

As said in the summary, the data I got was from Ramuh instead of dummies and therefore heavily relies on player skill.

When checking the R-Squared value, my model only explains 40% of the DPS data, the rest being noise, and even that’s inflated since I use 6 regressors. My model explains, at best, 35-ish % of the data.

So far, without proper data, the only that might be confirmed from this is that crit is indeed the top stat for PLD.

Data collection code (for those interested)

I collect the data using the selenium package for R.

I make Selenium go to the main page :

link <- "https://www.fflogs.com/zone/rankings/33#metric=dps&class=Global&boss=-1&spec=Paladin&dpstype=pdps"

library(RSelenium)
eCaps <- list(chromeOptions = list(
   args = c('--headless', '--disable-gpu', '--window-size=1280,800', '--no-sandbox')
))
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 4445L, browserName = "chrome",
                      extraCapabilities = eCaps)
remDr$open()

remDr$navigate("https://www.fflogs.com/zone/rankings/33#metric=dps&class=Global&boss=-1&spec=Paladin&dpstype=pdps")

From that, I get all the links in the page. Since all the links to player logs have the same pattern, I filter them using grepl (regular expressions) for the patterns “zone=33” (Eden) and “/character/id/” :

temp2 <- remDr$findElements(using = 'css selector', value = 'a')

hreflinks <- character()
for (i in 1:434){
   
      hreflinks <- c(hreflinks, temp2[[i]]$getElementAttribute("href"))
   
}

hreflinks2 <- character()
for (i in 1:length(hreflinks)){
   hreflinks2 <- c(hreflinks2, hreflinks[[i]])
}

sshref <- grepl(pattern = "zone=33", hreflinks2)
hreflinks2 <- hreflinks2[sshref] 
sshref2 <- grepl(pattern = "/character/id/", hreflinks2)
hreflinks2 <- hreflinks2[sshref2]

All the links are in the “hreflinks2” vector. You can either loop this process to do it in every pages, or repeat that manually.

Now once we have those links, we extract the name, DPS, and stats with a while() loop.

I used a while() loop instead of for() because then I know, from the current “m” value, where to start again if the extraction fails or crashes (and it crashed a lot).

There are actually so many incomplete logs that will make the collection crash that it was actually easier to build the “hreflinks2” vector manually by copy-pasting valid links

library(stringr)
name <- character()
hp <- numeric()
dps <- numeric()
str <- numeric()
tenacity <- numeric()
crit <- numeric()
DH <- numeric()
determination <- numeric()
skill_speed <- numeric()

m <- 1


while (m <= length(hreflinks2)){
   
   print(paste0("getting data n°", m))
   remDr$navigate(hreflinks2[m])
   
   hpxpath <- remDr$findElement(using = 'xpath', value = '//*[@id="stats-box-contents"]/div[1]/div[2]')
   hp <- c(hp, as.numeric(hpxpath$getElementText()[[1]]))

   # choper le nom
   nom <- strsplit(remDr$getTitle()[[1]], "-")
   nom <- nom[[1]]
   nom <- nom[1]
   name <- c(name, nom)
   
   # choper le dps
   deeps <- remDr$findElement(using = 'xpath', value = '//*[@id="boss-table-33"]/tbody/tr[1]/td[3]')
   dps <- c(dps, as.numeric(str_remove(deeps$getElementText()[[1]], ",")))
   
   
   
   Strxpath <- remDr$findElement(using = 'xpath', value = '//*[@id="stats-box-contents"]/div[2]/div[2]')
   str <- c(str, as.numeric(Strxpath$getElementText()[[1]]))
   
   Tenaxpath <- remDr$findElement(using = 'xpath', value = '//*[@id="stats-box-contents"]/div[3]/div[2]')
   tenacity <- c(tenacity, as.numeric(Tenaxpath$getElementText()[[1]]))
   
   Critxpath <- remDr$findElement(using = 'xpath', value = '//*[@id="stats-box-contents"]/div[5]/div[2]')
   crit <- c(crit, as.numeric(Critxpath$getElementText()[[1]]))
   
   DHxpath <- remDr$findElement(using = 'xpath', value = '//*[@id="stats-box-contents"]/div[6]/div[2]')
   DH <- c(DH, as.numeric(DHxpath$getElementText()[[1]]))
   
   Detxpath <- remDr$findElement(using = 'xpath', value = '//*[@id="stats-box-contents"]/div[7]/div[2]')
   determination <- c(determination, as.numeric(Detxpath$getElementText()[[1]]))
   
   Skspxpath <- remDr$findElement(using = 'xpath', value = '//*[@id="stats-box-contents"]/div[8]/div[2]')
   skill_speed <- c(skill_speed, as.numeric(Skspxpath$getElementText()[[1]]))
   
   m <- m+1
}


full_data_frame <- data.frame(name = name, hp = hp, STR = str, tenacity = tenacity,
                              crit = crit, DH = DH, deter = determination, 
                              skill_speed = skill_speed, DPS = dps)

This gets you the raw data frame. Sometimes characters that aren’t PLD are in PLD logs so I filtered these out by deleting any observation in which the HP < 125000.

In addition, rows 31 and 32 had some missing values so I did some manual cleaning :

clean_df <- full_data_frame %>% filter(hp > 125000)
clean_df[31, 8] <- 1017
clean_df[32, 4:8] <- c(round(mean(clean_df$tenacity), 0), 3863, 1520, 2282, 1257)

And this is the clean data I started the work from.

An enthusiast’s attempt to figuring out Statsweights

Neilyo Brahmen - ODIN EU

17/08/2020

Summary

Data Cleaning & Model Building

Coefficients

Data collection code (for those interested)