#for analyses and visualizations
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggimage)
library(png)
library(knitr)

#for textual analysis / wordcloud / sentiment
library(wordcloud)
## Loading required package: RColorBrewer
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(SnowballC)
library(syuzhet)

Pulling in data

Now that I have my Reddit data secured, I can begin constructing analyses with it.

df_reddit <- read.csv("./data/reddit_top100_posts.csv")

I’m going to make a new dataframe tallying the total Reddit score for each player:

df_rank <- df_reddit %>%
  group_by(player_name) %>%
  summarize(total_score = sum(data.score)) %>%
  arrange(desc(total_score))

And now I can rank them according to that total score.

df_rank <- df_rank %>%
  mutate(reddit_rank = row_number())

This gives me a comparison point for the Ringer rankings, which I’ll pull in now:

df_ringer <- read.csv("./data/nba_ringer_top100_TIDY.csv")

Because I already have the field “player_rank” in my Ringer data, I can now bring that in as a variable in my overall ranking dataframe. I just need to conduct a merge using the player_names, which are identical across dataframes:

df_ringer <- df_ringer %>%
  rename(ringer_rank = player_rank)

df_ringer_merge <- df_ringer %>%
  select(c(player_name, ringer_rank))

df_rank <- merge(df_rank, df_ringer_merge, by = "player_name")

For an initial visualization, let’s compare reddit_rank to ringer_rank, to see if, generally speaking, players considered better by the ringer are also more popular discussion topics in the r/nba community:

df_rank %>%
  ggplot(aes(x = ringer_rank,
             y = reddit_rank)) +
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

While this is useful for overall understanding, it’s easy to forget that each dot on this plot represents a player. For continued analysis, it would be useful (and, let’s face it, fun) to see pictures of each player represented on the plot. This is possible with a package called ggimage.

df_pic_files <- read.csv("data/nba_ringer_top100_pic_files.csv")
df_rank_pics <- merge(df_rank, df_pic_files, by = "player_name")
df_rank_pics %>%
  ggplot(aes(x = ringer_rank,
             y = reddit_rank,
             image = picture_file_path)) +
  geom_point() +
  geom_smooth(method = "lm") +
  geom_image(size = 0.05)
## `geom_smooth()` using formula = 'y ~ x'

Awesome! Now let’s see the results of this model:

model <- lm(reddit_rank ~ ringer_rank, data = df_rank)
model_summary <- summary(model)

model_summary
## 
## Call:
## lm(formula = reddit_rank ~ ringer_rank, data = df_rank)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.655 -13.229   0.014  13.070  46.852 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.93697    4.17175   3.581 0.000536 ***
## ringer_rank  0.70422    0.07172   9.819 2.99e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.7 on 98 degrees of freedom
## Multiple R-squared:  0.4959, Adjusted R-squared:  0.4908 
## F-statistic: 96.41 on 1 and 98 DF,  p-value: 2.989e-16

The model appears to fit decently well, meaning that, broadly, the “better” players as ranked by the Ringer tend to be discussed and upvoted more on r/nba. However, there are clearly many departures from this pattern hindering the model. These departures are reflected in the residuals produced by the model, which we can insert back into our table and see our biggest outliers:

df_rank <- df_rank %>%
  cbind(residual = residuals(model))
df_rank %>%
  arrange(desc(residual))
##                 player_name total_score reddit_rank ringer_rank    residual
## 82             O.G. Anunoby       22640          97          50  46.8521092
## 62          Khris Middleton       58920          86          41  42.1900750
## 12           Brandon Ingram       64868          81          34  42.1196040
## 87      Robert Williams III       26775          95          55  41.3310171
## 21           Darius Garland       83682          72          29  36.6406961
## 29            Devin Vassell        7166         100          76  31.5424302
## 35             Franz Wagner       39998          90          64  29.9930513
## 49             Jerami Grant       64174          82          53  29.7394539
## 98       Wendell Carter Jr.       13693          99          78  29.1339934
## 6           Anfernee Simons       52833          87          61  29.1057066
## 10         Bojan Bogdanovic       47571          89          68  26.1761776
## 58           Keldon Johnson       22640          96          79  25.4297750
## 84            Pascal Siakam      127763          54          21  24.2744434
## 46            Jarrett Allen      110359          64          38  22.3027303
## 64       Kristaps Porzingis       69875          77          58  21.2183618
## 73       Michael Porter Jr.       65426          80          63  20.6972697
## 96        Tyrese Haliburton      148879          49          19  20.6828803
## 67              LaMelo Ball      100376          67          45  20.3732013
## 60            Kevin Huerter       59410          85          71  20.0635224
## 34              Evan Mobley      110700          62          39  19.5985119
## 19             Clint Capela       32007          94          85  19.2044644
## 9               Bam Adebayo      142896          51          25  18.4575698
## 15          Cameron Johnson       14023          98          94  16.8664986
## 27             Desmond Bane      118407          57          36  16.7111671
## 81            Norman Powell       36773          92          90  13.6833723
## 31         Domantas Sabonis      170364          44          23  12.8660066
## 53              Josh Giddey       61426          84          81  12.0213381
## 90  Shai Gilgeous-Alexander      199643          33          10  11.0208461
## 54             Jrue Holiday      151068          48          32  10.5280408
## 14          Cade Cunningham       62791          83          82  10.3171197
## 77              Nic Claxton      102925          66          59   9.5141434
## 80           Nikola Vucevic       39575          91          96   8.4580618
## 78            Nicolas Batum       35543          93         100   7.6411881
## 68          Lauri Markkanen      173516          43          30   6.9364776
## 1              Aaron Gordon      117517          58          52   6.4436724
## 56       Karl-Anthony Towns      168276          46          37   5.0069487
## 41             Jakob Poeltl       49376          88          97   4.7538434
## 71          Malcolm Brogdon       85754          71          74   3.9508671
## 99              Zach LaVine      116468          59          57   3.9225803
## 55            Julius Randle      180388          42          33   3.8238224
## 76             Myles Turner      115046          61          60   3.8099250
## 97             Tyrese Maxey       70818          75          80   3.7255566
## 57            Kawhi Leonard      307176          25           9   3.7250645
## 37    Giannis Antetokounmpo      374956          20           2   3.6545935
## 95              Tyler Herro      110021          65          66   3.5846145
## 38              Ivica Zubac       66149          79          86   3.5002460
## 28             Devin Booker      294771          26          11   3.3166277
## 25            DeMar DeRozan      186761          39          31   2.2322592
## 85              Paul George      247095          28          18   0.3870987
## 65               Kyle Kuzma       97256          68          75   0.2466487
## 43             Jamal Murray      170057          45          43  -0.2183618
## 20           Damian Lillard      340562          22          12  -1.3875908
## 3               Alex Caruso       73371          74          87  -2.2039724
## 13              Brook Lopez      142016          52          56  -2.3732013
## 4            Alperen Sengun       70735          76          92  -3.7250645
## 86               RJ Barrett       68603          78          95  -3.8377198
## 45        Jaren Jackson Jr.      193077          35          35  -4.5846145
## 22             De'Aaron Fox      246435          29          27  -4.9508671
## 17           Christian Wood       92839          69          84  -5.0913171
## 18              CJ McCollum      122969          55          65  -5.7111671
## 48             Jayson Tatum      454345          14           7  -5.8664986
## 94               Trae Young      281845          27          26  -6.2466487
## 79             Nikola Jokic      559613           9           1  -6.6411881
## 92            Stephen Curry      487225          11           4  -6.7538434
## 93            Tobias Harris       76997          73          93  -7.4292829
## 100         Zion Williamson      406236          18          15  -7.5002460
## 74            Mikal Bridges      182389          41          49  -8.4436724
## 7             Anthony Davis      438738          16          16 -10.2044644
## 51              Joel Embiid      560216           8           6 -11.1622802
## 8           Anthony Edwards      336207          23          28 -11.6550855
## 70              Luka Doncic      629388           5           3 -12.0496250
## 32         Donovan Mitchell      476719          12          13 -12.0918092
## 42            Jalen Brunson      205254          31          40 -12.1057066
## 11             Bradley Beal      184112          40          54 -12.9647645
## 75              Mike Conley      121911          56          77 -13.1617882
## 47             Jaylen Brown      434457          17          22 -13.4297750
## 40          Jaden McDaniels       92363          70          98 -13.9503750
## 26            Derrick White      110472          63          89 -14.6124092
## 24          Dejounte Murray      192636          36          51 -14.8521092
## 36            Fred VanVleet      161393          47          67 -15.1196040
## 59             Kevin Durant      744193           3           5 -15.4580618
## 89           Scottie Barnes      145442          50          73 -16.3449145
## 39                Ja Morant      602249           6          14 -18.7960276
## 52          Jordan Clarkson      115981          60          91 -19.0208461
## 44             James Harden      546603          10          20 -19.0213381
## 69             LeBron James      948350           1           8 -19.5707171
## 50             Jimmy Butler      563268           7          17 -19.9086829
## 91        Spencer Dinwiddie      133387          53          83 -20.3870987
## 5            Andrew Wiggins      332856          24          46 -23.3310171
## 16               Chris Paul      359007          21          44 -24.9225803
## 72             Marcus Smart      203091          32          62 -26.5985119
## 2                Al Horford      192560          37          72 -28.6406961
## 88              Rudy Gobert      397564          19          48 -29.7394539
## 66             Kyrie Irving      917210           2          24 -29.8382118
## 83           Paolo Banchero      194663          34          70 -30.2322592
## 23            Deandre Ayton      232287          30          69 -33.5280408
## 63            Klay Thompson      460380          13          47 -35.0352355
## 61             Kevon Looney      190094          38          88 -38.9081908
## 33           Draymond Green      679734           4          42 -40.5141434
## 30            Dillon Brooks      452599          15          99 -69.6545935

Sorting this table by residuals is a great, quick way to understand discussion patterns relative to perceived player ability. Players with very low residuals (meaning high-ranking in Reddit discussions despite relatively low Ringer ranking) are perhaps discussed more popularly on Reddit than their skill might suggest. Players with high residuals are thought to be better players, but are perhaps under-discussed in spite of that. While I have some qualitative hunches for why this might be, continued analysis could help shed light on these patterns.

Wordcloud

For any given player, it would be interesting to see common words coming up in Reddit threads involving them. Here, I define a function to do just that.

make_player_wordcloud <-  function(player_name_wc) {

  player_name_wc <- gsub("[[:punct:]]", "", player_name_wc)
  player_name_split <- strsplit(player_name_wc, " ")[[1]]
  player_name_split_lower <- tolower(player_name_split)
  player_name_split_lower
  
  #limit text to particular player
  text_col <- df_reddit %>%
    filter(player_name == player_name_wc) %>%
    select(data.title)
  
  #store text as corpus
  corpus <- Corpus(VectorSource(text_col))
  
  #cleaning and processing the corpus
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  corpus <- tm_map(corpus, stripWhitespace)
  
  #creating term matrix
  dtm <- TermDocumentMatrix(corpus)
  
  # Convert the matrix to a dataframe
  m <- as.matrix(dtm)
  v <- sort(rowSums(m),decreasing=TRUE)
  
  # Create a data frame with words and their frequencies
  d <- data.frame(word = names(v),freq=v)
  
  #remove individual player's names from word list to not crowd out cloud
  d <- d[!(d$word %in% player_name_split_lower),]
  
  # Generate the wordcloud
  wordcloud(words = d$word, freq = d$freq, min.freq = 2,
            max.words=400, random.order=FALSE, rot.per=0.35, 
            colors=brewer.pal(8, "Dark2"))

}
#suppressing warnings because some words can't fit on the page
suppressWarnings(make_player_wordcloud("Draymond Green"))

df_reddit %>%
  filter(player_name == "Draymond Green") %>%
  head() %>%
  select(c("data.title", "data.score"))
##                                                                                                                                                                                                                                                                                            data.title
## 1                                                                                                                                                                                                                                             Video of Draymond Green violently punching Jordan Poole
## 2                                                                                                                                                                                                                                                [Highlight] Draymond Green stomps on Sabonis's chest
## 3                                                                                                                                                                                                                                            [Charania] Draymond Green has been suspended for Game 3.
## 4 Draymond Green on Lebron: "I recently said he’s the greatest face the NBA has ever had. Think about the day and age we live in – cameras, social media – he’s never had a scandal. Never been arrested. Never photo’d drunk. Those things go hand in hand with durability, longevity, consistency."
## 5                    I love the energy Boston fans bring to the game," #NBA Commissioner Adam Silver says in response to the Garden crowd screaming F-- You Draymond [Green]. “I want fans to enjoy themselves, of course from the league office you want to see it done with respect, but I get it."
## 6                                                                                                                                                                                                                            How Draymond Green Was after hitting Jordan Poole in practice [RDC Skit]
##   data.score
## 1      51782
## 2      36140
## 3      32325
## 4      22424
## 5      21020
## 6      15360

Sentiment Analysis

It seems that one of the things that might help shed light on why certain player’s are “over-discussed” is whether they are associated with negative news stories. Therefore, some light sentiment analysis might be useful in understanding potential correlation here.

df_reddit$post_title_sentiment <- get_sentiment(df_reddit$data.title)

Here, I’ll make a new dataframe to track the average post sentiment for each player.

df_reddit_sentiment <- df_reddit %>%
  group_by(player_name) %>%
  summarize(avg_sentiment = mean(post_title_sentiment))

Now, I’ll add this average sentiment score to df_rank, to facilitate more correlation work.

df_rank <- merge(df_rank, df_reddit_sentiment, by = 'player_name')

Finally, I can correlate residuals with average sentiment to see if text sentiment can help predict whether a player may be over- or under-discussed.

df_rank %>%
  ggplot(aes(x = avg_sentiment, 
             y = residual)) +
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

Sentiment–or at least, this particular measure of it, is clearly not a strong predictor of residuals in this case. More qualitative analysis may have to suffice for now.