#for analyses and visualizations
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggimage)
library(png)
library(knitr)
#for textual analysis / wordcloud / sentiment
library(wordcloud)
## Loading required package: RColorBrewer
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
library(SnowballC)
library(syuzhet)
Now that I have my Reddit data secured, I can begin constructing analyses with it.
df_reddit <- read.csv("./data/reddit_top100_posts.csv")
I’m going to make a new dataframe tallying the total Reddit score for each player:
df_rank <- df_reddit %>%
group_by(player_name) %>%
summarize(total_score = sum(data.score)) %>%
arrange(desc(total_score))
And now I can rank them according to that total score.
df_rank <- df_rank %>%
mutate(reddit_rank = row_number())
This gives me a comparison point for the Ringer rankings, which I’ll pull in now:
df_ringer <- read.csv("./data/nba_ringer_top100_TIDY.csv")
Because I already have the field “player_rank” in my Ringer data, I can now bring that in as a variable in my overall ranking dataframe. I just need to conduct a merge using the player_names, which are identical across dataframes:
df_ringer <- df_ringer %>%
rename(ringer_rank = player_rank)
df_ringer_merge <- df_ringer %>%
select(c(player_name, ringer_rank))
df_rank <- merge(df_rank, df_ringer_merge, by = "player_name")
For an initial visualization, let’s compare reddit_rank to ringer_rank, to see if, generally speaking, players considered better by the ringer are also more popular discussion topics in the r/nba community:
df_rank %>%
ggplot(aes(x = ringer_rank,
y = reddit_rank)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
While this is useful for overall understanding, it’s easy to forget that each dot on this plot represents a player. For continued analysis, it would be useful (and, let’s face it, fun) to see pictures of each player represented on the plot. This is possible with a package called ggimage.
df_pic_files <- read.csv("data/nba_ringer_top100_pic_files.csv")
df_rank_pics <- merge(df_rank, df_pic_files, by = "player_name")
df_rank_pics %>%
ggplot(aes(x = ringer_rank,
y = reddit_rank,
image = picture_file_path)) +
geom_point() +
geom_smooth(method = "lm") +
geom_image(size = 0.05)
## `geom_smooth()` using formula = 'y ~ x'
Awesome! Now let’s see the results of this model:
model <- lm(reddit_rank ~ ringer_rank, data = df_rank)
model_summary <- summary(model)
model_summary
##
## Call:
## lm(formula = reddit_rank ~ ringer_rank, data = df_rank)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.655 -13.229 0.014 13.070 46.852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.93697 4.17175 3.581 0.000536 ***
## ringer_rank 0.70422 0.07172 9.819 2.99e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.7 on 98 degrees of freedom
## Multiple R-squared: 0.4959, Adjusted R-squared: 0.4908
## F-statistic: 96.41 on 1 and 98 DF, p-value: 2.989e-16
The model appears to fit decently well, meaning that, broadly, the “better” players as ranked by the Ringer tend to be discussed and upvoted more on r/nba. However, there are clearly many departures from this pattern hindering the model. These departures are reflected in the residuals produced by the model, which we can insert back into our table and see our biggest outliers:
df_rank <- df_rank %>%
cbind(residual = residuals(model))
df_rank %>%
arrange(desc(residual))
## player_name total_score reddit_rank ringer_rank residual
## 82 O.G. Anunoby 22640 97 50 46.8521092
## 62 Khris Middleton 58920 86 41 42.1900750
## 12 Brandon Ingram 64868 81 34 42.1196040
## 87 Robert Williams III 26775 95 55 41.3310171
## 21 Darius Garland 83682 72 29 36.6406961
## 29 Devin Vassell 7166 100 76 31.5424302
## 35 Franz Wagner 39998 90 64 29.9930513
## 49 Jerami Grant 64174 82 53 29.7394539
## 98 Wendell Carter Jr. 13693 99 78 29.1339934
## 6 Anfernee Simons 52833 87 61 29.1057066
## 10 Bojan Bogdanovic 47571 89 68 26.1761776
## 58 Keldon Johnson 22640 96 79 25.4297750
## 84 Pascal Siakam 127763 54 21 24.2744434
## 46 Jarrett Allen 110359 64 38 22.3027303
## 64 Kristaps Porzingis 69875 77 58 21.2183618
## 73 Michael Porter Jr. 65426 80 63 20.6972697
## 96 Tyrese Haliburton 148879 49 19 20.6828803
## 67 LaMelo Ball 100376 67 45 20.3732013
## 60 Kevin Huerter 59410 85 71 20.0635224
## 34 Evan Mobley 110700 62 39 19.5985119
## 19 Clint Capela 32007 94 85 19.2044644
## 9 Bam Adebayo 142896 51 25 18.4575698
## 15 Cameron Johnson 14023 98 94 16.8664986
## 27 Desmond Bane 118407 57 36 16.7111671
## 81 Norman Powell 36773 92 90 13.6833723
## 31 Domantas Sabonis 170364 44 23 12.8660066
## 53 Josh Giddey 61426 84 81 12.0213381
## 90 Shai Gilgeous-Alexander 199643 33 10 11.0208461
## 54 Jrue Holiday 151068 48 32 10.5280408
## 14 Cade Cunningham 62791 83 82 10.3171197
## 77 Nic Claxton 102925 66 59 9.5141434
## 80 Nikola Vucevic 39575 91 96 8.4580618
## 78 Nicolas Batum 35543 93 100 7.6411881
## 68 Lauri Markkanen 173516 43 30 6.9364776
## 1 Aaron Gordon 117517 58 52 6.4436724
## 56 Karl-Anthony Towns 168276 46 37 5.0069487
## 41 Jakob Poeltl 49376 88 97 4.7538434
## 71 Malcolm Brogdon 85754 71 74 3.9508671
## 99 Zach LaVine 116468 59 57 3.9225803
## 55 Julius Randle 180388 42 33 3.8238224
## 76 Myles Turner 115046 61 60 3.8099250
## 97 Tyrese Maxey 70818 75 80 3.7255566
## 57 Kawhi Leonard 307176 25 9 3.7250645
## 37 Giannis Antetokounmpo 374956 20 2 3.6545935
## 95 Tyler Herro 110021 65 66 3.5846145
## 38 Ivica Zubac 66149 79 86 3.5002460
## 28 Devin Booker 294771 26 11 3.3166277
## 25 DeMar DeRozan 186761 39 31 2.2322592
## 85 Paul George 247095 28 18 0.3870987
## 65 Kyle Kuzma 97256 68 75 0.2466487
## 43 Jamal Murray 170057 45 43 -0.2183618
## 20 Damian Lillard 340562 22 12 -1.3875908
## 3 Alex Caruso 73371 74 87 -2.2039724
## 13 Brook Lopez 142016 52 56 -2.3732013
## 4 Alperen Sengun 70735 76 92 -3.7250645
## 86 RJ Barrett 68603 78 95 -3.8377198
## 45 Jaren Jackson Jr. 193077 35 35 -4.5846145
## 22 De'Aaron Fox 246435 29 27 -4.9508671
## 17 Christian Wood 92839 69 84 -5.0913171
## 18 CJ McCollum 122969 55 65 -5.7111671
## 48 Jayson Tatum 454345 14 7 -5.8664986
## 94 Trae Young 281845 27 26 -6.2466487
## 79 Nikola Jokic 559613 9 1 -6.6411881
## 92 Stephen Curry 487225 11 4 -6.7538434
## 93 Tobias Harris 76997 73 93 -7.4292829
## 100 Zion Williamson 406236 18 15 -7.5002460
## 74 Mikal Bridges 182389 41 49 -8.4436724
## 7 Anthony Davis 438738 16 16 -10.2044644
## 51 Joel Embiid 560216 8 6 -11.1622802
## 8 Anthony Edwards 336207 23 28 -11.6550855
## 70 Luka Doncic 629388 5 3 -12.0496250
## 32 Donovan Mitchell 476719 12 13 -12.0918092
## 42 Jalen Brunson 205254 31 40 -12.1057066
## 11 Bradley Beal 184112 40 54 -12.9647645
## 75 Mike Conley 121911 56 77 -13.1617882
## 47 Jaylen Brown 434457 17 22 -13.4297750
## 40 Jaden McDaniels 92363 70 98 -13.9503750
## 26 Derrick White 110472 63 89 -14.6124092
## 24 Dejounte Murray 192636 36 51 -14.8521092
## 36 Fred VanVleet 161393 47 67 -15.1196040
## 59 Kevin Durant 744193 3 5 -15.4580618
## 89 Scottie Barnes 145442 50 73 -16.3449145
## 39 Ja Morant 602249 6 14 -18.7960276
## 52 Jordan Clarkson 115981 60 91 -19.0208461
## 44 James Harden 546603 10 20 -19.0213381
## 69 LeBron James 948350 1 8 -19.5707171
## 50 Jimmy Butler 563268 7 17 -19.9086829
## 91 Spencer Dinwiddie 133387 53 83 -20.3870987
## 5 Andrew Wiggins 332856 24 46 -23.3310171
## 16 Chris Paul 359007 21 44 -24.9225803
## 72 Marcus Smart 203091 32 62 -26.5985119
## 2 Al Horford 192560 37 72 -28.6406961
## 88 Rudy Gobert 397564 19 48 -29.7394539
## 66 Kyrie Irving 917210 2 24 -29.8382118
## 83 Paolo Banchero 194663 34 70 -30.2322592
## 23 Deandre Ayton 232287 30 69 -33.5280408
## 63 Klay Thompson 460380 13 47 -35.0352355
## 61 Kevon Looney 190094 38 88 -38.9081908
## 33 Draymond Green 679734 4 42 -40.5141434
## 30 Dillon Brooks 452599 15 99 -69.6545935
Sorting this table by residuals is a great, quick way to understand discussion patterns relative to perceived player ability. Players with very low residuals (meaning high-ranking in Reddit discussions despite relatively low Ringer ranking) are perhaps discussed more popularly on Reddit than their skill might suggest. Players with high residuals are thought to be better players, but are perhaps under-discussed in spite of that. While I have some qualitative hunches for why this might be, continued analysis could help shed light on these patterns.
For any given player, it would be interesting to see common words coming up in Reddit threads involving them. Here, I define a function to do just that.
make_player_wordcloud <- function(player_name_wc) {
player_name_wc <- gsub("[[:punct:]]", "", player_name_wc)
player_name_split <- strsplit(player_name_wc, " ")[[1]]
player_name_split_lower <- tolower(player_name_split)
player_name_split_lower
#limit text to particular player
text_col <- df_reddit %>%
filter(player_name == player_name_wc) %>%
select(data.title)
#store text as corpus
corpus <- Corpus(VectorSource(text_col))
#cleaning and processing the corpus
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
#creating term matrix
dtm <- TermDocumentMatrix(corpus)
# Convert the matrix to a dataframe
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
# Create a data frame with words and their frequencies
d <- data.frame(word = names(v),freq=v)
#remove individual player's names from word list to not crowd out cloud
d <- d[!(d$word %in% player_name_split_lower),]
# Generate the wordcloud
wordcloud(words = d$word, freq = d$freq, min.freq = 2,
max.words=400, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
}
#suppressing warnings because some words can't fit on the page
suppressWarnings(make_player_wordcloud("Draymond Green"))
df_reddit %>%
filter(player_name == "Draymond Green") %>%
head() %>%
select(c("data.title", "data.score"))
## data.title
## 1 Video of Draymond Green violently punching Jordan Poole
## 2 [Highlight] Draymond Green stomps on Sabonis's chest
## 3 [Charania] Draymond Green has been suspended for Game 3.
## 4 Draymond Green on Lebron: "I recently said he’s the greatest face the NBA has ever had. Think about the day and age we live in – cameras, social media – he’s never had a scandal. Never been arrested. Never photo’d drunk. Those things go hand in hand with durability, longevity, consistency."
## 5 I love the energy Boston fans bring to the game," #NBA Commissioner Adam Silver says in response to the Garden crowd screaming F-- You Draymond [Green]. “I want fans to enjoy themselves, of course from the league office you want to see it done with respect, but I get it."
## 6 How Draymond Green Was after hitting Jordan Poole in practice [RDC Skit]
## data.score
## 1 51782
## 2 36140
## 3 32325
## 4 22424
## 5 21020
## 6 15360
It seems that one of the things that might help shed light on why certain player’s are “over-discussed” is whether they are associated with negative news stories. Therefore, some light sentiment analysis might be useful in understanding potential correlation here.
df_reddit$post_title_sentiment <- get_sentiment(df_reddit$data.title)
Here, I’ll make a new dataframe to track the average post sentiment for each player.
df_reddit_sentiment <- df_reddit %>%
group_by(player_name) %>%
summarize(avg_sentiment = mean(post_title_sentiment))
Now, I’ll add this average sentiment score to df_rank, to facilitate more correlation work.
df_rank <- merge(df_rank, df_reddit_sentiment, by = 'player_name')
Finally, I can correlate residuals with average sentiment to see if text sentiment can help predict whether a player may be over- or under-discussed.
df_rank %>%
ggplot(aes(x = avg_sentiment,
y = residual)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
Sentiment–or at least, this particular measure of it, is clearly not a strong predictor of residuals in this case. More qualitative analysis may have to suffice for now.