Background
Playing games with our consoles during work-at-home period is one of the best way to entertain ourselves in this trying times. However, there are too many games to choose, and we have to maintain our budget in choosing which games to buy. Let’s find out which past popular games that are worth to buy in the upcoming PS Store sale!
About the Data
This dataset is from Kaggle, and is a real data scraped from vgchartz.com as of April 12th, 2019. attention: this data has many N/A, so any game recommendations from this analysis comes from the available data and may not represent the actual game popularity
Data Read
## Rank Name Genre ESRB_Rating Platform
## 1 1 Wii Sports Sports E Wii
## 2 2 Super Mario Bros. Platform NES
## 3 3 Mario Kart Wii Racing E Wii
## 4 4 PlayerUnknown's Battlegrounds Shooter PC
## 5 5 Wii Sports Resort Sports E Wii
## 6 6 Pokemon Red / Green / Blue Version Role-Playing E GB
## Publisher Developer Critic_Score User_Score Total_Shipped
## 1 Nintendo Nintendo EAD 7.7 NA 82.86
## 2 Nintendo Nintendo EAD 10.0 NA 40.24
## 3 Nintendo Nintendo EAD 8.2 9.1 37.14
## 4 PUBG Corporation PUBG Corporation NA NA 36.60
## 5 Nintendo Nintendo EAD 8.0 8.8 33.09
## 6 Nintendo Game Freak 9.4 NA 31.38
## Global_Sales NA_Sales PAL_Sales JP_Sales Other_Sales Year
## 1 NA NA NA NA NA 2006
## 2 NA NA NA NA NA 1985
## 3 NA NA NA NA NA 2008
## 4 NA NA NA NA NA 2017
## 5 NA NA NA NA NA 2009
## 6 NA NA NA NA NA 1998
Rank - Ranking of overall sales
Name - Name of the game
Platform - Platform of the game (i.e. PC, PS4, XOne, etc.)
Genre - Genre of the game
ESRB Rating - ESRB Rating of the game
Publisher - Publisher of the game
Developer - Developer of the game
Critic Score - Critic score of the game from 10
User Score - Users score the game from 10
Total Shipped - Total shipped copies of the game
Global_Sales - Total worldwide sales (in millions)
NA_Sales - Sales in North America (in millions)
PAL_Sales - Sales in Europe (in millions)
JP_Sales - Sales in Japan (in millions)
Other_Sales - Sales in the rest of the world (in millions)
Year - Year of release of the game
Data Preparation
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
To simplify the analysis, I want to drop regional sales variables and only take the “global” sales variable:
# drop regional sales variables
sales <- sales %>%
select(-NA_Sales, -PAL_Sales, -JP_Sales, -Other_Sales)
str(sales)## 'data.frame': 55792 obs. of 12 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : Factor w/ 37102 levels "_summer ##","- Arcane preRaise -",..: 35588 30375 18586 23666 35590 23822 21482 31512 21486 19620 ...
## $ Genre : Factor w/ 20 levels "Action","Action-Adventure",..: 18 11 13 16 18 14 11 12 11 7 ...
## $ ESRB_Rating : Factor w/ 9 levels "","AO","E","E10",..: 3 1 3 1 3 3 3 3 3 1 ...
## $ Platform : Factor w/ 74 levels "2600","3DO","3DS",..: 65 42 65 48 65 25 21 25 65 48 ...
## $ Publisher : Factor w/ 3069 levels "][ Games","@unepic_fran",..: 1883 1883 1883 2151 1883 1883 1883 1883 1883 1753 ...
## $ Developer : Factor w/ 8065 levels "",".theprodukkt",..: 4984 4984 4984 5635 4984 2726 4984 1159 4984 4665 ...
## $ Critic_Score : num 7.7 10 8.2 NA 8 9.4 9.1 NA 8.6 10 ...
## $ User_Score : num NA NA 9.1 NA 8.8 NA 8.1 NA 9.2 NA ...
## $ Total_Shipped: num 82.9 40.2 37.1 36.6 33.1 ...
## $ Global_Sales : num NA NA NA NA NA NA NA NA NA NA ...
## $ Year : num 2006 1985 2008 2017 2009 ...
Because we’re concerning about “which PS games to play”, the logical decision is to filter down the Platform into “PS4” only (as it is the latest platform from Playstation line):
# filter platform
ps4 <- sales %>%
filter(Platform == "PS4")
# replace NA data with 0 (a better option here rather than removing them to avoid discarding more data)
ps4[is.na(ps4)] = 0
# Look at the prepared data
str(ps4)## 'data.frame': 1755 obs. of 12 variables:
## $ Rank : int 21 35 46 51 69 77 85 98 101 103 ...
## $ Name : Factor w/ 37102 levels "_summer ##","- Arcane preRaise -",..: 12802 4895 25336 4917 10770 10769 14373 4897 10771 18727 ...
## $ Genre : Factor w/ 20 levels "Action","Action-Adventure",..: 1 16 2 16 18 18 1 16 18 2 ...
## $ ESRB_Rating : Factor w/ 9 levels "","AO","E","E10",..: 7 7 7 7 3 3 9 7 3 9 ...
## $ Platform : Factor w/ 74 levels "2600","3DO","3DS",..: 54 54 54 54 54 54 54 54 54 54 ...
## $ Publisher : Factor w/ 3069 levels "][ Games","@unepic_fran",..: 2282 75 2282 75 749 779 2491 75 779 2491 ...
## $ Developer : Factor w/ 8065 levels "",".theprodukkt",..: 5982 7335 5978 6436 2119 2098 3001 7335 2116 3477 ...
## $ Critic_Score : num 9.7 0 9.8 8 8.3 8.9 9.1 0 0 9.1 ...
## $ User_Score : num 0 0 0 0 0 0 8 0 0 0 ...
## $ Total_Shipped: num 0 0 0 0 0 0 10 0 0 9 ...
## $ Global_Sales : num 19.4 15.1 13.9 13.4 11.8 ...
## $ Year : num 2014 2015 2018 2017 2017 ...
We have narrowed our observations into 1,755 games.
Exploratory Data Analysis
Each gamers usually have their own favorite genre, although sometimes they may try games from other genres too. But how does this genre competes with each other in terms of sales?
# sales data based on genre
genre <- ps4 %>%
group_by(Genre) %>%
summarise(total.sales = sum(Global_Sales), n = n()) %>%
arrange(total.sales)
tail(genre)## # A tibble: 6 x 3
## Genre total.sales n
## <fct> <dbl> <int>
## 1 Racing 24.8 79
## 2 Action-Adventure 51.2 112
## 3 Role-Playing 54.4 216
## 4 Sports 109. 113
## 5 Action 117. 355
## 6 Shooter 145. 154
Top-Selling Genre in PS4 Games
library(ggplot2)
genre %>%
mutate(Genre = factor(Genre, levels = Genre)) %>% #mutate to update factor levels to order graph
arrange(total.sales) %>%
ggplot(aes(x = Genre, y = total.sales, label = n))+
labs(title = "Sales per Genre and #Copy Sold",
y = "Total Sales (in millions)")+
geom_bar(stat = "identity", aes(fill = total.sales))+
geom_text(size = 3, position = position_stack(vjust = 0.6))+
scale_fill_gradient(low = "blue", high = "red")+
coord_flip()Who knew
Shootinggenre has the highest sales number among others genre, with almost $150 millions from ‘only’ 154 millions copy sold? On the other hand, if we judge popularity of a genre based on their copy sold, then it meansAction,Sports,Role-Playing, andAction-Adventureare the most popular among gamers worlwide.
But why Shooting game has the higest $ sales even with lower copy sold? My assumption that this is because they rarely sold at discount price (e.g. not included in “Summer Sale” by PS Store).
Genre Popularities on Each Year
# Filter top genres and remove games without Year information
topgenre <- ps4 %>%
filter(Genre == c("Shooter", "Action", "Sports", "Role-Playing", "Action-Adventure"),
Year != 0) # Create plot
ggplot(data = topgenre, mapping = aes(x = Year, y = Global_Sales)) +
geom_bar(stat = "identity", mapping = aes(fill = Genre, color = Genre), size = .1, alpha = .8) +
facet_wrap(~Genre) +
theme_bw() +
ylab("Revenue from Each Genre (in millions)") +
theme(
legend.position = "none",
strip.text.x = element_text(margin = margin(3, 3, 3, 3), size = 10, face = "bold", color = "black"),
plot.title = element_text(size = 10, face = "bold", hjust = .5),
axis.text.x = element_text(size = 10, face = "bold"),
axis.text.y = element_text(size = 10, face = "bold"),
axis.title.y = element_text(size = 10))Does any particular year catches your attention? For me, 2016 is the golden year for Role-Playing genre (the genre I usually play), because there were these games that launched:
# The most popular Role-Playing in 2016
topgenre %>%
filter(Year == 2016,
Genre == "Role-Playing") %>%
arrange(-Global_Sales) %>%
select(Name, Global_Sales)## Name Global_Sales
## 1 Final Fantasy XV 5.07
## 2 Star Ocean 5: Integrity and Faithlessness 0.45
## 3 Rainbow Moon 0.00
## 4 Superdimension Neptune vs Sega Hard Girls 0.00
## 5 The Banner Saga 2 0.00
With the global sales of 5 million (from only PS4 version!), “Final Fantasy XV” is still a popular game that has strong fanbase all over the world.
Final Fantasy XV
Greatest Game Developer Ever
# Filter developers that create the most popular genres
topdev <- ps4 %>%
filter(Genre == c("Shooter", "Action", "Sports", "Role-Playing", "Action-Adventure"),
Global_Sales != 0) %>%
select(Developer, Global_Sales) %>%
arrange(-Global_Sales)# Create plot
plotdev <- topdev %>%
mutate(Developer = (levels = Developer)) %>%
ggplot(data = head(topdev,7), mapping = aes(x = Developer, y = Global_Sales)) +
geom_bar(stat = "identity", mapping = aes(fill = Developer), alpha = .7, size = 1) +
geom_label(mapping = aes(label=Global_Sales), fill = "gray", size = 4, color = "black", fontface = "bold", hjust=.7) +
ggtitle("Top Developers with the most sales (in millions)") +
xlab(" ") +
ylab("") +
theme(
plot.title = element_text(size = 14, hjust = .5, face = "bold"),
axis.title.x = element_text(size = 10, hjust = .5, face = "italic"),
axis.title.y = element_text(size = 10, hjust = .5, face = "italic"),
axis.text.x = element_text(size = 10, face = "bold", angle = 15),
axis.text.y = element_text(size = 10, face = "bold"),
legend.position = "none",
panel.background = element_blank())
# Recolor plot using wesanderson color palette
plotdev+scale_fill_manual(values = wes_palette("BottleRocket1", n = 7))What did Sledgehammer Games create that give them 7.53 millions in sales? Your answer:
topgenre %>%
filter(Developer == "Sledgehammer Games") %>%
arrange(-Global_Sales) %>%
select(Name, Global_Sales)## Name Global_Sales
## 1 Call of Duty: Advanced Warfare 7.53
Call of Duty: Advanced Warfare
Knowing these developers may come in handy for gamers, as we can follow them on their social media to keep up with their next game projects! For example, if you like playing Call of Duty, then you might want to pay attention to
Sledgehammer Games'announcements.
Safe Games for Kids
# filter the games with "E" (Everyone) rating
e.rating <- ps4 %>%
filter(ESRB_Rating == "E") %>%
select(Name, Global_Sales, Genre) %>%
mutate(Name = as.character(Name)) %>%
arrange(-Global_Sales)# create the plot
plot <- ggplot(data = head(e.rating,5), aes(x = Name, y = Global_Sales, fill = Global_Sales))+
geom_bar(stat = "identity", mapping = aes(fill = Name), alpha = 0.6, size = 1)+
geom_label(mapping = aes(label = Global_Sales), fill = "sky blue", size = 4, color = "white", fontface = "bold")+
ggtitle("Top-Selling Games that has 'E (Everyone)' Ratings")+
xlab("")+
ylab("")+
theme(plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(size = 10, angle = 18),
panel.background = element_blank(),
legend.position = "none"
)
# Recolor the plot using wesanderson color palette "Darjeeling"
plot+scale_fill_manual(values = wes_palette("Darjeeling1", n = 5))Not surprisingly, the safest games to play with our little ones are coming from the
Sportgenre!
Highest-Rated Games
The ultimate, no-brainer way to recommend which game to play is actually by choosing which games that have the highest rating, both from fellow gamers and critiques’ judgement. And since this information is not available on Playstation store, we might as well found out which games that entertain people the most here.
Disclaimer: there are many N/A or 0 information on these scores (critic & users). Any recommendation made from the analysis below comes from incomplete dataset.
# Top 7 Games according to critic score
topcritic <- ps4 %>%
select(Name, Genre, Critic_Score) %>%
arrange(-Critic_Score) %>%
head(7)criticplot <- ggplot(data = topcritic, aes(x = Name, y = Critic_Score, fill = Critic_Score))+
geom_bar(stat = "identity", mapping = aes(fill = Name), alpha = 0.6, size = 1)+
coord_flip()+
geom_label(mapping = aes(label = Critic_Score), fill = "sky blue", size = 4, color = "white", fontface = "bold")+
ggtitle("7 Games with the Highest Critic Score Ratings")+
xlab("")+
ylab("")+
theme(plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
panel.background = element_blank(),
legend.position = "none"
)
criticplot# Top 7 Games according to user score
topuser <- ps4 %>%
select(Name, Genre, User_Score) %>%
arrange(-User_Score) %>%
head(7)ggplot(data = topuser, aes(x = Name, y = User_Score, fill = User_Score))+
geom_bar(stat = "identity", mapping = aes(fill = Name), alpha = 0.6, size = 1)+
coord_flip()+
geom_label(mapping = aes(label = User_Score), fill = "sky blue", size = 4, color = "white", fontface = "bold")+
ggtitle("7 Games with the Highest User Score Ratings")+
xlab("")+
ylab("")+
theme(plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 10),
panel.background = element_blank(),
legend.position = "none"
)If you’re a fellow gamer, I think we can both agree that “user score” from this dataset DOES NOT represent the gamers favorite. I’d suggest we can rather look at the first table (derived from the critic’s score), as it has more well-known games.
Conclusion
When it comes to games, there are no specific game recommendations. Even if you like a particular genre, there is no guarantee that you will like any other games from the same genre. However, to sum up the EDA above:
- If you’re looking to try new games, you can look the games under the “Shooter”, “Action”, “Sports”, “Role-Playing”, “Action-Adventure” genre, as they are the most popular
- Another way to look for new game is by paying attention to the next release from popular game developers. If you like
Call of Duty, then you may followSledgehammer Gamessocial media account for their next releases.
- Want to play with kids? Sport games are the way to go!
- The ultimate way to look for game recommendations is to compare
User ScoreandCritics Score. While my analysis above doesn’t represent real recommendation (because there are many missing score), you can always look for score information online.