The “Video Games Sales” is a dataset taken from zenodo (https://zenodo.org/records/5898311). It was generated by scraping the data from the video games website called vgchartz.com. This dataset contains information about video games with more than 100000 copies sold between 1980 and 2020. This is great to look at popular games. There are 16598 records in the set.
Rank - Ranking of overall sales
Name - The games name
Platform - Platform of the games release (i.e. PC,PS4, etc.)
Year - Year of the game’s release
Genre - Genre of the game
Publisher - Publisher of the game
NA_Sales - Sales in North America (in millions)
EU_Sales - Sales in Europe (in millions)
JP_Sales - Sales in Japan (in millions)
Other_Sales - Sales in the rest of the world (in millions)
Global_Sales - Total worldwide sales.
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(cluster)
library(ggplot2)
library(tidyr)
library(clusterCrit)
sales <- read.csv("vgsales.csv")
head(sales)
## Rank Name Platform Year Genre Publisher NA_Sales
## 1 1 Wii Sports Wii 2006 Sports Nintendo 41.49
## 2 2 Super Mario Bros. NES 1985 Platform Nintendo 29.08
## 3 3 Mario Kart Wii Wii 2008 Racing Nintendo 15.85
## 4 4 Wii Sports Resort Wii 2009 Sports Nintendo 15.75
## 5 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo 11.27
## 6 6 Tetris GB 1989 Puzzle Nintendo 23.20
## EU_Sales JP_Sales Other_Sales Global_Sales
## 1 29.02 3.77 8.46 82.74
## 2 3.58 6.81 0.77 40.24
## 3 12.88 3.79 3.31 35.82
## 4 11.01 3.28 2.96 33.00
## 5 8.89 10.22 1.00 31.37
## 6 2.26 4.22 0.58 30.26
range(as.numeric(sales$Year), na.rm = T)
## Warning: NAs introduced by coercion
## [1] 1980 2020
unique(sales$Platform)
## [1] "Wii" "NES" "GB" "DS" "X360" "PS3" "PS2" "SNES" "GBA" "3DS"
## [11] "PS4" "N64" "PS" "XB" "PC" "2600" "PSP" "XOne" "GC" "WiiU"
## [21] "GEN" "DC" "PSV" "SAT" "SCD" "WS" "NG" "TG16" "3DO" "GG"
## [31] "PCFX"
unique(sales$Genre)
## [1] "Sports" "Platform" "Racing" "Role-Playing" "Puzzle"
## [6] "Misc" "Shooter" "Simulation" "Action" "Fighting"
## [11] "Adventure" "Strategy"
unique(sales$Publisher)
## [1] "Nintendo"
## [2] "Microsoft Game Studios"
## [3] "Take-Two Interactive"
## [4] "Sony Computer Entertainment"
## [5] "Activision"
## [6] "Ubisoft"
## [7] "Bethesda Softworks"
## [8] "Electronic Arts"
## [9] "Sega"
## [10] "SquareSoft"
## [11] "Atari"
## [12] "505 Games"
## [13] "Capcom"
## [14] "GT Interactive"
## [15] "Konami Digital Entertainment"
## [16] "Sony Computer Entertainment Europe"
## [17] "Square Enix"
## [18] "LucasArts"
## [19] "Virgin Interactive"
## [20] "Warner Bros. Interactive Entertainment"
## [21] "Universal Interactive"
## [22] "Eidos Interactive"
## [23] "RedOctane"
## [24] "Vivendi Games"
## [25] "Enix Corporation"
## [26] "Namco Bandai Games"
## [27] "Palcom"
## [28] "Hasbro Interactive"
## [29] "THQ"
## [30] "Fox Interactive"
## [31] "Acclaim Entertainment"
## [32] "MTV Games"
## [33] "Disney Interactive Studios"
## [34] "N/A"
## [35] "Majesco Entertainment"
## [36] "Codemasters"
## [37] "Red Orb"
## [38] "Level 5"
## [39] "Arena Entertainment"
## [40] "Midway Games"
## [41] "JVC"
## [42] "Deep Silver"
## [43] "989 Studios"
## [44] "NCSoft"
## [45] "UEP Systems"
## [46] "Parker Bros."
## [47] "Maxis"
## [48] "Imagic"
## [49] "Tecmo Koei"
## [50] "Valve Software"
## [51] "ASCII Entertainment"
## [52] "Mindscape"
## [53] "Infogrames"
## [54] "Unknown"
## [55] "Square"
## [56] "Valve"
## [57] "Activision Value"
## [58] "Banpresto"
## [59] "D3Publisher"
## [60] "Oxygen Interactive"
## [61] "Red Storm Entertainment"
## [62] "Video System"
## [63] "Hello Games"
## [64] "Global Star"
## [65] "Gotham Games"
## [66] "Westwood Studios"
## [67] "GungHo"
## [68] "Crave Entertainment"
## [69] "Hudson Soft"
## [70] "Coleco"
## [71] "Rising Star Games"
## [72] "Atlus"
## [73] "TDK Mediactive"
## [74] "ASC Games"
## [75] "Zoo Games"
## [76] "Accolade"
## [77] "Sony Online Entertainment"
## [78] "3DO"
## [79] "RTL"
## [80] "Natsume"
## [81] "Focus Home Interactive"
## [82] "Alchemist"
## [83] "Black Label Games"
## [84] "SouthPeak Games"
## [85] "Mastertronic"
## [86] "Ocean"
## [87] "Zoo Digital Publishing"
## [88] "Psygnosis"
## [89] "City Interactive"
## [90] "Empire Interactive"
## [91] "Success"
## [92] "Compile"
## [93] "Russel"
## [94] "Taito"
## [95] "Agetec"
## [96] "GSP"
## [97] "Microprose"
## [98] "Play It"
## [99] "Slightly Mad Studios"
## [100] "Tomy Corporation"
## [101] "Sammy Corporation"
## [102] "Koch Media"
## [103] "Game Factory"
## [104] "Titus"
## [105] "Marvelous Entertainment"
## [106] "Genki"
## [107] "Mojang"
## [108] "Pinnacle"
## [109] "CTO SpA"
## [110] "TalonSoft"
## [111] "Crystal Dynamics"
## [112] "SCi"
## [113] "Quelle"
## [114] "mixi, Inc"
## [115] "Rage Software"
## [116] "Ubisoft Annecy"
## [117] "Scholastic Inc."
## [118] "Interplay"
## [119] "Mystique"
## [120] "ChunSoft"
## [121] "Square EA"
## [122] "20th Century Fox Video Games"
## [123] "Avanquest Software"
## [124] "Hudson Entertainment"
## [125] "Nordic Games"
## [126] "Men-A-Vision"
## [127] "Nobilis"
## [128] "Big Ben Interactive"
## [129] "Touchstone"
## [130] "Spike"
## [131] "Jester Interactive"
## [132] "Nippon Ichi Software"
## [133] "LEGO Media"
## [134] "Quest"
## [135] "Illusion Softworks"
## [136] "Tigervision"
## [137] "Funbox Media"
## [138] "Rocket Company"
## [139] "Metro 3D"
## [140] "Mattel Interactive"
## [141] "IE Institute"
## [142] "Rondomedia"
## [143] "Sony Computer Entertainment America"
## [144] "Universal Gamex"
## [145] "Ghostlight"
## [146] "Wizard Video Games"
## [147] "BMG Interactive Entertainment"
## [148] "PQube"
## [149] "Trion Worlds"
## [150] "Laguna"
## [151] "Ignition Entertainment"
## [152] "Takara"
## [153] "Kadokawa Shoten"
## [154] "Destineer"
## [155] "Enterbrain"
## [156] "Xseed Games"
## [157] "Imagineer"
## [158] "System 3 Arcade Software"
## [159] "CPG Products"
## [160] "Aruze Corp"
## [161] "Gamebridge"
## [162] "Midas Interactive Entertainment"
## [163] "Jaleco"
## [164] "Answer Software"
## [165] "XS Games"
## [166] "Activision Blizzard"
## [167] "Pack In Soft"
## [168] "Rebellion"
## [169] "Xplosiv"
## [170] "Ultravision"
## [171] "GameMill Entertainment"
## [172] "Wanadoo"
## [173] "NovaLogic"
## [174] "Telltale Games"
## [175] "Epoch"
## [176] "BAM! Entertainment"
## [177] "Knowledge Adventure"
## [178] "Mastiff"
## [179] "Tetris Online"
## [180] "Harmonix Music Systems"
## [181] "ESP"
## [182] "TYO"
## [183] "Telegames"
## [184] "Mud Duck Productions"
## [185] "Screenlife"
## [186] "Pioneer LDC"
## [187] "Magical Company"
## [188] "Mentor Interactive"
## [189] "Kemco"
## [190] "Human Entertainment"
## [191] "Avanquest"
## [192] "Data Age"
## [193] "Electronic Arts Victor"
## [194] "Black Bean Games"
## [195] "Jack of All Games"
## [196] "989 Sports"
## [197] "Takara Tomy"
## [198] "Media Rings"
## [199] "Elf"
## [200] "Kalypso Media"
## [201] "Starfish"
## [202] "Zushi Games"
## [203] "Jorudan"
## [204] "Destination Software, Inc"
## [205] "New"
## [206] "Brash Entertainment"
## [207] "ITT Family Games"
## [208] "PopCap Games"
## [209] "Home Entertainment Suppliers"
## [210] "Ackkstudios"
## [211] "Starpath Corp."
## [212] "P2 Games"
## [213] "BPS"
## [214] "Gathering of Developers"
## [215] "NewKidCo"
## [216] "Storm City Games"
## [217] "CokeM Interactive"
## [218] "CBS Electronics"
## [219] "Magix"
## [220] "Marvelous Interactive"
## [221] "Nihon Falcom Corporation"
## [222] "Wargaming.net"
## [223] "Angel Studios"
## [224] "Arc System Works"
## [225] "Playmates"
## [226] "SNK Playmore"
## [227] "Hamster Corporation"
## [228] "From Software"
## [229] "Nippon Columbia"
## [230] "Nichibutsu"
## [231] "Little Orbit"
## [232] "Conspiracy Entertainment"
## [233] "DTP Entertainment"
## [234] "Hect"
## [235] "Mumbo Jumbo"
## [236] "Pacific Century Cyber Works"
## [237] "Indie Games"
## [238] "Liquid Games"
## [239] "NEC"
## [240] "Axela"
## [241] "ArtDink"
## [242] "Sunsoft"
## [243] "Gust"
## [244] "SNK"
## [245] "NEC Interchannel"
## [246] "FuRyu"
## [247] "Xing Entertainment"
## [248] "ValuSoft"
## [249] "Victor Interactive"
## [250] "Detn8 Games"
## [251] "American Softworks"
## [252] "Nordcurrent"
## [253] "Bomb"
## [254] "Falcom Corporation"
## [255] "AQ Interactive"
## [256] "CCP"
## [257] "Milestone S.r.l."
## [258] "Sears"
## [259] "JoWood Productions"
## [260] "Seta Corporation"
## [261] "On Demand"
## [262] "NCS"
## [263] "Aspyr"
## [264] "Gremlin Interactive Ltd"
## [265] "Agatsuma Entertainment"
## [266] "Compile Heart"
## [267] "Culture Brain"
## [268] "Mad Catz"
## [269] "Shogakukan"
## [270] "Merscom LLC"
## [271] "Rebellion Developments"
## [272] "Nippon Telenet"
## [273] "TDK Core"
## [274] "bitComposer Games"
## [275] "Foreign Media Games"
## [276] "Astragon"
## [277] "SSI"
## [278] "Kadokawa Games"
## [279] "Idea Factory"
## [280] "Performance Designed Products"
## [281] "Asylum Entertainment"
## [282] "Core Design Ltd."
## [283] "PlayV"
## [284] "UFO Interactive"
## [285] "Idea Factory International"
## [286] "Playlogic Game Factory"
## [287] "Essential Games"
## [288] "Adeline Software"
## [289] "Funcom"
## [290] "Panther Software"
## [291] "Blast! Entertainment Ltd"
## [292] "Game Life"
## [293] "DSI Games"
## [294] "Avalon Interactive"
## [295] "Popcorn Arcade"
## [296] "Neko Entertainment"
## [297] "Vir2L Studios"
## [298] "Aques"
## [299] "Syscom"
## [300] "White Park Bay Software"
## [301] "System 3"
## [302] "Vatical Entertainment"
## [303] "Daedalic"
## [304] "EA Games"
## [305] "Media Factory"
## [306] "Vic Tokai"
## [307] "The Adventure Company"
## [308] "Game Arts"
## [309] "Broccoli"
## [310] "Acquire"
## [311] "General Entertainment"
## [312] "Excalibur Publishing"
## [313] "Imadio"
## [314] "Swing! Entertainment"
## [315] "Sony Music Entertainment"
## [316] "Aqua Plus"
## [317] "Paradox Interactive"
## [318] "Hip Interactive"
## [319] "DreamCatcher Interactive"
## [320] "Tripwire Interactive"
## [321] "Sting"
## [322] "Yacht Club Games"
## [323] "SCS Software"
## [324] "Bigben Interactive"
## [325] "Havas Interactive"
## [326] "Slitherine Software"
## [327] "Graffiti"
## [328] "Funsta"
## [329] "Telstar"
## [330] "U.S. Gold"
## [331] "DreamWorks Interactive"
## [332] "Data Design Interactive"
## [333] "MTO"
## [334] "DHM Interactive"
## [335] "FunSoft"
## [336] "SPS"
## [337] "Bohemia Interactive"
## [338] "Reef Entertainment"
## [339] "Tru Blu Entertainment"
## [340] "Moss"
## [341] "T&E Soft"
## [342] "O-Games"
## [343] "Aksys Games"
## [344] "NDA Productions"
## [345] "Data East"
## [346] "Time Warner Interactive"
## [347] "Gainax Network Systems"
## [348] "Daito"
## [349] "O3 Entertainment"
## [350] "Gameloft"
## [351] "Xicat Interactive"
## [352] "Simon & Schuster Interactive"
## [353] "Valcon Games"
## [354] "PopTop Software"
## [355] "TOHO"
## [356] "HMH Interactive"
## [357] "5pb"
## [358] "Cave"
## [359] "CDV Software Entertainment"
## [360] "Microids"
## [361] "PM Studios"
## [362] "Paon"
## [363] "Micro Cabin"
## [364] "GameTek"
## [365] "Benesse"
## [366] "Type-Moon"
## [367] "Enjoy Gaming ltd."
## [368] "Asmik Corp"
## [369] "Interplay Productions"
## [370] "Asmik Ace Entertainment"
## [371] "inXile Entertainment"
## [372] "Image Epoch"
## [373] "Phantom EFX"
## [374] "Evolved Games"
## [375] "responDESIGN"
## [376] "Culture Publishers"
## [377] "Griffin International"
## [378] "Hackberry"
## [379] "Hearty Robin"
## [380] "Nippon Amuse"
## [381] "Origin Systems"
## [382] "Seventh Chord"
## [383] "Mitsui"
## [384] "Milestone"
## [385] "Abylight"
## [386] "Flight-Plan"
## [387] "Glams"
## [388] "Locus"
## [389] "Warp"
## [390] "Daedalic Entertainment"
## [391] "Alternative Software"
## [392] "Myelin Media"
## [393] "Mercury Games"
## [394] "Irem Software Engineering"
## [395] "Sunrise Interactive"
## [396] "Elite"
## [397] "Evolution Games"
## [398] "Tivola"
## [399] "Global A Entertainment"
## [400] "Edia"
## [401] "Athena"
## [402] "Aria"
## [403] "Gamecock"
## [404] "Tommo"
## [405] "Altron"
## [406] "Happinet"
## [407] "iWin"
## [408] "Media Works"
## [409] "Fortyfive"
## [410] "Revolution Software"
## [411] "Imax"
## [412] "Crimson Cow"
## [413] "10TACLE Studios"
## [414] "Groove Games"
## [415] "Pack-In-Video"
## [416] "Insomniac Games"
## [417] "Ascaron Entertainment GmbH"
## [418] "Asgard"
## [419] "Ecole"
## [420] "Yumedia"
## [421] "Phenomedia"
## [422] "HAL Laboratory"
## [423] "Grand Prix Games"
## [424] "DigiCube"
## [425] "Creative Core"
## [426] "Kaga Create"
## [427] "WayForward Technologies"
## [428] "LSP Games"
## [429] "ASCII Media Works"
## [430] "Coconuts Japan"
## [431] "Arika"
## [432] "Ertain"
## [433] "Marvel Entertainment"
## [434] "Prototype"
## [435] "TopWare Interactive"
## [436] "Phantagram"
## [437] "1C Company"
## [438] "The Learning Company"
## [439] "TechnoSoft"
## [440] "Vap"
## [441] "Misawa"
## [442] "Tradewest"
## [443] "Team17 Software"
## [444] "Yeti"
## [445] "Pow"
## [446] "Navarre Corp"
## [447] "MediaQuest"
## [448] "Max Five"
## [449] "Comfort"
## [450] "Monte Christo Multimedia"
## [451] "Pony Canyon"
## [452] "Riverhillsoft"
## [453] "Summitsoft"
## [454] "Milestone S.r.l"
## [455] "Playmore"
## [456] "MLB.com"
## [457] "Kool Kizz"
## [458] "Flashpoint Games"
## [459] "49Games"
## [460] "Legacy Interactive"
## [461] "Alawar Entertainment"
## [462] "CyberFront"
## [463] "Cloud Imperium Games Corporation"
## [464] "Societa"
## [465] "Virtual Play Games"
## [466] "Interchannel"
## [467] "Sonnet"
## [468] "Experience Inc."
## [469] "Zenrin"
## [470] "Iceberg Interactive"
## [471] "Ivolgamus"
## [472] "2D Boy"
## [473] "MC2 Entertainment"
## [474] "Kando Games"
## [475] "Just Flight"
## [476] "Office Create"
## [477] "Mamba Games"
## [478] "Fields"
## [479] "Princess Soft"
## [480] "Maximum Family Games"
## [481] "Berkeley"
## [482] "Fuji"
## [483] "Dusenberry Martin Racing"
## [484] "imageepoch Inc."
## [485] "Big Fish Games"
## [486] "Her Interactive"
## [487] "Kamui"
## [488] "ASK"
## [489] "Headup Games"
## [490] "KSS"
## [491] "Cygames"
## [492] "KID"
## [493] "Quinrose"
## [494] "Sunflowers"
## [495] "dramatic create"
## [496] "TGL"
## [497] "Encore"
## [498] "Extreme Entertainment Group"
## [499] "Intergrow"
## [500] "G.Rev"
## [501] "Sweets"
## [502] "Kokopeli Digital Studios"
## [503] "Number None"
## [504] "Nexon"
## [505] "id Software"
## [506] "BushiRoad"
## [507] "Tryfirst"
## [508] "Strategy First"
## [509] "7G//AMES"
## [510] "GN Software"
## [511] "Yuke's"
## [512] "Easy Interactive"
## [513] "Licensed 4U"
## [514] "FuRyu Corporation"
## [515] "Lexicon Entertainment"
## [516] "Paon Corporation"
## [517] "Kids Station"
## [518] "GOA"
## [519] "Graphsim Entertainment"
## [520] "King Records"
## [521] "Introversion Software"
## [522] "Minato Station"
## [523] "Devolver Digital"
## [524] "Blue Byte"
## [525] "Gaga"
## [526] "Yamasa Entertainment"
## [527] "Plenty"
## [528] "Views"
## [529] "fonfun"
## [530] "NetRevo"
## [531] "Codemasters Online"
## [532] "Quintet"
## [533] "Phoenix Games"
## [534] "Dorart"
## [535] "Marvelous Games"
## [536] "Focus Multimedia"
## [537] "Imageworks"
## [538] "Karin Entertainment"
## [539] "Aerosoft"
## [540] "Technos Japan Corporation"
## [541] "Gakken"
## [542] "Mirai Shounen"
## [543] "Datam Polystar"
## [544] "Saurus"
## [545] "HuneX"
## [546] "Revolution (Japan)"
## [547] "Giza10"
## [548] "Visco"
## [549] "Alvion"
## [550] "Mycom"
## [551] "Giga"
## [552] "Warashi"
## [553] "System Soft"
## [554] "Sold Out"
## [555] "Lighthouse Interactive"
## [556] "Masque Publishing"
## [557] "RED Entertainment"
## [558] "Michaelsoft"
## [559] "Media Entertainment"
## [560] "New World Computing"
## [561] "Genterprise"
## [562] "Interworks Unlimited, Inc."
## [563] "Boost On"
## [564] "Stainless Games"
## [565] "EON Digital Entertainment"
## [566] "Epic Games"
## [567] "Naxat Soft"
## [568] "Ascaron Entertainment"
## [569] "Piacci"
## [570] "Nitroplus"
## [571] "Paradox Development"
## [572] "Otomate"
## [573] "Ongakukan"
## [574] "Commseed"
## [575] "Inti Creates"
## [576] "Takuyo"
## [577] "Interchannel-Holon"
## [578] "Rain Games"
## [579] "UIG Entertainment"
I check important variables to see the values that I can get. For instance, I check Platform and note that there are a lot of different gaming consoles and PC. I also wanted to look which Genres there are and see how the variable is stored. Finally,I printed unique publishers and noted that the data contains 579 of them.
For clustering, I wanted to group the games by “taste” i.e to analyze what affects the taste of gamers. Is it the country/region they are living in? Do they have distinct favorite game publishers? Is there a genre they enjoy more than others in a certain region? Which gaming platforms are used more in every group?
To answer these questions, I have decided to extract share of sales for every region to get a percentage, so that clustering is “fair” and we don’t have to normalize if we have the opportunity to extract normalized values immediately from the set. Share variables will be numeric from 0 to 1. Additionally, since the dataset is too large for clusterization, I have taken a subset of 1000 records to proceed. K-Means, PAM, and Hierarchial Clustering are chosen because the data is simple and structure doesn’t require something like DBSCAN or Spectral.
sales$NA_Share <- sales$NA_Sales / sales$Global_Sales
sales$EU_Share <- sales$EU_Sales / sales$Global_Sales
sales$JP_Share <- sales$JP_Sales / sales$Global_Sales
sales$Other_Share <- sales$Other_Sales / sales$Global_Sales
set.seed(123)
sales_subset <- sales %>%
sample_n(1000)
subset_shares <-sales_subset %>%
select(NA_Share, EU_Share, JP_Share, Other_Share)
summary(subset_shares)
## NA_Share EU_Share JP_Share Other_Share
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.5088 Median :0.1773 Median :0.0000 Median :0.06000
## Mean :0.4750 Mean :0.2171 Mean :0.2292 Mean :0.06646
## 3rd Qu.:0.7626 3rd Qu.:0.3670 3rd Qu.:0.2783 3rd Qu.:0.10372
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :0.75000
str(subset_shares)
## 'data.frame': 1000 obs. of 4 variables:
## $ NA_Share : num 0.838 0.856 0.55 0.875 0.537 ...
## $ EU_Share : num 0.0294 0.1261 0.3833 0 0.2927 ...
## $ JP_Share : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Other_Share: num 0.1324 0.027 0.0667 0.125 0.1463 ...
set.seed(123)
hopkins_result <- get_clust_tendency(subset_shares, n = nrow(subset_shares) - 1)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
hopkins_result
## $hopkins_stat
## [1] 0.9330738
##
## $plot
Hopkins statistic is ~0.93, which means the data is highly clusterable and we can proceed and try to get good insights.
There is no better algorithm to start with than good old K-Means. For that purpose, I am plotting the silhouette score and gap statistic to choose the best number of clusters. 2 Clusters have the highest silhouette score, but gap statistic indicates 6 as the best choice. Since 2 clusters for gap statistic is much worse but silhouette score for 6 clusters is fairly close to 2, I am choosing to go with k = 6. Moreover, to analyze the taste of gamers, I am expecting to have more than 4 clusters at least one per region, thus, 2 clusters would not make much sense.
fviz_nbclust(
subset_shares,
kmeans,
method = "silhouette"
) +
labs(title = "Optimal Number of Clusters Using Silhouette Method")
fviz_nbclust(
subset_shares,
kmeans,
method = "gap_stat"
) +
labs(title = "Optimal Number of Clusters Using GAP Method")
km <- kmeans(subset_shares, centers = 6, nstart = 25)
fviz_cluster(km, data = subset_shares, geom = "point")
To visualize clusters, dimension reduction was used and it covered 82% of the variance which is good. We can see that the most distinct clusters are 3 and 5.
sil <- silhouette(km$cluster, dist(subset_shares))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 80 0.55
## 2 2 238 0.51
## 3 3 206 0.91
## 4 4 202 0.36
## 5 5 211 0.63
## 6 6 63 0.30
Silhouette score exactly confirmed my approximation about the distinct clusters, as we can see that cluster 3 has a highest silhouette score at 0.91 and cluster 5 the second highest at 0.63. Clusters 4 and 6 are with the lowest silhouette width ~0.3 - 0.4, indicating overlap and weak distinct structure.
sales_subset$Cluster <- factor(km$cluster)
ggplot(sales_subset, aes(x = Cluster)) +
geom_bar(fill = "steelblue") +
labs(title = "Number of Games per Cluster", y = "Count")
Before clusterization, I want to check the correlation of our variables for distance matrix.
cor(subset_shares)
## NA_Share EU_Share JP_Share Other_Share
## NA_Share 1.00000000 -0.2376522 -0.7221810 0.06620558
## EU_Share -0.23765221 1.0000000 -0.4531807 0.30628892
## JP_Share -0.72218102 -0.4531807 1.0000000 -0.38871250
## Other_Share 0.06620558 0.3062889 -0.3887125 1.00000000
There is only one high negatively correlated pair NA/JP, but it is not an extreme case so I will proceed.
d <- dist(subset_shares, method = "euclidean")
hc <- agnes(d, method = "complete")
hc$ac
## [1] 0.9868188
pltree(hc, cex = 0.6, hang = -1, main = "Dendrogram", labels = F)
rect.hclust(hc, k = 4, border = 2:5)
hc_clusters <- cutree(hc, k = 4)
sales_subset$hc_cluster <- hc_clusters
fviz_cluster(list(data = subset_shares, cluster = hc_clusters))
It is visible that cutting at 4 clusters is a good choice, because later there are too many groupings, and these 4 are quite distinct. Plus, I get a very good hc coefficient with this method. On the plot that captures ~82% of the variance, we see a good result, even though there are some overlaps, overall, the clusters are fairly distinct.
fviz_nbclust(
subset_shares,
pam,
method = "silhouette"
) +
labs(title = "Optimal Number of Clusters Using Silhouette Method")
fviz_nbclust(
subset_shares,
pam,
method = "gap_stat"
) +
labs(title = "Optimal Number of Clusters Using GAP Method")
set.seed(123)
pam_model <- pam(subset_shares, k = 6)
sales_subset$pam_cluster <- pam_model$clustering
pam_model$silinfo$avg.width
## [1] 0.5791628
plot(pam_model)
Even though GAP-stat is increasing and silhouette suggests 2 clusters, I chose to go with 6 clusters again, as the silhouette is relatively good and GAP-stat suggests too many clusters.
data_matrix <- as.matrix(subset_shares)
ch_km <- intCriteria(traj = data_matrix,
part = as.integer(sales_subset$Cluster),
crit = "Calinski_Harabasz")$calinski_harabasz
ch_km
## [1] 3388.761
ch_hc <- intCriteria(traj = data_matrix,
part = sales_subset$hc_cluster,
crit = "Calinski_Harabasz")$calinski_harabasz
ch_hc
## [1] 2932.91
ch_pam <- intCriteria(traj = data_matrix,
part = sales_subset$pam_cluster,
crit = "Calinski_Harabasz")$calinski_harabasz
ch_pam
## [1] 3346.91
According to CH index, best clusterization result is given by K-Means, closely followed by PAM. Hierarchial clusterization is worse than other two due to the index.
sales_subset %>%
group_by(Cluster) %>%
summarise(Avg_Global = mean(Global_Sales)) %>%
ggplot(aes(x = Cluster, y = Avg_Global, fill = Cluster)) +
geom_col() +
labs(title = "K-means: Avg Global Sales")
sales_subset %>%
group_by(hc_cluster) %>%
summarise(Avg_Global = mean(Global_Sales)) %>%
ggplot(aes(x = hc_cluster, y = Avg_Global, fill = factor(hc_cluster))) +
geom_col() +
labs(title = "Hierarchical: Avg Global Sales")
sales_subset %>%
group_by(pam_cluster) %>%
summarise(Avg_Global = mean(Global_Sales)) %>%
ggplot(aes(x = pam_cluster, y = Avg_Global, fill = factor(pam_cluster))) +
geom_col() +
labs(title = "PAM: Avg Global Sales")
Comparing clusterization techniques by Global Sales, we can see that PAM and K-Means are similar. The difference is only with clusters 4 and 5, in K-means cluster 4 has way more sales than cluster 5, and the opposite is true in PAM. Cluster 6 is highest selling in both techniques, followed by cluster 2, which has half of those sales. For hierarchial clustering, Cluster 2 is highest selling, followed by cluster 1, while 3 and 4 have way less sales.
make_publisher_plot <- function(df, cluster_col, title_prefix) {
df %>%
count({{ cluster_col }}, Publisher) %>%
group_by({{ cluster_col }}) %>%
top_n(5, n) %>%
ggplot(aes(x = reorder(Publisher, n), y = n, fill = factor({{ cluster_col }}))) +
geom_col() +
facet_wrap(vars({{ cluster_col }}), scales = "free", , ncol = 2) +
coord_flip() +
labs(title = paste(title_prefix, "Top Publishers"),
x = "Publisher", y = "Count")
}
make_publisher_plot(sales_subset, Cluster, "K-means")
make_publisher_plot(sales_subset, hc_cluster, "Hierarchical")
make_publisher_plot(sales_subset, pam_cluster, "PAM")
Starting with K-Means, we can instantly note clusters 3 and 6 for highest number of games by Japanese publishers. I remember from previous plot that cluster 3 was exclusive for Japanese share, but cluster 6 is a bit more interesting because it appears in all markets. Interesting insight here is that games by Namco Bandai Games are the most valued in Japan by a huge margin, while Nintendo is more widespread but less valued in Japan specifically. Mostly European cluster 1 shows Electronic Arts as the biggest publisher, followed by Ubisoft and CodeMasters. Cluster 2 shares are split between North America and Europe, and I see that while EA is the most popular, there are Activision, Ubisoft, and Sony Entertainment in this cluster. What is interesting, Sony has more love in NA and Europe than in Japan. Cluster 4 was a “North America” cluster with small shares in EU and Other. Here, we see that THQ is the most popular with EA, Activision, Ubisoft, and Microsoft behind it. We can see THQ appearing only in clusters 2, 4, and 5, as these as American dominated clusters and THQ is an American company. Cluster 5 is even more “North American”, where EA comes first, and Activision, Ubisoft, THQ has close popularity to one another. Take-Two Interactive is the top publisher only in this cluster, indicating another American company making games mostly for Americans.
After analysis of Hierarchial clustering results on these plots, I did not spot any new valuable insights. Additionally, since this type has 4 clusters, it did not provide as much insights as K-Means.
Again, I see that PAM results are identical to K-Means, so no analysis is required.
make_genre_plot <- function(df, cluster_col, title_prefix) {
df %>%
count({{ cluster_col }}, Genre) %>%
group_by({{ cluster_col }}) %>%
slice_max(n, n = 5) %>%
ggplot(aes(x = reorder(Genre, n), y = n, fill = factor({{ cluster_col }}))) +
geom_col() +
facet_wrap(vars({{ cluster_col }}), scales = "free_y", ncol = 2) +
coord_flip() +
labs(title = paste(title_prefix, "Top Genres per Cluster"),
x = "Genre", y = "Count")
}
make_genre_plot(sales_subset, Cluster, "K-means")
make_genre_plot(sales_subset, hc_cluster, "Hierarchical")
make_genre_plot(sales_subset, pam_cluster, "PAM")
In the genre plots for K-Means, we can note that Japan’s two prominent clusters 3 and 6 contain Role-Playing games, which indicates that this is the most popular genre in Japan, followed by Adventure and Action. Cluster 5, which is North American cluster has the most Sports games, which is quite unique, since Sports is not the most loved in any other region. All other clusters mostly contain Action games, while also containing a lot of Shooters, Sports, Racing, and Misc genre. European cluster 1 also contains Simulation and Strategy as top 5 and 6 genres. Cluster 4 has highest difference between 1st and other genres; Action is the most popular here, while Platform is on the 2nd position, this is mainly North American base of gamers, added to the smaller part of European and Other World’s gamers. Second cluster does not give us any specific insight, because it is seen in all 3 regions except Japan and typical genres like Action, Sports and Shooters are most popular.
Again, PAM plots are identical to K-Means and Hierarchial omit some insights instead of giving new valuable information, thus I will not include analysis of those.
make_top_games_plot <- function(df, cluster_col, title_prefix) {
df %>%
group_by({{ cluster_col }}) %>%
slice_max(Global_Sales, n = 5) %>%
ggplot(aes(x = reorder(Name, Global_Sales), y = Global_Sales, fill = factor({{ cluster_col }}))) +
geom_col() +
facet_wrap(vars({{ cluster_col }}), scales = "free_y", ncol = 2) +
coord_flip() +
labs(title = paste(title_prefix, "Top 5 Games by Global Sales per Cluster"),
x = "Game", y = "Global Sales (Millions)")
}
make_top_games_plot(sales_subset, Cluster, "K-means")
make_top_games_plot(sales_subset, hc_cluster, "Hierarchical")
make_top_games_plot(sales_subset, pam_cluster, "PAM")
Japanese cluster 6 is unique again, as we can see Pokemon Yellow, and Wii Party as the most popular games, which are not seen in any other cluster’s top 5. Cluster 5 was North American cluster where Sports games were the most appreciated and we can see it from the top 5 list as well, where NFL games and one NBA game are in the list. Other American-European cluster 4, where Action games were most popular shows us that GTA 4 and Halo 3 are top 2. Surprisingly, even though Action was more popular than sports in European cluster 1, FIFA Soccer 06 is on the first place, and PES 2008 is on the third, two more football games are in top 5 together with Art Academy. This once again proves that Americans play American football, while Europeans play original Football (Soccer). Cluster 2, which is the one shared by Europe and North America almost equally (with a small share of Other World), has Call of Duty: Black Ops as the most popular game, followed by Forza Motorsport 3 and some other Action and Sports games. This is very typical, since the most popular genres in this cluster were Action, Sports, and Racing. This cluster is where European and American taste overlaps.
make_platform_plot <- function(df, cluster_col, title_prefix) {
df %>%
count({{ cluster_col }}, Platform) %>%
group_by({{ cluster_col }}) %>%
slice_max(n, n = 5) %>%
ggplot(aes(x = reorder(Platform, n), y = n, fill = factor({{ cluster_col }}))) +
geom_col() +
facet_wrap(vars({{ cluster_col }}), scales = "free_y", ncol = 2) +
coord_flip() +
labs(title = paste(title_prefix, "Top Platforms per Cluster"),
x = "Platform", y = "Count")
}
make_platform_plot(sales_subset, Cluster, "K-means")
make_platform_plot(sales_subset, hc_cluster, "Hierarchical")
make_platform_plot(sales_subset, pam_cluster, "PAM")
Looking at the most popular platforms to play video-games, in Cluster 1(Europe) PC is by far the most popular, and PlayStation 3 comes next. For Cluster 6, many clusters have approximately same number of games published on them, but PSP and 3DS are the most popular ones. For clusters 5 which has highest and almost exclusively North American shares, DS and Wii are two most popular, while PSP, X360 and PS2 are also on the top 5 list. Cluster 4 which has some European and mostly NA share, has most games on GBA, followed by XB, GC, X360, and DS. For almost exclusively Japanese cluster 3, PSP and DS are by far “highest-number-of-games” platforms, PS2, PS, and PSV are also on the top 5 list, but behind by a lot; all of these 5 platforms have Japanese origin, once again indicating Japanese exclusiveness. Cluster 2 which had fair shares in both North America and Europe and some share in Other parts of the world has most games on PS2 and PS; lower on the list are also X360, PS3, and DS.
Overall, I noted that out of all top 5 lists, only non-Japanese platforms are Xbox360(X360), Xbox(XB), and PC, and these are on top list of only non-Japanese markets. Almost all other top platforms(mainly consoles) are made either by Nintendo or Sony.
Interesting fact: PSP(PlayStation Portable), PSV(PlayStation Vita), PS, PS2, PS3 are created by Sony. DS, Wii, GC, GB, 3DS, GBA are created by Nintendo.
Since various tests showed that K-Means is the best algorithm and it is very compute-effective, I ran K-Means on the whole dataset to check if it shows same results as the subset. I expected all plots and insights to be nearly the same, except the top 5 games, top 5 platforms, and maybe publishers since the whole dataset might include more popular games. Therefore, this is also good to see which are the actual 5 most popular games in each cluster. I only test K-Means, because this is the only fast-enough algorithm from the 3 I chose.
Gap statistic is extremely slow to compute on the whole set, but since I got same silhouette graph, I assume GAP will also be approximately same, leading to our previous choice of 6 clusters.
sales_shares <- sales %>%
select(NA_Share, EU_Share, JP_Share, Other_Share)
fviz_nbclust(
sales_shares,
kmeans,
method = "silhouette"
) +
labs(title = "Optimal Number of Clusters Using Silhouette Method")
We have obtained exactly same average silhouette of 0.58 as before, plus we have same structure of clusters, even though there are way more points, which may make it harder to spot. Some of the clusters’ silhouette widths have improved, some are a little worse. Overall, this is a good observation.
km <- kmeans(sales_shares, centers = 6, nstart = 25)
fviz_cluster(km, data = sales_shares, geom = "point")
sil <- silhouette(km$cluster, dist(sales_shares))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 4230 0.45
## 2 2 1466 0.54
## 3 3 2971 0.62
## 4 4 939 0.36
## 5 5 3532 0.94
## 6 6 3460 0.41
sales$Cluster <- km$cluster
To conclude, my aim was to make more than 4 clusters, so I tried 4 for Hierarchial and 6 for K-Means, and 6 proved to give more information. That is also the reason I did not try any other number of clusters, because more than 6 would result into hard interpretation with worse results, and less than 4 would not give much information. Only choice that was not checked is 5, but the silhouette and gap is a little worse than for 6, and I expect same results.
Regarding insights, I learnt that Japanese market is much more exclusive than others in terms of played games, genres, and consoles. Xbox consoles by Microsoft in this study were only mainly used in Europe and North America. Japan was mainly using Nintendo consoles. Nevertheless, Nintendo games were more played in NA and EU, while Japan prefered Namco Bandai. Role-Play was most popular genre in Japan, alongside with Adventure. Action was generally most popular worldwide, and North America enjoyed Sports more than other regions.
The study proved to be useful by answering all asked questions and even more. Combination of domain knowledge and interest in dataset helped to achieve the results. Hopefully, the reader found this study interesting as well and learnt new facts. Thank you for reading!