Introduction

The “Video Games Sales” is a dataset taken from zenodo (https://zenodo.org/records/5898311). It was generated by scraping the data from the video games website called vgchartz.com. This dataset contains information about video games with more than 100000 copies sold between 1980 and 2020. This is great to look at popular games. There are 16598 records in the set.

Field Description

  • Rank - Ranking of overall sales

  • Name - The games name

  • Platform - Platform of the games release (i.e. PC,PS4, etc.)

  • Year - Year of the game’s release

  • Genre - Genre of the game

  • Publisher - Publisher of the game

  • NA_Sales - Sales in North America (in millions)

  • EU_Sales - Sales in Europe (in millions)

  • JP_Sales - Sales in Japan (in millions)

  • Other_Sales - Sales in the rest of the world (in millions)

  • Global_Sales - Total worldwide sales.

library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(cluster)
library(ggplot2)
library(tidyr)
library(clusterCrit)

Preliminary Analysis

sales <- read.csv("vgsales.csv")
head(sales)
##   Rank                     Name Platform Year        Genre Publisher NA_Sales
## 1    1               Wii Sports      Wii 2006       Sports  Nintendo    41.49
## 2    2        Super Mario Bros.      NES 1985     Platform  Nintendo    29.08
## 3    3           Mario Kart Wii      Wii 2008       Racing  Nintendo    15.85
## 4    4        Wii Sports Resort      Wii 2009       Sports  Nintendo    15.75
## 5    5 Pokemon Red/Pokemon Blue       GB 1996 Role-Playing  Nintendo    11.27
## 6    6                   Tetris       GB 1989       Puzzle  Nintendo    23.20
##   EU_Sales JP_Sales Other_Sales Global_Sales
## 1    29.02     3.77        8.46        82.74
## 2     3.58     6.81        0.77        40.24
## 3    12.88     3.79        3.31        35.82
## 4    11.01     3.28        2.96        33.00
## 5     8.89    10.22        1.00        31.37
## 6     2.26     4.22        0.58        30.26
range(as.numeric(sales$Year), na.rm = T)
## Warning: NAs introduced by coercion
## [1] 1980 2020
unique(sales$Platform)
##  [1] "Wii"  "NES"  "GB"   "DS"   "X360" "PS3"  "PS2"  "SNES" "GBA"  "3DS" 
## [11] "PS4"  "N64"  "PS"   "XB"   "PC"   "2600" "PSP"  "XOne" "GC"   "WiiU"
## [21] "GEN"  "DC"   "PSV"  "SAT"  "SCD"  "WS"   "NG"   "TG16" "3DO"  "GG"  
## [31] "PCFX"
unique(sales$Genre)
##  [1] "Sports"       "Platform"     "Racing"       "Role-Playing" "Puzzle"      
##  [6] "Misc"         "Shooter"      "Simulation"   "Action"       "Fighting"    
## [11] "Adventure"    "Strategy"
unique(sales$Publisher)
##   [1] "Nintendo"                              
##   [2] "Microsoft Game Studios"                
##   [3] "Take-Two Interactive"                  
##   [4] "Sony Computer Entertainment"           
##   [5] "Activision"                            
##   [6] "Ubisoft"                               
##   [7] "Bethesda Softworks"                    
##   [8] "Electronic Arts"                       
##   [9] "Sega"                                  
##  [10] "SquareSoft"                            
##  [11] "Atari"                                 
##  [12] "505 Games"                             
##  [13] "Capcom"                                
##  [14] "GT Interactive"                        
##  [15] "Konami Digital Entertainment"          
##  [16] "Sony Computer Entertainment Europe"    
##  [17] "Square Enix"                           
##  [18] "LucasArts"                             
##  [19] "Virgin Interactive"                    
##  [20] "Warner Bros. Interactive Entertainment"
##  [21] "Universal Interactive"                 
##  [22] "Eidos Interactive"                     
##  [23] "RedOctane"                             
##  [24] "Vivendi Games"                         
##  [25] "Enix Corporation"                      
##  [26] "Namco Bandai Games"                    
##  [27] "Palcom"                                
##  [28] "Hasbro Interactive"                    
##  [29] "THQ"                                   
##  [30] "Fox Interactive"                       
##  [31] "Acclaim Entertainment"                 
##  [32] "MTV Games"                             
##  [33] "Disney Interactive Studios"            
##  [34] "N/A"                                   
##  [35] "Majesco Entertainment"                 
##  [36] "Codemasters"                           
##  [37] "Red Orb"                               
##  [38] "Level 5"                               
##  [39] "Arena Entertainment"                   
##  [40] "Midway Games"                          
##  [41] "JVC"                                   
##  [42] "Deep Silver"                           
##  [43] "989 Studios"                           
##  [44] "NCSoft"                                
##  [45] "UEP Systems"                           
##  [46] "Parker Bros."                          
##  [47] "Maxis"                                 
##  [48] "Imagic"                                
##  [49] "Tecmo Koei"                            
##  [50] "Valve Software"                        
##  [51] "ASCII Entertainment"                   
##  [52] "Mindscape"                             
##  [53] "Infogrames"                            
##  [54] "Unknown"                               
##  [55] "Square"                                
##  [56] "Valve"                                 
##  [57] "Activision Value"                      
##  [58] "Banpresto"                             
##  [59] "D3Publisher"                           
##  [60] "Oxygen Interactive"                    
##  [61] "Red Storm Entertainment"               
##  [62] "Video System"                          
##  [63] "Hello Games"                           
##  [64] "Global Star"                           
##  [65] "Gotham Games"                          
##  [66] "Westwood Studios"                      
##  [67] "GungHo"                                
##  [68] "Crave Entertainment"                   
##  [69] "Hudson Soft"                           
##  [70] "Coleco"                                
##  [71] "Rising Star Games"                     
##  [72] "Atlus"                                 
##  [73] "TDK Mediactive"                        
##  [74] "ASC Games"                             
##  [75] "Zoo Games"                             
##  [76] "Accolade"                              
##  [77] "Sony Online Entertainment"             
##  [78] "3DO"                                   
##  [79] "RTL"                                   
##  [80] "Natsume"                               
##  [81] "Focus Home Interactive"                
##  [82] "Alchemist"                             
##  [83] "Black Label Games"                     
##  [84] "SouthPeak Games"                       
##  [85] "Mastertronic"                          
##  [86] "Ocean"                                 
##  [87] "Zoo Digital Publishing"                
##  [88] "Psygnosis"                             
##  [89] "City Interactive"                      
##  [90] "Empire Interactive"                    
##  [91] "Success"                               
##  [92] "Compile"                               
##  [93] "Russel"                                
##  [94] "Taito"                                 
##  [95] "Agetec"                                
##  [96] "GSP"                                   
##  [97] "Microprose"                            
##  [98] "Play It"                               
##  [99] "Slightly Mad Studios"                  
## [100] "Tomy Corporation"                      
## [101] "Sammy Corporation"                     
## [102] "Koch Media"                            
## [103] "Game Factory"                          
## [104] "Titus"                                 
## [105] "Marvelous Entertainment"               
## [106] "Genki"                                 
## [107] "Mojang"                                
## [108] "Pinnacle"                              
## [109] "CTO SpA"                               
## [110] "TalonSoft"                             
## [111] "Crystal Dynamics"                      
## [112] "SCi"                                   
## [113] "Quelle"                                
## [114] "mixi, Inc"                             
## [115] "Rage Software"                         
## [116] "Ubisoft Annecy"                        
## [117] "Scholastic Inc."                       
## [118] "Interplay"                             
## [119] "Mystique"                              
## [120] "ChunSoft"                              
## [121] "Square EA"                             
## [122] "20th Century Fox Video Games"          
## [123] "Avanquest Software"                    
## [124] "Hudson Entertainment"                  
## [125] "Nordic Games"                          
## [126] "Men-A-Vision"                          
## [127] "Nobilis"                               
## [128] "Big Ben Interactive"                   
## [129] "Touchstone"                            
## [130] "Spike"                                 
## [131] "Jester Interactive"                    
## [132] "Nippon Ichi Software"                  
## [133] "LEGO Media"                            
## [134] "Quest"                                 
## [135] "Illusion Softworks"                    
## [136] "Tigervision"                           
## [137] "Funbox Media"                          
## [138] "Rocket Company"                        
## [139] "Metro 3D"                              
## [140] "Mattel Interactive"                    
## [141] "IE Institute"                          
## [142] "Rondomedia"                            
## [143] "Sony Computer Entertainment America"   
## [144] "Universal Gamex"                       
## [145] "Ghostlight"                            
## [146] "Wizard Video Games"                    
## [147] "BMG Interactive Entertainment"         
## [148] "PQube"                                 
## [149] "Trion Worlds"                          
## [150] "Laguna"                                
## [151] "Ignition Entertainment"                
## [152] "Takara"                                
## [153] "Kadokawa Shoten"                       
## [154] "Destineer"                             
## [155] "Enterbrain"                            
## [156] "Xseed Games"                           
## [157] "Imagineer"                             
## [158] "System 3 Arcade Software"              
## [159] "CPG Products"                          
## [160] "Aruze Corp"                            
## [161] "Gamebridge"                            
## [162] "Midas Interactive Entertainment"       
## [163] "Jaleco"                                
## [164] "Answer Software"                       
## [165] "XS Games"                              
## [166] "Activision Blizzard"                   
## [167] "Pack In Soft"                          
## [168] "Rebellion"                             
## [169] "Xplosiv"                               
## [170] "Ultravision"                           
## [171] "GameMill Entertainment"                
## [172] "Wanadoo"                               
## [173] "NovaLogic"                             
## [174] "Telltale Games"                        
## [175] "Epoch"                                 
## [176] "BAM! Entertainment"                    
## [177] "Knowledge Adventure"                   
## [178] "Mastiff"                               
## [179] "Tetris Online"                         
## [180] "Harmonix Music Systems"                
## [181] "ESP"                                   
## [182] "TYO"                                   
## [183] "Telegames"                             
## [184] "Mud Duck Productions"                  
## [185] "Screenlife"                            
## [186] "Pioneer LDC"                           
## [187] "Magical Company"                       
## [188] "Mentor Interactive"                    
## [189] "Kemco"                                 
## [190] "Human Entertainment"                   
## [191] "Avanquest"                             
## [192] "Data Age"                              
## [193] "Electronic Arts Victor"                
## [194] "Black Bean Games"                      
## [195] "Jack of All Games"                     
## [196] "989 Sports"                            
## [197] "Takara Tomy"                           
## [198] "Media Rings"                           
## [199] "Elf"                                   
## [200] "Kalypso Media"                         
## [201] "Starfish"                              
## [202] "Zushi Games"                           
## [203] "Jorudan"                               
## [204] "Destination Software, Inc"             
## [205] "New"                                   
## [206] "Brash Entertainment"                   
## [207] "ITT Family Games"                      
## [208] "PopCap Games"                          
## [209] "Home Entertainment Suppliers"          
## [210] "Ackkstudios"                           
## [211] "Starpath Corp."                        
## [212] "P2 Games"                              
## [213] "BPS"                                   
## [214] "Gathering of Developers"               
## [215] "NewKidCo"                              
## [216] "Storm City Games"                      
## [217] "CokeM Interactive"                     
## [218] "CBS Electronics"                       
## [219] "Magix"                                 
## [220] "Marvelous Interactive"                 
## [221] "Nihon Falcom Corporation"              
## [222] "Wargaming.net"                         
## [223] "Angel Studios"                         
## [224] "Arc System Works"                      
## [225] "Playmates"                             
## [226] "SNK Playmore"                          
## [227] "Hamster Corporation"                   
## [228] "From Software"                         
## [229] "Nippon Columbia"                       
## [230] "Nichibutsu"                            
## [231] "Little Orbit"                          
## [232] "Conspiracy Entertainment"              
## [233] "DTP Entertainment"                     
## [234] "Hect"                                  
## [235] "Mumbo Jumbo"                           
## [236] "Pacific Century Cyber Works"           
## [237] "Indie Games"                           
## [238] "Liquid Games"                          
## [239] "NEC"                                   
## [240] "Axela"                                 
## [241] "ArtDink"                               
## [242] "Sunsoft"                               
## [243] "Gust"                                  
## [244] "SNK"                                   
## [245] "NEC Interchannel"                      
## [246] "FuRyu"                                 
## [247] "Xing Entertainment"                    
## [248] "ValuSoft"                              
## [249] "Victor Interactive"                    
## [250] "Detn8 Games"                           
## [251] "American Softworks"                    
## [252] "Nordcurrent"                           
## [253] "Bomb"                                  
## [254] "Falcom Corporation"                    
## [255] "AQ Interactive"                        
## [256] "CCP"                                   
## [257] "Milestone S.r.l."                      
## [258] "Sears"                                 
## [259] "JoWood Productions"                    
## [260] "Seta Corporation"                      
## [261] "On Demand"                             
## [262] "NCS"                                   
## [263] "Aspyr"                                 
## [264] "Gremlin Interactive Ltd"               
## [265] "Agatsuma Entertainment"                
## [266] "Compile Heart"                         
## [267] "Culture Brain"                         
## [268] "Mad Catz"                              
## [269] "Shogakukan"                            
## [270] "Merscom LLC"                           
## [271] "Rebellion Developments"                
## [272] "Nippon Telenet"                        
## [273] "TDK Core"                              
## [274] "bitComposer Games"                     
## [275] "Foreign Media Games"                   
## [276] "Astragon"                              
## [277] "SSI"                                   
## [278] "Kadokawa Games"                        
## [279] "Idea Factory"                          
## [280] "Performance Designed Products"         
## [281] "Asylum Entertainment"                  
## [282] "Core Design Ltd."                      
## [283] "PlayV"                                 
## [284] "UFO Interactive"                       
## [285] "Idea Factory International"            
## [286] "Playlogic Game Factory"                
## [287] "Essential Games"                       
## [288] "Adeline Software"                      
## [289] "Funcom"                                
## [290] "Panther Software"                      
## [291] "Blast! Entertainment Ltd"              
## [292] "Game Life"                             
## [293] "DSI Games"                             
## [294] "Avalon Interactive"                    
## [295] "Popcorn Arcade"                        
## [296] "Neko Entertainment"                    
## [297] "Vir2L Studios"                         
## [298] "Aques"                                 
## [299] "Syscom"                                
## [300] "White Park Bay Software"               
## [301] "System 3"                              
## [302] "Vatical Entertainment"                 
## [303] "Daedalic"                              
## [304] "EA Games"                              
## [305] "Media Factory"                         
## [306] "Vic Tokai"                             
## [307] "The Adventure Company"                 
## [308] "Game Arts"                             
## [309] "Broccoli"                              
## [310] "Acquire"                               
## [311] "General Entertainment"                 
## [312] "Excalibur Publishing"                  
## [313] "Imadio"                                
## [314] "Swing! Entertainment"                  
## [315] "Sony Music Entertainment"              
## [316] "Aqua Plus"                             
## [317] "Paradox Interactive"                   
## [318] "Hip Interactive"                       
## [319] "DreamCatcher Interactive"              
## [320] "Tripwire Interactive"                  
## [321] "Sting"                                 
## [322] "Yacht Club Games"                      
## [323] "SCS Software"                          
## [324] "Bigben Interactive"                    
## [325] "Havas Interactive"                     
## [326] "Slitherine Software"                   
## [327] "Graffiti"                              
## [328] "Funsta"                                
## [329] "Telstar"                               
## [330] "U.S. Gold"                             
## [331] "DreamWorks Interactive"                
## [332] "Data Design Interactive"               
## [333] "MTO"                                   
## [334] "DHM Interactive"                       
## [335] "FunSoft"                               
## [336] "SPS"                                   
## [337] "Bohemia Interactive"                   
## [338] "Reef Entertainment"                    
## [339] "Tru Blu Entertainment"                 
## [340] "Moss"                                  
## [341] "T&E Soft"                              
## [342] "O-Games"                               
## [343] "Aksys Games"                           
## [344] "NDA Productions"                       
## [345] "Data East"                             
## [346] "Time Warner Interactive"               
## [347] "Gainax Network Systems"                
## [348] "Daito"                                 
## [349] "O3 Entertainment"                      
## [350] "Gameloft"                              
## [351] "Xicat Interactive"                     
## [352] "Simon & Schuster Interactive"          
## [353] "Valcon Games"                          
## [354] "PopTop Software"                       
## [355] "TOHO"                                  
## [356] "HMH Interactive"                       
## [357] "5pb"                                   
## [358] "Cave"                                  
## [359] "CDV Software Entertainment"            
## [360] "Microids"                              
## [361] "PM Studios"                            
## [362] "Paon"                                  
## [363] "Micro Cabin"                           
## [364] "GameTek"                               
## [365] "Benesse"                               
## [366] "Type-Moon"                             
## [367] "Enjoy Gaming ltd."                     
## [368] "Asmik Corp"                            
## [369] "Interplay Productions"                 
## [370] "Asmik Ace Entertainment"               
## [371] "inXile Entertainment"                  
## [372] "Image Epoch"                           
## [373] "Phantom EFX"                           
## [374] "Evolved Games"                         
## [375] "responDESIGN"                          
## [376] "Culture Publishers"                    
## [377] "Griffin International"                 
## [378] "Hackberry"                             
## [379] "Hearty Robin"                          
## [380] "Nippon Amuse"                          
## [381] "Origin Systems"                        
## [382] "Seventh Chord"                         
## [383] "Mitsui"                                
## [384] "Milestone"                             
## [385] "Abylight"                              
## [386] "Flight-Plan"                           
## [387] "Glams"                                 
## [388] "Locus"                                 
## [389] "Warp"                                  
## [390] "Daedalic Entertainment"                
## [391] "Alternative Software"                  
## [392] "Myelin Media"                          
## [393] "Mercury Games"                         
## [394] "Irem Software Engineering"             
## [395] "Sunrise Interactive"                   
## [396] "Elite"                                 
## [397] "Evolution Games"                       
## [398] "Tivola"                                
## [399] "Global A Entertainment"                
## [400] "Edia"                                  
## [401] "Athena"                                
## [402] "Aria"                                  
## [403] "Gamecock"                              
## [404] "Tommo"                                 
## [405] "Altron"                                
## [406] "Happinet"                              
## [407] "iWin"                                  
## [408] "Media Works"                           
## [409] "Fortyfive"                             
## [410] "Revolution Software"                   
## [411] "Imax"                                  
## [412] "Crimson Cow"                           
## [413] "10TACLE Studios"                       
## [414] "Groove Games"                          
## [415] "Pack-In-Video"                         
## [416] "Insomniac Games"                       
## [417] "Ascaron Entertainment GmbH"            
## [418] "Asgard"                                
## [419] "Ecole"                                 
## [420] "Yumedia"                               
## [421] "Phenomedia"                            
## [422] "HAL Laboratory"                        
## [423] "Grand Prix Games"                      
## [424] "DigiCube"                              
## [425] "Creative Core"                         
## [426] "Kaga Create"                           
## [427] "WayForward Technologies"               
## [428] "LSP Games"                             
## [429] "ASCII Media Works"                     
## [430] "Coconuts Japan"                        
## [431] "Arika"                                 
## [432] "Ertain"                                
## [433] "Marvel Entertainment"                  
## [434] "Prototype"                             
## [435] "TopWare Interactive"                   
## [436] "Phantagram"                            
## [437] "1C Company"                            
## [438] "The Learning Company"                  
## [439] "TechnoSoft"                            
## [440] "Vap"                                   
## [441] "Misawa"                                
## [442] "Tradewest"                             
## [443] "Team17 Software"                       
## [444] "Yeti"                                  
## [445] "Pow"                                   
## [446] "Navarre Corp"                          
## [447] "MediaQuest"                            
## [448] "Max Five"                              
## [449] "Comfort"                               
## [450] "Monte Christo Multimedia"              
## [451] "Pony Canyon"                           
## [452] "Riverhillsoft"                         
## [453] "Summitsoft"                            
## [454] "Milestone S.r.l"                       
## [455] "Playmore"                              
## [456] "MLB.com"                               
## [457] "Kool Kizz"                             
## [458] "Flashpoint Games"                      
## [459] "49Games"                               
## [460] "Legacy Interactive"                    
## [461] "Alawar Entertainment"                  
## [462] "CyberFront"                            
## [463] "Cloud Imperium Games Corporation"      
## [464] "Societa"                               
## [465] "Virtual Play Games"                    
## [466] "Interchannel"                          
## [467] "Sonnet"                                
## [468] "Experience Inc."                       
## [469] "Zenrin"                                
## [470] "Iceberg Interactive"                   
## [471] "Ivolgamus"                             
## [472] "2D Boy"                                
## [473] "MC2 Entertainment"                     
## [474] "Kando Games"                           
## [475] "Just Flight"                           
## [476] "Office Create"                         
## [477] "Mamba Games"                           
## [478] "Fields"                                
## [479] "Princess Soft"                         
## [480] "Maximum Family Games"                  
## [481] "Berkeley"                              
## [482] "Fuji"                                  
## [483] "Dusenberry Martin Racing"              
## [484] "imageepoch Inc."                       
## [485] "Big Fish Games"                        
## [486] "Her Interactive"                       
## [487] "Kamui"                                 
## [488] "ASK"                                   
## [489] "Headup Games"                          
## [490] "KSS"                                   
## [491] "Cygames"                               
## [492] "KID"                                   
## [493] "Quinrose"                              
## [494] "Sunflowers"                            
## [495] "dramatic create"                       
## [496] "TGL"                                   
## [497] "Encore"                                
## [498] "Extreme Entertainment Group"           
## [499] "Intergrow"                             
## [500] "G.Rev"                                 
## [501] "Sweets"                                
## [502] "Kokopeli Digital Studios"              
## [503] "Number None"                           
## [504] "Nexon"                                 
## [505] "id Software"                           
## [506] "BushiRoad"                             
## [507] "Tryfirst"                              
## [508] "Strategy First"                        
## [509] "7G//AMES"                              
## [510] "GN Software"                           
## [511] "Yuke's"                                
## [512] "Easy Interactive"                      
## [513] "Licensed 4U"                           
## [514] "FuRyu Corporation"                     
## [515] "Lexicon Entertainment"                 
## [516] "Paon Corporation"                      
## [517] "Kids Station"                          
## [518] "GOA"                                   
## [519] "Graphsim Entertainment"                
## [520] "King Records"                          
## [521] "Introversion Software"                 
## [522] "Minato Station"                        
## [523] "Devolver Digital"                      
## [524] "Blue Byte"                             
## [525] "Gaga"                                  
## [526] "Yamasa Entertainment"                  
## [527] "Plenty"                                
## [528] "Views"                                 
## [529] "fonfun"                                
## [530] "NetRevo"                               
## [531] "Codemasters Online"                    
## [532] "Quintet"                               
## [533] "Phoenix Games"                         
## [534] "Dorart"                                
## [535] "Marvelous Games"                       
## [536] "Focus Multimedia"                      
## [537] "Imageworks"                            
## [538] "Karin Entertainment"                   
## [539] "Aerosoft"                              
## [540] "Technos Japan Corporation"             
## [541] "Gakken"                                
## [542] "Mirai Shounen"                         
## [543] "Datam Polystar"                        
## [544] "Saurus"                                
## [545] "HuneX"                                 
## [546] "Revolution (Japan)"                    
## [547] "Giza10"                                
## [548] "Visco"                                 
## [549] "Alvion"                                
## [550] "Mycom"                                 
## [551] "Giga"                                  
## [552] "Warashi"                               
## [553] "System Soft"                           
## [554] "Sold Out"                              
## [555] "Lighthouse Interactive"                
## [556] "Masque Publishing"                     
## [557] "RED Entertainment"                     
## [558] "Michaelsoft"                           
## [559] "Media Entertainment"                   
## [560] "New World Computing"                   
## [561] "Genterprise"                           
## [562] "Interworks Unlimited, Inc."            
## [563] "Boost On"                              
## [564] "Stainless Games"                       
## [565] "EON Digital Entertainment"             
## [566] "Epic Games"                            
## [567] "Naxat Soft"                            
## [568] "Ascaron Entertainment"                 
## [569] "Piacci"                                
## [570] "Nitroplus"                             
## [571] "Paradox Development"                   
## [572] "Otomate"                               
## [573] "Ongakukan"                             
## [574] "Commseed"                              
## [575] "Inti Creates"                          
## [576] "Takuyo"                                
## [577] "Interchannel-Holon"                    
## [578] "Rain Games"                            
## [579] "UIG Entertainment"

I check important variables to see the values that I can get. For instance, I check Platform and note that there are a lot of different gaming consoles and PC. I also wanted to look which Genres there are and see how the variable is stored. Finally,I printed unique publishers and noted that the data contains 579 of them.

Data preparation

For clustering, I wanted to group the games by “taste” i.e to analyze what affects the taste of gamers. Is it the country/region they are living in? Do they have distinct favorite game publishers? Is there a genre they enjoy more than others in a certain region? Which gaming platforms are used more in every group?

To answer these questions, I have decided to extract share of sales for every region to get a percentage, so that clustering is “fair” and we don’t have to normalize if we have the opportunity to extract normalized values immediately from the set. Share variables will be numeric from 0 to 1. Additionally, since the dataset is too large for clusterization, I have taken a subset of 1000 records to proceed. K-Means, PAM, and Hierarchial Clustering are chosen because the data is simple and structure doesn’t require something like DBSCAN or Spectral.

sales$NA_Share <- sales$NA_Sales / sales$Global_Sales
sales$EU_Share <- sales$EU_Sales / sales$Global_Sales
sales$JP_Share <- sales$JP_Sales / sales$Global_Sales
sales$Other_Share <- sales$Other_Sales / sales$Global_Sales

set.seed(123)
sales_subset <- sales %>%
  sample_n(1000)

subset_shares <-sales_subset %>% 
  select(NA_Share, EU_Share, JP_Share, Other_Share)
summary(subset_shares)
##     NA_Share         EU_Share         JP_Share       Other_Share     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.5088   Median :0.1773   Median :0.0000   Median :0.06000  
##  Mean   :0.4750   Mean   :0.2171   Mean   :0.2292   Mean   :0.06646  
##  3rd Qu.:0.7626   3rd Qu.:0.3670   3rd Qu.:0.2783   3rd Qu.:0.10372  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :0.75000
str(subset_shares)
## 'data.frame':    1000 obs. of  4 variables:
##  $ NA_Share   : num  0.838 0.856 0.55 0.875 0.537 ...
##  $ EU_Share   : num  0.0294 0.1261 0.3833 0 0.2927 ...
##  $ JP_Share   : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Other_Share: num  0.1324 0.027 0.0667 0.125 0.1463 ...
set.seed(123) 
hopkins_result <- get_clust_tendency(subset_shares, n = nrow(subset_shares) - 1)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
hopkins_result
## $hopkins_stat
## [1] 0.9330738
## 
## $plot

Hopkins statistic is ~0.93, which means the data is highly clusterable and we can proceed and try to get good insights.

K-Means Clusterization

There is no better algorithm to start with than good old K-Means. For that purpose, I am plotting the silhouette score and gap statistic to choose the best number of clusters. 2 Clusters have the highest silhouette score, but gap statistic indicates 6 as the best choice. Since 2 clusters for gap statistic is much worse but silhouette score for 6 clusters is fairly close to 2, I am choosing to go with k = 6. Moreover, to analyze the taste of gamers, I am expecting to have more than 4 clusters at least one per region, thus, 2 clusters would not make much sense.

fviz_nbclust(
  subset_shares,
  kmeans,
  method = "silhouette"
) +
  labs(title = "Optimal Number of Clusters Using Silhouette Method")

fviz_nbclust(
  subset_shares,
  kmeans,
  method = "gap_stat"
) +
  labs(title = "Optimal Number of Clusters Using GAP Method")

km <- kmeans(subset_shares, centers = 6, nstart = 25)
fviz_cluster(km, data = subset_shares, geom = "point")

To visualize clusters, dimension reduction was used and it covered 82% of the variance which is good. We can see that the most distinct clusters are 3 and 5.

sil <- silhouette(km$cluster, dist(subset_shares))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   80          0.55
## 2       2  238          0.51
## 3       3  206          0.91
## 4       4  202          0.36
## 5       5  211          0.63
## 6       6   63          0.30

Silhouette score exactly confirmed my approximation about the distinct clusters, as we can see that cluster 3 has a highest silhouette score at 0.91 and cluster 5 the second highest at 0.63. Clusters 4 and 6 are with the lowest silhouette width ~0.3 - 0.4, indicating overlap and weak distinct structure.

sales_subset$Cluster <- factor(km$cluster)
ggplot(sales_subset, aes(x = Cluster)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Number of Games per Cluster", y = "Count")

Hierarchial Clusterization

Before clusterization, I want to check the correlation of our variables for distance matrix.

cor(subset_shares)
##                NA_Share   EU_Share   JP_Share Other_Share
## NA_Share     1.00000000 -0.2376522 -0.7221810  0.06620558
## EU_Share    -0.23765221  1.0000000 -0.4531807  0.30628892
## JP_Share    -0.72218102 -0.4531807  1.0000000 -0.38871250
## Other_Share  0.06620558  0.3062889 -0.3887125  1.00000000

There is only one high negatively correlated pair NA/JP, but it is not an extreme case so I will proceed.

d <- dist(subset_shares, method = "euclidean")
hc <- agnes(d, method = "complete")
hc$ac
## [1] 0.9868188
pltree(hc, cex = 0.6, hang = -1, main = "Dendrogram", labels = F)
rect.hclust(hc, k = 4, border = 2:5)

hc_clusters <- cutree(hc, k = 4)
sales_subset$hc_cluster <- hc_clusters
fviz_cluster(list(data = subset_shares, cluster = hc_clusters))

It is visible that cutting at 4 clusters is a good choice, because later there are too many groupings, and these 4 are quite distinct. Plus, I get a very good hc coefficient with this method. On the plot that captures ~82% of the variance, we see a good result, even though there are some overlaps, overall, the clusters are fairly distinct.

fviz_nbclust(
  subset_shares,
  pam,
  method = "silhouette"
) +
  labs(title = "Optimal Number of Clusters Using Silhouette Method")

fviz_nbclust(
  subset_shares,
  pam,
  method = "gap_stat"
) +
  labs(title = "Optimal Number of Clusters Using GAP Method")

set.seed(123)
pam_model <- pam(subset_shares, k = 6)

sales_subset$pam_cluster <- pam_model$clustering

pam_model$silinfo$avg.width
## [1] 0.5791628
plot(pam_model)

Even though GAP-stat is increasing and silhouette suggests 2 clusters, I chose to go with 6 clusters again, as the silhouette is relatively good and GAP-stat suggests too many clusters.

Calinski-Harabasz Analysis

data_matrix <- as.matrix(subset_shares)
ch_km <- intCriteria(traj = data_matrix,
                     part = as.integer(sales_subset$Cluster),
                     crit = "Calinski_Harabasz")$calinski_harabasz
ch_km
## [1] 3388.761
ch_hc <- intCriteria(traj = data_matrix,
                     part = sales_subset$hc_cluster,
                     crit = "Calinski_Harabasz")$calinski_harabasz
ch_hc
## [1] 2932.91
ch_pam <- intCriteria(traj = data_matrix,
                      part = sales_subset$pam_cluster,
                      crit = "Calinski_Harabasz")$calinski_harabasz
ch_pam
## [1] 3346.91

According to CH index, best clusterization result is given by K-Means, closely followed by PAM. Hierarchial clusterization is worse than other two due to the index.

sales_subset %>%
  group_by(Cluster) %>%
  summarise(Avg_Global = mean(Global_Sales)) %>%
  ggplot(aes(x = Cluster, y = Avg_Global, fill = Cluster)) +
  geom_col() +
  labs(title = "K-means: Avg Global Sales")

sales_subset %>%
  group_by(hc_cluster) %>%
  summarise(Avg_Global = mean(Global_Sales)) %>%
  ggplot(aes(x = hc_cluster, y = Avg_Global, fill = factor(hc_cluster))) +
  geom_col() +
  labs(title = "Hierarchical: Avg Global Sales")

sales_subset %>%
  group_by(pam_cluster) %>%
  summarise(Avg_Global = mean(Global_Sales)) %>%
  ggplot(aes(x = pam_cluster, y = Avg_Global, fill = factor(pam_cluster))) +
  geom_col() +
  labs(title = "PAM: Avg Global Sales")

Comparing clusterization techniques by Global Sales, we can see that PAM and K-Means are similar. The difference is only with clusters 4 and 5, in K-means cluster 4 has way more sales than cluster 5, and the opposite is true in PAM. Cluster 6 is highest selling in both techniques, followed by cluster 2, which has half of those sales. For hierarchial clustering, Cluster 2 is highest selling, followed by cluster 1, while 3 and 4 have way less sales.

Average Regional Share Per Cluster

make_region_plot <- function(df, cluster_col, title_prefix) {
  df %>%
    group_by({{ cluster_col }}) %>%
    summarise(across(c(NA_Share, EU_Share, JP_Share, Other_Share), mean)) %>%
    pivot_longer(-{{ cluster_col }}) %>%
    ggplot(aes(x = name, y = value, fill = factor({{ cluster_col }}))) +
    geom_col(position = "dodge") +
    labs(title = paste(title_prefix, "Avg Regional Share"),
         x = "Region", y = "Share")
}

make_region_plot(sales_subset, Cluster, "K-means")

make_region_plot(sales_subset, hc_cluster, "Hierarchical")

make_region_plot(sales_subset, pam_cluster, "PAM")

Now we come to the interesting part.

In K-Means, we can see that Cluster 1 is dominated by the EU Share, while Cluster 3 is dominated by Japan. We can also see that in Japanese share there are only 2 prominent clusters - 3 and 6, indicating Japanese uniqueness. Clusters 4 and 5 are mostly seen in North America. Clusters 2 and 6 are moderate ones, containing some from each region, however, cluster 2 is almost invisible in Japanese market.

For Hierarchial, we note that 4th cluster is almost Japanese exclusive, 3rd cluster is dominant in Europe, clusters 1 and 2 has most apperance with North American share, while also appearing in Europe and a little in Other parts of world. Cluster 3 also has its small shares in Other and NA.

Lastly, PAM results show us almost the same structure for Regional Shares as K-Means, just cluster labels are different.

Top genres per cluster

make_genre_plot <- function(df, cluster_col, title_prefix) {
df %>%
count({{ cluster_col }}, Genre) %>%
group_by({{ cluster_col }}) %>%
slice_max(n, n = 5) %>%
ggplot(aes(x = reorder(Genre, n), y = n, fill = factor({{ cluster_col }}))) +
geom_col() +
facet_wrap(vars({{ cluster_col }}), scales = "free_y", ncol = 2) +
coord_flip() +
labs(title = paste(title_prefix, "Top Genres per Cluster"),
x = "Genre", y = "Count")
}

make_genre_plot(sales_subset, Cluster, "K-means")

make_genre_plot(sales_subset, hc_cluster, "Hierarchical")

make_genre_plot(sales_subset, pam_cluster, "PAM")

In the genre plots for K-Means, we can note that Japan’s two prominent clusters 3 and 6 contain Role-Playing games, which indicates that this is the most popular genre in Japan, followed by Adventure and Action. Cluster 5, which is North American cluster has the most Sports games, which is quite unique, since Sports is not the most loved in any other region. All other clusters mostly contain Action games, while also containing a lot of Shooters, Sports, Racing, and Misc genre. European cluster 1 also contains Simulation and Strategy as top 5 and 6 genres. Cluster 4 has highest difference between 1st and other genres; Action is the most popular here, while Platform is on the 2nd position, this is mainly North American base of gamers, added to the smaller part of European and Other World’s gamers. Second cluster does not give us any specific insight, because it is seen in all 3 regions except Japan and typical genres like Action, Sports and Shooters are most popular.

Again, PAM plots are identical to K-Means and Hierarchial omit some insights instead of giving new valuable information, thus I will not include analysis of those.

Top games per cluster

make_top_games_plot <- function(df, cluster_col, title_prefix) {
df %>%
group_by({{ cluster_col }}) %>%
slice_max(Global_Sales, n = 5) %>%
ggplot(aes(x = reorder(Name, Global_Sales), y = Global_Sales, fill = factor({{ cluster_col }}))) +
geom_col() +
facet_wrap(vars({{ cluster_col }}), scales = "free_y", ncol = 2) +
coord_flip() +
labs(title = paste(title_prefix, "Top 5 Games by Global Sales per Cluster"),
x = "Game", y = "Global Sales (Millions)")
}

make_top_games_plot(sales_subset, Cluster, "K-means")

make_top_games_plot(sales_subset, hc_cluster, "Hierarchical")

make_top_games_plot(sales_subset, pam_cluster, "PAM")

Japanese cluster 6 is unique again, as we can see Pokemon Yellow, and Wii Party as the most popular games, which are not seen in any other cluster’s top 5. Cluster 5 was North American cluster where Sports games were the most appreciated and we can see it from the top 5 list as well, where NFL games and one NBA game are in the list. Other American-European cluster 4, where Action games were most popular shows us that GTA 4 and Halo 3 are top 2. Surprisingly, even though Action was more popular than sports in European cluster 1, FIFA Soccer 06 is on the first place, and PES 2008 is on the third, two more football games are in top 5 together with Art Academy. This once again proves that Americans play American football, while Europeans play original Football (Soccer). Cluster 2, which is the one shared by Europe and North America almost equally (with a small share of Other World), has Call of Duty: Black Ops as the most popular game, followed by Forza Motorsport 3 and some other Action and Sports games. This is very typical, since the most popular genres in this cluster were Action, Sports, and Racing. This cluster is where European and American taste overlaps.

Top platforms per cluster

make_platform_plot <- function(df, cluster_col, title_prefix) {
  df %>%
    count({{ cluster_col }}, Platform) %>%
    group_by({{ cluster_col }}) %>%
    slice_max(n, n = 5) %>%
    ggplot(aes(x = reorder(Platform, n), y = n, fill = factor({{ cluster_col }}))) +
    geom_col() +
    facet_wrap(vars({{ cluster_col }}), scales = "free_y", ncol = 2) +
    coord_flip() +
    labs(title = paste(title_prefix, "Top Platforms per Cluster"),
         x = "Platform", y = "Count")
}

make_platform_plot(sales_subset, Cluster, "K-means")

make_platform_plot(sales_subset, hc_cluster, "Hierarchical")

make_platform_plot(sales_subset, pam_cluster, "PAM")

Looking at the most popular platforms to play video-games, in Cluster 1(Europe) PC is by far the most popular, and PlayStation 3 comes next. For Cluster 6, many clusters have approximately same number of games published on them, but PSP and 3DS are the most popular ones. For clusters 5 which has highest and almost exclusively North American shares, DS and Wii are two most popular, while PSP, X360 and PS2 are also on the top 5 list. Cluster 4 which has some European and mostly NA share, has most games on GBA, followed by XB, GC, X360, and DS. For almost exclusively Japanese cluster 3, PSP and DS are by far “highest-number-of-games” platforms, PS2, PS, and PSV are also on the top 5 list, but behind by a lot; all of these 5 platforms have Japanese origin, once again indicating Japanese exclusiveness. Cluster 2 which had fair shares in both North America and Europe and some share in Other parts of the world has most games on PS2 and PS; lower on the list are also X360, PS3, and DS.

Overall, I noted that out of all top 5 lists, only non-Japanese platforms are Xbox360(X360), Xbox(XB), and PC, and these are on top list of only non-Japanese markets. Almost all other top platforms(mainly consoles) are made either by Nintendo or Sony.

Interesting fact: PSP(PlayStation Portable), PSV(PlayStation Vita), PS, PS2, PS3 are created by Sony. DS, Wii, GC, GB, 3DS, GBA are created by Nintendo.

Proof of subset insights validity

Since various tests showed that K-Means is the best algorithm and it is very compute-effective, I ran K-Means on the whole dataset to check if it shows same results as the subset. I expected all plots and insights to be nearly the same, except the top 5 games, top 5 platforms, and maybe publishers since the whole dataset might include more popular games. Therefore, this is also good to see which are the actual 5 most popular games in each cluster. I only test K-Means, because this is the only fast-enough algorithm from the 3 I chose.

First Proof - Same Silhouette graph

Gap statistic is extremely slow to compute on the whole set, but since I got same silhouette graph, I assume GAP will also be approximately same, leading to our previous choice of 6 clusters.

sales_shares <- sales %>% 
  select(NA_Share, EU_Share, JP_Share, Other_Share)

fviz_nbclust(
  sales_shares,
  kmeans,
  method = "silhouette"
) +
  labs(title = "Optimal Number of Clusters Using Silhouette Method")

Second Proof - Similar Clusters Structure and Silhouette

We have obtained exactly same average silhouette of 0.58 as before, plus we have same structure of clusters, even though there are way more points, which may make it harder to spot. Some of the clusters’ silhouette widths have improved, some are a little worse. Overall, this is a good observation.

km <- kmeans(sales_shares, centers = 6, nstart = 25)

fviz_cluster(km, data = sales_shares, geom = "point")

sil <- silhouette(km$cluster, dist(sales_shares))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1 4230          0.45
## 2       2 1466          0.54
## 3       3 2971          0.62
## 4       4  939          0.36
## 5       5 3532          0.94
## 6       6 3460          0.41

sales$Cluster <- km$cluster

Plots of Publishers, Genres, Shares

sales %>%
    group_by(Cluster) %>%
    summarise(across(c(NA_Share, EU_Share, JP_Share, Other_Share), mean)) %>%
    pivot_longer(-Cluster) %>%
    ggplot(aes(x = name, y = value, fill = factor(Cluster))) +
    geom_col(position = "dodge") +
    labs(title = "Avg Regional Share",
         x = "Region", y = "Share")

sales %>%
    count(Cluster, Publisher) %>%
    group_by(Cluster) %>%
    top_n(5, n) %>%
    ggplot(aes(x = reorder(Publisher, n), y = n, fill = factor(Cluster))) +
    geom_col() +
    facet_wrap(vars(Cluster), scales = "free", , ncol = 2) +
    coord_flip() +
    labs(title = "Top Publishers",
         x = "Publisher", y = "Count")

sales %>%
    count(Cluster, Genre) %>%
    group_by(Cluster) %>%
    slice_max(n, n = 5) %>%
    ggplot(aes(x = reorder(Genre, n), y = n, fill = factor(Cluster))) +
    geom_col() +
    facet_wrap(vars(Cluster), scales = "free_y", ncol = 2) +
    coord_flip() +
    labs(title = "Top Genres per Cluster",
        x = "Genre", y = "Count")

sales %>%
    group_by(Cluster) %>%
    slice_max(Global_Sales, n = 5) %>%
    ggplot(aes(x = reorder(Name, Global_Sales), y = Global_Sales, fill = factor(Cluster))) +
    geom_col() +
    facet_wrap(vars(Cluster), scales = "free_y", ncol = 2) +
    coord_flip() +
    labs(title = "Top 5 Games by Global Sales per Cluster",
        x = "Game", y = "Global Sales (Millions)")

sales %>%
    count(Cluster, Platform) %>%
    group_by(Cluster) %>%
    slice_max(n, n = 5) %>%
    ggplot(aes(x = reorder(Platform, n), y = n, fill = factor(Cluster))) +
    geom_col() +
    facet_wrap(vars(Cluster), scales = "free_y", ncol = 2) +
    coord_flip() +
    labs(title = "Top Platforms per Cluster",
         x = "Platform", y = "Count")

To sum everything up, some of the shares in North America and Europe have changed, and this does not indicate strong taste difference, because Europe and North America had a lot more in common than different. All of the Top Publishers and genres stayed the same. As expected, top 5 games is a completely different list of games, because we look at the whole set. We see that top games in cluster 6 were sold much more than in other clusters. Still, European cluster 5 has “good old” FIFA on the first place(now it is FIFA 16 instead of FIFA 06), North American Cluster 1 has American Football games with Pac-Man and Duck Hunt added. Cluster 2 is very interesting because top 5 games are all Pokemon, and it is not a Japanese-only cluster(worldwide). Cluster 3 is a Japanese cluster incorporating top 5 only Adventure games, corresponding to Japanese genre taste. Cluster 4 contains 3 Super Mario games, Kinect Adventures, and Tetris; this is American-European cluster, and the games I have not seen in subset analysis, which is great. Kinect is Xbox specific technology, confirming relative popularity of Xbox in these regions. Cluster 6 is a Nintendo cluster looking at the top 5 games, which is really interesting as it has almost only European and North American shares, even though Nintendo is Japanese publisher. In the Platform plot, I did not spot any new insights comparing to the subset analysis.

Conclusion

To conclude, my aim was to make more than 4 clusters, so I tried 4 for Hierarchial and 6 for K-Means, and 6 proved to give more information. That is also the reason I did not try any other number of clusters, because more than 6 would result into hard interpretation with worse results, and less than 4 would not give much information. Only choice that was not checked is 5, but the silhouette and gap is a little worse than for 6, and I expect same results.

Regarding insights, I learnt that Japanese market is much more exclusive than others in terms of played games, genres, and consoles. Xbox consoles by Microsoft in this study were only mainly used in Europe and North America. Japan was mainly using Nintendo consoles. Nevertheless, Nintendo games were more played in NA and EU, while Japan prefered Namco Bandai. Role-Play was most popular genre in Japan, alongside with Adventure. Action was generally most popular worldwide, and North America enjoyed Sports more than other regions.

The study proved to be useful by answering all asked questions and even more. Combination of domain knowledge and interest in dataset helped to achieve the results. Hopefully, the reader found this study interesting as well and learnt new facts. Thank you for reading!