Introduction

On this occasion, I will try to do an exploratory data analysis (EDA) based on a dataset about a list of video games with sales greater than 100,000 copies. Data was collected from various e-commerce merchants as part of a research study, Lets try..!

Input Data

Make sure the data is already in the folder in the R project. We can import data in the following ways:

vgsales <- read.csv("vgsales.csv")
vgsales

Data Inspection

The data can be displayed, then we will do a simple inspection to find out the structure and dimensions of the data:

dim(vgsales)
#> [1] 16598    11

This dataset has 16.598 rows and 11 columns.

Let’s try to see what columns are in the dataset:

names(vgsales)
#>  [1] "Rank"         "Name"         "Platform"     "Year"         "Genre"       
#>  [6] "Publisher"    "NA_Sales"     "EU_Sales"     "JP_Sales"     "Other_Sales" 
#> [11] "Global_Sales"

Viewing the top 5 rows of a dataframe

head(vgsales,5)

View the bottom 5 rows of the dataframe

tail(vgsales,5)

Data Cleansing

We need to do data cleansing first so that later the data can be processed. Agar tidak merubah data asli maka kita akan memakai subject baru “vgsales2”.

vgsales2 <- vgsales

View the data type of each column:

str(vgsales2)
#> 'data.frame':    16598 obs. of  11 variables:
#>  $ Rank        : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Name        : chr  "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
#>  $ Platform    : chr  "Wii" "NES" "Wii" "Wii" ...
#>  $ Year        : chr  "2006" "1985" "2008" "2009" ...
#>  $ Genre       : chr  "Sports" "Platform" "Racing" "Sports" ...
#>  $ Publisher   : chr  "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
#>  $ NA_Sales    : num  41.5 29.1 15.8 15.8 11.3 ...
#>  $ EU_Sales    : num  29.02 3.58 12.88 11.01 8.89 ...
#>  $ JP_Sales    : num  3.77 6.81 3.79 3.28 10.22 ...
#>  $ Other_Sales : num  8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
#>  $ Global_Sales: num  82.7 40.2 35.8 33 31.4 ...

Based on the data above, we need to do Explicit Coercion which is to change the data type into the right data type. The data to be changed are - Genre -> factor - Publisher -> factor - Platform -> factor - Year -> integer

vgsales2$Genre <- as.factor(vgsales2$Genre)
vgsales2$Publisher <- as.factor(vgsales2$Publisher)
vgsales2$Platform <- as.factor(vgsales2$Platform)
vgsales2$Year <- as.integer(vgsales2$Year)

str(vgsales2)
#> 'data.frame':    16598 obs. of  11 variables:
#>  $ Rank        : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Name        : chr  "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
#>  $ Platform    : Factor w/ 31 levels "2600","3DO","3DS",..: 26 12 26 26 6 6 5 26 26 12 ...
#>  $ Year        : int  2006 1985 2008 2009 1996 1989 2006 2006 2009 1984 ...
#>  $ Genre       : Factor w/ 12 levels "Action","Adventure",..: 11 5 7 11 8 6 5 4 5 9 ...
#>  $ Publisher   : Factor w/ 579 levels "10TACLE Studios",..: 369 369 369 369 369 369 369 369 369 369 ...
#>  $ NA_Sales    : num  41.5 29.1 15.8 15.8 11.3 ...
#>  $ EU_Sales    : num  29.02 3.58 12.88 11.01 8.89 ...
#>  $ JP_Sales    : num  3.77 6.81 3.79 3.28 10.22 ...
#>  $ Other_Sales : num  8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
#>  $ Global_Sales: num  82.7 40.2 35.8 33 31.4 ...

Overcoming the missing value, to check whether there is a missing value, we can use:

anyNA(vgsales2)
#> [1] TRUE

and to see the number of missing values, you can use:

colSums(is.na(vgsales2))
#>         Rank         Name     Platform         Year        Genre    Publisher 
#>            0            0            0          271            0            0 
#>     NA_Sales     EU_Sales     JP_Sales  Other_Sales Global_Sales 
#>            0            0            0            0            0

For simplicity, we will eliminate the missing value in the following way:

vgsales2 <- na.omit(vgsales2) #Eliminate missing values
colSums(is.na(vgsales2)) #Check again whether there are still missing values
#>         Rank         Name     Platform         Year        Genre    Publisher 
#>            0            0            0            0            0            0 
#>     NA_Sales     EU_Sales     JP_Sales  Other_Sales Global_Sales 
#>            0            0            0            0            0
dim(vgsales2)
#> [1] 16327    11

By eliminating rows with missing values, the number of data rows is reduced from 16,598 to 16,327.

It can be seen that the missing value has been successfully removed, and the data is ready to be processed.

Data Explanation

View a summary of the data structure.

summary(vgsales2)
#>       Rank           Name              Platform         Year     
#>  Min.   :    1   Length:16327       DS     :2133   Min.   :1980  
#>  1st Qu.: 4136   Class :character   PS2    :2127   1st Qu.:2003  
#>  Median : 8295   Mode  :character   PS3    :1304   Median :2007  
#>  Mean   : 8293                      Wii    :1290   Mean   :2006  
#>  3rd Qu.:12442                      X360   :1235   3rd Qu.:2010  
#>  Max.   :16600                      PSP    :1197   Max.   :2020  
#>                                     (Other):7041                 
#>           Genre                             Publisher        NA_Sales      
#>  Action      :3253   Electronic Arts             : 1339   Min.   : 0.0000  
#>  Sports      :2304   Activision                  :  966   1st Qu.: 0.0000  
#>  Misc        :1710   Namco Bandai Games          :  928   Median : 0.0800  
#>  Role-Playing:1471   Ubisoft                     :  918   Mean   : 0.2654  
#>  Shooter     :1282   Konami Digital Entertainment:  823   3rd Qu.: 0.2400  
#>  Adventure   :1276   THQ                         :  712   Max.   :41.4900  
#>  (Other)     :5031   (Other)                     :10641                    
#>     EU_Sales          JP_Sales         Other_Sales        Global_Sales    
#>  Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.0100  
#>  1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.: 0.0600  
#>  Median : 0.0200   Median : 0.00000   Median : 0.01000   Median : 0.1700  
#>  Mean   : 0.1476   Mean   : 0.07866   Mean   : 0.04832   Mean   : 0.5402  
#>  3rd Qu.: 0.1100   3rd Qu.: 0.04000   3rd Qu.: 0.04000   3rd Qu.: 0.4800  
#>  Max.   :29.0200   Max.   :10.22000   Max.   :10.57000   Max.   :82.7400  
#> 

Based on the summary data above, we can conclude several things, namely: 1. This dataset includes game names released from 1980 to 2020. 2. The publisher that has been the most prolific in producing games over that period of time is Electronic Arts. 3. Games in the Action genre are frequently produced. 4. Based on the data, the favorite platforms to play games are the Nintendo DS and PS2.

Data Manipulation

We will try to find insights into other interesting things based on this data.

  1. Total revenue by region?
totalsales <- c(sum(vgsales2$Global_Sales),sum(vgsales2$NA_Sales),
sum(vgsales2$EU_Sales),
sum(vgsales2$JP_Sales),
sum(vgsales2$Other_Sales)
)

market <- c("Global_Sales","NA_Sales","EU_Sales","JP_Sales","Other_Sales")
region <- data.frame(market,totalsales)
region
graphics::barplot(xtabs(totalsales~market,region))

Answer: Sales from the North America region recorded the largest sales, with 4392.95 percent of total global sales reaching 8920.44 percent.

  1. Publisher with the highest sales?
pub.sales <- aggregate(x = Global_Sales ~ Publisher,vgsales2,sum)
pub.sales <- pub.sales[order(pub.sales$Global_Sales,decreasing=T),]
head(pub.sales,5) #5 Publisher with the highest sales

Answer : Nintendo is the publisher with the most sales

  1. Genre with the highest sales?
gen.sales <- aggregate(x = Global_Sales ~ Genre,vgsales2,sum)
gen.sales <- gen.sales[order(gen.sales$Global_Sales,decreasing=T),]
head(gen.sales,5) #5 Genre with the highest sales

Answer: action games are the highest selling games

  1. Game with the highest sales?
game.sales <- vgsales2[,c("Name","Global_Sales")]
game.sales <- aggregate(x = Global_Sales ~ Name,vgsales2,sum)
game.sales <- game.sales[order(game.sales$Global_Sales,decreasing=T),]
head(game.sales,5) #5 Game with the highest sales

Answer: the title “Wii Sports” is the highest selling game

  1. View game sales graphs by year of release
pw <- function(x){ 
    
    if(x < 1990){
      x <- "1980-1990"
    }else if(x >= 1990 & x <= 2000){
      x <- "1991-2000"
    }else if(x >= 2001 & x <= 2010){
      x <- "2001-2010"
    }else{
      x <- "2011-2020"
    }
}
temp <- sapply(vgsales2$Year, pw)
vgsales2$decade <- temp
head(vgsales2)
graphics::barplot(xtabs(Global_Sales~decade,vgsales2))

Answer : Highest game sales in 2001-2010 decade

  1. Look for the average game released by publishers each year
library(tidyr)
vgsales3 <- aggregate(Name~Publisher+decade,vgsales2,length)
vgsales3 <- pivot_wider(data = vgsales3, names_from = decade,values_from = Name)
index <- is.na(vgsales3)
vgsales3[index] <- 0
vgsales3[is.na(vgsales3)] <- 0
vgsales3
vgsales3$sum <- rowSums(vgsales3[,c("1980-1990","1991-2000","2001-2010","2011-2020")])
vgsales3 <- vgsales3[order(vgsales3$sum,decreasing=T),]
vgsales3$ave <- vgsales3$sum/40
vgsales3

Answer: Electronic Arts has produced the most games in the last 40 years, with a total of 1339 games and an average of 33 games every decade.

  1. Seeing the trend of game genres in every decade?
vgsales4 <- aggregate(Name~Genre+decade,vgsales2,length)
vgsales4 <- vgsales4[order(vgsales4$Name,decreasing = T),]
vgsales4
# visualisasi
library(ggplot2)
ggplot(data = vgsales4, mapping = aes(x = Genre, y = Name))+
  geom_col(aes(fill = decade), position = "dodge") +
  facet_wrap(vars(decade), scale = "free_y") + 
  scale_x_discrete(guide = guide_axis(angle = 90)) 

Answer: Game Genre Trends in Decade 1 (1980-1990) The action genre became the most popular, but in decade 2 (1991-2000), it was replaced by the sport genre, while in decade 3 (2001-2010) and decade 4 (2011-2020), the action genre is once again the most popular.

Conclusion

Over the last 40 years (1980–2020), Nintendo has been the publisher that has generated the most sales compared to the others, with a total of 1784.43. In line with this, the game title that sold the most was Wii Sport, which was also produced by Nintendo. From the standpoint of genre, Action games are games that are most in demand by the public. However, in the 1991–2000 decade, the sport genre was more in demand, judging by the number of publishers producing sports games in that decade. Electronic Arts is the most prolific publisher, producing an average of 33 games per year, while Nintendo is in 7th place with an average of 17 games per year.

Reference

Thansk to GREGORYSMITH for dataset : https://www.kaggle.com/datasets/gregorut/videogamesales