On this occasion, I will try to do an exploratory data analysis (EDA) based on a dataset about a list of video games with sales greater than 100,000 copies. Data was collected from various e-commerce merchants as part of a research study, Lets try..!
Make sure the data is already in the folder in the R project. We can import data in the following ways:
vgsales <- read.csv("vgsales.csv")
vgsalesThe data can be displayed, then we will do a simple inspection to find out the structure and dimensions of the data:
dim(vgsales)#> [1] 16598 11
This dataset has 16.598 rows and 11 columns.
Let’s try to see what columns are in the dataset:
names(vgsales)#> [1] "Rank" "Name" "Platform" "Year" "Genre"
#> [6] "Publisher" "NA_Sales" "EU_Sales" "JP_Sales" "Other_Sales"
#> [11] "Global_Sales"
Viewing the top 5 rows of a dataframe
head(vgsales,5)View the bottom 5 rows of the dataframe
tail(vgsales,5)We need to do data cleansing first so that later the data can be processed. Agar tidak merubah data asli maka kita akan memakai subject baru “vgsales2”.
vgsales2 <- vgsalesView the data type of each column:
str(vgsales2)#> 'data.frame': 16598 obs. of 11 variables:
#> $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Name : chr "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
#> $ Platform : chr "Wii" "NES" "Wii" "Wii" ...
#> $ Year : chr "2006" "1985" "2008" "2009" ...
#> $ Genre : chr "Sports" "Platform" "Racing" "Sports" ...
#> $ Publisher : chr "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
#> $ NA_Sales : num 41.5 29.1 15.8 15.8 11.3 ...
#> $ EU_Sales : num 29.02 3.58 12.88 11.01 8.89 ...
#> $ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
#> $ Other_Sales : num 8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
#> $ Global_Sales: num 82.7 40.2 35.8 33 31.4 ...
Based on the data above, we need to do Explicit Coercion which is to change the data type into the right data type. The data to be changed are - Genre -> factor - Publisher -> factor - Platform -> factor - Year -> integer
vgsales2$Genre <- as.factor(vgsales2$Genre)
vgsales2$Publisher <- as.factor(vgsales2$Publisher)
vgsales2$Platform <- as.factor(vgsales2$Platform)
vgsales2$Year <- as.integer(vgsales2$Year)
str(vgsales2)#> 'data.frame': 16598 obs. of 11 variables:
#> $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Name : chr "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
#> $ Platform : Factor w/ 31 levels "2600","3DO","3DS",..: 26 12 26 26 6 6 5 26 26 12 ...
#> $ Year : int 2006 1985 2008 2009 1996 1989 2006 2006 2009 1984 ...
#> $ Genre : Factor w/ 12 levels "Action","Adventure",..: 11 5 7 11 8 6 5 4 5 9 ...
#> $ Publisher : Factor w/ 579 levels "10TACLE Studios",..: 369 369 369 369 369 369 369 369 369 369 ...
#> $ NA_Sales : num 41.5 29.1 15.8 15.8 11.3 ...
#> $ EU_Sales : num 29.02 3.58 12.88 11.01 8.89 ...
#> $ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
#> $ Other_Sales : num 8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
#> $ Global_Sales: num 82.7 40.2 35.8 33 31.4 ...
Overcoming the missing value, to check whether there is a missing value, we can use:
anyNA(vgsales2)#> [1] TRUE
and to see the number of missing values, you can use:
colSums(is.na(vgsales2))#> Rank Name Platform Year Genre Publisher
#> 0 0 0 271 0 0
#> NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
#> 0 0 0 0 0
For simplicity, we will eliminate the missing value in the following way:
vgsales2 <- na.omit(vgsales2) #Eliminate missing values
colSums(is.na(vgsales2)) #Check again whether there are still missing values#> Rank Name Platform Year Genre Publisher
#> 0 0 0 0 0 0
#> NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
#> 0 0 0 0 0
dim(vgsales2)#> [1] 16327 11
By eliminating rows with missing values, the number of data rows is reduced from 16,598 to 16,327.
It can be seen that the missing value has been successfully removed, and the data is ready to be processed.
View a summary of the data structure.
summary(vgsales2)#> Rank Name Platform Year
#> Min. : 1 Length:16327 DS :2133 Min. :1980
#> 1st Qu.: 4136 Class :character PS2 :2127 1st Qu.:2003
#> Median : 8295 Mode :character PS3 :1304 Median :2007
#> Mean : 8293 Wii :1290 Mean :2006
#> 3rd Qu.:12442 X360 :1235 3rd Qu.:2010
#> Max. :16600 PSP :1197 Max. :2020
#> (Other):7041
#> Genre Publisher NA_Sales
#> Action :3253 Electronic Arts : 1339 Min. : 0.0000
#> Sports :2304 Activision : 966 1st Qu.: 0.0000
#> Misc :1710 Namco Bandai Games : 928 Median : 0.0800
#> Role-Playing:1471 Ubisoft : 918 Mean : 0.2654
#> Shooter :1282 Konami Digital Entertainment: 823 3rd Qu.: 0.2400
#> Adventure :1276 THQ : 712 Max. :41.4900
#> (Other) :5031 (Other) :10641
#> EU_Sales JP_Sales Other_Sales Global_Sales
#> Min. : 0.0000 Min. : 0.00000 Min. : 0.00000 Min. : 0.0100
#> 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.0600
#> Median : 0.0200 Median : 0.00000 Median : 0.01000 Median : 0.1700
#> Mean : 0.1476 Mean : 0.07866 Mean : 0.04832 Mean : 0.5402
#> 3rd Qu.: 0.1100 3rd Qu.: 0.04000 3rd Qu.: 0.04000 3rd Qu.: 0.4800
#> Max. :29.0200 Max. :10.22000 Max. :10.57000 Max. :82.7400
#>
Based on the summary data above, we can conclude several things, namely: 1. This dataset includes game names released from 1980 to 2020. 2. The publisher that has been the most prolific in producing games over that period of time is Electronic Arts. 3. Games in the Action genre are frequently produced. 4. Based on the data, the favorite platforms to play games are the Nintendo DS and PS2.
We will try to find insights into other interesting things based on this data.
totalsales <- c(sum(vgsales2$Global_Sales),sum(vgsales2$NA_Sales),
sum(vgsales2$EU_Sales),
sum(vgsales2$JP_Sales),
sum(vgsales2$Other_Sales)
)
market <- c("Global_Sales","NA_Sales","EU_Sales","JP_Sales","Other_Sales")
region <- data.frame(market,totalsales)
regiongraphics::barplot(xtabs(totalsales~market,region))
Answer: Sales from the North America region recorded the largest
sales, with 4392.95 percent of total global sales reaching 8920.44
percent.
pub.sales <- aggregate(x = Global_Sales ~ Publisher,vgsales2,sum)
pub.sales <- pub.sales[order(pub.sales$Global_Sales,decreasing=T),]
head(pub.sales,5) #5 Publisher with the highest salesAnswer : Nintendo is the publisher with the most sales
gen.sales <- aggregate(x = Global_Sales ~ Genre,vgsales2,sum)
gen.sales <- gen.sales[order(gen.sales$Global_Sales,decreasing=T),]
head(gen.sales,5) #5 Genre with the highest salesAnswer: action games are the highest selling games
game.sales <- vgsales2[,c("Name","Global_Sales")]
game.sales <- aggregate(x = Global_Sales ~ Name,vgsales2,sum)
game.sales <- game.sales[order(game.sales$Global_Sales,decreasing=T),]
head(game.sales,5) #5 Game with the highest salesAnswer: the title “Wii Sports” is the highest selling game
pw <- function(x){
if(x < 1990){
x <- "1980-1990"
}else if(x >= 1990 & x <= 2000){
x <- "1991-2000"
}else if(x >= 2001 & x <= 2010){
x <- "2001-2010"
}else{
x <- "2011-2020"
}
}temp <- sapply(vgsales2$Year, pw)
vgsales2$decade <- temp
head(vgsales2)graphics::barplot(xtabs(Global_Sales~decade,vgsales2))
Answer : Highest game sales in 2001-2010 decade
library(tidyr)
vgsales3 <- aggregate(Name~Publisher+decade,vgsales2,length)
vgsales3 <- pivot_wider(data = vgsales3, names_from = decade,values_from = Name)
index <- is.na(vgsales3)
vgsales3[index] <- 0
vgsales3[is.na(vgsales3)] <- 0
vgsales3vgsales3$sum <- rowSums(vgsales3[,c("1980-1990","1991-2000","2001-2010","2011-2020")])
vgsales3 <- vgsales3[order(vgsales3$sum,decreasing=T),]
vgsales3$ave <- vgsales3$sum/40
vgsales3Answer: Electronic Arts has produced the most games in the last 40 years, with a total of 1339 games and an average of 33 games every decade.
vgsales4 <- aggregate(Name~Genre+decade,vgsales2,length)
vgsales4 <- vgsales4[order(vgsales4$Name,decreasing = T),]
vgsales4# visualisasi
library(ggplot2)
ggplot(data = vgsales4, mapping = aes(x = Genre, y = Name))+
geom_col(aes(fill = decade), position = "dodge") +
facet_wrap(vars(decade), scale = "free_y") +
scale_x_discrete(guide = guide_axis(angle = 90))
Answer: Game Genre Trends in Decade 1 (1980-1990) The action
genre became the most popular, but in decade 2 (1991-2000), it was
replaced by the sport genre, while in decade 3 (2001-2010) and decade 4
(2011-2020), the action genre is once again the most
popular.
Over the last 40 years (1980–2020), Nintendo has been the publisher that has generated the most sales compared to the others, with a total of 1784.43. In line with this, the game title that sold the most was Wii Sport, which was also produced by Nintendo. From the standpoint of genre, Action games are games that are most in demand by the public. However, in the 1991–2000 decade, the sport genre was more in demand, judging by the number of publishers producing sports games in that decade. Electronic Arts is the most prolific publisher, producing an average of 33 games per year, while Nintendo is in 7th place with an average of 17 games per year.
Thansk to GREGORYSMITH for dataset : https://www.kaggle.com/datasets/gregorut/videogamesales