A brief analysis of the “Videogame with Sales Data” is conducted. The data is available on Kaggle. We use the following variables:
The dataset offers other variables that are not used in this analysis.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lattice)
mydata <- read.csv("Video_Games_Sales.csv")
mydata <- mydata[-1,]
Since Wii Sports was sold along with the console, it is excluded from the analysis.
First we look at summary statistics.
summary(mydata[6:10])
## NA_Sales EU_Sales JP_Sales Other_Sales
## Min. : 0.0000 Min. : 0.0000 Min. : 0.00000 Min. : 0.00000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.00000
## Median : 0.0800 Median : 0.0200 Median : 0.00000 Median : 0.01000
## Mean : 0.2609 Mean : 0.1433 Mean : 0.07738 Mean : 0.04683
## 3rd Qu.: 0.2400 3rd Qu.: 0.1100 3rd Qu.: 0.04000 3rd Qu.: 0.03000
## Max. :29.0800 Max. :12.7600 Max. :10.22000 Max. :10.57000
## Global_Sales
## Min. : 0.0100
## 1st Qu.: 0.0600
## Median : 0.1700
## Mean : 0.5286
## 3rd Qu.: 0.4700
## Max. :40.2400
cor(mydata$NA_Sales, mydata$JP_Sales)
## [1] 0.4511037
cor(mydata$NA_Sales, mydata$EU_Sales)
## [1] 0.7176671
cor(mydata$NA_Sales, mydata$Global_Sales)
## [1] 0.9300055
cor(mydata$JP_Sales, mydata$EU_Sales)
## [1] 0.4414561
It is interesting to note that sales in North America and Japan are not highly correlated, that is, games popular in North America are not necessarily popular in Japan and vice-versa. The same phenomenon is not as strong between Europe and North America.
Let’s take a look at units sold for each platform.
g <- ggplot(data = mydata, aes(x= Global_Sales, y = Platform))
g + geom_point(alpha = 0.25, col = "blue") + xlab("Global Sales (mil.units) for each game") + labs(title = "Individual Game Sales by Platform")
It’s noteworthy that the best selling game is for the NES console. No surprise it’s Super Mario Bros.
Now let’s take a look at total game sales by platform.
sales_table <- mydata %>% group_by(Platform) %>% summarise(sales_by = sum(Global_Sales))
gg <- ggplot(data = sales_table, aes(x = sales_by, y = Platform))
gg + geom_point(shape = 22, colour = "red", fill = "white", size = 3, stroke = 2) + xlab("Total Global Sales (mil. units)") + labs(title = "Total Sales by Platform")
It is easy to see that the console selling the most games was PS2 followed by Xbox360 and PS3.
Now we can take a look at which software producers sold the most units of games.
publisher_sales_table <- mydata %>% group_by(Publisher) %>% summarise(publisher_sales = sum(Global_Sales))
publisher_sales_table <- publisher_sales_table[order(- publisher_sales_table$publisher_sales),]
publisher_sales_table <- as.data.frame(publisher_sales_table)
head(publisher_sales_table, n = 15)
## Publisher publisher_sales
## 1 Nintendo 1706.28
## 2 Electronic Arts 1116.96
## 3 Activision 731.16
## 4 Sony Computer Entertainment 606.48
## 5 Ubisoft 471.61
## 6 Take-Two Interactive 403.82
## 7 THQ 338.44
## 8 Konami Digital Entertainment 282.39
## 9 Sega 270.35
## 10 Namco Bandai Games 254.62
## 11 Microsoft Game Studios 248.32
## 12 Capcom 200.02
## 13 Atari 156.83
## 14 Warner Bros. Interactive Entertainment 151.79
## 15 Square Enix 144.35
So far we haven’t checked how sales are distributed by years. Let’s do so now.
year_sales_table_global <- mydata %>% group_by(Year_of_Release) %>% summarise(Global_Sales = sum(Global_Sales))
year_sales_table_global <- year_sales_table_global[-c(39,40),]
tempdata <- table(mydata$Year_of_Release)
tempdata <- as.data.frame(tempdata)
tempdata <- tempdata[-c(39,40),]
colnames(tempdata) <- c("Year_of_Release", "Games_Released")
newData <- merge(tempdata, year_sales_table_global)
colors = c("darkblue", "red")
barchart(Games_Released + Global_Sales ~ Year_of_Release, data = newData,
auto.key=list(space='bottom'), ylab = "Games released and mil. units sold", scales=list(x=list(rot=45)),
par.settings=list(superpose.polygon=list(col=colors)))
It is very interesting to see how the number of releases and global sales seem to follow the same pattern, in fact, the correlation between the two is 0.98. Let’s take a look at their scatterplot.
qplot(Global_Sales, Games_Released, data = newData, xlab = "Sales (Mil. Units)", ylab = "Games Released")
As we can see there’s a strong positive relation between games published and sales. This is a very interesting phenomenon that merits a deeper analysis.