A brief analysis of the “Videogame with Sales Data” is conducted. The data is available on Kaggle. We use the following variables:

The dataset offers other variables that are not used in this analysis.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lattice)
mydata <- read.csv("Video_Games_Sales.csv")
mydata <- mydata[-1,]

Since Wii Sports was sold along with the console, it is excluded from the analysis.

First we look at summary statistics.

summary(mydata[6:10])
##     NA_Sales          EU_Sales          JP_Sales         Other_Sales      
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.00000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.00000  
##  Median : 0.0800   Median : 0.0200   Median : 0.00000   Median : 0.01000  
##  Mean   : 0.2609   Mean   : 0.1433   Mean   : 0.07738   Mean   : 0.04683  
##  3rd Qu.: 0.2400   3rd Qu.: 0.1100   3rd Qu.: 0.04000   3rd Qu.: 0.03000  
##  Max.   :29.0800   Max.   :12.7600   Max.   :10.22000   Max.   :10.57000  
##   Global_Sales    
##  Min.   : 0.0100  
##  1st Qu.: 0.0600  
##  Median : 0.1700  
##  Mean   : 0.5286  
##  3rd Qu.: 0.4700  
##  Max.   :40.2400
cor(mydata$NA_Sales, mydata$JP_Sales)
## [1] 0.4511037
cor(mydata$NA_Sales, mydata$EU_Sales)
## [1] 0.7176671
cor(mydata$NA_Sales, mydata$Global_Sales)
## [1] 0.9300055
cor(mydata$JP_Sales, mydata$EU_Sales)
## [1] 0.4414561

It is interesting to note that sales in North America and Japan are not highly correlated, that is, games popular in North America are not necessarily popular in Japan and vice-versa. The same phenomenon is not as strong between Europe and North America.

Let’s take a look at units sold for each platform.

g <- ggplot(data = mydata, aes(x= Global_Sales, y =  Platform))
g + geom_point(alpha = 0.25, col = "blue") + xlab("Global Sales (mil.units) for each game") + labs(title = "Individual Game Sales by Platform")

It’s noteworthy that the best selling game is for the NES console. No surprise it’s Super Mario Bros.

Now let’s take a look at total game sales by platform.

sales_table <- mydata %>% group_by(Platform) %>% summarise(sales_by = sum(Global_Sales))

gg <- ggplot(data = sales_table, aes(x = sales_by, y = Platform))
gg + geom_point(shape = 22, colour = "red", fill = "white", size = 3, stroke = 2) + xlab("Total Global Sales (mil. units)") + labs(title = "Total Sales by Platform")

It is easy to see that the console selling the most games was PS2 followed by Xbox360 and PS3.

Now we can take a look at which software producers sold the most units of games.

publisher_sales_table <- mydata %>% group_by(Publisher) %>% summarise(publisher_sales = sum(Global_Sales))

publisher_sales_table <- publisher_sales_table[order(- publisher_sales_table$publisher_sales),]

publisher_sales_table <- as.data.frame(publisher_sales_table)
head(publisher_sales_table, n = 15)
##                                 Publisher publisher_sales
## 1                                Nintendo         1706.28
## 2                         Electronic Arts         1116.96
## 3                              Activision          731.16
## 4             Sony Computer Entertainment          606.48
## 5                                 Ubisoft          471.61
## 6                    Take-Two Interactive          403.82
## 7                                     THQ          338.44
## 8            Konami Digital Entertainment          282.39
## 9                                    Sega          270.35
## 10                     Namco Bandai Games          254.62
## 11                 Microsoft Game Studios          248.32
## 12                                 Capcom          200.02
## 13                                  Atari          156.83
## 14 Warner Bros. Interactive Entertainment          151.79
## 15                            Square Enix          144.35

So far we haven’t checked how sales are distributed by years. Let’s do so now.

year_sales_table_global <- mydata %>% group_by(Year_of_Release) %>% summarise(Global_Sales = sum(Global_Sales))


year_sales_table_global <- year_sales_table_global[-c(39,40),]

tempdata <- table(mydata$Year_of_Release)
tempdata <- as.data.frame(tempdata)
tempdata <- tempdata[-c(39,40),]
colnames(tempdata) <- c("Year_of_Release", "Games_Released")

newData <- merge(tempdata, year_sales_table_global)

colors = c("darkblue", "red")
barchart(Games_Released + Global_Sales ~ Year_of_Release, data = newData, 
         auto.key=list(space='bottom'), ylab = "Games released and mil. units sold", scales=list(x=list(rot=45)),
         par.settings=list(superpose.polygon=list(col=colors)))

It is very interesting to see how the number of releases and global sales seem to follow the same pattern, in fact, the correlation between the two is 0.98. Let’s take a look at their scatterplot.

qplot(Global_Sales, Games_Released, data = newData, xlab = "Sales (Mil. Units)", ylab = "Games Released")

As we can see there’s a strong positive relation between games published and sales. This is a very interesting phenomenon that merits a deeper analysis.