2023-03-26

Video Game Data

This data provides a table of popular video games (with over 100,000 sold copies.) The following parameters are covered:

  • Rank of Game Sales
  • Name of the Game
  • Platform of Game (ex: PC, Wii, etc.)
  • Year of Game Release
  • Genre of Game
  • Publisher of Game
  • Sales in North America (NA), Europe (EU), Japan (JP), and the rest of the world (other) (in millions)
  • Total sales in the World

The data set and its details can be found at this page: https://www.kaggle.com/datasets/gregorut/videogamesales

Previewing the Data

sales = read.csv("vgsales.csv")
colnames(sales)
##  [1] "Rank"         "Name"         "Platform"     "Year"         "Genre"       
##  [6] "Publisher"    "NA_Sales"     "EU_Sales"     "JP_Sales"     "Other_Sales" 
## [11] "Global_Sales"
head(sales)
##   Rank                     Name Platform Year        Genre Publisher NA_Sales
## 1    1               Wii Sports      Wii 2006       Sports  Nintendo    41.49
## 2    2        Super Mario Bros.      NES 1985     Platform  Nintendo    29.08
## 3    3           Mario Kart Wii      Wii 2008       Racing  Nintendo    15.85
## 4    4        Wii Sports Resort      Wii 2009       Sports  Nintendo    15.75
## 5    5 Pokemon Red/Pokemon Blue       GB 1996 Role-Playing  Nintendo    11.27
## 6    6                   Tetris       GB 1989       Puzzle  Nintendo    23.20
##   EU_Sales JP_Sales Other_Sales Global_Sales
## 1    29.02     3.77        8.46        82.74
## 2     3.58     6.81        0.77        40.24
## 3    12.88     3.79        3.31        35.82
## 4    11.01     3.28        2.96        33.00
## 5     8.89    10.22        1.00        31.37
## 6     2.26     4.22        0.58        30.26

Sample Question to Introduce Data:

What are the top 5 most popular publishers? How many games were produced by the rest?

Top Game Publishing Companies (by # of popular games produced)

Second Interpretation

Many would argue that the success of a gaming company is better marked by the success of the produced games rather than the number of games produced. The box plot shows the success of the top 5 publishers (by number of games produced).

Using this data from one company’s perspective

The data can be filtered to better understand a single companies success by country, platform, and relative to other companies. For this project, THQ will be selected as the practice company.

Where does THQ rank by the number of games produced?

index = which(sort_publisher$Publisher == "THQ")
index
## [1] 6
sort_publisher[index,2]
## [1] 715

THQ ranks 6th in companies that sell the greatest number of popular video games, with 715 games produced.

How data about other companies can be applied:

Using the data from other companies, THQ could discover which platforms it is most popular with to determine whether it should focus on certain platforms or try to catch up on others. The following slides use data for ALL companies.

Top Platforms (by # of popular games produced)

Graphing this data

To be more specific towards THQ purposes, the data will be filtered to show the most popular platforms for THQ games.

# filter the platform data for THQ and recalculate the values
THQ = filter(sales, Publisher == "THQ")
platforms_THQ = as.data.frame(table(THQ$Platform), stringsAsFactors = FALSE)
colnames(platforms_THQ) = c("Platform", "Frequency")
sort_platforms_THQ = platforms_THQ[order(platforms_THQ$Frequency, decreasing = TRUE),]
rownames(sort_platforms_THQ) = c(1:nrow(platforms_THQ))

# save the top 5 platforms and group the rest
n = 5
bottom = sort_platforms_THQ[(n+1):nrow(sort_platforms_THQ),]
other = sum(bottom$Frequency)
platforms_final_THQ = sort_platforms_THQ[1:n,]
platforms_final_THQ[(n+1),] = c("Other", other)

platforms_final_THQ
##   Platform Frequency
## 1       DS       115
## 2      GBA       110
## 3      PS2       100
## 4      Wii        76
## 5     X360        64
## 6    Other       250

## Analyzing the results By filtering the data to THQ produced-games, a new platform, GBA, is shown with greater significance to the company.

Applying knowledge as a consumer

Data analysis can be very valuable to those with two specific interests: video games and the stock market.

Tracking the production of a company over time and its success can help an individual decide whether they’d like to invest in it.

Graph of Game Sales Over Time

# calculate and plot the line of best fit
line_THQ = lm(THQ$Global_Sales~THQ$Year)
summary(line_THQ)
## 
## Call:
## lm(formula = THQ$Global_Sales ~ THQ$Year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.61909 -0.33942 -0.17948  0.08632  2.96080 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 47.212233  12.131227   3.892 0.000109 ***
## THQ$Year    -0.023297   0.006047  -3.852 0.000128 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5689 on 710 degrees of freedom
## Multiple R-squared:  0.02047,    Adjusted R-squared:  0.0191 
## F-statistic: 14.84 on 1 and 710 DF,  p-value: 0.0001276

The formula is as follows:


\(Global\;Sales\;(in\;millions) = -0.023297*Year + 47.2122\)


The \(R^2\) value is 0.02047, which suggests that there is a very weak linear correlation between time and sales. There is a lot of variability above and below the plot line. Log-transforming the data did not provide stronger linear trends.

From this data, the company appears to be holding a steady production of games with a slight decline; however, the weak \(R^2\) does not provide confidence to this conclusion.

Sales in North America vs. the World

A second linear interpretation can be performed to see if THQ sales in the United States are indicative of their sales in the rest of the world using a simple linear regression model.

Graph Model

Calculate values

THQ_line = lm(THQ$Global_Sales~THQ$NA_Sales)
summary(THQ_line)
## 
## Call:
## lm(formula = THQ$Global_Sales ~ THQ$NA_Sales)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90914 -0.07868 -0.03032  0.04154  1.17178 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.007986   0.010122   0.789     0.43    
## THQ$NA_Sales 1.604765   0.022760  70.508   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2032 on 710 degrees of freedom
## Multiple R-squared:  0.875,  Adjusted R-squared:  0.8749 
## F-statistic:  4971 on 1 and 710 DF,  p-value: < 2.2e-16

The formula is as follows:


\(Global\;Sales= 1.60476*N.A.\;Sales\; + 0.007986\)


with sales in millions.

The \(R^2\) value is 0.875, which suggests that there is a strong linear correlation between sale performance in each grouping. The p-value is also \(2.2*10^{-6}\), so we fail to reject the null hypothesis that there is a relationship between a video game’s sale in North America and the world.

Further testing

# collect residual values
res_THQ = residuals(THQ_line)
# create Normal Q-Q plot of residual values
qqnorm(res_THQ)

The points curve towards the ends and have a linear trend in the middle, so other than the extremities, the data has a strong linear nature. Other than the very high and very low sales, THQ sales in North America and the world can be reasonably predicted with a linear model.