Introduction

Video games are an important part of the entertainment industry. Since a young age I have been fascinated by the world of video games and what it had to offer. Most of us really have enjoy playing video games since we were kids and still do sometimes as an adult. It is the same reason I decided to analyze this gaming data. With this data we will break down some of the trends like the most popular platform, genre, the biggest market, etc. Unfortunately the data is based on physical sales and the most recent years are not all complete, so there are quite a few limitations on the data. Nevertheless, we have enough data to use and analyze. We will explore the data as much as possible in a simple and informative way.

I) Data Input

We are using a data file that contains a list of video games with sales. The file was generated by a scrape of vgchartz.com and with another web scrape from Metacritic, the file is call vgsales.csv and we will attach it as a source to all images.

data <- read.csv("/cloud/project/vgsales.csv", stringsAsFactors = FALSE)

library(tidyverse)
library(RColorBrewer)
library(ggplot2)
library(hrbrthemes)
library(viridis)
library(plotly)
library(ggthemes)
library(dplyr)
library(psych)
library(lubridate)

II) Inspecting the Data

dim(data)
## [1] 16598    11
names(data)
##  [1] "Rank"         "Name"         "Platform"     "Year"         "Genre"       
##  [6] "Publisher"    "NA_Sales"     "EU_Sales"     "JP_Sales"     "Other_Sales" 
## [11] "Global_Sales"

From our inspection we can conclude:

  • The data contain 16598 of rows and 11 of columns
  • Each column name: “Rank” , Name”, “Platform”, “Year”, “Genre”, “Publisher”, “NA_Sales”, “EU_Sales”, “JP_Sales”, “Other_Sales”, “Global_Sales”

III) Cleaning the Data

The file is selected and copied for analysis. From first observations I have mentioned that data from 2017:2020 is incomplete, removing data with the NaN values will definitely help with the analysis!

#Checking the data
str(data)
## 'data.frame':    16598 obs. of  11 variables:
##  $ Rank        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name        : chr  "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
##  $ Platform    : chr  "Wii" "NES" "Wii" "Wii" ...
##  $ Year        : chr  "2006" "1985" "2008" "2009" ...
##  $ Genre       : chr  "Sports" "Platform" "Racing" "Sports" ...
##  $ Publisher   : chr  "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
##  $ NA_Sales    : num  41.5 29.1 15.8 15.8 11.3 ...
##  $ EU_Sales    : num  29.02 3.58 12.88 11.01 8.89 ...
##  $ JP_Sales    : num  3.77 6.81 3.79 3.28 10.22 ...
##  $ Other_Sales : num  8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
##  $ Global_Sales: num  82.7 40.2 35.8 33 31.4 ...
data[ , c("Platform", "Year", "Genre", "Publisher")] <- lapply(data[ , c("Platform", "Year", "Genre", "Publisher")], as.factor)

str(data)
## 'data.frame':    16598 obs. of  11 variables:
##  $ Rank        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name        : chr  "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
##  $ Platform    : Factor w/ 31 levels "2600","3DO","3DS",..: 26 12 26 26 6 6 5 26 26 12 ...
##  $ Year        : Factor w/ 40 levels "1980","1981",..: 27 6 29 30 17 10 27 27 30 5 ...
##  $ Genre       : Factor w/ 12 levels "Action","Adventure",..: 11 5 7 11 8 6 5 4 5 9 ...
##  $ Publisher   : Factor w/ 579 levels "10TACLE Studios",..: 369 369 369 369 369 369 369 369 369 369 ...
##  $ NA_Sales    : num  41.5 29.1 15.8 15.8 11.3 ...
##  $ EU_Sales    : num  29.02 3.58 12.88 11.01 8.89 ...
##  $ JP_Sales    : num  3.77 6.81 3.79 3.28 10.22 ...
##  $ Other_Sales : num  8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
##  $ Global_Sales: num  82.7 40.2 35.8 33 31.4 ...
#checking missing value
data[data == "N/A"]<-NA
colSums(is.na(data))
##         Rank         Name     Platform         Year        Genre    Publisher 
##            0            0            0          271            0           58 
##     NA_Sales     EU_Sales     JP_Sales  Other_Sales Global_Sales 
##            0            0            0            0            0
#calculating the missing values.
colSums(is.na(data))/nrow(data)
##         Rank         Name     Platform         Year        Genre    Publisher 
##  0.000000000  0.000000000  0.000000000  0.016327268  0.000000000  0.003494397 
##     NA_Sales     EU_Sales     JP_Sales  Other_Sales Global_Sales 
##  0.000000000  0.000000000  0.000000000  0.000000000  0.000000000
#Dropping missing values
data <- data %>% 
  drop_na(Year, Publisher)
anyNA(data)
## [1] FALSE
data <- data[data$Year != "N/A" & data$Year != "2017" & data$Year != "2020", ]
data$Year <- factor(data$Year)

The data have been converted to the desired data.

IV) Data Summary

#Describe Data
describe(data)
##              vars     n    mean      sd  median trimmed     mad  min      max
## Rank            1 16287 8288.97 4792.14 8291.00 8288.15 6157.24 1.00 16600.00
## Name*           2 16287 5707.19 3276.85 5776.00 5720.84 4219.48 1.00 11322.00
## Platform*       3 16287   16.73    8.27   17.00   16.68   10.38 1.00    31.00
## Year*           4 16287   27.40    5.83   28.00   27.86    5.93 1.00    37.00
## Genre*          5 16287    5.93    3.76    6.00    5.86    5.93 1.00    12.00
## Publisher*      6 16287  297.92  181.75  328.00  302.21  268.35 1.00   579.00
## NA_Sales        7 16287    0.27    0.82    0.08    0.13    0.12 0.00    41.49
## EU_Sales        8 16287    0.15    0.51    0.02    0.06    0.03 0.00    29.02
## JP_Sales        9 16287    0.08    0.31    0.00    0.02    0.00 0.00    10.22
## Other_Sales    10 16287    0.05    0.19    0.01    0.02    0.01 0.00    10.57
## Global_Sales   11 16287    0.54    1.57    0.17    0.28    0.21 0.01    82.74
##                 range  skew kurtosis    se
## Rank         16599.00  0.00    -1.20 37.55
## Name*        11321.00 -0.03    -1.21 25.68
## Platform*       30.00 -0.05    -0.99  0.06
## Year*           36.00 -1.01     1.85  0.05
## Genre*          11.00  0.07    -1.43  0.03
## Publisher*     578.00 -0.14    -1.40  1.42
## NA_Sales        41.49 18.74   642.49  0.01
## EU_Sales        29.02 18.77   745.95  0.00
## JP_Sales        10.22 11.12   191.08  0.00
## Other_Sales     10.57 24.10  1011.31  0.00
## Global_Sales    82.73 17.30   595.62  0.01
#Data summary
summary(data)
##       Rank           Name              Platform         Year     
##  Min.   :    1   Length:16287       DS     :2130   2009   :1431  
##  1st Qu.: 4132   Class :character   PS2    :2127   2008   :1428  
##  Median : 8291   Mode  :character   PS3    :1304   2010   :1257  
##  Mean   : 8289                      Wii    :1290   2007   :1201  
##  3rd Qu.:12438                      X360   :1234   2011   :1136  
##  Max.   :16600                      PSP    :1197   2006   :1008  
##                                     (Other):7005   (Other):8826  
##           Genre                             Publisher        NA_Sales      
##  Action      :3250   Electronic Arts             : 1339   Min.   : 0.0000  
##  Sports      :2304   Activision                  :  966   1st Qu.: 0.0000  
##  Misc        :1686   Namco Bandai Games          :  928   Median : 0.0800  
##  Role-Playing:1468   Ubisoft                     :  917   Mean   : 0.2657  
##  Shooter     :1282   Konami Digital Entertainment:  823   3rd Qu.: 0.2400  
##  Adventure   :1274   THQ                         :  712   Max.   :41.4900  
##  (Other)     :5023   (Other)                     :10602                    
##     EU_Sales          JP_Sales         Other_Sales        Global_Sales   
##  Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.010  
##  1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.: 0.060  
##  Median : 0.0200   Median : 0.00000   Median : 0.01000   Median : 0.170  
##  Mean   : 0.1478   Mean   : 0.07885   Mean   : 0.04844   Mean   : 0.541  
##  3rd Qu.: 0.1100   3rd Qu.: 0.04000   3rd Qu.: 0.04000   3rd Qu.: 0.480  
##  Max.   :29.0200   Max.   :10.22000   Max.   :10.57000   Max.   :82.740  
## 

Data Summary

  • Action games are the most common genre.
  • Nintendo DS is the most common used platform.
  • 2009 is where most video games were published.
  • Electronic Arts is the highest publisher of video games.
  • NA has about half the market of video game sales.

V) Data Visualization

We will plot some visuals to see where NA stands compare to other markets

# Plotting Sales of the Markets
data %>%  
  select(Platform,Year,Genre, NA_Sales,EU_Sales, JP_Sales,Other_Sales,Global_Sales) %>%summary()
##     Platform         Year               Genre         NA_Sales      
##  DS     :2130   2009   :1431   Action      :3250   Min.   : 0.0000  
##  PS2    :2127   2008   :1428   Sports      :2304   1st Qu.: 0.0000  
##  PS3    :1304   2010   :1257   Misc        :1686   Median : 0.0800  
##  Wii    :1290   2007   :1201   Role-Playing:1468   Mean   : 0.2657  
##  X360   :1234   2011   :1136   Shooter     :1282   3rd Qu.: 0.2400  
##  PSP    :1197   2006   :1008   Adventure   :1274   Max.   :41.4900  
##  (Other):7005   (Other):8826   (Other)     :5023                    
##     EU_Sales          JP_Sales         Other_Sales        Global_Sales   
##  Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.010  
##  1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.: 0.060  
##  Median : 0.0200   Median : 0.00000   Median : 0.01000   Median : 0.170  
##  Mean   : 0.1478   Mean   : 0.07885   Mean   : 0.04844   Mean   : 0.541  
##  3rd Qu.: 0.1100   3rd Qu.: 0.04000   3rd Qu.: 0.04000   3rd Qu.: 0.480  
##  Max.   :29.0200   Max.   :10.22000   Max.   :10.57000   Max.   :82.740  
## 
market_means <- data.frame(Mean = c(mean(data$NA_Sales), mean(data$EU_Sales), mean(data$JP_Sales), mean(data$Other_Sales), mean(data$Global_Sales)))
row.names(market_means) <- c("North America ", "Europe", "Japan", "Rest of the world ", "Worldwide")
market_means$Mean_round <- round(market_means$Mean ,digit=2)
market_means
##                          Mean Mean_round
## North America      0.26569534       0.27
## Europe             0.14776754       0.15
## Japan              0.07884939       0.08
## Rest of the world  0.04843679       0.05
## Worldwide          0.54102229       0.54
theme_set(theme_bw())

ggplot(data = market_means, mapping = aes(x=row.names(market_means), y=Mean_round)) + 
geom_boxplot() + geom_segment(aes(x=row.names(market_means), 
                   xend=row.names(market_means), 
                   y=0, 
                   yend=Mean_round)) +
 geom_label(mapping = aes(label=Mean_round), fill = "darkblue", size = 3.5, color = "white", fontface = "bold", hjust=.5) +
  ggtitle("Sales share on the Markets") +
  xlab("Markets") +
  ylab("Mean of Sales") +
  labs(caption="source: vgsales.csv") +
  theme(
    plot.title = element_text(size = 24, hjust = .5, face = "bold"),
    axis.title.x = element_text(size = 18, hjust = .5, face = "italic"),
    axis.title.y = element_text(size = 18, hjust = .5, face = "italic"),
    axis.text.x = element_text(size = 10, face = "bold", angle = 0),
    axis.text.y = element_text(size = 10, face = "bold"),
    legend.position = "none")

We can see that according to the graph, NA accounts for 1/2 of the world sales, EU following with about 1/4 and JP about 1/7.

5.1 Genre Frequency Distribution

We will construct a genre frequency

# Construct a frequency distribution, sum of the numbers in each category (Genre) based on how many times it shows up. 
freq_genre <- data.frame(cbind(Frequency = table(data$Genre), Percent = prop.table(table(data$Genre)) * 100))
freq_genre <- freq_genre[order(freq_genre$Frequency, decreasing = T), ]
freq_genre
##              Frequency   Percent
## Action            3250 19.954565
## Sports            2304 14.146252
## Misc              1686 10.351814
## Role-Playing      1468  9.013324
## Shooter           1282  7.871308
## Adventure         1274  7.822189
## Racing            1225  7.521336
## Platform           875  5.372383
## Simulation         847  5.200467
## Fighting           836  5.132928
## Strategy           670  4.113710
## Puzzle             570  3.499724
# Plot

ggplot(data = freq_genre, mapping = aes(x = Frequency, y = row.names(freq_genre))) +
  geom_bar(stat = "identity", mapping = aes(fill = row.names(freq_genre), color = row.names(freq_genre)), alpha = .7, size = 1.1) +
  geom_label(mapping = aes(label=Frequency), fill = "darkblue", size = 3.5, color = "white", fontface = "bold", hjust=.5) +
  ggtitle("Genre Frequency Distribution") +
  xlab("Genres") +
  ylab("Frequency") +
  labs(caption="source: vgsales.csv") +
  theme(
    plot.title = element_text(size = 24, hjust = .5, face = "bold"),
    axis.title.x = element_text(size = 18, hjust = .5, face = "italic"),
    axis.title.y = element_text(size = 18, hjust = .5, face = "italic"),
    axis.text.x = element_text(size = 10, face = "bold", angle = 0),
    axis.text.y = element_text(size = 10, face = "bold"),
    legend.position = "none")

We can see that action is the highest genre, which is accurate with our previous summary report. A side note, we can combine some of the genres, since some of the them fall into the same category, but we will leave it as it is for now.

5.2 Comparing Number of sales per genre

We will use a heat graph to see how high the score of sales are based on genre.

options(repr.plot.width = 20, repr.plot.height = 5)

sales_comp_gen <- data %>%
  select(Genre, NA_Sales, EU_Sales, JP_Sales, Other_Sales) %>% 
  group_by(Genre) %>% 
  summarise(NA_Sales = sum(NA_Sales), 
            EU_Sales = sum(EU_Sales), 
            JP_Sales = sum(JP_Sales), 
            Other_Sales = sum(Other_Sales))

sales_comp_gen <- pivot_longer(data =  sales_comp_gen, 
                               cols = c("NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"))

ggplot(data = sales_comp_gen, aes(x = name, y = Genre, fill = value))+
  geom_tile(aes(fill = value))+
  geom_text(aes(label = value), position = position_dodge(width = .1), color = "black")+ 
  labs(title = "Sales Comparison by Genre",
       subtitle = "Video Games Sales Data",
       x = "Total Sales",
       y = NULL,
       fill = NULL)+
  theme_minimal()+ labs(caption="source: vgsales.csv") +
  theme(legend.position = "right")+
  scale_fill_distiller(palette = "Spectral")

We can see that the genre score in action is very high in NA and EU follow by the sports genre. NA dominates all across the genres in sales.

5.3 Gaming Company Distribution WorldWide

We will group and combine the consoles based on it’s platform and company, to see which platform is played the most.

# Construct a frequency distribution, sum of the numbers in each category (Platform) based on how many times it shows up. 

freq_platform <- data.frame(cbind(Frequency = table(data$Platform), Percent = prop.table(table(data$Platform)) * 100))
freq_platform <- freq_platform[order(freq_platform$Frequency, decreasing = T), ]
freq_platform
##      Frequency      Percent
## DS        2130 13.077914901
## PS2       2127 13.059495303
## PS3       1304  8.006385461
## Wii       1290  7.920427335
## X360      1234  7.576594830
## PSP       1197  7.349419783
## PS        1189  7.300300853
## PC         938  5.759194450
## XB         803  4.930312519
## GBA        786  4.825934795
## GC         542  3.327807454
## 3DS        499  3.063793209
## PSV        408  2.505065390
## PS4        335  2.056855161
## N64        316  1.940197704
## SNES       239  1.467428010
## XOne       213  1.307791490
## SAT        173  1.062196844
## WiiU       143  0.878000860
## 2600       116  0.712224474
## NES         98  0.601706883
## GB          97  0.595567017
## DC          52  0.319273040
## GEN         27  0.165776386
## NG          12  0.073678394
## SCD          6  0.036839197
## WS           6  0.036839197
## 3DO          3  0.018419598
## TG16         2  0.012279732
## GG           1  0.006139866
## PCFX         1  0.006139866
# Regroup platform as Platform_type
freq_platform$Platform = c('DS', 'PS2', 'PS3', 'Wii', 'X360', 'PSP', 'PS', 'PC', 'GBA', 'XB', 'GC', '3DS', 'PSV', 'PS4', 'N64', 'SNES', 'XOne', 'SAT', 'WiiU', '2600', 'NES', 'GB', 'DC', 'GEN', 'NG', 'SCD', 'WS', '3DO', 'TG16', 'GG', 'PCFX')
pc <- c("PC")
xbox <- c("X360", "XB", "XOne")
nintendo <- c("Wii", "WiiU", "N64", "GC", "NES", "3DS", "DS", "SNES", "GBA", "GB", "SCD") 
playstation <- c("PS", "PS2", "PS3", "PS4", "PSP", "PSV")
platforms <- freq_platform %>%
  mutate(Platform_type = ifelse(Platform %in% pc, "PC",
                                ifelse(Platform %in% xbox, "Xbox",
                                       ifelse(Platform %in% nintendo, "Nintendo", 
                                              ifelse(Platform %in% playstation, "Playstation", "Others")))))

ggplot(data = platforms, mapping = aes(x = Frequency, y = Platform_type)) +
  geom_bar(stat = "identity", mapping = aes(fill = Platform_type, color = Platform_type), alpha = 0.7, size = 0.3) +
  ggtitle("Gaming Company Frequency Distribution") + 
  xlab("Frequency") +
  ylab("Company") + 
  coord_flip() + labs(caption="source: vgsales.csv")+ theme(
    plot.title = element_text(size = 19, hjust = .5, face = "bold"),
    axis.title.x = element_text(size = 18, hjust = .5, face = "italic"),
    axis.title.y = element_text(size = 18, hjust = .5, face = "italic"),
    axis.text.x = element_text(size = 10, face = "bold", angle = 0),
    axis.text.y = element_text(size = 10, face = "bold"),
    legend.position = "none")

We can see that the platform/company Playstation and Nintendo are very dominant in the market. Nintendo is just trailing behind by a small margin, which looks about right according to the Video game history.

5.4 Top Publisher based on Sales

Visualizing the top publisher

data$Year <- as.Date(as.character(data$Year), format="%Y")
data$Year <- year(data$Year)

data$Name <- as.character(data$Name)

data.publisher.sales <- aggregate(
  Global_Sales~Publisher+Year,
  data,
  sum
)
  
data.publisher.sales.clean <- aggregate(
  Global_Sales~Publisher,
  data,
  sum
)
data.publisher.sales.clean <- data.publisher.sales.clean[
  order(data.publisher.sales.clean$Global_Sales, decreasing=T),
]

ggplot(data.publisher.sales,
  aes(
    x=Global_Sales,
    y=reorder(Publisher, Global_Sales),
    fill=Year
  )
) +
  geom_bar(stat="identity") +
  scale_fill_continuous(low="red", high="blue") +
  scale_y_discrete(limits=head(data.publisher.sales.clean, 10)$Publisher) + labs(caption="source: vgsales.csv")+
  labs(
    y="Publisher",
    x="Global Sales")
## Warning: Removed 2054 rows containing missing values (position_stack).

sales are Nintendo with viarity of games spanning from 1980’s to 2000’s. This is makes sense wince Wii games and Super Marios, Nintendo’s games

  • Revising our graph with some coding
#Re-checking
total_sales_publisher <- aggregate.data.frame(x = list(Total_Sales = data$Global_Sales),
                                 by = list(Publisher = data$Publisher),
                                 FUN = sum)

total_sales_publisher <- total_sales_publisher[order(total_sales_publisher$Total_Sales, decreasing = T), ]
head(total_sales_publisher, 10)
##                        Publisher Total_Sales
## 368                     Nintendo     1784.43
## 139              Electronic Arts     1093.39
## 17                    Activision      721.41
## 463  Sony Computer Entertainment      607.28
## 531                      Ubisoft      473.25
## 497         Take-Two Interactive      399.30
## 513                          THQ      340.44
## 282 Konami Digital Entertainment      278.56
## 451                         Sega      270.66
## 351           Namco Bandai Games      253.65

We re-check the data and it is accurate with the graph.

5.5 Hypothesis

for the hypothesis, we will compare 3 of the markets, NA, JP and Global.

a) Hypothesis

  • Is there is a significant change in NA_Sales with respect to genre and platform?
min(data$NA_Sales)
## [1] 0
max(data$NA_Sales)
## [1] 41.49
median(data$NA_Sales)
## [1] 0.08
# In the following data the ones having p-value <0.05 do not have a significant change but the rest change the sales significantly.]

fit <- lm( NA_Sales ~ Genre , data = data)
summary(fit)
## 
## Call:
## lm(formula = NA_Sales ~ Genre, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.510 -0.235 -0.154 -0.015 41.199 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.265160   0.014323  18.513  < 2e-16 ***
## GenreAdventure    -0.185152   0.026990  -6.860 7.14e-12 ***
## GenreFighting     -0.001117   0.031665  -0.035    0.972    
## GenreMisc         -0.029739   0.024507  -1.213    0.225    
## GenrePlatform      0.244543   0.031098   7.864 3.97e-15 ***
## GenrePuzzle       -0.051107   0.037079  -1.378    0.168    
## GenreRacing        0.026211   0.027375   0.957    0.338    
## GenreRole-Playing -0.042749   0.025677  -1.665    0.096 .  
## GenreShooter       0.183483   0.026930   6.813 9.87e-12 ***
## GenreSimulation   -0.050862   0.031501  -1.615    0.106    
## GenreSports        0.025678   0.022238   1.155    0.248    
## GenreStrategy     -0.163921   0.034645  -4.732 2.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8165 on 16275 degrees of freedom
## Multiple R-squared:  0.01519,    Adjusted R-squared:  0.01452 
## F-statistic: 22.82 on 11 and 16275 DF,  p-value: < 2.2e-16

For our first hypothesis the data shows having a p-value <0.05, it does not have a significant change but the rest change the sales significantly. Therefore, there is no significant change in NA_Sales with respect to genre and platform.

b) Hypothesis

  • Is there a significant change in JP_Sales with respect to genre and platform?
min(data$JP_Sales)
## [1] 0
max(data$JP_Sales)
## [1] 10.22
median(data$JP_Sales)
## [1] 0
# In the following data the ones having p-value<0.05 do not have a significant change but the rest change the sales significantly.

fit <- lm( JP_Sales ~ Genre , data = data)
summary(fit)
## 
## Call:
## lm(formula = JP_Sales ~ Genre, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2386 -0.0633 -0.0488 -0.0288  9.9814 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.048812   0.005381   9.072  < 2e-16 ***
## GenreAdventure    -0.008004   0.010139  -0.789 0.429900    
## GenreFighting      0.055434   0.011895   4.660 3.19e-06 ***
## GenreMisc          0.014456   0.009206   1.570 0.116393    
## GenrePlatform      0.100502   0.011683   8.603  < 2e-16 ***
## GenrePuzzle        0.050626   0.013929   3.635 0.000279 ***
## GenreRacing       -0.002600   0.010284  -0.253 0.800406    
## GenreRole-Playing  0.189778   0.009646  19.674  < 2e-16 ***
## GenreShooter      -0.019031   0.010117  -1.881 0.059971 .  
## GenreSimulation    0.026205   0.011834   2.214 0.026812 *  
## GenreSports        0.009677   0.008354   1.158 0.246719    
## GenreStrategy      0.024471   0.013015   1.880 0.060091 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3067 on 16275 degrees of freedom
## Multiple R-squared:  0.03354,    Adjusted R-squared:  0.03289 
## F-statistic: 51.35 on 11 and 16275 DF,  p-value: < 2.2e-16

For our 2nd hypothesis the data shows having a p-value <0.05, it does not have a significant change but the rest change the sales significantly. Therefore, there is no significant change in JP_Sales with respect to genre.

c) Hypothesis

  • Is there a significant change in Global_Sales with respect to genre and platform?
min(data$Global_Sales)
## [1] 0.01
max(data$Global_Sales)
## [1] 82.74
median(data$Global_Sales)
## [1] 0.17
# In the following data the ones having p-value<0.05 do not have a significant change but the rest change the sales significantly.]

fit <- lm( Global_Sales ~ Genre , data = data)
summary(fit)
## 
## Call:
## lm(formula = Global_Sales ~ Genre, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.938 -0.461 -0.310 -0.039 82.172 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.530102   0.027338  19.391  < 2e-16 ***
## GenreAdventure    -0.345965   0.051516  -6.716 1.93e-11 ***
## GenreFighting      0.001059   0.060438   0.018    0.986    
## GenreMisc         -0.061614   0.046776  -1.317    0.188    
## GenrePlatform      0.417476   0.059357   7.033 2.10e-12 ***
## GenrePuzzle       -0.105172   0.070772  -1.486    0.137    
## GenreRacing        0.063172   0.052251   1.209    0.227    
## GenreRole-Playing  0.099183   0.049010   2.024    0.043 *  
## GenreShooter       0.270366   0.051400   5.260 1.46e-07 ***
## GenreSimulation   -0.070019   0.060125  -1.165    0.244    
## GenreSports        0.038145   0.042445   0.899    0.369    
## GenreStrategy     -0.271490   0.066126  -4.106 4.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.559 on 16275 degrees of freedom
## Multiple R-squared:  0.01214,    Adjusted R-squared:  0.01147 
## F-statistic: 18.18 on 11 and 16275 DF,  p-value: < 2.2e-16

Our last hypothesis, the data shows having a p-value <0.05, it does not have a significant change but the rest change the sales significantly. Therefore, there is also no significant change in Global_Sales with respect to genre and platform.

VI) Conclusion

In conclusion, as a long time gamer, Im very impressed with the scores we were able to see in this report. They seem to be very accurate to the real world, credit to those that worked on the data.

We examine that on the heat image, the genre platform score high. This might brought up some confusion but if your a true gamer, you will take a quick notice that the genre is based on the console uniqueness, for example Nintendo and Playstation release games only available to their platforms, that is why you see it as genre. Based on the sales archieved, North America is the region that had the highest market and dominates across all genres. Most video games sales are played on Playstation and Nintendo with Action being the most dominant genre on all markets with the exception of Japan which seems to prefer Role playing genre.

We also acknowledge that the claims of our analysis are limited. Because the data was narrowed down- for example, we took out those publishers with NA, and the data was not fully complete, was missing “phone” as platform. Because all of our observations are made relative to a few region market sales and platforms, the results of our analysis would be expected to change if we were to use more regions and platforms. Thus we can say that our analysis claims are not absolute for all games.

In the future, it would be interesting to study the number of sales on different platforms and regions.