Executive Summary

This project will analyze the games sales of X-Box One and PlayStation 4 that will help to define interesting metrics in answering business questions as follow:

  1. List the Top 10 Games that make the most sales.
  2. What Group of Genre that make the most sales?
  3. Which Game Publisher is leading in the market?
  4. What’s the average of game sold in each continent?
  5. Display the total sales by year.

As we know that bad data could lead to bad decision-making, therefore, performing data wrangling before analysis is very important. Data wrangling is a process of transforming messy data into an insighful information. In a simple way, below is the image of data wrangling steps:

The pre-processing steps involved in this project are as follow:
1. Collect 2 datasets from Kaggle using appropriate functions and library packages.
2. Merge/combine the datasets into one and select useful variables to work with.
3. Discover the data to provide description of each variable.
4. Scan the data to find any missing values, errors, or special values, then find a suitable methodology to deal with it.
5. Scan for any outliers using summary statistics, apply appropriate plots on the data, explain the methodology and transform it for better insight.
6. Perform Tidy and Manipulate technique such as mutate a new variable with calculation, reshape the data from untidy to tidy format for better analysis.

1. Discovering

The first dataset is X-box One Games Sales, collected from Kaggle Open Data which can be downloaded from the link: https://www.kaggle.com/sidtwr/videogames-sales-dataset?select=XboxOne_GameSales.csv.

And the second dataset is Playstation 4 Games Sales, collected also from Kaggle Open Data which can be downloaded from the link: https://www.kaggle.com/sidtwr/videogames-sales-dataset?select=PS4_GamesSales.csv.

Both datasets have 9 main variables and described as follow:

  1. Game: name/title of the game
  2. Year: year when the game was published
  3. Genre: genre of the game
  4. Publisher: name of the company that publish the game
  5. North America: sales in North America (in Million)
  6. Europe: sales in Europe (in Million)
  7. Japan: sales in Japan (in Million)
  8. Rest of World: sales in the Rest of the world (in Million)
  9. Global: sales made globally
# Required Packages
library(readr)
library(dplyr)
library(tidyr)
library(outliers)
library(ggplot2)
library(lattice)

The two datasets are merged into one using the full_join(), based on Game, Year, Publisher, and Genre.

After joined the datasets, we ended up with only 12 variables where:
4 variables are Game, Year, Genre, and Publisher,
4 variables are the X-box One Sales in each continent and,
4 variables are the Playstation 4 sales in each continent.

##### DATASET 1: XBOX ONE #####
# Import and read the dataset
xbox <- read_csv("XboxOne_GameSales.csv")

# Select the columns for analysis
xbox <- xbox %>% select("Year", "Game", "Publisher", "Genre", "North America", "Europe", "Japan", "Rest of World")

##### DATASET 2: PLAYSTATION 4 #####
# Import and read the dataset
ps4 <- read_csv("PS4_GamesSales.csv")

# Select the columns for analysis
ps4 <- ps4 %>% select("Year", "Game", "Publisher", "Genre", "North America", "Europe", "Japan", "Rest of World")

# Merging the two datasets using full_join by Game, Year, Publisher, Genre
joinxp <- full_join(xbox, ps4, by = c("Game", "Year", "Publisher", "Genre"))

# Check the head of the dataset
joinxp
## # A tibble: 1,177 x 12
##    Year  Game  Publisher Genre `North America.~ Europe.x Japan.x
##    <chr> <chr> <chr>     <chr>            <dbl>    <dbl>   <dbl>
##  1 2014  Gran~ Rockstar~ Acti~             4.7      3.25    0.01
##  2 2015  Call~ Activisi~ Shoo~             4.63     2.04    0.02
##  3 2017  Call~ Activisi~ Shoo~             3.75     1.91    0   
##  4 2018  Red ~ Rockstar~ Acti~             3.76     1.47    0   
##  5 2014  Mine~ Microsof~ Misc              3.23     1.71    0   
##  6 2014  Call~ Activisi~ Shoo~             3.25     1.49    0.01
##  7 2016  Batt~ Electron~ Shoo~             3.37     1.26    0.02
##  8 2015  Fall~ Bethesda~ Role~             2.94     1.62    0.02
##  9 2015  Halo~ Microsof~ Shoo~             2.94     1.49    0.03
## 10 2016  Call~ Activisi~ Shoo~             2.91     1.44    0   
## # ... with 1,167 more rows, and 5 more variables: `Rest of World.x` <dbl>,
## #   `North America.y` <dbl>, Europe.y <dbl>, Japan.y <dbl>, `Rest of
## #   World.y` <dbl>

2. Structuring

We need to check the data structure of the joined table to ensure that the variables have appropriate data format and will be ready for analysis.
During data conversion, we also need to check the value using unique() function to understand the dataset.
There are combination of numeric and characters datatypes. Some of them need to be converted as follow:
1. Year: convert from Char to Factor, relabelled and ordered as True.
2. Genre: convert from Char to Factor, the order is not important.

After the conversion, we check the class() and levels() again to ensure the datatype are converted correctly and in order.

# Check the data structure
str(joinxp)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 1177 obs. of  12 variables:
##  $ Year           : chr  "2014" "2015" "2017" "2018" ...
##  $ Game           : chr  "Grand Theft Auto V" "Call of Duty: Black Ops 3" "Call of Duty: WWII" "Red Dead Redemption 2" ...
##  $ Publisher      : chr  "Rockstar Games" "Activision" "Activision" "Rockstar Games" ...
##  $ Genre          : chr  "Action" "Shooter" "Shooter" "Action-Adventure" ...
##  $ North America.x: num  4.7 4.63 3.75 3.76 3.23 3.25 3.37 2.94 2.94 2.91 ...
##  $ Europe.x       : num  3.25 2.04 1.91 1.47 1.71 1.49 1.26 1.62 1.49 1.44 ...
##  $ Japan.x        : num  0.01 0.02 0 0 0 0.01 0.02 0.02 0.03 0 ...
##  $ Rest of World.x: num  0.76 0.68 0.57 0.54 0.49 0.48 0.48 0.45 0.45 0.44 ...
##  $ North America.y: num  6.06 6.18 4.67 5.26 NA 2.84 2.2 2.91 NA 3.11 ...
##  $ Europe.y       : num  9.71 6.05 6.21 6.21 NA 3.34 3.65 3.97 NA 3.83 ...
##  $ Japan.y        : num  0.6 0.41 0.4 0.21 NA 0.14 0.29 0.27 NA 0.19 ...
##  $ Rest of World.y: num  3.02 2.44 2.12 2.26 NA 1.22 1.12 1.34 NA 1.36 ...
# Check for any unique value in the Year and Genre
unique(joinxp$Year)
## [1] "2014" "2015" "2017" "2018" "2016" "2013" "N/A"  "2019" "2020"
unique(joinxp$Genre)
##  [1] "Action"           "Shooter"          "Action-Adventure" "Misc"            
##  [5] "Role-Playing"     "Racing"           "Sports"           "Fighting"        
##  [9] "Adventure"        "MMO"              "Music"            "Simulation"      
## [13] "Strategy"         "Platform"         "Puzzle"           "Visual Novel"    
## [17] "Party"
##### Data conversion Steps #####
# Convert Year datatype to factor and put it in order
joinxp$Year <- factor(joinxp$Year, levels = c("2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020"), ordered = T)

# Convert Genre datatype to factor
joinxp$Genre <- factor(joinxp$Genre, levels = c("Action", "Shooter", "Action-Adventure", "Misc", "Role-Playing", "Racing", "Sports", "Fighting", "Adventure", "MMO", "Music", "Simulation", "Strategy", "Platform", "Puzzle", "Visual Novel", "Party"))

# Check the class and levels
class(joinxp$Year)
## [1] "ordered" "factor"
levels(joinxp$Year)
## [1] "2013" "2014" "2015" "2016" "2017" "2018" "2019" "2020"
class(joinxp$Genre)
## [1] "factor"
levels(joinxp$Genre)
##  [1] "Action"           "Shooter"          "Action-Adventure" "Misc"            
##  [5] "Role-Playing"     "Racing"           "Sports"           "Fighting"        
##  [9] "Adventure"        "MMO"              "Music"            "Simulation"      
## [13] "Strategy"         "Platform"         "Puzzle"           "Visual Novel"    
## [17] "Party"

3. Enriching

Data Wrangling steps don’t always have to be sequential as depicted on the picture above. As long as the Data Analyst understand the business goal and workflow of the analysis, it is fine to do it the other way around. We will perform the data enriching first then data cleaning after.

Enriched data will help create personalized relationships that provide a more holistic understanding of the business’ needs.

Noticed that the joined table ended up with 2 similar continent sales (e.g. North America.x for X-box One and North America.y for Playstation 4). Since we are interested in the sales from both consoles, we need to combine the 2 continent sales into one using the mutate() function and call the new variables as Continent_Total_Sales (eg. North_America_Total_Sales, Europe_Total_Sales, Japan_Total_Sales, and Rest_Of_World_Total_Sales).

After that, we can drop the individual continent sales from our joined table and focus only on each continent total sales.

# Mutate new variables by adding the two same continents
joinxp <- joinxp %>% mutate(North_America_Total_Sales = `North America.x` + `North America.y`,
                            Europe_Total_Sales = Europe.x + Europe.y, 
                            Japan_Total_Sales = Japan.x + Japan.y,
                            Rest_Of_World_Total_Sales = `Rest of World.x` + `Rest of World.y`)

# Select only the continents that are already combined
joinxp_filter <- joinxp %>% select(-c("North America.x", "North America.y",
                                      "Europe.x", "Europe.y", "Japan.x", "Japan.y", 
                                      "Rest of World.x", "Rest of World.y"))

# Check the head of the new filtered dataset
joinxp_filter %>% head(3)
## # A tibble: 3 x 8
##   Year  Game  Publisher Genre North_America_T~ Europe_Total_Sa~ Japan_Total_Sal~
##   <ord> <chr> <chr>     <fct>            <dbl>            <dbl>            <dbl>
## 1 2014  Gran~ Rockstar~ Acti~            10.8             13.0              0.61
## 2 2015  Call~ Activisi~ Shoo~            10.8              8.09             0.43
## 3 2017  Call~ Activisi~ Shoo~             8.42             8.12             0.4 
## # ... with 1 more variable: Rest_Of_World_Total_Sales <dbl>

For better analysis, we need to manipulate it by following the tidy data principle. The continent sales are currently in wide format, we can unite these and have one variable that represent as continent.

In case if we are interested with the total sales only regardless of where the games were sold, having four continents may be difficult to see. We can reshape the data to have only 2 variables; Continents and Sales.

# Gather the sales data from separate continents into single value
joinxp_tidy <- gather(joinxp_filter, key = "Continents", value = "Sales", 5:8)

# Convert datatype to Factor and relabelled it
joinxp_tidy$Continents <- factor(joinxp_tidy$Continents, levels = c("North_America_Total_Sales", "Europe_Total_Sales", "Japan_Total_Sales", "Rest_Of_World_Total_Sales"), labels = c("North America", "Europe", "Japan", "Rest of the World" ))

# Check the head of the new reshaped dataset
joinxp_tidy
## # A tibble: 4,708 x 6
##    Year  Game                    Publisher       Genre        Continents   Sales
##    <ord> <chr>                   <chr>           <fct>        <fct>        <dbl>
##  1 2014  Grand Theft Auto V      Rockstar Games  Action       North Ameri~ 10.8 
##  2 2015  Call of Duty: Black Op~ Activision      Shooter      North Ameri~ 10.8 
##  3 2017  Call of Duty: WWII      Activision      Shooter      North Ameri~  8.42
##  4 2018  Red Dead Redemption 2   Rockstar Games  Action-Adve~ North Ameri~  9.02
##  5 2014  MineCraft               Microsoft Stud~ Misc         North Ameri~ NA   
##  6 2014  Call of Duty: Advanced~ Activision      Shooter      North Ameri~  6.09
##  7 2016  Battlefield 1           Electronic Arts Shooter      North Ameri~  5.57
##  8 2015  Fallout 4               Bethesda Softw~ Role-Playing North Ameri~  5.85
##  9 2015  Halo 5: Guardians       Microsoft Stud~ Shooter      North Ameri~ NA   
## 10 2016  Call of Duty: Infinite~ Activision      Shooter      North Ameri~  6.02
## # ... with 4,698 more rows

Have a look at the difference after converted the data from wide format to long format. By reducing certain similar variables that can be grouped as one, the analysis will become much easier.

4. Cleaning

Scan Missing values, special values, or obvious errors.

There are 3 variables with missing values detected using colsums() function. Below are the detail and solution to handle it:

  1. Year: 1068 missing values Solution: Since Year is a Factor datatype, a new level name “Unknown” is added to replace the missing value. Assuming that the publisher did not provide it since the beginning.

  2. Publisher: 1068 missing values Solution: Replacing with string: “Unknown”. Assuming that the publisher did not provide it since the beginning.

  3. Sales: 2828 missing values Solution: The missing values occurence are due to the combined datasets from X-box One and Playstation 4 where both continents are duplicated. In this report, we will omit the value.

There are no special values or obvious errors found.

# Check for infinite or NaN or NA values
# Create a function to detect any infinite or nan or na values
is.specialorNA <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}

# Apply the function to the dataset
sapply(joinxp_tidy, function(x) sum(is.na(x)))
##       Year       Game  Publisher      Genre Continents      Sales 
##       1068          0       1068          0          0       2828
# Create a copy for scanning purposes
joinxp_scan <- joinxp_tidy

### Year ###
# Observe the Missing values
Empty_Year <- joinxp_scan %>% filter(is.na(Year))
Empty_Year %>% head(3)
## # A tibble: 3 x 6
##   Year  Game                     Publisher Genre     Continents    Sales
##   <ord> <chr>                    <chr>     <fct>     <fct>         <dbl>
## 1 <NA>  Dance Central: Spotlight <NA>      Music     North America    NA
## 2 <NA>  A Boy and His Blob       <NA>      Platform  North America     0
## 3 <NA>  Another World            <NA>      Adventure North America     0
# Adding a new level to a factor
levels(joinxp_scan$Year) <- c(levels(joinxp_scan$Year), "Unknown")
levels(joinxp_scan$Year)
## [1] "2013"    "2014"    "2015"    "2016"    "2017"    "2018"    "2019"   
## [8] "2020"    "Unknown"
# Replace with Unknown through labels
joinxp_scan <- joinxp_scan %>% mutate(Year = replace(Year, is.na(Year), "Unknown"))


### Publisher ###
# Observe the Missing values
Empty_Publisher <- joinxp_scan %>% filter(is.na(Publisher))
Empty_Publisher %>% head(3)
## # A tibble: 3 x 6
##   Year    Game                     Publisher Genre     Continents    Sales
##   <ord>   <chr>                    <chr>     <fct>     <fct>         <dbl>
## 1 Unknown Dance Central: Spotlight <NA>      Music     North America    NA
## 2 Unknown A Boy and His Blob       <NA>      Platform  North America     0
## 3 Unknown Another World            <NA>      Adventure North America     0
# Replace with Unknown
joinxp_scan <- joinxp_scan %>% mutate(Publisher = replace(Publisher, is.na(Publisher), "Unknown"))


### Sales ###
# Dealing with Missing values in Sales
Empty_sales <- joinxp_scan %>% filter(is.na(Sales))
Empty_sales %>% head(3)
## # A tibble: 3 x 6
##   Year  Game                        Publisher         Genre   Continents   Sales
##   <ord> <chr>                       <chr>             <fct>   <fct>        <dbl>
## 1 2014  MineCraft                   Microsoft Studios Misc    North Ameri~    NA
## 2 2015  Halo 5: Guardians           Microsoft Studios Shooter North Ameri~    NA
## 3 2015  Star Wars Battlefront (201~ Electronic Arts   Shooter North Ameri~    NA
# Remove all missing values due to redundant data
joinxp_scan <- na.omit(joinxp_scan)

# Recheck again if still any missing values
colSums(is.na(joinxp_scan))
##       Year       Game  Publisher      Genre Continents      Sales 
##          0          0          0          0          0          0

Scan Outliers

We are interested to observe the total games sales. For any non-sale games or 0 sales, we will remove it from the data.

Methodology:
1. The statistic summary is performed and grouped by Continents. As can be seen, the Standard Deviation for Japan and Rest of the World are much lower than North America and Europe. This mean the market in North America and Europe performed much better.

#Subset the sales with amount greater than 0
joinxp_scan <- subset(joinxp_scan, Sales > 0)

# Summary Statistics of Sales grouped by Continents 
joinxp_scan %>% group_by(Continents) %>% summarise(
  Mean=mean(Sales),
  Median=median(Sales),
  SD=sd(Sales),
  Q1=quantile(Sales, probs=.25),
  Q3=quantile(Sales,prob=.75),
  IQR=IQR(Sales),
  Min=min(Sales),
  Max=max(Sales),
)
## # A tibble: 4 x 9
##   Continents          Mean Median    SD    Q1    Q3   IQR   Min   Max
##   <fct>              <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 North America     0.773    0.23 1.42   0.1   0.78  0.68  0.01 10.8 
## 2 Europe            0.842    0.3  1.64   0.08  0.8   0.72  0.01 13.0 
## 3 Japan             0.0979   0.04 0.203  0.02  0.11  0.09  0.01  2.17
## 4 Rest of the World 0.238    0.07 0.447  0.02  0.24  0.22  0.01  3.78
# Boxplot for Sales by Continents
boxplot(joinxp_scan$Sales ~ joinxp_scan$Continents, main = "Sales by Continent (in Million Dollar)", ylab = "Sales", xlab = "Continent", col="skyblue")

# Checking outliers using Z-score
z.scores <- joinxp_scan$Sales %>% scores(type = "z")
summary(z.scores)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.43348 -0.40858 -0.33387  0.00000 -0.03503 10.31631
length(which(abs(z.scores) > 3))
## [1] 24
  1. Upon checking the outliers using boxplot, there seems to be so many outliers.
  2. When performing the Z-score method, it has detected 33 outliers in the Sales data.

Applying log transformation is the best way to reduce outliers and decrease the skewness of the empirical distribution.

# Check for Outliers using boxplot after log
boxplot(log(joinxp_scan$Sales) ~ joinxp_scan$Continents, main = "Sales by Continent", ylab = "Sales Frequency", xlab = "Continent", col = "skyblue")

5. Validating

When applying log transformation to the boxplot, the outliers are reduced and only 1 outlier is detected which is normal as it is not fall far away from the Maximum value.
Data validation is the process of authenticating your data and confirming that it is standardized, consistent, and high quality. Verify that it is clean and regularly structured. This process often uses automated programs to check the accuracy and validity of cells or fields by cross-checking data, although it may also be done manually.

We can also perform comparison of Sales by Genre using the same method.

### Additional work: Sales comparison by Genre
# Histogram
h3 <- joinxp_scan %>%  histogram(~log(Sales)|Genre, 
                    col="maroon",
                    data=.,
                    freq=TRUE,
                    xlab = "Sales",
                    main=("Sales by Genre")
                    )
h3

And the following is the analysis for Sales in Year.

### Additional work: Sales comparison by Year
# Histogram
h4 <- joinxp_scan %>%  histogram(~log(Sales)|Year, 
                    col="darkgreen",
                    data=.,
                    freq=TRUE,
                    xlab = "Sales",
                    main=("Sales by Year")
                    )
h4

# Boxplot
boxplot(log(joinxp_scan$Sales) ~ joinxp_scan$Year, main = "Sales by Year", ylab = "Sales Frequency", xlab = "Year", col = "darkgreen")

6. Publishing

All the data points have been standardized, making the data easier to analyze. Let’s compare the raw data to the ones that have already been cleaned: Raw data.

Tidy data.

Tidy data.

See how the data is cleaned after performing data wrangling. Now, we can start analysing the data and answer the business questions as we determined at the beginning.

Analyze

1. List the Top 10 Games that make the most sales.

top_games_sales <- joinxp_scan %>% group_by(Game) %>% summarize(Tot_Sales = sum(Sales))%>% arrange(desc(Tot_Sales)) %>% top_n(10)
## Selecting by Tot_Sales
top_games_sales
## # A tibble: 10 x 2
##    Game                           Tot_Sales
##    <chr>                              <dbl>
##  1 Grand Theft Auto V                  28.1
##  2 Call of Duty: Black Ops 3           22.4
##  3 Red Dead Redemption 2               19.7
##  4 Call of Duty: WWII                  19.6
##  5 FIFA 18                             14.9
##  6 FIFA 17                             14.6
##  7 Fallout 4                           13.5
##  8 Call of Duty: Infinite Warfare      13.3
##  9 Call of Duty: Advanced Warfare      12.8
## 10 Battlefield 1                       12.4

The result of our analysis shows that “Grand Theft Auto V” is the game with the highest sales with a total of $28,110,000.

2. What Group of Genre that makes the most sales?

top_genres_sales <- joinxp_scan %>% group_by(Genre) %>% summarize(Tot_Sales = sum(Sales))%>% arrange(desc(Tot_Sales)) %>% top_n(1)
## Selecting by Tot_Sales
top_genres_sales
## # A tibble: 1 x 2
##   Genre   Tot_Sales
##   <fct>       <dbl>
## 1 Shooter      186.

The result of our analysis shows that “Shooter” is the genre with the highest sales with a total of $182,480,000.

3. Which Game Publisher is leading in the market?

top_publisher <- joinxp_scan %>% group_by(Publisher) %>% summarize(Tot_Sales = sum(Sales))%>% arrange(desc(Tot_Sales)) %>% top_n(1)
## Selecting by Tot_Sales
top_publisher
## # A tibble: 1 x 2
##   Publisher  Tot_Sales
##   <chr>          <dbl>
## 1 Activision      106.

The result of our analysis shows that “Activision” is the top leading Publisher with a total sales of $105,720,000.

4. What’s the average game sold in each continent?

ave_sales <- joinxp_scan %>%  group_by(Continents) %>% summarize(Average_Sales = mean(Sales)) %>% arrange(desc(Average_Sales))
ave_sales
## # A tibble: 4 x 2
##   Continents        Average_Sales
##   <fct>                     <dbl>
## 1 Europe                   0.842 
## 2 North America            0.773 
## 3 Rest of the World        0.238 
## 4 Japan                    0.0979
ggplot(ave_sales,aes(x = Continents, y=Average_Sales))+geom_bar(stat ="identity")

5. Display the total sales by year.

yearly_sales <- joinxp_scan %>%  group_by(Year) %>% summarize(Total_Sales = sum(Sales))
yearly_sales
## # A tibble: 6 x 2
##   Year  Total_Sales
##   <ord>       <dbl>
## 1 2013         30.9
## 2 2014        132. 
## 3 2015        145. 
## 4 2016        145. 
## 5 2017        133. 
## 6 2018         40.1
ggplot(yearly_sales,aes(x= Year,Total_Sales))+geom_bar(stat ="identity")

Conclusion

European countries perform the best sales among the other continents followed by North America, the rest of world, and Japan.
We can assume that many gamers are based in Europe, or possibly because there are more countries in Europe than the rest of the world.
For marketing strategy, it is probably the best to hold more events in European and North American countries.