PROJECT SUMMARY

Dataset: Video Games Sales Dataset

data <- read.csv("C:\\Users\\gajaw\\OneDrive\\Desktop\\STATS\\vgsales.csv")

Source - Kaggle

Link - https://www.kaggle.com/gregorut/videogamesales?select=vgsales.csv

The dataset I am working with is a collection of global video game sales, featuring information about each game’s name, platform, genre, publisher, release year, and sales in various regions (North America, Europe, Japan, and others). It also includes the total global sales for each game. The dataset can be accessed at Kaggle. The documentation details the sources and format of the data, providing background on the collection process and variable descriptions.

Summarizing the data

Exploring the structure of the dataset

str(data)
## 'data.frame':    16598 obs. of  11 variables:
##  $ Rank        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name        : chr  "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
##  $ Platform    : chr  "Wii" "NES" "Wii" "Wii" ...
##  $ Year        : chr  "2006" "1985" "2008" "2009" ...
##  $ Genre       : chr  "Sports" "Platform" "Racing" "Sports" ...
##  $ Publisher   : chr  "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
##  $ NA_Sales    : num  41.5 29.1 15.8 15.8 11.3 ...
##  $ EU_Sales    : num  29.02 3.58 12.88 11.01 8.89 ...
##  $ JP_Sales    : num  3.77 6.81 3.79 3.28 10.22 ...
##  $ Other_Sales : num  8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
##  $ Global_Sales: num  82.7 40.2 35.8 33 31.4 ...

The dataset consists of various columns which give a clear picture of the sales performance, release information, and platform data for each video game

  • Rank: The position of the game based on its global sales (1 is the highest-selling game).

  • Name: The title of the video game.

  • Platform: The gaming system or console the game was released on (e.g., PlayStation, Xbox, Wii).

  • Year: The year the game was released.

  • Genre: The type or category of the game (e.g., Action, Sports, Puzzle).

  • Publisher: The company that produced and distributed the game (e.g., Nintendo, EA).

  • NA_Sales: Total sales of the game in North America, in millions of units.

  • EU_Sales: Total sales of the game in Europe, in millions of units.

  • JP_Sales: Total sales of the game in Japan, in millions of units.

  • Other_Sales: Total sales of the game in other regions (outside North America, Europe, and Japan), in millions of units.

  • Global_Sales: Total sales of the game worldwide, in millions of units.

This dataset is useful for understanding which factors, such as platform, genre, and region, contribute most to a video game’s global sales, helping to analyze trends in the gaming industry over time.

Summary Statistics of the data:

summary(data)
##       Rank           Name             Platform             Year          
##  Min.   :    1   Length:16598       Length:16598       Length:16598      
##  1st Qu.: 4151   Class :character   Class :character   Class :character  
##  Median : 8300   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 8301                                                           
##  3rd Qu.:12450                                                           
##  Max.   :16600                                                           
##     Genre            Publisher            NA_Sales          EU_Sales      
##  Length:16598       Length:16598       Min.   : 0.0000   Min.   : 0.0000  
##  Class :character   Class :character   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Mode  :character   Mode  :character   Median : 0.0800   Median : 0.0200  
##                                        Mean   : 0.2647   Mean   : 0.1467  
##                                        3rd Qu.: 0.2400   3rd Qu.: 0.1100  
##                                        Max.   :41.4900   Max.   :29.0200  
##     JP_Sales         Other_Sales        Global_Sales    
##  Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.0100  
##  1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.: 0.0600  
##  Median : 0.00000   Median : 0.01000   Median : 0.1700  
##  Mean   : 0.07778   Mean   : 0.04806   Mean   : 0.5374  
##  3rd Qu.: 0.04000   3rd Qu.: 0.04000   3rd Qu.: 0.4700  
##  Max.   :10.22000   Max.   :10.57000   Max.   :82.7400

A simple breakdown of the summary for each column:

  1. Rank:

    • The dataset ranks games from 1 to 16,600 based on global sales, with the lowest rank (1) representing the highest-selling game.
  2. Name, Platform, Genre, Publisher:

    • These are text columns that show the name of the game, the platform it was released on (e.g., Wii, PlayStation), the genre (e.g., Action, Sports), and the publisher (e.g., Nintendo, EA). No statistical summaries apply here as they are non-numeric.
  3. Year:

    • Games in the dataset were released between 1980 and 2020.

    • The median release year is 2007, with a mean around 2006. Most games were released between 2003 and 2010, indicating that this was a high-activity period in the gaming industry.

  4. NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales:

    • These represent sales in millions of units for different regions (North America, Europe, Japan, other regions) and globally.

    • The average global sales per game is 0.54 million units.

    • Some games had very high sales, with the maximum global sales reaching 82.74 million units, while many games had minimal or no sales in certain regions, as indicated by the minimum values of 0 in several columns.

  5. deviation_total_sales:

    • This shows how far each game’s sales deviate from the average global sales.

    • The maximum deviation from total sales is 82.2 million units, indicating a significant gap between the best-selling game and the average sales figures.

  6. deviation_year:

    • This measures how far each game’s release year deviates from the average release year (around 2006).

    • The largest deviation is around 26 years, indicating that there are games from as early as the 1980s to as late as the 2020s in this dataset.

Key Insights:

  • The majority of games in this dataset were released between 2003 and 2010, suggesting that this period was highly active in the gaming industry.

  • The sales figures are highly skewed, with a few games achieving massive global sales while most games remain closer to the lower end of the scale.

  • There is a significant deviation in both sales and release years, pointing to outliers in both categories (e.g., blockbuster games and games released far from the average release period).

Main Question

The main question for my project is:

“What factors contribute most to the global success of video games, and how do regional sales, platform, genre, and release timing impact overall sales performance?”

Purpose:

The goal is to analyze the dataset to identify the key drivers behind high-selling video games, understand trends in sales across different regions, and explore how platform, genre, and release year influence a game’s success. The project aims to uncover insights that can help predict future game sales based on these factors.

Visualizations for at least two interesting aspects of the data worth further investigation

ggplot(data, aes(x = Genre, y = Global_Sales)) +
  geom_boxplot(fill = "green") +
  labs(title = "Global Sales Distribution by Genre", 
       x = "Genre", 
       y = "Global Sales (in millions)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The above two visualizations provide valuable insights into how platform and genre influence video game sales.

Plan Moving Forward:

I plan on doing the following in the upcoming milestones to achieve my goal of this project -

Hypothesis Testing - Test hypotheses around factors like platform, genre, and release year affecting global sales.

Regional Sales Analysis - Perform a deeper analysis to compare sales across different regions.

Time Series Analysis - Examine historical sales trends and the evolution of sales across various platforms and genres. Keep an eye out for trends or shifts in release tactics and how they affect sales.

Outlier Detection - Determine whether games are significant sales anomalies and investigate the elements such as timing, marketing, or platform dominance that contributed to their success.

Presentation and Visualization - Create visualizations that summarize the key findings for each region, platform, and genre, and use these insights to explain trends in the gaming industry.

Initial Findings

Hypothesis 1

  • Games released on multiple platforms have higher global sales compared to games released on only one platform.

    data$Platform_Type <- ifelse(data$Platform %in% c("PS4", "X360", "PC"), "Multiple", "Single")
    ggplot(data, aes(x = Platform_Type, y = Global_Sales)) +
      geom_boxplot(fill = "yellow") +
      labs(title = "Global Sales Distribution: Single vs. Multiple Platforms", 
           x = "Platform Type", 
           y = "Global Sales (in millions)")

A boxplot showing the distribution of global sales for games released on single platforms versus multiple platforms. This allows us to see the range and variation in sales, not just the averages.

Hypothesis 2

  • Games released in recent years generate higher global sales due to advancements in technology and marketing, compared to games released earlier.

    ggplot(data, aes(x = Year, y = Global_Sales)) +
      geom_point(alpha = 0.5, color = "darkgreen") +
      geom_smooth(method = "lm", se = FALSE, color = "blue") +
      labs(title = "Global Sales Over the Years", 
           x = "Year", 
           y = "Global Sales (in millions)")
    ## `geom_smooth()` using formula = 'y ~ x'

    A scatter plot showing individual game sales over the years, with a trend line to observe overall sales patterns across time. This visualization will highlight not only the trend but also the spread of individual game sales.

Insights Gathered

  • Multi-platform games tend to have higher global sales due to reaching a broader audience, but some single-platform games still perform well due to exclusivity.

  • Recent game releases (after 2010) show an upward trend in global sales, indicating the industry’s growth, driven by technology and marketing advancements.