PROJECT SUMMARY
Dataset: Video Games Sales Dataset
data <- read.csv("C:\\Users\\gajaw\\OneDrive\\Desktop\\STATS\\vgsales.csv")
Source - Kaggle
Link - https://www.kaggle.com/gregorut/videogamesales?select=vgsales.csv
The dataset I am working with is a collection of global video game
sales, featuring information about each game’s name, platform, genre,
publisher, release year, and sales in various regions (North America,
Europe, Japan, and others). It also includes the total global sales for
each game. The dataset can be accessed at Kaggle. The documentation
details the sources and format of the data, providing background on the
collection process and variable descriptions.
Summarizing the data
Exploring the structure of the dataset
str(data)
## 'data.frame': 16598 obs. of 11 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
## $ Platform : chr "Wii" "NES" "Wii" "Wii" ...
## $ Year : chr "2006" "1985" "2008" "2009" ...
## $ Genre : chr "Sports" "Platform" "Racing" "Sports" ...
## $ Publisher : chr "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
## $ NA_Sales : num 41.5 29.1 15.8 15.8 11.3 ...
## $ EU_Sales : num 29.02 3.58 12.88 11.01 8.89 ...
## $ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
## $ Other_Sales : num 8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
## $ Global_Sales: num 82.7 40.2 35.8 33 31.4 ...
The dataset consists of various columns which give a clear picture of
the sales performance, release information, and platform data for each
video game
Rank: The position of the game based on its
global sales (1 is the highest-selling game).
Name: The title of the video game.
Platform: The gaming system or console the game
was released on (e.g., PlayStation, Xbox, Wii).
Year: The year the game was released.
Genre: The type or category of the game (e.g.,
Action, Sports, Puzzle).
Publisher: The company that produced and
distributed the game (e.g., Nintendo, EA).
NA_Sales: Total sales of the game in North
America, in millions of units.
EU_Sales: Total sales of the game in Europe, in
millions of units.
JP_Sales: Total sales of the game in Japan, in
millions of units.
Other_Sales: Total sales of the game in other
regions (outside North America, Europe, and Japan), in millions of
units.
Global_Sales: Total sales of the game worldwide,
in millions of units.
This dataset is useful for understanding which factors, such as
platform, genre, and region, contribute most to a video game’s global
sales, helping to analyze trends in the gaming industry over time.
Summary Statistics of the data:
summary(data)
## Rank Name Platform Year
## Min. : 1 Length:16598 Length:16598 Length:16598
## 1st Qu.: 4151 Class :character Class :character Class :character
## Median : 8300 Mode :character Mode :character Mode :character
## Mean : 8301
## 3rd Qu.:12450
## Max. :16600
## Genre Publisher NA_Sales EU_Sales
## Length:16598 Length:16598 Min. : 0.0000 Min. : 0.0000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000
## Mode :character Mode :character Median : 0.0800 Median : 0.0200
## Mean : 0.2647 Mean : 0.1467
## 3rd Qu.: 0.2400 3rd Qu.: 0.1100
## Max. :41.4900 Max. :29.0200
## JP_Sales Other_Sales Global_Sales
## Min. : 0.00000 Min. : 0.00000 Min. : 0.0100
## 1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.0600
## Median : 0.00000 Median : 0.01000 Median : 0.1700
## Mean : 0.07778 Mean : 0.04806 Mean : 0.5374
## 3rd Qu.: 0.04000 3rd Qu.: 0.04000 3rd Qu.: 0.4700
## Max. :10.22000 Max. :10.57000 Max. :82.7400
A simple breakdown of the summary for each column:
Rank:
- The dataset ranks games from 1 to 16,600 based on global sales, with
the lowest rank (1) representing the highest-selling game.
Name, Platform, Genre, Publisher:
- These are text columns that show the name of the game, the platform
it was released on (e.g., Wii, PlayStation), the genre (e.g., Action,
Sports), and the publisher (e.g., Nintendo, EA). No statistical
summaries apply here as they are non-numeric.
Year:
Games in the dataset were released between 1980 and
2020.
The median release year is 2007, with a mean around 2006. Most
games were released between 2003 and 2010, indicating that this was a
high-activity period in the gaming industry.
NA_Sales, EU_Sales, JP_Sales, Other_Sales,
Global_Sales:
These represent sales in millions of units for different regions
(North America, Europe, Japan, other regions) and globally.
The average global sales per game is 0.54 million units.
Some games had very high sales, with the maximum global sales
reaching 82.74 million units, while many games had minimal or no sales
in certain regions, as indicated by the minimum values of 0 in several
columns.
deviation_total_sales:
This shows how far each game’s sales deviate from the average
global sales.
The maximum deviation from total sales is 82.2 million units,
indicating a significant gap between the best-selling game and the
average sales figures.
deviation_year:
This measures how far each game’s release year deviates from the
average release year (around 2006).
The largest deviation is around 26 years, indicating that there
are games from as early as the 1980s to as late as the 2020s in this
dataset.
Key Insights:
The majority of games in this dataset were released between 2003
and 2010, suggesting that this period was highly active in the gaming
industry.
The sales figures are highly skewed, with a few games achieving
massive global sales while most games remain closer to the lower end of
the scale.
There is a significant deviation in both sales and release years,
pointing to outliers in both categories (e.g., blockbuster games and
games released far from the average release period).
Main Question
The main question for my project is:
“What factors contribute most to the global success of video
games, and how do regional sales, platform, genre, and release timing
impact overall sales performance?”
Purpose:
The goal is to analyze the dataset to identify the key drivers behind
high-selling video games, understand trends in sales across different
regions, and explore how platform, genre, and release year influence a
game’s success. The project aims to uncover insights that can help
predict future game sales based on these factors.
Visualizations for at least two interesting aspects of the data
worth further investigation
A bar chart showing the average global sales for each
platform.
This will reveal which platforms tend to produce the highest-selling
games. It will be useful to explore if newer platforms (e.g., PS4, Xbox
One) dominate sales or if older platforms like NES or GameBoy still
contribute significantly.
library(ggplot2)
ggplot(data, aes(x = Platform, y = Global_Sales)) +
stat_summary(fun = mean, geom = "bar", fill = "red") +
labs(title = "Average Global Sales by Platform",
x = "Platform",
y = "Average Global Sales (in millions)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

A boxplot displaying the distribution of global sales across
different game genres
Certain genres, such as Action and Sports, may perform better in
terms of sales, while others like Puzzle or Role-Playing might show
different trends. This can help determine if certain types of games are
consistently more successful worldwide.
ggplot(data, aes(x = Genre, y = Global_Sales)) +
geom_boxplot(fill = "green") +
labs(title = "Global Sales Distribution by Genre",
x = "Genre",
y = "Global Sales (in millions)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

The above two visualizations provide valuable insights into how
platform and genre influence video game sales.
Plan Moving Forward:
I plan on doing the following in the upcoming milestones to achieve
my goal of this project -
Hypothesis Testing - Test hypotheses around factors
like platform, genre, and release year affecting global sales.
Regional Sales Analysis - Perform a deeper analysis
to compare sales across different regions.
Time Series Analysis - Examine historical sales
trends and the evolution of sales across various platforms and genres.
Keep an eye out for trends or shifts in release tactics and how they
affect sales.
Outlier Detection - Determine whether games are
significant sales anomalies and investigate the elements such as timing,
marketing, or platform dominance that contributed to their success.
Presentation and Visualization - Create
visualizations that summarize the key findings for each region,
platform, and genre, and use these insights to explain trends in the
gaming industry.
Initial Findings
Hypothesis 1
Games released on multiple platforms have higher global sales
compared to games released on only one platform.
data$Platform_Type <- ifelse(data$Platform %in% c("PS4", "X360", "PC"), "Multiple", "Single")
ggplot(data, aes(x = Platform_Type, y = Global_Sales)) +
geom_boxplot(fill = "yellow") +
labs(title = "Global Sales Distribution: Single vs. Multiple Platforms",
x = "Platform Type",
y = "Global Sales (in millions)")

A boxplot showing the distribution of global sales for games released
on single platforms versus multiple platforms. This allows us to see the
range and variation in sales, not just the averages.
Hypothesis 2
Games released in recent years generate higher global sales due
to advancements in technology and marketing, compared to games released
earlier.
ggplot(data, aes(x = Year, y = Global_Sales)) +
geom_point(alpha = 0.5, color = "darkgreen") +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Global Sales Over the Years",
x = "Year",
y = "Global Sales (in millions)")
## `geom_smooth()` using formula = 'y ~ x'

A scatter plot showing individual game sales over the years, with a
trend line to observe overall sales patterns across time. This
visualization will highlight not only the trend but also the spread of
individual game sales.
Insights Gathered
Multi-platform games tend to have higher global sales due to
reaching a broader audience, but some single-platform games still
perform well due to exclusivity.
Recent game releases (after 2010) show an upward trend in global
sales, indicating the industry’s growth, driven by technology and
marketing advancements.