Capstone Report: Video Game Dataset

Introduction

The capstone project aimed to analyze video game sales data to extract insights into trends, regional preferences, and market penetration of genres/platforms. The project utilized a comprehensive dataset encompassing sales figures, genres, platforms, and other pertinent details.

Methodology

The analysis began with data cleaning and processing in RStudio, focusing on columns relevant to the study—game names, sales, genres, platforms, and regions. This involved addressing missing values, standardizing formats, and preparing the data for analysis.

Data Overview

The dataset comprised various columns such as game names, sales figures in different regions (North America, Europe, Japan, Others), genres, platforms, and additional attributes. These columns formed the basis for the subsequent analyses.

Analysis Findings

1. Data Cleaning and Processing

The cleaning process involved handling missing values, ensuring data consistency, and structuring the dataset for analysis. Transformations were applied to ensure uniformity in data formats.

##Loading packages and data file##
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(here)
## here() starts at C:/Users/Ross/Documents
library(skimr)
library(rmarkdown)
video_games <- read.csv("C:\\Users\\Ross\\Downloads\\archive (6)\\Video_Games.csv")
##Viewing Data File To Overview Data##
view(video_games)

##Begining Process Of Cleaning And Sorting Data From Dataset##
#Main Set Of Columns To Work With#
game_main_data <- video_games %>%
  select("Name", "Year_of_Release", "Genre", "Rating", "Platform", "Publisher")
view(game_main_data)

#All Sales Columns# 
#Note: All Sales Data Is Assumed That Any A Decimal Point Over 0. Is Over 1 Million Sales#
all_sales <- video_games %>%
  select("Global_Sales", "NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales")
view(all_sales)

2. Sales Analysis

Sales data analysis revealed top-selling games, trends in sales over different years, and insights into global sales patterns.

##Begining Analysis Process##
# Check the structure of sales columns
str(video_games[c('NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales')])
## 'data.frame':    16928 obs. of  5 variables:
##  $ NA_Sales    : num  41.4 29.1 15.7 15.6 11.3 ...
##  $ EU_Sales    : num  28.96 3.58 12.76 10.93 8.89 ...
##  $ JP_Sales    : num  3.77 6.81 3.79 3.28 10.22 ...
##  $ Other_Sales : num  8.45 0.77 3.29 2.95 1 0.58 2.88 2.84 2.24 0.47 ...
##  $ Global_Sales: num  82.5 40.2 35.5 32.8 31.4 ...
#Totaling all sales
video_games$total_sales <- rowSums(video_games[, c('NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales')], na.rm = TRUE)

#Checking the newly added total_sales column
head(video_games$total_sales)
## [1] 165.07  80.48  71.04  65.54  62.75  60.52
#Totaling all top selling games
top_selling_games <- video_games[order(video_games$total_sales, decreasing = TRUE), ]

#Total sales per publisher
publisher_sales <- video_games %>%
  group_by(Publisher) %>%
  summarize(Total_Sales = sum(total_sales, na.rm = TRUE)) %>%
  arrange(desc(Total_Sales))

#Count of games by platform
games_per_platform <- table(video_games$Platform)

#Total sales per platform
platform_sales <- video_games %>%
  group_by(Platform) %>%
  summarize(Total_Sales = sum(total_sales, na.rm = TRUE)) %>%
  arrange(desc(Total_Sales))
#Example: Bar chart for top publishers by sales with Viz
top_publishers <- head(publisher_sales, 10)

ggplot(top_publishers, aes(x = reorder(Publisher, Total_Sales), y = Total_Sales)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Top Publishers by Total Sales", x = "Publisher", y = "Total Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Number of games released per year
games_per_year <- table(video_games$Year_of_Release)

#Total sales per year
sales_per_year <- video_games %>%
  group_by(Year_of_Release) %>%
  summarize(Total_Sales = sum(total_sales, na.rm = TRUE)) %>%
  arrange(Year_of_Release)

#Number of games released per year
games_per_year <- table(video_games$Year_of_Release)
games_per_year_df <- as.data.frame(games_per_year)
colnames(games_per_year_df) <- c("Year", "GamesReleased")

#Total sales per year
sales_per_year <- video_games %>%
  group_by(Year_of_Release) %>%
  summarize(Total_Sales = sum(total_sales, na.rm = TRUE)) %>%
  arrange(Year_of_Release)

#Identify missing years or discrepancies
unique_years <- union(games_per_year_df$Year, sales_per_year$Year_of_Release)
missing_years <- setdiff(unique_years, intersect(games_per_year_df$Year, sales_per_year$Year_of_Release))
missing_years
## [1] NA
#Checking the contents of games_per_year_df and sales_per_year
head(games_per_year_df)
##   Year GamesReleased
## 1 1980             9
## 2 1981            46
## 3 1982            39
## 4 1983            17
## 5 1984            14
## 6 1985            14
head(sales_per_year)
## # A tibble: 6 × 2
##   Year_of_Release Total_Sales
##             <dbl>       <dbl>
## 1            1980        22.8
## 2            1981        71.4
## 3            1982        61.5
## 4            1983        33.6
## 5            1984       101. 
## 6            1985       108.
#Merge the datasets for games released and total sales
yearly_trends <- merge(games_per_year_df, sales_per_year, by.x = "Year", by.y = "Year_of_Release", all = TRUE)
#Ploting the yearly trends

ggplot(yearly_trends, aes(x = Year, group = 1)) +
  geom_line(aes(y = GamesReleased, color = "Games Released"), size = 1, linetype = "solid") +
  geom_line(aes(y = Total_Sales, color = "Total Sales"), size = 1, linetype = "solid") +
  labs(title = "Yearly Trends: Games Released vs. Total Sales", x = "Year", y = "Count/Sales") +
  scale_color_manual(values = c("blue", "red")) +
  theme_minimal() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 1 row containing missing values (`geom_line()`).

#Years start at 1980 and end in 2020

3. Regional Analysis

The examination of regional sales unveiled distinct preferences in different regions, identifying top-selling genres/platforms in North America, Europe, Japan, and other regions.

#Regional sales analysis#
#Total sales across regions
sales_regions <- video_games %>%
  summarize(
    Total_NA_Sales = sum(NA_Sales, na.rm = TRUE),
    Total_EU_Sales = sum(EU_Sales, na.rm = TRUE),
    Total_JP_Sales = sum(JP_Sales, na.rm = TRUE),
    Total_Other_Sales = sum(Other_Sales, na.rm = TRUE)
  )
# Top-selling genres or platforms in each region
top_genres_by_region <- video_games %>%
  group_by(Genre) %>%
  summarize(
    Total_NA_Sales = sum(NA_Sales, na.rm = TRUE),
    Total_EU_Sales = sum(EU_Sales, na.rm = TRUE),
    Total_JP_Sales = sum(JP_Sales, na.rm = TRUE),
    Total_Other_Sales = sum(Other_Sales, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  arrange(desc(Total_NA_Sales)) 
top_platforms_by_region <- video_games %>%
  group_by(Platform) %>%
  summarize(
    Total_NA_Sales = sum(NA_Sales, na.rm = TRUE),
    Total_EU_Sales = sum(EU_Sales, na.rm = TRUE),
    Total_JP_Sales = sum(JP_Sales, na.rm = TRUE),
    Total_Other_Sales = sum(Other_Sales, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  arrange(desc(Total_NA_Sales)) 
##Starting more indepth analysis##
#Stacked bar chart for genre sales across regions
genre_sales <- video_games %>%
  group_by(Genre) %>%
  summarize(
    NA_Sales = sum(NA_Sales, na.rm = TRUE),
    EU_Sales = sum(EU_Sales, na.rm = TRUE),
    JP_Sales = sum(JP_Sales, na.rm = TRUE),
    Other_Sales = sum(Other_Sales, na.rm = TRUE)
  ) %>%
  gather(key = "Region", value = "Sales", -Genre)
ggplot(genre_sales, aes(x = Genre, y = Sales, fill = Region)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Genre Sales Comparison across Regions", x = "Genre", y = "Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Check the structure or glimpse of the top genres for North America
str(top_genres_by_region)
## tibble [13 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Genre            : chr [1:13] "Action" "Sports" "Shooter" "Platform" ...
##  $ Total_NA_Sales   : num [1:13] 894 697 608 457 422 ...
##  $ Total_EU_Sales   : num [1:13] 526 381 329 207 221 ...
##  $ Total_JP_Sales   : num [1:13] 163.5 135.8 40.6 133.8 109.4 ...
##  $ Total_Other_Sales: num [1:13] 187 135.8 108.3 52.5 77.5 ...
head(top_genres_by_region)
## # A tibble: 6 × 5
##   Genre    Total_NA_Sales Total_EU_Sales Total_JP_Sales Total_Other_Sales
##   <chr>             <dbl>          <dbl>          <dbl>             <dbl>
## 1 Action             894.           526.          164.              187. 
## 2 Sports             697.           381.          136.              136. 
## 3 Shooter            608.           329.           40.6             108. 
## 4 Platform           457            207.          134.               52.5
## 5 Misc               422.           221.          109.               77.5
## 6 Racing             373.           247.           59.2              78.6
#Plotting trends for top genres in North America
top_genres_NA <- top_genres_by_region  

#Plotting bar chart for top genres in North America over time
ggplot(top_genres_by_region[1:10, ], aes(x = Genre, y = Total_NA_Sales, fill = Genre)) +
  geom_bar(stat = "identity") +
  labs(title = "Top Genres Trend in North America", x = "Genre", y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Filter data for top genres in Japan
top_genres_JP <- top_genres_by_region

#Plotting trends for top genres in Japan over time
ggplot(top_genres_JP[1:10, ], aes(x = Genre, y = Total_JP_Sales, fill = Genre)) +
  geom_bar(stat = "identity") +
  labs(title = "Top Genres Trend in Japan", x = "Genre", y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Filter data for top genres in Europe
top_genres_EU <- top_genres_by_region 

# Plotting trends for top genres in Europe over time
ggplot(top_genres_EU[1:10, ], aes(x = Genre, y = Total_EU_Sales, fill = Genre)) +
  geom_bar(stat = "identity") +
  labs(title = "Top Genres Trend in Europe", x = "Genre", y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Filter data for top genres in 'Other'
top_genres_other <- top_genres_by_region 

# Plotting trends for top genres in 'Other' over time
ggplot(top_genres_EU[1:10, ], aes(x = Genre, y = Total_Other_Sales, fill = Genre)) +
  geom_bar(stat = "identity") +
  labs(title = "Top Genres Trend in Other", x = "Genre", y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

4. Comparative Analysis

Comparative analysis highlighted genre/platform dominance across regions, showcasing market penetration levels and the varying popularity of genres/platforms in different regions.

Strategy Development Insights The findings provided valuable insights for business strategies, emphasizing the importance of tailoring marketing campaigns and game development strategies to align with regional preferences.

Visualizations Visual representations, including bar charts, pie charts, and line plots, were utilized to present findings visually, enhancing the understanding of trends and regional preferences.

Conclusion Overall, the analyses showcased the nuanced landscape of video game preferences across regions. The identification of top-selling genres/platforms and insights into market penetration lay a strong foundation for targeted business strategies and game development efforts.

Future Directions Future analyses could delve deeper into demographic-specific preferences within regions, conduct sentiment analysis, or incorporate additional datasets for a more comprehensive understanding of the gaming market

##Market Analysis##
#Pie chart for genre market penetration in North America
genre_sales <- video_games %>%
  group_by(Genre) %>%
  summarize(
    NA_Sales = sum(NA_Sales, na.rm = TRUE),
    EU_Sales = sum(EU_Sales, na.rm = TRUE),
    JP_Sales = sum(JP_Sales, na.rm = TRUE),
    Other_Sales = sum(Other_Sales, na.rm = TRUE)
  ) %>%
  gather(key = "Region", value = "Sales", -Genre)
#Filter data for North America
genre_sales_NA <- genre_sales %>% filter(Region == "NA_Sales")

ggplot(genre_sales_NA, aes(x = "", y = Sales, fill = Genre)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Market Penetration of Genres in North America", fill = "Genre") +
  theme_void() +
  theme(legend.position = "right")

#Filter data for EU
genre_sales_EU <- genre_sales %>% filter(Region == "EU_Sales")

ggplot(genre_sales_EU, aes(x = "", y = Sales, fill = Genre)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Market Penetration of Genres in EU", fill = "Genre") +
  theme_void() +
  theme(legend.position = "right")

#Filter data for Japan
genre_sales_JP <- genre_sales %>% filter(Region == "JP_Sales")

ggplot(genre_sales_JP, aes(x = "", y = Sales, fill = Genre)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Market Penetration of Genres in Japan", fill = "Genre") +
  theme_void() +
  theme(legend.position = "right")

#Filter data for 'Other'
genre_sales_Other <- genre_sales %>% filter(Region == "Other_Sales")

ggplot(genre_sales_Other, aes(x = "", y = Sales, fill = Genre)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Market Penetration of Genres in 'Other'", fill = "Genre") +
  theme_void() +
  theme(legend.position = "right")

Conclusion

Overall, the analyses showcased the nuanced landscape of video game preferences across regions. The identification of top-selling genres/platforms and insights into market penetration lay a strong foundation for targeted business strategies and game development efforts.

Future Directions Future analyses could delve deeper into demographic-specific preferences within regions, conduct sentiment analysis, or incorporate additional datasets for a more comprehensive understanding of the gaming market. References The analyses were conducted using RStudio and relevant libraries, with the dataset sourced from [source name/website].

Analysis Notes from project code

Utilizing Findings for Strategy Development: Targeted Marketing Campaigns: Region-specific Advertising: Tailor marketing campaigns based on the most preferred genres or platforms in each region. Emphasize these preferences in advertising to better resonate with the target audience in each region.

Game Development Strategies: Regional Preference Integration: Incorporate the most popular genres or platforms from your analysis into the development pipeline. Create games that align with the preferences of specific regions to enhance market acceptance.

Market Entry Strategies: Market Penetration Planning: Identify regions where certain genres or platforms have high potential but are currently underrepresented. Develop entry strategies to introduce or emphasize these genres/platforms in those markets.

Partnership and Distribution: Partnerships with Regional Players: Collaborate with regional distributors or game platforms to promote and distribute games aligned with the preferences of each region.

User Engagement Strategies: Enhanced User Experience: Tailor user experience elements in games based on regional preferences, ensuring better engagement and satisfaction.

Testing and Feedback Loops: Iterative Development: Use market insights as a basis for iterative game development. Continuously test and gather feedback from specific regions to adapt games accordingly.

References

The analyses were conducted using RStudio and relevant libraries, with the dataset sourced from [source name/website: https://www.kaggle.com/code/napatprapavasit/videogame-global-sales-prediction].