Overview of the Dataset

Dataset used: vgsales.csv

The project goes over the video game sales dataset and looks at patterns in genre, platform, year, and regional sales.

Main variables used:

  • Genre: game category
  • Platform: platform the game was released on
  • Year: year it was released
  • NA_Sales: North America sales
  • EU_Sales: Europe sales
  • JP_Sales: Japan sales
  • Global_Sales: total sales

R Code for preparing the data

This is how to load and prepare the data.

#load required libraries
library(dplyr)
library(ggplot2)
library(plotly)

#load the data
vg_df = read.csv("vgsales.csv")

#remove rows with missing year values
vg_df = vg_df %>%
  filter(!is.na(Year))

#convert year to numeric
vg_df$Year = as.numeric(vg_df$Year)

ggplot Bar Chart: Games by Genre

Action and sports seem to have the most amount of games compared to the other genres.

ggplot Boxplot: Global Sales by Genre

Some game genres have bigger sales spread than some of the other genres.

Plotly Scatter: NA Sales vs Global Sales

This shows us that games with high NA sales usually also have high global sales.

3D Plotly: NA Sales, EU Sales, and JP Sales

3D Plotly Data Explained

Main Findings:

  • NA Sales: Most of the games are centered around the lower sales, meaning most games are not selling that well in North America.

  • EU Sales: EU sales are generally low for most of the games on the list, however there are a few games that have much higher sales than the other games.

  • JP Sales: Generally JP sales are lower than NA and EU sales for most games.

The 3D plot displays the distribution of all games in terms of their sales. All games group around the bottom of the figure, only a few games manage to achieve high sales in one or all the regions.

Findings

The bar chart visually represented the amount of games in each genre based off the dataset. It shows that action and sports were the two genres that appeared the most. This means there are more games in those genres than in most of the other categories.

The boxplot shows that while some genres have a larger spread in terms of global sales, a few games in some genres are selling at much higher levels of sales than the rest.

The scatter plot showed a connection between a game’s sales in the NA region and its sales globally. The 3D plot demonstrated most of the games had sales in all regions that were near the lowest value, while only a few games had sales that were much higher in one or more regions.

Statistical Analysis: Summary Statistics

#summary statistics for sales by region
sales_summary = data.frame(
  Variable = c("NA_Sales", "EU_Sales", "JP_Sales"),
  Mean = c(mean(vg_df$NA_Sales), mean(vg_df$EU_Sales), mean(vg_df$JP_Sales)),
  Median = c(median(vg_df$NA_Sales), median(vg_df$EU_Sales), 
             median(vg_df$JP_Sales)),
  SD = c(sd(vg_df$NA_Sales), sd(vg_df$EU_Sales), sd(vg_df$JP_Sales))
)

sales_summary
##   Variable       Mean Median        SD
## 1 NA_Sales 0.26466743   0.08 0.8166830
## 2 EU_Sales 0.14665201   0.02 0.5053512
## 3 JP_Sales 0.07778166   0.00 0.3092906

Statistical Analysis: Interpretation

  • NA Sales: NA sales have the highest mean at 0.265. This shows that, on average, games sell more in North America than in the other two regions.

  • EU Sales: EU sales have a mean of 0.147, and this shows that Europe is in the middle compared to the three regions.

  • JP Sales: JP sales have the lowest mean at 0.078, and the median is 0.00. This goes to show that many games have very low sales in Japan.

  • Spread of Sales: NA sales also have the biggest standard deviation at 0.817. This means sales in North America vary more from game to game than they do in EU or JP.

  • Overall Result: Based on the summary statistics, North America has the strongest sales overall, while Japan has the lowest values for many games.

Source and Tools