Video Game Sale Data

This data set I used was about Video Game Sales from 1980 to 2020. This data set contained more than 16,500 games and included information about the year, the publisher, the platform for the game, and its sales in different regions of the world. The data set made for some interesting statistics and visualizations, especially since there were so many years of sales to look at.

Findings

These are my findings and graphs for the data I explored. The most obvious things I found were that actions games have almost always been the most popular genre and that North America tends to have the highest amount of sales per year.

Top Made Genres of Video Games

This graph shows the top made genres of video games. In this visualization, you can see that video games in the action genre are much higher than every other genre. This is a very broad genre of video games and tend to have the most amount of sales too, so that is most likely why they are so consistently made. The games around the mean tend to be made much less frequently, most likely because of longevity of the game and the time it takes to make one. Unlike sports games and action games, adventure and role-playing games tend to take a long time to produce because of how large they are and the creativity behind it. Also, unlike other genres, shooters and racing games tend to have a “replayability” to them and people are content with years between releases since there is not much to add.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import math
import statistics

import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/Users/Neil/anaconda3/Library/plugins/platforms'

#warnings.filewarnings("ignore")

path = "C:/Users/Neil/Documents/Python Project/"
filename = path + 'vgsales.csv'

videogames = pd.read_csv(filename)
videogame = videogames.dropna(axis = 0)

x = videogame.groupby(['Genre'])['Genre'].count().reset_index(name = 'Count')
x = pd.DataFrame(x)
x = x.sort_values('Count', ascending = False)
x = x.reset_index()

def pick_colors_mean(this_data):
    colors = []
    avg = this_data.Count.mean()
    for each in this_data.Count:
        if each > avg*1.10:
            colors.append('purple')
        elif each < avg*0.90:
            colors.append('blue')
        else:
            colors.append('black')
    return colors
  
import matplotlib.patches as mpatches

top1 = 11
bottom1 = 0

my_colors1 = pick_colors_mean(x)

Above = mpatches.Patch(color = 'purple', label = 'Above Average')
At = mpatches.Patch(color = 'black', label = 'Within 10% of Average')
Below = mpatches.Patch(color = 'blue', label = 'Below Average')

fig = plt.figure(figsize=(18, 16))

ax1 = fig.add_subplot(2, 1, 1)
ax1.bar(x.Genre, x.Count, label = 'Count', color = my_colors1)
ax1.legend(handles = [Above, At, Below], fontsize = 14)
plt.axhline(x.Count.mean(), color = 'black', linestyle='dashed')
ax1.set_title('Top Made Genres of Video Games', size = 20)
ax1.text(top1-0.3, x.Count.mean()+50, 'Mean = ' + str(round(x.Count.mean())), rotation = 0, fontsize = 14)

plt.show()

Regional Sales per Year

This graph compares the sales per year in different regions of the world. North America is consistently higher than other regions, especially from 2000 to around 2013. From about 1980 to 1995, video games were not as popular and still not widely used in homes. In the 1990’s, a lot of consoles and games started to come out which spiked the market. Consoles such as the original PlayStation, Nintendo64, Sega Dreamcast, and a few others all came out in the 1990’s. In the 2000’s, there has been a huge spike in video games. Almost every kid has some kind of gaming device and I do not see this graph going down anytime soon. The reason for the 2010’s on the graph being so low is just because the data set did not include a lot of modern games in it.

y = pd.read_csv(filename, usecols = ['Year', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'])
y = y.dropna(axis = 0)
y = pd.DataFrame(y)

y1 = y[['Year', 'NA_Sales']]
y1 = y.groupby(['Year'])['NA_Sales'].sum().reset_index(name = 'Total Sales')
y1 = pd.DataFrame(y1)

y2 = y[['Year', 'EU_Sales']]
y2 = y.groupby(['Year'])['EU_Sales'].sum().reset_index(name = 'Total Sales')
y2 = pd.DataFrame(y2)

y3 = y[['Year', 'JP_Sales']]
y3 = y.groupby(['Year'])['JP_Sales'].sum().reset_index(name = 'Total Sales')
y3 = pd.DataFrame(y3)

y4 = y[['Year', 'Other_Sales']]
y4 = y.groupby(['Year'])['Other_Sales'].sum().reset_index(name = 'Total Sales')
y4 = pd.DataFrame(y4)

fig = plt.figure(figsize = (18, 10))
ax = fig.add_subplot(1, 1, 1)

y1.plot(ax = ax, kind = 'line', x = 'Year', y = 'Total Sales', color = 'blue', label = 'NA Sales', marker = '8')
y2.plot(ax = ax, kind = 'line', x = 'Year', y = 'Total Sales', color = 'indianred', label = 'EU Sales', marker = '8')
y3.plot(ax = ax, kind = 'line', x = 'Year', y = 'Total Sales', color = 'green', label = 'JP Sales', marker = '8')
y4.plot(ax = ax, kind = 'line', x = 'Year', y = 'Total Sales', color = 'purple', label = 'Other Sales', marker = '8')
plt.title('Region Sales Per Year', fontsize = 20)
ax.set_xlabel('Year', fontsize = 18)
ax.set_ylabel('Total Sales (millions)', fontsize = 18, labelpad = 20)
ax.tick_params(axis='x', labelsize = 14, rotation = 0)
ax.tick_params(axis='y', labelsize = 14, rotation = 0)

plt.show()

Regional Sales by Genre

This graph shows the sales in different video game genres across the different regions of the world. As we can see again on this graph, North America is still the highest in sales. One interesting things seen on this graph is that role-playing games seem to be very popular in Japan. That is the only genre where they have a very significant number of sales. Europe is very close to North America in a lot of these categories. Looking at this graph also explains why there are so many sports and action games every year, they consistently sell the most and make companies a ton of money.


stacked_df = pd.read_csv(filename, usecols = ['Genre', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales'])
stacked_df = stacked_df.dropna(axis = 0)
stacked_df = pd.DataFrame(stacked_df)
stacked_df = stacked_df.groupby(['Genre'])['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales'].sum()
fig = plt.figure(figsize=(18, 10))
ax = fig.add_subplot(1, 1, 1)

stacked_df.plot(kind = 'bar', stacked = True, ax=ax)

plt.title('Regional Sales by Genre', fontsize = 20)
plt.ylabel('Total Sales (Millions)', fontsize = 18, labelpad = 10)
ax.set_xlabel('Genre', fontsize = 18)
plt.xticks(rotation = 0, horizontalalignment = 'center', fontsize = 14)
plt.yticks(fontsize = 14)
plt.show()

Top 10 Video Game Publishers Sales

This doughnut chart shows the top 10 video game publishers in terms of sales. This chart is interesting because Nintendo is far ahead of all the other companies in terms of sales, yet they do not really make a lot of action games which you might expect, since that is the top genre by far. They do, however, make a lot of variations of genres, so they really cover almost every market over the years. Electronic Arts (EA) are behind Nintendo, and that is because they make most of the sports games on the market, which is the second most popular genre. The rest of the sales are pretty evenly distributed between the other 8 top publishers. After the top 10, it drops off quite a bit which is why I only included the top 10 publishers. A lot goes into staying in the top publishers, such as lots of these companies cover a lot of genres or they are just make the top games in one or two genres.


pub = pd.read_csv(filename, usecols = ['Publisher', 'Global_Sales'])
pub = pub.dropna(axis = 0)
pub = pd.DataFrame(pub)
pub = pub.groupby(['Publisher'])['Global_Sales'].sum()
pub = pd.DataFrame(pub)
pub = pub.sort_values('Global_Sales', ascending = False)
pub = pub.reset_index()
pub = pub[0:10]

outside_color = len(pub.Publisher.unique())
outside_color_ref = np.arange(outside_color)*2

fig = plt.figure(figsize=(10,12))
ax = fig.add_subplot(1, 1, 1)

colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref)

all_sales = pub.Global_Sales.sum()

pub.groupby(['Publisher'])['Global_Sales'].sum().plot(
    kind = 'pie', radius = 1, colors = outer_colors, pctdistance = 0.85, labeldistance = 1.1, 
    wedgeprops = dict(edgecolor = 'white'), textprops = {'fontsize':15},
    autopct = lambda p: '{:.2f}%\n(${:.1f}M)'.format(p,(p/100)*all_sales),
    startangle=90)

hole = plt.Circle((0,0), 0.3, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)

ax.yaxis.set_visible(False)
plt.title('Top 10 Video Game Companies Sales', fontsize = 20)

ax.text(0, 0, 'Total Sales\n' + '$' + str(round(all_sales,2)) + 'M', size = 18, ha = 'center', va = 'center')

ax.axis('equal')
plt.show()

Top 10 Platforms by Region

This graph shows the top 10 platforms for games over four regions of the world in terms of sales. In the graph, we can see that the Play Station 2 (PS2) surprisingly is the most successful platform. This is probably because it was really a big step in video game consoles when it came out and there was not a lot of platforms that could compete in terms of games to play and popularity of certain games for the PS2. We can also see that North America is still above all the other regions in terms of sales for almost all the consoles until it comes to the Play Station 4 (PS4) and the personal computer (PC). Europe starts to generate more sales in these two platforms. I am not exactly sure why this is, because looking at the data, there are only a few games that would really carry the European market for these platforms, but apparantly it was enough.


plat = pd.read_csv(filename, usecols = ['Platform', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'])
plat = plat.dropna(axis = 0)
plat = pd.DataFrame(plat)
plat = plat.groupby(['Platform'])['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'].sum()
plat = plat.sort_values('Global_Sales', ascending = False)
plat = plat.reset_index()
plat = plat[0:10]

fig = plt.figure(figsize=(18, 10))
ax = fig.add_subplot(1, 1, 1)

plat.plot(kind = 'line', ax=ax, x = 'Platform', marker = 'o', markeredgewidth = 1, linewidth = 5,
          markersize = 18, markerfacecolor = 'white') 

plat_order = ['PS2', 'X360', 'PS3', 'Wii', 'DS', 'PS', 'GBA', 'PSP', 'PS4', 'PC']

num_rows = plat.shape[0]
num_cols = plat.shape[1]

plt.ylabel('Total Sales ($M)', fontsize = 18, labelpad = 10)
plt.title('Top 10 Selling Platforms by Region', fontsize = 20, pad = 15)

plt.xticks(np.arange(num_rows), plat_order, fontsize = 14)
#plt.yticks(range(1, num_cols+1, 1), fontsize = 14)
ax.set_xlabel('Platform', fontsize = 18)

plt.show()