This is my very first project on R. I purposely chose a small data set to make my job easier to get comfortable with the code, but I believe if I can do it with a small data set, I can do it with a big one as well !
Here we are trying to use data cleaning and visualization tools to highlight differences in Sales over the years on the game Console, Nintendo 64 between 1996 & 2000.
Some lines of code may look confusing since I’m learning, my apologies.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(dplyr)
df <- read.csv('best-selling-nintendo64.csv')
head(df)
## Game Developer.s. Publisher.s. Release.date
## 1 Super Mario 64 Nintendo EAD Nintendo 1996-06-23
## 2 Mario Kart 64 Nintendo EAD Nintendo 1996-12-14
## 3 GoldenEye 007 Rare Nintendo 1997-08-25
## 4 The Legend of Zelda: Ocarina of Time Nintendo EAD Nintendo 1998-11-21
## 5 Super Smash Bros. HAL Laboratory Nintendo 1999-01-21
## 6 Pokémon Stadium Nintendo EAD Nintendo 1999-04-30
## Sales
## 1 11910000
## 2 9870000
## 3 8090000
## 4 7600000
## 5 5550000
## 6 5460000
summary(df)
## Game Developer.s. Publisher.s. Release.date
## Length:46 Length:46 Length:46 Length:46
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Sales
## Min. : 1040000
## 1st Qu.: 1417500
## Median : 2100000
## Mean : 2956408
## 3rd Qu.: 3295000
## Max. :11910000
summary(is.na(df))
## Game Developer.s. Publisher.s. Release.date
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:46 FALSE:46 FALSE:46 FALSE:46
## Sales
## Mode :logical
## FALSE:46
table(df$Game)
##
## 007: The World Is Not Enough 1080° Snowboarding
## 1 1
## Banjo-Kazooie Banjo-Tooie
## 1 1
## Cruis'n USA Diddy Kong Racing
## 1 1
## Donkey Kong 64 Excitebike 64
## 1 1
## F-1 World Grand Prix F-Zero X
## 1 1
## GoldenEye 007 Hey You, Pikachu!
## 1 1
## Jet Force Gemini Kirby 64: The Crystal Shards
## 1 1
## Kobe Bryant in NBA Courtside Mario Golf
## 1 1
## Mario Kart 64 Mario Party
## 1 1
## Mario Party 2 Mario Party 3
## 1 1
## Mario Tennis Namco Museum 64
## 1 1
## Paper Mario Perfect Dark
## 1 1
## Pilotwings 64 Pocket Monsters' Stadium
## 1 1
## Pokémon Snap Pokémon Stadium
## 1 1
## Pokémon Stadium 2 Star Fox 64
## 1 1
## Star Wars Episode I: Racer Star Wars: Rogue Squadron
## 1 1
## Star Wars: Shadows of the Empire Super Mario 64
## 1 1
## Super Smash Bros. The Legend of Zelda: Majora's Mask
## 1 1
## The Legend of Zelda: Ocarina of Time Tony Hawk's Pro Skater
## 1 1
## Turok 2: Seeds of Evil Turok: Dinosaur Hunter
## 1 1
## Wave Race 64 WCW vs. nWo: World Tour
## 1 1
## WCW/nWo Revenge WWF No Mercy
## 1 1
## WWF WrestleMania 2000 Yoshi's Story
## 1 1
table(df$Publisher.s)
##
## Acclaim Entertainment Activision Electronic Arts
## 2 1 1
## LucasArts Namco Nintendo
## 1 1 33
## Rare THQ
## 3 4
table(df$Developer.s.)
##
## AKI Corporation and Asmik Ace Entertainment
## 4
## Ambrella
## 1
## Camelot Software Planning
## 2
## Edge of Reality
## 1
## Eurocom
## 1
## Factor 5 and LucasArts
## 1
## HAL Laboratory
## 2
## HAL Laboratory and Pax Softonica
## 1
## Hudson Soft
## 3
## Iguana Entertainment
## 2
## Intelligent Systems
## 1
## Left Field Productions
## 2
## LucasArts
## 2
## Mass Media Games
## 1
## Nintendo EAD
## 11
## Nintendo EAD / Nintendo R&D3 / Paradigm Entertainment
## 1
## Nintendo EAD and HAL Laboratory
## 1
## Paradigm Entertainment
## 1
## Rare
## 7
## Williams
## 1
table(df$Release.date)
##
## 1996-06-23 1996-09-27 1996-12-03 1996-12-14 1997-03-04 1997-04-27 1997-08-25
## 2 1 2 1 1 1 1
## 1997-11-14 1997-11-30 1997-12-21 1998-02-28 1998-04-27 1998-06-29 1998-07-14
## 1 1 1 1 1 1 1
## 1998-07-31 1998-08-01 1998-10-21 1998-10-26 1998-11-21 1998-12-07 1998-12-12
## 1 1 1 1 1 1 1
## 1998-12-18 1999-01-21 1999-03-21 1999-04-30 1999-06-11 1999-10-11 1999-10-12
## 1 1 1 2 1 1 1
## 1999-10-31 1999-11-22 1999-12-17 2000-02-29 2000-03-24 2000-04-27 2000-04-30
## 1 1 1 1 1 1 1
## 2000-05-22 2000-07-21 2000-08-11 2000-10-17 2000-11-17 2000-11-20 2000-12-07
## 1 1 1 1 1 1 1
## 2000-12-14
## 1
table(df$Sales)
##
## 1040000 1080000 1094765 1100000 1120000 1140000 1160000 1190000
## 1 1 1 1 1 1 1 2
## 1300000 1370000 1400000 1470000 1500000 1600000 1610000 1720000
## 1 1 1 1 1 1 1 1
## 1770000 1830000 1880000 1910000 2000000 2030000 2170000 2320000
## 1 1 1 1 1 1 1 1
## 2480000 2520000 2540000 2600000 2700000 2850000 2940000 3000000
## 1 1 1 1 1 1 1 1
## 3100000 3360000 3630000 3650000 4000000 4880000 5270000 5460000
## 1 1 1 1 1 1 1 1
## 5550000 7600000 8090000 9870000 11910000
## 1 1 1 1 1
colnames(df)[2] <- "Developper"
colnames(df)[3] <- "Publisher"
colnames(df)[4] <- "Release_date"
df$Release_date <- as.Date(df$Release_date)
df$Year_Release <- as.numeric(format(df$Release_date, "%Y"))
print(df$Year_Release)
## [1] 1996 1996 1997 1998 1999 1999 1999 1997 1997 1998 1999 2000 1999 2000 1996
## [16] 1997 1998 1996 2000 2000 1999 2000 1998 1998 2000 2000 1998 1998 2000 1996
## [31] 2000 1998 1997 1999 1998 2000 1997 1998 2000 1999 1999 1996 1998 1998 2000
## [46] 1999
Total_Publisher <- length(df$Developper)
ggplot(df,aes(x=Year_Release, )) +
geom_bar(color='black', fill='lightblue') +
theme(axis.text.x= element_text(angle=90,hjust=1)) +
ggtitle("No. game released per year on N64")+
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
We can notice several things :
Number_Games_Released <- 5.75 #average Sales
ggplot (df,aes(x=Publisher),y=Sales) +
geom_bar(color = 'black', fill = 'lightblue',) +
theme(axis.text.x= element_text(angle=90,hjust=1))+
ggtitle("No. game released per year on N64")+
geom_text(aes(label = ..count..), stat = "count", vjust = -0.4, colour = "black")+
geom_hline(data = df, aes(yintercept = Number_Games_Released))+
geom_text(aes(0, Number_Games_Released, label = Number_Games_Released , vjust = -0.5, hjust = -1))
We can notice several things :
show(df$Sales)
## [1] 11910000 9870000 8090000 7600000 5550000 5460000 5270000 4880000
## [9] 4000000 3650000 3630000 3360000 3100000 3000000 2940000 2850000
## [17] 2700000 2600000 2540000 2520000 2480000 2320000 2170000 2030000
## [25] 2000000 1910000 1880000 1830000 1770000 1720000 1610000 1600000
## [33] 1500000 1470000 1400000 1370000 1300000 1190000 1190000 1160000
## [41] 1140000 1120000 1100000 1094765 1080000 1040000
min(df$Sales) #1040000
## [1] 1040000
max(df$Sales) #11910000
## [1] 11910000
I struggled to create a bar plot grouping Nintendo yearly Sales and all Publisher Sales so I decided to do it separately by grouping information
Total_yearly_Sales <- df %>%
filter(Publisher == "Nintendo" | Publisher == "Rare" | Publisher == "LucasArts"|
Publisher == "THQ" | Publisher == "Acclaim Entertainment"|
Publisher == "Activision"| Publisher == "Electronic Arts"|
Publisher == "Namco") %>%
group_by(Publisher,Game, Year_Release, Sales) %>%
summarize(Sales =
sum(Sales, na.rm=TRUE))
## `summarise()` has grouped output by 'Publisher', 'Game', 'Year_Release'. You
## can override using the `.groups` argument.
Yearly Sales containing every publishers from 1996 to 2000
ggplot(Total_yearly_Sales,aes(x = Year_Release, y = Sales, fill = Nintendo_yearly_Sales), stat = 'identity',) +
geom_bar( fill = 'steelblue', stat = 'identity')
Going through the same process targeting only Nintendo
Nintendo_yearly_Sales <- df %>%
filter(Publisher == "Nintendo") %>%
group_by(Publisher,Game, Year_Release, Sales)
Nintendo_Sales <- Nintendo_yearly_Sales$Sales
#Yearly Sales Containing only Nintendo as a publisher from 1996 to 2000
ggplot(Nintendo_yearly_Sales,aes(x = Year_Release, y = Nintendo_Sales), stat = 'identity',) +
geom_bar( fill = 'lightblue', stat = 'identity')
We can identify several things if we compare both plots:
Analyzing the top 10 most profitable games
Top_10_Sales <- df %>% arrange(desc(Sales))
head(Top_10_Sales,10)
## Game Developper Publisher Release_date
## 1 Super Mario 64 Nintendo EAD Nintendo 1996-06-23
## 2 Mario Kart 64 Nintendo EAD Nintendo 1996-12-14
## 3 GoldenEye 007 Rare Nintendo 1997-08-25
## 4 The Legend of Zelda: Ocarina of Time Nintendo EAD Nintendo 1998-11-21
## 5 Super Smash Bros. HAL Laboratory Nintendo 1999-01-21
## 6 Pokémon Stadium Nintendo EAD Nintendo 1999-04-30
## 7 Donkey Kong 64 Rare Nintendo 1999-11-22
## 8 Diddy Kong Racing Rare Rare 1997-11-14
## 9 Star Fox 64 Nintendo EAD Nintendo 1997-04-27
## 10 Banjo-Kazooie Rare Nintendo 1998-06-29
## Sales Year_Release
## 1 11910000 1996
## 2 9870000 1996
## 3 8090000 1997
## 4 7600000 1998
## 5 5550000 1999
## 6 5460000 1999
## 7 5270000 1999
## 8 4880000 1997
## 9 4000000 1997
## 10 3650000 1998
In the top 10 biggest sales on Nintendo 64 we can see :
from 1996 to 2000 :
ggplot(data=Total_yearly_Sales, aes(x = Year_Release, y = Sales, fill = Publisher)) +
geom_bar(stat = "identity", position = position_dodge()) + ggtitle("Total Yearly Sales per Publishers")
This last visualization helps us draw several conclusions :