I’ll demonstrate the power of Tidyverse on a dataset from Kaggle: https://www.kaggle.com/gregorut/videogamesales
Fields in the dataset are:
Rank - Ranking of overall sales
Name - The games name
Platform - Platform of the games release (i.e. PC,PS4, etc.)
Year - Year of the game's release
Genre - Genre of the game
Publisher - Publisher of the game
NA_Sales - Sales in North America (in millions)
EU_Sales - Sales in Europe (in millions)
JP_Sales - Sales in Japan (in millions)
Other_Sales - Sales in the rest of the world (in millions)
Global_Sales - Total worldwide sales.
Following Wickham & Grolemund data science framework, this is the fist step in any data science project: Import.
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#importing and converting into a dataframe using piping (%>%), which is widely used in tidyverse.
sales_raw<-read.csv("https://raw.githubusercontent.com/bsvmelo/CUNY/master/vgsales.csv") %>% data.frame()
#quick display
head(sales_raw)
## Rank Name Platform Year Genre Publisher NA_Sales
## 1 1 Wii Sports Wii 2006 Sports Nintendo 41.49
## 2 2 Super Mario Bros. NES 1985 Platform Nintendo 29.08
## 3 3 Mario Kart Wii Wii 2008 Racing Nintendo 15.85
## 4 4 Wii Sports Resort Wii 2009 Sports Nintendo 15.75
## 5 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo 11.27
## 6 6 Tetris GB 1989 Puzzle Nintendo 23.20
## EU_Sales JP_Sales Other_Sales Global_Sales
## 1 29.02 3.77 8.46 82.74
## 2 3.58 6.81 0.77 40.24
## 3 12.88 3.79 3.31 35.82
## 4 11.01 3.28 2.96 33.00
## 5 8.89 10.22 1.00 31.37
## 6 2.26 4.22 0.58 30.26
#summary stats
summary(sales_raw)
## Rank Name Platform
## Min. : 1 Need for Speed: Most Wanted: 12 DS :2163
## 1st Qu.: 4151 FIFA 14 : 9 PS2 :2161
## Median : 8300 LEGO Marvel Super Heroes : 9 PS3 :1329
## Mean : 8301 Madden NFL 07 : 9 Wii :1325
## 3rd Qu.:12450 Ratatouille : 9 X360 :1265
## Max. :16600 Angry Birds Star Wars : 8 PSP :1213
## (Other) :16542 (Other):7142
## Year Genre Publisher
## 2009 :1431 Action :3316 Electronic Arts : 1351
## 2008 :1428 Sports :2346 Activision : 975
## 2010 :1259 Misc :1739 Namco Bandai Games : 932
## 2007 :1202 Role-Playing:1488 Ubisoft : 921
## 2011 :1139 Shooter :1310 Konami Digital Entertainment: 832
## 2006 :1008 Adventure :1286 THQ : 715
## (Other):9131 (Other) :5113 (Other) :10872
## NA_Sales EU_Sales JP_Sales Other_Sales
## Min. : 0.0000 Min. : 0.0000 Min. : 0.00000 Min. : 0.00000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.00000
## Median : 0.0800 Median : 0.0200 Median : 0.00000 Median : 0.01000
## Mean : 0.2647 Mean : 0.1467 Mean : 0.07778 Mean : 0.04806
## 3rd Qu.: 0.2400 3rd Qu.: 0.1100 3rd Qu.: 0.04000 3rd Qu.: 0.04000
## Max. :41.4900 Max. :29.0200 Max. :10.22000 Max. :10.57000
##
## Global_Sales
## Min. : 0.0100
## 1st Qu.: 0.0600
## Median : 0.1700
## Mean : 0.5374
## 3rd Qu.: 0.4700
## Max. :82.7400
##
Some statistics: Number of rows: 16598
Number of columns: 11
History size: 11
Number of unique games: 11493
Number of unique platforms: 31
Number of unique genres: 12
Number of unique publishers: 579
Number of NAs in each column: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Number of N/A in each column: 0, 0, 0, 271, 0, 58, 0, 0, 0, 0, 0
Despite the facts there are no NAs, there are N/A which will have to be dealt with in a later stage.
Following Wickham & Grolemund data science framework, I’ll start with some basic visualization using ggplot.
top_20<-head(sales_raw,20)
# Simple Bar Plot
ggplot(top_20)+ geom_col(mapping=aes(x=reorder(Name,-Global_Sales),y=Global_Sales))+ theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))+labs(x="Games", y = "Global Sales in millions")
#Number of games sold per platform
ggplot(sales_raw) + geom_bar(mapping=aes(x=Platform))+ theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))+labs(y = "# units sold in mn")
ggplot(sales_raw) + geom_bar(mapping=aes(x=Genre))+ theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))+labs(y = "# of games")
ggplot(sales_raw) + geom_col(mapping=aes(Year,Global_Sales,fill=Platform))+theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
ggplot(sales_raw) + geom_col(mapping=aes(Year,Global_Sales,fill=Platform))+facet_wrap(~Platform)+theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(), legend.position = "none")
1: First thing to do is to take care of the N/A values in the dataset.
To recap, this is the number of N/A in each column: 0, 0, 0, 271, 0, 58, 0, 0, 0, 0, 0
There are N/A in the Year and Publisher columns.
I will delete all records in Year and Publisher columns that have an N/A.
# replacing N/A by NA
sales_df1<-sales_raw
sales_df1<-sales_df1[sales_df1$Year !="N/A",]
sales_df1<-sales_df1[sales_df1$Publisher !="N/A",]
# confirming that all records containing N/As are deleted
colSums(sales_df1=='N/A')
## Rank Name Platform Year Genre Publisher
## 0 0 0 0 0 0
## NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
## 0 0 0 0 0
2:Filtering I will analyse only Sony Platform games, which I assume are: PS, PS2, PS3, PS4, PSP, PSV.
# Date frame with only games for Sony platforms
sales_sony<-drop_na(sales_df1)
sales_sony<-filter(na.omit(sales_df1), Platform %in% c('PS', 'PS2', 'PS3', 'PS4', 'PSP', 'PSV'))
head(sales_sony)
## Rank Name Platform Year Genre
## 1 17 Grand Theft Auto V PS3 2013 Action
## 2 18 Grand Theft Auto: San Andreas PS2 2004 Action
## 3 25 Grand Theft Auto: Vice City PS2 2002 Action
## 4 29 Gran Turismo 3: A-Spec PS2 2001 Racing
## 5 34 Call of Duty: Black Ops 3 PS4 2015 Shooter
## 6 35 Call of Duty: Black Ops II PS3 2012 Shooter
## Publisher NA_Sales EU_Sales JP_Sales Other_Sales
## 1 Take-Two Interactive 7.01 9.27 0.97 4.14
## 2 Take-Two Interactive 9.43 0.40 0.41 10.57
## 3 Take-Two Interactive 8.41 5.49 0.47 1.78
## 4 Sony Computer Entertainment 6.85 5.09 1.87 1.16
## 5 Activision 5.77 5.81 0.35 2.31
## 6 Activision 4.99 5.88 0.65 2.52
## Global_Sales
## 1 21.40
## 2 20.81
## 3 16.15
## 4 14.98
## 5 14.24
## 6 14.03
3:Selecting columns I will use the select() function to narrow down the analysis to games sold in Japan
# Date frame with only sales in Japan
sales_sony_jp<-select(sales_sony, -c(NA_Sales,EU_Sales,Other_Sales))
head(sales_sony_jp)
## Rank Name Platform Year Genre
## 1 17 Grand Theft Auto V PS3 2013 Action
## 2 18 Grand Theft Auto: San Andreas PS2 2004 Action
## 3 25 Grand Theft Auto: Vice City PS2 2002 Action
## 4 29 Gran Turismo 3: A-Spec PS2 2001 Racing
## 5 34 Call of Duty: Black Ops 3 PS4 2015 Shooter
## 6 35 Call of Duty: Black Ops II PS3 2012 Shooter
## Publisher JP_Sales Global_Sales
## 1 Take-Two Interactive 0.97 21.40
## 2 Take-Two Interactive 0.41 20.81
## 3 Take-Two Interactive 0.47 16.15
## 4 Sony Computer Entertainment 1.87 14.98
## 5 Activision 0.35 14.24
## 6 Activision 0.65 14.03
4:Adding new variable with mutate() I will create a variable corresponding to the Global share of the Japanese sales
sales_sony_jp<-mutate(sales_sony_jp,Share=JP_Sales/Global_Sales)
5:Grouped Summaries with summarize() I will summarize the Japanese Share of total sales in aggregate and also by year and publisher.
sales_sony_jp_summ_1<-summarize(sales_sony_jp,sum(JP_Sales)/sum(Global_Sales))
sales_sony_jp_summ_2<-sales_sony_jp %>%
group_by(Year) %>%
summarize(Share1=sum(JP_Sales)/sum(Global_Sales))
Historical share of Japanese sales is: r sales_sony_jp_summ_1
Plotting share of Japanese sales over time:
ggplot(sales_sony_jp_summ_2)+ geom_col(mapping=aes(sales_sony_jp_summ_2$Year,sales_sony_jp_summ_2$Share1))+theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))+labs(y = "JP Share", x="Year")
It seems there’s an error with 2017 data. I will leave this to my colleagues to try to fix :-)
Aggregation by year and by Platform
(sales_sony_jp_summ_3<-sales_sony_jp %>%
group_by(Year, Platform) %>%
summarize(Share1=sum(JP_Sales)/sum(Global_Sales)))
## # A tibble: 57 x 3
## # Groups: Year [24]
## Year Platform Share1
## <fct> <fct> <dbl>
## 1 1994 PS 0.444
## 2 1995 PS 0.266
## 3 1996 PS 0.188
## 4 1997 PS 0.243
## 5 1998 PS 0.168
## 6 1999 PS 0.194
## 7 2000 PS 0.152
## 8 2000 PS2 0.227
## 9 2001 PS 0.109
## 10 2001 PS2 0.107
## # … with 47 more rows
Aggregation by year by Genre and by Platform, when Share is greater than 0.
(sales_sony_jp_summ_4<-sales_sony_jp %>%
group_by(Year, Platform, Genre) %>%
summarize(Share1=sum(JP_Sales)/sum(Global_Sales))) %>%
filter(Share1>0)
## # A tibble: 501 x 4
## # Groups: Year, Platform [56]
## Year Platform Genre Share1
## <fct> <fct> <fct> <dbl>
## 1 1994 PS Fighting 0.384
## 2 1994 PS Racing 0.930
## 3 1994 PS Role-Playing 0.944
## 4 1994 PS Shooter 0.327
## 5 1994 PS Simulation 0.171
## 6 1994 PS Sports 0.298
## 7 1994 PS Strategy 0.96
## 8 1995 PS Action 0.196
## 9 1995 PS Adventure 0.815
## 10 1995 PS Fighting 0.366
## # … with 491 more rows
Aggregation by Platform, when Share is greater than 0.
(sales_sony_jp_summ_5<-sales_sony_jp %>%
group_by(Platform) %>%
summarize(Share1=sum(JP_Sales)/sum(Global_Sales))) %>%
filter(Share1>0)
## # A tibble: 6 x 2
## Platform Share1
## <fct> <dbl>
## 1 PS 0.192
## 2 PS2 0.112
## 3 PS3 0.0834
## 4 PS4 0.0514
## 5 PSP 0.260
## 6 PSV 0.339
I have used a lot of different functions to transform, tidy, and present data. There’s still a lot analysis that can be done.