library(readxl)
library(rvest)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag() masks stats::lag()
df = read_excel("videogames.xlsx")
-This data set contains a list of video games with sales greater than 100,000 copies. Each row in the data set represents a description of a video game.
-I collected a data set from the following link: https://www.kaggle.com/gregorut/videogamesales
head(df)
## # A tibble: 6 × 11
## Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 Wii Sports Wii 2006 Sports Nintendo 41.5 29.0 3.77
## 2 2 Super Mario… NES 1985 Platfo… Nintendo 29.1 3.58 6.81
## 3 3 Mario Kart … Wii 2008 Racing Nintendo 15.8 12.9 3.79
## 4 4 Wii Sports … Wii 2009 Sports Nintendo 15.8 11.0 3.28
## 5 5 Pokemon Red… GB 1996 Role-P… Nintendo 11.3 8.89 10.2
## 6 6 Tetris GB 1989 Puzzle Nintendo 23.2 2.26 4.22
## # … with 2 more variables: Other_Sales <dbl>, Global_Sales <dbl>
The data set includes information about:
-Rank - The overall rank of the game.
-Name - The name of the game.
-Platform- The platform on which the game was released.
-Year- The year in which the game was released.
-Genre-The genre of the game.
-Publisher-The publisher of the game.
-NA_Sales, EU_Sales, JP_Sales , Other_Sales, Global_Sales - Sales made by a particular game in North America, Europe, Japan, Other regions and Globally (in millions).
The mean of a set of observations is the arithmetic average of the values. Finding Mean of Global sales of video games in different years.
mean(df$Global_Sales,na.rm = TRUE)
## [1] 0.5374407
The standard deviation is a measure of the amount of variation or dispersion of a set of values.A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range. Finding Standard Deviation of Global sales.
sd(df$Global_Sales,na.rm = TRUE)
## [1] 1.555028
Maximum value
max(df$Global_Sales,na.rm = TRUE)
## [1] 82.74
Minimun Value
min(df$Global_Sales,na.rm = TRUE)
## [1] 0.01
Median
median(df$Global_Sales,na.rm = TRUE)
## [1] 0.17
1st Quartile
quantile(df$Global_Sales,.25,na.rm = TRUE)
## 25%
## 0.06
3rd Quartile
quantile (df$Global_Sales,.75,na.rm = TRUE)
## 75%
## 0.47
Histogram is a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval.
Histogram for Global Sales:-
Global_Sales<-df$Global_Sales
hist(Global_Sales,main= "Histogram of Global Sales", xlab = "Global Sales in Millions")
With right-skewed distribution,also known as “positively skewed” distribution, most data falls to the right, or positive side, of the graph’s peak. In the Histogram above, most of the data is one the right, so the distribution is right skewed.
boxplot(df$Global_Sales)
ggplot(df,aes(sample = Global_Sales))+ geom_qq()+geom_qq_line()
An outlier is an observation that lies an abnormal distance from other values in a random sample. So according to this qq plot, there is one point located above 80 lies distant from other values.
Graphical display of Global sales, NA sales, EU sales, JP sales, others sales and Global Sales and their correlation:-
par(mfrow=c(1,4))
with(df, plot(NA_Sales, Global_Sales))
with(df, plot(EU_Sales, Global_Sales))
with(df, plot(JP_Sales, Global_Sales))
with(df, plot(Other_Sales, Global_Sales))
df %>%
group_by(Platform)%>%
summarise(Frequency = n())%>%
mutate(Proportion = Frequency /sum(Frequency))
## # A tibble: 31 × 3
## Platform Frequency Proportion
## <chr> <int> <dbl>
## 1 2600 133 0.00801
## 2 3DO 3 0.000181
## 3 3DS 509 0.0307
## 4 DC 52 0.00313
## 5 DS 2163 0.130
## 6 GB 98 0.00590
## 7 GBA 822 0.0495
## 8 GC 556 0.0335
## 9 GEN 27 0.00163
## 10 GG 1 0.0000602
## # … with 21 more rows
mytable <- xtabs(~Publisher+Genre, data=df)
head(mytable)
## Genre
## Publisher Action Adventure Fighting Misc Platform Puzzle
## 10TACLE Studios 0 1 0 0 0 1
## 1C Company 0 0 0 0 0 0
## 20th Century Fox Video Games 4 0 0 0 0 0
## 2D Boy 0 0 0 0 0 1
## 3DO 17 3 1 0 1 1
## 49Games 0 0 0 0 0 0
## Genre
## Publisher Racing Role-Playing Shooter Simulation Sports
## 10TACLE Studios 0 0 0 0 0
## 1C Company 1 1 0 0 0
## 20th Century Fox Video Games 0 0 1 0 0
## 2D Boy 0 0 0 0 0
## 3DO 0 1 5 0 6
## 49Games 0 0 0 0 1
## Genre
## Publisher Strategy
## 10TACLE Studios 1
## 1C Company 1
## 20th Century Fox Video Games 0
## 2D Boy 0
## 3DO 1
## 49Games 0
The table shows how many genres of game do specefic publisher make. Some publishers make only one type of games whereas some publishers make all kinds of games.The most genre of game one publisher have made is 17 action game from 3D0. The table show some type of relation between two variable but doesn’t show what different does it make to create certain type of game and what genre of game is more profitable to make.
ggplot (data = df, aes(y=EU_Sales , color=Genre))+geom_boxplot()
Scatter Plot for Global Sales and North America Sales:-
plot(jitter(df$NA_Sales), jitter(df$Global_Sales), xlab="North America Sales", ylab="Global sales", main="North America sales vs. global sales", col=c("red", "blue"))
-Importance of Clean and Organize data:
-Clean and organize data is a critical part of data mining because the integrity of data is critical for ensuring that we have high quality data to make decisions upon. Some ways clean and organize data are: a. Organizing by its attributes. b. It’s important to create uniform data standards at the point of data entry. c. Implement periodic checks on your data cleaning process based on the situation d. Consistency is a key for clean data, so certain methods should be used consistently to avoid dirty data.
-Summary:
-This report summarizes the top games, platforms, publishers, genres are more attributes for different regions. The results will be beneficial in improving both sales and customer satisfaction. Some Interesting findings: The data is flawed since it only takes into account game sales where most revenue from games comes from micro transactions. Again, we did not have the data to show PC games on the rise. The analysis focuses more on consoles.
-Future Work: -This dataset for future work could be explored more indepth specifically regarding platform, individual games, year in which what platform lead in sales, year in which what publisher was the leader also. I stuck with this dataset to do all of my analysis, however I am sure one could go to retrieve data from other players such as EA sports.