library(readxl)
library(rvest)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()         masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag()            masks stats::lag()
df = read_excel("videogames.xlsx")

-This data set contains a list of video games with sales greater than 100,000 copies. Each row in the data set represents a description of a video game.

-I collected a data set from the following link: https://www.kaggle.com/gregorut/videogamesales

head(df)
## # A tibble: 6 × 11
##    Rank Name         Platform Year  Genre   Publisher NA_Sales EU_Sales JP_Sales
##   <dbl> <chr>        <chr>    <chr> <chr>   <chr>        <dbl>    <dbl>    <dbl>
## 1     1 Wii Sports   Wii      2006  Sports  Nintendo      41.5    29.0      3.77
## 2     2 Super Mario… NES      1985  Platfo… Nintendo      29.1     3.58     6.81
## 3     3 Mario Kart … Wii      2008  Racing  Nintendo      15.8    12.9      3.79
## 4     4 Wii Sports … Wii      2009  Sports  Nintendo      15.8    11.0      3.28
## 5     5 Pokemon Red… GB       1996  Role-P… Nintendo      11.3     8.89    10.2 
## 6     6 Tetris       GB       1989  Puzzle  Nintendo      23.2     2.26     4.22
## # … with 2 more variables: Other_Sales <dbl>, Global_Sales <dbl>

The data set includes information about:

-Rank - The overall rank of the game.

-Name - The name of the game.

-Platform- The platform on which the game was released.

-Year- The year in which the game was released.

-Genre-The genre of the game.

-Publisher-The publisher of the game.

-NA_Sales, EU_Sales, JP_Sales , Other_Sales, Global_Sales - Sales made by a particular game in North America, Europe, Japan, Other regions and Globally (in millions).

The mean of a set of observations is the arithmetic average of the values. Finding Mean of Global sales of video games in different years.

mean(df$Global_Sales,na.rm = TRUE)
## [1] 0.5374407

The standard deviation is a measure of the amount of variation or dispersion of a set of values.A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range. Finding Standard Deviation of Global sales.

sd(df$Global_Sales,na.rm = TRUE)
## [1] 1.555028

Maximum value

max(df$Global_Sales,na.rm = TRUE)
## [1] 82.74

Minimun Value

min(df$Global_Sales,na.rm = TRUE)
## [1] 0.01

Median

median(df$Global_Sales,na.rm = TRUE)
## [1] 0.17

1st Quartile

quantile(df$Global_Sales,.25,na.rm = TRUE)
##  25% 
## 0.06

3rd Quartile

quantile (df$Global_Sales,.75,na.rm = TRUE)
##  75% 
## 0.47

Histogram is a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval.

Histogram for Global Sales:-

Global_Sales<-df$Global_Sales
hist(Global_Sales,main= "Histogram of Global Sales", xlab = "Global Sales in Millions")

With right-skewed distribution,also known as “positively skewed” distribution, most data falls to the right, or positive side, of the graph’s peak. In the Histogram above, most of the data is one the right, so the distribution is right skewed.

boxplot(df$Global_Sales)

ggplot(df,aes(sample = Global_Sales))+ geom_qq()+geom_qq_line()

An outlier is an observation that lies an abnormal distance from other values in a random sample. So according to this qq plot, there is one point located above 80 lies distant from other values.

Graphical display of Global sales, NA sales, EU sales, JP sales, others sales and Global Sales and their correlation:-

par(mfrow=c(1,4))
with(df, plot(NA_Sales, Global_Sales))
with(df, plot(EU_Sales, Global_Sales))
with(df, plot(JP_Sales, Global_Sales))
with(df, plot(Other_Sales, Global_Sales))

  1. Frequency and relative frequency table for Platform:-
df %>%
  group_by(Platform)%>%
  summarise(Frequency = n())%>%
  mutate(Proportion = Frequency /sum(Frequency))
## # A tibble: 31 × 3
##    Platform Frequency Proportion
##    <chr>        <int>      <dbl>
##  1 2600           133  0.00801  
##  2 3DO              3  0.000181 
##  3 3DS            509  0.0307   
##  4 DC              52  0.00313  
##  5 DS            2163  0.130    
##  6 GB              98  0.00590  
##  7 GBA            822  0.0495   
##  8 GC             556  0.0335   
##  9 GEN             27  0.00163  
## 10 GG               1  0.0000602
## # … with 21 more rows
  1. Two way table for two categorical variable.(Publisher and Genre):-
mytable <- xtabs(~Publisher+Genre, data=df)
head(mytable)
##                               Genre
## Publisher                      Action Adventure Fighting Misc Platform Puzzle
##   10TACLE Studios                   0         1        0    0        0      1
##   1C Company                        0         0        0    0        0      0
##   20th Century Fox Video Games      4         0        0    0        0      0
##   2D Boy                            0         0        0    0        0      1
##   3DO                              17         3        1    0        1      1
##   49Games                           0         0        0    0        0      0
##                               Genre
## Publisher                      Racing Role-Playing Shooter Simulation Sports
##   10TACLE Studios                   0            0       0          0      0
##   1C Company                        1            1       0          0      0
##   20th Century Fox Video Games      0            0       1          0      0
##   2D Boy                            0            0       0          0      0
##   3DO                               0            1       5          0      6
##   49Games                           0            0       0          0      1
##                               Genre
## Publisher                      Strategy
##   10TACLE Studios                     1
##   1C Company                          1
##   20th Century Fox Video Games        0
##   2D Boy                              0
##   3DO                                 1
##   49Games                             0

The table shows how many genres of game do specefic publisher make. Some publishers make only one type of games whereas some publishers make all kinds of games.The most genre of game one publisher have made is 17 action game from 3D0. The table show some type of relation between two variable but doesn’t show what different does it make to create certain type of game and what genre of game is more profitable to make.

ggplot (data = df, aes(y=EU_Sales , color=Genre))+geom_boxplot()

  1. Scatter Plot is a graph in which the values of two variables are plotted along two axes, the pattern of the resulting points revealing any correlation present.

Scatter Plot for Global Sales and North America Sales:-

plot(jitter(df$NA_Sales), jitter(df$Global_Sales), xlab="North America Sales", ylab="Global sales", main="North America sales vs. global sales", col=c("red", "blue"))

-Importance of Clean and Organize data:

-Clean and organize data is a critical part of data mining because the integrity of data is critical for ensuring that we have high quality data to make decisions upon. Some ways clean and organize data are: a. Organizing by its attributes. b. It’s important to create uniform data standards at the point of data entry. c. Implement periodic checks on your data cleaning process based on the situation d. Consistency is a key for clean data, so certain methods should be used consistently to avoid dirty data.

-Summary:

-This report summarizes the top games, platforms, publishers, genres are more attributes for different regions. The results will be beneficial in improving both sales and customer satisfaction. Some Interesting findings: The data is flawed since it only takes into account game sales where most revenue from games comes from micro transactions. Again, we did not have the data to show PC games on the rise. The analysis focuses more on consoles.

-Future Work: -This dataset for future work could be explored more indepth specifically regarding platform, individual games, year in which what platform lead in sales, year in which what publisher was the leader also. I stuck with this dataset to do all of my analysis, however I am sure one could go to retrieve data from other players such as EA sports.