I decided to analyse data on video games sales. I selected the dataset as a source for my analysis on Kaggle.com (https://www.kaggle.com/datasets/gregorut/videogamesales). I modified the data a little bit in Exel beforehand and deleted the columns of data I am not interested in. I decided to analyse data on video games from ganres Rola-playing and Sports from publishers Nintendo and Atari.
data(package = .packages(all.available = TRUE))
library(readxl)
mydata <- read_excel("~/Desktop/videogames_sales.xlsx")
colnames(mydata) <- c("Rank", "Name", "Platform", "Year", "Genre", "Publisher", "NAsales", "EUsales", "JPsales", "OtherSales", "GlobalSales")
head(mydata, 15)
## # A tibble: 15 × 11
## Rank Name Platform Year Genre Publisher NAsales EUsales JPsales
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 Activ… Wii 2008 Spor… Atari 0.79 0.44 0.19
## 2 2 NFL 2… PS2 2002 Spor… Atari 1.06 0.08 0
## 3 3 .hack… PS2 2002 Role… Atari 0.49 0.38 0.26
## 4 4 Drago… GBA 2003 Role… Atari 0.78 0.29 0
## 5 5 We Sk… Wii 2008 Spor… Atari 0.38 0.29 0.15
## 6 6 Unlim… PS2 2002 Role… Atari 0.1 0.08 0.56
## 7 7 Tales… X360 2008 Role… Atari 0.32 0.18 0.19
## 8 8 .hack… PS2 2002 Role… Atari 0.23 0.18 0.2
## 9 9 Backy… PS2 2003 Spor… Atari 0.29 0.22 0
## 10 10 RealS… 2600 1982 Spor… Atari 0.46 0.03 0
## 11 11 Backy… PS2 2004 Spor… Atari 0.24 0.19 0
## 12 12 .hack… PS2 2002 Role… Atari 0.14 0.11 0.17
## 13 13 Etern… PS3 2008 Role… Atari 0.19 0.13 0.07
## 14 14 Backy… GBA 2002 Spor… Atari 0.31 0.11 0
## 15 15 My Ho… DS 2007 Spor… Atari 0.33 0 0
## # ℹ 2 more variables: OtherSales <dbl>, GlobalSales <dbl>
Definitions of all variables:
Before analyzing my data, I have decided to check for outliers and eliminate some of them for simplicity of analysis, transform categorical variables to factor variables and perform and some descriptive statistics.
boxplot(mydata$NAsales)
boxplot(mydata$JPsales)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!NAsales >= 7.00 )
mydata <- mydata %>%
filter(!JPsales >= 7.00 )
mydata$Genre <- factor(mydata$Genre,
levels = c("Sports", "Role-Playing"),
labels = c("Sports", "Role-Playing"))
mydata$Publisher <- factor(mydata$Publisher,
levels = c("Nintendo", "Atari"),
labels = c("Nintendo", "Atari"))
library(psych)
describeBy(mydata[ , c(-1, -2, -3, -4, -5, -6)])
## Warning in describeBy(mydata[, c(-1, -2, -3, -4, -5, -6)]): no
## grouping variable requested
## vars n mean sd median trimmed mad min max range
## NAsales 1 239 0.51 0.99 0.19 0.29 0.28 0.00 6.42 6.42
## EUsales 2 239 0.26 0.69 0.06 0.11 0.09 0.00 5.04 5.04
## JPsales 3 239 0.46 0.94 0.13 0.22 0.19 0.00 6.04 6.04
## OtherSales 4 239 0.06 0.15 0.02 0.03 0.03 0.00 1.37 1.37
## GlobalSales 5 239 1.29 2.62 0.45 0.71 0.53 0.01 18.36 18.35
## skew kurtosis se
## NAsales 3.98 17.47 0.06
## EUsales 4.77 24.19 0.04
## JPsales 3.53 13.98 0.06
## OtherSales 5.10 30.69 0.01
## GlobalSales 4.23 19.51 0.17
From descriptive statistics I can see that sales in all observed areas have a big positive skew, meaning they are skewed to the right and are not normally distributed. If we don’t include global sales, which are just a sum of all other regions, highest average sales of video games were recorded in North America and lowest in all other parts of the world. Similarly, highest amount in million $ was sold in North America and lowest in Japan. Half on all North America sales earned up to 190.000 $ while the other half earned more than that.
str(mydata)
## tibble [239 × 11] (S3: tbl_df/tbl/data.frame)
## $ Rank : num [1:239] 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr [1:239] "Active Life: Outdoor Challenge" "NFL 2K3" ".hack//Infection Part 1" "Dragon Ball Z: The Legacy of Goku II" ...
## $ Platform : chr [1:239] "Wii" "PS2" "PS2" "GBA" ...
## $ Year : chr [1:239] "2008" "2002" "2002" "2003" ...
## $ Genre : Factor w/ 2 levels "Sports","Role-Playing": 1 1 2 2 1 2 2 2 1 1 ...
## $ Publisher : Factor w/ 2 levels "Nintendo","Atari": 2 2 2 2 2 2 2 2 2 2 ...
## $ NAsales : num [1:239] 0.79 1.06 0.49 0.78 0.38 0.1 0.32 0.23 0.29 0.46 ...
## $ EUsales : num [1:239] 0.44 0.08 0.38 0.29 0.29 0.08 0.18 0.18 0.22 0.03 ...
## $ JPsales : num [1:239] 0.19 0 0.26 0 0.15 0.56 0.19 0.2 0 0 ...
## $ OtherSales : num [1:239] 0.14 0.18 0.13 0.02 0.08 0.03 0.05 0.06 0.07 0.01 ...
## $ GlobalSales: num [1:239] 1.55 1.32 1.27 1.09 0.9 0.77 0.75 0.68 0.59 0.5 ...
First I decided to check if there is a linear relationship between chosen two variables. From the scatter plot I can conclude that there is a weak linear relationship between them, but there is a lot of outliers. Therefore I have decided to use both Pearson and Spearman Correlation Coefficient for robustness check.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(mydata[ , c(-1, -2, -3, -4, -5, -6, -8, -10, -11)], smooth = FALSE)
scatterplot(mydata$NAsales, mydata$JPsales,
smooth = FALSE,
xlim = c(0.00, 6.42),
ylim = c(0.00, 6.04),
main = "Relationship between North America and Japan sales of video games (in millions)",
xlab = "North America sales",
ylab = "Japan sales")
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ , c(-1, -2, -3, -4, -5, -6)]),
type = "pearson")
## NAsales EUsales JPsales OtherSales GlobalSales
## NAsales 1.00 0.95 0.81 0.88 0.97
## EUsales 0.95 1.00 0.81 0.89 0.96
## JPsales 0.81 0.81 1.00 0.76 0.92
## OtherSales 0.88 0.89 0.76 1.00 0.90
## GlobalSales 0.97 0.96 0.92 0.90 1.00
##
## n= 239
##
##
## P
## NAsales EUsales JPsales OtherSales GlobalSales
## NAsales 0 0 0 0
## EUsales 0 0 0 0
## JPsales 0 0 0 0
## OtherSales 0 0 0 0
## GlobalSales 0 0 0 0
cor(mydata$NAsales, mydata$JPsales,
method = "pearson",
use = "complete.obs")
## [1] 0.8124194
rcorr(as.matrix(mydata[ , c(-1, -2, -3, -4, -5, -6)]),
type = "spearman")
## NAsales EUsales JPsales OtherSales GlobalSales
## NAsales 1.00 0.74 0.46 0.78 0.82
## EUsales 0.74 1.00 0.51 0.78 0.75
## JPsales 0.46 0.51 1.00 0.56 0.83
## OtherSales 0.78 0.78 0.56 1.00 0.78
## GlobalSales 0.82 0.75 0.83 0.78 1.00
##
## n= 239
##
##
## P
## NAsales EUsales JPsales OtherSales GlobalSales
## NAsales 0 0 0 0
## EUsales 0 0 0 0
## JPsales 0 0 0 0
## OtherSales 0 0 0 0
## GlobalSales 0 0 0 0
Based on Pearson Correlation Coefficient, linear relationship between North America and Japan sales for video games is strong and positive (0.812). The strongest relationship is between North America and Global sales (0.97 - very strong) and the weakest is between Japan and Other sales (0.76 - semi strong). Based on Spearman Correlation Coefficient, the monotonic relationship between North America and Japan sales for video games is moderate and positive (0.46). The strongest relationship is between Japan and Global sales (0.83 - very strong) and the weakest is between Japan and North America sales.
Based on the data above I can conclude that there are big differences in the Correlation Coefficients used, because my data is not really linear and I have a lot of outliers. In this case it is better to consider the Spearman values, as they are more robust and not that sensitive.
To check whether the correlation between North America and Japan sales is statistically significant we look at population correlation coefficient which is for both tests performed the same:
Reject H0 at p-value < 0.001
I can conclude that there is a statistically significant correlation between North America and Japan sales.
cor.test(mydata$NAsales, mydata$JPsales,
method = "spearman",
exact = FALSE,
use = "complete.obs")
##
## Spearman's rank correlation rho
##
## data: mydata$NAsales and mydata$JPsales
## S = 1226953, p-value = 5.804e-14
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4607462
Based on sample data and Spearman Correlation Coefficient, I can conclude that there is a correlation between North America and Japan sales. I can reject the H0 at p-value < 0.001.
With this research question I wanted to check whether the selected genre of the video game is related to which company publishes the video game. I wanted to check if there are certain patterns in the distribution of genre based on the publisher.
For the test to be valid, certain assumptions have to be met. - observations are independent - all video games in the selected sample dataset are independent - all expected frequencies are greater than 5 - from the test performed below I can see that all frequencies are greater than 5 - if the contingency table is more than 2x2, only 20% of expected frequencies can be between 1 and 5 - this contingency table is 2x2, therefore the statistical power of the test will not be reduced
chi_results <- chisq.test(mydata$Genre, mydata$Publisher,
correct = TRUE)
print(chi_results)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$Genre and mydata$Publisher
## X-squared = 23.769, df = 1, p-value = 1.086e-06
Based on the sample data, I can reject H0 at p-value < 0.001 and I can claim that there is an association between genre and publisher.
Then I checked if there are differences between observed and expected observations.
addmargins(chi_results$observed)
## mydata$Publisher
## mydata$Genre Nintendo Atari Sum
## Sports 51 56 107
## Role-Playing 104 28 132
## Sum 155 84 239
addmargins(chi_results$expected, 2)
## mydata$Publisher
## mydata$Genre Nintendo Atari Sum
## Sports 69.39331 37.60669 107
## Role-Playing 85.60669 46.39331 132
From the Chi-Square test we can see that observed frequencies are different from expected ones. If publisher and genre of a video game are not correlated, I would expect: - 37.61 Atari’s video games to be in Sports genre - 69.39 Nintendo’s games to be Sports genre - 46.39 Atari’s video games to be in Role-Playing genre - 85.61 Nintendo’s video games to be in Role-Playing genre
To check the direction of the relationship and if the differences between observed and expected observations are statistically significant, I decided to also check standardized residuals.
round(chi_results$residuals, 2)
## mydata$Publisher
## mydata$Genre Nintendo Atari
## Sports -2.21 3.00
## Role-Playing 1.99 -2.70
There is less Sports games published by Nintendo than I would expect. I can say that with 95% certainty. (alfa = 5 %). There is more Sports games published by Atari than I would expect. I can say that with 99% certainty. (alfa = 1 %). There is more Role-Playing games published by Nintendo than I would expect. I can say that with 95% certainty. (alfa = 5%). There is less Role-Playing games published by Atari than I would expect. I can say that with 99% certainty. (alfa = 1 %).
Then I decided to calculate relative frequencies with proportion tables to make sense in comparisons and interpretations of measures of association between genre and publisher.
addmargins(round(prop.table(chi_results$observed), 3))
## mydata$Publisher
## mydata$Genre Nintendo Atari Sum
## Sports 0.213 0.234 0.447
## Role-Playing 0.435 0.117 0.552
## Sum 0.648 0.351 0.999
Out of 239 observed video games, there is 43.5 % of them that are classified in a Role-Playing genre and were published by Nintendo. Out of 239 observed video games, there is 23.4 % of them that are classified in a Sports genre and were published by Atari.
addmargins(round(prop.table(chi_results$observed, 1), 3), 2)
## mydata$Publisher
## mydata$Genre Nintendo Atari Sum
## Sports 0.477 0.523 1.000
## Role-Playing 0.788 0.212 1.000
Out of all observed video games classified in the Role-Playing genre, only 21.2 % were published by Atari. Out of all observed video games classified in the Sports genre, 47.7 % were published by Nintendo.
addmargins(round(prop.table(chi_results$observed, 2), 3), 1)
## mydata$Publisher
## mydata$Genre Nintendo Atari
## Sports 0.329 0.667
## Role-Playing 0.671 0.333
## Sum 1.000 1.000
Out of all observed video games that were published by Atari, 66.7 % of them were classified in Sports genre. Out of all observed video games that were published by Nintendo, 67.1 % of them were classified in Role-Playing genre.
At the end, I decided to use both Cramer’s V staistics and Odds ration to determine the effect size in my 2X2 contingency table.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cramers_v(mydata$Genre, mydata$Publisher)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.32 | [0.21, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.32)
## [1] "large"
## (Rules: funder2019)
Based on the analysis above I can conclude that there is a strong association between genre and publisher and based on Cramer’s V ststistics I can conclude that there is a strong and statistically significant association between genre and publisher of video games, meaning that a certain publisher (Nintendo or Atari) is more or less likely to publish a video game in a certain genre (Sports or Role-Playing) compared to the other.