Homework 2

Sara Bračun Duhovnik

About the data:

I decided to analyse data on video games sales. I selected the dataset as a source for my analysis on Kaggle.com (https://www.kaggle.com/datasets/gregorut/videogamesales). I modified the data a little bit in Exel beforehand and deleted the columns of data I am not interested in. I decided to analyse data on video games from ganres Rola-playing and Sports from publishers Nintendo and Atari.

data(package = .packages(all.available = TRUE))
library(readxl)
mydata <- read_excel("~/Desktop/videogames_sales.xlsx")
colnames(mydata) <- c("Rank", "Name", "Platform", "Year", "Genre", "Publisher", "NAsales", "EUsales", "JPsales", "OtherSales", "GlobalSales")
head(mydata, 15)
## # A tibble: 15 × 11
##     Rank Name   Platform Year  Genre Publisher NAsales EUsales JPsales
##    <dbl> <chr>  <chr>    <chr> <chr> <chr>       <dbl>   <dbl>   <dbl>
##  1     1 Activ… Wii      2008  Spor… Atari        0.79    0.44    0.19
##  2     2 NFL 2… PS2      2002  Spor… Atari        1.06    0.08    0   
##  3     3 .hack… PS2      2002  Role… Atari        0.49    0.38    0.26
##  4     4 Drago… GBA      2003  Role… Atari        0.78    0.29    0   
##  5     5 We Sk… Wii      2008  Spor… Atari        0.38    0.29    0.15
##  6     6 Unlim… PS2      2002  Role… Atari        0.1     0.08    0.56
##  7     7 Tales… X360     2008  Role… Atari        0.32    0.18    0.19
##  8     8 .hack… PS2      2002  Role… Atari        0.23    0.18    0.2 
##  9     9 Backy… PS2      2003  Spor… Atari        0.29    0.22    0   
## 10    10 RealS… 2600     1982  Spor… Atari        0.46    0.03    0   
## 11    11 Backy… PS2      2004  Spor… Atari        0.24    0.19    0   
## 12    12 .hack… PS2      2002  Role… Atari        0.14    0.11    0.17
## 13    13 Etern… PS3      2008  Role… Atari        0.19    0.13    0.07
## 14    14 Backy… GBA      2002  Spor… Atari        0.31    0.11    0   
## 15    15 My Ho… DS       2007  Spor… Atari        0.33    0       0   
## # ℹ 2 more variables: OtherSales <dbl>, GlobalSales <dbl>

Describtion:

  • Unit of observation: one video game
  • Sample size: 245 observations

Definitions of all variables:

  • Name: name of a video game
  • Platform: platform of a video game release
  • Year: year of the game’s release
  • Genre: genre of the game
  • Publisher: publisher of the game
  • NAsales: sales in North America (in millions)
  • EUsales: sales in Europe (in millions)
  • JPsales: sales in Japan (in millions)
  • OtherSales: sales in the rest of the world (in millions)
  • GlobalSales: total worldwide sales (in millions)

Research question 1: Is there a correlation between North America sales and Japan sales?

- H0: There is no correlation between North America and Japan sales.
- H1: There is correlation between North America and Japan sales.

Research question 2: Does the genre of the game vary depending on who publishes the game?

- H0: There is no association between video games genre and publisher
- H1: There is an association between video games genre and publisher

Data manipulation:

Before analyzing my data, I have decided to check for outliers and eliminate some of them for simplicity of analysis, transform categorical variables to factor variables and perform and some descriptive statistics.

  • Unit of observation: one video game
  • Sample size: 239 observations
boxplot(mydata$NAsales)

boxplot(mydata$JPsales)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata <- mydata %>%
  filter(!NAsales >= 7.00 )
mydata <- mydata %>%
  filter(!JPsales >= 7.00 )

mydata$Genre <- factor(mydata$Genre,
        levels = c("Sports", "Role-Playing"),
        labels = c("Sports", "Role-Playing"))
mydata$Publisher <- factor(mydata$Publisher,
        levels = c("Nintendo", "Atari"),
        labels = c("Nintendo", "Atari"))
library(psych)
describeBy(mydata[ , c(-1, -2, -3, -4, -5, -6)])
## Warning in describeBy(mydata[, c(-1, -2, -3, -4, -5, -6)]): no
## grouping variable requested
##             vars   n mean   sd median trimmed  mad  min   max range
## NAsales        1 239 0.51 0.99   0.19    0.29 0.28 0.00  6.42  6.42
## EUsales        2 239 0.26 0.69   0.06    0.11 0.09 0.00  5.04  5.04
## JPsales        3 239 0.46 0.94   0.13    0.22 0.19 0.00  6.04  6.04
## OtherSales     4 239 0.06 0.15   0.02    0.03 0.03 0.00  1.37  1.37
## GlobalSales    5 239 1.29 2.62   0.45    0.71 0.53 0.01 18.36 18.35
##             skew kurtosis   se
## NAsales     3.98    17.47 0.06
## EUsales     4.77    24.19 0.04
## JPsales     3.53    13.98 0.06
## OtherSales  5.10    30.69 0.01
## GlobalSales 4.23    19.51 0.17
Interpretation:

From descriptive statistics I can see that sales in all observed areas have a big positive skew, meaning they are skewed to the right and are not normally distributed. If we don’t include global sales, which are just a sum of all other regions, highest average sales of video games were recorded in North America and lowest in all other parts of the world. Similarly, highest amount in million $ was sold in North America and lowest in Japan. Half on all North America sales earned up to 190.000 $ while the other half earned more than that.

str(mydata)
## tibble [239 × 11] (S3: tbl_df/tbl/data.frame)
##  $ Rank       : num [1:239] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Name       : chr [1:239] "Active Life: Outdoor Challenge" "NFL 2K3" ".hack//Infection Part 1" "Dragon Ball Z: The Legacy of Goku II" ...
##  $ Platform   : chr [1:239] "Wii" "PS2" "PS2" "GBA" ...
##  $ Year       : chr [1:239] "2008" "2002" "2002" "2003" ...
##  $ Genre      : Factor w/ 2 levels "Sports","Role-Playing": 1 1 2 2 1 2 2 2 1 1 ...
##  $ Publisher  : Factor w/ 2 levels "Nintendo","Atari": 2 2 2 2 2 2 2 2 2 2 ...
##  $ NAsales    : num [1:239] 0.79 1.06 0.49 0.78 0.38 0.1 0.32 0.23 0.29 0.46 ...
##  $ EUsales    : num [1:239] 0.44 0.08 0.38 0.29 0.29 0.08 0.18 0.18 0.22 0.03 ...
##  $ JPsales    : num [1:239] 0.19 0 0.26 0 0.15 0.56 0.19 0.2 0 0 ...
##  $ OtherSales : num [1:239] 0.14 0.18 0.13 0.02 0.08 0.03 0.05 0.06 0.07 0.01 ...
##  $ GlobalSales: num [1:239] 1.55 1.32 1.27 1.09 0.9 0.77 0.75 0.68 0.59 0.5 ...

Research question 1:

First I decided to check if there is a linear relationship between chosen two variables. From the scatter plot I can conclude that there is a weak linear relationship between them, but there is a lot of outliers. Therefore I have decided to use both Pearson and Spearman Correlation Coefficient for robustness check.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplotMatrix(mydata[ , c(-1, -2, -3, -4, -5, -6, -8, -10, -11)], smooth = FALSE)

scatterplot(mydata$NAsales, mydata$JPsales,
            smooth = FALSE,
            xlim = c(0.00, 6.42),
            ylim = c(0.00, 6.04),
            main = "Relationship between North America and Japan sales of video games (in millions)",
            xlab = "North America sales",
            ylab = "Japan sales")

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(mydata[ , c(-1, -2, -3, -4, -5, -6)]),
      type = "pearson")
##             NAsales EUsales JPsales OtherSales GlobalSales
## NAsales        1.00    0.95    0.81       0.88        0.97
## EUsales        0.95    1.00    0.81       0.89        0.96
## JPsales        0.81    0.81    1.00       0.76        0.92
## OtherSales     0.88    0.89    0.76       1.00        0.90
## GlobalSales    0.97    0.96    0.92       0.90        1.00
## 
## n= 239 
## 
## 
## P
##             NAsales EUsales JPsales OtherSales GlobalSales
## NAsales              0       0       0          0         
## EUsales      0               0       0          0         
## JPsales      0       0               0          0         
## OtherSales   0       0       0                  0         
## GlobalSales  0       0       0       0
cor(mydata$NAsales, mydata$JPsales,
         method = "pearson",
         use = "complete.obs")
## [1] 0.8124194
rcorr(as.matrix(mydata[ , c(-1, -2, -3, -4, -5, -6)]),
      type = "spearman")
##             NAsales EUsales JPsales OtherSales GlobalSales
## NAsales        1.00    0.74    0.46       0.78        0.82
## EUsales        0.74    1.00    0.51       0.78        0.75
## JPsales        0.46    0.51    1.00       0.56        0.83
## OtherSales     0.78    0.78    0.56       1.00        0.78
## GlobalSales    0.82    0.75    0.83       0.78        1.00
## 
## n= 239 
## 
## 
## P
##             NAsales EUsales JPsales OtherSales GlobalSales
## NAsales              0       0       0          0         
## EUsales      0               0       0          0         
## JPsales      0       0               0          0         
## OtherSales   0       0       0                  0         
## GlobalSales  0       0       0       0
Interpretation:

Based on Pearson Correlation Coefficient, linear relationship between North America and Japan sales for video games is strong and positive (0.812). The strongest relationship is between North America and Global sales (0.97 - very strong) and the weakest is between Japan and Other sales (0.76 - semi strong). Based on Spearman Correlation Coefficient, the monotonic relationship between North America and Japan sales for video games is moderate and positive (0.46). The strongest relationship is between Japan and Global sales (0.83 - very strong) and the weakest is between Japan and North America sales.

Based on the data above I can conclude that there are big differences in the Correlation Coefficients used, because my data is not really linear and I have a lot of outliers. In this case it is better to consider the Spearman values, as they are more robust and not that sensitive.

To check whether the correlation between North America and Japan sales is statistically significant we look at population correlation coefficient which is for both tests performed the same:

  • p-value = 0
  • H0: Rho = 0
  • H1: Rho ≠ 0

Reject H0 at p-value < 0.001

I can conclude that there is a statistically significant correlation between North America and Japan sales.

cor.test(mydata$NAsales, mydata$JPsales,
         method = "spearman",
         exact = FALSE,
         use = "complete.obs")
## 
##  Spearman's rank correlation rho
## 
## data:  mydata$NAsales and mydata$JPsales
## S = 1226953, p-value = 5.804e-14
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4607462
Conclusion:

Based on sample data and Spearman Correlation Coefficient, I can conclude that there is a correlation between North America and Japan sales. I can reject the H0 at p-value < 0.001.

Research question 2:

With this research question I wanted to check whether the selected genre of the video game is related to which company publishes the video game. I wanted to check if there are certain patterns in the distribution of genre based on the publisher.

  • Categorical variable 1: Genre (every video game published from my sample data is either Role-playing or Sports genre)
  • Categorical variable 2: Publisher (every video game published from my sample data is either published by Nintendo or Atari)

For the test to be valid, certain assumptions have to be met. - observations are independent - all video games in the selected sample dataset are independent - all expected frequencies are greater than 5 - from the test performed below I can see that all frequencies are greater than 5 - if the contingency table is more than 2x2, only 20% of expected frequencies can be between 1 and 5 - this contingency table is 2x2, therefore the statistical power of the test will not be reduced

chi_results <- chisq.test(mydata$Genre, mydata$Publisher,
                          correct = TRUE)
print(chi_results)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$Genre and mydata$Publisher
## X-squared = 23.769, df = 1, p-value = 1.086e-06
Interpretation:

Based on the sample data, I can reject H0 at p-value < 0.001 and I can claim that there is an association between genre and publisher.

Then I checked if there are differences between observed and expected observations.

addmargins(chi_results$observed)
##               mydata$Publisher
## mydata$Genre   Nintendo Atari Sum
##   Sports             51    56 107
##   Role-Playing      104    28 132
##   Sum               155    84 239
addmargins(chi_results$expected, 2)
##               mydata$Publisher
## mydata$Genre   Nintendo    Atari Sum
##   Sports       69.39331 37.60669 107
##   Role-Playing 85.60669 46.39331 132
Interpretation:

From the Chi-Square test we can see that observed frequencies are different from expected ones. If publisher and genre of a video game are not correlated, I would expect: - 37.61 Atari’s video games to be in Sports genre - 69.39 Nintendo’s games to be Sports genre - 46.39 Atari’s video games to be in Role-Playing genre - 85.61 Nintendo’s video games to be in Role-Playing genre

To check the direction of the relationship and if the differences between observed and expected observations are statistically significant, I decided to also check standardized residuals.

round(chi_results$residuals, 2)
##               mydata$Publisher
## mydata$Genre   Nintendo Atari
##   Sports          -2.21  3.00
##   Role-Playing     1.99 -2.70
Interpretation:

There is less Sports games published by Nintendo than I would expect. I can say that with 95% certainty. (alfa = 5 %). There is more Sports games published by Atari than I would expect. I can say that with 99% certainty. (alfa = 1 %). There is more Role-Playing games published by Nintendo than I would expect. I can say that with 95% certainty. (alfa = 5%). There is less Role-Playing games published by Atari than I would expect. I can say that with 99% certainty. (alfa = 1 %).

Then I decided to calculate relative frequencies with proportion tables to make sense in comparisons and interpretations of measures of association between genre and publisher.

addmargins(round(prop.table(chi_results$observed), 3))
##               mydata$Publisher
## mydata$Genre   Nintendo Atari   Sum
##   Sports          0.213 0.234 0.447
##   Role-Playing    0.435 0.117 0.552
##   Sum             0.648 0.351 0.999
Interpretation:

Out of 239 observed video games, there is 43.5 % of them that are classified in a Role-Playing genre and were published by Nintendo. Out of 239 observed video games, there is 23.4 % of them that are classified in a Sports genre and were published by Atari.

addmargins(round(prop.table(chi_results$observed, 1), 3), 2)
##               mydata$Publisher
## mydata$Genre   Nintendo Atari   Sum
##   Sports          0.477 0.523 1.000
##   Role-Playing    0.788 0.212 1.000
Interpretation:

Out of all observed video games classified in the Role-Playing genre, only 21.2 % were published by Atari. Out of all observed video games classified in the Sports genre, 47.7 % were published by Nintendo.

addmargins(round(prop.table(chi_results$observed, 2), 3), 1)
##               mydata$Publisher
## mydata$Genre   Nintendo Atari
##   Sports          0.329 0.667
##   Role-Playing    0.671 0.333
##   Sum             1.000 1.000
Interpretation:

Out of all observed video games that were published by Atari, 66.7 % of them were classified in Sports genre. Out of all observed video games that were published by Nintendo, 67.1 % of them were classified in Role-Playing genre.

At the end, I decided to use both Cramer’s V staistics and Odds ration to determine the effect size in my 2X2 contingency table.

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
effectsize::cramers_v(mydata$Genre, mydata$Publisher)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.32              | [0.21, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.32)
## [1] "large"
## (Rules: funder2019)
Answer to research question 2:

Based on the analysis above I can conclude that there is a strong association between genre and publisher and based on Cramer’s V ststistics I can conclude that there is a strong and statistically significant association between genre and publisher of video games, meaning that a certain publisher (Nintendo or Atari) is more or less likely to publish a video game in a certain genre (Sports or Role-Playing) compared to the other.