PSCI 9590A - Introduction to Quantitative Methods

Evelyne Brie

Fall 2023

Estimation and Inference

In this lab, we (1) estimate basic statistics on univariate data (mean, median, variance, standard deviation), (2) estimate basic statistics on bivariate data (covariance, correlation) and display trends using two-way tables and (3) demonstrate how to graph data in base R with density plots, histograms and scatterplots.

 

Relevant functions: mean(), median(), var(), sd(), density(), plot(), hist(), table(), prop.table(), quantile(), cov(), cor().

1. Understanding our Dataset

1.1 Loading the dataset

We begin by loading the Video_Games_Sales_as_at_22_Dec_2016.csv dataset from the OWL class website. This is an actual dataset from video games sales across the world. As usual, the first thing we want to do is check whether at the content of this new dataset. This include checking the class() of each variable.

# Don't forget to set your working directory using setwd() !

# Setting the folder where the script is as our working directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))

# Loading the 2016 Video Games Dataset
DataGames <- read.csv("Video_Games_Sales_as_at_22_Dec_2016.csv")

# Printing the dimensions of our dataset
dim(DataGames)
## [1] 16719    16

1.2 Exploring our dataset

We want to make sure that we know what is encompassed within this dataset. This means not only which columns are included in the DataGames data frame, but also the class of each vector.

# Looking at the names of all variables using colnames()
colnames(DataGames)
##  [1] "Name"            "Platform"        "Year_of_Release" "Genre"          
##  [5] "Publisher"       "NA_Sales"        "EU_Sales"        "JP_Sales"       
##  [9] "Other_Sales"     "Global_Sales"    "Critic_Score"    "Critic_Count"   
## [13] "User_Score"      "User_Count"      "Developer"       "Rating"
# Looking at the content of the 5 first rows using head()
head(DataGames,5)
##                       Name Platform Year_of_Release        Genre Publisher
## 1               Wii Sports      Wii            2006       Sports  Nintendo
## 2        Super Mario Bros.      NES            1985     Platform  Nintendo
## 3           Mario Kart Wii      Wii            2008       Racing  Nintendo
## 4        Wii Sports Resort      Wii            2009       Sports  Nintendo
## 5 Pokemon Red/Pokemon Blue       GB            1996 Role-Playing  Nintendo
##   NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score Critic_Count
## 1    41.36    28.96     3.77        8.45        82.53           76           51
## 2    29.08     3.58     6.81        0.77        40.24           NA           NA
## 3    15.68    12.76     3.79        3.29        35.52           82           73
## 4    15.61    10.93     3.28        2.95        32.77           80           73
## 5    11.27     8.89    10.22        1.00        31.37           NA           NA
##   User_Score User_Count Developer Rating
## 1          8        322  Nintendo      E
## 2                    NA                 
## 3        8.3        709  Nintendo      E
## 4          8        192  Nintendo      E
## 5                    NA
# Viewing the class attribute of each variable using class() and sapply()
sapply(DataGames, FUN=class)  
##            Name        Platform Year_of_Release           Genre       Publisher 
##     "character"     "character"     "character"     "character"     "character" 
##        NA_Sales        EU_Sales        JP_Sales     Other_Sales    Global_Sales 
##       "numeric"       "numeric"       "numeric"       "numeric"       "numeric" 
##    Critic_Score    Critic_Count      User_Score      User_Count       Developer 
##       "integer"       "integer"     "character"       "integer"     "character" 
##          Rating 
##     "character"

2. Univariate Data

Let’s focus on one of these variables, namely the global sales variable (“Global_Sales”). This is the number of million copies sold for each game.

# Using summary() gives us an idea of the "Global_Sales" distribution
# Here, the data seems very positively skewed
summary(DataGames$Global_Sales) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.0600  0.1700  0.5335  0.4700 82.5300
# Identifying which element of the "Global_Sales" vector has the greatest value
# idx takes the value of the row number for that observation
idx <- which.max(DataGames$Global_Sales)

# Printing out the video game name for this row
DataGames$Name[idx]
## [1] "Wii Sports"
# Note that you can also do this in only one line (this is more efficient!)
DataGames$Name[which.max(DataGames$Global_Sales)]
## [1] "Wii Sports"
# Sort the dataset by the value of the "Global_Sales" variable
DataGames[order(DataGames$Global_Sales,decreasing = T)[1:20],c("Name","Global_Sales")]
##                                            Name Global_Sales
## 1                                    Wii Sports        82.53
## 2                             Super Mario Bros.        40.24
## 3                                Mario Kart Wii        35.52
## 4                             Wii Sports Resort        32.77
## 5                      Pokemon Red/Pokemon Blue        31.37
## 6                                        Tetris        30.26
## 7                         New Super Mario Bros.        29.80
## 8                                      Wii Play        28.92
## 9                     New Super Mario Bros. Wii        28.32
## 10                                    Duck Hunt        28.31
## 11                                   Nintendogs        24.67
## 12                                Mario Kart DS        23.21
## 13                  Pokemon Gold/Pokemon Silver        23.10
## 14                                      Wii Fit        22.70
## 15                           Kinect Adventures!        21.81
## 16                                 Wii Fit Plus        21.79
## 17                           Grand Theft Auto V        21.04
## 18                Grand Theft Auto: San Andreas        20.81
## 19                            Super Mario World        20.61
## 20 Brain Age: Train Your Brain in Minutes a Day        20.15
# Pssst: in this case, the original dataset was already ordered by global sales!
# But you can use this code to sort this dataset or other datasets using any variable

Notice that the Wii Sports variable has much more copies sold that the average video game. Any idea why? (This is not a stats question!)

2.1 Basic Statistics

We calculate four basic statistics for this variable: mean, median, variance and standard deviation.

2.1.1 Mean

A mean is the sum of values divided by the number of values.

This is the formula for the population mean: \[\ \mu = \frac{1}{N} \sum X_i \]

And this is the formula for the sample mean: \[\ \overline{X} = \frac{1}{n} \sum X_i \]

# Mean
mean(DataGames$Global_Sales) 
## [1] 0.5335427
# Using the pipe operator from the dplyr package
DataGames$Global_Sales %>% mean()
## [1] 0.5335427
# Proof
sum(DataGames$Global_Sales)/nrow(DataGames)
## [1] 0.5335427

2.1.2 Median

A median is the value separating the higher half from the lower half of a data sample (i.e. the “middle” value).

# Median
median(DataGames$Global_Sales) 
## [1] 0.17
# Using the pipe operator from the dplyr package
DataGames$Global_Sales %>% median()
## [1] 0.17

2.1.3 Variance

The variance is the expectation of the squared standard deviation of a variable from its mean (i.e. how much a set of numbers are spread out from the mean).

This is the formula for the population variance: \[ \sigma^2 = \frac{\sum (x_i - \bar{x})^2 }{N}\]

This is the formula for the sample variance: \[ s^2 = \frac{\sum (x_i - \bar{x})^2 }{n - 1}\]

# Variance
var(DataGames$Global_Sales)
## [1] 2.396103
# Proof
sum((DataGames$Global_Sales-mean(DataGames$Global_Sales))^2)/(length(DataGames$Global_Sales)-1)
## [1] 2.396103
# Not that the var() function gives the sample variance rather than the population variance
# Here is an explanation of how to get the population variance in R: https://influentialpoints.com/notes/n3rvari.htm

2.1.4 Standard Deviation

The standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data value. The standard deviation is the square root of the variance. What is the difference between both? The standard deviation is expressed in the same units as the mean is, whereas the variance is expressed in squared units.

This is the formula for the population standard deviation: \[ \sigma = \sqrt{\frac{\sum (x_i - \bar{x})^2 }{N}}\]

This is the formula for the sample standard deviation: \[ s = \sqrt{\frac{\sum (x_i - \bar{x})^2 }{n - 1}}\]

# Standard deviation
sd(DataGames$Global_Sales)
## [1] 1.547935
# Proof
sqrt(var(DataGames$Global_Sales))
## [1] 1.547935
 

Exercise 1

What is the mean, median and standard deviation of the Critic_Score variable? Also, which video game(s) has (have) the highest ranking for that variable? (Hint: you might need to remove missing values)

mean(DataGames$Critic_Score, na.rm=T)
## [1] 68.96768
median(DataGames$Critic_Score, na.rm=T)
## [1] 71
var(DataGames$Critic_Score, na.rm=T)
## [1] 194.2724
sd(DataGames$Critic_Score, na.rm=T)
## [1] 13.93816
DataGames$Name[which.max(DataGames$Critic_Score)]
## [1] "Grand Theft Auto IV"
DataGames$Name[which(DataGames$Critic_Score==max(DataGames$Critic_Score,na.rm = T))]
## [1] "Grand Theft Auto IV"      "Grand Theft Auto IV"     
## [3] "Tony Hawk's Pro Skater 2" "SoulCalibur"

2.2 Visualization

2.2.1 Density Plot

In scientific terms, a density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. In other words, the area under the curve represents the probability of obtaining each possible value within the distribution (i.e. the y axis is is NOT a probability). The total area under the density curve equals 1.

# Create the density of the distribution
d <- density(DataGames$Global_Sales) 

# Extreme values make this graph hard to read!
plot(d)

# As stated earlier, this variable is very positively skewed 
# Bounding the x axis to view all sales between 0 and 2 millions
plot(d, # Name of your density distribution object
     xlim=c(0,2), # Bounding the x axis
     main="Density of Global Video Games Sales", # Main title
     xlab="Video Games Sales (million)", # Label of the x axis
     ylab="Density", # Label of the y axis
     col="red", # Color of the line
     lwd=2) # Width of the line

2.2.2 Histogram

A histogram is a simple representation of the data distribution. It is one of the most basic ways to represent univariate data.

# Selecting only global sales between 0 and 2 millions
hist(DataGames$Global_Sales, # Name of the vector we wish to graph
     breaks=1000, # How many bars in total? (note: we bound it afterwards)
     xlim=c(0,2), # Bounding the x axis
     main="Histogram of Global Video Games Sales", # Main title
     xlab="Video Games Sales (million)", # Label of the x axis
     ylab="Frequency", # Label of the y axis
     col="steelblue") # Color of the histogram

3. Bivariate Data

What if we want to look at the relationship between two given variables? For instance, let’s see whether user score and global sales are correlated within the dataset. Indeed, one could expect better rated games to be sold more—is this true?

3.1 Two-Way Tables

# What is the class of this variable?
class(DataGames$User_Score)
## [1] "character"
# Converting this variable to numeric (creating a new variable)
DataGames$User_Score_Num <- as.numeric(DataGames$User_Score)
## Warning: NAs introduced by coercion
# Creating a table object
tt <- table(DataGames$User_Score_Num, DataGames$Global_Sales)

# Printing the first 5 rows and the first 5 columns (number of observations)
tt[1:5,1:5]
##      
##       0.01 0.02 0.03 0.04 0.05
##   0      0    0    0    0    0
##   0.2    0    0    0    1    0
##   0.3    0    0    0    0    1
##   0.5    0    0    0    0    0
##   0.6    0    1    0    0    0
# or

# Printing the first 5 rows and the first 5 columns (percentage of total observations)
prop.table(tt)[1:5,1:5]
##      
##               0.01         0.02         0.03         0.04         0.05
##   0   0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
##   0.2 0.0000000000 0.0000000000 0.0000000000 0.0001317523 0.0000000000
##   0.3 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0001317523
##   0.5 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
##   0.6 0.0000000000 0.0001317523 0.0000000000 0.0000000000 0.0000000000

Clearly, there are too many different user ratings and global sales values for this table to be readable. Let’s separate both variables into quantiles (i.e. cut points that will divide a dataset into equal-sized groups).

# Dividing the vector into 4 equal-sized groups
qt1 <- quantile(DataGames$User_Score_Num,c(0,.25,.50,.75,1),na.rm = T)

# Creating a new variable divided according to these groups
DataGames$User_Score_NumQT <- cut(DataGames$User_Score_Num,breaks=qt1)

# Dividing the vector into 4 equal-sized groups
qt2 <- quantile(DataGames$Global_Sales,c(0,.25,.50,.75,1))

# Creating a new variable divided according to these groups
DataGames$Global_SalesQT <- cut(DataGames$Global_Sales,breaks=qt2)

# Creating a table object
tt <- table(DataGames$User_Score_NumQT,DataGames$Global_SalesQT)

# Printing the table (number of observations)
tt
##            
##             (0.01,0.06] (0.06,0.17] (0.17,0.47] (0.47,82.5]
##   (0,6.4]           314         520         575         510
##   (6.4,7.5]         285         422         597         666
##   (7.5,8.2]         248         359         453         752
##   (8.2,9.7]         195         319         465         770
# Printing the table (percentage of total observations)
prop.table(tt)
##            
##             (0.01,0.06] (0.06,0.17] (0.17,0.47] (0.47,82.5]
##   (0,6.4]    0.04214765  0.06979866  0.07718121  0.06845638
##   (6.4,7.5]  0.03825503  0.05664430  0.08013423  0.08939597
##   (7.5,8.2]  0.03328859  0.04818792  0.06080537  0.10093960
##   (8.2,9.7]  0.02617450  0.04281879  0.06241611  0.10335570
# Printing the table (percentage of total observations BY ROW)
prop.table(tt,1)
##            
##             (0.01,0.06] (0.06,0.17] (0.17,0.47] (0.47,82.5]
##   (0,6.4]     0.1636269   0.2709745   0.2996352   0.2657634
##   (6.4,7.5]   0.1446701   0.2142132   0.3030457   0.3380711
##   (7.5,8.2]   0.1368653   0.1981236   0.2500000   0.4150110
##   (8.2,9.7]   0.1114923   0.1823899   0.2658662   0.4402516
# Printing the table (percentage of total observations BY COLUMN)
prop.table(tt,2)
##            
##             (0.01,0.06] (0.06,0.17] (0.17,0.47] (0.47,82.5]
##   (0,6.4]     0.3013436   0.3209877   0.2751196   0.1890289
##   (6.4,7.5]   0.2735125   0.2604938   0.2856459   0.2468495
##   (7.5,8.2]   0.2380038   0.2216049   0.2167464   0.2787250
##   (8.2,9.7]   0.1871401   0.1969136   0.2224880   0.2853966

3.2 Basic Statistics

3.2.1 Covariance

Covariance is a measure indicating the extent to which two random variables change in tandem (either positively or negatively). It’s value can lie between -\(\infty\) and \(\infty\). Its unit is based on that of the variable. However, you can’t compare variances over data sets with different scales.

In plain words, the covariance represents the degree to which the deviation of one variable from its mean is related to the deviation of another variable from its mean. Because it is hard to interpret, you typically just look at whether the score is positive, negative, or null.

cov(DataGames$Global_Sales,DataGames$User_Score_Num, use="pairwise.complete.obs")
## [1] 0.2479796

There is a positive relationship between both variables. However, it can be hard to interpret and it is not a standardized scale, since the value depends on the units of observations. The standardized measure to compare across relationships is the correlation.

3.2.2 Correlation

Correlation is a statistical measure that indicates how strongly two variables are related, and what is the direction of this relationship (positive or negative). It’s a scaled version of covariance. It’s value always lies between -1 and +1. The correlation score is called the Pearson correlation coefficient (also noted as r).

Correlation Score Strength
0 Null
0.1-0.3 Weak
0.3-0.5 Moderate
0.5-1 Strong
cor(DataGames$Global_Sales,DataGames$User_Score_Num, use="pairwise.complete.obs")
## [1] 0.08813917

Here, there is a weak positive relationship between both variables.

3.3 Visualization

3.3.1 Scatterplot

A scatterplot is graph of plotted points that show the relationship between two numeric variables (or vectors).

#### A very basic version...  ####
plot(DataGames$User_Score, DataGames$Global_Sales)
## Warning in xy.coords(x, y, xlabel, ylabel, log): NAs introduced by coercion

### And here is a more esthetically pleasing version
plot(DataGames$User_Score, DataGames$Global_Sales,
     ylim=c(0,5), # Plotting only global sales between 0 and 5 millions
     main="Scatterplot of Global Video Games Sales by User Score",
     ylab="Video Games Sales (million)",
     xlab="User Score",
     col="steelblue", # Color of the dots
     pch=16) # Type of dot
## Warning in xy.coords(x, y, xlabel, ylabel, log): NAs introduced by coercion

# We get a warning message here to alert us that some observations are removed
# This is normal: our ylim has an upper bound of 5
 

Exercise 2

Calculate the covariance and the correlation between “Global_Sales” and “Critic_Score”. Are better rated games sold more on average? Is this a strong relationship? Also: is this relationship stronger or weaker than the relationship we discussed earlier between global sales and user score?