Basic Descriptive Statistics in R

In this recitation, we (1) perform basic statistics on univariate data (mean, median, variance, standard deviation), (2) perform basic statistics on bivariate data (two-way table, covariance, correlation) and (3) demonstrate how to graph data using density plots, histograms and scatterplots.

Relevant functions: mean(), median(), var(), sd(), density(), plot(), hist(), table(), prop.table(), quantile(), cov(), cor().

1. Loading the Dataset using `read.csv()`

We begin by loading the Video_Games_Sales_as_at_22_Dec_2016.csv dataset from the Canvas class website. This is an actual dataset from video games sales across the world. As usual, the first thing we want to do is check whether at the content of this new dataset. This include checking the class() of each variable.

# Don't forget to set your working directory using setwd()
 
# Loading the 2016 Video Games Dataset
DataGames <- read.csv("/Users/evelynebrie/Dropbox/TA/PSCI_107_Fall2018/Recitation/Week5/Video_Games_Sales_as_at_22_Dec_2016.csv")

# Looking at the content of the variables using head()
head(DataGames,5)

##                       Name Platform Year_of_Release        Genre Publisher
## 1               Wii Sports      Wii            2006       Sports  Nintendo
## 2        Super Mario Bros.      NES            1985     Platform  Nintendo
## 3           Mario Kart Wii      Wii            2008       Racing  Nintendo
## 4        Wii Sports Resort      Wii            2009       Sports  Nintendo
## 5 Pokemon Red/Pokemon Blue       GB            1996 Role-Playing  Nintendo
##   NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score
## 1    41.36    28.96     3.77        8.45        82.53           76
## 2    29.08     3.58     6.81        0.77        40.24           NA
## 3    15.68    12.76     3.79        3.29        35.52           82
## 4    15.61    10.93     3.28        2.95        32.77           80
## 5    11.27     8.89    10.22        1.00        31.37           NA
##   Critic_Count User_Score User_Count Developer Rating
## 1           51          8        322  Nintendo      E
## 2           NA                    NA                 
## 3           73        8.3        709  Nintendo      E
## 4           73          8        192  Nintendo      E
## 5           NA                    NA

# Viewing the class attribute of each variable using class() and sapply()
sapply(DataGames, FUN=class)

##            Name        Platform Year_of_Release           Genre 
##        "factor"        "factor"        "factor"        "factor" 
##       Publisher        NA_Sales        EU_Sales        JP_Sales 
##        "factor"       "numeric"       "numeric"       "numeric" 
##     Other_Sales    Global_Sales    Critic_Score    Critic_Count 
##       "numeric"       "numeric"       "integer"       "integer" 
##      User_Score      User_Count       Developer          Rating 
##        "factor"       "integer"        "factor"        "factor"

2. Univariate Data

Let’s focus on one of these variables, namely the global sales variable (“Global_Sales”). This is the number of million copies sold for each game.

# Using summary() gives us an idea of the "Global_Sales" distribution
# Here, the data seems very positively skewed
summary(DataGames$Global_Sales)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.0600  0.1700  0.5335  0.4700 82.5300

# Identifying which element of the "Global_Sales" vector is the largest
idx <- which.max(DataGames$Global_Sales)

# Printing out the video game name for this row
DataGames$Name[idx]

## [1] Wii Sports
## 11563 Levels:   Beyblade Burst  Fire Emblem Fates ... Zyuden Sentai Kyoryuger: Game de Gaburincho!!

Notice that the Wii Sports variable has much more copies sold that the average video game. Any idea why?

2.1 Basic Statistics

We calculate four basic statistics for this variable: mean, median, variance and standard deviation.

2.1.1 Mean

A mean is the sum of values divided by the number of values.

# Mean
mean(DataGames$Global_Sales)

## [1] 0.5335427

# Using the pipe operator from the dplyr package
DataGames$Global_Sales %>% mean()

## [1] 0.5335427

2.1.2 Median

A median is the value separating the higher half from the lower half of a data sample (i.e. the “middle” value).

# Median
median(DataGames$Global_Sales)

## [1] 0.17

# Using the pipe operator from the dplyr package
DataGames$Global_Sales %>% median()

## [1] 0.17

2.1.3 Variance

The variance is the expectation of the squared standard deviation of a variable from its mean (i.e. how much a set of numbers are spread out from the mean).

# Variance
var(DataGames$Global_Sales)

## [1] 2.396103

# Proof
sum((DataGames$Global_Sales-mean(DataGames$Global_Sales))^2)/(length(DataGames$Global_Sales)-1)

## [1] 2.396103

2.1.4 Standard Deviation

The standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data value. The standard deviation is the square root of the variance. What is the difference between both? The standard deviation is expressed in the same units as the mean is, whereas the variance is expressed in squared units.

# Standard deviation
sd(DataGames$Global_Sales)

## [1] 1.547935

# Proof
sqrt(var(DataGames$Global_Sales))

## [1] 1.547935

2.2 Density Plot using `density()` and `plot()`

In scientific terms, a density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. In other words, it presents the probability of obtaining each possible value within the distribution.

# Create the density of the distribution
d <- density(DataGames$Global_Sales) 

# Extreme values make this graph hard to read!
plot(d)

# As stated earlier, this variable is very positively skewed 
# Bounding the x axis to view all sales between 0 and 2 millions
plot(d, # Name of your density distribution object
     xlim=c(0,2), # Bounding the x axis
     main="Density of Global Video Games Sales", # Main title
     xlab="Video Games Sales (million)", # Label of the x axis
     ylab="Density", # Label of the y axis
     col="red", # Color of the line
     lwd=2) # Width of the line

2.3 Histogram using `hist()`

A histogram is a simple representation of the data distribution. It is one of the most basic ways to represent univariate data.

# Selecting only global sales between 0 and 2 millions
hist(DataGames$Global_Sales, # Name of the vector we wish to graph
     breaks=1000, # How many bars in total? (note: we bound it afterwards)
     xlim=c(0,2), # Bounding the x axis
     main="Histogram of Global Video Games Sales", # Main title
     xlab="Video Games Sales (million)", # Label of the x axis
     ylab="Frequency", # Label of the y axis
     col="blue") # Color of the histogram

3. Bivariate Data

What if we want to look at the relationship between two given variables? For instance, let’s see whether user score and global sales are correlated within the dataset. Indeed, one could expect better rated games to be sold more—is this true?

Here, we are looking at scores given by regular users. For the exercises, you’ll be working with critic scores.

3.1 Two-Way Tables

# What is the class of this variable?
class(DataGames$User_Score)

## [1] "factor"

# Converting this variable to numeric (creating a new variable)
DataGames$User_Score_Num <- as.numeric(as.character(DataGames$User_Score))

## Warning: NAs introduced by coercion

# Creating a table object
tt <- table(DataGames$User_Score_Num, DataGames$Global_Sales)

# Printing the first 5 rows and the first 5 columns (number of observations)
tt[1:5,1:5]

##      
##       0.01 0.02 0.03 0.04 0.05
##   0      0    0    0    0    0
##   0.2    0    0    0    1    0
##   0.3    0    0    0    0    1
##   0.5    0    0    0    0    0
##   0.6    0    1    0    0    0

# or

# Printing the first 5 rows and the first 5 columns (percentage of total observations)
prop.table(tt)[1:5,1:5]

##      
##               0.01         0.02         0.03         0.04         0.05
##   0   0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
##   0.2 0.0000000000 0.0000000000 0.0000000000 0.0001317523 0.0000000000
##   0.3 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0001317523
##   0.5 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
##   0.6 0.0000000000 0.0001317523 0.0000000000 0.0000000000 0.0000000000

Clearly, there are too many different user ratings and global sales values for this table to be readable. Let’s separate both variables into quantiles (i.e. cut points that will divide a dataset into equal-sized groups).

# Dividing the vector into 4 equal-sized groups
qt1 <- quantile(DataGames$User_Score_Num,c(0,.25,.50,.75,1),na.rm = T)

# Creating a new variable divided according to these groups
DataGames$User_Score_NumQT <- cut(DataGames$User_Score_Num,breaks=qt1)

# Dividing the vector into 4 equal-sized groups
qt2 <- quantile(DataGames$Global_Sales,c(0,.25,.50,.75,1))

# Creating a new variable divided according to these groups
DataGames$Global_SalesQT <- cut(DataGames$Global_Sales,breaks=qt2)

# Creating a table object
tt <- table(DataGames$User_Score_NumQT,DataGames$Global_SalesQT)

# Printing the table (number of observations)
tt

##            
##             (0.01,0.06] (0.06,0.17] (0.17,0.47] (0.47,82.5]
##   (0,6.4]           314         520         575         510
##   (6.4,7.5]         285         422         597         666
##   (7.5,8.2]         248         359         453         752
##   (8.2,9.7]         195         319         465         770

# Printing the table (percentage of total observations)
prop.table(tt)

##            
##             (0.01,0.06] (0.06,0.17] (0.17,0.47] (0.47,82.5]
##   (0,6.4]    0.04214765  0.06979866  0.07718121  0.06845638
##   (6.4,7.5]  0.03825503  0.05664430  0.08013423  0.08939597
##   (7.5,8.2]  0.03328859  0.04818792  0.06080537  0.10093960
##   (8.2,9.7]  0.02617450  0.04281879  0.06241611  0.10335570

# Printing the table (percentage of total observations BY ROW)
prop.table(tt,1)

##            
##             (0.01,0.06] (0.06,0.17] (0.17,0.47] (0.47,82.5]
##   (0,6.4]     0.1636269   0.2709745   0.2996352   0.2657634
##   (6.4,7.5]   0.1446701   0.2142132   0.3030457   0.3380711
##   (7.5,8.2]   0.1368653   0.1981236   0.2500000   0.4150110
##   (8.2,9.7]   0.1114923   0.1823899   0.2658662   0.4402516

# Printing the table (percentage of total observations BY COLUMN)
prop.table(tt,2)

##            
##             (0.01,0.06] (0.06,0.17] (0.17,0.47] (0.47,82.5]
##   (0,6.4]     0.3013436   0.3209877   0.2751196   0.1890289
##   (6.4,7.5]   0.2735125   0.2604938   0.2856459   0.2468495
##   (7.5,8.2]   0.2380038   0.2216049   0.2167464   0.2787250
##   (8.2,9.7]   0.1871401   0.1969136   0.2224880   0.2853966

3.2 Covariance

Covariance is a measure indicating the extent to which two random variables change in tandem (either positively or negatively). It’s value can lie between -\(\infty\) and \(\infty\). It’s unit is that of the variable. A large covariance can mean a strong relationship between variables. However, you can’t compare variances over data sets with different scales.

cov(DataGames$Global_Sales,DataGames$User_Score_Num, use="pairwise.complete.obs")

## [1] 0.2479796

There is a positive relationship between both variables.

3.3 Correlation

Correlation is a statistical measure that indicates how strongly two variables are related. It’s a scaled version of covariance. It’s value always lies between -1 and +1.

cor(DataGames$Global_Sales,DataGames$User_Score_Num, use="pairwise.complete.obs")

## [1] 0.08813917

There is a weak positive relationship between both variables.

3.4 Scatterplot

A scatterplot is graph of plotted points that show the relationship between two numeric variables (or vectors).

#### SCATTERPLOT (Global Sales ~ User Score) ####
plot(DataGames$Global_Sales, DataGames$User_Score_Num)

# Selecting only global sales between 0 and 3 millions
plot(DataGames$Global_Sales, DataGames$User_Score, 
     xlim=c(0,3),
     main="Scatterplot of Global Video Games Sales by User Score",
     xlab="Video Games Sales (million)",
     ylab="User Score",
     col="blue", # Color of the dots
     pch=16) # Type of dot

Exercises

Step 1

What is the mean, median and standard deviation of the “Critic_Score” variable? Which video game has the highest ranking? (Hint: you might need to remove missing values)

Step 2

Create a red histogram representing the distribution of the “Critic_Score” variable. Is the distribution skewed? Positively or negatively?

Step 3

Calculate the covariance and the correlation between “Global_Sales” and “Critic_Score”. Are better rated games sold more on average? Is this a strong relationship? Also: is this relationship stronger or weaker than the relationship we discussed earlier between global sales and user score?