In this recitation, we (1) perform basic statistics on univariate data (mean, median, variance, standard deviation), (2) perform basic statistics on bivariate data (two-way table, covariance, correlation) and (3) demonstrate how to graph data using density plots, histograms and scatterplots.
Relevant functions: mean(), median(), var(), sd(), density(), plot(), hist(), table(), prop.table(), quantile(), cov(), cor().
read.csv()
We begin by loading the Video_Games_Sales_as_at_22_Dec_2016.csv dataset from the Canvas class website. This is an actual dataset from video games sales across the world. As usual, the first thing we want to do is check whether at the content of this new dataset. This include checking the class() of each variable.
# Don't forget to set your working directory using setwd()
# Loading the 2016 Video Games Dataset
DataGames <- read.csv("/Users/evelynebrie/Dropbox/TA/PSCI_107_Fall2018/Recitation/Week5/Video_Games_Sales_as_at_22_Dec_2016.csv")
# Looking at the content of the variables using head()
head(DataGames,5)
## Name Platform Year_of_Release Genre Publisher
## 1 Wii Sports Wii 2006 Sports Nintendo
## 2 Super Mario Bros. NES 1985 Platform Nintendo
## 3 Mario Kart Wii Wii 2008 Racing Nintendo
## 4 Wii Sports Resort Wii 2009 Sports Nintendo
## 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo
## NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score
## 1 41.36 28.96 3.77 8.45 82.53 76
## 2 29.08 3.58 6.81 0.77 40.24 NA
## 3 15.68 12.76 3.79 3.29 35.52 82
## 4 15.61 10.93 3.28 2.95 32.77 80
## 5 11.27 8.89 10.22 1.00 31.37 NA
## Critic_Count User_Score User_Count Developer Rating
## 1 51 8 322 Nintendo E
## 2 NA NA
## 3 73 8.3 709 Nintendo E
## 4 73 8 192 Nintendo E
## 5 NA NA
# Viewing the class attribute of each variable using class() and sapply()
sapply(DataGames, FUN=class)
## Name Platform Year_of_Release Genre
## "factor" "factor" "factor" "factor"
## Publisher NA_Sales EU_Sales JP_Sales
## "factor" "numeric" "numeric" "numeric"
## Other_Sales Global_Sales Critic_Score Critic_Count
## "numeric" "numeric" "integer" "integer"
## User_Score User_Count Developer Rating
## "factor" "integer" "factor" "factor"
Let’s focus on one of these variables, namely the global sales variable (“Global_Sales”). This is the number of million copies sold for each game.
# Using summary() gives us an idea of the "Global_Sales" distribution
# Here, the data seems very positively skewed
summary(DataGames$Global_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.0600 0.1700 0.5335 0.4700 82.5300
# Identifying which element of the "Global_Sales" vector is the largest
idx <- which.max(DataGames$Global_Sales)
# Printing out the video game name for this row
DataGames$Name[idx]
## [1] Wii Sports
## 11563 Levels: Beyblade Burst Fire Emblem Fates ... Zyuden Sentai Kyoryuger: Game de Gaburincho!!
Notice that the Wii Sports variable has much more copies sold that the average video game. Any idea why?
We calculate four basic statistics for this variable: mean, median, variance and standard deviation.
A mean is the sum of values divided by the number of values.
# Mean
mean(DataGames$Global_Sales)
## [1] 0.5335427
# Using the pipe operator from the dplyr package
DataGames$Global_Sales %>% mean()
## [1] 0.5335427
A median is the value separating the higher half from the lower half of a data sample (i.e. the “middle” value).
# Median
median(DataGames$Global_Sales)
## [1] 0.17
# Using the pipe operator from the dplyr package
DataGames$Global_Sales %>% median()
## [1] 0.17
The variance is the expectation of the squared standard deviation of a variable from its mean (i.e. how much a set of numbers are spread out from the mean).
# Variance
var(DataGames$Global_Sales)
## [1] 2.396103
# Proof
sum((DataGames$Global_Sales-mean(DataGames$Global_Sales))^2)/(length(DataGames$Global_Sales)-1)
## [1] 2.396103
The standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data value. The standard deviation is the square root of the variance. What is the difference between both? The standard deviation is expressed in the same units as the mean is, whereas the variance is expressed in squared units.
# Standard deviation
sd(DataGames$Global_Sales)
## [1] 1.547935
# Proof
sqrt(var(DataGames$Global_Sales))
## [1] 1.547935
density() and plot()
In scientific terms, a density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. In other words, it presents the probability of obtaining each possible value within the distribution.
# Create the density of the distribution
d <- density(DataGames$Global_Sales)
# Extreme values make this graph hard to read!
plot(d)
# As stated earlier, this variable is very positively skewed
# Bounding the x axis to view all sales between 0 and 2 millions
plot(d, # Name of your density distribution object
xlim=c(0,2), # Bounding the x axis
main="Density of Global Video Games Sales", # Main title
xlab="Video Games Sales (million)", # Label of the x axis
ylab="Density", # Label of the y axis
col="red", # Color of the line
lwd=2) # Width of the line
hist()
A histogram is a simple representation of the data distribution. It is one of the most basic ways to represent univariate data.
# Selecting only global sales between 0 and 2 millions
hist(DataGames$Global_Sales, # Name of the vector we wish to graph
breaks=1000, # How many bars in total? (note: we bound it afterwards)
xlim=c(0,2), # Bounding the x axis
main="Histogram of Global Video Games Sales", # Main title
xlab="Video Games Sales (million)", # Label of the x axis
ylab="Frequency", # Label of the y axis
col="blue") # Color of the histogram
What if we want to look at the relationship between two given variables? For instance, let’s see whether user score and global sales are correlated within the dataset. Indeed, one could expect better rated games to be sold more—is this true?
Here, we are looking at scores given by regular users. For the exercises, you’ll be working with critic scores.
# What is the class of this variable?
class(DataGames$User_Score)
## [1] "factor"
# Converting this variable to numeric (creating a new variable)
DataGames$User_Score_Num <- as.numeric(as.character(DataGames$User_Score))
## Warning: NAs introduced by coercion
# Creating a table object
tt <- table(DataGames$User_Score_Num, DataGames$Global_Sales)
# Printing the first 5 rows and the first 5 columns (number of observations)
tt[1:5,1:5]
##
## 0.01 0.02 0.03 0.04 0.05
## 0 0 0 0 0 0
## 0.2 0 0 0 1 0
## 0.3 0 0 0 0 1
## 0.5 0 0 0 0 0
## 0.6 0 1 0 0 0
# or
# Printing the first 5 rows and the first 5 columns (percentage of total observations)
prop.table(tt)[1:5,1:5]
##
## 0.01 0.02 0.03 0.04 0.05
## 0 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## 0.2 0.0000000000 0.0000000000 0.0000000000 0.0001317523 0.0000000000
## 0.3 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0001317523
## 0.5 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## 0.6 0.0000000000 0.0001317523 0.0000000000 0.0000000000 0.0000000000
Clearly, there are too many different user ratings and global sales values for this table to be readable. Let’s separate both variables into quantiles (i.e. cut points that will divide a dataset into equal-sized groups).
# Dividing the vector into 4 equal-sized groups
qt1 <- quantile(DataGames$User_Score_Num,c(0,.25,.50,.75,1),na.rm = T)
# Creating a new variable divided according to these groups
DataGames$User_Score_NumQT <- cut(DataGames$User_Score_Num,breaks=qt1)
# Dividing the vector into 4 equal-sized groups
qt2 <- quantile(DataGames$Global_Sales,c(0,.25,.50,.75,1))
# Creating a new variable divided according to these groups
DataGames$Global_SalesQT <- cut(DataGames$Global_Sales,breaks=qt2)
# Creating a table object
tt <- table(DataGames$User_Score_NumQT,DataGames$Global_SalesQT)
# Printing the table (number of observations)
tt
##
## (0.01,0.06] (0.06,0.17] (0.17,0.47] (0.47,82.5]
## (0,6.4] 314 520 575 510
## (6.4,7.5] 285 422 597 666
## (7.5,8.2] 248 359 453 752
## (8.2,9.7] 195 319 465 770
# Printing the table (percentage of total observations)
prop.table(tt)
##
## (0.01,0.06] (0.06,0.17] (0.17,0.47] (0.47,82.5]
## (0,6.4] 0.04214765 0.06979866 0.07718121 0.06845638
## (6.4,7.5] 0.03825503 0.05664430 0.08013423 0.08939597
## (7.5,8.2] 0.03328859 0.04818792 0.06080537 0.10093960
## (8.2,9.7] 0.02617450 0.04281879 0.06241611 0.10335570
# Printing the table (percentage of total observations BY ROW)
prop.table(tt,1)
##
## (0.01,0.06] (0.06,0.17] (0.17,0.47] (0.47,82.5]
## (0,6.4] 0.1636269 0.2709745 0.2996352 0.2657634
## (6.4,7.5] 0.1446701 0.2142132 0.3030457 0.3380711
## (7.5,8.2] 0.1368653 0.1981236 0.2500000 0.4150110
## (8.2,9.7] 0.1114923 0.1823899 0.2658662 0.4402516
# Printing the table (percentage of total observations BY COLUMN)
prop.table(tt,2)
##
## (0.01,0.06] (0.06,0.17] (0.17,0.47] (0.47,82.5]
## (0,6.4] 0.3013436 0.3209877 0.2751196 0.1890289
## (6.4,7.5] 0.2735125 0.2604938 0.2856459 0.2468495
## (7.5,8.2] 0.2380038 0.2216049 0.2167464 0.2787250
## (8.2,9.7] 0.1871401 0.1969136 0.2224880 0.2853966
Covariance is a measure indicating the extent to which two random variables change in tandem (either positively or negatively). It’s value can lie between -\(\infty\) and \(\infty\). It’s unit is that of the variable. A large covariance can mean a strong relationship between variables. However, you can’t compare variances over data sets with different scales.
cov(DataGames$Global_Sales,DataGames$User_Score_Num, use="pairwise.complete.obs")
## [1] 0.2479796
There is a positive relationship between both variables.
Correlation is a statistical measure that indicates how strongly two variables are related. It’s a scaled version of covariance. It’s value always lies between -1 and +1.
cor(DataGames$Global_Sales,DataGames$User_Score_Num, use="pairwise.complete.obs")
## [1] 0.08813917
There is a weak positive relationship between both variables.
A scatterplot is graph of plotted points that show the relationship between two numeric variables (or vectors).
#### SCATTERPLOT (Global Sales ~ User Score) ####
plot(DataGames$Global_Sales, DataGames$User_Score_Num)
# Selecting only global sales between 0 and 3 millions
plot(DataGames$Global_Sales, DataGames$User_Score,
xlim=c(0,3),
main="Scatterplot of Global Video Games Sales by User Score",
xlab="Video Games Sales (million)",
ylab="User Score",
col="blue", # Color of the dots
pch=16) # Type of dot
What is the mean, median and standard deviation of the “Critic_Score” variable? Which video game has the highest ranking? (Hint: you might need to remove missing values)
Create a red histogram representing the distribution of the “Critic_Score” variable. Is the distribution skewed? Positively or negatively?
Calculate the covariance and the correlation between “Global_Sales” and “Critic_Score”. Are better rated games sold more on average? Is this a strong relationship? Also: is this relationship stronger or weaker than the relationship we discussed earlier between global sales and user score?