Exploring Many Variables in R

by Jekaterina Novikova


Exploring Pseudo-Facebook User Data

Having the dataset pseudo_facebook.tsv, I am going to analyze the users’ behaviour in Facebook, understand what they are doing there and how they behave.

Reading the Data

First, I read the data and check its summary. Next, I get the list of names for all the variables of the dataset.

pf <- read.csv('pseudo_facebook.tsv', sep = '\t')
# Alternatively:
# pf <- read.delim('pseudo_facebook.tsv')  

summary(pf)
##      userid             age            dob_day         dob_year   
##  Min.   :1000008   Min.   : 13.00   Min.   : 1.00   Min.   :1900  
##  1st Qu.:1298806   1st Qu.: 20.00   1st Qu.: 7.00   1st Qu.:1963  
##  Median :1596148   Median : 28.00   Median :14.00   Median :1985  
##  Mean   :1597045   Mean   : 37.28   Mean   :14.53   Mean   :1976  
##  3rd Qu.:1895744   3rd Qu.: 50.00   3rd Qu.:22.00   3rd Qu.:1993  
##  Max.   :2193542   Max.   :113.00   Max.   :31.00   Max.   :2000  
##                                                                   
##    dob_month         gender          tenure        friend_count   
##  Min.   : 1.000   female:40254   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.: 3.000   male  :58574   1st Qu.: 226.0   1st Qu.:  31.0  
##  Median : 6.000   NA's  :  175   Median : 412.0   Median :  82.0  
##  Mean   : 6.283                  Mean   : 537.9   Mean   : 196.4  
##  3rd Qu.: 9.000                  3rd Qu.: 675.0   3rd Qu.: 206.0  
##  Max.   :12.000                  Max.   :3139.0   Max.   :4923.0  
##                                  NA's   :2                        
##  friendships_initiated     likes         likes_received    
##  Min.   :   0.0        Min.   :    0.0   Min.   :     0.0  
##  1st Qu.:  17.0        1st Qu.:    1.0   1st Qu.:     1.0  
##  Median :  46.0        Median :   11.0   Median :     8.0  
##  Mean   : 107.5        Mean   :  156.1   Mean   :   142.7  
##  3rd Qu.: 117.0        3rd Qu.:   81.0   3rd Qu.:    59.0  
##  Max.   :4144.0        Max.   :25111.0   Max.   :261197.0  
##                                                            
##   mobile_likes     mobile_likes_received   www_likes       
##  Min.   :    0.0   Min.   :     0.00     Min.   :    0.00  
##  1st Qu.:    0.0   1st Qu.:     0.00     1st Qu.:    0.00  
##  Median :    4.0   Median :     4.00     Median :    0.00  
##  Mean   :  106.1   Mean   :    84.12     Mean   :   49.96  
##  3rd Qu.:   46.0   3rd Qu.:    33.00     3rd Qu.:    7.00  
##  Max.   :25111.0   Max.   :138561.00     Max.   :14865.00  
##                                                            
##  www_likes_received 
##  Min.   :     0.00  
##  1st Qu.:     0.00  
##  Median :     2.00  
##  Mean   :    58.57  
##  3rd Qu.:    20.00  
##  Max.   :129953.00  
## 
names(pf)
##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

Third Qualitative Variable

Here, I will create a plot to examine if the age distribution has an impact on the number of friends across genders. For this, I will add a third variable of age to the line plot of friend count vs age.

library(ggplot2)

# ggplot(aes(x = gender, y = age),
#        data = subset(pf, !is.na(gender))) + 
#   geom_boxplot() +
#   stat_summary(fun.y = mean, geom = 'point', shape = 4)

ggplot(aes(x = age, y = friend_count), 
       data = subset(pf, !is.na(gender))) +
  geom_line(aes(color = gender), stat = "summary", fun.y = median)

The plot shows that almost everywhere the median count of friends is larger for women than it is for men. There are some exaptions and they include the noisy regions for older ages.


Third Qualitative Variable Using dplyr package

It is possible to create the same plot using a dplyr package.

library(dplyr)

# chain functions together %>%
pf.fc_by_age_gender <- pf %>%
  filter(!is.na(gender)) %>%
  group_by(age, gender) %>%
  summarise(mean_friend_count = mean(friend_count),
            median_friend_count = median(friend_count),
            n = n()) %>%
  ungroup() %>%
  arrange(age)

head(pf.fc_by_age_gender)
## Source: local data frame [6 x 5]
## 
##     age gender mean_friend_count median_friend_count     n
##   (int) (fctr)             (dbl)               (dbl) (int)
## 1    13 female          259.1606               148.0   193
## 2    13   male          102.1340                55.0   291
## 3    14 female          362.4286               224.0   847
## 4    14   male          164.1456                92.5  1078
## 5    15 female          538.6813               276.0  1139
## 6    15   male          200.6658               106.5  1478
ggplot(aes(x = age, y = median_friend_count), data = pf.fc_by_age_gender) +
  geom_line(aes(color = gender))


Reshaping Data

I will use the library reshape2 to transform the data to the wide format.

#install.packages('reshape2')
library(reshape2)

pf.fc_by_age_gender_wide <- dcast(pf.fc_by_age_gender,
                                  age ~ gender,
                                  value.var = 'median_friend_count')

head(pf.fc_by_age_gender_wide)
##   age female  male
## 1  13  148.0  55.0
## 2  14  224.0  92.5
## 3  15  276.0 106.5
## 4  16  258.5 136.0
## 5  17  245.5 125.0
## 6  18  243.0 122.0

Ratio Plot

Now, having the pf.fc_by_age_gender.wide dataframe, I will plot the ratio of the female to male median friend counts.

ggplot(aes(x = age, y = female/male), 
       data = pf.fc_by_age_gender_wide) +
  geom_line() +
  geom_hline(yintercept = 1, alpha = 0.3, color = 'red')

Now, we can easily see that females have on average more friend then males up to the age of about 70. Very young female users have almost 2.5 more friends on average than males.


Third Quantitative Variable

I am going to add another quantitative variable to the pf dataframe that would show in which year each user joined Facebook.

pf$year_joined <- floor(2014 - pf$tenure/365)

summary(pf$year_joined)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2005    2012    2012    2012    2013    2014       2
table(pf$year_joined)
## 
##  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014 
##     9    15   581  1507  4557  5448  9860 33366 43588    70

Cut a Variable

Now, I will transform the variable year_joined into a new categorical variable with just four backets.

# There will be four backets:
#        (2004, 2009]
#        (2009, 2011]
#        (2011, 2012]
#        (2012, 2014]


pf$year_joined_backet <- cut(pf$year_joined,
                             c(2004, 2009, 2011, 2012, 2014))

summary(pf$year_joined_backet)
## (2004,2009] (2009,2011] (2011,2012] (2012,2014]        NA's 
##        6669       15308       33366       43658           2

Plotting it All Together

Now, I will plot the distribution of friends count over the age for each of the four backets of the joining year.

ggplot(aes(x = age, y = friend_count), 
       data = subset(pf, !is.na(year_joined_backet))) +
  geom_line(aes(color = year_joined_backet), stat = "summary", fun.y = median)

The plot shows that the younger users that joined Facebook in its early years (between 2004 and 2009) have the biggest average number of friends. This relationship becomes messy for the older users.


Plot the Grand Mean

In the next plot, I will change the median number of friends to the mean value and will add the grand mean value for all the users as a dash line.

ggplot(aes(x = age, y = friend_count), 
       data = subset(pf, !is.na(year_joined_backet))) +
  geom_line(aes(color = year_joined_backet), stat = "summary", fun.y = mean) +
  geom_line(stat = 'summary', fun.y = mean, linetype = 2)

The grand mean line is a great reminder that the most of the data in our dataset is about the users of the recent cohort, who joined Facebook between the years 2012 and 2014.


Friending Rate

Next, I will calculate the number of friends per day since the user has been active on Facebook.

with(subset(pf, tenure >=1), summary(friend_count / tenure))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0775   0.2205   0.6096   0.5658 417.0000

Friendships Initiated

In addition, I will create a line graph of mean of friendships_initiated per day (of tenure) vs. tenure colored by year_joined.bucket

ggplot(aes(x = tenure, y = friendships_initiated / tenure), 
       data = subset(pf, tenure >= 1)) +
  geom_line(aes(color = year_joined_backet), stat = "summary", fun.y = mean)

A closer look at this plot shows that people with a longer tenure typically initiate less friendships.


Bias-Variance Tradeoff Revisited

I make several transformations to the y xis to visualize a bias / variance tradeoff. As the bin size increases, we see less noise in the graph.

library(gridExtra)

p1 <- ggplot(aes(x = tenure, y = friendships_initiated / tenure),
       data = subset(pf, tenure >= 1)) +
  geom_line(aes(color = year_joined_backet),
            stat = 'summary',
            fun.y = mean)

p2 <- ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_line(aes(color = year_joined_backet),
            stat = "summary",
            fun.y = mean)

p3 <- ggplot(aes(x = 30 * round(tenure / 30), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_line(aes(color = year_joined_backet),
            stat = "summary",
            fun.y = mean)

p4 <- ggplot(aes(x = 90 * round(tenure / 90), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_line(aes(color = year_joined_backet),
            stat = "summary",
            fun.y = mean)

grid.arrange(p1, p2, p3, p4, ncol = 1)

# ggplot(aes(x = tenure, y = friendships_initiated / tenure),
#        data = subset(pf, tenure >= 1)) +
#   geom_smooth(aes(color = year_joined_backet))

Introducing the Yogurt Data Set

The Yogurt dataset is another dataset I will use in this document. It has a different structure from the Facebook dataset, as each client here has multiple rows of data, each containing some data of that user’s purchases.


yo <- read.csv("yogurt.csv")
summary(yo)
##       obs               id               time         strawberry     
##  Min.   :   1.0   Min.   :2100081   Min.   : 9662   Min.   : 0.0000  
##  1st Qu.: 696.5   1st Qu.:2114348   1st Qu.: 9843   1st Qu.: 0.0000  
##  Median :1369.5   Median :2126532   Median :10045   Median : 0.0000  
##  Mean   :1367.8   Mean   :2128592   Mean   :10050   Mean   : 0.6492  
##  3rd Qu.:2044.2   3rd Qu.:2141549   3rd Qu.:10255   3rd Qu.: 1.0000  
##  Max.   :2743.0   Max.   :2170639   Max.   :10459   Max.   :11.0000  
##    blueberry        pina.colada          plain         mixed.berry    
##  Min.   : 0.0000   Min.   : 0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 0.0000   Median : 0.0000   Median :0.0000   Median :0.0000  
##  Mean   : 0.3571   Mean   : 0.3584   Mean   :0.2176   Mean   :0.3887  
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :12.0000   Max.   :10.0000   Max.   :6.0000   Max.   :8.0000  
##      price      
##  Min.   :20.00  
##  1st Qu.:50.00  
##  Median :65.04  
##  Mean   :59.25  
##  3rd Qu.:68.96  
##  Max.   :68.96
head(yo)
##   obs      id  time strawberry blueberry pina.colada plain mixed.berry
## 1   1 2100081  9678          0         0           0     0           1
## 2   2 2100081  9697          0         0           0     0           1
## 3   3 2100081  9825          0         0           0     0           1
## 4   4 2100081  9999          0         0           0     0           1
## 5   5 2100081 10015          1         0           1     0           1
## 6   6 2100081 10029          1         0           2     0           1
##   price
## 1 58.96
## 2 58.96
## 3 65.04
## 4 65.04
## 5 48.96
## 6 65.04

Histograms Revisited

Most of the values in this dataset are integers, but I will convert one of the variable to a factor.

yo$id <- factor(yo$id)
str(yo)
## 'data.frame':    2380 obs. of  9 variables:
##  $ obs        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ id         : Factor w/ 332 levels "2100081","2100370",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ time       : int  9678 9697 9825 9999 10015 10029 10036 10042 10083 10091 ...
##  $ strawberry : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ blueberry  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pina.colada: int  0 0 0 0 1 2 0 0 0 0 ...
##  $ plain      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mixed.berry: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ price      : num  59 59 65 65 49 ...
qplot(x = price, data = yo, fill = I("#F79420"))


Number of Purchases

It is worth to make a closer look at the distribution of prices in our dataset, as the previous plot showed the prices are discrete and some prices are quite frequent while there are no purchases at all for some other prices.

Afterwards, I will calculate and plot the total number of yogurt purchases by each user (household) in one time.

length(unique((yo$price)))
## [1] 20
table(yo$price)
## 
##    20 24.96 33.04  33.2 33.28 33.36 33.52 39.04    44 45.04 48.96 49.52 
##     2    11    54     1     1    22     1   234    21    11    81     1 
##  49.6    50 55.04 58.96    62 63.04 65.04 68.96 
##     1   205     6   303    15     2   799   609
yo <- transform(yo, all.purchases = strawberry + blueberry + pina.colada + plain + mixed.berry)

summary(yo$all.purchases)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.971   2.000  21.000
qplot(x = all.purchases, data = yo, 
      binwidth = 1,
      fill = I("#099DD9"))

This histogram reveals that most households buy one or two yogurts in a time.


Prices over Time

Now I will plot price over time.

ggplot(aes(x = time, y = price), data = yo) + 
  geom_jitter(alpha = 1/4, shape = 21, fill = I("#F79420"))

The plot shows that price seems to be increasing a bit over time.


Sampling Observations

When you are familiarazing yourself with a new dataset that contains multile observations of the same units, it is often useful to work with the sample of that unit so that it is easier to display the raw data for that unit. In terms of the Yogurt dataset, it is useful to look into the sample of several households with more details to know what could be the within-sample variations.


Looking at Samples of Households

set.seed(4230)

sample.ids <- sample(levels(yo$id), 16)

ggplot(aes(x = time, y = price),
       data = subset(yo, id %in% sample.ids)) +
  facet_wrap(~ id) +
  geom_line() +
  geom_point(aes(size = all.purchases), pch = 1)

The plot shows that some households purchase more yogurts than others. For most of the households the price of the yogurt tends to stay the same or increase over time, although there are some exceptions, e.g. id 2123463 or id 2141341. The down-peaks (as for id 2122705 or id 2124750) most probably represent coupons that reduce the price.


Scatterplot Matrix

It is sometimes very useful to create a matrix of scatterplots to speed up the data analysis.

library(GGally)
theme_set(theme_minimal(20))

set.seed(1836)

# I omit categorical variables
pf_subset <- pf[, c(2:15)]

names(pf_subset)
##  [1] "age"                   "dob_day"              
##  [3] "dob_year"              "dob_month"            
##  [5] "gender"                "tenure"               
##  [7] "friend_count"          "friendships_initiated"
##  [9] "likes"                 "likes_received"       
## [11] "mobile_likes"          "mobile_likes_received"
## [13] "www_likes"             "www_likes_received"
ggpairs(pf_subset[sample.int(nrow(pf_subset),1000),])

The ggpair uses different plot types for different types of combinations of variables. However, ggpairs does not do things like transforming axis into logarithmic ones etc.


Heat Maps

Here is an example of how to create a heat map for the data of another dataset, containing information of different genes.

nci <- read.table("nci.tsv")
colnames(nci) <- c(1:64)
nci.long.samp <- melt(as.matrix(nci[1:200,]))
names(nci.long.samp) <- c("gene", "case", "value")
head(nci.long.samp)
##   gene case  value
## 1    1    1  0.300
## 2    2    1  1.180
## 3    3    1  0.550
## 4    4    1  1.140
## 5    5    1 -0.265
## 6    6    1 -0.070
ggplot(aes(y = gene, x = case, fill = value),
  data = nci.long.samp) +
  geom_tile() +
  scale_fill_gradientn(colours = colorRampPalette(c("blue", "red"))(100))