Exploring Two Variables in R

by Jekaterina Novikova


Exploring Pseudo-Facebook User Data

Having the dataset pseudo_facebook.tsv, I am going to analyze the users’ behaviour in Facebook, understand what they are doing there and how they behave.

Reading the Data

First, I read the data and check its summary. Next, I get the list of names for all the variables of the dataset.

pf <- read.csv('pseudo_facebook.tsv', sep = '\t')
# Alternatively:
# pf <- read.delim('pseudo_facebook.tsv')  

summary(pf)
##      userid             age            dob_day         dob_year   
##  Min.   :1000008   Min.   : 13.00   Min.   : 1.00   Min.   :1900  
##  1st Qu.:1298806   1st Qu.: 20.00   1st Qu.: 7.00   1st Qu.:1963  
##  Median :1596148   Median : 28.00   Median :14.00   Median :1985  
##  Mean   :1597045   Mean   : 37.28   Mean   :14.53   Mean   :1976  
##  3rd Qu.:1895744   3rd Qu.: 50.00   3rd Qu.:22.00   3rd Qu.:1993  
##  Max.   :2193542   Max.   :113.00   Max.   :31.00   Max.   :2000  
##                                                                   
##    dob_month         gender          tenure        friend_count   
##  Min.   : 1.000   female:40254   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.: 3.000   male  :58574   1st Qu.: 226.0   1st Qu.:  31.0  
##  Median : 6.000   NA's  :  175   Median : 412.0   Median :  82.0  
##  Mean   : 6.283                  Mean   : 537.9   Mean   : 196.4  
##  3rd Qu.: 9.000                  3rd Qu.: 675.0   3rd Qu.: 206.0  
##  Max.   :12.000                  Max.   :3139.0   Max.   :4923.0  
##                                  NA's   :2                        
##  friendships_initiated     likes         likes_received    
##  Min.   :   0.0        Min.   :    0.0   Min.   :     0.0  
##  1st Qu.:  17.0        1st Qu.:    1.0   1st Qu.:     1.0  
##  Median :  46.0        Median :   11.0   Median :     8.0  
##  Mean   : 107.5        Mean   :  156.1   Mean   :   142.7  
##  3rd Qu.: 117.0        3rd Qu.:   81.0   3rd Qu.:    59.0  
##  Max.   :4144.0        Max.   :25111.0   Max.   :261197.0  
##                                                            
##   mobile_likes     mobile_likes_received   www_likes       
##  Min.   :    0.0   Min.   :     0.00     Min.   :    0.00  
##  1st Qu.:    0.0   1st Qu.:     0.00     1st Qu.:    0.00  
##  Median :    4.0   Median :     4.00     Median :    0.00  
##  Mean   :  106.1   Mean   :    84.12     Mean   :   49.96  
##  3rd Qu.:   46.0   3rd Qu.:    33.00     3rd Qu.:    7.00  
##  Max.   :25111.0   Max.   :138561.00     Max.   :14865.00  
##                                                            
##  www_likes_received 
##  Min.   :     0.00  
##  1st Qu.:     0.00  
##  Median :     2.00  
##  Mean   :    58.57  
##  3rd Qu.:    20.00  
##  Max.   :129953.00  
## 
names(pf)
##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

Scatterplots

Scatterplots are used to look at two continous variables at the same time on one plot. Here again, I will use both an ordinary qplot command and a ggplot library to produce scatterplots.


Scatterplots and Perceived Audience Size

Notes:

library(ggplot2)

qplot(age, friend_count, data = pf)


Some things to notice right away:

It looks like the younger users (under the age of thirty) have much more friends than users of the other age ranges.

The high and dense vertical lines, such as around the age of 69 or 100, show where the users lied about their age.


ggplot Syntax

summary(pf$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   28.00   37.28   50.00  113.00
ggplot(aes(x = age, y = friend_count), data = pf) + 
  geom_point() +
  xlim(13, 90)
## Warning: Removed 4906 rows containing missing values (geom_point).


Overplotting

Overplotting makes it difficult to say how many points are there in each region. In order to avoid this, I use geom_jitter(alpha = 1/20) to make each point on the plot to represent 20 points of data.

ggplot(aes(x = age, y = friend_count), data = pf) + 
  geom_jitter(alpha = 1/20) +
  xlim(13, 90)


Coord_trans()

Next, I will add a transformation to the y axis to make the data more readable.

# position = position_jitter(h = 0) is added in order to avoid
# negative points for the users with 0 friends

ggplot(aes(x = age, y = friend_count), data = pf) + 
  geom_point(alpha = 1/20, position = position_jitter(h = 0)) +
  xlim(13, 90) + 
  coord_trans(y = 'sqrt')

Alpha and Jitter

Now, I examine the relationship between friendships_initiated and age using the ggplot syntax.

ggplot(aes(x = age, y = friendships_initiated), data = pf) +
  geom_point(alpha = 1/10, position = position_jitter(h = 0)) +
  xlim(13, 90) +
  coord_trans(y = 'sqrt')


Conditional Means

It is sometimes useful to see how the average value of one variable varies comparing to the value of another variable. I will use a dpyr package for this.

For example, I will look at the plot of how the average number of friends changes over the age of the users.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
age_groups <- group_by(pf, age)
pf.fc_by_age <- summarise(age_groups,
          friend_count_mean = mean(friend_count),
          friend_count_median = median(friend_count),
          n = n())
pf.fc_by_age <- arrange(pf.fc_by_age, age)
head(pf.fc_by_age)
## Source: local data frame [6 x 4]
## 
##     age friend_count_mean friend_count_median     n
##   (int)             (dbl)               (dbl) (int)
## 1    13          164.7500                74.0   484
## 2    14          251.3901               132.0  1925
## 3    15          347.6921               161.0  2618
## 4    16          351.9371               171.5  3086
## 5    17          350.3006               156.0  3283
## 6    18          331.1663               162.0  5196

Now when I have all the neccessary summarised informaation, I will create the plot:

ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
  geom_line()


Overlaying Summaries with Raw Data

Now, I will provide both original raw data and a summary information on the same plot.

ggplot(aes(x = age, y = friend_count), data = pf) + 
  geom_point(alpha = 1/20, 
             position = position_jitter(h = 0),
             color = "orange") +
  coord_cartesian(xlim = c(13, 70), ylim = c(0,1000)) +
  geom_line(stat = "summary", fun.y = mean) +
  geom_line(stat = "summary", fun.y = quantile, prob = 0.1,
            linetype = 2, color = "blue") +
  geom_line(stat = "summary", fun.y = quantile, prob = 0.1,
            color = "blue") +
  geom_line(stat = "summary", fun.y = quantile, prob = 0.9,
            linetype = 2, color = "blue")

Some of the observations of the plot.

Having more than 1000 friends is rare, even for the younger users, as their 90% quantile is well below 1000.


Correlation

I will calculate the correlation to see what is the linear relationship between the age and the number of friends users have.

cor.test(pf$age, pf$friend_count, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737

Correlation on Subsets

As the most meaningful data in our subset is for the users of age < 70, I will use only this subset for calculating the correlation coefficient.

with( subset(pf, age <= 70), cor.test(age, friend_count,
                                      method = "spearman"))
## 
##  Spearman's rank correlation rho
## 
## data:  age and friend_count
## S = 1.5782e+14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.2552934

Create Scatterplots

Next, I will create a scatterplot of two highly correlated variables likes_received and www_likes_received.

ggplot(aes(x = www_likes_received, y = likes_received), data = pf) + 
  geom_point(alpha = 1/5, position = position_jitter(h = 0, w = 0)) +
  xlim(0, quantile(pf$www_likes_received, 0.95)) + 
  ylim(0, quantile(pf$likes_received, 0.95)) +
  #coord_trans(x = "sqrt", y = "sqrt") +
  geom_smooth(method = "lm", color = "red")


Strong Correlations

To determine the correlation between these two variables, I use the cor.test() command.

cor.test(pf$www_likes_received, pf$likes_received)
## 
##  Pearson's product-moment correlation
## 
## data:  pf$www_likes_received and pf$likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9473553 0.9486176
## sample estimates:
##       cor 
## 0.9479902

More Caution with Correlation

Correlation coefficients may be deceptive. To show this, I will use some data from the Mitchell dataset.

#install.packages('alr3')
library(alr3)
## Loading required package: car
data("Mitchell")

Now, I will create a scatterplot of Temperature vs Months data, provided by this dataset.

ggplot(aes(x = Month, y = Temp), data = Mitchell) +
  geom_point()

cor.test(Mitchell$Month, Mitchell$Temp)
## 
##  Pearson's product-moment correlation
## 
## data:  Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

Noisy Scatterplots

The above scatterplot looks quite messy and the correlation coefficient between the two variables is very low as well.

However, I know that months are discrete and repeat themselves over each 12 months, so I will analyse the data a bit further having this knowledge in mind.

Making Sense of Data

I will add a discrete scale to the x axis to represent 12 months as a repeated measure. I will also change the format of the graph to a line-plot instead of scatterplot.

To find out what is the range of the variable Month, I check its summary.

summary(Mitchell$Month)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   50.75  101.50  101.50  152.20  203.00
ggplot(aes(x = Month, y = Temp), data = Mitchell) +
  geom_line() +
  scale_x_discrete(breaks = seq(0, 203, 12))


A New Perspective

With such a plot, we see the fluctuations of the temperature that have a seasonal nature. So, it is always important to put the data into the context and not to rely on the correlation coefficient only.


Understanding Noise: Age to Age Months

Let’s return to the original dataset of Facebook users and further analyse the relationship between users’ age and a number of their friends.

ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
  geom_line()

Some of the variance here makes some sense, e.g. the peak around the age of 69, but others are just a noise.


Age with Months Means

I will create the plot of the same relationsgip between the age and the number of friends, but this time I will use age variable as age in months (rather than in years) to see how the noise increases having such a change.

pf$age_with_months <- pf$age + (12 - pf$dob_month)/12

age_with_months <- group_by(pf, age_with_months)
pf.fc_by_age_months <- summarise(age_with_months,
          friend_count_mean = mean(friend_count),
          friend_count_median = median(friend_count),
          n = n())
pf.fc_by_age_months <- arrange(pf.fc_by_age_months, age_with_months)
head(pf.fc_by_age_months)
## Source: local data frame [6 x 4]
## 
##   age_with_months friend_count_mean friend_count_median     n
##             (dbl)             (dbl)               (dbl) (int)
## 1        13.16667          46.33333                30.5     6
## 2        13.25000         115.07143                23.5    14
## 3        13.33333         136.20000                44.0    25
## 4        13.41667         164.24242                72.0    33
## 5        13.50000         131.17778                66.0    45
## 6        13.58333         156.81481                64.0    54

Now, I plot the results and it is a much noisier plot.

ggplot(aes(x = age_with_months, y = friend_count_mean), 
       data = subset(pf.fc_by_age_months, age_with_months <71)) +
  geom_line()


Noise in Conditional Means

I will put the two plots side by side to make the difference more visual.

library(gridExtra)

p1 <- ggplot(aes(x = age, y = friend_count_mean), data = subset(pf.fc_by_age, age < 71)) +
  geom_line()

p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean), 
       data = subset(pf.fc_by_age_months, age_with_months < 71)) +
  geom_line()

grid.arrange(p2, p1, ncol = 1)


Smoothing Conditional Means

In order to reduce the noise even more, I transform the age variable by dividing it by 5, rounding and multiplying by 5.

By doing this, we will estimate the mean more precisely but will most probably miss some important features.

p1 <- ggplot(aes(x = age, y = friend_count_mean), data = subset(pf.fc_by_age, age < 71)) +
  geom_line() +
  geom_smooth()

p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean), 
       data = subset(pf.fc_by_age_months, age_with_months < 71)) +
  geom_line() +
  geom_smooth()

p3 <- ggplot(aes(x = round(age/5)*5, y = friend_count_mean), data = subset(pf.fc_by_age, age < 71)) +
  geom_line(stat = "summary", fun.y = mean)

grid.arrange(p2, p1, p3, ncol = 1)


Which Plot to Choose?

In exploratory data analysis, you do not need to choose just one plot. Sometimes, different plots reveal different details about the same data.


Exploratory Data Analysis with More than Two Variables

I describe exploratory data analysis with more than two variables in further documents. Please, check my website for more details.