Exploring One Variable in R

by Jekaterina Novikova

Exploring Pseudo-Facebook User Data

Having the dataset pseudo_facebook.tsv, I am going to analyze the users’ behaviour in Facebook, understand what they are doing there and how they behave.

Reading the Data

First, I read the data and check its summary. Next, I get the list of names for all the variables of the dataset.

pf <- read.csv('pseudo_facebook.tsv', sep = '\t')
summary(pf)

##      userid             age            dob_day         dob_year   
##  Min.   :1000008   Min.   : 13.00   Min.   : 1.00   Min.   :1900  
##  1st Qu.:1298806   1st Qu.: 20.00   1st Qu.: 7.00   1st Qu.:1963  
##  Median :1596148   Median : 28.00   Median :14.00   Median :1985  
##  Mean   :1597045   Mean   : 37.28   Mean   :14.53   Mean   :1976  
##  3rd Qu.:1895744   3rd Qu.: 50.00   3rd Qu.:22.00   3rd Qu.:1993  
##  Max.   :2193542   Max.   :113.00   Max.   :31.00   Max.   :2000  
##                                                                   
##    dob_month         gender          tenure        friend_count   
##  Min.   : 1.000   female:40254   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.: 3.000   male  :58574   1st Qu.: 226.0   1st Qu.:  31.0  
##  Median : 6.000   NA's  :  175   Median : 412.0   Median :  82.0  
##  Mean   : 6.283                  Mean   : 537.9   Mean   : 196.4  
##  3rd Qu.: 9.000                  3rd Qu.: 675.0   3rd Qu.: 206.0  
##  Max.   :12.000                  Max.   :3139.0   Max.   :4923.0  
##                                  NA's   :2                        
##  friendships_initiated     likes         likes_received    
##  Min.   :   0.0        Min.   :    0.0   Min.   :     0.0  
##  1st Qu.:  17.0        1st Qu.:    1.0   1st Qu.:     1.0  
##  Median :  46.0        Median :   11.0   Median :     8.0  
##  Mean   : 107.5        Mean   :  156.1   Mean   :   142.7  
##  3rd Qu.: 117.0        3rd Qu.:   81.0   3rd Qu.:    59.0  
##  Max.   :4144.0        Max.   :25111.0   Max.   :261197.0  
##                                                            
##   mobile_likes     mobile_likes_received   www_likes       
##  Min.   :    0.0   Min.   :     0.00     Min.   :    0.00  
##  1st Qu.:    0.0   1st Qu.:     0.00     1st Qu.:    0.00  
##  Median :    4.0   Median :     4.00     Median :    0.00  
##  Mean   :  106.1   Mean   :    84.12     Mean   :   49.96  
##  3rd Qu.:   46.0   3rd Qu.:    33.00     3rd Qu.:    7.00  
##  Max.   :25111.0   Max.   :138561.00     Max.   :14865.00  
##                                                            
##  www_likes_received 
##  Min.   :     0.00  
##  1st Qu.:     0.00  
##  Median :     2.00  
##  Mean   :    58.57  
##  3rd Qu.:    20.00  
##  Max.   :129953.00  
##

names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

Histogram of Users’ Birthdays

I will look at a histogram of users’ birthdays, using a ggplot2 library:

#install.packages('ggplot2')
library(ggplot2)

qplot(x = dob_day, data = pf) +
  scale_x_discrete(breaks = 1:31)

What are some things to notice about this histogram?

Most people, according to the histogram, were born on the first day of the month. This does not seem normal.

The amount of people born on 31st day of a month is the smallest. This, however, seems to be normal, as not every month has 31 days.

Faceting

I am going to break the histogram into 12 histograms, one for each month of a year.

# Two options here: facet_wrap and facet_grid
# facet_wrap(~variable)
# facet_grid(vertical~horizontal)

qplot(x = dob_day, data = pf) +
  scale_x_discrete(breaks = 1:31) +
  facet_wrap(~dob_month, ncol = 3)

What are some things to notice about this histogram?

This new plot shows that only the first day of the first month (January) is an outlier. This implies that the date ‘1st of January’ is the default setting for Facebook when providing data of the date of birth, and people tend to provide a false information by selecting the default. This outlier represents the bad data in our dataset.

Friend Count and Limiting Axes

Now, I will look at the histogram of friend count.

qplot(x = friend_count, data = pf)

The data has a long tail over the x axis that does not help to see the clear picture. I will limit the x axis to see the data in more details.

qplot(x = friend_count, data = pf, xlim = c(0,1000))

# There is another way to limit the scale:
# qplot(x = friend_count, data = pf) +
#   scale_x_continuous(limits = c(0,1000)) + 
#   scale_y_continuous(limits = c(0,20000))

Adjusting Bin Width

To make the histogram more readable, I adjust the bin width.

qplot(x = friend_count, data = pf) +
  scale_x_continuous(limits = c(0,1000), breaks = seq(0,1000,50))

Faceting Friend Count

Now, I face the histogram of friend counts by gender. This will help answering a question who has more friends on average, males or females.

qplot(x = friend_count, data = pf, binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000),
                     breaks = seq(0, 1000, 50)) + 
  facet_wrap(~gender)

Obviously, we do not need the missing values of gender for our analysis.

Omitting NA Values

I omit the data with NA values of gender.

# Using na.omit() will omit all the data with NA values, not
# only NA values of gender. Instead, I will use is.na() 

qplot(x = friend_count, data = subset(pf, !is.na(gender)), binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000),
                     breaks = seq(0, 1000, 50)) + 
  facet_wrap(~gender)

It is difficult to say who has more friends, males or females, bu just looking at the histogram of friend counts.

Statistics ‘by’ Gender

The output of the table command shows there is slightly more males than females in our dataset.

table(pf$gender)

## 
## female   male 
##  40254  58574

by(pf$friend_count, pf$gender, summary)

## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

The command by gives us enough information to answer the following questions.

Who on average has more friends: men or women?

Response: females have more friends on average, their mean number of friends is 242 vs 165 for males.

What’s the difference between the median friend count for women and men?

Response: 96 for females vs 74 for males.

Tenure

Next, I will analyze the tenure of Facebook users’. In other words, I will look at how long people are using Facebook, according to our dataset.

qplot(x = tenure, data = pf, binwidth = 30,
  color = I('black'), fill = I('#099DD9'))

# Equivalent ggplot syntax: 
# ggplot(aes(x = tenure), data = pf) + 
#    geom_histogram(binwidth = 30, color = 'black', fill = '#099DD9')

Creating a histogram of tenure by year.

In order to create a histogram of tenure by year, I change the x value to tenure/365 and modify the binwidth value to be equal to 0.25 (quarters).

qplot(x = tenure/365, data = pf, binwidth = 0.25,
  color = I('black'), fill = I('#F79420')) + 
  scale_x_continuous(breaks = seq(1,7,1), limits = c(0,7))

# Equivalent ggplot syntax: 
# ggplot(aes(x = tenure/365), data = pf) + 
#    geom_histogram(binwidth = .25, color = 'black', fill = '#F79420')

Labeling Plots

I label the axes to make the plot easily understandable to viewers.

qplot(x = tenure/365, data = pf, binwidth = 0.25,
      xlab = "Number of years using Facebook",
      ylab = "Number of users in sample",
  color = I('black'), fill = I('#F79420')) + 
  scale_x_continuous(breaks = seq(1,7,1), limits = c(0,7))

# Equivalent ggplot syntax: 
# ggplot(aes(x = tenure / 365), data = pf) + 
#   geom_histogram(color = 'black', fill = '#F79420') + 
#   scale_x_continuous(breaks = seq(1, 7, 1), limits = c(0, 7)) + 
#   xlab('Number of years using Facebook') + 
#   ylab('Number of users in sample')

User Ages

Now, I create a histogram of users’ ages. But first, I check the summary for the age variable to find out the min and max values of the users’ age and use them to update the x axis.

summary(pf$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   28.00   37.28   50.00  113.00

qplot(x = age, data = pf, binwidth = 1,
      xlab = "Age of a user",
      ylab = "Number of users in sample",
  fill = I('#5760AB')) +
  scale_x_discrete(breaks = seq(13,113,5))

Some notices:

There are no users of the age less than 13 years old, and this corresponds to the Facebook policy. The maximum amount of users is of the age around 20, and there is also a peak on the histogram for the age more than 100. That, obviously, is a false data.

Transforming Data

Sometimes, we need to transform the data to make it look more like a normal distribution. For example, the histogram of friends count was skewed a lot and needs some transformations.

summary(pf$friend_count)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    31.0    82.0   196.4   206.0  4923.0

summary(log10(pf$friend_count + 1)) # we add 1 to avoid -Inf for those users who have 0 friends

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.505   1.919   1.868   2.316   3.692

summary(sqrt(pf$friend_count))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.568   9.055  11.090  14.350  70.160

#install.packages('gridExtra')
library(gridExtra)

p1 <- qplot(x = friend_count, data = pf)
p2 <- qplot(x = log10(friend_count+1), data = pf)
p3 <- qplot(x = sqrt(friend_count), data = pf)
grid.arrange(p1, p2, p3, ncol = 1)

# Equivalent in ggplot:
# p1 <- ggplot(aes(x = friend_count), data = pf) + geom_histogram()
# p2 <- p1 + scale_x_log10()
# p3 <- p1 + scale_x_sqrt()
# grid.arrange(p1, p2, p3, ncol = 1)

Add a Scaling Layer

There are two ways to transform the variable, the first method use rapid wraping up of the variable and the second adds scaling to the variable. The difference is in the x axis label, which is a friend_count instead of log10(friend_count)

logScale <- qplot(x = log10(friend_count), data = pf)

countScale <- ggplot(aes(x = friend_count), data = pf) + 
  geom_histogram() + 
  scale_x_log10()

grid.arrange(logScale, countScale, ncol = 2)

Frequency Polygons

Frequency polygons let show several facets of the data on the same plot. For example, we can look at the count of friends by gender in one frequency polygons’ plot.

qplot(x = friend_count, y = ..count../sum(..count..),
      data = subset(pf, !is.na(gender)), 
      xlab = "Friend Count",
      ylab = "Proportion of Users with that Friend Count",
      binwidth = 10, geom = "freqpoly", color = gender) +
  scale_x_continuous(limits = c(0, 1000),
                     breaks = seq(0, 1000, 50))

# Equivalent ggplot syntax: 
# ggplot(aes(x = friend_count, y = ..count../sum(..count..)), data = subset(pf, !is.na(gender))) +
#   geom_freqpoly(aes(color = gender), binwidth=10) + 
#   scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) + 
#   xlab('Friend Count') + 
#   ylab('Percentage of users with that friend count')

Likes on the Web

Next, I will use frequency polygons to determine which gender makes more likes.

qplot(x = www_likes, data = subset(pf, !is.na(gender)), 
      xlab = "Number of Likes",
      ylab = "Users with that Number of Likes",
      geom = "freqpoly", color = gender) +
  scale_x_log10()

by(pf$www_likes, pf$gender, sum)

## pf$gender: female
## [1] 3507665
## -------------------------------------------------------- 
## pf$gender: male
## [1] 1430175

by(pf$www_likes, pf$gender, summary)

## pf$gender: female
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.00     0.00    87.14    25.00 14860.00 
## -------------------------------------------------------- 
## pf$gender: male
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.00     0.00    24.42     2.00 12900.00

Box Plots

I will create box plots to compare statistical data visually.

qplot(x = gender, y = friend_count, 
      data = subset(pf, !is.na(pf$gender)),
      geom = "boxplot")

Adjust the code to focus on users who have friend counts between 0 and 1000.

qplot(x = gender, y = friend_count, 
      data = subset(pf, !is.na(pf$gender)),
      geom = "boxplot") +
  coord_cartesian(ylim = c(0,250))

# Another ways to limit y axis:
# + scale_y_continuous(limits = c(0,1000))
# or 
# ylim = c(0,1000)

The boxplot shows a slightly higher number of average number of friends for females, comparing with males. To be sure about it, I use the by command.

by(pf$friend_count, pf$gender, summary)

## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

I create an additional boxplot to answer the following question:

On average, who initiated more friendships in our sample: men or women?

qplot(x = gender, y = friendships_initiated, 
      data = subset(pf, !is.na(pf$gender)),
      geom = "boxplot") +
  coord_cartesian(ylim = c(0,150))

by(pf$friendships_initiated, pf$gender, summary)

## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    19.0    49.0   113.9   124.8  3654.0 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    15.0    44.0   103.1   111.0  4144.0

Response: women initiated more friend than men, both in terms of a mean and median value.

Getting Logical

It is possible to transform the variable into a boolean one. I will do this for the mobile_likes variables to find out whether user have used a mobile checkin or not.

summary(pf$mobile_likes > 0)

##    Mode   FALSE    TRUE    NA's 
## logical   35056   63947       0

The summary results shows that many users (35056) have never used a mobile checkin at all (mobile_likes = 0). So, I will create a new variable mobile_check_in in the dataset and assign its value to 1 or 0 depending on whether a user checked in via mobile or not.

mobile_check_in <- NA
pf$mobile_check_in <- ifelse(pf$mobile_likes>0, 1, 0)
pf$mobile_check_in <- factor(pf$mobile_check_in)
summary(pf$mobile_check_in)

##     0     1 
## 35056 63947

Now, I can further analyze the users’ behaviour in terms of mobile checkin, e.g. calculate what is the percentage of check in using mobile.

sum(pf$mobile_check_in==1)/length(pf$mobile_check_in)

## [1] 0.6459097

Exploratory Data Analysis with more than One Variable

I describe exploratory data analysis with more than one variable in further documents. Please, check my website for more details.