Explore One Variable

Pseudo-Facebook User Data

pf <- read.csv('pseudo_facebook.tsv', sep = '\t')
names(pf)
##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

Histogram of Users’ Birthdays

qplot(x = dob_day, data = pf)+ 
  scale_x_continuous(breaks = 1:31)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Instead of using the qplot() function, you can also use the ggplot() function to create the histogram:

ggplot(aes(x = dob_day), data = pf) +
  geom_histogram(binwidth = 1) +
  scale_x_continuous(breaks = 1:31)


What are some things that you notice about this histogram?

Response:

More users are born on January 1st, which is unusual. We can most likely expect that birthdays are evenly distributed across the month.

Estimating Your Audience Size


Think about a time when you posted a specific message or shared a photo on Facebook. What was it?

How many of your friends do you think saw that post?

Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?


Faceting

You may want to do split up your data by one or more variables and plot the subsets of data together.

That’s where faceting functions in R are useful.

#faceting >> facet_wrap

qplot(x = dob_day, data = pf)+ 
  scale_x_continuous(breaks = 1:31)+
  facet_wrap(~dob_month,ncol = 3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Equivalent ggplot syntax

ggplot(data = pf, aes(x = dob_day)) +
  geom_histogram(binwidth = 1) +
   scale_x_continuous(breaks = 1:31) +
   facet_wrap(~dob_month)


Be Skeptical - Outliers and Anomalies

A top-coded data set is one for which the value of variables above an upper bound are censored. This is often done to preserve the anonymity of people participating in the survey.

Outliers would be;

  • bad data about a non-extreme case
  • bad data about an extreme case
  • good data about an extreme case

Friend Count

Create a histogram of friend_count using the qplot syntax. We’ll also accept the ggplot syntax if you are familiar with it but additional parameters for setting the bin width or color won’t be accepted. Keep it simple.

You should create the histogram in R Studio on your computer first. Then, copy and paste your code into the {r} chunk.

Remember to load the pseudo-facebook data set into a variable named pf. Otherwise, the grader will reject the answer.

What code would you enter to create a histogram of friend counts?

ggplot(data = pf, aes(x = friend_count)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

How is this plot similar to Moira’s first plot?

The plot is similar to Moira’s first plot as it has an outlier, where one user has 5000 friends.


Limiting the Axes

qplot(x=friend_count, data=pf, xlim = c(0,1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).

#Equivalent ggplot syntax by adding a layer to x axis:

ggplot(aes(x = friend_count), data = pf) +
  geom_histogram() +
  scale_x_continuous(limits = c(0, 1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).

Exploring with Bin Width

When you adjust the bin width, and here the bin width is set to one, you really see these dramatic patterns, these tall vertical lines. This is because when you say, how many people do you think saw this post, people typically say, oh, 10, 20, 50, 100. People guess these regular numbers, and they don’t tend to guess Numbers in between. This really helps to get a feel for how big people thought their audience size was. The most common guess was 20 with 50 and 100 close behind. In most cases though, this was only about a quarter the size of the actual audience.


Adjusting the Bin Width

qplot(x=friend_count, data=pf, binwidth = 25)+
  scale_x_continuous(limits = c(0,1000), breaks = seq(0,1000,50))
## Warning: Removed 2951 rows containing non-finite values (stat_bin).

Faceting Friend Count

# What code would you add to create a facet the histogram by gender?
# Add it to the code below.

qplot(x = friend_count, data = pf, binwidth = 50) +
  scale_x_continuous(limits = c(0, 1000),
                     breaks = seq(0, 1000, 50))+
                        facet_grid(.~gender)
## Warning: Removed 2951 rows containing non-finite values (stat_bin).

Equivalent ggplot syntax

ggplot(aes(x = friend_count), data = pf) + 
  geom_histogram() + 
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) + 
  facet_wrap(~gender)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).


Omitting NA Values

qplot(x = friend_count, data = subset(pf, !is.na(gender)), binwidth = 50)+
  scale_x_continuous(limits = c(0, 1000),
                     breaks = seq(0, 1000, 50))+
                        facet_grid(.~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).


Statistics ‘by’ Gender

table(pf$gender)
## 
## female   male 
##  40254  58574
by(pf$friend_count, pf$gender, summary)
## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

After you run the above code, try to find the answers for the following questions.

Who on average has more friends: men or women?

What’s the difference between the median friend count for women and men?

Why would the median be a better measure than the mean?

Answers

Women has more friends on average.

The difference between the median friend count for women and men = 96-74 =22

There is a huge difference between median and mean for both categories. This means a few users have lots of friends compare to the majority so that mean is over-stated. Thus, the average friend count per user in the data set is higher than it really is. As this indicates that the data is not in a normal distribution, the median will make more sense because it shows the much more representative number of the friend count.


Tenure

Notes:


How would you create a histogram of tenure by year?

qplot (data=pf, x = tenure, binwidth = 50, color = I('black'), fill = I('#099009'))
## Warning: Removed 2 rows containing non-finite values (stat_bin).

#Equivalent ggplot syntax: 

ggplot(aes(x = tenure), data = pf) + 
   geom_histogram(binwidth = 30, color = 'black', fill = '#099DD9')
## Warning: Removed 2 rows containing non-finite values (stat_bin).

#Equivalent ggplot syntax with tenure in years. Notice that you have to adjust the binwidth as appropriate.To make the histogram more clear and nice set breaks also.  

ggplot(aes(x = tenure/365), data = pf) + 
   geom_histogram(binwidth = .25, color = 'black', fill = '#336600')+
  scale_x_continuous(breaks = seq(1,7,1), limits = c(0,7))
## Warning: Removed 26 rows containing non-finite values (stat_bin).


Labeling Plots

You can label the x and y axis as you need using xlab and ylab parameters.

qplot (data=pf, x = tenure/365, 
       xlab = 'Number of Years Using Facebook',
       ylab = 'Number of Users',
       color = I('black'), fill = I('#F79420')) +
       scale_x_continuous(breaks = seq(1,7,1), lim = c(0,7))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 26 rows containing non-finite values (stat_bin).


User Ages

qplot (data=pf, x = age, 
       xlab = 'Age',
       ylab = 'Number of Users',
       color = I('black'), fill = I('#F79420'))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In this case, binwidth 1 would be more appropriate.

#Equivalent ggplot syntax: 

ggplot(aes(x = age), data = pf) + 
  geom_histogram(binwidth = 1, color = 'black', fill = '#5760AB') + 
  scale_x_continuous(breaks = seq(0, 113, 5))

What do you notice?

If you may notice, there is no Facebook users under 13 years old. Make sense? Yes, because to open a Facebook account, you have to be at least 13 or older.

And do you see the peaks in mid 20s? This shows that the most users are in their mid 20. There are more than 5000 users in age of 18.

Finally, we can see that more Facebook users are from 16-25.

By the way, isn’t that weird there are some users more than 100 year old? Funny, huh?


Transforming Data

“Over-dispersed” is always relative to some particular posited distribution. For example, data might be over-dispersed compared with a Poisson distribution with that mean.

summary(pf$friend_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    31.0    82.0   196.4   206.0  4923.0
summary(log10(pf$friend_count + 1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.505   1.919   1.868   2.316   3.692
summary(sqrt(pf$friend_count))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.568   9.055  11.090  14.350  70.160

Create Multiple Plots in One Image

p1 <- qplot(data = pf, x = friend_count)
p2 <- qplot(data = pf, x = log10(friend_count + 1),
              color = I('black'), fill = I('#336600'))
p3 <- qplot(data = pf, x = sqrt(friend_count),
               color = I('black'), fill = I('#5760AB'))

grid.arrange(p1, p2, p3, ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, notice that the log10 graph is in normal distribution.

Add a Scaling Layer

logscale <- qplot(data = pf, x = log10(friend_count))

##Equivalent ggplot syntax: 

countscale <- ggplot(aes(x = friend_count), data = pf)+
      geom_histogram()+
      scale_x_log10()

grid.arrange(logscale, countscale, ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).

#OR

qplot(data = pf, x = friend_count)+
      scale_x_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).

Using actual count with ggplot function and adding a scale would be a better practice. You can add a scaling layer to any plot that you create.


Frequency Polygons

qplot(x= friend_count, data = subset(pf, !is.na(gender)), binwidth = 10)+
     scale_x_continuous(lim = c(0,1000), breaks = seq(0,1000,50))+
     facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).

# Creating Frequency Polygons

qplot(x= friend_count, y = ..count../sum(..count..),
      data = subset(pf, !is.na(gender)), 
      xlab = 'Friend Count',
      ylab = 'Proportion of Users with the Friend Count',
      binwidth = 10, geom = 'freqpoly', color = gender)+
    scale_x_continuous(lim = c(0,1000), breaks = seq(0,1000,50))
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).


Likes on the Web

Notes:

qplot(x= www_likes, data = subset(pf, !is.na(gender)), 
      xlab = 'likes on the web',
      geom = 'freqpoly', color = gender)+
    scale_x_continuous()+
    scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 60935 rows containing non-finite values (stat_bin).

by(pf$www_likes, pf$gender, sum)
## pf$gender: female
## [1] 3507665
## -------------------------------------------------------- 
## pf$gender: male
## [1] 1430175

What’s the www_likes count for males?

Which gender has more www_likes?


Box Plots

Notes:

qplot(x = gender, y = friend_count, data = subset(pf, !is.na(gender)), 
      geom = 'boxplot')+
     scale_y_continuous(lim = c(0,1000))
## Warning: Removed 2949 rows containing non-finite values (stat_boxplot).

#Alternative

qplot(x = gender, y = friend_count, data = subset(pf, !is.na(gender)), 
      geom = 'boxplot', ylim = c(0,1000))
## Warning: Removed 2949 rows containing non-finite values (stat_boxplot).

#If you use ylim or scale_y_continuous layer, you're removing y data points from calculations. The better way to do this is to use coord cartesian layer.

The middle black line is the median (the middle 50% of friend_count). The lower boundary of the box is 1st quartile and the upper boundary of the box is 3rd quartile.

Remember these from your statistics class?

Adjust the code to focus on users who have friend counts between 0 and 1000.

#Alternative with coord cartesian layer

qplot(x = gender, y = friend_count, data = subset(pf, !is.na(gender)), 
      geom = 'boxplot')+
     coord_cartesian(ylim = c(0,1000))


Box Plots, Quartiles, and Friendships

#Adjust the ylim for a more clear picture

qplot(x = gender, y = friend_count, data = subset(pf, !is.na(gender)), 
      geom = 'boxplot')+
     coord_cartesian(ylim = c(0,250))

by(pf$friend_count, pf$gender, summary)
## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

On average, who initiated more friendships in our sample: men or women?

Women

Write about some ways that you can verify your answer.

The median of friendships_initiated for women is slightly over men.

qplot(x = gender, y = friendships_initiated, 
      data = subset(pf, !is.na(gender)), 
      geom = 'boxplot')+
     coord_cartesian(ylim = c(0,250))

by(pf$friendships_initiated, pf$gender, summary)
## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    19.0    49.0   113.9   124.8  3654.0 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    15.0    44.0   103.1   111.0  4144.0

Getting Logical

summary(pf$mobile_likes)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     4.0   106.1    46.0 25110.0
summary(pf$mobile_likes > 0)
##    Mode   FALSE    TRUE    NA's 
## logical   35056   63947       0
mobile_check_in <- NA

pf$mobile_check_in <- ifelse(pf$mobile_likes > 0, 1,0)
pf$mobile_check_in <- factor(pf$mobile_check_in)
summary(pf$mobile_check_in)
##     0     1 
## 35056 63947

What percent of check in using mobile?

summary(pf$mobile_check_in)
##     0     1 
## 35056 63947
sum(pf$mobile_check_in == 1)/ length(pf$mobile_check_in)
## [1] 0.6459097

Analyzing One Variable

Reflection:

What did you learn so far?