Exploring One Variable

Pseudo-Facebook User Data

Please install and load the following packages before continuing

And download the pseudo_facebook dataset from here: pseudo_facebook

# read the data using read_tsv function from package (readr)
pf <- read_tsv("pseudo_facebook.tsv")
pf

## # A tibble: 99,003 x 15
##     userid   age dob_day dob_year dob_month gender tenure friend_count
##      <int> <int>   <int>    <int>     <int>  <chr>  <int>        <int>
## 1  2094382    14      19     1999        11   male    266            0
## 2  1192601    14       2     1999        11 female      6            0
## 3  2083884    14      16     1999        11   male     13            0
## 4  1203168    14      25     1999        12 female     93            0
## 5  1733186    14       4     1999        12   male     82            0
## 6  1524765    14       1     1999        12   male     15            0
## 7  1136133    13      14     2000         1   male     12            0
## 8  1680361    13       4     2000         1 female      0            0
## 9  1365174    13       1     2000         1   male     81            0
## 10 1712567    13       2     2000         2   male    171            0
## # ... with 98,993 more rows, and 7 more variables:
## #   friendships_initiated <int>, likes <int>, likes_received <int>,
## #   mobile_likes <int>, mobile_likes_received <int>, www_likes <int>,
## #   www_likes_received <int>

Barplot of Users’ Birthdays

Notes: Bar chart is used to plot

theme_set(theme_minimal(15))

qplot(as.factor(dob_day), data=pf, geom = "bar") +
        xlab("Day of Birth")

What are some things that you notice about this histogram?

We can notice that there’s a huge bin at day 1. This seems unusual since we expect that dob_day would be normally distributed across the 30 days of any given month.

There’s also this small bin at day 31. But this one makes sence since not all months have 31 days, and that explains the low number of people in this particular bin.

Moira’s Investigation

Notes:

Estimating Your Audience Size

Notes:

Think about a time when you posted a specific message or shared a photo on Facebook. What was it?

Response:

How many of your friends do you think saw that post?

Response:

Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?

Response:

Perceived Audience Size

Notes:

Faceting

Notes: Let’s break our barplot into 12 plots, one for each month of the year

# let us look again at our pf
pf

## # A tibble: 99,003 x 15
##     userid   age dob_day dob_year dob_month gender tenure friend_count
##      <int> <int>   <int>    <int>     <int>  <chr>  <int>        <int>
## 1  2094382    14      19     1999        11   male    266            0
## 2  1192601    14       2     1999        11 female      6            0
## 3  2083884    14      16     1999        11   male     13            0
## 4  1203168    14      25     1999        12 female     93            0
## 5  1733186    14       4     1999        12   male     82            0
## 6  1524765    14       1     1999        12   male     15            0
## 7  1136133    13      14     2000         1   male     12            0
## 8  1680361    13       4     2000         1 female      0            0
## 9  1365174    13       1     2000         1   male     81            0
## 10 1712567    13       2     2000         2   male    171            0
## # ... with 98,993 more rows, and 7 more variables:
## #   friendships_initiated <int>, likes <int>, likes_received <int>,
## #   mobile_likes <int>, mobile_likes_received <int>, www_likes <int>,
## #   www_likes_received <int>

# Now we will use the function (facet_wrap) to plot dob_d conditioned 
# on dob_month
ggplot(data=pf, aes(x=as.factor(dob_day), fill = dob_month)) +
        geom_bar() +
        facet_wrap(~ dob_month, ncol = 2)

Let’ss take another look at our plot. What stands out to you here?

Now, we can easily notice that the large number of people born in day 1 is actually occuring just in Janurary, while the other 11 months are normally distributed as we originally expected.

So, the number of people born in Januray 1st represents an outlier. But what actually is an outlier?

Be Skeptical - Outliers and Anomalies

People often talk about the importance of detecting and dealing with outliers in your data.

But there are many types of outliers and anomalies and how your analysis proceeds should depend on what type you’re dealing with. Outliers can have many causes.

For example, an outlier might be accurate data about an extreme case. For example, someone represented in your data set might really be tweeting 1,000 times a day. On the other hand, sometimes outliers, or anomalies represent bad data, or the limitations of your data. For example, what otherwise would be a normal value of a variable might be replaced with an extreme value. Or, in other cases, extreme values might be replaced with a more normal value. For example, in a lot of census data or surveys, income information is top coded. So individuals with very large incomes have their incomes replaced with some other value.

Moira’s Outlier

Notes:

Which case do you think applies to Moira’s outlier?

Response:

Friend Count

Notes:

What code would you enter to create a histogram of friend counts?

ggplot(data = pf, aes(x = friend_count)) +
        geom_histogram()

How is this plot similar to Moira’s first plot?

It’s similar. We need to limit our axes to get a better sense of the data.

Limiting the Axes

Notes:

ggplot(data = pf, aes(x = friend_count)) +
        geom_histogram() +
        scale_x_continuous(limits = c(0, 1000))

Exploring with Bin Width

Adjusting the Bin Width

Let’s set binwdith to 25. Also, let’s break the x-axis every, say 50 untis, so that each 2 bins will fit into one x-axis break.

ggplot(data = pf, aes(x = friend_count)) +
        geom_histogram(binwidth = 25) +
        scale_x_continuous(limits = c(0, 1000),
                           breaks = seq(0, 1000, 50))

Read more about adjusting scales on this link: Scales in ggplot2.

Faceting Friend Count

We want to know which gender on average has more friends?

# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
ggplot(data = pf, aes(x = friend_count)) +
        geom_histogram(binwidth = 25) +
        scale_x_continuous(limits = c(0, 1000),
                           breaks = seq(0, 1000, 50)) +
        facet_wrap(~ gender, nrow = 2)

Ohhh, we generated 3 panels, not 2. We need the NA values from the variable gender.

Omitting NA Values

# we have 2 options to omit NA values from (gender)

# 1st option: omitting NA only from (gender)
ggplot(data = filter(pf, !is.na(gender)), aes(x = friend_count)) +
        geom_histogram(binwidth = 25) +
        scale_x_continuous(limits = c(0, 1000),
                           breaks = seq(0, 1000, 50)) +
        facet_wrap(~ gender, nrow = 2)

# 2nd option: omitting NA from the entire data set 
# (not recommended since you may omit observations that have gender value
# but have NA for different variables)
ggplot(data = na.omit(pf), aes(x = friend_count)) +
        geom_histogram(binwidth = 25) +
        scale_x_continuous(limits = c(0, 1000),
                           breaks = seq(0, 1000, 50)) +
        facet_wrap(~ gender, nrow = 2)

Statistics ‘by’ Gender

We need to look at the average friend_count by gender. Who -one average- has more friends, males or females?

# Let's calculate the mean and median for friend_count conditioned on gender
by_gender <- pf %>% 
        group_by(gender) %>% 
        summarise(Avg_friendCount = round(mean(friend_count), 0),
                  median_friendCount = median(friend_count))
by_gender

## # A tibble: 3 x 3
##   gender Avg_friendCount median_friendCount
##    <chr>           <dbl>              <dbl>
## 1 female             242                 96
## 2   male             165                 74
## 3   <NA>             184                 81

Who on average has more friends: men or women?

Response:

What’s the difference between the median friend count for women and men?

Response:

Why would the median be a better measure than the mean?

Response:

Tenure

Notes:

pf

## # A tibble: 99,003 x 15
##     userid   age dob_day dob_year dob_month gender tenure friend_count
##      <int> <int>   <int>    <int>     <int>  <chr>  <int>        <int>
## 1  2094382    14      19     1999        11   male    266            0
## 2  1192601    14       2     1999        11 female      6            0
## 3  2083884    14      16     1999        11   male     13            0
## 4  1203168    14      25     1999        12 female     93            0
## 5  1733186    14       4     1999        12   male     82            0
## 6  1524765    14       1     1999        12   male     15            0
## 7  1136133    13      14     2000         1   male     12            0
## 8  1680361    13       4     2000         1 female      0            0
## 9  1365174    13       1     2000         1   male     81            0
## 10 1712567    13       2     2000         2   male    171            0
## # ... with 98,993 more rows, and 7 more variables:
## #   friendships_initiated <int>, likes <int>, likes_received <int>,
## #   mobile_likes <int>, mobile_likes_received <int>, www_likes <int>,
## #   www_likes_received <int>

# Let's look at the histogram of tenure
ggplot(data = pf, aes(x = tenure)) +
        geom_histogram(color = 'black', fill = "light blue")

# Let's change the binwidth to 30 to have a better look at the distribution

ggplot(data = pf, aes(x = tenure)) +
        geom_histogram(binwidth = 30, color = 'black', fill = "light blue") +
        geom_density()

Note on colors:

The parameter color determines the color outline of objects in a plot.

The parameter fill determines the color of the area inside objects in a plot. When fill is dependent on the data, specify it in the aes portion of the main function ggplot.

When fill is a fixed value, e.g. “blue” or “red” … etc, specify it in the geom function itself.

Read more about colors in ggplot2 in this link: ggplot2 colors

How would you create a histogram of tenure by year?

pf

## # A tibble: 99,003 x 15
##     userid   age dob_day dob_year dob_month gender tenure friend_count
##      <int> <int>   <int>    <int>     <int>  <chr>  <int>        <int>
## 1  2094382    14      19     1999        11   male    266            0
## 2  1192601    14       2     1999        11 female      6            0
## 3  2083884    14      16     1999        11   male     13            0
## 4  1203168    14      25     1999        12 female     93            0
## 5  1733186    14       4     1999        12   male     82            0
## 6  1524765    14       1     1999        12   male     15            0
## 7  1136133    13      14     2000         1   male     12            0
## 8  1680361    13       4     2000         1 female      0            0
## 9  1365174    13       1     2000         1   male     81            0
## 10 1712567    13       2     2000         2   male    171            0
## # ... with 98,993 more rows, and 7 more variables:
## #   friendships_initiated <int>, likes <int>, likes_received <int>,
## #   mobile_likes <int>, mobile_likes_received <int>, www_likes <int>,
## #   www_likes_received <int>

ggplot(data = pf, aes(x = tenure/365)) +
        geom_histogram(color = 'black', fill = "light blue")

# let's set the binwidth equal to 1/4 (this would give us a bin each 3 months)
ggplot(data = pf, aes(x = tenure/365)) +
        geom_histogram(binwidth = 0.25, color = 'black', fill = "orange") +
        scale_x_continuous(breaks = seq(0, 10, by = 1))

# let's omit the last 3 years (7-10) from our plot since they contain a very
# low number of people
ggplot(data = pf, aes(x = tenure/365)) +
        geom_histogram(binwidth = 0.25, color = 'black', fill = "orange") +
        scale_x_continuous(limits = c(0, 7), breaks = seq(0, 7, by = 1))

# Now it looks much better

Labeling Plots

ggplot(data = pf, aes(x = tenure/365)) +
        geom_histogram(binwidth = 0.25, color = 'black', fill = "orange") +
        scale_x_continuous(limits = c(0, 7), breaks = seq(0, 7, by = 1)) +
        xlab("Number of Years Using Facebook") +
        ylab("Number of users in sample") +
        ggtitle("Tenure Histogram by Year")

User Ages

pf

## # A tibble: 99,003 x 15
##     userid   age dob_day dob_year dob_month gender tenure friend_count
##      <int> <int>   <int>    <int>     <int>  <chr>  <int>        <int>
## 1  2094382    14      19     1999        11   male    266            0
## 2  1192601    14       2     1999        11 female      6            0
## 3  2083884    14      16     1999        11   male     13            0
## 4  1203168    14      25     1999        12 female     93            0
## 5  1733186    14       4     1999        12   male     82            0
## 6  1524765    14       1     1999        12   male     15            0
## 7  1136133    13      14     2000         1   male     12            0
## 8  1680361    13       4     2000         1 female      0            0
## 9  1365174    13       1     2000         1   male     81            0
## 10 1712567    13       2     2000         2   male    171            0
## # ... with 98,993 more rows, and 7 more variables:
## #   friendships_initiated <int>, likes <int>, likes_received <int>,
## #   mobile_likes <int>, mobile_likes_received <int>, www_likes <int>,
## #   www_likes_received <int>

ggplot(data = pf, aes(x = age)) + 
        geom_histogram(color = "black", fill = 'salmon')

# We need to adjust binwidth and x-axis limits and breaks
# Let's omit the last 20 and first 10 years of our histogram, 
# set binwidth equal = 2.5 and x-axis breaks equal to 5
ggplot(data = pf, aes(x = age)) + 
        geom_histogram(binwidth = 2.5, color = "black", fill = 'deepskyblue3') +
        scale_x_continuous(limits = c(10, 110), breaks = seq(10, 110, 5))

What do you notice?

Response:

The Spread of Memes

Notes:

Lada’s Money Bag Meme

Notes:

Transforming Data

Most of our variables, such as friend count, likes, comments, wall posts and others are variables that I would call engagement variables, and they all have very long tails. Some users have 10 times, or even 100 the median value. Another way to say this is that some people have an order of magnitudes, more likes, clicks, or comments, than any other users. In statistics, we say that the data is over dispersed.

Often it helps to transform these values so we can see standard deviations, or orders of magnitudes, so we are in effect, shortening the tail.

Here was our histogram of friend count from before:

ggplot(data = pf, aes(x = friend_count)) + 
        geom_histogram()

and notice, we still have that long tail. We can transform this variable by taking the log, either using the natural log, log base 2, or log base 10. We could use other functions, such as the square root, and doing so helps us to see patterns more clearly, without being distracted by the tails.

A lot of common statistical techniques, like linear regression, are based on the assumption that variables have normal distributions. So by taking the log of this variable, we can transform our data to turn it into a normal distribution or something that more closely resembles a normal distribution, if we’d be using linear regression or some other modelling technique.

Now, I know we’re not doing modelling here, but let’s just see what it looks like to transform the variable. First, I’m going to just do this in the summary command. So, here’s our

regular summary of friend count.

summary(pf$friend_count)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    31.0    82.0   196.4   206.0  4923.0

Looks like the median friend count is 82, and the mean is 196. I can take the log base 10 \((\log_{10})\) of this friend count and get a different table:

summary(log10(pf$friend_count))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -Inf   1.491   1.914    -Inf   2.314   3.692

Now this seems a little bit unusual since I have negative infinity for the minimum and negative infinity for the mean. So what must be going on? Well, some of our users have a friend count of zero. So, when we take the \(\log\) of base 10 of zero, that would be undefined. For those familiar with Calculus, the limit would be negative infinity, which is why that appears here. To understand this more intuitively, think of a power that you’d raise \(10\) to in order to get Zero. What that power would be? It’s a ver very very big negative number, technically, \(-\inf\).

To avoid this (haveing \(-\inf\)), we’re going to add one to friend count, so that way we don’t get an undefined answer, or negative infinity.

summary(log10(pf$friend_count + 1))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.505   1.919   1.868   2.316   3.692

There, that looks much better.

Now, let’s use another function. Let’s use the square root on friend count. This would be another type of transformation.

summary(sqrt(pf$friend_count))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.568   9.055  11.090  14.350  70.160

For me, log base 10 is an easier transformation to wrap my head around, since I’m just comparing friend counts on orders of magnitude of 10.

Now that you’ve seen transformations within summaries, let’s see if you can apply a similar transformation to the histogram. Check out the following links to learn how to use scales and how to create multiple graphs on one page.

Once you’ve read through those links, and you think you’re ready, try this next programming exercise. In it, you’re going to create three different histograms.

The first one will be our original friend count histogram, and then the second one will have the friend count transformed using log 10. And then the last histogram will have the friend count transformed using square root.

library(gridExtra)
library(ggthemes)
# friend_count histogram
hist1 <- ggplot(data = pf, aes(x = friend_count)) +
        geom_histogram(fill = "salmon", color = "black") +
        xlab("Friend Count") + 
        theme_gdocs() +
        scale_color_gdocs()

# friend count transformed using log10
hist2 <- ggplot(data = pf, aes(x = friend_count +1)) +
        geom_histogram(fill = "orange", color = "black") +
        scale_x_log10() +
        xlab("Friend Count") +
        theme_gdocs()
        scale_color_gdocs()

## <ggproto object: Class ScaleDiscrete, Scale>
##     aesthetics: colour
##     break_info: function
##     break_positions: function
##     breaks: waiver
##     call: call
##     clone: function
##     dimension: function
##     drop: TRUE
##     expand: waiver
##     get_breaks: function
##     get_breaks_minor: function
##     get_labels: function
##     get_limits: function
##     guide: legend
##     is_discrete: function
##     is_empty: function
##     labels: waiver
##     limits: NULL
##     map: function
##     map_df: function
##     na.value: NA
##     name: waiver
##     palette: function
##     range: <ggproto object: Class RangeDiscrete, Range>
##         range: NULL
##         reset: function
##         train: function
##         super:  <ggproto object: Class RangeDiscrete, Range>
##     reset: function
##     scale_name: gdocs
##     train: function
##     train_df: function
##     transform: function
##     transform_df: function
##     super:  <ggproto object: Class ScaleDiscrete, Scale>

# friend count transformed using sqrt
hist3 <- ggplot(data = pf, aes(x = friend_count)) +
        geom_histogram(fill = "deepskyblue2", color = "black") +
        scale_x_sqrt() +
        xlab("Friend Count") +
        theme_gdocs()
        scale_color_gdocs()

## <ggproto object: Class ScaleDiscrete, Scale>
##     aesthetics: colour
##     break_info: function
##     break_positions: function
##     breaks: waiver
##     call: call
##     clone: function
##     dimension: function
##     drop: TRUE
##     expand: waiver
##     get_breaks: function
##     get_breaks_minor: function
##     get_labels: function
##     get_limits: function
##     guide: legend
##     is_discrete: function
##     is_empty: function
##     labels: waiver
##     limits: NULL
##     map: function
##     map_df: function
##     na.value: NA
##     name: waiver
##     palette: function
##     range: <ggproto object: Class RangeDiscrete, Range>
##         range: NULL
##         reset: function
##         train: function
##         super:  <ggproto object: Class RangeDiscrete, Range>
##     reset: function
##     scale_name: gdocs
##     train: function
##     train_df: function
##     transform: function
##     transform_df: function
##     super:  <ggproto object: Class ScaleDiscrete, Scale>

# plot the 3 plots using grid.arrange
grid.arrange(hist1, hist2, hist3, nrow = 3)

Add a Scaling Layer

Notes:

Frequency Polygons

There’s another type of plot that lets us compare distributions, the frequency polygon. Frequency polygons are similar to histograms, but they draw a curve connecting the counts in a histogram. So this allows us to see the shape and the peaks of our distribution in more detail.

Remember we were trying to answer the question, who on average has more friends, men or women. We draw this histogram before:

ggplot(data = filter(pf, !is.na(gender)), aes(x = friend_count)) +
        geom_histogram(binwidth = 25) +
        scale_x_continuous(limits = c(0, 1000),
                           breaks = seq(0, 1000, 50)) +
        facet_wrap(~ gender, nrow = 2)

We said we couldn’t tell based on this histogram, so we ran some numerical summaries instead. And instead of having these 2 histograms side by side, we can actually use a frequency polygon and overlay these histograms together. Here’s how we can create that frequency polygon. I’ll copy and paste the same code, except I need to make an addition.

# I will just copy the code from before and do some changes:
# remove the facet_wrap function, because I want to overlay the 2 polygons 
# on top of each other.
# Instead, I will pass the argument (color) to the (aes) function,
# and set (color) equal to (gender). By doing so, I'm telling ggplot
# to condition my plot on (gender).
ggplot(data = filter(pf, !is.na(gender)), 
       aes(x = friend_count, color = gender)) +
        geom_freqpoly() +
        scale_x_continuous(limits = c(0, 1000),
                           breaks = seq(0, 1000, 50))

Now, We can compare 2 or more distributions at once. But again, this plot doesn’t really answer our question who has more friends on average men or women.

Let’s change the y-axis to show proportions instead of raw counts. To do so, we need add another argument to (aes) function, which is y, and set it equla to ..count../sum(..count..)

ggplot(aes(x = friend_count, y = ..count../sum(..count..)), 
       data = filter(pf, !is.na(gender))) +
  geom_freqpoly(aes(color = gender), binwidth=10) + 
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) + 
  xlab('Friend Count') + 
  ylab('Percentage of users with that friend count')

# Note that sum(..count..) will sum across color, 
# so the percentages displayed are percentages of total users. 
# To plot percentages within each group, you can try y = ..density...

Try to play around with this yourself to see where women overtake men in this side of the x-axis. Try using limits or try using breaks to explore more. ***

Likes on the Web

use frequency polygon to determine which gender creates more likes on the world wide web. create a frequency polygon and explore it in many different ways. Remember that the first plot that you make doesn’t need to be final.

pf

## # A tibble: 99,003 x 15
##     userid   age dob_day dob_year dob_month gender tenure friend_count
##      <int> <int>   <int>    <int>     <int>  <chr>  <int>        <int>
## 1  2094382    14      19     1999        11   male    266            0
## 2  1192601    14       2     1999        11 female      6            0
## 3  2083884    14      16     1999        11   male     13            0
## 4  1203168    14      25     1999        12 female     93            0
## 5  1733186    14       4     1999        12   male     82            0
## 6  1524765    14       1     1999        12   male     15            0
## 7  1136133    13      14     2000         1   male     12            0
## 8  1680361    13       4     2000         1 female      0            0
## 9  1365174    13       1     2000         1   male     81            0
## 10 1712567    13       2     2000         2   male    171            0
## # ... with 98,993 more rows, and 7 more variables:
## #   friendships_initiated <int>, likes <int>, likes_received <int>,
## #   mobile_likes <int>, mobile_likes_received <int>, www_likes <int>,
## #   www_likes_received <int>

ggplot(data = pf, aes(x = www_likes)) + 
        geom_freqpoly(aes(color = gender))

# note: this is an equivelant code
ggplot(data = pf, aes(x = www_likes, color = gender)) + 
        geom_freqpoly()

Our plot is too much skewed to the right. Let’s try using log10 transformation on x-axis

ggplot(data = filter(pf, !is.na(gender)), aes(x = www_likes, color = gender)) + 
        geom_freqpoly() +
        scale_x_log10()

Now, this looks much better. It looks like males have more likes in the beginning, and then females start to take over at some point. Still, we want some numerical information on who have more likes.

Let’s use dplyr to answer that question:

pf

## # A tibble: 99,003 x 15
##     userid   age dob_day dob_year dob_month gender tenure friend_count
##      <int> <int>   <int>    <int>     <int>  <chr>  <int>        <int>
## 1  2094382    14      19     1999        11   male    266            0
## 2  1192601    14       2     1999        11 female      6            0
## 3  2083884    14      16     1999        11   male     13            0
## 4  1203168    14      25     1999        12 female     93            0
## 5  1733186    14       4     1999        12   male     82            0
## 6  1524765    14       1     1999        12   male     15            0
## 7  1136133    13      14     2000         1   male     12            0
## 8  1680361    13       4     2000         1 female      0            0
## 9  1365174    13       1     2000         1   male     81            0
## 10 1712567    13       2     2000         2   male    171            0
## # ... with 98,993 more rows, and 7 more variables:
## #   friendships_initiated <int>, likes <int>, likes_received <int>,
## #   mobile_likes <int>, mobile_likes_received <int>, www_likes <int>,
## #   www_likes_received <int>

gender_webLikes <- pf %>%
        filter(!is.na(gender)) %>%
        group_by(gender) %>%
        summarise(t_webLikes = sum(www_likes),
                  avg_webLikes = mean(www_likes),
                  median_webLikes = median(www_likes))

gender_webLikes

## # A tibble: 2 x 4
##   gender t_webLikes avg_webLikes median_webLikes
##    <chr>      <int>        <dbl>           <dbl>
## 1 female    3507665     87.13830               0
## 2   male    1430175     24.41655               0

Box Plots

If you need a refresher on boxplots and how to read them, check this link how to read and use a Boxplot.

Also, have a look at this image to get a sense of how boxplot relates to normal distribution: Boxplot vs PDF

recall earlier that we split friend count by gender in a pair of histograms using facet wrap.

Now, we’re going to generate box plots of friend count by gender, so we can quickly see the differences between the distributions, especially the difference between the median of the two groups.

# Note that in boxplots the y axis will be the variable you're plotting
#(friend count). And The grouping variable (gender) will be on x axis.
ggplot(data = filter(pf, !is.na(gender)), aes(y = friend_count, x = gender)) +
        geom_boxplot()

The boxes cover the middle 50% of values, or what’s called the inter quartile range \(IQR\). The tiny little dots above the boxes are outliers. we usually consider outliers to be just outside of \(1.5\) times the IQR from the media on both sides of the box.

We can also see that the y axis is capturing all the friend counts from zero all the way up to 5,000. So we’re not omitting any user data in this plot.

And finally, this horizontal line inside the boxes is the median for the two box plots.

Since there’s so many outliers in these plots, let’s adjust our code to focus on just these two boxes. We’ll have you do this in the next programming exercise.

Adjust the code to focus on users who have friend counts between 0 and 1000.

ggplot(data = filter(pf, !is.na(gender)), aes(y = friend_count, x = gender)) +
        geom_boxplot(aes(color = gender)) +
        scale_y_continuous(limits = c(0, 1000), 
                           breaks = seq(0, 1000, by = 100))

Now it looks much better. But there’s another problem. To illustrate it let’s summarise again friend_count by gender:

summary(filter(pf, gender == "female") %>% select(friend_count) %>% unlist)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923

From the summary we can see that the 3rd Quartile of female friend_count is 244. Look at the left boxplot in the last graph. What do you notice? The upper boundary of the box, which corresponds to the 3rd quartile, is below 244 !! why is that.

Well, when we scales y-axis between (0, 1000), we actually remove data points from calculations. To avoid this, it’s better to use the coord_cartesizan layer to set the limits.

ggplot(data = filter(pf, !is.na(gender)), aes(y = friend_count, x = gender)) +
        geom_boxplot(aes(color = gender)) +
        coord_cartesian(ylim = c(0, 1000)) +
        scale_y_continuous(breaks = seq(0, 1000, by = 100))

Notice how the top of the box has moved slightly upward. I will plot the previous two plots beside each other to see the difference.

plot1 <- ggplot(data = filter(pf, !is.na(gender)), aes(y = friend_count, x = gender)) +
        geom_boxplot(aes(color = gender)) +
        scale_y_continuous(limits = c(0, 1000), 
                           breaks = seq(0, 1000, by = 100)) + 
        ggtitle("Using (scale_y_continuous)")

plot2 <- ggplot(data = filter(pf, !is.na(gender)), aes(y = friend_count, x = gender)) +
        geom_boxplot(aes(color = gender)) +
        coord_cartesian(ylim = c(0, 1000)) +
        scale_y_continuous(breaks = seq(0, 1000, by = 100)) + 
        ggtitle("Using (coord_cartesian)")

grid.arrange(plot1, plot2, ncol = 2)

Box Plots, Quartiles, and Friendships

It looks like females on average have slightly more friends than men. Since we can see that the median line is slightly higher. Let’s zoom in to get a better look.

ggplot(data = filter(pf, !is.na(gender)), aes(y = friend_count, x = gender)) +
        geom_boxplot(aes(color = gender)) +
        coord_cartesian(ylim = c(0, 250))

Now, let’s check our findings by looking at actual values and compare the values to what we see in our box plot.

# by takes 3 arguments, the first is the variable you
# want to apply some function on.
# the seconde is the variable you want to condition on
# and the third is the function you want to apply
by(pf$friend_count, pf$gender, summary)

## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

The table shows the minimum and maximum values for both genders, as well as the cuartiles. The first cuartile for women is 37 and that looks about right in our graph. The third quartile or the 75% mark is at 244. This means that 75% of female users have friend counts below 244. Or another way to say this is that 25% of female users have more than 244 friends.

Similarly for the men, we can see how the first quartiles and the third quartiles match up to the box plot.

On average, who initiated more friendships in our sample: men or women?

pf

## # A tibble: 99,003 x 15
##     userid   age dob_day dob_year dob_month gender tenure friend_count
##      <int> <int>   <int>    <int>     <int>  <chr>  <int>        <int>
## 1  2094382    14      19     1999        11   male    266            0
## 2  1192601    14       2     1999        11 female      6            0
## 3  2083884    14      16     1999        11   male     13            0
## 4  1203168    14      25     1999        12 female     93            0
## 5  1733186    14       4     1999        12   male     82            0
## 6  1524765    14       1     1999        12   male     15            0
## 7  1136133    13      14     2000         1   male     12            0
## 8  1680361    13       4     2000         1 female      0            0
## 9  1365174    13       1     2000         1   male     81            0
## 10 1712567    13       2     2000         2   male    171            0
## # ... with 98,993 more rows, and 7 more variables:
## #   friendships_initiated <int>, likes <int>, likes_received <int>,
## #   mobile_likes <int>, mobile_likes_received <int>, www_likes <int>,
## #   www_likes_received <int>

# first let's use a boxplot
ggplot(data = filter(pf, !is.na(gender)), 
       aes(y = friendships_initiated, 
           x = gender,
           color = gender)) +
        geom_boxplot() + 
        coord_cartesian(ylim = c(0, 250))

# looks like females have slightly more friendships_initiated 
# on average. 

# Now, let's check using numerical summary
pf %>% filter(!is.na(gender)) %>% group_by(gender) %>% 
        summarise(friendships_initiated_avg = mean(friendships_initiated))

## # A tibble: 2 x 2
##   gender friendships_initiated_avg
##    <chr>                     <dbl>
## 1 female                  113.8991
## 2   male                  103.0666

# another way
by(pf$friendships_initiated, pf$gender, summary)

## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    19.0    49.0   113.9   124.8  3654.0 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    15.0    44.0   103.1   111.0  4144.0

Getting Logical

Sometimes you encounter variables that have a lot of zero values. In this case, it’s usually preferable to convert that variable into binary one, that has only true or false.

This is helpful because we may want to know whether a user has used a certain feature at all, instead of the number of times that the user has actually used that feature.

For example, it may not matter how many times a person checked in using a mobile device. But, whether the person has ever used mobile check-in.

Here’s a summary of the mobile likes in our dataset.

summary(pf$mobile_likes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     4.0   106.1    46.0 25110.0

# The median is four, which means we have a lot of zeroes in our dataset.

# now let's run another summary
summary(pf$mobile_likes > 0)

##    Mode   FALSE    TRUE    NA's 
## logical   35056   63947       0

The table gives us the number of usres who have checked in, and the number of users who haven’t, regardless of how many mobile likes they did.

Let’s create a new variable that tracks mobile check-ins.

pf$mobile_check_in <- ifelse(pf$mobile_likes > 0, 1, 0)

what percent of check in using mobile?

mean(pf$mobile_check_in, na.rm = TRUE)

## [1] 0.6459097

Exploring One Variable

Pseudo-Facebook User Data

Barplot of Users’ Birthdays

What are some things that you notice about this histogram?

Moira’s Investigation

Estimating Your Audience Size

Think about a time when you posted a specific message or shared a photo on Facebook. What was it?

How many of your friends do you think saw that post?

Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?

Perceived Audience Size

Faceting

Note on facet_wrap and facet_grid:

Let’ss take another look at our plot. What stands out to you here?

Be Skeptical - Outliers and Anomalies

Moira’s Outlier

Which case do you think applies to Moira’s outlier?

Friend Count

What code would you enter to create a histogram of friend counts?

How is this plot similar to Moira’s first plot?

Limiting the Axes

Exploring with Bin Width

Adjusting the Bin Width

Faceting Friend Count

Omitting NA Values

Statistics ‘by’ Gender

Who on average has more friends: men or women?

What’s the difference between the median friend count for women and men?

Why would the median be a better measure than the mean?

Tenure

Note on colors:

How would you create a histogram of tenure by year?

Labeling Plots

User Ages

What do you notice?

The Spread of Memes

Lada’s Money Bag Meme

Transforming Data

Add a Scaling Layer

Frequency Polygons

Likes on the Web

Box Plots

Adjust the code to focus on users who have friend counts between 0 and 1000.

Box Plots, Quartiles, and Friendships

On average, who initiated more friendships in our sample: men or women?

Getting Logical

what percent of check in using mobile?

Note on `facet_wrap` and `facet_grid`: