Please install and load the following packages before continuing
And download the pseudo_facebook dataset from here: pseudo_facebook
# read the data using read_tsv function from package (readr)
pf <- read_tsv("pseudo_facebook.tsv")
pf## # A tibble: 99,003 x 15
## userid age dob_day dob_year dob_month gender tenure friend_count
## <int> <int> <int> <int> <int> <chr> <int> <int>
## 1 2094382 14 19 1999 11 male 266 0
## 2 1192601 14 2 1999 11 female 6 0
## 3 2083884 14 16 1999 11 male 13 0
## 4 1203168 14 25 1999 12 female 93 0
## 5 1733186 14 4 1999 12 male 82 0
## 6 1524765 14 1 1999 12 male 15 0
## 7 1136133 13 14 2000 1 male 12 0
## 8 1680361 13 4 2000 1 female 0 0
## 9 1365174 13 1 2000 1 male 81 0
## 10 1712567 13 2 2000 2 male 171 0
## # ... with 98,993 more rows, and 7 more variables:
## # friendships_initiated <int>, likes <int>, likes_received <int>,
## # mobile_likes <int>, mobile_likes_received <int>, www_likes <int>,
## # www_likes_received <int>
Notes: Bar chart is used to plot
theme_set(theme_minimal(15))
qplot(as.factor(dob_day), data=pf, geom = "bar") +
xlab("Day of Birth")We can notice that there’s a huge bin at day 1. This seems unusual since we expect that dob_day would be normally distributed across the 30 days of any given month.
There’s also this small bin at day 31. But this one makes sence since not all months have 31 days, and that explains the low number of people in this particular bin.
Notes:
Notes:
Response:
Response:
Notes:
Notes: Let’s break our barplot into 12 plots, one for each month of the year
# let us look again at our pf
pf## # A tibble: 99,003 x 15
## userid age dob_day dob_year dob_month gender tenure friend_count
## <int> <int> <int> <int> <int> <chr> <int> <int>
## 1 2094382 14 19 1999 11 male 266 0
## 2 1192601 14 2 1999 11 female 6 0
## 3 2083884 14 16 1999 11 male 13 0
## 4 1203168 14 25 1999 12 female 93 0
## 5 1733186 14 4 1999 12 male 82 0
## 6 1524765 14 1 1999 12 male 15 0
## 7 1136133 13 14 2000 1 male 12 0
## 8 1680361 13 4 2000 1 female 0 0
## 9 1365174 13 1 2000 1 male 81 0
## 10 1712567 13 2 2000 2 male 171 0
## # ... with 98,993 more rows, and 7 more variables:
## # friendships_initiated <int>, likes <int>, likes_received <int>,
## # mobile_likes <int>, mobile_likes_received <int>, www_likes <int>,
## # www_likes_received <int>
# Now we will use the function (facet_wrap) to plot dob_d conditioned
# on dob_month
ggplot(data=pf, aes(x=as.factor(dob_day), fill = dob_month)) +
geom_bar() +
facet_wrap(~ dob_month, ncol = 2)facet_wrap and facet_grid:These two functions are used for conditioning. Meaning: if you want to plot a variable (or more) conditioned on another variable.
facet_grid
The data can be split up by one or two variables that vary on the horizontal and/or vertical direction.
This is done by giving a formula to facet_grid(), of the form vertical ~ horizontal
facet_wrap
Instead of faceting with a variable in the horizontal or vertical direction, facets can be placed next to each other, wrapping with a certain number of columns or rows. The label for each plot will be at the top of the plot.
Tip: If you want to condition over one variable, like our case, it’s better to use facet_wrap. Otherwise, facet_grid would be better.
If you want to read more about this topic, check this link: Faceting in ggplot2
Now, we can easily notice that the large number of people born in day 1 is actually occuring just in Janurary, while the other 11 months are normally distributed as we originally expected.
So, the number of people born in Januray 1st represents an outlier. But what actually is an outlier?
People often talk about the importance of detecting and dealing with outliers in your data.
But there are many types of outliers and anomalies and how your analysis proceeds should depend on what type you’re dealing with. Outliers can have many causes.
For example, an outlier might be accurate data about an extreme case. For example, someone represented in your data set might really be tweeting 1,000 times a day. On the other hand, sometimes outliers, or anomalies represent bad data, or the limitations of your data. For example, what otherwise would be a normal value of a variable might be replaced with an extreme value. Or, in other cases, extreme values might be replaced with a more normal value. For example, in a lot of census data or surveys, income information is top coded. So individuals with very large incomes have their incomes replaced with some other value.
Notes:
Response:
Notes:
ggplot(data = pf, aes(x = friend_count)) +
geom_histogram()It’s similar. We need to limit our axes to get a better sense of the data.
Notes:
ggplot(data = pf, aes(x = friend_count)) +
geom_histogram() +
scale_x_continuous(limits = c(0, 1000))Let’s set binwdith to 25. Also, let’s break the x-axis every, say 50 untis, so that each 2 bins will fit into one x-axis break.
ggplot(data = pf, aes(x = friend_count)) +
geom_histogram(binwidth = 25) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50))Read more about adjusting scales on this link: Scales in ggplot2.
We want to know which gender on average has more friends?
# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
ggplot(data = pf, aes(x = friend_count)) +
geom_histogram(binwidth = 25) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50)) +
facet_wrap(~ gender, nrow = 2)Ohhh, we generated 3 panels, not 2. We need the NA values from the variable gender.
# we have 2 options to omit NA values from (gender)
# 1st option: omitting NA only from (gender)
ggplot(data = filter(pf, !is.na(gender)), aes(x = friend_count)) +
geom_histogram(binwidth = 25) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50)) +
facet_wrap(~ gender, nrow = 2)# 2nd option: omitting NA from the entire data set
# (not recommended since you may omit observations that have gender value
# but have NA for different variables)
ggplot(data = na.omit(pf), aes(x = friend_count)) +
geom_histogram(binwidth = 25) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50)) +
facet_wrap(~ gender, nrow = 2)We need to look at the average friend_count by gender. Who -one average- has more friends, males or females?
# Let's calculate the mean and median for friend_count conditioned on gender
by_gender <- pf %>%
group_by(gender) %>%
summarise(Avg_friendCount = round(mean(friend_count), 0),
median_friendCount = median(friend_count))
by_gender## # A tibble: 3 x 3
## gender Avg_friendCount median_friendCount
## <chr> <dbl> <dbl>
## 1 female 242 96
## 2 male 165 74
## 3 <NA> 184 81
Response:
Response:
Response:
Notes:
pf## # A tibble: 99,003 x 15
## userid age dob_day dob_year dob_month gender tenure friend_count
## <int> <int> <int> <int> <int> <chr> <int> <int>
## 1 2094382 14 19 1999 11 male 266 0
## 2 1192601 14 2 1999 11 female 6 0
## 3 2083884 14 16 1999 11 male 13 0
## 4 1203168 14 25 1999 12 female 93 0
## 5 1733186 14 4 1999 12 male 82 0
## 6 1524765 14 1 1999 12 male 15 0
## 7 1136133 13 14 2000 1 male 12 0
## 8 1680361 13 4 2000 1 female 0 0
## 9 1365174 13 1 2000 1 male 81 0
## 10 1712567 13 2 2000 2 male 171 0
## # ... with 98,993 more rows, and 7 more variables:
## # friendships_initiated <int>, likes <int>, likes_received <int>,
## # mobile_likes <int>, mobile_likes_received <int>, www_likes <int>,
## # www_likes_received <int>
# Let's look at the histogram of tenure
ggplot(data = pf, aes(x = tenure)) +
geom_histogram(color = 'black', fill = "light blue")# Let's change the binwidth to 30 to have a better look at the distribution
ggplot(data = pf, aes(x = tenure)) +
geom_histogram(binwidth = 30, color = 'black', fill = "light blue") +
geom_density()The parameter color determines the color outline of objects in a plot.
The parameter fill determines the color of the area inside objects in a plot. When fill is dependent on the data, specify it in the aes portion of the main function ggplot.
When fill is a fixed value, e.g. “blue” or “red” … etc, specify it in the geom function itself.
Read more about colors in ggplot2 in this link: ggplot2 colors
pf## # A tibble: 99,003 x 15
## userid age dob_day dob_year dob_month gender tenure friend_count
## <int> <int> <int> <int> <int> <chr> <int> <int>
## 1 2094382 14 19 1999 11 male 266 0
## 2 1192601 14 2 1999 11 female 6 0
## 3 2083884 14 16 1999 11 male 13 0
## 4 1203168 14 25 1999 12 female 93 0
## 5 1733186 14 4 1999 12 male 82 0
## 6 1524765 14 1 1999 12 male 15 0
## 7 1136133 13 14 2000 1 male 12 0
## 8 1680361 13 4 2000 1 female 0 0
## 9 1365174 13 1 2000 1 male 81 0
## 10 1712567 13 2 2000 2 male 171 0
## # ... with 98,993 more rows, and 7 more variables:
## # friendships_initiated <int>, likes <int>, likes_received <int>,
## # mobile_likes <int>, mobile_likes_received <int>, www_likes <int>,
## # www_likes_received <int>
ggplot(data = pf, aes(x = tenure/365)) +
geom_histogram(color = 'black', fill = "light blue")# let's set the binwidth equal to 1/4 (this would give us a bin each 3 months)
ggplot(data = pf, aes(x = tenure/365)) +
geom_histogram(binwidth = 0.25, color = 'black', fill = "orange") +
scale_x_continuous(breaks = seq(0, 10, by = 1))# let's omit the last 3 years (7-10) from our plot since they contain a very
# low number of people
ggplot(data = pf, aes(x = tenure/365)) +
geom_histogram(binwidth = 0.25, color = 'black', fill = "orange") +
scale_x_continuous(limits = c(0, 7), breaks = seq(0, 7, by = 1))# Now it looks much betterggplot(data = pf, aes(x = tenure/365)) +
geom_histogram(binwidth = 0.25, color = 'black', fill = "orange") +
scale_x_continuous(limits = c(0, 7), breaks = seq(0, 7, by = 1)) +
xlab("Number of Years Using Facebook") +
ylab("Number of users in sample") +
ggtitle("Tenure Histogram by Year")pf## # A tibble: 99,003 x 15
## userid age dob_day dob_year dob_month gender tenure friend_count
## <int> <int> <int> <int> <int> <chr> <int> <int>
## 1 2094382 14 19 1999 11 male 266 0
## 2 1192601 14 2 1999 11 female 6 0
## 3 2083884 14 16 1999 11 male 13 0
## 4 1203168 14 25 1999 12 female 93 0
## 5 1733186 14 4 1999 12 male 82 0
## 6 1524765 14 1 1999 12 male 15 0
## 7 1136133 13 14 2000 1 male 12 0
## 8 1680361 13 4 2000 1 female 0 0
## 9 1365174 13 1 2000 1 male 81 0
## 10 1712567 13 2 2000 2 male 171 0
## # ... with 98,993 more rows, and 7 more variables:
## # friendships_initiated <int>, likes <int>, likes_received <int>,
## # mobile_likes <int>, mobile_likes_received <int>, www_likes <int>,
## # www_likes_received <int>
ggplot(data = pf, aes(x = age)) +
geom_histogram(color = "black", fill = 'salmon')# We need to adjust binwidth and x-axis limits and breaks
# Let's omit the last 20 and first 10 years of our histogram,
# set binwidth equal = 2.5 and x-axis breaks equal to 5
ggplot(data = pf, aes(x = age)) +
geom_histogram(binwidth = 2.5, color = "black", fill = 'deepskyblue3') +
scale_x_continuous(limits = c(10, 110), breaks = seq(10, 110, 5))Response:
Notes:
Notes:
Most of our variables, such as friend count, likes, comments, wall posts and others are variables that I would call engagement variables, and they all have very long tails. Some users have 10 times, or even 100 the median value. Another way to say this is that some people have an order of magnitudes, more likes, clicks, or comments, than any other users. In statistics, we say that the data is over dispersed.
Often it helps to transform these values so we can see standard deviations, or orders of magnitudes, so we are in effect, shortening the tail.
Here was our histogram of friend count from before:
ggplot(data = pf, aes(x = friend_count)) +
geom_histogram()and notice, we still have that long tail. We can transform this variable by taking the log, either using the natural log, log base 2, or log base 10. We could use other functions, such as the square root, and doing so helps us to see patterns more clearly, without being distracted by the tails.
A lot of common statistical techniques, like linear regression, are based on the assumption that variables have normal distributions. So by taking the log of this variable, we can transform our data to turn it into a normal distribution or something that more closely resembles a normal distribution, if we’d be using linear regression or some other modelling technique.
Now, I know we’re not doing modelling here, but let’s just see what it looks like to transform the variable. First, I’m going to just do this in the summary command. So, here’s our
regular summary of friend count.
summary(pf$friend_count)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 31.0 82.0 196.4 206.0 4923.0
Looks like the median friend count is 82, and the mean is 196. I can take the log base 10 \((\log_{10})\) of this friend count and get a different table:
summary(log10(pf$friend_count))## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -Inf 1.491 1.914 -Inf 2.314 3.692
Now this seems a little bit unusual since I have negative infinity for the minimum and negative infinity for the mean. So what must be going on? Well, some of our users have a friend count of zero. So, when we take the \(\log\) of base 10 of zero, that would be undefined. For those familiar with Calculus, the limit would be negative infinity, which is why that appears here. To understand this more intuitively, think of a power that you’d raise \(10\) to in order to get Zero. What that power would be? It’s a ver very very big negative number, technically, \(-\inf\).
To avoid this (haveing \(-\inf\)), we’re going to add one to friend count, so that way we don’t get an undefined answer, or negative infinity.
summary(log10(pf$friend_count + 1))## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.505 1.919 1.868 2.316 3.692
There, that looks much better.
Now, let’s use another function. Let’s use the square root on friend count. This would be another type of transformation.
summary(sqrt(pf$friend_count))## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.568 9.055 11.090 14.350 70.160
For me, log base 10 is an easier transformation to wrap my head around, since I’m just comparing friend counts on orders of magnitude of 10.
Now that you’ve seen transformations within summaries, let’s see if you can apply a similar transformation to the histogram. Check out the following links to learn how to use scales and how to create multiple graphs on one page.
Once you’ve read through those links, and you think you’re ready, try this next programming exercise. In it, you’re going to create three different histograms.
The first one will be our original friend count histogram, and then the second one will have the friend count transformed using log 10. And then the last histogram will have the friend count transformed using square root.
library(gridExtra)
library(ggthemes)
# friend_count histogram
hist1 <- ggplot(data = pf, aes(x = friend_count)) +
geom_histogram(fill = "salmon", color = "black") +
xlab("Friend Count") +
theme_gdocs() +
scale_color_gdocs()
# friend count transformed using log10
hist2 <- ggplot(data = pf, aes(x = friend_count +1)) +
geom_histogram(fill = "orange", color = "black") +
scale_x_log10() +
xlab("Friend Count") +
theme_gdocs()
scale_color_gdocs()## <ggproto object: Class ScaleDiscrete, Scale>
## aesthetics: colour
## break_info: function
## break_positions: function
## breaks: waiver
## call: call
## clone: function
## dimension: function
## drop: TRUE
## expand: waiver
## get_breaks: function
## get_breaks_minor: function
## get_labels: function
## get_limits: function
## guide: legend
## is_discrete: function
## is_empty: function
## labels: waiver
## limits: NULL
## map: function
## map_df: function
## na.value: NA
## name: waiver
## palette: function
## range: <ggproto object: Class RangeDiscrete, Range>
## range: NULL
## reset: function
## train: function
## super: <ggproto object: Class RangeDiscrete, Range>
## reset: function
## scale_name: gdocs
## train: function
## train_df: function
## transform: function
## transform_df: function
## super: <ggproto object: Class ScaleDiscrete, Scale>
# friend count transformed using sqrt
hist3 <- ggplot(data = pf, aes(x = friend_count)) +
geom_histogram(fill = "deepskyblue2", color = "black") +
scale_x_sqrt() +
xlab("Friend Count") +
theme_gdocs()
scale_color_gdocs()## <ggproto object: Class ScaleDiscrete, Scale>
## aesthetics: colour
## break_info: function
## break_positions: function
## breaks: waiver
## call: call
## clone: function
## dimension: function
## drop: TRUE
## expand: waiver
## get_breaks: function
## get_breaks_minor: function
## get_labels: function
## get_limits: function
## guide: legend
## is_discrete: function
## is_empty: function
## labels: waiver
## limits: NULL
## map: function
## map_df: function
## na.value: NA
## name: waiver
## palette: function
## range: <ggproto object: Class RangeDiscrete, Range>
## range: NULL
## reset: function
## train: function
## super: <ggproto object: Class RangeDiscrete, Range>
## reset: function
## scale_name: gdocs
## train: function
## train_df: function
## transform: function
## transform_df: function
## super: <ggproto object: Class ScaleDiscrete, Scale>
# plot the 3 plots using grid.arrange
grid.arrange(hist1, hist2, hist3, nrow = 3)Notes:
There’s another type of plot that lets us compare distributions, the frequency polygon. Frequency polygons are similar to histograms, but they draw a curve connecting the counts in a histogram. So this allows us to see the shape and the peaks of our distribution in more detail.
Remember we were trying to answer the question, who on average has more friends, men or women. We draw this histogram before:
ggplot(data = filter(pf, !is.na(gender)), aes(x = friend_count)) +
geom_histogram(binwidth = 25) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50)) +
facet_wrap(~ gender, nrow = 2)We said we couldn’t tell based on this histogram, so we ran some numerical summaries instead. And instead of having these 2 histograms side by side, we can actually use a frequency polygon and overlay these histograms together. Here’s how we can create that frequency polygon. I’ll copy and paste the same code, except I need to make an addition.
# I will just copy the code from before and do some changes:
# remove the facet_wrap function, because I want to overlay the 2 polygons
# on top of each other.
# Instead, I will pass the argument (color) to the (aes) function,
# and set (color) equal to (gender). By doing so, I'm telling ggplot
# to condition my plot on (gender).
ggplot(data = filter(pf, !is.na(gender)),
aes(x = friend_count, color = gender)) +
geom_freqpoly() +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50))Now, We can compare 2 or more distributions at once. But again, this plot doesn’t really answer our question who has more friends on average men or women.
Let’s change the y-axis to show proportions instead of raw counts. To do so, we need add another argument to (aes) function, which is y, and set it equla to ..count../sum(..count..)
ggplot(aes(x = friend_count, y = ..count../sum(..count..)),
data = filter(pf, !is.na(gender))) +
geom_freqpoly(aes(color = gender), binwidth=10) +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
xlab('Friend Count') +
ylab('Percentage of users with that friend count')# Note that sum(..count..) will sum across color,
# so the percentages displayed are percentages of total users.
# To plot percentages within each group, you can try y = ..density...Try to play around with this yourself to see where women overtake men in this side of the x-axis. Try using limits or try using breaks to explore more. ***
use frequency polygon to determine which gender creates more likes on the world wide web. create a frequency polygon and explore it in many different ways. Remember that the first plot that you make doesn’t need to be final.
pf## # A tibble: 99,003 x 15
## userid age dob_day dob_year dob_month gender tenure friend_count
## <int> <int> <int> <int> <int> <chr> <int> <int>
## 1 2094382 14 19 1999 11 male 266 0
## 2 1192601 14 2 1999 11 female 6 0
## 3 2083884 14 16 1999 11 male 13 0
## 4 1203168 14 25 1999 12 female 93 0
## 5 1733186 14 4 1999 12 male 82 0
## 6 1524765 14 1 1999 12 male 15 0
## 7 1136133 13 14 2000 1 male 12 0
## 8 1680361 13 4 2000 1 female 0 0
## 9 1365174 13 1 2000 1 male 81 0
## 10 1712567 13 2 2000 2 male 171 0
## # ... with 98,993 more rows, and 7 more variables:
## # friendships_initiated <int>, likes <int>, likes_received <int>,
## # mobile_likes <int>, mobile_likes_received <int>, www_likes <int>,
## # www_likes_received <int>
ggplot(data = pf, aes(x = www_likes)) +
geom_freqpoly(aes(color = gender))# note: this is an equivelant code
ggplot(data = pf, aes(x = www_likes, color = gender)) +
geom_freqpoly()Our plot is too much skewed to the right. Let’s try using log10 transformation on x-axis
ggplot(data = filter(pf, !is.na(gender)), aes(x = www_likes, color = gender)) +
geom_freqpoly() +
scale_x_log10()Now, this looks much better. It looks like males have more likes in the beginning, and then females start to take over at some point. Still, we want some numerical information on who have more likes.
Let’s use dplyr to answer that question:
pf## # A tibble: 99,003 x 15
## userid age dob_day dob_year dob_month gender tenure friend_count
## <int> <int> <int> <int> <int> <chr> <int> <int>
## 1 2094382 14 19 1999 11 male 266 0
## 2 1192601 14 2 1999 11 female 6 0
## 3 2083884 14 16 1999 11 male 13 0
## 4 1203168 14 25 1999 12 female 93 0
## 5 1733186 14 4 1999 12 male 82 0
## 6 1524765 14 1 1999 12 male 15 0
## 7 1136133 13 14 2000 1 male 12 0
## 8 1680361 13 4 2000 1 female 0 0
## 9 1365174 13 1 2000 1 male 81 0
## 10 1712567 13 2 2000 2 male 171 0
## # ... with 98,993 more rows, and 7 more variables:
## # friendships_initiated <int>, likes <int>, likes_received <int>,
## # mobile_likes <int>, mobile_likes_received <int>, www_likes <int>,
## # www_likes_received <int>
gender_webLikes <- pf %>%
filter(!is.na(gender)) %>%
group_by(gender) %>%
summarise(t_webLikes = sum(www_likes),
avg_webLikes = mean(www_likes),
median_webLikes = median(www_likes))
gender_webLikes## # A tibble: 2 x 4
## gender t_webLikes avg_webLikes median_webLikes
## <chr> <int> <dbl> <dbl>
## 1 female 3507665 87.13830 0
## 2 male 1430175 24.41655 0
If you need a refresher on boxplots and how to read them, check this link how to read and use a Boxplot.
Also, have a look at this image to get a sense of how boxplot relates to normal distribution:
recall earlier that we split friend count by gender in a pair of histograms using facet wrap.
Now, we’re going to generate box plots of friend count by gender, so we can quickly see the differences between the distributions, especially the difference between the median of the two groups.
# Note that in boxplots the y axis will be the variable you're plotting
#(friend count). And The grouping variable (gender) will be on x axis.
ggplot(data = filter(pf, !is.na(gender)), aes(y = friend_count, x = gender)) +
geom_boxplot()The boxes cover the middle 50% of values, or what’s called the inter quartile range \(IQR\). The tiny little dots above the boxes are outliers. we usually consider outliers to be just outside of \(1.5\) times the IQR from the media on both sides of the box.
We can also see that the y axis is capturing all the friend counts from zero all the way up to 5,000. So we’re not omitting any user data in this plot.
And finally, this horizontal line inside the boxes is the median for the two box plots.
Since there’s so many outliers in these plots, let’s adjust our code to focus on just these two boxes. We’ll have you do this in the next programming exercise.
ggplot(data = filter(pf, !is.na(gender)), aes(y = friend_count, x = gender)) +
geom_boxplot(aes(color = gender)) +
scale_y_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, by = 100))Now it looks much better. But there’s another problem. To illustrate it let’s summarise again friend_count by gender:
summary(filter(pf, gender == "female") %>% select(friend_count) %>% unlist)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
From the summary we can see that the 3rd Quartile of female friend_count is 244. Look at the left boxplot in the last graph. What do you notice? The upper boundary of the box, which corresponds to the 3rd quartile, is below 244 !! why is that.
Well, when we scales y-axis between (0, 1000), we actually remove data points from calculations. To avoid this, it’s better to use the coord_cartesizan layer to set the limits.
ggplot(data = filter(pf, !is.na(gender)), aes(y = friend_count, x = gender)) +
geom_boxplot(aes(color = gender)) +
coord_cartesian(ylim = c(0, 1000)) +
scale_y_continuous(breaks = seq(0, 1000, by = 100))Notice how the top of the box has moved slightly upward. I will plot the previous two plots beside each other to see the difference.
plot1 <- ggplot(data = filter(pf, !is.na(gender)), aes(y = friend_count, x = gender)) +
geom_boxplot(aes(color = gender)) +
scale_y_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, by = 100)) +
ggtitle("Using (scale_y_continuous)")
plot2 <- ggplot(data = filter(pf, !is.na(gender)), aes(y = friend_count, x = gender)) +
geom_boxplot(aes(color = gender)) +
coord_cartesian(ylim = c(0, 1000)) +
scale_y_continuous(breaks = seq(0, 1000, by = 100)) +
ggtitle("Using (coord_cartesian)")
grid.arrange(plot1, plot2, ncol = 2)It looks like females on average have slightly more friends than men. Since we can see that the median line is slightly higher. Let’s zoom in to get a better look.
ggplot(data = filter(pf, !is.na(gender)), aes(y = friend_count, x = gender)) +
geom_boxplot(aes(color = gender)) +
coord_cartesian(ylim = c(0, 250))Now, let’s check our findings by looking at actual values and compare the values to what we see in our box plot.
# by takes 3 arguments, the first is the variable you
# want to apply some function on.
# the seconde is the variable you want to condition on
# and the third is the function you want to apply
by(pf$friend_count, pf$gender, summary)## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
The table shows the minimum and maximum values for both genders, as well as the cuartiles. The first cuartile for women is 37 and that looks about right in our graph. The third quartile or the 75% mark is at 244. This means that 75% of female users have friend counts below 244. Or another way to say this is that 25% of female users have more than 244 friends.
Similarly for the men, we can see how the first quartiles and the third quartiles match up to the box plot.
pf## # A tibble: 99,003 x 15
## userid age dob_day dob_year dob_month gender tenure friend_count
## <int> <int> <int> <int> <int> <chr> <int> <int>
## 1 2094382 14 19 1999 11 male 266 0
## 2 1192601 14 2 1999 11 female 6 0
## 3 2083884 14 16 1999 11 male 13 0
## 4 1203168 14 25 1999 12 female 93 0
## 5 1733186 14 4 1999 12 male 82 0
## 6 1524765 14 1 1999 12 male 15 0
## 7 1136133 13 14 2000 1 male 12 0
## 8 1680361 13 4 2000 1 female 0 0
## 9 1365174 13 1 2000 1 male 81 0
## 10 1712567 13 2 2000 2 male 171 0
## # ... with 98,993 more rows, and 7 more variables:
## # friendships_initiated <int>, likes <int>, likes_received <int>,
## # mobile_likes <int>, mobile_likes_received <int>, www_likes <int>,
## # www_likes_received <int>
# first let's use a boxplot
ggplot(data = filter(pf, !is.na(gender)),
aes(y = friendships_initiated,
x = gender,
color = gender)) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 250))# looks like females have slightly more friendships_initiated
# on average.
# Now, let's check using numerical summary
pf %>% filter(!is.na(gender)) %>% group_by(gender) %>%
summarise(friendships_initiated_avg = mean(friendships_initiated))## # A tibble: 2 x 2
## gender friendships_initiated_avg
## <chr> <dbl>
## 1 female 113.8991
## 2 male 103.0666
# another way
by(pf$friendships_initiated, pf$gender, summary)## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 19.0 49.0 113.9 124.8 3654.0
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 15.0 44.0 103.1 111.0 4144.0
Sometimes you encounter variables that have a lot of zero values. In this case, it’s usually preferable to convert that variable into binary one, that has only true or false.
This is helpful because we may want to know whether a user has used a certain feature at all, instead of the number of times that the user has actually used that feature.
For example, it may not matter how many times a person checked in using a mobile device. But, whether the person has ever used mobile check-in.
Here’s a summary of the mobile likes in our dataset.
summary(pf$mobile_likes)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 4.0 106.1 46.0 25110.0
# The median is four, which means we have a lot of zeroes in our dataset.
# now let's run another summary
summary(pf$mobile_likes > 0)## Mode FALSE TRUE NA's
## logical 35056 63947 0
The table gives us the number of usres who have checked in, and the number of users who haven’t, regardless of how many mobile likes they did.
Let’s create a new variable that tracks mobile check-ins.
pf$mobile_check_in <- ifelse(pf$mobile_likes > 0, 1, 0)mean(pf$mobile_check_in, na.rm = TRUE)## [1] 0.6459097