pf <- read.csv('pseudo_facebook.tsv', sep = '\t')
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
qplot(x = dob_day, data = pf)+
scale_x_continuous(breaks = 1:31)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(aes(x = dob_day), data = pf) +
geom_histogram(binwidth = 1) +
scale_x_continuous(breaks = 1:31)
Response:
More users are born on January 1st, which is unusual. We can most likely expect that birthdays are evenly distributed across the month.
You may want to do split up your data by one or more variables and plot the subsets of data together.
That’s where faceting functions in R are useful.
#faceting >> facet_wrap
qplot(x = dob_day, data = pf)+
scale_x_continuous(breaks = 1:31)+
facet_wrap(~dob_month,ncol = 3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Equivalent ggplot syntax
ggplot(data = pf, aes(x = dob_day)) +
geom_histogram(binwidth = 1) +
scale_x_continuous(breaks = 1:31) +
facet_wrap(~dob_month)
A top-coded data set is one for which the value of variables above an upper bound are censored. This is often done to preserve the anonymity of people participating in the survey.
Outliers would be;
Create a histogram of friend_count using the qplot syntax. We’ll also accept the ggplot syntax if you are familiar with it but additional parameters for setting the bin width or color won’t be accepted. Keep it simple.
You should create the histogram in R Studio on your computer first. Then, copy and paste your code into the {r} chunk.
Remember to load the pseudo-facebook data set into a variable named pf. Otherwise, the grader will reject the answer.
ggplot(data = pf, aes(x = friend_count)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The plot is similar to Moira’s first plot as it has an outlier, where one user has 5000 friends.
qplot(x=friend_count, data=pf, xlim = c(0,1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
#Equivalent ggplot syntax by adding a layer to x axis:
ggplot(aes(x = friend_count), data = pf) +
geom_histogram() +
scale_x_continuous(limits = c(0, 1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
When you adjust the bin width, and here the bin width is set to one, you really see these dramatic patterns, these tall vertical lines. This is because when you say, how many people do you think saw this post, people typically say, oh, 10, 20, 50, 100. People guess these regular numbers, and they don’t tend to guess Numbers in between. This really helps to get a feel for how big people thought their audience size was. The most common guess was 20 with 50 and 100 close behind. In most cases though, this was only about a quarter the size of the actual audience.
qplot(x=friend_count, data=pf, binwidth = 25)+
scale_x_continuous(limits = c(0,1000), breaks = seq(0,1000,50))
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 50) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50))+
facet_grid(.~gender)
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
ggplot(aes(x = friend_count), data = pf) +
geom_histogram() +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
facet_wrap(~gender)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
qplot(x = friend_count, data = subset(pf, !is.na(gender)), binwidth = 50)+
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50))+
facet_grid(.~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
table(pf$gender)
##
## female male
## 40254 58574
by(pf$friend_count, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
After you run the above code, try to find the answers for the following questions.
Women has more friends on average.
The difference between the median friend count for women and men = 96-74 =22
There is a huge difference between median and mean for both categories. This means a few users have lots of friends compare to the majority so that mean is over-stated. Thus, the average friend count per user in the data set is higher than it really is. As this indicates that the data is not in a normal distribution, the median will make more sense because it shows the much more representative number of the friend count.
Notes:
qplot (data=pf, x = tenure, binwidth = 50, color = I('black'), fill = I('#099009'))
## Warning: Removed 2 rows containing non-finite values (stat_bin).
#Equivalent ggplot syntax:
ggplot(aes(x = tenure), data = pf) +
geom_histogram(binwidth = 30, color = 'black', fill = '#099DD9')
## Warning: Removed 2 rows containing non-finite values (stat_bin).
#Equivalent ggplot syntax with tenure in years. Notice that you have to adjust the binwidth as appropriate.To make the histogram more clear and nice set breaks also.
ggplot(aes(x = tenure/365), data = pf) +
geom_histogram(binwidth = .25, color = 'black', fill = '#336600')+
scale_x_continuous(breaks = seq(1,7,1), limits = c(0,7))
## Warning: Removed 26 rows containing non-finite values (stat_bin).
You can label the x and y axis as you need using xlab and ylab parameters.
qplot (data=pf, x = tenure/365,
xlab = 'Number of Years Using Facebook',
ylab = 'Number of Users',
color = I('black'), fill = I('#F79420')) +
scale_x_continuous(breaks = seq(1,7,1), lim = c(0,7))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 26 rows containing non-finite values (stat_bin).
qplot (data=pf, x = age,
xlab = 'Age',
ylab = 'Number of Users',
color = I('black'), fill = I('#F79420'))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In this case, binwidth 1 would be more appropriate.
#Equivalent ggplot syntax:
ggplot(aes(x = age), data = pf) +
geom_histogram(binwidth = 1, color = 'black', fill = '#5760AB') +
scale_x_continuous(breaks = seq(0, 113, 5))
If you may notice, there is no Facebook users under 13 years old. Make sense? Yes, because to open a Facebook account, you have to be at least 13 or older.
And do you see the peaks in mid 20s? This shows that the most users are in their mid 20. There are more than 5000 users in age of 18.
Finally, we can see that more Facebook users are from 16-25.
By the way, isn’t that weird there are some users more than 100 year old? Funny, huh?
“Over-dispersed” is always relative to some particular posited distribution. For example, data might be over-dispersed compared with a Poisson distribution with that mean.
summary(pf$friend_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 31.0 82.0 196.4 206.0 4923.0
summary(log10(pf$friend_count + 1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.505 1.919 1.868 2.316 3.692
summary(sqrt(pf$friend_count))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.568 9.055 11.090 14.350 70.160
p1 <- qplot(data = pf, x = friend_count)
p2 <- qplot(data = pf, x = log10(friend_count + 1),
color = I('black'), fill = I('#336600'))
p3 <- qplot(data = pf, x = sqrt(friend_count),
color = I('black'), fill = I('#5760AB'))
grid.arrange(p1, p2, p3, ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, notice that the log10 graph is in normal distribution.
logscale <- qplot(data = pf, x = log10(friend_count))
##Equivalent ggplot syntax:
countscale <- ggplot(aes(x = friend_count), data = pf)+
geom_histogram()+
scale_x_log10()
grid.arrange(logscale, countscale, ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
#OR
qplot(data = pf, x = friend_count)+
scale_x_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
Using actual count with ggplot function and adding a scale would be a better practice. You can add a scaling layer to any plot that you create.
qplot(x= friend_count, data = subset(pf, !is.na(gender)), binwidth = 10)+
scale_x_continuous(lim = c(0,1000), breaks = seq(0,1000,50))+
facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
# Creating Frequency Polygons
qplot(x= friend_count, y = ..count../sum(..count..),
data = subset(pf, !is.na(gender)),
xlab = 'Friend Count',
ylab = 'Proportion of Users with the Friend Count',
binwidth = 10, geom = 'freqpoly', color = gender)+
scale_x_continuous(lim = c(0,1000), breaks = seq(0,1000,50))
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
Notes:
qplot(x= www_likes, data = subset(pf, !is.na(gender)),
xlab = 'likes on the web',
geom = 'freqpoly', color = gender)+
scale_x_continuous()+
scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 60935 rows containing non-finite values (stat_bin).
by(pf$www_likes, pf$gender, sum)
## pf$gender: female
## [1] 3507665
## --------------------------------------------------------
## pf$gender: male
## [1] 1430175
What’s the www_likes count for males?
Which gender has more www_likes?
Notes:
qplot(x = gender, y = friend_count, data = subset(pf, !is.na(gender)),
geom = 'boxplot')+
scale_y_continuous(lim = c(0,1000))
## Warning: Removed 2949 rows containing non-finite values (stat_boxplot).
#Alternative
qplot(x = gender, y = friend_count, data = subset(pf, !is.na(gender)),
geom = 'boxplot', ylim = c(0,1000))
## Warning: Removed 2949 rows containing non-finite values (stat_boxplot).
#If you use ylim or scale_y_continuous layer, you're removing y data points from calculations. The better way to do this is to use coord cartesian layer.
The middle black line is the median (the middle 50% of friend_count). The lower boundary of the box is 1st quartile and the upper boundary of the box is 3rd quartile.
Remember these from your statistics class?
#Alternative with coord cartesian layer
qplot(x = gender, y = friend_count, data = subset(pf, !is.na(gender)),
geom = 'boxplot')+
coord_cartesian(ylim = c(0,1000))
#Adjust the ylim for a more clear picture
qplot(x = gender, y = friend_count, data = subset(pf, !is.na(gender)),
geom = 'boxplot')+
coord_cartesian(ylim = c(0,250))
by(pf$friend_count, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
Women
The median of friendships_initiated for women is slightly over men.
qplot(x = gender, y = friendships_initiated,
data = subset(pf, !is.na(gender)),
geom = 'boxplot')+
coord_cartesian(ylim = c(0,250))
by(pf$friendships_initiated, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 19.0 49.0 113.9 124.8 3654.0
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 15.0 44.0 103.1 111.0 4144.0
summary(pf$mobile_likes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 4.0 106.1 46.0 25110.0
summary(pf$mobile_likes > 0)
## Mode FALSE TRUE NA's
## logical 35056 63947 0
mobile_check_in <- NA
pf$mobile_check_in <- ifelse(pf$mobile_likes > 0, 1,0)
pf$mobile_check_in <- factor(pf$mobile_check_in)
summary(pf$mobile_check_in)
## 0 1
## 35056 63947
What percent of check in using mobile?
summary(pf$mobile_check_in)
## 0 1
## 35056 63947
sum(pf$mobile_check_in == 1)/ length(pf$mobile_check_in)
## [1] 0.6459097
Reflection:
What did you learn so far?