Notes: For this lesson, we are going to look at pseudo Facebook data and try to perform Data Analysis with it. We first load our data into R and check out its variable names.
setwd("~/Desktop/Udacity/Data_analysis_with_R/3_Explore_one_variable")
pf <- read.csv('pseudo_facebook.tsv', sep='\t')
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
Notes: We begin by creating simple histograms to show a visual representation of all the Facebook user’s birthdays.
For more information on reading histograms in R, refer to http://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r/.
#install.packages('ggplot2')
library(ggplot2)
ggplot(aes(x=dob_day), data=pf) +
geom_bar() +
scale_x_discrete(limits=1:31)
#Alternative Method using qplot
#qplot(x = dob_day, data = pf, geom="bar") +
# scale_x_discrete(limits=as.character(1:31))
Warning: Previously in the ggplot2 package, geom_histogram
and geom_bar
were the same, one an alias of the other. Now geom_histogram
is for continuous data (it will do binning) and geom_bar
is for discrete data. Check out this explanation from Github to read on further updates to the ggplot2 package https://github.com/hadley/ggplot2/issues/1465.
Response: The most obvious thing from the histogram is the huge number of births that occur on the 1st. Also, the fewest number of births appear to occur on the 31st.
Notes: Moira’s study while working at Facebook was that she wanted to study the perceived audience that one had in Facebook. For example, if I posted something, how many people would I think viewed my post? How many people actually viewed my post (at least for one second)?
Think about a time when you posted a specific message or shared a photo on Facebook. What was it? Response: Graduating from berkeley.
How many of your friends do you think saw that post? Response: 400
Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is? Response: 60%
Notes: People tend to underestimate the perceived audience size that they had on Facebook.
Notes: We can break the histogram into 12 separte graphs (one for each month). facet_wrap
takes in a formula using the ~ and then takes on a variable that we are going to split our data over (i.e. dob months). Since we set ncol = 3, we can see our data is being displayed in three columns.
ggplot(aes(x=dob_day), data=pf) +
geom_bar() +
facet_wrap(~dob_month, ncol=3) +
scale_x_discrete(limits=1:31)
facet_grid
is a different way to display our data: “facet.wrap(formula)” where it looks something like this facet_wrap(~variable) facet.grid(formula) where it looks something like this facet_grid(vertical~horizontal).
Response: Every other histogram looks roughly the same. The huge spike on the 1st only occurs during January (i.e. the first of the year). One possibility is that this is the default value from Facebook. Also, in order to protect ones privacy, people might simply choose the first of the month by default.
Notes: Detecting and dealing with outliers is very important when one is conducting an anlysis. Outliers are very important to pay attention to. For example, an outlier can be accurate data from an extreme case. Other times, outliers might simply be incorrect data.
Here is on important piece on data that becomes public that deals with the censoring of outliers: “In econometrics and statistics, a TOP-CODED data observation is one for which data points whose values are above an upper bound are censored.
Survey data are often topcoded before release to the public to preserve the anonymity of respondents. For example, if a survey answer reported a respondent with self-identified wealth of $79 billion, it would not be anonymous because people would know there is a good chance the respondent was Bill Gates."
Notes: Moira’s plot came out with one huge bar (i.e. most people guessed a small number of people saw their post) and one tiny one (i.e. one person who guess 10 million people saw her post).
So she cut that huge outlier out and then she looked basically at people who guessed a few thousand or below. You can see that the bulk of the data was in the smaller ranges (i.e. skewed with a right tail).
Response: Outlier was bad data about an extreme case because the person only had roughly at most 1 million friends, so there is no way she could ahve had 10 million viewers for her post.
In this case, we would use geom_histogram
(instead of geom_bar
) since friend_count is a continuous variable.
ggplot(aes(x = friend_count), data = pf) +
geom_histogram()
#equivalent code
#qplot(x=friend_count, data=pf)
Response: This plot is similar to Moira’s plot in that it is positive skewed (i.e. right long tail). Most of the responses are clumped closer to zero, meaning that most people tended to guess that their audiences were less than 1000 people.
Notes: One way to limit our axis and only look at the ‘bulk’ of our data, which is those who have friend counts less than 1000, we can use the xlim argument inside qplot. Also, we could instead add a second layer to our qplot instead of using the xlim argument inside qplot
ggplot(aes(x = friend_count), data = pf) +
geom_histogram() +
scale_x_continuous(limits=c(0,1000))
#Similar plots using qplot function:
#qplot(x=friend_count, data=pf, xlim=c(0,1000))
#qplot(x=friend_count, data=pf) +
# scale_x_continuous(limits=c(0,1000))
Notes: When you adjust the binwidth, you can begin to see other patterns that were not obvious before. People typically tend to guess that “10, 30, 50, 100” people viewed their post. People guess these regular numbers and do not typically guess numbers in between.
Notes: Here, we try having binwidths with a value of 25. We also want to break up the x axis for every 50 units. For more information, read http://docs.ggplot2.org/current/scale_continuous.html. Notice many users have less than 25 friends. These are probably new users.
ggplot(aes(x = friend_count), data = pf) +
geom_histogram(binwidth = 25) +
scale_x_continuous(limits=c(0,1000), breaks=seq(0,1000,50))
#qplot(x=friend_count, data=pf, binwidth=25) +
# scale_x_continuous(limits=c(0,1000), breaks=seq(0,1000,50))
What code would you add to create a facet the histogram by gender?
ggplot(aes(x = friend_count), data = pf) +
geom_histogram(binwidth = 10) +
scale_x_continuous(limits=c(0,1000), breaks=seq(0,1000,50)) +
facet_wrap(~gender)
#qplot(x = friend_count, data = pf, binwidth = 10) +
# scale_x_continuous(limits = c(0, 1000),
# breaks = seq(0, 1000, 50)) +
# facet_wrap(~gender)
Notes: There are two ways to omit the NA values. Note however that we should be careful with the second method below because it gets rid of any observation that has an NA value, no matter which variable that NA value my fall under.
ggplot(aes(x = friend_count), data = subset(pf, !is.na(gender))) +
geom_histogram(binwidth = 10) +
scale_x_continuous(limits=c(0,1000), breaks=seq(0,1000,50)) +
facet_wrap(~gender)
#qplot(x = friend_count, data = subset(pf, !is.na(gender)), binwidth = 10) +
# scale_x_continuous(limits = c(0, 1000),
# breaks = seq(0, 1000, 50)) +
# facet_wrap(~gender)
ggplot(aes(x = friend_count), data = na.omit(pf)) +
geom_histogram(binwidth = 10) +
scale_x_continuous(limits=c(0,1000), breaks=seq(0,1000,50)) +
facet_wrap(~gender)
#qplot(x = friend_count, data = na.omit(pf), binwidth = 10) +
# scale_x_continuous(limits = c(0, 1000),
# breaks = seq(0, 1000, 50)) +
# facet_wrap(~gender)
Notes: After looking at the plots above, it might be hard to determine which gender has more friends on average. Instead of looking at this, we can use the table command to see if there are more men or females.
table(pf$gender)
##
## female male
## 40254 58574
by(pf$friend_count, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
Who on average has more friends: men or women? Response: women
What’s the difference between the median friend count for women and men? Response: 22
Why would the median be a better measure than the mean? Response: Because our data is skewed. The very large values on the right tails would pull the mean to the right. Therefore, the median would be a better measurement. The median is a more ROBUST statistic than the mean. The median is more resistant to change. “Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal.”
Notes: Tenure (i.e. how many days someone has been using Facebook). Here we are plotting the tenures using colors to make the plot more presentable http://docs.ggplot2.org/0.9.2.1/theme.html. By setting the binwidth=30, we get a finer view of the distribution. Here 30 makes sense since we are measuring days and 30 days is about a month.
ggplot(aes(x=tenure), data=pf) +
geom_histogram(color=I('black'), fill=I('#099DD9'))
ggplot(aes(x=tenure), data=pf) +
geom_histogram(color=I('black'), fill=I('#099DD9'), binwidth=30)
#qplot(x=tenure, data=pf, binwidth=30,
# color=I('black'), fill=I('#099DD9'))
How would you create a histogram of tenure by year? Use binwidth=365 so each bin represents a year of using Facebook. It looks like the bulk of the users have had less two and a half years on facebook.
ggplot(aes(x=tenure), data=pf) +
geom_histogram(color=I('black'), fill=I('#099DD9'), binwidth=365)
#Could also divide tenure by 365, rather than change binwidth to 365
#Both give equivalent graphs:
#ggplot(aes(x=tenure/365), data=pf) +
# geom_histogram(color=I('black'), fill=I('#F79420'), binwidth=1)
ggplot(aes(x=tenure/365), data=pf) +
geom_histogram(color=I('black'), fill=I('#F79420'), binwidth=.25) +
scale_x_continuous(breaks=seq(1,7,1), limits=c(0,7)) +
xlab("tenure (in years)")
Notes: We always want plots to speak for themselves, so make sure to label them.
ggplot(aes(x=tenure/365), data=pf) +
geom_histogram(color=I('black'), fill=I('dark green'), binwidth=.25) +
scale_x_continuous(breaks=seq(1,7,1), limits=c(0,7)) +
labs(x="Number of years using Facebook", y='Number of users in sample')
Notes: Note, users must be at least 13 years of age to set up a Facebook account, which is why there is no data below 13.
min(pf$age)
## [1] 13
max(pf$age)
## [1] 113
ggplot(aes(x=age), data=pf) +
geom_histogram(color=I('black'), fill=I('#FF9999'), binwidth=1) +
scale_x_continuous(breaks=seq(10,115,5), limits=c(10,115)) +
labs(x="Age in years", y="Number of users in sample")
Response: A couple of things to note is that that there appears to be a bell shape curve with a long right tail. The number of users increases from age 13 and it appears to peak around the age of 20, then the number of users begins to decrease after the age of 21. There is also those large spikes (anomalies) after the age of 100. Those are most likely fake user ages that are reported.
Notes: She is interested in how information flows through networks (i.e. social networks). Memes tend to replicate themselves, especially when they have text that say “repost” or “copy and paste”.
In order to analyze the occurrance of moneybag mean, Lada attempted to plot the occurrances of this meme. And she saw various spikes particularly in the months that were considered to be “lucky” because they had 5 fridays, saturdays, and sundays. When looking at her plots on a linear scale, using linear counts, it appears that the mean dissapears in the areas where the spikes are not visible. The meme probably never disspeared and it might have just been floating around facebook in low numbers.
To check this, one can use a log scale and the pattern is much more evident. Using this, we can see counts that are of size 10 while also seeing counts that are of 100,000. Eventhough there is a rapid decay of interest, it actually looks like it might be parallel. This was done in ggplot using a simple line geome, and grouping by the particular meme variant, and then rescaling the yaxis to one of the log versions.
Notes: Most of the variables like friend count, likes, comments, wall posts and others are variables that can be called ENGAGEMENT VARIABLES, and they have very long tails. Some have 10 times or even 100 times the median value. Another way to say this is that some people have an ORDER OF MAGNITUDES, more likes, clicks, or comments, than any other users. In statistics, we say that the data is OVER DISPERSED. Often, it helps to transform these values so we can see standard deviations, or orders of magnitudes, so we are in effect, shortening the tail.
Ex: The histogram of the friend count had very long tails. We can transform the data useing a log, log base 1, or base 10. We could also use other functions, such as the square root, and doing so helps us to see patterns more clearly without being distracted by the tails. Alot of common statistical techniques like linear regression, are based on the assumption that variables have normal distributions. So, by taking the log of this variable, we can transform our data to turn it into a normal distribution or something that more closely resembles a normal distribution, if we’d be using linear regression or some other modelling technique.
Trying a log10 transformation, we get something unusual since we get negative infinity for both the minimum and mean. Note, some of our users have a friend count of zero. So when we take the log10 of 0, that would be undefined. Using calculus, we would get that the limit would be -Inf. To avoid this, we are going to add 1 to friend count, so that we don’t get an undefined answer.
We can also use the sqrt transformation. The instructor mentions that log10 is an easier tranformation to wrap his head around, since he is just comparing friend counts on orders of magnitude of 10. Basically, a 10 fold scale, like the pH scale. Now that
ggplot(aes(x=friend_count), data=pf) +
geom_histogram()
summary(pf$friend_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 31.0 82.0 196.4 206.0 4923.0
summary(log10(pf$friend_count))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -Inf 1.491 1.914 -Inf 2.314 3.692
summary(log10(pf$friend_count + 1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.505 1.919 1.868 2.316 3.692
summary(sqrt(pf$friend_count))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.568 9.055 11.090 14.350 70.160
#Creating 3 plots in one screen
#install.packages('gridExtra')
library(gridExtra)
p1 <- ggplot(aes(x=friend_count), data=pf) +
geom_histogram()
p2 <- ggplot(aes(x=log10(friend_count+1)), data=pf) +
geom_histogram()
p3 <- ggplot(aes(x=sqrt(friend_count)), data=pf) +
geom_histogram()
grid.arrange(p1,p2,p3, ncol=1)
#Alternative plotting method (different x-axis than the one above):
p1 <- ggplot(aes(x=friend_count), data=pf) + geom_histogram()
p2 <- p1 + scale_x_log10()
p3 <- p1 + scale_x_sqrt()
grid.arrange(p1,p2,p3, ncol=1)
#Note, there is a slight difference here based on the x-axis labeling.
Resources: http://lightonphiri.org/blog/ggplot2-multiple-plots-in-one-graph-using-gridextra http://docs.ggplot2.org/current/scale_continuous.html https://en.wikipedia.org/wiki/Linear_regression#Assumptions http://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/
Note: “The difference between using the ggplot layer scale_x_log10
on a density plot of friend_count
and plotting a density plot of log10(friend count) is primarily axis labeling. Using scale_x_log10
will label the x-axis in dollars amounts, rather than in logs.”
Notes: In general, the instructor thinks it is easier to think about actual counts, so he prefers to use the scale_x_log10
as a layer.
ggplot(aes(x=friend_count), data=pf) +
geom_histogram() +
scale_x_log10()
So far we have seen how seen how to examine variables distributions using histograms and how to check our hunches with visualizations and numerical summaries. But there is another type of plot that lets us compare distributions and it is called the FREQUENCY POLYGON. They are similar to histograms, but they draw a curve connecting the counts in a histogram. So this allows us to see the shape and the peaks of our distribution in more detail.
Recall our goal was to figure out what gender on average had more friends and we couldnt tell from the following histograms. So we ran numerical summaries instead.
ggplot(aes(x=friend_count), data=subset(pf, !is.na(gender))) +
geom_histogram(binwidth=10) +
scale_x_continuous(lim=c(0,1000), breaks=seq(0,1000,50)) +
facet_wrap(~gender)
#Frequency Polygon
ggplot(aes(x=friend_count, colour=gender), data=subset(pf, !is.na(gender))) +
geom_freqpoly(binwidth=10) +
scale_x_continuous(lim=c(0,1000), breaks=seq(0,1000,50))
#qplot(x=friend_count, data=subset(pf, !is.na(gender)),
# binwidth=10, geom='freqpoly', color=gender) +
# scale_x_continuous(lim=c(0,1000), breaks=seq(0,1000,50))
The frequency polygon is good for comparing two separate distributions. However, this doesn’t really answer our desired question of who has more friends on average. Let’s change the y-axis to show proportions instead of raw counts. This is going to involve some funky syntax, so lets explain it. To change this count variable, we are going to pass in y to our qplot function. Note that sum(..count..) will sum across color, so the percentages displayed are percentages of total users. To plot percentages within each group, you can try y = ..density…
ggplot(aes(x=friend_count, y=..count../sum(..count..), colour=gender), data=subset(pf, !is.na(gender))) +
geom_freqpoly(binwidth=10) +
labs(x='Friend Count', y='Proportion of Users with that friend count') +
scale_x_continuous(lim=c(0,1000), breaks=seq(0,1000,50))
#qplot(x=friend_count, y=..count../sum(..count..),
# data=subset(pf, !is.na(gender)),
# xlab='Friend Count',
# ylab='Proportion of Users with that friend count',
# binwidth=10, geom='freqpoly', color=gender) +
# scale_x_continuous(lim=c(0,1000), breaks=seq(0,1000,50))
It may appear that males have higher friend counts on average than women, we can see that many males or a high percentage of them have low friend counts. It is probably in this tail region of the graph where females overtake males. Try using limits or breaks to explore more and figure out where this occurs.
ggplot(aes(x=friend_count, y=..count../sum(..count..), colour=gender), data=subset(pf, !is.na(gender))) +
geom_freqpoly(binwidth=10) +
labs(x='Friend Count', y='Proportion of Users with that friend count') +
scale_x_continuous(lim=c(250,1000), breaks=seq(250,1000,50))
#qplot(x=friend_count, y=..count../sum(..count..),
# data=subset(pf, !is.na(gender)),
# xlab='Friend Count',
# ylab='Proportion of Users with that friend count',
# binwidth=10, geom='freqpoly', color=gender) +
# scale_x_continuous(lim=c(250,1000), breaks=seq(250,1000,50))
Quiz: Determine which gender makes more likes on the world wide web.
ggplot(aes(x=www_likes, color=gender), data=subset(pf, !is.na(gender))) +
geom_freqpoly() +
scale_x_continuous()
#qplot(x=www_likes, data=subset(pf, !is.na(gender)),
# geom='freqpoly', color=gender) +
# scale_x_continuous()
#Add a log transformation because of the long tail
ggplot(aes(x=www_likes, color=gender), data=subset(pf, !is.na(gender))) +
geom_freqpoly() +
scale_x_continuous() +
scale_x_log10()
#qplot(x=www_likes, data=subset(pf, !is.na(gender)),
# geom='freqpoly', color=gender) +
# scale_x_continuous() +
# scale_x_log10()
Notes: Our above frequency plot still does not let us answer our question: who really has more likes, men or women? Let’s try a numerical summary instead.
male: 1,430,175 females: 3,507,665
by(pf$www_likes, pf$gender, sum)
## pf$gender: female
## [1] 3507665
## --------------------------------------------------------
## pf$gender: male
## [1] 1430175
Notes: Let’s generate boxplots so we can quickly see the distribution between genders. Notice we use the continuous variable as y and the categorical variable as x.
ggplot(aes(x=gender, y=friend_count), data=subset(pf, !is.na(gender)) ) +
geom_boxplot()
#qplot(x=gender, y=friend_count,
# data=subset(pf, !is.na(gender)),
# geom='boxplot')
Our boxplots are hard to see because there are so many outliers. Each dot is considered an outlier when they are just outside one and a half times the IQR from the median. Since there are so many outliers in these plots, lets adjust our code so it only focuses on the two boxes.
#Two methods
ggplot(aes(x=gender, y=friend_count), data=subset(pf, !is.na(gender)) ) +
geom_boxplot() +
scale_y_continuous(lim=c(0,1000))
#qplot(x=gender, y=friend_count,
# data=subset(pf, !is.na(gender)),
# geom='boxplot') +
# scale_y_continuous(lim=c(0,1000))
#qplot(x=gender, y=friend_count,
# data=subset(pf, !is.na(gender)),
# geom='boxplot', ylim=c(1,1000))
Our new boxplots might not be fully accurate because when these functions are executed, the data points are removed from the calculation, meaning that our boxes might get a bit shifted (i.e. median and quartiles are calculated without the data that was excluded via the lim
argument). A better way to do this is to use the CORD CARTESIAN LAYER to set the y limits instead.
#Cord cartesian method
ggplot(aes(x=gender, y=friend_count), data=subset(pf, !is.na(gender)) ) +
geom_boxplot() +
coord_cartesian(ylim=c(0,1000))
#qplot(x=gender, y=friend_count,
# data=subset(pf, !is.na(gender)),
# geom='boxplot') +
# coord_cartesian(ylim=c(0,1000))
Notes: It looks like females on average have slightly more friends than men becasuse I can see that the median is slightly higher. Lets zoom in to take a closer look. It looks like female median is slightly higher than for male.
ggplot(aes(x=gender, y=friend_count), data=subset(pf, !is.na(gender)) ) +
geom_boxplot() +
coord_cartesian(ylim=c(0,250))
#qplot(x=gender, y=friend_count,
# data=subset(pf, !is.na(gender)),
# geom='boxplot') +
# coord_cartesian(ylim=c(0,250))
by(pf$friend_count, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
Note: If we had not used the coordinate cartesian method, than our numeric summary would have not exactly matched our plot.
Response: women
Response: We can look at various plots or numeric summaries.
by(pf$friendships_initiated, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 19.0 49.0 113.9 124.8 3654.0
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 15.0 44.0 103.1 111.0 4144.0
ggplot(aes(x=gender, y=friendships_initiated),
data=subset(pf, !is.na(gender)) ) +
geom_boxplot() +
coord_cartesian(ylim=c(0,150))
#qplot(x=gender, y=friendships_initiated,
# data=subset(pf, !is.na(gender)),
# geom='boxplot') +
# coord_cartesian(ylim=c(0,150))
Response: Zooming into the boxplots using a coordinate cartesian limit from 0, 150, we can better see that the median for female friend request (49) is slightly higher than for males (44). The 75th quantile for women is also larger (124.8) than the 75th quantile for males (111).
Notes: There are other ways that we can transform a variable beside using a log or sqrt. You often want to convert variables that have a lot of 0 values to a new binary variable that has only true and false. This is helpful because we may want to know if they have used a certain feature at all, instead of the number of times that the user has used that feature. For example, it may not matter how many times a person checked in using a mobile device, but whether the person has ever used mobile check in. Using the summary, we see that the median is 4, meaning that we have a lot of zeroes in our dataset.
summary(pf$mobile_likes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 4.0 106.1 46.0 25110.0
summary(pf$mobile_likes > 0)
## Mode FALSE TRUE NA's
## logical 35056 63947 0
mobile_check_in <- NA
pf$mobile_check_in <- ifelse(pf$mobile_likes > 0, 1, 0)
pf$mobile_check_in <- factor(pf$mobile_check_in)
summary(pf$mobile_check_in)
## 0 1
## 35056 63947
#What percent of check in using mobile?
sum(pf$mobile_check_in == 1) / length(pf$mobile_check_in) #0.6459097
## [1] 0.6459097
Response: So ~65% of facebook users check in using mobile, which is over half of the users. So it would make a lot of sense to continue the development of the mobile experience, at least based on this sample of dataset. It is always important not to think about what kind of data you are looking at, but maybe what types of transformations you can make to the variables themselves. Sometimes you want raw counts and other times a binary is prefered.
Reflection: A lot of this lesson was review for me since I have worked with R a lot in the past. However, there were some key things that I did learn from this lesson. I learned a lot about the ggplot function, which creates graphs that are far more aesthetically pleasing than the basic plots found with the default functions from R. I also learned different ways to deal with long tail distributions and the appropriate way to transform data to better examine trends. I also really enjoyed the tutorial about how one should play around with bin sizes and overall how to scale a graph to extract as much information as possible from them.