https://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r/ ***
Notes: set your working dir.
read the data in, note ‘’ tab separated file Notes:
pf <- read.csv('pseudo_facebook.tsv', sep='\t')
Notes: You can use scale_x_continuous() instead to get the break points, or use ggplot() syntax .*** did not work?? for me
#install.packages('ggplot2')
library(ggplot2)
qplot(x = dob_day, data = pf) +
scale_x_continuous()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(aes(x = dob_day), data = pf) +
geom_histogram(binwidth = 1) +
scale_x_continuous(breaks = 1:31)
Response: The spike on day 1
Notes: Audience higher than the percieved number
Notes:
Response:0
Response:
Notes:
Notes: there 2 types In this case the above histogram is divided into 12 histograms based on the months using facet-wrap (for one variable) facet_wrap(~ variable) facet_grid(vertical~horizontal) # 2 or more variables
for more than one variable, use facet-grid.
qplot(x = dob_day, data = pf) +
scale_x_continuous(breaks = 1:31) +
facet_wrap(~dob_month, ncol = 3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = pf, aes(x = dob_day)) + geom_histogram(binwidth = 1) + scale_x_continuous(breaks = 1:31) + facet_wrap(~dob_month)
Response:
an extreme case like someone tweets 1000 times/day or they may represent bad data or the limitations of the data or extreme cases may be replaced with normal values like in census data where extreme salaries are brought down with normal values
Types/ category impt inorder to know how to exclude it bad data about a non-extreme case bad data about an extreme case *good data about an extreme case
Notes: terrible plot coz one person guessed(percieved a number in millions !!! So she had to adjust the axes first and foremost so that she could see the bulk of the data) #### Which case do you think applies to Moira’s outlier? Response:
Notes:
qplot(x = friend_count, data = pf)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Response: Its also long tail histogram with much of the data squished on the left
Notes: Therefore we have to limit the axes to see the data clearly. Say we want to see data within the first 1000 users. Use xlim as a vector
qplot(x = friend_count, data = pf, xlim = c(0,1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
#alternatively
qplot(x = friend_count, data = pf) +
scale_x_continuous(limits = c(0, 1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
Notes: qplot(x = friend_count, data = pf, binwidth = 25) + scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000,50)) ***
Notes: setting it at 1 allows you to see the individual perceptions that comeup as spikes
qplot(x = friend_count, data = pf, binwidth = 25) +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
facet_wrap(~gender)
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
https://www.statmethods.net/input/missingdata.html Notes: we add a condition to omit the NA values only within gender(!is.na(gender))
#qplot(x = friend_count, data = subset(pf, !is.na(gender)),binwidth = 10) +
# scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
# facet_wrap(~gender)
#alternatively - note: be careful with na.omit() coz it may remove na thats #not relate to the variable you want removed
qplot(x = friend_count, data = na.omit(pf), binwidth = 10) +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
Notes:
table(pf$gender)
##
## female male
## 40254 58574
by(pf$friend_count, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
Response: Women
Response: 22 #### Why would the median be a better measure than the mean? Response: more robust ***
Notes:
qplot(x = tenure, data = pf, binwidth = 30,
color = I('black'), fill = I('#099DD9'))
## Warning: Removed 2 rows containing non-finite values (stat_bin).
***The parameter color determines the color outline of objects in a plot.
The parameter fill determines the color of the area inside objects in a plot.
You might notice how the color black and the hex code color of #099DD9 (a shade of blue) are wrapped inside of I(). The I() functions stand for ‘as is’ and tells qplot to use them as colors.
qplot(x = tenure/365, data = pf, binwidth = 0.25,
color = I('black'), fill = I('#099DD9')) +
scale_x_continuous(breaks = seq(1, 7, 1), limits = c(0, 7))
## Warning: Removed 26 rows containing non-finite values (stat_bin).
#`change binwidth from 1, to 10, to 0.25
The bulk of the users had less than 2,5 years in fb. To improve the plot - change the x axis so that it increments by 1 year. We also limit users to only 7
http://docs.ggplot2.org/0.9.2.1/theme.html
Equivalent ggplot syntax for plots:
ggplot(aes(x = tenure), data = pf) + geom_histogram(binwidth = 30, color = ‘black’, fill = ‘#099DD9’)
ggplot(aes(x = tenure/365), data = pf) + geom_histogram(binwidth = .25, color = ‘black’, fill = ‘#F79420’)
Notes:
qplot(x = tenure/365, data = pf, binwidth = 0.25,
xlab = 'Number of years using Facebook',
ylab = 'Number of users in sample',
color = I('black'), fill = I('#099DD9')) +
scale_x_continuous(breaks = seq(1, 7, 1), lim = c(0, 7))
## Warning: Removed 26 rows containing non-finite values (stat_bin).
#or
ggplot(aes(x=tenure/365), data=pf) +
geom_histogram(color=I('black'), fill=I('dark green'), binwidth=.25) +
scale_x_continuous(breaks=seq(1,7,1), limits=c(0,7)) +
labs(x="Number of years using Facebook", y='Number of users in sample')
## Warning: Removed 26 rows containing non-finite values (stat_bin).
names(pf)
Notes: or ggplot(aes(x = age), data = pf) + geom_histogram(binwidth = 1, fill = ‘#5760AB’) + scale_x_continuous(breaks = seq(0, 113, 5))
min(pf$age)
## [1] 13
max(pf$age)
## [1] 113
qplot(x = age, data = pf, binwidth = 1,
xlab = 'User age',
ylab = 'count',
color = I('black'), fill = I('#099DD9')) +
scale_x_continuous(breaks = seq(0, 113, 5))
#users must be at least 13 years of age to set up a Facebook account, #which is why there is no data below 13.
Response: Response: there is a bell shape curve with a long right tail. The number of users increases from age 13 and it appears to peak around the age of 20, then the number of users begins to decrease after the age of 21. There is also those large spikes (anomalies) after the age of 100. Those are most likely fake user ages that are reported ***
Notes:
Notes: She is interested in how information flows through networks (i.e. social networks). Memes tend to replicate themselves, especially when they have text that say “repost” or “copy and paste”.
In order to analyze the occurrance of moneybag mean, Lada attempted to plot the occurrances of this meme. And she saw various spikes particularly in the months that were considered to be “lucky” because they had 5 fridays, saturdays, and sundays. When looking at her plots on a linear scale, using linear counts, it appears that the mean dissapears in the areas where the spikes are not visible. The meme probably never disspeared and it might have just been floating around facebook in low numbers.
To check this, one can use a log scale and the pattern is much more evident. Using this, we can see counts that are of size 10 while also seeing counts that are of 100,000. Eventhough there is a rapid decay of interest, it actually looks like it might be parallel. This was done in ggplot using a simple line geome, and grouping by the particular meme variant, and then rescaling the yaxis to one of the log versions.
qplot(x= friend_count, data = pf)
Notes: Most variables like friend count, likes, comments, wall posts and others are variables called ENGAGEMENT VARIABLES with very long tails. Some have 10 times or even 100 times the median value. They are in oRDER OF MAGNITUDES, i.e have more likes, clicks, or comments, than any other users. In statistics, we say that the data is OVER DISPERSED. Often, it helps to transform these values so we can see standard deviations, or orders of magnitudes, so we are in effect, shortening the tail.
summary(pf$friend_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 31.0 82.0 196.4 206.0 4923.0
Ex: The histogram of the friend count had very long tails. We can transform the data useing a log, log base 2, or base 10. We could also use the square root, and doing so helps us to see patterns more clearly without being distracted by the tails. Alot of common statistical techniques like linear regression, are based on the assumption that variables have normal distributions. So, by taking the log of this variable, we can transform our data to turn it into a normal distribution or something that more closely resembles a normal distribution.
summary(pf$friend_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 31.0 82.0 196.4 206.0 4923.0
summary(log10(pf$friend_count))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -Inf 1.491 1.914 -Inf 2.314 3.692
Trying a log10 transformation, we get something unusual since we get negative infinity for both the minimum and mean. Note, some of our users have a friend count of zero. So when we take the log10 of 0, that would be undefined. Using calculus, we would get that the limit would be -Inf. To avoid this, we are going to add 1 to friend count, so that we don’t get an undefined answer.
We can also use the sqrt transformation. The instructor mentions that log10 is an easier tranformation to wrap his head around, since he is just comparing friend counts on orders of magnitude of 10. Basically, a 10 fold scale, like the pH scale.
summary(log10(pf$friend_count +1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.505 1.919 1.868 2.316 3.692
summary(sqrt(pf$friend_count))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.568 9.055 11.088 14.353 70.164
Create Multiple Plots in One Image Output http://lightonphiri.org/blog/ggplot2-multiple-plots-in-one-graph-using-gridextra
Add Log or Sqrt Scales to an Axis http://docs.ggplot2.org/current/scale_continuous.html
Assumptions of Linear Regression http://en.wikipedia.org/wiki/Linear_regression#Assumptions
Normal Distribution http://en.wikipedia.org/wiki/Normal_distribution
You need to run the following lines of code before trying to create all three histograms on one plot.
install.packages(‘gridExtra’) library(gridExtra)
Log Transformations of Data http://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/
library(gridExtra)
p1 <- ggplot(aes(x=friend_count), data=pf) +
geom_histogram()
p2 <- ggplot(aes(x=log10(friend_count+1)), data=pf) +
geom_histogram()
p3 <- ggplot(aes(x=sqrt(friend_count)), data=pf) +
geom_histogram()
grid.arrange(p1,p2,p3, ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p1 <- ggplot(aes(x=friend_count), data=pf) + geom_histogram()
p2 <- p1 + scale_x_log10()
p3 <- p1 + scale_x_sqrt()
grid.arrange(p1,p2,p3, ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Notes:
logScale <- qplot(x = log10(friend_count), data = pf)
countScale <- ggplot(aes(x = friend_count), data = pf) +
geom_histogram() +
scale_x_log10()
grid.arrange(logScale, countScale, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
qplot(x = friend_count, data = subset(pf, !is.na(gender)),binwidth = 10,) +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
qplot(x = friend_count, data = subset(pf, !is.na(gender)),
binwidth = 10, geom = 'freqpoly', color = gender) +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50))
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
Note that the shape of the frequency polygon depends on how our bins are set up - the height of the lines are the same as the bars in individual histograms, but the lines are easier to make a comparison with since they are on the same axis.
Equivalent ggplot syntax:
ggplot(aes(x = friend_count, y = ..count../sum(..count..)), data = subset(pf, !is.na(gender))) + geom_freqpoly(aes(color = gender), binwidth=10) + scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) + xlab(‘Friend Count’) + ylab(‘Proportion of users with that friend count’)
the question was who has more friends? To answer that we need to change counts into proportions
qplot(x = friend_count, y = ..count../sum(..count..),
data = subset(pf, !is.na(gender)),
xlab = 'Friend count',
ylab = 'Proportion of Users with that friend count',
binwidth = 10, geom = 'freqpoly', color = gender) +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50))
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
qplot(x = friend_count, y = ..count../sum(..count..),
data = subset(pf, !is.na(gender)),
xlab = 'Friend count',
ylab = 'Proportion of Users with that friend count',
binwidth = 10, geom = 'freqpoly', color = gender) +
scale_x_continuous(lim=c(250,1000), breaks=seq(250,1000,50))
## Warning: Removed 81819 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
## use LIMITS or BREAKS to explore more.
#scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50))
#scale_x_continuous(lim=c(250,1000), breaks=seq(250,1000,50))
Notes:Our above and below frequency plot still does not let us answer our question: who really has more likes, men or women? Let’s try a numerical summary instead.
qplot(x = www_likes, data = subset(pf, !is.na(gender)),
geom = 'freqpoly', color = gender) +
scale_x_continuous() +
scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 60935 rows containing non-finite values (stat_bin).
The first question is asking how many www_likes there are in the entire data set for males.
The second question is asking which gender has more www_likes in total. *** so we shall find out numerically
by(pf$www_likes, pf$gender, sum)
## pf$gender: female
## [1] 3507665
## --------------------------------------------------------
## pf$gender: male
## [1] 1430175
How to read and use a Boxplot http://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/
The interquartile range or IQR includes all of the values between the bottom and top of the boxes in the boxplot. http://en.wikipedia.org/wiki/Interquartile_range
Visualization of the IQR with a normal probability distribution function with μ=1=1μ=1 and σ2=1^2=1σ2=1 (pdf). http://en.wikipedia.org/wiki/File:Boxplot_vs_PDF.svg
Intro to Descriptive Statistics Exercise: Match Box Plots https://classroom.udacity.com/courses/ud827/lessons/1471748603/concepts/834179180923
Notes:
qplot(x = gender, y = friend_count,
data = subset(pf, !is.na(gender)),
geom = 'boxplot')
qplot(x= gender, y = friend_count,
data = subset(pf, !is.na(gender)),
geom = 'boxplot', ylim= c(0, 1000))
## Warning: Removed 2949 rows containing non-finite values (stat_boxplot).
or
#Two methods ggplot(aes(x=gender, y=friend_count), data=subset(pf, !is.na(gender)) ) + geom_boxplot() + scale_y_continuous(lim=c(0,1000))
or
ggplot(aes(x=gender, y=friend_count), data=subset(pf, !is.na(gender)) ) + geom_boxplot() + coord_cartesian(ylim=c(0,1000)) ***
Notes:The question is NOT asking who initiated more friendships overall.
How to Interpret a Boxplot http://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/
The interquartile range or IQR includes all of the values between the bottom and top of the boxes in the boxplot. http://en.wikipedia.org/wiki/Interquartile_range
qplot(x= gender, y = friend_count,
data = subset(pf, !is.na(gender)),
geom = 'boxplot') +
coord_cartesian(ylim= c(0, 250))
by(pf$friend_count, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
# NOTE: coord_cartesian allows our box plots to match summary data.
Response: #### Write about some ways that you can verify your answer. Response:
qplot(x= gender, y = friendships_initiated,
data = subset(pf, !is.na(gender)),
geom = 'boxplot') +
coord_cartesian(ylim = c(0, 150))
# Get actual Numbers to check with a numerical summary
by(pf$friendships_initiated, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 19.0 49.0 113.9 124.8 3654.0
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 15.0 44.0 103.1 111.0 4144.0
Response:
Another way of data transformation = converting data w/a lot of zero values to binary (T/F)
Notes:There are other ways that we can transform a variable beside using a log or sqrt. You often want to convert variables that have a lot of 0 values to a new binary variable that has only true and false. This is helpful because we may want to know if they have used a certain feature at all, instead of the number of times that the user has used that feature. For example, it may not matter how many times a person checked in using a mobile device, but whether the person has ever used mobile check in. Using the summary, we see that the median is 4, meaning that we have a lot of zeroes in our dataset.
summary(pf$mobile_likes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 4.0 106.1 46.0 25111.0
summary(pf$mobile_likes > 0)
## Mode FALSE TRUE
## logical 35056 63947
# Better still to create a new variable that tracks mobile checkins.
pf$mobile_check_in <- NA
pf$mobile_check_in <- ifelse(pf$mobile_likes > 0, 1, 0) # 1 if user has ever used it, 0 if they never have.
pf$mobile_check_in <- factor(pf$mobile_check_in) # Convert it to a factor variable.
summary(pf$mobile_check_in)
## 0 1
## 35056 63947
#Ratio: What percent of users check in using mobile? Do this programatically.
#
sum(pf$mobile_check_in ==1) / length(pf$mobile_check_in)
## [1] 0.6459097
Response: Response: So ~65% of facebook users check in using mobile, which is over half of the users. So it would make a lot of sense to continue the development of the mobile experience, at least based on this sample of dataset. It is always important not to think about what kind of data you are looking at, but maybe what types of transformations you can make to the variables themselves. Sometimes you want raw counts and other times a binary is prefered.
Reflection: A lot of this lesson was review for me since I have worked with R a lot in the past. However, there were some key things that I did learn from this lesson. I learned a lot about the ggplot function, which creates graphs that are far more aesthetically pleasing than the basic plots found with the default functions from R. I also learned different ways to deal with long tail distributions and the appropriate way to transform data to better examine trends. I also really enjoyed the tutorial about how one should play around with bin sizes and overall how to scale a graph to extract as much information as possible from them.
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!