Lesson 3

https://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r/ ***

What to Do First?

Notes: set your working dir.


Pseudo-Facebook User Data

read the data in, note ‘’ tab separated file Notes:

pf <- read.csv('pseudo_facebook.tsv', sep='\t')

Histogram of Users’ Birthdays

Notes: You can use scale_x_continuous() instead to get the break points, or use ggplot() syntax .*** did not work?? for me

#install.packages('ggplot2')
library(ggplot2)


qplot(x = dob_day, data = pf) +
scale_x_continuous()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(aes(x = dob_day), data = pf) +
  geom_histogram(binwidth = 1) +
  scale_x_continuous(breaks = 1:31)


What are some things that you notice about this histogram?

Response: The spike on day 1


Moira’s Investigation

Notes: Audience higher than the percieved number


Estimating Your Audience Size

Notes:


Think about a time when you posted a specific message or shared a photo on Facebook. What was it?

Response:0

How many of your friends do you think saw that post?

Response:0

Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?

Response:


Perceived Audience Size

Notes:


Faceting

Notes: there 2 types In this case the above histogram is divided into 12 histograms based on the months using facet-wrap (for one variable) facet_wrap(~ variable) facet_grid(vertical~horizontal) # 2 or more variables

for more than one variable, use facet-grid.

qplot(x = dob_day, data = pf) +
  scale_x_continuous(breaks = 1:31) + 
  facet_wrap(~dob_month, ncol = 3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = pf, aes(x = dob_day)) + geom_histogram(binwidth = 1) + scale_x_continuous(breaks = 1:31) + facet_wrap(~dob_month)

Let’s take another look at our plot. What stands out to you here?

Response:


Be Skeptical - Outliers and Anomalies - there types due to different causes eg

an extreme case like someone tweets 1000 times/day or they may represent bad data or the limitations of the data or extreme cases may be replaced with normal values like in census data where extreme salaries are brought down with normal values

Types/ category impt inorder to know how to exclude it bad data about a non-extreme case bad data about an extreme case *good data about an extreme case


Moira’s Outlier

Notes: terrible plot coz one person guessed(percieved a number in millions !!! So she had to adjust the axes first and foremost so that she could see the bulk of the data) #### Which case do you think applies to Moira’s outlier? Response:


Friend Count

Notes:

What code would you enter to create a histogram of friend counts?

qplot(x = friend_count, data = pf)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

How is this plot similar to Moira’s first plot?

Response: Its also long tail histogram with much of the data squished on the left


Limiting the Axes

Notes: Therefore we have to limit the axes to see the data clearly. Say we want to see data within the first 1000 users. Use xlim as a vector

qplot(x = friend_count, data = pf, xlim = c(0,1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).

#alternatively

qplot(x = friend_count, data = pf) +
  scale_x_continuous(limits = c(0, 1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).

Exploring with Bin Width

Notes: qplot(x = friend_count, data = pf, binwidth = 25) + scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000,50)) ***

Adjusting the Bin Width

Notes: setting it at 1 allows you to see the individual perceptions that comeup as spikes

Faceting Friend Count - say we want to know which gender has the highest count

What code would you add to create a facet the histogram by gender?

Add it to the code below

qplot(x = friend_count, data = pf, binwidth = 25) +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
  facet_wrap(~gender)
## Warning: Removed 2951 rows containing non-finite values (stat_bin).


Omitting NA Values

https://www.statmethods.net/input/missingdata.html Notes: we add a condition to omit the NA values only within gender(!is.na(gender))

#qplot(x = friend_count, data = subset(pf, !is.na(gender)),binwidth = 10) +
#  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
#  facet_wrap(~gender)

#alternatively - note: be careful with na.omit() coz it may remove na thats #not relate to the variable you want removed

qplot(x = friend_count, data = na.omit(pf), binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
  facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).


Statistics ‘by’ Gender

Notes:

table(pf$gender)
## 
## female   male 
##  40254  58574
by(pf$friend_count, pf$gender, summary)
## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

Who on average has more friends: men or women?

Response: Women

What’s the difference between the median friend count for women and men?

Response: 22 #### Why would the median be a better measure than the mean? Response: more robust ***

Tenure

Notes:

qplot(x = tenure, data = pf, binwidth = 30,
      color = I('black'), fill = I('#099DD9'))
## Warning: Removed 2 rows containing non-finite values (stat_bin).

***The parameter color determines the color outline of objects in a plot.

The parameter fill determines the color of the area inside objects in a plot.

You might notice how the color black and the hex code color of #099DD9 (a shade of blue) are wrapped inside of I(). The I() functions stand for ‘as is’ and tells qplot to use them as colors.

How would you create a histogram of tenure by year?

qplot(x = tenure/365, data = pf, binwidth = 0.25,
      color = I('black'), fill = I('#099DD9')) +
  scale_x_continuous(breaks = seq(1, 7, 1), limits = c(0, 7))
## Warning: Removed 26 rows containing non-finite values (stat_bin).

#`change binwidth from 1, to 10, to 0.25

The bulk of the users had less than 2,5 years in fb. To improve the plot - change the x axis so that it increments by 1 year. We also limit users to only 7

http://docs.ggplot2.org/0.9.2.1/theme.html

Equivalent ggplot syntax for plots:

ggplot(aes(x = tenure), data = pf) + geom_histogram(binwidth = 30, color = ‘black’, fill = ‘#099DD9’)

ggplot(aes(x = tenure/365), data = pf) + geom_histogram(binwidth = .25, color = ‘black’, fill = ‘#F79420’)


Labeling Plots

Notes:

qplot(x = tenure/365, data = pf, binwidth = 0.25,
      xlab = 'Number of years using Facebook',
      ylab = 'Number of users in sample',
      color = I('black'), fill = I('#099DD9')) +
  scale_x_continuous(breaks = seq(1, 7, 1), lim = c(0, 7))
## Warning: Removed 26 rows containing non-finite values (stat_bin).

#or

ggplot(aes(x=tenure/365), data=pf) +
  geom_histogram(color=I('black'), fill=I('dark green'), binwidth=.25) +
  scale_x_continuous(breaks=seq(1,7,1), limits=c(0,7)) +
  labs(x="Number of years using Facebook", y='Number of users in sample')
## Warning: Removed 26 rows containing non-finite values (stat_bin).


names(pf)

User Ages

Notes: or ggplot(aes(x = age), data = pf) + geom_histogram(binwidth = 1, fill = ‘#5760AB’) + scale_x_continuous(breaks = seq(0, 113, 5))

min(pf$age)
## [1] 13
max(pf$age)
## [1] 113
qplot(x = age, data = pf, binwidth = 1,
      xlab = 'User age',
      ylab = 'count',
      color = I('black'), fill = I('#099DD9')) +
  scale_x_continuous(breaks = seq(0, 113, 5))

#users must be at least 13 years of age to set up a Facebook account, #which is why there is no data below 13.

What do you notice?

Response: Response: there is a bell shape curve with a long right tail. The number of users increases from age 13 and it appears to peak around the age of 20, then the number of users begins to decrease after the age of 21. There is also those large spikes (anomalies) after the age of 100. Those are most likely fake user ages that are reported ***

The Spread of Memes

Notes:


Lada’s Money Bag Meme

Notes: She is interested in how information flows through networks (i.e. social networks). Memes tend to replicate themselves, especially when they have text that say “repost” or “copy and paste”.

In order to analyze the occurrance of moneybag mean, Lada attempted to plot the occurrances of this meme. And she saw various spikes particularly in the months that were considered to be “lucky” because they had 5 fridays, saturdays, and sundays. When looking at her plots on a linear scale, using linear counts, it appears that the mean dissapears in the areas where the spikes are not visible. The meme probably never disspeared and it might have just been floating around facebook in low numbers.

To check this, one can use a log scale and the pattern is much more evident. Using this, we can see counts that are of size 10 while also seeing counts that are of 100,000. Eventhough there is a rapid decay of interest, it actually looks like it might be parallel. This was done in ggplot using a simple line geome, and grouping by the particular meme variant, and then rescaling the yaxis to one of the log versions.


qplot(x= friend_count, data = pf)

Transforming Data

Notes: Most variables like friend count, likes, comments, wall posts and others are variables called ENGAGEMENT VARIABLES with very long tails. Some have 10 times or even 100 times the median value. They are in oRDER OF MAGNITUDES, i.e have more likes, clicks, or comments, than any other users. In statistics, we say that the data is OVER DISPERSED. Often, it helps to transform these values so we can see standard deviations, or orders of magnitudes, so we are in effect, shortening the tail.

summary(pf$friend_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    31.0    82.0   196.4   206.0  4923.0

Ex: The histogram of the friend count had very long tails. We can transform the data useing a log, log base 2, or base 10. We could also use the square root, and doing so helps us to see patterns more clearly without being distracted by the tails. Alot of common statistical techniques like linear regression, are based on the assumption that variables have normal distributions. So, by taking the log of this variable, we can transform our data to turn it into a normal distribution or something that more closely resembles a normal distribution.

summary(pf$friend_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    31.0    82.0   196.4   206.0  4923.0
summary(log10(pf$friend_count))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -Inf   1.491   1.914    -Inf   2.314   3.692

Trying a log10 transformation, we get something unusual since we get negative infinity for both the minimum and mean. Note, some of our users have a friend count of zero. So when we take the log10 of 0, that would be undefined. Using calculus, we would get that the limit would be -Inf. To avoid this, we are going to add 1 to friend count, so that we don’t get an undefined answer.

We can also use the sqrt transformation. The instructor mentions that log10 is an easier tranformation to wrap his head around, since he is just comparing friend counts on orders of magnitude of 10. Basically, a 10 fold scale, like the pH scale.

summary(log10(pf$friend_count +1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.505   1.919   1.868   2.316   3.692
summary(sqrt(pf$friend_count))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.568   9.055  11.088  14.353  70.164

Create Multiple Plots in One Image Output http://lightonphiri.org/blog/ggplot2-multiple-plots-in-one-graph-using-gridextra

Add Log or Sqrt Scales to an Axis http://docs.ggplot2.org/current/scale_continuous.html

Assumptions of Linear Regression http://en.wikipedia.org/wiki/Linear_regression#Assumptions

Normal Distribution http://en.wikipedia.org/wiki/Normal_distribution

You need to run the following lines of code before trying to create all three histograms on one plot.

install.packages(‘gridExtra’) library(gridExtra)

Log Transformations of Data http://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/

library(gridExtra)

p1 <- ggplot(aes(x=friend_count), data=pf) +
  geom_histogram()
p2 <- ggplot(aes(x=log10(friend_count+1)), data=pf) +
  geom_histogram()
p3 <- ggplot(aes(x=sqrt(friend_count)), data=pf) +
  geom_histogram()
grid.arrange(p1,p2,p3, ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Alternative plotting method (different x-axis than the one above):

p1 <- ggplot(aes(x=friend_count), data=pf) + geom_histogram()
p2 <- p1 + scale_x_log10()
p3 <- p1 + scale_x_sqrt()
grid.arrange(p1,p2,p3, ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note, there is a slight difference here based on the x-axis labeling.


Add a Scaling Layer

Notes:

logScale <- qplot(x = log10(friend_count), data = pf) 

countScale <- ggplot(aes(x = friend_count), data = pf) +
  geom_histogram() +
  scale_x_log10()

grid.arrange(logScale, countScale, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).


Frequency Polygons

  • good for comparing two or more distributions at once
qplot(x = friend_count, data = subset(pf, !is.na(gender)),binwidth = 10,) +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
  facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).

qplot(x = friend_count, data = subset(pf, !is.na(gender)),
      binwidth = 10, geom = 'freqpoly', color = gender) +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) 
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).

Note that the shape of the frequency polygon depends on how our bins are set up - the height of the lines are the same as the bars in individual histograms, but the lines are easier to make a comparison with since they are on the same axis.

Equivalent ggplot syntax:

ggplot(aes(x = friend_count, y = ..count../sum(..count..)), data = subset(pf, !is.na(gender))) + geom_freqpoly(aes(color = gender), binwidth=10) + scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) + xlab(‘Friend Count’) + ylab(‘Proportion of users with that friend count’)

the question was who has more friends? To answer that we need to change counts into proportions

qplot(x = friend_count, y = ..count../sum(..count..),
      data = subset(pf, !is.na(gender)),
      xlab = 'Friend count',
      ylab = 'Proportion of Users with that friend count',
      binwidth = 10, geom = 'freqpoly', color = gender) +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50))
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).

qplot(x = friend_count, y = ..count../sum(..count..),
      data = subset(pf, !is.na(gender)),
      xlab = 'Friend count',
      ylab = 'Proportion of Users with that friend count',
      binwidth = 10, geom = 'freqpoly', color = gender) +
  scale_x_continuous(lim=c(250,1000), breaks=seq(250,1000,50))
## Warning: Removed 81819 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).

## use LIMITS or BREAKS to explore more.
  #scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50))
#scale_x_continuous(lim=c(250,1000), breaks=seq(250,1000,50))

Likes on the Web

Notes:Our above and below frequency plot still does not let us answer our question: who really has more likes, men or women? Let’s try a numerical summary instead.

qplot(x = www_likes, data = subset(pf, !is.na(gender)),
       geom = 'freqpoly', color = gender) +
  scale_x_continuous() + 
  scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 60935 rows containing non-finite values (stat_bin).

The first question is asking how many www_likes there are in the entire data set for males.

The second question is asking which gender has more www_likes in total. *** so we shall find out numerically

by(pf$www_likes, pf$gender, sum)
## pf$gender: female
## [1] 3507665
## -------------------------------------------------------- 
## pf$gender: male
## [1] 1430175

Box Plots

How to read and use a Boxplot http://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/

The interquartile range or IQR includes all of the values between the bottom and top of the boxes in the boxplot. http://en.wikipedia.org/wiki/Interquartile_range

Visualization of the IQR with a normal probability distribution function with μ=1=1μ=1 and σ2=1^2=1σ2=1 (pdf). http://en.wikipedia.org/wiki/File:Boxplot_vs_PDF.svg

Intro to Descriptive Statistics Exercise: Match Box Plots https://classroom.udacity.com/courses/ud827/lessons/1471748603/concepts/834179180923

Notes:

qplot(x = gender, y = friend_count,
      data = subset(pf, !is.na(gender)),
       geom = 'boxplot') 

Adjust the code to focus on users who have friend counts between 0 and 1000.

qplot(x= gender, y = friend_count, 
      data = subset(pf, !is.na(gender)), 
      geom = 'boxplot', ylim= c(0, 1000)) 
## Warning: Removed 2949 rows containing non-finite values (stat_boxplot).

or

#Two methods ggplot(aes(x=gender, y=friend_count), data=subset(pf, !is.na(gender)) ) + geom_boxplot() + scale_y_continuous(lim=c(0,1000))

or

Cord cartesian method

ggplot(aes(x=gender, y=friend_count), data=subset(pf, !is.na(gender)) ) + geom_boxplot() + coord_cartesian(ylim=c(0,1000)) ***

Box Plots, Quartiles, and Friendships

Notes:The question is NOT asking who initiated more friendships overall.

How to Interpret a Boxplot http://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/

The interquartile range or IQR includes all of the values between the bottom and top of the boxes in the boxplot. http://en.wikipedia.org/wiki/Interquartile_range

qplot(x= gender, y = friend_count, 
      data = subset(pf, !is.na(gender)), 
      geom = 'boxplot') +
  coord_cartesian(ylim= c(0, 250))

by(pf$friend_count, pf$gender, summary)
## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917
# NOTE: coord_cartesian allows our box plots to match summary data.

On average, who initiated more friendships in our sample: men or women?

Response: #### Write about some ways that you can verify your answer. Response:

qplot(x= gender, y = friendships_initiated, 
      data = subset(pf, !is.na(gender)), 
      geom = 'boxplot') + 
      coord_cartesian(ylim = c(0, 150))

# Get actual Numbers to check with a numerical summary
by(pf$friendships_initiated, pf$gender, summary)
## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    19.0    49.0   113.9   124.8  3654.0 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    15.0    44.0   103.1   111.0  4144.0

Response:


Getting Logical

Another way of data transformation = converting data w/a lot of zero values to binary (T/F)

Notes:There are other ways that we can transform a variable beside using a log or sqrt. You often want to convert variables that have a lot of 0 values to a new binary variable that has only true and false. This is helpful because we may want to know if they have used a certain feature at all, instead of the number of times that the user has used that feature. For example, it may not matter how many times a person checked in using a mobile device, but whether the person has ever used mobile check in. Using the summary, we see that the median is 4, meaning that we have a lot of zeroes in our dataset.

summary(pf$mobile_likes)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     4.0   106.1    46.0 25111.0
summary(pf$mobile_likes > 0)
##    Mode   FALSE    TRUE 
## logical   35056   63947
# Better still to create a new variable that tracks mobile checkins.
pf$mobile_check_in <- NA
pf$mobile_check_in <- ifelse(pf$mobile_likes > 0, 1, 0)   # 1 if user has ever used it, 0 if they never have.
pf$mobile_check_in <- factor(pf$mobile_check_in)          # Convert it to a factor variable.
summary(pf$mobile_check_in)
##     0     1 
## 35056 63947
#Ratio: What percent of users check in using mobile? Do this programatically.
#
sum(pf$mobile_check_in ==1) / length(pf$mobile_check_in)
## [1] 0.6459097

Response: Response: So ~65% of facebook users check in using mobile, which is over half of the users. So it would make a lot of sense to continue the development of the mobile experience, at least based on this sample of dataset. It is always important not to think about what kind of data you are looking at, but maybe what types of transformations you can make to the variables themselves. Sometimes you want raw counts and other times a binary is prefered.


Analyzing One Variable

Reflection: A lot of this lesson was review for me since I have worked with R a lot in the past. However, there were some key things that I did learn from this lesson. I learned a lot about the ggplot function, which creates graphs that are far more aesthetically pleasing than the basic plots found with the default functions from R. I also learned different ways to deal with long tail distributions and the appropriate way to transform data to better examine trends. I also really enjoyed the tutorial about how one should play around with bin sizes and overall how to scale a graph to extract as much information as possible from them.


Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!