Lesson 3

What to Do First?

Notes: Check the working directory and check the list of files

getwd() setwd(“~/R Datasources”) list.files()

Pseudo-Facebook User Data

Notes: 1. load the facebook data with the read.csv command and set teh seperator as tab i.e. 2. use the names command to viea the variables

pf <- read.csv('../R Datasources/pseudo_facebook.tsv',sep = '\t')
names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

Histogram of Users’ Birthdays

Notes:

###install.packages('ggplot2')
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.1

qplot(x= dob_day, data = pf)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

###use the scale x discrete to adjust the x axis
qplot(x= dob_day, data = pf)+ 
  scale_x_discrete(breaks=1:31)

  ### + facet_wrap(~dob_month, ncol = 3)

What are some things that you notice about this histogram?

Response: 1. Observed that most users seem to be born on the 1st day of the month. 2. Observerd that day 31 has the least number of users owing to the fact that not all months have 31 days ***

Moira’s Investigation

Notes: Moira observed that most people’s percieved observesation of their audience size is quite different from reality. ***

Estimating Your Audience Size

Notes:

Think about a time when you posted a specific message or shared a photo on Facebook. What was it?

Response: Message of my partner’s first time baking a cake #### How many of your friends do you think saw that post? Response: 50

Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?

Response:40%

Perceived Audience Size

Notes: most people usualy estimate their audience size to be a quater of the actua size

Faceting

Notes: 1. facet_wrap formula takes a ~ followed by a variable. Usually used when faceting over one variable. Allows you to create the same type of plot for each level of the categorial variable 2. facet_grid is used to facet over 2 or more variables facet_grid(vertical~horizontal) 3. Learn more about facetng here: http://www.cookbook-r.com/Graphs/Facets_(ggplot2)/ 4 Equivalent ggplot syntax: ggplot(data = pf, aes(x = dob_day)) + geom_histogram() + scale_x_discrete(breaks = 1:31) + facet_wrap(~dob_month)

qplot(x= dob_day, data = pf)+ 
  scale_x_discrete(breaks=1:31)+
   facet_wrap(~dob_month, ncol = 3)

Letâs take another look at our plot. What stands out to you here?

Response: most facebook users are born on the first day of the first month ***

Be Skeptical - Outliers and Anomalies

Notes: 1. need to detect and deal with outliers in the dataset. 2. Outliers could be accurate data or they could be examples of bad data ***

Moira’s Outlier

Notes: 1. She first adjusts the axes and cuts out the outlier #### Which case do you think applies to Moiraâs outlier? Response: 1. here outlier was bad data. An extreme value

Friend Count

Notes:

What code would you enter to create a histogram of friend counts?

names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

qplot(x = friend_count, data = pf)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

##qplot(x=pf$friend_count,data=pf)

How is this plot similar to Moira’s first plot?

Response: Skewed to one end of the axes. ***

Limiting the Axes

Notes: 1. Use xlim to limit the x-axis. It takes a vector with a start and end position 2. You can also add a layer using scale_x_continous 3. learn more about scales here: http://docs.ggplot2.org/current/scale_continuous.html

names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

qplot( x = friend_count, data = pf, xlim = c(0,1000))

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

##Use a layer instead of xlim
qplot( x = friend_count, data = pf) +
  scale_x_continuous(limits = c(0,1000))

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

#        qplot(x = friend_count,data = subset(pf, !is.na(gender)), binwidth = 25) +
#   scale_x_continuous(limits = c(0,1000), breaks=seq(0,1000,50)) +
#   facet_wrap(~gender)

Exploring with Bin Width

Notes:

Adjusting the Bin Width

Notes: Adjust the bin width and use breaks to adjust the x axis scale markers

qplot( x = friend_count, data = pf, binwidth = 25) +
  scale_x_continuous(limits = c(0,1000), breaks = seq(0,1000,50))

### Faceting Friend Count Notes: facet_wrap by gender. This produces 3 histograms namely male, female and NA

# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000),
                     breaks = seq(0, 1000, 50)) +
  facet_wrap(~gender)

Omitting NA Values

Notes: You can use the subset function to filter a dataset and remove NA values You can use the na.omit function as well with caution as it will remove any observations that have NA

names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

#omits all observation where gender has missing values
qplot(x = friend_count,data = subset(pf, !is.na(gender)), binwidth = 25) +
  scale_x_continuous(limits = c(0,1000), breaks=seq(0,1000,50)) +
  facet_wrap(~gender)

# omits all observation with missing values
qplot(x = friend_count,data = na.omit(pf) , binwidth = 25) +
  scale_x_continuous(limits = c(0,1000), breaks=seq(0,1000,50)) +
  facet_wrap(~gender)

Statistics ‘by’ Gender

Notes: You can use the table command to try and answer who has more friends, male or female

table(pf$gender)

## 
## female   male 
##  40254  58574

#friend count is the variable and gender is the categorical variable
by(pf$friend_count,pf$gender,summary)

## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

Who on average has more friends: men or women?

Response: women

What’s the difference between the median friend count for women and men?

Response:22

Why would the median be a better measure than the mean?

Response: There are a few users with high friend counts that pull the mean to one end. The median is resistant to change since it marks the half way point for all data points. So long as we trust half of our values we can report a reliable location of the center of distribution

Tenure

Notes: use tenure to examin how many days a perons has been using Facebook The parameter color determines the color outline of objects in a plot.

The parameter fill determines the color of the area inside objects in a plot.

You might notice how the color black and the hex code color of #099DD9 (a shade of blue) are wrapped inside of I(). The I() functions stand for ‘as is’ and tells qplot to use them as colors. Learn more about what you can adjust in a plot by reading the ggplot theme documentation http://docs.ggplot2.org/0.9.2.1/theme.html

names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

qplot(x = tenure, data = pf, binwidth = 30,
      color = I('black'), fill = I('#099DD9'))

How would you create a histogram of tenure by year?

## plot the tenure in years rather than in days

qplot(x = tenure / 365, data = pf, binwidth = .25,
      color = I('black'),fill = I('#099DD9'))

# change the x-axis to increment by one year
qplot(x = tenure / 365, data = pf, binwidth = .25,
      color = I('black'),fill = I('#099DD9')) +
  scale_x_continuous(breaks=seq(0,7,1), limits = c(0,7))

Labeling Plots

Notes: Plot need to speak for themsleves and rhe labels need to be changed to make sense

qplot(x = tenure / 365, data = pf, binwidth = .25,
      xlab = 'Number of years using Facebook',
      ylab = 'Number of users in sample',
      color = I('black'),fill = I('#099DD9')) +
  scale_x_continuous(breaks=seq(0,7,1), limits = c(0,7))

***

User Ages

Notes:

names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

qplot(x = age , data = subset(pf,!is.na(gender)),binwidth = 1,
      xlab = 'Facebook Users by Age',
      ylab = 'Number of users in sample',
      color = I('black'), fill = I('#099DD9')) +
  scale_x_continuous(breaks = seq(10,120,10), limits = c(10,120)) +
  facet_wrap(~gender)

by(pf$age,pf$gender,summary)

## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   21.00   31.00   39.46   54.00  113.00 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   27.00   35.67   45.00  113.00

What do you notice?

Response: The average age of men on facebook is lower then for females ***

The Spread of Memes

Notes:

Lada’s Money Bag Meme

Notes:

Transforming Data

Notes: sometimes data can be “over-dispersed” especially with long tailed data

qplot(x = friend_count, data = pf)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

summary(pf$friend_count)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    31.0    82.0   196.4   206.0  4923.0

#transform using log10. Result shows -inf as some users have zero friends
summary(log10(pf$friend_count))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -Inf       1       2    -Inf       2       4

#add 1 to firend count to overcome the -inf
summary(log10(pf$friend_count +1))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.505   1.919   1.868   2.316   3.692

#transform using square root
summary(sqrt(pf$friend_count))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.568   9.055  11.090  14.350  70.160

Add a Scaling Layer

Notes: 1. to plot the same graph you need the grid extra package. 2. The scale_x_log10 layer will plot in actual friend counts where as the wrapper log10 will plot in log scale

##install.packages('gridExtra')
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.2.1

#create each plot and assign to a variable
p1 <- qplot(x = friend_count, data = pf)
p2 <- qplot(x = friend_count, data = pf) +
  scale_x_log10()
p3 <- qplot(x = friend_count, data = pf) +
  scale_x_sqrt()
#use grid.arrange to plot 
grid.arrange(p1,p2,p3, ncol = 1)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## Transforming data alternate solution
## Use scales

#aes is the "aesthetic wrapper" and geom tell ggplot what type of plot we need
p1 <- ggplot(aes(x = friend_count), data = pf) + geom_histogram()
p2 <- p1 + scale_x_log10()
p3 <- p1 + scale_x_sqrt()
#use grid.arrange to plot 
grid.arrange(p1,p2,p3, ncol = 1)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

#examine the difference between adding a scaling layer and using a wrapper
logScale <- qplot(x = log10(friend_count), data = pf)
countScale <- ggplot(aes(x = friend_count), data = pf) + geom_histogram() +
  geom_histogram() +
  scale_x_log10()

grid.arrange(logScale,countScale, ncol = 2)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Frequency Polygons

Notes: usd to compare two or more distributions. Similar to histograms and draw a curve connectingthe counts in a histo.

#plot a histogram of users by age by gender 
qplot(x = friend_count , data = subset(pf,!is.na(gender)),binwidth = 10,
      xlab = 'Friend Count',
      ylab = 'Number of users in sample'
      ) +
  scale_x_continuous(breaks = seq(0,1000,50), limits = c(10,1000)) +
  facet_wrap(~gender)

#plot a frequency polygon of users by age by gender 
qplot(x = friend_count , y = ..count../sum(..count..),
      data = subset(pf,!is.na(gender)),binwidth = 10,
      geom = 'freqpoly', color = gender,
      xlab = 'Friend Count',
      ylab = 'Proportion of Users with that friend count'
      ) +
  scale_x_continuous(breaks = seq(0,1000,50), limits = c(10,1000))

## Warning: Removed 2 rows containing missing values (geom_path).

## Warning: Removed 2 rows containing missing values (geom_path).

Likes on the Web

Notes:

names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

#plot a frequency polygon of users by likes by gender
#observation: Seems like men have more likes then women
qplot(x = www_likes,
      data = subset(pf,!is.na(gender)),
      geom = 'freqpoly', color = gender,
      xlab = 'WWW Likes',
      ylab = 'Proportion of Users with that www like count'
      ) +
  scale_x_continuous() +
  scale_x_log10()

## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

#use a numerical summary to see who has more likes
by(pf$www_likes,pf$gender,sum)

## pf$gender: female
## [1] 3507665
## -------------------------------------------------------- 
## pf$gender: male
## [1] 1430175

Box Plots

Notes: How to read and use a Boxplot http://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/

The interquartile range or IQR includes all of the values between the bottom and top of the boxes in the boxplot. https://en.wikipedia.org/wiki/Interquartile_range

Intro to Descriptive Statistics Exercise: Match Box Plots https://www.udacity.com/course/viewer#!/c-ud827/l-1471748603/e-83417918/m-83664035

Outliers as usualy consider to be 1.5 times the IQR from the median

#plot a histogram of users by age by gender 
qplot(x = friend_count , data = subset(pf,!is.na(gender)),binwidth = 10,
      xlab = 'Friend Count',
      ylab = 'Number of users in sample'
      ) +
  scale_x_continuous(breaks = seq(0,1000,50), limits = c(10,1000)) +
  facet_wrap(~gender)

#plot a boxplot of users by age by gender
qplot(x = gender, y = friend_count, 
      data = subset(pf, !is.na(gender)), 
      geom = 'boxplot')

Adjust the code to focus on users who have friend counts between 0 and 1000.

#adjust the y scale to focus on 0 to 1000. ylim removes values/observations
  qplot(x = gender, y = friend_count, 
      data = subset(pf, !is.na(gender)), 
      geom = 'boxplot' , ylim = c(0, 1000))

## Warning: Removed 2949 rows containing non-finite values (stat_boxplot).

  #use scale y continious also revmoes values
  qplot(x = gender, y = friend_count, 
      data = subset(pf, !is.na(gender)), 
      geom = 'boxplot') +
  scale_y_continuous(limits = c(0,1000))

## Warning: Removed 2949 rows containing non-finite values (stat_boxplot).

  #use the cord cartesian layer to avoid removeing values
  qplot(x = gender, y = friend_count, 
      data = subset(pf, !is.na(gender)), 
      geom = 'boxplot') +
    coord_cartesian(ylim = c(0,1000))

Box Plots, Quartiles, and Friendships

Notes: 1. How to interpret a Box Plot http://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/ 2. The interquartile range or IQR includes all of the values between the bottom and top of the boxes in the boxplot.

 qplot(x = gender, y = friend_count, 
      data = subset(pf, !is.na(gender)), 
      geom = 'boxplot') +
    coord_cartesian(ylim = c(0,250))

  by(pf$friend_count,pf$gender,summary)

## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

On average, who initiated more friendships in our sample: men or women?

Response: women #### Write about some ways that you can verify your answer. Response: The median for women is greater than the median for me and this is backed up by numerical summary

  names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

qplot(x = gender, y = friendships_initiated, 
      data = subset(pf, !is.na(gender)), 
      geom = 'boxplot') +
    coord_cartesian(ylim = c(0,500))

qplot(x = gender, y = friendships_initiated, 
      data = subset(pf, !is.na(gender)), 
      geom = 'boxplot') +
    coord_cartesian(ylim = c(0,150))

by(pf$friendships_initiated,pf$gender,summary)

## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    19.0    49.0   113.9   124.8  3654.0 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    15.0    44.0   103.1   111.0  4144.0

Response:

Getting Logical

Notes: What is a factor variable? Eqivalent of a computed column or measure

#check the distributn of mobile lieks using a numerical summary
summary(pf$mobile_likes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     4.0   106.1    46.0 25110.0

#many of the values are zero. Filter out value that are zero
summary(pf$mobile_likes > 0)

##    Mode   FALSE    TRUE    NA's 
## logical   35056   63947       0

### Create a factor variable to measure if a user has checked in on mobile rather then a count of the 
### number of times a user has checked in on mobile

### create a variable in the dataframe pf and assign NA values to it
pf$mobile_check_in <- NA

### use a logical operator to assign 1 or 0 if a user has checked in
pf$mobile_check_in <- ifelse(pf$mobile_likes > 0,1,0)

###convert to a factor variable
pf$mobile_check_in <- factor(pf$mobile_check_in)
summary(pf$mobile_check_in)

##     0     1 
## 35056 63947

###Wat percent of check using mobile
sum(pf$mobile_check_in == 1) / length(pf$mobile_check_in)

## [1] 0.6459097

Response: 65% ***

Analyzing One Variable

Reflection:

Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!

Lesson 3

What to Do First?

Pseudo-Facebook User Data

Histogram of Users’ Birthdays

What are some things that you notice about this histogram?

Moira’s Investigation

Estimating Your Audience Size

Think about a time when you posted a specific message or shared a photo on Facebook. What was it?

Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?

Perceived Audience Size

Faceting

Letâs take another look at our plot. What stands out to you here?

Be Skeptical - Outliers and Anomalies

Moira’s Outlier

Friend Count

What code would you enter to create a histogram of friend counts?

How is this plot similar to Moira’s first plot?

Limiting the Axes

Exploring with Bin Width

Adjusting the Bin Width

Omitting NA Values

Statistics ‘by’ Gender

Who on average has more friends: men or women?

What’s the difference between the median friend count for women and men?

Why would the median be a better measure than the mean?

Tenure

How would you create a histogram of tenure by year?

Labeling Plots

User Ages

What do you notice?

The Spread of Memes

Lada’s Money Bag Meme

Transforming Data

Add a Scaling Layer

Frequency Polygons

Likes on the Web

Box Plots

Adjust the code to focus on users who have friend counts between 0 and 1000.

Box Plots, Quartiles, and Friendships

On average, who initiated more friendships in our sample: men or women?

Getting Logical

Analyzing One Variable

Letâs take another look at our plot. What stands out to you here?