Lesson 3


What to Do First?

Notes:

Load the facebook data in R


Pseudo-Facebook User Data

Notes: This is not actual FB data set but is very close to that We will do all our analysis in this its have ~99K rows and 15 attributes (col)

getwd()
## [1] "E:/R/WorkSpace"
setwd("E:/R/WorkSpace")
list.files()
##  [1] "Demystifying.R"       "lesson3_student.html" "lesson3_student.rmd" 
##  [4] "MostUsedRCommand.R"   "pseudo_facebook.tsv"  "reddit.csv"          
##  [7] "rsconnect"            "samplermd.html"       "samplermd.rmd"       
## [10] "stateData.csv"
pf <- read.csv("pseudo_facebook.tsv",sep = '\t')
names(pf) # prints columns
##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

Histogram of Users’ Birthdays

Notes:

#install.packages('ggplot2')
library(ggplot2)

ggplot(aes(x = dob_day), data = pf) + 
  geom_histogram(binwidth = 1) + 
  scale_x_continuous(breaks = 1:31)

  #facet_wrap(~dob_month)
#facet_grid() for mult var. use facet_wrap for one var

What are some things that you notice about this histogram?

Response: High value for Jan 1 maybe , default for Jan 1 So your outlier can be a bad data (error in data) OR a real data with extreme value. Try to reason about it by seeing if the outlier is even feasible


Moira’s Investigation

Notes: outlier can be actual data or it can be a bad data. you need to see if the outlier value is even feasible (in a range what you expect your data to be) Try to reason why the outlier can exist. it may be due to default value et al.


Estimating Your Audience Size

Notes: This excercise try to estimate the audience size of a FB post.


Think about a time when you posted a specific message or shared a photo on Facebook. What was it?

Response: 5

How many of your friends do you think saw that post?

Response: 100

Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?

Response: 30%


Perceived Audience Size

Notes: Usually people underestimate teh perceived Audience in FB *** ### Faceting Notes:

Use facet_wrap(~dob_month) or facet_grid() to split the chart based on a variable

Splitting the plot based on any variable

ggplot(aes(x = dob_day), data = pf) + 
  geom_histogram(binwidth = 1) + 
  scale_x_continuous(breaks = 1:31)+
  facet_wrap(~dob_month)

#facet_grid() for mult var. use facet_wrap for one var

Let’s take another look at our plot. What stands out to you here?

Response: Jan 1 have most birthday


Be Skeptical - Outliers and Anomalies

Notes:

It might be outlier as it can be default value of the FB bday data


Moira’s Outlier

Notes: #### Which case do you think applies to Moira’s outlier? Response:


Friend Count

Notes: Lets try to analyze the friend count data

What code would you enter to create a histogram of friend counts?

library(ggplot2)

ggplot(aes(x = friend_count), data = pf,binwidth=5) + 
  geom_histogram()+
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) # adjust axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).

How is this plot similar to Moira’s first plot?

Response:


Limiting the Axes

Notes:

limiting the axis is need to see the proper histogram. you should exclude the range where the outliers lies from the x axis by limiting the axis with the limits command below

library(ggplot2)

ggplot(aes(x = friend_count), data = pf) + 
  geom_histogram()+
  scale_x_continuous(limits = c(0, 1000)) # adjust axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).

Exploring with Bin Width

Notes:

Binwidth can really change the look of your histogram and also unravel new insights try different binwidth and see what you can learn from slightly different histogram plot structure ***

Adjusting the Bin Width

Notes: Use binwidth parameter in qplot() function to change the bin width.

Faceting Friend Count

# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000),
                     breaks = seq(0, 1000, 50))+
  facet_wrap(~gender)
## Warning: Removed 2951 rows containing non-finite values (stat_bin).


Omitting NA Values

Notes: use subset to exclude the rows with NA as a gender in data

qplot(x = friend_count, data = subset(pf,!is.na(gender)), binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000),breaks = seq(0, 1000, 50))+
  facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).


Statistics ‘by’ Gender

Notes:

table(pf$gender)
## 
## female   male 
##  40254  58574
by(pf$friend_count,pf$gender,summary)
## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

Who on average has more friends: men or women?

Response: Women

What’s the difference between the median friend count for women and men?

Response: 22

Why would the median be a better measure than the mean?

Response: Medain is more robust than mean. In a skewed distribution median is different from mean and is considered more robust. Also in normal distribution (mean == meadian)


Tenure

Notes: how many days/years a user is using the FB

ggplot(aes(x = tenure), data = pf) + 
   geom_histogram(binwidth = 30, color = 'black', fill = '#F79420')
## Warning: Removed 2 rows containing non-finite values (stat_bin).


How would you create a histogram of tenure by year?

Notice use of /365 and scale_x_continuous()

ggplot(aes(x = tenure/365), data = pf) + 
   geom_histogram(binwidth = 0.25, color = 'black', fill = '#F79420')+
   scale_x_continuous(breaks= seq(1,7,1),limits =c(0,7))
## Warning: Removed 26 rows containing non-finite values (stat_bin).


Labeling Plots

Notes:

ggplot(aes(x = tenure/365), data = pf) + 
   geom_histogram(binwidth = 0.25, color = 'black', fill = '#F79420')+
   scale_x_continuous(breaks= seq(1,7,1),limits =c(0,7))+
   xlab('Number of years using Facebook') + 
    ylab('Number of users in sample')
## Warning: Removed 26 rows containing non-finite values (stat_bin).


User Ages

Notes: Do a sensible choce for 1.binwidth 2.breaks 3.axis limits

names(pf)
##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"
ggplot(aes(x = age), data = pf) + 
   geom_histogram(binwidth = 1, color = 'black', fill = '#F79420')+
   scale_x_continuous(breaks= seq(0,113,1),limits =c(0,150))+
   xlab('age') + 
    ylab('Number of users in sample')

What do you notice?

Response: Most user are of 15-25 age older people use less FB miniminum age is 5? Data have error/outliers (?) Not very evidient set bin width 1 of to get more information.


The Spread of Memes

Notes: A meme is “an idea, behavior, or style that spreads from person to person within a culture”. A meme acts as a unit for carrying cultural ideas, symbols, or practices that can be transmitted from one mind to another through writing, speech, gestures, rituals, or other imitable phenomena with a mimicked theme. Supporters of the concept regard memes as cultural analogues to genes in that they self-replicate, mutate, and respond to selective pressures.

memes are usually see spikes at certain time interval and then they have quiet or low acitvity period


Lada’s Money Bag Meme

Notes: Facebook meme (money bag)

August will have 5 Fridays, 5 Saturdays and 5 Sundays.

This happens only once every 823 years.

The Chinese call it ’Silver pockets full. "

So: send this message to your friends and in four days the money will surprise you.

Based on Chinese Feng Shui. Whoever does not transmit the message … may find themselves clueless … This is not fun at all


Transforming Data

Notes:

Engagement variable might have a very long tail.i.e the data is “over dispersed”

transform such variable by taking log or sqrt.

Have to make the x / y scale log so that you can fit in wide swing in the data


Add a Scaling Layer

Notes:

Notice that the friend count data which looks like a skewed data. most common choice of such data is lognormal distribution.

if we plot it see the first and the second plot. Second plot is log based and looks like a normal distribution

Notice the original friend count data is

summary(pf$friend_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    31.0    82.0   196.4   206.0  4923.0
summary(log10(pf$friend_count+1)) # +1 to offset -INF 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.505   1.919   1.868   2.316   3.692
summary(sqrt(pf$friend_count))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.568   9.055  11.090  14.350  70.160
#install.packages('gridExtra') 
library(gridExtra) 

#http://lightonphiri.org/blog/ggplot2-multiple-plots-in-one-graph-using-gridextra

# create 3 histogram friend count , log10(friend count ), sqrt(friend count)

p_fc = qplot(x = friend_count, data = pf, binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000),breaks = seq(0, 1000, 50))
  
p_log_fc = qplot(x = log10(friend_count+1), data = pf, binwidth = 0.05) +
  scale_x_continuous(limits = c(0, 5),breaks = seq(0, 5, .5))

p_sqrt_fc = qplot(x = sqrt(friend_count), data = pf, binwidth = 1) +
  scale_x_continuous(limits = c(0, 100),breaks = seq(0, 100, 5))


grid.arrange(p_fc, p_log_fc, p_sqrt_fc, ncol=1)
## Warning: Removed 2951 rows containing non-finite values (stat_bin).


Add a Scaling Layer

Notes:

http://lightonphiri.org/blog/ggplot2-multiple-plots-in-one-graph-using-gridextra

create 3 histogram friend count , log10(friend count ), sqrt(friend count) install.packages(‘gridExtra’)

library(gridExtra) 
p_fc = ggplot(aes(x = friend_count), data = pf) + 
   geom_histogram(binwidth = 10, color = 'black', fill = '#F79420')+
   xlab('age') + 
    ylab('Number of users in sample')+
    scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) # adjust axis

p_log_fc = p_fc + scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
p_sqrt_fc = p_fc + scale_x_sqrt()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
grid.arrange(p_fc, p_log_fc, p_sqrt_fc, ncol=1)
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 1962 rows containing non-finite values (stat_bin).


Frequency Polygons

ggplot(aes(x = friend_count, y = ..count../sum(..count..)), data = subset(pf, !is.na(gender))) + 
  geom_freqpoly(aes(color = gender), binwidth=10) + 
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) + 
  xlab('Friend Count') + 
  ylab('Percentage of users with that friend count')
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).


Likes on the Web

Notes:

names(pf)
##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"
#y = ..count../sum(..count..)),
#
library(ggplot2)
ggplot(aes(x = www_likes, y = ..count../sum(..count..)), data = subset(pf, !is.na(gender))) + 
  geom_freqpoly(aes(color = gender), binwidth=10) + 
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) + 
  xlab('www_likes') + 
  ylab('Percentage of users')+
  scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 60935 rows containing non-finite values (stat_bin).

summary(pf$www_likes)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.00     0.00    49.96     7.00 14860.00
by(pf$www_likes,pf$gender,sum)
## pf$gender: female
## [1] 3507665
## -------------------------------------------------------- 
## pf$gender: male
## [1] 1430175

Box Plots

https://en.wikipedia.org/wiki/File:Boxplot_vs_PDF.svg

Represents the major statistics of data distribution namely

25 quartile 50 quartile (median) 75 quartile

Notes: Also note that using scale_y_continuous() trim the data hence the quartile will be incorrect. correct way to do it is using coord_cartesian(ylim = c(0,100))

ggplot(aes(factor(gender), friend_count), data = subset(pf, !is.na(gender))) + 
    geom_boxplot(binwidth=10)+
    coord_cartesian(ylim = c(0,500))
## Warning: Ignoring unknown parameters: binwidth

    #scale_y_continuous(limits = c(0, 500))

Adjust the code to focus on users who have friend counts between 0 and 1000.

ggplot(aes(factor(gender), friend_count), data = subset(pf, !is.na(gender))) + 
    geom_boxplot(binwidth=10)+
    coord_cartesian(ylim = c(0,1000))
## Warning: Ignoring unknown parameters: binwidth


Box Plots, Quartiles, and Friendships

Notes:

by(pf$friend_count,pf$gender,summary)
## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

On average, who initiated more friendships in our sample: men or women?

Response: Women #### Write about some ways that you can verify your answer. Response: Women median and mean is more. Also the 75 quartile is more for women use by() command with summary

ggplot(aes(factor(gender), friendships_initiated), data = subset(pf, !is.na(gender))) + 
    geom_boxplot()+
    coord_cartesian(ylim = c(0,500))

Response: more female initiate frindship request than men


Getting Logical

Notes: usage: Try to find if user is using certain feature or not

summary(pf$mobile_likes)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     4.0   106.1    46.0 25110.0
summary(pf$mobile_likes >0)
##    Mode   FALSE    TRUE    NA's 
## logical   35056   63947       0
mobile_check_in <- NA
pf$mobile_check_in <- ifelse(pf$mobile_likes>0,1,0)
pf$mobile_check_in <- factor(pf$mobile_check_in)
summary(pf$mobile_check_in)
##     0     1 
## 35056 63947

Response:


Analyzing One Variable

Reflection:

  • It is important to visualize data based upon various paramater seperately (columns).

  • always try to see if you have outlier try to find if it is an error or actual outlier

  • Try to see intresting pattern in the visualized data

  • In your plot adjust the axis range , bin width and breaks so that you can have important findings

  • You might also have to adjust your variable by talking log or sqrt of data

http://people.stern.nyu.edu/adamodar/New_Home_Page/StatFile/statdistns.htm

https://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/


Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!