Lesson 3

What to Do First?

Notes:

Load the facebook data in R

Pseudo-Facebook User Data

Notes: This is not actual FB data set but is very close to that We will do all our analysis in this its have ~99K rows and 15 attributes (col)

getwd()

## [1] "E:/R/WorkSpace"

setwd("E:/R/WorkSpace")
list.files()

##  [1] "Demystifying.R"       "lesson3_student.html" "lesson3_student.rmd" 
##  [4] "MostUsedRCommand.R"   "pseudo_facebook.tsv"  "reddit.csv"          
##  [7] "rsconnect"            "samplermd.html"       "samplermd.rmd"       
## [10] "stateData.csv"

pf <- read.csv("pseudo_facebook.tsv",sep = '\t')
names(pf) # prints columns

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

Histogram of Users’ Birthdays

Notes:

#install.packages('ggplot2')
library(ggplot2)

ggplot(aes(x = dob_day), data = pf) + 
  geom_histogram(binwidth = 1) + 
  scale_x_continuous(breaks = 1:31)

  #facet_wrap(~dob_month)
#facet_grid() for mult var. use facet_wrap for one var

What are some things that you notice about this histogram?

Response: High value for Jan 1 maybe , default for Jan 1 So your outlier can be a bad data (error in data) OR a real data with extreme value. Try to reason about it by seeing if the outlier is even feasible

Moira’s Investigation

Notes: outlier can be actual data or it can be a bad data. you need to see if the outlier value is even feasible (in a range what you expect your data to be) Try to reason why the outlier can exist. it may be due to default value et al.

Estimating Your Audience Size

Notes: This excercise try to estimate the audience size of a FB post.

Think about a time when you posted a specific message or shared a photo on Facebook. What was it?

Response: 5

How many of your friends do you think saw that post?

Response: 100

Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?

Response: 30%

Perceived Audience Size

Notes: Usually people underestimate teh perceived Audience in FB *** ### Faceting Notes:

Use facet_wrap(~dob_month) or facet_grid() to split the chart based on a variable

Splitting the plot based on any variable

ggplot(aes(x = dob_day), data = pf) + 
  geom_histogram(binwidth = 1) + 
  scale_x_continuous(breaks = 1:31)+
  facet_wrap(~dob_month)

#facet_grid() for mult var. use facet_wrap for one var

Letâs take another look at our plot. What stands out to you here?

Response: Jan 1 have most birthday

Be Skeptical - Outliers and Anomalies

Notes:

It might be outlier as it can be default value of the FB bday data

Moira’s Outlier

Notes: #### Which case do you think applies to Moiraâs outlier? Response:

Friend Count

Notes: Lets try to analyze the friend count data

What code would you enter to create a histogram of friend counts?

library(ggplot2)

ggplot(aes(x = friend_count), data = pf,binwidth=5) + 
  geom_histogram()+
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) # adjust axis

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2951 rows containing non-finite values (stat_bin).

How is this plot similar to Moira’s first plot?

Response:

Limiting the Axes

Notes:

limiting the axis is need to see the proper histogram. you should exclude the range where the outliers lies from the x axis by limiting the axis with the limits command below

library(ggplot2)

ggplot(aes(x = friend_count), data = pf) + 
  geom_histogram()+
  scale_x_continuous(limits = c(0, 1000)) # adjust axis

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2951 rows containing non-finite values (stat_bin).

Exploring with Bin Width

Notes:

Binwidth can really change the look of your histogram and also unravel new insights try different binwidth and see what you can learn from slightly different histogram plot structure ***

Adjusting the Bin Width

Notes: Use binwidth parameter in qplot() function to change the bin width.

Faceting Friend Count

# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000),
                     breaks = seq(0, 1000, 50))+
  facet_wrap(~gender)

## Warning: Removed 2951 rows containing non-finite values (stat_bin).

Omitting NA Values

Notes: use subset to exclude the rows with NA as a gender in data

qplot(x = friend_count, data = subset(pf,!is.na(gender)), binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000),breaks = seq(0, 1000, 50))+
  facet_wrap(~gender)

## Warning: Removed 2949 rows containing non-finite values (stat_bin).

Statistics ‘by’ Gender

Notes:

table(pf$gender)

## 
## female   male 
##  40254  58574

by(pf$friend_count,pf$gender,summary)

## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

Who on average has more friends: men or women?

Response: Women

What’s the difference between the median friend count for women and men?

Response: 22

Why would the median be a better measure than the mean?

Response: Medain is more robust than mean. In a skewed distribution median is different from mean and is considered more robust. Also in normal distribution (mean == meadian)

Tenure

Notes: how many days/years a user is using the FB

ggplot(aes(x = tenure), data = pf) + 
   geom_histogram(binwidth = 30, color = 'black', fill = '#F79420')

## Warning: Removed 2 rows containing non-finite values (stat_bin).

How would you create a histogram of tenure by year?

Notice use of /365 and scale_x_continuous()

ggplot(aes(x = tenure/365), data = pf) + 
   geom_histogram(binwidth = 0.25, color = 'black', fill = '#F79420')+
   scale_x_continuous(breaks= seq(1,7,1),limits =c(0,7))

## Warning: Removed 26 rows containing non-finite values (stat_bin).

Labeling Plots

Notes:

ggplot(aes(x = tenure/365), data = pf) + 
   geom_histogram(binwidth = 0.25, color = 'black', fill = '#F79420')+
   scale_x_continuous(breaks= seq(1,7,1),limits =c(0,7))+
   xlab('Number of years using Facebook') + 
    ylab('Number of users in sample')

## Warning: Removed 26 rows containing non-finite values (stat_bin).

User Ages

Notes: Do a sensible choce for 1.binwidth 2.breaks 3.axis limits

names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

ggplot(aes(x = age), data = pf) + 
   geom_histogram(binwidth = 1, color = 'black', fill = '#F79420')+
   scale_x_continuous(breaks= seq(0,113,1),limits =c(0,150))+
   xlab('age') + 
    ylab('Number of users in sample')

What do you notice?

Response: Most user are of 15-25 age older people use less FB miniminum age is 5? Data have error/outliers (?) Not very evidient set bin width 1 of to get more information.

The Spread of Memes

Notes: A meme is “an idea, behavior, or style that spreads from person to person within a culture”. A meme acts as a unit for carrying cultural ideas, symbols, or practices that can be transmitted from one mind to another through writing, speech, gestures, rituals, or other imitable phenomena with a mimicked theme. Supporters of the concept regard memes as cultural analogues to genes in that they self-replicate, mutate, and respond to selective pressures.

memes are usually see spikes at certain time interval and then they have quiet or low acitvity period

Lada’s Money Bag Meme

Notes: Facebook meme (money bag)

August will have 5 Fridays, 5 Saturdays and 5 Sundays.

This happens only once every 823 years.

The Chinese call it ’Silver pockets full. "

So: send this message to your friends and in four days the money will surprise you.

Based on Chinese Feng Shui. Whoever does not transmit the message … may find themselves clueless … This is not fun at all

Transforming Data

Notes:

Engagement variable might have a very long tail.i.e the data is “over dispersed”

transform such variable by taking log or sqrt.

Have to make the x / y scale log so that you can fit in wide swing in the data

Add a Scaling Layer

Notes:

Notice that the friend count data which looks like a skewed data. most common choice of such data is lognormal distribution.

if we plot it see the first and the second plot. Second plot is log based and looks like a normal distribution

Notice the original friend count data is

summary(pf$friend_count)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    31.0    82.0   196.4   206.0  4923.0

summary(log10(pf$friend_count+1)) # +1 to offset -INF

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.505   1.919   1.868   2.316   3.692

summary(sqrt(pf$friend_count))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.568   9.055  11.090  14.350  70.160

#install.packages('gridExtra') 
library(gridExtra) 

#http://lightonphiri.org/blog/ggplot2-multiple-plots-in-one-graph-using-gridextra

# create 3 histogram friend count , log10(friend count ), sqrt(friend count)

p_fc = qplot(x = friend_count, data = pf, binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000),breaks = seq(0, 1000, 50))
  
p_log_fc = qplot(x = log10(friend_count+1), data = pf, binwidth = 0.05) +
  scale_x_continuous(limits = c(0, 5),breaks = seq(0, 5, .5))

p_sqrt_fc = qplot(x = sqrt(friend_count), data = pf, binwidth = 1) +
  scale_x_continuous(limits = c(0, 100),breaks = seq(0, 100, 5))


grid.arrange(p_fc, p_log_fc, p_sqrt_fc, ncol=1)

## Warning: Removed 2951 rows containing non-finite values (stat_bin).

Add a Scaling Layer

Notes:

http://lightonphiri.org/blog/ggplot2-multiple-plots-in-one-graph-using-gridextra

create 3 histogram friend count , log10(friend count ), sqrt(friend count) install.packages(‘gridExtra’)

library(gridExtra) 
p_fc = ggplot(aes(x = friend_count), data = pf) + 
   geom_histogram(binwidth = 10, color = 'black', fill = '#F79420')+
   xlab('age') + 
    ylab('Number of users in sample')+
    scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) # adjust axis

p_log_fc = p_fc + scale_x_log10()

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

p_sqrt_fc = p_fc + scale_x_sqrt()

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

grid.arrange(p_fc, p_log_fc, p_sqrt_fc, ncol=1)

## Warning: Removed 2951 rows containing non-finite values (stat_bin).

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 1962 rows containing non-finite values (stat_bin).

Frequency Polygons

ggplot(aes(x = friend_count, y = ..count../sum(..count..)), data = subset(pf, !is.na(gender))) + 
  geom_freqpoly(aes(color = gender), binwidth=10) + 
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) + 
  xlab('Friend Count') + 
  ylab('Percentage of users with that friend count')

## Warning: Removed 2949 rows containing non-finite values (stat_bin).

## Warning: Removed 4 rows containing missing values (geom_path).

Likes on the Web

Notes:

names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

#y = ..count../sum(..count..)),
#
library(ggplot2)
ggplot(aes(x = www_likes, y = ..count../sum(..count..)), data = subset(pf, !is.na(gender))) + 
  geom_freqpoly(aes(color = gender), binwidth=10) + 
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) + 
  xlab('www_likes') + 
  ylab('Percentage of users')+
  scale_x_log10()

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 60935 rows containing non-finite values (stat_bin).

summary(pf$www_likes)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.00     0.00    49.96     7.00 14860.00

by(pf$www_likes,pf$gender,sum)

## pf$gender: female
## [1] 3507665
## -------------------------------------------------------- 
## pf$gender: male
## [1] 1430175

Box Plots

https://en.wikipedia.org/wiki/File:Boxplot_vs_PDF.svg

Represents the major statistics of data distribution namely

25 quartile 50 quartile (median) 75 quartile

Notes: Also note that using scale_y_continuous() trim the data hence the quartile will be incorrect. correct way to do it is using coord_cartesian(ylim = c(0,100))

ggplot(aes(factor(gender), friend_count), data = subset(pf, !is.na(gender))) + 
    geom_boxplot(binwidth=10)+
    coord_cartesian(ylim = c(0,500))

## Warning: Ignoring unknown parameters: binwidth

    #scale_y_continuous(limits = c(0, 500))

Adjust the code to focus on users who have friend counts between 0 and 1000.

ggplot(aes(factor(gender), friend_count), data = subset(pf, !is.na(gender))) + 
    geom_boxplot(binwidth=10)+
    coord_cartesian(ylim = c(0,1000))

## Warning: Ignoring unknown parameters: binwidth

Box Plots, Quartiles, and Friendships

Notes:

by(pf$friend_count,pf$gender,summary)

## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

On average, who initiated more friendships in our sample: men or women?

Response: Women #### Write about some ways that you can verify your answer. Response: Women median and mean is more. Also the 75 quartile is more for women use by() command with summary

ggplot(aes(factor(gender), friendships_initiated), data = subset(pf, !is.na(gender))) + 
    geom_boxplot()+
    coord_cartesian(ylim = c(0,500))

Response: more female initiate frindship request than men

Getting Logical

Notes: usage: Try to find if user is using certain feature or not

summary(pf$mobile_likes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     4.0   106.1    46.0 25110.0

summary(pf$mobile_likes >0)

##    Mode   FALSE    TRUE    NA's 
## logical   35056   63947       0

mobile_check_in <- NA
pf$mobile_check_in <- ifelse(pf$mobile_likes>0,1,0)
pf$mobile_check_in <- factor(pf$mobile_check_in)
summary(pf$mobile_check_in)

##     0     1 
## 35056 63947

Response:

Analyzing One Variable

Reflection:

It is important to visualize data based upon various paramater seperately (columns).
always try to see if you have outlier try to find if it is an error or actual outlier
Try to see intresting pattern in the visualized data
In your plot adjust the axis range , bin width and breaks so that you can have important findings
You might also have to adjust your variable by talking log or sqrt of data

http://people.stern.nyu.edu/adamodar/New_Home_Page/StatFile/statdistns.htm

https://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/

Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!

Lesson 3

What to Do First?

Pseudo-Facebook User Data

Histogram of Users’ Birthdays

What are some things that you notice about this histogram?

Moira’s Investigation

Estimating Your Audience Size

Think about a time when you posted a specific message or shared a photo on Facebook. What was it?

How many of your friends do you think saw that post?

Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?

Perceived Audience Size

Letâs take another look at our plot. What stands out to you here?

Be Skeptical - Outliers and Anomalies

Moira’s Outlier

Friend Count

What code would you enter to create a histogram of friend counts?

How is this plot similar to Moira’s first plot?

Limiting the Axes

Exploring with Bin Width

Adjusting the Bin Width

Faceting Friend Count

Omitting NA Values

Statistics ‘by’ Gender

Who on average has more friends: men or women?

What’s the difference between the median friend count for women and men?

Why would the median be a better measure than the mean?

Tenure

How would you create a histogram of tenure by year?

Labeling Plots

User Ages

What do you notice?

The Spread of Memes

Lada’s Money Bag Meme

Transforming Data

Add a Scaling Layer

Add a Scaling Layer

Frequency Polygons

Likes on the Web

Box Plots

https://en.wikipedia.org/wiki/File:Boxplot_vs_PDF.svg

Adjust the code to focus on users who have friend counts between 0 and 1000.

Box Plots, Quartiles, and Friendships

On average, who initiated more friendships in our sample: men or women?

Getting Logical

Analyzing One Variable

Letâs take another look at our plot. What stands out to you here?