Notes:
Load the facebook data in R
Notes: This is not actual FB data set but is very close to that We will do all our analysis in this its have ~99K rows and 15 attributes (col)
getwd()
## [1] "E:/R/WorkSpace"
setwd("E:/R/WorkSpace")
list.files()
## [1] "Demystifying.R" "lesson3_student.html" "lesson3_student.rmd"
## [4] "MostUsedRCommand.R" "pseudo_facebook.tsv" "reddit.csv"
## [7] "rsconnect" "samplermd.html" "samplermd.rmd"
## [10] "stateData.csv"
pf <- read.csv("pseudo_facebook.tsv",sep = '\t')
names(pf) # prints columns
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
Notes:
#install.packages('ggplot2')
library(ggplot2)
ggplot(aes(x = dob_day), data = pf) +
geom_histogram(binwidth = 1) +
scale_x_continuous(breaks = 1:31)
#facet_wrap(~dob_month)
#facet_grid() for mult var. use facet_wrap for one var
Response: High value for Jan 1 maybe , default for Jan 1 So your outlier can be a bad data (error in data) OR a real data with extreme value. Try to reason about it by seeing if the outlier is even feasible
Notes: outlier can be actual data or it can be a bad data. you need to see if the outlier value is even feasible (in a range what you expect your data to be) Try to reason why the outlier can exist. it may be due to default value et al.
Notes: This excercise try to estimate the audience size of a FB post.
Response: 100
Response: 30%
Notes: Usually people underestimate teh perceived Audience in FB *** ### Faceting Notes:
Use facet_wrap(~dob_month) or facet_grid() to split the chart based on a variable
Splitting the plot based on any variable
ggplot(aes(x = dob_day), data = pf) +
geom_histogram(binwidth = 1) +
scale_x_continuous(breaks = 1:31)+
facet_wrap(~dob_month)
#facet_grid() for mult var. use facet_wrap for one var
Response: Jan 1 have most birthday
Notes:
It might be outlier as it can be default value of the FB bday data
Notes: #### Which case do you think applies to Moiraâs outlier? Response:
Notes: Lets try to analyze the friend count data
library(ggplot2)
ggplot(aes(x = friend_count), data = pf,binwidth=5) +
geom_histogram()+
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) # adjust axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
Response:
Notes:
limiting the axis is need to see the proper histogram. you should exclude the range where the outliers lies from the x axis by limiting the axis with the limits command below
library(ggplot2)
ggplot(aes(x = friend_count), data = pf) +
geom_histogram()+
scale_x_continuous(limits = c(0, 1000)) # adjust axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
Notes:
Binwidth can really change the look of your histogram and also unravel new insights try different binwidth and see what you can learn from slightly different histogram plot structure ***
Notes: Use binwidth parameter in qplot() function to change the bin width.
# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50))+
facet_wrap(~gender)
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
Notes: use subset to exclude the rows with NA as a gender in data
qplot(x = friend_count, data = subset(pf,!is.na(gender)), binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),breaks = seq(0, 1000, 50))+
facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
Notes:
table(pf$gender)
##
## female male
## 40254 58574
by(pf$friend_count,pf$gender,summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
Response: Women
Response: 22
Response: Medain is more robust than mean. In a skewed distribution median is different from mean and is considered more robust. Also in normal distribution (mean == meadian)
Notes: how many days/years a user is using the FB
ggplot(aes(x = tenure), data = pf) +
geom_histogram(binwidth = 30, color = 'black', fill = '#F79420')
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Notice use of /365 and scale_x_continuous()
ggplot(aes(x = tenure/365), data = pf) +
geom_histogram(binwidth = 0.25, color = 'black', fill = '#F79420')+
scale_x_continuous(breaks= seq(1,7,1),limits =c(0,7))
## Warning: Removed 26 rows containing non-finite values (stat_bin).
Notes:
ggplot(aes(x = tenure/365), data = pf) +
geom_histogram(binwidth = 0.25, color = 'black', fill = '#F79420')+
scale_x_continuous(breaks= seq(1,7,1),limits =c(0,7))+
xlab('Number of years using Facebook') +
ylab('Number of users in sample')
## Warning: Removed 26 rows containing non-finite values (stat_bin).
Notes: Do a sensible choce for 1.binwidth 2.breaks 3.axis limits
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
ggplot(aes(x = age), data = pf) +
geom_histogram(binwidth = 1, color = 'black', fill = '#F79420')+
scale_x_continuous(breaks= seq(0,113,1),limits =c(0,150))+
xlab('age') +
ylab('Number of users in sample')
Response: Most user are of 15-25 age older people use less FB miniminum age is 5? Data have error/outliers (?) Not very evidient set bin width 1 of to get more information.
Notes: A meme is “an idea, behavior, or style that spreads from person to person within a culture”. A meme acts as a unit for carrying cultural ideas, symbols, or practices that can be transmitted from one mind to another through writing, speech, gestures, rituals, or other imitable phenomena with a mimicked theme. Supporters of the concept regard memes as cultural analogues to genes in that they self-replicate, mutate, and respond to selective pressures.
memes are usually see spikes at certain time interval and then they have quiet or low acitvity period
Notes: Facebook meme (money bag)
August will have 5 Fridays, 5 Saturdays and 5 Sundays.
This happens only once every 823 years.
The Chinese call it ’Silver pockets full. "
So: send this message to your friends and in four days the money will surprise you.
Based on Chinese Feng Shui. Whoever does not transmit the message … may find themselves clueless … This is not fun at all
Notes:
Engagement variable might have a very long tail.i.e the data is “over dispersed”
transform such variable by taking log or sqrt.
Have to make the x / y scale log so that you can fit in wide swing in the data
Notes:
Notice that the friend count data which looks like a skewed data. most common choice of such data is lognormal distribution.
if we plot it see the first and the second plot. Second plot is log based and looks like a normal distribution
Notice the original friend count data is
summary(pf$friend_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 31.0 82.0 196.4 206.0 4923.0
summary(log10(pf$friend_count+1)) # +1 to offset -INF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.505 1.919 1.868 2.316 3.692
summary(sqrt(pf$friend_count))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.568 9.055 11.090 14.350 70.160
#install.packages('gridExtra')
library(gridExtra)
#http://lightonphiri.org/blog/ggplot2-multiple-plots-in-one-graph-using-gridextra
# create 3 histogram friend count , log10(friend count ), sqrt(friend count)
p_fc = qplot(x = friend_count, data = pf, binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),breaks = seq(0, 1000, 50))
p_log_fc = qplot(x = log10(friend_count+1), data = pf, binwidth = 0.05) +
scale_x_continuous(limits = c(0, 5),breaks = seq(0, 5, .5))
p_sqrt_fc = qplot(x = sqrt(friend_count), data = pf, binwidth = 1) +
scale_x_continuous(limits = c(0, 100),breaks = seq(0, 100, 5))
grid.arrange(p_fc, p_log_fc, p_sqrt_fc, ncol=1)
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
Notes:
http://lightonphiri.org/blog/ggplot2-multiple-plots-in-one-graph-using-gridextra
create 3 histogram friend count , log10(friend count ), sqrt(friend count) install.packages(‘gridExtra’)
library(gridExtra)
p_fc = ggplot(aes(x = friend_count), data = pf) +
geom_histogram(binwidth = 10, color = 'black', fill = '#F79420')+
xlab('age') +
ylab('Number of users in sample')+
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) # adjust axis
p_log_fc = p_fc + scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
p_sqrt_fc = p_fc + scale_x_sqrt()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
grid.arrange(p_fc, p_log_fc, p_sqrt_fc, ncol=1)
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
ggplot(aes(x = friend_count, y = ..count../sum(..count..)), data = subset(pf, !is.na(gender))) +
geom_freqpoly(aes(color = gender), binwidth=10) +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
xlab('Friend Count') +
ylab('Percentage of users with that friend count')
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
Notes:
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
#y = ..count../sum(..count..)),
#
library(ggplot2)
ggplot(aes(x = www_likes, y = ..count../sum(..count..)), data = subset(pf, !is.na(gender))) +
geom_freqpoly(aes(color = gender), binwidth=10) +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
xlab('www_likes') +
ylab('Percentage of users')+
scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 60935 rows containing non-finite values (stat_bin).
summary(pf$www_likes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 49.96 7.00 14860.00
by(pf$www_likes,pf$gender,sum)
## pf$gender: female
## [1] 3507665
## --------------------------------------------------------
## pf$gender: male
## [1] 1430175
Represents the major statistics of data distribution namely
25 quartile 50 quartile (median) 75 quartile
Notes: Also note that using scale_y_continuous() trim the data hence the quartile will be incorrect. correct way to do it is using coord_cartesian(ylim = c(0,100))
ggplot(aes(factor(gender), friend_count), data = subset(pf, !is.na(gender))) +
geom_boxplot(binwidth=10)+
coord_cartesian(ylim = c(0,500))
## Warning: Ignoring unknown parameters: binwidth
#scale_y_continuous(limits = c(0, 500))
ggplot(aes(factor(gender), friend_count), data = subset(pf, !is.na(gender))) +
geom_boxplot(binwidth=10)+
coord_cartesian(ylim = c(0,1000))
## Warning: Ignoring unknown parameters: binwidth
Notes:
by(pf$friend_count,pf$gender,summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
Response: Women #### Write about some ways that you can verify your answer. Response: Women median and mean is more. Also the 75 quartile is more for women use by() command with summary
ggplot(aes(factor(gender), friendships_initiated), data = subset(pf, !is.na(gender))) +
geom_boxplot()+
coord_cartesian(ylim = c(0,500))
Response: more female initiate frindship request than men
Notes: usage: Try to find if user is using certain feature or not
summary(pf$mobile_likes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 4.0 106.1 46.0 25110.0
summary(pf$mobile_likes >0)
## Mode FALSE TRUE NA's
## logical 35056 63947 0
mobile_check_in <- NA
pf$mobile_check_in <- ifelse(pf$mobile_likes>0,1,0)
pf$mobile_check_in <- factor(pf$mobile_check_in)
summary(pf$mobile_check_in)
## 0 1
## 35056 63947
Response:
Reflection:
It is important to visualize data based upon various paramater seperately (columns).
always try to see if you have outlier try to find if it is an error or actual outlier
Try to see intresting pattern in the visualized data
In your plot adjust the axis range , bin width and breaks so that you can have important findings
You might also have to adjust your variable by talking log or sqrt of data
http://people.stern.nyu.edu/adamodar/New_Home_Page/StatFile/statdistns.htm
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!