Notes: Check the working directory and check the list of files
getwd() setwd(“~/R Datasources”) list.files()
Notes: 1. load the facebook data with the read.csv command and set teh seperator as tab i.e. 2. use the names command to viea the variables
pf <- read.csv('../R Datasources/pseudo_facebook.tsv',sep = '\t')
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
Notes:
###install.packages('ggplot2')
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.1
qplot(x= dob_day, data = pf)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
###use the scale x discrete to adjust the x axis
qplot(x= dob_day, data = pf)+
scale_x_discrete(breaks=1:31)
### + facet_wrap(~dob_month, ncol = 3)
Response: 1. Observed that most users seem to be born on the 1st day of the month. 2. Observerd that day 31 has the least number of users owing to the fact that not all months have 31 days ***
Notes: Moira observed that most people’s percieved observesation of their audience size is quite different from reality. ***
Notes:
Response:40%
Notes: most people usualy estimate their audience size to be a quater of the actua size
Notes: 1. facet_wrap formula takes a ~ followed by a variable. Usually used when faceting over one variable. Allows you to create the same type of plot for each level of the categorial variable 2. facet_grid is used to facet over 2 or more variables facet_grid(vertical~horizontal) 3. Learn more about facetng here: http://www.cookbook-r.com/Graphs/Facets_(ggplot2)/ 4 Equivalent ggplot syntax: ggplot(data = pf, aes(x = dob_day)) + geom_histogram() + scale_x_discrete(breaks = 1:31) + facet_wrap(~dob_month)
qplot(x= dob_day, data = pf)+
scale_x_discrete(breaks=1:31)+
facet_wrap(~dob_month, ncol = 3)
Response: most facebook users are born on the first day of the first month ***
Notes: 1. need to detect and deal with outliers in the dataset. 2. Outliers could be accurate data or they could be examples of bad data ***
Notes: 1. She first adjusts the axes and cuts out the outlier #### Which case do you think applies to Moiraâs outlier? Response: 1. here outlier was bad data. An extreme value
Notes:
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
qplot(x = friend_count, data = pf)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
##qplot(x=pf$friend_count,data=pf)
Response: Skewed to one end of the axes. ***
Notes: 1. Use xlim to limit the x-axis. It takes a vector with a start and end position 2. You can also add a layer using scale_x_continous 3. learn more about scales here: http://docs.ggplot2.org/current/scale_continuous.html
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
qplot( x = friend_count, data = pf, xlim = c(0,1000))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
##Use a layer instead of xlim
qplot( x = friend_count, data = pf) +
scale_x_continuous(limits = c(0,1000))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
# qplot(x = friend_count,data = subset(pf, !is.na(gender)), binwidth = 25) +
# scale_x_continuous(limits = c(0,1000), breaks=seq(0,1000,50)) +
# facet_wrap(~gender)
Notes:
Notes: Adjust the bin width and use breaks to adjust the x axis scale markers
qplot( x = friend_count, data = pf, binwidth = 25) +
scale_x_continuous(limits = c(0,1000), breaks = seq(0,1000,50))
### Faceting Friend Count Notes: facet_wrap by gender. This produces 3 histograms namely male, female and NA
# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50)) +
facet_wrap(~gender)
Notes: You can use the subset function to filter a dataset and remove NA values You can use the na.omit function as well with caution as it will remove any observations that have NA
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
#omits all observation where gender has missing values
qplot(x = friend_count,data = subset(pf, !is.na(gender)), binwidth = 25) +
scale_x_continuous(limits = c(0,1000), breaks=seq(0,1000,50)) +
facet_wrap(~gender)
# omits all observation with missing values
qplot(x = friend_count,data = na.omit(pf) , binwidth = 25) +
scale_x_continuous(limits = c(0,1000), breaks=seq(0,1000,50)) +
facet_wrap(~gender)
Notes: You can use the table command to try and answer who has more friends, male or female
table(pf$gender)
##
## female male
## 40254 58574
#friend count is the variable and gender is the categorical variable
by(pf$friend_count,pf$gender,summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
Response: women
Response:22
Response: There are a few users with high friend counts that pull the mean to one end. The median is resistant to change since it marks the half way point for all data points. So long as we trust half of our values we can report a reliable location of the center of distribution
Notes: use tenure to examin how many days a perons has been using Facebook The parameter color determines the color outline of objects in a plot.
The parameter fill determines the color of the area inside objects in a plot.
You might notice how the color black and the hex code color of #099DD9 (a shade of blue) are wrapped inside of I(). The I() functions stand for ‘as is’ and tells qplot to use them as colors. Learn more about what you can adjust in a plot by reading the ggplot theme documentation http://docs.ggplot2.org/0.9.2.1/theme.html
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
qplot(x = tenure, data = pf, binwidth = 30,
color = I('black'), fill = I('#099DD9'))
## plot the tenure in years rather than in days
qplot(x = tenure / 365, data = pf, binwidth = .25,
color = I('black'),fill = I('#099DD9'))
# change the x-axis to increment by one year
qplot(x = tenure / 365, data = pf, binwidth = .25,
color = I('black'),fill = I('#099DD9')) +
scale_x_continuous(breaks=seq(0,7,1), limits = c(0,7))
Notes: Plot need to speak for themsleves and rhe labels need to be changed to make sense
qplot(x = tenure / 365, data = pf, binwidth = .25,
xlab = 'Number of years using Facebook',
ylab = 'Number of users in sample',
color = I('black'),fill = I('#099DD9')) +
scale_x_continuous(breaks=seq(0,7,1), limits = c(0,7))
***
Notes:
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
qplot(x = age , data = subset(pf,!is.na(gender)),binwidth = 1,
xlab = 'Facebook Users by Age',
ylab = 'Number of users in sample',
color = I('black'), fill = I('#099DD9')) +
scale_x_continuous(breaks = seq(10,120,10), limits = c(10,120)) +
facet_wrap(~gender)
by(pf$age,pf$gender,summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 21.00 31.00 39.46 54.00 113.00
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 20.00 27.00 35.67 45.00 113.00
Response: The average age of men on facebook is lower then for females ***
Notes:
Notes:
Notes: sometimes data can be “over-dispersed” especially with long tailed data
qplot(x = friend_count, data = pf)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
summary(pf$friend_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 31.0 82.0 196.4 206.0 4923.0
#transform using log10. Result shows -inf as some users have zero friends
summary(log10(pf$friend_count))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -Inf 1 2 -Inf 2 4
#add 1 to firend count to overcome the -inf
summary(log10(pf$friend_count +1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.505 1.919 1.868 2.316 3.692
#transform using square root
summary(sqrt(pf$friend_count))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.568 9.055 11.090 14.350 70.160
Notes: 1. to plot the same graph you need the grid extra package. 2. The scale_x_log10 layer will plot in actual friend counts where as the wrapper log10 will plot in log scale
##install.packages('gridExtra')
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.2.1
#create each plot and assign to a variable
p1 <- qplot(x = friend_count, data = pf)
p2 <- qplot(x = friend_count, data = pf) +
scale_x_log10()
p3 <- qplot(x = friend_count, data = pf) +
scale_x_sqrt()
#use grid.arrange to plot
grid.arrange(p1,p2,p3, ncol = 1)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Transforming data alternate solution
## Use scales
#aes is the "aesthetic wrapper" and geom tell ggplot what type of plot we need
p1 <- ggplot(aes(x = friend_count), data = pf) + geom_histogram()
p2 <- p1 + scale_x_log10()
p3 <- p1 + scale_x_sqrt()
#use grid.arrange to plot
grid.arrange(p1,p2,p3, ncol = 1)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
#examine the difference between adding a scaling layer and using a wrapper
logScale <- qplot(x = log10(friend_count), data = pf)
countScale <- ggplot(aes(x = friend_count), data = pf) + geom_histogram() +
geom_histogram() +
scale_x_log10()
grid.arrange(logScale,countScale, ncol = 2)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Notes: usd to compare two or more distributions. Similar to histograms and draw a curve connectingthe counts in a histo.
#plot a histogram of users by age by gender
qplot(x = friend_count , data = subset(pf,!is.na(gender)),binwidth = 10,
xlab = 'Friend Count',
ylab = 'Number of users in sample'
) +
scale_x_continuous(breaks = seq(0,1000,50), limits = c(10,1000)) +
facet_wrap(~gender)
#plot a frequency polygon of users by age by gender
qplot(x = friend_count , y = ..count../sum(..count..),
data = subset(pf,!is.na(gender)),binwidth = 10,
geom = 'freqpoly', color = gender,
xlab = 'Friend Count',
ylab = 'Proportion of Users with that friend count'
) +
scale_x_continuous(breaks = seq(0,1000,50), limits = c(10,1000))
## Warning: Removed 2 rows containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_path).
Notes:
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
#plot a frequency polygon of users by likes by gender
#observation: Seems like men have more likes then women
qplot(x = www_likes,
data = subset(pf,!is.na(gender)),
geom = 'freqpoly', color = gender,
xlab = 'WWW Likes',
ylab = 'Proportion of Users with that www like count'
) +
scale_x_continuous() +
scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
#use a numerical summary to see who has more likes
by(pf$www_likes,pf$gender,sum)
## pf$gender: female
## [1] 3507665
## --------------------------------------------------------
## pf$gender: male
## [1] 1430175
Notes: How to read and use a Boxplot http://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/
The interquartile range or IQR includes all of the values between the bottom and top of the boxes in the boxplot. https://en.wikipedia.org/wiki/Interquartile_range
Intro to Descriptive Statistics Exercise: Match Box Plots https://www.udacity.com/course/viewer#!/c-ud827/l-1471748603/e-83417918/m-83664035
Outliers as usualy consider to be 1.5 times the IQR from the median
#plot a histogram of users by age by gender
qplot(x = friend_count , data = subset(pf,!is.na(gender)),binwidth = 10,
xlab = 'Friend Count',
ylab = 'Number of users in sample'
) +
scale_x_continuous(breaks = seq(0,1000,50), limits = c(10,1000)) +
facet_wrap(~gender)
#plot a boxplot of users by age by gender
qplot(x = gender, y = friend_count,
data = subset(pf, !is.na(gender)),
geom = 'boxplot')
#adjust the y scale to focus on 0 to 1000. ylim removes values/observations
qplot(x = gender, y = friend_count,
data = subset(pf, !is.na(gender)),
geom = 'boxplot' , ylim = c(0, 1000))
## Warning: Removed 2949 rows containing non-finite values (stat_boxplot).
#use scale y continious also revmoes values
qplot(x = gender, y = friend_count,
data = subset(pf, !is.na(gender)),
geom = 'boxplot') +
scale_y_continuous(limits = c(0,1000))
## Warning: Removed 2949 rows containing non-finite values (stat_boxplot).
#use the cord cartesian layer to avoid removeing values
qplot(x = gender, y = friend_count,
data = subset(pf, !is.na(gender)),
geom = 'boxplot') +
coord_cartesian(ylim = c(0,1000))
Notes: 1. How to interpret a Box Plot http://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/ 2. The interquartile range or IQR includes all of the values between the bottom and top of the boxes in the boxplot.
qplot(x = gender, y = friend_count,
data = subset(pf, !is.na(gender)),
geom = 'boxplot') +
coord_cartesian(ylim = c(0,250))
by(pf$friend_count,pf$gender,summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
Response: women #### Write about some ways that you can verify your answer. Response: The median for women is greater than the median for me and this is backed up by numerical summary
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
qplot(x = gender, y = friendships_initiated,
data = subset(pf, !is.na(gender)),
geom = 'boxplot') +
coord_cartesian(ylim = c(0,500))
qplot(x = gender, y = friendships_initiated,
data = subset(pf, !is.na(gender)),
geom = 'boxplot') +
coord_cartesian(ylim = c(0,150))
by(pf$friendships_initiated,pf$gender,summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 19.0 49.0 113.9 124.8 3654.0
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 15.0 44.0 103.1 111.0 4144.0
Response:
Notes: What is a factor variable? Eqivalent of a computed column or measure
#check the distributn of mobile lieks using a numerical summary
summary(pf$mobile_likes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 4.0 106.1 46.0 25110.0
#many of the values are zero. Filter out value that are zero
summary(pf$mobile_likes > 0)
## Mode FALSE TRUE NA's
## logical 35056 63947 0
### Create a factor variable to measure if a user has checked in on mobile rather then a count of the
### number of times a user has checked in on mobile
### create a variable in the dataframe pf and assign NA values to it
pf$mobile_check_in <- NA
### use a logical operator to assign 1 or 0 if a user has checked in
pf$mobile_check_in <- ifelse(pf$mobile_likes > 0,1,0)
###convert to a factor variable
pf$mobile_check_in <- factor(pf$mobile_check_in)
summary(pf$mobile_check_in)
## 0 1
## 35056 63947
###Wat percent of check using mobile
sum(pf$mobile_check_in == 1) / length(pf$mobile_check_in)
## [1] 0.6459097
Response: 65% ***
Reflection:
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!