Lesson 3


What to Do First?

Notes: Download pseudo_facebook.tsv


Pseudo-Facebook User Data

Notes:

getwd()
## [1] "D:/Documents/5. Dev/1. udacity/1. data_analysis_w_R/EDA_Course_Materials/lesson3"
list.files()
## [1] "lesson3_student.rmd"   "lesson3_student_files" "problem_set3.Rmd"     
## [4] "pseudo_facebook.tsv"
pf <- read.csv('pseudo_facebook.tsv',sep = '\t')
names(pf)
##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

Histogram of Users’ Birthdays

Notes:

install.packages('ggplot2', repos ="http://cran.us.r-project.org")
## Installing package into 'C:/Users/hahnsang/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\hahnsang\AppData\Local\Temp\1\RtmpQf6870\downloaded_packages
library(ggplot2)

names(pf)
##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"
qplot(x = dob_day, data = pf) + 
  scale_x_continuous(breaks=1:31)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


What are some things that you notice about this histogram?

Response: the first day peaks high and then the other days are flat


Moira’s Investigation

Notes:


Estimating Your Audience Size

Notes:


Think about a time when you posted a specific message or shared a photo on Facebook. What was it?

Response:

How many of your friends do you think saw that post?

Response:

Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?

Response:


Perceived Audience Size

Notes:


Faceting

Notes: facet_wrap(formula) facet_wrap(variable) facet_grid(formula) facet_grid(verticalhorizontal)

qplot(x = dob_day, data = pf) + 
  scale_x_continuous(breaks=1:31) +
  facet_wrap(~dob_month, ncol = 3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s take another look at our plot. What stands out to you here?

Response: January is unusual


Be Skeptical - Outliers and Anomalies

Notes: Outliers may be an accurate data of a very extreme case


Moira’s Outlier

Notes: #### Which case do you think applies to Moira’s outlier? Response: bad data about an extreme case


Friend Count

Notes:

What code would you enter to create a histogram of friend counts?

qplot(data =  pf, x = friend_count)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

How is this plot similar to Moira’s first plot?

Response: quite similar, long tail data


Limiting the Axes

Notes: for a long-tailed data

qplot(data =  pf, x = friend_count, xlim=c(0, 1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).

qplot(data =  pf, x = friend_count) +
  scale_x_continuous(limits= c(0, 1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).

Exploring with Bin Width

Notes:


Adjusting the Bin Width

Notes:

Faceting Friend Count

# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 25) +
  scale_x_continuous(limits = c(0, 1000),
                     breaks = seq(0, 1000, 50)) +
  facet_wrap(~gender)
## Warning: Removed 2951 rows containing non-finite values (stat_bin).


Omitting NA Values

Notes:

qplot(x = friend_count, data = subset(pf,!is.na(gender)), binwidth = 25) +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
  facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).

qplot(x = friend_count, data = na.omit(pf), binwidth = 25) +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
  facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).


Statistics ‘by’ Gender

Notes: We still don’t know which is more than the other. So, use tabel

table(pf$gender)
## 
## female   male 
##  40254  58574
by(pf$friend_count, pf$gender, summary)
## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

Who on average has more friends: men or women?

Response: woman

What’s the difference between the median friend count for women and men?

Response: 22

Why would the median be a better measure than the mean?

Response: more robost


Tenure

Notes:

qplot(x = tenure, data = pf, binwidth=30, color = I('black'), fill = I('#099DD9'))
## Warning: Removed 2 rows containing non-finite values (stat_bin).


How would you create a histogram of tenure by year?

qplot(x = tenure/365, data = pf, binwidth=.25, color = I('black'), fill = I('#F79420')) +
  scale_x_continuous(breaks = seq(1, 7, 1), limits = c(0, 7))
## Warning: Removed 26 rows containing non-finite values (stat_bin).


Labeling Plots

Notes: x axis and y axis are automatically generated unless you specify them

qplot(x = tenure/365, data = pf, 
      xlab = 'Number of years using Facebook',
      ylab = 'Number of users in sample',
      color = I('black'), fill = I('#F79420')) +
  scale_x_continuous(breaks = seq(1, 7, 1), limits = c(0, 7))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 26 rows containing non-finite values (stat_bin).


User Ages

Notes:

qplot(x = age, data = pf, binwidth = 1, 
      xlab = 'Ages', ylab = 'Number of users in sample',
      color = I('black'), fill = I('#5760AB')) +
  scale_x_continuous(breaks = seq(0, 113, 5))

What do you notice?

Response: Age starts from 13, and is a peak around 100


The Spread of Memes

Notes:


Lada’s Money Bag Meme

Notes:


Transforming Data

Notes: Log Transformations of Data http://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/

qplot(x = friend_count, data = pf)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(pf$friend_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    31.0    82.0   196.4   206.0  4923.0
summary(log10(pf$friend_count))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -Inf       1       2    -Inf       2       4
summary(log10(pf$friend_count+1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.505   1.919   1.868   2.316   3.692
summary(sqrt(pf$friend_count))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.568   9.055  11.090  14.350  70.160
install.packages('gridExtra', repos ="http://cran.us.r-project.org")
## Installing package into 'C:/Users/hahnsang/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)
## package 'gridExtra' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\hahnsang\AppData\Local\Temp\1\RtmpQf6870\downloaded_packages
library(gridExtra)

p1 <- qplot(x = friend_count, data = pf)
p2 <- qplot(x = log10(pf$friend_count+1), data = pf)
p3 <- qplot(x = sqrt(pf$friend_count), data = pf)

grid.arrange(p1, p2, p3, ncol=1)  
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p1 <- ggplot(aes(x = friend_count), data = pf) + geom_histogram()
p2 <- p1 + scale_x_log10()
p3 <- p1 + scale_x_sqrt()
grid.arrange(p1, p2, p3, ncol=1) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

***

Add a Scaling Layer

Notes: Only difference is actual count on x axis

logScale <- qplot(log10(pf$friend_count), data = pf)

countScale <- ggplot(aes(x = friend_count), data = pf) + geom_histogram() + scale_x_log10()

grid.arrange(logScale, countScale, ncol =2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).


Frequency Polygons

Question: which one shows better Who has more friends in average than women?

qplot(x = friend_count, data = subset(pf,!is.na(gender)), binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
  facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).

qplot(x = friend_count, data = subset(pf,!is.na(gender)), 
      binwidth = 10, geom='freqpoly', color = gender) +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) 
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).

qplot(x = friend_count, y= ..count../sum(..count..), data = subset(pf,!is.na(gender)), 
      xlab = 'Friend Count', ylab = 'Proportion of Users with that friend count',
      binwidth = 10, geom='freqpoly', color = gender) +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) 
## Warning: Removed 2949 rows containing non-finite values (stat_bin).

## Warning: Removed 4 rows containing missing values (geom_path).


Likes on the Web

Notes: Use a frequency poloygon to determine which gender makes more likes on the world wide web. What’s the www_like count for males? Which gender has more www_likes?

qplot(x = www_likes, data = subset(pf, !is.na(gender)), 
      geom ='freqpoly', color = gender) +
  scale_x_continuous() + 
  scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 60935 rows containing non-finite values (stat_bin).

ggplot(aes(x = www_likes), data = subset(pf, !is.na(gender))) + 
  geom_freqpoly(aes(color = gender)) + 
  scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 60935 rows containing non-finite values (stat_bin).

by(pf$www_likes, pf$gender, sum)
## pf$gender: female
## [1] 3507665
## -------------------------------------------------------- 
## pf$gender: male
## [1] 1430175

Box Plots

Notes: check outliers How to read boxplot http://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/ Interquartile range or IQR https://en.wikipedia.org/wiki/Interquartile_range Visualization https://en.wikipedia.org/wiki/File:Boxplot_vs_PDF.svg

qplot(x = gender, y= friend_count,
      data = subset(pf,!is.na(gender)),
      geom = 'boxplot')

Adjust the code to focus on users who have friend counts between 0 and 1000.

qplot(x = gender, y= friend_count,
      data = subset(pf,!is.na(gender)),
      geom = 'boxplot') + 
  scale_y_continuous(limits = c(0, 1000)) 
## Warning: Removed 2949 rows containing non-finite values (stat_boxplot).


Box Plots, Quartiles, and Friendships

Notes:

qplot(x = gender, y= friend_count,
      data = subset(pf,!is.na(gender)),
      geom = 'boxplot') + 
  coord_cartesian(ylim = c(0, 250)) 

by(pf$friend_count, pf$gender, summary)
## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

On average, who initiated more friendships in our sample: men or women?

Response: female #### Write about some ways that you can verify your answer. Response:

qplot(x = gender, y= friendships_initiated,
      data = subset(pf,!is.na(gender)),
      geom = 'boxplot') + 
  coord_cartesian(ylim = c(0, 500)) 

qplot(x = gender, y= friendships_initiated,
      data = subset(pf,!is.na(gender)),
      geom = 'boxplot') + 
  coord_cartesian(ylim = c(0, 150)) 

by(pf$friendships_initiated, pf$gender, summary)
## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    19.0    49.0   113.9   124.8  3654.0 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    15.0    44.0   103.1   111.0  4144.0

Response: it’s helpful to understand a distribution of the data, we can see the middle 50% of values for each segment of our categorial variable. Our polots also let us get a sense of outliers. in one way they’re much more rich with information than just this table


Getting Logical

Notes: how to handle lots of zero values

summary(pf$mobile_likes)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     4.0   106.1    46.0 25110.0
summary(pf$mobile_likes > 0)
##    Mode   FALSE    TRUE    NA's 
## logical   35056   63947       0
mobile_check_in <- NA
pf$mobile_check_in <- ifelse(pf$mobile_likes > 0, 1, 0)
pf$mobile_check_in <- factor(pf$mobile_check_in)
summary(pf$mobile_check_in)
##     0     1 
## 35056 63947
sum(pf$mobile_check_in == 1)/length(pf$mobile_check_in)
## [1] 0.6459097

What percent of check in using mobile? Response: 65%


Analyzing One Variable

Reflection: I learned the visualization of data including histograms, frequency polygon wih scaling layers, and box plots. Box plots are useful for ruling out outliers. Also, learned about logical operation. Concept map: https://wiki.uiowa.edu/download/attachments/42009071/Concept_Map.gif?version=1&modificationDate=1287007903090&api=v2


Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!