Lesson 4

RESOURCES:

ggplot tutorial: http://www.ceb-institute.org/bbs/wp-content/uploads/2011/09/handout_ggplot2.pdf
ggplot help: http://docs.ggplot2.org/current/
Transformations using the cartesian coordinate system http://docs.ggplot2.org/current/coord_trans.html
Jitter http://docs.ggplot2.org/current/position_jitter.html
Intro to dplr http://rstudio-pubs-static.s3.amazonaws.com/11068_8bc42d6df61341b2bed45e9a9a3bf9f4.html 6: Intro to dplyr by Hadley Wickham http://www.r-bloggers.com/hadley-wickham-presents-dplyr-at-user-2014/ 6.1 Hadley Wickham’s dplyr part 1 - http://www.r-bloggers.com/hadley-wickhams-dplyr-tutorial-at-user-2014-part-1/ 6.2 Hadley Wickham’s dplyr part 2 - http://www.r-bloggers.com/hadley-wickhams-dplyr-tutorial-at-user-2014-part-2/
Data Visualisation Gurus: 7.1 John Tukey - https://en.wikipedia.org/wiki/John_Tukey 7.2 William Playfair -https://en.wikipedia.org/wiki/William_Playfair 7.3 William Playfair and the Psychology of Graph - http://www.psych.utoronto.ca/users/spence/Spence%20(2006).pdf
Introduction to bivariate analysis http://dept.stat.lsa.umich.edu/~kshedden/Courses/Stat401/Notes/401-bivariate-slides.pdf ***

Scatterplots and Perceived Audience Size

Notes: Most FB users guess a perceived audience size much less than the actual audience size. And most users guess in multiples of 50 or 100 ***

Scatterplots

Notes:

#setup
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.1

getwd()

## [1] "C:/Users/amackay/Documents/R Scripts"

setwd("~/R Datasources")
list.files()

## [1] "pseudo_facebook.tsv" "reddit.csv"          "stateData.csv"

pf <- read.csv('../R Datasources/pseudo_facebook.tsv',sep = '\t')
names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

#plot a scatter of age and friend count. qplot witll automatically select the scatter based on the 2 variables
qplot(x = age, y = friend_count, data = pf)

#alternative syntax without specifiyign x and y
qplot(age, friend_count, data = pf)

What are some things that you notice right away?

Response: 1. Most users under 30 have high friend counts. 2. Some users over the age of 60 have unusually high friend counts. ***

ggplot Syntax

Notes: 1. The ggplot syntax can be used to create more complex plots 2. The main difference to q plot is to specify which geom to plot

#qplot syntax - qplot(x = age, y = friend_count, data = pf)
#ggplo uses the aesthictic wrapper (aes) for the x and y variables
ggplot(aes(x = age, y = friend_count), data = pf) + geom_point()

#clip the age range but first check the range
summary(pf$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   28.00   37.28   50.00  113.00

ggplot(aes(x = age, y = friend_count), data = pf) + geom_point() + xlim(13,90)

## Warning: Removed 4906 rows containing missing values (geom_point).

Overplotting

Notes: 1. The area of the plot that has a high density of plots is considered over plotted. 2. It makes it difficult to tell how many points are in each region. 3. The transparency can be set using the alpha param

# here alpha = 1/20 means that it will take 20 points to be the eqivalent of one point
ggplot(aes(x = age, y = friend_count), data = pf) + 
  geom_point(alpha = 1/20)  + 
  xlim(13,90)

## Warning: Removed 4906 rows containing missing values (geom_point).

#change the plot to a jitter to add some noise to the age variable and get a more dispersed distribution
ggplot(aes(x = age, y = friend_count), data = pf) + 
  geom_jitter(alpha = 1/20)  + 
  xlim(13,90)

## Warning: Removed 5180 rows containing missing values (geom_point).

What do you notice in the plot?

Response: The jitter plot reveals a more truer picture of the friend count by age i.e most young users have lower friend counts. ***

Coord_trans()

Notes: Some friend counts are zero, adding jitter may create a -ve number and the sqrt of that is imaginary We have to add syntax position = position_jitter(h = 0) to indicate a min height of zero

#transform the y axis using a sqrt function
ggplot(aes(x = age, y = friend_count), data = pf) + 
  geom_point(alpha = 1/20)  + 
  xlim(13,90) +
  coord_trans(y = 'sqrt')

## Warning: Removed 4906 rows containing missing values (geom_point).

#transform the y axis using a sqrt function and plot a jitter
ggplot(aes(x = age, y = friend_count), data = pf) + 
  geom_point(alpha = 1/20, position = position_jitter(h = 0))  + 
  xlim(13,90) +
  coord_trans(y = 'sqrt')

## Warning: Removed 5215 rows containing missing values (geom_point).

What do you notice?

We can see above the threshold of 1000 the friend count decreases ***

Alpha and Jitter

Notes: Examine the relationship between Age and friendships_initiated

names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

#plot scatter
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
  geom_point()

#limit x axis
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
  geom_point() +
  xlim(13,90)

## Warning: Removed 4906 rows containing missing values (geom_point).

#plot a jitter
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
  geom_jitter() +
  xlim(13,90)

## Warning: Removed 5212 rows containing missing values (geom_point).

#use alpha parm to reduce overplotting
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
  geom_jitter(alpha = 1/20) +
  xlim(13,90)

## Warning: Removed 5186 rows containing missing values (geom_point).

#transform y axis using coord_trans
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
  geom_jitter(alpha = 1/20, position = position_jitter(h = 0)) +
  xlim(13,90) +
  coord_trans(y = 'sqrt')

## Warning: Removed 5191 rows containing missing values (geom_point).

#limit dataset to remove NA values. Still get error message
ggplot(aes(x = age, y = friendships_initiated), data = subset(pf, !is.na(friendships_initiated))) +
  geom_jitter(alpha = 1/20, position = position_jitter(h = 0)) +
  xlim(13,90) +
  coord_trans(y = 'sqrt')

## Warning: Removed 5181 rows containing missing values (geom_point).

Overplotting and Domain Knowledge

Notes:

Conditional Means

Notes: 1. Summarise the dataset by mean and median using the group by and summarise notation 2. The n = n() provides the count in each group and only works with summarise

#install.packages('dplyr')
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.1

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#create grouping by age
age_groups <- group_by(pf, age)

#add the summaries
pf.fc_by_age <- summarise(age_groups,
          friend_count_mean = mean(friend_count),
          friend_count_median = median(friend_count),
          n = n())

#sort by age asending
pf.fc_by_age <- arrange(pf.fc_by_age, age)

#use Head to check the first few rows. The last parm is optional
head(pf.fc_by_age, 20)

## Source: local data frame [20 x 4]
## 
##    age friend_count_mean friend_count_median    n
## 1   13          164.7500                74.0  484
## 2   14          251.3901               132.0 1925
## 3   15          347.6921               161.0 2618
## 4   16          351.9371               171.5 3086
## 5   17          350.3006               156.0 3283
## 6   18          331.1663               162.0 5196
## 7   19          333.6921               157.0 4391
## 8   20          283.4991               135.0 3769
## 9   21          235.9412               121.0 3671
## 10  22          211.3948               106.0 3032
## 11  23          202.8426                93.0 4404
## 12  24          185.7121                92.0 2827
## 13  25          131.0211                62.0 3641
## 14  26          144.0082                75.0 2815
## 15  27          134.1473                72.0 2240
## 16  28          125.8354                66.0 2364
## 17  29          120.8182                66.0 1936
## 18  30          115.2080                67.5 1716
## 19  31          118.4599                63.0 1694
## 20  32          114.2800                63.0 1443

Conditional Means Alternative syntax

Notes: The %>% allows you to chain commands

#chain commands to the pf dataset
pf.fc_age_chaining <- pf %>% 
  group_by(age) %>%
  summarise(friend_count_mean = mean(friend_count),
          friend_count_median = median(friend_count),
          n = n()) %>%
  arrange(age)

head(pf.fc_age_chaining, 20)

## Source: local data frame [20 x 4]
## 
##    age friend_count_mean friend_count_median    n
## 1   13          164.7500                74.0  484
## 2   14          251.3901               132.0 1925
## 3   15          347.6921               161.0 2618
## 4   16          351.9371               171.5 3086
## 5   17          350.3006               156.0 3283
## 6   18          331.1663               162.0 5196
## 7   19          333.6921               157.0 4391
## 8   20          283.4991               135.0 3769
## 9   21          235.9412               121.0 3671
## 10  22          211.3948               106.0 3032
## 11  23          202.8426                93.0 4404
## 12  24          185.7121                92.0 2827
## 13  25          131.0211                62.0 3641
## 14  26          144.0082                75.0 2815
## 15  27          134.1473                72.0 2240
## 16  28          125.8354                66.0 2364
## 17  29          120.8182                66.0 1936
## 18  30          115.2080                67.5 1716
## 19  31          118.4599                63.0 1694
## 20  32          114.2800                63.0 1443

Create your plot! NOTES: examine mean friend count over age

ggplot(aes(x= age, y = friend_count_mean), data = pf.fc_by_age) +
  geom_line()

Overlaying Summaries with Raw Data

Notes: ggplot allows for overalying raw dat with summarised values such as overlaying the mean friend count (line) over a scatter of friend count by age

#plot a scatter 
ggplot(aes(x = age, y = friend_count), data = pf) +
  geom_point()

#use the alpha syntax to remove discrete age distribution as it is continious
ggplot(aes(x = age, y = friend_count), data = pf) +
  geom_point(alpha = 1/20, 
             position = position_jitter(h = 0))

#add some color and limit ages
myplot <- ggplot(aes(x = age, y = friend_count), data = pf) +
  geom_point(alpha = 1/20, 
             position = position_jitter(h = 0),
             color = 'orange') +
  xlim(13,90)
myplot

## Warning: Removed 5184 rows containing missing values (geom_point).

#transfrom y using coord_trans
myplot <- myplot + coord_trans(y = 'sqrt')
myplot

## Warning: Removed 5181 rows containing missing values (geom_point).

#add the mean summary to the plot
#fun == function
myplot <- myplot + geom_line(stat = 'summary', fun.y = mean)
myplot

## Warning: Removed 4906 rows containing missing values (stat_summary).

## Warning: Removed 5195 rows containing missing values (geom_point).

#add quartile summaries
#probs = .1 == the 10% quartile
myplot <- myplot + geom_line(stat = 'summary', fun.y = quantile, probs = .1,
                             linetype = 2, color = 'blue')
myplot <- myplot + geom_line(stat = 'summary', fun.y = quantile, probs = .5,
                             linetype = 2, color = 'red')
myplot <- myplot + geom_line(stat = 'summary', fun.y = quantile, probs = .9,
                             linetype = 1, color = 'blue')

myplot

## Warning: Removed 4906 rows containing missing values (stat_summary).

## Warning: Removed 4906 rows containing missing values (stat_summary).

## Warning: Removed 4906 rows containing missing values (stat_summary).

## Warning: Removed 4906 rows containing missing values (stat_summary).

## Warning: Removed 5180 rows containing missing values (geom_point).

#zoom in using the coord_cartesian syntax
myplot <- myplot + coord_cartesian(ylim = c(0,1000), x = c(13,70))
myplot

## Warning: Removed 4906 rows containing missing values (stat_summary).

## Warning: Removed 4906 rows containing missing values (stat_summary).

## Warning: Removed 4906 rows containing missing values (stat_summary).

## Warning: Removed 4906 rows containing missing values (stat_summary).

## Warning: Removed 5156 rows containing missing values (geom_point).

What are some of your observations of the plot?

Response:

Moira: Histogram Summary and Scatterplot

See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

Notes:

Correlation

Notes: 1. Udacity stats - https://www.udacity.com/course/viewer#!/c-ud201/l-1345848540/m-171582737 2. Correlation Coefficient http://www.r-tutor.com/elementary-statistics/numerical-measures/correlation-coefficient

cor.test(x = pf$age, y = pf$friend_count)

## 
##  Pearson's product-moment correlation
## 
## data:  pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737

#alternative syntax using With
with(pf, cor.test(age, friend_count, method = 'pearson'))

## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737

Look up the documentation for the cor.test function.

What’s the correlation between age and friend count? Round to three decimal places. Response: -0.02740737 This indicates that the relationship is not monotonic ***

Correlation on Subsets

Notes: the result of -0.172 indicates that as age increases friend count decrease but there is no strong correlation. Inferential statistics with experiments need to be used to infer the causation and not descriptive stats such as below.

with(subset(pf, age <= 70), cor.test(age, friend_count))

## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1780220 -0.1654129
## sample estimates:
##        cor 
## -0.1717245

Correlation Methods

Notes: Correlation (Pearson, Kendall, Spearman) http://www.statisticssolutions.com/correlation-pearson-kendall-spearman/ The point is that single number coefficients are useful but cant replace the richness of a scatter plot ***

Create Scatterplots

Notes: examining the relationship between likes received and WWW likes received

myplot <- ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
  geom_point()
myplot

myplot <- ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
  geom_point(alpha = 1/100) + 
  coord_trans(x = 'sqrt')
  
myplot

myplot <- myplot + coord_cartesian(xlim = c(0,50), ylim = c(0,2500))
myplot

Strong Correlations

Notes:

remove(myplot)

#plot the scatter
myplot <- ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
  geom_point()
myplot

#adjust the axes using the 95% quantile
myplot<- myplot + xlim(0,quantile(pf$www_likes_received,0.95)) +
  ylim(0,quantile(pf$www_likes,0.95))
myplot

## Warning: Removed 11608 rows containing missing values (geom_point).

# add the slope of the line of best fit through the point above is the correlation
#lm == lineear model
myplot <- myplot + geom_smooth(method = 'lm', color = 'red')
myplot

## Warning: Removed 11608 rows containing missing values (stat_smooth).

## Warning: Removed 11608 rows containing missing values (geom_point).

## Warning: Removed 33 rows containing missing values (geom_path).

What’s the correlation between the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.

cor.test(x = pf$www_likes_received, y = pf$likes_received)

## 
##  Pearson's product-moment correlation
## 
## data:  pf$www_likes_received and pf$likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9473553 0.9486176
## sample estimates:
##       cor 
## 0.9479902

Response: 0.948 ***

Moira on Correlation

Notes: Linear Regression Assumptions https://en.wikipedia.org/wiki/Linear_regression#Assumptions ***

More Caution with Correlation

Notes:

#install.packages('alr3')
library(alr3)

## Warning: package 'alr3' was built under R version 3.2.1

## Loading required package: car

## Warning: package 'car' was built under R version 3.2.1

data("Mitchell")

Create your plot!

names(Mitchell)

## [1] "Month" "Temp"

ggplot(aes(x = Month, y = Temp), data = Mitchell) +
  geom_point()

cor.test(x = Mitchell$Month, y = Mitchell$Temp)

## 
##  Pearson's product-moment correlation
## 
## data:  Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

Noisy Scatterplots

Take a guess for the correlation coefficient for the scatter plot. 0
What is the actual correlation of the two variables? (Round to the thousandths place)

cor.test(x = Mitchell$Month, y = Mitchell$Temp)

## 
##  Pearson's product-moment correlation
## 
## data:  Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

Making Sense of Data

Notes:

#transform the x axis to show break of a year
#1st change the range of the dataset to set the limits

range(Mitchell$Month)

## [1]   0 203

ggplot(aes(x = Month, y = Temp), data = Mitchell) +
  geom_point() +
  scale_x_discrete(breaks = seq(0,203,12))

A New Perspective

Notes: Stretch the above plot so that it is wider than taller. This causes the visualisation to show a distinct cyclic pattern which is unnoticed in a regular view! As a rule of thumb the visualisation should be twice as wide as it is tall.

What do you notice? Response: There is a pattern repeating for the temperatures across months

Watch the solution video and check out the Instructor Notes! Notes:

Understanding Noise: Age to Age Months

Notes: calc. the age to age months in decimals

pf$age_with_months <- pf$age + (12- pf$dob_month) /12

Age with Months Means

age_groups2 <- group_by(pf, age_with_months)
pf.fc_by_age_months <- summarise(age_groups2,
                                 friend_count_mean = mean(friend_count),
                                 friend_count_median = median(friend_count),
                                 n = n())

#sort by age asending
pf.fc_by_age_months <- arrange(pf.fc_by_age_months, age_with_months)

#use Head to check the first few rows. The last parm is optional
head(pf.fc_by_age_months, 20)

## Source: local data frame [20 x 4]
## 
##    age_with_months friend_count_mean friend_count_median   n
## 1         13.16667          46.33333                30.5   6
## 2         13.25000         115.07143                23.5  14
## 3         13.33333         136.20000                44.0  25
## 4         13.41667         164.24242                72.0  33
## 5         13.50000         131.17778                66.0  45
## 6         13.58333         156.81481                64.0  54
## 7         13.66667         130.06522                75.5  46
## 8         13.75000         205.82609               122.0  69
## 9         13.83333         215.67742               111.0  62
## 10        13.91667         162.28462                71.0 130
## 11        14.00000         194.13115               105.0 122
## 12        14.08333         226.67568               106.0 111
## 13        14.16667         270.73611               146.0 144
## 14        14.25000         218.86131               132.0 137
## 15        14.33333         313.24000               148.5 150
## 16        14.41667         230.50000               123.0 160
## 17        14.50000         268.41892               150.5 148
## 18        14.58333         288.51309               153.0 191
## 19        14.66667         264.82927               192.0 164
## 20        14.75000         182.55621               103.0 169

Programming Assignment

#plot a line of mean friend count over age in months and limit age to under 71
ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) +
  geom_line()

Noise in Conditional Means

#plot from earlier
p1 <- ggplot(aes(x = age, y = friend_count_mean), data = subset(pf.fc_by_age, age < 71)) +
  geom_line()


p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) +
  geom_line() 

# reduce the bin width by diviing the age.
# plot th mean friend count
p3 <- ggplot(aes(x = round(age / 5)*5, y = friend_count_mean), data = subset(pf.fc_by_age, age < 71)) +
  geom_line(stat = 'summary', fun.y = mean)

#arrange 
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.2.1

grid.arrange(p2, p1,p3, ncol = 1)

Smoothing Conditional Means

Notes:

Which Plot to Choose?

Notes:

Analyzing Two Variables

Questions: 1. In the scatter plot why is the age variable not intuitive? 2. How is a jitter plot better than a scatter plot for age vs friend count? 3. Why do we use the coor_trans on the y axis to improve the analysis? 4. Noise in Conditional Means - What is Bias Variance Trade-off? 5. Smoothing Conditional Means - What is local regression LOESS?

Reflection: 1.Jitter is used to overcome over plotting 2. Clarification on the jitter syntax with position. (https://discussions.udacity.com/t/confuse-with-position-arguments-in-geom-point-geom-jitter/26803) When we have geom_jitter( … position = position_jitter(h = 0)), we are telling R to set the magnitude of the jitter on the height of points (y-axis, vertical axis) to be 0. The equivalent setting for if we want to change or remove the jitter on the x-axis or horizontal axis is to use the “width” or “w” parameter. Don’t forget that you can always check the documentation for more details with ?position_jitter or check the online documentation to learn about functions you are unsure of.

geom_jitter performs the same things as geom_point, but has a different default value for the position argument. For geom_point, the default value is position = “identity”, while for geom_jitter the default value is position = “jitter”. Setting position = “jitter” in geom_point makes it act the same as geom_jitter; geom_jitter is a convenience function, since jittering points is something that is performed commonly enough. Again, the documentation for both functions might be useful to look at.

Types of Transformations Three ways of doing transformating in ggplot: by transforming the data qplot(log10(carat), log10(price), data=diamonds) by transforming the scales qplot(carat, price, data=diamonds, log=“xy”) qplot(carat, price, data=diamonds) + scale_x_log10() + scale_y_log10() by transforming the coordinate system: qplot(carat, price, data=diamonds) + coord_trans(x = “log10”, y = “log10”)

The difference between transforming the scales and transforming the coordinate system is that scale transformation occurs BEFORE statistics, and coordinate transformation afterwards. Coordinate transformation also changes the shape of geoms.

As a rule of thumb the visualisation should be twice as wide as it is tall!

Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!