Lesson 4

In this lesson we are going to learn three subjects: scatter plot, conditional means, and correlation between two variables.

Scatterplots and Perceived Audience Size

Notes:

Scatterplots

Notes: scatterplots need two continuous variables. qplot fits into this. qplot knows what x and y are by order.

library(ggplot2)
pf <- read.csv('pseudo_facebook.tsv',sep='\t')

qplot(x=age, y=friend_count, data=pf)

qplot(age, friend_count, data=pf)

What are some things that you notice right away?

Response: there is no linear relation between two variables. On the basis of age 30, age less than 30 has a lot more friends than other ages above 30. Age more than 30 has less than 1000 friends.

ggplot Syntax

Notes: recommend using xlim

ggplot(aes(x=age, y=friend_count), data=pf) + geom_point() + xlim(13,90)

## Warning: Removed 4906 rows containing missing values (geom_point).

summary(pf$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   28.00   37.28   50.00  113.00

Overplotting

Notes: geom_jitter with alpha alpha is used for

ggplot(aes(x=age, y=friend_count), data=pf) + geom_point(alpha=1/20) + xlim(13,90)

## Warning: Removed 4906 rows containing missing values (geom_point).

What do you notice in the plot?

Response: looks a long tail distribution

Coord_trans()

Notes:

ggplot(aes(x=age, y=friend_count), data=pf) + geom_point(alpha=1/20) + xlim(13,90) + coord_trans(y='sqrt')

## Warning: Removed 4906 rows containing missing values (geom_point).

Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

prevent negative from occuring

ggplot(aes(x=age, y=friend_count), data=pf) + geom_point(alpha=1/20, position = position_jitter(h=0)) + xlim(13,90) + coord_trans(y='sqrt')

## Warning: Removed 5201 rows containing missing values (geom_point).

What do you notice?

Alpha and Jitter

Notes:

ggplot(aes(x = age, y = friendships_initiated), data = pf) + geom_point(alpha = 1/10, position=position_jitter(h=0)) + xlim(13, 90) + coord_trans(y = 'sqrt')

## Warning: Removed 5202 rows containing missing values (geom_point).

Overplotting and Domain Knowledge

Notes:

Conditional Means

Notes:

#install.packages('dplyr')
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.4

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

age_groups <- group_by(pf, age)
pf.fc_by_age <- summarise(age_groups, friend_count_mean = mean(friend_count), friend_count_median = median(friend_count), n = n())

pf.fc_by_age <- arrange(pf.fc_by_age, age)

head(pf.fc_by_age)

## Source: local data frame [6 x 4]
## 
##     age friend_count_mean friend_count_median     n
##   (int)             (dbl)               (dbl) (int)
## 1    13          164.7500                74.0   484
## 2    14          251.3901               132.0  1925
## 3    15          347.6921               161.0  2618
## 4    16          351.9371               171.5  3086
## 5    17          350.3006               156.0  3283
## 6    18          331.1663               162.0  5196

Conditional Means Alternative Code

#install.packages('dplyr')
library(dplyr)

pf.fc_by_age <- pf %>%
  group_by(age) %>%
  summarise(friend_count_mean = mean(friend_count), 
            friend_count_median = median(friend_count), 
            n = n()) %>%
  arrange(age)

head(pf.fc_by_age, 20)

## Source: local data frame [20 x 4]
## 
##      age friend_count_mean friend_count_median     n
##    (int)             (dbl)               (dbl) (int)
## 1     13          164.7500                74.0   484
## 2     14          251.3901               132.0  1925
## 3     15          347.6921               161.0  2618
## 4     16          351.9371               171.5  3086
## 5     17          350.3006               156.0  3283
## 6     18          331.1663               162.0  5196
## 7     19          333.6921               157.0  4391
## 8     20          283.4991               135.0  3769
## 9     21          235.9412               121.0  3671
## 10    22          211.3948               106.0  3032
## 11    23          202.8426                93.0  4404
## 12    24          185.7121                92.0  2827
## 13    25          131.0211                62.0  3641
## 14    26          144.0082                75.0  2815
## 15    27          134.1473                72.0  2240
## 16    28          125.8354                66.0  2364
## 17    29          120.8182                66.0  1936
## 18    30          115.2080                67.5  1716
## 19    31          118.4599                63.0  1694
## 20    32          114.2800                63.0  1443

Create your plot!

ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) + geom_line()

Overlaying Summaries with Raw Data

Notes: original plot and summary plot

ggplot 2.0.0 changes the syntax for parameter arguments to functions when using stat = ‘summary’. To denote parameters that are being set on the function specified by fun.y, use the fun.args argument, e.g.: geom_line(stat = ‘summary’, fun.y = quantile, fun.args = list(probs = .9), … )

To zoom in, the code should use thecoord_cartesian(xlim = c(13, 90)) layer rather than xlim(13, 90) layer.

Look up documentation for coord_cartesian() and quantile() if you’re unfamiliar with them.

ggplot(aes(x = age, y = friendships_initiated), 
       data = pf) + 
  geom_point(alpha = 1/20, position=position_jitter(h=0), color = 'orange') + 
  coord_cartesian(xlim=c(13, 90), ylim = c(0,1000)) + 
  geom_line(stat = 'summary', fun.y = mean) +
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .1), linetype = 2, color = 'blue') +
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .5), color = 'blue') +
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .9), linetype = 2, color = 'blue')

What are some of your observations of the plot?

Response: people of young age have more friends than those of other ages. Age 68 has abnormally more friends than ages nearby. Over 80 may not be the right input.

Moira: Histogram Summary and Scatterplot

See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

Notes:

Correlation

Notes:

?cor.test

## starting httpd help server ...

##  done

cor.test(pf$age,pf$friend_count, method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737

with(pf, cor.test(age,friend_count, method = 'pearson'))

## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737

-0.02740737

Look up the documentation for the cor.test function.

What’s the correlation between age and friend count? Round to three decimal places. Response: correlation coefficient r = Pearson’s r = cov(X, Y)/Sx Sy r^2 = % of the variation in Y explained by the variation in x.

ref: http://www.r-tutor.com/elementary-statistics/numerical-measures/correlation-coefficient

Correlation on Subsets

Notes: correlation is not linear. So, we need to narrow down to a smaller range.

with(subset(pf, pf$age <= 70), cor.test(age, friend_count))

## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1780220 -0.1654129
## sample estimates:
##        cor 
## -0.1717245

-0.1717245

Correlation Methods

Notes: There are other methods to compute correlation coefficient, such as spearman

Create Scatterplots

Notes:

names(df)

## NULL

ggplot(aes(y = likes_received, x = www_likes_received), data = pf) + geom_point()

Strong Correlations

Notes: how to exclude outliers, using quantile

ggplot(aes(y = likes_received, x = www_likes_received), data = pf) + geom_point() +
  xlim(0, quantile(pf$www_likes_received, 0.95)) +
  ylim(0, quantile(pf$likes_received, 0.95)) +
  geom_smooth(method = 'lm', color = 'red')

## Warning: Removed 6075 rows containing non-finite values (stat_smooth).

## Warning: Removed 6075 rows containing missing values (geom_point).

What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.

with(pf, cor.test(www_likes_received, likes_received))

## 
##  Pearson's product-moment correlation
## 
## data:  www_likes_received and likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9473553 0.9486176
## sample estimates:
##       cor 
## 0.9479902

Response: 0.948

Moira on Correlation

Notes: strong correlation is not always a good thing.

More Caution with Correlation

Notes:

#install.packages('alr3')
library(alr3)

## Warning: package 'alr3' was built under R version 3.2.4

## Loading required package: car

## Warning: package 'car' was built under R version 3.2.4

data(Mitchell)
?Mitchell

Create your plot!

ggplot(aes(y = Temp, x = Month), data = Mitchell) + geom_point()

qplot(data=Mitchell, Month, Temp)

Noisy Scatterplots

Take a guess for the correlation coefficient for the scatterplot. 0 -.2 to .2
What is the actual correlation of the two variables? (Round to the thousandths place) 0.05747063

with(Mitchell, cor.test(Month,Temp))

## 
##  Pearson's product-moment correlation
## 
## data:  Month and Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

Making Sense of Data

Notes: Month should be 12-month based.

range(Mitchell$Month)

## [1]   0 203

ggplot(aes(y = Temp, x = Month), data = Mitchell) + geom_point() + scale_x_discrete(breaks = seq(0, 203, 12))

ggplot(aes(y = Temp, x = Month%%12), data = Mitchell) + geom_point()

A New Perspective

What do you notice? Response:there is a cyclical pattern of temp over month

Watch the solution video and check out the Instructor Notes! Notes: There are other measures of associations that can detect this. The dcor.ttest() function in the energy package implements a non-parametric test of the independence of two variables. While the Mitchell soil dataset is too coarse to identify a significant dependency between “Month” and “Temp”, we can see the difference between dcor.ttest and cor.test through other examples, like the following:

x <- seq(0, 4*pi, pi/20) y <- cos(x) qplot(x = x, y = y) dcor.ttest(x, y)

Understanding Noise: Age to Age Months

Notes:

ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) + geom_line()

head(pf.fc_by_age,10)

## Source: local data frame [10 x 4]
## 
##      age friend_count_mean friend_count_median     n
##    (int)             (dbl)               (dbl) (int)
## 1     13          164.7500                74.0   484
## 2     14          251.3901               132.0  1925
## 3     15          347.6921               161.0  2618
## 4     16          351.9371               171.5  3086
## 5     17          350.3006               156.0  3283
## 6     18          331.1663               162.0  5196
## 7     19          333.6921               157.0  4391
## 8     20          283.4991               135.0  3769
## 9     21          235.9412               121.0  3671
## 10    22          211.3948               106.0  3032

pf.fc_by_age[17:19,]

## Source: local data frame [3 x 4]
## 
##     age friend_count_mean friend_count_median     n
##   (int)             (dbl)               (dbl) (int)
## 1    29          120.8182                66.0  1936
## 2    30          115.2080                67.5  1716
## 3    31          118.4599                63.0  1694

Age with Months Means

pf$age_with_months <- pf$age + (12 - pf$dob_month) / 12

pf$age_with_months <- with(pf, age + (12 - dob_month) / 12)

library(dplyr)
age_months_groups <- group_by(pf, age_with_months)
pf.fc_by_age_months <- summarise(age_months_groups, 
          friend_count_mean = mean(friend_count),
          friend_count_median = median(friend_count),
          n = n())
pf.fc_by_age_months <- arrange(pf.fc_by_age_months, age_with_months)

head(pf.fc_by_age_months)

## Source: local data frame [6 x 4]
## 
##   age_with_months friend_count_mean friend_count_median     n
##             (dbl)             (dbl)               (dbl) (int)
## 1        13.16667          46.33333                30.5     6
## 2        13.25000         115.07143                23.5    14
## 3        13.33333         136.20000                44.0    25
## 4        13.41667         164.24242                72.0    33
## 5        13.50000         131.17778                66.0    45
## 6        13.58333         156.81481                64.0    54

Programming Assignment

Noise in Conditional Means

ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months<71)) + geom_line()

Smoothing Conditional Means

Notes: bias variance tradeoff

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

p1 <- ggplot(aes(x = age, y = friend_count_mean), data = subset(pf.fc_by_age, age<71)) + geom_line() + geom_smooth()

p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months<71)) + geom_line() + geom_smooth()

p3 <- ggplot(aes(x = round(age/5)*5, y = friend_count), data = subset(pf, age<71)) + geom_line(stat='summary', fun.y = mean)

grid.arrange(p2, p1, p3, ncol=1)

Which Plot to Choose?

Notes: don’t have to choose. explore as many plots fitting into your data as possible

Analyzing Two Variables

Reflection: I learned to evaluate the correlation between two variables. To understand the data better, I learned various techniques of changing the scale of axes, the smoothing, and overlaying of plots. Also, learned to manipulate the data to extract what is meaningful.

Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!