In this lesson we are going to learn three subjects: scatter plot, conditional means, and correlation between two variables.
Notes:
Notes: scatterplots need two continuous variables. qplot fits into this. qplot knows what x and y are by order.
library(ggplot2)
pf <- read.csv('pseudo_facebook.tsv',sep='\t')
qplot(x=age, y=friend_count, data=pf)
qplot(age, friend_count, data=pf)
Response: there is no linear relation between two variables. On the basis of age 30, age less than 30 has a lot more friends than other ages above 30. Age more than 30 has less than 1000 friends.
Notes: recommend using xlim
ggplot(aes(x=age, y=friend_count), data=pf) + geom_point() + xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).
summary(pf$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 20.00 28.00 37.28 50.00 113.00
Notes: geom_jitter with alpha alpha is used for
ggplot(aes(x=age, y=friend_count), data=pf) + geom_point(alpha=1/20) + xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).
Response: looks a long tail distribution
Notes:
ggplot(aes(x=age, y=friend_count), data=pf) + geom_point(alpha=1/20) + xlim(13,90) + coord_trans(y='sqrt')
## Warning: Removed 4906 rows containing missing values (geom_point).
prevent negative from occuring
ggplot(aes(x=age, y=friend_count), data=pf) + geom_point(alpha=1/20, position = position_jitter(h=0)) + xlim(13,90) + coord_trans(y='sqrt')
## Warning: Removed 5201 rows containing missing values (geom_point).
Notes:
ggplot(aes(x = age, y = friendships_initiated), data = pf) + geom_point(alpha = 1/10, position=position_jitter(h=0)) + xlim(13, 90) + coord_trans(y = 'sqrt')
## Warning: Removed 5202 rows containing missing values (geom_point).
Notes:
Notes:
#install.packages('dplyr')
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.4
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
age_groups <- group_by(pf, age)
pf.fc_by_age <- summarise(age_groups, friend_count_mean = mean(friend_count), friend_count_median = median(friend_count), n = n())
pf.fc_by_age <- arrange(pf.fc_by_age, age)
head(pf.fc_by_age)
## Source: local data frame [6 x 4]
##
## age friend_count_mean friend_count_median n
## (int) (dbl) (dbl) (int)
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
#install.packages('dplyr')
library(dplyr)
pf.fc_by_age <- pf %>%
group_by(age) %>%
summarise(friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n()) %>%
arrange(age)
head(pf.fc_by_age, 20)
## Source: local data frame [20 x 4]
##
## age friend_count_mean friend_count_median n
## (int) (dbl) (dbl) (int)
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
## 7 19 333.6921 157.0 4391
## 8 20 283.4991 135.0 3769
## 9 21 235.9412 121.0 3671
## 10 22 211.3948 106.0 3032
## 11 23 202.8426 93.0 4404
## 12 24 185.7121 92.0 2827
## 13 25 131.0211 62.0 3641
## 14 26 144.0082 75.0 2815
## 15 27 134.1473 72.0 2240
## 16 28 125.8354 66.0 2364
## 17 29 120.8182 66.0 1936
## 18 30 115.2080 67.5 1716
## 19 31 118.4599 63.0 1694
## 20 32 114.2800 63.0 1443
Create your plot!
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) + geom_line()
Notes: original plot and summary plot
ggplot 2.0.0 changes the syntax for parameter arguments to functions when using stat = ‘summary’. To denote parameters that are being set on the function specified by fun.y, use the fun.args argument, e.g.: geom_line(stat = ‘summary’, fun.y = quantile, fun.args = list(probs = .9), … )
To zoom in, the code should use thecoord_cartesian(xlim = c(13, 90)) layer rather than xlim(13, 90) layer.
Look up documentation for coord_cartesian() and quantile() if you’re unfamiliar with them.
ggplot(aes(x = age, y = friendships_initiated),
data = pf) +
geom_point(alpha = 1/20, position=position_jitter(h=0), color = 'orange') +
coord_cartesian(xlim=c(13, 90), ylim = c(0,1000)) +
geom_line(stat = 'summary', fun.y = mean) +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .1), linetype = 2, color = 'blue') +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .5), color = 'blue') +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .9), linetype = 2, color = 'blue')
Response: people of young age have more friends than those of other ages. Age 68 has abnormally more friends than ages nearby. Over 80 may not be the right input.
See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.
Notes:
Notes:
?cor.test
## starting httpd help server ...
## done
cor.test(pf$age,pf$friend_count, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
with(pf, cor.test(age,friend_count, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
-0.02740737
Look up the documentation for the cor.test function.
What’s the correlation between age and friend count? Round to three decimal places. Response: correlation coefficient r = Pearson’s r = cov(X, Y)/Sx Sy r^2 = % of the variation in Y explained by the variation in x.
ref: http://www.r-tutor.com/elementary-statistics/numerical-measures/correlation-coefficient
Notes: correlation is not linear. So, we need to narrow down to a smaller range.
with(subset(pf, pf$age <= 70), cor.test(age, friend_count))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1780220 -0.1654129
## sample estimates:
## cor
## -0.1717245
-0.1717245
Notes: There are other methods to compute correlation coefficient, such as spearman
Notes:
names(df)
## NULL
ggplot(aes(y = likes_received, x = www_likes_received), data = pf) + geom_point()
Notes: how to exclude outliers, using quantile
ggplot(aes(y = likes_received, x = www_likes_received), data = pf) + geom_point() +
xlim(0, quantile(pf$www_likes_received, 0.95)) +
ylim(0, quantile(pf$likes_received, 0.95)) +
geom_smooth(method = 'lm', color = 'red')
## Warning: Removed 6075 rows containing non-finite values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).
What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
with(pf, cor.test(www_likes_received, likes_received))
##
## Pearson's product-moment correlation
##
## data: www_likes_received and likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9473553 0.9486176
## sample estimates:
## cor
## 0.9479902
Response: 0.948
Notes: strong correlation is not always a good thing.
Notes:
#install.packages('alr3')
library(alr3)
## Warning: package 'alr3' was built under R version 3.2.4
## Loading required package: car
## Warning: package 'car' was built under R version 3.2.4
data(Mitchell)
?Mitchell
Create your plot!
ggplot(aes(y = Temp, x = Month), data = Mitchell) + geom_point()
qplot(data=Mitchell, Month, Temp)
Take a guess for the correlation coefficient for the scatterplot. 0 -.2 to .2
What is the actual correlation of the two variables? (Round to the thousandths place) 0.05747063
with(Mitchell, cor.test(Month,Temp))
##
## Pearson's product-moment correlation
##
## data: Month and Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08053637 0.19331562
## sample estimates:
## cor
## 0.05747063
Notes: Month should be 12-month based.
range(Mitchell$Month)
## [1] 0 203
ggplot(aes(y = Temp, x = Month), data = Mitchell) + geom_point() + scale_x_discrete(breaks = seq(0, 203, 12))
ggplot(aes(y = Temp, x = Month%%12), data = Mitchell) + geom_point()
What do you notice? Response:there is a cyclical pattern of temp over month
Watch the solution video and check out the Instructor Notes! Notes: There are other measures of associations that can detect this. The dcor.ttest() function in the energy package implements a non-parametric test of the independence of two variables. While the Mitchell soil dataset is too coarse to identify a significant dependency between “Month” and “Temp”, we can see the difference between dcor.ttest and cor.test through other examples, like the following:
x <- seq(0, 4*pi, pi/20) y <- cos(x) qplot(x = x, y = y) dcor.ttest(x, y)
Notes:
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) + geom_line()
head(pf.fc_by_age,10)
## Source: local data frame [10 x 4]
##
## age friend_count_mean friend_count_median n
## (int) (dbl) (dbl) (int)
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
## 7 19 333.6921 157.0 4391
## 8 20 283.4991 135.0 3769
## 9 21 235.9412 121.0 3671
## 10 22 211.3948 106.0 3032
pf.fc_by_age[17:19,]
## Source: local data frame [3 x 4]
##
## age friend_count_mean friend_count_median n
## (int) (dbl) (dbl) (int)
## 1 29 120.8182 66.0 1936
## 2 30 115.2080 67.5 1716
## 3 31 118.4599 63.0 1694
pf$age_with_months <- pf$age + (12 - pf$dob_month) / 12
pf$age_with_months <- with(pf, age + (12 - dob_month) / 12)
library(dplyr)
age_months_groups <- group_by(pf, age_with_months)
pf.fc_by_age_months <- summarise(age_months_groups,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
pf.fc_by_age_months <- arrange(pf.fc_by_age_months, age_with_months)
head(pf.fc_by_age_months)
## Source: local data frame [6 x 4]
##
## age_with_months friend_count_mean friend_count_median n
## (dbl) (dbl) (dbl) (int)
## 1 13.16667 46.33333 30.5 6
## 2 13.25000 115.07143 23.5 14
## 3 13.33333 136.20000 44.0 25
## 4 13.41667 164.24242 72.0 33
## 5 13.50000 131.17778 66.0 45
## 6 13.58333 156.81481 64.0 54
Programming Assignment
ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months<71)) + geom_line()
Notes: bias variance tradeoff
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
p1 <- ggplot(aes(x = age, y = friend_count_mean), data = subset(pf.fc_by_age, age<71)) + geom_line() + geom_smooth()
p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months<71)) + geom_line() + geom_smooth()
p3 <- ggplot(aes(x = round(age/5)*5, y = friend_count), data = subset(pf, age<71)) + geom_line(stat='summary', fun.y = mean)
grid.arrange(p2, p1, p3, ncol=1)
Notes: don’t have to choose. explore as many plots fitting into your data as possible
Reflection: I learned to evaluate the correlation between two variables. To understand the data better, I learned various techniques of changing the scale of axes, the smoothing, and overlaying of plots. Also, learned to manipulate the data to extract what is meaningful.
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!