Notes:
Notes:
library(ggplot2)
pf <- read.csv('pseudo_facebook.tsv', sep ='\t')
qplot(x=age, y=friend_count, data = pf)
ggplot(data=pf, aes(age,friend_count))+
geom_point()
Response: Majority of people under 30 have maximum friend count. This shows younger users have a lot of friends.There are few fake accounts or teenagers who have lied about their age age there is a surge in friend count at age 69 and also after 90 years. ***
Notes:
ggplot(data=pf,aes(age,friend_count))+geom_point()+xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).
summary(pf$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 20.00 28.00 37.28 50.00 113.00
Notes:
ggplot(data=pf,aes(age,friend_count))+geom_point(alpha=(1/20))+xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).
ggplot(data=pf,aes(age,friend_count))+geom_jitter(alpha=(1/20))+xlim(13,90)
## Warning: Removed 5189 rows containing missing values (geom_point).
Response: The friend count for young users are as high as it was evident from the initial chart. However, this chart shows that majority of young users and below 1000 friends. Also outliers or fake accounts are still evident at age 69. ***
Notes:
ggplot(data=pf,aes(age,friend_count))+geom_point(alpha=(1/20))+xlim(13,90)+coord_trans(y = "sqrt")
## Warning: Removed 4906 rows containing missing values (geom_point).
ggplot(data=pf,aes(age,friend_count))+geom_point(alpha=(1/20),position = position_jitter(h=0))+xlim(13,90)+coord_trans(y = "sqrt")
## Warning: Removed 5191 rows containing missing values (geom_point).
Notes:
ggplot(data = pf, aes(x = age, y = friendships_initiated))+
geom_point(alpha=1/10, position=position_jitter(h=0))+
xlim(13,90)+
coord_trans(y = "sqrt")
## Warning: Removed 5175 rows containing missing values (geom_point).
Notes:
Notes:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
age_groups <- group_by(pf, age)
pf.fc_by_age <- summarise(age_groups, friend_count_mean = mean(friend_count),friend_count_median = median(friend_count), n = n())
head(pf.fc_by_age, 20)
## # A tibble: 20 x 4
## age friend_count_mean friend_count_median n
## <int> <dbl> <dbl> <int>
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
## 7 19 333.6921 157.0 4391
## 8 20 283.4991 135.0 3769
## 9 21 235.9412 121.0 3671
## 10 22 211.3948 106.0 3032
## 11 23 202.8426 93.0 4404
## 12 24 185.7121 92.0 2827
## 13 25 131.0211 62.0 3641
## 14 26 144.0082 75.0 2815
## 15 27 134.1473 72.0 2240
## 16 28 125.8354 66.0 2364
## 17 29 120.8182 66.0 1936
## 18 30 115.2080 67.5 1716
## 19 31 118.4599 63.0 1694
## 20 32 114.2800 63.0 1443
Create your plot!
ggplot(data = pf.fc_by_age)+
geom_line(mapping = aes(x = age, y = friend_count_mean))+
xlim(13,90)
## Warning: Removed 23 rows containing missing values (geom_path).
Notes:
ggplot(data=pf,aes(age,friend_count))+
geom_point(alpha=(1/20),position = position_jitter(h=0),color = "orange")+
xlim(13,90)+
coord_trans(y = "sqrt")+
geom_line(stat = 'summary', fun.y = mean)+
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .1), linetype = 2, color ='blue')+
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .9), linetype = 2, color ='blue')+
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .5), color ='blue')
## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 5173 rows containing missing values (geom_point).
Response: Having more than 1000 friends is rare even in the younger age group.90 percentile is well belo the 1000 mark showing that 90% of the users have friend count lower than 1000 mark. *** ###Adding Coord_cartesian
ggplot(data=pf,aes(age,friend_count))+
geom_point(alpha=(1/20),position = position_jitter(h=0),color = "orange")+
coord_cartesian(xlim=c(13,70), ylim=c(0,1000))+
geom_line(stat = 'summary', fun.y = mean)+
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .1), linetype = 2, color ='blue')+
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .9), linetype = 2, color ='blue')+
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .5), color ='blue')
See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.
Notes:
Notes:
cor.test(pf$age, pf$friend_count, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
with(pf, cor.test(age, friend_count, method = "pearson"))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
Look up the documentation for the cor.test function.
What’s the correlation between age and friend count? Round to three decimal places. Response:
Notes:
with(filter(pf,age <= 70), cor.test(age, friend_count))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1780220 -0.1654129
## sample estimates:
## cor
## -0.1717245
Notes:
with(filter(pf,age <= 70), cor.test(age, friend_count, method = "spearman"))
## Warning in cor.test.default(age, friend_count, method = "spearman"): Cannot
## compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: age and friend_count
## S = 1.5782e+14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.2552934
Notes:
ggplot(data = pf)+
geom_point(mapping = aes(x = www_likes_received, y = likes_received), position = position_jitter(h = 0))
Notes:
ggplot(data = pf,aes(x = www_likes_received, y = likes_received))+
geom_jitter(alpha = (1/50), position = position_jitter(h = 0))+
coord_cartesian(xlim = c(0, quantile(pf$www_likes_received, 0.95)), ylim = c(0, quantile(pf$likes_received, 0.95)))+geom_smooth(method ='lm',color ='red')
What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
cor.test(pf$www_likes_received,pf$likes_received)
##
## Pearson's product-moment correlation
##
## data: pf$www_likes_received and pf$likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9473553 0.9486176
## sample estimates:
## cor
## 0.9479902
Response:
Notes:
Notes:
library(alr3)
## Loading required package: car
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
Create your plot!
ggplot(data = Mitchell, aes(Temp, Month))+
geom_point()
Take a guess for the correlation coefficient for the scatterplot. 0.04
What is the actual correlation of the two variables? (Round to the thousandths place) 0.0574
with(Mitchell, cor.test(Temp, Month))
##
## Pearson's product-moment correlation
##
## data: Temp and Month
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08053637 0.19331562
## sample estimates:
## cor
## 0.05747063
Notes:
ggplot(data = Mitchell, aes(Temp, Month))+
geom_point()+
scale_x_continuous(breaks = seq(0,203,12))
ggplot(aes(x=(Month%%12),y=Temp), data=Mitchell)+
geom_point()
What do you notice? Response:
Watch the solution video and check out the Instructor Notes! Notes:
Notes:
pf$age_with_months <- pf$age + (1 - pf$dob_month/12)
pf.fc_by_age_months <- pf %>% group_by(age_with_months) %>% summarise(friend_count_mean = mean(friend_count), friend_count_median = median(friend_count),n = n()) %>% arrange(age_with_months)
Programming Assignment
head(pf.fc_by_age_months, 10)
## # A tibble: 10 x 4
## age_with_months friend_count_mean friend_count_median n
## <dbl> <dbl> <dbl> <int>
## 1 13.16667 46.33333 30.5 6
## 2 13.25000 115.07143 23.5 14
## 3 13.33333 136.20000 44.0 25
## 4 13.41667 164.24242 72.0 33
## 5 13.50000 131.17778 66.0 45
## 6 13.58333 156.81481 64.0 54
## 7 13.66667 130.06522 75.5 46
## 8 13.75000 205.82609 122.0 69
## 9 13.83333 215.67742 111.0 62
## 10 13.91667 162.28462 71.0 130
ggplot(data = subset(pf.fc_by_age_months, pf.fc_by_age_months$age_with_months < 71), aes(x = age_with_months, y = friend_count_mean))+
geom_line()
Notes:
p1 <- ggplot(data = subset(pf.fc_by_age, age < 71),aes(x = age, y = friend_count_mean))+
geom_line()+
geom_smooth()
p2 <- ggplot(data = subset(pf.fc_by_age_months, pf.fc_by_age_months$age_with_months < 71), aes(x = age_with_months, y = friend_count_mean))+
geom_line()+
geom_smooth()
p3 <- ggplot(data = subset(pf, age < 71),aes(x = round(age/5)*5, y = friend_count))+
geom_line(stat = "summary", fun.y = mean)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(p1,p2,p3,ncol = 1)
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
Notes:
Reflection:
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!