Here I am simply loading the necessary libraries.
library(gridExtra)
library(ggplot2)
library(xkcd)
library(dplyr)
Now we need to change our working directory to load in the .tsv file (tab separated values). I also am using names to see the variables in the loaded dataframe.
The data frame we are using is Udacity’s “psuedo facebook” data. This is data that was generated (so its not any actual facebook users data), but has the same schema and similar values of the true facebook userbase.
getwd()
setwd("/Users/Taylor/Downloads")
#list.files()
pf<- read.csv("pseudo_facebook.tsv",sep='\t')
names(pf)
To investigate the relationship between two variables, it is often useful to plot the values on a scatterplot. Here I am using a facebook users Friend Count (Y) and their age (X).
qplot(x=age, y=friend_count, data=pf) +
theme_xkcd() +
ylab("Friend Count") +
ggtitle("Friend Count vs. Age")
Here we are using the ggplot syntax, rather than the qplot syntax. It is slightly more verbose, but also more powerful.
When there is overplotting (that is to say, raw data is stacked on top of other raw data) it is difficult to see where the concentration of values are. We can use the alpha syntax to add transparency to each point. Here it takes 20 overlain data values to reach 100% opacity.
summary(pf$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 20.00 28.00 37.28 50.00 113.00
ggplot(aes(x=age,y=friend_count),data=pf)+
geom_point(alpha=1/20) +
xlim(13,90) +
ylab("Friend Count") +
ggtitle("Friend Count vs. Age")+
theme_xkcd()
Note that since the age variable is stored a discrete factor, we see these very structured columns in our plot. Let’s change that by using geom_jitter. This slightly alters the position of each value to create a more natural representation of age.
ggplot(aes(x=age,y=friend_count),data=pf)+
geom_jitter(alpha=1/20) + # use jitter to add noise to the age, since its not truly discrete
xlim(13,90) +
ylab("Friend Count") +
ggtitle("Friend Count vs. Age")+
theme_xkcd()
Because of the large outliers, it is difficult to see the distribution of data. Let’s transform the Y-axis by taking its square root.
#lets change the y_axis layer
ggplot(aes(x=age,y=friend_count+1),data=pf)+
geom_jitter(alpha=1/20) + # use jitter to add noise to the age, since its not truly discrete
xlim(13,90) +
coord_trans(y = "sqrt")+
ylab("Friend Count") +
ggtitle("Friend Count vs. Age") +
theme_xkcd()
ggplot(aes(x=age,y=friend_count),data=pf)+
geom_point(alpha=1/20, position= position_jitter(h =0)) + # use jitter to add noise to the age, since its not truly discrete
xlim(13,90) +
coord_trans(y = "sqrt")+
ylab("Friend Count") +
ggtitle("Friend Count vs. Age") +
theme_xkcd()
?coord_trans
# This programming assignment
# will not be graded, but when you
# submit your code, the assignment
# will be marked as correct. By submitting
# your code, we can add to the feedback
# messages and address common mistakes
# in the Instructor Notes.
# You can assess your work by watching
# the solution video.
# Examine the relationship between
# friendships_initiated (y) and age (x)
# using the ggplot syntax.
# We recommend creating a basic scatter
# plot first to see what the distribution looks like.
# and then adjusting it by adding one layer at a time.
# What are your observations about your final plot?
# Remember to make adjustments to the breaks
# of the x-axis and to use apply alpha and jitter.
# ENTER ALL OF YOUR CODE FOR YOUR PLOT BELOW THIS LINE.
# =======================================================
ggplot(aes(x=age, y=friendships_initiated), data=pf) +
geom_point(position=position_jitter(h=0),
alpha=1/10,
color="#fcae3a") +
xlim(13,90) +
coord_trans(y="sqrt") +
ylab("Friendships Initiated") +
ggtitle("Friendships Initiated vs. Age") +
theme_xkcd()
Conditional means is a way for us to group users, and then take the average by each grouping to see trends. Here we are going to investigate the average friend count for users of differing ages.
The dplyr package is integral for these operations.
age_groups <-group_by(pf,age)
age_groups
## Source: local data frame [99,003 x 15]
## Groups: age [101]
##
## userid age dob_day dob_year dob_month gender tenure friend_count
## (int) (int) (int) (int) (int) (fctr) (int) (int)
## 1 2094382 14 19 1999 11 male 266 0
## 2 1192601 14 2 1999 11 female 6 0
## 3 2083884 14 16 1999 11 male 13 0
## 4 1203168 14 25 1999 12 female 93 0
## 5 1733186 14 4 1999 12 male 82 0
## 6 1524765 14 1 1999 12 male 15 0
## 7 1136133 13 14 2000 1 male 12 0
## 8 1680361 13 4 2000 1 female 0 0
## 9 1365174 13 1 2000 1 male 81 0
## 10 1712567 13 2 2000 2 male 171 0
## .. ... ... ... ... ... ... ... ...
## Variables not shown: friendships_initiated (int), likes (int),
## likes_received (int), mobile_likes (int), mobile_likes_received (int),
## www_likes (int), www_likes_received (int)
pf.fc_by_age <- summarise(age_groups,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n=n())
pf.fc_by_age
## Source: local data frame [101 x 4]
##
## age friend_count_mean friend_count_median n
## (int) (dbl) (dbl) (int)
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
## 7 19 333.6921 157.0 4391
## 8 20 283.4991 135.0 3769
## 9 21 235.9412 121.0 3671
## 10 22 211.3948 106.0 3032
## .. ... ... ... ...
head(pf.fc_by_age)
## Source: local data frame [6 x 4]
##
## age friend_count_mean friend_count_median n
## (int) (dbl) (dbl) (int)
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
This is another syntax for doing the same thing we did above. This syntax uses chaining to pass the first parameter down.
pf.fc_by_age <- pf %>%
group_by(age) %>%
summarise(friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n()) #%.%
#arrange(age)
pf.fc_by_age
## Source: local data frame [101 x 4]
##
## age friend_count_mean friend_count_median n
## (int) (dbl) (dbl) (int)
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
## 7 19 333.6921 157.0 4391
## 8 20 283.4991 135.0 3769
## 9 21 235.9412 121.0 3671
## 10 22 211.3948 106.0 3032
## .. ... ... ... ...
ggplot(pf.fc_by_age, aes(x=age, y=friend_count_mean)) +
geom_line(colour="#8224e3") +
xlim(13,90) +
ylab("Average Friend Count") +
xlab("Age of User") +
ggtitle("Average Facebook Friend Count By Age") +
theme_xkcd()
It’s great to see the raw data, but better to see the trended data.
ggplot(aes(x=age, y=friend_count), data=pf) +
geom_point(position=position_jitter(h=0),
alpha=1/10,
color="#fcae3a") +
xlim(13,90) +
coord_trans(y="sqrt") +
geom_line(stat='summary', fun.y=mean, color="#8224e3")+
geom_line(stat='summary', fun.y = quantile, probs=.1,
linetype=2, color="#dd3333") +
geom_line(stat='summary', fun.y = quantile, probs=.9,
linetype=2, color="#359bed") +
scale_colour_manual(values=c("#8224e3","#dd3333","#359bed")) +
scale_x_continuous(breaks=seq(from=15,to=70,by=5))+
ggtitle("Average Friend Count By Age")+
ylab("Number of Friends") +
xlab("Age of User")+
theme_xkcd()
ggplot(aes(x=age, y=friend_count), data=pf) +
geom_point(position=position_jitter(h=0),
alpha=1/10,
color="#fcae3a") +
coord_trans(y="sqrt") +
geom_line(stat='summary', fun.y=mean, color="#8224e3")+
geom_line(stat='summary', fun.y = quantile, probs=.1,
linetype=2, color="#dd3333") +
geom_line(stat='summary', fun.y = quantile, probs=.9,
linetype=2, color="#359bed") +
geom_line(stat='summary', fun.y=median, color="#81d742")+
scale_colour_manual(values=c("#8224e3","#dd3333","#359bed","#81d742")) +
scale_x_continuous(breaks=seq(from=15,to=70,by=5))+
coord_cartesian(xlim=c(13,70), ylim=c(0,1000))+
ggtitle("Average Friend Count By Age")+
ylab("Number of Friends") +
xlab("Age of User")+
theme_xkcd()
?cor.test
cor.test(pf$age,pf$friend_count,method = "pearson", alternative = "greater",exact = FALSE)
##
## Pearson's product-moment correlation
##
## data: pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value = 1
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
## -0.03263034 1.00000000
## sample estimates:
## cor
## -0.02740737
You can also calculate the coefficient using the with() function:
with(subset(pf,pf$age<=70),cor.test(age,friend_count,method="pearson"))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1780220 -0.1654129
## sample estimates:
## cor
## -0.1717245
As you may guess, this data is going to highly correlated–it is a subset/superset relationship.
Anytime the number of mobile likes for a user increases, so does that of their overall likes. The relationship is going to have a correlation coefficient (And hence slope) very close to 1.
# Create a scatterplot of likes_received (y)
# vs. www_likes_received (x). Use any of the
# techniques that you've learned so far to
# modify the plot.
# ENTER ALL OF YOUR CODE TO CREATE THE PLOT BELOW THIS LINE.
# ===========================================================
ggplot(aes(x=www_likes_received, y=likes_received), data=pf) +
geom_point(
alpha=1/3,
color="#81d742") +
#coord_trans(x = "sqrt", y = "sqrt") +
xlim(0, quantile(pf$www_likes_received,.95)) +
ylim(0,quantile(pf$likes_received,.95)) +
geom_smooth(method="lm", color="#dd3333") +
#scale_y_continuous(breaks=seq(from=0,to=600,by=50))+
#coord_cartesian(xlim=c(13,70), ylim=c(0,1000))+
ggtitle("Correlation Between Web Likes and Total Likes")+
ylab("Total Likes Received") +
xlab("Web Likes Received")+
theme_xkcd()
with(subset(pf),cor.test(www_likes_received,likes_received,method="pearson"))
##
## Pearson's product-moment correlation
##
## data: www_likes_received and likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9473553 0.9486176
## sample estimates:
## cor
## 0.9479902
library('alr3')
data(Mitchell)
?Mitchell
names(Mitchell)
## [1] "Month" "Temp"
# Create a scatterplot of temperature (Temp)
# vs. months (Month).
# ENTER ALL OF YOUR CODE TO CREATE THE PLOT BELOW THIS LINE.
# ===========================================================
ggplot(aes(x=Month, y=Temp), data=Mitchell) +
geom_point() +
theme_xkcd() +
ggtitle("Mitchell Dataset: Soil Temperature by month")
ggplot(aes(x=Month%%12, y=Temp), data=Mitchell) +
geom_point() +
theme_xkcd() +
xlim(0,12) +
ggtitle("Mitchell Dataset: Soil Temperature by month")
#Actual answer
ggplot(aes(x=Month, y=Temp), data=Mitchell) +
geom_point() +
scale_x_discrete(breaks = seq(0,203,11)) +
theme_xkcd() +
ggtitle("Mitchell Dataset: Soil Temperature by month")
cor.test(x=Mitchell$Month,y=Mitchell$Temp)
##
## Pearson's product-moment correlation
##
## data: Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08053637 0.19331562
## sample estimates:
## cor
## 0.05747063
ggplot(pf.fc_by_age, aes(x=age, y=friend_count_mean)) +
geom_line(colour="#8224e3") +
xlim(13,90) +
ylab("Average Friend Count") +
xlab("Age of User") +
ggtitle("Average Facebook Friend Count By Age") +
theme_xkcd()
# Create a new variable, 'age_with_months', in the 'pf' data frame.
# Be sure to save the variable in the data frame rather than creating
# a separate, stand-alone variable. You will need to use the variables
# 'age' and 'dob_month' to create the variable 'age_with_months'.
# Assume the reference date for calculating age is December 31, 2013.
# This programming assignment WILL BE automatically graded. For
# this exercise, you need only create the 'age_with_months' variable;
# no further processing of the data frame is necessary.
pf$age_with_months <-pf$age + (1 - pf$dob_month / 12)
pf$dob_month
byAge<-ggplot(subset(pf.fc_by_age,pf.fc_by_age$age<71), aes(x=age, y=friend_count_mean)) +
geom_line(colour="#8224e3") +
xlim(13,71) +
ylab("Average Friend Count") +
xlab("Age of User") +
scale_x_continuous(breaks=seq(from=13,to=71,by=3))+
ggtitle("Average Facebook Friend Count By Age") +
theme_xkcd()
byAge
age_groups_by_month <-group_by(pf,age_with_months)
age_groups_by_month
pf.fc_by_age_months <- summarise(age_groups_by_month,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n=n())
pf.fc_by_age_months
## Source: local data frame [1,194 x 4]
##
## age_with_months friend_count_mean friend_count_median n
## (dbl) (dbl) (dbl) (int)
## 1 13.16667 46.33333 30.5 6
## 2 13.25000 115.07143 23.5 14
## 3 13.33333 136.20000 44.0 25
## 4 13.41667 164.24242 72.0 33
## 5 13.50000 131.17778 66.0 45
## 6 13.58333 156.81481 64.0 54
## 7 13.66667 130.06522 75.5 46
## 8 13.75000 205.82609 122.0 69
## 9 13.83333 215.67742 111.0 62
## 10 13.91667 162.28462 71.0 130
## .. ... ... ... ...
byAgeMonths<-ggplot(subset(pf.fc_by_age_months,pf.fc_by_age_months$age_with_months<71), aes(x=age_with_months, y=friend_count_mean)) +
geom_line(colour="#8224e3") +
xlim(13,71) +
ylab("Average Friend Count") +
xlab("Age of User by Months") +
scale_x_continuous(breaks=seq(from=13,to=71,by=3))+
ggtitle("Average Facebook Friend Count In Age") +
theme_xkcd()
byAgeMonths
byAge5years<-ggplot(subset(pf,age<71), aes(x=round(age/5)*5, y=friend_count)) +
geom_line(colour="#8224e3",stat="summary", fun.y=mean) +
xlim(13,71) +
ylab("Average Friend Count") +
xlab("Age of User by 5 Years") +
scale_x_continuous(breaks=seq(from=15,to=71,by=5))+
ggtitle("Average Facebook Friend Count By Age") +
theme_xkcd()
byAge5years
grid.arrange(byAge,byAgeMonths,byAge5years,ncol=1)
byAge<-ggplot(subset(pf.fc_by_age,pf.fc_by_age$age<71), aes(x=age, y=friend_count_mean)) +
geom_line(colour="#8224e3") +
geom_smooth() +
xlim(13,71) +
ylab("Average Friend Count") +
xlab("Age of User") +
scale_x_continuous(breaks=seq(from=13,to=71,by=3))+
ggtitle("Average Facebook Friend Count By Age") +
theme_xkcd()
byAge
age_groups_by_month <-group_by(pf,age_with_months)
age_groups_by_month
## Source: local data frame [99,003 x 16]
## Groups: age_with_months [1194]
##
## userid age dob_day dob_year dob_month gender tenure friend_count
## (int) (int) (int) (int) (int) (fctr) (int) (int)
## 1 2094382 14 19 1999 11 male 266 0
## 2 1192601 14 2 1999 11 female 6 0
## 3 2083884 14 16 1999 11 male 13 0
## 4 1203168 14 25 1999 12 female 93 0
## 5 1733186 14 4 1999 12 male 82 0
## 6 1524765 14 1 1999 12 male 15 0
## 7 1136133 13 14 2000 1 male 12 0
## 8 1680361 13 4 2000 1 female 0 0
## 9 1365174 13 1 2000 1 male 81 0
## 10 1712567 13 2 2000 2 male 171 0
## .. ... ... ... ... ... ... ... ...
## Variables not shown: friendships_initiated (int), likes (int),
## likes_received (int), mobile_likes (int), mobile_likes_received (int),
## www_likes (int), www_likes_received (int), age_with_months (dbl)
pf.fc_by_age_months <- summarise(age_groups_by_month,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n=n())
pf.fc_by_age_months
## Source: local data frame [1,194 x 4]
##
## age_with_months friend_count_mean friend_count_median n
## (dbl) (dbl) (dbl) (int)
## 1 13.16667 46.33333 30.5 6
## 2 13.25000 115.07143 23.5 14
## 3 13.33333 136.20000 44.0 25
## 4 13.41667 164.24242 72.0 33
## 5 13.50000 131.17778 66.0 45
## 6 13.58333 156.81481 64.0 54
## 7 13.66667 130.06522 75.5 46
## 8 13.75000 205.82609 122.0 69
## 9 13.83333 215.67742 111.0 62
## 10 13.91667 162.28462 71.0 130
## .. ... ... ... ...
byAgeMonths<-ggplot(subset(pf.fc_by_age_months,pf.fc_by_age_months$age_with_months<71), aes(x=age_with_months, y=friend_count_mean)) +
geom_line(colour="#8224e3") +
geom_smooth() +
xlim(13,71) +
ylab("Average Friend Count") +
xlab("Age of User by Months") +
scale_x_continuous(breaks=seq(from=13,to=71,by=3))+
ggtitle("Average Facebook Friend Count In Age") +
theme_xkcd()
byAgeMonths
byAge5years<-ggplot(subset(pf,age<71), aes(x=round(age/5)*5, y=friend_count)) +
geom_line(colour="#8224e3",stat="summary", fun.y=mean) +
xlim(13,71) +
ylab("Average Friend Count") +
xlab("Age of User by 5 Years") +
scale_x_continuous(breaks=seq(from=15,to=71,by=5))+
ggtitle("Average Facebook Friend Count By Age") +
theme_xkcd()
byAge5years
grid.arrange(byAge,byAgeMonths,byAge5years,ncol=1)