Setup

General

Here I am simply loading the necessary libraries.

library(gridExtra)
library(ggplot2)
library(xkcd)
library(dplyr)

Load Data Setup

Now we need to change our working directory to load in the .tsv file (tab separated values). I also am using names to see the variables in the loaded dataframe.

The data frame we are using is Udacity’s “psuedo facebook” data. This is data that was generated (so its not any actual facebook users data), but has the same schema and similar values of the true facebook userbase.

getwd()
setwd("/Users/Taylor/Downloads")
#list.files()
pf<- read.csv("pseudo_facebook.tsv",sep='\t')
names(pf)

Scatterplot

Friend Count vs. Age

To investigate the relationship between two variables, it is often useful to plot the values on a scatterplot. Here I am using a facebook users Friend Count (Y) and their age (X).

qplot(x=age, y=friend_count, data=pf) +
  theme_xkcd() + 
  ylab("Friend Count") +
  ggtitle("Friend Count vs. Age")

Here we are using the ggplot syntax, rather than the qplot syntax. It is slightly more verbose, but also more powerful.

When there is overplotting (that is to say, raw data is stacked on top of other raw data) it is difficult to see where the concentration of values are. We can use the alpha syntax to add transparency to each point. Here it takes 20 overlain data values to reach 100% opacity.

summary(pf$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   28.00   37.28   50.00  113.00

ggplot(aes(x=age,y=friend_count),data=pf)+
  geom_point(alpha=1/20) +
  xlim(13,90) +
  ylab("Friend Count") +
  ggtitle("Friend Count vs. Age")+
  theme_xkcd()

Note that since the age variable is stored a discrete factor, we see these very structured columns in our plot. Let’s change that by using geom_jitter. This slightly alters the position of each value to create a more natural representation of age.

ggplot(aes(x=age,y=friend_count),data=pf)+
  geom_jitter(alpha=1/20) + # use jitter to add noise to the age, since its not truly discrete
  xlim(13,90) +
  ylab("Friend Count") +
  ggtitle("Friend Count vs. Age")+
  theme_xkcd()

Because of the large outliers, it is difficult to see the distribution of data. Let’s transform the Y-axis by taking its square root.

#lets change the y_axis layer
ggplot(aes(x=age,y=friend_count+1),data=pf)+
  geom_jitter(alpha=1/20) + # use jitter to add noise to the age, since its not truly discrete
  xlim(13,90) +
  coord_trans(y = "sqrt")+
  ylab("Friend Count") +
  ggtitle("Friend Count vs. Age") +
  theme_xkcd()

ggplot(aes(x=age,y=friend_count),data=pf)+
  geom_point(alpha=1/20, position= position_jitter(h =0)) + # use jitter to add noise to the age, since its not truly discrete
  xlim(13,90) +
  coord_trans(y = "sqrt")+
  ylab("Friend Count") +
  ggtitle("Friend Count vs. Age") +
  theme_xkcd()

?coord_trans

Friendships Initiated vs. Age

# This programming assignment
# will not be graded, but when you
# submit your code, the assignment
# will be marked as correct. By submitting
# your code, we can add to the feedback
# messages and address common mistakes
# in the Instructor Notes.

# You can assess your work by watching
# the solution video.


# Examine the relationship between
# friendships_initiated (y) and age (x)
# using the ggplot syntax.

# We recommend creating a basic scatter
# plot first to see what the distribution looks like.
# and then adjusting it by adding one layer at a time.

# What are your observations about your final plot?

# Remember to make adjustments to the breaks
# of the x-axis and to use apply alpha and jitter.

# ENTER ALL OF YOUR CODE FOR YOUR PLOT BELOW THIS LINE.
# =======================================================

ggplot(aes(x=age, y=friendships_initiated), data=pf) +
  geom_point(position=position_jitter(h=0),
             alpha=1/10,
             color="#fcae3a") +
  xlim(13,90) +
  coord_trans(y="sqrt") + 
  ylab("Friendships Initiated") +
  ggtitle("Friendships Initiated vs. Age") +
  theme_xkcd()

Conditional Means

Conditional means is a way for us to group users, and then take the average by each grouping to see trends. Here we are going to investigate the average friend count for users of differing ages.

The dplyr package is integral for these operations.

Grouping By Age

age_groups <-group_by(pf,age)
age_groups

## Source: local data frame [99,003 x 15]
## Groups: age [101]
## 
##     userid   age dob_day dob_year dob_month gender tenure friend_count
##      (int) (int)   (int)    (int)     (int) (fctr)  (int)        (int)
## 1  2094382    14      19     1999        11   male    266            0
## 2  1192601    14       2     1999        11 female      6            0
## 3  2083884    14      16     1999        11   male     13            0
## 4  1203168    14      25     1999        12 female     93            0
## 5  1733186    14       4     1999        12   male     82            0
## 6  1524765    14       1     1999        12   male     15            0
## 7  1136133    13      14     2000         1   male     12            0
## 8  1680361    13       4     2000         1 female      0            0
## 9  1365174    13       1     2000         1   male     81            0
## 10 1712567    13       2     2000         2   male    171            0
## ..     ...   ...     ...      ...       ...    ...    ...          ...
## Variables not shown: friendships_initiated (int), likes (int),
##   likes_received (int), mobile_likes (int), mobile_likes_received (int),
##   www_likes (int), www_likes_received (int)

pf.fc_by_age <- summarise(age_groups,
                    friend_count_mean = mean(friend_count),
                          friend_count_median = median(friend_count),
                          n=n())
pf.fc_by_age

## Source: local data frame [101 x 4]
## 
##      age friend_count_mean friend_count_median     n
##    (int)             (dbl)               (dbl) (int)
## 1     13          164.7500                74.0   484
## 2     14          251.3901               132.0  1925
## 3     15          347.6921               161.0  2618
## 4     16          351.9371               171.5  3086
## 5     17          350.3006               156.0  3283
## 6     18          331.1663               162.0  5196
## 7     19          333.6921               157.0  4391
## 8     20          283.4991               135.0  3769
## 9     21          235.9412               121.0  3671
## 10    22          211.3948               106.0  3032
## ..   ...               ...                 ...   ...

head(pf.fc_by_age)

## Source: local data frame [6 x 4]
## 
##     age friend_count_mean friend_count_median     n
##   (int)             (dbl)               (dbl) (int)
## 1    13          164.7500                74.0   484
## 2    14          251.3901               132.0  1925
## 3    15          347.6921               161.0  2618
## 4    16          351.9371               171.5  3086
## 5    17          350.3006               156.0  3283
## 6    18          331.1663               162.0  5196

This is another syntax for doing the same thing we did above. This syntax uses chaining to pass the first parameter down.

pf.fc_by_age <- pf %>%
  group_by(age) %>%
  summarise(friend_count_mean = mean(friend_count),
            friend_count_median = median(friend_count),
            n = n()) #%.%
  #arrange(age)
pf.fc_by_age

## Source: local data frame [101 x 4]
## 
##      age friend_count_mean friend_count_median     n
##    (int)             (dbl)               (dbl) (int)
## 1     13          164.7500                74.0   484
## 2     14          251.3901               132.0  1925
## 3     15          347.6921               161.0  2618
## 4     16          351.9371               171.5  3086
## 5     17          350.3006               156.0  3283
## 6     18          331.1663               162.0  5196
## 7     19          333.6921               157.0  4391
## 8     20          283.4991               135.0  3769
## 9     21          235.9412               121.0  3671
## 10    22          211.3948               106.0  3032
## ..   ...               ...                 ...   ...

ggplot(pf.fc_by_age, aes(x=age, y=friend_count_mean)) +
  geom_line(colour="#8224e3") +
  xlim(13,90) +
  ylab("Average Friend Count") +
  xlab("Age of User") + 
  ggtitle("Average Facebook Friend Count By Age") + 
  theme_xkcd()

Overlaying Summaries with Raw Data

It’s great to see the raw data, but better to see the trended data.

ggplot(aes(x=age, y=friend_count), data=pf) +
  geom_point(position=position_jitter(h=0),
             alpha=1/10,
             color="#fcae3a") +
  xlim(13,90) +
  coord_trans(y="sqrt") + 
  geom_line(stat='summary', fun.y=mean, color="#8224e3")+
  geom_line(stat='summary', fun.y = quantile, probs=.1, 
            linetype=2, color="#dd3333") +
  geom_line(stat='summary', fun.y = quantile, probs=.9, 
            linetype=2, color="#359bed") +
  scale_colour_manual(values=c("#8224e3","#dd3333","#359bed")) +
  scale_x_continuous(breaks=seq(from=15,to=70,by=5))+
  ggtitle("Average Friend Count By Age")+
  ylab("Number of Friends") +
  xlab("Age of User")+
  theme_xkcd()

ggplot(aes(x=age, y=friend_count), data=pf) +
  geom_point(position=position_jitter(h=0),
             alpha=1/10,
             color="#fcae3a") +
  coord_trans(y="sqrt") + 
  geom_line(stat='summary', fun.y=mean, color="#8224e3")+
  geom_line(stat='summary', fun.y = quantile, probs=.1, 
            linetype=2, color="#dd3333") +
  geom_line(stat='summary', fun.y = quantile, probs=.9, 
            linetype=2, color="#359bed") +
  geom_line(stat='summary', fun.y=median, color="#81d742")+
  scale_colour_manual(values=c("#8224e3","#dd3333","#359bed","#81d742")) +
  scale_x_continuous(breaks=seq(from=15,to=70,by=5))+
  coord_cartesian(xlim=c(13,70), ylim=c(0,1000))+
  ggtitle("Average Friend Count By Age")+
  ylab("Number of Friends") +
  xlab("Age of User")+
  theme_xkcd()

Correlation Coefficient

?cor.test
cor.test(pf$age,pf$friend_count,method = "pearson", alternative = "greater",exact = FALSE)

## 
##  Pearson's product-moment correlation
## 
## data:  pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value = 1
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  -0.03263034  1.00000000
## sample estimates:
##         cor 
## -0.02740737

You can also calculate the coefficient using the with() function:

with(subset(pf,pf$age<=70),cor.test(age,friend_count,method="pearson"))

## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1780220 -0.1654129
## sample estimates:
##        cor 
## -0.1717245

Likes received vs. Web Likes

As you may guess, this data is going to highly correlated–it is a subset/superset relationship.

Anytime the number of mobile likes for a user increases, so does that of their overall likes. The relationship is going to have a correlation coefficient (And hence slope) very close to 1.

# Create a scatterplot of likes_received (y)
# vs. www_likes_received (x). Use any of the
# techniques that you've learned so far to
# modify the plot.

# ENTER ALL OF YOUR CODE TO CREATE THE PLOT BELOW THIS LINE.
# ===========================================================
ggplot(aes(x=www_likes_received, y=likes_received), data=pf) +
  geom_point(
             alpha=1/3,
             color="#81d742") + 
  #coord_trans(x = "sqrt", y = "sqrt") + 
  xlim(0, quantile(pf$www_likes_received,.95)) +
  ylim(0,quantile(pf$likes_received,.95)) +
  geom_smooth(method="lm", color="#dd3333") +
  #scale_y_continuous(breaks=seq(from=0,to=600,by=50))+
  #coord_cartesian(xlim=c(13,70), ylim=c(0,1000))+
  ggtitle("Correlation Between Web Likes and Total Likes")+
  ylab("Total Likes Received") +
  xlab("Web Likes Received")+
  theme_xkcd()

with(subset(pf),cor.test(www_likes_received,likes_received,method="pearson"))

## 
##  Pearson's product-moment correlation
## 
## data:  www_likes_received and likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9473553 0.9486176
## sample estimates:
##       cor 
## 0.9479902

More Caution with Correlation

Mitchell Soil Temperature

library('alr3')
data(Mitchell)
?Mitchell
names(Mitchell)

## [1] "Month" "Temp"

# Create a scatterplot of temperature (Temp)
# vs. months (Month).

# ENTER ALL OF YOUR CODE TO CREATE THE PLOT BELOW THIS LINE.
# ===========================================================
ggplot(aes(x=Month, y=Temp), data=Mitchell) +
  geom_point() + 
  theme_xkcd() +
  ggtitle("Mitchell Dataset: Soil Temperature by month")

ggplot(aes(x=Month%%12, y=Temp), data=Mitchell) +
  geom_point() + 
  theme_xkcd() +
  xlim(0,12) +
  ggtitle("Mitchell Dataset: Soil Temperature by month")

#Actual answer
ggplot(aes(x=Month, y=Temp), data=Mitchell) +
  geom_point() + 
  scale_x_discrete(breaks = seq(0,203,11)) +
  theme_xkcd() +
  ggtitle("Mitchell Dataset: Soil Temperature by month")

cor.test(x=Mitchell$Month,y=Mitchell$Temp)

## 
##  Pearson's product-moment correlation
## 
## data:  Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

ggplot(pf.fc_by_age, aes(x=age, y=friend_count_mean)) +
  geom_line(colour="#8224e3") +
  xlim(13,90) +
  ylab("Average Friend Count") +
  xlab("Age of User") + 
  ggtitle("Average Facebook Friend Count By Age") + 
  theme_xkcd()

# Create a new variable, 'age_with_months', in the 'pf' data frame.
# Be sure to save the variable in the data frame rather than creating
# a separate, stand-alone variable. You will need to use the variables
# 'age' and 'dob_month' to create the variable 'age_with_months'.

# Assume the reference date for calculating age is December 31, 2013.

# This programming assignment WILL BE automatically graded. For
# this exercise, you need only create the 'age_with_months' variable;
# no further processing of the data frame is necessary.
pf$age_with_months <-pf$age + (1 - pf$dob_month / 12)
pf$dob_month

Age with Months Mean

Different Conditional Means

byAge<-ggplot(subset(pf.fc_by_age,pf.fc_by_age$age<71), aes(x=age, y=friend_count_mean)) +
  geom_line(colour="#8224e3") +
  xlim(13,71) +
  ylab("Average Friend Count") +
  xlab("Age of User") + 
  scale_x_continuous(breaks=seq(from=13,to=71,by=3))+
  ggtitle("Average Facebook Friend Count By Age") + 
  theme_xkcd()
byAge

age_groups_by_month <-group_by(pf,age_with_months)
age_groups_by_month

pf.fc_by_age_months <- summarise(age_groups_by_month,
                    friend_count_mean = mean(friend_count),
                          friend_count_median = median(friend_count),
                          n=n())
pf.fc_by_age_months

## Source: local data frame [1,194 x 4]
## 
##    age_with_months friend_count_mean friend_count_median     n
##              (dbl)             (dbl)               (dbl) (int)
## 1         13.16667          46.33333                30.5     6
## 2         13.25000         115.07143                23.5    14
## 3         13.33333         136.20000                44.0    25
## 4         13.41667         164.24242                72.0    33
## 5         13.50000         131.17778                66.0    45
## 6         13.58333         156.81481                64.0    54
## 7         13.66667         130.06522                75.5    46
## 8         13.75000         205.82609               122.0    69
## 9         13.83333         215.67742               111.0    62
## 10        13.91667         162.28462                71.0   130
## ..             ...               ...                 ...   ...

byAgeMonths<-ggplot(subset(pf.fc_by_age_months,pf.fc_by_age_months$age_with_months<71), aes(x=age_with_months, y=friend_count_mean)) +
  geom_line(colour="#8224e3") +
  xlim(13,71) +
  ylab("Average Friend Count") +
  xlab("Age of User by Months") + 
  scale_x_continuous(breaks=seq(from=13,to=71,by=3))+
  ggtitle("Average Facebook Friend Count In Age") + 
  theme_xkcd()
byAgeMonths

byAge5years<-ggplot(subset(pf,age<71), aes(x=round(age/5)*5, y=friend_count)) +
  geom_line(colour="#8224e3",stat="summary", fun.y=mean) +
  xlim(13,71) +
  ylab("Average Friend Count") +
  xlab("Age of User by 5 Years") + 
  scale_x_continuous(breaks=seq(from=15,to=71,by=5))+
  ggtitle("Average Facebook Friend Count By Age") + 
  theme_xkcd()
byAge5years

grid.arrange(byAge,byAgeMonths,byAge5years,ncol=1)

Smoothing

byAge<-ggplot(subset(pf.fc_by_age,pf.fc_by_age$age<71), aes(x=age, y=friend_count_mean)) +
  geom_line(colour="#8224e3") +
  geom_smooth() +
  xlim(13,71) +
  ylab("Average Friend Count") +
  xlab("Age of User") + 
  scale_x_continuous(breaks=seq(from=13,to=71,by=3))+
  ggtitle("Average Facebook Friend Count By Age") + 
  theme_xkcd()
byAge

age_groups_by_month <-group_by(pf,age_with_months)
age_groups_by_month

## Source: local data frame [99,003 x 16]
## Groups: age_with_months [1194]
## 
##     userid   age dob_day dob_year dob_month gender tenure friend_count
##      (int) (int)   (int)    (int)     (int) (fctr)  (int)        (int)
## 1  2094382    14      19     1999        11   male    266            0
## 2  1192601    14       2     1999        11 female      6            0
## 3  2083884    14      16     1999        11   male     13            0
## 4  1203168    14      25     1999        12 female     93            0
## 5  1733186    14       4     1999        12   male     82            0
## 6  1524765    14       1     1999        12   male     15            0
## 7  1136133    13      14     2000         1   male     12            0
## 8  1680361    13       4     2000         1 female      0            0
## 9  1365174    13       1     2000         1   male     81            0
## 10 1712567    13       2     2000         2   male    171            0
## ..     ...   ...     ...      ...       ...    ...    ...          ...
## Variables not shown: friendships_initiated (int), likes (int),
##   likes_received (int), mobile_likes (int), mobile_likes_received (int),
##   www_likes (int), www_likes_received (int), age_with_months (dbl)

pf.fc_by_age_months <- summarise(age_groups_by_month,
                    friend_count_mean = mean(friend_count),
                          friend_count_median = median(friend_count),
                          n=n())
pf.fc_by_age_months

## Source: local data frame [1,194 x 4]
## 
##    age_with_months friend_count_mean friend_count_median     n
##              (dbl)             (dbl)               (dbl) (int)
## 1         13.16667          46.33333                30.5     6
## 2         13.25000         115.07143                23.5    14
## 3         13.33333         136.20000                44.0    25
## 4         13.41667         164.24242                72.0    33
## 5         13.50000         131.17778                66.0    45
## 6         13.58333         156.81481                64.0    54
## 7         13.66667         130.06522                75.5    46
## 8         13.75000         205.82609               122.0    69
## 9         13.83333         215.67742               111.0    62
## 10        13.91667         162.28462                71.0   130
## ..             ...               ...                 ...   ...

byAgeMonths<-ggplot(subset(pf.fc_by_age_months,pf.fc_by_age_months$age_with_months<71), aes(x=age_with_months, y=friend_count_mean)) +
  geom_line(colour="#8224e3") +
  geom_smooth() +
  xlim(13,71) +
  ylab("Average Friend Count") +
  xlab("Age of User by Months") + 
  scale_x_continuous(breaks=seq(from=13,to=71,by=3))+
  ggtitle("Average Facebook Friend Count In Age") + 
  theme_xkcd()
byAgeMonths

byAge5years<-ggplot(subset(pf,age<71), aes(x=round(age/5)*5, y=friend_count)) +
  geom_line(colour="#8224e3",stat="summary", fun.y=mean) +
  xlim(13,71) +
  ylab("Average Friend Count") +
  xlab("Age of User by 5 Years") + 
  scale_x_continuous(breaks=seq(from=15,to=71,by=5))+
  ggtitle("Average Facebook Friend Count By Age") + 
  theme_xkcd()
byAge5years

grid.arrange(byAge,byAgeMonths,byAge5years,ncol=1)

Exploratory Data Analysis: Plotting Two Variables of Facebook Data (Lesson 4)

Taylor White

December 31, 2015

Setup

General

Load Data Setup

Scatterplot

Friend Count vs. Age

Friendships Initiated vs. Age

Conditional Means

Grouping By Age

Overlaying Summaries with Raw Data

Correlation Coefficient

Likes received vs. Web Likes

More Caution with Correlation

Mitchell Soil Temperature

Age with Months Mean

Different Conditional Means

Smoothing

Smoothing