Notes:
Notes:
# Import Library
library(ggplot2)
# Get Data
pf <- read.csv('pseudo_facebook.tsv',sep = '\t')
# Create scatterplot - qplot automatically renders a scatter plot with continuous data so we don't have
# to be explicit.
qplot(x = age, y = friend_count, data = pf)
# Another way: - QPLOT assumes X data is first and Y data is second.
qplot(age, friend_count, data = pf)
Response:
Observations:
Teenagers have a lot more friends
People lied about their age (69, 100+)
There are fake accounts
Notes:
# Create scatterplot - using QPlot
qplot(x = age, y = friend_count, data = pf)
# Create scatterplot - using GGPlot
ggplot(aes(x = age, y = friend_count), data = pf) + geom_point()
summary(pf$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 20.00 28.00 37.28 50.00 113.00
# > summary(pf$age)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 13.00 20.00 28.00 37.28 50.00 113.00
# Create scatterplot - using GGPlot & limiting the X axis to 13 through 90
ggplot(aes(x = age, y = friend_count), data = pf) + geom_point() + xlim(13, 90)
## Warning: Removed 4906 rows containing missing values (geom_point).
Notes:
# To better handle overplotting we use the aplha parameter (1/20) - 20 points to - 1 dot.
# Create scatterplot - using GGPlot & limiting the X axis to 13 through 90
ggplot(aes(x = age, y = friend_count), data = pf) + geom_point(alpha = 1/20) + xlim(13, 90)
## Warning: Removed 4906 rows containing missing values (geom_point).
Response:
The bulk of data lie below the 1000 friend count threshold
Notes:
# Jitter
ggplot(aes(x = age, y = friend_count), data = pf) + geom_jitter(alpha = 1/20) + xlim(13, 90)
## Warning: Removed 5179 rows containing missing values (geom_point).
The friend counts for young users are not nearly as high as they apeared before. The bulk of young users really hae friend counts below 1000.
Notice the Peak at 69 -
Notes:
# ORL: http://docs.ggplot2.org/current/coord_trans.html
# or
# ?coord_trans
# Core Trans using POINTS
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha = 1/20) +
xlim(13, 90) +
coord_trans(y = "sqrt")
## Warning: Removed 4906 rows containing missing values (geom_point).
# Core Trans using JITTER (can add Positive or negative noise to our points)
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha = 1/20, position = position_jitter(h = 0)) +
xlim(13, 90) +
coord_trans(y = "sqrt")
## Warning: Removed 5183 rows containing missing values (geom_point).
Notes:
# Examine the relationship between
# friendships_initiated (y) and age (x)
# using the ggplot syntax.
# We recommend creating a basic scatter
# plot first to see what the distribution looks like.
# and then adjusting it by adding one layer at a time.
# What are your observations about your final plot?
# Remember to make adjustments to the breaks
# of the x-axis and to use apply alpha and jitter.
# Create scatterplot - using GGPlot
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
geom_point(alpha = 1/10) +
xlim(13, 90) +
coord_trans(y = "sqrt")
## Warning: Removed 4906 rows containing missing values (geom_point).
# Create scatterplot - using GGPlot & Jitter
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
geom_jitter(alpha = 1/10, position = position_jitter(h = 0)) +
xlim(13, 90) +
coord_trans(y = "sqrt")
## Warning: Removed 5193 rows containing missing values (geom_point).
Notes:
Using percentage of bounded scopes are interesting.
dplyr library allow us to split up a dataframe and apply a function to some parts of the data.
Notes:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# dplyr Functions:
# filter()
# group_by()
# mutate()
# arrange()
age_groups <- group_by(pf, age)
# Create a table that for each age gives us the mean / median and friend count using
# N function can only be used for "summarise", and reports how many people are really in each group.
pf.fc_age <- summarise(age_groups,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n()
)
# Rearranging the order so that the "Ages" go from low to high.
pf.fc_age <- arrange(pf.fc_age, age)
head(pf.fc_age)
## Source: local data frame [6 x 4]
##
## age friend_count_mean friend_count_median n
## (int) (dbl) (dbl) (int)
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
# ******* ALTERNATIVE METHOD ***************
# Note %.% allows us to chain function onto our data set
pf.fc_by_age <- pf %>%
group_by(age) %>%
summarize( friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n()) %>%
arrange(age)
head(pf.fc_age)
## Source: local data frame [6 x 4]
##
## age friend_count_mean friend_count_median n
## (int) (dbl) (dbl) (int)
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
Create your plot!
# Plot mean friend count vs. age using a line graph.
# Be sure you use the correct variable names
# and the correct data frame. You should be working
# with the new data frame created from the dplyr
# functions. The data frame is called 'pf.fc_by_age'.
# Use geom_line() rather than geom_point to create
# the plot. You can look up the documentation for
# geom_line() to see what it does.
pf.fc_by_age
## Source: local data frame [101 x 4]
##
## age friend_count_mean friend_count_median n
## (int) (dbl) (dbl) (int)
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
## 7 19 333.6921 157.0 4391
## 8 20 283.4991 135.0 3769
## 9 21 235.9412 121.0 3671
## 10 22 211.3948 106.0 3032
## .. ... ... ... ...
# Create scatterplot - using GGPlot - OLDER EXAMPLE
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
geom_point(alpha = 1/10) +
xlim(13, 90) +
coord_trans(y = "sqrt")
## Warning: Removed 23 rows containing missing values (geom_point).
# Create a Line Plot
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
geom_point(alpha = 1/5) +
geom_line() +
xlim(13, 90) +
coord_trans(y = "sqrt")
## Warning: Removed 23 rows containing missing values (geom_point).
## Warning: Removed 23 rows containing missing values (geom_path).
Notes:
# Create scatterplot - using GGPlot - OLDER EXAMPLE - using orange color
ggplot(aes(x = age, y = friend_count), data = pf) +
xlim(13, 90) +
geom_point(alpha = .05,
position = position_jitter(h = 0),
color = 'orange') +
coord_trans(y = "sqrt") +
geom_line(stat = 'summary', fun.y = mean)
## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 5175 rows containing missing values (geom_point).
# Adding in Quantiles
ggplot(aes(x = age, y = friend_count), data = pf) +
xlim(13, 90) +
geom_point(alpha = .05,
position = position_jitter(h = 0),
color = 'orange') +
coord_trans(y = "sqrt") +
geom_line(stat = 'summary', fun.y = mean) +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .1),
linetype =2, color = 'blue') +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .9),
linetype =2, color = 'blue')
## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 5175 rows containing missing values (geom_point).
# INSPIRATION: coord_cartesian
#
# A better method - using coord_cartesian Layer - adjusting for zoom into 250 count.
# qplot(x= gender, y = friend_count,
# data = subset(pf, !is.na(gender)),
# geom = 'boxplot') +
# coord_cartesian(ylim = c(0, 250))
ggplot(aes(x = age, y = friend_count), data = pf) +
# xlim(13, 90) +
coord_cartesian(xlim = c(13, 70), ylim = c(0, 1000)) +
geom_point(alpha = .05,
position = position_jitter(h = 0),
color = 'orange') +
# coord_trans(y = "sqrt") +
geom_line(stat = 'summary', fun.y = mean) +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .1),
linetype =2, color = 'blue') +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .9),
linetype =2, color = 'blue')
Response:
See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.
Notes:
Notes:
# ?cor.test
cor.test(pf$age, pf$friend_count, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
# > cor.test(pf$age, pf$friend_count, method = 'pearson')
#
# Pearson's product-moment correlation
#
# data: pf$age and pf$friend_count
# t = -8.6268, df = 99001, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.03363072 -0.02118189
# sample estimates:
# cor
# -0.02740737
# This indicates that there is no meaningful relationship between the two variables
# A good rule of thumb is a correlation > 0.3 or < -0.3 is meaningful (but small)
# .05 is moderate, and 0.7 is large.
# Another way to compute the same coefficient is to use the following code:
with(pf, cor.test(age, friend_count, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
# Another way to compute the same coefficient is to use the following code - restricting bounds:
with(subset(pf, age <= 70), cor.test(age, friend_count, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1780220 -0.1654129
## sample estimates:
## cor
## -0.1717245
# Subset the data for ages less than 70
# subset(pf, age <= 70)
# > with(subset(pf, age <= 70), cor.test(age, friend_count, method = 'pearson'))
#
# Pearson's product-moment correlation
#
# data: age and friend_count
# t = -52.592, df = 91029, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.1780220 -0.1654129
# sample estimates:
# cor
# -0.1717245
# Note: this articulates a negative relationship between age and friend count,
# as age increases, friend count decreases.
# It's important to say that one variable doesn't cause the other....
# To address causality, we'd need inferential statistics, not the descriptive statistics that
# we are using here.
Look up the documentation for the cor.test function.
What’s the correlation between age and friend count? Round to three decimal places. Response:
-0.172 for < 70 and -0.027 for whole DS.
Notes:
# Pearson
with( subset(pf, age <= 70) , cor.test(age, friend_count, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1780220 -0.1654129
## sample estimates:
## cor
## -0.1717245
# Spearman
with( subset(pf, age <= 70) , cor.test(age, friend_count, method = 'spearman', exact=FALSE))
##
## Spearman's rank correlation rho
##
## data: age and friend_count
## S = 1.5782e+14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.2552934
# > with( subset(pf, age <= 70) , cor.test(age, friend_count, method = 'spearman', exact=FALSE))
#
# Spearman's rank correlation rho
#
# data: age and friend_count
# S = 1.5782e+14, p-value < 2.2e-16
# alternative hypothesis: true rho is not equal to 0
# sample estimates:
# rho
# -0.2552934
Notes:
Notes:
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
# > names(pf)
# [1] "userid" "age" "dob_day" "dob_year" "dob_month"
#
# [6] "gender" "tenure" "friend_count" "friendships_initiated" "likes" #
# [11] "likes_received" "mobile_likes" "mobile_likes_received" "www_likes"
# "www_likes_received"
# Inspiration:
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
geom_point(alpha = 1/10) +
xlim(13, 90) +
coord_trans(y = "sqrt")
## Warning: Removed 23 rows containing missing values (geom_point).
# Actual:
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) + geom_point()
# Quantile Function to remove outliers
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) + geom_point() +
xlim(0, quantile(pf$www_likes_received, 0.95)) +
ylim(0, quantile(pf$likes_received, 0.95))
## Warning: Removed 6075 rows containing missing values (geom_point).
# Quantile Function to remove outliers - add line of best fit.
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) + geom_point() +
xlim(0, quantile(pf$www_likes_received, 0.95)) +
ylim(0, quantile(pf$likes_received, 0.95)) +
geom_smooth(method = 'lm', color = 'red')
## Warning: Removed 6075 rows containing non-finite values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).
Notes:
# Quantifying relationship with a number
# Pearson
cor.test(pf$www_likes_received, pf$likes_received, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: pf$www_likes_received and pf$likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9473553 0.9486176
## sample estimates:
## cor
## 0.9479902
# > cor.test(pf$www_likes_received, pf$likes_received, method = 'pearson')
#
# Pearson's product-moment correlation
#
# data: pf$www_likes_received and pf$likes_received
# t = 937.1, df = 99001, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# 0.9473553 0.9486176
# sample estimates:
# cor
# 0.9479902
# NOTE: These variables were so closeely correlated because one was a superset of the other.
What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
Response:
Notes:
Notes:
# install.packages('alr3')
library(alr3)
## Loading required package: car
data("Mitchell")
# ?Mitchell
Create your plot!
# Create a scatterplot of temperature (Temp)
# vs. months (Month).
str(Mitchell)
## 'data.frame': 204 obs. of 2 variables:
## $ Month: int 0 1 2 3 4 5 6 7 8 9 ...
## $ Temp : num -5.18 -1.65 2.49 10.4 14.99 ...
# Inspiration - scatterplot using GGPlot
# ggplot(aes(x = age, y = friend_count), data = pf) + geom_point()
ggplot(aes(x = Temp, y = Month), data = Mitchell) + geom_point()
# Actual Correlation
cor.test(Mitchell$Temp, Mitchell$Month, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: Mitchell$Temp and Mitchell$Month
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08053637 0.19331562
## sample estimates:
## cor
## 0.05747063
# > cor.test(Mitchell$Temp, Mitchell$Month, method = 'pearson')
#
# Pearson's product-moment correlation
#
# data: Mitchell$Temp and Mitchell$Month
# t = 0.81816, df = 202, p-value = 0.4142
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.08053637 0.19331562
# sample estimates:
# cor
# 0.05747063
Zero
0.0574
ggplot(aes(x = Temp, y = Month), data = Mitchell) + geom_point() +
scale_x_discrete(breaks = seq(0, 283, 12))
Notes:
What do you notice? Response:
Watch the solution video and check out the Instructor Notes! Notes:
Notes:
# Going back to this Line plot:
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
geom_line()
# Print out some of the DF to have a closer look at the rising and falling of the data
head(pf.fc_by_age)
## Source: local data frame [6 x 4]
##
## age friend_count_mean friend_count_median n
## (int) (dbl) (dbl) (int)
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
# Slice Data More:
pf.fc_by_age[17:19, ]
## Source: local data frame [3 x 4]
##
## age friend_count_mean friend_count_median n
## (int) (dbl) (dbl) (int)
## 1 29 120.8182 66.0 1936
## 2 30 115.2080 67.5 1716
## 3 31 118.4599 63.0 1694
# Create the users age in months rather than years
pf$age_with_months <- pf$age + abs(12 - pf$dob_month)/12
# Two alternate solutions:
# (1) pf$age_with_months <- pf$age + (1 - pf$dob_month / 12)
# (2) pf$age_with_months <- with(pf, age + (1 - dob_month / 12))
head(pf[16])
## age_with_months
## 1 14.08333
## 2 14.08333
## 3 14.08333
## 4 14.00000
## 5 14.00000
## 6 14.00000
# Solution: Chain commands together using "%>%""
library(dplyr)
pf.fc_by_age_months <-pf %>%
group_by(age_with_months) %>%
summarize( friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n()) %>%
arrange(age_with_months)
head(pf.fc_by_age_months)
## Source: local data frame [6 x 4]
##
## age_with_months friend_count_mean friend_count_median n
## (dbl) (dbl) (dbl) (int)
## 1 13.16667 46.33333 30.5 6
## 2 13.25000 115.07143 23.5 14
## 3 13.33333 136.20000 44.0 25
## 4 13.41667 164.24242 72.0 33
## 5 13.50000 131.17778 66.0 45
## 6 13.58333 156.81481 64.0 54
# Alternative Method: Use the DataFrame and them apply commands to it
age_with_months_groups <- group_by(pf, age_with_months)
pf.fc_by_age_months2 <- summarise(age_with_months_groups,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
pf.fc_by_age_months2 <- arrange(pf.fc_by_age_months2, age_with_months)
head(pf.fc_by_age_months2)
## Source: local data frame [6 x 4]
##
## age_with_months friend_count_mean friend_count_median n
## (dbl) (dbl) (dbl) (int)
## 1 13.16667 46.33333 30.5 6
## 2 13.25000 115.07143 23.5 14
## 3 13.33333 136.20000 44.0 25
## 4 13.41667 164.24242 72.0 33
## 5 13.50000 131.17778 66.0 45
## 6 13.58333 156.81481 64.0 54
Programming Assignment
# Experiment: Create a DataFrame
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
employ.data <- data.frame(employee, salary, startdate)
# Experiment: Test for data type and if a dataframe
typeof(employ.data)
## [1] "list"
is.data.frame(pf.fc_by_age_months)
## [1] TRUE
# Original Line plot: NOTE: (12 x 71 = 852)
p1 <- ggplot(aes(x = age, y = friend_count_mean),
data = subset(pf.fc_by_age, age < 71)) +
geom_line() +
geom_smooth()
# New Line plot with noise subsetted by age < 71
p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean),
data = subset(pf.fc_by_age_months, age_with_months < 71)) +
geom_line()+
geom_smooth()
# Add 3rd plot that displays the means
p3 <- ggplot(aes(x = round(age/5) * 5, y = friend_count_mean),
data = subset(pf.fc_by_age, age < 71)) +
geom_line(stat = 'summary', fun.y = mean)
# Grid Plot
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(p2, p1, p3, ncol = 1)
#Spot Checking
head(pf.fc_by_age)
## Source: local data frame [6 x 4]
##
## age friend_count_mean friend_count_median n
## (int) (dbl) (dbl) (int)
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
head(pf.fc_by_age_months)
## Source: local data frame [6 x 4]
##
## age_with_months friend_count_mean friend_count_median n
## (dbl) (dbl) (dbl) (int)
## 1 13.16667 46.33333 30.5 6
## 2 13.25000 115.07143 23.5 14
## 3 13.33333 136.20000 44.0 25
## 4 13.41667 164.24242 72.0 33
## 5 13.50000 131.17778 66.0 45
## 6 13.58333 156.81481 64.0 54
Notes:
Notes:
Reflection:
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!
Notes: