Notes: Most FB users guess a perceived audience size much less than the actual audience size. And most users guess in multiples of 50 or 100 ***
Notes:
#setup
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.1
getwd()
## [1] "C:/Users/amackay/Documents/R Scripts"
setwd("~/R Datasources")
list.files()
## [1] "pseudo_facebook.tsv" "reddit.csv" "stateData.csv"
pf <- read.csv('../R Datasources/pseudo_facebook.tsv',sep = '\t')
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
#plot a scatter of age and friend count. qplot witll automatically select the scatter based on the 2 variables
qplot(x = age, y = friend_count, data = pf)
#alternative syntax without specifiyign x and y
qplot(age, friend_count, data = pf)
Response: 1. Most users under 30 have high friend counts. 2. Some users over the age of 60 have unusually high friend counts. ***
Notes: 1. The ggplot syntax can be used to create more complex plots 2. The main difference to q plot is to specify which geom to plot
#qplot syntax - qplot(x = age, y = friend_count, data = pf)
#ggplo uses the aesthictic wrapper (aes) for the x and y variables
ggplot(aes(x = age, y = friend_count), data = pf) + geom_point()
#clip the age range but first check the range
summary(pf$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 20.00 28.00 37.28 50.00 113.00
ggplot(aes(x = age, y = friend_count), data = pf) + geom_point() + xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).
Notes: 1. The area of the plot that has a high density of plots is considered over plotted. 2. It makes it difficult to tell how many points are in each region. 3. The transparency can be set using the alpha param
# here alpha = 1/20 means that it will take 20 points to be the eqivalent of one point
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha = 1/20) +
xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).
#change the plot to a jitter to add some noise to the age variable and get a more dispersed distribution
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_jitter(alpha = 1/20) +
xlim(13,90)
## Warning: Removed 5180 rows containing missing values (geom_point).
Response: The jitter plot reveals a more truer picture of the friend count by age i.e most young users have lower friend counts. ***
Notes: Some friend counts are zero, adding jitter may create a -ve number and the sqrt of that is imaginary We have to add syntax position = position_jitter(h = 0) to indicate a min height of zero
#transform the y axis using a sqrt function
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha = 1/20) +
xlim(13,90) +
coord_trans(y = 'sqrt')
## Warning: Removed 4906 rows containing missing values (geom_point).
#transform the y axis using a sqrt function and plot a jitter
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha = 1/20, position = position_jitter(h = 0)) +
xlim(13,90) +
coord_trans(y = 'sqrt')
## Warning: Removed 5215 rows containing missing values (geom_point).
We can see above the threshold of 1000 the friend count decreases ***
Notes: Examine the relationship between Age and friendships_initiated
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
#plot scatter
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
geom_point()
#limit x axis
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
geom_point() +
xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).
#plot a jitter
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
geom_jitter() +
xlim(13,90)
## Warning: Removed 5212 rows containing missing values (geom_point).
#use alpha parm to reduce overplotting
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
geom_jitter(alpha = 1/20) +
xlim(13,90)
## Warning: Removed 5186 rows containing missing values (geom_point).
#transform y axis using coord_trans
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
geom_jitter(alpha = 1/20, position = position_jitter(h = 0)) +
xlim(13,90) +
coord_trans(y = 'sqrt')
## Warning: Removed 5191 rows containing missing values (geom_point).
#limit dataset to remove NA values. Still get error message
ggplot(aes(x = age, y = friendships_initiated), data = subset(pf, !is.na(friendships_initiated))) +
geom_jitter(alpha = 1/20, position = position_jitter(h = 0)) +
xlim(13,90) +
coord_trans(y = 'sqrt')
## Warning: Removed 5181 rows containing missing values (geom_point).
Notes:
Notes: 1. Summarise the dataset by mean and median using the group by and summarise notation 2. The n = n() provides the count in each group and only works with summarise
#install.packages('dplyr')
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.1
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#create grouping by age
age_groups <- group_by(pf, age)
#add the summaries
pf.fc_by_age <- summarise(age_groups,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
#sort by age asending
pf.fc_by_age <- arrange(pf.fc_by_age, age)
#use Head to check the first few rows. The last parm is optional
head(pf.fc_by_age, 20)
## Source: local data frame [20 x 4]
##
## age friend_count_mean friend_count_median n
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
## 7 19 333.6921 157.0 4391
## 8 20 283.4991 135.0 3769
## 9 21 235.9412 121.0 3671
## 10 22 211.3948 106.0 3032
## 11 23 202.8426 93.0 4404
## 12 24 185.7121 92.0 2827
## 13 25 131.0211 62.0 3641
## 14 26 144.0082 75.0 2815
## 15 27 134.1473 72.0 2240
## 16 28 125.8354 66.0 2364
## 17 29 120.8182 66.0 1936
## 18 30 115.2080 67.5 1716
## 19 31 118.4599 63.0 1694
## 20 32 114.2800 63.0 1443
Notes: The %>% allows you to chain commands
#chain commands to the pf dataset
pf.fc_age_chaining <- pf %>%
group_by(age) %>%
summarise(friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n()) %>%
arrange(age)
head(pf.fc_age_chaining, 20)
## Source: local data frame [20 x 4]
##
## age friend_count_mean friend_count_median n
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
## 7 19 333.6921 157.0 4391
## 8 20 283.4991 135.0 3769
## 9 21 235.9412 121.0 3671
## 10 22 211.3948 106.0 3032
## 11 23 202.8426 93.0 4404
## 12 24 185.7121 92.0 2827
## 13 25 131.0211 62.0 3641
## 14 26 144.0082 75.0 2815
## 15 27 134.1473 72.0 2240
## 16 28 125.8354 66.0 2364
## 17 29 120.8182 66.0 1936
## 18 30 115.2080 67.5 1716
## 19 31 118.4599 63.0 1694
## 20 32 114.2800 63.0 1443
Create your plot! NOTES: examine mean friend count over age
ggplot(aes(x= age, y = friend_count_mean), data = pf.fc_by_age) +
geom_line()
Notes: ggplot allows for overalying raw dat with summarised values such as overlaying the mean friend count (line) over a scatter of friend count by age
#plot a scatter
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point()
#use the alpha syntax to remove discrete age distribution as it is continious
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha = 1/20,
position = position_jitter(h = 0))
#add some color and limit ages
myplot <- ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha = 1/20,
position = position_jitter(h = 0),
color = 'orange') +
xlim(13,90)
myplot
## Warning: Removed 5184 rows containing missing values (geom_point).
#transfrom y using coord_trans
myplot <- myplot + coord_trans(y = 'sqrt')
myplot
## Warning: Removed 5181 rows containing missing values (geom_point).
#add the mean summary to the plot
#fun == function
myplot <- myplot + geom_line(stat = 'summary', fun.y = mean)
myplot
## Warning: Removed 4906 rows containing missing values (stat_summary).
## Warning: Removed 5195 rows containing missing values (geom_point).
#add quartile summaries
#probs = .1 == the 10% quartile
myplot <- myplot + geom_line(stat = 'summary', fun.y = quantile, probs = .1,
linetype = 2, color = 'blue')
myplot <- myplot + geom_line(stat = 'summary', fun.y = quantile, probs = .5,
linetype = 2, color = 'red')
myplot <- myplot + geom_line(stat = 'summary', fun.y = quantile, probs = .9,
linetype = 1, color = 'blue')
myplot
## Warning: Removed 4906 rows containing missing values (stat_summary).
## Warning: Removed 4906 rows containing missing values (stat_summary).
## Warning: Removed 4906 rows containing missing values (stat_summary).
## Warning: Removed 4906 rows containing missing values (stat_summary).
## Warning: Removed 5180 rows containing missing values (geom_point).
#zoom in using the coord_cartesian syntax
myplot <- myplot + coord_cartesian(ylim = c(0,1000), x = c(13,70))
myplot
## Warning: Removed 4906 rows containing missing values (stat_summary).
## Warning: Removed 4906 rows containing missing values (stat_summary).
## Warning: Removed 4906 rows containing missing values (stat_summary).
## Warning: Removed 4906 rows containing missing values (stat_summary).
## Warning: Removed 5156 rows containing missing values (geom_point).
Response:
See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.
Notes:
Notes: 1. Udacity stats - https://www.udacity.com/course/viewer#!/c-ud201/l-1345848540/m-171582737 2. Correlation Coefficient http://www.r-tutor.com/elementary-statistics/numerical-measures/correlation-coefficient
cor.test(x = pf$age, y = pf$friend_count)
##
## Pearson's product-moment correlation
##
## data: pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
#alternative syntax using With
with(pf, cor.test(age, friend_count, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
Look up the documentation for the cor.test function.
What’s the correlation between age and friend count? Round to three decimal places. Response: -0.02740737 This indicates that the relationship is not monotonic ***
Notes: the result of -0.172 indicates that as age increases friend count decrease but there is no strong correlation. Inferential statistics with experiments need to be used to infer the causation and not descriptive stats such as below.
with(subset(pf, age <= 70), cor.test(age, friend_count))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1780220 -0.1654129
## sample estimates:
## cor
## -0.1717245
Notes: Correlation (Pearson, Kendall, Spearman) http://www.statisticssolutions.com/correlation-pearson-kendall-spearman/ The point is that single number coefficients are useful but cant replace the richness of a scatter plot ***
Notes: examining the relationship between likes received and WWW likes received
myplot <- ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
geom_point()
myplot
myplot <- ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
geom_point(alpha = 1/100) +
coord_trans(x = 'sqrt')
myplot
myplot <- myplot + coord_cartesian(xlim = c(0,50), ylim = c(0,2500))
myplot
Notes:
remove(myplot)
#plot the scatter
myplot <- ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
geom_point()
myplot
#adjust the axes using the 95% quantile
myplot<- myplot + xlim(0,quantile(pf$www_likes_received,0.95)) +
ylim(0,quantile(pf$www_likes,0.95))
myplot
## Warning: Removed 11608 rows containing missing values (geom_point).
# add the slope of the line of best fit through the point above is the correlation
#lm == lineear model
myplot <- myplot + geom_smooth(method = 'lm', color = 'red')
myplot
## Warning: Removed 11608 rows containing missing values (stat_smooth).
## Warning: Removed 11608 rows containing missing values (geom_point).
## Warning: Removed 33 rows containing missing values (geom_path).
What’s the correlation between the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
cor.test(x = pf$www_likes_received, y = pf$likes_received)
##
## Pearson's product-moment correlation
##
## data: pf$www_likes_received and pf$likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9473553 0.9486176
## sample estimates:
## cor
## 0.9479902
Response: 0.948 ***
Notes: Linear Regression Assumptions https://en.wikipedia.org/wiki/Linear_regression#Assumptions ***
Notes:
#install.packages('alr3')
library(alr3)
## Warning: package 'alr3' was built under R version 3.2.1
## Loading required package: car
## Warning: package 'car' was built under R version 3.2.1
data("Mitchell")
Create your plot!
names(Mitchell)
## [1] "Month" "Temp"
ggplot(aes(x = Month, y = Temp), data = Mitchell) +
geom_point()
cor.test(x = Mitchell$Month, y = Mitchell$Temp)
##
## Pearson's product-moment correlation
##
## data: Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08053637 0.19331562
## sample estimates:
## cor
## 0.05747063
cor.test(x = Mitchell$Month, y = Mitchell$Temp)
##
## Pearson's product-moment correlation
##
## data: Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08053637 0.19331562
## sample estimates:
## cor
## 0.05747063
Notes:
#transform the x axis to show break of a year
#1st change the range of the dataset to set the limits
range(Mitchell$Month)
## [1] 0 203
ggplot(aes(x = Month, y = Temp), data = Mitchell) +
geom_point() +
scale_x_discrete(breaks = seq(0,203,12))
Notes: Stretch the above plot so that it is wider than taller. This causes the visualisation to show a distinct cyclic pattern which is unnoticed in a regular view! As a rule of thumb the visualisation should be twice as wide as it is tall.
What do you notice? Response: There is a pattern repeating for the temperatures across months
Watch the solution video and check out the Instructor Notes! Notes:
Notes: calc. the age to age months in decimals
pf$age_with_months <- pf$age + (12- pf$dob_month) /12
age_groups2 <- group_by(pf, age_with_months)
pf.fc_by_age_months <- summarise(age_groups2,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
#sort by age asending
pf.fc_by_age_months <- arrange(pf.fc_by_age_months, age_with_months)
#use Head to check the first few rows. The last parm is optional
head(pf.fc_by_age_months, 20)
## Source: local data frame [20 x 4]
##
## age_with_months friend_count_mean friend_count_median n
## 1 13.16667 46.33333 30.5 6
## 2 13.25000 115.07143 23.5 14
## 3 13.33333 136.20000 44.0 25
## 4 13.41667 164.24242 72.0 33
## 5 13.50000 131.17778 66.0 45
## 6 13.58333 156.81481 64.0 54
## 7 13.66667 130.06522 75.5 46
## 8 13.75000 205.82609 122.0 69
## 9 13.83333 215.67742 111.0 62
## 10 13.91667 162.28462 71.0 130
## 11 14.00000 194.13115 105.0 122
## 12 14.08333 226.67568 106.0 111
## 13 14.16667 270.73611 146.0 144
## 14 14.25000 218.86131 132.0 137
## 15 14.33333 313.24000 148.5 150
## 16 14.41667 230.50000 123.0 160
## 17 14.50000 268.41892 150.5 148
## 18 14.58333 288.51309 153.0 191
## 19 14.66667 264.82927 192.0 164
## 20 14.75000 182.55621 103.0 169
Programming Assignment
#plot a line of mean friend count over age in months and limit age to under 71
ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) +
geom_line()
#plot from earlier
p1 <- ggplot(aes(x = age, y = friend_count_mean), data = subset(pf.fc_by_age, age < 71)) +
geom_line()
p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) +
geom_line()
# reduce the bin width by diviing the age.
# plot th mean friend count
p3 <- ggplot(aes(x = round(age / 5)*5, y = friend_count_mean), data = subset(pf.fc_by_age, age < 71)) +
geom_line(stat = 'summary', fun.y = mean)
#arrange
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.2.1
grid.arrange(p2, p1,p3, ncol = 1)
Notes:
Notes:
Questions: 1. In the scatter plot why is the age variable not intuitive? 2. How is a jitter plot better than a scatter plot for age vs friend count? 3. Why do we use the coor_trans on the y axis to improve the analysis? 4. Noise in Conditional Means - What is Bias Variance Trade-off? 5. Smoothing Conditional Means - What is local regression LOESS?
Reflection: 1.Jitter is used to overcome over plotting 2. Clarification on the jitter syntax with position. (https://discussions.udacity.com/t/confuse-with-position-arguments-in-geom-point-geom-jitter/26803) When we have geom_jitter( … position = position_jitter(h = 0)), we are telling R to set the magnitude of the jitter on the height of points (y-axis, vertical axis) to be 0. The equivalent setting for if we want to change or remove the jitter on the x-axis or horizontal axis is to use the “width” or “w” parameter. Don’t forget that you can always check the documentation for more details with ?position_jitter or check the online documentation to learn about functions you are unsure of.
geom_jitter performs the same things as geom_point, but has a different default value for the position argument. For geom_point, the default value is position = “identity”, while for geom_jitter the default value is position = “jitter”. Setting position = “jitter” in geom_point makes it act the same as geom_jitter; geom_jitter is a convenience function, since jittering points is something that is performed commonly enough. Again, the documentation for both functions might be useful to look at.
Types of Transformations Three ways of doing transformating in ggplot: by transforming the data qplot(log10(carat), log10(price), data=diamonds) by transforming the scales qplot(carat, price, data=diamonds, log=“xy”) qplot(carat, price, data=diamonds) + scale_x_log10() + scale_y_log10() by transforming the coordinate system: qplot(carat, price, data=diamonds) + coord_trans(x = “log10”, y = “log10”)
The difference between transforming the scales and transforming the coordinate system is that scale transformation occurs BEFORE statistics, and coordinate transformation afterwards. Coordinate transformation also changes the shape of geoms.
As a rule of thumb the visualisation should be twice as wide as it is tall!
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!