Lesson 4 - Explore 2 Variables

Scatterplots and Perceived Audience Size

Notes:

Scatterplots

Notes:

# Import Library
library(ggplot2)

# Get Data
pf <- read.csv('pseudo_facebook.tsv',sep = '\t')

# Create scatterplot - qplot automatically renders a scatter plot with continuous data so we don't have 
# to be explicit.
qplot(x = age, y = friend_count, data = pf)

# Another way: - QPLOT assumes X data is first and Y data is second.
qplot(age, friend_count, data = pf)

What are some things that you notice right away?

Response:

Observations:

Teenagers have a lot more friends
People lied about their age (69, 100+)
There are fake accounts

ggplot Syntax

Notes:

# Create scatterplot - using QPlot
qplot(x = age, y = friend_count, data = pf)

# Create scatterplot - using GGPlot
ggplot(aes(x = age, y = friend_count), data = pf)  + geom_point()

summary(pf$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   28.00   37.28   50.00  113.00

# > summary(pf$age)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  13.00   20.00   28.00   37.28   50.00  113.00 
  
  
# Create scatterplot - using GGPlot & limiting the X axis to 13 through 90
ggplot(aes(x = age, y = friend_count), data = pf)  + geom_point() + xlim(13, 90)

## Warning: Removed 4906 rows containing missing values (geom_point).

Overplotting

Notes:

# To better handle overplotting we use the aplha parameter (1/20) - 20 points to - 1 dot.
# Create scatterplot - using GGPlot & limiting the X axis to 13 through 90
ggplot(aes(x = age, y = friend_count), data = pf)  + geom_point(alpha = 1/20) + xlim(13, 90)

## Warning: Removed 4906 rows containing missing values (geom_point).

What do you notice in the plot?

Response:

The bulk of data lie below the 1000 friend count threshold

Alpha and Jitter

Notes:

# Jitter
ggplot(aes(x = age, y = friend_count), data = pf)  + geom_jitter(alpha = 1/20) + xlim(13, 90)

## Warning: Removed 5179 rows containing missing values (geom_point).

The friend counts for young users are not nearly as high as they apeared before. The bulk of young users really hae friend counts below 1000.

Notice the Peak at 69 -

Coord_trans()

Notes:

Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

# ORL: http://docs.ggplot2.org/current/coord_trans.html
# or

# ?coord_trans 


# Core Trans using POINTS
ggplot(aes(x = age, y = friend_count), data = pf) +
  geom_point(alpha = 1/20) +
  xlim(13, 90) +  
  coord_trans(y = "sqrt")

## Warning: Removed 4906 rows containing missing values (geom_point).

# Core Trans using JITTER (can add Positive or negative noise to our points)
ggplot(aes(x = age, y = friend_count), data = pf) +
  geom_point(alpha = 1/20, position = position_jitter(h = 0)) +
  xlim(13, 90) +  
  coord_trans(y = "sqrt")

## Warning: Removed 5183 rows containing missing values (geom_point).

What do you notice?

Quiz Alpha and Jitter

Notes:

# Examine the relationship between
# friendships_initiated (y) and age (x)
# using the ggplot syntax.

# We recommend creating a basic scatter
# plot first to see what the distribution looks like.
# and then adjusting it by adding one layer at a time.

# What are your observations about your final plot?

# Remember to make adjustments to the breaks
# of the x-axis and to use apply alpha and jitter.


# Create scatterplot - using GGPlot
ggplot(aes(x = age, y = friendships_initiated), data = pf)  + 
  geom_point(alpha = 1/10) +
  xlim(13, 90) +
  coord_trans(y = "sqrt")

## Warning: Removed 4906 rows containing missing values (geom_point).

# Create scatterplot - using GGPlot & Jitter
ggplot(aes(x = age, y = friendships_initiated), data = pf)  + 
  geom_jitter(alpha = 1/10, position = position_jitter(h = 0)) +
  xlim(13, 90) +
  coord_trans(y = "sqrt")

## Warning: Removed 5193 rows containing missing values (geom_point).

Overplotting and Domain Knowledge

Notes:

Using percentage of bounded scopes are interesting.

Conditional Means (how does the mean or median varies with another variable)

dplyr library allow us to split up a dataframe and apply a function to some parts of the data.

Notes:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# dplyr Functions:
# filter()
# group_by()
# mutate()
# arrange()

age_groups <- group_by(pf, age)

# Create a table that for each age gives us the mean / median and friend count using 
# N function can only be used for "summarise", and reports how many people are really in each group.
pf.fc_age <- summarise(age_groups,
          friend_count_mean = mean(friend_count),
          friend_count_median = median(friend_count),
          n = n()
          )

# Rearranging the order so that the "Ages" go from low to high.
pf.fc_age <- arrange(pf.fc_age, age)

head(pf.fc_age)

## Source: local data frame [6 x 4]
## 
##     age friend_count_mean friend_count_median     n
##   (int)             (dbl)               (dbl) (int)
## 1    13          164.7500                74.0   484
## 2    14          251.3901               132.0  1925
## 3    15          347.6921               161.0  2618
## 4    16          351.9371               171.5  3086
## 5    17          350.3006               156.0  3283
## 6    18          331.1663               162.0  5196

# ******* ALTERNATIVE METHOD ***************
# Note %.% allows us to chain function onto our data set

pf.fc_by_age <- pf %>% 
  group_by(age) %>% 
  summarize( friend_count_mean = mean(friend_count),
             friend_count_median = median(friend_count),
             n = n()) %>% 
  arrange(age)

head(pf.fc_age)

## Source: local data frame [6 x 4]
## 
##     age friend_count_mean friend_count_median     n
##   (int)             (dbl)               (dbl) (int)
## 1    13          164.7500                74.0   484
## 2    14          251.3901               132.0  1925
## 3    15          347.6921               161.0  2618
## 4    16          351.9371               171.5  3086
## 5    17          350.3006               156.0  3283
## 6    18          331.1663               162.0  5196

Create your plot!

# Plot mean friend count vs. age using a line graph.
# Be sure you use the correct variable names
# and the correct data frame. You should be working
# with the new data frame created from the dplyr
# functions. The data frame is called 'pf.fc_by_age'.

# Use geom_line() rather than geom_point to create
# the plot. You can look up the documentation for
# geom_line() to see what it does.


pf.fc_by_age

## Source: local data frame [101 x 4]
## 
##      age friend_count_mean friend_count_median     n
##    (int)             (dbl)               (dbl) (int)
## 1     13          164.7500                74.0   484
## 2     14          251.3901               132.0  1925
## 3     15          347.6921               161.0  2618
## 4     16          351.9371               171.5  3086
## 5     17          350.3006               156.0  3283
## 6     18          331.1663               162.0  5196
## 7     19          333.6921               157.0  4391
## 8     20          283.4991               135.0  3769
## 9     21          235.9412               121.0  3671
## 10    22          211.3948               106.0  3032
## ..   ...               ...                 ...   ...

# Create scatterplot - using GGPlot - OLDER EXAMPLE
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age)  + 
  geom_point(alpha = 1/10) +
  xlim(13, 90) +
  coord_trans(y = "sqrt")

## Warning: Removed 23 rows containing missing values (geom_point).

# Create a Line Plot
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age)  + 
  geom_point(alpha = 1/5) +
  geom_line() +
  xlim(13, 90) +
  coord_trans(y = "sqrt")

## Warning: Removed 23 rows containing missing values (geom_point).

## Warning: Removed 23 rows containing missing values (geom_path).

Overlaying Summaries with Raw Data

Notes:

# Create scatterplot - using GGPlot - OLDER EXAMPLE - using orange color
ggplot(aes(x = age, y = friend_count), data = pf)  + 
  xlim(13, 90) +
  geom_point(alpha = .05,
             position = position_jitter(h = 0),
             color = 'orange') +
  coord_trans(y = "sqrt") +
  geom_line(stat = 'summary', fun.y = mean)

## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 5175 rows containing missing values (geom_point).

# Adding in Quantiles

ggplot(aes(x = age, y = friend_count), data = pf)  + 
  xlim(13, 90) +
  geom_point(alpha = .05,
             position = position_jitter(h = 0),
             color = 'orange') +
  coord_trans(y = "sqrt") +
  geom_line(stat = 'summary', fun.y = mean) +
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .1),
            linetype =2, color = 'blue') +
    geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .9),
            linetype =2, color = 'blue')

## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 5175 rows containing missing values (geom_point).

# INSPIRATION: coord_cartesian
#
# A better method - using coord_cartesian Layer - adjusting for zoom into 250 count.
# qplot(x= gender, y = friend_count, 
#      data = subset(pf, !is.na(gender)), 
#      geom = 'boxplot') + 
#  coord_cartesian(ylim = c(0, 250))


ggplot(aes(x = age, y = friend_count), data = pf)  + 
  # xlim(13, 90) +
  coord_cartesian(xlim = c(13, 70), ylim = c(0, 1000)) +
  geom_point(alpha = .05,
             position = position_jitter(h = 0),
             color = 'orange') +
  # coord_trans(y = "sqrt") +
  geom_line(stat = 'summary', fun.y = mean) +
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .1),
            linetype =2, color = 'blue') +
    geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .9),
            linetype =2, color = 'blue')

What are some of your observations of the plot?

Response:

Moira: Histogram Summary and Scatterplot

See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

Notes:

Correlation

Notes:

# ?cor.test

cor.test(pf$age, pf$friend_count, method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737

# > cor.test(pf$age, pf$friend_count, method = 'pearson')
#
#   Pearson's product-moment correlation
#
# data:  pf$age and pf$friend_count
# t = -8.6268, df = 99001, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
#  -0.03363072 -0.02118189
# sample estimates:
#         cor 
# -0.02740737 

# This indicates that there is no meaningful relationship between the two variables
# A good rule of thumb is a correlation > 0.3 or < -0.3 is meaningful (but small)
# .05 is moderate, and 0.7 is large.

# Another way to compute the same coefficient is to use the following code:
with(pf, cor.test(age, friend_count, method = 'pearson'))

## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737

# Another way to compute the same coefficient is to use the following code - restricting bounds:
with(subset(pf, age <= 70), cor.test(age, friend_count, method = 'pearson'))

## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1780220 -0.1654129
## sample estimates:
##        cor 
## -0.1717245

# Subset the data for ages less than 70
# subset(pf, age <= 70)


# > with(subset(pf, age <= 70), cor.test(age, friend_count, method = 'pearson'))
#
#   Pearson's product-moment correlation
#
# data:  age and friend_count
# t = -52.592, df = 91029, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.1780220 -0.1654129
# sample estimates:
#        cor 
# -0.1717245 

# Note: this articulates a negative relationship between age and friend count,
# as age increases, friend count decreases.

# It's important to say that one variable doesn't cause the other....
# To address causality, we'd need inferential statistics, not the descriptive statistics that
# we are using here.

Look up the documentation for the cor.test function.

What’s the correlation between age and friend count? Round to three decimal places. Response:

-0.172 for < 70 and -0.027 for whole DS.

Correlation on Subsets

Notes:

# Pearson
with( subset(pf, age <= 70) , cor.test(age, friend_count, method = 'pearson'))

## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1780220 -0.1654129
## sample estimates:
##        cor 
## -0.1717245

# Spearman
with( subset(pf, age <= 70) , cor.test(age, friend_count, method = 'spearman', exact=FALSE))

## 
##  Spearman's rank correlation rho
## 
## data:  age and friend_count
## S = 1.5782e+14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.2552934

# > with( subset(pf, age <= 70) , cor.test(age, friend_count, method = 'spearman', exact=FALSE))
# 
#   Spearman's rank correlation rho
#
# data:  age and friend_count
# S = 1.5782e+14, p-value < 2.2e-16
# alternative hypothesis: true rho is not equal to 0
# sample estimates:
#        rho 
# -0.2552934

Correlation Methods

Notes:

Create Scatterplots

Notes:

names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

# > names(pf)
#  [1] "userid"                "age"                   "dob_day"               "dob_year"              "dob_month"
#
#  [6] "gender"                "tenure"                "friend_count"          "friendships_initiated" "likes"    #            
# [11] "likes_received"        "mobile_likes"          "mobile_likes_received" "www_likes"             
# "www_likes_received"


# Inspiration:
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age)  + 
  geom_point(alpha = 1/10) +
  xlim(13, 90) +
  coord_trans(y = "sqrt")

## Warning: Removed 23 rows containing missing values (geom_point).

# Actual:

ggplot(aes(x = www_likes_received, y = likes_received), data = pf) + geom_point()

# Quantile Function to remove outliers
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) + geom_point() +
  xlim(0, quantile(pf$www_likes_received, 0.95)) +
  ylim(0, quantile(pf$likes_received, 0.95))

## Warning: Removed 6075 rows containing missing values (geom_point).

# Quantile Function to remove outliers - add line of best fit.
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) + geom_point() +
  xlim(0, quantile(pf$www_likes_received, 0.95)) +
  ylim(0, quantile(pf$likes_received, 0.95)) +
  geom_smooth(method = 'lm', color = 'red')

## Warning: Removed 6075 rows containing non-finite values (stat_smooth).

## Warning: Removed 6075 rows containing missing values (geom_point).

Strong (High) Correlations

Notes:

# Quantifying relationship with a number

# Pearson
cor.test(pf$www_likes_received, pf$likes_received, method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  pf$www_likes_received and pf$likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9473553 0.9486176
## sample estimates:
##       cor 
## 0.9479902

# > cor.test(pf$www_likes_received, pf$likes_received, method = 'pearson')
# 
#   Pearson's product-moment correlation
# 
# data:  pf$www_likes_received and pf$likes_received
# t = 937.1, df = 99001, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
#  0.9473553 0.9486176
# sample estimates:
#       cor 
# 0.9479902 

# NOTE: These variables were so closeely correlated because one was a superset of the other.

What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.

Response:

Moira on Correlation

Notes:

More Caution with Correlation - learning how correlation can be decaptive.

Notes:

# install.packages('alr3')
library(alr3)

## Loading required package: car

data("Mitchell")
# ?Mitchell

Create your plot!

# Create a scatterplot of temperature (Temp)
# vs. months (Month).

str(Mitchell)

## 'data.frame':    204 obs. of  2 variables:
##  $ Month: int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Temp : num  -5.18 -1.65 2.49 10.4 14.99 ...

# Inspiration - scatterplot using GGPlot
# ggplot(aes(x = age, y = friend_count), data = pf)  + geom_point()

ggplot(aes(x = Temp, y = Month), data = Mitchell)  + geom_point()

# Actual Correlation
cor.test(Mitchell$Temp, Mitchell$Month, method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  Mitchell$Temp and Mitchell$Month
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

# > cor.test(Mitchell$Temp, Mitchell$Month, method = 'pearson')
#
#   Pearson's product-moment correlation
#
# data:  Mitchell$Temp and Mitchell$Month
# t = 0.81816, df = 202, p-value = 0.4142
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.08053637  0.19331562
# sample estimates:
#       cor 
# 0.05747063

Noisy Scatterplots

Take a guess for the correlation coefficient for the scatterplot.

Zero

What is the actual correlation of the two variables? (Round to the thousandths place)

0.0574

ggplot(aes(x = Temp, y = Month), data = Mitchell)  + geom_point() +
  scale_x_discrete(breaks = seq(0, 283, 12))

Making Sense of Data

Notes:

A New Perspective

What do you notice? Response:

Watch the solution video and check out the Instructor Notes! Notes:

Understanding Noise: Age to Age Months

Notes:

# Going back to this Line plot:
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age)  + 
  geom_line()

# Print out some of the DF to have a closer look at the rising and falling of the data
head(pf.fc_by_age)

## Source: local data frame [6 x 4]
## 
##     age friend_count_mean friend_count_median     n
##   (int)             (dbl)               (dbl) (int)
## 1    13          164.7500                74.0   484
## 2    14          251.3901               132.0  1925
## 3    15          347.6921               161.0  2618
## 4    16          351.9371               171.5  3086
## 5    17          350.3006               156.0  3283
## 6    18          331.1663               162.0  5196

# Slice Data More:
pf.fc_by_age[17:19, ]

## Source: local data frame [3 x 4]
## 
##     age friend_count_mean friend_count_median     n
##   (int)             (dbl)               (dbl) (int)
## 1    29          120.8182                66.0  1936
## 2    30          115.2080                67.5  1716
## 3    31          118.4599                63.0  1694

# Create the users age in months rather than years
pf$age_with_months <- pf$age + abs(12 - pf$dob_month)/12


# Two alternate solutions:
# (1) pf$age_with_months <- pf$age + (1 - pf$dob_month / 12)
# (2) pf$age_with_months <- with(pf, age + (1 - dob_month / 12))


head(pf[16])

##   age_with_months
## 1        14.08333
## 2        14.08333
## 3        14.08333
## 4        14.00000
## 5        14.00000
## 6        14.00000

Age with Months Means

# Solution: Chain commands together using "%>%""

library(dplyr)

pf.fc_by_age_months <-pf %>%
  group_by(age_with_months) %>%
  summarize( friend_count_mean = mean(friend_count),
             friend_count_median = median(friend_count),
             n = n()) %>%
  arrange(age_with_months)


head(pf.fc_by_age_months)

## Source: local data frame [6 x 4]
## 
##   age_with_months friend_count_mean friend_count_median     n
##             (dbl)             (dbl)               (dbl) (int)
## 1        13.16667          46.33333                30.5     6
## 2        13.25000         115.07143                23.5    14
## 3        13.33333         136.20000                44.0    25
## 4        13.41667         164.24242                72.0    33
## 5        13.50000         131.17778                66.0    45
## 6        13.58333         156.81481                64.0    54

# Alternative Method: Use the DataFrame and them apply commands to it

age_with_months_groups <- group_by(pf, age_with_months)
pf.fc_by_age_months2 <- summarise(age_with_months_groups,
                                  friend_count_mean = mean(friend_count),
                                  friend_count_median = median(friend_count),
                                  n = n()) 
pf.fc_by_age_months2 <- arrange(pf.fc_by_age_months2, age_with_months)


head(pf.fc_by_age_months2)

## Source: local data frame [6 x 4]
## 
##   age_with_months friend_count_mean friend_count_median     n
##             (dbl)             (dbl)               (dbl) (int)
## 1        13.16667          46.33333                30.5     6
## 2        13.25000         115.07143                23.5    14
## 3        13.33333         136.20000                44.0    25
## 4        13.41667         164.24242                72.0    33
## 5        13.50000         131.17778                66.0    45
## 6        13.58333         156.81481                64.0    54

Programming Assignment

# Experiment: Create a DataFrame
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
employ.data <- data.frame(employee, salary, startdate)


# Experiment:  Test for data type and if a dataframe
typeof(employ.data)

## [1] "list"

is.data.frame(pf.fc_by_age_months)

## [1] TRUE

Noise in Conditional Means

# Original Line plot: NOTE:  (12 x 71 = 852)
p1 <- ggplot(aes(x = age, y = friend_count_mean), 
       data = subset(pf.fc_by_age, age < 71))  + 
  geom_line() +
  geom_smooth()


# New Line plot with noise subsetted by age < 71
p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean), 
       data = subset(pf.fc_by_age_months, age_with_months < 71))  + 
  geom_line()+
  geom_smooth()


# Add 3rd plot that displays the means
p3 <- ggplot(aes(x = round(age/5) * 5, y = friend_count_mean), 
       data = subset(pf.fc_by_age, age < 71))  + 
  geom_line(stat = 'summary', fun.y = mean)


# Grid Plot
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

grid.arrange(p2, p1, p3, ncol = 1)

#Spot Checking
head(pf.fc_by_age)

## Source: local data frame [6 x 4]
## 
##     age friend_count_mean friend_count_median     n
##   (int)             (dbl)               (dbl) (int)
## 1    13          164.7500                74.0   484
## 2    14          251.3901               132.0  1925
## 3    15          347.6921               161.0  2618
## 4    16          351.9371               171.5  3086
## 5    17          350.3006               156.0  3283
## 6    18          331.1663               162.0  5196

head(pf.fc_by_age_months)

## Source: local data frame [6 x 4]
## 
##   age_with_months friend_count_mean friend_count_median     n
##             (dbl)             (dbl)               (dbl) (int)
## 1        13.16667          46.33333                30.5     6
## 2        13.25000         115.07143                23.5    14
## 3        13.33333         136.20000                44.0    25
## 4        13.41667         164.24242                72.0    33
## 5        13.50000         131.17778                66.0    45
## 6        13.58333         156.81481                64.0    54

Smoothing Conditional Means

Notes:

Which Plot to Choose?

Notes:

Analyzing Two Variables

Reflection:

Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!

TEMPLATE

Notes:

Lesson 4 - Explore 2 Variables

Matthew R. Versaggi

May 6, 2016

Scatterplots and Perceived Audience Size

Scatterplots

What are some things that you notice right away?

ggplot Syntax

Overplotting

What do you notice in the plot?

Alpha and Jitter

Coord_trans()

Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

What do you notice?

Quiz Alpha and Jitter

Overplotting and Domain Knowledge

Conditional Means (how does the mean or median varies with another variable)

Overlaying Summaries with Raw Data

What are some of your observations of the plot?

Moira: Histogram Summary and Scatterplot

Correlation

Correlation on Subsets

Correlation Methods

Create Scatterplots

Strong (High) Correlations

Moira on Correlation

More Caution with Correlation - learning how correlation can be decaptive.

Noisy Scatterplots

Making Sense of Data

A New Perspective

Understanding Noise: Age to Age Months

Age with Months Means

Noise in Conditional Means

Smoothing Conditional Means

Which Plot to Choose?

Analyzing Two Variables

TEMPLATE