Udacity: Data Analysis with R

Multivariate Data

Notes: In the last lesson, we learned how to examine the relationship between two quantitative variables. In this lesson, we will learn how to look at 3 or more variables at a time.

Let’s turn back to Moira’s analysis and how she examined a third variable in her scatter plot of perceived audience size.

Moira Perceived Audience Size Colored by Age

Notes: Moira began to wonder if older Facebook users had a better sense of their perceived Facebook audience size, than teenagers, for example. Therefore, for her “Perceived Audience Size by Age” scatterplot, she decided to add color to the points to display the age ranges. However, this path still led her to a dead end.

Third Qualitative Variable

Notes: It is important to note that when we are doing exploratory data analysis, we oftentimes do hit dead ends.

Let’s continue our analysis of friend_count by age by adding a third variable. Previously, we had noted that female users, on average, had more friends than male users. We might wonder if this is because females have a different age distribution? Or, maybe, conditional on age, the differences are actually larger?

setwd("~/Desktop/Udacity/Data_analysis_with_R/5_Explore_Many_Variables")
suppressMessages(library("ggplot2"))
pf <- read.csv('pseudo_facebook.tsv', sep='\t')

ggplot(aes(friend_count), data=subset(pf, !is.na(gender))) +
  geom_histogram(binwidth=30) +
  facet_wrap(~gender)

Exploring Ages of Users by Gender

Here’s a box plot of ages by gender. Let’s mark the mean of each gender to the box plot using stat_summary. Since male users are a bit younger, we might think that a simple male to female comparison doesn’t capture their substantial differences in friend count. Let’s look at median friend count by age and gender instead.

ggplot(aes(x=gender, y=age), data=subset(pf, !is.na(gender))) + 
  geom_boxplot() +
  stat_summary(fun.y=mean, geom="point", shape=4)

ggplot(aes(x=age, y=friend_count), data=subset(pf, !is.na(gender))) +
  geom_line(aes(color=gender), stat="summary", fun.y=median)

Looking at these plots, we can see that nearly everywhere, the median friend count is larger for women than it is for men. Now there are some exceptions, and this includes the very noisy estimates for our very “old users”. Recall that we are not entirely sure these “old users” age information is very accurate.

Quiz: Third Qualitative Variable

# Write code to create a new data frame,
# called 'pf.fc_by_age_gender', that contains
# information on each age AND gender group.

# The data frame should contain the following variables:

#    mean_friend_count,
#    median_friend_count,
#    n (the number of users in each age and gender grouping)

# Here is an example of the structure of your data frame. Your
# data values will be different. Note that if you are grouping by
# more than one variable, you will probably need to call the
# ungroup() function. 

#   age gender mean_friend_count median_friend_count    n
# 1  13 female          247.2953                 150  207
# 2  13   male          184.2342                  61  265
# 3  14 female          329.1938                 245  834
# 4  14   male          157.1204                  88 1201

suppressMessages(library(dplyr))
# ENTER YOUR CODE BELOW THIS LINE.
# ==============================================================

pf.fc_by_age_gender <- pf %>% 
  filter(!is.na(gender)) %>%
  group_by(age, gender) %>%
  summarise(mean_friend_count = mean(friend_count),
            median_friend_count = median(friend_count),
            n=n())
head(pf.fc_by_age_gender)

## Source: local data frame [6 x 5]
## Groups: age [3]
## 
##     age gender mean_friend_count median_friend_count     n
##   (int) (fctr)             (dbl)               (dbl) (int)
## 1    13 female          259.1606               148.0   193
## 2    13   male          102.1340                55.0   291
## 3    14 female          362.4286               224.0   847
## 4    14   male          164.1456                92.5  1078
## 5    15 female          538.6813               276.0  1139
## 6    15   male          200.6658               106.5  1478

Plotting Conditional Summaries

# Create a line graph showing the
# median friend count over the ages
# for each gender. Be sure to use
# the data frame you just created,
# pf.fc_by_age_gender.

# ENTER YOUR CODE BELOW THIS LINE
# =================================================
ggplot(aes(age, median_friend_count), data = pf.fc_by_age_gender) +
  geom_line(aes(color=gender))

#Can add the color argument inside ggplot or geom_line function.
#ggplot(aes(age, median_friend_count, color=gender), data = pf.fc_by_age_gender) +
#  geom_line()

Thinking in Ratios

Notes: The above plot can be useful if we want to inspect these values, or carry out further operations to help us understand how the difference between male and female users varies with age. For example, looking at this plot, it seems like the gender difference is largest for our young users. It would be helpful to put this in relative terms though. So let’s answer a different question.

How many times more friends does the average female user have than the male user? Maybe, females have twice as many friends as male users or maybe it is ten-times as many friends.

Wide and Long Format

Notes: To answer the above question, we need to rearrange our data a bit. Right now, our data is in long format. We have many rows and notice that the variables that we grouped over, male and female, have been repeated. They are repeated for each year. So let’s change this long format into a wide format.

Our new data frame will have one row for each age and then we will put the median friend count inside of males and females.

head(pf.fc_by_age_gender)

## Source: local data frame [6 x 5]
## Groups: age [3]
## 
##     age gender mean_friend_count median_friend_count     n
##   (int) (fctr)             (dbl)               (dbl) (int)
## 1    13 female          259.1606               148.0   193
## 2    13   male          102.1340                55.0   291
## 3    14 female          362.4286               224.0   847
## 4    14   male          164.1456                92.5  1078
## 5    15 female          538.6813               276.0  1139
## 6    15   male          200.6658               106.5  1478

suppressMessages(library(tidyr))

#spread function: Generates multiple columns from two columns
pf.fc_by_age_gender.spread <- spread(data=subset(pf.fc_by_age_gender, 
                   select = c("age", "gender", "median_friend_count")),
                   key=gender, value=median_friend_count)
head(pf.fc_by_age_gender.spread)

## Source: local data frame [6 x 3]
## Groups: age [6]
## 
##     age female  male
##   (int)  (dbl) (dbl)
## 1    13  148.0  55.0
## 2    14  224.0  92.5
## 3    15  276.0 106.5
## 4    16  258.5 136.0
## 5    17  245.5 125.0
## 6    18  243.0 122.0

For more resources on how to change your data from Wide to Long Format (or vice versa), refer to:

“Data Wrangling with R” https://s3.amazonaws.com/udacity-hosted-downloads/ud651/DataWranglingWithR.pdf

“An Introduction to Reshape2 package” http://seananderson.ca/2013/10/19/reshape.html

“Converting Between Wide and Long Format” http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/

“Melt Data Frame” http://www.r-bloggers.com/melt/

Reshaping Data

Notes: Notice that we used the tidyr package above to create our wide formatted data that we called pf.fc_by_age_gender.spread.

Alternatively, we can use the reshape2 package to create an identical data frame using the dcast function. The letter “d” is used since we want the result to be a data frame. If we wanted an array or a matrix, we could use acast.

It first takes in the data frame that we are working with. We then have to pass a formula in the argument. The variables to the left of the tilde are the variables that we want to keep with an addition sign in between them. On the right side of the tilde, we put gender since we want male and female users to have their own column for median friend count in the data frame. Lastly, we set value.var to median_friend_count because it holds the key measurements or their values in our new data frame.

Note: the nill function will allow you to convert the Y data back to the original long format.

#install.packages('reshape2')
suppressMessages(library(reshape2))

pf.fc_by_age_gender.wide <- dcast(pf.fc_by_age_gender,
                                  age ~ gender,
                                  value.var="median_friend_count")
head(pf.fc_by_age_gender.wide)

##   age female  male
## 1  13  148.0  55.0
## 2  14  224.0  92.5
## 3  15  276.0 106.5
## 4  16  258.5 136.0
## 5  17  245.5 125.0
## 6  18  243.0 122.0

Ratio Plot

# Plot the ratio of the female to male median
# friend counts using the data frame
# pf.fc_by_age_gender.wide.

# Think about what geom you should use.
# Add a horizontal line to the plot with
# a y intercept of 1, which will be the
# base line. Look up the documentation
# for geom_hline to do that. Use the parameter
# linetype in geom_hline to make the
# line dashed.

# The linetype parameter can take the values 0-6:
# 0 = blank, 1 = solid, 2 = dashed
# 3 = dotted, 4 = dotdash, 5 = longdash
# 6 = twodash

# ENTER YOUR CODE BELOW THIS LINE
# =================================================

ggplot(aes(age, female/male), data=pf.fc_by_age_gender.wide) +
  geom_line() +
  geom_hline(yintercept=1, linetype=2, alpha=0.3)

From this plot, we can see that for very young users, the median female user has over two and a half times as many friends as the median male user. Clearly, it was helpful to condition on age, and understanding the relationship of gender with friend count. This helped assure us that this pattern was robust for users of many different ages. And it also highlighted where this difference is most striking.

There are many processes that can produce this difference, including the biased distribution from which the pseudo-Facebook data was generated. One idea that shows the complexity of interpretation here is that people from particular countries, who more recently joined Facebook are more likely to be male with lower friend counts.

Third Quantitative Variable

Notes: In the previous example, we were looking at our data Age and Friend Count across the categorical variable Gender. Usually color or shape tend to be aesthetics for representing such changes over a categorical variable.

But what if we looked at age and friend count over, say, another numerical variable? For example, we might notice that since users are likely to accumulate friends over their time on Facebook, then Facebook tenure is important for preducting friend count. Tenure, or how many days since registering with Facebook is associated with age. The first people to start using Facebook were college students as of 2004 and 2005. As a result, this group is more likely to have amassed a larger number of friends.

One way to explore all 4 variables friend count, age, gender, and tenure is using a two-dimensional display like a scatterplot. And we can bend one of the quantitative variables and compare those bends. In this case, we can group users by the year that they joined. So let’s create a new variable called year_joined in our data frame. It will contain the year that the users first joined Facebook. To create this variable, we need to use the varaible tenure and use 2014 as the reference year.

# Create a variable called year_joined
# in the pf data frame using the variable
# tenure and 2014 as the reference year.

# The variable year joined should contain the year
# that a user joined facebook.

# ENTER YOUR CODE BELOW THIS LINE.
# ========================================================
pf$year_joined <- floor(2014 - (pf$tenure/365))

Cut a Variable

Notes: Now lets explore the new variable. It looks like most of our users joined in 2012 or 2013 and since the values for this variables are discrete and the range is pretty narrow, lets table this variable as well.

summary(pf$year_joined)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2005    2012    2012    2012    2013    2014       2

table(pf$year_joined)

## 
##  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014 
##     9    15   581  1507  4557  5448  9860 33366 43588    70

This gives us the distribution of the year joined. Notice that there isn’t much data here about early joiners. To increase the data we have in each tenure category, we can group some of these year together. The cut function is quite useful for making discrete variables from continuous or numerical ones, sometimes in combination with the function quantile.

Converting Continuous Data into Categorical Data

The next task if to cut the variable year joined to create four bins, or buckets of users. The bins will be from “2004-2009”, “2009-2011”, “2011-2012”, and “2012-2014”. For more info on cut, go to http://www.r-bloggers.com/r-function-of-the-day-cut-2/

# Create a new variable in the data frame
# called year_joined.bucket by using
# the cut function on the variable year_joined.

# You need to create the following buckets for the
# new variable, year_joined.bucket
#        (2004, 2009]
#        (2009, 2011]
#        (2011, 2012]
#        (2012, 2014]
# Note that a parenthesis means exclude the year and a
# bracket means include the year.

# ENTER YOUR CODE BELOW THIS LINE
# ========================================================================
pf$year_joined.bucket <- cut(pf$year_joined, breaks=c(2004,2009,2011,2012,2014))

Plotting it All Together

Notes: So far, we have done 2 things up to this point. One, we created a variable called year_joined, based on the tenure variable. Second, we converted year_joined, to the variable year_joined.bucket, which was a categorical variable that binned our users into different groups.

Let’s table this new variable to see the distribution in each group. 2 people have values of NA. Let’s use this new variable to create a line graph like we did for gender at the start of the lesson.

table(pf$year_joined.bucket, useNA="ifany")

## 
## (2004,2009] (2009,2011] (2011,2012] (2012,2014]        <NA> 
##        6669       15308       33366       43658           2

# Create a line graph of friend_count vs. age
# so that each year_joined.bucket is a line
# tracking the median user friend_count across
# age. This means you should have four different
# lines on your plot.

# You should subset the data to exclude the users
# whose year_joined.bucket is NA.

# ENTER YOUR CODE BELOW THIS LINE
# ===================================================
ggplot(aes(age, friend_count), 
       data = subset(pf, !is.na(year_joined.bucket))) +
  geom_line(aes(color=year_joined.bucket), stat="summary", fun.y=median) +
  labs(color="Year Joined")

Plot the Grand Mean

Notes: Looking at this plot, we can confirm our suspicion. Those with a longer tenure tend to have higher friend counts, with the exception of our older users, say about 80 and up. To put these cohort specific medians into perspective, we can change them to cohort specific means and plot the grand mean in here as well. The grand mean is the line that we saw in the previous lesson.

# Write code to do the following:

# (1) Add another geom_line to code below
# to plot the grand mean of the friend count vs age.
# (2) Exclude any users whose year_joined.bucket is NA.
# (3) Use a different line type for the grand mean.

# As a reminder, the parameter linetype can take the values 0-6:
# 0 = blank, 1 = solid, 2 = dashed
# 3 = dotted, 4 = dotdash, 5 = longdash
# 6 = twodash

# ENTER YOUR CODE BELOW THIS LINE
# ==================================================================
ggplot(aes(age, friend_count), 
       data = subset(pf, !is.na(year_joined.bucket))) +
  geom_line(aes(color=year_joined.bucket), stat="summary", fun.y=mean) +
  labs(color="Year Joined") +
  geom_line(stat="summary", fun.y=mean, linetype=5)

Plotting the grand mean (dashed black line) is a good reminder that much of the data in the sample is about members of recent cohorts, since the mean line is located more towards the bottom of the graph.

Friending Rate

Notes: Since the general pattern continues to hold after conditioning on each of the buckets of year_joined, we might increase our confidence that this observation isn’t just an artifact of the time users have had to accumulate friends.

Let’s look at this relationship in another way. We could also look at tenure and friend count as a rate instead. For example, we can see how many friends does a user have for each day since they’ve started using the service. Try to create a summary of this rate, that shows how many friends a user has for each day since the users started using Facebook. Subset the data so you only consider users with at least one day of tenure. Once you have that summary, answer the following two questions.

with(subset(pf, tenure > 0), summary(friend_count/tenure))

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0775   0.2205   0.6096   0.5658 417.0000

What is the median friend rate? 0.2205 What is the maximum friend rate? 417

The maximum friend rate is definitely an outlier since our third quartile only goes up to 0.5658.

Friendships Initiated

Notes: We have learned more about the relationship between age friend count by making use of gender and tenure. Users who have been on the site longer typically have higher friend counts across ages.

What about friend requests? Do new users go on “friending sprees?” Or do users with more tenure initiate more friendships? Let’s use plots to answer these questions? Create a line graph of friendships initated per day vs. tenure. Use the variables age, tenure, friendships initated, and year joined bucket. The color of each line should correspond to each bucket of year joined.

# Create a line graph of mean of friendships_initiated per day (of tenure)
# vs. tenure colored by year_joined.bucket.

# You need to make use of the variables tenure,
# friendships_initiated, and year_joined.bucket.

# ENTER YOUR CODE BELOW THIS LINE
# ========================================================================
ggplot(aes(tenure, friendships_initiated/tenure), 
       data = subset(pf, tenure>0)) +
  geom_line(aes(color=year_joined.bucket), stat="summary", fun.y=mean) +
  labs(color="Year Joined")

Taking a closer look at the plot, it appears that users with more tenure typically initate less friendships.

Bias-Variance Tradeoff Revisited

Notes: Our above graph has a lot of noise since we are plotting the mean of y for every possible tenure x value. Recall from lesson 2 that we can adjust this noise by binning our x-axis differently.

We get slightly less noise in our second plot. The fourth graph has very high bias, but much less variance using the number 90. Notice that as the bin size increases, we see less noise on the plot. Our estimates for the mean are adjusted since we have more data points for our new values of tenure.

For more information on the Bias-Variance Tradeoff, go to http://scott.fortmann-roe.com/docs/BiasVariance.html.

p1 <- ggplot(aes(x = tenure, y = friendships_initiated / tenure),
       data = subset(pf, tenure >= 1)) +
  geom_line(aes(color = year_joined.bucket), stat = 'summary', fun.y = mean)

p2 <- ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_line(aes(color = year_joined.bucket), stat = "summary", fun.y = mean)

p3 <- ggplot(aes(x = 30 * round(tenure / 30), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_line(aes(color = year_joined.bucket), stat = "summary", fun.y = mean)

p4 <- ggplot(aes(x = 90 * round(tenure / 90), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_line(aes(color = year_joined.bucket), stat = "summary", fun.y = mean)

suppressMessages(library(gridExtra))
grid.arrange(p1, p2, p3, p4, ncol=2)

Using Smoothers

In lesson 2, we were introduced to smoothers as one tool for an analyst to use in these types of situations. Instead of using geom_line, try out geom_smooth to add a smoother to the first plot.

# Instead of geom_line(), use geom_smooth() to add a smoother to the plot.
# You can use the defaults for geom_smooth() but do color the line
# by year_joined.bucket

# ALTER THE CODE BELOW THIS LINE
# ==============================================================================

ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_smooth(aes(color = year_joined.bucket))

In this smoothed out version of the graph, we still see that friendships initated declines as tenure increases.

Sean’s NFL Fan Sentiment Study

Notes: Pay attention to the models and the trade offs in visualizations Sean made in his work at Facebook.

His work revolves around measuring fan sentiment on NFL teams over the course of the season. He wanted to come out with a way to visualize this and tell a story about the experience of a fan. He counted the ratio of positive and negative words at 5-minute increments over the course of the whole four months of the NFL season. Because we are dealing with ratios, we end up with some values that are extremely high.

Looking at the data, it is clear that it is going to require some modeling or some statistics in order to kind of tease out what’s actually happening. The next step he did was to compute a moving average of these 5 minute increments and he starts to get more signal out of there. There is still a lot of noise, so he began to aggregate the data to one day intervals, where patterns began to emerge.

A key advantage of this data was that he knew what he was looking for ahead of time because he knows that fans typically experience the highs and lows throughout a season. He wanted to create a model that predicts sentiment as a function of time. A thing that came to his mind was a natural spline. This actually told a nice story. The color vertical lines as the dates of wins and loses. It gives an idea of why the lines are upward sloping and downward slopping. This has a nice look, but it doesn’t have the feature that we’d expect, which is that on a game day, things change really abruptly. Basically, we need a more flexible model to display these discrete jumps.

Because NFL games occur once every week, he decided to look at moving averages in week intervals, which tells a better story.

For more information on his Sean’s research, refer to: “The Emotional Highs and Lows of the NFL Season” https://www.facebook.com/notes/facebook-data-science/the-emotional-highs-and-lows-of-the-nfl-season/10152033221418859

Introducing the Yogurt Data Set

Notes: We will now look at the yogurt data set. Because of online purchases and credit cards, lots of retail purchase data is associated with individuals or households such that there is a history of purchase data over time. Analyst and industry often use this panel scanner data, and economist and other behavioral scientists use it to test and develop theories about consumer behavior.

The yogurt data set contains data of 5 flavors of Dannon yogurt in the 8-ounce size. Their price is recorded with each purchase occasion. This data set has a different structure than the pseudo-Facebook data set. The “pf” data set has one row per individual with the row giving their characteristics and counts of behaviors over a single period of time. On the other hand, the yogurt data has many rows per household, one per each purchase. These different structures help to answer different types of questions.

For more information on the yogurt data set, refer to: “Kim, Jaehwan, Greg M. Allenby, and Peter E. Rossi.”Modeling consumer demand for variety." Marketing Science 21.3 (2002): 229-250."

Histograms Revisited

yo <- read.csv("yogurt.csv", header=T)
str(yo)

## 'data.frame':    2380 obs. of  9 variables:
##  $ obs        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ id         : int  2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 ...
##  $ time       : int  9678 9697 9825 9999 10015 10029 10036 10042 10083 10091 ...
##  $ strawberry : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ blueberry  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pina.colada: int  0 0 0 0 1 2 0 0 0 0 ...
##  $ plain      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mixed.berry: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ price      : num  59 59 65 65 49 ...

#Change id to a factor
yo$id <- factor(yo$id)
str(yo)

## 'data.frame':    2380 obs. of  9 variables:
##  $ obs        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ id         : Factor w/ 332 levels "2100081","2100370",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ time       : int  9678 9697 9825 9999 10015 10029 10036 10042 10083 10091 ...
##  $ strawberry : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ blueberry  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pina.colada: int  0 0 0 0 1 2 0 0 0 0 ...
##  $ plain      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mixed.berry: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ price      : num  59 59 65 65 49 ...

#Histogram of yogurt prices (different binwidths)
p1 <- ggplot(aes(price), data=yo) + geom_histogram(fill=I("#099DD9"))
p2 <- ggplot(aes(price), data=yo) + geom_histogram(binwidth=10, fill=I("#099DD9"))
grid.arrange(p1, p2, ncol=2)

Some things that stand our from the plot is the discreteness in our data. There are prices in which there are many observations, but then no observations in adjacent prices. This makes sense if prices are set in a way that applies to many of the consumers. If we are interested in price sensitivity, then we should consider what sort of variations are in these prices.

Note that using a different binwidth would have masked this discreteness. The second graph is a very biased model for this data set.

Number of Purchases

Notes: Other ways that we might have missed the discreteness is by looking at a 5 number summary of the data. A clue to the discreteness is that the 75th percentile is the same as the maximum.

We could also see the discreteness by looking at how many distinct prices there are in the data set. Here we can see that there are a total of about 20 different price levels. Explore this more by using a table.

summary(yo)

##       obs               id            time         strawberry     
##  Min.   :   1.0   2132290:  74   Min.   : 9662   Min.   : 0.0000  
##  1st Qu.: 696.5   2130583:  59   1st Qu.: 9843   1st Qu.: 0.0000  
##  Median :1369.5   2124073:  50   Median :10045   Median : 0.0000  
##  Mean   :1367.8   2149500:  50   Mean   :10050   Mean   : 0.6492  
##  3rd Qu.:2044.2   2101790:  47   3rd Qu.:10255   3rd Qu.: 1.0000  
##  Max.   :2743.0   2129528:  39   Max.   :10459   Max.   :11.0000  
##                   (Other):2061                                    
##    blueberry        pina.colada          plain         mixed.berry    
##  Min.   : 0.0000   Min.   : 0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 0.0000   Median : 0.0000   Median :0.0000   Median :0.0000  
##  Mean   : 0.3571   Mean   : 0.3584   Mean   :0.2176   Mean   :0.3887  
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :12.0000   Max.   :10.0000   Max.   :6.0000   Max.   :8.0000  
##                                                                       
##      price      
##  Min.   :20.00  
##  1st Qu.:50.00  
##  Median :65.04  
##  Mean   :59.25  
##  3rd Qu.:68.96  
##  Max.   :68.96  
##

length(unique(yo$price))

## [1] 20

table(yo$price)

## 
##    20 24.96 33.04  33.2 33.28 33.36 33.52 39.04    44 45.04 48.96 49.52 
##     2    11    54     1     1    22     1   234    21    11    81     1 
##  49.6    50 55.04 58.96    62 63.04 65.04 68.96 
##     1   205     6   303    15     2   799   609

So on a given purchase occasion, how many 8-ounce yogurts does a household purchase? To answer this, we need to combine accounts of the different yogurt flavors into one variable. For example, in observation #5, this household bought 3 different types of yogurts.

# Create a new variable called all.purchases,
# which gives the total counts of yogurt for
# each observation or household.

# ENTER YOUR CODE BELOW THIS LINE
# ========================================================
#yo$all.purchases <- yo$strawberry + yo$blueberry + yo$pina.colada + yo$plain + yo$mixed.berry

#Alternative method
yo <- transform(yo, all.purchases = strawberry + blueberry + pina.colada + plain + mixed.berry)

Prices over Time

Notes: Let’s create a histogram with this new column (i.e. all.purchases) in our dataframe. This histogram shows that most households buy 1 or 2 yogurts at a time.

ggplot(aes(all.purchases), data=yo) + 
  geom_histogram(binwidth=1, fill=I("#099DD9"))

To dive deeper into our yogurt prices and household behavior, lets investigate the price over time in more detail. This is where our Facebook data was deficient. In this data set, we can examine changes in prices because we have data on the same households over time.

# Create a scatterplot of price vs time.
# This will be an example of a time series plot.

# Resolve overplotting issues by using
# techniques you learned in Lesson 4.

# What are some things that you notice?

# ENTER YOUR CODE BELOW THIS LINE
# ================================================
ggplot(aes(time, price), data=yo) +
  geom_jitter(alpha=.25, shape=21, fill=I("#F79420")) +
  ggtitle("Time Series Plot: Price vs. Time")

Looking at the Time Series Plot, we can see that the modal of the most common prices, seem to be increasing over time. We also see some lower price points scattered about the graph. These may be due to sales, or perhaps, buyers using coupons that bring down the price of yogurt.

Sampling Observations

Notes: When familiarizing yourself with a new data set that contains multiple observations of the same units, it’s often useful to work with a sample of those units so that it’s easy to display the raw data for that sample. In the case of the yogurt data set, we might want to look at a small sample of households in more detail so that we know what kind of with-in and between household variation we are working with.

This analysis of a sub-sample might come before trying to use with-in household variation as part of a model. For example, this data set was originally used to model consumer preferences for variety. But, before doing that, we’d want to look at how often we observe households buying yogurt, how often they buy multiple items, and what prices they’re buying yogurt at. One way to do this is to look at some sub-sample in more detail. Let’s pick 16 households at random and take a closer look.

Looking at Samples of Households

Here, we first sample the different unique household id numbers (16 unique households). Because the id numbers are repeated each time there is a new purchase, we have to sample the id levels.

Next, we plot each of the purchase occasions for each of the households that we sampled. Notice the size=all.purchases argument provides more detail to our graph by making the circle on our plots larger for larger number of purchases.

For more information on graphical parameters, refer to: http://www.statmethods.net/advgraphs/parameters.html.

set.seed(4230)
sample.ids <- sample(levels(yo$id), 16)
ggplot(aes(time, price), data=subset(yo, id %in% sample.ids)) +
  facet_wrap(~id) +
  geom_line() +
  geom_point(aes(size=all.purchases), pch=1)

From the plot, notice the frequency of the purchases and the amount of the purchases. Notice that it would be useful to pair the coupon data to this graph to see if that relates to the dips in prices or if perhaps it was just due to special price reductions by the store.

The next task is to create a similar plot and provide insights for another set of 16 houses. Post any findings in the discussion forum:

set.seed(222)
sample.ids <- sample(levels(yo$id), 16)
ggplot(aes(time, price), data=subset(yo, id %in% sample.ids)) +
  facet_wrap(~id) +
  geom_line() +
  geom_point(aes(size=all.purchases), pch=1)

I used the set.seed(222). The top right plot (household #2114025) displays how they purchased yogurt at very frequent time intervals. The price of yogurt for this household was generally pretty high, with a couple of dips that could potentially be attributed to coupon usages or special sales deals. Households #2120436, 2139774, 2141341, and 2158196 purchased yogurt less frequently. Notice how household #2120378 only bought yogurt at a low price, meaning that this household probably only bought yogurt when it was either on sale or when they had a coupon. However, we cannot confirm this without more data.

The Limits of Cross Sectional Data

Notes: The general idea here is that if we have observations over time, we can facet by the primary unit, case, or individual in the data set. For the yogurt data, it was the households we were faceting over. This faceted time series plot is something we can’t generate with the pseudo-Facebook dataset since we don’t have data on our sample of users over time.

The Facebook data isn’t great for examining the process of friending over time. The dataset is just a cross-section: it is just one snapshot at a fixed point that tells us the characteristics of individuals. We don’t have data for individuals over, say, a year. But if we had data like the yogurt data, we would be able to track friendships initated over time and compare that with tenure. This would give us better evidence to explain the difference or the drop in friendships initated over time as tenure increases.

Many Variables

Notes: Much of the analysis we have done so far focused on some pre-chosen variable, relationship, or question of interest. We then used EDA to let those chosen variables speak and surprise us. Most recently, when analyzing the relationship between 2 variables, we looked to incorporate more variables into the analysis to improve it. For example, by looking whether a particular relationship is consistent across values of those other variables. In choosing a 3rd or 4th variable to plot, we relied on our domain knowledge. But often, we might want visualizations or summaries to help us identify such auxilary variables.

In some analyses, we may plan to make use of a larger number of variables. Perhaps we are planning on predicting one variable with 10, 20, or 100 other variables. Or maybe we want to summarize a large set of variables into a smaller set of dimensions. Or perhaps we are looking for interesting relationships among a large set of variables? In such cases, we can help speed up our exploratory data analysis by producing many plots or comparisons at once. This is a way to let the data speak to us and look at variables that we didn’t originally have a particular interest in.

Scatterplot Matrix

Notes: Scatterplot matrix is a grid of all the pair of variables from a data set. Scatterplots, however, do not work great for categorical data. In this case, box plots are more appropriate.

Here, we use the GGally package for the scatterplot matrix function ggpairs.

#install.packages("GGally")
suppressMessages(library("GGally"))
theme_set(theme_minimal(20))

set.seed(1836)
pf_subset <- pf[,c(2:15)]
names(pf_subset)

##  [1] "age"                   "dob_day"              
##  [3] "dob_year"              "dob_month"            
##  [5] "gender"                "tenure"               
##  [7] "friend_count"          "friendships_initiated"
##  [9] "likes"                 "likes_received"       
## [11] "mobile_likes"          "mobile_likes_received"
## [13] "www_likes"             "www_likes_received"

#Warning: Takes too long to run!!!!
#ggpairs(pf_subset[sample.int(nrow(pf_subset), 1000), ])

#You may also find that variable labels are on the outer edges of the 
#scatterplot matrix, rather than on the diagonal. If you want labels 
#in the diagonal, run the following code:
ggpairs(pf_subset[sample.int(nrow(pf_subset), 1000), c(1,5,7,8,11)], axisLabels = 'internal')

with(pf[sample.int(nrow(pf), 1000),], 
     cor.test(friendships_initiated, friend_count))

## 
##  Pearson's product-moment correlation
## 
## data:  friendships_initiated and friend_count
## t = 38.121, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7434663 0.7940618
## sample estimates:
##       cor 
## 0.7699715

with(pf[sample.int(nrow(pf), 1000),], 
     cor.test(age, mobile_likes))

## 
##  Pearson's product-moment correlation
## 
## data:  age and mobile_likes
## t = 0.80488, df = 998, p-value = 0.4211
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03658115  0.08732489
## sample estimates:
##        cor 
## 0.02546969

Note: the psych package contains a Scatterplot Matrix function that is much faster!!

#alternative Scatterplot matrix that is much faster!!
#install.packages('psych')
suppressMessages(library(psych))
set.seed(1836)
pairs.panels(sample_n(pf[c(2:15)],1000),pch=".")

Even More Variables

Notes: Examples of large data sets arise in many areas, but one that has attracted the attention of statisticians is genomic data. In these data sets, there is often thousands of genetic measurements for each of a small number of samples.

In some cases, some of these samples have a disease, and so we’d like to identify genes that are associated with the disease. The data set is of gene expression in tumors. The data contains the expression of 6,830 genes, compared with a larger baseline reference sample. Let’s change the column names of the data set to be the numbers 1 to 64 to make the x-axis look nicer.

Heat Maps

Notes: The last plot that we’ll make for this course is called a Heat Map. Notice that to create the data frame required for the heat map (long form data), we can use the tidyr package and use the gather function. Refer to the pdf for more information on this package: https://dl.dropboxusercontent.com/u/5896466/wrangling-webinar.pdf.

Creating Data into Long Form

Method #1: tidyr and gather function

nci <- read.table("nci.tsv")
colnames(nci) <- c(1:64)

library(tidyr)
nci$gene <- 1:nrow(nci)
nci <- nci[,c("gene", as.character(1:64))] #rearrange and place new column first

#Gather all the data for the first 200 genes.
nci.long.samp <- gather(nci[1:200,], "case", "value", 2:ncol(nci[1:200,]))
head(nci.long.samp)

##   gene case  value
## 1    1    1  0.300
## 2    2    1  1.180
## 3    3    1  0.550
## 4    4    1  1.140
## 5    5    1 -0.265
## 6    6    1 -0.070

Method #2: reshape2 and melt function

Similarily, we can create our desired data frame using the reshape2 package and use the melt function. Both methods get us the same data frame.

nci <- read.table("nci.tsv")
colnames(nci) <- c(1:64)

library(reshape2)
nci.long.samp <- melt(as.matrix(nci[1:200,]))
names(nci.long.samp) <- c("gene", "case", "value")
head(nci.long.samp)

##   gene case  value
## 1    1    1  0.300
## 2    2    1  1.180
## 3    3    1  0.550
## 4    4    1  1.140
## 5    5    1 -0.265
## 6    6    1 -0.070

Looking at a Subset of the Large Data

We only look at the first 200 genes because the data set is very large. For our data, we want to display each combination of gene and sample case, the difference in gene expression and the sample from the base line. We want to display combinations where a gene is over-expressed in red and combinations where it is under-expressed in blue. This might look at a very dense display and this is with only 200 rows of data!!

#Heat Map
ggplot(aes(y = gene, x = case, fill = value),
  data = nci.long.samp) +
  geom_tile() +
  scale_fill_gradientn(colours = colorRampPalette(c("blue", "red"))(100))

Analyzing Three of More Variables

Reflection: What did you learn in this lesson?

In this lesson, I learned how examine data with 3 or more variables. Let’s say for example, that we are looking at a scatterplot or a line graph that is comparing two continuous variables. If we want to include the information of a third variable that is categorical, we could add color to the plot.

This lesson also discussed the different packages that one can use to reshape data. We can use either the tidyr or reshape2 packages to change data from long format to wide format.

If our third variable that we wish to incorporate in our plot comes from continuous data, then we learned how the cut function can be quite useful for making discrete variables from continuous or numerical ones. This way, we can create bins of data. By binning our continuous data, we can use colors (or other plot aesthetics) to include the data for this third variable within our plot.

When working with the yogurt data set, we learned that it might be useful to sample your data to get a better understanding of the with-in and between household variation.

One key thing to note is the limits of cross-sectional data. With our yogurt data set, we were able to facet our graphs by the primary unit because we had time series data. The pseudo-Facebook dataset contained only one snapshot of data at a fixed point in time for every primary observation unit.

If we are dealing with 4 or more variables, it might be best to rely on a scatterplot matrix to look at how various pairs of variables relate with one another. For this, we can use the GGally or the psych package.

Video Recap (Conclusion of Lesson): In this lesson, we started with simple extensions to the scatterplot and plots of conditional summaries that were covered in lesson 2, such as adding summaries for multiple groups. Then we covered other techniques for looking at a large number of variables at once, such as the Scatterplot Matrix or the Heat Map.

This lesson also covered how to reshape data, moving from broad data with one row per case, to aggregate data with one row per combination of variables, and we moved back and forth between long and wide formats for our data.

In the next lesson, we will hear from Solomon. He is going to do an in-depth analysis of the diamond data set to get an example of how an expert does it. Lesson 6 also covers a larger part of the data analysis process, highlighting the role of EDA. Solomon did a lot reading about the diamond industry. He wrote code to scrape and formate a new data set, and he used the results of his exploratory data analysis to guide, interpret, and evaluate a statistical model of diamond prices.