Lab 13: Reshaping data

Learning objectives

  • using the spread() function to transform a data from long to wide
  • using the gather() function to transform a data from wide to long
  • advanced reshaping (combining gather(), unite() and spread())

1. From long to wide: toy example

In this lab we will learn how to reshape data. This means rearranging rows into columns and vise versa. Suppose we have the following data on sales. Each sale is recorded in a separate row. For each sale we know the date, product sold and the dollar value of the sale.

library(tidyr)
long_df <- data.frame(date=c(1,1,2,3), product=c("a","b","a","b"), dollars=c(10,2,5,7))
long_df
##   date product dollars
## 1    1       a      10
## 2    1       b       2
## 3    2       a       5
## 4    3       b       7

Suppose that we would like the data to be arranged so that each row represents a date and columns describe what was sold on that date. In other words, we want to spread rows into columns. We can accomplish this using function spread() in tidyr package (a companion to dplyr). The function takes three key arguments. First, it takes the original data. Second, it take the key which is the column in the original data that we want spread. The values of this variable will appear as column headings in the new data frame. The final argument is value which is a column in the original data whose values will be in the cells inside the data frame.

wide_df <- spread(data=long_df, key=product, value=dollars )
wide_df
##   date  a  b
## 1    1 10  2
## 2    2  5 NA
## 3    3 NA  7

We transformed our long data frame into a shorter but wider one. It has the same information, only arranged differently.

2. From wide to long: toy example

Sometimes we need to go from wide data to long data. For example if GDP data is organized as follows.

wide_df <- data.frame(year=c(2014,2015), q1=c(10,16), q2=c(12,17), q3=c(13,18), q4=c(15,19))
wide_df
##   year q1 q2 q3 q4
## 1 2014 10 12 13 15
## 2 2015 16 17 18 19

If we wanted to plot the time series we would first need to rearrange the the data so that consecutive observations on GPD are in separate rows. We can accomplish this using the gather() function. It takes several arguments. First, the data that is to be rearranged. Second, value which is the name of the new variable that will contain values. Third, key which is a name of the newly created variable that will contain the names of gathered variables. And finally, the variables to be gathered.

long_df <- gather(data=wide_df, value = gdp, key=quarter, q1,q2,q3,q4)
long_df
##   year quarter gdp
## 1 2014      q1  10
## 2 2015      q1  16
## 3 2014      q2  12
## 4 2015      q2  17
## 5 2014      q3  13
## 6 2015      q3  18
## 7 2014      q4  15
## 8 2015      q4  19

IN-CLASS EXERCISE 1: Turn the GDP data back from long to wide.

IN-CLASS EXERCISE 2: The data frame below has information on stock prices for three companies. Put the stock prices for all companies in one column along with a day column and a variable that indicates which company the stock price belongs to.

wide <- data.frame(day=c(1,2), aapl=c(100,110),fb=c(40,30),goog=c(500,600))
wide
##   day aapl fb goog
## 1   1  100 40  500
## 2   2  110 30  600

3. From long to wide: real example

Suppose we wanted to find out if pancake mix tends to be purchased with syrups, and pasta with pasta sauce. We would like to know which products tend to be in the shopping basket together. Right now our data is organized by transaction. It may be easier for us to work with the data if it is organized by basket instead. Let’s load in the transaction data and merge in product category and brand information.

library(dplyr)
library(ggplot2)
#trans <- read.csv("https://www.dropbox.com/s/30gqnhk3tqsawfa/dh_transactions.csv?raw=1")
trans <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab5 summarizing by group/dh_transactions.csv")
trans <- filter(trans, dollar_sales>0)
#prods <- read.csv("https://www.dropbox.com/s/ja9am94rmwtte0n/dh_product_lookup.csv?raw=1")
prods <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab5 summarizing by group/dh_product_lookup.csv")
trans <- inner_join(trans, select(prods,upc, commodity, brand), by="upc")

As we know the transaction data contains about 900 different products. Since we are interested which product categories appear together rather than which specific products, let’s simplify the range of products by aggregating up to the product category. For example, if a customer bought two pasta sauces in one basket, we will add up the cost of the two sauces and treat it as one purchase.

temp <- trans %>% group_by(household, basket, commodity) %>% summarize(spent = sum(dollar_sales))
head(temp, n=4)
## Source: local data frame [4 x 4]
## Groups: household, basket [3]
## 
##   household  basket     commodity spent
##       <int>   <int>        <fctr> <dbl>
## 1         1  458810 pancake mixes  1.79
## 2         1  458810        syrups  1.50
## 3         1 1424208   pasta sauce  3.22
## 4         1 1512312 pancake mixes  1.99

We created a data frame which we called temp (for temporary). We also note that the aggregating to product category level did not reduce the number of observations dramatically. This is because buying two kinds of pasta sauces, or two kinds of syrups in one basket is relatively rare.

Now we are ready to reshape our data set. We start with temp data set. The variable we want to ‘spread’ is commodity, i.e. we want pasta, pasta sauce, pancake mix and syrups to be new variables. And, we want the total spent on these categories to be the values in these variables.

basket <- spread(data=temp, key=commodity, value =  spent, fill=0)
head(basket, n=4)
## Source: local data frame [4 x 6]
## Groups: household, basket [4]
## 
##   household  basket `pancake mixes` pasta `pasta sauce` syrups
##       <int>   <int>           <dbl> <dbl>         <dbl>  <dbl>
## 1         1  458810            1.79     0          0.00   1.50
## 2         1 1424208            0.00     0          3.22   0.00
## 3         1 1512312            1.99     0          0.00   0.00
## 4         1 1518809            0.00     0          0.00   1.19

We see that there are over three million baskets. The first basket in our data had $1.79 worth of pancake mixes, $1.50 worth of syrups and no pasta or pasta sauce. Notice that if a basket did not contain a product category, the value is equal to zero. We set that using the fill= option in the spread() function above. Otherwise, the cell would have NAs in them.

4. Working with wide data

Now that we have the data at the basket level, we can see whether different product categories appear together. Since at this point we are not interested in the value of sales, but rather whether or not a product category is in the basket, let’s create indicator variables for each product category.

basket$pancakey <- ifelse(basket$`pancake mixes` > 0,"yes","no")
basket$pastay <- ifelse(basket$`pasta` > 0,"yes","no")
basket$saucey <- ifelse(basket$`pasta sauce` > 0,"yes","no")
basket$syrupy <- ifelse(basket$syrups > 0,"yes","no")
head(basket, n=4)
## Source: local data frame [4 x 10]
## Groups: household, basket [4]
## 
##   household  basket `pancake mixes` pasta `pasta sauce` syrups pancakey
##       <int>   <int>           <dbl> <dbl>         <dbl>  <dbl>    <chr>
## 1         1  458810            1.79     0          0.00   1.50      yes
## 2         1 1424208            0.00     0          3.22   0.00       no
## 3         1 1512312            1.99     0          0.00   0.00      yes
## 4         1 1518809            0.00     0          0.00   1.19       no
## # ... with 3 more variables: pastay <chr>, saucey <chr>, syrupy <chr>

Let’s use the crosstab() function from the descr package to see the association of different product categories.

library(descr)
crosstab(basket$saucey, basket$pastay, prop.t = TRUE, plot = FALSE)
##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |           Total Percent | 
## |-------------------------|
## 
## ============================================
##                  basket$pastay
## basket$saucey         no       yes     Total
## --------------------------------------------
## no               631203   1139818   1771021 
##                    19.0%     34.4%          
## --------------------------------------------
## yes              725120    818343   1543463 
##                    21.9%     24.7%          
## --------------------------------------------
## Total           1356323   1958161   3314484 
## ============================================

We see that pasta was in 1,958,161 baskets, and of those baskets 818,343 also had sauce in it. So, given that a basket has pasta, the chances that it also has sauce is about 42%.

IN-CLASS EXERCISE 3: Is Buddy the Elf out there? Given that a basket has pasta, what are the chances that it also has syrup?

EXTENDED EXERCISE: Is there a way to visualize associations of different products? Different brands?

5. From long to wide: another example

Sometimes reshaping data is necessary to create a good visualization. For example, in our transaction data we know that some customers used coupons. Suppose we want to compare average prices when coupon was used to average prices when coupon was not used. Let’s summarize the transactions data by upc code and whether or not a coupon was used. While collapsing/summarizing the data we will calculate the average price for each upc code and whether or not a coupon was used.

prods <- trans %>% group_by(upc, coupon) %>% summarize(av_price=mean(dollar_sales/units))

It is possible that not all products had a coupon. If a product had a coupon it would be in our data set twice, if it did not have a coupon it would be there only once. So we can use mutate() to calculate how many times each upc code is in the data and filter out those that appear only once.

prods <- prods %>% group_by(upc) %>% mutate(count=n()) %>% filter(count==2) %>% select(-count)
head(prods, n=4)
## Source: local data frame [4 x 3]
## Groups: upc [2]
## 
##          upc coupon  av_price
##        <dbl>  <int>     <dbl>
## 1 1160000004      0 2.7912861
## 2 1160000004      1 2.0548603
## 3 1330052501      0 0.6901815
## 4 1330052501      1 0.3275000

Right now we have a data set where each product appears twice, once with average price when a coupon was used (coupon=1) and once an average price when a coupon was not used (coupon=0). Suppose that we want to plot average prices with coupon against average prices with no coupon. To do this we would need to reshape the data so that each product appears in the data only once, but have two new columns - one that shows no coupon price and one with coupon price.

prods <-  spread(data=prods, key=coupon, value=av_price)
prods <- rename(prods, coupon_price = `1`, nocoupon_price = `0`)
head(prods, n=4)
## Source: local data frame [4 x 3]
## Groups: upc [4]
## 
##          upc nocoupon_price coupon_price
##        <dbl>          <dbl>        <dbl>
## 1 1160000004      2.7912861     2.054860
## 2 1330052501      0.6901815     0.327500
## 3 1480071124      2.2694960     1.551274
## 4 1510000001      0.6858615     0.235000

Now we are ready to plot the data.

ggplot(prods) + geom_point(aes(x=nocoupon_price, y=coupon_price)) + 
  geom_line(aes(y=coupon_price,x=coupon_price), color="red") #add a horizontal line

As expected, in vast majority of cases coupon prices are lower than no coupon prices. Interestingly, coupons appear to lower the price by a bit less than a dollar. This is true for products priced in the one to two dollar range as well as more expensive ones.

6. Advanced reshaping

Suppose you had the following data on population and GDP for two countries in two years. Reshape this data so that each country’s population and each country’s GDP are in separate columns, and each year is in a separate row. Give the new variables informative names.

data <- data.frame(year=c(1,1,2,2), country=c("a","b","a","b"), pop=c(1,2,3,4), gdp=c(5,6,7,8))
data
##   year country pop gdp
## 1    1       a   1   5
## 2    1       b   2   6
## 3    2       a   3   7
## 4    2       b   4   8

The problem we have is that there are two variables that have values=. In the end we want columns that indicate population in country a (say a_pop), population in country b (say b_pop), GDP in country a (say a_gdp), etc. We solve this in three stages. First, we gather the pop and gdp columns to make it long, we call the variable that has either pop or GDP stat. Second, we unite the country and stat variables into one called country_stat. Finally, we spread the country_stat variable to make the data wide again.

temp <- gather(data, key=stat, value=value, pop, gdp)
temp
##   year country stat value
## 1    1       a  pop     1
## 2    1       b  pop     2
## 3    2       a  pop     3
## 4    2       b  pop     4
## 5    1       a  gdp     5
## 6    1       b  gdp     6
## 7    2       a  gdp     7
## 8    2       b  gdp     8
temp <- unite(temp, country_stat, country, stat)
temp
##   year country_stat value
## 1    1        a_pop     1
## 2    1        b_pop     2
## 3    2        a_pop     3
## 4    2        b_pop     4
## 5    1        a_gdp     5
## 6    1        b_gdp     6
## 7    2        a_gdp     7
## 8    2        b_gdp     8
data_wide <- spread(temp, key=country_stat, value=value)
data_wide
##   year a_gdp a_pop b_gdp b_pop
## 1    1     5     1     6     2
## 2    2     7     3     8     4

Exercises:

  1. Suppose you have the following data on population for two countries for two years. Reshape this data so that each country’s population is in a separate column, and each year is on just one row. Give the new variables informative names.
data <- data.frame(year=c(1,1,2,2), country=c("a","b","a","b"), pop=c(1,2,3,4))
data
##   year country pop
## 1    1       a   1
## 2    1       b   2
## 3    2       a   3
## 4    2       b   4
  1. Run the code below. Would you say that the resulting data frame is wide or long?
#put tickers of a bunch of stocks in a vector
stocks <- c("FB","GOOG","AAPL", "UAL") #facebook, google, apple, united airlines
data <- data.frame() #create an empty data frame
#make a loop that grabs the tickers from the 'stocks' vector and gets the data from yahoo finance
for (i in 1:length(stocks)) {
  temp <- read.csv(paste("http://ichart.finance.yahoo.com/table.csv?s=",stocks[i], sep=""))
  temp$stock <- stocks[i] #create a key that is the stock ticker
  temp <- select(temp, Date, stock, Adj.Close) #keep only Date, stock and Adj.Close
  data <- bind_rows(data,temp) #append temp to data, this will gradually stack data for each stock on top of each other
}
data$Date <- as.Date(data$Date) #turn factor date into a proper date
data <- filter(data, Date>as.Date("2003-01-01")) #trim the time frame
  1. Use the above data frame to create a time plot of stock prices of all four companies in one graph. Use a log scale on the y-axis. (Hint: You can map the color aesthetic to stock.)

  2. Turn your data from long to wide with stock prices for each company as separate columns. Do you have any missing values? Why?

  3. Calculate the correlation coefficients among the four stock prices. (Hint 1: Use function cor() which takes as an argument a data frame of numerical variables. Hint 2: Use option use="pairwise.complete.obs to tell the function to ignore missing values in each pair of variables.) Which pair of stock prices has the lowest correlation coefficient? Which has the highest? Does that make sense to you?

  4. Use package corrplot to visualize the correlations. See the documentation here (Hint: You need to put the output of cor() into a matrix object, like this correlations <- cor().)