spread() function to transform a data from long to widegather() function to transform a data from wide to longgather(), unite() and spread())In this lab we will learn how to reshape data. This means rearranging rows into columns and vise versa. Suppose we have the following data on sales. Each sale is recorded in a separate row. For each sale we know the date, product sold and the dollar value of the sale.
library(tidyr)
long_df <- data.frame(date=c(1,1,2,3), product=c("a","b","a","b"), dollars=c(10,2,5,7))
long_df
## date product dollars
## 1 1 a 10
## 2 1 b 2
## 3 2 a 5
## 4 3 b 7
Suppose that we would like the data to be arranged so that each row represents a date and columns describe what was sold on that date. In other words, we want to spread rows into columns. We can accomplish this using function spread() in tidyr package (a companion to dplyr). The function takes three key arguments. First, it takes the original data. Second, it take the key which is the column in the original data that we want spread. The values of this variable will appear as column headings in the new data frame. The final argument is value which is a column in the original data whose values will be in the cells inside the data frame.
wide_df <- spread(data=long_df, key=product, value=dollars )
wide_df
## date a b
## 1 1 10 2
## 2 2 5 NA
## 3 3 NA 7
We transformed our long data frame into a shorter but wider one. It has the same information, only arranged differently.
Sometimes we need to go from wide data to long data. For example if GDP data is organized as follows.
wide_df <- data.frame(year=c(2014,2015), q1=c(10,16), q2=c(12,17), q3=c(13,18), q4=c(15,19))
wide_df
## year q1 q2 q3 q4
## 1 2014 10 12 13 15
## 2 2015 16 17 18 19
If we wanted to plot the time series we would first need to rearrange the the data so that consecutive observations on GPD are in separate rows. We can accomplish this using the gather() function. It takes several arguments. First, the data that is to be rearranged. Second, value which is the name of the new variable that will contain values. Third, key which is a name of the newly created variable that will contain the names of gathered variables. And finally, the variables to be gathered.
long_df <- gather(data=wide_df, value = gdp, key=quarter, q1,q2,q3,q4)
long_df
## year quarter gdp
## 1 2014 q1 10
## 2 2015 q1 16
## 3 2014 q2 12
## 4 2015 q2 17
## 5 2014 q3 13
## 6 2015 q3 18
## 7 2014 q4 15
## 8 2015 q4 19
IN-CLASS EXERCISE 1: Turn the GDP data back from long to wide.
IN-CLASS EXERCISE 2: The data frame below has information on stock prices for three companies. Put the stock prices for all companies in one column along with a day column and a variable that indicates which company the stock price belongs to.
wide <- data.frame(day=c(1,2), aapl=c(100,110),fb=c(40,30),goog=c(500,600))
wide
## day aapl fb goog
## 1 1 100 40 500
## 2 2 110 30 600
Suppose we wanted to find out if pancake mix tends to be purchased with syrups, and pasta with pasta sauce. We would like to know which products tend to be in the shopping basket together. Right now our data is organized by transaction. It may be easier for us to work with the data if it is organized by basket instead. Let’s load in the transaction data and merge in product category and brand information.
library(dplyr)
library(ggplot2)
#trans <- read.csv("https://www.dropbox.com/s/30gqnhk3tqsawfa/dh_transactions.csv?raw=1")
trans <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab5 summarizing by group/dh_transactions.csv")
trans <- filter(trans, dollar_sales>0)
#prods <- read.csv("https://www.dropbox.com/s/ja9am94rmwtte0n/dh_product_lookup.csv?raw=1")
prods <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab5 summarizing by group/dh_product_lookup.csv")
trans <- inner_join(trans, select(prods,upc, commodity, brand), by="upc")
As we know the transaction data contains about 900 different products. Since we are interested which product categories appear together rather than which specific products, let’s simplify the range of products by aggregating up to the product category. For example, if a customer bought two pasta sauces in one basket, we will add up the cost of the two sauces and treat it as one purchase.
temp <- trans %>% group_by(household, basket, commodity) %>% summarize(spent = sum(dollar_sales))
head(temp, n=4)
## Source: local data frame [4 x 4]
## Groups: household, basket [3]
##
## household basket commodity spent
## <int> <int> <fctr> <dbl>
## 1 1 458810 pancake mixes 1.79
## 2 1 458810 syrups 1.50
## 3 1 1424208 pasta sauce 3.22
## 4 1 1512312 pancake mixes 1.99
We created a data frame which we called temp (for temporary). We also note that the aggregating to product category level did not reduce the number of observations dramatically. This is because buying two kinds of pasta sauces, or two kinds of syrups in one basket is relatively rare.
Now we are ready to reshape our data set. We start with temp data set. The variable we want to ‘spread’ is commodity, i.e. we want pasta, pasta sauce, pancake mix and syrups to be new variables. And, we want the total spent on these categories to be the values in these variables.
basket <- spread(data=temp, key=commodity, value = spent, fill=0)
head(basket, n=4)
## Source: local data frame [4 x 6]
## Groups: household, basket [4]
##
## household basket `pancake mixes` pasta `pasta sauce` syrups
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 458810 1.79 0 0.00 1.50
## 2 1 1424208 0.00 0 3.22 0.00
## 3 1 1512312 1.99 0 0.00 0.00
## 4 1 1518809 0.00 0 0.00 1.19
We see that there are over three million baskets. The first basket in our data had $1.79 worth of pancake mixes, $1.50 worth of syrups and no pasta or pasta sauce. Notice that if a basket did not contain a product category, the value is equal to zero. We set that using the fill= option in the spread() function above. Otherwise, the cell would have NAs in them.
Now that we have the data at the basket level, we can see whether different product categories appear together. Since at this point we are not interested in the value of sales, but rather whether or not a product category is in the basket, let’s create indicator variables for each product category.
basket$pancakey <- ifelse(basket$`pancake mixes` > 0,"yes","no")
basket$pastay <- ifelse(basket$`pasta` > 0,"yes","no")
basket$saucey <- ifelse(basket$`pasta sauce` > 0,"yes","no")
basket$syrupy <- ifelse(basket$syrups > 0,"yes","no")
head(basket, n=4)
## Source: local data frame [4 x 10]
## Groups: household, basket [4]
##
## household basket `pancake mixes` pasta `pasta sauce` syrups pancakey
## <int> <int> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 458810 1.79 0 0.00 1.50 yes
## 2 1 1424208 0.00 0 3.22 0.00 no
## 3 1 1512312 1.99 0 0.00 0.00 yes
## 4 1 1518809 0.00 0 0.00 1.19 no
## # ... with 3 more variables: pastay <chr>, saucey <chr>, syrupy <chr>
Let’s use the crosstab() function from the descr package to see the association of different product categories.
library(descr)
crosstab(basket$saucey, basket$pastay, prop.t = TRUE, plot = FALSE)
## Cell Contents
## |-------------------------|
## | Count |
## | Total Percent |
## |-------------------------|
##
## ============================================
## basket$pastay
## basket$saucey no yes Total
## --------------------------------------------
## no 631203 1139818 1771021
## 19.0% 34.4%
## --------------------------------------------
## yes 725120 818343 1543463
## 21.9% 24.7%
## --------------------------------------------
## Total 1356323 1958161 3314484
## ============================================
We see that pasta was in 1,958,161 baskets, and of those baskets 818,343 also had sauce in it. So, given that a basket has pasta, the chances that it also has sauce is about 42%.
IN-CLASS EXERCISE 3: Is Buddy the Elf out there? Given that a basket has pasta, what are the chances that it also has syrup?
EXTENDED EXERCISE: Is there a way to visualize associations of different products? Different brands?
Sometimes reshaping data is necessary to create a good visualization. For example, in our transaction data we know that some customers used coupons. Suppose we want to compare average prices when coupon was used to average prices when coupon was not used. Let’s summarize the transactions data by upc code and whether or not a coupon was used. While collapsing/summarizing the data we will calculate the average price for each upc code and whether or not a coupon was used.
prods <- trans %>% group_by(upc, coupon) %>% summarize(av_price=mean(dollar_sales/units))
It is possible that not all products had a coupon. If a product had a coupon it would be in our data set twice, if it did not have a coupon it would be there only once. So we can use mutate() to calculate how many times each upc code is in the data and filter out those that appear only once.
prods <- prods %>% group_by(upc) %>% mutate(count=n()) %>% filter(count==2) %>% select(-count)
head(prods, n=4)
## Source: local data frame [4 x 3]
## Groups: upc [2]
##
## upc coupon av_price
## <dbl> <int> <dbl>
## 1 1160000004 0 2.7912861
## 2 1160000004 1 2.0548603
## 3 1330052501 0 0.6901815
## 4 1330052501 1 0.3275000
Right now we have a data set where each product appears twice, once with average price when a coupon was used (coupon=1) and once an average price when a coupon was not used (coupon=0). Suppose that we want to plot average prices with coupon against average prices with no coupon. To do this we would need to reshape the data so that each product appears in the data only once, but have two new columns - one that shows no coupon price and one with coupon price.
prods <- spread(data=prods, key=coupon, value=av_price)
prods <- rename(prods, coupon_price = `1`, nocoupon_price = `0`)
head(prods, n=4)
## Source: local data frame [4 x 3]
## Groups: upc [4]
##
## upc nocoupon_price coupon_price
## <dbl> <dbl> <dbl>
## 1 1160000004 2.7912861 2.054860
## 2 1330052501 0.6901815 0.327500
## 3 1480071124 2.2694960 1.551274
## 4 1510000001 0.6858615 0.235000
Now we are ready to plot the data.
ggplot(prods) + geom_point(aes(x=nocoupon_price, y=coupon_price)) +
geom_line(aes(y=coupon_price,x=coupon_price), color="red") #add a horizontal line
As expected, in vast majority of cases coupon prices are lower than no coupon prices. Interestingly, coupons appear to lower the price by a bit less than a dollar. This is true for products priced in the one to two dollar range as well as more expensive ones.
Suppose you had the following data on population and GDP for two countries in two years. Reshape this data so that each country’s population and each country’s GDP are in separate columns, and each year is in a separate row. Give the new variables informative names.
data <- data.frame(year=c(1,1,2,2), country=c("a","b","a","b"), pop=c(1,2,3,4), gdp=c(5,6,7,8))
data
## year country pop gdp
## 1 1 a 1 5
## 2 1 b 2 6
## 3 2 a 3 7
## 4 2 b 4 8
The problem we have is that there are two variables that have values=. In the end we want columns that indicate population in country a (say a_pop), population in country b (say b_pop), GDP in country a (say a_gdp), etc. We solve this in three stages. First, we gather the pop and gdp columns to make it long, we call the variable that has either pop or GDP stat. Second, we unite the country and stat variables into one called country_stat. Finally, we spread the country_stat variable to make the data wide again.
temp <- gather(data, key=stat, value=value, pop, gdp)
temp
## year country stat value
## 1 1 a pop 1
## 2 1 b pop 2
## 3 2 a pop 3
## 4 2 b pop 4
## 5 1 a gdp 5
## 6 1 b gdp 6
## 7 2 a gdp 7
## 8 2 b gdp 8
temp <- unite(temp, country_stat, country, stat)
temp
## year country_stat value
## 1 1 a_pop 1
## 2 1 b_pop 2
## 3 2 a_pop 3
## 4 2 b_pop 4
## 5 1 a_gdp 5
## 6 1 b_gdp 6
## 7 2 a_gdp 7
## 8 2 b_gdp 8
data_wide <- spread(temp, key=country_stat, value=value)
data_wide
## year a_gdp a_pop b_gdp b_pop
## 1 1 5 1 6 2
## 2 2 7 3 8 4
data <- data.frame(year=c(1,1,2,2), country=c("a","b","a","b"), pop=c(1,2,3,4))
data
## year country pop
## 1 1 a 1
## 2 1 b 2
## 3 2 a 3
## 4 2 b 4
#put tickers of a bunch of stocks in a vector
stocks <- c("FB","GOOG","AAPL", "UAL") #facebook, google, apple, united airlines
data <- data.frame() #create an empty data frame
#make a loop that grabs the tickers from the 'stocks' vector and gets the data from yahoo finance
for (i in 1:length(stocks)) {
temp <- read.csv(paste("http://ichart.finance.yahoo.com/table.csv?s=",stocks[i], sep=""))
temp$stock <- stocks[i] #create a key that is the stock ticker
temp <- select(temp, Date, stock, Adj.Close) #keep only Date, stock and Adj.Close
data <- bind_rows(data,temp) #append temp to data, this will gradually stack data for each stock on top of each other
}
data$Date <- as.Date(data$Date) #turn factor date into a proper date
data <- filter(data, Date>as.Date("2003-01-01")) #trim the time frame
Use the above data frame to create a time plot of stock prices of all four companies in one graph. Use a log scale on the y-axis. (Hint: You can map the color aesthetic to stock.)
Turn your data from long to wide with stock prices for each company as separate columns. Do you have any missing values? Why?
Calculate the correlation coefficients among the four stock prices. (Hint 1: Use function cor() which takes as an argument a data frame of numerical variables. Hint 2: Use option use="pairwise.complete.obs to tell the function to ignore missing values in each pair of variables.) Which pair of stock prices has the lowest correlation coefficient? Which has the highest? Does that make sense to you?
Use package corrplot to visualize the correlations. See the documentation here (Hint: You need to put the output of cor() into a matrix object, like this correlations <- cor().)