Stat 585X: Week 1

Ian Lyttle

library(knitr)
library(plyr)
library(reshape2)

options(width = 100, str = strOptions(vec.len = 2))

opts_chunk$set(comment = NA, tidy = FALSE, fig.align = "center", fig.width = 10, 
    fig.height = 6, dev = "png")

Summary

Here are my short answers to the questions:

What does it mean to melt data?

This is process by which we obtain molten data.

In the context of a data frame, molten data are data where each row contains a single observation of a single variable.

Why cast it into different forms?

It is sometimes the case where molten data are more amenable to transformation, modeling, and visualization.

It is also sometimes the case that cast data are easier for humans to read.

Where would this be useful?

This might be useful in summarizing a dataset for a presentation. It might also be useful in case you want to transform data using values from different variables.

These answers are expanded-upon using a case study.

Case study

Let's look at one of the example files from the lecture.

temp.all <- read.table("data/9641C_201112_raw.avg")
dim(temp.all)
[1] 132644     14

colnames(temp.all) <- 
  c("id", "jan", "feb", "mar", "apr", "may", "jun", 
    "jul", "aug", "sep", "oct", "nov", "dec", "ann")

head(temp.all)
         id   jan   feb mar   apr may jun jul aug sep oct   nov   dec   ann
1 1.108e+09   469   426 571   638 710 789 813 810 786 613 -9999 -9999 -9999
2 1.108e+09   476   513 548 -9999 766 787 802 815 744 645   578   456 -9999
3 1.108e+09   452   553 675   642 695 817 816 779 753 658 -9999 -9999 -9999
4 1.108e+09 -9999 -9999 561   669 748 786 819 837 806 738   588   518 -9999
5 1.108e+09 -9999   544 543   641 723 798 809 818 819 700   549   576 -9999
6 1.108e+09   538   648 618   713 764 807 812 786 784 695   663   477   692

Looking at the structure of the data (and doing a little googling on the file name), it looks like the year column is a combination of a station id and a year.

temp_new <- mutate(
  temp.all,
  station_id = floor(id/10000),
  year = id %% 10000,
  id = NULL
)

Melting

I can replace the NA values by first melting the data frame (but there are other ways to do this without having melted the data frame).

temp_melt <- 
  melt(temp_new, id.vars = c("station_id", "year"), variable.name = "month")

head(temp_melt)
  station_id year month value
1     110843 1895   jan   469
2     110843 1896   jan   476
3     110843 1897   jan   452
4     110843 1900   jan -9999
5     110843 1926   jan -9999
6     110843 1927   jan   538
str(temp_melt)
'data.frame':   1724372 obs. of  4 variables:
 $ station_id: num  110843 110843 ...
 $ year      : num  1895 1896 ...
 $ month     : Factor w/ 13 levels "jan","feb","mar",..: 1 1 1 1 1 ...
 $ value     : chr  "469" "476" ...
summary(temp_melt)
   station_id           year          month           value          
 Min.   : 110843   Min.   :1895   jan    :132644   Length:1724372    
 1st Qu.:1348943   1st Qu.:1928   feb    :132644   Class :character  
 Median :2553103   Median :1955   mar    :132644   Mode  :character  
 Mean   :2581190   Mean   :1955   apr    :132644                     
 3rd Qu.:3679313   3rd Qu.:1983   may    :132644                     
 Max.   :4899053   Max.   :2011   jun    :132644                     
                                  (Other):928508                     

In addition to treating the NA values, we should coerce value from character to double.

temp_melt$value <- as.numeric(temp_melt$value)
Warning: NAs introduced by coercion
temp_melt$value[temp_melt$value == -9999] <- NA

head(temp_melt)
  station_id year month value
1     110843 1895   jan   469
2     110843 1896   jan   476
3     110843 1897   jan   452
4     110843 1900   jan    NA
5     110843 1926   jan    NA
6     110843 1927   jan   538

Casting

Cast form (where it was originally) might be useful for tabular presentation.

temp_cast <- dcast(temp_melt, formula = station_id + year ~ month)
head(temp_cast)
  station_id year jan feb mar apr may jun jul aug sep oct nov dec ann
1     110843 1895 469 426 571 638 710 789 813 810 786 613  NA  NA  NA
2     110843 1896 476 513 548  NA 766 787 802 815 744 645 578 456  NA
3     110843 1897 452 553 675 642 695 817 816 779 753 658  NA  NA  NA
4     110843 1900  NA  NA 561 669 748 786 819 837 806 738 588 518  NA
5     110843 1926  NA 544 543 641 723 798 809 818 819 700 549 576  NA
6     110843 1927 538 648 618 713 764 807 812 786 784 695 663 477 692

Also, maybe we are interested in, for each station and year, the difference between the value between June and October. Here, again, cast form is useful.

temp_cast_diff <- summarize(
  temp_cast,
  station_id,
  year,
  oct_jun_diff = oct - jun
)

head(temp_cast_diff)
  station_id year oct_jun_diff
1     110843 1895         -176
2     110843 1896         -142
3     110843 1897         -159
4     110843 1900          -48
5     110843 1926          -98
6     110843 1927         -112