Ian Lyttle
library(knitr)
library(plyr)
library(reshape2)
options(width = 100, str = strOptions(vec.len = 2))
opts_chunk$set(comment = NA, tidy = FALSE, fig.align = "center", fig.width = 10,
fig.height = 6, dev = "png")
Here are my short answers to the questions:
This is process by which we obtain molten data.
In the context of a data frame, molten data are data where each row contains a single observation of a single variable.
It is sometimes the case where molten data are more amenable to transformation, modeling, and visualization.
It is also sometimes the case that cast data are easier for humans to read.
This might be useful in summarizing a dataset for a presentation. It might also be useful in case you want to transform data using values from different variables.
These answers are expanded-upon using a case study.
Let's look at one of the example files from the lecture.
temp.all <- read.table("data/9641C_201112_raw.avg")
dim(temp.all)
[1] 132644 14
colnames(temp.all) <-
c("id", "jan", "feb", "mar", "apr", "may", "jun",
"jul", "aug", "sep", "oct", "nov", "dec", "ann")
head(temp.all)
id jan feb mar apr may jun jul aug sep oct nov dec ann
1 1.108e+09 469 426 571 638 710 789 813 810 786 613 -9999 -9999 -9999
2 1.108e+09 476 513 548 -9999 766 787 802 815 744 645 578 456 -9999
3 1.108e+09 452 553 675 642 695 817 816 779 753 658 -9999 -9999 -9999
4 1.108e+09 -9999 -9999 561 669 748 786 819 837 806 738 588 518 -9999
5 1.108e+09 -9999 544 543 641 723 798 809 818 819 700 549 576 -9999
6 1.108e+09 538 648 618 713 764 807 812 786 784 695 663 477 692
Looking at the structure of the data (and doing a little googling on the file name), it looks like the year column is a combination of a station id and a year.
temp_new <- mutate(
temp.all,
station_id = floor(id/10000),
year = id %% 10000,
id = NULL
)
I can replace the NA values by first melting the data frame (but there are other ways to do this without having melted the data frame).
temp_melt <-
melt(temp_new, id.vars = c("station_id", "year"), variable.name = "month")
head(temp_melt)
station_id year month value
1 110843 1895 jan 469
2 110843 1896 jan 476
3 110843 1897 jan 452
4 110843 1900 jan -9999
5 110843 1926 jan -9999
6 110843 1927 jan 538
str(temp_melt)
'data.frame': 1724372 obs. of 4 variables:
$ station_id: num 110843 110843 ...
$ year : num 1895 1896 ...
$ month : Factor w/ 13 levels "jan","feb","mar",..: 1 1 1 1 1 ...
$ value : chr "469" "476" ...
summary(temp_melt)
station_id year month value
Min. : 110843 Min. :1895 jan :132644 Length:1724372
1st Qu.:1348943 1st Qu.:1928 feb :132644 Class :character
Median :2553103 Median :1955 mar :132644 Mode :character
Mean :2581190 Mean :1955 apr :132644
3rd Qu.:3679313 3rd Qu.:1983 may :132644
Max. :4899053 Max. :2011 jun :132644
(Other):928508
In addition to treating the NA values, we should coerce value from character to double.
temp_melt$value <- as.numeric(temp_melt$value)
Warning: NAs introduced by coercion
temp_melt$value[temp_melt$value == -9999] <- NA
head(temp_melt)
station_id year month value
1 110843 1895 jan 469
2 110843 1896 jan 476
3 110843 1897 jan 452
4 110843 1900 jan NA
5 110843 1926 jan NA
6 110843 1927 jan 538
Cast form (where it was originally) might be useful for tabular presentation.
temp_cast <- dcast(temp_melt, formula = station_id + year ~ month)
head(temp_cast)
station_id year jan feb mar apr may jun jul aug sep oct nov dec ann
1 110843 1895 469 426 571 638 710 789 813 810 786 613 NA NA NA
2 110843 1896 476 513 548 NA 766 787 802 815 744 645 578 456 NA
3 110843 1897 452 553 675 642 695 817 816 779 753 658 NA NA NA
4 110843 1900 NA NA 561 669 748 786 819 837 806 738 588 518 NA
5 110843 1926 NA 544 543 641 723 798 809 818 819 700 549 576 NA
6 110843 1927 538 648 618 713 764 807 812 786 784 695 663 477 692
Also, maybe we are interested in, for each station and year, the difference between the value between June and October. Here, again, cast form is useful.
temp_cast_diff <- summarize(
temp_cast,
station_id,
year,
oct_jun_diff = oct - jun
)
head(temp_cast_diff)
station_id year oct_jun_diff
1 110843 1895 -176
2 110843 1896 -142
3 110843 1897 -159
4 110843 1900 -48
5 110843 1926 -98
6 110843 1927 -112