<- read.csv("C:/Users/rache/Downloads/train.csv") titanic
III. Data Manipulation
Importing Data
Reading data from CSV, Excel, and other format
CSV
A CSV file (Comma-Separated Values file) is a plain text file that stores tabular data, such as a spreadsheet or database, in a simple, structured format. Each line in the file represents a row in the table, and each value within a row is separated by a comma.
Key Features:
Plain Text: CSV files are easy to create, read, and manipulate using simple text editors or programming languages.
Comma-Separated: Values are separated by commas, though other delimiters like tabs (TSV files) or semicolons are also used in some variations.
Rows and Columns: Each row corresponds to a record, and each comma-separated value in the row corresponds to a field (or column) in the table.
No Formatting: CSV files do not contain complex formatting, formulas, or images. They are purely text-based, making them lightweight and portable.
Importing .csv
files:
- This line reads a CSV file from a specific path provided as an absolute path. The
~
symbol represents the home directory of the current user, and the rest of the path is relative to this home directory.
This approach is useful when you have a specific file path, and it ensures that you’re pointing to the exact location where the file is stored.
getwd() # 1>
[1] "C:/Users/rache/Downloads"
<- read.csv(file = "datasets_import/titanic.csv") # 2> titanic
- The
getwd()
function in R stands for “Get Working Directory”. It returns the current working directory of your R session, which is the default folder where R looks for files to read and where it saves files if you don’t specify a full path. - This line reads a CSV file using a relative path. The path provided is relative to the current working directory of the R session.
Be sure to eyeball your original data and imported data to ensure it is imported correctly. You might note there are many “blank” values.
By default, the blank values in character vectors are treated as a separate category.
# Generate frequency table for Cabin, including missing values
<- table(titanic$Cabin,
cabin_freq useNA = "ifany"
)
# Sort the frequency table in descending order
<- sort(x = cabin_freq,
sorted_cabin_freq decreasing = TRUE
)
# print largest 5 categories
1:5] sorted_cabin_freq[
B96 B98 C23 C25 C27 G6 C22 C26
687 4 4 4 3
- We can explicitly ask the software to treat these “blank values” as “NA” or missing values by specifying the
na.strings
argument.
<- read.csv(file = "datasets_import/titanic.csv",
titanic na.strings = "") # can specify nultiple arguments as well in na.strings if needed. EG 99999999 in some software
Excel Files
An Excel file is a spreadsheet format created by Microsoft Excel that stores tabular data, including text, numbers, and formulas, in a structured format. Unlike CSV files, Excel files can contain multiple sheets, rich formatting, and more complex data structures.
Key Features:
Structured Data: Excel files can store data in multiple sheets within a single file, allowing for complex data organization.
Rich Formatting: Unlike CSV files, Excel files support cell formatting, including font styles, colors, and borders.
Formulas and Functions: Excel supports formulas and functions to perform calculations and data analysis directly within the spreadsheet.
Data Validation and Charts: Excel files can include data validation rules and embedded charts for visual data representation.
Importing xlsx
files:
Lots of functions to import
.xlsx
files.To import Excel files in R, you can use the
readxl
package or theopenxlsx
package. Below is an example usingreadxl
:
# install.packages("readxl")
library(readxl)
# Import an Excel file and easure time taken to import the file
<- Sys.time()
start_time
<- read_excel(path = "datasets_import/Vending_Machine_data.xlsx",
Vending_Machine_data sheet = "Vending_Machine_data",
range = "A1:N37"
)
New names:
• `` -> `...14`
<- Sys.time()
end_time
- start_time # Time difference of 0.0800302 secs end_time
Time difference of 0.2282751 secs
- The
openxlsx
package provides functions for reading and writing Excel files, and is optimized for handling larger datasets and offers additional features like formatting and writing to Excel.
# install.packages("openxlsx")
library(openxlsx)
?read.xlsx
starting httpd help server ... done
<- Sys.time()
start_time
<- read.xlsx(xlsxFile = "datasets_import/Vending_Machine_data.xlsx",
Vending_Machine_data sheet = "Vending_Machine_data"
)<- Sys.time()
end_time
- start_time # Time difference of 0.01190901 secs end_time
Time difference of 0.02591205 secs
DTA Files
A .dta
file is a data format used by Stata, a statistical software package. These files store tabular data, including variables and observations, in a structured format that Stata can read and write.
Key Features:
- Structured Data:
.dta
files contain tabular data with columns (variables) and rows (observations). - Compatibility: Designed for use with Stata software but can be read by other statistical packages.
- Metadata: May include variable labels and value labels in addition to the data itself.
Importing .dta
Files
To import .dta
files into R, you can use several packages, each with its own features. The popular package is haven
, which is part of the tidyverse
and is specifically designed for reading and writing data from statistical software packages, including Stata.
# install.packages("haven")
# Load the haven library
library(haven)
<- read_dta(file = "datasets_import/ProbSet4trade.dta") data_haven
Basic data inspection (head, tail, summary)
head
Purpose: Displays the first few rows of a data frame, matrix, or vector. Useful for quickly viewing the beginning of a dataset.
head()
?head(x= titanic[,1:3]) # first three columns only
PassengerId Survived Pclass
1 1 0 3
2 2 1 1
3 3 1 3
4 4 1 1
5 5 0 3
6 6 0 3
head(x = titanic[,1:3], n=9L) # if you want more observations
PassengerId Survived Pclass
1 1 0 3
2 2 1 1
3 3 1 3
4 4 1 1
5 5 0 3
6 6 0 3
7 7 0 1
8 8 0 3
9 9 1 3
tail
Purpose: Displays the last few rows of a data frame, matrix, or vector. Useful for quickly viewing the end of a dataset.
tail(x= titanic[,1:3])
PassengerId Survived Pclass
886 886 0 3
887 887 0 2
888 888 1 1
889 889 0 3
890 890 1 1
891 891 0 3
tail(x = titanic[,1:3], n=9L) # if you want more observations
PassengerId Survived Pclass
883 883 0 3
884 884 0 2
885 885 0 3
886 886 0 3
887 887 0 2
888 888 1 1
889 889 0 3
890 890 1 1
891 891 0 3
glimpse
The glimpse()
function from the dplyr
package is a powerful tool for inspecting the structure and content of a data frame or tibble in R.
Purpose: Provides a compact and transposed view of the data frame or tibble, showing column names, types, and a preview of the values.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
::glimpse(titanic) # explictly specifying the package name from which glimpse function comes dplyr
Rows: 891
Columns: 12
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
$ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
$ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
$ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
$ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
$ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
$ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
$ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
$ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
$ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
$ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
Visualization of missing variables
- Identification: Detect missing values using functions like
is.na()
in R.
table(is.na(titanic$Cabin))
FALSE TRUE
204 687
- instead of applying functions on each variable, we can apply functions from user written packages on the entire dataset.
visdat
The vis_dat()
function from the visdat
package is a useful tool for visualizing missing values and the structure of your data. It provides a comprehensive view of the completeness of your data, which can help identify patterns of missingness and understand the structure of your dataset.
# install.packages("visdat")
library(visdat)
vis_dat(titanic)
- Does not work on large data sets
# Use tryCatch to handle errors and print the result
tryCatch({
# Attempt to visualize data with vis_dat()
vis_dat(data_haven)
error = function(e) {
}, # Print the error message
print(paste("An error occurred:", e$message))
})
[1] "An error occurred: Data exceeds recommended size for visualisation"
naniar
The naniar
package offers a suite of functions for visualizing and handling missing data. The gg_miss_upset()
function, for example, can visualize missing data patterns in a dataset.
# install.packages("naniar")
library(naniar)
gg_miss_upset(titanic)
VIM
The VIM package provides functions like aggr()
and matrixplot()
for visualizing missing data patterns. These functions help in understanding the distribution and patterns of missing values.
# install.packages("VIM")
library(VIM)
Loading required package: colorspace
Loading required package: grid
VIM is ready to use.
Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
Attaching package: 'VIM'
The following object is masked from 'package:datasets':
sleep
aggr(titanic,
col = c('navyblue', 'orange'),
numbers = TRUE,
sortVars = TRUE
)
Variables sorted by number of missings:
Variable Count
Cabin 0.771043771
Age 0.198653199
Embarked 0.002244669
PassengerId 0.000000000
Survived 0.000000000
Pclass 0.000000000
Name 0.000000000
Sex 0.000000000
SibSp 0.000000000
Parch 0.000000000
Ticket 0.000000000
Fare 0.000000000
Amelia
The Amelia package offers the missmap()
function, which creates a graphical representation of missing data in a dataset.
# install.packages("Amelia")
library(Amelia)
Loading required package: Rcpp
##
## Amelia II: Multiple Imputation
## (Version 1.8.3, built: 2024-11-07)
## Copyright (C) 2005-2025 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
##
missmap(titanic,
main = "Missing values map",
col = c("yellow", "black")
)
- Works on large datasets.
::missmap(data_haven) Amelia
plot_missing()
The plot_missing()
function from the DataExplorer
package provides a bar chart showing the proportion of missing values per column.
# install.packages("DataExplorer")
library(DataExplorer)
plot_missing(titanic)
Data Cleaning and Transformation
Data cleaning and transformation are crucial steps in the data analysis process. They ensure that the data is accurate, consistent, and formatted appropriately for analysis.
Handling missing values
Imputation: Fill in missing values with statistical measures (mean, median) or use imputation techniques.
Median Imputation:
- Replaces missing values with the median, robust to outliers.
table(is.na(titanic$Age ))
FALSE TRUE
714 177
$Age_imputed <- median(titanic$Age)
titanictable(is.na(titanic$Age_imputed ))
TRUE
891
$Age_imputed_median <- median(titanic$Age, na.rm = TRUE) # The na.rm = TRUE argument ensures that missing values are ignored when calculating the median
titanictable(is.na(titanic$Age_imputed_median ))
FALSE
891
Mean Imputation:
- Replaces missing values with the average of the available data. Is fine when data is symmetric.
$Age_imputed_mean <- mean(titanic$Age, na.rm = TRUE) # The na.rm = TRUE argument ensures that missing values are ignored when calculating the mean titanic
Mode Imputation:
- Useful for categorical data, replaces missing values with the most common value.
Predictive Modeling:
- Uses regression or other models to predict and fill in missing values.
You can predict missing values using other columns. For example, use a linear regression model to predict the missing values in Age
based on other features.
# install.packages("mice")
library(mice)
# Perform multiple imputation on the titanic dataset
<- mice(titanic,
imputed_data method = 'pmm', # 1>
m = 5, # 2>
maxit = 50, # 3>
seed = 500 # 4>
)
# Extract the first complete dataset from the imputed data
<- complete(data = imputed_data,
completed_data action = 1 # 5>
)
# Assign the imputed Age values to a new column in the original dataset
$Age_imputed_pm <- completed_data$Age titanic
method = 'pmm'
: Use Predictive Mean Matching for imputing missing values.m = 5
: Generate 5 imputed datasets to account for uncertainty in imputations.maxit = 50
: Run the algorithm for a maximum of 50 iterations to refine imputations.seed = 500
: Set a random seed for reproducibility of results.action = 1
: In this case, 1 means the first imputed dataset out of the m datasets that mice() has generated.
KNN Imputation:
- Uses similar rows to fill in missing data.
# Perform KNN imputation
<- VIM::kNN(data = titanic,
titanic2 variable = c("Age", "Fare"),
k = 5
)
- Removal: In some cases, it may be appropriate to remove rows or columns with excessive missing values.
colnames(titanic)
[1] "PassengerId" "Survived" "Pclass"
[4] "Name" "Sex" "Age"
[7] "SibSp" "Parch" "Ticket"
[10] "Fare" "Cabin" "Embarked"
[13] "Age_imputed" "Age_imputed_median" "Age_imputed_mean"
$Cabin <- NULL
titaniccolnames(titanic)
[1] "PassengerId" "Survived" "Pclass"
[4] "Name" "Sex" "Age"
[7] "SibSp" "Parch" "Ticket"
[10] "Fare" "Embarked" "Age_imputed"
[13] "Age_imputed_median" "Age_imputed_mean"
Filtering, selecting, and mutating data using dplyr
When working with data in R, dplyr is a powerful package that makes data manipulation straightforward and efficient. Here’s an overview of how to filter, select, and mutate data using dplyr:
1. Filtering Data
The filter()
function is used to subset rows based on conditions.
- The basic syntax is
filter(data, condition)
.
library(dplyr)
# Filter rows where 'Fare' is greater than 10
<- titanic |>
filtered_data filter(Fare > 10)
- Can combine multiple conditions.
# Filter rows where 'Fare' is greater than 10 and gender is male
<- titanic %>%
filtered_data filter(Fare > 10, Sex == "male")
nrow(filtered_data)
[1] 305
ncol(filtered_data)
[1] 14
dim(filtered_data) # rows by columns
[1] 305 14
2. Selecting Columns
The select()
function is used to choose specific columns from a dataset.
- The basic syntax is
select(data, column1, column2, ...)
.
# Select only 'Survived','Fare' and 'Sex' columns
<- titanic |>
selected_data select(Survived, Fare, Sex)
# nrow(selected_data)
# ncol(selected_data)
dim(selected_data)
[1] 891 3
3. Mutating Data
The mutate()
function is used to add new columns or modify existing ones.
The basic syntax is
mutate(data, new_column = expression)
.# Add a new column 'Fare_by_Age' that <- titanic %>% titanic mutate(Fare_by_Age = Fare / Age_imputed_median)
Suppose you were interested in Average_Fare_by_Pclass
that calculates the average fare by passenger class. You can find that as well.
::group_by
?dplyr
<- titanic |>
titanic group_by(Pclass) |>
mutate(Average_Fare_by_Pclass = mean(Fare, na.rm = TRUE)) |>
ungroup() # Remove the grouping structure from the dataset, returning it to a regular data frame
# Display unique values in the 'Average_Fare_by_Pclass' column
unique(titanic$Average_Fare_by_Pclass) # 1>
[1] 13.67555 84.15469 20.66218
# Show unique combinations of Pclass and Average_Fare_by_Pclass
<- titanic %>%
unique_pclass_fares distinct(Pclass, Average_Fare_by_Pclass) # 2>
# Print the result
print(unique_pclass_fares)
# A tibble: 3 × 2
Pclass Average_Fare_by_Pclass
<int> <dbl>
1 3 13.7
2 1 84.2
3 2 20.7
- Extracts unique values from a vector or column.
- Extracts unique rows based on specified columns from a data frame.
Basic Data Wrangling
Merging and joining data frames (across rows)
In data analysis, combining data sets is often necessary to create a comprehensive view of the data. Merging and joining data frames are essential operations for integrating data sets based on common variables or keys.
Purpose: Combining two data frames into one based on shared column(s) or row names. This operation is similar to SQL joins.
Main Types of Merges (aka joins):
Inner Join: Includes only rows with keys that are present in both data frames.
- Commonly used.
Left Join: Includes all rows from the first data frame and matched rows from the second.
Right Join: Includes all rows from the second data frame and matched rows from the first.
- Symmetric case of left join.
Full Join: Includes all rows from both data frames, with
NA
for missing matches.- Rarely really used.
Function: merge()
- Lets look at
df1
- it has uniqueID
.
df1
ID Name Income
1 1 Alice 50000
2 2 Bob 60000
3 3 Charlie 55000
4 4 David 70000
5 5 Eva 62000
unique(df1$ID)
[1] 1 2 3 4 5
# duplicated(df1$ID)
table(duplicated(df1$ID))
FALSE
5
- Lets look at
df2
- it has uniqueID
as well.
# 1> df2
ID Age State
1 2 30 California
2 3 25 Texas
3 4 NA New York
4 NA 40 Florida
5 6 35 Nevada
unique(df2$ID)
[1] 2 3 4 NA 6
# duplicated(df2$ID)
table(duplicated(df2$ID))
FALSE
5
- Has one missing
ID
. Typically we clean data frames and would have thrown away rows with duplicateID
s. Even observations with unknownID
s would have been thrown.
- Since the
ID
is unique in both data sets, we can merge the two data frames based on columnID
.
# Inner Join (default)
<- merge(x = df1, y = df2, by = "ID") # 1>
inner_join_df # 2,3,4 inner_join_df
ID Name Income Age State
1 2 Bob 60000 30 California
2 3 Charlie 55000 25 Texas
3 4 David 70000 NA New York
# Left Join
<- merge(x = df1, y = df2, by = "ID", all.x = TRUE) # 2>
left_join_df # 1,2,3,4,5 left_join_df
ID Name Income Age State
1 1 Alice 50000 NA <NA>
2 2 Bob 60000 30 California
3 3 Charlie 55000 25 Texas
4 4 David 70000 NA New York
5 5 Eva 62000 NA <NA>
# Right Join
<- merge(x = df1, y = df2, by = "ID", all.y = TRUE) # 3>
right_join_df # 2,3,4,6,NA right_join_df
ID Name Income Age State
1 2 Bob 60000 30 California
2 3 Charlie 55000 25 Texas
3 4 David 70000 NA New York
4 6 <NA> NA 35 Nevada
5 NA <NA> NA 40 Florida
# Full Join
<- merge(x = df1, y = df2, by = "ID", all = TRUE) # 4>
full_join_df # 1,2,3,4,5,6,NA full_join_df
ID Name Income Age State
1 1 Alice 50000 NA <NA>
2 2 Bob 60000 30 California
3 3 Charlie 55000 25 Texas
4 4 David 70000 NA New York
5 5 Eva 62000 NA <NA>
6 6 <NA> NA 35 Nevada
7 NA <NA> NA 40 Florida
- Inner join keeps observations that are common in both datasets.
ID
- 2,3,4 - Left Join keep all observations in
df1
and only observations indf2
if they merge withdf1
.ID
- 1,2,3,4,5 - Right Join keep all observations in
df2
and only observations indf1
if they merge withdf2
.ID
- 2,3,4,6, NA - Full join is often not useful.
ID
- 1,2,3,4,5,6,NA
- Think about this as getting more controls for your data.
Appending data frames (over columns)
Now, lets say our df1
stays the same but we collect data on more more people (more rows, same columns). Now, our df3
is different than df2
-
df1
ID Name Income
1 1 Alice 50000
2 2 Bob 60000
3 3 Charlie 55000
4 4 David 70000
5 5 Eva 62000
df3
ID Name Income
1 6 Frank 48000
2 7 Grace 55000
3 8 Hannah 62000
4 9 Ivy 70000
5 10 Jack 58000
To combine these data frames, use the rbind()
function.
rbind(df1, df3)
: Combinesdf1
anddf3
by appending the rows ofdf3
todf1
.This operation assumes that both data frames have the same column names and order.
# Appending df2 to df1
<- rbind(df1, df3)
appended_df
# Display the result
appended_df
ID Name Income
1 1 Alice 50000
2 2 Bob 60000
3 3 Charlie 55000
4 4 David 70000
5 5 Eva 62000
6 6 Frank 48000
7 7 Grace 55000
8 8 Hannah 62000
9 9 Ivy 70000
10 10 Jack 58000
Result
The appended_df
data frame will have the combined rows from df1
and df3
, containing all the unique people from both data frames.
- Think about this as more people are given the same survey.
Reshaping data (wide to long format)
reshape2
is an R
package written by Hadley Wickham that makes it easy to transform data between wide and long formats.
What makes data wide or long?
Wide data has a column for each variable. For example, this is wide-format data:
## ozone wind temp
## 1 23.61538 11.622581 65.54839
## 2 29.44444 10.266667 79.10000
## 3 59.11538 8.941935 83.90323
## 4 59.96154 8.793548 83.96774
And this is long-format data:
## No id variables; using all as measure variables
## variable value
## 1 ozone 23.615385
## 2 ozone 29.444444
## 3 ozone 59.115385
## 4 ozone 59.961538
## 5 wind 11.622581
## 6 wind 10.266667
## 7 wind 8.941935
## 8 wind 8.793548
## 9 temp 65.548387
## 10 temp 79.100000
## 11 temp 83.903226
## 12 temp 83.967742
Long-format data has a column for possible variable types and a column for the values of those variables.
Long-format data isn’t necessarily only two columns. For example, we might have ozone measurements for each day of the year. In that case, we could have another column for day. In other words, there are different levels of “longness”.
The ultimate shape you want to get your data into will depend on what you are doing with it.
It turns out that you need wide-format data for some types of data analysis and long-format datafor others. In reality, you need long-format data much more commonly than wide-format data. For example, ggplot2
requires long-format data (technically tidy data), plyr
requires long-format data, and most modelling functions (such as lm()
, glm()
, and gam()
) require long-format data.
- But people often find it easier to record their data in wide format.
The reshape2 package
reshape2
is based around two key functions: melt
and cast
:
melt
takes wide-format data and melts it into long-format data.cast
takes long-format data and casts it into wide-format data.
Think of working with metal: if you melt metal, it drips and becomes long. If you cast it into a mould, it becomes wide.
Wide- to long-format data: the melt
function
For this example we’ll work with the airquality
dataset that is built into R
.
First we’ll change the column names to lower case to make them easier to work with. Then we’ll look at the data:
names(airquality) <- tolower(names(airquality))
head(airquality)
ozone solar.r wind temp month day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
What happens if we run the function melt
with all the default argument values?
# install.packages("reshape2")
library(reshape2)
<- melt(airquality) # [a]ir [q]uality [l]ong format aql
No id variables; using all as measure variables
head(aql)
variable value
1 ozone 41
2 ozone 36
3 ozone 12
4 ozone 18
5 ozone NA
6 ozone 28
tail(aql)
variable value
913 day 25
914 day 26
915 day 27
916 day 28
917 day 29
918 day 30
By default, melt
has assumed that all columns with numeric values are variables with values. Often this is what you want. Maybe here we want to know the values of ozone
, solar.r
, wind
, and temp
for each month
and day
. We can do that with melt
by telling it that we want month
and day
to be “ID variables”. ID variables are the variables that identify individual rows of data.
<- melt(airquality, id.vars = c("month", "day"))
aql head(aql)
month day variable value
1 5 1 ozone 41
2 5 2 ozone 36
3 5 3 ozone 12
4 5 4 ozone 18
5 5 5 ozone NA
6 5 6 ozone 28
What if we wanted to control the column names in our long-format data? melt
lets us set those too all in one step:
<- melt(airquality, id.vars = c("month", "day"),
aql variable.name = "climate_variable",
value.name = "climate_value")
head(aql)
month day climate_variable climate_value
1 5 1 ozone 41
2 5 2 ozone 36
3 5 3 ozone 12
4 5 4 ozone 18
5 5 5 ozone NA
6 5 6 ozone 28
Long- to wide-format data: the cast functions
Whereas going from wide- to long-format data is pretty straightforward, going from long- to wide-format data can take a bit more thought. It usually involves some head scratching and some trial and error for all but the simplest cases. Let’s go through some examples.
In reshape2
there are multiple cast
functions. Since you will most commonly work with data.frame
objects, we’ll explore the dcast
function. (There is also acast
to return a vector, matrix, or array.)
Let’s take the long-format airquality
data and cast
it into some different wide formats. To start with, we’ll recover the same format we started with and compare the two.
dcast
uses a formula to describe the shape of the data. The arguments on the left refer to the ID variables and the arguments on the right refer to the measured variables. Coming up with the right formula can take some trial and error at first. So, if you’re stuck don’t feel bad about just experimenting with formulas. There are usually only so many ways you can write the formula.
Here, we need to tell dcast
that month
and day
are the ID variables (we want a column for each) and that variable
describes the measured variables. Since there is only one remaining column, dcast
will figure out that it contains the values themselves. We could explicitly declare this with value.var
. (And in some cases it will be necessary to do so.)
<- melt(airquality, id.vars = c("month", "day"))
aql <- dcast(aql, month + day ~ variable)
aqw head(aqw)
month day ozone solar.r wind temp
1 5 1 41 190 7.4 67
2 5 2 36 118 8.0 72
3 5 3 12 149 12.6 74
4 5 4 18 313 11.5 62
5 5 5 NA NA 14.3 56
6 5 6 28 NA 14.9 66
head(airquality) # original data
ozone solar.r wind temp month day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
So, besides re-arranging the columns, we’ve recovered our original data.
If it isn’t clear to you what just happened there, then have a look at this illustration:
Figure 1: An illustration of the dcast
function. The blue shading indicates ID variables that we want to represent individual rows. The red shading represents variable names that we want to swing into column names. The grey shading represents the data values that we want to fill in the cells with.
One confusing “mistake” you might make is casting a dataset in which there is more than one value per data cell. For example, this time we won’t include day
as an ID variable:
dcast(aql, month ~ variable)
Aggregation function missing: defaulting to length
month ozone solar.r wind temp
1 5 31 31 31 31
2 6 30 30 30 30
3 7 31 31 31 31
4 8 31 31 31 31
5 9 30 30 30 30
When you run this in R
, you’ll notice the warning message:
# Aggregation function missing: defaulting to length
And if you look at the output, the cells are filled with the number of data rows for each month-climate combination. The numbers we’re seeing are the number of days recorded in each month. When you cast
your data and there are multiple values per cell, you also need to tell dcast
how to aggregate the data. For example, maybe you want to take the mean
, or the median
, or the sum
. Let’s try the last example, but this time we’ll take the mean of the climate values. We’ll also pass the option na.rm = TRUE
through the ...
argument to remove NA
values. (The ...
let’s you pass on additional arguments to your fun.aggregate
function, here mean
.)
dcast(aql, month ~ variable, fun.aggregate = mean,
na.rm = TRUE)
month ozone solar.r wind temp
1 5 23.61538 181.2963 11.622581 65.54839
2 6 29.44444 190.1667 10.266667 79.10000
3 7 59.11538 216.4839 8.941935 83.90323
4 8 59.96154 171.8571 8.793548 83.96774
5 9 31.44828 167.4333 10.180000 76.90000
Unlike melt
, there are some other fancy things you can do with dcast
that is not covered here. It’s worth reading the help file ?dcast
. For example, you can compute summaries for rows and columns, subset the columns, and fill in missing cells in one call to dcast
.
Appendix
Exporting Data from R (not covered but you can read)
Reshaping command on
airquality
data - Example’s source