III. Data Manipulation

Importing Data

Reading data from CSV, Excel, and other format

CSV

A CSV file (Comma-Separated Values file) is a plain text file that stores tabular data, such as a spreadsheet or database, in a simple, structured format. Each line in the file represents a row in the table, and each value within a row is separated by a comma.

Key Features:

Plain Text: CSV files are easy to create, read, and manipulate using simple text editors or programming languages.
Comma-Separated: Values are separated by commas, though other delimiters like tabs (TSV files) or semicolons are also used in some variations.
Rows and Columns: Each row corresponds to a record, and each comma-separated value in the row corresponds to a field (or column) in the table.
No Formatting: CSV files do not contain complex formatting, formulas, or images. They are purely text-based, making them lightweight and portable.

Importing `.csv` files:

titanic <- read.csv("C:/Users/rache/Downloads/train.csv")

This line reads a CSV file from a specific path provided as an absolute path. The ~ symbol represents the home directory of the current user, and the rest of the path is relative to this home directory.

This approach is useful when you have a specific file path, and it ensures that you’re pointing to the exact location where the file is stored.

getwd() # 1>

[1] "C:/Users/rache/Downloads"

titanic <- read.csv(file = "datasets_import/titanic.csv") # 2>

The getwd() function in R stands for “Get Working Directory”. It returns the current working directory of your R session, which is the default folder where R looks for files to read and where it saves files if you don’t specify a full path.
This line reads a CSV file using a relative path. The path provided is relative to the current working directory of the R session.

Warning

Be sure to eyeball your original data and imported data to ensure it is imported correctly. You might note there are many “blank” values.

By default, the blank values in character vectors are treated as a separate category.

# Generate frequency table for Cabin, including missing values
cabin_freq <- table(titanic$Cabin, 
                    useNA = "ifany"
                    )

# Sort the frequency table in descending order
sorted_cabin_freq <- sort(x = cabin_freq, 
                          decreasing = TRUE
                          )

# print largest 5 categories
sorted_cabin_freq[1:5]


                B96 B98 C23 C25 C27          G6     C22 C26 
        687           4           4           4           3

We can explicitly ask the software to treat these “blank values” as “NA” or missing values by specifying the na.strings argument.

titanic <- read.csv(file = "datasets_import/titanic.csv",
                    na.strings = "") # can specify nultiple arguments as well in na.strings if needed. EG 99999999 in some software

Blank values in character vectors can be assigned an NA value to create a distinct missing class.

Excel Files

An Excel file is a spreadsheet format created by Microsoft Excel that stores tabular data, including text, numbers, and formulas, in a structured format. Unlike CSV files, Excel files can contain multiple sheets, rich formatting, and more complex data structures.

Key Features:

Structured Data: Excel files can store data in multiple sheets within a single file, allowing for complex data organization.
Rich Formatting: Unlike CSV files, Excel files support cell formatting, including font styles, colors, and borders.
Formulas and Functions: Excel supports formulas and functions to perform calculations and data analysis directly within the spreadsheet.
Data Validation and Charts: Excel files can include data validation rules and embedded charts for visual data representation.

Importing `xlsx` files:

Lots of functions to import .xlsx files.
To import Excel files in R, you can use the readxl package or the openxlsx package. Below is an example using readxl:

# install.packages("readxl")
library(readxl)

# Import an Excel file and easure time taken to import the file
start_time <- Sys.time()

Vending_Machine_data <- read_excel(path = "datasets_import/Vending_Machine_data.xlsx", 
                                   sheet = "Vending_Machine_data",
                                   range = "A1:N37"
                                   )

New names:
• `` -> `...14`

end_time <- Sys.time()

end_time - start_time  # Time difference of 0.0800302 secs

Time difference of 0.2282751 secs

The openxlsx package provides functions for reading and writing Excel files, and is optimized for handling larger datasets and offers additional features like formatting and writing to Excel.

# install.packages("openxlsx")
library(openxlsx)

?read.xlsx

starting httpd help server ... done

start_time <- Sys.time()

Vending_Machine_data <- read.xlsx(xlsxFile = "datasets_import/Vending_Machine_data.xlsx",
                                   sheet = "Vending_Machine_data"
                                   )
end_time <- Sys.time()

end_time - start_time  # Time difference of 0.01190901 secs

Time difference of 0.02591205 secs

DTA Files

A .dta file is a data format used by Stata, a statistical software package. These files store tabular data, including variables and observations, in a structured format that Stata can read and write.

Key Features:

Structured Data: .dta files contain tabular data with columns (variables) and rows (observations).
Compatibility: Designed for use with Stata software but can be read by other statistical packages.
Metadata: May include variable labels and value labels in addition to the data itself.

Importing `.dta` Files

To import .dta files into R, you can use several packages, each with its own features. The popular package is haven, which is part of the tidyverse and is specifically designed for reading and writing data from statistical software packages, including Stata.

# install.packages("haven")
# Load the haven library
library(haven)

data_haven <- read_dta(file = "datasets_import/ProbSet4trade.dta")

Basic data inspection (head, tail, summary)

`head`

Purpose: Displays the first few rows of a data frame, matrix, or vector. Useful for quickly viewing the beginning of a dataset.

?head()
head(x= titanic[,1:3]) # first three columns only

  PassengerId Survived Pclass
1           1        0      3
2           2        1      1
3           3        1      3
4           4        1      1
5           5        0      3
6           6        0      3

head(x = titanic[,1:3], n=9L) # if you want more observations

  PassengerId Survived Pclass
1           1        0      3
2           2        1      1
3           3        1      3
4           4        1      1
5           5        0      3
6           6        0      3
7           7        0      1
8           8        0      3
9           9        1      3

`tail`

Purpose: Displays the last few rows of a data frame, matrix, or vector. Useful for quickly viewing the end of a dataset.

tail(x= titanic[,1:3])

    PassengerId Survived Pclass
886         886        0      3
887         887        0      2
888         888        1      1
889         889        0      3
890         890        1      1
891         891        0      3

tail(x = titanic[,1:3], n=9L) # if you want more observations

    PassengerId Survived Pclass
883         883        0      3
884         884        0      2
885         885        0      3
886         886        0      3
887         887        0      2
888         888        1      1
889         889        0      3
890         890        1      1
891         891        0      3

`glimpse`

The glimpse() function from the dplyr package is a powerful tool for inspecting the structure and content of a data frame or tibble in R.

Purpose: Provides a compact and transposed view of the data frame or tibble, showing column names, types, and a preview of the values.

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

dplyr::glimpse(titanic) # explictly specifying the package name from which glimpse function comes

Rows: 891
Columns: 12
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
$ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
$ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
$ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
$ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
$ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
$ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
$ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
$ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
$ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
$ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

Visualization of missing variables

Identification: Detect missing values using functions like is.na() in R.

table(is.na(titanic$Cabin))


FALSE  TRUE 
  204   687

instead of applying functions on each variable, we can apply functions from user written packages on the entire dataset.

`visdat`

The vis_dat() function from the visdat package is a useful tool for visualizing missing values and the structure of your data. It provides a comprehensive view of the completeness of your data, which can help identify patterns of missingness and understand the structure of your dataset.

# install.packages("visdat")
library(visdat)
vis_dat(titanic)

Does not work on large data sets

# Use tryCatch to handle errors and print the result
tryCatch({
  # Attempt to visualize data with vis_dat()
  vis_dat(data_haven)
}, error = function(e) {
  # Print the error message
  print(paste("An error occurred:", e$message))
})

[1] "An error occurred: Data exceeds recommended size for visualisation"

`naniar`

The naniar package offers a suite of functions for visualizing and handling missing data. The gg_miss_upset() function, for example, can visualize missing data patterns in a dataset.

# install.packages("naniar")
library(naniar) 
gg_miss_upset(titanic)

`VIM`

The VIM package provides functions like aggr() and matrixplot() for visualizing missing data patterns. These functions help in understanding the distribution and patterns of missing values.

# install.packages("VIM")

library(VIM)

Loading required package: colorspace

Loading required package: grid

VIM is ready to use.

Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues


Attaching package: 'VIM'

The following object is masked from 'package:datasets':

    sleep

aggr(titanic,
     col = c('navyblue', 'orange'), 
     numbers = TRUE,
     sortVars = TRUE
     )


 Variables sorted by number of missings: 
    Variable       Count
       Cabin 0.771043771
         Age 0.198653199
    Embarked 0.002244669
 PassengerId 0.000000000
    Survived 0.000000000
      Pclass 0.000000000
        Name 0.000000000
         Sex 0.000000000
       SibSp 0.000000000
       Parch 0.000000000
      Ticket 0.000000000
        Fare 0.000000000

`Amelia`

The Amelia package offers the missmap() function, which creates a graphical representation of missing data in a dataset.

# install.packages("Amelia")

library(Amelia)

Loading required package: Rcpp

## 
## Amelia II: Multiple Imputation
## (Version 1.8.3, built: 2024-11-07)
## Copyright (C) 2005-2025 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
##

missmap(titanic, 
        main = "Missing values map", 
        col = c("yellow", "black")
        )

Works on large datasets.

Amelia::missmap(data_haven)

`plot_missing()`

The plot_missing() function from the DataExplorer package provides a bar chart showing the proportion of missing values per column.

# install.packages("DataExplorer")

library(DataExplorer) 
plot_missing(titanic)

Data Cleaning and Transformation

Data cleaning and transformation are crucial steps in the data analysis process. They ensure that the data is accurate, consistent, and formatted appropriately for analysis.

Handling missing values

Imputation: Fill in missing values with statistical measures (mean, median) or use imputation techniques.

Median Imputation:

Replaces missing values with the median, robust to outliers.

table(is.na(titanic$Age ))


FALSE  TRUE 
  714   177

titanic$Age_imputed <- median(titanic$Age) 
table(is.na(titanic$Age_imputed ))


TRUE 
 891

titanic$Age_imputed_median  <- median(titanic$Age, na.rm = TRUE) #  The na.rm = TRUE argument ensures that missing values are ignored when calculating the median
table(is.na(titanic$Age_imputed_median ))


FALSE 
  891

Mean Imputation:

Replaces missing values with the average of the available data. Is fine when data is symmetric.

titanic$Age_imputed_mean <- mean(titanic$Age, na.rm = TRUE) #  The na.rm = TRUE argument ensures that missing values are ignored when calculating the mean

Mode Imputation:

Useful for categorical data, replaces missing values with the most common value.

Predictive Modeling:

Uses regression or other models to predict and fill in missing values.

You can predict missing values using other columns. For example, use a linear regression model to predict the missing values in Age based on other features.

# install.packages("mice")
library(mice)

# Perform multiple imputation on the titanic dataset
          imputed_data <- mice(titanic, 
                               method = 'pmm', # 1>
                               m = 5,          # 2>
                               maxit = 50,     # 3> 
                               seed = 500      # 4>
                               )
          
# Extract the first complete dataset from the imputed data
completed_data <- complete(data = imputed_data, 
                           action = 1 # 5>
                           )

# Assign the imputed Age values to a new column in the original dataset
titanic$Age_imputed_pm <- completed_data$Age

method = 'pmm': Use Predictive Mean Matching for imputing missing values.
m = 5: Generate 5 imputed datasets to account for uncertainty in imputations.
maxit = 50: Run the algorithm for a maximum of 50 iterations to refine imputations.
seed = 500: Set a random seed for reproducibility of results.
action = 1: In this case, 1 means the first imputed dataset out of the m datasets that mice() has generated.

KNN Imputation:

Uses similar rows to fill in missing data.

# Perform KNN imputation
titanic2 <- VIM::kNN(data = titanic, 
                     variable = c("Age", "Fare"), 
                     k = 5
                     )

Removal: In some cases, it may be appropriate to remove rows or columns with excessive missing values.

colnames(titanic)

 [1] "PassengerId"        "Survived"           "Pclass"            
 [4] "Name"               "Sex"                "Age"               
 [7] "SibSp"              "Parch"              "Ticket"            
[10] "Fare"               "Cabin"              "Embarked"          
[13] "Age_imputed"        "Age_imputed_median" "Age_imputed_mean"

titanic$Cabin <- NULL
colnames(titanic)

 [1] "PassengerId"        "Survived"           "Pclass"            
 [4] "Name"               "Sex"                "Age"               
 [7] "SibSp"              "Parch"              "Ticket"            
[10] "Fare"               "Embarked"           "Age_imputed"       
[13] "Age_imputed_median" "Age_imputed_mean"

Filtering, selecting, and mutating data using `dplyr`

When working with data in R, dplyr is a powerful package that makes data manipulation straightforward and efficient. Here’s an overview of how to filter, select, and mutate data using dplyr:

1. Filtering Data

The filter() function is used to subset rows based on conditions.

The basic syntax is filter(data, condition).

library(dplyr)

# Filter rows where 'Fare' is greater than 10
filtered_data <- titanic |>
  filter(Fare > 10)

Can combine multiple conditions.

# Filter rows where 'Fare' is greater than 10 and gender is male
filtered_data <- titanic %>%
  filter(Fare > 10, Sex == "male")

nrow(filtered_data)

[1] 305

ncol(filtered_data)

[1] 14

dim(filtered_data) # rows by columns

[1] 305  14

2. Selecting Columns

The select() function is used to choose specific columns from a dataset.

The basic syntax is select(data, column1, column2, ...).

# Select only 'Survived','Fare' and 'Sex' columns
selected_data <- titanic |> 
  select(Survived, Fare, Sex)

# nrow(selected_data)
# ncol(selected_data)

dim(selected_data)

[1] 891   3

3. Mutating Data

The mutate() function is used to add new columns or modify existing ones.

The basic syntax is mutate(data, new_column = expression).

# Add a new column 'Fare_by_Age' that 
titanic <- titanic %>%
  mutate(Fare_by_Age = Fare / Age_imputed_median)

Suppose you were interested in Average_Fare_by_Pclass that calculates the average fare by passenger class. You can find that as well.

?dplyr::group_by


titanic <- titanic |>
  group_by(Pclass) |>
  mutate(Average_Fare_by_Pclass = mean(Fare, na.rm = TRUE)) |>
  ungroup() # Remove the grouping structure from the dataset, returning it to a regular data frame

# Display unique values in the 'Average_Fare_by_Pclass' column
unique(titanic$Average_Fare_by_Pclass)     # 1>

[1] 13.67555 84.15469 20.66218

# Show unique combinations of Pclass and Average_Fare_by_Pclass
unique_pclass_fares <- titanic %>%
  distinct(Pclass, Average_Fare_by_Pclass) # 2> 

# Print the result
print(unique_pclass_fares)

# A tibble: 3 × 2
  Pclass Average_Fare_by_Pclass
   <int>                  <dbl>
1      3                   13.7
2      1                   84.2
3      2                   20.7

Extracts unique values from a vector or column.
Extracts unique rows based on specified columns from a data frame.

Basic Data Wrangling

Merging and joining data frames (across rows)

In data analysis, combining data sets is often necessary to create a comprehensive view of the data. Merging and joining data frames are essential operations for integrating data sets based on common variables or keys.

Purpose: Combining two data frames into one based on shared column(s) or row names. This operation is similar to SQL joins.

Main Types of Merges (aka joins):

Inner Join: Includes only rows with keys that are present in both data frames.
- Commonly used.
Left Join: Includes all rows from the first data frame and matched rows from the second.
Right Join: Includes all rows from the second data frame and matched rows from the first.
- Symmetric case of left join.
Full Join: Includes all rows from both data frames, with NA for missing matches.
- Rarely really used.

Function: merge()

Lets look at df1 - it has unique ID.

df1

  ID    Name Income
1  1   Alice  50000
2  2     Bob  60000
3  3 Charlie  55000
4  4   David  70000
5  5     Eva  62000

unique(df1$ID)

[1] 1 2 3 4 5

# duplicated(df1$ID)
table(duplicated(df1$ID))


FALSE 
    5

Lets look at df2 - it has unique ID as well.

df2 # 1>

  ID Age      State
1  2  30 California
2  3  25      Texas
3  4  NA   New York
4 NA  40    Florida
5  6  35     Nevada

unique(df2$ID)

[1]  2  3  4 NA  6

# duplicated(df2$ID)
table(duplicated(df2$ID))


FALSE 
    5

Has one missing ID. Typically we clean data frames and would have thrown away rows with duplicate IDs. Even observations with unknown IDs would have been thrown.

Since the ID is unique in both data sets, we can merge the two data frames based on column ID.

# Inner Join (default) 

inner_join_df <- merge(x = df1, y = df2, by = "ID") # 1>
inner_join_df # 2,3,4

  ID    Name Income Age      State
1  2     Bob  60000  30 California
2  3 Charlie  55000  25      Texas
3  4   David  70000  NA   New York

# Left Join

left_join_df <- merge(x = df1, y = df2, by = "ID", all.x = TRUE) # 2>
left_join_df # 1,2,3,4,5

  ID    Name Income Age      State
1  1   Alice  50000  NA       <NA>
2  2     Bob  60000  30 California
3  3 Charlie  55000  25      Texas
4  4   David  70000  NA   New York
5  5     Eva  62000  NA       <NA>

# Right Join

right_join_df <- merge(x = df1, y = df2, by = "ID", all.y = TRUE) # 3>
right_join_df # 2,3,4,6,NA

  ID    Name Income Age      State
1  2     Bob  60000  30 California
2  3 Charlie  55000  25      Texas
3  4   David  70000  NA   New York
4  6    <NA>     NA  35     Nevada
5 NA    <NA>     NA  40    Florida

# Full Join

full_join_df <- merge(x = df1, y = df2, by = "ID", all = TRUE) # 4>
full_join_df # 1,2,3,4,5,6,NA

  ID    Name Income Age      State
1  1   Alice  50000  NA       <NA>
2  2     Bob  60000  30 California
3  3 Charlie  55000  25      Texas
4  4   David  70000  NA   New York
5  5     Eva  62000  NA       <NA>
6  6    <NA>     NA  35     Nevada
7 NA    <NA>     NA  40    Florida

Inner join keeps observations that are common in both datasets. ID - 2,3,4
Left Join keep all observations in df1 and only observations in df2 if they merge with df1. ID - 1,2,3,4,5
Right Join keep all observations in df2 and only observations in df1 if they merge with df2. ID - 2,3,4,6, NA
Full join is often not useful. ID - 1,2,3,4,5,6,NA

Think about this as getting more controls for your data.

Appending data frames (over columns)

Now, lets say our df1 stays the same but we collect data on more more people (more rows, same columns). Now, our df3 is different than df2 -

df1

  ID    Name Income
1  1   Alice  50000
2  2     Bob  60000
3  3 Charlie  55000
4  4   David  70000
5  5     Eva  62000

df3

  ID   Name Income
1  6  Frank  48000
2  7  Grace  55000
3  8 Hannah  62000
4  9    Ivy  70000
5 10   Jack  58000

To combine these data frames, use the rbind() function.

rbind(df1, df3): Combines df1 and df3 by appending the rows of df3 to df1.
This operation assumes that both data frames have the same column names and order.

# Appending df2 to df1
appended_df <- rbind(df1, df3)

# Display the result
appended_df

   ID    Name Income
1   1   Alice  50000
2   2     Bob  60000
3   3 Charlie  55000
4   4   David  70000
5   5     Eva  62000
6   6   Frank  48000
7   7   Grace  55000
8   8  Hannah  62000
9   9     Ivy  70000
10 10    Jack  58000

Result

The appended_df data frame will have the combined rows from df1 and df3, containing all the unique people from both data frames.

Think about this as more people are given the same survey.

Reshaping data (wide to long format)

reshape2 is an R package written by Hadley Wickham that makes it easy to transform data between wide and long formats.

What makes data wide or long?

Wide data has a column for each variable. For example, this is wide-format data:

##      ozone      wind     temp
## 1 23.61538 11.622581 65.54839
## 2 29.44444 10.266667 79.10000
## 3 59.11538  8.941935 83.90323
## 4 59.96154  8.793548 83.96774

And this is long-format data:

## No id variables; using all as measure variables

##    variable     value
## 1     ozone 23.615385
## 2     ozone 29.444444
## 3     ozone 59.115385
## 4     ozone 59.961538
## 5      wind 11.622581
## 6      wind 10.266667
## 7      wind  8.941935
## 8      wind  8.793548
## 9      temp 65.548387
## 10     temp 79.100000
## 11     temp 83.903226
## 12     temp 83.967742

Long-format data has a column for possible variable types and a column for the values of those variables.

Long-format data isn’t necessarily only two columns. For example, we might have ozone measurements for each day of the year. In that case, we could have another column for day. In other words, there are different levels of “longness”.
The ultimate shape you want to get your data into will depend on what you are doing with it.

It turns out that you need wide-format data for some types of data analysis and long-format datafor others. In reality, you need long-format data much more commonly than wide-format data. For example, ggplot2 requires long-format data (technically tidy data), plyr requires long-format data, and most modelling functions (such as lm(), glm(), and gam()) require long-format data.

But people often find it easier to record their data in wide format.

The reshape2 package

reshape2 is based around two key functions: melt and cast:

melt takes wide-format data and melts it into long-format data.
cast takes long-format data and casts it into wide-format data.

Think of working with metal: if you melt metal, it drips and becomes long. If you cast it into a mould, it becomes wide.

Wide- to long-format data: the `melt` function

For this example we’ll work with the airquality dataset that is built into R.

First we’ll change the column names to lower case to make them easier to work with. Then we’ll look at the data:

names(airquality) <- tolower(names(airquality))
head(airquality)

  ozone solar.r wind temp month day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

What happens if we run the function melt with all the default argument values?

# install.packages("reshape2")
library(reshape2)
aql <- melt(airquality) # [a]ir [q]uality [l]ong format

No id variables; using all as measure variables

head(aql)

  variable value
1    ozone    41
2    ozone    36
3    ozone    12
4    ozone    18
5    ozone    NA
6    ozone    28

tail(aql)

    variable value
913      day    25
914      day    26
915      day    27
916      day    28
917      day    29
918      day    30

By default, melt has assumed that all columns with numeric values are variables with values. Often this is what you want. Maybe here we want to know the values of ozone, solar.r, wind, and temp for each month and day. We can do that with melt by telling it that we want month and day to be “ID variables”. ID variables are the variables that identify individual rows of data.

aql <- melt(airquality, id.vars = c("month", "day"))
head(aql)

  month day variable value
1     5   1    ozone    41
2     5   2    ozone    36
3     5   3    ozone    12
4     5   4    ozone    18
5     5   5    ozone    NA
6     5   6    ozone    28

What if we wanted to control the column names in our long-format data? melt lets us set those too all in one step:

aql <- melt(airquality, id.vars = c("month", "day"),
  variable.name = "climate_variable", 
  value.name = "climate_value")
head(aql)

  month day climate_variable climate_value
1     5   1            ozone            41
2     5   2            ozone            36
3     5   3            ozone            12
4     5   4            ozone            18
5     5   5            ozone            NA
6     5   6            ozone            28

Long- to wide-format data: the cast functions

Whereas going from wide- to long-format data is pretty straightforward, going from long- to wide-format data can take a bit more thought. It usually involves some head scratching and some trial and error for all but the simplest cases. Let’s go through some examples.

In reshape2 there are multiple cast functions. Since you will most commonly work with data.frame objects, we’ll explore the dcast function. (There is also acast to return a vector, matrix, or array.)

Let’s take the long-format airquality data and cast it into some different wide formats. To start with, we’ll recover the same format we started with and compare the two.

dcast uses a formula to describe the shape of the data. The arguments on the left refer to the ID variables and the arguments on the right refer to the measured variables. Coming up with the right formula can take some trial and error at first. So, if you’re stuck don’t feel bad about just experimenting with formulas. There are usually only so many ways you can write the formula.

Here, we need to tell dcast that month and day are the ID variables (we want a column for each) and that variable describes the measured variables. Since there is only one remaining column, dcast will figure out that it contains the values themselves. We could explicitly declare this with value.var. (And in some cases it will be necessary to do so.)

aql <- melt(airquality, id.vars = c("month", "day"))
aqw <- dcast(aql, month + day ~ variable)
head(aqw)

  month day ozone solar.r wind temp
1     5   1    41     190  7.4   67
2     5   2    36     118  8.0   72
3     5   3    12     149 12.6   74
4     5   4    18     313 11.5   62
5     5   5    NA      NA 14.3   56
6     5   6    28      NA 14.9   66

head(airquality) # original data

  ozone solar.r wind temp month day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

So, besides re-arranging the columns, we’ve recovered our original data.

If it isn’t clear to you what just happened there, then have a look at this illustration:

Figure 1: An illustration of the dcast function. The blue shading indicates ID variables that we want to represent individual rows. The red shading represents variable names that we want to swing into column names. The grey shading represents the data values that we want to fill in the cells with.

One confusing “mistake” you might make is casting a dataset in which there is more than one value per data cell. For example, this time we won’t include day as an ID variable:

dcast(aql, month ~ variable)

Aggregation function missing: defaulting to length

  month ozone solar.r wind temp
1     5    31      31   31   31
2     6    30      30   30   30
3     7    31      31   31   31
4     8    31      31   31   31
5     9    30      30   30   30

When you run this in R, you’ll notice the warning message:

# Aggregation function missing: defaulting to length

And if you look at the output, the cells are filled with the number of data rows for each month-climate combination. The numbers we’re seeing are the number of days recorded in each month. When you cast your data and there are multiple values per cell, you also need to tell dcast how to aggregate the data. For example, maybe you want to take the mean, or the median, or the sum. Let’s try the last example, but this time we’ll take the mean of the climate values. We’ll also pass the option na.rm = TRUE through the ... argument to remove NA values. (The ... let’s you pass on additional arguments to your fun.aggregate function, here mean.)

dcast(aql, month ~ variable, fun.aggregate = mean, 
  na.rm = TRUE)

  month    ozone  solar.r      wind     temp
1     5 23.61538 181.2963 11.622581 65.54839
2     6 29.44444 190.1667 10.266667 79.10000
3     7 59.11538 216.4839  8.941935 83.90323
4     8 59.96154 171.8571  8.793548 83.96774
5     9 31.44828 167.4333 10.180000 76.90000

Unlike melt, there are some other fancy things you can do with dcast that is not covered here. It’s worth reading the help file ?dcast. For example, you can compute summaries for rows and columns, subset the columns, and fill in missing cells in one call to dcast.

Appendix

Exporting Data from R (not covered but you can read)
Reshaping command on airquality data - Example’s source

Importing Data

Reading data from CSV, Excel, and other format

CSV

Key Features:

Importing .csv files:

Excel Files

Key Features:

Importing xlsx files:

DTA Files

Key Features:

Importing .dta Files

Basic data inspection (head, tail, summary)

head

tail

glimpse

Visualization of missing variables

visdat

naniar

VIM

Amelia

plot_missing()

Data Cleaning and Transformation

Handling missing values

Median Imputation:

Mean Imputation:

Mode Imputation:

Predictive Modeling:

KNN Imputation:

Filtering, selecting, and mutating data using dplyr

1. Filtering Data

2. Selecting Columns

3. Mutating Data

Basic Data Wrangling

Merging and joining data frames (across rows)

Main Types of Merges (aka joins):

Appending data frames (over columns)

Result

Reshaping data (wide to long format)

What makes data wide or long?

The reshape2 package

Wide- to long-format data: the melt function

Long- to wide-format data: the cast functions

Appendix

Importing `.csv` files:

Importing `xlsx` files:

Importing `.dta` Files

`head`

`tail`

`glimpse`

`visdat`

`naniar`

`VIM`

`Amelia`

`plot_missing()`

Filtering, selecting, and mutating data using `dplyr`

Wide- to long-format data: the `melt` function