Data_cleaning_Reshaping_from_Excel_to

Introduction

Data from an excel spreadsheet often needs to be reshaped and cleaned for efficient use in Tableau. Tableau needs data that has a database structure, where each unique record is in one row (long, narrow format), without duplicates or missing data. Data in a spreadsheet is ment to be human-friendly, and is by nature not always condusive in format for statistical analysis or visualisation.

The original data transformed by the code below is in one spreadsheet, contains multiple tables for different groups, each table having it’s own headers and being seperated by empty lines.

Copies of the data are made throughout so that modifications do not occur “in-place”.

Setting working directory and installing packages needed:

This section sets the environment for data cleaning and reshaping from excel to that needed for Tableau. The working directory is set as the same directory in which the data is saved. THe R packges needed for this wrangling are listed in the vector called packages and should be installed if the call to library(package) returns an error. This can be done by uncommenting the line for which the package is referenced.

setwd("/Users/tina/Projects/Infermadiks_DLA_data_vis")

#packages <- c("knitr", "readxl", "reshape2", "WriteXLS", "stringi")
#install.packages("knitr", repos='http://cran.us.r-project.org')
library("knitr")
#install.packages("readxl", repos='http://cran.us.r-project.org')
library("readxl")
#install.packages("reshape2", repos='http://cran.us.r-project.org')
library("reshape2")
#install.packages("WriteXLS", repos='http://cran.us.r-project.org')
library("WriteXLS")
#install.packages("stringi")
library("stringi")

Getting initial data

A character vector of column names is created for the data, the data is read into R from the Excel format, using the character vector as column names and skipping empty data rows.

#Character vector for column names*
names = c("name", "theme1", "theme2", "theme3", "theme4", "theme5", "noData")
#Use read_excel to read the data into a data frame in R:
raw_data <- read_excel("DLA.xlsx", col_names = names, skip = 2)
#removing last column of no data
data <- raw_data[,1:6]

Cleaning data

Removing sensitive information from data sets minimises the risk of such information being shared publically. While full names are not given in the data, first names are and these are replaced with “ID+random string of length3” using a regular expression:

random_string_names <- paste("ID", stri_rand_strings(n=nrow(data), length=3, pattern="[A-Za-z0-9]"), sep="")
renamed <- unlist(random_string_names)
data$name <- renamed
data_copy <- data
head(data_copy, 3)

##    name      theme1    theme2        theme3       theme4     theme5
## 1 IDnfK     Theme 1   Theme 2       Theme 3      Theme 4    Theme 5
## 2 IDbdS Restorative   Learner Connectedness      Relator   Includer
## 3 IDUYR Restorative Developer   Consistency Deliberative Discipline

Removing rows with no values:

There are rows of data that have no values in them (Na’s). These resulted from the empty rows in the original excel spread sheet and need to be removed.

rows_to_remove <- which(complete.cases(data_copy) == FALSE)
test_remove_Na_rows <- data_copy[-rows_to_remove,]
data_needed <- test_remove_Na_rows
head(data_needed, 3)

##    name      theme1    theme2        theme3       theme4     theme5
## 1 IDnfK     Theme 1   Theme 2       Theme 3      Theme 4    Theme 5
## 2 IDbdS Restorative   Learner Connectedness      Relator   Includer
## 3 IDUYR Restorative Developer   Consistency Deliberative Discipline

Spliting the data.

A list of data frames for each group is created for ease of further wrangling. The top row of each smaller data frame will contain the original headings from the excel data groups and needs to be removed. After that, the data is reshaped into long format, where the values under the different “theme” columns are melted together into one column and a corresponding theme number column is also created.

rows_to_split_on <- which(data_needed$theme1 == "Theme 1")
data_split <- split(data_needed, cumsum(1:nrow(data_needed) %in% rows_to_split_on))

##Removing first row of data from each small dataframe and making a new list of df's from that:
data_split_top_rows_removed <- data_split
for(i in (1:length(data_split))) {
    data_split_top_rows_removed[[i]] <- data_split[[i]][2:nrow(data_split[[i]]),]
    }

##Reshaping the data into long format
melt_data <- lapply(data_split_top_rows_removed, melt, id = "name")
#Checking the melt with the first and last three rows of the first and last data frame
head(melt_data[[1]], 3)

##    name variable       value
## 1 IDbdS   theme1 Restorative
## 2 IDUYR   theme1 Restorative
## 3 IDGc6   theme1      Belief

tail(melt_data[[1]], 3)

##     name variable          value
## 23 IDGc6   theme5        Relator
## 24 IDdg1   theme5 Responsibility
## 25 IDMp5   theme5       Achiever

Final formating

A column with group number is added to the data frame and the domains to which the strengths are mapped are added as well.

creating a vector of group names for each data frame:

group_names <- list()
group_names <- paste("group", 1:10, sep="")
str(group_names)

##  chr [1:10] "group1" "group2" "group3" "group4" "group5" ...

Adding these to each data frame in the list of data frames:

melt_data_with_group_numbers <- melt_data
for (i in 1:10){
     melt_data_with_group_numbers[[i]]$group <- group_names[i]
    }
df_for_groups <- melt_data_with_group_numbers
head(df_for_groups[[1]], 3)

##    name variable       value  group
## 1 IDbdS   theme1 Restorative group1
## 2 IDUYR   theme1 Restorative group1
## 3 IDGc6   theme1      Belief group1

tail(df_for_groups[[10]], 3)

##     name variable          value   group
## 8  IDmPr   theme4      Activator group10
## 9  IDTfu   theme5 Responsibility group10
## 10 IDmPr   theme5     Positivity group10

Adding the domains to the data that the strengths themes correspond with:

All this is in a loop to go through each data frame in the list:

for (i in 1:length(df_for_groups)){
    #converting the variable column entries to characters entries:
    df_for_groups[[i]]$variable <- as.character(df_for_groups[[i]]$variable)
    #renaming theme(x) in variable to appropriate domain using category in value column
    Execution <- c("Achiever", "Arranger", "Belief", "Consistency", "Deliberative", "Discipline", "Focus", "Responsibility", "Restorative")
    Strategic <- c("Analytical", "Context", "Futuristic", "Ideation", "Input", "Intellection", "Learner", "Strategic")
    Influencing <- c("Activator","Command", "Communication", "Competition", "Maximizer", "Self-Assurance", "Significance", "Woo")
    Relationship <- c("Adaptability", "Developer", "Connectedness", "Empathy", "Harmony", "Includer", "Individualization", "Positivity", "Relator")

    Execution_rows <- which(df_for_groups[[i]]$value %in% Execution)
    Strategic_rows <- which(df_for_groups[[i]]$value %in% Strategic)
    Influencing_rows <- which(df_for_groups[[i]]$value %in% Influencing)
    Relationship_rows <- which(df_for_groups[[i]]$value %in% Relationship)

    #replacing the "theme(x)" entries with donamins:
    df_for_groups[[i]]$variable[Execution_rows] <- "Execution"
    df_for_groups[[i]]$variable[Strategic_rows] <- "Strategic"
    df_for_groups[[i]]$variable[Influencing_rows] <- "Influencing"
    df_for_groups[[i]]$variable[Relationship_rows] <- "Relationship"
}
head(df_for_groups[[1]], 3)

##    name  variable       value  group
## 1 IDbdS Execution Restorative group1
## 2 IDUYR Execution Restorative group1
## 3 IDGc6 Execution      Belief group1

tail(df_for_groups[[1]], 3)

##     name     variable          value  group
## 23 IDGc6 Relationship        Relator group1
## 24 IDdg1    Execution Responsibility group1
## 25 IDMp5    Execution       Achiever group1

Final formatting and exporting:

The subsetted data frames are recombined into one and then exported back into .xls format. For ease of use in Tableau visualisations, the column headings and row names are not exported.

data_tableau_ready <- do.call("rbind", df_for_groups)
head(data_tableau_ready)

##      name  variable       value  group
## 1.1 IDbdS Execution Restorative group1
## 1.2 IDUYR Execution Restorative group1
## 1.3 IDGc6 Execution      Belief group1
## 1.4 IDdg1 Execution    Achiever group1
## 1.5 IDMp5 Strategic  Futuristic group1
## 1.6 IDbdS Strategic     Learner group1

WriteXLS(data_tableau_ready, ExcelFileName = "DLA_Tableau_ready.xls", row.names = FALSE, col.names = FALSE)

Data_cleaning_Reshaping_from_Excel_to_Tableau

Christiane Heiligers

July 11, 2016