Data from an excel spreadsheet often needs to be reshaped and cleaned for efficient use in Tableau. Tableau needs data that has a database structure, where each unique record is in one row (long, narrow format), without duplicates or missing data. Data in a spreadsheet is ment to be human-friendly, and is by nature not always condusive in format for statistical analysis or visualisation.
The original data transformed by the code below is in one spreadsheet, contains multiple tables for different groups, each table having it’s own headers and being seperated by empty lines.
Copies of the data are made throughout so that modifications do not occur “in-place”.
This section sets the environment for data cleaning and reshaping from excel to that needed for Tableau. The working directory is set as the same directory in which the data is saved. THe R packges needed for this wrangling are listed in the vector called packages and should be installed if the call to library(package) returns an error. This can be done by uncommenting the line for which the package is referenced.
setwd("/Users/tina/Projects/Infermadiks_DLA_data_vis")
#packages <- c("knitr", "readxl", "reshape2", "WriteXLS", "stringi")
#install.packages("knitr", repos='http://cran.us.r-project.org')
library("knitr")
#install.packages("readxl", repos='http://cran.us.r-project.org')
library("readxl")
#install.packages("reshape2", repos='http://cran.us.r-project.org')
library("reshape2")
#install.packages("WriteXLS", repos='http://cran.us.r-project.org')
library("WriteXLS")
#install.packages("stringi")
library("stringi")
A character vector of column names is created for the data, the data is read into R from the Excel format, using the character vector as column names and skipping empty data rows.
#Character vector for column names*
names = c("name", "theme1", "theme2", "theme3", "theme4", "theme5", "noData")
#Use read_excel to read the data into a data frame in R:
raw_data <- read_excel("DLA.xlsx", col_names = names, skip = 2)
#removing last column of no data
data <- raw_data[,1:6]
Removing sensitive information from data sets minimises the risk of such information being shared publically. While full names are not given in the data, first names are and these are replaced with “ID+random string of length3” using a regular expression:
random_string_names <- paste("ID", stri_rand_strings(n=nrow(data), length=3, pattern="[A-Za-z0-9]"), sep="")
renamed <- unlist(random_string_names)
data$name <- renamed
data_copy <- data
head(data_copy, 3)
## name theme1 theme2 theme3 theme4 theme5
## 1 IDnfK Theme 1 Theme 2 Theme 3 Theme 4 Theme 5
## 2 IDbdS Restorative Learner Connectedness Relator Includer
## 3 IDUYR Restorative Developer Consistency Deliberative Discipline
There are rows of data that have no values in them (Na’s). These resulted from the empty rows in the original excel spread sheet and need to be removed.
rows_to_remove <- which(complete.cases(data_copy) == FALSE)
test_remove_Na_rows <- data_copy[-rows_to_remove,]
data_needed <- test_remove_Na_rows
head(data_needed, 3)
## name theme1 theme2 theme3 theme4 theme5
## 1 IDnfK Theme 1 Theme 2 Theme 3 Theme 4 Theme 5
## 2 IDbdS Restorative Learner Connectedness Relator Includer
## 3 IDUYR Restorative Developer Consistency Deliberative Discipline
A list of data frames for each group is created for ease of further wrangling. The top row of each smaller data frame will contain the original headings from the excel data groups and needs to be removed. After that, the data is reshaped into long format, where the values under the different “theme” columns are melted together into one column and a corresponding theme number column is also created.
rows_to_split_on <- which(data_needed$theme1 == "Theme 1")
data_split <- split(data_needed, cumsum(1:nrow(data_needed) %in% rows_to_split_on))
##Removing first row of data from each small dataframe and making a new list of df's from that:
data_split_top_rows_removed <- data_split
for(i in (1:length(data_split))) {
data_split_top_rows_removed[[i]] <- data_split[[i]][2:nrow(data_split[[i]]),]
}
##Reshaping the data into long format
melt_data <- lapply(data_split_top_rows_removed, melt, id = "name")
#Checking the melt with the first and last three rows of the first and last data frame
head(melt_data[[1]], 3)
## name variable value
## 1 IDbdS theme1 Restorative
## 2 IDUYR theme1 Restorative
## 3 IDGc6 theme1 Belief
tail(melt_data[[1]], 3)
## name variable value
## 23 IDGc6 theme5 Relator
## 24 IDdg1 theme5 Responsibility
## 25 IDMp5 theme5 Achiever
A column with group number is added to the data frame and the domains to which the strengths are mapped are added as well.
group_names <- list()
group_names <- paste("group", 1:10, sep="")
str(group_names)
## chr [1:10] "group1" "group2" "group3" "group4" "group5" ...
melt_data_with_group_numbers <- melt_data
for (i in 1:10){
melt_data_with_group_numbers[[i]]$group <- group_names[i]
}
df_for_groups <- melt_data_with_group_numbers
head(df_for_groups[[1]], 3)
## name variable value group
## 1 IDbdS theme1 Restorative group1
## 2 IDUYR theme1 Restorative group1
## 3 IDGc6 theme1 Belief group1
tail(df_for_groups[[10]], 3)
## name variable value group
## 8 IDmPr theme4 Activator group10
## 9 IDTfu theme5 Responsibility group10
## 10 IDmPr theme5 Positivity group10
All this is in a loop to go through each data frame in the list:
for (i in 1:length(df_for_groups)){
#converting the variable column entries to characters entries:
df_for_groups[[i]]$variable <- as.character(df_for_groups[[i]]$variable)
#renaming theme(x) in variable to appropriate domain using category in value column
Execution <- c("Achiever", "Arranger", "Belief", "Consistency", "Deliberative", "Discipline", "Focus", "Responsibility", "Restorative")
Strategic <- c("Analytical", "Context", "Futuristic", "Ideation", "Input", "Intellection", "Learner", "Strategic")
Influencing <- c("Activator","Command", "Communication", "Competition", "Maximizer", "Self-Assurance", "Significance", "Woo")
Relationship <- c("Adaptability", "Developer", "Connectedness", "Empathy", "Harmony", "Includer", "Individualization", "Positivity", "Relator")
Execution_rows <- which(df_for_groups[[i]]$value %in% Execution)
Strategic_rows <- which(df_for_groups[[i]]$value %in% Strategic)
Influencing_rows <- which(df_for_groups[[i]]$value %in% Influencing)
Relationship_rows <- which(df_for_groups[[i]]$value %in% Relationship)
#replacing the "theme(x)" entries with donamins:
df_for_groups[[i]]$variable[Execution_rows] <- "Execution"
df_for_groups[[i]]$variable[Strategic_rows] <- "Strategic"
df_for_groups[[i]]$variable[Influencing_rows] <- "Influencing"
df_for_groups[[i]]$variable[Relationship_rows] <- "Relationship"
}
head(df_for_groups[[1]], 3)
## name variable value group
## 1 IDbdS Execution Restorative group1
## 2 IDUYR Execution Restorative group1
## 3 IDGc6 Execution Belief group1
tail(df_for_groups[[1]], 3)
## name variable value group
## 23 IDGc6 Relationship Relator group1
## 24 IDdg1 Execution Responsibility group1
## 25 IDMp5 Execution Achiever group1
The subsetted data frames are recombined into one and then exported back into .xls format. For ease of use in Tableau visualisations, the column headings and row names are not exported.
data_tableau_ready <- do.call("rbind", df_for_groups)
head(data_tableau_ready)
## name variable value group
## 1.1 IDbdS Execution Restorative group1
## 1.2 IDUYR Execution Restorative group1
## 1.3 IDGc6 Execution Belief group1
## 1.4 IDdg1 Execution Achiever group1
## 1.5 IDMp5 Strategic Futuristic group1
## 1.6 IDbdS Strategic Learner group1
WriteXLS(data_tableau_ready, ExcelFileName = "DLA_Tableau_ready.xls", row.names = FALSE, col.names = FALSE)