Melting a Data.frame with dplyr & tidyr

Elongating a Dataset using methods from dplyr & tidyr

Setting the environment

These are the R libraries we will need for this demo:

library( magrittr )
library( dplyr )
library( tidyr )
library( ggplot2 )

Creating the data:

This demo will use data that summarizes Gender Education Attainment in a table with a double-layered header. This assigment calls for the data to be formatted as a .csv file, so in this section we will write the data as input to a .csv file:

eduDATA_df <- rbind(c(NA,'Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Female', 'Female'),
                    c(NA, 'No HS', 'HS', 'Associate', "Bachelor's", 'Graduate', 'No HS', 'HS', 'Associate', "Bachelor's", 'Graduate' ),
                    c('T066_008', 'T066_009', 'T066_010', 'T066_011', 'T066_012', 'T066_013', 'T066_014', 'T066_015', 'T066_016', 'T066_017', 'T066_018'),
                    c('Northeast Region', 24801, 36714, 44765, 65177, 89449, 16523, 24251, 31455, 45593, 61203),
                    c('Midwest Region', 23530, 33606, 41180, 59243, 77325, 15471, 21713, 26944, 39534, 54246),
                    c('South Region', 22065, 31552, 40097, 59711, 81744, 14659, 21530, 27243, 41210, 53100),
                    c('West Region', 22554, 32289, 42127, 62069, 86476, 15250, 23230, 29402, 42527, 57915 ))

#write the table to a .csv file
#substitute your filepath below to save to your local machine
write.table( eduDATA_df,file='/home/bonzilla/Desktop/MSDS2020/DATA607_DataAcquMan/eduDATA_df.csv', sep = ",", col.names=F, row.names=F)

Read .csv file into R

The .csv file that was generated in the previous block of code was uploaded to the author’s github accout. We can read the .csv into an R data.frame via the URL link to the raw data:

#the url for the .csv file
URL <-'https://raw.githubusercontent.com/SmilodonCub/DATA607/master/eduDATA_df.csv'
#read to an R data.frame with read.csv().
eduDATA_df <- read.csv( URL ,stringsAsFactors = FALSE )
eduDATA_df

##                NA.     Male   Male.1    Male.2     Male.3   Male.4   Female
## 1             <NA>    No HS       HS Associate Bachelor's Graduate    No HS
## 2         T066_008 T066_009 T066_010  T066_011   T066_012 T066_013 T066_014
## 3 Northeast Region    24801    36714     44765      65177    89449    16523
## 4   Midwest Region    23530    33606     41180      59243    77325    15471
## 5     South Region    22065    31552     40097      59711    81744    14659
## 6      West Region    22554    32289     42127      62069    86476    15250
##   Female.1  Female.2   Female.3 Female.4
## 1       HS Associate Bachelor's Graduate
## 2 T066_015  T066_016   T066_017 T066_018
## 3    24251     31455      45593    61203
## 4    21713     26944      39534    54246
## 5    21530     27243      41210    53100
## 6    23230     29402      42527    57915

Tidy the data.frame with tidyr & dplyr

This is a very small data.frame, but it can be used to illustrate a big point: tidy data facilitate visualization & downstream analysis. Tidying, or wrangling the data is often necessary to perform before more glamorous techniques (e.g. modeling) can be applied to a dataset. We will only be performing some brief analysis and visualization of this data.frame, but the main point is to observe how easy these later steps occur if we take the time to properly format the data.frame.
In this demo, the ‘eduDATA_df’ data.frame will be changed from the present ‘wide’ version into a longer, ‘melted’ version; it will hold the same information, but will be rearranged to facilitate visualization and analysis.

Let’s start cleaning!

eduDATA_df <- eduDATA_df %>% 
    #there are uninformative text labels in row2: 
    #remove them with tidyverse methods
    filter( row_number() != 2L ) %>%
    #rename the columns with dplyr rename()
    rename( Region = NA., Male.NoHS = Male, Male.HS = Male.1, Male.Associate = Male.2, Male.Bachelors = Male.3, Male.Graduate = Male.4, Female.NoHS = Female, Female.HS = Female.1, Female.Associate = Female.2, Female.Bachelors = Female.3, Female.Graduate = Female.4) %>%
    #remove row 2
    filter( row_number() != 1L ) %>%
    #pivot_long() to give each observation it's own row in the data.frame
    pivot_longer( cols = Male.NoHS:Female.Graduate, names_to = 'Category', values_to = 'Total') %>%
    #mutate Total to numeric
    mutate( Total = as.numeric( Total ))

head( eduDATA_df )

## # A tibble: 6 x 3
##   Region           Category       Total
##   <chr>            <chr>          <dbl>
## 1 Northeast Region Male.NoHS      24801
## 2 Northeast Region Male.HS        36714
## 3 Northeast Region Male.Associate 44765
## 4 Northeast Region Male.Bachelors 65177
## 5 Northeast Region Male.Graduate  89449
## 6 Northeast Region Female.NoHS    16523

The data.frame has been transformed from a wide format with multiple observations in a single row, to a much longer version. Now each value from the data.frame has it’s own row. However, in the process of simplifying the double header, two features of the data were combined into one column, ‘Category’.
Our next step is to seperate the label information for Gender & Level of Education into seperate columns. We will do this using tidyr seperate() method:

eduDATA_df <- eduDATA_df %>% separate( Category,  c("Gender", "eduLevel"))
head( eduDATA_df )

## # A tibble: 6 x 4
##   Region           Gender eduLevel  Total
##   <chr>            <chr>  <chr>     <dbl>
## 1 Northeast Region Male   NoHS      24801
## 2 Northeast Region Male   HS        36714
## 3 Northeast Region Male   Associate 44765
## 4 Northeast Region Male   Bachelors 65177
## 5 Northeast Region Male   Graduate  89449
## 6 Northeast Region Female NoHS      16523

Excellent! Now the data.frame is organized in a format that wil facilitate use with functions such as ggplot2 plotting methods. We can now begin to tackle the analysis questions that were mentioned in the class discussion associated with this dataset.

Question 1: Which region appears to have the highest proportions of HS graduates as the highest educational attainment?

Visualize the data in a way that best addresses this question:

#This question does not involves the 'Gender' feature, so we will aggreggate the data by Region and eduLevel to simplify our visualization
edu_byRegion <- eduDATA_df %>% 
    #group_by() to aggregate Region & eduLevel data
    group_by( Region, eduLevel ) %>%
    #find the total of Male + Female incidences
    summarise( Total = sum( Total )) %>%
    #mutate() to add a column with proportions
    mutate( proportion = Total/sum( Total ))

# Grouped
ggplot(edu_byRegion, aes(fill=Region, y=proportion, x=eduLevel)) + 
    geom_bar(position="dodge", stat="identity") +
    ggtitle('Proportion Education Level') +
    xlab( 'Education Level' ) +
    ylab( 'Proportion')

From the figure above, we can see that the proportions per education level are comparable between each region. To address the Question: The Midwest Region has the highest proportion of people with ‘High School’ as their highest educational achievement.

Question 2:Which education level has the highest discrepancy between Males and Females?

(NOTE: This question was not mentioned in the DATA607 discussion board)

Visualize the data in a way that best addresses this question:

#This question does not involves the 'Gender' feature, so we will aggreggate the data by Region and eduLevel to simplify our visualization
edu_byGender <- eduDATA_df %>% 
    #group_by() to aggregate Region & eduLevel data
    group_by( Gender, eduLevel ) %>%
    #find the total of Male + Female incidences
    summarise( Total = sum( Total )) %>%
    #mutate() to add a column with proportions
    mutate( proportion = Total/sum( Total ))

# Grouped
ggplot(edu_byGender, aes(fill=Gender, y=proportion, x=eduLevel)) + 
    geom_bar(position="dodge", stat="identity") +
    ggtitle('Proportion Education Level per Gender') +
    xlab( 'Education Level' ) +
    ylab( 'Proportion')

The differences between Male and Female are too slim to appreciate in this figure. Next, calculate the discrepancy:

edu_genderDiscrep <- edu_byGender %>% 
    pivot_wider( names_from = Gender, values_from = proportion ) %>% 
    select( -Total ) %>% 
    group_by( eduLevel ) %>% 
    summarise(Female=sum(Female,na.rm=TRUE), Male = sum( Male , na.rm = TRUE)) %>%
    mutate( discrepancy_Percent = abs( Female-Male)*100)

edu_genderDiscrep

## # A tibble: 5 x 4
##   eduLevel  Female   Male discrepancy_Percent
##   <chr>      <dbl>  <dbl>               <dbl>
## 1 Associate 0.174  0.172               0.130 
## 2 Bachelors 0.255  0.252               0.257 
## 3 Graduate  0.342  0.343               0.149 
## 4 HS        0.137  0.137               0.0554
## 5 NoHS      0.0934 0.0952              0.182

Again, the discrepancies are very slim, but from the output of our calculations, we can see that the largest discrepancy between Male & Female proportions is found for the Bachelors education level. By looking at the bar plot above, we can see that Females with bachelors have a slightly higher proportion than Males with bachelors. However, larger discrepancies might be found between the genders if we are to look at the data by region.

In Closing

This demo used a small .csv file to demonstrate the importance tidying datasets. The original format of the data is very easy on human eyes: it can be scanned for the gist with ease. However, the wide format does not facilitate visualization. dplyr and tidyr provide methods to reshape data into a format that helps downstream analysis.