In this post, I will be discussing converting the dataset to wide and long forms. Dataset I will be using is from Connecticut Accidental Drug Related Deaths 2012-2017. Dataset has 4066 individual cases.
Let’s load dataset.
library(tidyverse)
library(knitr)
library(kableExtra)
library(reshape2)
CTDrug.data.df <- as.data.frame(read.csv("https://raw.githubusercontent.com/akulapa/Data608-FinalProject/master/CT_Drug.csv", header=T, stringsAsFactors = F, na.strings=c("","NA")))
CTDrug.df <- CTDrug.data.df %>% select(-Age,-AgeClass,-zip,-Longitude,-Latitude,-DeathLoc,-DeathState)
CTDrug.df$Id <- seq(1:nrow(CTDrug.df))
CTDrug.df %>% head(20) %>%
kable("html",caption = "Sample Data") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12) %>%
scroll_box(width = "100%", height = "200px")
| Year | Sex | Race | Heroin | Cocaine | Fentanyl | Oxycodone | Oxymorphone | EtOH | Hydrocodone | Benzodiazepine | Amphet | Tramad | Morphine | Other | Opioid | AgeGroup | Id |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2013 | Female | White | N | N | N | Y | Y | N | N | Y | N | N | N | N | N | 45-54 | 1 |
| 2012 | Male | White | N | Y | N | Y | N | N | N | N | N | N | N | N | N | 25-34 | 2 |
| 2014 | Male | White | N | N | N | N | N | Y | N | Y | N | N | N | N | N | 35-44 | 3 |
| 2014 | Female | White | Y | N | Y | N | N | N | N | N | N | N | N | N | N | 15-24 | 4 |
| 2013 | Female | White | Y | N | N | N | N | N | N | N | N | N | N | N | N | 25-34 | 5 |
| 2013 | Female | White | N | N | N | N | N | Y | N | Y | N | N | N | N | N | 45-54 | 6 |
| 2013 | Male | White | Y | N | N | N | N | Y | N | Y | N | N | N | N | N | 55-64 | 7 |
| 2013 | Male | Hispanic | Y | Y | N | N | N | N | N | N | N | N | N | N | N | 45-54 | 8 |
| 2014 | Female | White | N | Y | N | N | N | N | N | N | N | N | N | N | N | 45-54 | 9 |
| 2013 | Male | White | Y | N | Y | N | N | N | N | N | N | N | N | N | N | 25-34 | 10 |
| 2012 | Male | White | Y | N | N | N | N | N | N | Y | N | N | N | N | N | 15-24 | 11 |
| 2014 | Male | White | Y | N | N | N | N | Y | N | N | N | N | N | N | N | 25-34 | 12 |
| 2015 | Male | White | Y | Y | N | N | N | N | N | N | N | N | N | N | Y | 25-34 | 13 |
| 2013 | Male | Hispanic | Y | Y | N | N | N | N | N | N | N | N | N | N | N | 35-44 | 14 |
| 2015 | Male | White | Y | Y | N | N | N | Y | N | N | N | N | N | N | Y | 45-54 | 15 |
| 2013 | Male | Other | Y | N | N | N | N | N | N | N | N | N | N | N | N | 55-64 | 16 |
| 2015 | Male | White | N | N | Y | N | N | N | N | N | N | N | N | N | Y | 15-24 | 17 |
| 2014 | Male | White | Y | N | Y | N | N | N | N | N | N | N | N | N | N | 45-54 | 18 |
| 2015 | Male | Hispanic | Y | N | Y | N | N | Y | N | N | N | N | N | N | Y | 35-44 | 19 |
| 2015 | Male | White | N | Y | N | N | N | N | N | N | N | N | N | N | N | 45-54 | 20 |
The dataset contains four different race values. Let’s convert row value of Race to four individual columns.
unique(CTDrug.df$Race)
## [1] "White" "Hispanic" "Other" "Black"
#Convert Race to Columns
CTDrug.wide.df <- CTDrug.df %>%
gather(Race,name,starts_with('Race')) %>%
mutate(Race = name) %>%
mutate(present = 1) %>%
select(-name) %>%
spread(Race,present,fill = 0)
CTDrug.wide.df %>% head(20) %>%
kable("html",caption = "Sample Data") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12) %>%
scroll_box(width = "100%", height = "200px")
| Year | Sex | Heroin | Cocaine | Fentanyl | Oxycodone | Oxymorphone | EtOH | Hydrocodone | Benzodiazepine | Amphet | Tramad | Morphine | Opioid | AgeGroup | Id | Black | Hispanic | Other | White |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2013 | Female | N | N | N | Y | Y | N | N | Y | N | N | N | N | 45-54 | 1 | 0 | 0 | 0 | 1 |
| 2012 | Male | N | Y | N | Y | N | N | N | N | N | N | N | N | 25-34 | 2 | 0 | 0 | 0 | 1 |
| 2014 | Male | N | N | N | N | N | Y | N | Y | N | N | N | N | 35-44 | 3 | 0 | 0 | 0 | 1 |
| 2014 | Female | Y | N | Y | N | N | N | N | N | N | N | N | N | 15-24 | 4 | 0 | 0 | 0 | 1 |
| 2013 | Female | Y | N | N | N | N | N | N | N | N | N | N | N | 25-34 | 5 | 0 | 0 | 0 | 1 |
| 2013 | Female | N | N | N | N | N | Y | N | Y | N | N | N | N | 45-54 | 6 | 0 | 0 | 0 | 1 |
| 2013 | Male | Y | N | N | N | N | Y | N | Y | N | N | N | N | 55-64 | 7 | 0 | 0 | 0 | 1 |
| 2013 | Male | Y | Y | N | N | N | N | N | N | N | N | N | N | 45-54 | 8 | 0 | 1 | 0 | 0 |
| 2014 | Female | N | Y | N | N | N | N | N | N | N | N | N | N | 45-54 | 9 | 0 | 0 | 0 | 1 |
| 2013 | Male | Y | N | Y | N | N | N | N | N | N | N | N | N | 25-34 | 10 | 0 | 0 | 0 | 1 |
| 2012 | Male | Y | N | N | N | N | N | N | Y | N | N | N | N | 15-24 | 11 | 0 | 0 | 0 | 1 |
| 2014 | Male | Y | N | N | N | N | Y | N | N | N | N | N | N | 25-34 | 12 | 0 | 0 | 0 | 1 |
| 2015 | Male | Y | Y | N | N | N | N | N | N | N | N | N | Y | 25-34 | 13 | 0 | 0 | 0 | 1 |
| 2013 | Male | Y | Y | N | N | N | N | N | N | N | N | N | N | 35-44 | 14 | 0 | 1 | 0 | 0 |
| 2015 | Male | Y | Y | N | N | N | Y | N | N | N | N | N | Y | 45-54 | 15 | 0 | 0 | 0 | 1 |
| 2013 | Male | Y | N | N | N | N | N | N | N | N | N | N | N | 55-64 | 16 | 0 | 0 | 1 | 0 |
| 2015 | Male | N | N | Y | N | N | N | N | N | N | N | N | Y | 15-24 | 17 | 0 | 0 | 0 | 1 |
| 2014 | Male | Y | N | Y | N | N | N | N | N | N | N | N | N | 45-54 | 18 | 0 | 0 | 0 | 1 |
| 2015 | Male | Y | N | Y | N | N | Y | N | N | N | N | N | Y | 35-44 | 19 | 0 | 1 | 0 | 0 |
| 2015 | Male | N | Y | N | N | N | N | N | N | N | N | N | N | 45-54 | 20 | 0 | 0 | 0 | 1 |
The data format is good if we are studying an individual case. In case, you want to generate drug summary it would be better if we had data in long form. Converting drug data into a long form.
Drug.df <- CTDrug.df %>%
melt(id.vars = c('Id','Year','Sex','Race','AgeGroup')) %>%
rename(Drug=variable,Observed=value) %>%
arrange(Id)
Drug.df %>% head(20) %>%
kable("html",caption = "Sample Data") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12) %>%
scroll_box(width = "100%", height = "200px")
| Id | Year | Sex | Race | AgeGroup | Drug | Observed |
|---|---|---|---|---|---|---|
| 1 | 2013 | Female | White | 45-54 | Heroin | N |
| 1 | 2013 | Female | White | 45-54 | Cocaine | N |
| 1 | 2013 | Female | White | 45-54 | Fentanyl | N |
| 1 | 2013 | Female | White | 45-54 | Oxycodone | Y |
| 1 | 2013 | Female | White | 45-54 | Oxymorphone | Y |
| 1 | 2013 | Female | White | 45-54 | EtOH | N |
| 1 | 2013 | Female | White | 45-54 | Hydrocodone | N |
| 1 | 2013 | Female | White | 45-54 | Benzodiazepine | Y |
| 1 | 2013 | Female | White | 45-54 | Amphet | N |
| 1 | 2013 | Female | White | 45-54 | Tramad | N |
| 1 | 2013 | Female | White | 45-54 | Morphine | N |
| 1 | 2013 | Female | White | 45-54 | Other | N |
| 1 | 2013 | Female | White | 45-54 | Opioid | N |
| 2 | 2012 | Male | White | 25-34 | Heroin | N |
| 2 | 2012 | Male | White | 25-34 | Cocaine | Y |
| 2 | 2012 | Male | White | 25-34 | Fentanyl | N |
| 2 | 2012 | Male | White | 25-34 | Oxycodone | Y |
| 2 | 2012 | Male | White | 25-34 | Oxymorphone | N |
| 2 | 2012 | Male | White | 25-34 | EtOH | N |
| 2 | 2012 | Male | White | 25-34 | Hydrocodone | N |
To conclude, data in long form makes it easy to generate summaries, whereas wide form is useful if we want to study individual observations.