The Open Data Initiative was established in 2013 by the Office of Management and Budget through policy memorandum M-13-13. In support of this effort, on April 14th 2016, VA published a collection of record-level and aggregate Veterans Crisis Line call datasets on the data.gov website. This data was made available through a Freedom of Information Act (FOIA) request (appeal 15-00242-F). This guide serves to help the data analysis community explore the data by showing the steps involved in accessing, preparing and using the Veterans Crisis Line Calls FY2014 Record-Level data file with R and R Markdown. In this guide you will see how to read the file from the data.gov website and prepare the data for analysis. There are also three sample visualizations to help explore and understand the data.
This dataset is also accompanied by a data dictionary, the link for this is listed in the metadata section on the data.gov page of the file. The data dictionary is important because it provides a way to decode the values the variables assume. From the dictionary we can also see which variables are machine generated versus populated by the VA Crisis Line responder. It is important to note that the crisis line data is partially redacted where some of the variables have been coded with a value of “b6”. There are also entire observations that have been redacted and for some observations certain variable values as well.
As mentioned earlier this guide was created using R and R Markdown to ensure full reproducibility and also to allow access to this data set using free open source tools. The raw R Markdown document along with some supporting files are available on GitHub. As for the raw data, it is accessed directly from the data.gov website and so it is not posted on GitHub. In order to successfully execute the R Markdown document you will need ensure the appropriate R packages are installed.
GitHub Repository:
https://github.com/mihiriyer/crisis
Required R Packages:
Guide in html:
The Veterans Crisis Line Calls FY2014 Record-Level dataset is provided as a zip file so once downloaded the file will need to be unzipped and read. The zip file can be downloaded, unzipped and read with the following code:
Once the data has been loaded, it’s useful to run the nrow, length and str functions to see the number of observations, variables (columns) and data types. From the nrow function we can see that there are 367,000 observations in this dataset. The output of the length function shows that there are 68 variables. Then the str output shows all of variables/fields that have been read as chr (character) values (See appendix A for details). From this output, all the variables which are populated with “b6” are visible, this implies that the variable has been redacted. From this output we can see that only 14 variables are exposed.
Again from the str output we can see that there some date, time, categorical and redacted variables. Now that we know which variables can be used (date, time, and categorical) the next step is to select the non-redacted variables and assign them to their appropriate data type. Assigning the categorical variables is fairly straight forward since we can use the factor data type and then using the data dictionary we can set the level option to the coded values and then the label option is set to the text definitions of the code. Then for the one time variable, CALL_DURATION formatted as H:M:S, we can use the lubridate package to convert to seconds and then minutes. This is in effect renders the variable as a numeric data type. The variables CallStartYYYY, CallEndYYYY, DateClosedYYYY, and TwoWeekFollowUpDateYYYY are a little unique because they are date-time variables but since only the year is exposed we can set these to the factor data type as well. The last variable TwoWeekFollowUpCount contains integer values and “b6”. Since the dictionary doesn’t define the values the variable takes on, this variables is left as a character data type. Below is the resultant dataset after removing the redacted variables and assigning data types, the detailed steps are presented in Appendix B:
str(crisis.nred, strict.width = "cut")
## 'data.frame': 367000 obs. of 17 variables:
## $ ActionTaken : Factor w/ 12 levels "Referral generated, se"..
## $ CALL_DURATION : num 79 51 13 56 18 5 11 35 23 11 ...
## $ CallEndYYYY : Factor w/ 3 levels "2013","2014",..: 1 1 1 1 ..
## $ CallOutcome : Factor w/ 11 levels "Substance Use Addictio"..
## $ CallSource : Factor w/ 24 levels "Hotline 800",..: 1 16 1 ..
## $ CallStartYYYY : Factor w/ 3 levels "2013","2014",..: 1 1 1 1 ..
## $ CheckedCapriInfo : Factor w/ 4 levels "Yes","Veteran Refused",....
## $ DateClosedYYYY : Factor w/ 4 levels "Empty","2013",..: 2 1 1 1..
## $ IsReferral : Factor w/ 3 levels "No","Yes","b6-redacted": ..
## $ ReferralType : Factor w/ 5 levels "Emergent","Empty",..: 4 2..
## $ RiskAssessmentId : Factor w/ 4 levels "High Risk","Moderate to "..
## $ SatisfactionWithCall : Factor w/ 4 levels "True","False",..: 1 1 1 1..
## $ TwoWeekFollowUp : Factor w/ 3 levels "No","Yes","b6-redacted": ..
## $ TwoWeekFollowUpClosed : Factor w/ 3 levels "No","Yes","b6-redacted": ..
## $ TwoWeekFollowUpCount : chr "1" "0" "0" "0" ...
## $ TwoWeekFollowUpDateYYYY: Factor w/ 4 levels "Empty","2013",..: 2 1 1 1..
## $ isClosed : Factor w/ 3 levels "No","Yes","b6-redacted": ..
Now that the relevant (data containing) variables have been extracted and assigned to the appropriate data type, we can start exploring the data. We know that we have one numeric variable (CALL_DURATION), one undefined character variable (TwoWeekFollowUpCount), and the remaining 15 variables are categorical. A good first step for the CALL_DURATION would be to view the distribution by using a histogram or box-plot. A box-plot of CALL_DURATION with any of the categorical variables would show its distribution, for example, by the different types of ActionTaken or Referral Types. The categorical variables can visualized with bar plot to see the counts or totals by the various categories of the variable.
Running the summary command is a good first step towards exploring the CALL_DURATION (in minutes) variable, since it provides a six-number summary and the number of missing values. The max and min values can help inform the choice of axis limits.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 13.00 27.00 37.91 50.00 1439.00 1887
From the above we can see that 75% of the calls were within 50 minutes and also that 25% of the calls are between 50 and 1,439 minutes (24 hours is 1440 minutes). These statistics imply that the distribution exhibits a positive (right) skew so it will be helpful to narrow the range of the CALL_DURATION variable because this will ensure that the boxplots are legible. By selecting the calls that were less than 500 minutes only 128 calls are excluded.
The graphic below provides a simple view of the total number of calls for each type of Call Outcome. This variable appears to indicate the nature of the caller’s mental health crisis. There are 36 options for this variable but only 11 options that are actually used. These unused options have been dropped to eliminate clutter in the graphic. The bar plot below was made using ggplot2 and uses Color Brewer for the color scheme.
From the above graph we can see that more than 250,000 calls were made by callers with suicidal thoughts. The next largest reason for calling the crisis line was due to substance use/addiction issues for which there were more than 50,000 calls.
In this next graph we show the total number of calls by the Action Taken by the call responder. From this bar plot we can that in most cases, nearly 150,000 cases, the caller responded to the intervention made by the crisis line responder. It is interesting to note that the next largest action is one where no action was possible, there were nearly 100,000 such calls.
str output## 'data.frame': 367000 obs. of 68 variables:
## $ AccessToHurtMeans : chr "b6" "b6" "b6" "b6" ...
## $ ActionTaken : chr "1" "6" "9" "9" ...
## $ AmbivalenceForLiving : chr "b6" "b6" "b6" "b6" ...
## $ AngerInTenuousControl : chr "b6" "b6" "b6" "b6" ...
## $ CALL_DURATION : chr "01:19:00" "00:51:00" "00:13:00" "00:56:00" ...
## $ CallEndYYYY : chr "2013" "2013" "2013" "2013" ...
## $ CallOutcome : chr "4" "4" "1" "1" ...
## $ CallSource : chr "1" "16" "1" "16" ...
## $ CallStartYYYY : chr "2013" "2013" "2013" "2013" ...
## $ CallerRelationshipToVet : chr "b6" "b6" "b6" "b6" ...
## $ CheckedCapriInfo : chr "Yes" "Did not ask" "Did not ask" "Did not ask" ...
## $ ChronicPain : chr "b6" "b6" "b6" "b6" ...
## $ ComingDownFromDrugs : chr "b6" "b6" "b6" "b6" ...
## $ CoreValuesOrBeliefs : chr "b6" "b6" "b6" "b6" ...
## $ CurrentlyIntoxicated : chr "b6" "b6" "b6" "b6" ...
## $ CurrentlyUsingOrAbusingPrescriptionDrugs: chr "b6" "b6" "b6" "b6" ...
## $ DateClosedYYYY : chr "2013" ". " ". " ". " ...
## $ DaysSinceLastDose : chr "b6" "b6" "b6" "b6" ...
## $ DesireToHarmSelfOrOthers : chr "b6" "b6" "b6" "b6" ...
## $ DifficultOrDelusional : chr "b6" "b6" "b6" "b6" ...
## $ DrugsInvolved : chr "b6" "b6" "b6" "b6" ...
## $ EngagementWithHelper : chr "b6" "b6" "b6" "b6" ...
## $ EverAttemptedSuicide : chr "b6" "b6" "b6" "b6" ...
## $ ExposureToSuicide : chr "b6" "b6" "b6" "b6" ...
## $ FeelingAlone : chr "b6" "b6" "b6" "b6" ...
## $ FeelingTrapped : chr "b6" "b6" "b6" "b6" ...
## $ FuturePlanning : chr "b6" "b6" "b6" "b6" ...
## $ HasExpressedIntentToDie : chr "b6" "b6" "b6" "b6" ...
## $ HasInsomnia : chr "b6" "b6" "b6" "b6" ...
## $ HaveAccessToGun : chr "b6" "b6" "b6" "b6" ...
## $ HavePutHurtPlansIntoAction : chr "b6" "b6" "b6" "b6" ...
## $ HeightenedAnxiety : chr "b6" "b6" "b6" "b6" ...
## $ Helplessness : chr "b6" "b6" "b6" "b6" ...
## $ Hopelessness : chr "b6" "b6" "b6" "b6" ...
## $ ID : chr "b6" "b6" "b6" "b6" ...
## $ IsActiveDuty : chr "b6" "b6" "b6" "b6" ...
## $ IsReferral : chr "1" "0" "0" "0" ...
## $ IsVet : chr "b6" "b6" "b6" "b6" ...
## $ MilitaryBranch : chr "b6" "b6" "b6" "b6" ...
## $ NearestFacilitySiteCode : chr "b6" "b6" "b6" "b6" ...
## $ PatientAge : chr "b6" "b6" "b6" "b6" ...
## $ PatientGender : chr "b6" "b6" "b6" "b6" ...
## $ PerceivedBurdenOnOthers : chr "b6" "b6" "b6" "b6" ...
## $ PlanToHurtSelfOrOthers : chr "b6" "b6" "b6" "b6" ...
## $ PlanToHurtWhen : chr "b6" "b6" "b6" "b6" ...
## $ PsychologicalPain : chr "b6" "b6" "b6" "b6" ...
## $ ReferralType : chr "Routine" "" "" "" ...
## $ ResponderID : chr "b6" "b6" "b6" "b6" ...
## $ RiskAssessmentId : chr "3" "3" "3" "3" ...
## $ SatisfactionWithCall : chr "TRUE" "TRUE" "TRUE" "TRUE" ...
## $ SenseOfPurpose : chr "b6" "b6" "b6" "b6" ...
## $ SomeoneToCall : chr "b6" "b6" "b6" "b6" ...
## $ SomeoneWith : chr "b6" "b6" "b6" "b6" ...
## $ StationID : chr "b6" "b6" "b6" "b6" ...
## $ StoppedTakingPrescriptionMeds : chr "b6" "b6" "b6" "b6" ...
## $ ThinkingOfSuicide : chr "b6" "b6" "b6" "b6" ...
## $ ThoughtOfInLastTwoMonths : chr "b6" "b6" "b6" "b6" ...
## $ Tiredness : chr "b6" "b6" "b6" "b6" ...
## $ TriedSuicideBefore : chr "b6" "b6" "b6" "b6" ...
## $ TriedSuicideInLastYear : chr "b6" "b6" "b6" "b6" ...
## $ TwoWeekFollowUp : chr "1" "0" "0" "0" ...
## $ TwoWeekFollowUpBy : chr "b6" "b6" "b6" "b6" ...
## $ TwoWeekFollowUpClosed : chr "1" "0" "0" "0" ...
## $ TwoWeekFollowUpCount : chr "1" "0" "0" "0" ...
## $ TwoWeekFollowUpDateYYYY : chr "2013" ". " ". " ". " ...
## $ VeteranStatus : chr "b6" "b6" "b6" "b6" ...
## $ WhoWhatSupportedBy : chr "b6" "b6" "b6" "b6" ...
## $ isClosed : chr "1" "0" "0" "0" ...
#select NON-REDACTED variables
crisis.nred <- crisis[, c(2, 5,6,7,8,9, 11, 17,37,47,49,50,61,63,64,65,68)]
#ActionTaken as factor
action<- read.csv("https://raw.githubusercontent.com/mihiriyer/crisis/master/60ActionTakenCodes.csv", stringsAsFactors=FALSE)
crisis.nred$ActionTaken <- factor(crisis.nred$ActionTaken, levels=c(1:11, "b6"), labels=c(action[,2], "b6-redacted"))
rm(action)
#CALL_DURATION as times
#load lubridate library to convert CALL_Duration variable into seconds and then miniutes
library(lubridate)
crisis.nred$CALL_DURATION <- period_to_seconds(hms(crisis.nred$CALL_DURATION))
crisis.nred$CALL_DURATION <- crisis.nred$CALL_DURATION/60
# the remaining variable will be set as factor as they are mostly categorical variables
#CallEndYYYY
crisis.nred$CallEndYYYY <- factor(crisis.nred$CallEndYYYY, levels=c("2013", "2014", "b6"), labels=c("2013", "2014", "b6-redacted"))
#CallOutcome
outcome <- read.csv(file="https://raw.githubusercontent.com/mihiriyer/crisis/master/59CallOutcomeCodes.csv", stringsAsFactors = FALSE)
crisis.nred$CallOutcome <- factor(crisis.nred$CallOutcome, levels=c(1:35, "b6"), labels=c(outcome[,2], "b6-redacted"))
#calculate total number of levels ie. choices available for the Call Outcome variable
callout.levels <- length(levels(crisis.nred$CallOutcome))
#drop empty levels
crisis.nred$CallOutcome <- droplevels((crisis.nred$CallOutcome))
rm(outcome)
#CallSource
callsource <- read.csv(file="https://raw.githubusercontent.com/mihiriyer/crisis/master/7CallSourceCodes.csv", stringsAsFactors = FALSE)
crisis.nred$CallSource <- factor(crisis.nred$CallSource, levels=c(1:23, "b6"), labels=c(callsource[,2], "b6-redacted"))
rm(callsource)
# CAllStartYYYY
crisis.nred$CallStartYYYY <- factor(crisis.nred$CallStartYYYY, levels=c("2013", "2014", "b6"), labels=c("2013", "2014", "b6-redacted"))
#CheckedCapriInfo
crisis.nred$CheckedCapriInfo <- factor(crisis.nred$CheckedCapriInfo, levels=c("Yes", "Veteran Refused", "Did not ask", "b6"), labels=c("Yes", "Veteran Refused", "Did not ask", "b6-redacted"))
# DateClosedYYYY
crisis.nred$DateClosedYYYY <- factor(crisis.nred$DateClosedYYYY, levels=c(". ", "2013", "2014", "b6"), labels=c("Empty", "2013", "2014", "b6-redacted"))
# IsReferral
crisis.nred$IsReferral <- factor(crisis.nred$IsReferral, levels=c("0", "1", "b6"), labels=c("No", "Yes", "b6-redacted"))
# RefferalType
crisis.nred$ReferralType[crisis.nred$ReferralType == ""] <- "Empty"
crisis.nred$ReferralType <- factor(crisis.nred$ReferralType, levels=c("Emergent","Empty","Info Only","Routine","Urgent"))
# RiskAssessmentID
riskassess <- read.csv(file="https://raw.githubusercontent.com/mihiriyer/crisis/master/57RiskAssessmentCodes.csv", stringsAsFactors = FALSE)
crisis.nred$RiskAssessmentId <- factor(crisis.nred$RiskAssessmentId, levels=c(1:3, "b6"), labels=c(riskassess[,2], "b6"))
rm(riskassess)
# SatisfactionWithCall
crisis.nred$SatisfactionWithCall <- factor(crisis.nred$SatisfactionWithCall, levels=c("TRUE", "FALSE", "unsure", "b6"), labels=c("True", "False", "Unsure", "b6-redacted"))
# TwoWeekFollowUp
crisis.nred$TwoWeekFollowUp <- factor(crisis.nred$TwoWeekFollowUp, levels=c("0", "1", "b6"), labels=c("No", "Yes", "b6-redacted"))
# TwoWeekFollowUpClosed
crisis.nred$TwoWeekFollowUpClosed <- factor(crisis.nred$TwoWeekFollowUpClosed, levels=c("0", "1", "b6"), labels=c("No", "Yes", "b6-redacted"))
# TwoWeekFollowUpDateYYYY
crisis.nred$TwoWeekFollowUpDateYYYY <- factor(crisis.nred$TwoWeekFollowUpDateYYYY, levels=c(". ", "2013", "2014", "b6"), labels=c("Empty", "2013", "2014", "b6-redacted"))
# isClosed
crisis.nred$isClosed <- factor(crisis.nred$isClosed, levels=c("0", "1", "b6"), labels=c("No", "Yes", "b6-redacted"))