Data - CDC Youth Risk Behavior Surveillance System (YRBSS)

The data I want to use for my final project is from the CDC’s Youth Risk Behavior Surveillance System (YRBSS) found here:

www.cdc.gov/healthyyouth/data/yrbs/index.htm

The data on the CDC website is available only in ASCII and Microsoft Access formats though. After a lot of trial and error I realized that trying to import the data downloaded directly from the CDC into R was going to be very problematic, so I searched some more and found the YRBSS data already converted to csv format on the Kaggle website here:

https://www.kaggle.com/raylo168/dash-yrbss-hs-2017

For simplicity I will use the Kaggle dataset.

There are 5 files that total about 2 GB of data.

  1. Alcohol and Other Drug Use.csv
  2. Dietary Behaviors.csv
  3. Obesity Overweight and Weight Control.csv
  4. Physical Activity.csv
  5. Sexual Behaviors.csv
  6. Tobacco Use.csv

Each file contains data at the National, State, Territory, Local, and ‘Other’ regional levels for the years 1991 through 2017 (odd years only) in one file. Each question is separated into a higher risk and lower risk category with the aggregated percentage of respondents in each cateogory in separate columns. Variables for race, gender, and geolocation are also included.

“Alcohol and Other Drug Use” Variable Names

##  [1] "YEAR"                                   
##  [2] "LocationAbbr"                           
##  [3] "LocationDesc"                           
##  [4] "DataSource"                             
##  [5] "Topic"                                  
##  [6] "Subtopic"                               
##  [7] "ShortQuestionText"                      
##  [8] "Greater_Risk_Question"                  
##  [9] "Description"                            
## [10] "Data_Value_Symbol"                      
## [11] "Data_Value_Type"                        
## [12] "Greater_Risk_Data_Value"                
## [13] "Greater_Risk_Data_Value_Footnote_Symbol"
## [14] "Greater_Risk_Data_Value_Footnote"       
## [15] "Greater_Risk_Low_Confidence_Limit"      
## [16] "Greater_Risk_High_Confidence_Limit"     
## [17] "Lesser_Risk_Question"                   
## [18] "Lesser_Risk_Data_Value"                 
## [19] "Lesser_Risk_Data_Value_Footnote_Symbol" 
## [20] "Lesser_Risk_Data_Value_Footnote"        
## [21] "Lesser_Risk_Low_Confidence_Limit"       
## [22] "Lesser_Risk_High_Confidence_Limit"      
## [23] "Sample_Size"                            
## [24] "Sex"                                    
## [25] "Race"                                   
## [26] "Grade"                                  
## [27] "GeoLocation"                            
## [28] "TopicId"                                
## [29] "SubTopicID"                             
## [30] "QuestionCode"                           
## [31] "LocationId"                             
## [32] "StratID1"                               
## [33] "StratID2"                               
## [34] "StratID3"                               
## [35] "StratificationType"

Sample of “Alcohol and Other Drug Use” data

Project Description

I intend to include data from all 5 csv files (if possible) and create an interactive graphic that allows the user to choose the risk behaviors of interest to them and compare them across different regional areas vs the national numbers as well as see changes over time. So I imagine line graphs showing how a behavior has changed from 1991 to 2017 and possibly allowing the user to select multiple behaviors or multiple geographic regions to plot on the same graph for comparison. Each of those selection options could be allowed in separate graphs or on separate tabs. I may have separate tabs for state data vs. local data, or may keep them all together so that state and local data can be compared to each other as well. I’d like to include filters for race and gender as well and possibly create a map using the geolocation data available.

Relevance

The Youth Risk Behavior Surveillance System is the only study of it’s kind that “monitors six categories of health-related behaviors that contribute to the leading causes of death and disability among youth and adults”1 over the past 16 years. Understanding 1. which of these behaviors youth are engaged in 2. how often or to what extent they engage in them and 3. how those behaviors are changing over time can lead to the development of better preventative programs and health education.

Technologies

I’d like to create a shiny app using plotly for the graphics but may have to scale back my data if I choose to go that route since they have a 1 GB limit for the free accounts. Possibly a way around this might be to make separate shiny apps for each csv file instead of including them all in one? Not sure if it’s a 1GB per app or 1GB per account limit.

Another thought is to do a static Rmd file which can be posted to RPubs with a static graphic for each csv file and a link to a separate shiny app for each file for interactive exploration as well. This may be a work-around solution that would allow for incorporating all the data into one document.

I’m open to suggestions for the best approach to take with this dataset!

References


  1. “Youth Risk Behavior Surveillance System (YRBSS)” Centers for Disease Control and Prevention, August 22, 2018, www.cdc.gov/healthyyouth/data/yrbs/index.htm.