# load data
# I had to load part of the county data because it was too big for GitHub. This is 130K observations.
county <- read.csv("https://raw.githubusercontent.com/Godbero/CUNY-MSDA-IS606/master/cbp13copartial.csv")
library(rvest)
## Loading required package: xml2
drugs <- read_html("http://www.samhsa.gov/data/sites/default/files/NSDUHsubstateChangeTabs2012/NSDUHsubstateChangeTabs2012.htm", encoding = "UTF-8")
drugs.tables <- html_nodes(drugs, "table")
drugs.table1 <- html_table(drugs.tables[[1]], header = TRUE, fill = TRUE)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Is there a link between drug use and macroeconomic performance? I found an interesting article in the Economist that suggested this had not been studied very well. After re-reading our textbook, I can refine my question.

Is drug use, on average, higher or lower in counties with low macroeconomic performance?

Cases

What are the cases, and how many are there?

I would like to get data at the US county level. As of 2013 , there were 3,007 counties, 64 parishes, 19 organized boroughs, 11 census areas, 41 independent cities, and the District of Columbia for a total of 3,143 counties and county-equivalents in the United States.

If I cannot get both drug and economic data at the county level, I will move up to the state level, for around 51.

Data collection

Describe the method of data collection.

The Top 10 economic indicators are:

  1. Real GDP (Gross Domestic Product)
  2. M2 (Money Supply)
  3. Consumer Price Index (CPI)
  4. Producer Price Index (PPI)
  5. Consumer Confidence Survey
  6. Current Employment Statistics (CES)
  7. Retail Trade Sales and Food Services Sales
  8. Housing Starts (Formally Known as “New Residential Construction”)
  9. Manufacturing and Trade Inventories and Sales
  10. S&P 500 Stock Index (the S&P 500)

The idea would be to find county-level equivalents for as many as possible and come up with my measure for economic performance. For example, GDP is considered by many to be THE best indicator, but can I find or calculate GDP for each US county.

I lot of economic data, especially local-level is collected by the US Census. I will start there and see if I can find what I need.

I would need to find data on drug use at the county-level as well. I am not certain if I will find more in health resources (CDC, NIH, etc.) or crime (FBI, etc.). I will examine both.

Type of study

What type of study is this (observational/experiment)?

Observational

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

The county data from Chapter 1 in our textbook seemed like a good place to start. When I followed the link from our book I ended at QuickFacts on Census.gov and could not recreate the data set they used in the book. The interface seemed to be able to give you one county at a time. With over 3,000 counties this did not seem the way to go. link

After spending too much time with QuickFacts (the book could do it), I search for county level data on the Census site and found County Business Patterns (CBP). I have a huge file that should give me a way to examine economic activity at the county level.

Field Name Data Type Description
FIPSTATE C FIPS State Code
FIPSCTY C FIPS County Code
NAICS C Industry Code
EMPFLAG C Data Suppression Flag
EMP_NF C Total Mid-March Employees Noise Flag
EMP N Total Mid-March Employees with Noise
QP1_NF C Total First Quarter Payroll Noise Flag
QP1 N Total First Quarter Payroll ($1,000) with Noise
AP_NF C Total Annual Payroll Noise Flag
AP N Total Annual Payroll ($1,000) with Noise
EST N Total Number of Establishments
N1_4 N Number of Establishments: 1-4 Employee Size Class
N5_9 N Number of Establishments: 5-9 Employee Size Class
N10_19 N Number of Establishments: 10-19 Employee Size Class
N20_49 N Number of Establishments: 20-49 Employee Size Class
N50_99 N Number of Establishments: 50-99 Employee Size Class
N100_249 N Number of Establishments: 100-249 Employee Size Class
N250_499 N Number of Establishments: 250-499 Employee Size Class
N500_999 N Number of Establishments: 500-999 Employee Size Class
N1000 N Number of Establishments: 1,000 or More Employee Size Class
N1000_1 N Number of Establishments: Employment Size Class: 1,000-1,499 Employees
N1000_2 N Number of Establishments: Employment Size Class: 1,500-2,499 Employees
N1000_3 N Number of Establishments: Employment Size Class: 2,500-4,999 Employees
N1000_4 N Number of Establishments: Employment Size Class: 5,000 or More Employees
CENSTATE C Census State Code
CENCTY C Census County Code

There seems to be data over several decades, so I should be able to track and establish economic trends at the county level.

On the issue of drug use the data does not seem to be as plentiful. I found data on the Substance Abuse and Mental Health Services Administration (SAMHSA) site. It does not got back very far (2002 so far) and they use substate regions and not counties. Bringing data together I can compare may be a challenge.

Response

What is the response variable, and what type is it (numerical/categorical)?

Is drug use, on average, higher or lower in counties with low macroeconomic performance? If we suspect macroeconomic performance might affect drug use in a county, then macroeconomic performance is the explanatory variable and drug use is the response variable in the relationship.

drug use is the response variable and it is numerical

Explanatory

What is the explanatory variable, and what type is it (numerical/categorival)?

macroeconomic performance is the explanatory variable and it is numerical

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

I will need to do some serious tidying of the data before I can say anything useful. Below I run summary() on about 1/8 of the raw county data. The untidy drug data gives similar useless results. I may need to backtrack and work on state-level drug use data, since there are some many holes at the substate level. I can compare the states and figure out where they rank in drug use and look for changes over the time frame of the data. Then I can go back to Census.gov and find state-level economic data for the same time period.

summary(county)
##     fipstate        fipscty           naics          empflag     
##  Min.   :1.00    Min.   :  1.00          :64999          :83405  
##  1st Qu.:1.00    1st Qu.: 21.00   ------ :  107   A      :30998  
##  Median :1.00    Median : 73.00   81---- :  107   B      :10572  
##  Mean   :1.54    Mean   : 81.53   23---- :  106   C      : 2809  
##  3rd Qu.:2.00    3rd Qu.:110.00   42---- :  106   E      : 1170  
##  Max.   :4.00    Max.   :999.00   44---- :  106   F      :  599  
##  NA's   :64999   NA's   :64999    (Other):64467   (Other):  445  
##  emp_nf         emp            qp1_nf         qp1           ap_nf    
##   :64999   Min.   :      0.0    :64999   Min.   :       0    :64999  
##  D:40899   1st Qu.:      0.0   D:40899   1st Qu.:       0   D:40899  
##  G:10708   Median :      0.0   G:12928   Median :       0   G:14150  
##  H: 7698   Mean   :    277.6   H: 9382   Mean   :    2863   H: 9950  
##  S: 5694   3rd Qu.:     23.0   S: 1790   3rd Qu.:     222            
##            Max.   :1491582.0             Max.   :16847973            
##            NA's   :64999                 NA's   :64999               
##        ap                est               n1_4         
##  Min.   :       0   Min.   :    1.0   Min.   :    0.00  
##  1st Qu.:       0   1st Qu.:    1.0   1st Qu.:    1.00  
##  Median :       0   Median :    2.0   Median :    1.00  
##  Mean   :   11903   Mean   :   19.9   Mean   :   10.44  
##  3rd Qu.:    1066   3rd Qu.:    7.0   3rd Qu.:    4.00  
##  Max.   :67740763   Max.   :86045.0   Max.   :47161.00  
##  NA's   :64999      NA's   :64999     NA's   :64999     
##       n5_9              n10_19             n20_49            n50_99      
##  Min.   :    0.00   Min.   :    0.00   Min.   :   0.00   Min.   :   0.0  
##  1st Qu.:    0.00   1st Qu.:    0.00   1st Qu.:   0.00   1st Qu.:   0.0  
##  Median :    0.00   Median :    0.00   Median :   0.00   Median :   0.0  
##  Mean   :    3.87   Mean   :    2.64   Mean   :   1.84   Mean   :   0.6  
##  3rd Qu.:    2.00   3rd Qu.:    1.00   3rd Qu.:   1.00   3rd Qu.:   0.0  
##  Max.   :14679.00   Max.   :10724.00   Max.   :8144.00   Max.   :2907.0  
##  NA's   :64999      NA's   :64999      NA's   :64999     NA's   :64999   
##     n100_249          n250_499        n500_999          n1000      
##  Min.   :   0.00   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.:   0.00   1st Qu.:  0.0   1st Qu.:  0.00   1st Qu.: 0.00  
##  Median :   0.00   Median :  0.0   Median :  0.00   Median : 0.00  
##  Mean   :   0.35   Mean   :  0.1   Mean   :  0.04   Mean   : 0.02  
##  3rd Qu.:   0.00   3rd Qu.:  0.0   3rd Qu.:  0.00   3rd Qu.: 0.00  
##  Max.   :1724.00   Max.   :454.0   Max.   :166.00   Max.   :86.00  
##  NA's   :64999     NA's   :64999   NA's   :64999    NA's   :64999  
##     n1000_1         n1000_2         n1000_3         n1000_4     
##  Min.   : 0.00   Min.   : 0.00   Min.   : 0      Min.   :0      
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0      1st Qu.:0      
##  Median : 0.00   Median : 0.00   Median : 0      Median :0      
##  Mean   : 0.01   Mean   : 0.01   Mean   : 0      Mean   :0      
##  3rd Qu.: 0.00   3rd Qu.: 0.00   3rd Qu.: 0      3rd Qu.:0      
##  Max.   :35.00   Max.   :28.00   Max.   :16      Max.   :7      
##  NA's   :64999   NA's   :64999   NA's   :64999   NA's   :64999  
##     censtate         cencty      
##  Min.   :63.00   Min.   :  1.00  
##  1st Qu.:63.00   1st Qu.: 21.00  
##  Median :63.00   Median : 73.00  
##  Mean   :71.41   Mean   : 81.53  
##  3rd Qu.:86.00   3rd Qu.:110.00  
##  Max.   :94.00   Max.   :999.00  
##  NA's   :64999   NA's   :64999
summary(drugs.table1)
##  State/Substate Region 2008-2010(Estimate)
##  Length:469            Length:469         
##  Class :character      Class :character   
##  Mode  :character      Mode  :character   
##                                           
##                                           
##                                           
##                                           
##  2008-2010(95% ConfidenceInterval) 2010-2012(Estimate)
##  Length:469                        Min.   : 3.920     
##  Class :character                  1st Qu.: 7.070     
##  Mode  :character                  Median : 8.485     
##                                    Mean   : 8.942     
##                                    3rd Qu.:10.502     
##                                    Max.   :19.410     
##                                    NA's   :1          
##  2010-2012(95% ConfidenceInterval)   P Value            NA         
##  Length:469                        Length:469         Mode:logical  
##  Class :character                  Class :character   NA's:469      
##  Mode  :character                  Mode  :character                 
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##     NA             NA             NA         
##  Mode:logical   Mode:logical   Mode:logical  
##  NA's:469       NA's:469       NA's:469      
##                                              
##                                              
##                                              
##                                              
##