# load data
# I had to load part of the county data because it was too big for GitHub. This is 130K observations.
county <- read.csv("https://raw.githubusercontent.com/Godbero/CUNY-MSDA-IS606/master/cbp13copartial.csv")
library(rvest)
## Loading required package: xml2
drugs <- read_html("http://www.samhsa.gov/data/sites/default/files/NSDUHsubstateChangeTabs2012/NSDUHsubstateChangeTabs2012.htm", encoding = "UTF-8")
drugs.tables <- html_nodes(drugs, "table")
drugs.table1 <- html_table(drugs.tables[[1]], header = TRUE, fill = TRUE)
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Is there a link between drug use and macroeconomic performance? I found an interesting article in the Economist that suggested this had not been studied very well. After re-reading our textbook, I can refine my question.
Is drug use, on average, higher or lower in counties with low macroeconomic performance?
What are the cases, and how many are there?
I would like to get data at the US county level. As of 2013 , there were 3,007 counties, 64 parishes, 19 organized boroughs, 11 census areas, 41 independent cities, and the District of Columbia for a total of 3,143 counties and county-equivalents in the United States.
If I cannot get both drug and economic data at the county level, I will move up to the state level, for around 51.
Describe the method of data collection.
The Top 10 economic indicators are:
The idea would be to find county-level equivalents for as many as possible and come up with my measure for economic performance. For example, GDP is considered by many to be THE best indicator, but can I find or calculate GDP for each US county.
I lot of economic data, especially local-level is collected by the US Census. I will start there and see if I can find what I need.
I would need to find data on drug use at the county-level as well. I am not certain if I will find more in health resources (CDC, NIH, etc.) or crime (FBI, etc.). I will examine both.
What type of study is this (observational/experiment)?
Observational
If you collected the data, state self-collected. If not, provide a citation/link.
The county data from Chapter 1 in our textbook seemed like a good place to start. When I followed the link from our book I ended at QuickFacts on Census.gov and could not recreate the data set they used in the book. The interface seemed to be able to give you one county at a time. With over 3,000 counties this did not seem the way to go. link
After spending too much time with QuickFacts (the book could do it), I search for county level data on the Census site and found County Business Patterns (CBP). I have a huge file that should give me a way to examine economic activity at the county level.
| Field Name | Data Type | Description |
|---|---|---|
| FIPSTATE | C | FIPS State Code |
| FIPSCTY | C | FIPS County Code |
| NAICS | C | Industry Code |
| EMPFLAG | C | Data Suppression Flag |
| EMP_NF | C | Total Mid-March Employees Noise Flag |
| EMP | N | Total Mid-March Employees with Noise |
| QP1_NF | C | Total First Quarter Payroll Noise Flag |
| QP1 | N | Total First Quarter Payroll ($1,000) with Noise |
| AP_NF | C | Total Annual Payroll Noise Flag |
| AP | N | Total Annual Payroll ($1,000) with Noise |
| EST | N | Total Number of Establishments |
| N1_4 | N | Number of Establishments: 1-4 Employee Size Class |
| N5_9 | N | Number of Establishments: 5-9 Employee Size Class |
| N10_19 | N | Number of Establishments: 10-19 Employee Size Class |
| N20_49 | N | Number of Establishments: 20-49 Employee Size Class |
| N50_99 | N | Number of Establishments: 50-99 Employee Size Class |
| N100_249 | N | Number of Establishments: 100-249 Employee Size Class |
| N250_499 | N | Number of Establishments: 250-499 Employee Size Class |
| N500_999 | N | Number of Establishments: 500-999 Employee Size Class |
| N1000 | N | Number of Establishments: 1,000 or More Employee Size Class |
| N1000_1 | N | Number of Establishments: Employment Size Class: 1,000-1,499 Employees |
| N1000_2 | N | Number of Establishments: Employment Size Class: 1,500-2,499 Employees |
| N1000_3 | N | Number of Establishments: Employment Size Class: 2,500-4,999 Employees |
| N1000_4 | N | Number of Establishments: Employment Size Class: 5,000 or More Employees |
| CENSTATE | C | Census State Code |
| CENCTY | C | Census County Code |
There seems to be data over several decades, so I should be able to track and establish economic trends at the county level.
On the issue of drug use the data does not seem to be as plentiful. I found data on the Substance Abuse and Mental Health Services Administration (SAMHSA) site. It does not got back very far (2002 so far) and they use substate regions and not counties. Bringing data together I can compare may be a challenge.
What is the response variable, and what type is it (numerical/categorical)?
Is drug use, on average, higher or lower in counties with low macroeconomic performance? If we suspect macroeconomic performance might affect drug use in a county, then macroeconomic performance is the explanatory variable and drug use is the response variable in the relationship.
drug use is the response variable and it is numerical
What is the explanatory variable, and what type is it (numerical/categorival)?
macroeconomic performance is the explanatory variable and it is numerical
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
I will need to do some serious tidying of the data before I can say anything useful. Below I run summary() on about 1/8 of the raw county data. The untidy drug data gives similar useless results. I may need to backtrack and work on state-level drug use data, since there are some many holes at the substate level. I can compare the states and figure out where they rank in drug use and look for changes over the time frame of the data. Then I can go back to Census.gov and find state-level economic data for the same time period.
summary(county)
## fipstate fipscty naics empflag
## Min. :1.00 Min. : 1.00 :64999 :83405
## 1st Qu.:1.00 1st Qu.: 21.00 ------ : 107 A :30998
## Median :1.00 Median : 73.00 81---- : 107 B :10572
## Mean :1.54 Mean : 81.53 23---- : 106 C : 2809
## 3rd Qu.:2.00 3rd Qu.:110.00 42---- : 106 E : 1170
## Max. :4.00 Max. :999.00 44---- : 106 F : 599
## NA's :64999 NA's :64999 (Other):64467 (Other): 445
## emp_nf emp qp1_nf qp1 ap_nf
## :64999 Min. : 0.0 :64999 Min. : 0 :64999
## D:40899 1st Qu.: 0.0 D:40899 1st Qu.: 0 D:40899
## G:10708 Median : 0.0 G:12928 Median : 0 G:14150
## H: 7698 Mean : 277.6 H: 9382 Mean : 2863 H: 9950
## S: 5694 3rd Qu.: 23.0 S: 1790 3rd Qu.: 222
## Max. :1491582.0 Max. :16847973
## NA's :64999 NA's :64999
## ap est n1_4
## Min. : 0 Min. : 1.0 Min. : 0.00
## 1st Qu.: 0 1st Qu.: 1.0 1st Qu.: 1.00
## Median : 0 Median : 2.0 Median : 1.00
## Mean : 11903 Mean : 19.9 Mean : 10.44
## 3rd Qu.: 1066 3rd Qu.: 7.0 3rd Qu.: 4.00
## Max. :67740763 Max. :86045.0 Max. :47161.00
## NA's :64999 NA's :64999 NA's :64999
## n5_9 n10_19 n20_49 n50_99
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.0
## Mean : 3.87 Mean : 2.64 Mean : 1.84 Mean : 0.6
## 3rd Qu.: 2.00 3rd Qu.: 1.00 3rd Qu.: 1.00 3rd Qu.: 0.0
## Max. :14679.00 Max. :10724.00 Max. :8144.00 Max. :2907.0
## NA's :64999 NA's :64999 NA's :64999 NA's :64999
## n100_249 n250_499 n500_999 n1000
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.0 Median : 0.00 Median : 0.00
## Mean : 0.35 Mean : 0.1 Mean : 0.04 Mean : 0.02
## 3rd Qu.: 0.00 3rd Qu.: 0.0 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :1724.00 Max. :454.0 Max. :166.00 Max. :86.00
## NA's :64999 NA's :64999 NA's :64999 NA's :64999
## n1000_1 n1000_2 n1000_3 n1000_4
## Min. : 0.00 Min. : 0.00 Min. : 0 Min. :0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0 1st Qu.:0
## Median : 0.00 Median : 0.00 Median : 0 Median :0
## Mean : 0.01 Mean : 0.01 Mean : 0 Mean :0
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0 3rd Qu.:0
## Max. :35.00 Max. :28.00 Max. :16 Max. :7
## NA's :64999 NA's :64999 NA's :64999 NA's :64999
## censtate cencty
## Min. :63.00 Min. : 1.00
## 1st Qu.:63.00 1st Qu.: 21.00
## Median :63.00 Median : 73.00
## Mean :71.41 Mean : 81.53
## 3rd Qu.:86.00 3rd Qu.:110.00
## Max. :94.00 Max. :999.00
## NA's :64999 NA's :64999
summary(drugs.table1)
## State/Substate Region 2008-2010(Estimate)
## Length:469 Length:469
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## 2008-2010(95% ConfidenceInterval) 2010-2012(Estimate)
## Length:469 Min. : 3.920
## Class :character 1st Qu.: 7.070
## Mode :character Median : 8.485
## Mean : 8.942
## 3rd Qu.:10.502
## Max. :19.410
## NA's :1
## 2010-2012(95% ConfidenceInterval) PÂ Value NA
## Length:469 Length:469 Mode:logical
## Class :character Class :character NA's:469
## Mode :character Mode :character
##
##
##
##
## NA NA NA
## Mode:logical Mode:logical Mode:logical
## NA's:469 NA's:469 NA's:469
##
##
##
##
##