Data Used
My second data set was from the discussion thread ‘cause of Death’ by Raghunathan Ramnath. ‘tidyr’ and ‘dplyr’ were the main methods to manipulate data. This data looks at Multiple Cause of Death, 1999-2015 Results
Figure 1. Data Set 2. Multiple Cause of Death, 1999-2015 Results.
Reading in Data
This data was read in using SQL
format and it was created using MySQL.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'knitr' was built under R version 3.3.2
## Loading required package: DBI
ID | CensusRegion | Deaths | Population | CrudeRateper100000 |
---|---|---|---|---|
1 | Census Region 1: Northeast (CENS-R1) | 461712 | 54653362 | 844.8 |
2 | Census Region 2: Midwest (CENS-R2) | 564665 | 66293689 | 851.8 |
3 | Census Region 3: South (CENS-R3) | 924360 | 110688742 | 835.1 |
4 | Census Region 4: West (CENS-R4) | 472975 | 69595414 | 679.6 |
5 | Total | 2423712 | 301231207 | 840.6 |
The following begins to cleanup process.
// Cleaning Data
Project6DS2<-tbl_df(Project6DS2)
Project6DS2a <- Project6DS2 %>% gather("RegionMetrics","Values",3:5)
Project6DS2a <- Project6DS2a[2:4]
Project6DS2a <- Project6DS2a %>% group_by(CensusRegion, RegionMetrics) %>% spread(CensusRegion, Values)
colnames(Project6DS2a) <- str_to_title(colnames(Project6DS2a))
kable(head(Project6DS2a), caption="Table 2. Data Cleaned Up and Inverted for Analysis")
Regionmetrics | Census Region 1: Northeast (Cens-R1) | Census Region 2: Midwest (Cens-R2) | Census Region 3: South (Cens-R3) | Census Region 4: West (Cens-R4) | Total |
---|---|---|---|---|---|
CrudeRateper100000 | 844.8 | 851.8 | 835.1 | 679.6 | 840.6 |
Deaths | 461712.0 | 564665.0 | 924360.0 | 472975.0 | 2423712.0 |
Population | 54653362.0 | 66293689.0 | 110688742.0 | 69595414.0 | 301231207.0 |
Analyze Data
There was no requested analysis from the thread only the type of data found. What is interesting is to see the average across regions.
// Preparing Data for Analysis
Project6DS1b <- Project6DS1a %>% mutate(Delta = `Shipping Fees Collected`-`Price Of Carrier`)
Project6DS1b <- Project6DS1b[!Project6DS1b$Delta == 0, ]
#Method 1
Project6DS1c <- Project6DS1b %>% group_by(Country) %>% summarise(FeeDelta = mean(Delta,2))
kable(Project6DS1c, caption="Table 3. Average Delta Fees Collected by Country")
#prep data for analysis
Project6DS2b <- Project6DS2a %>% mutate(Average = (`Census Region 1: Northeast (Cens-R1)`+`Census Region 2: Midwest (Cens-R2)`+`Census Region 3: South (Cens-R3)`+`Census Region 4: West (Cens-R4)`)%/%4)
#Method 1
kable(Project6DS2b, caption="Table 3. Average per Metric")
Regionmetrics | Census Region 1: Northeast (Cens-R1) | Census Region 2: Midwest (Cens-R2) | Census Region 3: South (Cens-R3) | Census Region 4: West (Cens-R4) | Total | Average |
---|---|---|---|---|---|---|
CrudeRateper100000 | 844.8 | 851.8 | 835.1 | 679.6 | 840.6 | 802 |
Deaths | 461712.0 | 564665.0 | 924360.0 | 472975.0 | 2423712.0 | 605928 |
Population | 54653362.0 | 66293689.0 | 110688742.0 | 69595414.0 | 301231207.0 | 75307801 |
Some observations: - For the per 100K metric, Region 4 brings down the average - For the deaths and Population, Region 3 raises the average
For further analysis, I would consider digging deeper into Region 3.