Project 2 - Data Set 2

Cesar L. Espitia

March 12, 2017

Data Used

My second data set was from the discussion thread ‘cause of Death’ by Raghunathan Ramnath. ‘tidyr’ and ‘dplyr’ were the main methods to manipulate data. This data looks at Multiple Cause of Death, 1999-2015 Results

Figure 1. Data Set 2. Multiple Cause of Death, 1999-2015 Results.

Figure 1. Data Set 2. Multiple Cause of Death, 1999-2015 Results.

Reading in Data

This data was read in using SQL format and it was created using MySQL.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'knitr' was built under R version 3.3.2
## Loading required package: DBI
Table 1. Unitdy Data
ID CensusRegion Deaths Population CrudeRateper100000
1 Census Region 1: Northeast (CENS-R1) 461712 54653362 844.8
2 Census Region 2: Midwest (CENS-R2) 564665 66293689 851.8
3 Census Region 3: South (CENS-R3) 924360 110688742 835.1
4 Census Region 4: West (CENS-R4) 472975 69595414 679.6
5 Total 2423712 301231207 840.6

The following begins to cleanup process.

// Cleaning Data
Project6DS2<-tbl_df(Project6DS2)
Project6DS2a <- Project6DS2  %>% gather("RegionMetrics","Values",3:5)
Project6DS2a <- Project6DS2a[2:4]

Project6DS2a <- Project6DS2a %>% group_by(CensusRegion, RegionMetrics) %>% spread(CensusRegion, Values)
colnames(Project6DS2a) <- str_to_title(colnames(Project6DS2a))


kable(head(Project6DS2a), caption="Table 2. Data Cleaned Up and Inverted for Analysis")
Table 2. Data Cleaned Up and Inverted for Analysis
Regionmetrics Census Region 1: Northeast (Cens-R1) Census Region 2: Midwest (Cens-R2) Census Region 3: South (Cens-R3) Census Region 4: West (Cens-R4) Total
CrudeRateper100000 844.8 851.8 835.1 679.6 840.6
Deaths 461712.0 564665.0 924360.0 472975.0 2423712.0
Population 54653362.0 66293689.0 110688742.0 69595414.0 301231207.0

Analyze Data

There was no requested analysis from the thread only the type of data found. What is interesting is to see the average across regions.

// Preparing Data for Analysis

Project6DS1b <- Project6DS1a %>% mutate(Delta = `Shipping Fees Collected`-`Price Of Carrier`) 
Project6DS1b <- Project6DS1b[!Project6DS1b$Delta == 0, ]
#Method 1
Project6DS1c <- Project6DS1b %>% group_by(Country) %>% summarise(FeeDelta = mean(Delta,2))
kable(Project6DS1c, caption="Table 3. Average Delta Fees Collected by Country")
#prep data for analysis
Project6DS2b <- Project6DS2a %>% mutate(Average = (`Census Region 1: Northeast (Cens-R1)`+`Census Region 2: Midwest (Cens-R2)`+`Census Region 3: South (Cens-R3)`+`Census Region 4: West (Cens-R4)`)%/%4)

#Method 1
kable(Project6DS2b, caption="Table 3. Average per Metric")
Table 3. Average per Metric
Regionmetrics Census Region 1: Northeast (Cens-R1) Census Region 2: Midwest (Cens-R2) Census Region 3: South (Cens-R3) Census Region 4: West (Cens-R4) Total Average
CrudeRateper100000 844.8 851.8 835.1 679.6 840.6 802
Deaths 461712.0 564665.0 924360.0 472975.0 2423712.0 605928
Population 54653362.0 66293689.0 110688742.0 69595414.0 301231207.0 75307801

Some observations: - For the per 100K metric, Region 4 brings down the average - For the deaths and Population, Region 3 raises the average

For further analysis, I would consider digging deeper into Region 3.