Abstract: Data literacy is the ability to derive meaningful information from data [1]. Unfortunately, at the rate that human societies are converting to a more data-driven world, may data illiteracy increases as well. To combat the possibility of living in a data illiterate world, this paper will attempt to wrangle a large cancer dataset in an attempt to broadcast basic descriptive statistic targeted to the Hispanic population.
Introduction: Within the data science community, one zettabyte is equivalent to over one trillion gigabits [2]. What is more interesting is that it is estimated that the world would have generated 44 zettabytes of data (or 44 trillion gigabits of data) by the year 2020 [3]. To make use of this data, all fields of study, including all healthcare sectors are going to need to adopt skills in data science to help make sense of the influx of information to make better decisions.
In the health care field, when an individual is diagnosed with cancer, their life plans abruptly altered [4]. Both the patient and doctor must work together to find as much information as possible to generate a solution so that the individual diagnosed can have a long life. This paper does not have the power to erase cancer from the human gene pool. However, this article will attempt to filter, select, rename and mutate data (data wrangling) to provide a few basic descriptive statistics about cancer. More specifically, this article will target Hispanics by extrapolating information states with the highest Hispanic population, Texas, New Mexico and California.
Method: The data used in this paper can be found in the social explorer website [5], a site dedicated to providing data, graphs, and knowledge to the public. To see how the data was wrangled, please visit the bottom of this article. Please note that this data below display information regarding the cancer death rate per 100,000 people for Hispanics under 18, between 18 and 44, between 45 and 64 as well as 65 an older. Also, subjects were divided into either white Hispanics or black Hispanics.
#install.packages("readr")
#install.packages("dplyr")
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
HW2Dataset <- read_csv("Dataset_Hw2_712.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## Geo_NAME = col_character(),
## Geo_QNAME = col_character(),
## Geo_NATION = col_logical(),
## Geo_COUNTY = col_logical(),
## SE_T005_007 = col_logical(),
## SE_T006_007 = col_logical(),
## SE_T007_007 = col_logical(),
## SE_NV022_007 = col_logical(),
## SE_T027_021 = col_logical(),
## SE_T039_007 = col_logical(),
## SE_T040_007 = col_logical(),
## SE_NV133_007 = col_logical(),
## SE_T041_007 = col_logical(),
## SE_NV137_007 = col_logical()
## )
## See spec(...) for full column specifications.
View(HW2Dataset)
head(HW2Dataset)
## # A tibble: 6 x 1,637
## Geo_FIPS Geo_NAME Geo_QNAME Geo_NATION Geo_STATE Geo_COUNTY SE_T001_001
## <dbl> <chr> <chr> <lgl> <dbl> <lgl> <dbl>
## 1 1 Alabama Alabama NA 1 NA 214.
## 2 2 Alaska Alaska NA 2 NA 138.
## 3 4 Arizona Arizona NA 4 NA 171.
## 4 5 Arkansas Arkansas NA 5 NA 226
## 5 6 Califor… Californ… NA 6 NA 151.
## 6 8 Colorado Colorado NA 8 NA 140.
## # … with 1,630 more variables: SE_T001_002 <dbl>, SE_T001_003 <dbl>,
## # SE_T001_004 <dbl>, SE_NV002_001 <dbl>, SE_NV002_002 <dbl>,
## # SE_NV002_003 <dbl>, SE_NV002_004 <dbl>, SE_NV003_001 <dbl>,
## # SE_NV003_002 <dbl>, SE_NV003_003 <dbl>, SE_NV003_004 <dbl>,
## # SE_T002_001 <dbl>, SE_T002_002 <dbl>, SE_T002_003 <dbl>,
## # SE_T002_004 <dbl>, SE_T002_005 <dbl>, SE_NV005_001 <dbl>,
## # SE_NV005_002 <dbl>, SE_NV005_003 <dbl>, SE_NV005_004 <dbl>,
## # SE_NV005_005 <dbl>, SE_NV006_001 <dbl>, SE_NV006_002 <dbl>,
## # SE_NV006_003 <dbl>, SE_NV006_004 <dbl>, SE_NV006_005 <dbl>,
## # SE_T003_001 <dbl>, SE_T003_002 <dbl>, SE_T003_003 <dbl>,
## # SE_T003_004 <dbl>, SE_T003_005 <dbl>, SE_NV008_001 <dbl>,
## # SE_NV008_002 <dbl>, SE_NV008_003 <dbl>, SE_NV008_004 <dbl>,
## # SE_NV008_005 <dbl>, SE_NV009_001 <dbl>, SE_NV009_002 <dbl>,
## # SE_NV009_003 <dbl>, SE_NV009_004 <dbl>, SE_NV009_005 <dbl>,
## # SE_T004_001 <dbl>, SE_T004_002 <dbl>, SE_T004_003 <dbl>,
## # SE_T004_004 <dbl>, SE_T004_005 <dbl>, SE_NV011_001 <dbl>,
## # SE_NV011_002 <dbl>, SE_NV011_003 <dbl>, SE_NV011_004 <dbl>,
## # SE_NV011_005 <dbl>, SE_NV012_001 <dbl>, SE_NV012_002 <dbl>,
## # SE_NV012_003 <dbl>, SE_NV012_004 <dbl>, SE_NV012_005 <dbl>,
## # SE_T005_001 <dbl>, SE_T005_002 <dbl>, SE_T005_003 <dbl>,
## # SE_T005_004 <dbl>, SE_T005_005 <dbl>, SE_T005_006 <dbl>,
## # SE_T005_007 <lgl>, SE_T005_008 <dbl>, SE_T005_009 <dbl>,
## # SE_T005_010 <dbl>, SE_T005_011 <dbl>, SE_T005_012 <dbl>,
## # SE_T005_013 <dbl>, SE_T005_014 <dbl>, SE_T005_015 <dbl>,
## # SE_T005_016 <dbl>, SE_T005_017 <dbl>, SE_T005_018 <dbl>,
## # SE_T005_019 <dbl>, SE_T005_020 <dbl>, SE_T005_021 <dbl>,
## # SE_T005_022 <dbl>, SE_T005_023 <dbl>, SE_T005_024 <dbl>,
## # SE_T005_025 <dbl>, SE_T005_026 <dbl>, SE_T005_027 <dbl>,
## # SE_T005_028 <dbl>, SE_NV014_001 <dbl>, SE_NV014_002 <dbl>,
## # SE_NV014_003 <dbl>, SE_NV014_004 <dbl>, SE_NV014_005 <dbl>,
## # SE_NV014_006 <dbl>, SE_NV014_007 <dbl>, SE_NV014_008 <dbl>,
## # SE_NV014_009 <dbl>, SE_NV014_010 <dbl>, SE_NV014_011 <dbl>,
## # SE_NV014_012 <dbl>, SE_NV014_013 <dbl>, SE_NV014_014 <dbl>,
## # SE_NV014_015 <dbl>, SE_NV014_016 <dbl>, …
tail(HW2Dataset)
## # A tibble: 6 x 1,637
## Geo_FIPS Geo_NAME Geo_QNAME Geo_NATION Geo_STATE Geo_COUNTY SE_T001_001
## <dbl> <chr> <chr> <lgl> <dbl> <lgl> <dbl>
## 1 51 Virginia Virginia NA 51 NA 174.
## 2 53 Washing… Washingt… NA 53 NA 171.
## 3 54 West Vi… West Vir… NA 54 NA 254.
## 4 55 Wiscons… Wisconsin NA 55 NA 199.
## 5 56 Wyoming Wyoming NA 56 NA 162.
## 6 72 Puerto … Puerto R… NA 72 NA NA
## # … with 1,630 more variables: SE_T001_002 <dbl>, SE_T001_003 <dbl>,
## # SE_T001_004 <dbl>, SE_NV002_001 <dbl>, SE_NV002_002 <dbl>,
## # SE_NV002_003 <dbl>, SE_NV002_004 <dbl>, SE_NV003_001 <dbl>,
## # SE_NV003_002 <dbl>, SE_NV003_003 <dbl>, SE_NV003_004 <dbl>,
## # SE_T002_001 <dbl>, SE_T002_002 <dbl>, SE_T002_003 <dbl>,
## # SE_T002_004 <dbl>, SE_T002_005 <dbl>, SE_NV005_001 <dbl>,
## # SE_NV005_002 <dbl>, SE_NV005_003 <dbl>, SE_NV005_004 <dbl>,
## # SE_NV005_005 <dbl>, SE_NV006_001 <dbl>, SE_NV006_002 <dbl>,
## # SE_NV006_003 <dbl>, SE_NV006_004 <dbl>, SE_NV006_005 <dbl>,
## # SE_T003_001 <dbl>, SE_T003_002 <dbl>, SE_T003_003 <dbl>,
## # SE_T003_004 <dbl>, SE_T003_005 <dbl>, SE_NV008_001 <dbl>,
## # SE_NV008_002 <dbl>, SE_NV008_003 <dbl>, SE_NV008_004 <dbl>,
## # SE_NV008_005 <dbl>, SE_NV009_001 <dbl>, SE_NV009_002 <dbl>,
## # SE_NV009_003 <dbl>, SE_NV009_004 <dbl>, SE_NV009_005 <dbl>,
## # SE_T004_001 <dbl>, SE_T004_002 <dbl>, SE_T004_003 <dbl>,
## # SE_T004_004 <dbl>, SE_T004_005 <dbl>, SE_NV011_001 <dbl>,
## # SE_NV011_002 <dbl>, SE_NV011_003 <dbl>, SE_NV011_004 <dbl>,
## # SE_NV011_005 <dbl>, SE_NV012_001 <dbl>, SE_NV012_002 <dbl>,
## # SE_NV012_003 <dbl>, SE_NV012_004 <dbl>, SE_NV012_005 <dbl>,
## # SE_T005_001 <dbl>, SE_T005_002 <dbl>, SE_T005_003 <dbl>,
## # SE_T005_004 <dbl>, SE_T005_005 <dbl>, SE_T005_006 <dbl>,
## # SE_T005_007 <lgl>, SE_T005_008 <dbl>, SE_T005_009 <dbl>,
## # SE_T005_010 <dbl>, SE_T005_011 <dbl>, SE_T005_012 <dbl>,
## # SE_T005_013 <dbl>, SE_T005_014 <dbl>, SE_T005_015 <dbl>,
## # SE_T005_016 <dbl>, SE_T005_017 <dbl>, SE_T005_018 <dbl>,
## # SE_T005_019 <dbl>, SE_T005_020 <dbl>, SE_T005_021 <dbl>,
## # SE_T005_022 <dbl>, SE_T005_023 <dbl>, SE_T005_024 <dbl>,
## # SE_T005_025 <dbl>, SE_T005_026 <dbl>, SE_T005_027 <dbl>,
## # SE_T005_028 <dbl>, SE_NV014_001 <dbl>, SE_NV014_002 <dbl>,
## # SE_NV014_003 <dbl>, SE_NV014_004 <dbl>, SE_NV014_005 <dbl>,
## # SE_NV014_006 <dbl>, SE_NV014_007 <dbl>, SE_NV014_008 <dbl>,
## # SE_NV014_009 <dbl>, SE_NV014_010 <dbl>, SE_NV014_011 <dbl>,
## # SE_NV014_012 <dbl>, SE_NV014_013 <dbl>, SE_NV014_014 <dbl>,
## # SE_NV014_015 <dbl>, SE_NV014_016 <dbl>, …
TexasCancerData <- filter(HW2Dataset,Geo_NAME == "Texas")
NewMexCancerData<-filter(HW2Dataset,Geo_NAME == "New Mexico")
CaliMexCancerData<-filter(HW2Dataset,Geo_NAME == "California")
TexasCancerData2 <- select(TexasCancerData,SE_T005_002, SE_T005_004, SE_T005_009, SE_T005_011, SE_T005_016, SE_T005_018,SE_T005_023,SE_T005_025)
CaliMexCancerData2 <- select(CaliMexCancerData,SE_T005_002, SE_T005_004, SE_T005_009, SE_T005_011, SE_T005_016, SE_T005_018,SE_T005_023,SE_T005_025)
NewMexCancerData2 <- select(NewMexCancerData,SE_T005_002, SE_T005_004, SE_T005_009, SE_T005_011, SE_T005_016, SE_T005_018,SE_T005_023,SE_T005_025)
NewMexCancerData3 <- rename(NewMexCancerData2, HispanicWhiteUnder18 = SE_T005_002, HispanicBlackUnder18 = SE_T005_004, HispanicWhite18to44 = SE_T005_009, HispanicBlack18to44 = SE_T005_011, HispanicWhite45to64 = SE_T005_016 , HispanicBlack45to64 = SE_T005_018 , HispanicWhiteOver65 = SE_T005_023 , HispanicBlackOver65 = SE_T005_025)
TexasCancerData3 <- rename(TexasCancerData2, HispanicWhiteUnder18 = SE_T005_002, HispanicBlackUnder18 = SE_T005_004, HispanicWhite18to44 = SE_T005_009, HispanicBlack18to44 = SE_T005_011, HispanicWhite45to64 = SE_T005_016 , HispanicBlack45to64 = SE_T005_018 , HispanicWhiteOver65 = SE_T005_023 , HispanicBlackOver65 = SE_T005_025)
CaliCancerData3 <- rename(CaliMexCancerData2, HispanicWhiteUnder18 = SE_T005_002, HispanicBlackUnder18 = SE_T005_004, HispanicWhite18to44 = SE_T005_009, HispanicBlack18to44 = SE_T005_011, HispanicWhite45to64 = SE_T005_016 , HispanicBlack45to64 = SE_T005_018 , HispanicWhiteOver65 = SE_T005_023 , HispanicBlackOver65 = SE_T005_025)
NYCCancerDeathPer100K <- mutate(select(filter(HW2Dataset,Geo_NAME == "New York"),SE_NV002_001)/100000)
head(NYCCancerDeathPer100K)
## SE_NV002_001
## 1 0.35738
head(CaliCancerData3)
## # A tibble: 1 x 8
## HispanicWhiteUn… HispanicBlackUn… HispanicWhite18… HispanicBlack18…
## <dbl> <dbl> <dbl> <dbl>
## 1 2.7 2.1 14 16
## # … with 4 more variables: HispanicWhite45to64 <dbl>,
## # HispanicBlack45to64 <dbl>, HispanicWhiteOver65 <dbl>,
## # HispanicBlackOver65 <dbl>
head(NewMexCancerData3)
## # A tibble: 1 x 8
## HispanicWhiteUn… HispanicBlackUn… HispanicWhite18… HispanicBlack18…
## <dbl> <dbl> <dbl> <dbl>
## 1 2.4 NA 14.2 NA
## # … with 4 more variables: HispanicWhite45to64 <dbl>,
## # HispanicBlack45to64 <dbl>, HispanicWhiteOver65 <dbl>,
## # HispanicBlackOver65 <dbl>
head(TexasCancerData3)
## # A tibble: 1 x 8
## HispanicWhiteUn… HispanicBlackUn… HispanicWhite18… HispanicBlack18…
## <dbl> <dbl> <dbl> <dbl>
## 1 2.6 2.3 14.6 17.9
## # … with 4 more variables: HispanicWhite45to64 <dbl>,
## # HispanicBlack45to64 <dbl>, HispanicWhiteOver65 <dbl>,
## # HispanicBlackOver65 <dbl>
dim(CaliCancerData3)
## [1] 1 8
dim(NewMexCancerData3)
## [1] 1 8
dim(TexasCancerData3)
## [1] 1 8
View(CaliCancerData3)
View(NewMexCancerData3)
View(TexasCancerData3)
Result/Conclusion: Based on the data, the three states with the highest population of Hispanic people all experience rates of cancer in a similar fashion. Also, the data shows a trend. As an individual gets older, the chances that they will be diagnosed with cancer increases. On average, regardless of race, for every 100,000 Hispanic children, two will die from cancer. The rate increases drastically once a person reaches the age of 65 or older. Again, regardless of race, for every 100,000 Hispanic people over the age of 65, over 1000 people will die from cancer.
Citations: Rouse, Margaret. “What is data literacy.” what is, whatis.techtarget.com/definition/data-literacy.
Fogarty, Kevin. “How many gigabytes in a a zettabyte and why you need to know.” ITWorld, 26 June 2012, www.itworld.com/article/2722925/consumerization/how-many-gigabytes-in-a-a-zettabyte-and-why-you-need-to-know.html
Morrow, Jordan. “The rise of data literacy in the healthcare industry.” Qlik, Mar. 2018.
Rettenmaier, Andrew. “Healthcare Financing and the Cost of Cancer Care.” Cancer Network, 23 Oct. 2012, www.cancernetwork.com/practice-policy/healthcare-financing-and-cost-cancer-care.