The Final Project for Team Cancer Researchers (Formally Team Aero Bombers) focused on analyzing Cancer Data downloaded from the Surveillance, Epidemmiology, and End Results Program (SEER) of the National Cancer Institute (NCI) of the National Institute of Health (NIH) website [SEE] and the Cancer Incidence in Five Continents (CI5) collaboration of the World Health Organization: International Agency for Research on Cancer website [CI5]. Team Cancer Researchers was composed of Jason Givens-Doyle, Romerl Elizes, and Soumya Ghosh.
Team responsibilities for the Project were:
Rom (Romerl) - initially downloaded the SEERS data, cleaned the SEERS data and stored them into CSV files, investigated alternative database with MongoDB, performed detailed statistical analysis as required by the project, developed comprehensive documentation on the master RMD file, and merged/tested RMD file contributions from all teammates. Also, taskmaster for team to get deliverables completed on time.
Soumya - developed the MySQL database for the project combining the SEERS (using the CSV files Romerl recreated) and CI5 research databases, creates ETL bulk load scripts to populate the data, developed the data pull queries, performed exploratory analysis and created visualization for CI5 data set for UK.
Jason - developed the initial idea for researching cancer data. Identified the two data sources that will be used in the project, developed drill downs between two country samples, developed sun burst charts.
The initial project goals based on the submitted project proposal were:
Define a prediction model to determine the likelihood of certain cancers.
Work on subsetting the data, looking a subgroups within the identified risk groups and drilling down for more specific risks and indicators.
Using the characteristics of the data, statistically determine cancer survivability by cancer type by diagnosis year.
The final project objectives have been changed to reflect the realities of the project limitations:
Download two sources of relevant canced data.
Statistically analyze the SEER data for validity prior to conducting a more detailed investigation. Specific focus is on the survivability in years vs. diagnosis year for 9 types of cancer in the United States.
Display a global view of the cancer data and perform exploratory analysis for sample region (UK) with CI5 data set using graphs and tables.
Drill down comparison between two countries (Bulgaria and Netherlands) with specific cancer indications.
The non-presentation requirements for the project have been satisfied according to the Course Instructors’ rubrics requirements.
Proposal describes your motivation for performing this analysis. - COMPLETED. Satisfied in Section 1.2
Your project has a recognizable “data science workflow,” such as the OSEMN workflow or Hadley Wickham’s Grammar of Data Science. - COMPLETED. Satisfied in Section 1.2 Detail.
Project includes data from at least two different types of data sources. - COMPLETED. Identified in Section 2.1 and 2.2. The SEERS and CI5 Cancer databases were used. Additionally, CSV files with data from other sources were used for this investigation.
Project includes at least one data transformation operation. - COMPLETED. Extensive data transformation operations were detailed in Section 2.1. The cleaning of SEER data from text file to csv to database are detailed there.
Project includes at least one statistical analysis and at least one graphics that describes or validates your data. - COMPLETED. Comprehensive summary statistical analyses were detailed in Section 3.3 when analyzing each of the cancer types of the SEERS database. A summary bar chart was displayed for each summary statistical analysis.
Project includes at least one graphic that supports your conclusion(s). - COMPLETED. Several graph charts were used to support our conclusions throught the project.
Project includes at least one statistical analysis that supports your conclusion(s). - COMPLETED. Comprehensive detailed analyses were detailed in Section 3.4 using Linear Regression to support our Inference Analyses.
Project includes at least one feature that we did not cover in class! - COMPLETED. Project displayed a Pyramid Chart and several Sun Burst charts that were not discussed in this course.
Code and data. Have you delivered the submitted code and data where it is self-contained-preferably in rpubs.com and github? Am I able to fully reproduce your results with what you’ve delivered? You won’t receive full credit if your code references data on your local machine! - COMPLETED. All code was submitted for the project. The data is at least 10 MB. At this time it was not possible to include this data in the project submission as it required comprehensive cleaning and transformation in order for it to be used in a MySQL database. A One Drive link to the zip file of the SEER Data CSV files (155 MB) can be found here: https://spsmailcuny-my.sharepoint.com/:u:/g/personal/romerl_elizes36_spsmail_cuny_edu/EV6WbISZr-dCmGWuYzj7rKYBou-6sr3h0AJ1A48Fxb0sbQ?e=8hn68u. The link for compressed CI5 CSV files (127MB) data set is here: https://spsmailcuny-my.sharepoint.com/:u:/g/personal/soumya_ghosh58_spsmail_cuny_edu/ESWOE8LwQz9LgNlKDtlaoKAB1j2v-EDVF0O5oLFKhlpBJQ?e=FgokiJ
Code and data. Does all of the delivered code run without errors? - COMPLETED. All code ran without errors.
Code and data. Have you delivered your code and conclusions using a “reproducible research” tool such as RMarkdown? - COMPLETED.
Deadline management. Were your draft project proposal, project, and presentation delivered on time? - COMPLETED.
Some of the project caveats and difficulties were highlighted here.
Obtaining the SEER data was a time consuming process. It required each of the team members to request access via NIH guidelines. While access to the data was pretty straightforward for some, it was difficult to obtain for one our teammates.
Downloading the SEER data was the easy part. The data while comprehensive was stored in text files that had to be cleaned and parsed properly in order for it to in a readable format such that R could properly clean the data. Even then, the SEER data once downloaded had over 10,050,814 rows of data!
Due to the vastness of the SEER, for the purposes of the investigation, we focused only on cancer patients who were diganosed with cancer prior to 1999. This represents 25 years worth of data and and only accounts for 2,788,863 rows of data.
Our team membership is truly an online collaboration. Jason lives in Japan, Soumya lives in Connecticut, and Rom lives in New Jersey. Team discussions, development, and collaboration was a purely online endeavor. The Thanksgiving holiday affected one of our team meeetings and made the project meet-ups challenging.
This project certainly has some room for improvement and can be used for other future research opportunities.
The statistical analysis only focused on two data variables in the SEER data. Due to time constraints, it was not realistically feasible to explore the other variables for more detailed analysis. We extracted 22 columns worth of SEERS data, but the statistical analysis on the two variables was still exhaustive. Exploration of the other columns would be a worthy endeavor.
A more detailed analysis could be investigated for some of the individual cancer types and do a comparison between regions. However, due to time constraints and team discussion, it was best to focus on the data from a global perspective. In a future project opportunity, other students could focus on investigating a particular cancer type(s) and do comparisons between the regions in the United States.
This section will focus on downloading the data sets from the SEER and CI5 websites. It will focus
Loading necessary libraries
library(RODBC)
library(dplyr)
library(stringr)
library(ggplot2)
library(plotly)
library(kableExtra)
library(data.table)
library(knitr)
library(psych)
library(tidyr)
library(scales)
library(maps)
library(mapdata)
library(sunburstR)
library(RMySQL)
library(ggrepel)
library(plotly)
library(rCharts)
The SEERS was downloaded into a directory and was separated by text files. Each text file represented Cancer type and contained over 100 fields each by a particular region and time period in the United States. The text files extracted represented the following Cancer Types: Breast, Digestive, Male Genital, Female Genital, Respiratory, Colon/Rectal, Lymphoma/Leukemia, Urinary, and Other Cancers. Each column was and values were divided by specified space positions in the text files. R data cleaning procedures were used to extract the data values from each field space position and rename them based on the provided data dictionary. All of the text files representing a specific cancer type were merged together using the R data cleaning procedures and converted into 1 CSV file. As a result, there was one CSV file for each cancer type. A link to the initial R work can be found here: http://rpubs.com/RommyGraphs/442626
The CI5plus database contains updated annual incidence rates for 124 selected populations from 108 cancer registries published in CI5, for the longest period available (up to 2012), for all cancers and 28 major types (see the Cancer dictionary menu option here). The data dictinary associated with the CI5 Diagnostic Units was analyzed and combined with SEERS databse defined cancer types for comparative analyses. Cancer Registry data capturing the country, region etc. and Microscopically verified cancer incidence data by age groups for each continent were combined into separate dimension and fact tables. SQL transformations were applied to convert incidence data tidy format for further analysis and consumption in R. SQL based ETL and Bulk load scripts were created to process and load the data sets into one consolidated MySQL database.
Before we decided to exclusively use the MySQL database for the Final Project, Rom investigated the possibility of using the Mongo database as a possible database for the Final Project. Some of the initial work can be found here:
Storing the CSV files into data frames that could easily be stored in MongoDB Collections - http://rpubs.com/RommyGraphs/444545
Defining an Initial Statistical Analysis from a Global Perspective on the SEER Data in MongoDB - http://rpubs.com/RommyGraphs/444519
Defining an Initial Statistical Analysis from a Detailed Perspective on Respirator Cancer by Region on the SEER Data in MongoDB - http://rpubs.com/RommyGraphs/444533
In the end, due to time constraints and team familiarity with the MySQL database technology, we decided to use the MySQL database for the project.
Below are the steps followed to perform database migration -
We used the RODBC package and an ODBC data source called ‘MySQL_SEERS_Analysis’ in order to connect to the database and retrieve tables into their respective data frames.
The SEERS data was queried straight from the established SQL database. A data frame was created using the following limiting criteria - patient who was diagnosed prior to 1999. A sample table display was created to verify that the data could be shown. The data frame was further divided into data frames representing their cancer types. These data frames will be used in the Statistical Analysis portion of the project.
seerMasterDF <- as.data.frame(sqlFetch(con,"Cancer_Patients_Master"),stringsAsFactors = FALSE)
### Derive Year attributes
seerMasterDF <- seerMasterDF %>% mutate(survivalYears = survivalMonths/12, currentYear = survivalYears + yearDiagnosis) %>% subset(yearDiagnosis < 1999 & survivalMonths < 9999)
tmpseerMasterDF <- head(seerMasterDF)
tmpseerMasterDF %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
SEERDiagnosticUnit | personID | locality | maritalStatus | race | derivedHispanicOrigin | sex | ageDiagnosis | birthYear | sequenceNumber | monthDiagnosis | yearDiagnosis | primarySite | laterality | histology | behavior | histologicType | behaviorCode | grade | diagnosticConfirmation | reportingSourceType | survivalMonths | survivalYears | currentYear | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | Breast | 54000177 | Rural Georgia | widowed | White | Not Latino | F | 85 | 1911 | 2 | 9 | 1996 | C506 | Left | 8500 | 3 | 8500 | Malignant | I | Positive histology | Hospital inpatient | 5 | 0.4166667 | 1996.417 |
4 | Breast | 54000306 | Rural Georgia | widowed | Black | Not Latino | F | 68 | 1924 | 2 | 12 | 1992 | C504 | Left | 8500 | 2 | 8500 | Noninvasive | undetermined | Positive histology | Hospital inpatient | 233 | 19.4166667 | 2011.417 |
6 | Breast | 54000815 | Rural Georgia | divorced | White | Not Latino | F | 62 | 1936 | 3 | 9 | 1998 | C505 | Left | 8500 | 3 | 8500 | Malignant | II | Positive histology | Hospital inpatient | 115 | 9.5833333 | 2007.583 |
9 | Breast | 54001286 | Rural Georgia | widowed | White | Not Latino | F | 87 | 1909 | 2 | 1 | 1997 | C503 | Left | 8504 | 3 | 8504 | Malignant | undetermined | Positive histology | Hospital inpatient | 80 | 6.6666667 | 2003.667 |
11 | Breast | 54002704 | Rural Georgia | married | White | Not Latino | F | 72 | 1920 | 2 | 7 | 1992 | C509 | Left | 8501 | 3 | 8501 | Malignant | III | Positive histology | Hospital inpatient | 261 | 21.7500000 | 2013.750 |
12 | Breast | 54002998 | Rural Georgia | widowed | White | Not Latino | F | 70 | 1927 | 2 | 12 | 1997 | C501 | Left | 8500 | 2 | 8500 | Noninvasive | II | Positive histology | Hospital inpatient | 166 | 13.8333333 | 2010.833 |
### Create individual data frames based on cancer types
breastDF <- seerMasterDF %>% subset(SEERDiagnosticUnit == "Breast")
digothrDF <- seerMasterDF %>% subset(SEERDiagnosticUnit == "Digothr")
malegenDF <- seerMasterDF %>% subset(SEERDiagnosticUnit == "Malegen")
femgenDF <- seerMasterDF %>% subset(SEERDiagnosticUnit == "Femgen")
respirDF <- seerMasterDF %>% subset(SEERDiagnosticUnit == "Respir")
colrectDF <- seerMasterDF %>% subset(SEERDiagnosticUnit == "Colrect")
lymyleukDF <- seerMasterDF %>% subset(SEERDiagnosticUnit == "Lymyleuk")
urinaryDF <- seerMasterDF %>% subset(SEERDiagnosticUnit == "Urinary")
otherDF <- seerMasterDF %>% subset(SEERDiagnosticUnit == "Other")
The CI5 data was queried straight from the established SQL database. A single SQL query was leveraged using JOINs and GROUP BY and its results were displayed in a Kable to verify that the data could be queried.
Query <- "SELECT
reg.Continent,
reg.Country,
diag.DiagUnitLvl1Desc,
diag.DiagUnitLvl0Desc,
diag.SEERDiagnosticUnit,
diag.SEERDiagnosticUnitDesc,
can.Year,
can.AgeGroup,
can.Sex,
sum(can.TotalCases) as CancerCases
FROM
SEERS_Analysis.cancer_cases_details as can
INNER JOIN
SEERS_Analysis.cancer_registry_master as reg
ON can.CancerRegistryID = reg.CancerRegistryID
INNER JOIN
diagnostic_units_master as diag
ON can.CancerCode = diag.CancerCode
GROUP BY
reg.Continent,
reg.Country,
diag.DiagUnitLvl1Desc,
diag.DiagUnitLvl0Desc,
diag.SEERDiagnosticUnit,
diag.SEERDiagnosticUnitDesc,
can.Year,
can.AgeGroup,
can.Sex"
CI5MasterDF <- sqlQuery(con,Query)
### Derive Year attributes
CI5MasterDF$BeginAge <- as.numeric(str_extract(CI5MasterDF$AgeGroup,"\\d{1,2} "))
tmpCI5MasterDF <- head(CI5MasterDF)
tmpCI5MasterDF %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Continent | Country | DiagUnitLvl1Desc | DiagUnitLvl0Desc | SEERDiagnosticUnit | SEERDiagnosticUnitDesc | Year | AgeGroup | Sex | CancerCases | BeginAge |
---|---|---|---|---|---|---|---|---|---|---|
Africa | Uganda | Acute | Lymphoid leukaemia | Lymyleuk | Lymphoma/Leukemia | 1993 | 5 - 9 | M | 1 | 5 |
Africa | Uganda | Acute | Lymphoid leukaemia | Lymyleuk | Lymphoma/Leukemia | 1994 | 15 - 19 | M | 1 | 15 |
Africa | Uganda | Acute | Lymphoid leukaemia | Lymyleuk | Lymphoma/Leukemia | 1995 | 0 - 4 | F | 1 | 0 |
Africa | Uganda | Acute | Lymphoid leukaemia | Lymyleuk | Lymphoma/Leukemia | 1995 | 10 - 14 | M | 1 | 10 |
Africa | Uganda | Acute | Lymphoid leukaemia | Lymyleuk | Lymphoma/Leukemia | 1995 | 15 - 19 | M | 1 | 15 |
Africa | Uganda | Acute | Lymphoid leukaemia | Lymyleuk | Lymphoma/Leukemia | 1996 | 10 - 14 | F | 1 | 10 |
The CI5 population data was queried straight from the established SQL database. A single SQL query was leveraged using JOINs and GROUP BY and its results were displayed in a Kable to verify that the data could be queried.
Query_pop <- "SELECT
reg.Continent,
reg.Country,
pop.Year,
pop.AgeGroup,
pop.Sex,
sum(pop.TotalPopulation) as Population
FROM
SEERS_Analysis.population_details as pop
INNER JOIN
SEERS_Analysis.cancer_registry_master as reg
ON pop.CancerRegistryID = reg.CancerRegistryID
GROUP BY
reg.Continent,
reg.Country,
pop.Year,
pop.AgeGroup,
pop.Sex"
CI5PopMasterDF <- sqlQuery(con,Query_pop)
### Derive Year attributes
CI5PopMasterDF$BeginAge <- as.numeric(str_extract(CI5PopMasterDF$AgeGroup,"\\d{1,2} "))
tmpCI5PopMasterDF <- head(CI5PopMasterDF)
tmpCI5PopMasterDF %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Continent | Country | Year | AgeGroup | Sex | Population | BeginAge |
---|---|---|---|---|---|---|
Africa | Uganda | 1993 | 0 - 4 | F | 98371 | 0 |
Africa | Uganda | 1993 | 0 - 4 | M | 96250 | 0 |
Africa | Uganda | 1993 | 10 - 14 | F | 75214 | 10 |
Africa | Uganda | 1993 | 10 - 14 | M | 59394 | 10 |
Africa | Uganda | 1993 | 15 - 19 | F | 83967 | 15 |
Africa | Uganda | 1993 | 15 - 19 | M | 60309 | 15 |
The research question being investigated from a statistical standpoint is that can we determine cancer survivability in years by cancer type by diagnosis year.
The research investigation from a purely statistical standpoint will determine if there is a correlation between the survivability in years (year of diagnosis + years alive after diagnosis) and the diagnosis year. Why should we care? It is implied that the innovations in cancer research and mitigation should help improve years of survivability of cancer patients. The investigation hopes to establish this correlation if any. Each cancer type will be investigated and will be performed using traditional summary statistic and inference analysis investigation.
The data was collected by the SEER dataset. The data will only cover patients were diagnosed with cancer prior to 2001. It is implied that cancer research has certainly improved over the years, therefore, we could see an improvement in survivability over time.
Each case represents a cancer patient diagnosed from 1973 to 1999. There are 2,788,863 observations in the given data sets.
The response variable is survival years and is numerical.
The explanatory variable is the diagnosis year and is numerical.
This an observational study as I am using data that has been extracted from the SEER dataset. The dataset represent rows of actual cancer patients.
It is important to define the functions needed to perform the statistical analysis. The functions defined here were developed and used by Rom during his previous semester’s work in the course, DATA 606 Statistics Probability for Data Analytics.
plot_ss is a helpful function for Inference Testings that visually shows the linear regression line and dots representing the individual instances of all dat for each diagnosis year investigated. the function also shows where the data is bunched up or sparsely placed for each diagnosis year.
plot_ss <- function(x, y, maintitle, showSquares = FALSE, leastSquares = FALSE){
plot(x,y,xlab="Diagnosis Year", ylab = "Survival Years", main = maintitle)
if(leastSquares){
m1 <- lm(y~x)
y.hat <- m1$fit
} else{
pt1 <- locator(1)
points(pt1$x, pt1$y, pch = 4)
pt2 <- locator(1)
points(pt2$x, pt2$y, pch = 4)
pts <- data.frame("x" = c(pt1$x, pt2$x),"y" = c(pt1$y, pt2$y))
m1 <- lm(y ~ x, data = pts)
y.hat <- predict(m1, newdata = data.frame(x))
}
r <- y - y.hat
abline(m1)
oSide <- x - r
LLim <- par()$usr[1]
RLim <- par()$usr[2]
oSide[oSide < LLim | oSide > RLim] <- c(x + r)[oSide < LLim | oSide > RLim] # move boxes to avoid margins
n <- length(y.hat)
for(i in 1:n){
lines(rep(x[i], 2), c(y[i], y.hat[i]), lty = 2, col = "blue")
if(showSquares){
lines(rep(oSide[i], 2), c(y[i], y.hat[i]), lty = 3, col = "orange")
lines(c(oSide[i], x[i]), rep(y.hat[i],2), lty = 3, col = "orange")
lines(c(oSide[i], x[i]), rep(y[i],2), lty = 3, col = "orange")
}
}
}
The function shows the mean survivability in years for each year of diagnosis. It will show a detailed description of the data and then show a bar chart showing the relationship.
summaryTable <- function(cancerType,maintitle = " test"){
survivalYears = cancerType$survivalYears
yearDiagnosis = cancerType$yearDiagnosis
meanTable <- tapply(survivalYears,yearDiagnosis,mean)
show(nrow(cancerType))
show(describeBy(survivalYears, group = yearDiagnosis, mat=TRUE))
barplot(meanTable,beside=T,col=c("#ee7700","#3333ff")
,main=maintitle,xlab="Diagnosis Year",ylab="Survival Years")
}
The inferenceTests function shows the inference test for the two variables investigated and statistically determines using the lm function if the data being displayed is statistically valid. The lm function will display the p-value for the data and that will determine if the data shows cause to reject the Null Hypothesis. If we fail to reject the Null Hypothesis then there is evidence that there is no correlation between diagnosis year and survivability in years. However, if the p-value indicates (less than 0.05) that we can reject the Null Hypothesis then there is evidence that there is a correlation between diagnosis year and survivability in years.
inferenceTest0 is a useful function and it must be executed before we do any inference analysis. This test determines if the data being investigated is valid for inference analysis.
In this section, we will focus on the initial statistical findings for diagnosis year vs. survivability in years for each cancer type. The cancer types are: Breast, Digestive, Male Genital, Female Genital, Respiratory, Colon/Rectal, Lymphoma/Leukemia, Urinary, and Other Cancers. There will be a short discussion about each cancer.
## [1] 437429
## item group1 vars n mean sd median trimmed mad
## X11 1 1973 1 7483 13.75490 12.847722 9.000000 12.07707 10.62530
## X12 2 1974 1 9926 14.05917 12.763345 9.666667 12.47855 11.24305
## X13 3 1975 1 10162 14.51522 12.835051 10.291667 13.08905 11.92258
## X14 4 1976 1 9935 14.28900 12.726503 9.750000 12.90285 11.36660
## X15 5 1977 1 9861 13.93842 12.403394 9.583333 12.57523 10.99595
## X16 6 1978 1 10044 13.74251 12.216045 9.333333 12.44367 10.87240
## X17 7 1979 1 10422 13.48143 11.922955 9.333333 12.22413 10.74885
## X18 8 1980 1 10663 13.57605 11.730983 9.583333 12.46070 10.99595
## X19 9 1981 1 11268 13.30669 11.442176 9.500000 12.24710 10.74885
## X110 10 1982 1 11444 13.62260 11.337058 10.166667 12.75869 11.49015
## X111 11 1983 1 12269 13.52652 11.155280 10.083333 12.76108 11.49015
## X112 12 1984 1 13167 13.81847 10.906295 10.750000 13.23652 12.10790
## X113 13 1985 1 14549 14.01009 10.793238 11.416667 13.59735 12.84920
## X114 14 1986 1 15342 14.27138 10.453837 12.333333 14.03408 13.59050
## X115 15 1987 1 16872 14.44615 10.120677 12.916667 14.36199 13.83760
## X116 16 1988 1 16854 14.60096 9.860029 13.666667 14.67856 14.45535
## X117 17 1989 1 16492 14.32958 9.535967 13.583333 14.46047 14.33180
## X118 18 1990 1 17597 14.16376 9.230081 13.833333 14.37613 14.45535
## X119 19 1991 1 18195 14.05645 8.884676 14.083333 14.35876 14.57890
## X120 20 1992 1 25672 13.89155 8.543038 14.500000 14.27432 12.84920
## X121 21 1993 1 25434 13.54769 8.203665 14.416667 13.97147 11.61370
## X122 22 1994 1 26258 13.37373 7.804135 14.833333 13.87288 9.76045
## X123 23 1995 1 27474 13.11121 7.374806 15.083333 13.66612 8.15430
## X124 24 1996 1 28246 12.85748 7.002059 15.583333 13.47107 6.05395
## X125 25 1997 1 30002 12.60863 6.523395 15.750000 13.27695 4.44780
## X126 26 1998 1 31798 12.20939 6.117024 15.833333 12.89831 2.96520
## min max range skew kurtosis se
## X11 0 42.91667 42.91667 0.89565957 -0.4202666 0.14852113
## X12 0 41.91667 41.91667 0.83378345 -0.5349414 0.12810833
## X13 0 40.91667 40.91667 0.75263027 -0.7283433 0.12732333
## X14 0 39.91667 39.91667 0.75236774 -0.7694448 0.12768066
## X15 0 38.91667 38.91667 0.76497370 -0.7451622 0.12490507
## X16 0 37.91667 37.91667 0.74437880 -0.7985499 0.12189258
## X17 0 36.91667 36.91667 0.74752092 -0.7853837 0.11679073
## X18 0 35.91667 35.91667 0.68726304 -0.9023959 0.11360428
## X19 0 34.91667 34.91667 0.68188420 -0.9160247 0.10779167
## X110 0 33.91667 33.91667 0.57864151 -1.0842911 0.10597696
## X111 0 32.91667 32.91667 0.54514974 -1.1505016 0.10071077
## X112 0 31.91667 31.91667 0.44906595 -1.2503189 0.09504599
## X113 0 30.91667 30.91667 0.35527815 -1.3657618 0.08948190
## X114 0 29.91667 29.91667 0.25317341 -1.4278506 0.08439851
## X115 0 28.91667 28.91667 0.16759356 -1.4754067 0.07791594
## X116 0 27.91667 27.91667 0.06413681 -1.5140338 0.07594981
## X117 0 26.91667 26.91667 0.02254484 -1.5325599 0.07425540
## X118 0 25.91667 25.91667 -0.03545514 -1.5440039 0.06958028
## X119 0 24.91667 24.91667 -0.11037804 -1.5457012 0.06586665
## X120 0 23.91667 23.91667 -0.18923222 -1.5382729 0.05331906
## X121 0 22.91667 22.91667 -0.23139467 -1.5325383 0.05143995
## X122 0 21.91667 21.91667 -0.32292716 -1.4815121 0.04816083
## X123 0 20.91667 20.91667 -0.39854144 -1.4218837 0.04449279
## X124 0 19.91667 19.91667 -0.49578128 -1.3484823 0.04166269
## X125 0 18.91667 18.91667 -0.60275166 -1.2141172 0.03766158
## X126 0 17.91667 17.91667 -0.68133959 -1.1155929 0.03430365
Observations
The study encompasses 437,429 breast cancer cases. The bar chart indicates a uniform trend of survivability in years vs. diagnosis year. Based on visual observation, it appears breast cancer survivability in years is between 12 and 14 years. There is a peak in 1988 and slowly moves in a downward trend to 1998.
## [1] 206980
## item group1 vars n mean sd median trimmed mad
## X11 1 1973 1 4689 1.881051 5.297684 0.3333333 0.5714095 0.49420
## X12 2 1974 1 5480 2.002311 5.501564 0.3333333 0.6262165 0.49420
## X13 3 1975 1 5725 1.866638 5.143867 0.3333333 0.6049443 0.49420
## X14 4 1976 1 5873 2.048343 5.526647 0.3333333 0.6591119 0.49420
## X15 5 1977 1 5913 1.897937 5.037871 0.3333333 0.6411083 0.49420
## X16 6 1978 1 5908 2.096000 5.564810 0.3333333 0.6715666 0.49420
## X17 7 1979 1 6306 2.137026 5.473037 0.4166667 0.7152860 0.49420
## X18 8 1980 1 6356 2.101794 5.296972 0.4166667 0.7183117 0.61775
## X19 9 1981 1 6468 2.055440 5.255808 0.4166667 0.7077215 0.49420
## X110 10 1982 1 6525 2.179285 5.439357 0.4166667 0.7462810 0.61775
## X111 11 1983 1 6738 2.153458 5.377940 0.4166667 0.7568466 0.61775
## X112 12 1984 1 6825 2.144713 5.293854 0.4166667 0.7480468 0.61775
## X113 13 1985 1 6947 2.322609 5.576023 0.4166667 0.8139354 0.61775
## X114 14 1986 1 6990 2.368801 5.549618 0.4166667 0.8554036 0.61775
## X115 15 1987 1 7144 2.374230 5.393575 0.5000000 0.9092460 0.61775
## X116 16 1988 1 7292 2.517793 5.651390 0.5000000 0.9484773 0.61775
## X117 17 1989 1 7300 2.485183 5.431884 0.5000000 0.9769121 0.61775
## X118 18 1990 1 7448 2.423346 5.286239 0.5000000 0.9660235 0.61775
## X119 19 1991 1 7759 2.518892 5.350108 0.5000000 1.0216218 0.61775
## X120 20 1992 1 11165 2.523944 5.240308 0.5000000 1.0669801 0.74130
## X121 21 1993 1 11378 2.428063 5.085070 0.5000000 1.0031213 0.61775
## X122 22 1994 1 11469 2.499913 5.094961 0.5000000 1.0622389 0.61775
## X123 23 1995 1 11705 2.482059 4.963624 0.5000000 1.0834312 0.74130
## X124 24 1996 1 12200 2.493518 4.851093 0.5000000 1.1324624 0.74130
## X125 25 1997 1 12541 2.552621 4.855625 0.5000000 1.1850477 0.74130
## X126 26 1998 1 12836 2.536713 4.670902 0.5000000 1.2455453 0.74130
## min max range skew kurtosis se
## X11 0 42.83333 42.83333 4.730771 24.936216 0.07736525
## X12 0 41.91667 41.91667 4.569437 23.006433 0.07431832
## X13 0 40.83333 40.83333 4.743313 25.216311 0.06798326
## X14 0 39.91667 39.91667 4.468826 21.702227 0.07211602
## X15 0 38.91667 38.91667 4.580315 23.478372 0.06551535
## X16 0 37.91667 37.91667 4.285307 19.632547 0.07239859
## X17 0 36.91667 36.91667 4.136243 18.289460 0.06892097
## X18 0 35.91667 35.91667 4.059057 17.608932 0.06644094
## X19 0 34.91667 34.91667 4.213699 19.079700 0.06535134
## X110 0 33.91667 33.91667 3.953334 16.422105 0.06733755
## X111 0 32.91667 32.91667 4.011968 16.810384 0.06551647
## X112 0 31.91667 31.91667 3.894166 15.756904 0.06407973
## X113 0 30.91667 30.91667 3.607551 13.075184 0.06689997
## X114 0 29.91667 29.91667 3.490404 12.148901 0.06637805
## X115 0 28.91667 28.91667 3.417048 11.668875 0.06381253
## X116 0 27.91667 27.91667 3.209816 9.844757 0.06618080
## X117 0 26.91667 26.91667 3.132905 9.382690 0.06357539
## X118 0 25.91667 25.91667 3.171410 9.607031 0.06125295
## X119 0 24.91667 24.91667 2.975868 8.216639 0.06073789
## X120 0 23.91667 23.91667 2.906187 7.763203 0.04959381
## X121 0 22.91667 22.91667 2.893605 7.592461 0.04767207
## X122 0 21.91667 21.91667 2.744005 6.598931 0.04757493
## X123 0 20.91667 20.91667 2.645381 6.016964 0.04587891
## X124 0 19.91667 19.91667 2.552132 5.473130 0.04391974
## X125 0 18.91667 18.91667 2.421005 4.656433 0.04335898
## X126 0 17.91667 17.91667 2.322243 4.188239 0.04122739
Observations
The study encompasses 206,980 digestive cancer cases. The bar chart indicates a slightly upward trend of survivability in years vs. diagnosis year. Based on visual observation, it appears digestive cancer survivability in years is between 1.8 and 2.5 years. There is an upward trend between 1973 and 1998.
## [1] 360293
## item group1 vars n mean sd median trimmed
## X11 1 1973 1 4320 7.114718 8.687770 4.000000 5.310547
## X12 2 1974 1 5192 7.449762 8.875015 4.166667 5.627748
## X13 3 1975 1 6032 7.808742 8.967664 4.666667 6.033188
## X14 4 1976 1 6386 7.983884 8.921898 4.833333 6.250375
## X15 5 1977 1 6806 8.332966 9.091667 5.083333 6.603088
## X16 6 1978 1 6852 8.348232 9.023291 5.083333 6.646647
## X17 7 1979 1 7268 8.444357 9.004389 5.250000 6.745673
## X18 8 1980 1 7695 8.557115 9.077270 5.250000 6.851917
## X19 9 1981 1 8026 8.738662 9.025482 5.583333 7.103576
## X110 10 1982 1 8248 8.643146 8.815989 5.416667 7.074836
## X111 11 1983 1 8751 8.648793 8.649312 5.666667 7.150086
## X112 12 1984 1 8888 8.869449 8.704468 5.750000 7.416350
## X113 13 1985 1 9458 8.939698 8.523616 6.041667 7.583300
## X114 14 1986 1 9979 9.244372 8.555844 6.416667 7.983991
## X115 15 1987 1 11475 9.576863 8.486495 6.833333 8.451276
## X116 16 1988 1 11983 9.649344 8.242807 7.166667 8.646935
## X117 17 1989 1 13054 9.964749 8.226717 7.666667 9.108092
## X118 18 1990 1 15474 10.227521 8.060204 8.250000 9.546540
## X119 19 1991 1 19968 10.753915 7.868505 9.166667 10.306747
## X120 20 1992 1 31167 11.165196 7.717871 9.833333 10.919812
## X121 21 1993 1 28415 11.338891 7.557125 10.333333 11.250165
## X122 22 1994 1 25107 11.532375 7.436377 10.916667 11.606383
## X123 23 1995 1 23974 11.637232 7.143867 11.583333 11.856022
## X124 24 1996 1 24372 11.665795 6.868081 12.250000 12.008783
## X125 25 1997 1 25622 11.628626 6.472183 12.750000 12.076304
## X126 26 1998 1 25781 11.373883 6.090046 12.916667 11.874687
## mad min max range skew kurtosis se
## X11 4.694900 0 42.91667 42.91667 2.13787283 4.82524349 0.13218021
## X12 4.818450 0 41.91667 41.91667 2.03146149 4.17865126 0.12316909
## X13 5.312650 0 40.91667 40.91667 1.91144568 3.58706972 0.11546455
## X14 5.559750 0 39.91667 39.91667 1.81634516 3.14292419 0.11164590
## X15 5.683300 0 38.91667 38.91667 1.70557690 2.58145000 0.11020405
## X16 5.683300 0 37.91667 37.91667 1.63967995 2.26335123 0.10900748
## X17 5.868625 0 36.91667 36.91667 1.60973311 2.08741851 0.10562015
## X18 5.806850 0 35.91667 35.91667 1.54190588 1.71306597 0.10347869
## X19 6.177500 0 34.91667 34.91667 1.45193163 1.37034808 0.10074438
## X110 6.053950 0 33.91667 33.91667 1.41558125 1.26507026 0.09707262
## X111 6.177500 0 32.91667 32.91667 1.35424866 1.05882555 0.09245975
## X112 6.424600 0 31.91667 31.91667 1.24891965 0.67536308 0.09232944
## X113 6.671700 0 30.91667 30.91667 1.17068649 0.45047623 0.08764441
## X114 7.042350 0 29.91667 29.91667 1.05160082 0.07960905 0.08564841
## X115 7.289450 0 28.91667 28.91667 0.93936159 -0.22674871 0.07922313
## X116 7.536550 0 27.91667 27.91667 0.85247141 -0.39254226 0.07529954
## X117 8.030750 0 26.91667 26.91667 0.73167215 -0.65767874 0.07200369
## X118 8.524950 0 25.91667 25.91667 0.60874626 -0.86781355 0.06479548
## X119 9.019150 0 24.91667 24.91667 0.44538422 -1.08155467 0.05568330
## X120 9.389800 0 23.91667 23.91667 0.30068378 -1.23766085 0.04371696
## X121 9.760450 0 22.91667 22.91667 0.18137208 -1.33845895 0.04483145
## X122 10.501750 0 21.91667 21.91667 0.05080442 -1.44248734 0.04693145
## X123 10.872400 0 20.91667 20.91667 -0.09160849 -1.45430675 0.04613846
## X124 10.131100 0 19.91667 19.91667 -0.22518684 -1.44980914 0.04399363
## X125 8.277850 0 18.91667 18.91667 -0.35342744 -1.37784016 0.04043377
## X126 6.795250 0 17.91667 17.91667 -0.44689911 -1.31408026 0.03792894
Observations
The study encompasses 360,923 male genital cancer cases. The bar chart indicates a slighlty upward trend of survivability in years vs. diagnosis year. Based on visual observation, it appears male genital cancer survivability in years is between 6.5 and 11.8 years. There is an upward trend between 1973 and 1998.
## [1] 203556
## item group1 vars n mean sd median trimmed mad
## X11 1 1973 1 5388 15.87147 14.347374 13.166667 14.62873 17.91475
## X12 2 1974 1 6589 16.59190 14.343188 14.416667 15.57202 19.15025
## X13 3 1975 1 7188 16.58519 14.106362 14.750000 15.67658 19.27380
## X14 4 1976 1 7057 16.37411 13.912329 14.333333 15.51266 18.77960
## X15 5 1977 1 6722 15.73044 13.724293 13.125000 14.82664 17.35878
## X16 6 1978 1 6551 15.02595 13.371177 11.833333 14.06595 15.81440
## X17 7 1979 1 6373 14.95557 13.362156 11.833333 14.10275 16.06150
## X18 8 1980 1 6448 14.12064 12.905340 10.333333 13.18519 14.08470
## X19 9 1981 1 6329 14.03257 12.683007 10.666667 13.19867 14.45535
## X110 10 1982 1 6410 13.56959 12.323044 10.416667 12.74322 14.08470
## X111 11 1983 1 6474 13.34460 12.136343 9.750000 12.58658 13.21985
## X112 12 1984 1 6606 13.34879 11.875184 10.083333 12.71158 13.59050
## X113 13 1985 1 6638 13.14895 11.673163 9.916667 12.58366 13.46695
## X114 14 1986 1 6452 13.15542 11.482754 10.416667 12.71185 14.08470
## X115 15 1987 1 6642 12.67831 11.067401 10.083333 12.24349 13.71405
## X116 16 1988 1 6822 13.02514 10.804941 11.333333 12.79263 15.19665
## X117 17 1989 1 7081 12.65961 10.458559 11.000000 12.46080 14.70245
## X118 18 1990 1 7376 12.56597 10.134811 11.083333 12.46781 14.82600
## X119 19 1991 1 7323 12.13707 9.779326 11.000000 12.05479 14.57890
## X120 20 1992 1 10704 12.10983 9.425707 11.416667 12.14294 15.07310
## X121 21 1993 1 10792 11.61045 9.069296 10.750000 11.64571 14.33180
## X122 22 1994 1 10739 11.42904 8.670590 11.166667 11.54191 14.45535
## X123 23 1995 1 11019 11.29448 8.266091 11.833333 11.49797 12.60210
## X124 24 1996 1 11130 10.90928 7.869461 11.666667 11.13900 11.49015
## X125 25 1997 1 11323 10.55532 7.439857 11.500000 10.81921 10.37820
## X126 26 1998 1 11380 10.25996 7.061796 12.083333 10.57399 8.27785
## min max range skew kurtosis se
## X11 0 42.91667 42.91667 0.46344438 -1.180030 0.19546033
## X12 0 41.91667 41.91667 0.37005273 -1.284193 0.17669975
## X13 0 40.91667 40.91667 0.32702106 -1.329257 0.16638378
## X14 0 39.91667 39.91667 0.34122238 -1.328088 0.16561123
## X15 0 38.91667 38.91667 0.38931987 -1.316772 0.16739432
## X16 0 37.91667 37.91667 0.44030337 -1.279974 0.16520221
## X17 0 36.91667 36.91667 0.40084634 -1.364036 0.16738039
## X18 0 35.91667 35.91667 0.47474377 -1.293991 0.16071520
## X19 0 34.91667 34.91667 0.43548712 -1.345532 0.15942436
## X110 0 33.91667 33.91667 0.43727406 -1.342985 0.15391785
## X111 0 32.91667 32.91667 0.44217714 -1.371299 0.15083478
## X112 0 31.91667 31.91667 0.38486943 -1.434564 0.14610701
## X113 0 30.91667 30.91667 0.35783924 -1.484888 0.14327484
## X114 0 29.91667 29.91667 0.29878300 -1.548839 0.14295485
## X115 0 28.91667 28.91667 0.29950003 -1.548683 0.13579889
## X116 0 27.91667 27.91667 0.19120710 -1.607732 0.13081771
## X117 0 26.91667 26.91667 0.18020028 -1.627628 0.12428667
## X118 0 25.91667 25.91667 0.12796229 -1.656883 0.11800630
## X119 0 24.91667 24.91667 0.12221990 -1.666706 0.11427847
## X120 0 23.91667 23.91667 0.03644682 -1.694893 0.09110473
## X121 0 22.91667 22.91667 0.04169118 -1.705741 0.08730168
## X122 0 21.91667 21.91667 -0.02132959 -1.711389 0.08366942
## X123 0 20.91667 20.91667 -0.10097208 -1.702723 0.07874611
## X124 0 19.91667 19.91667 -0.13252895 -1.701669 0.07459288
## X125 0 18.91667 18.91667 -0.16303928 -1.699480 0.06991717
## X126 0 17.91667 17.91667 -0.23834437 -1.678105 0.06619788
Observations
The study encompasses 203,556 female genital cancer cases. The bar chart indicates a slighlty downward trend of survivability in years vs. diagnosis year. Based on visual observation, it appears female genital cancer survivability in years is between 14.7 and 10.6 years. There is an downward trend between 1973 and 1998.
## [1] 402016
## item group1 vars n mean sd median trimmed mad
## X11 1 1973 1 7421 2.759511 6.185634 0.5000000 1.118564 0.61775
## X12 2 1974 1 8975 3.156657 6.773337 0.5833333 1.331210 0.74130
## X13 3 1975 1 9930 2.980799 6.436963 0.5833333 1.250525 0.74130
## X14 4 1976 1 10578 3.065994 6.359016 0.6666667 1.373582 0.86485
## X15 5 1977 1 10951 3.083934 6.459985 0.6666667 1.353784 0.86485
## X16 6 1978 1 11476 3.093064 6.329695 0.6666667 1.405140 0.86485
## X17 7 1979 1 11948 2.942794 5.993471 0.6666667 1.350270 0.86485
## X18 8 1980 1 12449 2.875840 5.877824 0.6666667 1.330087 0.86485
## X19 9 1981 1 13013 2.939817 5.938792 0.6666667 1.364406 0.86485
## X110 10 1982 1 13401 2.980375 5.980149 0.6666667 1.390822 0.86485
## X111 11 1983 1 13598 3.004633 5.954391 0.6666667 1.423506 0.86485
## X112 12 1984 1 14167 2.826016 5.674059 0.6666667 1.317152 0.86485
## X113 13 1985 1 14237 2.929661 5.760715 0.6666667 1.397426 0.86485
## X114 14 1986 1 14625 2.861054 5.601615 0.7500000 1.373202 0.98840
## X115 15 1987 1 15200 2.778586 5.412748 0.6666667 1.339899 0.86485
## X116 16 1988 1 15459 2.777373 5.416707 0.6666667 1.338029 0.86485
## X117 17 1989 1 15489 2.783895 5.367042 0.7500000 1.348638 0.98840
## X118 18 1990 1 15855 2.811116 5.296980 0.7500000 1.401564 0.98840
## X119 19 1991 1 16239 2.788216 5.169718 0.7500000 1.414832 0.98840
## X120 20 1992 1 22357 2.728679 5.064413 0.7500000 1.381213 0.98840
## X121 21 1993 1 22112 2.681123 4.958204 0.6666667 1.359365 0.86485
## X122 22 1994 1 22107 2.724476 4.955791 0.7500000 1.395413 0.98840
## X123 23 1995 1 22427 2.694026 4.801400 0.7500000 1.416374 0.98840
## X124 24 1996 1 22490 2.688699 4.737620 0.6666667 1.428760 0.86485
## X125 25 1997 1 22569 2.600802 4.521878 0.7500000 1.402060 0.98840
## X126 26 1998 1 22943 2.590594 4.423947 0.7500000 1.417493 0.98840
## min max range skew kurtosis se
## X11 0 42.91667 42.91667 3.648284 14.984012 0.07180472
## X12 0 41.91667 41.91667 3.282687 11.524087 0.07149662
## X13 0 40.91667 40.91667 3.374960 12.203756 0.06459611
## X14 0 39.91667 39.91667 3.248832 11.373464 0.06182842
## X15 0 38.91667 38.91667 3.245890 11.136045 0.06173119
## X16 0 37.91667 37.91667 3.195686 10.780425 0.05908640
## X17 0 36.91667 36.91667 3.206208 11.026504 0.05483158
## X18 0 35.91667 35.91667 3.294079 11.680610 0.05268043
## X19 0 34.91667 34.91667 3.190019 10.770003 0.05206062
## X110 0 33.91667 33.91667 3.126128 10.170239 0.05165869
## X111 0 32.91667 32.91667 3.060926 9.619290 0.05106224
## X112 0 31.91667 31.91667 3.156159 10.310437 0.04767107
## X113 0 30.91667 30.91667 3.018698 9.211257 0.04827999
## X114 0 29.91667 29.91667 3.030200 9.280250 0.04631965
## X115 0 28.91667 28.91667 3.011478 9.172017 0.04390318
## X116 0 27.91667 27.91667 3.004268 8.998679 0.04356569
## X117 0 26.91667 26.91667 2.919218 8.338630 0.04312442
## X118 0 25.91667 25.91667 2.814148 7.628188 0.04206735
## X119 0 24.91667 24.91667 2.761625 7.297869 0.04056834
## X120 0 23.91667 23.91667 2.741123 7.082723 0.03387056
## X121 0 22.91667 22.91667 2.699955 6.786354 0.03334343
## X122 0 21.91667 21.91667 2.573823 5.927358 0.03333098
## X123 0 20.91667 20.91667 2.505410 5.544809 0.03206138
## X124 0 19.91667 19.91667 2.422418 5.019460 0.03159115
## X125 0 18.91667 18.91667 2.396506 4.878989 0.03009974
## X126 0 17.91667 17.91667 2.322820 4.433139 0.02920686
Observations
The study encompasses 402,016 respiratory cancer cases. The bar chart indicates a uniform trend of survivability in years vs. diagnosis year. Based on visual observation, it appears respiratory cancer survivability in years is between 2.5 and 3 years. There is an peaks in 1974, 1978, and 1983. It appears to slowly trend downward after 1985.
## [1] 354431
## item group1 vars n mean sd median trimmed mad
## X11 1 1973 1 7497 7.393980 9.964422 2.583333 5.316192 3.58295
## X12 2 1974 1 9022 7.555032 10.049499 2.666667 5.479646 3.83005
## X13 3 1975 1 9772 7.574848 9.889754 2.833333 5.556174 3.95360
## X14 4 1976 1 10252 7.869066 10.087779 2.916667 5.853298 4.07715
## X15 5 1977 1 10595 7.743747 9.925999 3.000000 5.740543 4.20070
## X16 6 1978 1 10815 7.839367 9.882616 3.166667 5.885579 4.44780
## X17 7 1979 1 11121 7.929308 9.815731 3.250000 6.015501 4.44780
## X18 8 1980 1 11631 7.750093 9.544340 3.250000 5.914222 4.57135
## X19 9 1981 1 11982 8.031339 9.566109 3.583333 6.253982 4.81845
## X110 10 1982 1 11899 7.871502 9.367822 3.583333 6.121503 4.94200
## X111 11 1983 1 12408 7.876330 9.241025 3.666667 6.173533 5.06555
## X112 12 1984 1 12825 7.972872 9.183120 3.750000 6.334681 5.06555
## X113 13 1985 1 13568 8.440276 9.258769 4.333333 6.937661 5.93040
## X114 14 1986 1 13461 8.481589 9.147331 4.583333 7.045323 6.17750
## X115 15 1987 1 13560 8.389454 8.962190 4.583333 7.014496 6.17750
## X116 16 1988 1 13340 8.302111 8.837809 4.500000 6.977909 6.17750
## X117 17 1989 1 13613 8.305162 8.694746 4.666667 7.086937 6.30105
## X118 18 1990 1 13518 8.312318 8.517924 4.750000 7.205837 6.42460
## X119 19 1991 1 13486 8.230708 8.331222 4.750000 7.222845 6.30105
## X120 20 1992 1 18397 8.112233 8.176035 4.666667 7.193888 6.30105
## X121 21 1993 1 18208 7.852144 7.879212 4.583333 6.992135 6.17750
## X122 22 1994 1 18109 7.782493 7.664211 4.583333 7.024547 6.17750
## X123 23 1995 1 18017 7.576844 7.437200 4.500000 6.889843 6.17750
## X124 24 1996 1 18354 7.635974 7.236419 4.750000 7.083498 6.54815
## X125 25 1997 1 19158 7.463253 6.961837 4.750000 6.989121 6.54815
## X126 26 1998 1 19823 7.474739 6.679914 5.083333 7.124004 6.91880
## min max range skew kurtosis se
## X11 0 42.91667 42.91667 1.7027479 2.24775556 0.11508225
## X12 0 41.91667 41.91667 1.6481656 1.96140876 0.10580179
## X13 0 40.91667 40.91667 1.6158121 1.84553073 0.10004463
## X14 0 39.91667 39.91667 1.5428508 1.51807708 0.09963026
## X15 0 38.91667 38.91667 1.5641638 1.59542150 0.09643257
## X16 0 37.91667 37.91667 1.5029676 1.33694950 0.09502955
## X17 0 36.91667 36.91667 1.4622561 1.18000860 0.09307879
## X18 0 35.91667 35.91667 1.4367919 1.09566145 0.08849880
## X19 0 34.91667 34.91667 1.3569861 0.80704449 0.08739179
## X110 0 33.91667 33.91667 1.3595868 0.80978113 0.08587829
## X111 0 32.91667 32.91667 1.3289695 0.71765395 0.08296010
## X112 0 31.91667 31.91667 1.2616486 0.46746542 0.08108893
## X113 0 30.91667 30.91667 1.1037857 0.01825419 0.07948685
## X114 0 29.91667 29.91667 1.0638244 -0.11351409 0.07884165
## X115 0 28.91667 28.91667 1.0354566 -0.19891748 0.07696341
## X116 0 27.91667 27.91667 1.0090007 -0.27971377 0.07651855
## X117 0 26.91667 26.91667 0.9513300 -0.44613295 0.07452123
## X118 0 25.91667 25.91667 0.8880731 -0.59688136 0.07326179
## X119 0 24.91667 24.91667 0.8430289 -0.71015720 0.07174095
## X120 0 23.91667 23.91667 0.8015797 -0.82643251 0.06027945
## X121 0 22.91667 22.91667 0.7827651 -0.85749918 0.05839177
## X122 0 21.91667 21.91667 0.7266084 -0.98150257 0.05695348
## X123 0 20.91667 20.91667 0.6871565 -1.06730729 0.05540746
## X124 0 19.91667 19.91667 0.5806019 -1.23462936 0.05341440
## X125 0 18.91667 18.91667 0.5351441 -1.30903133 0.05029777
## X126 0 17.91667 17.91667 0.4403046 -1.42113781 0.04744453
Observations
The study encompasses 354,431 colon/rectal cancer cases. The bar chart indicates a uniform trend of survivability in years vs. diagnosis year. Based on visual observation, it appears colon/rectal cancer survivability in years is between 7.3 and 8.4 years. There is a peak in 1986. It appears to slowly trend downward after 1986.
## [1] 220475
## item group1 vars n mean sd median trimmed mad
## X11 1 1973 1 4116 6.623907 10.684675 1.916667 3.910494 2.71810
## X12 2 1974 1 4885 7.057864 10.993167 2.083333 4.324678 2.96520
## X13 3 1975 1 5353 7.154773 10.782224 2.250000 4.561308 3.21230
## X14 4 1976 1 5560 7.072902 10.544514 2.333333 4.528796 3.33585
## X15 5 1977 1 5529 7.562730 11.087585 2.250000 4.976271 3.21230
## X16 6 1978 1 5717 7.221532 10.457210 2.416667 4.765027 3.45940
## X17 7 1979 1 5905 7.284519 10.585373 2.416667 4.814374 3.45940
## X18 8 1980 1 6102 7.616587 10.584935 2.750000 5.253243 3.83005
## X19 9 1981 1 6327 7.461027 10.308909 2.666667 5.171539 3.70650
## X110 10 1982 1 6585 7.419615 10.225348 2.583333 5.187370 3.70650
## X111 11 1983 1 6822 7.666887 10.182662 2.833333 5.567577 3.95360
## X112 12 1984 1 7177 7.628327 10.035417 2.833333 5.619261 4.07715
## X113 13 1985 1 7332 7.834936 10.031404 3.000000 5.988124 4.20070
## X114 14 1986 1 7394 7.492077 9.663383 2.833333 5.683373 4.07715
## X115 15 1987 1 7970 7.446299 9.462589 2.916667 5.747425 4.07715
## X116 16 1988 1 8063 7.597110 9.376017 3.083333 6.053545 4.32425
## X117 17 1989 1 8276 7.293993 9.061203 2.916667 5.799469 4.20070
## X118 18 1990 1 8618 7.292556 8.883519 2.916667 5.920630 4.20070
## X119 19 1991 1 8941 7.213082 8.637447 3.000000 5.941377 4.20070
## X120 20 1992 1 12630 7.124908 8.443424 2.916667 5.956956 4.20070
## X121 21 1993 1 12737 7.171561 8.253906 3.083333 6.137131 4.44780
## X122 22 1994 1 13208 7.057181 7.981960 3.083333 6.119559 4.44780
## X123 23 1995 1 13630 7.009660 7.780965 3.083333 6.178146 4.44780
## X124 24 1996 1 13630 7.081683 7.500327 3.500000 6.391171 5.06555
## X125 25 1997 1 13929 6.931863 7.199148 3.583333 6.327464 5.18910
## X126 26 1998 1 14039 6.986692 6.940419 3.916667 6.516640 5.68330
## min max range skew kurtosis se
## X11 0 42.91667 42.91667 2.1943749 3.99401315 0.16654195
## X12 0 41.91667 41.91667 2.0239487 3.12060195 0.15728617
## X13 0 40.91667 40.91667 1.9466688 2.81478176 0.14737022
## X14 0 39.91667 39.91667 1.9355781 2.78332977 0.14141295
## X15 0 38.91667 38.91667 1.7602853 1.91226452 0.14911237
## X16 0 37.91667 37.91667 1.8187353 2.24653193 0.13830305
## X17 0 36.91667 36.91667 1.7508971 1.86762490 0.13775146
## X18 0 35.91667 35.91667 1.6375162 1.45304155 0.13550399
## X19 0 34.91667 34.91667 1.6279429 1.41657273 0.12960262
## X110 0 33.91667 33.91667 1.5774272 1.20857896 0.12600861
## X111 0 32.91667 32.91667 1.4749969 0.85426024 0.12328364
## X112 0 31.91667 31.91667 1.4312737 0.69463702 0.11845787
## X113 0 30.91667 30.91667 1.3208605 0.32482526 0.11715221
## X114 0 29.91667 29.91667 1.3429592 0.40116153 0.11238011
## X115 0 28.91667 28.91667 1.2995681 0.25191835 0.10599388
## X116 0 27.91667 27.91667 1.2022533 -0.02882036 0.10441672
## X117 0 26.91667 26.91667 1.2083329 -0.02308582 0.09960373
## X118 0 25.91667 25.91667 1.1339055 -0.22833131 0.09569336
## X119 0 24.91667 24.91667 1.0883221 -0.34404066 0.09134659
## X120 0 23.91667 23.91667 1.0248991 -0.51278237 0.07513061
## X121 0 22.91667 22.91667 0.9349391 -0.71401939 0.07313511
## X122 0 21.91667 21.91667 0.8792967 -0.83234798 0.06945300
## X123 0 20.91667 20.91667 0.8094059 -0.98237027 0.06664775
## X124 0 19.91667 19.91667 0.7051911 -1.14787759 0.06424396
## X125 0 18.91667 18.91667 0.6515578 -1.23854252 0.06099878
## X126 0 17.91667 17.91667 0.5390658 -1.39439193 0.05857572
Observations
The study encompasses 220,475 lymphoma/leukemia cancer cases. The bar chart indicates a uniform trend of survivability in years vs. diagnosis year. Based on visual observation, it appears lymphoma/leukemia cancer survivability in years is between 6.6 and 7.8 years. There is a peak in 1985. It appears to slowly trend downward after 1985.
## [1] 179462
## item group1 vars n mean sd median trimmed mad
## X11 1 1973 1 3484 9.485649 11.192703 4.750000 7.380022 6.548150
## X12 2 1974 1 4158 9.448854 10.777903 5.166667 7.508614 6.918800
## X13 3 1975 1 4421 9.538566 10.656362 5.250000 7.670366 7.042350
## X14 4 1976 1 4738 9.700946 10.790205 5.500000 7.813049 7.289450
## X15 5 1977 1 4706 9.365827 10.567959 5.083333 7.467229 6.795250
## X16 6 1978 1 4965 9.617170 10.375288 5.833333 7.882100 7.783650
## X17 7 1979 1 5039 9.792717 10.536489 5.583333 8.041801 7.536550
## X18 8 1980 1 5305 9.984700 10.588539 5.750000 8.279093 7.660100
## X19 9 1981 1 5549 9.753935 10.278389 5.916667 8.094367 7.907200
## X110 10 1982 1 5478 9.487100 10.068216 5.458333 7.884884 7.351225
## X111 11 1983 1 5775 9.570361 9.959284 5.833333 8.042920 7.783650
## X112 12 1984 1 5973 9.854819 9.955124 6.166667 8.457087 8.277850
## X113 13 1985 1 6052 9.726550 9.740826 6.250000 8.384345 8.277850
## X114 14 1986 1 6394 9.642477 9.481491 6.375000 8.402398 8.463175
## X115 15 1987 1 6650 9.579699 9.251800 6.500000 8.421507 8.524950
## X116 16 1988 1 6622 9.534557 9.131803 6.416667 8.475006 8.401400
## X117 17 1989 1 6837 9.524584 9.016432 6.500000 8.580637 8.401400
## X118 18 1990 1 6957 9.389883 8.683441 6.666667 8.530971 8.524950
## X119 19 1991 1 7121 9.402998 8.555231 6.750000 8.667939 8.772050
## X120 20 1992 1 9961 9.158409 8.293178 6.500000 8.485349 8.524950
## X121 21 1993 1 10077 9.213647 8.128626 6.750000 8.671741 8.772050
## X122 22 1994 1 10233 9.068194 7.844130 6.916667 8.614012 9.019150
## X123 23 1995 1 10339 8.791630 7.520182 6.750000 8.391081 8.648500
## X124 24 1996 1 10523 8.618708 7.294392 6.750000 8.299956 8.772050
## X125 25 1997 1 10849 8.385573 6.999199 6.583333 8.127414 8.648500
## X126 26 1998 1 11256 8.193445 6.690566 6.666667 8.010419 8.648500
## min max range skew kurtosis se
## X11 0 42.91667 42.91667 1.4114744 1.16410253 0.18962514
## X12 0 41.91667 41.91667 1.3699190 1.09667025 0.16714447
## X13 0 40.91667 40.91667 1.3084747 0.89954744 0.16026870
## X14 0 39.91667 39.91667 1.2817741 0.72533289 0.15675888
## X15 0 38.91667 38.91667 1.3222252 0.81685907 0.15405122
## X16 0 37.91667 37.91667 1.2063191 0.52421577 0.14724499
## X17 0 36.91667 36.91667 1.1531571 0.28604245 0.14843070
## X18 0 35.91667 35.91667 1.0964154 0.08104538 0.14537621
## X19 0 34.91667 34.91667 1.1039400 0.12918014 0.13798050
## X110 0 33.91667 33.91667 1.0804174 0.03152847 0.13603214
## X111 0 32.91667 32.91667 1.0310579 -0.11422963 0.13105456
## X112 0 31.91667 31.91667 0.9245231 -0.38248957 0.12881025
## X113 0 30.91667 30.91667 0.9159662 -0.41562621 0.12521211
## X114 0 29.91667 29.91667 0.8601086 -0.52972891 0.11857423
## X115 0 28.91667 28.91667 0.8235819 -0.60221091 0.11345286
## X116 0 27.91667 27.91667 0.7800082 -0.71844991 0.11221784
## X117 0 26.91667 26.91667 0.7258358 -0.86020982 0.10904404
## X118 0 25.91667 25.91667 0.6870955 -0.90029932 0.10410722
## X119 0 24.91667 24.91667 0.6094574 -1.05653808 0.10138209
## X120 0 23.91667 23.91667 0.5843650 -1.11085974 0.08309398
## X121 0 22.91667 22.91667 0.5049924 -1.24500526 0.08097510
## X122 0 21.91667 21.91667 0.4458373 -1.30711486 0.07754312
## X123 0 20.91667 20.91667 0.4231493 -1.34240545 0.07395867
## X124 0 19.91667 19.91667 0.3625611 -1.42848674 0.07110813
## X125 0 18.91667 18.91667 0.3197088 -1.48785807 0.06719755
## X126 0 17.91667 17.91667 0.2604833 -1.53335437 0.06306245
Observations
The study encompasses 179,462 urinary cancer cases. The bar chart indicates a uniform trend of survivability in years vs. diagnosis year. Based on visual observation, it appears urinary cancer survivability in years is between 8.1 and 9.9 years. There is a peak in 1984. It appears to slowly trend downward after 1984.
## [1] 381735
## item group1 vars n mean sd median trimmed mad
## X11 1 1973 1 6521 10.003565 14.137128 2.000000 7.242972 2.96520
## X12 2 1974 1 7787 10.161412 13.965538 2.250000 7.544950 3.33585
## X13 3 1975 1 8752 10.676883 14.090801 2.666667 8.287120 3.95360
## X14 4 1976 1 8975 10.925506 14.033776 2.750000 8.719329 4.07715
## X15 5 1977 1 9478 10.971680 13.915363 3.000000 8.896185 4.44780
## X16 6 1978 1 9782 10.895548 13.683211 2.833333 8.924738 4.20070
## X17 7 1979 1 10061 10.391653 13.197229 2.833333 8.420528 4.20070
## X18 8 1980 1 10563 10.429471 12.939319 3.000000 8.589625 4.32425
## X19 9 1981 1 10895 10.559660 12.933212 3.000000 8.872749 4.44780
## X110 10 1982 1 11104 10.420299 12.574630 3.166667 8.822649 4.57135
## X111 11 1983 1 11359 10.385685 12.303146 3.416667 8.903051 4.94200
## X112 12 1984 1 11861 10.311595 12.007243 3.500000 8.933212 5.06555
## X113 13 1985 1 12685 10.668020 11.981074 4.000000 9.499729 5.80685
## X114 14 1986 1 13237 10.561948 11.674875 4.166667 9.489472 6.05395
## X115 15 1987 1 13802 10.271259 11.280629 4.083333 9.250181 5.93040
## X116 16 1988 1 13904 9.794364 10.857863 3.666667 8.780003 5.31265
## X117 17 1989 1 14506 9.831110 10.611264 4.000000 8.949344 5.80685
## X118 18 1990 1 15026 9.990955 10.393865 4.583333 9.272466 6.67170
## X119 19 1991 1 15453 9.851156 10.005996 4.833333 9.221481 7.04235
## X120 20 1992 1 22657 9.477395 9.635287 4.666667 8.878321 6.79525
## X121 21 1993 1 22436 9.300752 9.278751 5.000000 8.782526 7.28945
## X122 22 1994 1 23134 9.298007 8.934112 5.666667 8.902168 8.27785
## X123 23 1995 1 23738 9.568477 8.597834 7.250000 9.363715 10.50175
## X124 24 1996 1 24408 9.639083 8.212022 8.166667 9.572742 11.73725
## X125 25 1997 1 24587 9.364044 7.803965 8.333333 9.353630 11.98435
## X126 26 1998 1 25024 9.167283 7.387301 9.083333 9.231027 11.98435
## min max range skew kurtosis se
## X11 0 42.91667 42.91667 1.34203141 0.3098656 0.17506692
## X12 0 41.91667 41.91667 1.29196651 0.1723404 0.15826051
## X13 0 40.91667 40.91667 1.18103776 -0.1343432 0.15061979
## X14 0 39.91667 39.91667 1.08890701 -0.3730869 0.14813487
## X15 0 38.91667 38.91667 1.05422550 -0.4788967 0.14293421
## X16 0 37.91667 37.91667 1.01474329 -0.5732171 0.13834842
## X17 0 36.91667 36.91667 1.06335367 -0.4627017 0.13157161
## X18 0 35.91667 35.91667 1.00325767 -0.5851087 0.12589770
## X19 0 34.91667 34.91667 0.94090024 -0.7550320 0.12390613
## X110 0 33.91667 33.91667 0.91855939 -0.7933949 0.11933160
## X111 0 32.91667 32.91667 0.87311020 -0.8814200 0.11543729
## X112 0 31.91667 31.91667 0.83336959 -0.9572847 0.11025103
## X113 0 30.91667 30.91667 0.72491851 -1.1670127 0.10637768
## X114 0 29.91667 29.91667 0.68541426 -1.2285139 0.10147462
## X115 0 28.91667 28.91667 0.67930045 -1.2385665 0.09602013
## X116 0 27.91667 27.91667 0.70085652 -1.2105228 0.09208194
## X117 0 26.91667 26.91667 0.63123998 -1.3174686 0.08810353
## X118 0 25.91667 25.91667 0.53586505 -1.4470361 0.08479209
## X119 0 24.91667 24.91667 0.49450170 -1.4906459 0.08049223
## X120 0 23.91667 23.91667 0.48744353 -1.5049862 0.06401230
## X121 0 22.91667 22.91667 0.43640714 -1.5558716 0.06194650
## X122 0 21.91667 21.91667 0.35889930 -1.6244435 0.05873893
## X123 0 20.91667 20.91667 0.20857348 -1.7114106 0.05580421
## X124 0 19.91667 19.91667 0.10384727 -1.7506335 0.05256347
## X125 0 18.91667 18.91667 0.05453841 -1.7610722 0.04976942
## X126 0 17.91667 17.91667 -0.02131686 -1.7682495 0.04669898
Observations
The study encompasses 381,735 other cancer cases. The bar chart indicates a uniform trend of survivability in years vs. diagnosis year. Based on visual observation, it appears other cancer survivability in years is between 9.2 and 10.9 years. There is a peak in 1985. It appears to slowly trend downward after 1985.
Based on the initial summary analysis, there is definitely room for more investigation of the Cancer Data.
There is an upward trend in survivability in years for digestive and male genital cancer diagnoses.
There is a downward trend in survivability in years for female genital cancer diagnoses.
There is a slight downward trend in survivability in years for respiratory cancer diagnoses.
All other cancer diagnoses show a uniform trend in the 25 year period, however, there is slight downward trends after a peak year to 1998.
These summary statistics alone indicate that more investigation may be needed to why certain trends occur. For example, as a scientist, I would be concerned about the downward trends considering that supposedly cancer research and mitigation techniques have improved. In terms of survivability years going up for digestive and male genital cancer diagnoses, why are they improving. Are the mitigation processes different and therefore more successful? Are there better medical personnel to handle these particular cases? For a further research project, these would be useful sub-projects to consider.
Despite what the Exploratory Data Analysis exposed on the Cancer types in terms of survivability in years vs. diagnosis year, inference tests will need to be conducted to verify that the data being investigated is actually sound.
Linear Regression will be the chosen algorithm for inference test calculations. By conducting the linear regression tests, it clearly shows the data trends for each cancer type and shows whether diagnosis year clearly affects survivability in years.
Prior to executing all inference tests, a test will be conducted that verify that the data meets the criteria for inference. After each linear regression test for each cancer type, a Hypothesis Test will be indicated and indicate if there is enough evidence that the data does determine the legitimacy of the survivability in years. Additionally, the yearDiagnosis field representing the Diagnosis Year will be compared with its p-value to determine its validity for predicting the Surviability in Years. p-values for the inference test and coefficient yearDiagnosis must be less than 0.05 in order to be considered valid evidence.
Linearity. The data does not show a clear residual trend. However, using the residuals in the Histogram graph, we see that the residuals are between -10 and 30. This is a good indication of linearity in the data by the sloping downward trend from -10 to 30. However, there is concern that the residuals does not show a uniform Bell Curve distribution. CONDITION ACCEPTED WITH RESERVATIONS.
Nearly normal residuals. The residuals graph indicated by the Historgram show a downward trend in distribution. However, there is concern that the residuals does not show a uniform Bell Curve distribution. CONDITION ACCEPTED WITH RESERVATIONS.
Constant variability. The variability of points around the least squares line remains roughly constant as evidenced by the QQ-Plot. There is skew in both the top-right and bottom-left part of the chart, but there is a clearly a large black line of data converging on the center of the least squares line. CONDITION ACCEPTED.
Independent observations. The data was not independently collected. It was collected for identified cancer patients in the SEER database diagnosed from 1973 to 1998. However, 2788,863 observations were collected which increases the likelihood of data independence. The data used is greater than 10% of all SEER data collected. CONDITION ACCEPTED.
The data has satisfied all conditions for linear regression. However, we are concerned about the Bell Curve distribution not being normal.
##
## Call:
## lm(formula = survivalYears ~ yearDiagnosis, data = cancerType)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4535 -8.8698 -0.4518 7.0473 28.4631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 124.349144 3.959068 31.41 <2e-16 ***
## yearDiagnosis -0.055700 0.001991 -27.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.558 on 437427 degrees of freedom
## Multiple R-squared: 0.001786, Adjusted R-squared: 0.001784
## F-statistic: 782.7 on 1 and 437427 DF, p-value: < 2.2e-16
\({ H }_{ 0 }\): There is insufficent evidence that Breast Cancer Survival Years are improving as Diagnosis Year increases.
\({ H }_{ A }\): There is sufficient evidence that Breast Cancer Survival Years are improving as Diagnosis Year increases.
The inference test for Breast Cancer Survivability shows that the p-value for the linear regression is 2.2 * \(10^{-16}\) which is less than 0.05. After this initial linear regression test, there is enough evidence that the data does determine the legitimacy of the survivability in years. The yearDiagnosis variable at negative 0.056 has a p-value of 2 * \(10^{-16}\) which is less than 0.05. The data is valid and we fail reject the Null Hypothesis.
##
## Call:
## lm(formula = survivalYears ~ yearDiagnosis, data = cancerType)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.609 -2.194 -1.840 -0.869 40.902
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -51.504551 3.036779 -16.96 <2e-16 ***
## yearDiagnosis 0.027084 0.001528 17.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.227 on 206978 degrees of freedom
## Multiple R-squared: 0.001516, Adjusted R-squared: 0.001511
## F-statistic: 314.2 on 1 and 206978 DF, p-value: < 2.2e-16
\({ H }_{ 0 }\): There is insufficient evidence that Digestive Cancer Survival Years are improving as Diagnosis Year increases.
\({ H }_{ A }\): There is sufficient evidence that Digestive Cancer Survival Years are improving as Diagnosis Year increases.
The inference test for Digestive Cancer Survivability shows that the p-value for the linear regression is 2.2 * \(10^{16}\) which is less than 0.05. After this initial linear regression test, there is enough evidence that the data does determine the legitimacy of the survivability in years. The yearDiagnosis variable at postive 0.027 has a p-value of 2 * \(10^{-16}\) which is less than 0.05. The data is valid and we can reject the Null Hypothesis.
##
## Call:
## lm(formula = survivalYears ~ yearDiagnosis, data = cancerType)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.950 -6.540 -1.586 5.741 35.750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.704e+02 3.834e+00 -96.59 <2e-16 ***
## yearDiagnosis 1.913e-01 1.927e-03 99.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.847 on 360291 degrees of freedom
## Multiple R-squared: 0.02663, Adjusted R-squared: 0.02663
## F-statistic: 9858 on 1 and 360291 DF, p-value: < 2.2e-16
\({ H }_{ 0 }\): There is insufficient evidence that Male Genital Cancer Survival Years are improving as Diagnosis Year increases.
\({ H }_{ A }\): There is sufficient evidence that Male Genital Cancer Survival Years are improving as Diagnosis Year increases.
The inference test for Male Genital Cancer Survivability shows that the p-value for the linear regression is 2.2 * \(10^{-16}\) which is less than 0.05. After this initial linear regression test, there is enough evidence that the data does determine the legitimacy of the survivability in years. The yearDiagnosis variable at postive 0.019 has a p-value of 2 * \(10^{-16}\) which is less than 0.05. The data is valid and we can reject the Null Hypothesis.
##
## Call:
## lm(formula = survivalYears ~ yearDiagnosis, data = cancerType)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.373 -10.371 -1.378 9.155 26.543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 489.766146 6.294693 77.81 <2e-16 ***
## yearDiagnosis -0.239936 0.003168 -75.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.01 on 203554 degrees of freedom
## Multiple R-squared: 0.02741, Adjusted R-squared: 0.02741
## F-statistic: 5737 on 1 and 203554 DF, p-value: < 2.2e-16
\({ H }_{ 0 }\): There is insufficient evidence that Female Genital Cancer Survival Years are improving as Diagnosis Year increases.
\({ H }_{ A }\): There is sufficient evidence that Female Genital Cancer Survival Years are improving as Diagnosis Year increases.
The inference test for Female Genital Cancer Survivability shows that the p-value for the linear regression is 2.2 * \(10^{-16}\) which is less than 0.05. After this initial linear regression test, there is enough evidence that the data does determine the legitimacy of the survivability in years. The yearDiagnosis variable at negative 0.24 has a p-value of 2 * \(10^{-16}\) which is less than 0.05. The data is valid and we fail reject the Null Hypothesis.
##
## Call:
## lm(formula = survivalYears ~ yearDiagnosis, data = cancerType)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.084 -2.597 -2.114 -0.662 39.833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.30338 2.34612 16.75 <2e-16 ***
## yearDiagnosis -0.01836 0.00118 -15.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.449 on 402014 degrees of freedom
## Multiple R-squared: 0.0006014, Adjusted R-squared: 0.0005989
## F-statistic: 241.9 on 1 and 402014 DF, p-value: < 2.2e-16
\({ H }_{ 0 }\): There is insufficient evidence that Respiratory Cancer Survival Years are improving as Diagnosis Year increases.
\({ H }_{ A }\): There is sufficient evidence that Respiratory Cancer Survival Years are improving as Diagnosis Year increases.
The inference test for Respiratory Cancer Survivability shows that the p-value for the linear regression is 2.2 * \(10^{-16}\) which is less than 0.05. After this initial linear regression test, there is enough evidence that the data does determine the legitimacy of the survivability in years. The yearDiagnosis variable at negative 0.24 has a p-value of 2 * \(10^{-16}\) which is less than 0.05. The data is valid and we fail reject the Null Hypothesis.
##
## Call:
## lm(formula = survivalYears ~ yearDiagnosis, data = cancerType)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.972 -6.891 -3.820 5.368 34.945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.343298 3.930227 4.158 3.21e-05 ***
## yearDiagnosis -0.004243 0.001978 -2.146 0.0319 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.69 on 354429 degrees of freedom
## Multiple R-squared: 1.299e-05, Adjusted R-squared: 1.017e-05
## F-statistic: 4.603 on 1 and 354429 DF, p-value: 0.03191
\({ H }_{ 0 }\): There is insufficient evidence that Colon/Rectal Cancer Survival Years are improving as Diagnosis Year increases.
\({ H }_{ A }\): There is sufficient evidence that Colon/Rectal Cancer Survival Years are improving as Diagnosis Year increases.
The inference test for Colon/Rectal Cancer Survivability shows that the p-value for the linear regression is 0.03 which is less than 0.05. After this initial linear regression test, there is enough evidence that the data does determine the legitimacy of the survivability in years. The yearDiagnosis variable at negative 0.004 has a p-value of 0.03 which is less than 0.05. The data is valid and we fail reject the Null Hypothesis.
##
## Call:
## lm(formula = survivalYears ~ yearDiagnosis, data = cancerType)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.464 -6.711 -4.343 4.074 35.452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.921882 5.258535 6.831 8.44e-12 ***
## yearDiagnosis -0.014423 0.002645 -5.453 4.95e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.093 on 220473 degrees of freedom
## Multiple R-squared: 0.0001349, Adjusted R-squared: 0.0001303
## F-statistic: 29.74 on 1 and 220473 DF, p-value: 4.949e-08
\({ H }_{ 0 }\): There is insufficient evidence that Lymphoma/Leukemia Cancer Survival Years are improving as Diagnosis Year increases.
\({ H }_{ A }\): There is sufficient evidence that Lymphoma/Leukemia Cancer Survival Years are improving as Diagnosis Year increases.
The inference test for Lymphoma/Leukemia Cancer Survivability shows that the p-value for the linear regression is 4.9 * \(10^{-8}\) which is less than 0.05. After this initial linear regression test, there is enough evidence that the data does determine the legitimacy of the survivability in years. The yearDiagnosis variable at negative 0.24 has a p-value of 4.9 * \(10^{-8}\) which is less than 0.05. The data is valid and we fail reject the Null Hypothesis.
##
## Call:
## lm(formula = survivalYears ~ yearDiagnosis, data = cancerType)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.044 -7.799 -2.958 6.506 32.872
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 114.22697 5.74556 19.88 <2e-16 ***
## yearDiagnosis -0.05280 0.00289 -18.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.001 on 179460 degrees of freedom
## Multiple R-squared: 0.001857, Adjusted R-squared: 0.001851
## F-statistic: 333.8 on 1 and 179460 DF, p-value: < 2.2e-16
\({ H }_{ 0 }\): There is insufficient evidence that Urinary Cancer Survival Years are improving as Diagnosis Year increases.
\({ H }_{ A }\): There is sufficient evidence that Urinary Cancer Survival Years are improving as Diagnosis Year increases.
The inference test for Urinary Cancer Survivability shows that the p-value for the linear regression is 2.2 * \(10^{-16}\) which is less than 0.05. After this initial linear regression test, there is enough evidence that the data does determine the legitimacy of the survivability in years. The yearDiagnosis variable at negative 0.05 has a p-value of 2 * \(10^{-16}\) which is less than 0.05. The data is valid and we fail reject the Null Hypothesis.
##
## Call:
## lm(formula = survivalYears ~ yearDiagnosis, data = cancerType)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.939 -9.244 -5.207 9.138 31.977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 140.585164 4.788060 29.36 <2e-16 ***
## yearDiagnosis -0.065710 0.002408 -27.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.8 on 381733 degrees of freedom
## Multiple R-squared: 0.001947, Adjusted R-squared: 0.001944
## F-statistic: 744.6 on 1 and 381733 DF, p-value: < 2.2e-16
\({ H }_{ 0 }\): There is insufficient evidence that Other Cancer Survival Years are improving as Diagnosis Year increases.
\({ H }_{ A }\): There is sufficient evidence that Other Cancer Survival Years are improving as Diagnosis Year increases.
The inference test for Other Cancer Survivability shows that the p-value for the linear regression is 2.2 * \(10^{-16}\) which is less than 0.05. After this initial linear regression test, there is enough evidence that the data does determine the legitimacy of the survivability in years. The yearDiagnosis variable at negative 0.07 has a p-value of 2 * \(10^{-16}\) which is less than 0.05. The data is valid and we fail reject the Null Hypothesis.
Recall the Initial Summary Analysis for the Cancer Data.
There is an upward trend in survivability in years for digestive and male genital cancer diagnoses.
There is a downward trend in survivability in years for female genital cancer diagnoses.
There is a slight downward trend in survivability in years for respiratory cancer diagnoses.
All other cancer diagnoses show a uniform trend in the 25 year period, however, there is slight downward trends after a peak year to 1998.
Based on the inference analysis, the following can be said about the data:
Through inference testing, there is an upward trend in survivability in years for digestive and male genital cancer diagnoses.
Through inference testing, there is an slight downward trend in survivability in years for all other cancer diagnoses.
We chose the period between 1995-2012 for the analysis since CI5 data set contains data for most of the countries reported within this time frame. The majority of the cancer incidence data captured in the CI5 data set appears to be in North America region. Africa has very little reported data.
WorldData <- map_data('world')
WorldData %>% filter(region != "Antarctica") -> WorldData
WorldData <- fortify(WorldData)
CI5SummaryDF <- CI5MasterDF %>% filter (Year>=1995) %>% group_by(Country) %>% summarise(CancerCases = sum(CancerCases)/1000000)
ggplot() +
geom_map(data=WorldData, map=WorldData,
aes(x=long, y=lat, group=group, map_id=region),
fill="white", colour="#7f7f7f", size=0.5) +
geom_map(data=CI5SummaryDF, map=WorldData,
aes(fill=CancerCases, map_id=Country),
colour="#7f7f7f", size=0.5) +
coord_map("rectangular", lat0=0, xlim=c(-180,180), ylim=c(-60, 90)) +
scale_fill_continuous(low="thistle2", high="darkred", guide="colorbar") +
scale_y_continuous(breaks=c()) +
scale_x_continuous(breaks=c()) +
labs(fill="Cancer Cases (in Millions)", title="Global View of the Cancer Cases (1995-2012)", x="", y="") +
theme(plot.title = element_text(hjust = 0.5)) +
theme_bw() +
theme(panel.border = element_blank())
TimeSummary_DF <- CI5MasterDF %>% filter(Year >= 1995) %>% group_by(Year, Continent) %>% summarise(CancerCases=sum(CancerCases)/1000000)
p <- ggplot(data=TimeSummary_DF, aes(x=Year,y=CancerCases, color=Continent)) +
geom_line(aes(group = Continent))+
geom_point()+
theme(axis.text.x=element_text(angle = 60, vjust = 0.5)) +
scale_fill_brewer(palette="Paired") +
ggtitle("Trend of Cancer Incidence By Continent") +
theme(plot.title = element_text(hjust = 0.5), legend.position = "bottom") +
labs(y = "Cancer Cases (in Millions)")
ggplotly(p)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
From the trend chart above Americas, Asia and Europe show a growing trend in cancer cases whereas Oceania and Africa show flat trend. But this could be due to limited reported data from those regions.
We picked UK as a sample region to understand the age distribution of population at Risk with a break down by gender through a population pyramid.
UKPopSummary_DF <- CI5PopMasterDF %>% filter(Country =="UK") %>% group_by(Sex,AgeGroup,BeginAge) %>%
summarise(Population=sum(Population))
UKPopSummary_DF$Population <- ifelse(UKPopSummary_DF$Sex == "M", -1*UKPopSummary_DF$Population, UKPopSummary_DF$Population)
p0 <- ggplot(UKPopSummary_DF, aes(x = reorder(AgeGroup,BeginAge), y = Population, fill = Sex)) +
geom_bar(data = subset(UKPopSummary_DF, Sex == "F"), stat = "identity") +
geom_bar(data = subset(UKPopSummary_DF, Sex == "M"), stat = "identity") +
scale_y_continuous(breaks = seq(-50000000, 50000000, 10000000),
labels = paste0(as.character(c(seq(50, 0, -10), seq(10, 50, 10))), "m")) +
coord_flip() +
scale_fill_brewer(palette = "Set1") +
ggtitle("Population Pyramid - UK") +
labs(y = "Population (in Millions)",x= "Age Group") +
theme_bw()
ggplotly(p0)
We picked UK as a sample region to understand the age distribution of cancer patients with a break down by gender through an incidence pyramid.
UKSummary_DF <- CI5MasterDF %>% filter(Country =="UK") %>% group_by(Sex,AgeGroup,BeginAge) %>%
summarise(CancerCases=sum(CancerCases))
UKSummary_DF$CancerCases <- ifelse(UKSummary_DF$Sex == "M", -1*UKSummary_DF$CancerCases, UKSummary_DF$CancerCases)
p1 <- ggplot(UKSummary_DF, aes(x = reorder(AgeGroup,BeginAge), y = CancerCases, fill = Sex)) +
geom_bar(data = subset(UKSummary_DF, Sex == "F"), stat = "identity") +
geom_bar(data = subset(UKSummary_DF, Sex == "M"), stat = "identity") +
scale_y_continuous(breaks = seq(-15000000, 15000000, 3000000),
labels = paste0(as.character(c(seq(15, 0, -3), seq(3, 15, 3))), "m")) +
coord_flip() +
scale_fill_brewer(palette = "Set1") +
ggtitle("Incidence Pyramid for Cancer Patients - UK") +
labs(y = "Cancer Cases (in Millions)",x= "Age Group") +
theme_bw()
ggplotly(p1)
For male 70-74 age group has most no. of reported cancer incidence whereas for female, age group 75-79 has most no. of cancer cases. Based on the chart above we excluded age group below 35 for further analysis.
Here we did a trend analysis of cancer patients in UK betweek 1995 to 2012 by age group beyond 35.
UKTimeSummary_DF <- CI5MasterDF %>% filter(Country =="UK", Year>=1995, BeginAge>=35) %>% group_by(AgeGroup,BeginAge,Sex,Year) %>% summarise(CancerCases=sum(CancerCases)/1000000)
ggplot(UKTimeSummary_DF, aes(x = Year, y = CancerCases, color=AgeGroup)) +
geom_line(aes(group = reorder(AgeGroup,BeginAge)))+
geom_point()+
theme(axis.text.x=element_text(angle = 60, vjust = 0.5)) +
scale_fill_brewer(palette="Paired") +
ggtitle("Trend of Cancer Incidence By Age Group - UK") +
theme(plot.title = element_text(hjust = 0.5), legend.position = "bottom") +
facet_wrap(~Sex) +
labs(y = "Cancer Cases (in Millions)")
Based on the Trend chart above both in Male and Female category there is a steady increase in cancer patients in the age group 65-69 in UK.
The we looked at Top 10 most frequently occurring Cancer diagnosis types in UK between all age groups.
UKDiagSummary_DF <- CI5MasterDF %>% filter(Country =="UK", Year>=1995, Sex == "F") %>% group_by(DiagUnitLvl0Desc) %>% summarise(CancerCases=sum(CancerCases)/1000000) %>% mutate(rank = rank(-CancerCases)) %>% filter(rank <= 10) %>% arrange(rank)
p2 <- ggplot(UKDiagSummary_DF, aes(x = reorder(DiagUnitLvl0Desc,-CancerCases), y = CancerCases)) +
geom_bar(stat = "identity", position = "dodge", fill = "orange") +
geom_text(aes(label=CancerCases), vjust=-0.8, color="black", position = position_dodge(0.9), size=3.5) +
theme(axis.text.x=element_text(angle = 60, vjust = 0.5)) +
scale_fill_brewer(palette="Paired") +
ggtitle("Top 10 Cancer Diagnosis Types - UK") +
theme(plot.title = element_text(hjust = 0.5), legend.position = "bottom") +
labs(x = "Cancer Diagnosis Type",y = "Cancer Cases (in Millions)")
ggplotly(p2)
UKDiagSummary_DF <- CI5MasterDF %>% filter(Country =="UK", Year>=1995, Sex == "M") %>% group_by(DiagUnitLvl0Desc) %>% summarise(CancerCases=sum(CancerCases)/1000000) %>% mutate(rank = rank(-CancerCases)) %>% filter(rank <= 10) %>% arrange(rank)
p3 <- ggplot(UKDiagSummary_DF, aes(x = reorder(DiagUnitLvl0Desc,-CancerCases), y = CancerCases)) +
geom_bar(stat = "identity", position = "dodge", fill = "blue") +
geom_text(aes(label=CancerCases), vjust=-0.8, color="black", position = position_dodge(0.9), size=3.5) +
theme(axis.text.x=element_text(angle = 60, vjust = 0.5)) +
scale_fill_brewer(palette="Paired") +
ggtitle("Top 10 Cancer Diagnosis Types - UK") +
theme(plot.title = element_text(hjust = 0.5), legend.position = "bottom") +
labs(x = "Cancer Diagnosis Type",y = "Cancer Cases (in Millions)")
ggplotly(p3)
For female, Lip, Breast and Lung cancer has most reported cases whereas for male, Lip, Lung and Prostate are the top 3 diagnostic types of cancer for all age groups. From both the Bar chart above, it is evident that Lip/Oral cancer is the most dominant type in UK. The reasons for developing Oral/mouth cancer could be due to smoking habits or using other forms of tobaco, dringking alcohol etc.
For this section, we comparied cancer rates between two countries: Bulgaria and the Netherlands. We used the sunburstR package to look at the percentage of cancer diagnoses by type for both the United States and Bulgaria. The first sunburst for each is for general organs (Lung, Bone, etc.) or specific cancers without subtypes. The second sunburst breaks down the cancers with subtypes into those subtypes. Sun Burst charting is a technology outside of the scope of the course and makes the rPubs experience more interactive.
EuropeJoined<-subset(joinedtab,joinedtab$Continent=="Europe")
Europe<-EuropeJoined[,c(1:24)]
Europe2011<-subset(Europe,Europe$Year==2011)
Bulgaria<-subset(Europe2011,Europe2011$CancerRegistryID==10000000)
Bulgaria2011<-subset(Bulgaria,Bulgaria$Year==2011)
CensusData<-read.csv("https://raw.githubusercontent.com/RommyGraphs/MSDAGroups/master/DATA607/CombinedData.csv")
BulgPopDemo<-CensusData$Bulgaria
Bulgaria2011ABSM<-Bulgaria2011[1,]
Bulgaria2011ABSF<-Bulgaria2011[171,]
Bulgaria2011ABSDemo<-as.data.frame(Bulgaria2011ABSM[,c(6:24)]+Bulgaria2011ABSF[,c(6:24)])
BulgariaGeneralRate<-Bulgaria2011ABSDemo/BulgPopDemo
Neth<-subset(Europe2011,Europe2011$CancerRegistryID==52800000)
Neth2011<-subset(Neth,Neth$Year==2011)
This gives the difference between the Netherlands’ and Bulgaria’s Cancer Rates
NethFullPop<-CensusData$Netherlands
Neth2011ABSM<-Neth2011[1,]
Neth2011ABSM<-Neth2011ABSM[,-c(1:5)]
Neth2011ABSF<-Neth2011[171,]
Neth2011ABSF<-Neth2011ABSF[,-c(1:5)]
Neth2011AllButSkinTotal<-Neth2011ABSF+Neth2011ABSM
Neth2011AllButSkinTotal<-Neth2011AllButSkinTotal[,-19]
NethABST<-as.data.frame(t(Neth2011AllButSkinTotal))
NethRates<-as.data.frame(NethABST/NethFullPop)
BulgRates<-as.data.frame(t(BulgariaGeneralRate))[-19,]
NethMinusBulg<-as.data.frame(NethRates-BulgRates)
names(NethMinusBulg)<-c("Difference in Cancer Rates")
NethMinusBulgby100000<-NethMinusBulg
NethMinusBulgby100000$`Difference in Cancer Rates`<-NethMinusBulgby100000$`Difference in Cancer Rates`*100000
names(NethMinusBulgby100000)<-c("Difference In Cancer Rates per 100,000")
NethMinusBulgby100000
Difference In Cancer Rates per 100,000 | |
---|---|
Age_0_4 | 0.0834663 |
Age_5_9 | 1.6342868 |
Age_10_14 | 6.3761948 |
Age_15_19 | 8.5776608 |
Age_20_24 | 17.8418074 |
Age_25_29 | 21.8606396 |
Age_30_34 | 21.9525255 |
Age_35_39 | 20.0827046 |
Age_40_44 | 15.1611987 |
Age_45_49 | 62.3620733 |
Age_50_54 | 70.6538882 |
Age_55_59 | 153.8709518 |
Age_60_64 | 302.7881184 |
Age_65_69 | 681.5190610 |
Age_70_74 | 887.6540597 |
Age_75_79 | 1076.3077235 |
Age_80_84 | 1307.9435060 |
Age_85Plus | 1146.9155321 |
Graphically:
BulgWithAges<-as.data.frame(cbind(seq(0,85,by=5),t(BulgariaGeneralRate)))
names(BulgWithAges)<-c("Age","CancerRate")
BulgWithAgesby100000<-BulgWithAges
BulgWithAgesby100000$CancerRate<-BulgWithAgesby100000$CancerRate*100000
ggplotly(ggplot(BulgWithAgesby100000,aes(x=Age,y=CancerRate))+geom_point()+ggtitle("Bulgarian Cancer Rate per 100,000 by Five Year Age Group, 2011"))
NethWithAges<-as.data.frame(cbind(seq(0,85,by=5),NethRates))
NethWithAgesper100000<-NethWithAges
names(NethWithAgesper100000)<-c("Age","CancerRate")
NethWithAgesper100000$CancerRate<-NethWithAgesper100000$CancerRate*1000000
ggplotly(ggplot(NethWithAgesper100000,aes(x=Age,y=CancerRate))+geom_point()+ggtitle("Netherlands Cancer Rate per 100,000 by Five Year Age Group, 2011"))
NethMinusBulgby100000<-as.data.frame(cbind(seq(0,85,by=5),NethMinusBulgby100000))
names(NethMinusBulgby100000)<-names(NethWithAgesper100000)<-c("Age","Cancer_Rate_Difference")
ggplotly(ggplot(NethMinusBulgby100000,aes(x=Age,y=Cancer_Rate_Difference))+geom_point()+ggtitle("Netherlands/Bulgaria Cancer Rate Difference per 100,000 by Five Year Age Group, 2011"))
We uploaded cancer_detailed_modified2.csv to Jason’s github. The dashes are in the All.cancers.excluding.non.melanoma.skin are necessary for the sunburst. Given RStudios’ autofill, we did not fix the column names except at the end.
CDict<-read.csv("https://raw.githubusercontent.com/jgivensdoyle/607/master/cancer_detailed_modified2.csv",stringsAsFactors = FALSE)
#Lowerit is used to remove cancer classes that have sub-classes for sunbursting
lowerit<-function(x){
i<-1
while(i<nrow(x)){
if((is.na(x[i+1,3])==TRUE)&(is.na(x[i,3])==FALSE)){
x<-x[-i,]
}else{
i=i+1
}
}
return(x)
}
ByCancerType<-function(x){
xSpec<-subset(x,x$CancerCode!=1)
xTotalTypes<-aggregate(xSpec$TotalCases, by=list(Cancer=xSpec$CancerCode), FUN=sum)
xTotalTypes$Cancer<-CDict$All.cancers.excluding.non.melanoma.skin
xTotalTypes<-cbind.data.frame(xTotalTypes,CDict$X.C00.96.C44.)
xTotalTypes$`CDict$X.C00.96.C44.`<-as.character(xTotalTypes$`CDict$X.C00.96.C44.`)
xTotalTypes$`CDict$X.C00.96.C44.`<-gsub("^$",NA,xTotalTypes$`CDict$X.C00.96.C44.`)
return(xTotalTypes)
}
In sunbursting the US cancer data, the percentages vary slightly between looking at overall categories and specific sub-categories. The numbers, however, remain the same. We are unsure of why the sunburstR package does this.
specccs<-subset(cancercasessummary,cancercasessummary$CancerCode!=1)
#now creating all year data set
USCancer<- subset(joinedtab,joinedtab$Country=="USA")
USCancer<-USCancer[,c(1:24)]
USCancerSum<-aggregate(USCancer$TotalCases, by=list(CancerCode=USCancer$CancerCode), FUN=sum)
USCancerSum<-USCancerSum[-1,]
USCancerSum$CancerCode<-CDict$All.cancers.excluding.non.melanoma.skin
#USCancerSum
#sunburst(USCancerSum)
USCancerSum<-cbind.data.frame(USCancerSum,CDict$X.C00.96.C44.)
USCancerSum$`CDict$X.C00.96.C44.`<-as.character(USCancerSum$`CDict$X.C00.96.C44.`)
USCancerSum$`CDict$X.C00.96.C44.`<-gsub("^$",NA,USCancerSum[,3])
toplevel<-subset(USCancerSum,is.na(USCancerSum$`CDict$X.C00.96.C44.`)==FALSE)
toplevel2<-toplevel[,c(1:2)]
sunburst(toplevel2,count=TRUE)
[BUL] Bulgaria - Population Data. Retrieved from website: http://www.nsi.bg/bg/content/3078/наÑеление-по-облаÑÑи-обÑини-наÑелени-меÑÑа-и-вÑзÑаÑÑ-кÑм-01022011-г
[CI5] Cancer Incidence in Five Continents (CI5) - Retrieved from website: http://ci5.iarc.fr/Default.aspx
**[NET] The Netherlands - Population Data. Retrieved from websited: https://www.cbs.nl/en-gb/publication/2014/47/dutch-census-2011, page 24.
[SEE] NCI Surveillance, Epidemiology, and End Results Program (SEER) - Retrieved from website: https://seer.cancer.gov/