library(readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Cancer is the uncontrollable growth of cells. Cancer shows itself in many ways and you need to be extremely careful paying attention to your body and knowing if something seems off. The first step in a good Prognosis is catching it as early as possible. Most Cancers are curable if you catch it early enough, but unfortunately some cancers are borderline impossible to catch early enough. Cancer is caused by a mutation in genes. Mutations are important in an organism’s growth and evolution as a species, but there are certain genes that you don’t want to mutate. These are the tumor suppression genes. The name says it all, these genes help suppress tumors and prevent the spread of cancer. When these genes mutate, they can’t do their job. Every day when cells multiple there is risk of cancer but these genes stop that from happening because the cell has checkpoints before the cell can actually multiply. This makes sure everything is running smoothly and is an amazing system. Cancer research dates back way longer than I would have thought. Who else but Hippocrates the “Father of Medicine” was the first to describe tumors. Using terms like carcinos and carcinoma. Of course, he didn’t know exactly what it was, so he used terms that described the way they looked. These terms meant crab, likely applied to the disease because the finger-like spreading projections from a cancer called to mind the shape of a crab1. Cancer is caused by carcinogens. These carcinogens are broken up into different groups by how likely it is to cause cancer. Group 1 are known carcinogenic to humans. This list includes: Tobacco (Smokeless and smoked), outdoor air pollution, processed meats, and many more. Group 2A are Probably human carcinogens this list includes many chemicals and red meats. Group 2B are possibly human carcinogens. This list includes: Aloe Vera and Gas engine exhaust3. The data was collected by different registries funded by the CDC to collect cancer research2. There is always potential bias in any sampling, but in this case, we are dealing with the population of nationwide cancer patients. The only potential bias is if the cancer goes un noticed and the person lived with it without knowing. The data we are using consists of 7 years of cancer statistics ranging from 2010 to 2016. Variables include: Sex, locations (City/States), population size, amount of diagnosed and their associated mortality rate, rate of cancer diagnosis/morality per 100,000 people. The rate per 100,000 people is the more important to look at than actual number of patients because obviously the larger the state’s population the higher number of cancer patients.
Mortality <- read_csv("Morality_Rates_By_State.csv")
## Warning: Missing column names filled in: 'X5' [5], 'X6' [6], 'X11' [11],
## 'X12' [12], 'X14' [14], 'X15' [15], 'X20' [20], 'X21' [21], 'X26' [26],
## 'X27' [27]
## Parsed with column specification:
## cols(
## .default = col_double(),
## FMAREA = col_character(),
## X5 = col_logical(),
## X6 = col_logical(),
## FDAREA = col_character(),
## X11 = col_logical(),
## X12 = col_logical(),
## X14 = col_logical(),
## X15 = col_logical(),
## MMAREA = col_character(),
## X20 = col_logical(),
## X21 = col_logical(),
## MDAREA = col_character(),
## X26 = col_logical(),
## X27 = col_logical()
## )
## See spec(...) for full column specifications.
Female_Diagnosed_By_City <- read_csv("Female_Diagnosed_By_City_2010-2016.csv")
## Parsed with column specification:
## cols(
## AREA = col_character(),
## YEAR = col_double(),
## COUNT = col_double(),
## POPULATION = col_double(),
## RATE_PER_100000 = col_double(),
## EVENT = col_character(),
## RACE = col_character(),
## SEX = col_character(),
## SITE = col_character(),
## LOWER_CI_RATE_PER_100000 = col_double(),
## UPPER_CI_RATE_PER_100000 = col_double()
## )
Male_Diagnosed_By_City <- read_csv("Male_Diagnosed_By_City_2010-2016.csv")
## Parsed with column specification:
## cols(
## AREA = col_character(),
## YEAR = col_double(),
## COUNT = col_double(),
## POPULATION = col_double(),
## RATE_PER_100000 = col_double(),
## EVENT = col_character(),
## RACE = col_character(),
## SEX = col_character(),
## SITE = col_character(),
## LOWER_CI_RATE_PER_100000 = col_double(),
## UPPER_CI_RATE_PER_100000 = col_double()
## )
Male_And_Female_Diagnosed_By_City <- read_csv("Male_And_Female_Diagnosed_By_City_2010-2016.csv")
## Parsed with column specification:
## cols(
## AREA = col_character(),
## YEAR = col_double(),
## COUNT = col_double(),
## POPULATION = col_double(),
## RATE_PER_100000 = col_double(),
## EVENT = col_character(),
## RACE = col_character(),
## SEX = col_character(),
## SITE = col_character(),
## LOWER_CI_RATE_PER_100000 = col_double(),
## UPPER_CI_RATE_PER_100000 = col_double()
## )
Now that we have the data loaded into r we can we can start off by comparing the nationwide difference between Cancer diagnoses between Males in Females in the United States.Null Hypothesis: There is no difference in means between Male and Female Cancer rates per 100,000 people in the United States. Alterative Hypothesis: There is a difference in means between Male and Female Cancer rates per 100,000 people in the United States.
summary(Mortality$FDRATE_PER_100000)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 317.0 473.8 509.6 508.6 551.7 661.6 1
summary(Mortality$MDRATE_PER_100000)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 345.6 506.3 556.3 542.9 587.2 691.0 1
Let’s see at how these numbers compare in a boxplox.
boxplot (Mortality$FDRATE_PER_100000, Mortality$MDRATE_PER_100000, horizontal=TRUE, xlab="Diagnosed Per 100,000", main = "Male(Blue) vs Female (Pink) Cancer Patients", col=c("Pink","Blue"))

boxplot (Mortality$FDCOUNT, Mortality$MDCOUNT, horizontal=TRUE, xlab="Diagnosed Per 100,000", main = "Male(Blue) vs Female (Pink) Cancer Patients", col=c("Pink","Blue"))

There defintely appears to be a difference in the means, but the only way to know for sure is through a t test.
t.test(Mortality$MDRATE_PER_100000, Mortality$FDRATE_PER_100000, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: Mortality$MDRATE_PER_100000 and Mortality$FDRATE_PER_100000
## t = 6.665, df = 683.58, p-value = 5.455e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 24.24711 44.49918
## sample estimates:
## mean of x mean of y
## 542.9474 508.5743
Since the p value is 7.866e-11 we therefore reject the null hypothesis. We have 95% confidence that there is between of 23.89 and 44.11 between the mean of Male Diagnosis rate per 100,000 people and the Female cancer Diagnosis per 100,000 people. Due to the very small p value, there is very strong evidence of nonnormality.
Next let’s compare the mortality rates of Males vs Females. Males may develop cancer more than Females but do they develop deadlier cancers? For this we will be showing the ratio between mortality and diagnosis. Null Hypothesis: There is no difference in means between Male and Female Cancer Mortality rates in the United States. Alterative Hypothesis: There is a difference in means between Male and Female Cancer Mortality rates in the United States.
summary(Mortality$FM_Rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.2795 0.3333 0.3486 0.3494 0.3671 0.4212 1
summary(Mortality$MM_Rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.2809 0.3579 0.3768 0.3780 0.3999 0.4513 1
boxplot (Mortality$FMRATE_PER_100000, Mortality$MMRATE_PER_100000, horizontal=TRUE, xlab="Mortality Per 100,000", main = "Male(Blue) vs Female (Pink) Cancer Patients", col=c("Pink","Blue"))

boxplot (Mortality$FMCOUNT, Mortality$MMCOUNT, horizontal=TRUE, xlab="Total Mortality", main = "Male(Blue) vs Female (Pink) Cancer Patients", col=c("Pink","Blue"))

boxplot (Mortality$FM_Rate, Mortality$MM_Rate, horizontal=TRUE, xlab="Mortality:Diagnosed Ratio", main = "Male(Blue) vs Female (Pink) Cancer Patients", col=c("Pink","Blue"))

There are 3 different boxplots. First, we have the Mortality rates per 100,000 people. This doesn’t tell the who story because we know that males are more likely to be diagnosed with cancer so therefore they will most likely have the higher mortality per 100,000. The third boxplot is what we want to look at. This shows us the ratio of mortality to Diagnosed cancer patients. It still appears that Males have a higher rate than Females, but just to be safe we will run a t test.
t.test(Mortality$MM_Rate, Mortality$FM_Rate, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: Mortality$MM_Rate and Mortality$FM_Rate
## t = 13.272, df = 675.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.02437277 0.03283610
## sample estimates:
## mean of x mean of y
## 0.3780053 0.3494009
Since the p value is 2.2e-16 we therefore reject the null hypothesis. We have 95% confidence that there is between of 23.89 and 44.11 between the mean of Male Diagnosis rate per 100,000 people and the Female cancer Diagnosis per 100,000 people. Due to the very small p value, there is very strong evidence of nonnormality.
The last data we will be comparing is the difference between City living and comparing it to states averages. Does living in a city effect the rate of cancer due to the worse air pollution or does living in a city actually decrease your likelihood of cancer? Let’s check it out. Null Hypothesis: There is no difference in means between Cancer Diagnosis rates in the City vs state wide in the United States. Alterative Hypothesis: There is a difference in means between Cancer Diagnosis rates in the City vs state wide in the United States.
summary(Female_Diagnosed_By_City$RATE_PER_100000)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 374.7 405.8 449.2 458.7 511.2 583.4
summary(Male_Diagnosed_By_City$RATE_PER_100000)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 367.6 410.3 455.3 474.3 528.9 651.3
summary(Male_And_Female_Diagnosed_By_City$RATE_PER_100000)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 375.0 404.0 442.6 466.3 517.1 601.4
Let’s take two separate looks at Females and Females in the cities vs state averages.
boxplot(Female_Diagnosed_By_City$RATE_PER_100000, Mortality$FDRATE_PER_100000, horizontal=TRUE, xlab="Diagnosed per 100,000", main = "Female City (Green) vs State (Tan) Cancer Rates", col=c("Green","Tan"))

t.test(Mortality$FDRATE_PER_100000, Female_Diagnosed_By_City$RATE_PER_100000, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: Mortality$FDRATE_PER_100000 and Female_Diagnosed_By_City$RATE_PER_100000
## t = 5.2941, df = 62.959, p-value = 1.615e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 31.04201 68.68615
## sample estimates:
## mean of x mean of y
## 508.5743 458.7102
Since the p value is 1.615e-06 we therefore reject the null hypothesis. We have 95% confidence that there is between of 31.04 and 68.69 between the two means.
boxplot(Male_Diagnosed_By_City$RATE_PER_100000, Mortality$MDRATE_PER_100000, horizontal=TRUE, xlab="Diagnosed per 100,000", main = "Male Cancer Rates City (Orange) vs State (Blue)", col=c("Orange","Blue"))

t.test(Mortality$MDRATE_PER_100000, Male_Diagnosed_By_City$RATE_PER_100000, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: Mortality$MDRATE_PER_100000 and Male_Diagnosed_By_City$RATE_PER_100000
## t = 5.6978, df = 59.813, p-value = 3.935e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 44.52642 92.70721
## sample estimates:
## mean of x mean of y
## 542.9474 474.3306
Since the p value is 3.935e-07 we therefore reject the null hypothesis. We have 95% confidence that there is between of 44.53 and 92.71 between the two means.
All P-Values are extremely low meaning there is very strong evidence on nonnormality.
Male vs Female Diagnosed with cancer: 2.2e-16
Male vs Female Morality Rates: 7.866e-11
City vs State Diagnosed (Female): 1.615e-06
City vs State Diagnosed (Male): 3.935e-07
Due to the extremely low P-Values we can see that there is very strong evidence of non normality which means there is difference between the groups means for all cases. We can see Males are more likely to be diagnosed with Cancer AND have a higher rate of death. Meaning they are more likely to develop deadlier cancers. Which it’s actually the opposite in cities. Females are more likely to develop cancer than Males.
This process was extremely tedious. First, finding the dataset you wanted was extremely hard. Then, sorting through all the data was a long process as well. I had to downloaded from a txt tab file and convert it into a CSV and sort through over a million lines. I have a new appreciation for the time and energy data scientists go through to run their analyzes. There is always a lot of junk you have to search through and you can’t always get the data you want. Especially for free. Overall is was a fun project, and helped me put all the code I learned in class to the test!