ANLY545 Project Concept Presentation- Cancer Data Analysis

Vaidyanathan Subramanian,Rahul Singh

2022-04-30

Cancer Disease in the United States

Cancer is “a disease in which some of the body’s cells grow uncontrollably and spread to other parts of the body” (National Cancer Institute 2021). There are many different types of cancer, and those different types are usually named for the part of the body where the cancer starts. There is a one in three chance that you will have to deal with cancer somewhere in your body in your lifetime (American Cancer Society 2021).

Cancer is the second leading cause of death in the United States, exceeded only by heart disease.
One of every four deaths in the United States is due to cancer.
In the United States in 2018, 1,708,921 new cancer cases were reported and 599,265 people died of cancer.
For every 100,000 people, 436 new cancer cases were reported and 149 people died of cancer.
2018 is the latest year for which incidence data are available.

How Does Cancer Develop?

Cancer is a genetic disease—that is, it is caused by changes to genes that control the way our cells function, especially how they grow and divide.

Genetic changes that cause cancer can happen because:

of errors that occur as cells divide.
of damage to DNA caused by harmful substances in the environment, such as the chemicals in tobacco smoke and ultraviolet rays from the sun. (Our Cancer Causes and Prevention section has more information.)
they were inherited from our parents.

Rate of New Cancers in the United States

All Types of Cancer, All Ages, All Races and Ethnicities, Male and Female

Rate of New Cancers in the United States, 2018

Top 10 Cancers by Rates of New Cancer Cases

Top 10 Cancers by Rates of Cancer Deaths

Purpose

Toxic chemicals are virtually always released into the environment as a result of current and historical industrial activity, some of which are known or suspected carcinogens (National Institutes of Health 2018). While plant operators can implement technical and operational methods to reduce these emissions, and governments can use legislation to encourage plant operators to do so, no industrial process can be completely clean, and industrially produced toxins remain an unavoidable part of life in a modern society.

Potential linkages between industrial operations and cancer hot areas can be investigated using publicly available geospatial data. While the actual causes of individual tumors are complex and often unknown with any degree of certainty, aggregated data analysis can provide important insights and avenues for directing further (and sometimes scarce) investigatory resources in the right direction.

We’d like to analyze cancer registry data to see if there’s a link between geographic location, age, behavioral risk factors, and industrial toxins in the environment

Hypothesis

Different groups of patient with different age and gender will be analyzed in aspects of health condition (Chest Pain Type, Rest Blood Pressure, Fasting Blood Sugar, Thalassemia) and heart disease(Exercise Induced Angina, Heart Disease).
We are expecting a significant different heart disease rate between various groups of patient.
H0: heart disease rate is not significant different between different patient groups
H1: heart disease rate is significant different between different patient groups

Data

To explore the relationship between heart disease and patient’s characteristics, we need to the following information: Cases of patients who experienced heart disease Age Gender Hematic index related to heart health Symptoms and Severity

Data Resource

Dataset Analysis

From Cleveland database, subset of 14 attributes Categorical data and Numerical data Interested attributes: Age (Real) Gender(Binary) Chest Pain Type (Nominal) Rest Blood Pressure( Real) Fasting Blood Sugar (Real) Thalassemia( Nominal) Exercise Induced Angina (Binary) Heart Disease( Binary) There is no missing value

Analytic Scope/Methods

Descriptive and Correlation analysis Methods: χ2 test of independence; Loglinear analysis for multiple attributes H0: There is no relationship between variables H1: There is a relationship between variables Variables: Age and/or Gender vs Heart Disease Age and/or Gender vs Hematic index Graphic: Bar Chart, Mosaic Displays

Analytic Scope/Pitfalls

The bias of the data selection may exist. With the sample size of 270, the data may not represent the behaviors of the population. There are some confounders not included in the data sheet may influencing the result.

References

https://www.cdc.gov/cancer/dcpc/data/index.htm#:~:text=In%20the%20United%20States%20in,which%20incidence%20data%20are%20available.
https://gis.cdc.gov/Cancer/USCS/#/AtAGlance/
https://www.cancer.gov/about-cancer/understanding/what-is-cancer