DATA607_Final

Project Objective

This project seeks to investigate the density of the correlation between Lung Cancer with Air Pollution, Tobacco Smoke, exposure to Radon and Radiation and Arsanic in drinking water.
There are lots of lung cancer patient around the world who suffer with the lung cancer but still hard to identify a good course for the reason. Most geographical locations don’t have enough resources to measure the coursing risk factors but USA.
There for I decided to use the data sets available in USA to find the correlation between the problem and risk factors.

Lung Cancer statistics in the…

##                                       Maximum Moderate Minimum
## Cities with Air Pollution             |   *   |   *    |   *  
## Cities with Tobacco Smoke             |   *   |   *    |   *  
## Cities with Radon and Radiation       |   *   |   *    |   *  
## Cities with high Agriculture Industry |   *   |   *    |   *

Risk Factor statistics in the…

##                         Maximum Moderate Minimum
## Cities with Lung Cancer |   *   |   *    |   *

Data Sources

USA Population Statistics List of U.S. states and territories by population
Scrape HTML Table using rvest to data a frame
USA State Abbreviations and FIPS USPS State Abbreviations and FIPS Codes
Scrape HTML Table using rvest to data a frame
Cities with Air Pollution Ranks America’s Health Rankings
Downloaded Report Data from this web site
Cancer Statistics United States Cancer Statistics (USCS)
Downloaded Report Data from this web site

Cities with Tobacco Smoke Behavioral Risk Factor Data: Tobacco Use (2011 to present)
Downloaded Report Data ->
Cities with Radon and Radiation EPA Map of Radon Zones including State Radon Information and Contacts
Cities with high Agriculture Industry for high Arsanic Levels United States Cancer Statistics

Following process enter into the wikipedia.org and scrape the HTM table. The process also design to cleanse and filter the data in to datatframe …

W3Schools.com

library(rvest)
page <- read_html("http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")

pop_table <- page %>%
   html_node("table") %>%
   html_table(fill = TRUE)

pop_df <- as_data_frame(pop_table)

pop_df <- pop_df %>% mutate(StateName = `State or territory`) %>% mutate(Population = `Population estimate, July 1, 2016`)
popdf<- pop_df %>% select(StateName,Population)
popdf

## # A tibble: 59 × 2
##         StateName Population
##             <chr>      <chr>
## 1      California 39,250,017
## 2           Texas 27,862,596
## 3         Florida 20,612,439
## 4        New York 19,745,289
## 5    Pennsylvania 12,802,503
## 6        Illinois 12,801,539
## 7            Ohio 11,614,373
## 8         Georgia 10,310,371
## 9  North Carolina 10,146,788
## 10       Michigan  9,928,301
## # ... with 49 more rows

Back to the Menu_{_{_{_{_{_{_~}}}}}}

===============================================================================================================

Following process extract State Abbreviations and FIPS Codes . They will need in future data frame mappings …

W3Schools.com

library(rvest)

page <- read_html("https://www.bls.gov/cew/cewedr10.htm")

stat_table <- page %>%
   html_node("table") %>%
   html_table(fill = TRUE)

stat_df <- as_data_frame(stat_table,keep.rownames = TRUE)

#stat_df <- stat_df %>% mutate(StateName = `State or territory`) %>% mutate(Population = `Population estimate, July 1, 2016`)
statdf<- stat_df %>% select(X1,X2,X3,X4,X5,X6)
statdf <- filter(statdf,X1!= 'State')
statdf <- filter(statdf,X3!= 'State')
df1 <-  statdf%>% mutate(State = X1) %>% mutate(StateCode = X2) %>% mutate(StateID = X3)
df2 <-  statdf%>% mutate(State = X3) %>% mutate(StateCode = X4) %>% mutate(StateID = X5)
statdf1 <-  select(df1,State,StateCode,StateID)
statdf2 <-  select(df2,State,StateCode,StateID)

statedf<- rbind(statdf1, statdf2)
head(statedf)

## # A tibble: 6 × 3
##        State StateCode StateID
##        <chr>     <chr>   <chr>
## 1    Alabama        AL      01
## 2     Alaska        AK      02
## 3    Arizona        AZ      04
## 4   Arkansas        AR      05
## 5 California        CA      06
## 6   Colorado        CO      08

Back to the Menu_{_{_{_{_{_{_~}}}}}}

===============================================================================================================

Air Pollution Data files are stored in github and below process is loading them in to mysql. Then create separate data frames with the filtered values for analysis

#Loading airports.csv
airpo15_df <- read.csv("https://raw.githubusercontent.com/mathsanu/CUNY_MSDA/master/DATA607/FinalProject/data/2015SeniorRanks.csv", header=TRUE)
airpo16_df <- read.csv("https://raw.githubusercontent.com/mathsanu/CUNY_MSDA/master/DATA607/FinalProject/data/2016SeniorRanks.csv", header=TRUE)

Connecting to MySQL database in the localhost.

library('RMySQL')
#mydb = dbConnect(MySQL(), user='root', password='root', host='localhost', dbname="ourairports")
mydb = dbConnect(MySQL(), user='root', password='root', host='localhost')

# DROP the database if exists using RMySQL in R
dbSendQuery(mydb, "DROP DATABASE if exists 607finalprj;")

## <MySQLResult:2,0,0>

# creating the database using RMySQL in R
dbSendQuery(mydb, "CREATE DATABASE 607finalprj;")

## <MySQLResult:2,0,1>

mydb = dbConnect(MySQL(), user='root', password='root', host='localhost', dbname="607finalprj")

Insert above record sets into MySQL

# 1 ~~~airpo15_df~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create Table and Insert records from airpo15_df in MySql Database ("607finalprj")
dbSendQuery(mydb, "drop table if exists allcensus")

## <MySQLResult:122805272,1,0>

dbWriteTable(mydb, value = airpo15_df, name = "allcensus", overwrite=TRUE, row.names=FALSE,add_id = TRUE)

## [1] TRUE

# 1 ~~~airpo16_df~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Insert records from airpo16_df  into MySql table ("allcensus")
dbWriteTable(mydb, 'allcensus', airpo16_df, row.names=F, append=T)

## [1] TRUE

Creating an aggrigate Data fram from the data in mysql database

airp_df<-dbGetQuery(mydb,"SELECT Edition ,`Measure.Name`,    `allcensus`.`State.Name`,
    `allcensus`.`Rank`,
    `allcensus`.`Value`,
    `allcensus`.`Score`
FROM 607finalprj.allcensus
where `Measure.Name` = 'Air Pollution'
order by `Value`;")
head(airp_df,5)

##   Edition  Measure.Name   State.Name Rank Value Score
## 1    2015 Air Pollution      Wyoming    1   5.0 -2.00
## 2    2015 Air Pollution North Dakota    2   5.2 -2.00
## 3    2015 Air Pollution      Montana    3   5.7 -2.00
## 4    2015 Air Pollution       Alaska    4   6.0 -2.00
## 5    2015 Air Pollution      Vermont    5   6.2 -1.96

According to the Air Pollution Ranks highest polluted States are

##              Rank   Highest 
## California   |   50 |   12.5
## Idaho        |   49 |   11.7
## Pennsylvania |   48 |   11.4
## Indiana      |   47 |   11.3
## Illinois     |   46 |   11.1

Moderately polluted States are

##                Rank   Moderate
## Wisconsin      |   30 |   9.1 
## South Carolina |   29 |   9   
## Mississippi    |   27 |   8.9 
## Utah           |   26 |   8.6 
## Connecticut    |   25 |   8.8

Lowest polluted States are

##              Rank  Moderate
## Vermont      |   5 |   6.2 
## Alaska       |   4 |   6   
## Montana      |   3 |   5.7 
## North Dakota |   2 |   5.2 
## Wyoming      |   1 |   5

Back to the Menu_{_{_{_{_{_{_~}}}}}}

===============================================================================================================

Cancer statistics join with above data frame “State Population” to calculate the deaths as a percentage of the population. Otherwise direct total deaths will not be a accurate variable

lungcancer_df <- read.csv("https://raw.githubusercontent.com/mathsanu/CUNY_MSDA/master/DATA607/FinalProject/data/LungCancer.csv", header=TRUE)

lungcancer_df <- filter(lungcancer_df,Leading.Cancer.Sites == 'Lung and Bronchus')
head(lungcancer_df,5)

##   Notes Leading.Cancer.Sites Leading.Cancer.Sites.Code        State
## 1          Lung and Bronchus                     22030   California
## 2          Lung and Bronchus                     22030      Florida
## 3          Lung and Bronchus                     22030        Texas
## 4          Lung and Bronchus                     22030     New York
## 5          Lung and Bronchus                     22030 Pennsylvania
##   State.Code Deaths
## 1          6  12464
## 2         12  11915
## 3         48   9439
## 4         36   9074
## 5         42   7536

# Join DF , calculating deathratio and deathrank

lc_df <- left_join(lungcancer_df,popdf, by=c("State" = "StateName") )
lc_df1<- lc_df %>% mutate(deathratio = (Deaths/as.numeric(gsub(",", "",Population)) ) * 10000) %>% arrange(desc(deathratio))%>%mutate(deathrank=row_number())

head(lc_df1,5)

##   Notes Leading.Cancer.Sites Leading.Cancer.Sites.Code         State
## 1          Lung and Bronchus                     22030      Kentucky
## 2          Lung and Bronchus                     22030 West Virginia
## 3          Lung and Bronchus                     22030         Maine
## 4          Lung and Bronchus                     22030      Arkansas
## 5          Lung and Bronchus                     22030    Washington
##   State.Code Deaths Population deathratio deathrank
## 1         21   3559  4,436,974   8.021232         1
## 2         54   1437  1,831,102   7.847733         2
## 3         23   1025  1,331,479   7.698206         3
## 4          5   2168  2,988,248   7.255087         4
## 5         53   3020  4,288,000   7.042910         5

ls(lc_df1)

## [1] "deathrank"                 "deathratio"               
## [3] "Deaths"                    "Leading.Cancer.Sites"     
## [5] "Leading.Cancer.Sites.Code" "Notes"                    
## [7] "Population"                "State"                    
## [9] "State.Code"

According to the above data following 5 States have the highest Deaths

lc_df1 %>% filter(deathrank < 6)

##   Notes Leading.Cancer.Sites Leading.Cancer.Sites.Code         State
## 1          Lung and Bronchus                     22030      Kentucky
## 2          Lung and Bronchus                     22030 West Virginia
## 3          Lung and Bronchus                     22030         Maine
## 4          Lung and Bronchus                     22030      Arkansas
## 5          Lung and Bronchus                     22030    Washington
##   State.Code Deaths Population deathratio deathrank
## 1         21   3559  4,436,974   8.021232         1
## 2         54   1437  1,831,102   7.847733         2
## 3         23   1025  1,331,479   7.698206         3
## 4          5   2168  2,988,248   7.255087         4
## 5         53   3020  4,288,000   7.042910         5

According to the above data following 5 States have the lowest Deaths

lc_df1 %>% filter(deathrank > 45)

##   Notes Leading.Cancer.Sites Leading.Cancer.Sites.Code      State
## 1          Lung and Bronchus                     22030     Alaska
## 2          Lung and Bronchus                     22030 New Mexico
## 3          Lung and Bronchus                     22030      Texas
## 4          Lung and Bronchus                     22030 California
## 5          Lung and Bronchus                     22030   Colorado
## 6          Lung and Bronchus                     22030       Utah
##   State.Code Deaths Population deathratio deathrank
## 1          2    273    741,894   3.679771        46
## 2         35    736  2,081,015   3.536736        47
## 3         48   9439 27,862,596   3.387696        48
## 4          6  12464 39,250,017   3.175540        49
## 5          8   1561  5,540,545   2.817412        50
## 6         49    437  3,051,217   1.432215        51

library(ggplot2)
ggplot(lc_df1, aes(fill=deathratio, y=deathratio, x=State)) + ggtitle("Death Ratio in each State ") +   geom_bar( stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + geom_point()

Joining the Lung Cancer Death datafram and Air Polition datafrm by State to fingd the correlation

airp_df

# Join DF , calculating deathratio and deathrank
# head(lc_df1)
# head(airp_df)

lcap_df <- left_join(lc_df1,airp_df, by=c("State" = "State.Name") )
# lc_df1<- lc_df %>% mutate(deathratio = (Deaths/as.numeric(gsub(",", "",Population)) ) * 10000) %>% arrange(desc(deathratio))%>%mutate(deathrank=row_number())

head(lcap_df,5)

##   Notes Leading.Cancer.Sites Leading.Cancer.Sites.Code         State
## 1          Lung and Bronchus                     22030      Kentucky
## 2          Lung and Bronchus                     22030 West Virginia
## 3          Lung and Bronchus                     22030         Maine
## 4          Lung and Bronchus                     22030      Arkansas
## 5          Lung and Bronchus                     22030    Washington
##   State.Code Deaths Population deathratio deathrank Edition  Measure.Name
## 1         21   3559  4,436,974   8.021232         1    2015 Air Pollution
## 2         54   1437  1,831,102   7.847733         2    2015 Air Pollution
## 3         23   1025  1,331,479   7.698206         3    2015 Air Pollution
## 4          5   2168  2,988,248   7.255087         4    2015 Air Pollution
## 5         53   3020  4,288,000   7.042910         5    2015 Air Pollution
##   Rank Value Score
## 1   44  10.1  0.36
## 2   33   9.4 -0.06
## 3   13   7.4 -1.25
## 4   37   9.7  0.12
## 5   17   8.0 -0.89

summary(lcap_df)

##                                                                                                                        Notes   
##                                                                                                                           :51  
##  ---                                                                                                                      : 0  
##  1. 'Suppressed' is displayed for data values when they must not be provided in order to protect personal privacy. Data is: 0  
##  1. Totals are not available for these results due to suppression constraints. More Information:                          : 0  
##  2. 'Not Applicable' is displayed for all data values when gender-specific cancers are combined with the wrong sex. More  : 0  
##  2. Rows with suppressed Deaths are hidden. Use Quick Options above to show suppressed rows.                              : 0  
##  (Other)                                                                                                                  : 0  
##                      Leading.Cancer.Sites Leading.Cancer.Sites.Code
##  Lung and Bronchus             :51        22030      :51           
##                                : 0                   : 0           
##  Brain and Other Nervous System: 0        20010-20100: 0           
##  Breast                        : 0        21010      : 0           
##  Cervix Uteri                  : 0        21020      : 0           
##  Colon and Rectum              : 0        21041-21052: 0           
##  (Other)                       : 0        (Other)    : 0           
##     State             State.Code        Deaths         Population       
##  Length:51          Min.   : 1.00   Min.   :  230.0   Length:51         
##  Class :character   1st Qu.:16.50   1st Qu.:  713.5   Class :character  
##  Mode  :character   Median :29.00   Median : 2391.0   Mode  :character  
##                     Mean   :28.96   Mean   : 3062.3                     
##                     3rd Qu.:41.50   3rd Qu.: 4003.0                     
##                     Max.   :56.00   Max.   :12464.0                     
##                                                                         
##    deathratio      deathrank       Edition     Measure.Name      
##  Min.   :1.432   Min.   : 1.0   Min.   :2015   Length:51         
##  1st Qu.:4.448   1st Qu.:13.5   1st Qu.:2015   Class :character  
##  Median :4.974   Median :26.0   Median :2015   Mode  :character  
##  Mean   :5.158   Mean   :26.0   Mean   :2015                     
##  3rd Qu.:5.977   3rd Qu.:38.5   3rd Qu.:2015                     
##  Max.   :8.021   Max.   :51.0   Max.   :2015                     
##                                                                  
##       Rank           Value            Score        
##  Min.   : 1.00   Min.   : 5.000   Min.   :-2.0000  
##  1st Qu.:13.25   1st Qu.: 7.500   1st Qu.:-1.2200  
##  Median :24.00   Median : 8.800   Median :-0.4200  
##  Mean   :25.10   Mean   : 8.669   Mean   :-0.4922  
##  3rd Qu.:37.00   3rd Qu.: 9.700   3rd Qu.: 0.1200  
##  Max.   :50.00   Max.   :12.500   Max.   : 1.7800  
##  NA's   :1                        NA's   :1

library(ggplot2)
lcap_df.lm1  <- lm(deathratio ~ Value, data = lcap_df)
plot(lcap_df$deathratio ~ lcap_df$Value, main = "Relationship between Lung Cancer Death Ratio  vs State Air Polution",xlab='Air Pollution',ylab='Death Ratio')
 abline(lcap_df.lm1 )
 abline(h=5.197,col = "red") 
 abline(v= 8.800 ,col = "blue" )

summary(lcap_df.lm1)

## 
## Call:
## lm(formula = deathratio ~ Value, data = lcap_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7489 -0.7846 -0.1231  0.7414  2.7189 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.2814     0.9823   4.358  6.7e-05 ***
## Value         0.1011     0.1112   0.909    0.368    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.339 on 49 degrees of freedom
## Multiple R-squared:  0.01657,    Adjusted R-squared:  -0.003496 
## F-statistic: 0.8258 on 1 and 49 DF,  p-value: 0.3679

We get a lot of useful information here without being too overwhelmed by pages of output.
The estimates for the model intercept is 3.0822 and the coefficient measuring the slope of the relationship with Air_Polution_Value is 0.2439 and information about standard errors of these estimates is also provided in the Coefficients table. We see that the test of significance of the model coefficients is also summarized in that table so we can see that there is evidence that the coefficient is significantly different to zero - as the Air Pollution increases so does Lung Cancer. It proves weak Positive Correlation between the number of Air Pollution Value and Lunge Cancer Death Ratio.

Back to the Menu_{_{_{_{_{_{_~}}}}}}

DATA607_Final_Project

Matheesha Thambeliyagodage

May 5, 2017

Project Objective

Data Sources

Following process enter into the wikipedia.org and scrape the HTM table. The process also design to cleanse and filter the data in to datatframe …

Following process extract State Abbreviations and FIPS Codes . They will need in future data frame mappings …

Air Pollution Data files are stored in github and below process is loading them in to mysql. Then create separate data frames with the filtered values for analysis

Connecting to MySQL database in the localhost.

Insert above record sets into MySQL

Cancer statistics join with above data frame “State Population” to calculate the deaths as a percentage of the population. Otherwise direct total deaths will not be a accurate variable

According to the above data following 5 States have the highest Deaths

According to the above data following 5 States have the lowest Deaths

Joining the Lung Cancer Death datafram and Air Polition datafrm by State to fingd the correlation