Lung Cancer statistics in the…
## Maximum Moderate Minimum
## Cities with Air Pollution | * | * | *
## Cities with Tobacco Smoke | * | * | *
## Cities with Radon and Radiation | * | * | *
## Cities with high Agriculture Industry | * | * | *
Risk Factor statistics in the…
## Maximum Moderate Minimum
## Cities with Lung Cancer | * | * | *
USA Population Statistics List of U.S. states and territories by population
Scrape HTML Table using rvest to data a frame
USA State Abbreviations and FIPS USPS State Abbreviations and FIPS Codes
Scrape HTML Table using rvest to data a frame
Cities with Air Pollution Ranks America’s Health Rankings
Downloaded Report Data from this web site
Cancer Statistics United States Cancer Statistics (USCS)
Downloaded Report Data from this web site
Cities with Tobacco Smoke Behavioral Risk Factor Data: Tobacco Use (2011 to present)
Downloaded Report Data ->
Cities with Radon and Radiation EPA Map of Radon Zones including State Radon Information and Contacts
Cities with high Agriculture Industry for high Arsanic Levels United States Cancer Statistics
library(rvest)
page <- read_html("http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
pop_table <- page %>%
html_node("table") %>%
html_table(fill = TRUE)
pop_df <- as_data_frame(pop_table)
pop_df <- pop_df %>% mutate(StateName = `State or territory`) %>% mutate(Population = `Population estimate, July 1, 2016`)
popdf<- pop_df %>% select(StateName,Population)
popdf
## # A tibble: 59 × 2
## StateName Population
## <chr> <chr>
## 1 California 39,250,017
## 2 Texas 27,862,596
## 3 Florida 20,612,439
## 4 New York 19,745,289
## 5 Pennsylvania 12,802,503
## 6 Illinois 12,801,539
## 7 Ohio 11,614,373
## 8 Georgia 10,310,371
## 9 North Carolina 10,146,788
## 10 Michigan 9,928,301
## # ... with 49 more rows
===============================================================================================================
library(rvest)
page <- read_html("https://www.bls.gov/cew/cewedr10.htm")
stat_table <- page %>%
html_node("table") %>%
html_table(fill = TRUE)
stat_df <- as_data_frame(stat_table,keep.rownames = TRUE)
#stat_df <- stat_df %>% mutate(StateName = `State or territory`) %>% mutate(Population = `Population estimate, July 1, 2016`)
statdf<- stat_df %>% select(X1,X2,X3,X4,X5,X6)
statdf <- filter(statdf,X1!= 'State')
statdf <- filter(statdf,X3!= 'State')
df1 <- statdf%>% mutate(State = X1) %>% mutate(StateCode = X2) %>% mutate(StateID = X3)
df2 <- statdf%>% mutate(State = X3) %>% mutate(StateCode = X4) %>% mutate(StateID = X5)
statdf1 <- select(df1,State,StateCode,StateID)
statdf2 <- select(df2,State,StateCode,StateID)
statedf<- rbind(statdf1, statdf2)
head(statedf)
## # A tibble: 6 × 3
## State StateCode StateID
## <chr> <chr> <chr>
## 1 Alabama AL 01
## 2 Alaska AK 02
## 3 Arizona AZ 04
## 4 Arkansas AR 05
## 5 California CA 06
## 6 Colorado CO 08
===============================================================================================================
#Loading airports.csv
airpo15_df <- read.csv("https://raw.githubusercontent.com/mathsanu/CUNY_MSDA/master/DATA607/FinalProject/data/2015SeniorRanks.csv", header=TRUE)
airpo16_df <- read.csv("https://raw.githubusercontent.com/mathsanu/CUNY_MSDA/master/DATA607/FinalProject/data/2016SeniorRanks.csv", header=TRUE)
library('RMySQL')
#mydb = dbConnect(MySQL(), user='root', password='root', host='localhost', dbname="ourairports")
mydb = dbConnect(MySQL(), user='root', password='root', host='localhost')
# DROP the database if exists using RMySQL in R
dbSendQuery(mydb, "DROP DATABASE if exists 607finalprj;")
## <MySQLResult:2,0,0>
# creating the database using RMySQL in R
dbSendQuery(mydb, "CREATE DATABASE 607finalprj;")
## <MySQLResult:2,0,1>
mydb = dbConnect(MySQL(), user='root', password='root', host='localhost', dbname="607finalprj")
# 1 ~~~airpo15_df~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create Table and Insert records from airpo15_df in MySql Database ("607finalprj")
dbSendQuery(mydb, "drop table if exists allcensus")
## <MySQLResult:122805272,1,0>
dbWriteTable(mydb, value = airpo15_df, name = "allcensus", overwrite=TRUE, row.names=FALSE,add_id = TRUE)
## [1] TRUE
# 1 ~~~airpo16_df~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Insert records from airpo16_df into MySql table ("allcensus")
dbWriteTable(mydb, 'allcensus', airpo16_df, row.names=F, append=T)
## [1] TRUE
Creating an aggrigate Data fram from the data in mysql database
airp_df<-dbGetQuery(mydb,"SELECT Edition ,`Measure.Name`, `allcensus`.`State.Name`,
`allcensus`.`Rank`,
`allcensus`.`Value`,
`allcensus`.`Score`
FROM 607finalprj.allcensus
where `Measure.Name` = 'Air Pollution'
order by `Value`;")
head(airp_df,5)
## Edition Measure.Name State.Name Rank Value Score
## 1 2015 Air Pollution Wyoming 1 5.0 -2.00
## 2 2015 Air Pollution North Dakota 2 5.2 -2.00
## 3 2015 Air Pollution Montana 3 5.7 -2.00
## 4 2015 Air Pollution Alaska 4 6.0 -2.00
## 5 2015 Air Pollution Vermont 5 6.2 -1.96
According to the Air Pollution Ranks highest polluted States are
## Rank Highest
## California | 50 | 12.5
## Idaho | 49 | 11.7
## Pennsylvania | 48 | 11.4
## Indiana | 47 | 11.3
## Illinois | 46 | 11.1
Moderately polluted States are
## Rank Moderate
## Wisconsin | 30 | 9.1
## South Carolina | 29 | 9
## Mississippi | 27 | 8.9
## Utah | 26 | 8.6
## Connecticut | 25 | 8.8
Lowest polluted States are
## Rank Moderate
## Vermont | 5 | 6.2
## Alaska | 4 | 6
## Montana | 3 | 5.7
## North Dakota | 2 | 5.2
## Wyoming | 1 | 5
===============================================================================================================
lungcancer_df <- read.csv("https://raw.githubusercontent.com/mathsanu/CUNY_MSDA/master/DATA607/FinalProject/data/LungCancer.csv", header=TRUE)
lungcancer_df <- filter(lungcancer_df,Leading.Cancer.Sites == 'Lung and Bronchus')
head(lungcancer_df,5)
## Notes Leading.Cancer.Sites Leading.Cancer.Sites.Code State
## 1 Lung and Bronchus 22030 California
## 2 Lung and Bronchus 22030 Florida
## 3 Lung and Bronchus 22030 Texas
## 4 Lung and Bronchus 22030 New York
## 5 Lung and Bronchus 22030 Pennsylvania
## State.Code Deaths
## 1 6 12464
## 2 12 11915
## 3 48 9439
## 4 36 9074
## 5 42 7536
# Join DF , calculating deathratio and deathrank
lc_df <- left_join(lungcancer_df,popdf, by=c("State" = "StateName") )
lc_df1<- lc_df %>% mutate(deathratio = (Deaths/as.numeric(gsub(",", "",Population)) ) * 10000) %>% arrange(desc(deathratio))%>%mutate(deathrank=row_number())
head(lc_df1,5)
## Notes Leading.Cancer.Sites Leading.Cancer.Sites.Code State
## 1 Lung and Bronchus 22030 Kentucky
## 2 Lung and Bronchus 22030 West Virginia
## 3 Lung and Bronchus 22030 Maine
## 4 Lung and Bronchus 22030 Arkansas
## 5 Lung and Bronchus 22030 Washington
## State.Code Deaths Population deathratio deathrank
## 1 21 3559 4,436,974 8.021232 1
## 2 54 1437 1,831,102 7.847733 2
## 3 23 1025 1,331,479 7.698206 3
## 4 5 2168 2,988,248 7.255087 4
## 5 53 3020 4,288,000 7.042910 5
ls(lc_df1)
## [1] "deathrank" "deathratio"
## [3] "Deaths" "Leading.Cancer.Sites"
## [5] "Leading.Cancer.Sites.Code" "Notes"
## [7] "Population" "State"
## [9] "State.Code"
lc_df1 %>% filter(deathrank < 6)
## Notes Leading.Cancer.Sites Leading.Cancer.Sites.Code State
## 1 Lung and Bronchus 22030 Kentucky
## 2 Lung and Bronchus 22030 West Virginia
## 3 Lung and Bronchus 22030 Maine
## 4 Lung and Bronchus 22030 Arkansas
## 5 Lung and Bronchus 22030 Washington
## State.Code Deaths Population deathratio deathrank
## 1 21 3559 4,436,974 8.021232 1
## 2 54 1437 1,831,102 7.847733 2
## 3 23 1025 1,331,479 7.698206 3
## 4 5 2168 2,988,248 7.255087 4
## 5 53 3020 4,288,000 7.042910 5
lc_df1 %>% filter(deathrank > 45)
## Notes Leading.Cancer.Sites Leading.Cancer.Sites.Code State
## 1 Lung and Bronchus 22030 Alaska
## 2 Lung and Bronchus 22030 New Mexico
## 3 Lung and Bronchus 22030 Texas
## 4 Lung and Bronchus 22030 California
## 5 Lung and Bronchus 22030 Colorado
## 6 Lung and Bronchus 22030 Utah
## State.Code Deaths Population deathratio deathrank
## 1 2 273 741,894 3.679771 46
## 2 35 736 2,081,015 3.536736 47
## 3 48 9439 27,862,596 3.387696 48
## 4 6 12464 39,250,017 3.175540 49
## 5 8 1561 5,540,545 2.817412 50
## 6 49 437 3,051,217 1.432215 51
library(ggplot2)
ggplot(lc_df1, aes(fill=deathratio, y=deathratio, x=State)) + ggtitle("Death Ratio in each State ") + geom_bar( stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + geom_point()
airp_df
# Join DF , calculating deathratio and deathrank
# head(lc_df1)
# head(airp_df)
lcap_df <- left_join(lc_df1,airp_df, by=c("State" = "State.Name") )
# lc_df1<- lc_df %>% mutate(deathratio = (Deaths/as.numeric(gsub(",", "",Population)) ) * 10000) %>% arrange(desc(deathratio))%>%mutate(deathrank=row_number())
head(lcap_df,5)
## Notes Leading.Cancer.Sites Leading.Cancer.Sites.Code State
## 1 Lung and Bronchus 22030 Kentucky
## 2 Lung and Bronchus 22030 West Virginia
## 3 Lung and Bronchus 22030 Maine
## 4 Lung and Bronchus 22030 Arkansas
## 5 Lung and Bronchus 22030 Washington
## State.Code Deaths Population deathratio deathrank Edition Measure.Name
## 1 21 3559 4,436,974 8.021232 1 2015 Air Pollution
## 2 54 1437 1,831,102 7.847733 2 2015 Air Pollution
## 3 23 1025 1,331,479 7.698206 3 2015 Air Pollution
## 4 5 2168 2,988,248 7.255087 4 2015 Air Pollution
## 5 53 3020 4,288,000 7.042910 5 2015 Air Pollution
## Rank Value Score
## 1 44 10.1 0.36
## 2 33 9.4 -0.06
## 3 13 7.4 -1.25
## 4 37 9.7 0.12
## 5 17 8.0 -0.89
summary(lcap_df)
## Notes
## :51
## --- : 0
## 1. 'Suppressed' is displayed for data values when they must not be provided in order to protect personal privacy. Data is: 0
## 1. Totals are not available for these results due to suppression constraints. More Information: : 0
## 2. 'Not Applicable' is displayed for all data values when gender-specific cancers are combined with the wrong sex. More : 0
## 2. Rows with suppressed Deaths are hidden. Use Quick Options above to show suppressed rows. : 0
## (Other) : 0
## Leading.Cancer.Sites Leading.Cancer.Sites.Code
## Lung and Bronchus :51 22030 :51
## : 0 : 0
## Brain and Other Nervous System: 0 20010-20100: 0
## Breast : 0 21010 : 0
## Cervix Uteri : 0 21020 : 0
## Colon and Rectum : 0 21041-21052: 0
## (Other) : 0 (Other) : 0
## State State.Code Deaths Population
## Length:51 Min. : 1.00 Min. : 230.0 Length:51
## Class :character 1st Qu.:16.50 1st Qu.: 713.5 Class :character
## Mode :character Median :29.00 Median : 2391.0 Mode :character
## Mean :28.96 Mean : 3062.3
## 3rd Qu.:41.50 3rd Qu.: 4003.0
## Max. :56.00 Max. :12464.0
##
## deathratio deathrank Edition Measure.Name
## Min. :1.432 Min. : 1.0 Min. :2015 Length:51
## 1st Qu.:4.448 1st Qu.:13.5 1st Qu.:2015 Class :character
## Median :4.974 Median :26.0 Median :2015 Mode :character
## Mean :5.158 Mean :26.0 Mean :2015
## 3rd Qu.:5.977 3rd Qu.:38.5 3rd Qu.:2015
## Max. :8.021 Max. :51.0 Max. :2015
##
## Rank Value Score
## Min. : 1.00 Min. : 5.000 Min. :-2.0000
## 1st Qu.:13.25 1st Qu.: 7.500 1st Qu.:-1.2200
## Median :24.00 Median : 8.800 Median :-0.4200
## Mean :25.10 Mean : 8.669 Mean :-0.4922
## 3rd Qu.:37.00 3rd Qu.: 9.700 3rd Qu.: 0.1200
## Max. :50.00 Max. :12.500 Max. : 1.7800
## NA's :1 NA's :1
library(ggplot2)
lcap_df.lm1 <- lm(deathratio ~ Value, data = lcap_df)
plot(lcap_df$deathratio ~ lcap_df$Value, main = "Relationship between Lung Cancer Death Ratio vs State Air Polution",xlab='Air Pollution',ylab='Death Ratio')
abline(lcap_df.lm1 )
abline(h=5.197,col = "red")
abline(v= 8.800 ,col = "blue" )
summary(lcap_df.lm1)
##
## Call:
## lm(formula = deathratio ~ Value, data = lcap_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7489 -0.7846 -0.1231 0.7414 2.7189
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2814 0.9823 4.358 6.7e-05 ***
## Value 0.1011 0.1112 0.909 0.368
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.339 on 49 degrees of freedom
## Multiple R-squared: 0.01657, Adjusted R-squared: -0.003496
## F-statistic: 0.8258 on 1 and 49 DF, p-value: 0.3679
We get a lot of useful information here without being too overwhelmed by pages of output.
The estimates for the model intercept is 3.0822 and the coefficient measuring the slope of the relationship with Air_Polution_Value is 0.2439 and information about standard errors of these estimates is also provided in the Coefficients table. We see that the test of significance of the model coefficients is also summarized in that table so we can see that there is evidence that the coefficient is significantly different to zero - as the Air Pollution increases so does Lung Cancer. It proves weak Positive Correlation between the number of Air Pollution Value and Lunge Cancer Death Ratio.