Lung cancer is caused by many factors. One of the factors is air pollutants. There are many types of air pollutants. The six air pollutants are carbon monoxide, lead, ground-level ozone, nitrogen dioxide, sulfur dioxide and particulate matter. Particulate matter (PM), also known as particle pollution, is a complex mixture of extremely small particles and liquid droplets that get into the air. Once inhaled, these particles can affect the heart and lungs and cause serious health effects. Particulate Matter is measued with an index called PM2.5. PM2.5 particles are extremely small air pollutants with a diameter of 2.5 micrometers or less. They are small enough to invade even the smallest airways.
In this project we will be using PM2.5 (air pollution index) and compare it with the count of lung cancer incidents for 7 different cities:
The objective of this project is to analyze the cause and effect relationship of the Independent Variable (PM2.5) on the Dependent Variable (Lung Cancer count). The Linear Least Squares Regression method is used to create a model that predicts the value of the dependent variable. Our assumption is that PM2.5 has an effect on lung cancer rate.
The datasets for this project are downloaded from the following sites for years 1999 to 2013:
| dplyr | tidyr |
|---|---|
| mutate | gather |
| filter | |
| arrange | |
| select |
library(tidyr)
library(dplyr)
library(ggplot2)
setwd(getwd())
cancer_incidence_df = read.csv(file="DATA\\INCIDENCE_DATA\\INCIDENCE_DATA_CITIES.csv",
head=TRUE, sep=",",stringsAsFactors = FALSE)
colnames(cancer_incidence_df) = gsub("\\.Count","",colnames(cancer_incidence_df))
print(cancer_incidence_df)
## Area X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009 X2010 X2011 X2012 X2013
## 1 Atlanta 1293 1346 1330 1344 1322 1340 1338 1353 1438 1440 1485 1383 1478 1573 1511
## 2 Detroit 3200 3110 3268 3203 3375 3213 3335 3311 3379 3410 3365 3328 3256 3147 3183
## 3 Los Angeles 4131 3986 4086 3962 3938 4003 3960 3864 3826 3901 3998 3782 3669 3552 3508
## 4 San Francisco-Oakland 2298 2339 2338 2282 2341 2162 2239 2230 2296 2136 2289 2227 2190 2230 2172
## 5 San Jose-Monterey 1023 954 951 989 1019 972 1008 990 1024 901 996 929 973 1042 985
## 6 Seattle-Puget Sound 2734 2604 2624 2701 2754 2713 2755 2831 2862 2853 2951 2971 2895 2933 2851
## 7 Pittsburgh 1231 1204 1328 1275 1301 1327 1223 1191 915 1283 1235 1206 1230 1111 1150
cancer_incidence_df2 = gather(cancer_incidence_df, Year, Count, X1999:X2013)
cancer_incidence_df2$Year = gsub("X","", cancer_incidence_df2$Year)
cancer_incidence_df2$Count = gsub(",", "", cancer_incidence_df2$Count)
cancer_incidence_df2$Count = as.numeric(cancer_incidence_df2$Count)
colnames(cancer_incidence_df2) = c("City", "Year", "Count")
cancer_incidence_df2$City = gsub("San Francisco-Oakland", "San Francisco", cancer_incidence_df2$City)
cancer_incidence_df2$City = gsub("San Jose-Monterey", "San Jose", cancer_incidence_df2$City)
cancer_incidence_df2$City = gsub("Seattle-Puget Sound", "Seattle", cancer_incidence_df2$City)
cancer_incidence_df2 = arrange(cancer_incidence_df2, City)
head(cancer_incidence_df2)
## City Year Count
## 1 Atlanta 1999 1293
## 2 Atlanta 2000 1346
## 3 Atlanta 2001 1330
## 4 Atlanta 2002 1344
## 5 Atlanta 2003 1322
## 6 Atlanta 2004 1340
# function to extract year from date
getYear = function(date) {
return (as.numeric(gsub(".*/","",date)))
}
# function to load air pollution data for each city
getPollutionData = function(directoryName) {
filenames <- list.files(directoryName, pattern="*.csv", full.names=TRUE)
dim(filenames)
ldf = lapply(filenames, read.csv)
count = 0
total = NULL
for (i in 1:length(filenames)) {
temp_df = ldf[[i]]
count = count + nrow(temp_df)
total = rbind(total, temp_df)
}
colnames(total) = gsub("\\.","_",colnames(total))
cityPollution_df = select(total, matches("Date|Daily_Mean_PM2_5_Concentration|DAILY_AQI_VALUE|CBSA_NAME"))
colnames(cityPollution_df) = c("Date", "PM25", "DAILY_AQI", "City")
cityPollution_df = arrange(cityPollution_df, Date)
cityPollution_df = mutate(cityPollution_df, Year=getYear(cityPollution_df$Date))
return(cityPollution_df)
}
atlanta_df = getPollutionData("DATA\\POLLUTION_DATA\\01_ATLANTA")
nrow(atlanta_df)
## [1] 49946
head(atlanta_df)
## Date PM25 DAILY_AQI City Year
## 1 01/01/1999 21.4 71 Atlanta-Sandy Springs-Roswell, GA 1999
## 2 01/01/1999 26.9 82 Atlanta-Sandy Springs-Roswell, GA 1999
## 3 01/01/1999 15.5 58 Atlanta-Sandy Springs-Roswell, GA 1999
## 4 01/02/1999 16.6 60 Atlanta-Sandy Springs-Roswell, GA 1999
## 5 01/02/1999 18.1 64 Atlanta-Sandy Springs-Roswell, GA 1999
## 6 01/03/1999 15.5 58 Atlanta-Sandy Springs-Roswell, GA 1999
detroit_df = getPollutionData("DATA\\POLLUTION_DATA\\02_DETROIT")
nrow(detroit_df)
## [1] 30596
head(detroit_df)
## Date PM25 DAILY_AQI City Year
## 1 01/03/1999 3.5 15 Detroit-Warren-Dearborn, MI 1999
## 2 01/03/1999 7.2 30 Detroit-Warren-Dearborn, MI 1999
## 3 01/03/1999 7.6 32 Detroit-Warren-Dearborn, MI 1999
## 4 01/06/1999 10.2 43 Detroit-Warren-Dearborn, MI 1999
## 5 01/06/1999 14.2 55 Detroit-Warren-Dearborn, MI 1999
## 6 01/06/1999 15.5 58 Detroit-Warren-Dearborn, MI 1999
los_angeles_df = getPollutionData("DATA\\POLLUTION_DATA\\03_LOS_ANGELES")
nrow(los_angeles_df)
## [1] 51592
head(los_angeles_df)
## Date PM25 DAILY_AQI City Year
## 1 01/03/1999 41.8 117 Los Angeles-Long Beach-Anaheim, CA 1999
## 2 01/03/1999 23.0 74 Los Angeles-Long Beach-Anaheim, CA 1999
## 3 01/03/1999 16.3 60 Los Angeles-Long Beach-Anaheim, CA 1999
## 4 01/03/1999 17.0 61 Los Angeles-Long Beach-Anaheim, CA 1999
## 5 01/04/1999 8.9 37 Los Angeles-Long Beach-Anaheim, CA 1999
## 6 01/04/1999 18.5 64 Los Angeles-Long Beach-Anaheim, CA 1999
san_francisco_df = getPollutionData("DATA\\POLLUTION_DATA\\04_SAN_FRANCISCO")
nrow(san_francisco_df)
## [1] 29369
head(san_francisco_df)
## Date PM25 DAILY_AQI City Year
## 1 01/02/1999 12.4 52 San Francisco-Oakland-Hayward, CA 1999
## 2 01/06/1999 12.5 52 San Francisco-Oakland-Hayward, CA 1999
## 3 01/09/1999 25.1 78 San Francisco-Oakland-Hayward, CA 1999
## 4 01/13/1999 4.7 20 San Francisco-Oakland-Hayward, CA 1999
## 5 01/16/1999 4.3 18 San Francisco-Oakland-Hayward, CA 1999
## 6 01/20/1999 8.6 36 San Francisco-Oakland-Hayward, CA 1999
san_jose_df = getPollutionData("DATA\\POLLUTION_DATA\\05_SAN_JOSE")
nrow(san_jose_df)
## [1] 15585
head(san_jose_df)
## Date PM25 DAILY_AQI City Year
## 1 01/02/1999 6.7 28 San Jose-Sunnyvale-Santa Clara, CA 1999
## 2 01/06/1999 5.4 23 San Jose-Sunnyvale-Santa Clara, CA 1999
## 3 01/06/1999 63.0 155 San Jose-Sunnyvale-Santa Clara, CA 1999
## 4 01/09/1999 3.4 14 San Jose-Sunnyvale-Santa Clara, CA 1999
## 5 01/12/1999 31.0 91 San Jose-Sunnyvale-Santa Clara, CA 1999
## 6 01/13/1999 11.0 46 San Jose-Sunnyvale-Santa Clara, CA 1999
seattle_df = getPollutionData("DATA\\POLLUTION_DATA\\06_SEATTLE")
nrow(seattle_df)
## [1] 107560
head(seattle_df)
## Date PM25 DAILY_AQI City Year
## 1 01/01/1999 14.8 57 Seattle-Tacoma-Bellevue, WA 1999
## 2 01/01/1999 15.6 58 Seattle-Tacoma-Bellevue, WA 1999
## 3 01/01/1999 11.9 50 Seattle-Tacoma-Bellevue, WA 1999
## 4 01/01/1999 17.9 63 Seattle-Tacoma-Bellevue, WA 1999
## 5 01/02/1999 18.8 65 Seattle-Tacoma-Bellevue, WA 1999
## 6 01/02/1999 20.4 68 Seattle-Tacoma-Bellevue, WA 1999
pittsburgh_df = getPollutionData("DATA\\POLLUTION_DATA\\07_PITTSBURGH")
nrow(pittsburgh_df)
## [1] 51934
head(pittsburgh_df)
## Date PM25 DAILY_AQI City Year
## 1 01/08/1999 14.5 56 Pittsburgh, PA 1999
## 2 01/09/1999 9.4 39 Pittsburgh, PA 1999
## 3 01/10/1999 11.6 48 Pittsburgh, PA 1999
## 4 01/15/1999 15.4 58 Pittsburgh, PA 1999
## 5 01/18/1999 5.0 21 Pittsburgh, PA 1999
## 6 01/23/1999 13.3 54 Pittsburgh, PA 1999
# function to calculate average value for air pollution
getYearlyMean = function(CityPollutionData,mycity) {
averagePollutionYears = NULL
for (i in 1999:2013) {
myData = filter (CityPollutionData, Year==i)
averagePollutionYears = rbind(averagePollutionYears,c(i,round(mean(myData$PM25),2)))
}
averagePollutionYears = as.data.frame(averagePollutionYears)
averagePollutionYears$city = mycity
colnames(averagePollutionYears) = c("Year","Mean","City")
return(averagePollutionYears)
}
atlanta_Pollution_df = getYearlyMean(atlanta_df,"Atlanta")
detroit_Pollution_df = getYearlyMean(detroit_df,"Detroit")
los_angeles_Pollution_df = getYearlyMean(los_angeles_df,"Los Angeles")
san_francisco_Pollution_df = getYearlyMean(san_francisco_df,"San Francisco")
san_jose_Pollution_df = getYearlyMean(san_jose_df,"San Jose")
seattle_Pollution_df = getYearlyMean(seattle_df,"Seattle")
pittsburgh_Pollution_df = getYearlyMean(pittsburgh_df,"Pittsburgh")
atlanta_Pollution_df;
## Year Mean City
## 1 1999 20.91 Atlanta
## 2 2000 18.68 Atlanta
## 3 2001 17.11 Atlanta
## 4 2002 15.05 Atlanta
## 5 2003 14.97 Atlanta
## 6 2004 14.88 Atlanta
## 7 2005 15.14 Atlanta
## 8 2006 15.47 Atlanta
## 9 2007 15.36 Atlanta
## 10 2008 12.97 Atlanta
## 11 2009 10.95 Atlanta
## 12 2010 12.06 Atlanta
## 13 2011 11.86 Atlanta
## 14 2012 10.37 Atlanta
## 15 2013 9.39 Atlanta
detroit_Pollution_df;
## Year Mean City
## 1 1999 15.60 Detroit
## 2 2000 15.91 Detroit
## 3 2001 16.28 Detroit
## 4 2002 15.74 Detroit
## 5 2003 15.51 Detroit
## 6 2004 13.91 Detroit
## 7 2005 16.25 Detroit
## 8 2006 13.56 Detroit
## 9 2007 13.76 Detroit
## 10 2008 11.90 Detroit
## 11 2009 10.62 Detroit
## 12 2010 10.41 Detroit
## 13 2011 9.85 Detroit
## 14 2012 9.91 Detroit
## 15 2013 9.61 Detroit
los_angeles_Pollution_df;
## Year Mean City
## 1 1999 21.41 Los Angeles
## 2 2000 19.86 Los Angeles
## 3 2001 21.21 Los Angeles
## 4 2002 19.29 Los Angeles
## 5 2003 17.75 Los Angeles
## 6 2004 16.94 Los Angeles
## 7 2005 15.58 Los Angeles
## 8 2006 14.37 Los Angeles
## 9 2007 14.67 Los Angeles
## 10 2008 14.93 Los Angeles
## 11 2009 14.27 Los Angeles
## 12 2010 12.65 Los Angeles
## 13 2011 13.72 Los Angeles
## 14 2012 12.53 Los Angeles
## 15 2013 12.20 Los Angeles
san_francisco_Pollution_df;
## Year Mean City
## 1 1999 14.74 San Francisco
## 2 2000 11.86 San Francisco
## 3 2001 12.36 San Francisco
## 4 2002 13.75 San Francisco
## 5 2003 9.74 San Francisco
## 6 2004 10.99 San Francisco
## 7 2005 9.92 San Francisco
## 8 2006 9.71 San Francisco
## 9 2007 8.99 San Francisco
## 10 2008 10.46 San Francisco
## 11 2009 9.88 San Francisco
## 12 2010 8.52 San Francisco
## 13 2011 9.18 San Francisco
## 14 2012 7.68 San Francisco
## 15 2013 10.17 San Francisco
san_jose_Pollution_df;
## Year Mean City
## 1 1999 16.18 San Jose
## 2 2000 13.88 San Jose
## 3 2001 12.50 San Jose
## 4 2002 12.80 San Jose
## 5 2003 10.92 San Jose
## 6 2004 10.99 San Jose
## 7 2005 11.47 San Jose
## 8 2006 10.67 San Jose
## 9 2007 9.97 San Jose
## 10 2008 10.81 San Jose
## 11 2009 9.03 San Jose
## 12 2010 7.56 San Jose
## 13 2011 9.09 San Jose
## 14 2012 7.40 San Jose
## 15 2013 8.85 San Jose
seattle_Pollution_df;
## Year Mean City
## 1 1999 10.02 Seattle
## 2 2000 11.53 Seattle
## 3 2001 10.28 Seattle
## 4 2002 9.95 Seattle
## 5 2003 9.24 Seattle
## 6 2004 9.52 Seattle
## 7 2005 8.88 Seattle
## 8 2006 8.32 Seattle
## 9 2007 7.93 Seattle
## 10 2008 6.78 Seattle
## 11 2009 7.46 Seattle
## 12 2010 6.06 Seattle
## 13 2011 7.08 Seattle
## 14 2012 6.63 Seattle
## 15 2013 7.56 Seattle
pittsburgh_Pollution_df;
## Year Mean City
## 1 1999 16.16 Pittsburgh
## 2 2000 15.92 Pittsburgh
## 3 2001 17.39 Pittsburgh
## 4 2002 15.78 Pittsburgh
## 5 2003 15.78 Pittsburgh
## 6 2004 15.35 Pittsburgh
## 7 2005 16.70 Pittsburgh
## 8 2006 15.59 Pittsburgh
## 9 2007 16.76 Pittsburgh
## 10 2008 14.71 Pittsburgh
## 11 2009 13.02 Pittsburgh
## 12 2010 13.55 Pittsburgh
## 13 2011 11.84 Pittsburgh
## 14 2012 10.66 Pittsburgh
## 15 2013 10.29 Pittsburgh
pollutionMaster_df = NULL
pollutionMaster_df = rbind(pollutionMaster_df, atlanta_Pollution_df)
pollutionMaster_df = rbind(pollutionMaster_df, detroit_Pollution_df)
pollutionMaster_df = rbind(pollutionMaster_df, los_angeles_Pollution_df)
pollutionMaster_df = rbind(pollutionMaster_df, san_francisco_Pollution_df)
pollutionMaster_df = rbind(pollutionMaster_df, san_jose_Pollution_df)
pollutionMaster_df = rbind(pollutionMaster_df, seattle_Pollution_df)
pollutionMaster_df = rbind(pollutionMaster_df, pittsburgh_Pollution_df)
pollutionMaster_df = arrange(pollutionMaster_df, City)
head(pollutionMaster_df, 10)
## Year Mean City
## 1 1999 20.91 Atlanta
## 2 2000 18.68 Atlanta
## 3 2001 17.11 Atlanta
## 4 2002 15.05 Atlanta
## 5 2003 14.97 Atlanta
## 6 2004 14.88 Atlanta
## 7 2005 15.14 Atlanta
## 8 2006 15.47 Atlanta
## 9 2007 15.36 Atlanta
## 10 2008 12.97 Atlanta
tail(pollutionMaster_df, 10)
## Year Mean City
## 96 2004 9.52 Seattle
## 97 2005 8.88 Seattle
## 98 2006 8.32 Seattle
## 99 2007 7.93 Seattle
## 100 2008 6.78 Seattle
## 101 2009 7.46 Seattle
## 102 2010 6.06 Seattle
## 103 2011 7.08 Seattle
## 104 2012 6.63 Seattle
## 105 2013 7.56 Seattle
cancer_airpollution_merged_df = merge (x = pollutionMaster_df, y = cancer_incidence_df2, c("City","Year"))
From the lung cancer incidence graphs we notice the following:
Highest number of cases are reported in Los Angeles
Lowest number of cases are reported in San Jose
Lung Cancer cases in Seattle and Atlanta seem to be positively increasing
Lung Cancer Cases in San Francisco, San Jose, Pittsburgh and Detroit are stable
Lung cancer cases in Los Angeles are decreasing.
#Bar Graph
ggplot(data=cancer_incidence_df2, aes(x=Year, y=Count, fill=City)) +
geom_bar(stat="identity", position="dodge",width=0.5) +
ylab("Count of Lung Cancer") + ggtitle("Count of Lung Cancer")
#Line Graph
ggplot(data=cancer_incidence_df2, aes(x=Year, y=Count, group=City, color=City)) +
geom_line() + geom_point() +
ylab("# of Lung Cancer cases") + ggtitle("Lung Cancer Statistics")
myHistogramPlots = function(city_df, cityName, lowColor, highColor) {
plot = ggplot(city_df, aes(x=PM25)) +
geom_histogram (aes(fill = ..count..)) +
ggtitle(cityName) + scale_fill_gradient("Count", low = lowColor, high = highColor)
return(plot)
}
atlanta_histogram = myHistogramPlots(atlanta_df, "Atlanta", "green", "red")
detroit_histogram = myHistogramPlots(detroit_df, "Detroit", "orange", "blue")
los_angeles_histogram = myHistogramPlots(los_angeles_df, "Los Angeles", "yellow", "purple")
san_francisco_histogram = myHistogramPlots(san_francisco_df, "San Francisco", "cyan", "navy")
san_jose_histogram = myHistogramPlots(san_jose_df, "San Jose", "yellow", "maroon")
seattle_histogram = myHistogramPlots(seattle_df, "Seattle", "blue", "orange")
pittsburgh_histogram = myHistogramPlots(pittsburgh_df, "Pittsburgh", "limegreen", "deeppink")
atlanta_histogram
detroit_histogram
los_angeles_histogram
san_francisco_histogram
san_jose_histogram
seattle_histogram
pittsburgh_histogram
# find the summary statistics
findStatsFunction<-function(cityName) {
byCity = filter(cancer_airpollution_merged_df, City == cityName)
corr = cor(byCity$Mean,byCity$Count)
print (paste0("cor=",round(corr,2)))
myTitle = paste0(cityName, " - PM2.5 vs Lung Cancer", sep="")
plot = ggplot(byCity, aes(Mean,Count)) + geom_point(colour="red") +
xlab("PM2.5") + ylab("Lung Cancer") + labs(title = myTitle)
print(plot)
m = lm (Count~Mean,byCity)
s = summary(m)
print(s)
return (byCity)
}
atlanta_statistics = findStatsFunction("Atlanta")
## [1] "cor=-0.83"
##
## Call:
## lm(formula = Count ~ Mean, data = byCity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.473 -40.778 3.853 32.070 87.389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1713.496 59.745 28.680 3.87e-13 ***
## Mean -21.975 4.074 -5.394 0.000122 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48 on 13 degrees of freedom
## Multiple R-squared: 0.6911, Adjusted R-squared: 0.6674
## F-statistic: 29.09 on 1 and 13 DF, p-value: 0.0001225
detroit_statistics = findStatsFunction("Detroit")
## [1] "cor=-0.04"
##
## Call:
## lm(formula = Count ~ Mean, data = byCity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -158.629 -67.452 -0.131 78.043 135.978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3290.026 133.484 24.647 2.68e-12 ***
## Mean -1.345 9.893 -0.136 0.894
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 96.75 on 13 degrees of freedom
## Multiple R-squared: 0.00142, Adjusted R-squared: -0.07539
## F-statistic: 0.01848 on 1 and 13 DF, p-value: 0.8939
los_angeles_statistics = findStatsFunction("Los Angeles")
## [1] "cor=0.82"
##
## Call:
## lm(formula = Count ~ Mean, data = byCity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -185.052 -69.007 0.919 73.192 206.723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3114.142 152.928 20.363 3.03e-11 ***
## Mean 47.452 9.339 5.081 0.000211 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 109.6 on 13 degrees of freedom
## Multiple R-squared: 0.6651, Adjusted R-squared: 0.6393
## F-statistic: 25.81 on 1 and 13 DF, p-value: 0.0002107
san_francisco_statistics = findStatsFunction("San Francisco")
## [1] "cor=0.34"
##
## Call:
## lm(formula = Count ~ Mean, data = byCity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -114.433 -28.344 -3.408 54.275 99.142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2125.854 96.787 21.964 1.16e-11 ***
## Mean 11.910 9.051 1.316 0.211
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.27 on 13 degrees of freedom
## Multiple R-squared: 0.1175, Adjusted R-squared: 0.04965
## F-statistic: 1.731 on 1 and 13 DF, p-value: 0.211
san_jose_statistics = findStatsFunction("San Jose")
## [1] "cor=0.02"
##
## Call:
## lm(formula = Count ~ Mean, data = byCity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.734 -21.380 4.475 29.613 59.621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 979.4397 49.7087 19.704 4.59e-11 ***
## Mean 0.3973 4.4996 0.088 0.931
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.85 on 13 degrees of freedom
## Multiple R-squared: 0.0005992, Adjusted R-squared: -0.07628
## F-statistic: 0.007795 on 1 and 13 DF, p-value: 0.931
seattle_statistics = findStatsFunction("Seattle")
## [1] "cor=-0.95"
##
## Call:
## lm(formula = Count ~ Mean, data = byCity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.465 -16.182 1.919 14.759 78.394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3386.679 52.246 64.82 < 2e-16 ***
## Mean -68.911 6.061 -11.37 3.98e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.98 on 13 degrees of freedom
## Multiple R-squared: 0.9086, Adjusted R-squared: 0.9016
## F-statistic: 129.3 on 1 and 13 DF, p-value: 3.982e-08
pittsburgh_statistics = findStatsFunction("Pittsburgh")
## [1] "cor=0.17"
##
## Call:
## lm(formula = Count ~ Mean, data = byCity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -315.892 -24.860 4.874 60.142 107.308
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1097.767 184.702 5.943 4.88e-05 ***
## Mean 7.943 12.485 0.636 0.536
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 105 on 13 degrees of freedom
## Multiple R-squared: 0.03019, Adjusted R-squared: -0.04441
## F-statistic: 0.4047 on 1 and 13 DF, p-value: 0.5357
| City | Linear Regression Equation | Correlation Coefficient | R-Square | Description |
|---|---|---|---|---|
| Atlanta | y=17.50 +(-21.97) x | -0.83 | 0.69 | Strong correlation. Model fits the data |
| Detroit | y=3290.02 +(-1.34) x | -0.04 | 0.00 | Very Week correlation. Model fits doesn’t fit the data. It seems to have non-linear fit |
| Los Angeles | y=3114.14 + 47.45 x | 0.82 | 0.67 | String correlation. Model fits the data |
| San Francisco | y=2125.85 + (11.91) x | 0.34 | 0.11 | Week correlation. Model doesn’t fit the data.Model seems non linear |
| San Jose | y=979.44 + (0.39) x | 0.02 | 0.00 | Week correlation. Model doesn’t fit the data |
| Seattle | y = 3386.68 + (-68.91) x | -0.95 | 0.90 | Strong correlation. Model fits the data |
| Pittsburg | y = 1097.76 + (7.94) x | 0.17 | 0.03 | Week correlation. Model doesn’t fit the data. It has an outlier |
Based on the anaysis, we notice that air pollution and lung cancer incidence data for some of the cities fit the linear regression model. And the data for other cities such as San Francisco, San Jose and Detroit have week correlation and low R-square value which indicate that the linear model is not be a good model for predicting lung cancer based on ambient air pollution (PM2.5).
So, we conclude that there might be other factors like tobacco smoking and occupational pollution exposure that contribute to lung cancer incidences. The excess lung cancer risk associated with ambient air pollution is small compared with that from tobacco smoking. It is also possible to predict the incidence of lung cancer using solutions based on transformation methods.