Final_Project

Lung Cancer and Air Pollution

(1) Introduction:

Lung cancer is caused by many factors. One of the factors is air pollutants. There are many types of air pollutants. The six air pollutants are carbon monoxide, lead, ground-level ozone, nitrogen dioxide, sulfur dioxide and particulate matter. Particulate matter (PM), also known as particle pollution, is a complex mixture of extremely small particles and liquid droplets that get into the air. Once inhaled, these particles can affect the heart and lungs and cause serious health effects. Particulate Matter is measued with an index called PM2.5. PM2.5 particles are extremely small air pollutants with a diameter of 2.5 micrometers or less. They are small enough to invade even the smallest airways.

In this project we will be using PM2.5 (air pollution index) and compare it with the count of lung cancer incidents for 7 different cities:

Atlanta

Detroit

Los Angels

San Francisco

San Jose

Seattle

Pittsburgh

The objective of this project is to analyze the cause and effect relationship of the Independent Variable (PM2.5) on the Dependent Variable (Lung Cancer count). The Linear Least Squares Regression method is used to create a model that predicts the value of the dependent variable. Our assumption is that PM2.5 has an effect on lung cancer rate.

(2) Data Sources:

The datasets for this project are downloaded from the following sites for years 1999 to 2013:

Lung Cancer Incidence (CDC)

Air Pollution (EPA) (Atlanta, Detroit, Los Angeles, San Francisco, San Jose, Seattle)

Air Pollution (Pennsylvania Gov Health and Statistics) (Pittsburgh)

(3) Libraries:

(3.1) The following dplyr/tidyr functions are used in this project

dplyr	tidyr
mutate	gather
filter
arrange
select

library(tidyr)
library(dplyr)
library(ggplot2)

(4) Load, Transform and Clean Data:

(4.1) Lung Cancer Incidence and Air Pollution Data for each city is loaded for each year from 1999 to 2013.

setwd(getwd())
cancer_incidence_df = read.csv(file="DATA\\INCIDENCE_DATA\\INCIDENCE_DATA_CITIES.csv", 
                               head=TRUE, sep=",",stringsAsFactors = FALSE)
colnames(cancer_incidence_df) = gsub("\\.Count","",colnames(cancer_incidence_df))
print(cancer_incidence_df)

##                    Area X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009 X2010 X2011 X2012 X2013
## 1               Atlanta  1293  1346  1330  1344  1322  1340  1338  1353  1438  1440  1485  1383  1478  1573  1511
## 2               Detroit  3200  3110  3268  3203  3375  3213  3335  3311  3379  3410  3365  3328  3256  3147  3183
## 3           Los Angeles  4131  3986  4086  3962  3938  4003  3960  3864  3826  3901  3998  3782  3669  3552  3508
## 4 San Francisco-Oakland  2298  2339  2338  2282  2341  2162  2239  2230  2296  2136  2289  2227  2190  2230  2172
## 5     San Jose-Monterey  1023   954   951   989  1019   972  1008   990  1024   901   996   929   973  1042   985
## 6   Seattle-Puget Sound  2734  2604  2624  2701  2754  2713  2755  2831  2862  2853  2951  2971  2895  2933  2851
## 7            Pittsburgh  1231  1204  1328  1275  1301  1327  1223  1191   915  1283  1235  1206  1230  1111  1150

cancer_incidence_df2 = gather(cancer_incidence_df, Year, Count, X1999:X2013)
cancer_incidence_df2$Year = gsub("X","", cancer_incidence_df2$Year)
cancer_incidence_df2$Count = gsub(",", "", cancer_incidence_df2$Count)
cancer_incidence_df2$Count = as.numeric(cancer_incidence_df2$Count)
colnames(cancer_incidence_df2) = c("City", "Year", "Count")
cancer_incidence_df2$City = gsub("San Francisco-Oakland", "San Francisco", cancer_incidence_df2$City)  
cancer_incidence_df2$City = gsub("San Jose-Monterey", "San Jose", cancer_incidence_df2$City)  
cancer_incidence_df2$City = gsub("Seattle-Puget Sound", "Seattle", cancer_incidence_df2$City)
cancer_incidence_df2 = arrange(cancer_incidence_df2, City)
head(cancer_incidence_df2)

##      City Year Count
## 1 Atlanta 1999  1293
## 2 Atlanta 2000  1346
## 3 Atlanta 2001  1330
## 4 Atlanta 2002  1344
## 5 Atlanta 2003  1322
## 6 Atlanta 2004  1340

# function to extract year from date
getYear = function(date) {
  return (as.numeric(gsub(".*/","",date)))
}

# function to load air pollution data for each city
getPollutionData = function(directoryName) {
  filenames <- list.files(directoryName, pattern="*.csv", full.names=TRUE)
  dim(filenames)
  ldf = lapply(filenames, read.csv)

  count = 0
  total = NULL
  for (i in 1:length(filenames)) {
    temp_df = ldf[[i]]
    count = count + nrow(temp_df)
    total = rbind(total, temp_df)
  }
  colnames(total) = gsub("\\.","_",colnames(total))
  cityPollution_df = select(total,  matches("Date|Daily_Mean_PM2_5_Concentration|DAILY_AQI_VALUE|CBSA_NAME"))
  colnames(cityPollution_df) = c("Date", "PM25", "DAILY_AQI", "City")
  cityPollution_df = arrange(cityPollution_df, Date)
  cityPollution_df = mutate(cityPollution_df, Year=getYear(cityPollution_df$Date))
  return(cityPollution_df)
}

atlanta_df = getPollutionData("DATA\\POLLUTION_DATA\\01_ATLANTA")
nrow(atlanta_df)

## [1] 49946

head(atlanta_df)

##         Date PM25 DAILY_AQI                              City Year
## 1 01/01/1999 21.4        71 Atlanta-Sandy Springs-Roswell, GA 1999
## 2 01/01/1999 26.9        82 Atlanta-Sandy Springs-Roswell, GA 1999
## 3 01/01/1999 15.5        58 Atlanta-Sandy Springs-Roswell, GA 1999
## 4 01/02/1999 16.6        60 Atlanta-Sandy Springs-Roswell, GA 1999
## 5 01/02/1999 18.1        64 Atlanta-Sandy Springs-Roswell, GA 1999
## 6 01/03/1999 15.5        58 Atlanta-Sandy Springs-Roswell, GA 1999

detroit_df = getPollutionData("DATA\\POLLUTION_DATA\\02_DETROIT")
nrow(detroit_df)

## [1] 30596

head(detroit_df)

##         Date PM25 DAILY_AQI                        City Year
## 1 01/03/1999  3.5        15 Detroit-Warren-Dearborn, MI 1999
## 2 01/03/1999  7.2        30 Detroit-Warren-Dearborn, MI 1999
## 3 01/03/1999  7.6        32 Detroit-Warren-Dearborn, MI 1999
## 4 01/06/1999 10.2        43 Detroit-Warren-Dearborn, MI 1999
## 5 01/06/1999 14.2        55 Detroit-Warren-Dearborn, MI 1999
## 6 01/06/1999 15.5        58 Detroit-Warren-Dearborn, MI 1999

los_angeles_df = getPollutionData("DATA\\POLLUTION_DATA\\03_LOS_ANGELES")
nrow(los_angeles_df)

## [1] 51592

head(los_angeles_df)

##         Date PM25 DAILY_AQI                               City Year
## 1 01/03/1999 41.8       117 Los Angeles-Long Beach-Anaheim, CA 1999
## 2 01/03/1999 23.0        74 Los Angeles-Long Beach-Anaheim, CA 1999
## 3 01/03/1999 16.3        60 Los Angeles-Long Beach-Anaheim, CA 1999
## 4 01/03/1999 17.0        61 Los Angeles-Long Beach-Anaheim, CA 1999
## 5 01/04/1999  8.9        37 Los Angeles-Long Beach-Anaheim, CA 1999
## 6 01/04/1999 18.5        64 Los Angeles-Long Beach-Anaheim, CA 1999

san_francisco_df = getPollutionData("DATA\\POLLUTION_DATA\\04_SAN_FRANCISCO")
nrow(san_francisco_df)

## [1] 29369

head(san_francisco_df)

##         Date PM25 DAILY_AQI                              City Year
## 1 01/02/1999 12.4        52 San Francisco-Oakland-Hayward, CA 1999
## 2 01/06/1999 12.5        52 San Francisco-Oakland-Hayward, CA 1999
## 3 01/09/1999 25.1        78 San Francisco-Oakland-Hayward, CA 1999
## 4 01/13/1999  4.7        20 San Francisco-Oakland-Hayward, CA 1999
## 5 01/16/1999  4.3        18 San Francisco-Oakland-Hayward, CA 1999
## 6 01/20/1999  8.6        36 San Francisco-Oakland-Hayward, CA 1999

san_jose_df = getPollutionData("DATA\\POLLUTION_DATA\\05_SAN_JOSE")
nrow(san_jose_df)

## [1] 15585

head(san_jose_df)

##         Date PM25 DAILY_AQI                               City Year
## 1 01/02/1999  6.7        28 San Jose-Sunnyvale-Santa Clara, CA 1999
## 2 01/06/1999  5.4        23 San Jose-Sunnyvale-Santa Clara, CA 1999
## 3 01/06/1999 63.0       155 San Jose-Sunnyvale-Santa Clara, CA 1999
## 4 01/09/1999  3.4        14 San Jose-Sunnyvale-Santa Clara, CA 1999
## 5 01/12/1999 31.0        91 San Jose-Sunnyvale-Santa Clara, CA 1999
## 6 01/13/1999 11.0        46 San Jose-Sunnyvale-Santa Clara, CA 1999

seattle_df = getPollutionData("DATA\\POLLUTION_DATA\\06_SEATTLE")
nrow(seattle_df)

## [1] 107560

head(seattle_df)

##         Date PM25 DAILY_AQI                        City Year
## 1 01/01/1999 14.8        57 Seattle-Tacoma-Bellevue, WA 1999
## 2 01/01/1999 15.6        58 Seattle-Tacoma-Bellevue, WA 1999
## 3 01/01/1999 11.9        50 Seattle-Tacoma-Bellevue, WA 1999
## 4 01/01/1999 17.9        63 Seattle-Tacoma-Bellevue, WA 1999
## 5 01/02/1999 18.8        65 Seattle-Tacoma-Bellevue, WA 1999
## 6 01/02/1999 20.4        68 Seattle-Tacoma-Bellevue, WA 1999

pittsburgh_df = getPollutionData("DATA\\POLLUTION_DATA\\07_PITTSBURGH")
nrow(pittsburgh_df)

## [1] 51934

head(pittsburgh_df)

##         Date PM25 DAILY_AQI           City Year
## 1 01/08/1999 14.5        56 Pittsburgh, PA 1999
## 2 01/09/1999  9.4        39 Pittsburgh, PA 1999
## 3 01/10/1999 11.6        48 Pittsburgh, PA 1999
## 4 01/15/1999 15.4        58 Pittsburgh, PA 1999
## 5 01/18/1999  5.0        21 Pittsburgh, PA 1999
## 6 01/23/1999 13.3        54 Pittsburgh, PA 1999

(5) Data Preparation:

(5.1) The air pollution data is collected for each day of the year. So, we take the average value for each year from 1999 to 2013 for each city. Since the lung cancer incidence data is available for each year, it will be easier for our analysis if the pollution data is also converted to yearly PM2.5 value.

# function to calculate average value for air pollution
getYearlyMean = function(CityPollutionData,mycity) {
  averagePollutionYears = NULL
  
  for (i in 1999:2013) {
    myData = filter (CityPollutionData, Year==i)
    averagePollutionYears = rbind(averagePollutionYears,c(i,round(mean(myData$PM25),2)))
  }
  
  averagePollutionYears = as.data.frame(averagePollutionYears)
  averagePollutionYears$city = mycity
  colnames(averagePollutionYears) = c("Year","Mean","City")
  return(averagePollutionYears)
}

atlanta_Pollution_df = getYearlyMean(atlanta_df,"Atlanta")
detroit_Pollution_df = getYearlyMean(detroit_df,"Detroit")
los_angeles_Pollution_df = getYearlyMean(los_angeles_df,"Los Angeles")
san_francisco_Pollution_df = getYearlyMean(san_francisco_df,"San Francisco")
san_jose_Pollution_df = getYearlyMean(san_jose_df,"San Jose")
seattle_Pollution_df = getYearlyMean(seattle_df,"Seattle")
pittsburgh_Pollution_df = getYearlyMean(pittsburgh_df,"Pittsburgh")

atlanta_Pollution_df;

##    Year  Mean    City
## 1  1999 20.91 Atlanta
## 2  2000 18.68 Atlanta
## 3  2001 17.11 Atlanta
## 4  2002 15.05 Atlanta
## 5  2003 14.97 Atlanta
## 6  2004 14.88 Atlanta
## 7  2005 15.14 Atlanta
## 8  2006 15.47 Atlanta
## 9  2007 15.36 Atlanta
## 10 2008 12.97 Atlanta
## 11 2009 10.95 Atlanta
## 12 2010 12.06 Atlanta
## 13 2011 11.86 Atlanta
## 14 2012 10.37 Atlanta
## 15 2013  9.39 Atlanta

detroit_Pollution_df;

##    Year  Mean    City
## 1  1999 15.60 Detroit
## 2  2000 15.91 Detroit
## 3  2001 16.28 Detroit
## 4  2002 15.74 Detroit
## 5  2003 15.51 Detroit
## 6  2004 13.91 Detroit
## 7  2005 16.25 Detroit
## 8  2006 13.56 Detroit
## 9  2007 13.76 Detroit
## 10 2008 11.90 Detroit
## 11 2009 10.62 Detroit
## 12 2010 10.41 Detroit
## 13 2011  9.85 Detroit
## 14 2012  9.91 Detroit
## 15 2013  9.61 Detroit

los_angeles_Pollution_df;

##    Year  Mean        City
## 1  1999 21.41 Los Angeles
## 2  2000 19.86 Los Angeles
## 3  2001 21.21 Los Angeles
## 4  2002 19.29 Los Angeles
## 5  2003 17.75 Los Angeles
## 6  2004 16.94 Los Angeles
## 7  2005 15.58 Los Angeles
## 8  2006 14.37 Los Angeles
## 9  2007 14.67 Los Angeles
## 10 2008 14.93 Los Angeles
## 11 2009 14.27 Los Angeles
## 12 2010 12.65 Los Angeles
## 13 2011 13.72 Los Angeles
## 14 2012 12.53 Los Angeles
## 15 2013 12.20 Los Angeles

san_francisco_Pollution_df;

##    Year  Mean          City
## 1  1999 14.74 San Francisco
## 2  2000 11.86 San Francisco
## 3  2001 12.36 San Francisco
## 4  2002 13.75 San Francisco
## 5  2003  9.74 San Francisco
## 6  2004 10.99 San Francisco
## 7  2005  9.92 San Francisco
## 8  2006  9.71 San Francisco
## 9  2007  8.99 San Francisco
## 10 2008 10.46 San Francisco
## 11 2009  9.88 San Francisco
## 12 2010  8.52 San Francisco
## 13 2011  9.18 San Francisco
## 14 2012  7.68 San Francisco
## 15 2013 10.17 San Francisco

san_jose_Pollution_df;

##    Year  Mean     City
## 1  1999 16.18 San Jose
## 2  2000 13.88 San Jose
## 3  2001 12.50 San Jose
## 4  2002 12.80 San Jose
## 5  2003 10.92 San Jose
## 6  2004 10.99 San Jose
## 7  2005 11.47 San Jose
## 8  2006 10.67 San Jose
## 9  2007  9.97 San Jose
## 10 2008 10.81 San Jose
## 11 2009  9.03 San Jose
## 12 2010  7.56 San Jose
## 13 2011  9.09 San Jose
## 14 2012  7.40 San Jose
## 15 2013  8.85 San Jose

seattle_Pollution_df;

##    Year  Mean    City
## 1  1999 10.02 Seattle
## 2  2000 11.53 Seattle
## 3  2001 10.28 Seattle
## 4  2002  9.95 Seattle
## 5  2003  9.24 Seattle
## 6  2004  9.52 Seattle
## 7  2005  8.88 Seattle
## 8  2006  8.32 Seattle
## 9  2007  7.93 Seattle
## 10 2008  6.78 Seattle
## 11 2009  7.46 Seattle
## 12 2010  6.06 Seattle
## 13 2011  7.08 Seattle
## 14 2012  6.63 Seattle
## 15 2013  7.56 Seattle

pittsburgh_Pollution_df;

##    Year  Mean       City
## 1  1999 16.16 Pittsburgh
## 2  2000 15.92 Pittsburgh
## 3  2001 17.39 Pittsburgh
## 4  2002 15.78 Pittsburgh
## 5  2003 15.78 Pittsburgh
## 6  2004 15.35 Pittsburgh
## 7  2005 16.70 Pittsburgh
## 8  2006 15.59 Pittsburgh
## 9  2007 16.76 Pittsburgh
## 10 2008 14.71 Pittsburgh
## 11 2009 13.02 Pittsburgh
## 12 2010 13.55 Pittsburgh
## 13 2011 11.84 Pittsburgh
## 14 2012 10.66 Pittsburgh
## 15 2013 10.29 Pittsburgh

pollutionMaster_df = NULL
pollutionMaster_df = rbind(pollutionMaster_df, atlanta_Pollution_df)
pollutionMaster_df = rbind(pollutionMaster_df, detroit_Pollution_df)
pollutionMaster_df = rbind(pollutionMaster_df, los_angeles_Pollution_df)
pollutionMaster_df = rbind(pollutionMaster_df, san_francisco_Pollution_df)
pollutionMaster_df = rbind(pollutionMaster_df, san_jose_Pollution_df)
pollutionMaster_df = rbind(pollutionMaster_df, seattle_Pollution_df)
pollutionMaster_df = rbind(pollutionMaster_df, pittsburgh_Pollution_df)

pollutionMaster_df = arrange(pollutionMaster_df, City)
head(pollutionMaster_df, 10)

##    Year  Mean    City
## 1  1999 20.91 Atlanta
## 2  2000 18.68 Atlanta
## 3  2001 17.11 Atlanta
## 4  2002 15.05 Atlanta
## 5  2003 14.97 Atlanta
## 6  2004 14.88 Atlanta
## 7  2005 15.14 Atlanta
## 8  2006 15.47 Atlanta
## 9  2007 15.36 Atlanta
## 10 2008 12.97 Atlanta

tail(pollutionMaster_df, 10)

##     Year Mean    City
## 96  2004 9.52 Seattle
## 97  2005 8.88 Seattle
## 98  2006 8.32 Seattle
## 99  2007 7.93 Seattle
## 100 2008 6.78 Seattle
## 101 2009 7.46 Seattle
## 102 2010 6.06 Seattle
## 103 2011 7.08 Seattle
## 104 2012 6.63 Seattle
## 105 2013 7.56 Seattle

cancer_airpollution_merged_df = merge (x = pollutionMaster_df, y = cancer_incidence_df2, c("City","Year"))

(6) Data Visualization:

(6.1) Display the contents of cancer incidence data.

From the lung cancer incidence graphs we notice the following:

Highest number of cases are reported in Los Angeles

Lowest number of cases are reported in San Jose

Lung Cancer cases in Seattle and Atlanta seem to be positively increasing

Lung Cancer Cases in San Francisco, San Jose, Pittsburgh and Detroit are stable

Lung cancer cases in Los Angeles are decreasing.

#Bar Graph
ggplot(data=cancer_incidence_df2, aes(x=Year, y=Count, fill=City)) + 
  geom_bar(stat="identity", position="dodge",width=0.5) + 
  ylab("Count of Lung Cancer") + ggtitle("Count of Lung Cancer")

#Line Graph
ggplot(data=cancer_incidence_df2, aes(x=Year, y=Count, group=City, color=City)) +  
  geom_line() + geom_point() + 
  ylab("# of Lung Cancer cases") +  ggtitle("Lung Cancer Statistics")

(6.2) From the distribution charts, we notice that some cities like Detroit, Los Angeles and Pittsburgh have higher conceration of air pollution distribution.

myHistogramPlots = function(city_df, cityName, lowColor, highColor) {
  plot = ggplot(city_df, aes(x=PM25)) +
  geom_histogram (aes(fill = ..count..)) +
  ggtitle(cityName) + scale_fill_gradient("Count", low = lowColor, high = highColor)
  return(plot)
}

atlanta_histogram = myHistogramPlots(atlanta_df, "Atlanta", "green", "red")
detroit_histogram = myHistogramPlots(detroit_df, "Detroit", "orange", "blue")
los_angeles_histogram = myHistogramPlots(los_angeles_df, "Los Angeles", "yellow", "purple")
san_francisco_histogram = myHistogramPlots(san_francisco_df, "San Francisco", "cyan", "navy")
san_jose_histogram = myHistogramPlots(san_jose_df, "San Jose", "yellow", "maroon")
seattle_histogram = myHistogramPlots(seattle_df, "Seattle", "blue", "orange")
pittsburgh_histogram = myHistogramPlots(pittsburgh_df, "Pittsburgh", "limegreen", "deeppink")

atlanta_histogram

detroit_histogram

los_angeles_histogram

san_francisco_histogram

san_jose_histogram

seattle_histogram

pittsburgh_histogram

(7) Statistical Analysis:

(7.1) Perform correlation analysis between lung cancer incidence and air pollution index PM2.5.

# find the summary statistics
findStatsFunction<-function(cityName) {
  byCity = filter(cancer_airpollution_merged_df, City == cityName)
  corr = cor(byCity$Mean,byCity$Count)
  print (paste0("cor=",round(corr,2)))
  myTitle = paste0(cityName, " - PM2.5 vs Lung Cancer", sep="")
  plot = ggplot(byCity, aes(Mean,Count)) + geom_point(colour="red") + 
    xlab("PM2.5") + ylab("Lung Cancer") + labs(title = myTitle)
  print(plot)
  m = lm (Count~Mean,byCity)
  s = summary(m)
  print(s)
  return (byCity)
}

atlanta_statistics = findStatsFunction("Atlanta")

## [1] "cor=-0.83"

## 
## Call:
## lm(formula = Count ~ Mean, data = byCity)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.473 -40.778   3.853  32.070  87.389 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1713.496     59.745  28.680 3.87e-13 ***
## Mean         -21.975      4.074  -5.394 0.000122 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48 on 13 degrees of freedom
## Multiple R-squared:  0.6911, Adjusted R-squared:  0.6674 
## F-statistic: 29.09 on 1 and 13 DF,  p-value: 0.0001225

detroit_statistics = findStatsFunction("Detroit")

## [1] "cor=-0.04"

## 
## Call:
## lm(formula = Count ~ Mean, data = byCity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -158.629  -67.452   -0.131   78.043  135.978 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3290.026    133.484  24.647 2.68e-12 ***
## Mean          -1.345      9.893  -0.136    0.894    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 96.75 on 13 degrees of freedom
## Multiple R-squared:  0.00142,    Adjusted R-squared:  -0.07539 
## F-statistic: 0.01848 on 1 and 13 DF,  p-value: 0.8939

los_angeles_statistics = findStatsFunction("Los Angeles")

## [1] "cor=0.82"

## 
## Call:
## lm(formula = Count ~ Mean, data = byCity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -185.052  -69.007    0.919   73.192  206.723 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3114.142    152.928  20.363 3.03e-11 ***
## Mean          47.452      9.339   5.081 0.000211 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 109.6 on 13 degrees of freedom
## Multiple R-squared:  0.6651, Adjusted R-squared:  0.6393 
## F-statistic: 25.81 on 1 and 13 DF,  p-value: 0.0002107

san_francisco_statistics = findStatsFunction("San Francisco")

## [1] "cor=0.34"

## 
## Call:
## lm(formula = Count ~ Mean, data = byCity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -114.433  -28.344   -3.408   54.275   99.142 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2125.854     96.787  21.964 1.16e-11 ***
## Mean          11.910      9.051   1.316    0.211    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.27 on 13 degrees of freedom
## Multiple R-squared:  0.1175, Adjusted R-squared:  0.04965 
## F-statistic: 1.731 on 1 and 13 DF,  p-value: 0.211

san_jose_statistics = findStatsFunction("San Jose")

## [1] "cor=0.02"

## 
## Call:
## lm(formula = Count ~ Mean, data = byCity)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.734 -21.380   4.475  29.613  59.621 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 979.4397    49.7087  19.704 4.59e-11 ***
## Mean          0.3973     4.4996   0.088    0.931    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.85 on 13 degrees of freedom
## Multiple R-squared:  0.0005992,  Adjusted R-squared:  -0.07628 
## F-statistic: 0.007795 on 1 and 13 DF,  p-value: 0.931

seattle_statistics = findStatsFunction("Seattle")

## [1] "cor=-0.95"

## 
## Call:
## lm(formula = Count ~ Mean, data = byCity)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.465 -16.182   1.919  14.759  78.394 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3386.679     52.246   64.82  < 2e-16 ***
## Mean         -68.911      6.061  -11.37 3.98e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.98 on 13 degrees of freedom
## Multiple R-squared:  0.9086, Adjusted R-squared:  0.9016 
## F-statistic: 129.3 on 1 and 13 DF,  p-value: 3.982e-08

pittsburgh_statistics = findStatsFunction("Pittsburgh")

## [1] "cor=0.17"

## 
## Call:
## lm(formula = Count ~ Mean, data = byCity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -315.892  -24.860    4.874   60.142  107.308 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1097.767    184.702   5.943 4.88e-05 ***
## Mean           7.943     12.485   0.636    0.536    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 105 on 13 degrees of freedom
## Multiple R-squared:  0.03019,    Adjusted R-squared:  -0.04441 
## F-statistic: 0.4047 on 1 and 13 DF,  p-value: 0.5357

(7.2) The following table describes statistical analysis results.

City	Linear Regression Equation	Correlation Coefficient	R-Square	Description
Atlanta	y=17.50 +(-21.97) x	-0.83	0.69	Strong correlation. Model fits the data
Detroit	y=3290.02 +(-1.34) x	-0.04	0.00	Very Week correlation. Model fits doesn’t fit the data. It seems to have non-linear fit
Los Angeles	y=3114.14 + 47.45 x	0.82	0.67	String correlation. Model fits the data
San Francisco	y=2125.85 + (11.91) x	0.34	0.11	Week correlation. Model doesn’t fit the data.Model seems non linear
San Jose	y=979.44 + (0.39) x	0.02	0.00	Week correlation. Model doesn’t fit the data
Seattle	y = 3386.68 + (-68.91) x	-0.95	0.90	Strong correlation. Model fits the data
Pittsburg	y = 1097.76 + (7.94) x	0.17	0.03	Week correlation. Model doesn’t fit the data. It has an outlier

(8) Conclusion:

Based on the anaysis, we notice that air pollution and lung cancer incidence data for some of the cities fit the linear regression model. And the data for other cities such as San Francisco, San Jose and Detroit have week correlation and low R-square value which indicate that the linear model is not be a good model for predicting lung cancer based on ambient air pollution (PM2.5).

So, we conclude that there might be other factors like tobacco smoking and occupational pollution exposure that contribute to lung cancer incidences. The excess lung cancer risk associated with ambient air pollution is small compared with that from tobacco smoking. It is also possible to predict the incidence of lung cancer using solutions based on transformation methods.