IS 607: PROJECT

ABSTRACT:

The project analysis is done on three (3) different datasets from different sources. The first dataset if from the University of California, Irvine website on Air Quality, the second dataset was extracted as an .xls file from World Bank website for the year 1961-1999, while the third dataset was base on incident rate of leukemia in US States and was extracted from cancer.gov website.
Analyses was done on all the three datasets to determine if the is linear relationship between them. I found out that the Air Quality has some of it variables that are way beyond the acceptable threnshold, I checked for interaction between those variables and benzene. I compared the result with the Leukemia incident rate in the 49 US States and the result is amazing.
We would randomly select observations to avoid bias.
We would do some of the analysis on three(3) different programs; R, MySQL and MongoDB.

AIM AND OBJECTIVE:

I am motivated by the serious impacts of the Air Quality throughout the whole world. Lots of adults, children are prone or exposed to the various categories of air throughout the globe. I thought it would be good to do an analysis to see the level of impact and to offer solutions.

Load Libraries:

Please, kindly install the package(s) first if you don’t already have it and load their respective libraries.

options(warn = -1)
suppressMessages(library(xlsx))
library(xlsxjars)
library(rJava)
library(devtools)
suppressMessages(library(RMySQL))
library(DBI)
suppressMessages(library(plotly))
suppressWarnings(library(rmongodb))
library(knitr)
library(mongolite)
suppressMessages(library(RCurl))
suppressMessages(library(plyr))
suppressMessages(library(ggplot2))

We will be using the Air Quality dataset from the University of California, Irvine website:

http://archive.ics.uci.edu/ml/datasets/Air+Quality#

** It is a .xlsx dataset that I downloaded into my local computer.**

Kindly download from here:

https://github.com/mascotinme/MSDA-IS607/blob/master/cckp_historical_data_0.xls

airqu <- read.xlsx2("C:/Users/mayowa/Documents/GitHub/MSDA-IS607/AirQualityUCI.xlsx", sheetName = "AirQualityUCI", header=TRUE, colClasses=NA)

dim(airqu)

## [1] 9471   15

kable(head(airqu))

Date	Time	CO.GT.	PT08.S1.CO.	NMHC.GT.	C6H6.GT.	PT08.S2.NMHC.	NOx.GT.	PT08.S3.NOx.	NO2.GT.	PT08.S4.NO2.	PT08.S5.O3.	T	RH	AH
2004-03-10	1899-12-30 18:00:00	2.6	1360.00	150	11.881723	1045.50	166	1056.25	113	1692.00	1267.50	13.600	48.875	0.7577538
2004-03-10	1899-12-30 19:00:00	2.0	1292.25	112	9.397165	954.75	103	1173.75	92	1558.75	972.25	13.300	47.700	0.7254874
2004-03-10	1899-12-30 20:00:00	2.2	1402.00	88	8.997817	939.25	131	1140.00	114	1554.50	1074.00	11.900	53.975	0.7502391
2004-03-10	1899-12-30 21:00:00	2.2	1375.50	80	9.228796	948.25	172	1092.00	122	1583.75	1203.25	11.000	60.000	0.7867125
2004-03-10	1899-12-30 22:00:00	1.6	1272.25	51	6.518224	835.50	131	1205.00	116	1490.00	1110.00	11.150	59.575	0.7887942
2004-03-10	1899-12-30 23:00:00	1.2	1197.00	38	4.741012	750.25	89	1336.50	96	1393.00	949.25	11.175	59.175	0.7847717

kable(tail(airqu))

	Date	Time	CO.GT.	PT08.S1.CO.	NMHC.GT.	C6H6.GT.	PT08.S2.NMHC.	NOx.GT.	PT08.S3.NOx.	NO2.GT.	PT08.S4.NO2.	PT08.S5.O3.	T	RH	AH
9466	NA	NA	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9467	NA	NA	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9468	NA	NA	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9469	NA	NA	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9470	NA	NA	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9471	NA	NA	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

names(airqu)

##  [1] "Date"          "Time"          "CO.GT."        "PT08.S1.CO."  
##  [5] "NMHC.GT."      "C6H6.GT."      "PT08.S2.NMHC." "NOx.GT."      
##  [9] "PT08.S3.NOx."  "NO2.GT."       "PT08.S4.NO2."  "PT08.S5.O3."  
## [13] "T"             "RH"            "AH"

str(airqu)

## 'data.frame':    9471 obs. of  15 variables:
##  $ Date         :Class 'POSIXct'  atomic [1:9471] 1.08e+09 1.08e+09 1.08e+09 1.08e+09 1.08e+09 ...
##   .. ..- attr(*, "tzone")= chr "GMT"
##  $ Time         :Class 'POSIXct'  atomic [1:9471] -2.21e+09 -2.21e+09 -2.21e+09 -2.21e+09 -2.21e+09 ...
##   .. ..- attr(*, "tzone")= chr "GMT"
##  $ CO.GT.       : num  2.6 2 2.2 2.2 1.6 1.2 1.2 1 0.9 0.6 ...
##  $ PT08.S1.CO.  : num  1360 1292 1402 1376 1272 ...
##  $ NMHC.GT.     : num  150 112 88 80 51 38 31 31 24 19 ...
##  $ C6H6.GT.     : num  11.88 9.4 9 9.23 6.52 ...
##  $ PT08.S2.NMHC.: num  1046 955 939 948 836 ...
##  $ NOx.GT.      : num  166 103 131 172 131 89 62 62 45 -200 ...
##  $ PT08.S3.NOx. : num  1056 1174 1140 1092 1205 ...
##  $ NO2.GT.      : num  113 92 114 122 116 96 77 76 60 -200 ...
##  $ PT08.S4.NO2. : num  1692 1559 1554 1584 1490 ...
##  $ PT08.S5.O3.  : num  1268 972 1074 1203 1110 ...
##  $ T            : num  13.6 13.3 11.9 11 11.2 ...
##  $ RH           : num  48.9 47.7 54 60 59.6 ...
##  $ AH           : num  0.758 0.725 0.75 0.787 0.789 ...

class(airqu)

## [1] "data.frame"

We notice that there are alot of NA’s in the dataset, we therefore thought it would be appropriate (for easy analysis) to replace the missing values or NA with Zero(o)

We also notice that the column names is ambiguous, we therefore change it to a reasonable column names.

airqu$NMHC.GT.[is.na(airqu$NMHC.GT.)] <- 0.0000
airqu$CO.GT.[is.na(airqu$CO.GT.)] <- 0.0000
airqu$PT08.S1.CO.[is.na(airqu$PT08.S1.CO.)] <- 0.0000
airqu$C6H6.GT.[is.na(airqu$C6H6.GT.)] <- 0.0000
airqu$PT08.S2.NMHC.[is.na(airqu$PT08.S2.NMHC.)] <- 0.0000
airqu$NOx.GT.[is.na(airqu$NOx.GT.)] <- 0.0000
airqu$PT08.S3.NOx.[is.na(airqu$PT08.S3.NOx.)] <- 0.0000
airqu$NO2.GT.[is.na(airqu$NO2.GT.)] <- 0.0000
airqu$PT08.S4.NO2.[is.na(airqu$PT08.S4.NO2.)] <- 0.0000
airqu$PT08.S5.O3.[is.na(airqu$PT08.S5.O3.)] <- 0.0000
airqu$T[is.na(airqu$T)] <- 0.0000
airqu$RH[is.na(airqu$RH)] <- 0.0000
airqu$AH[is.na(airqu$AH)] <- 0.0000

kable(head(airqu))

Date	Time	CO.GT.	PT08.S1.CO.	NMHC.GT.	C6H6.GT.	PT08.S2.NMHC.	NOx.GT.	PT08.S3.NOx.	NO2.GT.	PT08.S4.NO2.	PT08.S5.O3.	T	RH	AH
2004-03-10	1899-12-30 18:00:00	2.6	1360.00	150	11.881723	1045.50	166	1056.25	113	1692.00	1267.50	13.600	48.875	0.7577538
2004-03-10	1899-12-30 19:00:00	2.0	1292.25	112	9.397165	954.75	103	1173.75	92	1558.75	972.25	13.300	47.700	0.7254874
2004-03-10	1899-12-30 20:00:00	2.2	1402.00	88	8.997817	939.25	131	1140.00	114	1554.50	1074.00	11.900	53.975	0.7502391
2004-03-10	1899-12-30 21:00:00	2.2	1375.50	80	9.228796	948.25	172	1092.00	122	1583.75	1203.25	11.000	60.000	0.7867125
2004-03-10	1899-12-30 22:00:00	1.6	1272.25	51	6.518224	835.50	131	1205.00	116	1490.00	1110.00	11.150	59.575	0.7887942
2004-03-10	1899-12-30 23:00:00	1.2	1197.00	38	4.741012	750.25	89	1336.50	96	1393.00	949.25	11.175	59.175	0.7847717

colnames(airqu)[1] <- "date"
colnames(airqu)[2] <- "time"
colnames(airqu)[3] <- "co"
colnames(airqu)[4] <- "tin_oxide"
colnames(airqu)[5] <- "hydrocarbon"
colnames(airqu)[6] <- "benzene"
colnames(airqu)[7] <- "titania"
colnames(airqu)[8] <- "nitroge_oxide"
colnames(airqu)[9] <- "tungsten-oxide"
colnames(airqu)[10] <- "no2"
colnames(airqu)[11] <- "tungsten_oxide_o2"
colnames(airqu)[12] <- "indium_oxide"
colnames(airqu)[13] <- "temp_ac"
colnames(airqu)[14] <- "rh"
colnames(airqu)[15] <- "ah"


kable(head(airqu))

date	time	co	tin_oxide	hydrocarbon	benzene	titania	nitroge_oxide	tungsten-oxide	no2	tungsten_oxide_o2	indium_oxide	temp_ac	rh	ah
2004-03-10	1899-12-30 18:00:00	2.6	1360.00	150	11.881723	1045.50	166	1056.25	113	1692.00	1267.50	13.600	48.875	0.7577538
2004-03-10	1899-12-30 19:00:00	2.0	1292.25	112	9.397165	954.75	103	1173.75	92	1558.75	972.25	13.300	47.700	0.7254874
2004-03-10	1899-12-30 20:00:00	2.2	1402.00	88	8.997817	939.25	131	1140.00	114	1554.50	1074.00	11.900	53.975	0.7502391
2004-03-10	1899-12-30 21:00:00	2.2	1375.50	80	9.228796	948.25	172	1092.00	122	1583.75	1203.25	11.000	60.000	0.7867125
2004-03-10	1899-12-30 22:00:00	1.6	1272.25	51	6.518224	835.50	131	1205.00	116	1490.00	1110.00	11.150	59.575	0.7887942
2004-03-10	1899-12-30 23:00:00	1.2	1197.00	38	4.741012	750.25	89	1336.50	96	1393.00	949.25	11.175	59.175	0.7847717

We will now connect R with MySql database.

mysql = dbConnect(MySQL(), user='root', password='oracle', dbname='world', host='localhost') # This connects the R with MySQL. Note that the login entities may be different from your login details, kindly adjust as neccesary.

We would like to view the list of database(s) which are already in MySQL database.

class(mysql)

## [1] "MySQLConnection"
## attr(,"package")
## [1] "RMySQL"

dbListTables(mysql)

## [1] "airqu"           "airqualityuci"   "city"            "country"        
## [5] "countrylanguage" "flights"

Dropping and adding of table if they already exist to give access to the new table to be created.

dbSendQuery(mysql, 'drop table if exists airqu')

## <MySQLResult:0,0,1>

dbWriteTable(conn=mysql, name='airqu', value=airqu)

## [1] TRUE

dbListFields(mysql, 'airqualityuci')

##  [1] "row_names"         "Date"              "Time"             
##  [4] "co"                "tin_oxide"         "hydrocarbon"      
##  [7] "benzene"           "titania"           "nitroge_oxide"    
## [10] "tungsten.oxide"    "no2"               "tungsten_oxide_o2"
## [13] "indium_oxide"      "temp_ac"           "rh"               
## [16] "ah"

We can now read and access the newly created table in R from MySQL directly.

airqualityuci <- dbReadTable(conn=mysql, name='airqu')

write.table(airqualityuci, file = "airqualityuci.csv",row.names=FALSE, na="",col.names=TRUE, sep=",")

Attribute Information :

Date: Date (DD/MM/YYYY)
Time: Time (HH.MM.SS)
co: True hourly averaged concentration CO in mg/m^3
tin_oxide: PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
hydrocarbon: True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
benzene: True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
titania: PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
nitroge_oxide: True hourly averaged NOx concentration in ppb (reference analyzer)
tungsten_oxide: PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
no2: True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
tungsten_oxide2: PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
indium_oxide: PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
temp_ac: Temperature in Â°C
rh: Relative Humidity (%)
ah: AH Absolute Humidit

** we are going to randomly sample 900 (Less than 10%) observations from the total dataset.**

airqualityuci <- airqualityuci[sample(nrow(airqualityuci), 900), ]
dim(airqualityuci)

## [1] 900  15

str(airqualityuci)

## 'data.frame':    900 obs. of  15 variables:
##  $ date             : chr  "1094083200" "1101427200" "1100217600" "1096416000" ...
##  $ time             : chr  "-2209161600" "-2209129200" "-2209161600" "-2209158000" ...
##  $ co               : num  2 5.8 1.2 0.9 2.5 0.5 0.7 4 1.5 2.7 ...
##  $ tin_oxide        : num  1060 1521 932 872 1171 ...
##  $ hydrocarbon      : num  -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
##  $ benzene          : num  11.41 26.63 4.53 4.14 12.98 ...
##  $ titania          : num  1029 1470 740 718 1083 ...
##  $ nitroge_oxide    : num  242 960 167 106 390 ...
##  $ tungsten.oxide   : num  622 456 926 980 694 ...
##  $ no2              : num  126 149 92 71 96 50.7 78.4 127 103 126 ...
##  $ tungsten_oxide_o2: num  1663 1857 1180 1223 1758 ...
##  $ indium_oxide     : num  1210 1726 774 872 1276 ...
##  $ temp_ac          : num  24.3 10.6 12.9 16.3 22.9 ...
##  $ rh               : num  44.7 72.7 69.9 56.8 59.1 ...
##  $ ah               : num  1.343 0.927 1.037 1.047 1.633 ...

summary(airqualityuci)

##      date               time                 co            tin_oxide     
##  Length:900         Length:900         Min.   :-200.00   Min.   :-200.0  
##  Class :character   Class :character   1st Qu.:   0.50   1st Qu.: 909.6  
##  Mode  :character   Mode  :character   Median :   1.40   Median :1049.5  
##                                        Mean   : -36.97   Mean   :1030.8  
##                                        3rd Qu.:   2.40   3rd Qu.:1204.8  
##                                        Max.   :  10.20   Max.   :1982.2  
##   hydrocarbon        benzene            titania       nitroge_oxide   
##  Min.   :-200.0   Min.   :-200.000   Min.   :-200.0   Min.   :-200.0  
##  1st Qu.:-200.0   1st Qu.:   3.856   1st Qu.: 702.7   1st Qu.:  44.0  
##  Median :-200.0   Median :   7.701   Median : 886.8   Median : 132.1  
##  Mean   :-164.4   Mean   :   1.465   Mean   : 875.1   Mean   : 160.8  
##  3rd Qu.:-200.0   3rd Qu.:  13.237   3rd Qu.:1091.6   3rd Qu.: 265.1  
##  Max.   : 880.0   Max.   :  49.492   Max.   :1958.8   Max.   :1369.0  
##  tungsten.oxide        no2          tungsten_oxide_o2  indium_oxide   
##  Min.   :-200.0   Min.   :-200.00   Min.   :-200      Min.   :-200.0  
##  1st Qu.: 651.9   1st Qu.:  49.75   1st Qu.:1169      1st Qu.: 663.9  
##  Median : 806.9   Median :  92.00   Median :1436      Median : 933.6  
##  Mean   : 802.7   Mean   :  54.35   Mean   :1374      Mean   : 952.1  
##  3rd Qu.: 966.2   3rd Qu.: 129.15   3rd Qu.:1649      3rd Qu.:1219.4  
##  Max.   :2047.0   Max.   : 288.00   Max.   :2536      Max.   :2385.5  
##     temp_ac               rh                ah           
##  Min.   :-200.000   Min.   :-200.00   Min.   :-200.0000  
##  1st Qu.:  11.238   1st Qu.:  34.05   1st Qu.:   0.7186  
##  Median :  16.350   Median :  48.40   Median :   0.9729  
##  Mean   :   9.823   Mean   :  39.58   Mean   :  -6.7912  
##  3rd Qu.:  24.431   3rd Qu.:  63.20   3rd Qu.:   1.2935  
##  Max.   :  44.350   Max.   :  86.53   Max.   :   2.2310

What recommendations has the federal government made to protect human health?

….EPA estimates that 10 ppb benzene in drinking water that is consumed regularly or exposure to 0.4 ppb in air over a lifetime could cause a risk of one additional cancer case for every 100,000 exposed persons……

The underlisted output displays the date, time of benzene with ppb greater than the safely regulated ppb for benzene.

mysqlquery <- dbGetQuery(mysql, "SELECT Date, Time, benzene 
     FROM airqu 
     WHERE benzene IN (SELECT benzene 
                  FROM airqu 
                  WHERE benzene > 10)")

kable(head(mysqlquery))

Date	Time	benzene
1078876800	-2209096800	11.88172
1078963200	-2209111200	11.53901
1078963200	-2209100400	11.15158
1078963200	-2209096800	20.79922
1078963200	-2209093200	27.35981
1078963200	-2209089600	24.01776

No wonder! The EPA has classified benzene as a Group A, known human carcinogen!
https://www3.epa.gov/airtoxics/hlthef/benzene.html

MONGODB —A NO-SQL

use this query from your cmd shell:

mongoimport –db world –collection airqualityuci –type csv –file“c:.csv” –headerline

Adjust the file location accordingly

mongo_conn <- mongo.create() # createing a MongoDB connection

mongo.is.connected(mongo_conn)  # checking if the mongo is connected

## [1] TRUE

if(mongo.is.connected(mongo_conn) == TRUE) {
  mongo.get.databases(mongo_conn)
}

## [1] "database" "flights"  "hockey"   "world"

if(mongo.is.connected(mongo_conn) == TRUE) {
  db <- "world"
  mongo.get.database.collections(mongo_conn, db)
}

## character(0)

mongo_conn = mongo(collection = "airqualityuci",  db = "world")
mongo_conn$insert(airqualityuci) # Insert the airqualityuci into mongodb collection

## 
Complete! Processed total of 900 rows.

## [1] TRUE

plot(airqualityuci$temp_ac, airqualityuci$benzene, main = "Scatter Plot", xlab = "Temperature", ylab = "Benzene")
abline(lm(airqualityuci$temp_ac~airqualityuci$benzene), col = "blue");

cor(airqualityuci$temp_ac, airqualityuci$benzene)

## [1] 0.9713017

temp_co <- lm(temp_ac ~ benzene, data=airqualityuci)

summary.lm(temp_co)

## 
## Call:
## lm(formula = temp_ac ~ benzene, data = airqualityuci)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.602  -5.712  -0.020   6.493  27.833 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.330640   0.342555   24.32   <2e-16 ***
## benzene     1.018777   0.008325  122.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.27 on 898 degrees of freedom
## Multiple R-squared:  0.9434, Adjusted R-squared:  0.9434 
## F-statistic: 1.498e+04 on 1 and 898 DF,  p-value: < 2.2e-16

plot(x=temp_co$residuals, y=temp_co$benzene)
abline(h = 0, lty = 3, col = "blue")

hist(temp_co$residuals)

airqual <- subset(airqualityuci, select=c(co, tin_oxide, hydrocarbon, benzene, titania, nitroge_oxide, tungsten.oxide, no2, tungsten_oxide_o2, indium_oxide, temp_ac, rh, ah  ))

plot(airqual)

We are going to do a Multiple Linear Regression model. But, we will want to choose the best model for our prediction.

\({ Y }\quad =\quad { B }_{ 0 }\quad +\quad { B }_{ 1 }{ X }_{ 1 }\quad +\quad { B }_{ 2 }{ X }_{ 2 }\quad +\quad\) ………+\(\quad { B }_{ n }{ X }_{ n }\quad\) +\(\quad { e}\\\)

SETTING UP HYPOTHESIS: # 1

Null Hypothesis: The population parameters(means) are the same

\(H_o:\) \(\mu_1\) = \(\mu_2\) = \(\mu_3\)…..= \(\mu_n\)
Alternative Hythesis: Not \(H_o\)

\(H_a:\) Not \(H_o\) : \(\mu_1\) \(\neq\) \(\mu_2\) \(\neq\) \(\mu_3\)…..\(\neq\) \(\mu_n\)

Rejection:

Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less that the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)

Underlisted are the best variables for the model, and analysis would be made on them.

modelselection <- lm(temp_ac ~ tin_oxide+hydrocarbon+benzene+titania+nitroge_oxide+tungsten.oxide+no2+tungsten_oxide_o2+indium_oxide+co+rh+ah, data=airqual)

stepwise <- step(modelselection, direction = "both") # Model Selection using both FORWARD AND BACKWARD selection.

## Start:  AIC=2001.43
## temp_ac ~ tin_oxide + hydrocarbon + benzene + titania + nitroge_oxide + 
##     tungsten.oxide + no2 + tungsten_oxide_o2 + indium_oxide + 
##     co + rh + ah
## 
##                     Df Sum of Sq     RSS    AIC
## - tin_oxide          1       9.1  8090.4 2000.4
## - co                 1      17.6  8098.9 2001.4
## <none>                            8081.3 2001.4
## - titania            1     202.6  8284.0 2021.7
## - no2                1     292.0  8373.4 2031.4
## - nitroge_oxide      1     371.1  8452.4 2039.8
## - indium_oxide       1     532.7  8614.0 2056.9
## - tungsten.oxide     1     732.2  8813.5 2077.5
## - hydrocarbon        1    1219.3  9300.7 2125.9
## - benzene            1    2427.0 10508.4 2235.8
## - ah                 1   13361.4 21442.8 2877.7
## - tungsten_oxide_o2  1   14312.4 22393.7 2916.7
## - rh                 1   15167.0 23248.3 2950.4
## 
## Step:  AIC=2000.44
## temp_ac ~ hydrocarbon + benzene + titania + nitroge_oxide + tungsten.oxide + 
##     no2 + tungsten_oxide_o2 + indium_oxide + co + rh + ah
## 
##                     Df Sum of Sq     RSS    AIC
## <none>                            8090.4 2000.4
## - co                 1      18.2  8108.7 2000.5
## + tin_oxide          1       9.1  8081.3 2001.4
## - titania            1     281.7  8372.1 2029.2
## - no2                1     289.1  8379.6 2030.0
## - nitroge_oxide      1     371.6  8462.0 2038.8
## - indium_oxide       1     571.1  8661.6 2059.8
## - tungsten.oxide     1     723.7  8814.1 2075.5
## - hydrocarbon        1    1231.7  9322.1 2126.0
## - benzene            1    2537.2 10627.7 2243.9
## - ah                 1   13690.5 21780.9 2889.8
## - tungsten_oxide_o2  1   14502.7 22593.1 2922.7
## - rh                 1   17193.7 25284.1 3024.0

summary(stepwise)

## 
## Call:
## lm(formula = temp_ac ~ hydrocarbon + benzene + titania + nitroge_oxide + 
##     tungsten.oxide + no2 + tungsten_oxide_o2 + indium_oxide + 
##     co + rh + ah, data = airqual)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0612  -1.9755   0.1464   1.8602  12.1700 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.8150246  1.1167344   2.521   0.0119 *  
## hydrocarbon       -0.0103517  0.0008903 -11.627  < 2e-16 ***
## benzene           -1.0401703  0.0623307 -16.688  < 2e-16 ***
## titania            0.0112008  0.0020145   5.560 3.57e-08 ***
## nitroge_oxide      0.0066334  0.0010387   6.386 2.74e-10 ***
## tungsten.oxide    -0.0049917  0.0005601  -8.912  < 2e-16 ***
## no2               -0.0105469  0.0018722  -5.633 2.37e-08 ***
## tungsten_oxide_o2  0.0247435  0.0006202  39.897  < 2e-16 ***
## indium_oxide      -0.0053875  0.0006805  -7.917 7.21e-15 ***
## co                 0.0026057  0.0018412   1.415   0.1574    
## rh                -0.3096226  0.0071273 -43.442  < 2e-16 ***
## ah                 2.3536728  0.0607178  38.764  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.018 on 888 degrees of freedom
## Multiple R-squared:  0.9952, Adjusted R-squared:  0.9951 
## F-statistic: 1.663e+04 on 11 and 888 DF,  p-value: < 2.2e-16

We will like to further test by using another statistical tool called Analysis of Variance (ANOVA)

ANALYSIS OF VARIANCE (ANOVA)

anova(stepwise, test= "F")

## Analysis of Variance Table
## 
## Response: temp_ac
##                    Df  Sum Sq Mean Sq    F value Pr(>F)    
## hydrocarbon         1    1728    1728 1.8968e+02 <2e-16 ***
## benzene             1 1579793 1579793 1.7340e+05 <2e-16 ***
## titania             1   17270   17270 1.8955e+03 <2e-16 ***
## nitroge_oxide       1   15927   15927 1.7481e+03 <2e-16 ***
## tungsten.oxide      1    1071    1071 1.1752e+02 <2e-16 ***
## no2                 1    9132    9132 1.0023e+03 <2e-16 ***
## tungsten_oxide_o2   1   12744   12744 1.3987e+03 <2e-16 ***
## indium_oxide        1    5798    5798 6.3636e+02 <2e-16 ***
## co                  1      15      15 1.6648e+00 0.1973    
## rh                  1    8991    8991 9.8684e+02 <2e-16 ***
## ah                  1   13690   13690 1.5027e+03 <2e-16 ***
## Residuals         888    8090       9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(stepwise)

##                          2.5 %       97.5 %
## (Intercept)        0.623277993  5.006771200
## hydrocarbon       -0.012099085 -0.008604343
## benzene           -1.162503038 -0.917837527
## titania            0.007247048  0.015154553
## nitroge_oxide      0.004594774  0.008672034
## tungsten.oxide    -0.006090906 -0.003892402
## no2               -0.014221378 -0.006872474
## tungsten_oxide_o2  0.023526332  0.025960701
## indium_oxide      -0.006722946 -0.004051988
## co                -0.001007918  0.006219285
## rh                -0.323611008 -0.295634221
## ah                 2.234505597  2.472840022

The Error Term or Residual

We are going to check the distribution of the residuals with normal dataset distribution to determine the stochastics difference between the expected value and the observed value.

residual_temp = data.frame(Fitted = fitted(stepwise),
Residuals = resid(stepwise), Treatment = airqual$temp_ac)

residual_benzene = data.frame(Fitted = fitted(stepwise),
Residuals = resid(stepwise), Treatment = airqual$benzene)



ggplot(residual_temp, aes(Fitted, Residuals, colour = Treatment)) + geom_point()

ggplot(residual_benzene, aes(Fitted, Residuals, colour = Treatment)) + geom_point()

Decision & Conclusion:

From the above multiple regression analysis and ANOVA, we found that the P-value is actually low compare to the Table value (0.05), we therefor reject \(H_o\) (Null hypothesis) and conclude that there is/are difference(s) between the variables.

SECOND DATASET WAS EXTRACTED FROM THE WORLD BANK WEBSITE AS A .XLS FILE

This dataset contains historical temperature extracted from the Climate Research Unit (Mitchell et al, 2003), aggregated to the country and basin levels. The dataset time is from 1961-1999.
Country_temperatureCRU: mean monthly and annual temperatures by country for the period 1961-1999. Values are in degrees Celsius.
http://databank.worldbank.org/data/download/catalog/cckp_historical_data_0.xls
we will like to compare the dataset with the above airquality analysis.

suppressMessages(library(dplyr))

dataset2 <- read.xlsx("C:/Users/mayowa/Documents/cckp_historical_data_0.xls", sheetIndex=4,
  startRow=NULL, endRow=179, colIndex=NULL,
  as.data.frame=TRUE, header=TRUE,
  keepFormulas=FALSE)

str(dataset2)

## 'data.frame':    178 obs. of  14 variables:
##  $ ISO_3DIGIT : Factor w/ 178 levels "AFG","AGO","ALB",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Jan_Temp   : num  0.0731 22.5822 2.0231 18.4275 20.8035 ...
##  $ Feb_temp   : num  2.11 22.68 3.22 19.43 19.9 ...
##  $ Mar_temp   : num  7.6 22.78 6.04 22.61 17.51 ...
##  $ Apr_Temp   : num  13.37 22.35 9.92 26.58 14.05 ...
##  $ May_temp   : num  18.2 20.7 14.4 30.6 10.6 ...
##  $ Jun_Temp   : num  23.2 18.37 17.93 32.46 7.66 ...
##  $ July_Temp  : num  25.26 17.95 20.54 33.8 7.42 ...
##  $ Aug_Temp   : num  23.77 19.9 20.48 33.55 9.02 ...
##  $ Sept_temp  : num  19 22.2 17.2 31.7 11.5 ...
##  $ Oct_temp   : num  13 23.2 12.3 28.3 14.7 ...
##  $ Nov_Temp   : num  7 22.79 7.58 24.06 17.54 ...
##  $ Dec_temp   : num  2.43 22.61 3.65 20.28 19.83 ...
##  $ Annual_temp: num  12.9 21.5 11.3 26.8 14.2 ...

kable(summary(dataset2))

ISO_3DIGIT	Jan_Temp	Feb_temp	Mar_temp	Apr_Temp	May_temp	Jun_Temp	July_Temp	Aug_Temp	Sept_temp	Oct_temp	Nov_Temp	Dec_temp	Annual_temp
AFG : 1	Min. :-26.9967	Min. :-24.725	Min. :-18.717	Min. :-14.249	Min. :-6.319	Min. : 0.07677	Min. : 2.02	Min. : 2.583	Min. :-2.336	Min. :-8.36	Min. :-17.638	Min. :-24.074	Min. :-9.143
AGO : 1	1st Qu.: 0.1473	1st Qu.: 2.092	1st Qu.: 5.265	1st Qu.: 9.689	1st Qu.:14.050	1st Qu.:16.36776	1st Qu.:18.05	1st Qu.:18.455	1st Qu.:15.333	1st Qu.:10.85	1st Qu.: 5.953	1st Qu.: 2.041	1st Qu.:10.091
ALB : 1	Median : 19.2529	Median : 20.492	Median : 21.995	Median : 22.028	Median :21.907	Median :22.99741	Median :22.98	Median :23.314	Median :22.598	Median :22.49	Median : 21.282	Median : 20.043	Median :21.582
ARE : 1	Mean : 12.8916	Mean : 13.879	Mean : 15.935	Mean : 18.241	Mean :20.160	Mean :21.34465	Mean :22.01	Mean :21.941	Mean :20.757	Mean :18.70	Mean : 15.992	Mean : 13.729	Mean :17.965
ARG : 1	3rd Qu.: 24.2336	3rd Qu.: 24.821	3rd Qu.: 25.333	3rd Qu.: 25.633	3rd Qu.:25.898	3rd Qu.:26.01320	3rd Qu.:26.05	3rd Qu.:26.057	3rd Qu.:25.772	3rd Qu.:25.32	3rd Qu.: 24.545	3rd Qu.: 23.983	3rd Qu.:25.046
ARM : 1	Max. : 27.7845	Max. : 29.088	Max. : 30.629	Max. : 32.055	Max. :33.062	Max. :34.72038	Max. :36.25	Max. :35.878	Max. :32.778	Max. :29.55	Max. : 27.473	Max. : 27.302	Max. :28.300
(Other):172	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

we are going to sample 178 observations from the 1st dataset to match with the second dataset to avoid bias.

SETTING UP HYPOTHESIS: # 2

Null Hypothesis: There is no difference between means temperature in dataset 1 and dataset 2

\(H_o:\) \(\mu_1\) = \(\mu_2\) = \(\mu_3\)…..= \(\mu_n\)
Alternative Hythesis:

\(H_a:\) Not \(H_o:\) \(\mu_1\) \(\neq\) \(\mu_2\) \(\neq\) \(\mu_3\)…..\(\neq\) \(\mu_n\)

Rejection:

Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less that the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)

Decision:

We therefore Accept \(H_o\), since calculated calculated value (P-Value) is greater than the tabulated value(Table value = 0.05 )

sample1 <- airqual[sample(nrow(airqual), 178), ]
dim(sample1)

## [1] 178  13

temp_2 <- lm(sample1$temp_ac ~ dataset2$Annual_temp)

summary.lm(temp_2)

## 
## Call:
## lm(formula = sample1$temp_ac ~ dataset2$Annual_temp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -213.093    0.661    5.772   13.928   28.177 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           17.4049     7.0242   2.478   0.0142 *
## dataset2$Annual_temp  -0.3337     0.3526  -0.946   0.3453  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.48 on 176 degrees of freedom
## Multiple R-squared:  0.005061,   Adjusted R-squared:  -0.0005919 
## F-statistic: 0.8953 on 1 and 176 DF,  p-value: 0.3453

cor.test(sample1$temp_ac, dataset2$Annual_temp, method = c("pearson", "kendall", "spearman"),
         exact = NULL, conf.level = 0.95, continuity = FALSE)

## 
##  Pearson's product-moment correlation
## 
## data:  sample1$temp_ac and dataset2$Annual_temp
## t = -0.9462, df = 176, p-value = 0.3453
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.21596693  0.07674566
## sample estimates:
##        cor 
## -0.0711421

anova(temp_2)

## Analysis of Variance Table
## 
## Response: sample1$temp_ac
##                       Df Sum Sq Mean Sq F value Pr(>F)
## dataset2$Annual_temp   1   1467  1466.9  0.8953 0.3453
## Residuals            176 288371  1638.5

DATASET THREE(3)

The dataset is a Leukemia incident rate collected from the Center for Diseases Control and Prevention Website for 49 US States…..

** Data Source:**

http://www.statecancerprofiles.cancer.gov/incidencerates/index.php?stateFIPS=00&cancer=090&race=00&sex=0&age=009&year=0&type=incd&sortVariableName=rate&sortOrder=default#results

SETTING UP HYPOTHESIS: # 3

Null Hypothesis: There is no relationship between benzene in dataset 1 and and the average incident in dataset 3

\(H_o:\) \(\mu_1\) = \(\mu_2\) = \(\mu_3\)…..= \(\mu_n\)
Alternative Hythesis:

\(H_a:\) Not \(H_o:\) \(\mu_1\) \(\neq\) \(\mu_2\) \(\neq\) \(\mu_3\)…..\(\neq\) \(\mu_n\)

Rejection:

Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less than the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)

library(plyr)

incd <- read.csv("https://raw.githubusercontent.com/mascotinme/MSDA-IS607/master/incd.csv", sep = ",", header = TRUE, skip = 8)
 
str(incd)

## 'data.frame':    75 obs. of  10 variables:
##  $ State                                             : Factor w/ 71 levels "","# Data do not include cases diagnosed in other states for those states in which the data exchange agreement specifically prohib"| __truncated__,..: 64 70 41 37 29 33 50 52 44 34 ...
##  $ FIPS                                              : int  0 55000 27000 23000 16000 19000 36000 38000 30000 20000 ...
##  $ Age.Adjusted.Incidence.Rate......cases.per.100.000: Factor w/ 35 levels "","¶¶ ","10.2",..: 17 35 34 33 32 31 30 29 28 27 ...
##  $ Lower.95..Confidence.Interval                     : Factor w/ 35 levels "","¶¶","10","10.4",..: 22 34 33 32 30 31 32 26 27 29 ...
##  $ Upper.95..Confidence.Interval                     : Factor w/ 36 levels ""," ¶¶","11",..: 15 36 34 35 32 30 28 33 31 29 ...
##  $ Average.Annual.Count                              : Factor w/ 53 levels "","¶¶","1005",..: 30 4 52 20 19 35 27 6 14 31 ...
##  $ Recent.Trend                                      : Factor w/ 5 levels "","¶¶","falling",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Recent.5.Year.Trend.....in.Incidence.Rates        : Factor w/ 38 levels "","-0.1","-0.3",..: 15 2 32 17 6 15 26 12 33 28 ...
##  $ Lower.95..Confidence.Interval.1                   : Factor w/ 43 levels "","-0.1","-0.3",..: 7 22 16 34 10 3 20 35 19 2 ...
##  $ Upper.95..Confidence.Interval.1                   : Factor w/ 43 levels "","-0.1","¶¶",..: 7 25 43 40 38 5 37 8 12 29 ...

# Rename the column names for easy accesibility

incd$Average.Annual.Count <- as.numeric(incd$Average.Annual.Count)

names(incd) <- c("state", "FIPS", "adj_age_100000", "lower_Adj_CI", "upper_Adj_CI", "avg_annual_count", "Recent.Trend", "5years_rate", "Lower_5Yrs_CI", "upper_5Yrs_CI")



# We are going to seperate the rows(Stable, Rising and Falling)


str(incd)

## 'data.frame':    75 obs. of  10 variables:
##  $ state           : Factor w/ 71 levels "","# Data do not include cases diagnosed in other states for those states in which the data exchange agreement specifically prohib"| __truncated__,..: 64 70 41 37 29 33 50 52 44 34 ...
##  $ FIPS            : int  0 55000 27000 23000 16000 19000 36000 38000 30000 20000 ...
##  $ adj_age_100000  : Factor w/ 35 levels "","¶¶ ","10.2",..: 17 35 34 33 32 31 30 29 28 27 ...
##  $ lower_Adj_CI    : Factor w/ 35 levels "","¶¶","10","10.4",..: 22 34 33 32 30 31 32 26 27 29 ...
##  $ upper_Adj_CI    : Factor w/ 36 levels ""," ¶¶","11",..: 15 36 34 35 32 30 28 33 31 29 ...
##  $ avg_annual_count: num  30 4 52 20 19 35 27 6 14 31 ...
##  $ Recent.Trend    : Factor w/ 5 levels "","¶¶","falling",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ 5years_rate     : Factor w/ 38 levels "","-0.1","-0.3",..: 15 2 32 17 6 15 26 12 33 28 ...
##  $ Lower_5Yrs_CI   : Factor w/ 43 levels "","-0.1","-0.3",..: 7 22 16 34 10 3 20 35 19 2 ...
##  $ upper_5Yrs_CI   : Factor w/ 43 levels "","-0.1","¶¶",..: 7 25 43 40 38 5 37 8 12 29 ...

kable(head(incd))

state	FIPS	adj_age_100000	lower_Adj_CI	upper_Adj_CI	avg_annual_count	Recent.Trend	5years_rate	Lower_5Yrs_CI	upper_5Yrs_CI
US (SEER+NPCR)(1,10)	0	13.2	13.1	13.2	30	stable	0	-1.3	1.2
Wisconsin(6,10)	55000	16.5	16.1	17	4	stable	-0.1	-3.6	3.6
Minnesota(6,10)	27000	16	15.5	16.4	52	stable	3.4	-2.7	9.8
Maine(6,10)	23000	15.7	14.8	16.6	20	stable	0.3	-8	9.2
Idaho(6,10)	16000	15.3	14.4	16.2	19	stable	-1.7	-10.4	7.7
Iowa(3,8)	19000	15.2	14.6	15.8	35	stable	0	-0.3	0.3

stable <- subset(incd, incd$Recent.Trend == "stable")
rising <- subset(incd, incd$Recent.Trend == "rising")
falling <- subset(incd, incd$Recent.Trend == "falling")

kable(head(stable))

state	FIPS	adj_age_100000	lower_Adj_CI	upper_Adj_CI	avg_annual_count	Recent.Trend	5years_rate	Lower_5Yrs_CI	upper_5Yrs_CI
US (SEER+NPCR)(1,10)	0	13.2	13.1	13.2	30	stable	0	-1.3	1.2
Wisconsin(6,10)	55000	16.5	16.1	17	4	stable	-0.1	-3.6	3.6
Minnesota(6,10)	27000	16	15.5	16.4	52	stable	3.4	-2.7	9.8
Maine(6,10)	23000	15.7	14.8	16.6	20	stable	0.3	-8	9.2
Idaho(6,10)	16000	15.3	14.4	16.2	19	stable	-1.7	-10.4	7.7
Iowa(3,8)	19000	15.2	14.6	15.8	35	stable	0	-0.3	0.3

kable(head(rising))

	state	FIPS	adj_age_100000	lower_Adj_CI	upper_Adj_CI	avg_annual_count	Recent.Trend	5years_rate	Lower_5Yrs_CI	upper_5Yrs_CI
12	Connecticut(3,8)	9000	14.7	14.2	15.2	38	rising	0.8	0.4	1.2
14	New Jersey(3,8)	34000	14.2	13.9	14.6	9	rising	0.4	0.1	0.6
24	Utah(3,8)	49000	13.3	12.6	14	26	rising	0.8	0.3	1.4
29	Tennessee(6,10)	47000	13.1	12.7	13.5	50	rising	2.6	0.1	5.2
34	Missouri(6,10)	29000	12.7	12.3	13.1	47	rising	2.4	0.3	4.6
43	Hawaii(3,8)	15000	12	11.2	12.8	16	rising	1.6	0.6	2.5

kable(head(falling))

	state	FIPS	adj_age_100000	lower_Adj_CI	upper_Adj_CI	avg_annual_count	Recent.Trend	5years_rate	Lower_5Yrs_CI	upper_5Yrs_CI
37	California(3,9)	6000	12.5	12.4	12.7	32	falling	-1.7	-3.4	-0.1

require(ggplot2)

ggplot(stable, col = "red", aes(x =state, y=round(avg_annual_count,2))) +
       geom_bar(stat="identity",position="dodge") + 
       xlab("States") + 
       ylab("Average Annual counts") + 
       ggtitle("Average Annual Stable Leukemia Counts per US States")

ggplot(rising, col = "blue", aes(x =state, y=round(avg_annual_count,2))) +
       geom_bar(stat="identity",position="dodge") + 
       xlab("States") + 
       ylab("Average Annual counts") + 
       ggtitle("Average Annual Rising Leukemia Counts per US States")

sample_all1 <- sample1[sample(nrow(sample1), 45), ]
sample_all2 <- dataset2[sample(nrow(dataset2), 45), ]
sample_all3 <- incd[sample(nrow(incd), 45), ]
allsamples <- data.frame(c(sample_all1, sample_all2, sample_all3))


all_dataset_subset <- allsamples[, c("avg_annual_count", "Annual_temp" , "co", "benzene")]
str(all_dataset_subset)

## 'data.frame':    45 obs. of  4 variables:
##  $ avg_annual_count: num  1 1 36 3 17 30 13 21 1 1 ...
##  $ Annual_temp     : num  25.06 25.8 26.82 5.03 -9.14 ...
##  $ co              : num  4.6 -200 0.5 1.1 1.5 1.1 -200 1 1.9 2.3 ...
##  $ benzene         : num  24.24 1.37 4.09 3.46 7.62 ...

linearmodel <- lm(all_dataset_subset$avg_annual_count ~(all_dataset_subset$co + all_dataset_subset$benzene))
linearmodel

## 
## Call:
## lm(formula = all_dataset_subset$avg_annual_count ~ (all_dataset_subset$co + 
##     all_dataset_subset$benzene))
## 
## Coefficients:
##                (Intercept)       all_dataset_subset$co  
##                   21.08803                     0.03006  
## all_dataset_subset$benzene  
##                    0.05977

anova(linearmodel)

## Analysis of Variance Table
## 
## Response: all_dataset_subset$avg_annual_count
##                            Df  Sum Sq Mean Sq F value Pr(>F)
## all_dataset_subset$co       1   193.6  193.63  0.6415 0.4277
## all_dataset_subset$benzene  1   306.3  306.35  1.0149 0.3195
## Residuals                  42 12677.8  301.85

Decision:

We therefore Reject \(H_o\), since calculated calculated value (P-Value) is greater than the tabulated value(Table value = 0.05 )

GENERAL CONCLUSION:

Based on the above analyzes done on the three datasets, we therefore conclude that there are strong relationships between the carbon-monoxide (co), Temperature (temp_ac) and Benzene. Though, the datasets might not be enough to be able to determine the actual causes, and links between the variables, we are strongly confident that a reduction in the exposure to their atmospheric condition (particularly Benzene) may aid in reducing leukemia in the United States and the World at large.

Note: The analysis is not exclusive as some other contributing factor(s) may also needs to be accessed.

REFERENCES:

IS 607: PROJECT

MUSA T. GANIYU

April 28, 2016