Air Temperature and Sea Surface Temperature (3)

Assignment 4

s3686502 Dan Enoka

15 October 2017

Data set obtained from https://www.kaggle.com/uciml/el-nino-dataset

RPubs link information http://rpubs.com/fast_Eddie/319318

library(readr)
elnino80_98Even <- read_csv("elnino80-98Even.csv")
#View(elnino80_98Even)

Introduction .

The data set in this analysis is from a series of approximately 70 moored buoys positioned throughout the equatorial Pacific. The data consists of a variety of data, namely humidity, air temperature and sea surface temperature including subsurface temperatures down to the depth of 500 m. This array is known as the Tropical Atmospheric Ocean (TAO) array. All readings were taken at the same time of day. All readings were taken between the years 1980 to 1998.

Problem Statement

My interest in this data set is to determine if there the has been any change in Air Temperature and / or any change in Sea Surface Temperature.

I then intend to see if there is any correlation between these two sets of data.

To achieve this I intend to include Pearsons Correlation Coefficient test, and the variance F Test, along with visual representation to assist with an understanding of the data.

PLEASE NOTE: As the original data set contained in excess of 170,000 observations, and in order to allow my computer a chance to be able to perform computations within a reasonable time I decided to reduce the amount of data I would use. To do this I have selected “Even” years only and as a further reduction have removed all empty cells from the Air Temperature column and the the Sea Surface Temperature column. For each removed cell from these two columns I also removed the entire row of variables for that cell. The data set was then reduced to a little over 75,000 entries, all calculations performed were performed on this reduced and renamed data set.

Visualisation Section

This section shows visual depictions for Air Temperature and Sea Surface Temperature

In this section I looked at the possibility that the statistics for Air and Sea temperature may be similar, this section though only gives a numerical understanding

#-------------------------------------------------   below are even years for elnino 80-98 only

                                            # Min.   1st Qu.  Median  Mean    3rd Qu.  Max.  --Check Values--
summary(elnino80_98Even$`Air Temp`)         # 17.05   25.90   27.26   26.79   28.14   31.66    Air Temp

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.05   25.90   27.26   26.79   28.14   31.66

summary(elnino80_98Even$`Sea Surface Temp`) # 17.35   26.63   28.22   27.63   29.22   31.26    Sea Surface Temp

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.35   26.63   28.22   27.63   29.22   31.26

Air_mean    <-  as.numeric(mean(elnino80_98Even$`Air Temp`))          #needed for density plot below
Sea_mean    <-  as.numeric(mean(elnino80_98Even$`Sea Surface Temp`))   
air_sd      <-  as.numeric(sd(elnino80_98Even$`Air Temp`))
sea_sd      <-  as.numeric(sd(elnino80_98Even$`Sea Surface Temp`))

Boxplots

The above section didn’t really give me any insight into a possible visual depiction I had been hoping for, at this point I thought I would try box plots. However, these boxplots may appear to display a normal distribution, this is because I have removed the outliers for this graph.

par(mfrow=c(1, 1))            #Boxplot of Air and Sea temp
boxplot(elnino80_98Even$`Air Temp`,elnino80_98Even$`Sea Surface Temp`, 
        col=c('violet', 'lightblue'),  main="Boxplot of Air Temperature vs Sea Surface Temperature" , 
        outline = FALSE, names =  c("Air Temperature", "Sea Surface Temperature") , medcol = "Blue", 
        staplelwd = 4,las=1 )

Histograms with Frequency applied

These show a negatively skewed tendency which may indicate an increasing occurrence of higher temperatures, interestingly there seems to be a similarity between the two.It appears as though the variance in both Air and Sea Temperature is becoming increasingly smaller with time, this can be shown in the scatter plot below where it appears as though the temperature is falling below 25 at a reduced rate as time goes on, but the highest point seems stable at about 30 degrees Celsius.

par(mfrow=c(2, 2))
hist(elnino80_98Even$`Air Temp`,xlab = "Air Temperature (Temp)",
     main = "Histogram of Air Temp Frequency", 
     col = "darkolivegreen1")
plot(elnino80_98Even$`Air Temp`,col="white",ylab = "Air Temp")
lines(elnino80_98Even$`Sea Surface Temp`,col="chartreuse2")

hist(elnino80_98Even$`Sea Surface Temp`,xlab = "Sea Surface Temperature (Temp)",
     main = "Sea Surface Temp Frequency",
     col = "lightblue")
plot(elnino80_98Even$`Sea Surface Temp`,col="white",ylab = "Sea Surface Temp")
lines(elnino80_98Even$`Sea Surface Temp`,col="darkmagenta")

Histograms (Density) With Lines Overlay

After determining in the above sections that there appears to be a visual similarity, I then wondered what a visual depiction using a histogram density plot might look like. I have also added a lines overlay to this section, just out of curiosity

 par(mfrow=c(1, 2))           #Histogram with Lines overlay   
hist(elnino80_98Even$`Air Temp`,probability="TRUE", nclass=40,
     main="Histogram of Air Temperature\n   (Density scale)",col = "lightblue")
lines (density(elnino80_98Even$`Air Temp`),col="darkblue",lwd = 2)  
hist(elnino80_98Even$`Sea Surface Temp`,probability="TRUE", nclass=40,
     main="Histogram of Sea Temperature\n   (Density scale)",col = "aquamarine")
lines (density(elnino80_98Even$`Sea Surface Temp`),col="darkblue",lwd = 2)

Combined Lines Overlay

Now that I had a better idea of how there may be a possibility that there may be some correlation between Air Temp and Sea Surface Temp, I decided to try to show a two lines overlay on one graph

par(mfrow=c(1, 1))            #Combined Lines Overlay
plot (density(elnino80_98Even$`Sea Surface Temp`), 
      main = "Density Plot of Temperature Similarities", lwd = 2, col = "darkblue")
lines (density(elnino80_98Even$`Air Temp`),lwd = 2 ,col = "red")
colours = c("Blue = Sea Temperature","Red = Air Temperature")
legend(x = 18, y = 0.2,legend = colours,col=c(1:2),pch = 16)

Scatter plot with Regression line added

From all of the above, I decided to get a better visual depiction by adding a scatter plot of Air temp verses Sea surface temp and then add a Regression line

                              #Scatter plot with Regression line setup
Airtemp2 <- elnino80_98Even$`Air Temp`^2               
Seatemp2 <- elnino80_98Even$`Sea Surface Temp`^2
AirSeaxy <- elnino80_98Even$`Air Temp`*elnino80_98Even$`Sea Surface Temp`

sum_x <- sum(elnino80_98Even$`Air Temp`)#sum_x         #The "Commented out" are check values
sum_y <- sum(elnino80_98Even$`Sea Surface Temp`)#sum_y
sum_x_sq <- sum(Airtemp2)#sum_x_sq
sum_y_sq <- sum(Seatemp2)#sum_y_sq
sum_xy <- sum(AirSeaxy)#sum_xy 

n <- length(elnino80_98Even$`Air Temp`)#n #Sample size

Lxx <- sum_x_sq-((sum_x^2)/n)#Lxx
Lyy <- sum_y_sq-((sum_y^2)/n)#Lyy
Lxy = sum_xy - (((sum_x)*(sum_y))/n)#Lxy

b = Lxy/Lxx#b
a = mean(elnino80_98Even$`Air Temp` - b*mean(elnino80_98Even$`Sea Surface Temp`))#a

plot(elnino80_98Even$`Air Temp` ~ elnino80_98Even$`Sea Surface Temp`, data = elnino80_98Even, 
     xlab = "Air Temperature", ylab = "Sea Temperature" , las=1 , col=" grey56")
abline(a = a, b = b, col= "red")     #  <-    puts the red line in for         y = a + bx

Regression Line Equation for the above Scatter Plot

A <- round(a,3)                   #  Regression equation
B <- round(b,3)
str_c("The fitted Regression Equation as shown above in Red is   Y =  " , 
      A  ,"  + ", "  ", B , " (X)  + ",  intToUtf8(210), "" )  # intToUtf8(210)   --Shows ASCii char

## [1] "The fitted Regression Equation as shown above in Red is   Y =  -2.595  +   1.063 (X)  + Ò"

Simulator to Observe various segments of Air and Sea temperature

In this section I wondered how best to show the possibility that there maybe some similar interaction between Air temp and Sea temp, for this section I designed a simulator to randomly choose sections of 5000 recorded observations length and then display them graphically as a way to gather perhaps more possible insight into this set of data. PLEASE NOTE : Each “Run” will select a new section from the data to be displayed

par(mfrow = c(1,1 ))  # reset to 1 row and 1 column
r2 <- round(runif(1,1,65000),0)    #r2       #Simulator to observe varing segments of the population
r3 = r2  +  5000                   #r3
Temp <- as.numeric(c(elnino80_98Even$`Air Temp` [r2+1:r3]))  #20001:30000size = 0.8
plot(Temp ,col=" burlywood3 ", las=1)    # mediumpurple1            darkolivegreen3
lines(elnino80_98Even$`Sea Surface Temp`[r2+1:r3], col=" dodgerblue2 " , 
      title(main = "Air Temp (Blue) recorded at the same time as Sea Temp (Fawn)"))

date <-as.data.frame( c(elnino80_98Even$Date))
date1 =date[r2,]
date2 =date[r3,]

Simulator Observation listing

NOTE This should be run after each Simulator “Run” as it gives the exact location from where each sample is selected from, each random sample is 5000 observations long and is randomly chosen from identical sections of the Air temp and Sea temp population as contained in the elnino80-98Even data set

#To obtain the specific column observation listing, this must be run immediately following the above simulator run
str_c("This is a random selection of 5000 observations begining at observation  number " , r2 , "  on the date of " , 
      date1  , " and starting on the above graph at Index '0' (Please see above graph) and ending at observation number " ,
      r3 , "  on the date of "  , date2 , ". PLEASE NOTE these dates are ordered ' Year - Month - Day '. " )

## [1] "This is a random selection of 5000 observations begining at observation  number 55436  on the date of 960408 and starting on the above graph at Index '0' (Please see above graph) and ending at observation number 60436  on the date of 960319. PLEASE NOTE these dates are ordered ' Year - Month - Day '. "

Descriptive Statistics Section

This section performs a two sided T Test for both Air Temperature and Sea Surface Temperature, it also performs a variance test, and then a correlation test on the data set

t.test(elnino80_98Even$`Air Temp`,             # Gives 95% Confidence Interval                 
       mu = Sea_mean, 
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  elnino80_98Even$`Air Temp`
## t = -122.68, df = 75622, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 27.63161
## 95 percent confidence interval:
##  26.77347 26.80045
## sample estimates:
## mean of x 
##  26.78696

t.test(elnino80_98Even$`Sea Surface Temp`,     # Gives 95% Confidence Interval 
       mu = Air_mean,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  elnino80_98Even$`Sea Surface Temp`
## t = 109.17, df = 75622, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 26.78696
## 95 percent confidence interval:
##  27.61644 27.64677
## sample estimates:
## mean of x 
##  27.63161

Variance    <- var.test(elnino80_98Even$`Air Temp`,elnino80_98Even$`Sea Surface Temp`)   # Variance is close to 1

Correlation <- cor(elnino80_98Even$`Air Temp`,elnino80_98Even$`Sea Surface Temp`)        # Correlation is close to 1 
                                                                                         # Pearsons correlation coefficient

Descriptive Statistics Section cont.

This section shows the outcomes of the Variance test and the Correlation test

Variance

## 
##  F test to compare two variances
## 
## data:  elnino80_98Even$`Air Temp` and elnino80_98Even$`Sea Surface Temp`
## F = 0.79185, num df = 75622, denom df = 75622, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.7806399 0.8032157
## sample estimates:
## ratio of variances 
##          0.7918474

str_c("A Correlation test between Air Temperature and Sea Surface Temperature has also been applied and shows it as being " , round(Correlation,3) , " " )

## [1] "A Correlation test between Air Temperature and Sea Surface Temperature has also been applied and shows it as being 0.946 "

Hypothesis Testing.

                     There is no significant difference between Air Temperature and Sea Surface Temperature

\[H_0: \mu_1 = \mu_2 \]

                     There is a significant difference between Air Temperature and Sea Surface Temperature

\[H_A: \mu_1 \ne \mu_2\]

Discussion

As shown in the above One Sample t-test for Air temperature the p-value < 2.2e-16, which is a value of less than 0.05. Similarly, the One Sample t-test for Sea Surface Temperature recorded a p-value < 2.2e-16, which also is a value of less than 0.05. This would suggest that there is a Statistical Significance between Air Temperature and Sea Surface Temperature. With this in mind, my selection is to “Reject the Null Hypothesis”" and accept the Alternative Hypothesis. Following this, the correlation shows that as Air Temperature rises, Sea Surface Temperature also rises at a similar rate, and vise versa.

Prior to my analysis, the data was checked for any irregularities. The data appeared to have a few missing values in the columns that I wished to investigate which , if interpreted incorrectly could (although minimal) make any further analyses reflect incorrect results. To minimize or perhaps even stop the possibility of this happening and as mentioned at the beginning of this analysis, from the columns “Air Temp” and “Sea Surface Temp”, I removed any empty cells, which also included the complete removal of that entire row. This would give me an “n” count or a length count in each column that matched. I’m not entirely sure why I did this, but I did have trouble firstly trying to evaluate the unchanged original data set. I found this to be that in the original data these empty cells contained non numeric script rather than simply an empty cell. I consider there to be some flaws in this evaluation although I believe the data set to be of such a large volume that any possible flaws would hopefully be minimized. With these cells removed the data set was still enormous, to compensate for this and as mentioned above, I chose only the even years, which effectively halved the data without I consider to not be to compromising of any results obtained.

Continued collection of such data I believe is of paramount importance, because it is only through the collection and then analyzing of such data that insight into what may or may not be happening can occur in relation to the world in which we live. But there is the possibility, at some point, that continued collection of this type (in large quantities) of data may become redundant. This would I believe be non beneficial in that it would be simply repeating that which we already suspect. Rather I believe the collection of data through different methodology’s and changing techniques would be a better approach and would yield greater insight.

To restate my findings, it appears as though there might be a relationship between Air Temperature and Sea Surface Temperature. I do however consider my Regression equation to be possibly incorrect, my reasoning for this is “on the scatter plot with a regression line added, it seems a little steep”. This could be the result of outliers influencing my equations, given more time I would delve further into why I think this. My conclusion is, the lower values of Air and Sea Temperatures aren’t being reached as often as they used to be, which suggests a rising average in both Air and Sea temperature. Further research should be done to investigate if the average Air and Sea Temperature has significantly changed over time, and if so, why.

OF NOTE 1 The data appears to have been originally grouped into clusters for each buoy, meaning that when using the simulator to generate groupings of 5000 observations, it maybe that that might only cover perhaps just one buoy, rather than several as I had hoped. This could be adjusted by selecting for instance 100 observations for each of the 70 buoys for a randomly chosen time, this could be done by using locations rather than observations. However buoy locations are stated as longitude and Latitude and are not stationary, they float around on the sea currents.

OF NOTE 2 Several other plot variants were considered, for example a Q-Q plot. However I found the size of the data to be at times overwhelming for my computer, this included some of the plots simply appearing somewhat odd looking. I considered these to be not as effectual when attempting to ascertain insight into what the data might be trying to impart to me. And as such choose not to use them.

References

Information including the elnino data set were obtained from the site stated below:

https://www.kaggle.com/uciml/el-nino-dataset

OF NOTE : In my particular dataset as used above for analysis purposes, I removed any EMPTY cell from the Air Temp and Sea surface Columns including the entire row for that particular cell. Also, as the dataset was large I have then used only Even years for the years 1980 - 1998. I then renamed this file for use in my analysis.

The following has been taken directly from the kaggle website and describes the data set in detail

Context

This data set contains oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific. This data was collected with the Tropical Atmosphere Ocean (TAO) array, which consists of nearly 70 moored buoys spanning the equatorial Pacific, measuring oceanographic and surface meteorological variables critical for improved detection, understanding and prediction of seasonal-to-inter-annual climate variations originating in the tropics.

Content

The data consists of the following variables: date, latitude, longitude, zonal winds (west<0, east>0), meridional winds (south<0, north>0), relative humidity, air temperature, sea surface temperature and subsurface temperatures down to a depth of 500 meters. Data taken from the buoys from as early as 1980 for some locations. Other data that was taken in various locations are rainfall, solar radiation, current levels, and subsurface temperatures. The latitude and longitude in the data showed that the buoys moved around to different locations. The latitude values stayed within a degree from the approximate location. Yet the longitude values were sometimes as far as five degrees off of the approximate location. There are missing values in the data. Not all buoys are able to measure currents, rainfall, and solar radiation, so these values are missing dependent on the individual buoy. The amount of data available is also dependent on the buoy, as certain buoys were commissioned earlier than others.

All readings were taken at the same time of day. Acknowledgement This data set is part of the UCI Machine Learning Repository, and the original source can be found here. The original owner is the NOAA Pacific Marine Environmental Laboratory.

Dan Enoka s3686502