Ocean Temperatures Dataset Info

Cleaned Datasets

The output files of the procedures performed in the next steps could be downloaded in either csv or rds formats.

Download Cleaned Dataset (wide form as rds)
Download Cleaned Dataset (long form as rds)
Download Cleaned Dataset (long form as csv)

Getting the raw data

The dataset “Annual mean 1x1 gridded data on Standard Levels” is taken from the Asia-Pacific Data-Research Center’s database. It was collected under the ARGO project part the National Oceanic and Atmospheric Administration (NOAA) of the US. The used version is 1.1.f. May 2017. There are 27 zipped files(each for a specific ocean depth) within the archive with an identical structure. Once unzipped, we could load all of them with the lapply() function. As work directories vary from user to user, I will use a general way to load the files via the choose.files() function.

## The 'files' variable will contain the names and locations of the 27 datasets
files <- choose.files()

## You could read all the files simultaneously in several ways, the most frequently applied ones are via a for loop 
##or some of the apply function. Due to its simplicity, we will use the lapply. 
depth = lapply(files, read.table) %>% bind_rows()

Data Wrangling and Cleaning

Unfortunately the depth level is not specified in the dataset itself, but rather in the file name. For this reason I will extract the file names via basename() and create a vector containing the unique ocean depths. Furthermore, as I do not need all of the columns, some of them will be excluded, leaving only the longitude, latitude, salinity and temperature. The files have no header, so one needs to read the help file to see to what variable corresponds a particular column.

## Extract the longitude (1), latitude(2), salinity (6), and temperature(3) columns

depth_subset <- depth[, c(1, 2, 6, 3)]
colnames(depth_subset) <- c('longitude', 'latitude', "salinity(psu)", 'temperature(C)')

## The depth level is specified in the file name, rather than within the file. For this reason
## let's extract the file names and create a vector containing the unique ocean depths:
(file_names <- basename(files))

##  [1] "TS0000a.dat-000000" "TS0005a.dat-000000" "TS0010a.dat-000000"
##  [4] "TS0020a.dat-000000" "TS0030a.dat-000000" "TS0050a.dat-000000"
##  [7] "TS0075a.dat-000000" "TS0100a.dat-000000" "TS0125a.dat-000000"
## [10] "TS0150a.dat-000000" "TS0200a.dat-000000" "TS0250a.dat-000000"
## [13] "TS0300a.dat-000000" "TS0400a.dat-000000" "TS0500a.dat-000000"
## [16] "TS0600a.dat-000000" "TS0700a.dat-000000" "TS0800a.dat-000000"
## [19] "TS0900a.dat-000000" "TS1000a.dat-000000" "TS1100a.dat-000000"
## [22] "TS1200a.dat-000000" "TS1300a.dat-000000" "TS1400a.dat-000000"
## [25] "TS1500a.dat-000000" "TS1750a.dat-000000" "TS2000a.dat-000000"

## From inspecting the names, we see that the depth is indicated from the characters at positions 3 to 6.
(depths_vector <- as.numeric(substr(file_names, start = 3, stop = 6)))

##  [1]    0    5   10   20   30   50   75  100  125  150  200  250  300  400
## [15]  500  600  700  800  900 1000 1100 1200 1300 1400 1500 1750 2000

As was mentioned earlier, there are 27 datasets with an identical structure, each having 64800 observations for a total of 1.7496 mln. observations. You will notice some bizarre values of -999.00 which are used rather than NAs. We will filter them later, but keep them for now as at this point we will associate each used dataset in the merged data frame with its level of depth. This is done as it is highly unlikely that the number of missing values in each dataset is the same. However, prior to cleaning them up, each of them has 64800 rows. Let’s create a column vector which repeats each element of the depths vector 64800 and insert it as a column named depth(m) in the data frame.

rep_depths <- as.vector(sapply(depths_vector, function(x){rep(x, 64800)}))
depth_subset$`depth(m)` <- rep_depths

## The dimension of the dataset prior to eliminating the NAs
## writeLines(prettyNum(dim(depth_subset), big.mark = ","))
dimensions_prior <- data.frame(prettyNum(dim(depth_subset)[1], big.mark = ","), 
                         dim(depth_subset)[2]); names(dimensions_prior) <- c("Rows", "Columns")

dimensions_prior %>%
           kable(align = 'l') %>%
           kable_styling(position = 'l', font_size = 13, 
                         full_width = FALSE)

Rows	Columns
1,749,600	5

head(depth_subset)

##   longitude latitude salinity(psu) temperature(C) depth(m)
## 1       0.5    -89.5          -999           -999        0
## 2       0.5    -88.5          -999           -999        0
## 3       0.5    -87.5          -999           -999        0
## 4       0.5    -86.5          -999           -999        0
## 5       0.5    -85.5          -999           -999        0
## 6       0.5    -84.5          -999           -999        0

Let’s clean all the -999.00 values and use the summary function to spot any suspicious values in the data frame. We will also add a bit of fancier formatting via the kableExtra package.

depth_subset <- depth_subset %>% 
                filter(`temperature(C)` != -999.00)

## The dimension of the dataset after eliminating the NAs:
## writeLines(prettyNum(dim(depth_subset), big.mark = ",")) 

dimensions <- data.frame(prettyNum(dim(depth_subset)[1], big.mark = ","), 
                         dim(depth_subset)[2]); names(dimensions) <- c("Rows", "Columns")

dimensions %>%
           kable(align = 'l') %>%
           kable_styling(position = 'l', font_size = 13, 
                         full_width = FALSE)

Rows	Columns
793,989	5

summary(depth_subset) %>%
        kable(format = "markdown") %>%
        kable_styling(position = 'center', font_size = 13, 
                      full_width = TRUE)

longitude	latitude	salinity(psu)	temperature(C)	depth(m)
Min. : 0.5	Min. :-62.50	Min. :30.97	Min. :-2.023	Min. : 0.0
1st Qu.:123.5	1st Qu.:-41.50	1st Qu.:34.41	1st Qu.: 3.277	1st Qu.: 75.0
Median :199.5	Median :-16.50	Median :34.62	Median : 6.303	Median : 400.0
Mean :196.1	Mean :-12.48	Mean :34.75	Mean : 9.730	Mean : 609.8
3rd Qu.:271.5	3rd Qu.: 13.50	3rd Qu.:35.00	3rd Qu.:14.912	3rd Qu.:1100.0
Max. :359.5	Max. : 64.50	Max. :37.55	Max. :30.270	Max. :2000.0

Now the data has no bizarre values. This is nice and all, however, every R developer has to switch constantly between wide and long data formats. For this reason, let’s create a 2nd data frame via the spread() function.

Spread Format

depths_chr_name <- paste0(as.character(depths_vector), "m")

## We will eliminate the 3rd column(salinity) in the spread format:
depth_spread <- spread(depth_subset[, -3], key = `depth(m)`, value = `temperature(C)`)
colnames(depth_spread)[3:ncol(depth_spread)] <- depths_chr_name


## As there are 27 temperature columns, I will show only some of them.

head(depth_spread[, c(1:2, seq(from = 3, to = 29, by = 4))]) %>%
     kable(format = "markdown") %>%
     kable_styling(position = 'center', 
                   font_size = 13, 
                   full_width = TRUE)

longitude	latitude	0m	30m	125m	300m	700m	1100m	1500m
0.5	-62.5	-0.530	-0.902	-0.835	0.535	0.310	0.129	-0.018
0.5	-61.5	-0.391	-0.544	-0.846	0.506	0.286	0.108	-0.037
0.5	-60.5	-0.996	-1.098	-0.353	0.498	0.286	0.108	-0.037
0.5	-59.5	-1.052	-1.086	-0.435	0.474	0.285	0.104	-0.043
0.5	-58.5	-0.104	-0.124	-1.015	0.507	0.347	0.169	0.011
0.5	-57.5	0.104	0.080	-0.907	0.617	0.415	0.247	0.087

Longitude and Latitude Formats

If you look at the longitude and latitude columns, you will notice the datasets were a bit inconsistent in their formatting. The longitude is given within the range 0:360 degrees, while the latitude is -90:90 (-90 South : +90 North). It would be nice for the latitude and longitude to be in the same format. Let’s subtract 360 from all longitude observations above 180 to normalize them and have the longitude in the format:-180 (West) : +180 (East).

depth_spread_normalized <- depth_spread %>% 
                           mutate(longitude = ifelse(longitude > 180, 
                                                     longitude - 360,
                                                     longitude))


## Doing the same for the 1st data frame in the tidy format
depth_tidy_normalized <- depth_subset %>% 
                           mutate(longitude = ifelse(longitude > 180, 
                                                     longitude - 360,
                                                     longitude))

Let’s inspect the last rows of the initial and new data frame to verify that the modification has occured:

tail(depth_spread[, 1:7])

##       longitude latitude     0m     5m    10m    20m    30m
## 29402     359.5     -0.5 27.178 27.083 27.001 25.961 23.845
## 29403     359.5      0.5 27.675 27.645 27.645 27.171 25.378
## 29404     359.5      1.5 28.008 28.003 28.017 27.963 27.184
## 29405     359.5      2.5 28.263 28.254 28.233 28.206 27.653
## 29406     359.5      3.5 28.233 28.214 28.038 27.756 26.590
## 29407     359.5      4.5 28.172 28.135 27.647 26.966 25.214

tail(depth_spread_normalized[, 1:7])

##       longitude latitude     0m     5m    10m    20m    30m
## 29402      -0.5     -0.5 27.178 27.083 27.001 25.961 23.845
## 29403      -0.5      0.5 27.675 27.645 27.645 27.171 25.378
## 29404      -0.5      1.5 28.008 28.003 28.017 27.963 27.184
## 29405      -0.5      2.5 28.263 28.254 28.233 28.206 27.653
## 29406      -0.5      3.5 28.233 28.214 28.038 27.756 26.590
## 29407      -0.5      4.5 28.172 28.135 27.647 26.966 25.214

Top 3 Hottest and Coldest Ocean Surfaces

Let’s say we want to find the coordinates of the 3 hottest and coldest ocean surfaces (0m column). First, we will create a couple of vectors containing the hottest and coldest 3 temperatures and then filter the data frame to obtain their geospatial coordinates.

(maxTemp <- sort(depth_spread_normalized$`0m`, decreasing = TRUE)[1:3])

## [1] 30.270 30.009 30.001

(minTemp <- sort(depth_spread_normalized$`0m`, decreasing = FALSE)[1:3])

## [1] -1.986 -1.977 -1.967

## Filter the data frame by finding the 3 highest temperatures in the column `0m` 
depth_spread_normalized %>% filter(`0m` %in% maxTemp) %>% 
                            select(longitude, latitude, `0m`) %>%
                            arrange(desc(`0m`))

##   longitude latitude     0m
## 1     -99.5     16.5 30.270
## 2     154.5     -0.5 30.009
## 3    -100.5     16.5 30.001

The hottest ocean surface in this dataset was recorded at longitude -99.5 (West) , latitude 16.5 (North) in the Pacific Ocean, about 37km south of the town of San Marcos, Guerrero, Mexico. Compute with Wolfram Alpha

depth_spread_normalized %>% filter(`0m` %in% minTemp) %>% 
                            select(longitude, latitude, `0m`) %>%
                            arrange(`0m`)

##   longitude latitude     0m
## 1     -32.5    -62.5 -1.986
## 2     -33.5    -62.5 -1.977
## 3     -41.5    -62.5 -1.967

The coldest ocean surface in this dataset was recorded at longitude -32.5 (West) , latitude -62.5 (South) in the Southern (aka Antarctic) Ocean, about 677km north of Orcadas, Argentina/UK’s permanent station on Antarctica. Compute with Wolfram Alpha

Temperature Profiles

Let’s visualize the thermoclines at those 2 locations. For this purpose the tidy format is usually more appropriate.

library(ggthemes)
max <- depth_tidy_normalized  %>% filter(longitude == -99.5 & latitude == 16.5) %>%
                                  select(`depth(m)`, `temperature(C)`)

min <- depth_tidy_normalized  %>% filter(longitude == -32.5 & latitude == -62.5) %>%
                                  select(`depth(m)`, `temperature(C)`)


ggplot(max, aes(x = `depth(m)`, y = `temperature(C)`, 
                col = -`depth(m)`))+
          geom_point(size = 2.8)+
          geom_line(lwd = 1.25)+
          labs(col='Depth(meters)')+
          ggtitle(label = "Location with the highest surface temperature")+
          theme_classic(base_size = 12)

### Doing the same for the coldest place, which has a more interesting profile

ggplot(min, aes(x = `depth(m)`, y = `temperature(C)`, 
                col = -`depth(m)`))+
          geom_point(size = 2.8)+
          geom_line(lwd = 1.25)+
          labs(col='Depth(meters)') +
  ggtitle(label = "Location with the lowest surface temperature")+
  theme_classic(base_size = 12)

References

[1] NOAA, ARGO Project; http://apdrc.soest.hawaii.edu/projects/Argo/data/gridded/On_standard_levels/
[2] Wolfram Knowledge Database