The output files of the procedures performed in the next steps could be downloaded in either csv or rds formats.
Download Cleaned Dataset (wide form as rds)
Download Cleaned Dataset (long form as rds)
Download Cleaned Dataset (long form as csv)
The dataset “Annual mean 1x1 gridded data on Standard Levels” is taken from the Asia-Pacific Data-Research Center’s database. It was collected under the ARGO project part the National Oceanic and Atmospheric Administration (NOAA) of the US. The used version is 1.1.f. May 2017. There are 27 zipped files(each for a specific ocean depth) within the archive with an identical structure. Once unzipped, we could load all of them with the lapply() function. As work directories vary from user to user, I will use a general way to load the files via the choose.files() function.
## The 'files' variable will contain the names and locations of the 27 datasets
files <- choose.files()
## You could read all the files simultaneously in several ways, the most frequently applied ones are via a for loop
##or some of the apply function. Due to its simplicity, we will use the lapply.
depth = lapply(files, read.table) %>% bind_rows()
Unfortunately the depth level is not specified in the dataset itself, but rather in the file name. For this reason I will extract the file names via basename() and create a vector containing the unique ocean depths. Furthermore, as I do not need all of the columns, some of them will be excluded, leaving only the longitude, latitude, salinity and temperature. The files have no header, so one needs to read the help file to see to what variable corresponds a particular column.
## Extract the longitude (1), latitude(2), salinity (6), and temperature(3) columns
depth_subset <- depth[, c(1, 2, 6, 3)]
colnames(depth_subset) <- c('longitude', 'latitude', "salinity(psu)", 'temperature(C)')
## The depth level is specified in the file name, rather than within the file. For this reason
## let's extract the file names and create a vector containing the unique ocean depths:
(file_names <- basename(files))
## [1] "TS0000a.dat-000000" "TS0005a.dat-000000" "TS0010a.dat-000000"
## [4] "TS0020a.dat-000000" "TS0030a.dat-000000" "TS0050a.dat-000000"
## [7] "TS0075a.dat-000000" "TS0100a.dat-000000" "TS0125a.dat-000000"
## [10] "TS0150a.dat-000000" "TS0200a.dat-000000" "TS0250a.dat-000000"
## [13] "TS0300a.dat-000000" "TS0400a.dat-000000" "TS0500a.dat-000000"
## [16] "TS0600a.dat-000000" "TS0700a.dat-000000" "TS0800a.dat-000000"
## [19] "TS0900a.dat-000000" "TS1000a.dat-000000" "TS1100a.dat-000000"
## [22] "TS1200a.dat-000000" "TS1300a.dat-000000" "TS1400a.dat-000000"
## [25] "TS1500a.dat-000000" "TS1750a.dat-000000" "TS2000a.dat-000000"
## From inspecting the names, we see that the depth is indicated from the characters at positions 3 to 6.
(depths_vector <- as.numeric(substr(file_names, start = 3, stop = 6)))
## [1] 0 5 10 20 30 50 75 100 125 150 200 250 300 400
## [15] 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1750 2000
As was mentioned earlier, there are 27 datasets with an identical structure, each having 64800 observations for a total of 1.7496 mln. observations. You will notice some bizarre values of -999.00 which are used rather than NAs. We will filter them later, but keep them for now as at this point we will associate each used dataset in the merged data frame with its level of depth. This is done as it is highly unlikely that the number of missing values in each dataset is the same. However, prior to cleaning them up, each of them has 64800 rows. Let’s create a column vector which repeats each element of the depths vector 64800 and insert it as a column named depth(m) in the data frame.
rep_depths <- as.vector(sapply(depths_vector, function(x){rep(x, 64800)}))
depth_subset$`depth(m)` <- rep_depths
## The dimension of the dataset prior to eliminating the NAs
## writeLines(prettyNum(dim(depth_subset), big.mark = ","))
dimensions_prior <- data.frame(prettyNum(dim(depth_subset)[1], big.mark = ","),
dim(depth_subset)[2]); names(dimensions_prior) <- c("Rows", "Columns")
dimensions_prior %>%
kable(align = 'l') %>%
kable_styling(position = 'l', font_size = 13,
full_width = FALSE)
Rows | Columns |
---|---|
1,749,600 | 5 |
head(depth_subset)
## longitude latitude salinity(psu) temperature(C) depth(m)
## 1 0.5 -89.5 -999 -999 0
## 2 0.5 -88.5 -999 -999 0
## 3 0.5 -87.5 -999 -999 0
## 4 0.5 -86.5 -999 -999 0
## 5 0.5 -85.5 -999 -999 0
## 6 0.5 -84.5 -999 -999 0
Let’s clean all the -999.00 values and use the summary function to spot any suspicious values in the data frame. We will also add a bit of fancier formatting via the kableExtra package.
depth_subset <- depth_subset %>%
filter(`temperature(C)` != -999.00)
## The dimension of the dataset after eliminating the NAs:
## writeLines(prettyNum(dim(depth_subset), big.mark = ","))
dimensions <- data.frame(prettyNum(dim(depth_subset)[1], big.mark = ","),
dim(depth_subset)[2]); names(dimensions) <- c("Rows", "Columns")
dimensions %>%
kable(align = 'l') %>%
kable_styling(position = 'l', font_size = 13,
full_width = FALSE)
Rows | Columns |
---|---|
793,989 | 5 |
summary(depth_subset) %>%
kable(format = "markdown") %>%
kable_styling(position = 'center', font_size = 13,
full_width = TRUE)
longitude | latitude | salinity(psu) | temperature(C) | depth(m) | |
---|---|---|---|---|---|
Min. : 0.5 | Min. :-62.50 | Min. :30.97 | Min. :-2.023 | Min. : 0.0 | |
1st Qu.:123.5 | 1st Qu.:-41.50 | 1st Qu.:34.41 | 1st Qu.: 3.277 | 1st Qu.: 75.0 | |
Median :199.5 | Median :-16.50 | Median :34.62 | Median : 6.303 | Median : 400.0 | |
Mean :196.1 | Mean :-12.48 | Mean :34.75 | Mean : 9.730 | Mean : 609.8 | |
3rd Qu.:271.5 | 3rd Qu.: 13.50 | 3rd Qu.:35.00 | 3rd Qu.:14.912 | 3rd Qu.:1100.0 | |
Max. :359.5 | Max. : 64.50 | Max. :37.55 | Max. :30.270 | Max. :2000.0 |
Now the data has no bizarre values. This is nice and all, however, every R developer has to switch constantly between wide and long data formats. For this reason, let’s create a 2nd data frame via the spread() function.
depths_chr_name <- paste0(as.character(depths_vector), "m")
## We will eliminate the 3rd column(salinity) in the spread format:
depth_spread <- spread(depth_subset[, -3], key = `depth(m)`, value = `temperature(C)`)
colnames(depth_spread)[3:ncol(depth_spread)] <- depths_chr_name
## As there are 27 temperature columns, I will show only some of them.
head(depth_spread[, c(1:2, seq(from = 3, to = 29, by = 4))]) %>%
kable(format = "markdown") %>%
kable_styling(position = 'center',
font_size = 13,
full_width = TRUE)
longitude | latitude | 0m | 30m | 125m | 300m | 700m | 1100m | 1500m |
---|---|---|---|---|---|---|---|---|
0.5 | -62.5 | -0.530 | -0.902 | -0.835 | 0.535 | 0.310 | 0.129 | -0.018 |
0.5 | -61.5 | -0.391 | -0.544 | -0.846 | 0.506 | 0.286 | 0.108 | -0.037 |
0.5 | -60.5 | -0.996 | -1.098 | -0.353 | 0.498 | 0.286 | 0.108 | -0.037 |
0.5 | -59.5 | -1.052 | -1.086 | -0.435 | 0.474 | 0.285 | 0.104 | -0.043 |
0.5 | -58.5 | -0.104 | -0.124 | -1.015 | 0.507 | 0.347 | 0.169 | 0.011 |
0.5 | -57.5 | 0.104 | 0.080 | -0.907 | 0.617 | 0.415 | 0.247 | 0.087 |
If you look at the longitude and latitude columns, you will notice the datasets were a bit inconsistent in their formatting. The longitude is given within the range 0:360 degrees, while the latitude is -90:90 (-90 South : +90 North). It would be nice for the latitude and longitude to be in the same format. Let’s subtract 360 from all longitude observations above 180 to normalize them and have the longitude in the format:-180 (West) : +180 (East).
depth_spread_normalized <- depth_spread %>%
mutate(longitude = ifelse(longitude > 180,
longitude - 360,
longitude))
## Doing the same for the 1st data frame in the tidy format
depth_tidy_normalized <- depth_subset %>%
mutate(longitude = ifelse(longitude > 180,
longitude - 360,
longitude))
Let’s inspect the last rows of the initial and new data frame to verify that the modification has occured:
tail(depth_spread[, 1:7])
## longitude latitude 0m 5m 10m 20m 30m
## 29402 359.5 -0.5 27.178 27.083 27.001 25.961 23.845
## 29403 359.5 0.5 27.675 27.645 27.645 27.171 25.378
## 29404 359.5 1.5 28.008 28.003 28.017 27.963 27.184
## 29405 359.5 2.5 28.263 28.254 28.233 28.206 27.653
## 29406 359.5 3.5 28.233 28.214 28.038 27.756 26.590
## 29407 359.5 4.5 28.172 28.135 27.647 26.966 25.214
tail(depth_spread_normalized[, 1:7])
## longitude latitude 0m 5m 10m 20m 30m
## 29402 -0.5 -0.5 27.178 27.083 27.001 25.961 23.845
## 29403 -0.5 0.5 27.675 27.645 27.645 27.171 25.378
## 29404 -0.5 1.5 28.008 28.003 28.017 27.963 27.184
## 29405 -0.5 2.5 28.263 28.254 28.233 28.206 27.653
## 29406 -0.5 3.5 28.233 28.214 28.038 27.756 26.590
## 29407 -0.5 4.5 28.172 28.135 27.647 26.966 25.214
Let’s say we want to find the coordinates of the 3 hottest and coldest ocean surfaces (0m column). First, we will create a couple of vectors containing the hottest and coldest 3 temperatures and then filter the data frame to obtain their geospatial coordinates.
(maxTemp <- sort(depth_spread_normalized$`0m`, decreasing = TRUE)[1:3])
## [1] 30.270 30.009 30.001
(minTemp <- sort(depth_spread_normalized$`0m`, decreasing = FALSE)[1:3])
## [1] -1.986 -1.977 -1.967
## Filter the data frame by finding the 3 highest temperatures in the column `0m`
depth_spread_normalized %>% filter(`0m` %in% maxTemp) %>%
select(longitude, latitude, `0m`) %>%
arrange(desc(`0m`))
## longitude latitude 0m
## 1 -99.5 16.5 30.270
## 2 154.5 -0.5 30.009
## 3 -100.5 16.5 30.001
The hottest ocean surface in this dataset was recorded at longitude -99.5 (West) , latitude 16.5 (North) in the Pacific Ocean, about 37km south of the town of San Marcos, Guerrero, Mexico. Compute with Wolfram Alpha
depth_spread_normalized %>% filter(`0m` %in% minTemp) %>%
select(longitude, latitude, `0m`) %>%
arrange(`0m`)
## longitude latitude 0m
## 1 -32.5 -62.5 -1.986
## 2 -33.5 -62.5 -1.977
## 3 -41.5 -62.5 -1.967
The coldest ocean surface in this dataset was recorded at longitude -32.5 (West) , latitude -62.5 (South) in the Southern (aka Antarctic) Ocean, about 677km north of Orcadas, Argentina/UK’s permanent station on Antarctica. Compute with Wolfram Alpha
Let’s visualize the thermoclines at those 2 locations. For this purpose the tidy format is usually more appropriate.
library(ggthemes)
max <- depth_tidy_normalized %>% filter(longitude == -99.5 & latitude == 16.5) %>%
select(`depth(m)`, `temperature(C)`)
min <- depth_tidy_normalized %>% filter(longitude == -32.5 & latitude == -62.5) %>%
select(`depth(m)`, `temperature(C)`)
ggplot(max, aes(x = `depth(m)`, y = `temperature(C)`,
col = -`depth(m)`))+
geom_point(size = 2.8)+
geom_line(lwd = 1.25)+
labs(col='Depth(meters)')+
ggtitle(label = "Location with the highest surface temperature")+
theme_classic(base_size = 12)
### Doing the same for the coldest place, which has a more interesting profile
ggplot(min, aes(x = `depth(m)`, y = `temperature(C)`,
col = -`depth(m)`))+
geom_point(size = 2.8)+
geom_line(lwd = 1.25)+
labs(col='Depth(meters)') +
ggtitle(label = "Location with the lowest surface temperature")+
theme_classic(base_size = 12)
[1] NOAA, ARGO Project; http://apdrc.soest.hawaii.edu/projects/Argo/data/gridded/On_standard_levels/
[2] Wolfram Knowledge Database