library(readxl)
Data <- read_excel("~/Desktop/EPSRC Project /ArcLakeGroupSummary.xlsx")
str(Data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 732 obs. of 12 variables:
## $ GloboLakes_ID : num 2 3 4 5 6 7 8 9 10 11 ...
## $ LakeName : chr "SUPERIOR" "VICTORIA" "ARAL Sea" "HURON" ...
## $ Group : num 4 6 9 5 9 6 4 4 6 4 ...
## $ Latitude : num 47.7 -1.3 45.1 44.8 43.9 ...
## $ Longitude : num -88.2 33.2 60.1 -82.2 -87.1 ...
## $ Type : chr "Lake" "Lake" "Lake" "Lake" ...
## $ Elevation : num 184 1140 42 176 176 837 450 157 485 158 ...
## $ LakeSize : num 4094 2313 1535 2763 2710 ...
## $ OverallAvg : num 6.1 25.14 11.11 8.35 9.42 ...
## $ OverallMeanAmp: num 14.65 1.56 26.72 19.28 19.26 ...
## $ PC1 : num -41.3 100.1 -11.1 -29.7 -24.2 ...
## $ PC2 : num -19.132 -13.852 35.134 0.822 3.668 ...
Variable Name | Description |
---|---|
Group | The group number (there are 9 in total) |
Latitude | The Latitude of the lake |
Longitude | The Longitude of the lake |
Type | A factor with three levels - lake, lagoon & reservoir |
Elevation | The altitude of the lake (m) |
LakeSize | The size of the lake _ in terms of the number of ~ 1km pixels |
OverallAvg | The mean temperature of the lake (celsius) |
OverallMeanAmp | The overall amplitude of the lake (celsius) |
PC1 & PC2 | Scores descibing variability of the lakes (derived from a PC analysis of the temperature data) |
A lake is a large body of water (larger and deeper than a pond) within a body of land. A lagoon is a shallow body of water seperated from a larger body of water by barrier islands or reefs. A reservoir is a large natural or artificial lake used as a source of water supply.
Beginning by seperating the data set into the 9 classification groups, we are able to get numerical summaries from each group and form some initial impressions of the ways in which the groups differ and can be seperated.
summary(Group1[,-c(1:3)])
## Latitude Longitude Type Elevation
## Min. :24.85 Min. :-122.77 Length:55 Min. : -22.0
## 1st Qu.:34.34 1st Qu.: 13.98 Class :character 1st Qu.: 3.0
## Median :38.07 Median : 45.49 Mode :character Median : 69.0
## Mean :37.45 Mean : 48.66 Mean : 360.7
## 3rd Qu.:40.98 3rd Qu.: 116.66 3rd Qu.: 470.0
## Max. :46.90 Max. : 140.37 Max. :2113.0
## LakeSize OverallAvg OverallMeanAmp PC1
## Min. : 4.00 Min. :13.06 Min. :11.28 Min. : 1.849
## 1st Qu.: 8.00 1st Qu.:14.70 1st Qu.:17.63 1st Qu.:14.049
## Median : 14.00 Median :16.35 Median :22.11 Median :23.793
## Mean : 29.85 Mean :16.29 Mean :21.76 Mean :23.664
## 3rd Qu.: 28.50 3rd Qu.:17.74 3rd Qu.:25.95 3rd Qu.:32.784
## Max. :283.00 Max. :19.44 Max. :28.72 Max. :43.745
## PC2
## Min. : 3.383
## 1st Qu.:16.835
## Median :29.007
## Mean :26.287
## 3rd Qu.:38.206
## Max. :42.859
summary(Group2[,-c(1:3)])
## Latitude Longitude Type Elevation
## Min. :20.21 Min. :-115.83 Length:43 Min. :-202.0
## 1st Qu.:29.29 1st Qu.: -91.82 Class :character 1st Qu.: 3.5
## Median :30.23 Median : -80.86 Mode :character Median : 22.0
## Mean :30.32 Mean : -22.88 Mean : 119.3
## 3rd Qu.:32.79 3rd Qu.: 44.28 3rd Qu.: 59.5
## Max. :36.11 Max. : 116.06 Max. :1801.0
## LakeSize OverallAvg OverallMeanAmp PC1
## Min. : 4.0 Min. :18.95 Min. : 6.987 Min. :46.92
## 1st Qu.: 7.0 1st Qu.:20.97 1st Qu.:14.831 1st Qu.:56.64
## Median : 14.0 Median :21.95 Median :18.532 Median :64.98
## Mean : 25.3 Mean :22.09 Mean :17.420 Mean :66.50
## 3rd Qu.: 23.0 3rd Qu.:23.13 3rd Qu.:19.532 3rd Qu.:73.85
## Max. :186.0 Max. :25.44 Max. :24.330 Max. :92.46
## PC2
## Min. :-0.8748
## 1st Qu.:22.1256
## Median :28.8977
## Mean :26.5181
## 3rd Qu.:31.7008
## Max. :38.3728
summary(Group3[,-c(1:3)])
## Latitude Longitude Type Elevation
## Min. :-18.130 Min. :-106.110 Length:79 Min. :-404.0
## 1st Qu.: -2.275 1st Qu.: -63.100 Class :character 1st Qu.: 25.0
## Median : 1.720 Median : -51.500 Mode :character Median : 89.0
## Mean : 4.273 Mean : -4.154 Mean : 233.6
## 3rd Qu.: 11.945 3rd Qu.: 36.080 3rd Qu.: 383.5
## Max. : 31.520 Max. : 137.920 Max. :1056.0
## LakeSize OverallAvg OverallMeanAmp PC1
## Min. : 3.00 Min. :25.98 Min. : 0.4907 Min. : 99.38
## 1st Qu.: 7.00 1st Qu.:27.83 1st Qu.: 1.6521 1st Qu.:117.44
## Median : 16.00 Median :28.62 Median : 3.2274 Median :124.88
## Mean : 33.75 Mean :28.96 Mean : 3.9325 Mean :124.82
## 3rd Qu.: 31.00 3rd Qu.:30.20 3rd Qu.: 5.2279 3rd Qu.:133.75
## Max. :281.00 Max. :31.76 Max. :11.8852 Max. :144.09
## PC2
## Min. :-23.8997
## 1st Qu.: -6.6334
## Median : -0.3507
## Mean : -0.5058
## 3rd Qu.: 6.8411
## Max. : 23.1318
summary(Group4[,-c(1:3)])
## Latitude Longitude Type Elevation
## Min. :28.55 Min. :-176.0100 Length:121 Min. : 0
## 1st Qu.:51.46 1st Qu.:-123.9800 Class :character 1st Qu.: 46
## Median :62.35 Median : 31.8600 Mode :character Median : 259
## Mean :56.93 Mean : -0.9774 Mean :1211
## 3rd Qu.:67.69 3rd Qu.: 97.2700 3rd Qu.:1127
## Max. :81.80 Max. : 174.4400 Max. :5182
## LakeSize OverallAvg OverallMeanAmp PC1
## Min. : 4.0 Min. :0.04835 Min. : 0.8017 Min. :-73.94
## 1st Qu.: 13.0 1st Qu.:2.44587 1st Qu.:12.2417 1st Qu.:-62.04
## Median : 22.0 Median :3.28293 Median :14.6697 Median :-56.74
## Mean : 133.8 Mean :3.27909 Mean :14.4160 Mean :-57.09
## 3rd Qu.: 43.0 3rd Qu.:4.12047 3rd Qu.:16.6814 3rd Qu.:-52.09
## Max. :4094.0 Max. :6.09557 Max. :21.1424 Max. :-41.32
## PC2
## Min. :-55.871
## 1st Qu.:-27.654
## Median :-20.535
## Mean :-21.576
## 3rd Qu.:-13.910
## Max. : -1.181
summary(Group5[,-c(1:3)])
## Latitude Longitude Type Elevation
## Min. :30.90 Min. :-142.77 Length:244 Min. : 5
## 1st Qu.:51.81 1st Qu.:-105.59 Class :character 1st Qu.: 151
## Median :54.34 Median : -93.51 Mode :character Median : 266
## Mean :53.86 Mean : -33.89 Mean : 490
## 3rd Qu.:56.52 3rd Qu.: 41.32 3rd Qu.: 528
## Max. :65.02 Max. : 140.53 Max. :4743
## LakeSize OverallAvg OverallMeanAmp PC1
## Min. : 4.00 Min. :4.152 Min. :13.72 Min. :-52.23
## 1st Qu.: 8.00 1st Qu.:5.542 1st Qu.:19.60 1st Qu.:-45.22
## Median : 14.00 Median :5.948 Median :20.81 Median :-42.73
## Mean : 58.69 Mean :6.054 Mean :20.71 Mean :-42.15
## 3rd Qu.: 33.00 3rd Qu.:6.579 3rd Qu.:22.02 3rd Qu.:-39.27
## Max. :2763.00 Max. :8.349 Max. :27.45 Max. :-29.66
## PC2
## Min. :-11.886
## 1st Qu.: 1.943
## Median : 6.979
## Mean : 6.899
## 3rd Qu.: 12.838
## Max. : 26.593
summary(Group6[,-c(1:3)])
## Latitude Longitude Type Elevation
## Min. :-31.370 Min. :-91.20 Length:42 Min. : 1.0
## 1st Qu.:-21.445 1st Qu.: 26.22 Class :character 1st Qu.: 397.5
## Median :-11.045 Median : 31.20 Mode :character Median : 809.0
## Mean :-10.465 Mean : 14.68 Mean : 846.8
## 3rd Qu.: -1.448 3rd Qu.: 36.23 3rd Qu.:1328.5
## Max. : 14.670 Max. :136.33 Max. :2074.0
## LakeSize OverallAvg OverallMeanAmp PC1
## Min. : 4.00 Min. :21.02 Min. : 0.9175 Min. : 77.50
## 1st Qu.: 7.25 1st Qu.:22.84 1st Qu.: 2.7803 1st Qu.: 86.82
## Median : 14.50 Median :24.39 Median : 6.0563 Median : 98.60
## Mean : 142.88 Mean :24.20 Mean : 6.1236 Mean : 96.29
## 3rd Qu.: 71.50 3rd Qu.:25.58 3rd Qu.: 8.0284 3rd Qu.:106.53
## Max. :2313.00 Max. :26.75 Max. :15.0455 Max. :115.66
## PC2
## Min. :-55.011
## 1st Qu.:-31.274
## Median :-25.132
## Mean :-25.300
## 3rd Qu.:-14.010
## Max. : -7.214
summary(Group7[,-c(1:3)])
## Latitude Longitude Type Elevation
## Min. :-54.55 Min. :-73.03 Length:19 Min. : 21.0
## 1st Qu.:-51.56 1st Qu.:-72.48 Class :character 1st Qu.: 174.5
## Median :-48.75 Median :-71.52 Mode :character Median : 264.0
## Mean :-48.38 Mean :-47.03 Mean : 391.3
## 3rd Qu.:-45.44 3rd Qu.:-69.14 3rd Qu.: 519.5
## Max. :-40.92 Max. :170.15 Max. :1155.0
## LakeSize OverallAvg OverallMeanAmp PC1
## Min. : 5.00 Min. : 4.824 Min. : 3.934 Min. :-29.853
## 1st Qu.: 10.50 1st Qu.: 6.627 1st Qu.: 5.881 1st Qu.:-19.783
## Median : 24.00 Median : 7.749 Median : 7.816 Median :-16.977
## Mean : 33.32 Mean : 7.930 Mean : 8.728 Mean :-12.992
## 3rd Qu.: 50.00 3rd Qu.: 8.773 3rd Qu.:11.488 3rd Qu.: -3.498
## Max. :109.00 Max. :11.165 Max. :17.881 Max. : 6.935
## PC2
## Min. :-83.49
## 1st Qu.:-70.90
## Median :-59.39
## Mean :-62.45
## 3rd Qu.:-55.31
## Max. :-48.70
summary(Group8[,-c(1:3)])
## Latitude Longitude Type Elevation
## Min. :-41.14 Min. :-76.15 Length:29 Min. : 8.0
## 1st Qu.:-38.81 1st Qu.:-69.30 Class :character 1st Qu.: 52.0
## Median :-36.04 Median :-62.61 Mode :character Median : 113.0
## Mean :-34.42 Mean :-21.13 Mean : 564.1
## 3rd Qu.:-33.16 3rd Qu.:-53.25 3rd Qu.: 292.0
## Max. :-11.02 Max. :175.90 Max. :3975.0
## LakeSize OverallAvg OverallMeanAmp PC1
## Min. : 7.00 Min. :12.30 Min. : 2.755 Min. :16.49
## 1st Qu.: 8.00 1st Qu.:13.93 1st Qu.: 9.290 1st Qu.:27.45
## Median : 17.00 Median :15.02 Median :12.215 Median :39.43
## Mean : 42.69 Mean :15.70 Mean :11.363 Mean :41.41
## 3rd Qu.: 40.00 3rd Qu.:17.71 3rd Qu.:13.426 3rd Qu.:56.13
## Max. :285.00 Max. :20.01 Max. :16.894 Max. :72.38
## PC2
## Min. :-76.65
## 1st Qu.:-58.18
## Median :-55.54
## Mean :-56.04
## 3rd Qu.:-53.41
## Max. :-38.63
summary(Group9[,-c(1:3)])
## Latitude Longitude Type Elevation
## Min. :37.47 Min. :-121.8900 Length:100 Min. : 1.0
## 1st Qu.:43.98 1st Qu.: -81.3675 Class :character 1st Qu.: 34.0
## Median :46.22 Median : 26.4050 Mode :character Median : 112.5
## Mean :47.21 Mean : -0.8504 Mean : 359.3
## 3rd Qu.:50.02 3rd Qu.: 47.0525 3rd Qu.: 407.0
## Max. :59.44 Max. : 140.8900 Max. :1996.0
## LakeSize OverallAvg OverallMeanAmp PC1
## Min. : 2.0 Min. : 7.623 Min. :13.95 Min. :-32.676
## 1st Qu.: 8.0 1st Qu.: 8.526 1st Qu.:22.36 1st Qu.:-27.459
## Median : 16.0 Median : 9.554 Median :23.77 Median :-18.843
## Mean : 101.3 Mean : 9.824 Mean :23.64 Mean :-18.285
## 3rd Qu.: 35.0 3rd Qu.:10.902 3rd Qu.:25.44 3rd Qu.:-10.377
## Max. :2710.0 Max. :12.775 Max. :29.81 Max. : 2.023
## PC2
## Min. :-2.739
## 1st Qu.:17.485
## Median :22.422
## Mean :22.553
## 3rd Qu.:29.537
## Max. :45.189
The group summaries appear to have a number of features that may offer great ways to discriminate between groups.
Looking at the Elevation summaries for each group it would appear that all of the groups apart from group 6 may have multimodal distributions for this variable, as their medians (a robust estimate) greatly differs from their means (a non-robust estimate).
There appears to be large differences in the OverallAvg for each group as indicated by both the medians and means for each group - this may be a variable that offers good discrimination between groups.
The size of the groups vary greatly (min=19, max=244).
Plotting the pairs plot of variables, treating the groups a factors, we are looking to observe graphical distinctions between groups. A larger version can be found here - left click and open in a new tab. Element [1,1] defines the colour coding of the groups. The lake size in this instance is the log of the original lake size (The original observations for lake size were very left hand skewed).
Data$Group<-factor(Data$Group)
Data$LakeSize<-log(Data$LakeSize)
ggpairs(Data[, -c(1,2,6)] ,aes(colour=Group),
upper=list(continuous = "points", combo = "box_no_facet", discrete = "facetbar", na = "na"),
lower = list(continuous = "points", combo="box_no_facet", discrete = "facetbar", na = "na"))
There are a number of illuminating points that can be made from the pairs plot above.
As can be seen in the first column of boxplots for continuous variables against groups, there are a number of plots that sugggest the distribution of certain variables for the groups contain some significant differences. Latitude, OverallAvg, OverallMeanAmp, PC1, PC2 and to a lesser extent longitude appear to be such variables. The size of the groups are also worth bearing in mind when making comments here.
There are a number of scatterplots that offer a fair bit of discrimination between groups in 2-dimensions. For example, to name a few, Latitude against Longitude, Latitude against OverallAvg and PC2 against PC1.
There is a strong positive linear relationship between OverallAvg and PC1 suggesting that the first principal component largely represnts this variable.
The sinusoidal pattern between Latitude and OverallAvg is induced by the temperature change of the globe that occurs latitude wise.
The 2-dimensional scatterplot of the first two principal components appear to offer the greatest seperation between groups among all combinations of two variables and should be examined further. The larger points represent the centroids of the groups.
class.means<-t(sapply(by(Data[, 11:12], Data$Group, colMeans),c))
class.means1<-cbind(class.means,c(1:9))
class.means1<-as.data.frame(class.means1)
ggplot(Data, aes(x=PC1, y=PC2))+geom_point(aes(colour=factor(Group)))+
geom_point(data = class.means1, aes(colour=factor(V3)), size=7, alpha=1/2) +
labs( title="PC1 VS. PC2 (The Larger Icons are Group Means)")
The first two principal components are of some use when it comes to clearly separating groups. Groups 1, 7 & 8 are reasonably well clustered / seperated. However, group 5 slightly overlaps with groups 4 & 9 at the perimeter of the cluster and has a very elliptical shape. A similar issue appears to be present with groups 2, 3 & 6. Furthermore, the cluster structure of group 6 is interesting. On the face of things, group 6 appears to have an outlying subcluster of 6 observations, which appears to potentially be more akin to group 8. Would a 3rd principal component be of any use in the 3-dimensional space.
Examining the kernel densities of Elevation for each group offers some interesting insight into how discrmination between groups may occur.
ggplot()+ geom_density(aes(x=Group1$Elevation),colour="red")+
geom_density(aes(x=Group2$Elevation),colour="blue")+
geom_density(aes(x=Group3$Elevation),colour="black")+
geom_density(aes(x=Group4$Elevation), colour="green")+
geom_density(aes(x=Group5$Elevation),colour="orange")+
geom_density(aes(x=Group6$Elevation),colour="yellow")+
geom_density(aes(x=Group7$Elevation),colour="pink")+
geom_density(aes(x=Group8$Elevation), colour="purple") +
geom_density(aes(x=Group9$Elevation), colour="brown")+
labs(x="Elevation", y="Density", title="Densities of Elevation for each Group")
Strikingly, the Elevation density for group 2 is very multimodal - with sharp peaks of density across a moderate range of elevation values. It can be seen in plots below that bodies of water in group 2 occur in very dense clusters in America, China and the Middle East. This may explain the sharp peaks in density.
Similarly, the densities of OverallAvg for each group offers an interesting insight into how discrimination between groups may occur.
ggplot()+ geom_density(aes(x=Group1$OverallAvg),colour="red")+
geom_density(aes(x=Group2$OverallAvg),colour="blue")+
geom_density(aes(x=Group3$OverallAvg),colour="black")+
geom_density(aes(x=Group4$OverallAvg), colour="green")+
geom_density(aes(x=Group5$OverallAvg),colour="orange")+
geom_density(aes(x=Group6$OverallAvg),colour="yellow")+
geom_density(aes(x=Group7$OverallAvg),colour="pink")+
geom_density(aes(x=Group8$OverallAvg), colour="purple") +
geom_density(aes(x=Group9$OverallAvg), colour="brown")+
labs(x="Overall Average", y="Density", title="Densities of The Mean Temperature of the Lake (Celsius) for each Group")
Taking into account the relatively small sizes of some groups, the densities appear to be unimodal and offer a great deal of seperation between groups. For example, the mean OverallAvg of group 4 appear to be significanlty different to those in group 3. A Mann-Whitney test could confirm these significant differences nonparametrically.
From previous results and knowledge, it would appear that the overall amplitude temperature may be of some importance in differentiating between groups.
ggplot()+ geom_density(aes(x=Group1$OverallMeanAmp),colour="red")+
geom_density(aes(x=Group2$OverallMeanAmp),colour="blue")+
geom_density(aes(x=Group3$OverallMeanAmp),colour="black")+
geom_density(aes(x=Group4$OverallMeanAmp), colour="green")+
geom_density(aes(x=Group5$OverallMeanAmp),colour="orange")+
geom_density(aes(x=Group6$OverallMeanAmp),colour="yellow")+
geom_density(aes(x=Group7$OverallMeanAmp),colour="pink")+
geom_density(aes(x=Group8$OverallMeanAmp), colour="purple")+
geom_density(aes(x=Group9$OverallMeanAmp), colour="brown")+
labs(x="Overall Mean Amplitude", y="Density", title="Densities of The Overall Amplitude Temperature of the Lake (Celsius) for each Group")
On the whole, OverallMeanAmp only appears to offer legitimate discrmination for groups at the extreme ends of the spectrum - most of the groups span large portions of each other.
However, more interestingly, the majority of density shapes for each group appears to be some minor modification of its OverallAvg counterpart. For example, note group 4 in green, group 5 in orange and group 7 in pink.
As the data we are analysing is geostatistical, it would be wise to analyse the data spatially by using various mapping techniques. This will allow us to gain a deeper understanding of the problem at hand.
leaflet() %>%
addTiles() %>%
addCircles(data=Group1, lat=~Latitude, lng= ~Longitude, radius=10000, color="red") %>%
addCircles(data=Group2, lat=~Latitude, lng= ~Longitude, radius=10000, color="blue") %>%
addCircles(data=Group3, lat=~Latitude, lng= ~Longitude, radius=10000, color="black") %>%
addCircles(data=Group4, lat=~Latitude, lng= ~Longitude, radius=10000, color="green") %>%
addCircles(data=Group5, lat=~Latitude, lng= ~Longitude, radius=10000, color="orange") %>%
addCircles(data=Group6, lat=~Latitude, lng= ~Longitude, radius=10000, color="yellow") %>%
addCircles(data=Group7, lat=~Latitude, lng= ~Longitude, radius=10000, color="pink") %>%
addCircles(data=Group8, lat=~Latitude, lng= ~Longitude, radius=10000, color="purple") %>%
addCircles(data=Group9, lat=~Latitude, lng= ~Longitude, radius=10000, color="brown")
The bodies of water appear to be clustered spatially, which is to be expected. In particular, the groups appear to be stacked horizontally up the latitude of the map. This may encode information about the temperature of the lakes.
Spatial analysis techniques such as kriging may be of some use here in helping to predict future unlabelled bodies of water by producing uncertainty maps.
Interestingly, China seems to be the only distinct land mass that does not follow the horizontal layering of groups - it has groups 4 and 5 in the south west of China when it should perhaps not be there. What is going on here?
leaflet()%>% addProviderTiles("Esri.WorldImagery")%>%
setView(lng=88.7879, lat=30.1534, zoom=4)%>%
addCircles(data=Group4, lat=~Latitude, lng= ~Longitude, radius=20000, color="green")%>%
addCircles(data=Group5, lat=~Latitude, lng= ~Longitude, radius=20000, color="orange")%>%
addMarkers(lng=88.7879, lat=30.1534, popup="Tibet, China")
This part of south west China is actually Tibet, which is of mountainous terrain - akin to that of Canada and Scandanavian countries. This may suggest that temperature / elevation or a mixture of the two may play a significant role in deciding which label a body of water may eventually get.
Interestingly, along the south eastern region of Africa, group 6 is vertically clustered. This deviates from the horizontal pattern seen elsewhere. Why might this be the case?
leaflet()%>% setView(lng=32.16, lat=-6.07, zoom=4) %>%addTiles()%>%
addMarkers(lng=33.32, lat=-1.3, popup="Lake : Victoria. Size : 2313 1km pixels. Largest in the region.")%>%
addMarkers(lng=29.46, lat=-6.07, popup="Lake : Tanganyika. Size : 1148 1km pixels. 2nd largest in the region. ")%>%
addMarkers(lng=34.59, lat=-11.96, popup="Lake : Niassa. Size : 1048 1km pixels. 3rd largest in the region.")%>%
addMarkers(lng=32.16, lat=-7.86, popup="Lake : Rukwa. Size : 219 1km pixels. 4th largest in the region.")%>%
addCircles(data=Group6, lat=~Latitude, lng= ~Longitude, radius=10000, color="yellow")
The bodies of water in this part of Africa appear to be part of a very connected network. The vertical spread of bodies of water in class 6 across the entire east and south eastern part of Africa could possibly be due to the influence that Lake Victoria, Lake Tanganyika & Lago Niassa has across the region - as these are extremely large in comparison to other bodies of water in the region.
The data set consists of three body water types: lagoon, lake and reservoir. Plotting the type counts for each group may give us an insight into the relationship between body water type and group.
Data$Group<-factor(Data$Group)
b<-ggplot(Data, aes(x=Group, fill=Type))+geom_bar(position="dodge")
ggplotly(b)