GESC 258- Geographical Research Methods
Hypothesis testing for means
We are going to explore one of my favorite geospatial technologies for our foray into hypothesis testing, Global Positioning Systems (GPS), which is a specific form of the more general Global Navigation Satellite Systems (GNSS). With the advent of GPS chips in mobile phones, location-aware applications and services are now extremely widely used. The quality of GPS sensors varies considerably and an affect how we use the positioning information we get from them. One particular issue with GPS is that it does not work in doors. If you have even a rough understanding of how GPS works, you will know you have to connect to satellites that receive messages to your receiver, and connecting to multiple satellites allows your exact position on the surface of the earth to be located. If we are inside, we cannot receive signals broadcast from space.
Now this issue can get a little more complex when we are using GPS outdoors in areas where we do not have a clear view of the sky. In particular, GPS signals can reflect off of objects in the path between our receiver and the satellite - a phenoemna known as GPS multipath error as the error is due to the multiple paths the signal takes. More advanced GPS receivers use different techniques to minimize this issue. However, multipath error can still be an issue in urban environments with tall buildings nearby or in forested areas with canopy cover.
In this lab what we want to explore is the difference in reported positioning for two receivers. To do a proper error analysis of these units, we would really need to compare them to a third independent benchmark, such as a survey marker or a GPS position obtained from a highly accurate positioning system using differential corrections. Lacking that - we will simply compare the positions to each other, to ask; are the positions statistically different. Also we will first try the whole process using fake GPS data.
Examining with Fake Data
We will first create some fake geographic data to examine. One of the
great things about working with R
is that you can usually
find a package to do just about anything you want to do. In this case we
want an R
function to generate random geographic
coordinates. We will use runif
which generates random
uniform numbers and just set the min and max to -180 and +180 for
longitude and -90 and +90 for latitude. Here is an example of how we
select a random value between -180 and +180 for longitude and -90 and
+90 for latitude:
runif(1, min = -180, max = 180)
## [1] 44.59965
runif(1, min = -90, max = 90)
## [1] 17.86825
Now we will need a bounding box within which points should be generated. Luckily there is a website (http://bboxfinder.com/) that makes this process very easy from which you can copy the bounding box coordinates.
Given that we’re probably all desparate to travel at this point, I am going to go to one of my favorite places, Arugum Bay in Sri Lanka(https://www.google.com/search?q=arugam+bay&sxsrf=ALeKk00ld2CQx0-28Kx7Vl-UBgkd7CW3gA:1614621081426&source=lnms&tbm=isch&sa=X&ved=2ahUKEwiC54Sg1I_vAhWLt54KHdJFArEQ_AUoAnoECBEQBA&biw=1273&bih=757).
Arugam Bay, Sri Lanka.
Feel free to change this somewhere you want to go! For my study area,
we can see the bounding box is
81.808421,6.825655,81.851207,6.852244
and this is reported
in Longitude, Latitude pairs for the box, and when we look at the
rg_position
help it tells it that it requires this
information as numeric vector of the form west (long), south (lat), east
(long), north (lat). Lets try it (pay attention to the order of the
numbers in the following function)
<- runif(1, min = 81.808421, max = 81.851207)
lon <- runif(1, min = 6.825655, max = 6.852244)
lat c(lon,lat)
## [1] 81.811055 6.842744
Try plugging the result into google maps to see if it lands in the correct location (Google Maps takes latitude/longitude in the search box so you’ll have to switch the order).
Bounding box coordinates plotted on Google Maps
We can get more points by increasing the count argument as follows:
<- runif(5, min = 81.808421, max = 81.851207)
lon <- runif(5, min = 6.825655, max = 6.852244)
lat cbind(lon,lat)
## lon lat
## [1,] 81.83756 6.830512
## [2,] 81.81728 6.834733
## [3,] 81.84812 6.841627
## [4,] 81.81799 6.843277
## [5,] 81.82109 6.840435
now the runif
function returns a list of pairs of
longitude/latitude pairs. We can convert them to a data frame with a
little r
magic -
set.seed(123)
<- runif(29, min = 81.808421, max = 81.851207)
X1 <- runif(29, min = 6.825655, max = 6.852244)
Y1 <- data.frame(X1, Y1)
df1
<- runif(29, min = 81.808421, max = 81.851207)
X2 <- runif(29, min = 6.825655, max = 6.852244)
Y2 <- data.frame(X2, Y2) df2
so now we have two data frames each with 29 geographic cooridnates in our area of interest.
head(df1)
## X1 Y1
## 1 81.82073 6.829567
## 2 81.84215 6.851261
## 3 81.82592 6.849646
## 4 81.84620 6.844020
## 5 81.84866 6.846806
## 6 81.81037 6.826309
head(df1)
## X1 Y1
## 1 81.82073 6.829567
## 2 81.84215 6.851261
## 3 81.82592 6.849646
## 4 81.84620 6.844020
## 5 81.84866 6.846806
## 6 81.81037 6.826309
Now, back to GPS and hypothesis testing.
We will pretend df1
are coordinates from one GPS and
df2
are coordinates from another GPS. Since these were just
randomly created we won’t expect the coordinates to be close - but the
idea of comparing coordinates will be the same.
t-test
To compare the mean of 1 group to a specific value, or to compare the
means of 2 groups, you do a t-test.
The t-test
function in R is t.test()
. The t.test()
function can take several arguments, here I’ll emphasize a few of them.
To see them all, check the help menu for t.test
(?t.test
).
Argument | Description |
---|---|
x |
A vector of data whose mean you want to compare to the
null hypothesis mu |
mu |
The population mean under the null hypothesis. For
example, mu = 0 will test the null hypothesis that the true
population mean is 0 . |
alternative |
A string specifying the alternative hypothesis. Can be
"two.sided" indicating a two-tailed test, or
"greater" or "less" for a one-tailed
test. |
In a one-sample t-test, you compare the data from one group of data
to some hypothesized mean. For example, if someone said that average
error of GPS data is 5 meters, we could conduct a one-sample test
comparing the data from a sample to a hypothesized mean of 5. To conduct
a one-sample t-test in R using t.test(), enter a vector as the main
argument x
, and the null hypothesis as the argument
mu
.
One-Sample Hypothesis Testing
First we will test the hypothesis of whether the difference between two fake gps data is equal to zero.
\[H_0: \mu = 0\] \[H_a: \mu \ne 0\]
where \(\mu\) is the population
mean difference. To start we have to compute our variable analysis.
Given that we have a two-dimensional measure of location that we want to
summarize with a single vector, we will convert the geographic
coordinates to distance using pythagorous theorum (https://en.wikipedia.org/wiki/Pythagorean_theorem) which
in r
is just implementing
\[d = \sqrt{(x_1-x_2)^2 + (y_1-y_2)^2}\]
<- sqrt((df1$X1 - df2$X2)^2 + (df1$Y1 - df2$Y2)^2) d
so now we are testing \(H_0: \mu =
0\) on the sample mean difference in d
. The function
for doing a t-test in r
is t.test
.
t.test(d, mu=0)
##
## One Sample t-test
##
## data: d
## t = 11.116, df = 28, p-value = 8.866e-12
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.01574146 0.02285386
## sample estimates:
## mean of x
## 0.01929766
Use
?t.test
in the R console to learn more about this function. Pay attention toalternative
argument in the documentation of this function. This parameter helps you to perform a correct t.test
As you can see, the function printed lots of information: the sample
mean was 0.019
, the test statistic (11.116
),
and the p-value was 8.866e-12
(which is virtually 0).
Because 8.866e-12
is less than 0.05
, we would
reject the null hypothesis that the true mean is equal to 0.
Interpretting p-value: look at the following image. If our level of confidence is 95% so our significance value \(\alpha=0.05\). Now we look at the type of out t-test. It is a two-tail test (why?). So we use the following charts to make a conclusion.Since p-value is far less than 0.05 \(`8.866e-12`< 0.05\). As a result the left side image is valid and we reject the null hypothesis that the true mean is equal to 0.
Bounding box coordinates plotted on Google Maps
Now, what happens if I change the null hypothesis to a mean of 0.02
(a close value of my sample mean)? Because the sample mean was 0.019,
quite close to 0.02, the test statistic should decrease
,
and the p-value
should increase:
\[H_0: \mu = 0.02\]
\[H_A: \mu \ne 0.02\]
t.test(d, mu=0.02)
##
## One Sample t-test
##
## data: d
## t = -0.40456, df = 28, p-value = 0.6889
## alternative hypothesis: true mean is not equal to 0.02
## 95 percent confidence interval:
## 0.01574146 0.02285386
## sample estimates:
## mean of x
## 0.01929766
Just as we predicted! The test statistic decreased to just -0.4, and the p-value increased to 0.68. In other words, our sample mean of 0.019 is reasonably consistent with the hypothesis that the true population mean is 0.02.
Interpretting p-value: look at the above histograms again. If our level of confidence is 95% so our significance value \(\alpha=0.05\). Now we look at the type of out t-test. It is a two-tail test. Since p-value is far more than 0.05 \(0.6889 > 0.05\). As a result the right side image is valid and we fail to reject the null hypothesis that the true mean is equal to 0.02.
Hand Calculation method: As per lecture slides the way we calculate a t-statistic for our vector d is:
\[t = \frac{\bar{x} -
\mu}{\sigma_{\bar{x}}}\] where \(\sigma_{\bar{x}}\) is \(\frac{s}{\sqrt{n}}\), where \(\mu=0\), which we can hand-calculcate
inr
as:
<- (mean(d) - 0)/ (sd(d)/sqrt(29))
t t
## [1] 11.11565
which can see matches the output of the t.test
function
for the value of the t. We can evaluate probability as well:
<-(mean(d) - 0.02) / (sd(d)/sqrt(29))
t t
## [1] -0.404556
2*pt(q = t, df = 28) #since this is two-tail test we multiply it into 2
## [1] 0.6888778
type
?pt?
in the R console to see documentation about this function. Pay attention tolower.tail
parameter in this function. When lower.tail=FALSE you get the probability to the right of X
Can you see how our conclusion would change and what this would mean?
How would result from pt
change?
Two-Sample Hypothesis Testing
Lets suppose that the first 14 of our data points were collected on a day with overcast weather, and the remaining 14 points were collected on a clear day. We might want to examine whether there was a statistical difference between the two days. This would be comparing two groups; and we would need a two-sample hypothesis test to evaluate: \[H_0: \mu_1 = \mu_2\] \[H_A: \mu_1 \ne \mu_2\]
which is to say, there is not a difference (ie the two means are equal) or there is (ie the two means are not equal). First we need to separate out the two groups. We can do this by subsetting into two new vectors as follows:
<- d[1:14]
day1 <- d[15:29] day2
then we test using t.test
again:
t.test(day1, day2)
##
## Welch Two Sample t-test
##
## data: day1 and day2
## t = -1.2463, df = 25.906, p-value = 0.2238
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.011412695 0.002798026
## sample estimates:
## mean of x mean of y
## 0.01706973 0.02137706
It looks like our test-statistic is -1.24. If there was really no
difference between day1 and day2 data, we would expect a test statistic
close to 0. Because test-statistic is -1.24, this makes us think that
there really is a difference. However, in order to make our decision, we
need to get the p-value
from the test. This result shows
there is no statistical difference between the values in d
between day 1 and day 2 because thep-value
associated with
the null hypothesis is p=0.2238
which is far greater than
any of the nominal significance levels (0.01 , 0.05 and etc.) we use in
hypothesis testing. Note that the degree of freedom df
is
different here and no longern-1
as in the case with single
sample hypothesis testing.
GPS DATA
To Answer the today’s lab questions (question 2, 3 and 4) we will use
a real world GPS data which is collected earlier. As we explained
earlier, we will compare the GPS signals we received from two receivers.
We have already collected a set of gps points using a Garmin GPSmap 62s
and an Iphone App on an IPhone 8 Plus phone - the EpiCollect 5. Garmin
is considered recreational grade while the IPhone app is not a true GNSS
receiver. What we want to explore is the difference in reported
positioning for these two receivers. These data are already preprocessed
and the difference between two recived signals is already calculated (as
we discussed it in section One-Sample Hypothesis Testing
).
Lets have a look at them
<- read.csv("https://www.dropbox.com/s/f7xcibye2epgqgy/dfprocessed.csv?dl=1")
df <- df$dis[which(df$loctype=="Open")]
open_dis <- df$dis[which(df$loctype=="Forest")]
forest_dis mean(open_dis)
## [1] 24.51924
mean(forest_dis)
## [1] 30.33474
::kable(head(df), caption = 'GPS Data') knitr
X | PID | ele | acc | loctype | dis |
---|---|---|---|---|---|
1 | 1 | -4.412673 | 8 | Open | 29.40511 |
2 | 2 | 38.054688 | 8 | Open | 10.23022 |
3 | 3 | 36.078232 | 8 | Forest | 15.83321 |
4 | 4 | 70.174721 | 8 | Forest | 17.74558 |
5 | 5 | 80.339729 | 8 | Open | 16.70315 |
6 | 6 | 78.286949 | 8 | Forest | 32.75483 |
The above data set has the following columns: PID:
a
unique Id column, ele:
elevation of the point,
acc:
accuracy of the gps, loctype:
it can be
either Open
or Forest
, dis:
calculated distance between gps signal from Garmin GPSmap and Iphone
app.
Lab Assignment
Generate random points for a study area of interest to you (as in the example above). Conduct a hypotheses test that the mean difference is a) equal to zero then b) greater than
0.02
. For each use thet.test
function. In your answer include your hypothesis statements,p-value
, and a sentence interepting your results from each hypothesis test. Include a screenshot showing your bounding box coordinates plotted on Google Maps (out of 5)For the GPS data, is the mean difference significantly different from zero ( \(α=0.05\) )? Use the
t.test
function. In your answer include your hypothesis statements,p-value
, and a sentence interepting your results from each hypothesis test. (out of 5)For the GPS data, was the difference between the two receivers statistically different between locations under forest canopy vs. open sky? Was there more or less difference when under forest canopy? Conduct both one tailed and two tailed hypothesis tests. For this question you can use the
t.test
function. In your answer include your hypothesis statements,p-value
, and a sentence interepting your results from each hypothesis test. (out of 5)For the GPS data, test whether the distance between the two receivers were statistically lower when accuracy was low (<=12, vs high >12). For this question you can just use the
t.test
function. In your answer include your hypothesis statements, p-value, and a sentence interepting your results from each hypothesis test. (out of 5)
# use this codes to filter gps data based on the accuracy
<- df$dis[which(df$acc<=12)]
low_accuracy <- df$dis[which(df$acc>12)] high_accuracy
Hand in
Please submit your answers on MLS under Assignment 4 folder. Your
final report should be in pdf
format. Also Please make sure
to include clean codes and their results. The
formatting of your assignment has 5 marks.
Credits
This lab material is adopted from GESC 258- Labs originally developed by Dr. Colin Robertson.