(Please read carefully)
The exam will be submitted individually by uploading
the solved exam Rmd and html files to the
course moodle.
Please name your files as 52414-HomeExam_ID.Rmd and
52414-HomeExam_ID.html where ID is replaced by
your ID number (do not write your name in the file name
or in the exam itself).
The number of points for each sub-question is indicated next to it.
Once you click on the moodle link for the home exam, the
exam will start and you have three days (72 hours) to complete and
submit it.
The exam will be available from June 26th. The last submission time
is July 1st at 8:59.
You may use all course materials, the web and other written materials and R libraries.
You are NOT allowed to discuss any of the exam questions/materials with other students.
A Hebrew version of the exam is available here.
The purpose of the Hebrew version is to help in explaining the exam
questions. You should use and fill only this Rmd English
version and follow the instructions here. Contact us if you think there
is a conflict between the Hebrew version and the instructions here.
Analysis and Presentation of Results:
Write your answers and explanations in the text of the
Rmd file (not in the code).
The text of your answers should be next to the relevant code, plots
and tables and refer to them, and not at a separate place at the end.
You need to explain every step of your analysis. When in doubt, a more detailed explanation is better than omitting explanations.
Give informative titles, axis names and names for each curve/bar in your graphs.
In some graphs you may need to change the graph limits. If you do so,
please include the outlier points you have removed in a separate table.
Add informative comments explaining your code
Whenever possible, use objective and
specific terms and quantities learned in class, and
avoid subjective and general
non-quantified statements. For example:
Good: “We see a \(2.5\)-fold increase in the curve from
Jan. 1st to March 1st”.
Bad: “The curve goes up at the beginning”.
Good: “The median is \(4.7\). We detected five outliers with
distance \(>3\) standard deviations
from the median”.
Bad: “The five points on the sides seem far from the
middle”.
Sometimes Tables are the best way to present your
results (e.g. when asked for a list of items). Exclude irrelevant
rows/columns. Display clearly items’ names in your
Tables.
Show numbers in plots/tables using standard digits and not scientific display.
That is: 90000000 and not 9e+06.
Round numbers to at most 3 digits after the dot - that is, 9.456 and not 9.45581451044
Some questions may require data wrangling and manipulation which you need to
decide on. The instructions may not specify precisely the exact plot you should use
(for example: show the distribution of ...). In such
cases, you should decide what and how to show the results.
When analyzing real data, use your best judgment if you encounter missing values, negative values, NaNs, errors in the data etc. (e.g. excluding them, zeroing negative values..) and mention what you have done in your analysis in such cases.
Required libraries are called in the Rmd file. Install
any library missing from your R environment. You are
allowed to add additional libraries if you want.
If you do so, add them at the start of the Rmd file, right below the existing libraries, and explain what libraries you’ve added, and what is each new library used for.
This is an .Rmd file. Copy it with your mouse to create and Rmd file, and edit the file to include your questions.
Good luck!
#
rm(list=ls()) #rstudio cleaning functions
library(reshape)
library(knitr) # to make sure the knit is happening
library(viridis) #
# Helper functions for converting between coordinates.
# You may use and modify the functions. It is your responsibility that
# the input/output format used in them are correct.
# Convert to Cartesian coordinates: the function assumes that theta is in [0,pi] and phi is in [0,2pi]
spherical_to_cartesian <- function(point,r = 6371)
{
x <- r*cos(point$phi)*sin(point$theta)
y <- r*sin(point$phi)*sin(point$theta)
z <- r*cos(point$theta)
return(data.frame(x=x,y=y,z=z))
}
# Convert to Spherical coordinates: the function assumes that theta is in [0,pi] and phi is in [0,2pi]
cartesian_to_spherical <- function(point)
{
r <- sqrt(point$x**2+point$y**2+point$z**2)
phi <- atan(point$y/point$x) + pi * ( 1 - sign(point$y)*(1+sign(point$x))/2 ) # take care of all sign cases. When x=0 or y=0 exactly, the function may not work correctly, but these are events of measure zero for any continuous distribution over the sphere.
theta <- acos(point$z/r)
return(list(r=r,phi=phi,theta=theta))
}
# Geodesic distance between points. ### HERE I CHANGED IT A BIT
# Input format: p1 and p2 are lists having items: r, phi, theta in spherical coordinates,
# i.e. phi in [0,2pi] and theta in [0,pi]
geodesic_dist <- function(p1, p2, r)
{
return ( r * acos( cos(p1$theta)*cos(p2$theta) + sin(p1$theta)*sin(p2$theta) * cos(p1$phi-p2$phi) ) )
}
cartesian2geographic <- function(point,r=6371) #for apply
{
longitude = ( atan(point$y/point$x) + pi*(1-sign(point$y)*(1+sign(point$x))/2 ) ) * (180/pi) - 180
latitude = (acos(point$z/r))*180/pi - 90
return(list( r=r ,lon = longitude, lat = latitude))
}
geo2xyz <- function(point,r=6371)
{
x <- r*cos((point$long + 180)*pi/180)*sin((point$lat + 90) * pi/180 ) #r*cos(phi)*sin(theta)
y <- r*sin((point$long + 180) * pi/180)*sin((point$lat + 90) * pi/180 ) #r*sin(phi)*sin(theta)
z <- r*cos((point$lat + 90) * pi/180) #r*cos(theta)
return(data.frame(x=x,y=y,z=z))
}
geographic2spheric <- function(point)
{
phi = (point$longitude + 180) * pi/180
theta = (point$latitude + 90) * pi/180
return(list( phi = phi, theta= theta))
}
runif_on_spher2df<-function(iters,r=6371) {
df_rand_xyz_n <- as.data.frame(runif_on_sphere(n=iters, d=3, r=r))
colnames(df_rand_xyz_n) <- c("x", "y", "z")
return( df_rand_xyz_n)
}
In this question we simulate random locations on earth. For simplicity, we assume that earth is a ball with a radius of \(r = 6371 km\) (we ignore the fact that earth is not an exact ball), and we
will simulate random points on the surface of the ball, i.e. from a sphere with this radius.
Each location on earth can be represented using three Cartesian coordinates \((x,y,z)\),
where the center of earth is at \((0,0,0)\), and each point on the surface
satisfies \(x^2+y^2+z^2 = r^2\).
Alternatively, we also use spherical coordinates to represent points
on earth: \((r,\theta, \phi)\) where
\(\phi \in [0, 2\pi]\) and \(\theta \in [0, \pi]\). Here \(r\) is always the same, at \(6371\) (points inside
earth will have a lower \(r\) value).
\(\phi\) is the longitude,
with points on the
Prime
meridian (Greenwich line) having \(\phi=0\), and then \(\phi\) is increased as we move east. For
example, if we move east and complete half a circle around the world, we
reach points at the international
date line, and they will have \(\phi=\pi\), and as we approach the
Greenwich line again from west (i.e. when completing a full circle),
\(\phi\) will get closer and closer to
\(2\pi\).
\(\theta\) is the
latitude, with the north (south) pole having \(\theta=\pi\) (\(\theta=0\)), and points on the equator have
\(\theta=\frac{\pi}{2}\).
Note: A common way to represent (spherical) geographical coordinates is to set the longitude \(\phi\) between \([-180,180]\) (instead of \([0,2\pi]\)),
where positive (negative) values correspond to east (west) of the Greenwich line. The latitude \(\theta\) is set between \([-90,90]\) (instead of \([0,\pi]\)), where positive (negative) values correspond to north (south) of the equator. We will refer to this convention as geographical coordinates, and will use it in some parts of the exam, e.g. when reading real geographical data.
For two points on the sphere, we define their Euclidean distance as the standard distance in space, i.e. the length of the straight line connecting the points, that passes through the ball. This will be the length you need to travel if you could dig a straight hole between the two locations. The Euclidean distance has a simple formula in terms of the Cartesian coordinates. For two points \((x_1,y_1,z_1)\) and \((x_2,y_2,z_2)\), their distance is
\[d(p_1,p_2) \equiv \sqrt{(x_1-x_2)^2+(y_1-y_2)^2+(z_1-z_2)^2}\]
We also define the geodesic distance of two points,
as the length of the shortest path connecting them along the sphere.
This is the length you will have to travel if you fly in a ‘straight
line’ along the sphere between the two points, along a great circle, and
for the sphere it is also called great circle distance or
spherical distance.
The geodesic distance has a simple formula in terms of the spherical coordinates. For two points \((r,\theta_1,\phi_1)\) and \((r,\theta_2,\phi_2)\), their distance is
\[g(p_1,p_2) \equiv r \cos^{-1} \Big(\cos (\theta_1) \cos (\theta_2) + \sin (\theta_1) \sin (\theta_2) \cos(\phi_1 - \phi_2)\Big)\]
For example, if \(\phi_1=\phi_2\) then the distance simplifies to \(g(p_1,p_2) = r |\theta_1 - \theta_2|\).
The length of the red line in the figure below is the geodesic distance of \(p\) and \(q\). The length of a straight line through the ball (not drawn) will be the euclidean distance of \(p\) and \(q\)
What is the average geodesic distance between two
points sampled uniformly on the sphere? simulate \(1000\) random pairs of points and estimate
it using the pairwise distances computed for the simulated points.
You may use the function runif_on_sphere from the
package uniformly, to simulate points uniformly on the
sphere.
Next, repeat the same simulation but this time estimate the average
Euclidean distance between two random points on the
sphere.
bonus: Derive an exact mathematical formula for the average geodesic distance, and for the average Euclidean distance as functions of \(r\). Compare the formulas you got to the estimated distances from the simulations.
I calculated both geodesic distances and eculidan distances with foor loops and helper functions for converting coordinates.
(I ran the a calculation with a for loop, at first with the formulat itself, later on I applied the function itself.)
globe.Radius = 6371
r = globe.Radius #shorter for convenience
B = 1000
# calculating geodesic distance
geo_distances = c() # will be filled with results
for (i in 1:B){
p_1 = as.data.frame(runif_on_sphere(n=1, d=3, r=r)) # samples uniformly Cartesian
names(p_1) = c('x', "y", "z")
sample_1 = as.data.frame(cartesian_to_spherical(p_1)) # to spherical
p_2 = as.data.frame(runif_on_sphere(n=1, d=3, r=r))
names(p_2) = c('x', "y", "z")
sample_2 = as.data.frame(cartesian_to_spherical(p_2))
geo_distances[i] = r * acos( cos(sample_1[1,3])*cos(sample_2[1,3]) + sin(sample_1[1,3])*sin(sample_2[1,3])*cos( sample_1[1,2] - sample_2[1,2]) )
}
avg_geo_dist = mean(geo_distances)
# avg_geo_dist
# calculating euclidean distance
euclidean_distances = c()
for (i in 1:B){
p_1 = as.data.frame(runif_on_sphere(n=1, d=3, r=r))
names(p_1) = c('x', "y", "z")
sample_1 = as.data.frame(cartesian_to_spherical(p_1))
p_2 = as.data.frame(runif_on_sphere(n=1, d=3, r=r))
names(p_2) = c('x', "y", "z")
sample_2 = as.data.frame(cartesian_to_spherical(p_2))
euclidean_distances[i] = sqrt((p_1[1,1]-p_2[1,1])^2 + (p_1[1,2]-p_2[1,2])^2 + (p_1[1,3]-p_2[1,3])^2)
}
avg_euco_dist = mean(euclidean_distances)
# avg_euco_dist
The mean geodesic distance is 9966.3970454, the mean eculedean distance is 8500.828412, obviasly the eculedean is smaller sinse its root is more direct.
Repeat the above simulations, but this time keep all results and plot the empirical distributions for the two types of distance.
Compute the skewness and kurtosis of the two empirical distance distributions. Which of them looks closer to the Normal distribution? explain
Ans: I recorded the samples and plotted the empirical distribution.
# wrangling
df <- data.frame( ind = 1:B, Geodesic = geo_distances , Euclidean = euclidean_distances)
#more wrangling so i could plot and compare
df.long = melt(df, id.vars = "ind")
# plotting
ggplot(df.long, aes(value, fill=variable)) +
geom_density(alpha = 0.3) +
labs(x="Distancess Between Paris of Random Points (km)", y= "Empirical Density"
,title = "Empirical Distribution of Geodesic and Euclidean Distances", fill = "Distance Type")
Now lets calculate
skewness and kurtosis:
ske_geo = skewness(geo_distances, na.rm = TRUE)
kur_geo = kurtosis(geo_distances, na.rm = TRUE)
ske_euc = skewness(euclidean_distances, na.rm = TRUE)
kur_euc = kurtosis(euclidean_distances, na.rm = TRUE)
The Geodesic distances distribute with a skewness of 0.0009806 (close to 0) indicating a right tail, and a kurtosis of -0.7886174 which indicates short tails compared to Normal distribution.
The Euclidean distances distribute with a skewness of -0.6090566 indicating a left tail, and a kurtosis of -0.5280114 which indicates short tails (longer then the Geodesic’s) left tail compared to Normal distribution.
I would say the Geodesic distribution is closer to the normal distribution because it is more symmetric.
Write a function that receives as input \(n\), the number of desired samples, and a
string representing a country’s name. The function should sample random
points from the specific country.
Use rejection sampling, where you can check if a point belongs to a
country using the map.where function from the
maps package.
Run the function to sample \(1000\)
pairs of points in the USA and estimate the average
geodesic distance for two random points in USA.
Hint: You may need to convert your points to
geographical coordinates.
ANS: I implemented a rejection sampling function and upgraded it with sampling a polygon.
rej.samp_n_ps_of_country <- function(nsmp, country_name){ #rejection sampling of of country coordinates
B <- nsmp
if (nsmp<10000 ) B <- 10000
lon <- lat <- rep(0,nsmp)
cnt <- 0
while (cnt < nsmp){
# nsmp=10000 #for testing
# country_name <- "USA" #for testing
rand_p = as.data.frame(cartesian2geographic(runif_on_spher2df(B)))[2:3]
pattt <- "(?<=:).*"
I <- which( str_replace(str_replace( map.where( "world",rand_p$lon,rand_p$lat) ,pattern=pattt,"") ,":" ,"") == as.character(country_name))
new_cnt <- min(length(I), nsmp-cnt)
if(length(I)>0){
lon[(cnt+1):(cnt+new_cnt)] <- rand_p$lon[I[1:new_cnt]]
lat[(cnt+1):(cnt+new_cnt)] <- rand_p$lat[I[1:new_cnt]]
}
cnt = cnt + new_cnt
}
return(data.frame(longitude = lon,latitude =lat))
}
Now I run the function to sample \(1000\) pairs of points in the USA
# #ii:
B = 1000 #1000 pairs
usa_rand1 = rej.samp_n_ps_of_country(B,"USA")
usa_rand2 = rej.samp_n_ps_of_country(B,"USA")
# ggplot(usa_rand1, aes(x=longitude,y=longitude)) + geom_point()
plot(usa_rand1,main="Scatter Plot of Usa Points - First Sample")
plot(usa_rand2,main="Scatter Plot of Usa Points - First Sample")
# #test if indeed from usa and uniform: A CHECK
pattt <- "(?<=:).*"
paste ("Are all samples coords of USA: ", (all(str_replace(str_replace( map.where( "world",usa_rand1$longitude, usa_rand1$longitude) ,pattern=pattt,"") ,":" ,"") ) == "USA") && all(str_replace(str_replace( map.where( "world",usa_rand2$longitude, usa_rand2$longitude) ,pattern=pattt,"") ,":" ,"") == "USA"))
## [1] "Are all samples coords of USA: FALSE"
# calculating distancess:
rand_usa.distnaces = numeric(length = B)
for (i in 1:B){
sph_p1 <- geographic2spheric(usa_rand1[i,])
sph_p2 <- geographic2spheric(usa_rand2[i,])
rand_usa.distnaces[i] <- geodesic_dist(sph_p1,sph_p2,r=6371)
}
usa_d_avg_hat <- mean(rand_usa.distnaces)
The average geodesic distance is 2353.3856175.
For which countries do you think that the rejection sampling method would be problematic? How would you improve it? explain in words (no need for analysis/code for this part).
Ans:
I learned in class that rejection Sampling can be implemented for any condition \(A\), but can become very inefficient when \(A\) is a rare event \((P(A) is small)\). Hence, small countries that take a small place in the world’s map would take a long time to sample with that method.
I would define sub polygons with countries boundaries as boundaries and devide the function by cases (if country is in polygon y, sample from that polygon only).
A statistician proposes to you two different methods for sampling
points on the surface of the sphere of radius \(r\):
Sample from a uniform distribution \(\phi
\sim U[0,2\pi]\) and \(\theta \sim
U[0,\pi]\) independently, to get a point in spherical coordinates
\((r,\phi,\theta)\).
This point can be converted to Cartesian coordinates to get \((x,y,z)\).
Sample from a Normal distribution \(x,y,z \sim N(0,1)\) independently, and then normalize the resulting vector, i.e. set:
\[ (x, y, z) \leftarrow \frac{r}{\sqrt{x^2+y^2+z^2}} (x, y, z) \]
For each of the two methods, determine if the method really samples points uniformly on the sphere, or whether there are certain areas of the sphere that will be more/less dense than others. You may use mathematical arguments and/or simulation results with computed statistics/plots to support your conclusion.
Note that the uniform distribution on the sphere has the property that two regions of equal areas on the sphere always have the same probability (this can be defined mathematically in details, but the formal definitions are not needed for the questions).
Recall that we know that the function runif_on_sphere
does sample from the uniform distribution over the sphere, and you may
compare your two methods to the output of this function.
ANS:
I will preform a sampling and plotting of the empirical distributions
of the valuesx,y and zwith all
methods:
First, let’s plot a regular uniform sample of the sphere with
runif_on_sphere:
B = 10000
# banchmark uniform sampling
original_rand_xyz <- as.data.frame(runif_on_sphere(n=B, d=3, r=r))
colnames(original_rand_xyz) = c('x','y','z')
# for (i in 1:B) rand_xyz[i,] <- cartesian_to_spherical(rand_xyz[i,])
x=ggplot(original_rand_xyz , aes(x=x),"x") + geom_density(col=3)
y=ggplot(original_rand_xyz , aes(x=y),"y") + geom_density(col=4)
z=ggplot(original_rand_xyz , aes(x=z),"z") + geom_density(col=5)
#ggplot_add(x,y)
plot(density(original_rand_xyz$x), col=1, title="X,Y,Z Density Plot")
lines(density(original_rand_xyz$y), col=2, title="X,Y,Z Density Plot")
lines(density(original_rand_xyz$z), col=3, title="X,Y,Z Density Plot")
ggarrange(x, y, z,
labels = c("X", "Y", "Z"),
ncol = 2, nrow = 2)
We can see that the distribution of x,
y,z separately.
Now I will simulate Method 1, convert the results to spherical
coordinates and plot the values of x,y and
z:
#Method 1:
original_spheral_samps <- data.frame(r=r, phi = runif(B, 0, 2*pi) , theta = runif(B, 0, pi) )
# plot(spheral_samps, main="Polar sampling")
for (i in 1:B) original_spheral_samps[i,] <- spherical_to_cartesian(original_spheral_samps[i,])
colnames(original_spheral_samps) <- c("x","y","z")
plot(density(original_spheral_samps$x , title="X,Y,Z Density Plot"), col=1)
lines(density(original_spheral_samps$y), col=2)
lines(density(original_spheral_samps$z), col=3)
The results are not at all similar to the ones from the benchmark.
And now I will simulate the Method 2:
#Method 2:
r <- 6371
B = 1000
df <- data.frame(x= rnorm(B), y = rnorm(B),z = rnorm(B))
for(i in 1:B) df[i,] <- df[i,]*r /(sqrt((df[i,'x']**2 + df[i,'y']**2 + df[i,'z']**2)))
plot(density(df$x), col=1)
lines(density(df$y), col=2)
lines(density(df$z), col=3)
The plot is similar to the banchmark. Thus, it can be concluded that the second method does samples uniformly better than the first method.
Read the list of world countries by area from Wikipedia dataset file here.
Use the rvest package to read the html file and extract
the data table representing countries by their area into a data-frame.
Remove rows with country name containing parenthesis ().
Convert the column representing total area to numerical values in square km. For countries where more than one value is available, take the first one. Show the top and bottom two rows of the resulting data-frame.
url <- "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area"
rawdf1<- read_html(url) %>% html_table() %>% nth(2) %>% as.data.frame() #RENAME DEP.AREA
# %>% filter(Country...dependency)
dep.area <- rawdf1 # REMOVE LATER
# wiki_map_names<- data.frame(
names.wiki = c("United States", "Kingdom of Denmark", "Republic of the Congo", "Sahrawi Arab Democratic Republic", "United Kingdom", "Somaliland", "Eswatini", "The Bahamas", "The Gambia", "Abkhazia", "United States Minor Outlying Islands", "State of Palestine", "Transnistria", "South Ossetia", "Northern Cyprus", "Artsakh", "East Timor", "Trinidad and Tobago","DR Congo" ,"Congo", "Federated States of Micronesia", "Republic of Artsakh", "Czechia")
# ,
names.map = c("USA", "Denmark", "Republic of Congo", "Western Sahara", "UK", "Somalia", "Swaziland", "Bahamas", "Gambia", "Georgia", "USA", "Palestine", "Moldova", "Georgia", "Cyprus", "Armenia", "Timor-Leste", "Trinidad","Democratic Republic of the Congo", "Republic of Congo", "Micronesia", "Armenia" , "Czech Republic")
# )
# wiki_map_names
#List of countries in data
map_lib_regions <- map_data("world") %>% distinct(region)
dep.area <- rawdf1
colnames(dep.area)[2:3] <- c("Country", "Area_km2" )
parenthtis <- str_which(dep.area$Country, pattern = "\\(")
dep.area = dep.area %>% select(Country,Area_km2) %>% filter(!(row_number() %in% parenthtis))
dep.area$Country[4] <- "Canada"
#CONVERT to maps names:
for(i in 1:nrow(dep.area) ){ ## MAYBE MAKE IT RUN FASTER
if(dep.area$Country[i] %in% names.wiki) dep.area$Country[i] <- names.map[which(names.wiki == dep.area$Country[i])]
}
pat="(?<=\\s).*|(,)"
dep.area$Area_km2 <- dep.area$Area_km2 %>% str_replace_all(pat, "") %>% str_trim() %>% as.numeric() #LATER LOOK INTO WARNING OF NAS INTRODUCED BY COERCION
# Show the top and bottom two rows of the resulting data-frame.
dep.area %>% slice(1:2,n()-1,n())
Repeat the above analysis but this time with the list of world countries by population from here, where instead of the total area column, convert to numerical values the column representing total population.
url2="https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
rawdf <- read_html(url2) %>% html_table() %>% first() %>% as.data.frame() #CHANGE TO DEP.PROP
dep.pop <- rawdf
colnames(dep.pop)[2] = "Country"
#CLEAR:
parenthtis <- str_which(dep.pop$Country, pattern = "\\(")
dep.pop = dep.pop %>% select(Country,Population) %>% filter(!(row_number() %in% parenthtis))
#CONVERT to maps names: #LATER MAKE A HELPER FUNCTION
for(i in 1:nrow(dep.pop) ){ ## MAYBE MAKE IT RUN FASTER
if(dep.pop$Country[i] %in% names.wiki) {
dep.pop$Country[i] <- names.map[which(names.wiki == dep.pop$Country[i])]
}
}
#numbers to numeric
dep.pop$Population <- dep.pop$Population %>% str_replace_all(pat, "") %>% str_trim() %>% as.numeric()
# Show the top and bottom two rows of the resulting data-frame.
dep.pop %>% slice(2:3,n()-1,n())
Merge the two data-frames into a new one (denoted the
wikipedia data-frame) and add a new column called
Population_Density, that lists for each country the number
of people per square km.
Show a world-map (you can use the package rworldmap),
where each country is colored by its population density.
Display the three most dense and three least dense countries in the world in a table.
wikipedia <- as.data.frame(inner_join(x=dep.area,y=dep.pop,"Country")) %>%
group_by(Country) %>%
summarise(Area_km2= max(Area_km2), Population = max(Population)) %>%
mutate(Population = coalesce(Population, 0)) %>% mutate(Population_Density = (Population)/ (Area_km2))
wikipedia_map <- wikipedia %>% mutate(Population_Density.log = log(Population_Density)) %>%
mutate(Population_Density.log10 = log10(Population_Density))
#remove later
# delete_latedf <- wikipedia %>% select(Country,Area_km2,Population,Population_Density,Population_Density.log)
fr <- joinCountryData2Map(dF = wikipedia_map %>% dplyr::select(Country, Population_Density ,Population_Density.log,Population_Density.log10) %>% filter(Population_Density.log>0) , joinCode = "NAME", nameJoinColumn = "Country",verbose=F)
## 196 codes from your data successfully matched countries in the map
## 5 codes from your data failed to match with a country code in the map
## 47 codes from the map weren't represented in your data
mapCountryData(mapToPlot = fr, nameColumnToPlot = "Population_Density", catMethod = "fixedWidth", oceanCol = "steelblue1", missingCountryCol = "white", mapTitle = "Population Density", aspect = "variable") # Plot Worldmap
## Warning in plot.window(xlim = xlim, ylim = ylim, asp = aspect): NAs introduced
## by coercion
mapCountryData(mapToPlot = fr, nameColumnToPlot = "Population_Density.log", catMethod = "fixedWidth", oceanCol = "steelblue1", missingCountryCol = "white", mapTitle = "Population Density (loged)", aspect = "variable") # Plot Worldmap
## Warning in plot.window(xlim = xlim, ylim = ylim, asp = aspect): NAs introduced
## by coercion
mapCountryData(mapToPlot = fr, nameColumnToPlot = "Population_Density.log10", catMethod = "fixedWidth", oceanCol = "steelblue1", missingCountryCol = "white", mapTitle = "Population Density (loged by 10)", aspect = "variable") # Plot Worldmap
## Warning in plot.window(xlim = xlim, ylim = ylim, asp = aspect): NAs introduced
## by coercion
# Most dense countries
wikipedia %>% select(Country, Population_Density) %>%arrange(desc(Population_Density)) %>% head(3)
#list dense countries
wikipedia %>% select(Country, Population_Density) %>% arrange(desc(Population_Density)) %>% tail(3)
Suppose that we choose two random individuals uniformly at random
from the world’s population, and suppose that each individual is located
uniformly at random in her country - what will be the average geodesic
distance between the two?
Simulate \(1000\) pairs of people: first, sample their nationalities, and then sample their locations conditional on their nationality.
To speed-up the simulation, you may ignore countries with area smaller than \(1000\) square km. Use the lists of names to match between the names from wikipedia
to the names returned by the map.where function.
Additional parsing might be needed for the output of this function in a
few cases.
Use the simulation to estimate the average geodesic distance. How does it compare to the average geodesic distance between two random points on earth?
# Filter out below 1000 square km: ###NEED TO MODIFY
wikipedia_sim <- wikipedia %>% select (Country,Area_km2,Population,Population_Density) %>% filter(Area_km2>=1000 ) %>% filter(Country != "World")
#Setting proportions of nationalities
props <- wikipedia_sim$Population / sum(wikipedia_sim$Population)
names(props) <- wikipedia_sim$Country
# sample 2 random person nationalities:
B=1000
simulation_people1 <- data.frame(
ind=1:B,
pers.nat1 = sample(wikipedia_sim$Country, B, prob = props, TRUE) #,pers.long1 = NA, pers.lat1 = NA
) %>% arrange(pers.nat1)
simulation_people2 <- data.frame( ind=1:B, pers.nat2 = sample(wikipedia_sim$Country, B, prob = props, TRUE)) %>%
arrange(pers.nat2)
### sim 1
coords = data.frame( longitude= 0, latitude=0 )
for(country in (unique(simulation_people1$pers.nat1))){
# print(country)
# print(sum(simulation_people1$pers.nat1==country)) # PRINT FOR TEST
coords = rbind( coords, rej.samp_n_ps_of_country( sum(simulation_people1$pers.nat1==country), country ) )
}
coords = coords[2:(B+1),]
simulation_people1 = cbind (simulation_people1,coords ) %>% arrange(ind)
### sim 2:
coords = data.frame(longitude= 0, latitude=0)
for(country in (unique(simulation_people2$pers.nat2))){
# print(names(national.cnt2[i])) # print(country) # PRINT FOR TEST
coords = rbind(coords, rej.samp_n_ps_of_country( sum(simulation_people2$pers.nat2==country), country ) )
}
coords = coords[2:(B+1),]
simulation_people2 = cbind(simulation_people2 , coords ) %>% arrange(ind)
rand_people_distances <- numeric(length = B)
# for(i in 1:B) rand_people_distances[i] <- geodesic_dis(simulation_people1, simulation_people2)
# rand_people_distances <- numeric(length = B
for (i in 1:B){
sp_p1 <- geographic2spheric(simulation_people1[i,3:4])
sp_p2 <- geographic2spheric(simulation_people2[i,3:4])
rand_people_distances[i] <- geodesic_dist(sp_p1,sp_p2,r=6371)
}
avg_ran_pop_dis <- mean(rand_people_distances)
avg_ran_pop_dis
## [1] 6917.926
avg_geo_dist
## [1] 9966.397
ANS: The average geodesic 6917.9259973`distance between two randomly sampled people on the planet is smaller by roughly \(2000 \ km^2\) and it makes sense because most of the human population distributes geographically by groups and not entirely random.
Suppose that we didn’t know the countries’ areas and want to estimate the area of a country by sampling.
Write a function that receives as input a vector of country names, a
number of samples \(n\), and outputs
estimates for the areas of all countries in square km, based on drawing
\(n\) random points on earth and
recording the relative frequency of points falling within each country.
First I calculate the worlds area, \(S\) with the know formula: \[S_r = 4\pi r^2 \quad \rightarrow \quad S_{r=6371} =4\pi 6,371^2= 510064471.91 \]
r=6371
# world's area:
S <- 4*pi*r^2
estimate_areas <- function(countries, n){ # creating the functions:
# countries <- c("USA", "Israel","India") #### TEST RM LATER
# n <- B ####RM Later
cntrs <- data.frame(Country = countries)
samps <- as.data.frame(cartesian2geographic( runif_on_spher2df(n) )[2:3]) #sample n in x,y,z & convert to geographic points
samp.cs <- prop.table(table(str_replace( str_replace( map.where("world",samps) ,pattern="(?<=:).*","") ,":" ,""),useNA = "ifany" )) #match coords with countries, get frequencies, keep NA's,
cntrs.apeared.names <- names(samp.cs[which(names(samp.cs) %in% countries)])
Estimated_Areas <- data.frame(Country = cntrs.apeared.names, Estimated_Area = as.numeric(samp.cs[which(names(samp.cs) %in% countries)] * S)) # Mult freqs by the world surface km^2 to get #######LATER BETTER FUNC EXTRACT VALUES FROM TABLE
results <- left_join(cntrs, Estimated_Areas, by = "Country" )
# results$Estimated_Area <- coalesce(results$Estimated_Area,0)
return( results)
}
Run the function with \(n=10000\)
samples and with the list of countries from the wikipedia
data-frame. Report separately the top 10 countries with the largest
estimated areas and the top 10 countries with the
largest true areas in tables. Do you see an agreement
between the two top-10 lists?
#TEMPORARLY HERE RM LATER
wikipedia <- as.data.frame(full_join(x=dep.area,y=dep.pop,"Country")) %>%
group_by(Country) %>%
summarise(Area_km2= max(Area_km2), Population = max(Population)) %>%
mutate(Population = coalesce(Population, 0)) %>% mutate(Population_Density = (Population)/ (Area_km2)) %>% filter(Country != "World")
wikipedia <- wikipedia %>% filter(Country != "World")
#Applying the function on wikipedia df, n=10000
estimated_arieas <- estimate_areas(countries = wikipedia$Country , n=10000)
# top 10 largest estimated
estimated_arieas %>% arrange(desc(Estimated_Area)) %>% head(10) #PRETTIER COL NAMES
# top 10 true largest areas
as.data.frame(wikipedia) %>% arrange(desc(Area_km2)) %>% select(Country, Area_km2) %>% head(10)
ANS: There is a certain similarity between the estimated top ten and
the true top ten. Almost, if not all the top ten estimated appear in the
true top ten, although there is a certin disagree about the order.
Around \(71%\) were NA
values which makes sense because the world is covered in water at that
ratio. I goggled it. It’s true.
Next, merge the estimated areas results with the data-frame and plot the true area (x-axis) vs. the estimated area (y-axis) for all countries.
#TEMPORARLY HERE RM LATER
wikipedia <- as.data.frame(full_join(x=dep.area,y=dep.pop,"Country")) %>%
group_by(Country) %>%
summarise(Area_km2= max(Area_km2), Population = max(Population)) %>%
mutate(Population = coalesce(Population, 0)) %>% mutate(Population_Density = (Population)/ (Area_km2)) %>% filter(Country != "World")
wikipedia <- full_join(wikipedia, estimated_arieas, by = "Country") #%>% select(Country, Area_km2,Estimated_Area)
plot(wikipedia$Area_km2 ,wikipedia$Estimated_Area, log="xy")
# + scale_x_log10() + scale_y_log10() #USE LATER
ml <- lm(wikipedia$Area_km2~wikipedia$Estimated_Area, data=wikipedia)
abline(ml)
# Extracting R-squared parameter from summary
R.squared <- summary(ml)$r.squared
# R.squared
What is the \(R^2\) between the two?
ANS: The \(R^2\) between the two is
{r R.squared} which shows that {r R.squared} proportion of variance in
the Estimated Area for countries can be explained by the
variance in the real Area.
###a. (6pt) Read the earthquakes dataset from USGS. Select
all earthquakes in the world in 2022 of magnitude above 2.5, and save
them to a .csv file (The magnitude is in Richter’s
scale). Load the dataset into an R data-frame, and display the five
latest earthquakes and the five strongest earthquakes.
# https://raw.githubusercontent.com/AnonymousDataStudentWhoLovesR/Main/main/query.csv
usgs <- read.csv("query.csv")
# 5 Latest earthquakes
usgs %>% select(time,mag,place,depth) %>% head(5)
# head(usgs)
# 5 Strongest earthquakes in 2022
usgs %>% select(time,mag,place) %>% arrange(desc(mag)) %>% head(5)
usgs %>%head(5)
Plot the earthquakes on top of on the world map. That is, each
earthquake should be represented by a point, with the x
axis representing longitude, the y axis
representing latitude, and the size of the point
representing the magnitude. This scatter plot should be on top of the
world map, drawn using the geom_map function from the
ggmap package. The data-frame representing the map can be
generated as the output of the command
map_data("world").
pattt = "(?<=:).*"
usgs <- usgs %>% select(time,latitude,longitude, mag, place, depth) #%>%
world_map <- map_data("world")
ggplot() +
geom_map(data =world_map, map = world_map, color = "white", fill="grey", alpha=0.7,
aes(x=long, y=lat, map_id=region, group=group,color="lightgrey")) +
geom_point(data=usgs, aes(x=longitude, y=latitude ,size=mag,col=desc(mag)),alpha=0.5) + scale_size_continuous(range=c(.1,4)) + coord_quickmap() +
labs(y="Latitude", x="Longitude", title="Earthquakes in 2022 by Magnitude", col="Magnitude", size="Magnitude")
## Warning: Ignoring unknown aesthetics: x, y
Repeat the plot from (b.), but this time make separate plots for
different earthquakes, where they are grouped together by magnitude.
That is, there should be a separate plot for each of the following
magnitude ranges: between \(2\) and
\(3\), between \(3\) and \(4\), …, , between \(7\) and \(8\).
for (i in 1:6) {
nam <- paste("g", i, sep = "")
assign(nam,
# print(
ggplot() + geom_map( data = world_map, map = world_map, color = "white", fill="grey", alpha=0.7, aes(x=long, y=lat, map_id=region, group=group ,color="lightgrey")) + geom_point(data= usgs %>% filter( mag >= (i+1) & mag < (i+2) ),aes(x=longitude, y=latitude , size = mag,col=desc(mag)) ,alpha=0.5) + scale_size_continuous(range=c(.5,4)) + coord_quickmap() + labs( y="Latitude", x="Longitude", col ="Magnitude" , title= paste("Earthquakes in 2022 of Magnitude in Range ",i+1,"-", i+2), size = "Magnitude")
) #end of assign
}
## Warning: Ignoring unknown aesthetics: x, y
## Ignoring unknown aesthetics: x, y
## Ignoring unknown aesthetics: x, y
## Ignoring unknown aesthetics: x, y
## Ignoring unknown aesthetics: x, y
## Ignoring unknown aesthetics: x, y
#MAYBE BETTER COLOURS
ggarrange(g1,g2,g3,g4,g5,g6 ,ncol=2,nrow =3)
What can you conclude about the locations of strong/weak
earthquakes?
Parse the column representing the time to extract (i) the number of
days since the beginning of the year, and (ii) the time of day, in units
of hours since midnight from , as two new separate numeric columns.
That is, for example the date 2022-03-10T17:40:28.123Z
will be converted to \(31+28+10=69\)
days since the beginning of the year (the numbers of days in the first
months are January:31, February:28, March:31, April:30, May:31), and to
\(17+40/60+28/3600=17.6744\) hours
since midnight (we ignore fractions of seconds here).
Plot pairwise correlations of the magnitude (mag),
depth, day-of-year and
time-of-day. (The depth column represents the
depth below the ground in km of the center of the earthquake). For which
pairs of variables there is a significant correlation?
You may use the ggpairs command from the
GGally package.
usgs <- usgs %>% mutate( day_of_year = as.numeric(as.Date(usgs$time) - as.Date("2022-01-01"))) %>%
mutate( time_of_day = hour(ymd_hms(usgs$time)) + (minute(ymd_hms(usgs$time))/60) + (round( second( ymd_hms(usgs$time)),0)/3600) )
fetuears <- c("mag","depth","day_of_year", "time_of_day") #selecting requested cols
fav <- "pinkish" #Just kidding, it's a way to get a non black default color with ggpairs
ggpairs(usgs, columns = fetuears,mapping = ggplot2::aes(color = fav) )
cor.depth.mag <- cor(usgs$mag, usgs$depth)
According to the plot, we see a week positive correlation of
0.1587673 Between Depth and Magnatuide,
otherwise then that I see no significant relation ship.
We want to test if there are times in the day in which earthquakes
are more or less commons. Let \(t_i\)
be the time of each earthquake \(i\) in
the day (between \(00:00:00\) and \(23:59:59\)), in units of hours, calculated
in (d.).
Test the null hypothesis \(H_0: t_i \sim U[0, 24]\) against the complex alternative
\(H_1: t_i \nsim U[0, 24]\).
We test using the Pearson chi-square statistic:
\[ S = \sum_{i=1}^{24} \frac{(o_{i}-e_{i})^2}{e_{i}}\]
where \(o_{i}\) is the number of
earthquakes at hour \(i\) (e.g. between
\(00:00:00\) and \(00:59:59\) for \(i=1\), between \(01:00:00\) and \(01:59:59\) for \(i=2\) etc. You may ignore the minutes and
seconds information and extract just the hour for this sub-question),
and \(e_i\) is the expected value of
\(\frac{n}{24}\), with \(n\) being the total number of earthquakes.
Report the test statistic and the p-value, using the \(\chi^2(24-1)\) approximation for the null
distribution. Would you reject the null hypothesis at significance level
\(\alpha=0.01\)?
ANS: We want to sample a countinuess random variable with the First, a simple histogram can give as a first intuition to the results:
# Deriving observed counts grouped by hour with hour(ymd_hms()):
Hours <- hour(ymd_hms(usgs$time))
observed = data.frame(Hours) #1 is the first hour
observed %>% ggplot(aes(Hours)) + geom_bar(aes(y=..count..)) +
labs(title="Empirical Distribution of Earthquake Observations by Hour",y="Frequency")
We can already grasp a visual intuition that there is no uniform distribution of occurances by hour. And now a classic \(\chi^2\) test will be demonstrated as follows:
observed <- usgs %>% group_by(hour(ymd_hms(usgs$time))) %>% summarise(freq = n())
n <- nrow(usgs) # samples
e_i <- n/24
Statistic <- 0 # Statistic calculation (T), will by summed by a loop
for (i in 1:24) Statistic = Statistic + (((observed$freq[i]-e_i)^2)/e_i) #Applying formula
\[\widehat{Pval}(s) = = P_{H_0} (S > s)=1 - \hat{F}_{S}(s)\] Thus:
#Chi squ with 23 d.f, rounded
p_val <- round(1 - pchisq(Statistic,df = 24-1),3)
We have Pvalue=0.002 < α=0.01, and Statistic = 47.5349729, thus it concluded that \(H_0: \ t_i \sim U[0, 24]\) is rejected.
Thank You For The Interesting Course.