52414: Data Analysis with R Home Exam

Q0.Submission Instructions

(Please read carefully)

The exam will be submitted individually by uploading the solved exam Rmd and html files to the course moodle.

Please name your files as 52414-HomeExam_ID.Rmd and 52414-HomeExam_ID.html where ID is replaced by your ID number (do not write your name in the file name or in the exam itself).

The number of points for each sub-question is indicated next to it.

Once you click on the moodle link for the home exam, the exam will start and you have three days (72 hours) to complete and submit it.

The exam will be available from June 26th. The last submission time is July 1st at 8:59.

You may use all course materials, the web and other written materials and R libraries.

You are NOT allowed to discuss any of the exam questions/materials with other students.

A Hebrew version of the exam is available here.

The purpose of the Hebrew version is to help in explaining the exam questions. You should use and fill only this Rmd English version and follow the instructions here. Contact us if you think there is a conflict between the Hebrew version and the instructions here.

Analysis and Presentation of Results:

Write your answers and explanations in the text of the Rmd file (not in the code).

The text of your answers should be next to the relevant code, plots and tables and refer to them, and not at a separate place at the end.

You need to explain every step of your analysis. When in doubt, a more detailed explanation is better than omitting explanations.

Give informative titles, axis names and names for each curve/bar in your graphs.

In some graphs you may need to change the graph limits. If you do so, please include the outlier points you have removed in a separate table.

Add informative comments explaining your code

Whenever possible, use objective and specific terms and quantities learned in class, and avoid subjective and general non-quantified statements. For example:

Good: “We see a \(2.5\)-fold increase in the curve from Jan. 1st to March 1st”.

Bad: “The curve goes up at the beginning”.

Good: “The median is \(4.7\). We detected five outliers with distance \(>3\) standard deviations from the median”.

Bad: “The five points on the sides seem far from the middle”.

Sometimes Tables are the best way to present your results (e.g. when asked for a list of items). Exclude irrelevant

rows/columns. Display clearly items’ names in your Tables.

Show numbers in plots/tables using standard digits and not scientific display.

That is: 90000000 and not 9e+06.

Round numbers to at most 3 digits after the dot - that is, 9.456 and not 9.45581451044

Some questions may require data wrangling and manipulation which you need to

decide on. The instructions may not specify precisely the exact plot you should use

(for example: show the distribution of ...). In such cases, you should decide what and how to show the results.

When analyzing real data, use your best judgment if you encounter missing values, negative values, NaNs, errors in the data etc. (e.g. excluding them, zeroing negative values..) and mention what you have done in your analysis in such cases.

Required libraries are called in the Rmd file. Install any library missing from your R environment. You are allowed to add additional libraries if you want.

If you do so, add them at the start of the Rmd file, right below the existing libraries, and explain what libraries you’ve added, and what is each new library used for.

This is an .Rmd file. Copy it with your mouse to create and Rmd file, and edit the file to include your questions.

Good luck!

My Additional Libraries:

# 
rm(list=ls()) #rstudio cleaning functions
library(reshape)
library(knitr) # to make sure the knit is happening
library(viridis) #

Allaowed libraries:

Helper functions Provided:

# Helper functions for converting between coordinates.

# You may use and modify the functions. It is your responsibility that 
# the input/output format used in them are correct. 
# Convert to Cartesian coordinates: the function assumes that theta is in [0,pi] and phi is in [0,2pi]
spherical_to_cartesian <- function(point,r = 6371)
{
  x <- r*cos(point$phi)*sin(point$theta)
  y <- r*sin(point$phi)*sin(point$theta)
  z <- r*cos(point$theta)
  return(data.frame(x=x,y=y,z=z))
}


# Convert to Spherical coordinates: the function assumes that theta is in [0,pi] and phi is in [0,2pi]
cartesian_to_spherical <- function(point) 
{
  r <- sqrt(point$x**2+point$y**2+point$z**2)
  phi <- atan(point$y/point$x) + pi * ( 1 - sign(point$y)*(1+sign(point$x))/2 )  # take care of all sign cases. When x=0 or y=0 exactly, the function may not work correctly, but these are events of measure zero for any continuous distribution over the sphere.
  theta <- acos(point$z/r)
  return(list(r=r,phi=phi,theta=theta))
}

# Geodesic distance between points.    ### HERE I CHANGED IT A BIT
# Input format: p1 and p2 are lists having items: r, phi, theta in spherical coordinates, 
# i.e. phi in [0,2pi] and theta in [0,pi]
geodesic_dist <- function(p1, p2, r)      
{
  return ( r * acos( cos(p1$theta)*cos(p2$theta) + sin(p1$theta)*sin(p2$theta) * cos(p1$phi-p2$phi) ) )
}

My Helpers :

cartesian2geographic <- function(point,r=6371) #for apply
{
  longitude = ( atan(point$y/point$x) + pi*(1-sign(point$y)*(1+sign(point$x))/2 ) ) * (180/pi) - 180
  latitude = (acos(point$z/r))*180/pi - 90
  return(list( r=r ,lon  = longitude, lat = latitude))
}

geo2xyz <- function(point,r=6371) 
{
  x <- r*cos((point$long + 180)*pi/180)*sin((point$lat + 90) * pi/180 ) #r*cos(phi)*sin(theta)
  y <- r*sin((point$long + 180) * pi/180)*sin((point$lat + 90) * pi/180 ) #r*sin(phi)*sin(theta)
  z <- r*cos((point$lat + 90) * pi/180) #r*cos(theta)
  return(data.frame(x=x,y=y,z=z))
}


geographic2spheric <- function(point)
{
  phi = (point$longitude + 180) * pi/180 
  theta = (point$latitude + 90) * pi/180 
  return(list( phi = phi, theta= theta))
}

runif_on_spher2df<-function(iters,r=6371) {
  df_rand_xyz_n <- as.data.frame(runif_on_sphere(n=iters, d=3, r=r))
  colnames(df_rand_xyz_n) <- c("x", "y", "z") 
  return( df_rand_xyz_n)
  }

Q1. Simulate random points on earth (31 pt)

Background

Earth

In this question we simulate random locations on earth. For simplicity, we assume that earth is a ball with a radius of \(r = 6371 km\) (we ignore the fact that earth is not an exact ball), and we

will simulate random points on the surface of the ball, i.e. from a sphere with this radius.

Each location on earth can be represented using three Cartesian coordinates \((x,y,z)\),

where the center of earth is at \((0,0,0)\), and each point on the surface satisfies \(x^2+y^2+z^2 = r^2\).

Alternatively, we also use spherical coordinates to represent points on earth: \((r,\theta, \phi)\) where \(\phi \in [0, 2\pi]\) and \(\theta \in [0, \pi]\). Here \(r\) is always the same, at \(6371\) (points inside earth will have a lower \(r\) value). \(\phi\) is the longitude, with points on the

Prime meridian (Greenwich line) having \(\phi=0\), and then \(\phi\) is increased as we move east. For example, if we move east and complete half a circle around the world, we reach points at the international date line, and they will have \(\phi=\pi\), and as we approach the Greenwich line again from west (i.e. when completing a full circle), \(\phi\) will get closer and closer to \(2\pi\).

\(\theta\) is the latitude, with the north (south) pole having \(\theta=\pi\) (\(\theta=0\)), and points on the equator have \(\theta=\frac{\pi}{2}\).

Note: A common way to represent (spherical) geographical coordinates is to set the longitude \(\phi\) between \([-180,180]\) (instead of \([0,2\pi]\)),

where positive (negative) values correspond to east (west) of the Greenwich line. The latitude \(\theta\) is set between \([-90,90]\) (instead of \([0,\pi]\)), where positive (negative) values correspond to north (south) of the equator. We will refer to this convention as geographical coordinates, and will use it in some parts of the exam, e.g. when reading real geographical data.

For two points on the sphere, we define their Euclidean distance as the standard distance in space, i.e. the length of the straight line connecting the points, that passes through the ball. This will be the length you need to travel if you could dig a straight hole between the two locations. The Euclidean distance has a simple formula in terms of the Cartesian coordinates. For two points \((x_1,y_1,z_1)\) and \((x_2,y_2,z_2)\), their distance is

\[d(p_1,p_2) \equiv \sqrt{(x_1-x_2)^2+(y_1-y_2)^2+(z_1-z_2)^2}\]

We also define the geodesic distance of two points, as the length of the shortest path connecting them along the sphere. This is the length you will have to travel if you fly in a ‘straight line’ along the sphere between the two points, along a great circle, and for the sphere it is also called great circle distance or spherical distance.

The geodesic distance has a simple formula in terms of the spherical coordinates. For two points \((r,\theta_1,\phi_1)\) and \((r,\theta_2,\phi_2)\), their distance is

\[g(p_1,p_2) \equiv r \cos^{-1} \Big(\cos (\theta_1) \cos (\theta_2) + \sin (\theta_1) \sin (\theta_2) \cos(\phi_1 - \phi_2)\Big)\]

For example, if \(\phi_1=\phi_2\) then the distance simplifies to \(g(p_1,p_2) = r |\theta_1 - \theta_2|\).

The length of the red line in the figure below is the geodesic distance of \(p\) and \(q\). The length of a straight line through the ball (not drawn) will be the euclidean distance of \(p\) and \(q\)

Geodesic Distance

1.a. (10pt)

What is the average geodesic distance between two points sampled uniformly on the sphere? simulate \(1000\) random pairs of points and estimate it using the pairwise distances computed for the simulated points.

You may use the function runif_on_sphere from the package uniformly, to simulate points uniformly on the sphere.

Next, repeat the same simulation but this time estimate the average Euclidean distance between two random points on the sphere.

bonus: Derive an exact mathematical formula for the average geodesic distance, and for the average Euclidean distance as functions of \(r\). Compare the formulas you got to the estimated distances from the simulations.

I calculated both geodesic distances and eculidan distances with foor loops and helper functions for converting coordinates.

(I ran the a calculation with a for loop, at first with the formulat itself, later on I applied the function itself.)

globe.Radius = 6371
r = globe.Radius #shorter for convenience
B = 1000

# calculating geodesic distance
geo_distances = c() # will be filled with results
for (i in 1:B){ 
  p_1 = as.data.frame(runif_on_sphere(n=1, d=3, r=r)) # samples uniformly Cartesian
  names(p_1) = c('x', "y", "z") 
  sample_1 = as.data.frame(cartesian_to_spherical(p_1)) # to spherical
  p_2 = as.data.frame(runif_on_sphere(n=1, d=3, r=r))
  names(p_2) = c('x', "y", "z")
  sample_2 = as.data.frame(cartesian_to_spherical(p_2))
  geo_distances[i] = r * acos( cos(sample_1[1,3])*cos(sample_2[1,3]) + sin(sample_1[1,3])*sin(sample_2[1,3])*cos( sample_1[1,2] - sample_2[1,2]) )
}

avg_geo_dist = mean(geo_distances)
# avg_geo_dist

# calculating euclidean distance
euclidean_distances = c()  
for (i in 1:B){
  p_1 = as.data.frame(runif_on_sphere(n=1, d=3, r=r))
  names(p_1) = c('x', "y", "z")
  sample_1 = as.data.frame(cartesian_to_spherical(p_1))
  p_2 = as.data.frame(runif_on_sphere(n=1, d=3, r=r))
  names(p_2) = c('x', "y", "z")
  sample_2 = as.data.frame(cartesian_to_spherical(p_2))
  euclidean_distances[i] = sqrt((p_1[1,1]-p_2[1,1])^2 + (p_1[1,2]-p_2[1,2])^2 + (p_1[1,3]-p_2[1,3])^2)
}
avg_euco_dist = mean(euclidean_distances)
# avg_euco_dist

The mean geodesic distance is 9966.3970454, the mean eculedean distance is 8500.828412, obviasly the eculedean is smaller sinse its root is more direct.

1.b. (6pt)

Repeat the above simulations, but this time keep all results and plot the empirical distributions for the two types of distance.

Compute the skewness and kurtosis of the two empirical distance distributions. Which of them looks closer to the Normal distribution? explain

Ans: I recorded the samples and plotted the empirical distribution.

# wrangling
df <- data.frame( ind = 1:B, Geodesic = geo_distances , Euclidean = euclidean_distances)
#more wrangling so i could plot and compare
df.long = melt(df, id.vars = "ind")
# plotting 
ggplot(df.long, aes(value, fill=variable)) + 
  geom_density(alpha = 0.3) + 
  labs(x="Distancess Between Paris of Random Points (km)", y= "Empirical Density"
     ,title = "Empirical Distribution of Geodesic and Euclidean Distances", fill = "Distance Type")

Now lets calculate skewness and kurtosis:

ske_geo = skewness(geo_distances, na.rm = TRUE)
kur_geo = kurtosis(geo_distances, na.rm = TRUE)
ske_euc = skewness(euclidean_distances, na.rm = TRUE)
kur_euc = kurtosis(euclidean_distances, na.rm = TRUE)

The Geodesic distances distribute with a skewness of 0.0009806 (close to 0) indicating a right tail, and a kurtosis of -0.7886174 which indicates short tails compared to Normal distribution.

The Euclidean distances distribute with a skewness of -0.6090566 indicating a left tail, and a kurtosis of -0.5280114 which indicates short tails (longer then the Geodesic’s) left tail compared to Normal distribution.

I would say the Geodesic distribution is closer to the normal distribution because it is more symmetric.

1.c. (7pt)

Write a function that receives as input \(n\), the number of desired samples, and a string representing a country’s name. The function should sample random points from the specific country.

Use rejection sampling, where you can check if a point belongs to a country using the map.where function from the maps package.

Run the function to sample \(1000\) pairs of points in the USA and estimate the average geodesic distance for two random points in USA.

Hint: You may need to convert your points to geographical coordinates.

ANS: I implemented a rejection sampling function and upgraded it with sampling a polygon.

rej.samp_n_ps_of_country <- function(nsmp, country_name){ #rejection sampling of of country coordinates
  B <- nsmp
  if (nsmp<10000 ) B <- 10000 
  lon <- lat <- rep(0,nsmp)
  cnt <- 0
  while (cnt < nsmp){
    # nsmp=10000 #for testing
    # country_name <- "USA" #for testing
    rand_p = as.data.frame(cartesian2geographic(runif_on_spher2df(B)))[2:3]
    pattt <- "(?<=:).*"
    I <- which( str_replace(str_replace( map.where( "world",rand_p$lon,rand_p$lat) ,pattern=pattt,"") ,":" ,"") == as.character(country_name))
    new_cnt <- min(length(I), nsmp-cnt)
    if(length(I)>0){
      lon[(cnt+1):(cnt+new_cnt)] <- rand_p$lon[I[1:new_cnt]]
      lat[(cnt+1):(cnt+new_cnt)] <- rand_p$lat[I[1:new_cnt]]
    }
    cnt = cnt + new_cnt
  }
  return(data.frame(longitude = lon,latitude =lat))
}

Now I run the function to sample \(1000\) pairs of points in the USA

# #ii:
B = 1000 #1000 pairs
usa_rand1 = rej.samp_n_ps_of_country(B,"USA")
usa_rand2 = rej.samp_n_ps_of_country(B,"USA")

# ggplot(usa_rand1, aes(x=longitude,y=longitude)) + geom_point()
plot(usa_rand1,main="Scatter Plot of Usa Points - First Sample")

plot(usa_rand2,main="Scatter Plot of Usa Points - First Sample")

# #test if indeed from usa and uniform: A CHECK 
pattt <- "(?<=:).*"
paste ("Are all samples coords of USA: ", (all(str_replace(str_replace( map.where( "world",usa_rand1$longitude, usa_rand1$longitude) ,pattern=pattt,"") ,":" ,"") ) == "USA") && all(str_replace(str_replace( map.where( "world",usa_rand2$longitude, usa_rand2$longitude) ,pattern=pattt,"") ,":" ,"") == "USA"))

## [1] "Are all samples coords of USA:  FALSE"

# calculating distancess:

rand_usa.distnaces = numeric(length = B)
for (i in 1:B){
  sph_p1 <-  geographic2spheric(usa_rand1[i,])
  sph_p2 <-  geographic2spheric(usa_rand2[i,])
  rand_usa.distnaces[i] <-  geodesic_dist(sph_p1,sph_p2,r=6371)
}
usa_d_avg_hat <- mean(rand_usa.distnaces)

The average geodesic distance is 2353.3856175.

For which countries do you think that the rejection sampling method would be problematic? How would you improve it? explain in words (no need for analysis/code for this part).

Ans:

I learned in class that rejection Sampling can be implemented for any condition \(A\), but can become very inefficient when \(A\) is a rare event \((P(A) is small)\). Hence, small countries that take a small place in the world’s map would take a long time to sample with that method.

I would define sub polygons with countries boundaries as boundaries and devide the function by cases (if country is in polygon y, sample from that polygon only).

1.d. (8pt)

A statistician proposes to you two different methods for sampling points on the surface of the sphere of radius \(r\):

Sample from a uniform distribution \(\phi \sim U[0,2\pi]\) and \(\theta \sim U[0,\pi]\) independently, to get a point in spherical coordinates \((r,\phi,\theta)\).

This point can be converted to Cartesian coordinates to get \((x,y,z)\).

Sample from a Normal distribution \(x,y,z \sim N(0,1)\) independently, and then normalize the resulting vector, i.e. set:

\[ (x, y, z) \leftarrow \frac{r}{\sqrt{x^2+y^2+z^2}} (x, y, z) \]

For each of the two methods, determine if the method really samples points uniformly on the sphere, or whether there are certain areas of the sphere that will be more/less dense than others. You may use mathematical arguments and/or simulation results with computed statistics/plots to support your conclusion.

Note that the uniform distribution on the sphere has the property that two regions of equal areas on the sphere always have the same probability (this can be defined mathematically in details, but the formal definitions are not needed for the questions).

Recall that we know that the function runif_on_sphere does sample from the uniform distribution over the sphere, and you may compare your two methods to the output of this function.

ANS:

I will preform a sampling and plotting of the empirical distributions of the valuesx,y and zwith all methods:

First, let’s plot a regular uniform sample of the sphere with runif_on_sphere:

B = 10000

# banchmark uniform sampling
original_rand_xyz <- as.data.frame(runif_on_sphere(n=B, d=3, r=r))
colnames(original_rand_xyz) = c('x','y','z')
# for (i in 1:B) rand_xyz[i,] <- cartesian_to_spherical(rand_xyz[i,])
x=ggplot(original_rand_xyz , aes(x=x),"x") + geom_density(col=3) 
y=ggplot(original_rand_xyz , aes(x=y),"y") + geom_density(col=4)
z=ggplot(original_rand_xyz , aes(x=z),"z") + geom_density(col=5)
#ggplot_add(x,y)


plot(density(original_rand_xyz$x), col=1, title="X,Y,Z Density Plot")
lines(density(original_rand_xyz$y), col=2, title="X,Y,Z Density Plot")
lines(density(original_rand_xyz$z), col=3, title="X,Y,Z Density Plot")

ggarrange(x, y, z,
          labels = c("X", "Y", "Z"),
          ncol = 2, nrow = 2)

We can see that the distribution of x, y,z separately.

Now I will simulate Method 1, convert the results to spherical coordinates and plot the values of x,y and z:

#Method 1:
original_spheral_samps <- data.frame(r=r, phi = runif(B, 0, 2*pi) , theta = runif(B, 0, pi) )
# plot(spheral_samps,  main="Polar sampling")
for (i in 1:B) original_spheral_samps[i,] <- spherical_to_cartesian(original_spheral_samps[i,])
colnames(original_spheral_samps) <- c("x","y","z")
plot(density(original_spheral_samps$x , title="X,Y,Z Density Plot"), col=1)
lines(density(original_spheral_samps$y), col=2)
lines(density(original_spheral_samps$z), col=3)

The results are not at all similar to the ones from the benchmark.

And now I will simulate the Method 2:

#Method 2:
r <- 6371
B = 1000 
df <- data.frame(x= rnorm(B), y = rnorm(B),z = rnorm(B)) 
for(i in 1:B)  df[i,] <- df[i,]*r /(sqrt((df[i,'x']**2 + df[i,'y']**2 + df[i,'z']**2))) 
plot(density(df$x), col=1)
lines(density(df$y), col=2)
lines(density(df$z), col=3)

The plot is similar to the banchmark. Thus, it can be concluded that the second method does samples uniformly better than the first method.

Q2. Analysis of geographical data (36 pt)

2.a. (6pt)

Read the list of world countries by area from Wikipedia dataset file here.

Use the rvest package to read the html file and extract the data table representing countries by their area into a data-frame.

Remove rows with country name containing parenthesis ().

Convert the column representing total area to numerical values in square km. For countries where more than one value is available, take the first one. Show the top and bottom two rows of the resulting data-frame.

url <- "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area"
rawdf1<- read_html(url) %>% html_table() %>% nth(2) %>% as.data.frame() #RENAME DEP.AREA
# %>% filter(Country...dependency)
dep.area <- rawdf1 # REMOVE LATER

# wiki_map_names<- data.frame(
  names.wiki = c("United States", "Kingdom of Denmark", "Republic of the Congo", "Sahrawi Arab Democratic Republic", "United Kingdom", "Somaliland", "Eswatini", "The Bahamas", "The Gambia", "Abkhazia", "United States Minor Outlying Islands", "State of Palestine", "Transnistria", "South Ossetia",  "Northern Cyprus",  "Artsakh", "East Timor", "Trinidad and Tobago","DR Congo" ,"Congo", "Federated States of Micronesia", "Republic of Artsakh", "Czechia")
# ,
  names.map = c("USA", "Denmark", "Republic of Congo", "Western Sahara", "UK", "Somalia", "Swaziland", "Bahamas", "Gambia", "Georgia", "USA", "Palestine", "Moldova", "Georgia", "Cyprus",  "Armenia", "Timor-Leste", "Trinidad","Democratic Republic of the Congo", "Republic of Congo", "Micronesia", "Armenia" , "Czech Republic")
# )
  
# wiki_map_names
#List of countries in data
  map_lib_regions <- map_data("world") %>% distinct(region)

dep.area <- rawdf1
colnames(dep.area)[2:3] <- c("Country", "Area_km2" )
parenthtis <- str_which(dep.area$Country, pattern = "\\(") 
dep.area = dep.area %>% select(Country,Area_km2) %>% filter(!(row_number() %in% parenthtis)) 

dep.area$Country[4] <- "Canada"

#CONVERT to maps names:
for(i in 1:nrow(dep.area) ){ ## MAYBE MAKE IT RUN FASTER
    if(dep.area$Country[i] %in% names.wiki) dep.area$Country[i] <- names.map[which(names.wiki == dep.area$Country[i])] 
  }


pat="(?<=\\s).*|(,)"
dep.area$Area_km2 <- dep.area$Area_km2  %>%   str_replace_all(pat, "")  %>% str_trim()  %>% as.numeric() #LATER LOOK INTO WARNING OF NAS INTRODUCED BY COERCION
# Show the top and bottom two rows of the resulting data-frame.
dep.area %>% slice(1:2,n()-1,n())

2.b. (6pt)

Repeat the above analysis but this time with the list of world countries by population from here, where instead of the total area column, convert to numerical values the column representing total population.

url2="https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
rawdf <- read_html(url2) %>% html_table() %>% first() %>% as.data.frame() #CHANGE TO DEP.PROP

dep.pop <- rawdf
colnames(dep.pop)[2] = "Country"

#CLEAR:
parenthtis <- str_which(dep.pop$Country, pattern = "\\(") 
dep.pop = dep.pop %>% select(Country,Population) %>% filter(!(row_number() %in% parenthtis)) 

#CONVERT to maps names: #LATER MAKE A HELPER FUNCTION
for(i in 1:nrow(dep.pop) ){ ## MAYBE MAKE IT RUN FASTER
    if(dep.pop$Country[i] %in% names.wiki) {
      dep.pop$Country[i] <- names.map[which(names.wiki == dep.pop$Country[i])] 
    }
  }

#numbers to numeric
dep.pop$Population <- dep.pop$Population  %>%   str_replace_all(pat, "")  %>% str_trim()  %>% as.numeric() 


# Show the top and bottom two rows of the resulting data-frame.
dep.pop %>%  slice(2:3,n()-1,n())

2.c. (6pt)

Merge the two data-frames into a new one (denoted the wikipedia data-frame) and add a new column called Population_Density, that lists for each country the number of people per square km.

Show a world-map (you can use the package rworldmap), where each country is colored by its population density.

Display the three most dense and three least dense countries in the world in a table.

wikipedia <- as.data.frame(inner_join(x=dep.area,y=dep.pop,"Country")) %>% 
  group_by(Country) %>%
  summarise(Area_km2= max(Area_km2), Population = max(Population)) %>%
  mutate(Population = coalesce(Population, 0)) %>% mutate(Population_Density = (Population)/ (Area_km2)) 

wikipedia_map <- wikipedia  %>%   mutate(Population_Density.log = log(Population_Density)) %>%
  mutate(Population_Density.log10 = log10(Population_Density))
#remove later
# delete_latedf <- wikipedia %>% select(Country,Area_km2,Population,Population_Density,Population_Density.log)

fr <- joinCountryData2Map(dF = wikipedia_map %>% dplyr::select(Country, Population_Density ,Population_Density.log,Population_Density.log10) %>% filter(Population_Density.log>0) , joinCode = "NAME", nameJoinColumn = "Country",verbose=F)

## 196 codes from your data successfully matched countries in the map
## 5 codes from your data failed to match with a country code in the map
## 47 codes from the map weren't represented in your data

mapCountryData(mapToPlot = fr, nameColumnToPlot = "Population_Density", catMethod = "fixedWidth", oceanCol = "steelblue1", missingCountryCol = "white", mapTitle = "Population Density", aspect = "variable") # Plot Worldmap

## Warning in plot.window(xlim = xlim, ylim = ylim, asp = aspect): NAs introduced
## by coercion

mapCountryData(mapToPlot = fr, nameColumnToPlot = "Population_Density.log", catMethod = "fixedWidth", oceanCol = "steelblue1", missingCountryCol = "white", mapTitle = "Population Density (loged)", aspect = "variable") # Plot Worldmap

## Warning in plot.window(xlim = xlim, ylim = ylim, asp = aspect): NAs introduced
## by coercion

mapCountryData(mapToPlot = fr, nameColumnToPlot = "Population_Density.log10", catMethod = "fixedWidth", oceanCol = "steelblue1", missingCountryCol = "white", mapTitle = "Population Density (loged by 10)", aspect = "variable") # Plot Worldmap

## Warning in plot.window(xlim = xlim, ylim = ylim, asp = aspect): NAs introduced
## by coercion

# Most dense countries
wikipedia %>%  select(Country, Population_Density) %>%arrange(desc(Population_Density)) %>% head(3)

#list dense countries
wikipedia %>%  select(Country, Population_Density) %>% arrange(desc(Population_Density)) %>% tail(3)

2.d. (8pt)

Suppose that we choose two random individuals uniformly at random from the world’s population, and suppose that each individual is located uniformly at random in her country - what will be the average geodesic distance between the two?

Simulate \(1000\) pairs of people: first, sample their nationalities, and then sample their locations conditional on their nationality.

To speed-up the simulation, you may ignore countries with area smaller than \(1000\) square km. Use the lists of names to match between the names from wikipedia

to the names returned by the map.where function. Additional parsing might be needed for the output of this function in a few cases.

Use the simulation to estimate the average geodesic distance. How does it compare to the average geodesic distance between two random points on earth?

# Filter out below 1000 square km:  ###NEED TO MODIFY
wikipedia_sim <- wikipedia %>% select (Country,Area_km2,Population,Population_Density) %>% filter(Area_km2>=1000  ) %>% filter(Country != "World")

#Setting proportions of nationalities
props <- wikipedia_sim$Population / sum(wikipedia_sim$Population)
names(props) <- wikipedia_sim$Country

# sample 2 random person nationalities:
B=1000
simulation_people1 <- data.frame(
  ind=1:B,
  pers.nat1 = sample(wikipedia_sim$Country, B, prob = props, TRUE) #,pers.long1 = NA, pers.lat1 = NA
) %>% arrange(pers.nat1)

simulation_people2 <- data.frame( ind=1:B, pers.nat2 = sample(wikipedia_sim$Country, B, prob = props, TRUE)) %>%
  arrange(pers.nat2)

### sim 1
coords = data.frame( longitude= 0, latitude=0 )
for(country in (unique(simulation_people1$pers.nat1))){
  # print(country)
  # print(sum(simulation_people1$pers.nat1==country)) # PRINT FOR TEST
  coords = rbind( coords, rej.samp_n_ps_of_country( sum(simulation_people1$pers.nat1==country), country ) )
}
coords = coords[2:(B+1),]
simulation_people1 = cbind (simulation_people1,coords ) %>% arrange(ind)

### sim 2:
coords = data.frame(longitude= 0, latitude=0)
for(country in (unique(simulation_people2$pers.nat2))){
  # print(names(national.cnt2[i])) # print(country) # PRINT FOR TEST
  coords = rbind(coords, rej.samp_n_ps_of_country( sum(simulation_people2$pers.nat2==country), country ) )
}

coords = coords[2:(B+1),]
simulation_people2 = cbind(simulation_people2 , coords ) %>% arrange(ind)

rand_people_distances <- numeric(length = B)
# for(i in 1:B) rand_people_distances[i] <- geodesic_dis(simulation_people1, simulation_people2)

# rand_people_distances <- numeric(length = B
for (i in 1:B){
  sp_p1 <-  geographic2spheric(simulation_people1[i,3:4])
  sp_p2 <-  geographic2spheric(simulation_people2[i,3:4])
  rand_people_distances[i] <-  geodesic_dist(sp_p1,sp_p2,r=6371)
}
avg_ran_pop_dis <- mean(rand_people_distances)
avg_ran_pop_dis

## [1] 6917.926

avg_geo_dist

## [1] 9966.397

ANS: The average geodesic 6917.9259973`distance between two randomly sampled people on the planet is smaller by roughly \(2000 \ km^2\) and it makes sense because most of the human population distributes geographically by groups and not entirely random.

2.e. (10pt)

Suppose that we didn’t know the countries’ areas and want to estimate the area of a country by sampling.

Write a function that receives as input a vector of country names, a number of samples \(n\), and outputs estimates for the areas of all countries in square km, based on drawing \(n\) random points on earth and recording the relative frequency of points falling within each country.

First I calculate the worlds area, \(S\) with the know formula: \[S_r = 4\pi r^2 \quad \rightarrow \quad S_{r=6371} =4\pi 6,371^2= 510064471.91 \]

r=6371
# world's area:
S <- 4*pi*r^2

estimate_areas <- function(countries, n){  # creating the functions:
  # countries <- c("USA", "Israel","India") #### TEST RM LATER
  # n <- B ####RM Later
  cntrs <- data.frame(Country = countries)
  samps <- as.data.frame(cartesian2geographic( runif_on_spher2df(n) )[2:3]) #sample n in x,y,z & convert to geographic points
  samp.cs <- prop.table(table(str_replace( str_replace( map.where("world",samps) ,pattern="(?<=:).*","") ,":" ,""),useNA = "ifany" )) #match coords with countries, get frequencies, keep NA's, 
  cntrs.apeared.names <- names(samp.cs[which(names(samp.cs) %in% countries)])
  Estimated_Areas <- data.frame(Country = cntrs.apeared.names, Estimated_Area = as.numeric(samp.cs[which(names(samp.cs) %in% countries)] * S)) # Mult freqs by the world surface km^2 to get #######LATER BETTER FUNC EXTRACT VALUES FROM TABLE
  results <- left_join(cntrs, Estimated_Areas, by = "Country" ) 
  # results$Estimated_Area <- coalesce(results$Estimated_Area,0)
  return( results)
}

Run the function with \(n=10000\) samples and with the list of countries from the wikipedia data-frame. Report separately the top 10 countries with the largest estimated areas and the top 10 countries with the largest true areas in tables. Do you see an agreement between the two top-10 lists?

#TEMPORARLY HERE RM LATER
wikipedia <- as.data.frame(full_join(x=dep.area,y=dep.pop,"Country")) %>% 
  group_by(Country) %>%
  summarise(Area_km2= max(Area_km2), Population = max(Population)) %>%
  mutate(Population = coalesce(Population, 0)) %>% mutate(Population_Density = (Population)/ (Area_km2)) %>% filter(Country != "World")

wikipedia <- wikipedia  %>% filter(Country != "World")
#Applying the function on wikipedia df, n=10000
estimated_arieas <- estimate_areas(countries = wikipedia$Country , n=10000)
# top 10 largest estimated
estimated_arieas %>% arrange(desc(Estimated_Area)) %>% head(10) #PRETTIER COL NAMES

# top 10 true largest areas
as.data.frame(wikipedia) %>%  arrange(desc(Area_km2)) %>% select(Country, Area_km2) %>% head(10)

ANS: There is a certain similarity between the estimated top ten and the true top ten. Almost, if not all the top ten estimated appear in the true top ten, although there is a certin disagree about the order. Around \(71%\) were NA values which makes sense because the world is covered in water at that ratio. I goggled it. It’s true. resolutions

Next, merge the estimated areas results with the data-frame and plot the true area (x-axis) vs. the estimated area (y-axis) for all countries.

#TEMPORARLY HERE RM LATER
wikipedia <- as.data.frame(full_join(x=dep.area,y=dep.pop,"Country")) %>% 
  group_by(Country) %>%
  summarise(Area_km2= max(Area_km2), Population = max(Population)) %>%
  mutate(Population = coalesce(Population, 0)) %>% mutate(Population_Density = (Population)/ (Area_km2)) %>% filter(Country != "World")

wikipedia <- full_join(wikipedia, estimated_arieas, by = "Country") #%>% select(Country, Area_km2,Estimated_Area)
plot(wikipedia$Area_km2 ,wikipedia$Estimated_Area, log="xy") 
# + scale_x_log10() + scale_y_log10() #USE LATER
ml <- lm(wikipedia$Area_km2~wikipedia$Estimated_Area, data=wikipedia)
abline(ml)

# Extracting R-squared parameter from summary 
R.squared <- summary(ml)$r.squared 
# R.squared

What is the \(R^2\) between the two?

ANS: The \(R^2\) between the two is {r R.squared} which shows that {r R.squared} proportion of variance in the Estimated Area for countries can be explained by the variance in the real Area.

Q2 Solutions:

Q3. Analysis and visualization of earthquake geographical data (33 pt)

resolutions

###a. (6pt) Read the earthquakes dataset from USGS. Select all earthquakes in the world in 2022 of magnitude above 2.5, and save them to a .csv file (The magnitude is in Richter’s scale). Load the dataset into an R data-frame, and display the five latest earthquakes and the five strongest earthquakes.

# https://raw.githubusercontent.com/AnonymousDataStudentWhoLovesR/Main/main/query.csv
usgs <-  read.csv("query.csv")

# 5 Latest earthquakes
usgs %>% select(time,mag,place,depth) %>% head(5)

# head(usgs)
# 5 Strongest earthquakes in 2022
usgs %>% select(time,mag,place) %>% arrange(desc(mag)) %>% head(5)

usgs %>%head(5)

b. (6pt)

Plot the earthquakes on top of on the world map. That is, each earthquake should be represented by a point, with the x axis representing longitude, the y axis representing latitude, and the size of the point representing the magnitude. This scatter plot should be on top of the world map, drawn using the geom_map function from the ggmap package. The data-frame representing the map can be generated as the output of the command map_data("world").

pattt = "(?<=:).*"
usgs <- usgs %>% select(time,latitude,longitude, mag, place, depth) #%>%
world_map <- map_data("world")

ggplot() +
  geom_map(data =world_map, map = world_map,  color = "white", fill="grey", alpha=0.7,
           aes(x=long, y=lat, map_id=region, group=group,color="lightgrey")) +
  geom_point(data=usgs, aes(x=longitude, y=latitude  ,size=mag,col=desc(mag)),alpha=0.5) + scale_size_continuous(range=c(.1,4)) + coord_quickmap() + 
  labs(y="Latitude", x="Longitude", title="Earthquakes in 2022 by Magnitude", col="Magnitude", size="Magnitude")

## Warning: Ignoring unknown aesthetics: x, y

c. (6pt)

Repeat the plot from (b.), but this time make separate plots for different earthquakes, where they are grouped together by magnitude. That is, there should be a separate plot for each of the following magnitude ranges: between \(2\) and \(3\), between \(3\) and \(4\), …, , between \(7\) and \(8\).

for (i in 1:6) {
  nam <- paste("g", i, sep = "")
assign(nam,
# print(
  ggplot() + geom_map( data = world_map, map = world_map, color = "white", fill="grey", alpha=0.7, aes(x=long, y=lat, map_id=region, group=group ,color="lightgrey")) + geom_point(data= usgs %>% filter( mag >= (i+1) & mag < (i+2) ),aes(x=longitude, y=latitude , size = mag,col=desc(mag)) ,alpha=0.5) + scale_size_continuous(range=c(.5,4)) + coord_quickmap() + labs( y="Latitude", x="Longitude", col ="Magnitude" , title= paste("Earthquakes in 2022 of Magnitude in Range ",i+1,"-", i+2), size = "Magnitude") 
  ) #end of assign
}

## Warning: Ignoring unknown aesthetics: x, y
## Ignoring unknown aesthetics: x, y
## Ignoring unknown aesthetics: x, y
## Ignoring unknown aesthetics: x, y
## Ignoring unknown aesthetics: x, y
## Ignoring unknown aesthetics: x, y

#MAYBE BETTER COLOURS
ggarrange(g1,g2,g3,g4,g5,g6 ,ncol=2,nrow =3)

What can you conclude about the locations of strong/weak earthquakes?

d. (6pt)

Parse the column representing the time to extract (i) the number of days since the beginning of the year, and (ii) the time of day, in units of hours since midnight from , as two new separate numeric columns.

That is, for example the date 2022-03-10T17:40:28.123Z will be converted to \(31+28+10=69\) days since the beginning of the year (the numbers of days in the first months are January:31, February:28, March:31, April:30, May:31), and to \(17+40/60+28/3600=17.6744\) hours since midnight (we ignore fractions of seconds here).

Plot pairwise correlations of the magnitude (mag), depth, day-of-year and time-of-day. (The depth column represents the depth below the ground in km of the center of the earthquake). For which pairs of variables there is a significant correlation?

You may use the ggpairs command from the GGally package.

usgs <- usgs %>% mutate( day_of_year = as.numeric(as.Date(usgs$time) - as.Date("2022-01-01"))) %>%
  mutate( time_of_day = hour(ymd_hms(usgs$time)) + (minute(ymd_hms(usgs$time))/60) +  (round( second( ymd_hms(usgs$time)),0)/3600) ) 

fetuears <- c("mag","depth","day_of_year", "time_of_day") #selecting requested cols
fav <- "pinkish" #Just kidding, it's a way to get a non black default color with ggpairs
ggpairs(usgs, columns = fetuears,mapping = ggplot2::aes(color = fav) )

cor.depth.mag <- cor(usgs$mag, usgs$depth)

According to the plot, we see a week positive correlation of 0.1587673 Between Depth and Magnatuide, otherwise then that I see no significant relation ship.

e. (9pt)

We want to test if there are times in the day in which earthquakes are more or less commons. Let \(t_i\) be the time of each earthquake \(i\) in the day (between \(00:00:00\) and \(23:59:59\)), in units of hours, calculated in (d.).

Test the null hypothesis \(H_0: t_i \sim U[0, 24]\) against the complex alternative

\(H_1: t_i \nsim U[0, 24]\).

We test using the Pearson chi-square statistic:

\[ S = \sum_{i=1}^{24} \frac{(o_{i}-e_{i})^2}{e_{i}}\]

where \(o_{i}\) is the number of earthquakes at hour \(i\) (e.g. between \(00:00:00\) and \(00:59:59\) for \(i=1\), between \(01:00:00\) and \(01:59:59\) for \(i=2\) etc. You may ignore the minutes and seconds information and extract just the hour for this sub-question), and \(e_i\) is the expected value of \(\frac{n}{24}\), with \(n\) being the total number of earthquakes. Report the test statistic and the p-value, using the \(\chi^2(24-1)\) approximation for the null distribution. Would you reject the null hypothesis at significance level \(\alpha=0.01\)?

ANS: We want to sample a countinuess random variable with the First, a simple histogram can give as a first intuition to the results:

# Deriving observed counts grouped by hour with hour(ymd_hms()):
Hours <- hour(ymd_hms(usgs$time)) 
observed  = data.frame(Hours) #1 is the first hour
observed %>% ggplot(aes(Hours)) +  geom_bar(aes(y=..count..)) +
  labs(title="Empirical Distribution of Earthquake Observations by Hour",y="Frequency")

We can already grasp a visual intuition that there is no uniform distribution of occurances by hour. And now a classic \(\chi^2\) test will be demonstrated as follows:

observed <- usgs %>% group_by(hour(ymd_hms(usgs$time))) %>% summarise(freq = n())
n <- nrow(usgs) # samples
e_i <- n/24

Statistic <- 0 # Statistic calculation (T), will by summed by a loop
for (i in 1:24) Statistic = Statistic + (((observed$freq[i]-e_i)^2)/e_i) #Applying formula

\[\widehat{Pval}(s) = = P_{H_0} (S > s)=1 - \hat{F}_{S}(s)\] Thus:

#Chi squ with 23 d.f, rounded
p_val <- round(1 - pchisq(Statistic,df = 24-1),3)

We have Pvalue=0.002 < α=0.01, and Statistic = 47.5349729, thus it concluded that \(H_0: \ t_i \sim U[0, 24]\) is rejected.

Thank You For The Interesting Course.