library(tidyverse)

## -- Attaching packages --------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.7
## v tidyr   0.8.2     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts ------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(threejs)

## Loading required package: igraph

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:purrr':
## 
##     compose, simplify

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following object is masked from 'package:tibble':
## 
##     as_data_frame

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

library(jsonlite)

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
## 
##     flatten

library(AMR)
library(ggmap)

Part 1)

a.) This palette is a sequential palette that is best used with ordered data. Data that ranges in values from low to high values is emphasized by the intuitive use of light (low) colors to dark (high) colors. By using a sequential palette, one is able to communicate the subtle variation in the range of the values in the data. This allows viewers to notice and appreciate the nuances in the data.

b.) This palette is a diverging palette that is best used when you have middle range values you want to showcase, as well as values at either end of the data’s range. Unlike sequential palettes, diverging palettes emphasize middle range values by fading the colors into a faint “critical break” in the middle of the spectrum and save darker colors for the extreme end of their spectrums.

c.) This is a qualitative palette, and unlike the preceding palettes, the colors of this palette do not represent some kind of range from low to high values. Instead, the qualitative palette is best used for categorical or nominal data. Because the colors do not represent a change from low to high, the colors can be used to create aesthetically pleasing graphs without the implication that non-intuitive uses of color can confuse the viewer of one’s graph.

2

First we load the data into R

earthquake <- read.csv("https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv")

Then we can make a globe map with threejs.

globejs(lat = earthquake$latitude, 
        lon = earthquake$longitude)

earthquakes1 <- earthquake %>% select(latitude, longitude, depth, mag)

names(earthquakes1) <- c("lat", "long", "depth", "mag")

We can then use the leaflet package to make a bubble map

library(leaflet)

# Create a color palette with handmade bins.
mybins=seq(1, 10, by=1)
mypalette = colorBin(palette="OrRd", domain=earthquakes1$mag, na.color="transparent", bins=mybins)


# Final Map
leaflet(earthquakes1) %>%
  addTiles()  %>%
  addProviderTiles("Esri.WorldImagery") %>%
  addCircleMarkers(~long, ~lat,
    fillColor = ~mypalette(mag), fillOpacity = 0.7, color="white", radius=8, stroke=FALSE,
    labelOptions = labelOptions( style = list("font-weight" = "normal", padding = "3px 8px"), textsize = "13px", direction = "auto")
  ) %>%
  addLegend( pal=mypalette, values=~mag, opacity=0.9, title = "Magnitude", position = "bottomright" )

## Warning in mypalette(mag): Some values were outside the color scale and
## will be treated as NA

3

health <- read.csv("https://raw.githubusercontent.com/gene493/datadump/master/Restaurant_Scores_-_LIVES_Standard.csv
")

Looking at risk levels according to the data

levels(health$risk_category)

## [1] ""              "High Risk"     "Low Risk"      "Moderate Risk"

First thing I noticed is that there is there are 4 factors, with one named “”, which looks like just missing information.

To make this less confusing, I decided to give it a name rather than leaving it blank.

levels(health$risk_category) <- c('missing', 'High Risk', "Low Risk", "Medium Risk")

Next I wanted to observe the frequency of each risk category

freqtable <- health %>% freq(risk_category)
freqtable

Next, I wanted to observe the missing values to see if they were actually scored but simply missing a classification after the fact.

health %>% select(business_name, risk_category, inspection_score) %>% 
filter(risk_category == "missing", !is.na(inspection_score))

I noticed that most of the values with non missing scores score very high, I then summarised the data to see how many actually qualified as Low Risk

health %>% select(business_name, risk_category, inspection_score) %>% 
  filter(risk_category == "missing", !is.na(inspection_score), inspection_score > 89) %>% count()

Surprisingly, almost all of the rows with “missing” risk categories actually fall into Low risk.

Graphing the data

health %>% ggplot(aes(x=risk_category, fill=risk_category)) +
  labs(x="Risk Level") +
  geom_bar()

It is important to note that most of the missing data actually falls into the “Low Risk” category

Graphing the data without the “missing” category

I filter out all rows with “missing” and simply make the same graph

graph1 <- health %>% select(business_name, risk_category, inspection_score) %>% 
  filter(!risk_category == "missing")

graph1 %>% ggplot(aes(x=risk_category, fill=risk_category)) +
  labs(x="Risk Level") +
  geom_bar()

This second plot definitely less accurate as it seems to change what the overall data actually claims.

If we do not take into account the missing data which is mostly “Low Risk”, it definitely drops the average score of the entirety of the data.

Box plot

health %>% ggplot(aes(x=risk_category,y=inspection_score, fill = risk_category)) +
                      geom_boxplot()

## Warning: Removed 13725 rows containing non-finite values (stat_boxplot).

Looking at a boxplot of the data, we find that most of the values are actually greater than 80, and of the high risk category, most observations below 75 seem to fall within only the first quartile.

More surprising, we find there are observations marked “Low Risk” and “Medium Risk” falling under their appropriate categories, >=90 being low risk, and medium risk 86-90.

Not surprisingly, as we discovered earlier, almost all of the data we found to be missing scored 100, leaving only a few observations below as outliers.

Using ggmap to Show restaraunt locations

rpubs code, The restaraunt locations are all in San Francisco so this should fit just fine.

First we need to create a box that maps the range of distance we want to view.

m <- get_stamenmap(bbox = c(left = -122.5164, bottom = 37.7066, right = -122.3554, top = 37.8103), maptype = c("watercolor"), zoom = 13)

## Map from URL : http://tile.stamen.com/watercolor/13/1308/3165.jpg

## Map from URL : http://tile.stamen.com/watercolor/13/1309/3165.jpg

## Map from URL : http://tile.stamen.com/watercolor/13/1310/3165.jpg

## Map from URL : http://tile.stamen.com/watercolor/13/1311/3165.jpg

## Map from URL : http://tile.stamen.com/watercolor/13/1308/3166.jpg

## Map from URL : http://tile.stamen.com/watercolor/13/1309/3166.jpg

## Map from URL : http://tile.stamen.com/watercolor/13/1310/3166.jpg

## Map from URL : http://tile.stamen.com/watercolor/13/1311/3166.jpg

## Map from URL : http://tile.stamen.com/watercolor/13/1308/3167.jpg

## Map from URL : http://tile.stamen.com/watercolor/13/1309/3167.jpg

## Map from URL : http://tile.stamen.com/watercolor/13/1310/3167.jpg

## Map from URL : http://tile.stamen.com/watercolor/13/1311/3167.jpg

## Map from URL : http://tile.stamen.com/watercolor/13/1308/3168.jpg

## Map from URL : http://tile.stamen.com/watercolor/13/1309/3168.jpg

## Map from URL : http://tile.stamen.com/watercolor/13/1310/3168.jpg

## Map from URL : http://tile.stamen.com/watercolor/13/1311/3168.jpg

Then I want to remove all NA values for latitude and longitude from our dataset

health_map <- health %>% filter(!is.na(business_latitude), !is.na(business_longitude))
health_map

With some edits we can then get this map.

ggmap(m, base_layer = ggplot(aes(x = business_longitude, y = business_latitude), data = health_map))  + geom_point(aes(colour = health_map$risk_category))

## Warning: Removed 36 rows containing missing values (geom_point).

This was kind of a mess, there are too many points to really see too much so let’s dial it back a little and reduce the data

I want to pull a random sample of 300 and then try again.

random <- health_map[sample(nrow(health_map), 300), ]

ggmap(m, base_layer = ggplot(aes(x = business_longitude, y = business_latitude), data = random))  + geom_point(aes(colour = random$risk_category))

We can also just subset the data and only look at the high risk areas.

health_map_high <- subset(health_map, risk_category=="High Risk")

ggmap(m, base_layer = ggplot(aes(x = business_longitude, y = business_latitude), data = health_map_high))  + geom_point(color="red")

## Warning: Removed 3 rows containing missing values (geom_point).

R Notebook