library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.7
## v tidyr 0.8.2 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(threejs)
## Loading required package: igraph
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(jsonlite)
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
library(AMR)
library(ggmap)
a.) This palette is a sequential palette that is best used with ordered data. Data that ranges in values from low to high values is emphasized by the intuitive use of light (low) colors to dark (high) colors. By using a sequential palette, one is able to communicate the subtle variation in the range of the values in the data. This allows viewers to notice and appreciate the nuances in the data.
b.) This palette is a diverging palette that is best used when you have middle range values you want to showcase, as well as values at either end of the data’s range. Unlike sequential palettes, diverging palettes emphasize middle range values by fading the colors into a faint “critical break” in the middle of the spectrum and save darker colors for the extreme end of their spectrums.
c.) This is a qualitative palette, and unlike the preceding palettes, the colors of this palette do not represent some kind of range from low to high values. Instead, the qualitative palette is best used for categorical or nominal data. Because the colors do not represent a change from low to high, the colors can be used to create aesthetically pleasing graphs without the implication that non-intuitive uses of color can confuse the viewer of one’s graph.
earthquake <- read.csv("https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv")
globejs(lat = earthquake$latitude,
lon = earthquake$longitude)
earthquakes1 <- earthquake %>% select(latitude, longitude, depth, mag)
names(earthquakes1) <- c("lat", "long", "depth", "mag")
library(leaflet)
# Create a color palette with handmade bins.
mybins=seq(1, 10, by=1)
mypalette = colorBin(palette="OrRd", domain=earthquakes1$mag, na.color="transparent", bins=mybins)
# Final Map
leaflet(earthquakes1) %>%
addTiles() %>%
addProviderTiles("Esri.WorldImagery") %>%
addCircleMarkers(~long, ~lat,
fillColor = ~mypalette(mag), fillOpacity = 0.7, color="white", radius=8, stroke=FALSE,
labelOptions = labelOptions( style = list("font-weight" = "normal", padding = "3px 8px"), textsize = "13px", direction = "auto")
) %>%
addLegend( pal=mypalette, values=~mag, opacity=0.9, title = "Magnitude", position = "bottomright" )
## Warning in mypalette(mag): Some values were outside the color scale and
## will be treated as NA
health <- read.csv("https://raw.githubusercontent.com/gene493/datadump/master/Restaurant_Scores_-_LIVES_Standard.csv
")
levels(health$risk_category)
## [1] "" "High Risk" "Low Risk" "Moderate Risk"
First thing I noticed is that there is there are 4 factors, with one named “”, which looks like just missing information.
To make this less confusing, I decided to give it a name rather than leaving it blank.
levels(health$risk_category) <- c('missing', 'High Risk', "Low Risk", "Medium Risk")
freqtable <- health %>% freq(risk_category)
freqtable
Next, I wanted to observe the missing values to see if they were actually scored but simply missing a classification after the fact.
health %>% select(business_name, risk_category, inspection_score) %>%
filter(risk_category == "missing", !is.na(inspection_score))
I noticed that most of the values with non missing scores score very high, I then summarised the data to see how many actually qualified as Low Risk
health %>% select(business_name, risk_category, inspection_score) %>%
filter(risk_category == "missing", !is.na(inspection_score), inspection_score > 89) %>% count()
Surprisingly, almost all of the rows with “missing” risk categories actually fall into Low risk.
health %>% ggplot(aes(x=risk_category, fill=risk_category)) +
labs(x="Risk Level") +
geom_bar()
It is important to note that most of the missing data actually falls into the “Low Risk” category
I filter out all rows with “missing” and simply make the same graph
graph1 <- health %>% select(business_name, risk_category, inspection_score) %>%
filter(!risk_category == "missing")
graph1 %>% ggplot(aes(x=risk_category, fill=risk_category)) +
labs(x="Risk Level") +
geom_bar()
This second plot definitely less accurate as it seems to change what the overall data actually claims.
If we do not take into account the missing data which is mostly “Low Risk”, it definitely drops the average score of the entirety of the data.
health %>% ggplot(aes(x=risk_category,y=inspection_score, fill = risk_category)) +
geom_boxplot()
## Warning: Removed 13725 rows containing non-finite values (stat_boxplot).
Looking at a boxplot of the data, we find that most of the values are actually greater than 80, and of the high risk category, most observations below 75 seem to fall within only the first quartile.
More surprising, we find there are observations marked “Low Risk” and “Medium Risk” falling under their appropriate categories, >=90 being low risk, and medium risk 86-90.
Not surprisingly, as we discovered earlier, almost all of the data we found to be missing scored 100, leaving only a few observations below as outliers.
rpubs code, The restaraunt locations are all in San Francisco so this should fit just fine.
First we need to create a box that maps the range of distance we want to view.
m <- get_stamenmap(bbox = c(left = -122.5164, bottom = 37.7066, right = -122.3554, top = 37.8103), maptype = c("watercolor"), zoom = 13)
## Map from URL : http://tile.stamen.com/watercolor/13/1308/3165.jpg
## Map from URL : http://tile.stamen.com/watercolor/13/1309/3165.jpg
## Map from URL : http://tile.stamen.com/watercolor/13/1310/3165.jpg
## Map from URL : http://tile.stamen.com/watercolor/13/1311/3165.jpg
## Map from URL : http://tile.stamen.com/watercolor/13/1308/3166.jpg
## Map from URL : http://tile.stamen.com/watercolor/13/1309/3166.jpg
## Map from URL : http://tile.stamen.com/watercolor/13/1310/3166.jpg
## Map from URL : http://tile.stamen.com/watercolor/13/1311/3166.jpg
## Map from URL : http://tile.stamen.com/watercolor/13/1308/3167.jpg
## Map from URL : http://tile.stamen.com/watercolor/13/1309/3167.jpg
## Map from URL : http://tile.stamen.com/watercolor/13/1310/3167.jpg
## Map from URL : http://tile.stamen.com/watercolor/13/1311/3167.jpg
## Map from URL : http://tile.stamen.com/watercolor/13/1308/3168.jpg
## Map from URL : http://tile.stamen.com/watercolor/13/1309/3168.jpg
## Map from URL : http://tile.stamen.com/watercolor/13/1310/3168.jpg
## Map from URL : http://tile.stamen.com/watercolor/13/1311/3168.jpg
Then I want to remove all NA values for latitude and longitude from our dataset
health_map <- health %>% filter(!is.na(business_latitude), !is.na(business_longitude))
health_map
ggmap(m, base_layer = ggplot(aes(x = business_longitude, y = business_latitude), data = health_map)) + geom_point(aes(colour = health_map$risk_category))
## Warning: Removed 36 rows containing missing values (geom_point).
I want to pull a random sample of 300 and then try again.
random <- health_map[sample(nrow(health_map), 300), ]
ggmap(m, base_layer = ggplot(aes(x = business_longitude, y = business_latitude), data = random)) + geom_point(aes(colour = random$risk_category))
health_map_high <- subset(health_map, risk_category=="High Risk")
ggmap(m, base_layer = ggplot(aes(x = business_longitude, y = business_latitude), data = health_map_high)) + geom_point(color="red")
## Warning: Removed 3 rows containing missing values (geom_point).