HW 8 - Lectures 8-11

Due Date: April 7, 2021, 11:00pm EST

Question 1 [70 pt]

  1. Load the dataset listings.csv which includes information about AirBNB listings in Boston (retrieved from http://insideairbnb.com/get-the-data.html on March 31st, 2021)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.5     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)  
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
listings <- read_csv("listings.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   id = col_double(),
##   name = col_character(),
##   host_id = col_double(),
##   host_name = col_character(),
##   neighbourhood_group = col_logical(),
##   neighbourhood = col_character(),
##   latitude = col_double(),
##   longitude = col_double(),
##   room_type = col_character(),
##   price = col_double(),
##   minimum_nights = col_double(),
##   number_of_reviews = col_double(),
##   last_review = col_date(format = ""),
##   reviews_per_month = col_double(),
##   calculated_host_listings_count = col_double(),
##   availability_365 = col_double()
## )
  1. Load your mapbox token to the system
Sys.setenv("MAPBOX_TOKEN" = "pk.eyJ1IjoidGltb3RoeW5ndXllIiwiYSI6ImNrbW8wZmR0OTF6cnoycHQ0ZXFydTBwY3oifQ.LumCjldRn9X9jX70p1BA-w")
  1. Use some EDA tool/s to analyze the price variable (may or may not be graphical).

Based on your observations, consider removing outliers.

Please analyze carefully the outliers - are they random? do you think they represent errors?

Just only looking at the listing prices, some of these outliers may be errors especially for the outliers at 10k and 4k. It could be an error in the price or may not be considering other features.

plot_ly(listings, y = ~price, type="box")
  1. Create a plot that demonstrates the effect of neighborhood on price. Please transform the price variable with log base 10.

    listings$log_price = log10(listings$price)
    log_plot <- plot_ly(listings, y = ~log_price, color = ~neighbourhood, type="box")
    log_plot
    ## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
    ## Returning the palette you asked for with that many colors
    
    ## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
    ## Returning the palette you asked for with that many colors
  2. Next, we move to organize the price data on a mapbox layer. Choose how visualize the price data. Consider which type of base layer best assists you in showing what you want to show. Add interesting information to the tooltip. Make sure that everything that is visible is also desirable.

    In the following example, I choose the visualize the price data by grouping each data point by each neighborhood. Then, to highlight the price variable, I add it to the tooltip and it’s adjusted by the size / 2. So higher prices will show up bigger on the plot. In the tooltip as well, I plan to add information regarding the percentile of the price compared to other places in Boston.

    library(dplyr)
    listings = mutate(listings, percentile_rank = ntile(desc(listings$price),100))
    fig <- listings
    fig <- fig %>%
      plot_mapbox() %>%
      add_markers(x = ~longitude,
                  y = ~latitude,
                  color = ~as.factor(neighbourhood),
                  text = ~paste0("The price is $", price, "<br>",
                                 "in ", neighbourhood, "<br>")) %>%
      layout(
        mapbox = list(
          style = 'open-street-map',
          zoom = 9.5,
          center = list(lon = -71.1, lat = 42.3))) 
    
    fig
    ## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
    ## Returning the palette you asked for with that many colors
    
    ## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
    ## Returning the palette you asked for with that many colors
  3. GPX is a popular format for storing GPS data. The file mbta.gpx (retrieved from http://erikdemaine.org/maps/mbta/ on March 31st, 2021) includes waypoints of all of the MBTA stations as well as routes of rapid transit lines: subway / light train (red, orange, blue and green lines), bus rapid transit (silver line) and commuter rail.

The read_GPX() function from the tmaptools package reads GPX files into sf objects in R.

library(tmaptools)
mbta <- read_GPX("mbta.gpx")

Add all subway / light rail stations and tracks to the plot (red, orange, blue and green lines).

Hint: you may want to use the grepl() function.

#str(mbta)
class(mbta)
## [1] "list"
library(sf)
## Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
#mbta$waypoints$type
a <- fig %>%
  add_data(dplyr::filter(mbta$tracks, grepl("Red Line", mbta$tracks$name))) %>%
  add_sf(color = I("red"), name="Red Line")

a <- a %>%
  add_data(dplyr::filter(mbta$tracks, grepl("Green Line", mbta$tracks$name))) %>%
  add_sf(color = I("green"), name="Green Line")

a <- a %>%
  add_data(dplyr::filter(mbta$tracks, grepl("Blue Line", mbta$tracks$name))) %>%
  add_sf(color = I("blue"), name="Blue Line")

a <- a %>%
  add_data(dplyr::filter(mbta$tracks, grepl("Orange Line", mbta$tracks$name))) %>%
  add_sf(color = I("orange"), name="Orange Line")

a <- a %>%
  add_data(dplyr::filter(mbta$tracks, grepl("Silver Line", mbta$tracks$name))) %>%
  add_sf(color = I("gray"), name="Silver Line")

a <- a %>%
  add_data(dplyr::filter(mbta$tracks, grepl("Commuter Rail", 
  mbta$tracks$name))) %>%
  add_sf(color = I("purple"), name="Commuter Rail")

a <- a %>%
  add_sf(data = mbta$waypoints,
         color = I("black"), 
         text =~paste0(name, " : ", type), 
         name = "Stations",
        hoverinfo = "text")

a
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors


Extra credit [10 pts] have the track for each line appearing with the appropriate color (red, orange, green, blue) and with a legend entry that allows hiding/showing it (instead of a legend entry for all lines together). You can consider writing a function that will save you much typing.

  1. KML is another popular format for storing GPS data. The Boston_Neighborhoods.kml file (downloaded from https://data.boston.gov/dataset/boston-neighborhoods on March 31st, 2021) contains the boundaries of all of Boston’s neighborhoods. You can read KML data into an sf object using the function st_read() from the sf package.
boston_neighborhoods <- sf::st_read("Boston_Neighborhoods.kml")
## Reading layer `Boston_Neighborhoods' from data source `C:\Users\quynh\OneDrive\Desktop\Spring-2021\Stats697V\hwk\8\Boston_Neighborhoods.kml' using driver `KML'
## Simple feature collection with 26 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -71.19125 ymin: 42.22792 xmax: -70.92278 ymax: 42.39699
## Geodetic CRS:  WGS 84
class(boston_neighborhoods)
## [1] "sf"         "data.frame"

Add the neighborhood boundaries to your map (with a legend entry that allows showing/hiding those).

new <- a %>%
  add_sf(
    data = boston_neighborhoods,
    split = ~Name, 
    stroke = I("black"), span = I(1),
    text = ~paste(Name), 
    hoverinfo = "text"
  ) 

new
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
  1. Organize the last plot with the plot from 4 in a meaningful way. You might need to tweak either so that all labels, titles etc. appear as you want them to.

    fig2 <- subplot(log_plot, margin = 0) %>%
            subplot(new, nrows = 2, heights = c(0.3, 0.7), margin = 0.01) %>%
            layout(legend = list(y = 1)) 
    ## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
    ## Returning the palette you asked for with that many colors
    
    ## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
    ## Returning the palette you asked for with that many colors
    
    ## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
    ## Returning the palette you asked for with that many colors
    
    ## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
    ## Returning the palette you asked for with that many colors
    ## Warning: Can only have one: config
    fig2

Extra credit you can gain up to 20 extra credit points for this question if you add 1. meaningful data analyses; 2. download additional interesting data and combine it in your plot; 3. add useful interactivity (not with shiny, by using plotly traces/legends) to the above plots.

In terms of additional interactivity, I just added some useful knowledge to the tooltip and put which lines the stations meet. For example, Park Street will arrive by the red and green line.

Question 2 [30 pt]

(from Unwin’s “Graphical Data Analysis with R”, Chapter 6, Exercise 2)

Pottery

The package HSAUR2 includes a dataset on the chemical composition of Romano-British pottery, pottery, with 45 cases and 10 variables.

  1. Draw a parallel coordinate plot of the nine composition variables. What features can you see?

    From the parallel coordinate plot, it looks like there’s a higher level of Ca0 for kiln 1, but other than that, the values of other variables are pretty steady hovering around 0 for other chemical compositions other than Na2O (which is around 1).There seems like there’s a negative correlation between kiln 1 and kiln 2/3 for the most part as they have a lot of intersecting lines across multiple chemical compositions. Kiln 5 typically has a value around 0 and -1 other than for Al2O3 which is hovering around 1.

  2. Do so once with plot_ly* and once with ggparcoord + ggplotly. Which method do you prefer? Why?

    In this case the ggparcoord + ggplotly is better since it captures the points and values more easily.

    library(plotly)
    library(GGally)
    ## Registered S3 method overwritten by 'GGally':
    ##   method from   
    ##   +.gg   ggplot2
    library(HSAUR2)
    ## Warning: package 'HSAUR2' was built under R version 4.0.5
    ## Loading required package: tools
    data("pottery", package = "HSAUR2")
    pottery
    (gg3 <-
      ggparcoord(
        pottery,
        columns = 1:9,
        groupColumn = 10,
        order = "anyClass",
        showPoints = TRUE,
        title = "Parallel Coordinate Plot for the Pottery Data",
        alphaLines = 0.3  ) +
        theme_bw() +
        theme(legend.position = "top"))

    ggplotly(gg3)
    pottery %>% 
      plot_ly(
        type = 'parcoords',
        line = 
          list(
            color = ~as.numeric(factor(kiln)),
            colorscale =
              list(
                c(0, "orange"),
                c(0.5, "green"),
                c(1, "blue"),
                c(1.5, "yellow"),
                c(2, "purple")
                )
            ),
        dimensions = 
          list(
            list(label = 'Al2O3', values = ~Al2O3),
            list(label = 'Fe2O3', values = ~Fe2O3),
            list(label = 'MgO', values = ~MgO),
            list(label = 'CaO', values = ~CaO),
            list(label = 'Na2O', values = ~Na2O),
            list(label = 'K2O', values = ~K2O),
            list(label = 'TiO2', values = ~TiO2),
            list(label = 'MnO', values = ~MnO),
            list(label = 'BaO',values = ~BaO)
          )
      )
  • You may gain 5 extra credit points if, when you create the plotly version, you implement it by writing a function that generates the dimensions list for you without having to type the names and labels for all the nine variables.
  1. Make a new variable with the cases with low values on MgO. How are these cases different from the rest on the other variables?

    They typically have a kiln value of 4 or 5.

    low_mgo <- pottery[pottery$MgO < 1, ] 
    low_mgo
  2. Colour your original parallel coordinate plot using the site information, kiln. Which kilns can be easily distinguished from the others using which variables?

    Kilns that can be distinguished 1 vs 2 and 3. We can see this through CaO vs MgO vs Al2O3.

    (gg3 <-
      ggparcoord(
        pottery,
        columns = 1:9,
        groupColumn = 10,
        order = "anyClass",
        showPoints = TRUE,
        title = "Parallel Coordinate Plot for the Pottery Data",
        alphaLines = 0.3  ) +
        theme_bw() +
        theme(legend.position = "top"))

    ggplotly(gg3)
    pottery %>% 
      plot_ly(
        type = 'parcoords',
        line = 
          list(
            color = ~as.numeric(factor(kiln)),
            colorscale =
              list(
                c(0, "orange"),
                c(0.5, "green"),
                c(1, "blue"),
                c(1.5, "yellow"),
                c(2, "purple")
                )
            ),
        dimensions = 
          list(
            list(label = 'Al2O3', values = ~Al2O3),
            list(label = 'Fe2O3', values = ~Fe2O3),
            list(label = 'MgO', values = ~MgO),
            list(label = 'CaO', values = ~CaO),
            list(label = 'Na2O', values = ~Na2O),
            list(label = 'K2O', values = ~K2O),
            list(label = 'TiO2', values = ~TiO2),
            list(label = 'MnO', values = ~MnO),
            list(label = 'BaO',values = ~BaO)
          )
      )

Here too, Do so once with plot_ly and once with ggparcoord + ggplotly. Which method do you prefer? Why?

Here I prefer using plot_ly since I’m able to see the intersections more easily.

Extra credit 5 pt Consider different scaling methods and variable ordering techniques. Which best serves you?