EDAV PSet 2

Read Graphical Data Analysis with R, Ch. 4, 5

Grading is based both on your graphs and verbal explanations. Follow all best practices as discussed in class.

1. Jungle Garden

Data: garden_planting from gardenR package

For each of the following choose the appropriate data from the package and draw a bar chart, following best practices, that can directly answer the following questions. For each, explain your choice of bar order.

How many varieties of each type of vegetable were planted?

data(garden_planting,package = "gardenR")
garden_planting %>% 
  select(vegetable) %>% 
  mutate(vegetable = fct_infreq(vegetable) %>% fct_rev()) %>%
  ggplot(aes(vegetable)) + 
  geom_bar(fill = "lightblue") + coord_flip() + 
  ylab("number of varieties")+
  ggtitle("Varieties of each type of vegetable")

The bar chart takes types of vegetable as y-axis and numbers of varieties as x-axis, and then orders vegetable by their numbers of varieties from the largest on the top to the smallest on the bottom. Therefore, we can quickly find that for example, tomatoes have most variesties.

Were there any days of the week that Dr. Lendway never planted? Which days of the week was she most likely to be found planting? (Measure by the number of rows in garden_planting.)

week <- garden_planting %>% select(date) %>% 
  mutate(weekday = weekdays(date))
week %>%
  select(weekday) %>%
  mutate(weekday = fct_infreq(weekday) %>% fct_rev()) %>%
  ggplot(aes(weekday)) + 
  geom_bar(fill = "lightblue")

First, we converts dates to weekdays. Then we get the frequency of each weekday and draw the bar chart whose order of x-axis is from the least frequent weekday to the most frequent one. From this plot, we can know that Dr. Lendway never planted on Tuesdays and she most likely planted on Wednesdays.There is also a null value here since data of one observation is missing.

How many seeds were planted in each plot?

seeds <- garden_planting %>%  group_by(plot) %>% 
  summarise(total_seeds = sum(number_seeds_planted)) 

seeds %>% ggplot(aes(x = fct_reorder(plot,total_seeds), y = total_seeds)) +
  geom_col(fill = "lightblue")+coord_flip() +
  xlab("Plot")+ylab("Total number of seeds") +
  ggtitle("Total Number of Seeds Planted in Each Plot")

Through group by function, we get the total number of seeds for each plot. Then, we draw the bar and order plots from the one with the largest total number of seeds on the top to the one with the smallest on the bottom. Therefore, we can quickly find that for example, H is the plot where Dr. Lendway planted most seeds. Here, "F" is an exception since its value is null.

2. Vaccination Rates

To get the data for this problem, we’ll scrape data from this page: https://www.beckershospitalreview.com/public-health/states-ranked-by-percentage-of-covid-19-vaccines-administered.html. Important: you should only execute parts (a) and (b) once. After running these lines, comment them out, and leave them commented out in the submitted version.

Check that it’s ok to scrape the site. What is the result?

#install.packages("rvest")
#install.packages("robotstxt")

library(rvest)
library(robotstxt)
paths_allowed("https://www.beckershospitalreview.com/public-health/states-ranked-by-percentage-of-covid-19-vaccines-administered.html")

## [1] TRUE

The result is TRUE.

Read the page at the URL listed above and save it.

page <- read_html("https://www.beckershospitalreview.com/public-health/states-ranked-by-percentage-of-covid-19-vaccines-administered.html")

Read in the page saved in part b) and create a data frame with the following columns: State, Distributed, Administered, Percentage. Display the first six rows of the data frame.

State <- page %>% html_nodes("ol strong") %>% html_text()
others <- page %>% html_nodes("ol li") %>% html_text()
split <-str_split(others,":", simplify = TRUE)
Distributed <- c()
Administered <- c()
Percentage <- c()
for (i in 1:50){
  tmpd <- parse_number(split[i,2])
  tmpa <- parse_number(split[i,3])
  tmpp <- parse_number(split[i,4])
  Distributed <- c(Distributed,tmpd)
  Administered <- c(Administered,tmpa)
  Percentage <- c(Percentage,tmpp)
}

doses <- data_frame(State, Distributed, Administered, Percentage)
head(doses)

## # A tibble: 6 x 4
##   State          Distributed Administered Percentage
##   <chr>                <dbl>        <dbl>      <dbl>
## 1 West Virginia       328600       276753       84.2
## 2 North Dakota        131625       106900       81.2
## 3 New Mexico          364775       294566       80.8
## 4 South Dakota        153450       115676       75.4
## 5 South Carolina      683600       497433       72.8
## 6 Connecticut         666175       481826       72.3

Create a Cleveland dot plot showing the percentage of distributed vaccines that have been administered.

theme_dotplot <- theme_bw(14) +
    theme(axis.text.y = element_text(size = rel(.65)),
        axis.ticks.y = element_blank(),
        axis.title.x = element_text(size = rel(.75)),
        panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_line(size = 0.5),
        panel.grid.minor.x = element_blank())

ggplot(doses,aes(Percentage,fct_reorder(State,Percentage))) +
  geom_point(color = "red") + ylab("State") +
  ggtitle("Percentage of Each State") + theme_dotplot

Create a multiple Cleveland dot plot showing doses distributed in one color and doses administered in another for each state. Experiment with different sorting orders. Choose the one that you think works best and explain why.

tidydose <- doses %>% 
  select(State, Distributed,Administered) %>% 
  pivot_longer(cols = -State, names_to ="dose", values_to = "value")

ggplot(tidydose,
       aes(value, fct_reorder2(State, dose, value, .desc = FALSE), 
           color = dose)) + 
  geom_point() +xlab("Doses") +ylab("State") +
  ggtitle("Doses of Each State Sorted by Doses Distributed") + theme_dotplot

This one sorts states by their doses distributed from the largest on the top to the smallest on the bottom. We can clearly see which state has the most doses distributed. Also, although values of doses administered are not in a perfectly increasing order from bottom to top in the plot, its general trend is increasing. Most importantly, what we care is the percentage of distributed vaccines that have been administered. From the plot, if the two points for one state is closer, the percentage should be larger. The plot gives us some ideas and help us compare percentages.

3. Taxis

Data: NYC yellow cab rides in June 2020, available here:

https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

It’s a large file so you may work with a reasonably-sized random sample of the data (for example 10000 rows). If you remove outliers / impossibilities explain your logic.

Draw four plots of tip_amount vs. fare_amount with the following variations:

Scatterplot with alpha blending

library(ggplot2)

df2 <- read.csv(file = "df.csv",  header = TRUE)

ggplot(df2, aes(fare_amount, tip_amount)) + 
  geom_point(alpha = 0.23, size = 2)  + 
  ggtitle("scatter plot with alpha blending") + 
  theme(plot.title = element_text(hjust = 0.5))

Scatter Plot of "NYC yellow cab rides in June 2020" data (random sampled, length = 10000) 
is displayed above. Note that it is displayed without any limitation in axis to observe outlier and dense region

ggplot(df2, aes(fare_amount, tip_amount)) + 
  geom_point(alpha = 0.23, size = 1.5)  + 
  scale_x_continuous(limits = c(0,80)) + 
  scale_y_continuous(limits = c(0,20)) + 
  ggtitle("scatterplot with alpha blending(modified)") + 
  theme(plot.title = element_text(hjust = 0.5))

By observation from first plot, setting limit to 80 USD in fare_amount and 20 USD in tip_amount
gives best display to examine the major trend in scatter plot

Scatterplot with alpha blending + density estimate contour lines

ggplot(df2, aes(fare_amount, tip_amount)) + 
  geom_point(alpha = 0.15, size = 1.2) + 
  geom_density2d(color = "red", size = 0.7, bins = 25) +
  scale_x_continuous(limits = c(0,40)) +
  scale_y_continuous(limits = c(0,5.5)) + 
  ggtitle("Scatterplot with alpha blending + density estimate contour lines") +
  theme(plot.title = element_text(hjust = 0.5))

Hexagonal heatmap of bin counts

#install.packages("hexbin")

ggplot(df2, aes(fare_amount, tip_amount)) + 
  geom_hex(bins = 30)+ 
  scale_fill_continuous(type = "viridis") + 
  theme_bw() + 
  ggtitle("Hexagonal heatmap of bin counts") + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(df2, aes(fare_amount, tip_amount)) + 
  geom_hex(bins = 40) + 
  scale_x_continuous(limits = c(0,40)) +
  scale_y_continuous(limits = c(0,10)) + 
  scale_fill_continuous(type = "viridis") + 
  theme_bw() + 
  ggtitle("Hexagonal heatmap of bin counts(modified)") + 
  theme(plot.title = element_text(hjust = 0.5))

By observation from first plot, setting limit to 40 USD in fare_amount and 10 USD in tip_amount
gives best display to examine the major trend in hexagonal heatmap

Square heatmap of bin counts

ggplot(df2, aes(fare_amount, tip_amount)) + geom_bin2d(bins = 30)+ scale_fill_continuous(type = "viridis") + theme_bw() + ggtitle("square heatmap of bin counts") + theme(plot.title = element_text(hjust = 0.5))

ggplot(df2, aes(fare_amount, tip_amount)) + geom_bin2d(bins = 40) + scale_x_continuous(limits = c(0,40)) +
  scale_y_continuous(limits = c(0,10)) + scale_fill_continuous(type = "viridis") + theme_bw()  + ggtitle("square heatmap of bin counts(modified)") + theme(plot.title = element_text(hjust = 0.5))

By observation from first plot, setting limit to 40 USD in fare_amount and 10 USD in tip_amount
gives best display to examine the major trend in square heatmap

For all, adjust parameters to the levels that provide the best views of the data.

Describe noteworthy features of the data, using the “Movie ratings” example on page 82 (last page of Section 5.3) as a guide.

. Most of the passengers go within fare < 100 USD. 

. Both fare >100 (USD) or tip > 25 (USD) can be considered as outliers. 

. There are people(or person) who drove with yellow cab with fare amount near 500 (USD).

. There are people who gave driver tip more than their fare.

. Regardless of fare amount, there are lots of passengers(one of most dense line) who give driver zero tips Scatter plots and density contour (two clusters) confirm this trend.

. It can be also found that regardless of fare amount, there are also lot of passengers who gave driver tip with amount of 2.5 (USD) roughly. But this trend is less solid compare to above one(passengers who gave 0 tips regardless of fare amount). 

. Around fare amount of 30 USD, it can be found that there are linear relationships between fare amount and tip amount; It roughly follows (tip/fare) = 0.25(1/4) and 0.3(3/10). Also, it can be found that around to 20 USD, there exists linear relationship of (tip/fare) = 0.4. These linear trends gradually decrease and stop as fare amount goes up. 

. It can be found from heatmap there are relatively high concentration of passengers who drove fare amount < 20 USD. Above mentioned linear trend also can be confirmed by the heat map (linear trend in yellow color – high concentration of count).

. Scatterplots with alpha blending give much information about association and relationship between variables while density estimate contour lines identity clusters. Heatmaps provide confirmations to observations attained from other plots.

EDAV PSet 2

Yiyuan Xu & Kyungjin Kim

1. Jungle Garden

2. Vaccination Rates

3. Taxis