Read Graphical Data Analysis with R, Ch. 4, 5
Grading is based both on your graphs and verbal explanations. Follow all best practices as discussed in class.
Data: garden_planting from gardenR package
For each of the following choose the appropriate data from the package and draw a bar chart, following best practices, that can directly answer the following questions. For each, explain your choice of bar order.
data(garden_planting,package = "gardenR")
garden_planting %>%
select(vegetable) %>%
mutate(vegetable = fct_infreq(vegetable) %>% fct_rev()) %>%
ggplot(aes(vegetable)) +
geom_bar(fill = "lightblue") + coord_flip() +
ylab("number of varieties")+
ggtitle("Varieties of each type of vegetable")
The bar chart takes types of vegetable as y-axis and numbers of varieties as x-axis, and then orders vegetable by their numbers of varieties from the largest on the top to the smallest on the bottom. Therefore, we can quickly find that for example, tomatoes have most variesties.
garden_planting.)week <- garden_planting %>% select(date) %>%
mutate(weekday = weekdays(date))
week %>%
select(weekday) %>%
mutate(weekday = fct_infreq(weekday) %>% fct_rev()) %>%
ggplot(aes(weekday)) +
geom_bar(fill = "lightblue")
First, we converts dates to weekdays. Then we get the frequency of each weekday and draw the bar chart whose order of x-axis is from the least frequent weekday to the most frequent one. From this plot, we can know that Dr. Lendway never planted on Tuesdays and she most likely planted on Wednesdays.There is also a null value here since data of one observation is missing.
seeds <- garden_planting %>% group_by(plot) %>%
summarise(total_seeds = sum(number_seeds_planted))
seeds %>% ggplot(aes(x = fct_reorder(plot,total_seeds), y = total_seeds)) +
geom_col(fill = "lightblue")+coord_flip() +
xlab("Plot")+ylab("Total number of seeds") +
ggtitle("Total Number of Seeds Planted in Each Plot")
Through group by function, we get the total number of seeds for each plot. Then, we draw the bar and order plots from the one with the largest total number of seeds on the top to the one with the smallest on the bottom. Therefore, we can quickly find that for example, H is the plot where Dr. Lendway planted most seeds. Here, "F" is an exception since its value is null.
To get the data for this problem, we’ll scrape data from this page: https://www.beckershospitalreview.com/public-health/states-ranked-by-percentage-of-covid-19-vaccines-administered.html. Important: you should only execute parts (a) and (b) once. After running these lines, comment them out, and leave them commented out in the submitted version.
#install.packages("rvest")
#install.packages("robotstxt")
library(rvest)
library(robotstxt)
paths_allowed("https://www.beckershospitalreview.com/public-health/states-ranked-by-percentage-of-covid-19-vaccines-administered.html")
## [1] TRUE
The result is TRUE.
page <- read_html("https://www.beckershospitalreview.com/public-health/states-ranked-by-percentage-of-covid-19-vaccines-administered.html")
State, Distributed, Administered, Percentage. Display the first six rows of the data frame.State <- page %>% html_nodes("ol strong") %>% html_text()
others <- page %>% html_nodes("ol li") %>% html_text()
split <-str_split(others,":", simplify = TRUE)
Distributed <- c()
Administered <- c()
Percentage <- c()
for (i in 1:50){
tmpd <- parse_number(split[i,2])
tmpa <- parse_number(split[i,3])
tmpp <- parse_number(split[i,4])
Distributed <- c(Distributed,tmpd)
Administered <- c(Administered,tmpa)
Percentage <- c(Percentage,tmpp)
}
doses <- data_frame(State, Distributed, Administered, Percentage)
head(doses)
## # A tibble: 6 x 4
## State Distributed Administered Percentage
## <chr> <dbl> <dbl> <dbl>
## 1 West Virginia 328600 276753 84.2
## 2 North Dakota 131625 106900 81.2
## 3 New Mexico 364775 294566 80.8
## 4 South Dakota 153450 115676 75.4
## 5 South Carolina 683600 497433 72.8
## 6 Connecticut 666175 481826 72.3
theme_dotplot <- theme_bw(14) +
theme(axis.text.y = element_text(size = rel(.65)),
axis.ticks.y = element_blank(),
axis.title.x = element_text(size = rel(.75)),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(size = 0.5),
panel.grid.minor.x = element_blank())
ggplot(doses,aes(Percentage,fct_reorder(State,Percentage))) +
geom_point(color = "red") + ylab("State") +
ggtitle("Percentage of Each State") + theme_dotplot
tidydose <- doses %>%
select(State, Distributed,Administered) %>%
pivot_longer(cols = -State, names_to ="dose", values_to = "value")
ggplot(tidydose,
aes(value, fct_reorder2(State, dose, value, .desc = FALSE),
color = dose)) +
geom_point() +xlab("Doses") +ylab("State") +
ggtitle("Doses of Each State Sorted by Doses Distributed") + theme_dotplot
This one sorts states by their doses distributed from the largest on the top to the smallest on the bottom. We can clearly see which state has the most doses distributed. Also, although values of doses administered are not in a perfectly increasing order from bottom to top in the plot, its general trend is increasing. Most importantly, what we care is the percentage of distributed vaccines that have been administered. From the plot, if the two points for one state is closer, the percentage should be larger. The plot gives us some ideas and help us compare percentages.
Data: NYC yellow cab rides in June 2020, available here:
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
It’s a large file so you may work with a reasonably-sized random sample of the data (for example 10000 rows). If you remove outliers / impossibilities explain your logic.
Draw four plots of tip_amount vs. fare_amount with the following variations:
library(ggplot2)
df2 <- read.csv(file = "df.csv", header = TRUE)
ggplot(df2, aes(fare_amount, tip_amount)) +
geom_point(alpha = 0.23, size = 2) +
ggtitle("scatter plot with alpha blending") +
theme(plot.title = element_text(hjust = 0.5))
Scatter Plot of "NYC yellow cab rides in June 2020" data (random sampled, length = 10000)
is displayed above. Note that it is displayed without any limitation in axis to observe outlier and dense region
ggplot(df2, aes(fare_amount, tip_amount)) +
geom_point(alpha = 0.23, size = 1.5) +
scale_x_continuous(limits = c(0,80)) +
scale_y_continuous(limits = c(0,20)) +
ggtitle("scatterplot with alpha blending(modified)") +
theme(plot.title = element_text(hjust = 0.5))
By observation from first plot, setting limit to 80 USD in fare_amount and 20 USD in tip_amount
gives best display to examine the major trend in scatter plot
ggplot(df2, aes(fare_amount, tip_amount)) +
geom_point(alpha = 0.15, size = 1.2) +
geom_density2d(color = "red", size = 0.7, bins = 25) +
scale_x_continuous(limits = c(0,40)) +
scale_y_continuous(limits = c(0,5.5)) +
ggtitle("Scatterplot with alpha blending + density estimate contour lines") +
theme(plot.title = element_text(hjust = 0.5))
#install.packages("hexbin")
ggplot(df2, aes(fare_amount, tip_amount)) +
geom_hex(bins = 30)+
scale_fill_continuous(type = "viridis") +
theme_bw() +
ggtitle("Hexagonal heatmap of bin counts") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(df2, aes(fare_amount, tip_amount)) +
geom_hex(bins = 40) +
scale_x_continuous(limits = c(0,40)) +
scale_y_continuous(limits = c(0,10)) +
scale_fill_continuous(type = "viridis") +
theme_bw() +
ggtitle("Hexagonal heatmap of bin counts(modified)") +
theme(plot.title = element_text(hjust = 0.5))
By observation from first plot, setting limit to 40 USD in fare_amount and 10 USD in tip_amount
gives best display to examine the major trend in hexagonal heatmap
ggplot(df2, aes(fare_amount, tip_amount)) + geom_bin2d(bins = 30)+ scale_fill_continuous(type = "viridis") + theme_bw() + ggtitle("square heatmap of bin counts") + theme(plot.title = element_text(hjust = 0.5))
ggplot(df2, aes(fare_amount, tip_amount)) + geom_bin2d(bins = 40) + scale_x_continuous(limits = c(0,40)) +
scale_y_continuous(limits = c(0,10)) + scale_fill_continuous(type = "viridis") + theme_bw() + ggtitle("square heatmap of bin counts(modified)") + theme(plot.title = element_text(hjust = 0.5))
By observation from first plot, setting limit to 40 USD in fare_amount and 10 USD in tip_amount
gives best display to examine the major trend in square heatmap
For all, adjust parameters to the levels that provide the best views of the data.
. Most of the passengers go within fare < 100 USD.
. Both fare >100 (USD) or tip > 25 (USD) can be considered as outliers.
. There are people(or person) who drove with yellow cab with fare amount near 500 (USD).
. There are people who gave driver tip more than their fare.
. Regardless of fare amount, there are lots of passengers(one of most dense line) who give driver zero tips Scatter plots and density contour (two clusters) confirm this trend.
. It can be also found that regardless of fare amount, there are also lot of passengers who gave driver tip with amount of 2.5 (USD) roughly. But this trend is less solid compare to above one(passengers who gave 0 tips regardless of fare amount).
. Around fare amount of 30 USD, it can be found that there are linear relationships between fare amount and tip amount; It roughly follows (tip/fare) = 0.25(1/4) and 0.3(3/10). Also, it can be found that around to 20 USD, there exists linear relationship of (tip/fare) = 0.4. These linear trends gradually decrease and stop as fare amount goes up.
. It can be found from heatmap there are relatively high concentration of passengers who drove fare amount < 20 USD. Above mentioned linear trend also can be confirmed by the heat map (linear trend in yellow color – high concentration of count).
. Scatterplots with alpha blending give much information about association and relationship between variables while density estimate contour lines identity clusters. Heatmaps provide confirmations to observations attained from other plots.