Part 1

  • Sie Siong Wong’s code and vignette.

Objective

  • Dplyr annd GGPlot is the TidyVerse packages that I choose to create a programming sample to demonstrate how to use its capabilities from reshape the data to plot analysis result.The dataset is Ramen Rating and got this from Kaggle.

What is dplyr?

  • It is the next iteration of plyr package.
  • It is faster and has a more consistent of API.
  • It provides a set of tools that help you solve the most common data manipulation challenges.

What is GGPlot?

  • GGPlot was based on the Grammar of Graphics.
  • It puts an organized framework around the various parts of a graph.
  • The components of ggplot grammar are data, aesthetic mapping, geom, stat, scales, theme, and so on.
  • It can be used to create scatter plots, boxplots, histograms, lines best approximating data, etc.
  • It is robust and has served statisticians well.

Load the Packages

library(dplyr)
library(ggplot2)

Load the Dataset

rr <- read.csv("https://raw.githubusercontent.com/SieSiongWong/DATA-607/master/Project%20TidyVerse/RamenRatings.csv", header=TRUE, stringsAsFactors = FALSE)

head(rr)
##   Review..          Brand
## 1     2580      New Touch
## 2     2579       Just Way
## 3     2578         Nissin
## 4     2577        Wei Lih
## 5     2576 Ching's Secret
## 6     2575  Samyang Foods
##                                                       Variety Style
## 1                                   T's Restaurant Tantanmen    Cup
## 2 Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles  Pack
## 3                               Cup Noodles Chicken Vegetable   Cup
## 4                               GGE Ramen Snack Tomato Flavor  Pack
## 5                                             Singapore Curry  Pack
## 6                                      Kimchi song Song Ramen  Pack
##       Country Stars
## 1       Japan  3.75
## 2      Taiwan     1
## 3         USA  2.25
## 4      Taiwan  2.75
## 5       India  3.75
## 6 South Korea  4.75

Clean and Tidy the Data

# Convert Stars column from factor to numeric.
rr$Stars <-as.numeric(rr$Stars)
## Warning: NAs introduced by coercion
# You can use the 'select' function to pick variables based on their names.
rr.df <- select (rr, Brand, Country, Stars)

# or you can do this way using the combination of 'pipe' and 'select' functions. The pipe allows you to pipe the output of one function to the input of another function.
rr.df <- rr %>% select (Brand, Country, Stars)

# You can use the 'rename' function to rename the column name.
rr.df <- rr.df %>% rename ("Ratings"="Stars")

# Using filter to select only numeric data on Stars column.
rr.df <- rr.df %>% filter(!is.na(Ratings))

# To move specific column to the beginning of a dataframe, we can use the 'select' and 'everything' functions.
rr.df <- rr.df %>% select(Country, everything())

# To sort the Country column in descending order, we can use the 'arrange' function.
rr.df <- rr.df %>% arrange(Country)

Reshape and Analyze the Data

# Another very useful function is the 'group_by'. YOu can use this function to group the same elements together and then use the 'summarise' function to collapse each group into a single-row summary. For example, we can group country and brand and then do the average for each brand. 
rr.df.avg1 <- rr.df %>% group_by(Country, Brand) %>% summarise(Average=mean(Ratings))

# Very often you'll want to create new columns based on the values in existing columns, for example like previous example to get average ratings for each brand. In this case, you can use the 'mutate' function. The difference between summarise and mutate functions is that using 'mutate' will keep all existing columns in the output while 'summarise' will not show the column variable which is used to perform the calculation.  
rr.df.avg2 <- rr.df %>% group_by(Country, Brand) %>% mutate(Average=mean(Ratings))

# You also can use the 'transmute' to only show the grouped and new column variables just like the summarise function.
rr.df.avg3 <- rr.df %>% group_by(Country, Brand) %>% transmute(Average=mean(Ratings))

Plot the Analysis Result

# In the ggplot() parenthesis, the data source is 'rr.df.avg1', and within the aesthetic parenthesis you'll need to define the x and y variables. In this case, the x=Country, y=Average. If you want to include legend, you can put fill=Country in the aes().
ggplot(rr.df.avg1, aes(x=Country, y=Average)) + geom_boxplot()

# You notice that the boxplots are not in the order. To sort the boxplots in the order, we can use the 'reorder' function and use median as reference parameter to sort.
ggplot(rr.df.avg1, aes(x=reorder(factor(Country),Average, fun=median),y=Average)) + geom_boxplot() 

# Also, we can further refine the plot to include title and y-axis label. To do this, the argument for title is 'title' and the argument for y-axis label is 'ylab'. 
ggplot(rr.df.avg1, aes(x=reorder(factor(Country), Average, fun=median),y=Average)) + geom_boxplot() + labs(title="Average Ramen Ratings for Each Country") + ylab("Average Ratings")

# In this example, I will use the 'fill' to include legend so that color will be automatically assigned to each box plot and then remove the legend display manually. We can use 'theme()' to customize the plot's look. To remove legend display, you can use 'legend.position', and to remove x-axis label, you can use 'axis.title.x'. Also, we can adjust the plot title height position using' plot.title'.Since there are so many labels for the x-axis in this example, we can turn the x-axis text into 90 degree angle and adjust their margin using 'axis.text.x'. 
ggplot(rr.df.avg1, aes(x=reorder(factor(Country), Average, fun=median),y=Average,fill=factor(Country))) + geom_boxplot() + labs(title="Average Ramen Ratings for Each Country") + ylab("Average Ratings") + theme(legend.position = "none", axis.title.x = element_blank(), axis.text.x=element_text(angle=90, margin=margin(t = 10, r = 0, b = 0, l = 0)), plot.title = element_text(hjust=0.5))

Part 2

  • Extending Sufian Suwarman’s code and vignette.

Load data

# Read data.
polls <- read.csv("https://raw.githubusercontent.com/ssufian/Data_607/master/approval_topline.csv", header=TRUE,stringsAsFactors = FALSE)

polls2 <- read.csv('https://raw.githubusercontent.com/ssufian/Data_607/master/polling_data.csv', header=TRUE, stringsAsFactors = FALSE)

Extending Sufian’s Code

Trump’s Approval Change Over Time

  • It will be interesting to see how Trump’s approval change over the time by using the dplyr packages functions such as group_by and summarize and then plot the trend.
# Approval rates.
average.approve <- polls2 %>% 
  select(enddate, adjusted_approve) %>%
  group_by(enddate) %>% 
  summarise(average_approve = mean(adjusted_approve))
  
# Disapproval rates.
average.disapprove <- polls2 %>% 
  select(enddate, adjusted_disapprove) %>%
  group_by(enddate) %>% 
  summarise(average_disapprove = mean(adjusted_disapprove))

# Merge two dataframe together.
average.final <- merge(average.approve,average.disapprove)
average.final$enddate <- anytime::anydate(average.final$enddate)

# We can plot the approval and disapproval ratings together.
ggplot(average.final, aes(x=enddate)) +
  ggtitle("Approval & Disapproval") + ylab("%") + 
  geom_smooth(aes(y=average_approve, group=1, colour="Approval")) + geom_smooth(aes(y=average_disapprove, group=2, colour="Disapproval")) + 
  theme(plot.title = element_text(hjust=0.5), axis.title.x=element_blank(), axis.text.x=element_text(angle=180,hjust=1),legend.position=c(0.6,0.9),legend.title=element_blank())
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

  • From the chart, we can see that president Trump’s approval rate has been declined sharply since he took office in January 2017. But his approval rate started to increase again in Fall 2017 and become quite steady around 43% from March 2018 forward but starts to trend down recently. It is quite interesting to see the disapproval rate is exactly the opposite side of approval rate. Not sure if the data

Compare Estimate Approval Rate

# Estimate approval rate.
average.approve.estimate <- polls %>% 
  select(modeldate, approve_estimate) %>%
  group_by(modeldate) %>% 
  summarise(average_approve_estimate = mean(approve_estimate))

average.approve.estimate$modeldate <- anytime::anydate(average.approve.estimate$modeldate)

# Plot to compare estimate and survey approval rate.
ggplot(NULL) +
  ggtitle("Estimate vs Survey Approval Rate") + 
  ylab("%") + 
  geom_smooth(data=average.final, aes(x=enddate, y=average_approve, group=1, colour="Actual Approval")) +
  geom_smooth(data=average.approve.estimate, aes(x=modeldate, y=average_approve_estimate, group=2, colour="Estiamte Approval")) + 
  theme(plot.title = element_text(hjust=0.5), axis.title.x=element_blank(), axis.text.x=element_text(angle=180,hjust=1),legend.position=c(0.6,0.9),legend.title=element_blank())
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

  • From the chart, we can see the estimate approval rate is fairly close to the actual approval rate.

Subgroup Approval Rate Variation

  • It will also be interesting to see the variation of approval rate from each subgroup..
# Approval rate.
subgroup.approve <- polls2 %>% 
  select(subgroup, adjusted_approve) %>% group_by(subgroup)

# Boxplot for each subgroup.
ggplot(subgroup.approve, aes(x=reorder(factor(subgroup), adjusted_approve, fun=median),y=adjusted_approve,fill=factor(subgroup))) + geom_boxplot() + labs(title="Subgroup Approval Rate Variation") + ylab("%") + theme(legend.position = "none", axis.title.x = element_blank(), axis.text.x=element_text(angle=45)) + theme(plot.title = element_text(hjust=0.5)) + theme(axis.text.x = element_text(margin = margin(t = 25, r = 20, b = 0, l = 0)))

  • From the boxplot, we can see the variation about the same across the 3 groups.