TidyVerse - a programming sample “vignette”

Rpubs link: http://rpubs.com/ssufian/548044

Github: https://github.com/ssufian/FALL2019TIDYVERSE/blob/master/Data-607-Tidyverse-Project-orig.Rmd

R Problem Statement:

In this assignment, we will practice collaborating around a code project with Github as a class.

Using several TidyVerse packages, and the “How Popular is Trump?” dataset from

fivethirtyeight.com, I’m going to show a programming sample “vignette” that demonstrates how to use the

capabilities of ggplot2, dplyr, readr packages with this particular dataset.

# Load
library("tidyverse")
library(ggplot2)
library(dplyr)

Using readr to read data from a csv file (Github site)

Part I: Sufian’s code and vignette

polls <- read_csv("https://raw.githubusercontent.com/ssufian/Data_607/master/approval_topline.csv")

## Parsed with column specification:
## cols(
##   president = col_character(),
##   subgroup = col_character(),
##   modeldate = col_character(),
##   approve_estimate = col_double(),
##   approve_hi = col_double(),
##   approve_lo = col_double(),
##   disapprove_estimate = col_double(),
##   disapprove_hi = col_double(),
##   disapprove_lo = col_double(),
##   timestamp = col_character()
## )

polls1 <- read_csv('https://raw.githubusercontent.com/ssufian/Data_607/master/polling_data.csv')

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   samplesize = col_double(),
##   weight = col_double(),
##   influence = col_double(),
##   approve = col_double(),
##   disapprove = col_double(),
##   adjusted_approve = col_double(),
##   adjusted_disapprove = col_double(),
##   tracking = col_logical(),
##   poll_id = col_double(),
##   question_id = col_double()
## )

## See spec(...) for full column specifications.

Using ggplot2 to visualizae data; with pipe operation %>% from dplyr

polls1 <- as_tibble(polls1)

Bar Charts of Trump’s approval/disapproval ratings by pollster types shown below:

polls1 %>% ggplot(aes(x=pollster, y=adjusted_approve, fill=pollster))+
  geom_bar(stat = "identity", position = "dodge") + 
  guides(fill = FALSE) +
  ggtitle("Trump's approval ratings by Pollster types")+ 
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

polls1 %>% ggplot(aes(x=pollster, y=adjusted_disapprove, fill=pollster))+
  geom_bar(stat = "identity", position = "dodge") + 
  guides(fill = FALSE) +
  ggtitle("Trump's disapproval ratings by Pollster types")+ 
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Trumps Approval/Dis-approval ratings by voter types:

polls1 %>% ggplot(aes(x=subgroup, y=adjusted_approve, fill=subgroup))+
  geom_bar(stat = "identity", position = "dodge") + 
  guides(fill = FALSE) +
  ggtitle("Trump's approval ratings by voter types")+ 
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

polls1 %>% ggplot(aes(x=subgroup, y=adjusted_disapprove, fill=subgroup))+
  geom_bar(stat = "identity", position = "dodge") + 
  guides(fill = FALSE) +
  ggtitle("Trump's disapproval ratings by voter types")+ 
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Observation:

2 Things to note in that Trump’s approval ratings amongst the 64 pollster is just averaging in the

mid 45% range while his disapproval ratings is in the mid 60% range

Interesting to note that the voter type that really matters in the 3 sub groupings are adult

registered Voters; this group have an average higher adj. dis-approval ratings for Trump relatively

speaking.

In parrellel, voters that have an adj. approval ratings for Trump has an overall lower ratings

relative to the dis-approvals among their peers.

Tidyverse Capability 1.

Description:

Using filter we can select rows of the data frame matching conditions.
filter capability tutorial

pollfilter <-  filter(polls1, samplesize >800 & samplesize < 1992)

head(pollfilter)

## # A tibble: 6 x 22
##   president subgroup modeldate startdate enddate pollster grade samplesize
##   <chr>     <chr>    <chr>     <chr>     <chr>   <chr>    <chr>      <dbl>
## 1 Donald T~ All pol~ 11/5/2019 1/20/2017 1/22/2~ Gallup   B           1500
## 2 Donald T~ All pol~ 11/5/2019 1/21/2017 1/23/2~ Gallup   B           1500
## 3 Donald T~ All pol~ 11/5/2019 1/20/2017 1/24/2~ Ipsos    B-          1632
## 4 Donald T~ All pol~ 11/5/2019 1/22/2017 1/24/2~ Gallup   B           1500
## 5 Donald T~ All pol~ 11/5/2019 1/22/2017 1/24/2~ Rasmuss~ C+          1500
## 6 Donald T~ All pol~ 11/5/2019 1/20/2017 1/25/2~ Quinnip~ B+          1190
## # ... with 14 more variables: population <chr>, weight <dbl>,
## #   influence <dbl>, approve <dbl>, disapprove <dbl>,
## #   adjusted_approve <dbl>, adjusted_disapprove <dbl>,
## #   multiversions <chr>, tracking <lgl>, url <chr>, poll_id <dbl>,
## #   question_id <dbl>, createddate <chr>, timestamp <chr>

Tidyverse Capability 2.

Description:

Using select we can keep the selected variables
select capability tutorial

df<- select(polls1, c("subgroup","pollster","approve", "disapprove"))
head(df)

## # A tibble: 6 x 4
##   subgroup  pollster                                 approve disapprove
##   <chr>     <chr>                                      <dbl>      <dbl>
## 1 All polls Morning Consult                             46         37  
## 2 All polls Gallup                                      45         45  
## 3 All polls Gallup                                      45         46  
## 4 All polls Ipsos                                       42.1       45.2
## 5 All polls Gallup                                      46         45  
## 6 All polls Rasmussen Reports/Pulse Opinion Research    57         43

Part II: Lin li’s code and vignette that Sufian will extend, see below

weather <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/us-weather-history/KCLT.csv")
head(weather,n=2)

##       date actual_mean_temp actual_min_temp actual_max_temp
## 1 2014-7-1               81              70              91
## 2 2014-7-2               85              74              95
##   average_min_temp average_max_temp record_min_temp record_max_temp
## 1               67               89              56             104
## 2               68               89              56             101
##   record_min_temp_year record_max_temp_year actual_precipitation
## 1                 1919                 2012                    0
## 2                 2008                 1931                    0
##   average_precipitation record_precipitation
## 1                   0.1                 5.91
## 2                   0.1                 1.53

weather2 <- weather %>% separate(date, c("year", "month", "day"), sep = "-")
head(weather2,n=2)

##   year month day actual_mean_temp actual_min_temp actual_max_temp
## 1 2014     7   1               81              70              91
## 2 2014     7   2               85              74              95
##   average_min_temp average_max_temp record_min_temp record_max_temp
## 1               67               89              56             104
## 2               68               89              56             101
##   record_min_temp_year record_max_temp_year actual_precipitation
## 1                 1919                 2012                    0
## 2                 2008                 1931                    0
##   average_precipitation record_precipitation
## 1                   0.1                 5.91
## 2                   0.1                 1.53

weather3 <- select(weather2, year, actual_mean_temp, record_min_temp, record_max_temp, record_precipitation)

head(weather3,n=2)

##   year actual_mean_temp record_min_temp record_max_temp
## 1 2014               81              56             104
## 2 2014               85              56             101
##   record_precipitation
## 1                 5.91
## 2                 1.53

Extending the mutate function as an example, see below

Tidyverse Capability 3 on Lin Li’s vignette.

Description:

Using mutate to create another column based on existing column
For ex: creating a z-score of another column

weather4 <- weather3 %>% 
  mutate( precipitation_z_score = (record_precipitation - mean(record_precipitation)/sd(record_precipitation)))

head(weather4)

##   year actual_mean_temp record_min_temp record_max_temp
## 1 2014               81              56             104
## 2 2014               85              56             101
## 3 2014               82              56              99
## 4 2014               75              55              99
## 5 2014               72              57             100
## 6 2014               74              57              99
##   record_precipitation precipitation_z_score
## 1                 5.91             3.3981049
## 2                 1.53            -0.9818951
## 3                 2.50            -0.0118951
## 4                 2.63             0.1181049
## 5                 1.65            -0.8618951
## 6                 1.95            -0.5618951

Tidyverse Capability 4

Description:

Using top_n function to select and order top n entries

weather5 <- weather4 %>% top_n(actual_mean_temp, n= 5)

(weather5)

##   year actual_mean_temp record_min_temp record_max_temp
## 1 2015               86              53             100
## 2 2015               87              52             100
## 3 2015               87              53             100
## 4 2015               88              55             102
## 5 2015               86              53             102
##   record_precipitation precipitation_z_score
## 1                 2.76             0.2481049
## 2                 1.86            -0.6518951
## 3                 1.82            -0.6918951
## 4                 1.25            -1.2618951
## 5                 2.76             0.2481049