Introduction

For This tidyverse assignment, I choose to use the following libraries:

  • dplyr
  • ggplot2
  • magrittr
  • forcats

I did not do extensive data filtering for this assignment, but rather choose to demostrate the capabilities of the tidyverse libraries.

Data Pull

For my dataset, I choose to use kaggle with a gross avg rent across the US. The dataset is fairly long and has several fields. Not all are useful, but in the TidyVerse section, I use some of the ddplyr libaries to filter out what I need.

rent.data <- read.csv("https://raw.githubusercontent.com/dapolloxp/spring2019tidyverse/master/kaggle_gross_rent.csv")

TidyVerse Libraries

In this section, I’m using some of the tidyverse libaries, such as piping in magrittr and dplyr which make it extremely easy to filter and select fields. I also chose fields using select in ddplyr and group_by to aggregrate information.

I choose to count the number of zip codes with the piping built-into dplyr

sub.rent.data <- rent.data %>% select(State_Name, State_ab, County, City, Place, Type, Primary, Zip_Code, Lat, Lon, Mean, Median, Stdev, Samples)

sub.rent.data$Zip_Code <- as.character(sub.rent.data$Zip_Code) 

head(sub.rent.data)
##   State_Name State_ab          County         City            Place Type
## 1    Alabama       AL Chambers County       Wadley           Abanda  CDP
## 2    Alabama       AL  Winston County      Addison          Addison Town
## 3    Alabama       AL Marshall County  Albertville Albertville city City
## 4    Alabama       AL  Pickens County   Aliceville  Aliceville city City
## 5    Alabama       AL   Etowah County Walnut Grove          Altoona Town
## 6    Alabama       AL  Calhoun County     Anniston    Anniston city City
##   Primary Zip_Code      Lat       Lon Mean Median Stdev Samples
## 1   place    36276 33.09163 -85.52703  972    968    51      12
## 2   place    35540 34.20268 -87.17800  519    460   275      64
## 3   place    35950 34.26313 -86.21066  625    585   234    2560
## 4   place    35442 33.12369 -88.15936  546    438   354     574
## 5   place    35990 34.03920 -86.30570  350    303   185     114
## 6   place    36207 33.67344 -85.81092  600    599   274    3901
#datatable(sub.rent.data)

x <- function(zips)
{
  
  for (i in 1:length(zips)) 
  {
  paste0("Processing", zips[i])
  if(nchar(zips[i]) == 4)
    {
      zips[i] <- paste0("0", zips[i], sep = "")
    }
  
  else if(nchar(zips[i]) == 3)
    {
      zips[i] <- paste0("00", zips[i], sep ="")
    }
  }
  return(zips)
}
#sapply(sub.rent.data$Zip_Code, x)

#

sub.rent.data$Zip_Code <- as.factor(sub.rent.data$Zip_Code) 

sub.rent.data$Zip_Code %>% fct_count(sort = TRUE)
## # A tibble: 17,006 x 2
##    f         n
##    <fct> <int>
##  1 78584    22
##  2 94606    20
##  3 36605    18
##  4 731      18
##  5 35215    17
##  6 35630    17
##  7 87532    17
##  8 11203    16
##  9 35020    16
## 10 36117    16
## # ... with 16,996 more rows
avg.rent.byzip <- sub.rent.data %>% group_by(Zip_Code) %>% summarise(AvgRent=mean(Mean))

avg.rent.bystate <- sub.rent.data %>% group_by(State_Name) %>% summarise(AvgRent=mean(Mean))

Graphing

In this section, I am using ggplot to create a basic scatterplot as it is one of the most useful libraries.

ggplot(avg.rent.byzip, aes(x=Zip_Code, y=AvgRent)) + geom_point()

ggplot(avg.rent.bystate, aes(x=avg.rent.bystate$State_Name,y=AvgRent)) + geom_point() + theme(axis.text.x = element_text(angle = 90)) + xlab("State Name") + ylab("Average Rent")


Addition by Debabrata Kabiraj

Original

Show the original data format

head(sub.rent.data,10)
##    State_Name State_ab           County         City            Place Type
## 1     Alabama       AL  Chambers County       Wadley           Abanda  CDP
## 2     Alabama       AL   Winston County      Addison          Addison Town
## 3     Alabama       AL  Marshall County  Albertville Albertville city City
## 4     Alabama       AL   Pickens County   Aliceville  Aliceville city City
## 5     Alabama       AL    Etowah County Walnut Grove          Altoona Town
## 6     Alabama       AL   Calhoun County     Anniston    Anniston city City
## 7     Alabama       AL Limestone County      Ardmore          Ardmore Town
## 8     Alabama       AL      Dale County       Ariton           Ariton Town
## 9     Alabama       AL      Clay County      Ashland          Ashland Town
## 10    Alabama       AL  Escambia County       Atmore      Atmore city City
##    Primary Zip_Code      Lat       Lon Mean Median Stdev Samples
## 1    place    36276 33.09163 -85.52703  972    968    51      12
## 2    place    35540 34.20268 -87.17800  519    460   275      64
## 3    place    35950 34.26313 -86.21066  625    585   234    2560
## 4    place    35442 33.12369 -88.15936  546    438   354     574
## 5    place    35990 34.03920 -86.30570  350    303   185     114
## 6    place    36207 33.67344 -85.81092  600    599   274    3901
## 7    place    35739 34.98784 -86.82902  581    557   283     215
## 8    place    36311 31.59777 -85.71306  581    539   357      45
## 9    place    36251 33.26989 -85.83371  433    411   237     476
## 10   place    36504 31.12794 -87.45764  556    554   227    1109
sample(head(sub.rent.data,20),10)
##         Lat               Place           City Samples Median Type
## 1  33.09163              Abanda         Wadley      12    968  CDP
## 2  34.20268             Addison        Addison      64    460 Town
## 3  34.26313    Albertville city    Albertville    2560    585 City
## 4  33.12369     Aliceville city     Aliceville     574    438 City
## 5  34.03920             Altoona   Walnut Grove     114    303 Town
## 6  33.67344       Anniston city       Anniston    3901    599 City
## 7  34.98784             Ardmore        Ardmore     215    557 Town
## 8  31.59777              Ariton         Ariton      45    539 Town
## 9  33.26989             Ashland        Ashland     476    411 Town
## 10 31.12794         Atmore city         Atmore    1109    554 City
## 11 32.43256        Autaugaville   Autaugaville      71    628 Town
## 12 31.30208              Babbie            Opp      40    705 Town
## 13 34.02305            Ballplay    Hokes Bluff      24    445  CDP
## 14 30.40759 Bayou La Batre city Bayou La Batre     290    631 City
## 15 33.93541           Beaverton           Guin      15    418 Town
## 16 32.46366             Bellamy           York      46    328  CDP
## 17 33.66670               Berry          Berry     166    244 Town
## 18 33.52744     Birmingham city     Birmingham   44449    722 City
## 19 34.07573        Blountsville   Blountsville     296    498 Town
## 20 34.19902           Boaz city           Boaz      44    845 City
##              County Stdev Zip_Code Primary
## 1   Chambers County    51    36276   place
## 2    Winston County   275    35540   place
## 3   Marshall County   234    35950   place
## 4    Pickens County   354    35442   place
## 5     Etowah County   185    35990   place
## 6    Calhoun County   274    36207   place
## 7  Limestone County   283    35739   place
## 8       Dale County   357    36311   place
## 9       Clay County   237    36251   place
## 10  Escambia County   227    36504   place
## 11   Autauga County   235    36003   place
## 12 Covington County   155    36467   place
## 13    Etowah County   126    35903   place
## 14    Mobile County   296    36509   place
## 15     Lamar County   154    35563   place
## 16    Sumter County   263    36925   place
## 17   Fayette County   223    35546   place
## 18 Jefferson County   352    35203   place
## 19    Blount County   272    35031   place
## 20    Etowah County    80    35957   place
#DT::datatable(sub.rent.data, options = list(pagelength=5))

ggplot(sub.rent.data, aes(x=sub.rent.data$State_Name, y=sub.rent.data$Median, color = sub.rent.data$Type)) +
    geom_point(alpha = 0.2) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Nest

Using the nest function from tidyr package to group the rents by state, county and city to form a tibble, a list, column

sub.rent.data.nested <- sub.rent.data %>% 
                          tidyr::nest(-c(State_Name, State_ab, County, City)) 
head(sub.rent.data.nested,10)
FALSE # A tibble: 10 x 5
FALSE    State_Name State_ab County           City         data             
FALSE    <fct>      <fct>    <fct>            <fct>        <list>           
FALSE  1 Alabama    AL       Chambers County  Wadley       <tibble [1 x 10]>
FALSE  2 Alabama    AL       Winston County   Addison      <tibble [1 x 10]>
FALSE  3 Alabama    AL       Marshall County  Albertville  <tibble [1 x 10]>
FALSE  4 Alabama    AL       Pickens County   Aliceville   <tibble [2 x 10]>
FALSE  5 Alabama    AL       Etowah County    Walnut Grove <tibble [2 x 10]>
FALSE  6 Alabama    AL       Calhoun County   Anniston     <tibble [2 x 10]>
FALSE  7 Alabama    AL       Limestone County Ardmore      <tibble [1 x 10]>
FALSE  8 Alabama    AL       Dale County      Ariton       <tibble [1 x 10]>
FALSE  9 Alabama    AL       Clay County      Ashland      <tibble [2 x 10]>
FALSE 10 Alabama    AL       Escambia County  Atmore       <tibble [1 x 10]>
DT::datatable(sub.rent.data.nested, options = list(pagelength=5))

Unnest

Using the unnest function from tidyr to ungroup the tibble, a list column, rents back to original form state, county and city

sub.rent.data.unnested <- sub.rent.data.nested %>%
                            tidyr::unnest(data)
head(sub.rent.data.unnested,10)
## # A tibble: 10 x 14
##    State_Name State_ab County City  Place Type  Primary Zip_Code   Lat
##    <fct>      <fct>    <fct>  <fct> <fct> <fct> <fct>   <fct>    <dbl>
##  1 Alabama    AL       Chamb~ Wadl~ Aban~ CDP   place   36276     33.1
##  2 Alabama    AL       Winst~ Addi~ Addi~ Town  place   35540     34.2
##  3 Alabama    AL       Marsh~ Albe~ Albe~ City  place   35950     34.3
##  4 Alabama    AL       Picke~ Alic~ Alic~ City  place   35442     33.1
##  5 Alabama    AL       Picke~ Alic~ Memp~ Town  place   35442     33.1
##  6 Alabama    AL       Etowa~ Waln~ Alto~ Town  place   35990     34.0
##  7 Alabama    AL       Etowa~ Waln~ Waln~ Town  place   35990     34.1
##  8 Alabama    AL       Calho~ Anni~ Anni~ City  place   36207     33.7
##  9 Alabama    AL       Calho~ Anni~ Saks  CDP   place   36206     33.7
## 10 Alabama    AL       Limes~ Ardm~ Ardm~ Town  place   35739     35.0
## # ... with 5 more variables: Lon <dbl>, Mean <int>, Median <int>,
## #   Stdev <int>, Samples <int>
DT::datatable(sub.rent.data.unnested, options = list(pagelength=5))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

Benchmarking Parallel Processing

cluster <- create_cluster(cores = detectCores())

slow_func <- function(df) {
  slow <- df %>%
            group_by(State_Name, County, City) %>%
            filter(dplyr::n_distinct(Place) > 1)
}

fast_func <- function(df) {
  fast <- df %>%
            partition(State_Name, County, City, cluster = cluster) %>%
            group_by(State_Name, County, City) %>%
            filter(dplyr::n_distinct(Place) > 1) %>%
            collect()
}

microbenchmark(slow_func(sub.rent.data), fast_func(sub.rent.data))
FALSE Unit: seconds
FALSE                      expr      min       lq     mean   median       uq
FALSE  slow_func(sub.rent.data) 2.510346 2.643581 2.741582 2.697463 2.782313
FALSE  fast_func(sub.rent.data) 1.097413 1.125423 1.166321 1.140598 1.165179
FALSE       max neval cld
FALSE  3.880706   100   b
FALSE  1.948583   100  a

As can be seen above, implementing parallel processing on the high volume data we could achieve 50% better performance from underpowered client-side processing. We will see significant performance improvement f we choose server-side processing with high CPU cores and memory at our disposable.