For This tidyverse assignment, I choose to use the following libraries:
I did not do extensive data filtering for this assignment, but rather choose to demostrate the capabilities of the tidyverse libraries.
For my dataset, I choose to use kaggle with a gross avg rent across the US. The dataset is fairly long and has several fields. Not all are useful, but in the TidyVerse section, I use some of the ddplyr libaries to filter out what I need.
rent.data <- read.csv("https://raw.githubusercontent.com/dapolloxp/spring2019tidyverse/master/kaggle_gross_rent.csv")In this section, I’m using some of the tidyverse libaries, such as piping in magrittr and dplyr which make it extremely easy to filter and select fields. I also chose fields using select in ddplyr and group_by to aggregrate information.
I choose to count the number of zip codes with the piping built-into dplyr
sub.rent.data <- rent.data %>% select(State_Name, State_ab, County, City, Place, Type, Primary, Zip_Code, Lat, Lon, Mean, Median, Stdev, Samples)
sub.rent.data$Zip_Code <- as.character(sub.rent.data$Zip_Code)
head(sub.rent.data)## State_Name State_ab County City Place Type
## 1 Alabama AL Chambers County Wadley Abanda CDP
## 2 Alabama AL Winston County Addison Addison Town
## 3 Alabama AL Marshall County Albertville Albertville city City
## 4 Alabama AL Pickens County Aliceville Aliceville city City
## 5 Alabama AL Etowah County Walnut Grove Altoona Town
## 6 Alabama AL Calhoun County Anniston Anniston city City
## Primary Zip_Code Lat Lon Mean Median Stdev Samples
## 1 place 36276 33.09163 -85.52703 972 968 51 12
## 2 place 35540 34.20268 -87.17800 519 460 275 64
## 3 place 35950 34.26313 -86.21066 625 585 234 2560
## 4 place 35442 33.12369 -88.15936 546 438 354 574
## 5 place 35990 34.03920 -86.30570 350 303 185 114
## 6 place 36207 33.67344 -85.81092 600 599 274 3901
#datatable(sub.rent.data)
x <- function(zips)
{
for (i in 1:length(zips))
{
paste0("Processing", zips[i])
if(nchar(zips[i]) == 4)
{
zips[i] <- paste0("0", zips[i], sep = "")
}
else if(nchar(zips[i]) == 3)
{
zips[i] <- paste0("00", zips[i], sep ="")
}
}
return(zips)
}
#sapply(sub.rent.data$Zip_Code, x)
#
sub.rent.data$Zip_Code <- as.factor(sub.rent.data$Zip_Code)
sub.rent.data$Zip_Code %>% fct_count(sort = TRUE)## # A tibble: 17,006 x 2
## f n
## <fct> <int>
## 1 78584 22
## 2 94606 20
## 3 36605 18
## 4 731 18
## 5 35215 17
## 6 35630 17
## 7 87532 17
## 8 11203 16
## 9 35020 16
## 10 36117 16
## # ... with 16,996 more rows
avg.rent.byzip <- sub.rent.data %>% group_by(Zip_Code) %>% summarise(AvgRent=mean(Mean))
avg.rent.bystate <- sub.rent.data %>% group_by(State_Name) %>% summarise(AvgRent=mean(Mean))In this section, I am using ggplot to create a basic scatterplot as it is one of the most useful libraries.
ggplot(avg.rent.byzip, aes(x=Zip_Code, y=AvgRent)) + geom_point()ggplot(avg.rent.bystate, aes(x=avg.rent.bystate$State_Name,y=AvgRent)) + geom_point() + theme(axis.text.x = element_text(angle = 90)) + xlab("State Name") + ylab("Average Rent")Show the original data format
head(sub.rent.data,10)## State_Name State_ab County City Place Type
## 1 Alabama AL Chambers County Wadley Abanda CDP
## 2 Alabama AL Winston County Addison Addison Town
## 3 Alabama AL Marshall County Albertville Albertville city City
## 4 Alabama AL Pickens County Aliceville Aliceville city City
## 5 Alabama AL Etowah County Walnut Grove Altoona Town
## 6 Alabama AL Calhoun County Anniston Anniston city City
## 7 Alabama AL Limestone County Ardmore Ardmore Town
## 8 Alabama AL Dale County Ariton Ariton Town
## 9 Alabama AL Clay County Ashland Ashland Town
## 10 Alabama AL Escambia County Atmore Atmore city City
## Primary Zip_Code Lat Lon Mean Median Stdev Samples
## 1 place 36276 33.09163 -85.52703 972 968 51 12
## 2 place 35540 34.20268 -87.17800 519 460 275 64
## 3 place 35950 34.26313 -86.21066 625 585 234 2560
## 4 place 35442 33.12369 -88.15936 546 438 354 574
## 5 place 35990 34.03920 -86.30570 350 303 185 114
## 6 place 36207 33.67344 -85.81092 600 599 274 3901
## 7 place 35739 34.98784 -86.82902 581 557 283 215
## 8 place 36311 31.59777 -85.71306 581 539 357 45
## 9 place 36251 33.26989 -85.83371 433 411 237 476
## 10 place 36504 31.12794 -87.45764 556 554 227 1109
sample(head(sub.rent.data,20),10)## Lat Place City Samples Median Type
## 1 33.09163 Abanda Wadley 12 968 CDP
## 2 34.20268 Addison Addison 64 460 Town
## 3 34.26313 Albertville city Albertville 2560 585 City
## 4 33.12369 Aliceville city Aliceville 574 438 City
## 5 34.03920 Altoona Walnut Grove 114 303 Town
## 6 33.67344 Anniston city Anniston 3901 599 City
## 7 34.98784 Ardmore Ardmore 215 557 Town
## 8 31.59777 Ariton Ariton 45 539 Town
## 9 33.26989 Ashland Ashland 476 411 Town
## 10 31.12794 Atmore city Atmore 1109 554 City
## 11 32.43256 Autaugaville Autaugaville 71 628 Town
## 12 31.30208 Babbie Opp 40 705 Town
## 13 34.02305 Ballplay Hokes Bluff 24 445 CDP
## 14 30.40759 Bayou La Batre city Bayou La Batre 290 631 City
## 15 33.93541 Beaverton Guin 15 418 Town
## 16 32.46366 Bellamy York 46 328 CDP
## 17 33.66670 Berry Berry 166 244 Town
## 18 33.52744 Birmingham city Birmingham 44449 722 City
## 19 34.07573 Blountsville Blountsville 296 498 Town
## 20 34.19902 Boaz city Boaz 44 845 City
## County Stdev Zip_Code Primary
## 1 Chambers County 51 36276 place
## 2 Winston County 275 35540 place
## 3 Marshall County 234 35950 place
## 4 Pickens County 354 35442 place
## 5 Etowah County 185 35990 place
## 6 Calhoun County 274 36207 place
## 7 Limestone County 283 35739 place
## 8 Dale County 357 36311 place
## 9 Clay County 237 36251 place
## 10 Escambia County 227 36504 place
## 11 Autauga County 235 36003 place
## 12 Covington County 155 36467 place
## 13 Etowah County 126 35903 place
## 14 Mobile County 296 36509 place
## 15 Lamar County 154 35563 place
## 16 Sumter County 263 36925 place
## 17 Fayette County 223 35546 place
## 18 Jefferson County 352 35203 place
## 19 Blount County 272 35031 place
## 20 Etowah County 80 35957 place
#DT::datatable(sub.rent.data, options = list(pagelength=5))
ggplot(sub.rent.data, aes(x=sub.rent.data$State_Name, y=sub.rent.data$Median, color = sub.rent.data$Type)) +
geom_point(alpha = 0.2) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Using the nest function from tidyr package to group the rents by state, county and city to form a tibble, a list, column
sub.rent.data.nested <- sub.rent.data %>%
tidyr::nest(-c(State_Name, State_ab, County, City))
head(sub.rent.data.nested,10)FALSE # A tibble: 10 x 5
FALSE State_Name State_ab County City data
FALSE <fct> <fct> <fct> <fct> <list>
FALSE 1 Alabama AL Chambers County Wadley <tibble [1 x 10]>
FALSE 2 Alabama AL Winston County Addison <tibble [1 x 10]>
FALSE 3 Alabama AL Marshall County Albertville <tibble [1 x 10]>
FALSE 4 Alabama AL Pickens County Aliceville <tibble [2 x 10]>
FALSE 5 Alabama AL Etowah County Walnut Grove <tibble [2 x 10]>
FALSE 6 Alabama AL Calhoun County Anniston <tibble [2 x 10]>
FALSE 7 Alabama AL Limestone County Ardmore <tibble [1 x 10]>
FALSE 8 Alabama AL Dale County Ariton <tibble [1 x 10]>
FALSE 9 Alabama AL Clay County Ashland <tibble [2 x 10]>
FALSE 10 Alabama AL Escambia County Atmore <tibble [1 x 10]>
DT::datatable(sub.rent.data.nested, options = list(pagelength=5))Using the unnest function from tidyr to ungroup the tibble, a list column, rents back to original form state, county and city
sub.rent.data.unnested <- sub.rent.data.nested %>%
tidyr::unnest(data)
head(sub.rent.data.unnested,10)## # A tibble: 10 x 14
## State_Name State_ab County City Place Type Primary Zip_Code Lat
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 Alabama AL Chamb~ Wadl~ Aban~ CDP place 36276 33.1
## 2 Alabama AL Winst~ Addi~ Addi~ Town place 35540 34.2
## 3 Alabama AL Marsh~ Albe~ Albe~ City place 35950 34.3
## 4 Alabama AL Picke~ Alic~ Alic~ City place 35442 33.1
## 5 Alabama AL Picke~ Alic~ Memp~ Town place 35442 33.1
## 6 Alabama AL Etowa~ Waln~ Alto~ Town place 35990 34.0
## 7 Alabama AL Etowa~ Waln~ Waln~ Town place 35990 34.1
## 8 Alabama AL Calho~ Anni~ Anni~ City place 36207 33.7
## 9 Alabama AL Calho~ Anni~ Saks CDP place 36206 33.7
## 10 Alabama AL Limes~ Ardm~ Ardm~ Town place 35739 35.0
## # ... with 5 more variables: Lon <dbl>, Mean <int>, Median <int>,
## # Stdev <int>, Samples <int>
DT::datatable(sub.rent.data.unnested, options = list(pagelength=5))## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
cluster <- create_cluster(cores = detectCores())
slow_func <- function(df) {
slow <- df %>%
group_by(State_Name, County, City) %>%
filter(dplyr::n_distinct(Place) > 1)
}
fast_func <- function(df) {
fast <- df %>%
partition(State_Name, County, City, cluster = cluster) %>%
group_by(State_Name, County, City) %>%
filter(dplyr::n_distinct(Place) > 1) %>%
collect()
}
microbenchmark(slow_func(sub.rent.data), fast_func(sub.rent.data))FALSE Unit: seconds
FALSE expr min lq mean median uq
FALSE slow_func(sub.rent.data) 2.510346 2.643581 2.741582 2.697463 2.782313
FALSE fast_func(sub.rent.data) 1.097413 1.125423 1.166321 1.140598 1.165179
FALSE max neval cld
FALSE 3.880706 100 b
FALSE 1.948583 100 a
As can be seen above, implementing parallel processing on the high volume data we could achieve 50% better performance from underpowered client-side processing. We will see significant performance improvement f we choose server-side processing with high CPU cores and memory at our disposable.