For This tidyverse assignment, I choose to use the following libraries:
I did not do extensive data filtering for this assignment, but rather choose to demostrate the capabilities of the tidyverse libraries.
For my dataset, I choose to use kaggle with a gross avg rent across the US. The dataset is fairly long and has several fields. Not all are useful, but in the TidyVerse section, I use some of the ddplyr libaries to filter out what I need.
rent.data <- read.csv("https://raw.githubusercontent.com/dapolloxp/spring2019tidyverse/master/kaggle_gross_rent.csv")In this section, I’m using some of the tidyverse libaries, such as piping in magrittr and dplyr which make it extremely easy to filter and select fields. I also chose fields using select in ddplyr and group_by to aggregrate information.
I choose to count the number of zip codes with the piping built-into dplyr
sub.rent.data <- rent.data %>% select(State_Name, State_ab, County, City, Place, Type, Primary, Zip_Code, Lat, Lon, Mean, Median, Stdev, Samples)
sub.rent.data$Zip_Code <- as.character(sub.rent.data$Zip_Code)
head(sub.rent.data)## State_Name State_ab County City Place Type
## 1 Alabama AL Chambers County Wadley Abanda CDP
## 2 Alabama AL Winston County Addison Addison Town
## 3 Alabama AL Marshall County Albertville Albertville city City
## 4 Alabama AL Pickens County Aliceville Aliceville city City
## 5 Alabama AL Etowah County Walnut Grove Altoona Town
## 6 Alabama AL Calhoun County Anniston Anniston city City
## Primary Zip_Code Lat Lon Mean Median Stdev Samples
## 1 place 36276 33.09163 -85.52703 972 968 51 12
## 2 place 35540 34.20268 -87.17800 519 460 275 64
## 3 place 35950 34.26313 -86.21066 625 585 234 2560
## 4 place 35442 33.12369 -88.15936 546 438 354 574
## 5 place 35990 34.03920 -86.30570 350 303 185 114
## 6 place 36207 33.67344 -85.81092 600 599 274 3901
#datatable(sub.rent.data)
x <- function(zips)
{
for (i in 1:length(zips))
{
paste0("Processing", zips[i])
if(nchar(zips[i]) == 4)
{
zips[i] <- paste0("0", zips[i], sep = "")
}
else if(nchar(zips[i]) == 3)
{
zips[i] <- paste0("00", zips[i], sep ="")
}
}
return(zips)
}
#sapply(sub.rent.data$Zip_Code, x)
#
sub.rent.data$Zip_Code <- as.factor(sub.rent.data$Zip_Code)
sub.rent.data$Zip_Code %>% fct_count(sort = TRUE)## # A tibble: 17,006 x 2
## f n
## <fct> <int>
## 1 78584 22
## 2 94606 20
## 3 36605 18
## 4 731 18
## 5 35215 17
## 6 35630 17
## 7 87532 17
## 8 11203 16
## 9 35020 16
## 10 36117 16
## # ... with 16,996 more rows
avg.rent.byzip <- sub.rent.data %>% group_by(Zip_Code) %>% summarise(AvgRent=mean(Mean))
avg.rent.bystate <- sub.rent.data %>% group_by(State_Name) %>% summarise(AvgRent=mean(Mean))In this section, I am using ggplot to create a basic scatterplot as it is one of the most useful libraries.
ggplot(avg.rent.byzip, aes(x=Zip_Code, y=AvgRent)) + geom_point()ggplot(avg.rent.bystate, aes(x=avg.rent.bystate$State_Name,y=AvgRent)) + geom_point() + theme(axis.text.x = element_text(angle = 90)) + xlab("State Name") + ylab("Average Rent")Show the original data format
head(sub.rent.data,10)## State_Name State_ab County City Place Type
## 1 Alabama AL Chambers County Wadley Abanda CDP
## 2 Alabama AL Winston County Addison Addison Town
## 3 Alabama AL Marshall County Albertville Albertville city City
## 4 Alabama AL Pickens County Aliceville Aliceville city City
## 5 Alabama AL Etowah County Walnut Grove Altoona Town
## 6 Alabama AL Calhoun County Anniston Anniston city City
## 7 Alabama AL Limestone County Ardmore Ardmore Town
## 8 Alabama AL Dale County Ariton Ariton Town
## 9 Alabama AL Clay County Ashland Ashland Town
## 10 Alabama AL Escambia County Atmore Atmore city City
## Primary Zip_Code Lat Lon Mean Median Stdev Samples
## 1 place 36276 33.09163 -85.52703 972 968 51 12
## 2 place 35540 34.20268 -87.17800 519 460 275 64
## 3 place 35950 34.26313 -86.21066 625 585 234 2560
## 4 place 35442 33.12369 -88.15936 546 438 354 574
## 5 place 35990 34.03920 -86.30570 350 303 185 114
## 6 place 36207 33.67344 -85.81092 600 599 274 3901
## 7 place 35739 34.98784 -86.82902 581 557 283 215
## 8 place 36311 31.59777 -85.71306 581 539 357 45
## 9 place 36251 33.26989 -85.83371 433 411 237 476
## 10 place 36504 31.12794 -87.45764 556 554 227 1109
sample(head(sub.rent.data,20),10)## Median Primary State_Name Zip_Code Lat Samples Place
## 1 968 place Alabama 36276 33.09163 12 Abanda
## 2 460 place Alabama 35540 34.20268 64 Addison
## 3 585 place Alabama 35950 34.26313 2560 Albertville city
## 4 438 place Alabama 35442 33.12369 574 Aliceville city
## 5 303 place Alabama 35990 34.03920 114 Altoona
## 6 599 place Alabama 36207 33.67344 3901 Anniston city
## 7 557 place Alabama 35739 34.98784 215 Ardmore
## 8 539 place Alabama 36311 31.59777 45 Ariton
## 9 411 place Alabama 36251 33.26989 476 Ashland
## 10 554 place Alabama 36504 31.12794 1109 Atmore city
## 11 628 place Alabama 36003 32.43256 71 Autaugaville
## 12 705 place Alabama 36467 31.30208 40 Babbie
## 13 445 place Alabama 35903 34.02305 24 Ballplay
## 14 631 place Alabama 36509 30.40759 290 Bayou La Batre city
## 15 418 place Alabama 35563 33.93541 15 Beaverton
## 16 328 place Alabama 36925 32.46366 46 Bellamy
## 17 244 place Alabama 35546 33.66670 166 Berry
## 18 722 place Alabama 35203 33.52744 44449 Birmingham city
## 19 498 place Alabama 35031 34.07573 296 Blountsville
## 20 845 place Alabama 35957 34.19902 44 Boaz city
## State_ab Stdev City
## 1 AL 51 Wadley
## 2 AL 275 Addison
## 3 AL 234 Albertville
## 4 AL 354 Aliceville
## 5 AL 185 Walnut Grove
## 6 AL 274 Anniston
## 7 AL 283 Ardmore
## 8 AL 357 Ariton
## 9 AL 237 Ashland
## 10 AL 227 Atmore
## 11 AL 235 Autaugaville
## 12 AL 155 Opp
## 13 AL 126 Hokes Bluff
## 14 AL 296 Bayou La Batre
## 15 AL 154 Guin
## 16 AL 263 York
## 17 AL 223 Berry
## 18 AL 352 Birmingham
## 19 AL 272 Blountsville
## 20 AL 80 Boaz
#DT::datatable(sub.rent.data, options = list(pagelength=5))
ggplot(sub.rent.data, aes(x=sub.rent.data$State_Name, y=sub.rent.data$Median, color = sub.rent.data$Type)) +
geom_point(alpha = 0.2) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Using the nest function from tidyr package to group the rents by state, county and city to form a tibble, a list, column
sub.rent.data.nested <- sub.rent.data %>%
tidyr::nest(-c(State_Name, State_ab, County, City))
head(sub.rent.data.nested,10)FALSE # A tibble: 10 x 5
FALSE State_Name State_ab County City data
FALSE <fct> <fct> <fct> <fct> <list>
FALSE 1 Alabama AL Chambers County Wadley <tibble [1 x 10]>
FALSE 2 Alabama AL Winston County Addison <tibble [1 x 10]>
FALSE 3 Alabama AL Marshall County Albertville <tibble [1 x 10]>
FALSE 4 Alabama AL Pickens County Aliceville <tibble [2 x 10]>
FALSE 5 Alabama AL Etowah County Walnut Grove <tibble [2 x 10]>
FALSE 6 Alabama AL Calhoun County Anniston <tibble [2 x 10]>
FALSE 7 Alabama AL Limestone County Ardmore <tibble [1 x 10]>
FALSE 8 Alabama AL Dale County Ariton <tibble [1 x 10]>
FALSE 9 Alabama AL Clay County Ashland <tibble [2 x 10]>
FALSE 10 Alabama AL Escambia County Atmore <tibble [1 x 10]>
DT::datatable(sub.rent.data.nested, options = list(pagelength=5))Using the unnest function from tidyr to ungroup the tibble, a list column, rents back to original form state, county and city
sub.rent.data.unnested <- sub.rent.data.nested %>%
tidyr::unnest(data)
head(sub.rent.data.unnested,10)## # A tibble: 10 x 14
## State_Name State_ab County City Place Type Primary Zip_Code Lat
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 Alabama AL Chamb~ Wadl~ Aban~ CDP place 36276 33.1
## 2 Alabama AL Winst~ Addi~ Addi~ Town place 35540 34.2
## 3 Alabama AL Marsh~ Albe~ Albe~ City place 35950 34.3
## 4 Alabama AL Picke~ Alic~ Alic~ City place 35442 33.1
## 5 Alabama AL Picke~ Alic~ Memp~ Town place 35442 33.1
## 6 Alabama AL Etowa~ Waln~ Alto~ Town place 35990 34.0
## 7 Alabama AL Etowa~ Waln~ Waln~ Town place 35990 34.1
## 8 Alabama AL Calho~ Anni~ Anni~ City place 36207 33.7
## 9 Alabama AL Calho~ Anni~ Saks CDP place 36206 33.7
## 10 Alabama AL Limes~ Ardm~ Ardm~ Town place 35739 35.0
## # ... with 5 more variables: Lon <dbl>, Mean <int>, Median <int>,
## # Stdev <int>, Samples <int>
DT::datatable(sub.rent.data.unnested, options = list(pagelength=5))## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html