In this assignment, you’ll practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.
GitHub repository: https://github.com/acatlin/FALL2022TIDYVERSE
Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)
You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository. You should also update the README.md file with your example.
After you’ve created your vignette, please submit your GitHub handle name in the submission link provided below. This will let your instructor know that your work is ready to be peer-graded. You should complete your submission on the schedule stated in the course syllabus.
I have selected House Rent Prediction data from kaggle for performing deepdive analysis using tidyverse package.
In this assignment, we shall be looking at readr, dplyr and tidyr packages in tidyverse. we shall be looking at in depth about dplyr package which has many useful functions for data manipulation and data cleaning.
library(tidyverse)
data_path <- "https://raw.githubusercontent.com/Naik-Khyati/create_tv/main/data/House_Rent_Dataset.csv"
# read data
house_rent_data <- read_csv(data_path)
## Rows: 4746 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Floor, Area Type, Area Locality, City, Furnishing Status, Tenant P...
## dbl (4): BHK, Rent, Size, Bathroom
## date (1): Posted On
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dplyr package in R Programming Language is a structure of data manipulation that provides a uniform set of verbs, helping to resolve the most frequent data manipulation hurdles.
The dplyr Package in R performs the steps given below quicker
and in an easier fashion:
* By limiting the choices the focus can now be more on data manipulation
difficulties. * There are uncomplicated “verbs”, functions present for
tackling every common data manipulation and the thoughts can be
translated into code faster. * There are valuable backends and hence
waiting time for the computer reduces.
In the following few examples, we shall be looking at some of the important and frequently used verb functions (glimpse, rename, select, filter, arrange, group_by, summarize, mutate) from dplyr package.
glimpse(house_rent_data)
## Rows: 4,746
## Columns: 12
## $ `Posted On` <date> 2022-05-18, 2022-05-13, 2022-05-16, 2022-07-04, 2…
## $ BHK <dbl> 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 3, 1, 1, 1, 3, 3, 2,…
## $ Rent <dbl> 10000, 20000, 17000, 10000, 7500, 7000, 10000, 500…
## $ Size <dbl> 1100, 800, 1000, 800, 850, 600, 700, 250, 800, 100…
## $ Floor <chr> "Ground out of 2", "1 out of 3", "1 out of 3", "1 …
## $ `Area Type` <chr> "Super Area", "Super Area", "Super Area", "Super A…
## $ `Area Locality` <chr> "Bandel", "Phool Bagan, Kankurgachi", "Salt Lake C…
## $ City <chr> "Kolkata", "Kolkata", "Kolkata", "Kolkata", "Kolka…
## $ `Furnishing Status` <chr> "Unfurnished", "Semi-Furnished", "Semi-Furnished",…
## $ `Tenant Preferred` <chr> "Bachelors/Family", "Bachelors/Family", "Bachelors…
## $ Bathroom <dbl> 2, 1, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 1,…
## $ `Point of Contact` <chr> "Contact Owner", "Contact Owner", "Contact Owner",…
house_rent_data <- house_rent_data %>% rename(dt_posted = `Posted On`,
area_typ = `Area Type` ,
area_loc = `Area Locality`,
furnishing_status = `Furnishing Status`,
tenant_preferred = `Tenant Preferred`,
point_of_contact = `Point of Contact`) %>% mutate(flag_rent = ifelse(Rent>15000,1,0))
house_rent_data %>% select(City,Rent,BHK) %>%
filter(City == 'Mumbai') %>%
group_by(BHK) %>%
summarize(mean_rent=mean(Rent)) %>%
arrange(desc(mean_rent))
## # A tibble: 5 × 2
## BHK mean_rent
## <dbl> <dbl>
## 1 5 442727.
## 2 4 279110.
## 3 3 122009.
## 4 2 57768.
## 5 1 29219.
# Convert data from long to wide
house_rent_data %>% select(City,BHK,Rent) %>%
group_by(City,BHK) %>%
summarize(mean_rent=mean(Rent)) %>%
arrange(desc(mean_rent)) %>% spread(BHK, mean_rent)
## `summarise()` has grouped output by 'City'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 7
## # Groups: City [6]
## City `1` `2` `3` `4` `5` `6`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Bangalore 9368. 16122. 61989. 113043. NA NA
## 2 Chennai 8456. 15702. 35742. 96350 75000 170000
## 3 Delhi 11332. 18878. 44142. 117456. 190000 NA
## 4 Hyderabad 9754. 13878. 29338. 95731. 131667. 45000
## 5 Kolkata 6897. 10688. 19667. 26909. 23750 20000
## 6 Mumbai 29219. 57768. 122009. 279110. 442727. NA
We are using housing rent data for this assignment. We looked at multiple functions from dplyr package such as glimpse, rename, select, filter, arrange, group_by, summarize, mutate etc. We also used tidyr package in tidyverse to convert data from long to wide.