In this assignment, you’ll practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.

GitHub repository: https://github.com/acatlin/FALL2022TIDYVERSE

Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)

You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository. You should also update the README.md file with your example.

After you’ve created your vignette, please submit your GitHub handle name in the submission link provided below. This will let your instructor know that your work is ready to be peer-graded. You should complete your submission on the schedule stated in the course syllabus.

Solution

I have selected House Rent Prediction data from kaggle for performing deepdive analysis using tidyverse package.

In this assignment, we shall be looking at readr, dplyr and tidyr packages in tidyverse. we shall be looking at in depth about dplyr package which has many useful functions for data manipulation and data cleaning.

Read libraries

library(tidyverse)

readr package in tidyverse

data_path <- "https://raw.githubusercontent.com/Naik-Khyati/create_tv/main/data/House_Rent_Dataset.csv"

# read data
house_rent_data <- read_csv(data_path)
## Rows: 4746 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Floor, Area Type, Area Locality, City, Furnishing Status, Tenant P...
## dbl  (4): BHK, Rent, Size, Bathroom
## date (1): Posted On
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dplyr package in tidyverse

The dplyr package in R Programming Language is a structure of data manipulation that provides a uniform set of verbs, helping to resolve the most frequent data manipulation hurdles.

The dplyr Package in R performs the steps given below quicker and in an easier fashion:
* By limiting the choices the focus can now be more on data manipulation difficulties. * There are uncomplicated “verbs”, functions present for tackling every common data manipulation and the thoughts can be translated into code faster. * There are valuable backends and hence waiting time for the computer reduces.

In the following few examples, we shall be looking at some of the important and frequently used verb functions (glimpse, rename, select, filter, arrange, group_by, summarize, mutate) from dplyr package.

glimpse(house_rent_data)
## Rows: 4,746
## Columns: 12
## $ `Posted On`         <date> 2022-05-18, 2022-05-13, 2022-05-16, 2022-07-04, 2…
## $ BHK                 <dbl> 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 3, 1, 1, 1, 3, 3, 2,…
## $ Rent                <dbl> 10000, 20000, 17000, 10000, 7500, 7000, 10000, 500…
## $ Size                <dbl> 1100, 800, 1000, 800, 850, 600, 700, 250, 800, 100…
## $ Floor               <chr> "Ground out of 2", "1 out of 3", "1 out of 3", "1 …
## $ `Area Type`         <chr> "Super Area", "Super Area", "Super Area", "Super A…
## $ `Area Locality`     <chr> "Bandel", "Phool Bagan, Kankurgachi", "Salt Lake C…
## $ City                <chr> "Kolkata", "Kolkata", "Kolkata", "Kolkata", "Kolka…
## $ `Furnishing Status` <chr> "Unfurnished", "Semi-Furnished", "Semi-Furnished",…
## $ `Tenant Preferred`  <chr> "Bachelors/Family", "Bachelors/Family", "Bachelors…
## $ Bathroom            <dbl> 2, 1, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 1,…
## $ `Point of Contact`  <chr> "Contact Owner", "Contact Owner", "Contact Owner",…
house_rent_data <- house_rent_data %>% rename(dt_posted = `Posted On`, 
                                              area_typ = `Area Type` ,  
                                              area_loc = `Area Locality`,
                                              furnishing_status = `Furnishing Status`,
                                              tenant_preferred = `Tenant Preferred`,
                                              point_of_contact = `Point of Contact`) %>% mutate(flag_rent = ifelse(Rent>15000,1,0))
house_rent_data %>% select(City,Rent,BHK) %>% 
                         filter(City == 'Mumbai') %>%
                         group_by(BHK) %>% 
                         summarize(mean_rent=mean(Rent)) %>% 
                         arrange(desc(mean_rent))
## # A tibble: 5 × 2
##     BHK mean_rent
##   <dbl>     <dbl>
## 1     5   442727.
## 2     4   279110.
## 3     3   122009.
## 4     2    57768.
## 5     1    29219.

tidyr package in tidyverse

# Convert data from long to wide
house_rent_data %>% select(City,BHK,Rent) %>% 
                         group_by(City,BHK) %>% 
                         summarize(mean_rent=mean(Rent)) %>% 
                         arrange(desc(mean_rent)) %>% spread(BHK, mean_rent)
## `summarise()` has grouped output by 'City'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 7
## # Groups:   City [6]
##   City         `1`    `2`     `3`     `4`     `5`    `6`
##   <chr>      <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
## 1 Bangalore  9368. 16122.  61989. 113043.     NA      NA
## 2 Chennai    8456. 15702.  35742.  96350   75000  170000
## 3 Delhi     11332. 18878.  44142. 117456. 190000      NA
## 4 Hyderabad  9754. 13878.  29338.  95731. 131667.  45000
## 5 Kolkata    6897. 10688.  19667.  26909.  23750   20000
## 6 Mumbai    29219. 57768. 122009. 279110. 442727.     NA

Conclusion

We are using housing rent data for this assignment. We looked at multiple functions from dplyr package such as glimpse, rename, select, filter, arrange, group_by, summarize, mutate etc. We also used tidyr package in tidyverse to convert data from long to wide.