body .main-container { max-width: 1800px; max-height: 100px; }

R Packages

library(tidyverse) #loading all library needed for this assignment
#library(openintro)
#library(psych)
#head(fastfood)
library(readxl)
#library(data.table)
#library(DT)
library(knitr)

library(readr)
#library(plyr)
library(dplyr)
library(stringr)
#library(XML)
#library(RCurl)
#library(jsonlite)
#library(httr)

#library(maps)
#library(dice)
# #library(VennDiagram)
# #library(help = "dice")
#ibrary(DBI)
#library(dbplyr)

# library(rstudioapi)
# library(RJDBC)
# library(odbc)
# library(RSQLite)
# #library(rvest)

#library(readtext)
#library(ggpubr)
#library(fitdistrplus)
#library(ggplot2)
#library(moments)
#library(qualityTools)
#library(normalp)
#library(utils)
#library(MASS)
#library(qqplotr)
#library(DATA606)

#library(knitLatex)
#library(knitr)
#library(markdown)
#library(rmarkdown)
#render("DATA606_Project_Proposal.Rmd", "pdf_document")

Github Link: https://github.com/asmozo24/DATA606_Project_Proposal

Web link: https://rpubs.com/amekueko/682247

data source: https://www.kaggle.com/omarhanyy/500-greatest-songs-of-all-time

Description

This assignment is about getting familiar with two or more Tidyverse packages. So, I am going to write a vignette using readr, dplyr , and stringr which are part of the core tidyverse packages used for data analysis.

readr

According to tidyverse.org, readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes. The required package for readr is under the tidyverse package (install.packages(“tidyverse”)) or you can just install single readr with install.packages(“readr”). There is cheat sheet (https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf) which can be helpful at a time. For this assignment, I will practice reading and saving csv file.

# setting the working directory
setwd("~/R/DATA607_Tidyverse")



# load the csv file which has all the variable.

Top_500Songs <- read.csv("https://raw.githubusercontent.com/asmozo24/DATA607_Tidyverse-CREATE-Assignmen/main/Top%20500%20Songs.csv", stringsAsFactors=FALSE)
str(Top_500Songs)
## 'data.frame':    500 obs. of  8 variables:
##  $ title      : chr  "Shop Around" "Buddy Holly" "Miss You" "The Rising" ...
##  $ description: chr  "Robinson thought Barrett Strong should record \"Shop Around,\" but Gordy persuaded Smokey that he was the right"| __truncated__ "In the early 1990s, Cuomo had an awkward girlfriend who was routinely picked on. His efforts to stick up for he"| __truncated__ "The Stones were in Toronto, rehearsing for their classic gigs at the El Mocambo Club, when Jagger, jamming with"| __truncated__ "Springsteen wrote the track about 9/11, taking the viewpoint of a firefighter entering one of the Twin Towers ("| __truncated__ ...
##  $ appears.on : chr  "The Ultimate Collection (Motown)" "Weezer (Geffen)" "Some Girls (Virgin)" "The Rising (Columbia)" ...
##  $ artist     : chr  "Smokey Robinson and the Miracles" "Weezer" "The Rolling Stones" "Bruce Springsteen" ...
##  $ writers    : chr  "Berry Gordy, Robinson" "Rivers Cuomo" "Mick Jagger, Keith Richards" "Springsteen" ...
##  $ producer   : chr  "Gordy" "Ric Ocasek" "The Glimmer Twins" "Brendan O'Brien" ...
##  $ released   : chr  "Dec. '60, Tamla" "Aug. '94, DGC" "May '78, Rolling Stones" "July '02, Columbia" ...
##  $ streak     : chr  "16 weeks; No. 2" "21 weeks; No. 18" "20 weeks; No. 1" "11 weeks; No. 52" ...
view(Top_500Songs)

# file to big, cleaning/removing the column I don't need
Top_500Songs <- Top_500Songs [, -2]
# saving the new csv file 
write.csv(Top_500Songs,'Top_500Songs.csv')

glimpse(Top_500Songs)
## Rows: 500
## Columns: 7
## $ title      <chr> "Shop Around", "Buddy Holly", "Miss You", "The Rising", ...
## $ appears.on <chr> "The Ultimate Collection (Motown)", "Weezer (Geffen)", "...
## $ artist     <chr> "Smokey Robinson and the Miracles", "Weezer", "The Rolli...
## $ writers    <chr> "Berry Gordy, Robinson", "Rivers Cuomo", "Mick Jagger, K...
## $ producer   <chr> "Gordy", "Ric Ocasek", "The Glimmer Twins", "Brendan O'B...
## $ released   <chr> "Dec. '60, Tamla", "Aug. '94, DGC", "May '78, Rolling St...
## $ streak     <chr> "16 weeks; No. 2", "21 weeks; No. 18", "20 weeks; No. 1"...
#view(Top_500Songs)

dplyr

According to tidyverse.org, dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. package for dplyr is uner tidyverse or single installation similar to readr. dplyr offers some verbs used for data manipulation: mutate(): uses this verb to add new variable based on the manipulation of the existing variables. select(): lets you select specific variables of your interest. filter(): similar to select, just that this verb goes into the variable and picks your value (s) of interest. summarise(): provides a summary of your dataframe. arrange(): this verb is good for ordering of rows in your dataframe.

Based on previous assginments, I can say dply syntax eliminate the need to use ‘$’ which is used in the base function. In addition multiple verbs can be used or group together with %>%.

# let's check if there is a missing value in a specific column
# return 06 rows with empty values...tempted to delete data but will not do it now....no need
Top_500Songs %>% 
  filter( is.na(streak) | streak == "") 
# another way
filter(Top_500Songs, is.na(streak) | streak == "")
filter(Top_500Songs, !grepl("weeks", streak))
# Being in the top 500 greatest songs of all time, I will assum the song hits the hit parade of billboard for few months...lets check that
Top_500Songs %>% 
  select(streak)%>% 
  filter(grepl("weeks", streak))
# what if I want to find the songs that stayed on top for longest period...this is like string search comparison which is bit tidious 
# I think a manual search and create a new variable called ranking

Top_500Songs %>% 
  select(streak)%>% 
  filter(grepl( "No. 1", streak))
# or but not really helpful ...the nature of the data
songRank <- Top_500Songs %>% 
  arrange(desc(streak))

#view(songRank)  ...if streak was numerical ...this would be perfect

Top_500Songs %>% 
  mutate(rank = min_rank(desc(streak)))%>% 
  arrange(desc(rank))
# let's check if R.Kelly is on the list
Top_500Songs %>% 
  filter(artist == "R. Kelly" )
# let's say I only want to see R.Kelly record (song title , release date and streak)
Top_500Songs %>% 
  select(title, artist, released, streak) %>% 
  filter(artist == "R. Kelly")
# How about I add a new variable which shows R.Kelly youtube view of the title song.
#Top_500Songs %>% 
#  mutate(youtubeView = ifelse(filter(Top_500Songs, artist == "R. Kelly"), "R.Kelly: 232,560,092" ))
# https://www.youtube.com/watch?v=y6y_4_b6RS8

stringr

According to tidyverse.org, stringr provides a cohesive set of functions designed to make working with strings as easy as possible. It is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations. the required package comes under tidyverse package or single stringr. helpful, cheat sheet…https://github.com/rstudio/cheatsheets/blob/master/strings.pdf

#let say I want to check my favorite artist and I don't remember their full name
Top_500Songs %>% 
  #select(artist) %>% 
  filter(grepl("50 Cent", artist))
artist <- unlist(Top_500Songs %>% 
  #select(artist) %>% 
  filter(grepl("50 Cent", artist)))

# another way to detect matching pattern
str_detect(artist, "Rich")
##  [1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE
# find a matching pattern and display/extract 
str_subset(artist, "Dr.")
## [1] "50 Cent, Dr. Dre, Mike Elizondo" "50 Cent, Dr. Dre, Mike Elizondo"
## [3] "Dr. Dre, Elizondo"               "Dr. Dre, Elizondo"
# Find the length of a vector
str_length(artist)
##  [1] 10 10 51 51  7  7 31 31 17 17 36 36 15 15
#string count 
str_count(artist, "Dr.")
##  [1] 0 0 0 0 0 0 2 2 2 2 0 0 0 0
# convert string to upper case 
str_to_upper(artist)
##  [1] "IN DA CLUB"                                         
##  [2] "IN DA CLUB"                                         
##  [3] "GET RICH OR DIE TRYIN' (INTERSCOPE/AFTERMATH/SHADY)"
##  [4] "GET RICH OR DIE TRYIN' (INTERSCOPE/AFTERMATH/SHADY)"
##  [5] "50 CENT"                                            
##  [6] "50 CENT"                                            
##  [7] "50 CENT, DR. DRE, MIKE ELIZONDO"                    
##  [8] "50 CENT, DR. DRE, MIKE ELIZONDO"                    
##  [9] "DR. DRE, ELIZONDO"                                  
## [10] "DR. DRE, ELIZONDO"                                  
## [11] "DEC. '02, INTERSCOPE/AFTERMATH/SHADY"               
## [12] "DEC. '02, INTERSCOPE/AFTERMATH/SHADY"               
## [13] "30 WEEKS; NO. 1"                                    
## [14] "30 WEEKS; NO. 1"
# convert string to lower case 
str_to_lower(artist)
##  [1] "in da club"                                         
##  [2] "in da club"                                         
##  [3] "get rich or die tryin' (interscope/aftermath/shady)"
##  [4] "get rich or die tryin' (interscope/aftermath/shady)"
##  [5] "50 cent"                                            
##  [6] "50 cent"                                            
##  [7] "50 cent, dr. dre, mike elizondo"                    
##  [8] "50 cent, dr. dre, mike elizondo"                    
##  [9] "dr. dre, elizondo"                                  
## [10] "dr. dre, elizondo"                                  
## [11] "dec. '02, interscope/aftermath/shady"               
## [12] "dec. '02, interscope/aftermath/shady"               
## [13] "30 weeks; no. 1"                                    
## [14] "30 weeks; no. 1"
#string view
#str_view(artist, "Cent")
str_match(artist, 'Cent')
##       [,1]  
##  [1,] NA    
##  [2,] NA    
##  [3,] NA    
##  [4,] NA    
##  [5,] "Cent"
##  [6,] "Cent"
##  [7,] "Cent"
##  [8,] "Cent"
##  [9,] NA    
## [10,] NA    
## [11,] NA    
## [12,] NA    
## [13,] NA    
## [14,] NA

Conclusion

I think readr and dplyr are inevitable in R analysis. stringr is more focus on doing search or learning about a particular variable.