Mastering String Manipulation in R: A Practical Problem-Based Approach (Problem 1)

Working on the Problem

Loading Required libraries

library(stringr)
library(magrittr)

Problem:

You have given the company names in the variable cn. Do:

cn <- c("RIL Ltd.", "TCS Ltd.", "Dabur Ltd.", "SBI Pvt Ltd.", "ICICI Bank Ltd.", 
        "Ambuja Cement Ltd.", "TCS Group Ltd.", "Ambani Group Ltd.", "10 Mega Pvt Ltd.") #cn represents company name

Remove Ltd. from each name.
Get company names that have all capital letters
Get company names that includes Pvt or Private
Get name of all group companies.
Get all company names that start with digit.
Get company names that have at least 6 letters after removing Ltd.

Solution:

Remove Ltd.
```
str_view_all(cn, "Ltd\\.$") #this select Ltd. at the end.#to view
```
```
cn_no_ltd <- str_replace(cn, "Ltd\\.$", "") %>% str_trim
cn_no_ltd
```
```
## [1] "RIL"           "TCS"           "Dabur"         "SBI Pvt"      
## [5] "ICICI Bank"    "Ambuja Cement" "TCS Group"     "Ambani Group" 
## [9] "10 Mega Pvt"
```
Logic: Break down the regex pattern:
- Ltd: This part matches the literal characters “Ltd” in the string.
- \\.: The backslash followed by a dot matches a literal dot character. The backslash is used to escape the special meaning of the dot, as in regular expressions, a dot usually matches any character. However, in this case, it specifically matches a dot.
- $: The dollar sign represents the end of the string. It ensures that the pattern “Ltd\.” is matched only at the end of the string.
Combining it all, the regex pattern “Ltd\.$” matches the string “Ltd.” only when it appears at the end of the input string.

str_trim(): This function, also from the stringr package, removes leading and trailing whitespace from the resulting string obtained after the replacement. It ensures that any extra spaces before or after the remaining text are eliminated.
Company Name that have all capital letters
```
cn_no_ltd[str_count(cn_no_ltd, "\\S") == str_count(cn_no_ltd,"[A-Z]" )]
```
```
## [1] "RIL" "TCS"
```
Logic: str_count counts letters in words. Pattern in first part counts all letters excluding space, and count in the second part counts only capital letters. Both should be equal to have name with all capital letters.
Get company names that includes Pvt or Private
```
str_view_all(cn, "(Pvt)|(Private)") #to view
```
```
str_subset(cn, "Pvt")
```
```
## [1] "SBI Pvt Ltd."     "10 Mega Pvt Ltd."
```
Logic: (Pvt): The parentheses ( ) are used to create a capturing group. In this case, it captures the exact term “Pvt”.

|: The pipe symbol | acts as an OR operator in regular expressions. It allows you to specify alternative patterns. In this case, it separates the two options: “(Pvt)” and “Private”.

(Private): Similar to the first part, this is another capturing group that captures the exact term “Private”.

When this regex pattern is applied to a string, it will match and capture either “Pvt” or “Private” if either of them appears in the string. The match could be the whole term or just a part of a larger word or phrase, as long as it matches one of the options specified in the pattern.

Get name of all group companies.

str_subset(cn, "Group")

## [1] "TCS Group Ltd."    "Ambani Group Ltd."

Logic: Simply check for literal Group.

Get all company names that start with digit.
```
str_subset(cn, "^\\d")
```
```
## [1] "10 Mega Pvt Ltd."
```
The regular expression ^\d is used to match a digit at the beginning of a line or a string. Here’s a breakdown of each component:
- ^ represents the start of a line or string.
- \d represents a digit. The backslash \ is used as an escape character to ensure that the subsequent character d is interpreted as a special character in regular expressions. The d following the backslash matches any digit from 0 to 9.
Therefore, when you use the regular expression ^\\d, it will match any line or string that starts with a digit.
Get company names that have at least 6 letters after removing Ltd.
```
str_subset(cn_no_ltd, "[A-Za-z0-9\\s]{6,}")
```
```
## [1] "SBI Pvt"       "ICICI Bank"    "Ambuja Cement" "TCS Group"    
## [5] "Ambani Group"  "10 Mega Pvt"
```
The regular expression [A-Za-z0-9\s]{6,} matches a sequence of alphanumeric characters (letters and numbers) and whitespace, where the sequence is at least six characters long.

Let’s break down the regex pattern:
- [A-Za-z0-9\\s]: This character class specifies the range of characters that can be matched. It includes uppercase letters (A-Z), lowercase letters (a-z), numbers (0-9), and whitespace characters. The backslash before the s is used to escape the special meaning of the s and treat it as a literal whitespace character.
- {6,}: This quantifier specifies the minimum number of times the previous character class should occur. In this case, it indicates that the previous character class (which represents a single character) should occur at least six or more times consecutively.
In summary, the regular expression [A-Za-z0-9\s]{6,} matches a sequence of alphanumeric characters and whitespace that is at least six characters long. It can be used to find or validate strings that meet this pattern, such as passwords, usernames, or other text inputs requiring a certain length and combination of characters.

Mastering String Manipulation in R: A Practical Problem-Based Approach (Problem 1)

Neeraj Jain

2023-05-26

Working on the Problem

Loading Required libraries

Problem:

Solution: