library(stringr)
library(magrittr)
You have given the company names in the variable cn.
Do:
cn <- c("RIL Ltd.", "TCS Ltd.", "Dabur Ltd.", "SBI Pvt Ltd.", "ICICI Bank Ltd.",
"Ambuja Cement Ltd.", "TCS Group Ltd.", "Ambani Group Ltd.", "10 Mega Pvt Ltd.") #cn represents company name
Remove Ltd. from each name.
Get company names that have all capital letters
Get company names that includes Pvt or Private
Get name of all group companies.
Get all company names that start with digit.
Get company names that have at least 6 letters after removing Ltd.
Remove Ltd.
str_view_all(cn, "Ltd\\.$") #this select Ltd. at the end.#to view
cn_no_ltd <- str_replace(cn, "Ltd\\.$", "") %>% str_trim
cn_no_ltd
## [1] "RIL" "TCS" "Dabur" "SBI Pvt"
## [5] "ICICI Bank" "Ambuja Cement" "TCS Group" "Ambani Group"
## [9] "10 Mega Pvt"
Logic: Break down the regex pattern:
Ltd: This part matches the literal
characters “Ltd” in the string.
\\.: The backslash followed by a
dot matches a literal dot character. The backslash is used to escape the
special meaning of the dot, as in regular expressions, a dot usually
matches any character. However, in this case, it specifically matches a
dot.
$: The dollar sign represents the
end of the string. It ensures that the pattern “Ltd\.” is matched only
at the end of the string.
Combining it all, the regex pattern “Ltd\.$” matches the string “Ltd.” only when it appears at the end of the input string.
str_trim(): This function, also from
the stringr package, removes leading and
trailing whitespace from the resulting string obtained after the
replacement. It ensures that any extra spaces before or after the
remaining text are eliminated.
Company Name that have all capital letters
cn_no_ltd[str_count(cn_no_ltd, "\\S") == str_count(cn_no_ltd,"[A-Z]" )]
## [1] "RIL" "TCS"
Logic: str_count counts letters in words. Pattern in
first part counts all letters excluding space, and count in the second
part counts only capital letters. Both should be equal to have name with
all capital letters.
Get company names that includes Pvt or Private
str_view_all(cn, "(Pvt)|(Private)") #to view
str_subset(cn, "Pvt")
## [1] "SBI Pvt Ltd." "10 Mega Pvt Ltd."
Logic: (Pvt): The parentheses
( ) are used to create a capturing group.
In this case, it captures the exact term “Pvt”.
|: The pipe symbol
| acts as an OR operator in regular
expressions. It allows you to specify alternative patterns. In this
case, it separates the two options: “(Pvt)” and “Private”.
(Private): Similar to the first part,
this is another capturing group that captures the exact term
“Private”.
When this regex pattern is applied to a string, it will match and capture either “Pvt” or “Private” if either of them appears in the string. The match could be the whole term or just a part of a larger word or phrase, as long as it matches one of the options specified in the pattern.
Get name of all group companies.
str_subset(cn, "Group")
## [1] "TCS Group Ltd." "Ambani Group Ltd."
Logic: Simply check for literal Group.
Get all company names that start with digit.
str_subset(cn, "^\\d")
## [1] "10 Mega Pvt Ltd."
The regular expression ^\d is used to
match a digit at the beginning of a line or a string. Here’s a breakdown
of each component:
^ represents the start of a line or
string.
\d represents a digit. The
backslash \ is used as an escape character
to ensure that the subsequent character d
is interpreted as a special character in regular expressions. The
d following the backslash matches any
digit from 0 to 9.
Therefore, when you use the regular expression
^\\d, it will match any line or string
that starts with a digit.
Get company names that have at least 6 letters after removing Ltd.
str_subset(cn_no_ltd, "[A-Za-z0-9\\s]{6,}")
## [1] "SBI Pvt" "ICICI Bank" "Ambuja Cement" "TCS Group"
## [5] "Ambani Group" "10 Mega Pvt"
The regular expression [A-Za-z0-9\s]{6,} matches a
sequence of alphanumeric characters (letters and numbers) and
whitespace, where the sequence is at least six characters long.
Let’s break down the regex pattern:
[A-Za-z0-9\\s]: This character
class specifies the range of characters that can be matched. It includes
uppercase letters (A-Z), lowercase letters (a-z), numbers (0-9), and
whitespace characters. The backslash before the s is used
to escape the special meaning of the s and treat it as a
literal whitespace character.
{6,}: This quantifier specifies the
minimum number of times the previous character class should occur. In
this case, it indicates that the previous character class (which
represents a single character) should occur at least six or more times
consecutively.
In summary, the regular expression [A-Za-z0-9\s]{6,}
matches a sequence of alphanumeric characters and whitespace that is at
least six characters long. It can be used to find or validate strings
that meet this pattern, such as passwords, usernames, or other text
inputs requiring a certain length and combination of
characters.