A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids.
You are probably familiar with wildcard notations such as \(*.txt\) to find all text files in a file manager. The regex equivalent is \(<<.*\.txt>>\).
There are many software applications and programming languages that support regular expressions. You can save a lot of time and effort by using regular expressions. You can usually accomplish with a single regular expression, what would take dozens or hundreds otherwise.
You can use regular expressions in search and replace operations to quickly make changes across a large number of files. A simple example is \(<<gr[ae]y>>\) which will find both spellings of the word grey in one operation, instead of two. There are many text editors and search and replace tools with decent regex support.
Doing extensive text manipulation in R would be painful; the R language was developed for analyzing data sets, not for munging text files. However, R does have some facilities for working with text using regular expressions. This comes in handy, for example when selecting rows of a data set according to regular expression pattern matches in some columns.
grepl(“x”,foo)
grepl returns a false or true if the substring “x” is contained in the big string called “foo”
Example:
Define a string x as “regular expression” and determine if the string “reg” is contained in the string “x”:
x=("regular expression")
grepl("reg",x)
[1] TRUE
gsub(“red”, “black”, foo): Here gsub replaces the substring “red” with “black”. Sof the string is:
foo=c(“turn to red”)
It will give the result “turn to black”.
Let x be the string x=(“The client’s social security number is: 222-33-9999”) and replace the substring “222-33-9999” with “bbb-bb-bbbb”:
x=("The client's social security number is: 222-33-9999")
gsub("222-33-9999", "bbb-bb-bbbb", x)
[1] "The client's social security number is: bbb-bb-bbbb"
Gsub is also useful for removing parts of a string before or after a character or substing.
For example: Display all the characters before the social security number, but do not display the social security number (replace with no characters):
x=("The client's social security number is: 222-33-9999")
gsub("2.*", "", x)
[1] "The client's social security number is: "
A quantifier specifies how many instances of the previous element (which can be a character, a group, or a character class) must be present in the input string for a match to occur. The \("*"\) is a quantifier and it matches the previous element zero or more times. The dot matches (almost) any character.
In some card games, the Joker is a wildcard and can represent any card in the deck. With regular expressions, you are often matching pieces of text that you don’t know the exact contents of, other than the fact that they share a common pattern or structure (eg. phone numbers or zip codes).
Similarly, there is the concept of a wildcard, which is represented by the . (dot) metacharacter, and can match any single character (letter, digit, whitespace, everything). You may notice that this actually overrides the matching of the period character, so in order to specifically match a period, you need to escape the dot by using a slash . accordingly.
The \(.*\) is a regular expression command that means any character, any number of repetitions. So in our case it will look for any character after “2”, any number of repititions and then the “gsub”-command will replace every character after the character “2” with nothing.
We can display only the social security number as follows:
x=("The client's social security number is: 222-33-9999")
gsub(".* ", "", x)
[1] "222-33-9999"
Since regular expressions deal with text rather than with numbers, matching a number in a given range takes a little extra care.
Since regular expressions work with text, a regular expression engine treats 0 as a single character, and 255 as three characters.
The regex [0-9] matches single-digit numbers 0 to 9. [1-9][0-9] matches double-digit numbers 10 to 99.
Suppose we have the following string y = “The client defaulted 2 times on the loan.” Suppose that we need to know if there is in fact a digit in this string. The digit could be any number from 0 to 9. We can determine if there is a digit as follows:
y = c("The client defaulted 2 times on the loan.")
grepl("[0-9]", y)
[1] TRUE
Characters can be listed individually, e.g. [amk] will match ‘a’, ‘m’, or ‘k’.
The most basic regular expression consists of a single literal character, such as a. It matches the first occurrence of that character in the string. If the string is Jack is a boy, it matches the a after the J. The fact that this a is in the middle of the word does not matter to the regex engine. If it matters to you, you will need to tell that to the regex engine by using word boundaries. We will get to that later.
This regex can match the second a too. It only does so when you tell the regex engine to start searching through the string after the first match. In a text editor, you can do so by using its “Find Next” or “Search Forward” function. In a programming language, there is usually a separate function that you can call to continue searching through the string after the previous match.
Similarly, the regex cat matches cat in About cats and dogs. This regular expression consists of a series of three literal characters. This is like saying to the regex engine: find a c, immediately followed by an a, immediately followed by a t.
Note that regex engines are case sensitive by default. cat does not match Cat, unless you tell the regex engine to ignore differences in case.
the backslash \(\backslash\), The backslash gives special meaning to the character following it. For example, the combination \("\n"\) stands for the newline, one of the control characters. The combination \("\w"\) stands for a “word” character
the caret \(^\), the caret is the anchor for the start of the string, or the negation symbol.
"^a" matches "a" at the start of the string.
s=c("The caret is the anchor for that start of the string", "The dollar sign is the anchor for the end of the string", "Both of the symbols are special characters")
grepl("^The", s)
[1] TRUE TRUE FALSE
“^The” matches any string that starts with “The”. It does not pick up the “the” in the third index, since it does not appear at the start of the sentence.
“of the string$” matches a string that ends in with “of the string.”.
y=c("the end of a string", "These symbols indicate the start and the end of a string, respectively: ")
grepl("end of a string$", y)
[1] TRUE FALSE
“(^0-9)”.
y=c("0-9he end of a string", "These symbols indicate the start and the end of a string, respectively: ")
grepl("^0-9", y)
[1] TRUE FALSE
the dollar sign \(\$\),
the period or dot \(.\),
In regular expressions, the dot or period is one of the most commonly used metacharacters. The dot matches a single character, without caring what that character is. The only exception are line break characters.
the vertical bar or pipe symbol \(|\),
the question mark \(?\),
the asterisk or star \(*\),
the plus sign \(+\),
the opening parenthesis \((\),
the closing parenthesis \()\),
the opening square bracket \([\),
and the opening curly brace \({\),
These special characters are often called “metacharacters”.
If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match \(1+1=2\), the correct regex is \(1\+1=2\). Otherwise, the plus sign has a special meaning.
R’s base paste function is used to combine (or paste) set of strings. In machine learning, it is quite frequently used in creating / re-structuring variable names. For example, let’s say, you want to use two strings (Var1 and Var2) to create a new string Var3. For neatness, we’ll separate the resultant values using a - (hyphen).
The code for this will be as follows:
var3 <- paste("Var1","Var2",sep = "-")
var3
[1] "Var1-Var2"
The toString function allows you to convert any non-character value to a string.
toString (1:25)
[1] "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25"
Extract or replace matched substrings from match data.
tolower() converts a string to the lower case. Alternatively, you can also use the str_to_lower() function
tolower("String")
[1] "string"
toupper() It converts a string to the upper case. Alternatively, you can also use the str_to_upper() function
toupper("String")
[1] "STRING"
substr(x, start, stop) is used to extract parts of a string. Start and end positions need to be specified. Alternatively, you can use the str_sub() function.
substr("5 X 6 X 7", 4,6)
[1] " 6 "
abbreviate strings:
abbreviate(c("monday","tuesday","wednesday"),minlength = 3)
monday tuesday wednesday
"mnd" "tsd" "wdn"
Split strings:
strsplit(x = c("ID-101","ID-102","ID-103","ID-104"),split = "-")
[[1]]
[1] "ID" "101"
[[2]]
[1] "ID" "102"
[[3]]
[1] "ID" "103"
[[4]]
[1] "ID" "104"
Find and replace first match:
x="Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention."
sub(pattern = "Machine learning",replacement = "_______",x,ignore.case = T)
[1] "_______ is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention."
Anchors
Substitute any digit with an underscore:
gsub(pattern = "\\d", "_", "I'm working in RStudio v.0.99.484")
[1] "I'm working in RStudio v._.__.___"
Substitute any non-digit with an underscore:
gsub(pattern = "\\D", "_", "I'm working in RStudio v.0.99.484")
To match one of several characters in a specified set we can enclose the characters of concern with square brackets [ ]. In addition, to match any characters not in a specified character set we can include the caret ^ at the beginning of the set within the brackets. The following displays the general syntax for common character classes but these can be altered easily as shown in the examples that follow:
Character classes
Summary of Regular expression commands:
To find exactly where the pattern exists in a string use regexpr():
The output of regexpr() can be interepreted as follows. The first element provides the starting position of the match in each element. Note that the value -1 means there is no match. The second element (attribute “match length”) provides the length of the match. The third element (attribute “useBytes”) has a value TRUE meaning matching was done byte-by-byte rather than character-by-character.
regexpr("apple", c("I love apple and apple pie", "Apple ipad", "apple ipod"))
[1] 8 -1 1
attr(,"match.length")
[1] 5 -1 5
attr(,"useBytes")
[1] TRUE
Use regmatches to get the actual substrings matched by the regular expression or replace matched substrings.
x <- c("I love apple and apple pie", "Apple ipad", "apple ipod")
m <- regexpr("apple", x)
regmatches(x, m)
[1] "apple" "apple"
x <- c("I love apple and apple pie", "Apple ipad", "apple ipod")
m <- regexpr("apple", x)
regmatches(x, m) <- "orange"
x
[1] "I love orange and apple pie"
[2] "Apple ipad"
[3] "orange ipod"
Function strsplit splits its input according to a specified regular expression.
strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
x <- c("I love apple and apple pie", "Apple ipad", "apple ipod")
strsplit(x, split = "\\W")
[[1]]
[1] "I" "love" "apple" "and" "apple" "pie"
[[2]]
[1] "Apple" "ipad"
[[3]]
[1] "apple" "ipod"
This is a very important point to understand: a regex-directed engine will always return the leftmost match, even if a “better” match could be found later. When applying a regex to a string, the engine will start at the first character of the string. It will try all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, will the engine continue with the second character in the text. Again, it will try all possible permutations of the regex, in exactly the same order. The result is that the regex-directed engine will return the leftmost match.
With a “character class”, also called “character set”, you can tell the regex engine to match only one out of several characters. Simply place the characters you want to match between square brackets. If you want to match an a or an e, use «[ae]». You could use this in «gr[ae]y» to match either „gray” or „grey”. Very useful if you do not know whether the document you are searching through is written in American or British English.
A character class matches only a single character. «gr[ae]y» will not match “graay”, “graey” or any such thing. The order of the characters inside a character class does not matter. The results are identical. You can use a hyphen inside a character class to specify a range of characters. «[0-9]» matches a single digit between 0 and 9. You can use more than one range. «[0-9a-fA-F]» matches a single hexadecimal digit, case insensitively. You can combine ranges and single characters. «[0-9a-fxA-FX]» matches a hexadecimal digit or the letter X. Again, the order of the characters and the ranges does not matter.
Find a word, even if it is misspelled, such as «sep[ae]r[ae]te» or «li[cs]en[cs]e».
Find an identifier in a programming language with «[A-Za-z_][A-Za-z_0-9]*».
Most of the data available over the web is not readily available. It is present in an unstructured format (HTML format) and is not downloadable.
Web scraping is a technique for converting the data present in unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used.
Almost all the main languages provide ways for performing web scrapping. We use R for scrapping the data for the most popular feature films of 2016 from the IMDb website.
We’ll get a number of features for each of the 100 popular feature films released in 2016. Also, we’ll look at the most common problems that one might face while scrapping data from the internet because of lack of consistency in the website code and look at how to solve these problems.
There are several ways of scraping data from the web. Some of the popular ways are:
Human Copy-Paste: This is a slow and efficient way of scraping data from the web. This involves humans themselves analyzing and copying the data to local storage.
Text pattern matching: Another simple yet powerful approach to extract information from the web is by using regular expression matching facilities of programming languages. You can learn more about regular expressions here.
API Interface: Many websites like Facebook, Twitter, LinkedIn, etc. provides public and/ or private APIs which can be called using standard code for retrieving the data in the prescribed format.
DOM Parsing: By using the web browsers, programs can retrieve the dynamic content generated by client-side scripts. It is also possible to parse web pages into a DOM tree, based on which programs can retrieve parts of these pages.
We’ll use the DOM parsing approach and rely on the CSS selectors of the webpage for finding the relevant fields which contain the desired information.
We’ll be using an open source software named “Selector Gadget” which will be more than sufficient for anyone in order to perform Web scrapping.
Download here:
Using this you can select the parts of any website and get the relevant tags to get access to that part by simply clicking on that part of the website.
Gadget selector
Now, let’s get started with scraping the IMDb website for the 100 most popular feature films released in 2016.
Install the package “rvest”.
Specifying the url for desired website to be scrapped:
library('rvest')
url <- 'http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature'
Reading the HTML code from the website:
webpage <- read_html(url)
webpage
{xml_document}
<html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
[1] <head>\n<meta http-equiv="Content-Type" content="tex ...
[2] <body id="styleguide-v2" class="fixed">\n ...
Now, we’ll be scraping the following data from this website.
Rank: The rank of the film from 1 to 100 on the list of 100 most popular feature films released in 2016.
Title: The title of the feature film.
Rank:
We will start with scraping the Rank field. For that, we’ll use the selector gadget to get the specific CSS selectors that encloses the rankings. You can click on the extenstion in your browser and select the rankings field with cursor.
An HTML element is an individual component of an HTML document or web page, once this has been parsed into the Document Object Model. HTML is composed of a tree of HTML nodes, such as text nodes. Each node can have HTML attributes specified. Nodes can also have content, including other nodes and text. Many HTML nodes represent semantics, or meaning. For example, the “title” node represents the title of the document.
rank_data_html <- html_nodes(webpage,'.text-primary')
rank_data_html
{xml_nodeset (100)}
[1] <span class="lister-item-index unbold text-primary" ...
[2] <span class="lister-item-index unbold text-primary" ...
[3] <span class="lister-item-index unbold text-primary" ...
[4] <span class="lister-item-index unbold text-primary" ...
[5] <span class="lister-item-index unbold text-primary" ...
[6] <span class="lister-item-index unbold text-primary" ...
[7] <span class="lister-item-index unbold text-primary" ...
[8] <span class="lister-item-index unbold text-primary" ...
[9] <span class="lister-item-index unbold text-primary" ...
[10] <span class="lister-item-index unbold text-primary" ...
[11] <span class="lister-item-index unbold text-primary" ...
[12] <span class="lister-item-index unbold text-primary" ...
[13] <span class="lister-item-index unbold text-primary" ...
[14] <span class="lister-item-index unbold text-primary" ...
[15] <span class="lister-item-index unbold text-primary" ...
[16] <span class="lister-item-index unbold text-primary" ...
[17] <span class="lister-item-index unbold text-primary" ...
[18] <span class="lister-item-index unbold text-primary" ...
[19] <span class="lister-item-index unbold text-primary" ...
[20] <span class="lister-item-index unbold text-primary" ...
...
Convert the ranking data to text:
rank_data <- html_text(rank_data_html)
rank_data
[1] "1." "2." "3." "4." "5." "6." "7."
[8] "8." "9." "10." "11." "12." "13." "14."
[15] "15." "16." "17." "18." "19." "20." "21."
[22] "22." "23." "24." "25." "26." "27." "28."
[29] "29." "30." "31." "32." "33." "34." "35."
[36] "36." "37." "38." "39." "40." "41." "42."
[43] "43." "44." "45." "46." "47." "48." "49."
[50] "50." "51." "52." "53." "54." "55." "56."
[57] "57." "58." "59." "60." "61." "62." "63."
[64] "64." "65." "66." "67." "68." "69." "70."
[71] "71." "72." "73." "74." "75." "76." "77."
[78] "78." "79." "80." "81." "82." "83." "84."
[85] "85." "86." "87." "88." "89." "90." "91."
[92] "92." "93." "94." "95." "96." "97." "98."
[99] "99." "100."
Converting rankings to numerical:
rank_data<-as.numeric(rank_data)
Now you can clear the selector section and select all the titles:
title_data_html <- html_nodes(webpage,'.lister-item-header a')
title_data_html
{xml_nodeset (100)}
[1] <a href="/title/tt3704700/?ref_=adv_li_tt">Brain on ...
[2] <a href="/title/tt1431045/?ref_=adv_li_tt">Deadpool ...
[3] <a href="/title/tt4972582/?ref_=adv_li_tt">Split</a>
[4] <a href="/title/tt1386697/?ref_=adv_li_tt">Suicide ...
[5] <a href="/title/tt3521164/?ref_=adv_li_tt">Moana</a>
[6] <a href="/title/tt1211837/?ref_=adv_li_tt">Doctor S ...
[7] <a href="/title/tt3470600/?ref_=adv_li_tt">Sing</a>
[8] <a href="/title/tt3498820/?ref_=adv_li_tt">Captain ...
[9] <a href="/title/tt2660888/?ref_=adv_li_tt">Star Tre ...
[10] <a href="/title/tt3110958/?ref_=adv_li_tt">Now You ...
[11] <a href="/title/tt1289401/?ref_=adv_li_tt">Ghostbus ...
[12] <a href="/title/tt3385516/?ref_=adv_li_tt">X-Men: A ...
[13] <a href="/title/tt2975590/?ref_=adv_li_tt">Batman v ...
[14] <a href="/title/tt2674426/?ref_=adv_li_tt">Me Befor ...
[15] <a href="/title/tt1679335/?ref_=adv_li_tt">Trolls</a>
[16] <a href="/title/tt3748528/?ref_=adv_li_tt">Rogue On ...
[17] <a href="/title/tt3183660/?ref_=adv_li_tt">Fantasti ...
[18] <a href="/title/tt2543164/?ref_=adv_li_tt">Arrival</a>
[19] <a href="/title/tt4094724/?ref_=adv_li_tt">The Purg ...
[20] <a href="/title/tt1700841/?ref_=adv_li_tt">Sausage ...
...
Converting the title data to text:
title_data <- html_text(title_data_html)
title_data
[1] "Brain on Fire"
[2] "Deadpool"
[3] "Split"
[4] "Suicide Squad"
[5] "Moana"
[6] "Doctor Strange"
[7] "Sing"
[8] "Captain America: Civil War"
[9] "Star Trek: Beyond"
[10] "Now You See Me 2"
[11] "Ghostbusters"
[12] "X-Men: Apocalypse"
[13] "Batman v Superman: Dawn of Justice"
[14] "Me Before You"
[15] "Trolls"
[16] "Rogue One"
[17] "Fantastic Beasts and Where to Find Them"
[18] "Arrival"
[19] "The Purge: Election Year"
[20] "Sausage Party"
[21] "The Magnificent Seven"
[22] "Hacksaw Ridge"
[23] "Gods of Egypt"
[24] "The Boy"
[25] "Warcraft"
[26] "Don't Breathe"
[27] "Zootopia"
[28] "La La Land"
[29] "Passengers"
[30] "A Cure for Wellness"
[31] "The Legend of Tarzan"
[32] "The Lost City of Z"
[33] "The Conjuring 2"
[34] "The Secret Life of Pets"
[35] "Jack Reacher: Never Go Back"
[36] "Kimi no na wa."
[37] "13 Hours"
[38] "Everybody Wants Some!!"
[39] "Central Intelligence"
[40] "Nocturnal Animals"
[41] "The Girl on the Train"
[42] "Finding Dory"
[43] "Moonlight"
[44] "Teenage Mutant Ninja Turtles: Out of the Shadows"
[45] "The Shallows"
[46] "Dirty Grandpa"
[47] "The Bad Batch"
[48] "Hell or High Water"
[49] "The Nice Guys"
[50] "Hidden Figures"
[51] "Miss Peregrine's Home for Peculiar Children"
[52] "Dark Crimes"
[53] "The Jungle Book"
[54] "Jason Bourne"
[55] "Why Him?"
[56] "Contratiempo"
[57] "The Neon Demon"
[58] "The 5th Wave"
[59] "The Accountant"
[60] "Nowhereland"
[61] "Lion"
[62] "How to Be Single"
[63] "The Limehouse Golem"
[64] "The Edge of Seventeen"
[65] "10 Cloverfield Lane"
[66] "Independence Day: Resurgence"
[67] "Below Her Mouth"
[68] "Inferno"
[69] "Captain Fantastic"
[70] "Deepwater Horizon"
[71] "Allegiant"
[72] "Assassin's Creed"
[73] "Hush"
[74] "Brimstone"
[75] "Nerve"
[76] "Bad Moms"
[77] "Manchester by the Sea"
[78] "Allied"
[79] "War Dogs"
[80] "Patriots Day"
[81] "Grave"
[82] "Mike and Dave Need Wedding Dates"
[83] "The Choice"
[84] "Masterminds"
[85] "The Great Wall"
[86] "The Autopsy of Jane Doe"
[87] "The Founder"
[88] "Kung Fu Panda 3"
[89] "The Huntsman: Winter's War"
[90] "Ah-ga-ssi"
[91] "Hail, Caesar!"
[92] "Underworld: Blood Wars"
[93] "Gold"
[94] "The Boss"
[95] "Leap!"
[96] "Bridget Jones's Baby"
[97] "Before I Wake"
[98] "Silence"
[99] "Morgan"
[100] "Lights Out"
Metascore:
metascore_data_html<-html_nodes(webpage,'.metascore')
metascore_data_html
{xml_nodeset (98)}
[1] <span class="metascore unfavorable">34 </span>
[2] <span class="metascore favorable">65 </span>
[3] <span class="metascore favorable">62 </span>
[4] <span class="metascore mixed">40 </span>
[5] <span class="metascore favorable">81 </span>
[6] <span class="metascore favorable">72 </span>
[7] <span class="metascore mixed">59 </span>
[8] <span class="metascore favorable">75 </span>
[9] <span class="metascore favorable">68 </span>
[10] <span class="metascore mixed">46 </span>
[11] <span class="metascore mixed">60 </span>
[12] <span class="metascore mixed">52 </span>
[13] <span class="metascore mixed">44 </span>
[14] <span class="metascore mixed">51 </span>
[15] <span class="metascore mixed">56 </span>
[16] <span class="metascore favorable">65 </span>
[17] <span class="metascore favorable">66 </span>
[18] <span class="metascore favorable">81 </span>
[19] <span class="metascore mixed">55 </span>
[20] <span class="metascore favorable">66 </span>
...
Convert to text:
metascore_data <- html_text(metascore_data_html)
metascore_data
[1] "34 " "65 " "62 " "40 "
[5] "81 " "72 " "59 " "75 "
[9] "68 " "46 " "60 " "52 "
[13] "44 " "51 " "56 " "65 "
[17] "66 " "81 " "55 " "66 "
[21] "54 " "71 " "25 " "42 "
[25] "32 " "71 " "78 " "93 "
[29] "41 " "47 " "44 " "78 "
[33] "65 " "61 " "47 " "79 "
[37] "48 " "83 " "52 " "67 "
[41] "48 " "77 " "99 " "40 "
[45] "59 " "21 " "62 " "88 "
[49] "70 " "74 " "57 " "24 "
[53] "77 " "58 " "39 " "51 "
[57] "33 " "51 " "69 " "51 "
[61] "63 " "77 " "76 " "32 "
[65] "42 " "42 " "72 " "68 "
[69] "33 " "36 " "67 " "45 "
[73] "58 " "60 " "96 " "60 "
[77] "57 " "69 " "81 " "51 "
[81] "26 " "47 " "42 " "65 "
[85] "66 " "66 " "35 " "84 "
[89] "72 " "23 " "49 " "40 "
[93] "48 " "59 " "68 " "79 "
[97] "48 " "58 "
metascore_data<-as.numeric(metascore_data)
metascore_data
[1] 34 65 62 40 81 72 59 75 68 46 60 52 44 51 56 65 66 81
[19] 55 66 54 71 25 42 32 71 78 93 41 47 44 78 65 61 47 79
[37] 48 83 52 67 48 77 99 40 59 21 62 88 70 74 57 24 77 58
[55] 39 51 33 51 69 51 63 77 76 32 42 42 72 68 33 36 67 45
[73] 58 60 96 60 57 69 81 51 26 47 42 65 66 66 35 84 72 23
[91] 49 40 48 59 68 79 48 58
write.csv(title_data, "Movies.csv")
Add a column:
tmp <- cbind("Movies.csv", metascore_data)
tmp
write.csv(tmp, "Moviesmetascore.csv")
voorbeeld=c(1,2,3)
hist(voorbeeld)