Basic R practices to handle string data

  1. R is the statistical programming language
# Simple addition and subtraction
10+5
## [1] 15
10-5
## [1] 5
# Creating vectors: A sequence of numbers/integers, characters, Booleans
c(1,3,5) # Join elements into a vector 
## [1] 1 3 5
1:5 # An integer sequence
## [1] 1 2 3 4 5
seq(1, 5, by=2) # A sequence of integers from 1 to 5, increasing by 2
## [1] 1 3 5
rep(1:5, times=3) # Repeat an integer sequence 1:5 three times
##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
rep(1:5, each=3) # Repeat each element of an integer sequence 1:5 three times
##  [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
  1. R is an object-oriented envrionment, which is assigned with numbers.
# R uses the less than symbol followed by the hyphen(<-) as an assignment operator
a <- 10 # assign 10 to the object 'x'
a - 5 # subtract 5 from x
## [1] 5
  1. R objects can be also assigned with characters.
class <- "text mining"
print(class)
## [1] "text mining"
# R is an object-oriented programming language; everything can be assinged to an object
fruits <- c("Apple","Grape","Pear","Apple","Mango","Orange","Mango","Strawberry","Grape","Apple")
sort(fruits) # Return the object, fruits, sorted in alphabetical order for characters
##  [1] "Apple"      "Apple"      "Apple"      "Grape"      "Grape"     
##  [6] "Mango"      "Mango"      "Orange"     "Pear"       "Strawberry"
table(fruits) # See counts of values (elements)
## fruits
##      Apple      Grape      Mango     Orange       Pear Strawberry 
##          3          2          2          1          1          1
unique(fruits) # See unique values (elements)
## [1] "Apple"      "Grape"      "Pear"       "Mango"      "Orange"    
## [6] "Strawberry"
# String data can be a character vector
hello <- c("Hello!", "World", "is", "good!")
print(hello)
## [1] "Hello!" "World"  "is"     "good!"

3-1. Selecting Vector Elements ### By Position

fruits[4] # The fourth element of the vector object, fruits
## [1] "Apple"
fruits[-4] # All but the fourth one
## [1] "Apple"      "Grape"      "Pear"       "Mango"      "Orange"    
## [6] "Mango"      "Strawberry" "Grape"      "Apple"
fruits[2:4] # Elements from the second to the fourth ones
## [1] "Grape" "Pear"  "Apple"
fruits[-(2:4)] # All elements except the second to the fourth ones
## [1] "Apple"      "Mango"      "Orange"     "Mango"      "Strawberry"
## [6] "Grape"      "Apple"
fruits[c(1,5)] # Elements the first one and the fifth one
## [1] "Apple" "Mango"

By Value

fruits[fruits == "Apple"] # Elements which are equal to "Apple"
## [1] "Apple" "Apple" "Apple"
fruits["Apple"]
## [1] NA
# Note == for equal sign
# = is for setting argument in a function; Ex) seq(1, 5, by=2)
fruits[fruits != "Apple"] # All but elements equal to "Apple"
## [1] "Grape"      "Pear"       "Mango"      "Orange"     "Mango"     
## [6] "Strawberry" "Grape"
# != denotes not equal
fruits[fruits %in% c("Apple","Mango","Kiwi")] # Elements in the set "Apple", "Mango", "Kiwi"
## [1] "Apple" "Apple" "Mango" "Mango" "Apple"
# %in% identifies elements included in the following vector

Note that parentheses ( ) follow a certain function and square brakets [ ] select vector elements

  1. You can deal with string objects as follows.
paste(hello, collapse = ",")
## [1] "Hello!,World,is,good!"
hello_paste <- paste(hello, collapse = " ")
strsplit(hello_paste, split = " ")
## [[1]]
## [1] "Hello!" "World"  "is"     "good!"

R functions in the Base Package for text analysis

  1. nchar(): counting the number of characters and the number of bytes
nchar("data")
## [1] 4
nchar("bigdata")
## [1] 7
# Blank and punctuation marks are also counted 
nchar("big data")
## [1] 8
nchar("big, data")
## [1] 9
  1. strplit(): Sentences are composed of words that consist of characters. In text mining, therefore, we segment sentences into words as tokens, which were pre-processed. Then, we also might need to combine them again.
sent1 <- "Text mining begins with preprocessing and tokenization"
# We split sent1 into seven pieces of words that are separated from each other by white space (blank)
sent1_split <- strsplit(sent1, split=" ") # Parameter (split=" ") specifies a character (here, blank) that separates words as tokens
sent1_split # returns a list of vectors of words segmented from sent1
## [[1]]
## [1] "Text"          "mining"        "begins"        "with"         
## [5] "preprocessing" "and"           "tokenization"
sent1_split[[1]] # returns the first vector of word segments in the list object, sent1_split
## [1] "Text"          "mining"        "begins"        "with"         
## [5] "preprocessing" "and"           "tokenization"
sent1_split[[1]][5] # returns the fifth element in the first vector of word segments
## [1] "preprocessing"
  1. paste(): We can combine word segments back into a sentence that separates words by blank
sent2 <- paste(sent1_split[[1]], collapse=" ") # Parameter (collapse) specifies how elements in sent1_split are combined. Here, we want to combine word elements into a sentence that separates them by blank (" ").
paste(sent1_split[[1]], collapse="/") 
## [1] "Text/mining/begins/with/preprocessing/and/tokenization"
sent2
## [1] "Text mining begins with preprocessing and tokenization"
  1. Getting Help: We can access the help files about functions and packages.
?c # Get help of a particular function c( )
?strsplit
help.search("split") # Search and return the help files for functions that include a word or phrase
help(package = "stringr") # Find help for a package 
  1. tolower( ) and toupper( ): Translate uppercase letters into lowercase letters and vice versa
sentence <- "Text mining is easy to learn"
toupper(sentence)
## [1] "TEXT MINING IS EASY TO LEARN"
tolower(toupper(sentence))
## [1] "text mining is easy to learn"

Note1: R is sensitive to lowercase and uppercase letters

word1 <- "TEXT"
word1
## [1] "TEXT"
word2 <- "Text"
word2
## [1] "Text"
word1 == word2
## [1] FALSE

Note2: Parentheses are for functions but brackets are for selecting a certain element of a vector.

tolower(word1)
## [1] "text"
toupper(word2) 
## [1] "TEXT"

list( )

What is a vector object? A collection of ordered elements in the same nature.

Ex) a vector of numbers; a vector of characters; a logical vector

What if we want to have such three types of vector objects?

Remember what the function c( ) does.

# assign a sequence of numbers to a numeric vector object, vector_numeric
vector_numeric <- c(1:6) # a sequence of numbers from 1 to 6
vector_numeric
## [1] 1 2 3 4 5 6
# assign a sequence of characters to a character vector object, vector_character
vector_character <- c("v","e","c","t","o","r") # characters should be distinguished with quotation marks.
vector_character
## [1] "v" "e" "c" "t" "o" "r"
# assign a sequence of Boolean values, TRUE or FALSE, to a logical vector object, vector_logical
vector_logical <- c(TRUE, FALSE, FALSE, TRUE, T, F) # TRUE/FALSE should be uppercase letters
vector_logical
## [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE

What is a list object?

A vector with possible heterogeneous elements. That is, a list is collection of elements which can be of different types.

The elements of a list can include numeric vectors, character vectors, and logical vectors at once.

Let say we want to generate a list object that contains the above three vector objects all together.

We can use a function list( ) here.

List1 <- list(numbers=vector_numeric, characters=vector_character, Booleans=vector_logical) # Generate a list object, List1, to contain all the elements that are named "numbers", "characters", and "Booleans"
length(List1) # The function, length, calculates how many elements are in any R object (vectors, lists, & factors).
## [1] 3
class(List1) # R is an object-oriented style of programming. The function, class, allows us to know what type an object belongs to. It can be numeric, character, logical, list, and so on...
## [1] "list"
names(List1) # To get or set the names of an object; Here, elements' names are "numbers", "characters", and "Booleans"
## [1] "numbers"    "characters" "Booleans"
List1[1:2] # returns a new list object that contains the first and the second elements. What is length(List1[1:2])?
## $numbers
## [1] 1 2 3 4 5 6
## 
## $characters
## [1] "v" "e" "c" "t" "o" "r"
List1[2] # returns a new list object that contains the second element only. What is length(List1[2])?
## $characters
## [1] "v" "e" "c" "t" "o" "r"
List1[[2]] # returns a vector object that contains six elements in the second element of List1, the characters "v", "e", "c", "t", "o", and "r". What is length(List1[[2]])? 
## [1] "v" "e" "c" "t" "o" "r"
List1['characters'] # returns a new list with the element named 'numbers' only
## $characters
## [1] "v" "e" "c" "t" "o" "r"
List1[['characters']] # returns a vector with the elements of the list element named 'numbers'
## [1] "v" "e" "c" "t" "o" "r"

Why is list( ) important in text mining?

Text data are analyzed at multiple levels. The list( ) function is useful for dealing with hierarchical data.

Ex) Document > Paragraphs > Sentences > Words

Let say we have two sentences about BTS.

Ex) BTS <- “BTS is known as the Bangtan Boys. And BTS is a seven-member South Korean boy band.”

The character object BTS can be a list of two sentences, and each sentence as a vector contains a sequence of words.

# Example
sentences <- "BTS is known as the Bangtan Boys. BTS is a seven-member South Korean boy band." 
#sentences[[1]] <- c("BTS", "is", "known", "as", "the", "Bangtan Boys")
#sentences[[2]] <- c("And", "BTS", "is", "a", "seven-member", "South Korean", "boy band")

Let’s practice on list( )

# A list is formed with following objects
obj1 <- c("BTS","is","known","as","the","Bangtan Boys")
obj2 <- c("And", "BTS","is","a","seven-member","South Korea","boy band")
obj3 <- list(obj1, obj2)
class(obj1) # vector or list?
## [1] "character"
class(obj2) # vector or list?
## [1] "character"
class(obj3) # vector or list?
## [1] "list"
# list( ) applies to another list; that is, a list can belong to another list
MyList <- list(obj1, obj2, obj3)
MyList #vector or list?
## [[1]]
## [1] "BTS"          "is"           "known"        "as"          
## [5] "the"          "Bangtan Boys"
## 
## [[2]]
## [1] "And"          "BTS"          "is"           "a"           
## [5] "seven-member" "South Korea"  "boy band"    
## 
## [[3]]
## [[3]][[1]]
## [1] "BTS"          "is"           "known"        "as"          
## [5] "the"          "Bangtan Boys"
## 
## [[3]][[2]]
## [1] "And"          "BTS"          "is"           "a"           
## [5] "seven-member" "South Korea"  "boy band"

Indexing: selecting certain elements

How to select certain elements in a list? Use double square brackets [[ ]] for a list and single square brackets [ ] for a vector. [[ ]] returns sub-elements of a list element.

MyList[[1]] # returns a vector with the elements of the first element of the list MyList
## [1] "BTS"          "is"           "known"        "as"          
## [5] "the"          "Bangtan Boys"
class(MyList[[1]]) # MyList's first element is a character vector with six words
## [1] "character"
MyList[[2]] 
## [1] "And"          "BTS"          "is"           "a"           
## [5] "seven-member" "South Korea"  "boy band"

What is MyList[[3]]?

obj3
## [[1]]
## [1] "BTS"          "is"           "known"        "as"          
## [5] "the"          "Bangtan Boys"
## 
## [[2]]
## [1] "And"          "BTS"          "is"           "a"           
## [5] "seven-member" "South Korea"  "boy band"
obj3[[1]]
## [1] "BTS"          "is"           "known"        "as"          
## [5] "the"          "Bangtan Boys"
obj3[[2]]
## [1] "And"          "BTS"          "is"           "a"           
## [5] "seven-member" "South Korea"  "boy band"
# [1] to select the first element of a vector; [[1]] to select the elements of a list's first element
MyList[[3]][[1]] # selects the third element of the list MyList and its first element's sub-elements as a vector object
## [1] "BTS"          "is"           "known"        "as"          
## [5] "the"          "Bangtan Boys"
MyList[[3]][1] # selects the third element of the list MyList and its first element as a list object
## [[1]]
## [1] "BTS"          "is"           "known"        "as"          
## [5] "the"          "Bangtan Boys"

When you get familiar with list( ), then you will figure out what MyList[[3]][[1]][2] is. How to select the word “South Korea” in the third element of the list object MyList?

MyList[[3]][[2]][5]
## [1] "seven-member"
MyList[[3]][2][5]
## [[1]]
## NULL

Questions

  1. Return a new list with only the first element of the third element of MyList
  2. Return a vector object with the six elements from the above list
  3. Return the first element, “BTS”, from the above vector object

How to turn a list into a vector

#unlist( ) is a function to turn a list object into a vector object
#be cautious about using the unlist function
List1[2] # a new list object with one element, the character vector object vector_character
## $characters
## [1] "v" "e" "c" "t" "o" "r"
class(List1[2])
## [1] "list"
vector_character # a vector object with six elements
## [1] "v" "e" "c" "t" "o" "r"
class(vector_character)
## [1] "character"
unlist(List1[2]) # turn a list object into a vector object
## characters1 characters2 characters3 characters4 characters5 characters6 
##         "v"         "e"         "c"         "t"         "o"         "r"
unlist(List1[2]) == vector_character # is unlist(List1[2]) equal to vector_character?
## characters1 characters2 characters3 characters4 characters5 characters6 
##        TRUE        TRUE        TRUE        TRUE        TRUE        TRUE

When do we use unlist( ) in text mining?

Paragraph as a list of sentences can be combined into a vector object with words

# Organize a list of character vectors into a vector object
list_sentences <- list(MyList[[1]],MyList[[2]]) # generate a list of the two vectors
list_sentences
## [[1]]
## [1] "BTS"          "is"           "known"        "as"          
## [5] "the"          "Bangtan Boys"
## 
## [[2]]
## [1] "And"          "BTS"          "is"           "a"           
## [5] "seven-member" "South Korea"  "boy band"
list_sentences[[1]] # returns the first element as a vector under list_sentences as a list
## [1] "BTS"          "is"           "known"        "as"          
## [5] "the"          "Bangtan Boys"
list_sentences[[2]] # returns the second element as a vector under list_sentences as a list
## [1] "And"          "BTS"          "is"           "a"           
## [5] "seven-member" "South Korea"  "boy band"
unlist(list_sentences) # reverse the list into an vector object
##  [1] "BTS"          "is"           "known"        "as"          
##  [5] "the"          "Bangtan Boys" "And"          "BTS"         
##  [9] "is"           "a"            "seven-member" "South Korea" 
## [13] "boy band"

unlist( ) allows us to combine a list object’s elements into a vector object

Question

How to turn the character vector object, unlist(list_sentences), into a sentence that collapses words by blank?

R에서 프로젝트 사용하기

RStudio 프로젝트를 사용하면 작업 공간과 문서를 쉽게 관리 할 수 있습니다.

프로젝트 만들기

RStudio 프로젝트는 R 작업 디렉토리와 연결됩니다. RStudio 프로젝트를 다음과 같은 곳에 만들 수 있습니다.

  • 새로운 디렉토리에서
  • 또는 이미 R 코드 및 데이터가 있는 기존 디렉토리에서 만들 수 있습니다.

새로운 프로젝트를 만들 때, R studio는 :

  1. 프로젝트 디렉토리에 (.Rproj 확장자를 가진) 프로젝트 파일을 생성합니다. 이 파일에는 다양한 프로젝트 옵션이 있고 프로젝트를 직접 여는 지름길로 파일을 사용합니다.
  2. 원본 문서와 창 상태가 저장된 숨겨진 디렉토리를 만듭니다.
  3. 프로젝트를 저장하고 툴바에 프로젝트 이름을 나타냅니다.

프로젝트 열기

  1. Open Project 명령어를 사용하여 기존 프로젝트 파일(예: Capstone.Rproj)을 찾아 선택합니다.
  2. 또는 최근 열었던 프로젝트 목록에서 프로젝트를 선택합니다.

프로젝트 종료하기

프로젝트를 종료하거나 다른 프로젝트를 열면 다음 작업이 실행됩니다. :

  • .RData 및/또는 .Rhistory 는 프로젝트 디렉토리에 기록됩니다.
  • 오픈 소스 문서(.R 확장 문서) 목록이 저장됩니다. (다음 번에 프로젝트를 열 때, 복원할 수 있습니다.)

Global 옵션 확인

작업 내용이 올바르게 저장되도록 global 옵션을 확인해 봅시다.

  • Restore .RData into workspace at startup - 시작 시, R 작업 공간(전역 환경) 안에 초기 작업 디렉토리에 있는 .RData 파일(있는 경우)을 저장하세요.
  • Save workspace to .RData on exit - 종료 시, .RData를 저장할지 여부를 묻고 항상 저장하거나 저장하지 않습니다.
  • Always save history (even when not saving .RData) - 종료할 때, .RData 파일을 저장하지 않기로 선택하더라도 세션의 명령을 사용하여 .Rhistory 파일이 항상 저장되는지 확인하세요.

R 패키지

R을 사용하여 텍스트 마이닝 분석을 할 경우 관련 패키지를 설치하면 텍스트 끌어오기, 전처리와 분석은 물론 결과의 시각화 작업을 보다 편리하고 효율적으로 수행할 수 있다.

패키지? 라이브러리?

R은 오픈 소스 소프트웨어로서 텍스트마이닝에 특화되어있는 기능들을 잘 활용하는 것이 중요하다.

패키지는 R의 함수들이 목적에 따라 모여있는 집합체를 말한다.

라이브러리는 패키지를 다운로드 받은 장소를 말한다. 저장된 패키지를 R에 설치함으로서 우리는 포함되어 있는 함수들을 활용할 수 있게 된다. 그렇다면, R에서 어떻게 패키지를 다운로드 받고 설치할 수 있을까?

가령, 우리가 텍스트를 분석하기 위해 필요한 함수들이 “stringr”, “pdftools”, 그리고 “wordcloud”라는 패키지에 포함되어 있다고 가정해보자. 패키지들을 다운로드 받기위해서는 다음의 코드를 실행해야 한다.

install.packages("pdftools")

install.packages("stringr")

install.packages("wordcloud")

특히, install.packages라는 함수 뒤에 괄호가 따라오게 되는데, 이 괄호 안에 패키지의 이름이 입력된다. 주의점: 패키지의 이름은 문자 코드이기 때문에 따옴표를 붙여줘야 한다.

위의 정보를 입력하면 R은 해당 패키지를 CRAN이라는 서버로부터 다운로드 받는다. 하지만 패키지를 다운로드 받았을 뿐, 해당 함수를 실행하기 위해서 패키지를 R 세션에 “로드”해야하는 과정이 필요하다. 이 작업은 library()라는 함수로 실행한다.

library(pdftools)
library(stringr)
## 
## Attaching package: 'stringr'
## The following object is masked _by_ '.GlobalEnv':
## 
##     sentences
library(wordcloud)
## Loading required package: RColorBrewer

이러한 과정을 거치게 되면 우리는 오늘 R에서의 텍스트 분석을 위한 패키지 설치를 완료하게 된다: 1) pdftools는 pdf 파일을 R로 불러올때 필요한 함수를 제공한다; 2) stringr은 불러온 파일로부터 추출한 텍스트를 분석을 위해 전처리하는 과정에서 필요한 함수를 제공한다; 3) wordcloud는 전처리 후 분석한 결과를 시각화 하는데 필요한 함수를 제공한다.

디렉토리란?

텍스트를 분석하기 위해 다운로드 받은 디지털 문서가 저장되어 있는 장소를 알아야 R로 파일을 불러올 수 있다. 이때 필요한 정보가 Working Directory이다. 이 폴더는 R session에서 작업하는 파일들의 주소로서 다음의 함수를 이용하면 알 수 있다.

getwd() # 현재 작업중인 디렉토리 찾기 (입력 및 출력값이 저장되는 장소)
## [1] "/Users/shinhaenglee/Dropbox/2018_Class_Teaching/Capstone"

텍스트 전처리

PDF 문서로부터 문자열을 추출하여 처리하기

우리가 오늘 작업할 디지털 문서는 Pride and Prejudice의 무비 script의 pdf 파일이다. 이 파일을 다운로드 받고 우리는 해당 내용을 담고있는 텍스트를 분석 할 수 있다. 다운로드 받은 문서는 “pnp_script.pdf”이고 이것을 우리의 Working Directory로 옮긴다면 다음의 코드를 실행할 준비가 된다.

pnp_script <- pdf_text("pnp_script.pdf")
class(pnp_script)
## [1] "character"
pnp_script[1:2]
## [1] "                                     Pride and Prejudice\n                              Screenplay by Deborah Moggach\n                               Shooting script 28th June 2004\n1 EXT. LONGBOURN HOUSE - DAY.\nFADE UP ON: A YOUNG WOMAN, as she walks through a field of tall, meadow grass. She is\nreading a novel entitled 'First Impressions'.\nThis is LIZZIE BENNET, 20, good humoured, attractive, and nobody's fool. She approaches\nLongbourn, a fairly run down 17th Century house with a small moat around it. Lizzie jumps\nup onto a wall and crosses the moat by walking a wooden plank duck board, a reckless trick\nlearnt in early childhood. She walks passed the back of the house where, through an open\nwindow to the library, we see her mother and father, MR and MRS BENNET.\nMRS BENNET: My dear Mr Bennet, have you heard that Netherfield Park is let at last?\nWe follow Lizzie into the house, but still overhear her parents' conversation.\nMRS BENNET: (cont'd) Do you not want to know who has taken it? MR BENNET: As you\nwish to tell me, I doubt I have any choice in the matter.\n2 INT. LONGBOURN - CONTINUOUS.\nAs Lizzie walks through the hallway, we hear the sound of piano scales plodding through\nthe afternoon. She walks down the entrance hall past the room where MARY (18) the\nbluestocking of the family, is practising, and finds KITTY (16) and LYDIA (15) are listening\nat the door to the library. Lizzie pokes Lydia.\nLIZZIE: Liddy! Kitty - what have I told you about listening at - LYDIA: Never mind that,\nthere's a Mr Bingley arrived from the North\nKITTY: - with more than one chaise\nLYDIA: - and five thousand a year!\nLIZZIE: Really?\nLYDIA: And he's single!\nJANE, the eldest and very beautiful if rather naive sister, materializes at Lizzie's elbow.\nJANE: Who's single?\nLIZZIE: A Mr Bingley, apparently.\nKITTY: Shhhh!\n"
## [2] "She clamps her ear to the door.\nLIZZIE: Oh, really Kitty.\nLydia leans in, whilst Jane and Lizzie strain to hear without appearing t_.\n3 INT. LIBRARY - LONGBOURN - CONTINUOUS.\nMr Bennet is trying to ignore Mrs Bennet.\nMRS BENNET: What a fine thing for our girls!\nMR BENNET: How can it affect them?\nMRS BENNET: My dear Mr Bennet, how can you be so tiresome! You know that he must\nmarry one of them.\nMR BENNET: Oh, so that is his design in settling here?\nMr Bennet takes a plant he's been looking at from his table and walks out of the library into\nthe corridor, where the girls are gathered, Mrs Bennet following.\nMR BENNET (cont'd) Good heavens. People.\n4 INT. CORRIDOR - LONGBOURN - THE SAME.\nHe walks through the girls to the drawing room pursued by Mrs Bennet.\nMRS BENNET: - So you must go and visit him at once.\n5 INT. DRAWING ROOM - LONGBOURN - THE SAME.\nMr Bennet walks to a table and places the plant in the light. Mary is still practising the\npiano. The girls flock behind him.\nLYDIA: Are you listening? You never listen.\nKITTY: You must, Papa!\nMRS BENNET: At once!\nMR BENNET: There is no need, for I already have.\nThe piano stops. A frozen silence. They all stare.\nMRS BENNET: You have?\nJANE: When?\nMRS BENNET: How can you tease me, Mr Bennet? Have you no compassion for my poor\nnerves?\nMR BENNET: You mistake me, my dear. I have a high respect for them; they have been my\nconstant companions these twenty years.\n"
pnp_script_str <- paste(pnp_script, collapse=" ")
class(pnp_script_str)
## [1] "character"
substr(pnp_script_str, start=1, stop=1000)
## [1] "                                     Pride and Prejudice\n                              Screenplay by Deborah Moggach\n                               Shooting script 28th June 2004\n1 EXT. LONGBOURN HOUSE - DAY.\nFADE UP ON: A YOUNG WOMAN, as she walks through a field of tall, meadow grass. She is\nreading a novel entitled 'First Impressions'.\nThis is LIZZIE BENNET, 20, good humoured, attractive, and nobody's fool. She approaches\nLongbourn, a fairly run down 17th Century house with a small moat around it. Lizzie jumps\nup onto a wall and crosses the moat by walking a wooden plank duck board, a reckless trick\nlearnt in early childhood. She walks passed the back of the house where, through an open\nwindow to the library, we see her mother and father, MR and MRS BENNET.\nMRS BENNET: My dear Mr Bennet, have you heard that Netherfield Park is let at last?\nWe follow Lizzie into the house, but still overhear her parents' conversation.\nMRS BENNET: (cont'd) Do you not want to know who has taken it? MR "
#strsplit() 함수의 결과값은 벡터가 아닌 리스트이다.
pnp_script_line <- strsplit(pnp_script_str, split="\n") #"\n"으로 구별되는 각 줄별로 텍스트를 쪼개기
class(pnp_script_line)
## [1] "list"
length(pnp_script_line)
## [1] 1
pnp_script_line[[1]][1:10]
##  [1] "                                     Pride and Prejudice"                                  
##  [2] "                              Screenplay by Deborah Moggach"                               
##  [3] "                               Shooting script 28th June 2004"                             
##  [4] "1 EXT. LONGBOURN HOUSE - DAY."                                                             
##  [5] "FADE UP ON: A YOUNG WOMAN, as she walks through a field of tall, meadow grass. She is"     
##  [6] "reading a novel entitled 'First Impressions'."                                             
##  [7] "This is LIZZIE BENNET, 20, good humoured, attractive, and nobody's fool. She approaches"   
##  [8] "Longbourn, a fairly run down 17th Century house with a small moat around it. Lizzie jumps" 
##  [9] "up onto a wall and crosses the moat by walking a wooden plank duck board, a reckless trick"
## [10] "learnt in early childhood. She walks passed the back of the house where, through an open"
unlist(pnp_script_line)[1:10]
##  [1] "                                     Pride and Prejudice"                                  
##  [2] "                              Screenplay by Deborah Moggach"                               
##  [3] "                               Shooting script 28th June 2004"                             
##  [4] "1 EXT. LONGBOURN HOUSE - DAY."                                                             
##  [5] "FADE UP ON: A YOUNG WOMAN, as she walks through a field of tall, meadow grass. She is"     
##  [6] "reading a novel entitled 'First Impressions'."                                             
##  [7] "This is LIZZIE BENNET, 20, good humoured, attractive, and nobody's fool. She approaches"   
##  [8] "Longbourn, a fairly run down 17th Century house with a small moat around it. Lizzie jumps" 
##  [9] "up onto a wall and crosses the moat by walking a wooden plank duck board, a reckless trick"
## [10] "learnt in early childhood. She walks passed the back of the house where, through an open"
pnp_script_sent <- strsplit(unlist(pnp_script_line), split="\\. ") #줄별로 나눠진 텍스트를 메타문자가 아닌 마침표("\\.")로 구별된 문장단위로 쪼개기
class(pnp_script_sent)
## [1] "list"
length(pnp_script_sent)
## [1] 1539
pnp_script_sent[[1]]
## [1] "                                     Pride and Prejudice"
pnp_script_sent[[2]]
## [1] "                              Screenplay by Deborah Moggach"
pnp_script_word <- strsplit(unlist(pnp_script_sent), split=" ") #문장단위로 나눠진 텍스트를 빈칸(" ")으로 구별되는 단어단위로 쪼개기
class(pnp_script_word)
## [1] "list"
length(pnp_script_word)
## [1] 2235
pnp_script_word[[1]]
##  [1] ""          ""          ""          ""          ""         
##  [6] ""          ""          ""          ""          ""         
## [11] ""          ""          ""          ""          ""         
## [16] ""          ""          ""          ""          ""         
## [21] ""          ""          ""          ""          ""         
## [26] ""          ""          ""          ""          ""         
## [31] ""          ""          ""          ""          ""         
## [36] ""          ""          "Pride"     "and"       "Prejudice"
pnp_script_word[[2]]
##  [1] ""           ""           ""           ""           ""          
##  [6] ""           ""           ""           ""           ""          
## [11] ""           ""           ""           ""           ""          
## [16] ""           ""           ""           ""           ""          
## [21] ""           ""           ""           ""           ""          
## [26] ""           ""           ""           ""           ""          
## [31] "Screenplay" "by"         "Deborah"    "Moggach"
pnp_script_word <- unlist(pnp_script_word) #리스트로서의 결과값을 벡터로 바꿔주기
class(pnp_script_word)
## [1] "character"
length(pnp_script_word)
## [1] 16067
pnp_script_word[1]
## [1] ""
pnp_script_word[38]
## [1] "Pride"