R 패키지

R을 사용하여 텍스트 마이닝 분석을 할 경우 관련 패키지를 설치하면 텍스트 끌어오기, 전처리와 분석은 물론 결과의 시각화 작업을 보다 편리하고 효율적으로 수행할 수 있다.

패키지? 라이브러리?

R은 오픈 소스 소프트웨어로서 텍스트마이닝에 특화되어있는 기능들을 잘 활용하는 것이 중요하다.

패키지는 R의 함수들이 목적에 따라 모여있는 집합체를 말한다.

라이브러리는 패키지를 다운로드 받은 장소를 말한다. 저장된 패키지를 R에 설치함으로서 우리는 포함되어 있는 함수들을 활용할 수 있게 된다. 그렇다면, R에서 어떻게 패키지를 다운로드 받고 설치할 수 있을까?

가령, 우리가 텍스트를 분석하기 위해 필요한 함수들이 “stringr”이라는 패키지에 포함되어 있다고 가정해보자. 패키지들을 다운로드 받기위해서는 다음의 코드를 실행해야 한다.

install.packages("stringr")

특히, install.packages라는 함수 뒤에 괄호가 따라오게 되는데, 이 괄호 안에 패키지의 이름이 입력된다. 주의점: 패키지의 이름은 문자 코드이기 때문에 따옴표를 붙여줘야 한다.

위의 정보를 입력하면 R은 해당 패키지를 CRAN이라는 서버로부터 다운로드 받는다. 하지만 패키지를 다운로드 받았을 뿐, 해당 함수를 실행하기 위해서 패키지를 R 세션에 “로드”해야하는 과정이 필요하다. 이 작업은 library()라는 함수로 실행한다.

library(stringr)

이러한 과정을 거치게 되면 우리는 오늘 R에서의 텍스트 분석을 위한 패키지 설치를 완료하게 된다: stringr은 불러온 파일로부터 추출한 텍스트를 분석을 위해 전처리하는 과정에서 필요한 함수를 제공한다.

디렉토리란?

텍스트를 분석하기 위해 다운로드 받은 디지털 문서가 저장되어 있는 장소를 알아야 R로 파일을 불러올 수 있다. 이때 필요한 정보가 Working Directory이다. 이 폴더는 R session에서 작업하는 파일들의 주소로서 다음의 함수를 이용하면 알 수 있다.

getwd() # 현재 작업중인 디렉토리 찾기 (입력 및 출력값이 저장되는 장소)

## [1] "D:/Dropbox/2022_Class/ITM/R"

R Basics

What is R?

- R is a free software environment for statistical computing and graphics. 
- R is a high-level programming language: A set of predefined instructions that a computer is able to understand and react to.

Why R?

- As a user-friendly environment, RStudio has many added values.
- Open source inside
- Plugin ready: CRAN
- Data visualization friendly

How to use R

Basic R practices to handle string data

R is the statistical programming language

# Simple addition and subtraction
10+5

## [1] 15

10-5

## [1] 5

# Creating vectors: A sequence of numbers/integers, characters, Booleans
a <- c(1,3,5) # Join elements into a vector 
a

## [1] 1 3 5

b <- 1:5
b = 1:5 # An integer sequence
seq(1, 5, by=2) # A sequence of integers from 1 to 5, increasing by 2

## [1] 1 3 5

rep(1:5, times=3) # Repeat an integer sequence 1:5 three times

##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

rep(1:5, each=3) # Repeat each element of an integer sequence 1:5 three times

##  [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5

R is an object-oriented envrionment, which is assigned with numbers.

# R uses the less than symbol followed by the hyphen(<-) as an assignment operator
a <- 10 # assign 10 to the object 'x'
a <- 20
a - 5 # subtract 5 from x

## [1] 15

R objects can be also assigned with characters.

class <- c("text mining", "is simple")
class

## [1] "text mining" "is simple"

print(class)

## [1] "text mining" "is simple"

class = "text mining"

# R is an object-oriented programming language; everything can be assinged to an object
fruits <- c("Apple","Grape","Pear","Apple","Mango","Orange","Mango","Strawberry","Grape","Apple")
sort(fruits) # Return the object, fruits, sorted in alphabetical order for characters

##  [1] "Apple"      "Apple"      "Apple"      "Grape"      "Grape"     
##  [6] "Mango"      "Mango"      "Orange"     "Pear"       "Strawberry"

table(fruits) # See counts of values (elements)

## fruits
##      Apple      Grape      Mango     Orange       Pear Strawberry 
##          3          2          2          1          1          1

unique(fruits) # See unique values (elements)

## [1] "Apple"      "Grape"      "Pear"       "Mango"      "Orange"    
## [6] "Strawberry"

# String data can be a character vector
hello <- c("Hello!", "World", "is", "good!")
print(hello)

## [1] "Hello!" "World"  "is"     "good!"

3-1. Selecting Vector Elements ### By Position

fruits[4] # The fourth element of the vector object, fruits

## [1] "Apple"

fruits[-4] # All but the fourth one

## [1] "Apple"      "Grape"      "Pear"       "Mango"      "Orange"    
## [6] "Mango"      "Strawberry" "Grape"      "Apple"

fruits[2:4] # Elements from the second to the fourth ones

## [1] "Grape" "Pear"  "Apple"

fruits[-(2:4)] # All elements except the second to the fourth ones

## [1] "Apple"      "Mango"      "Orange"     "Mango"      "Strawberry"
## [6] "Grape"      "Apple"

fruits[c(1,5)] # Elements the first one and the fifth one

## [1] "Apple" "Mango"

Note that parentheses ( ) follow a certain function and square brakets [ ] select vector elements

You can deal with string objects as follows.

paste(hello, collapse = ",")

## [1] "Hello!,World,is,good!"

hello_paste <- paste(hello, collapse = " ")
strsplit(hello_paste, split = " ")

## [[1]]
## [1] "Hello!" "World"  "is"     "good!"

R functions in the Base Package for text analysis

nchar(): counting the number of characters and the number of bytes

nchar("data")

## [1] 4

# Blank and punctuation marks are also counted 
nchar("text mining") #11? 10?

## [1] 11

nchar("text, mining") #12? 11.5?

## [1] 12

strsplit(): Sentences are composed of words that consist of characters. In text mining, therefore, we segment sentences into words as tokens, which were pre-processed. Then, we also might need to combine them again.

sent1 <- "Text mining begins with preprocessing and tokenization."
# We split sent1 into seven pieces of words that are separated from each other by white space (blank)
sent1_split <- strsplit(sent1, split=" ") # Parameter (split=" ") specifies a character (here, blank) that separates words as tokens

class(sent1)

## [1] "character"

class(sent1_split)

## [1] "list"

class(unlist(sent1_split))

## [1] "character"

sent1_split # returns a list of vectors of words segmented from sent1

## [[1]]
## [1] "Text"          "mining"        "begins"        "with"         
## [5] "preprocessing" "and"           "tokenization."

sent1_split[[1]] # returns the first vector of word segments in the list object, sent1_split

## [1] "Text"          "mining"        "begins"        "with"         
## [5] "preprocessing" "and"           "tokenization."

sent1_split[[1]][5] # returns the fifth element in the first vector of word segments

## [1] "preprocessing"

paste(): We can combine word segments back into a sentence that separates words by blank

sent2 <- paste(sent1_split[[1]], collapse=" ") # Parameter (collapse) specifies how elements in sent1_split are combined. Here, we want to combine word elements into a sentence that separates them by blank (" ").
paste(sent1_split[[1]], collapse="/")

## [1] "Text/mining/begins/with/preprocessing/and/tokenization."

sent2

## [1] "Text mining begins with preprocessing and tokenization."

Getting Help: We can access the help files about functions and packages.

?c # Get help of a particular function c( )

## starting httpd help server ... done

?strsplit
help.search("split") # Search and return the help files for functions that include a word or phrase
help(package = "stringr") # Find help for a package

Note1: R is sensitive to lowercase and uppercase letters

word1 <- "TEXT"
word1

## [1] "TEXT"

word2 <- "Text"
word2

## [1] "Text"

word1 == word2

## [1] FALSE

Note2: Parentheses are for functions but brackets are for selecting a certain element of a vector.

tolower(word1)

## [1] "text"

toupper(word2)

## [1] "TEXT"

Foundational notions of R language

Vectors

Vectors

  - Numeric
  - Logical / Boolean
  - Character

“object-oriented programming language”

apple <- c(2020123456, 20197890123, 2020456789)
sales <- c("apple", "banana", "kiwi", "melon")
sales

## [1] "apple"  "banana" "kiwi"   "melon"

sales <- c(sales, apple)
sales

## [1] "apple"       "banana"      "kiwi"        "melon"       "2020123456" 
## [6] "20197890123" "2020456789"

Lists

Containers of objects

my_list <- list(apple, sales)

class(my_list)

## [1] "list"

  - Creating lists
  - Subsetting lists

Data frames
- All components are vectors, no matter whether logical, numerical, or character
- All vectors must be of the same length

Data frames

Packages

library(tidyverse)

  - Select and show a column of a data frame
  - Add a new column

Functions

Function

텍스트마이닝개론: Week2

이신행

3/14/2022

R에서 프로젝트 사용하기

프로젝트 만들기

프로젝트 열기

프로젝트 종료하기

Global 옵션 확인