사회과학적 빅데이터 연구란?

전통적 양적연구방법
- 설문조사
- 실험
- 내용분석
디지털 시대의 새로운 기회들
- 디지털 기술과 기기 확산
- 정보의 기록과 축적
- 정보의 수집, 처리, 분석 기술의 발전과 쉬운 접근성
빅데이터 방법론
- 디지털 미디어 / 인터넷을 도구로 이용한 연구
  
  “digital traces that can be compiled into comprehensive pictures of both individual and group behavior, with the potential to transform our understanding of lives, organizations, and societies” (Lazer et al. 2009: 721)
- 디지털 미디어 / 인터넷을 대상으로 그 효과를 분석한 연구
  
  “Researchers revisit the classical problems of social science like inequality, the allocation of scarce resources, identity, learning, culture, power, and so on - but do so in the context of digital media and the internet” (Sandvig & Hargittai 2015: 6)
새로운 방법론과 이론
- 방법론의 이론 의존성 (전통적 방법에 의해 규명된)
- 방법론의 이론 확장 가능성 vs. 방법론의 효율성
- 이론과 방법의 비독립성
- 자료기반 연구의 이론 확장 가능성
인터넷 연구의 의의
- 비간섭적 측정
- 새로운 사회 현상: 정체성, 공동체, 불평등, 정치, 조직, 그리고 문화

R에서 프로젝트 사용하기

RStudio 프로젝트를 사용하면 작업 공간과 문서를 쉽게 관리 할 수 있습니다.

프로젝트 만들기

RStudio 프로젝트는 R 작업 디렉토리와 연결됩니다. RStudio 프로젝트를 다음과 같은 곳에 만들 수 있습니다.

새로운 디렉토리에서
또는 이미 R 코드 및 데이터가 있는 기존 디렉토리에서 만들 수 있습니다.

새로운 프로젝트를 만들 때, R studio는 :

프로젝트 디렉토리에 (.Rproj 확장자를 가진) 프로젝트 파일을 생성합니다. 이 파일에는 다양한 프로젝트 옵션이 있고 프로젝트를 직접 여는 지름길로 파일을 사용합니다.
원본 문서와 창 상태가 저장된 숨겨진 디렉토리를 만듭니다.
프로젝트를 저장하고 툴바에 프로젝트 이름을 나타냅니다.

프로젝트 열기

Open Project 명령어를 사용하여 기존 프로젝트 파일(예: Capstone.Rproj)을 찾아 선택합니다.
또는 최근 열었던 프로젝트 목록에서 프로젝트를 선택합니다.

프로젝트 종료하기

프로젝트를 종료하거나 다른 프로젝트를 열면 다음 작업이 실행됩니다. :

.RData 및/또는 .Rhistory 는 프로젝트 디렉토리에 기록됩니다.
오픈 소스 문서(.R 확장 문서) 목록이 저장됩니다. (다음 번에 프로젝트를 열 때, 복원할 수 있습니다.)

Global 옵션 확인

작업 내용이 올바르게 저장되도록 global 옵션을 확인해 봅시다.

Restore .RData into workspace at startup - 시작 시, R 작업 공간(전역 환경) 안에 초기 작업 디렉토리에 있는 .RData 파일(있는 경우)을 저장하세요.
Save workspace to .RData on exit - 종료 시, .RData를 저장할지 여부를 묻고 항상 저장하거나 저장하지 않습니다.
Always save history (even when not saving .RData) - 종료할 때, .RData 파일을 저장하지 않기로 선택하더라도 세션의 명령을 사용하여 .Rhistory 파일이 항상 저장되는지 확인하세요.

R 마크다운 문서 이해하기

콘텐츠
R코드
결과

R 패키지

R을 사용하여 텍스트 마이닝 분석을 할 경우 관련 패키지를 설치하면 텍스트 끌어오기, 전처리와 분석은 물론 결과의 시각화 작업을 보다 편리하고 효율적으로 수행할 수 있다.

패키지? 라이브러리?

R은 오픈 소스 소프트웨어로서 텍스트마이닝에 특화되어있는 기능들을 잘 활용하는 것이 중요하다.

패키지는 R의 함수들이 목적에 따라 모여있는 집합체를 말한다.

라이브러리는 패키지를 다운로드 받은 장소를 말한다. 저장된 패키지를 R에 설치함으로서 우리는 포함되어 있는 함수들을 활용할 수 있게 된다. 그렇다면, R에서 어떻게 패키지를 다운로드 받고 설치할 수 있을까?

가령, 우리가 텍스트를 분석하기 위해 필요한 함수들이 “stringr”이라는 패키지에 포함되어 있다고 가정해보자. 패키지들을 다운로드 받기위해서는 다음의 코드를 실행해야 한다.

install.packages("stringr")

특히, install.packages라는 함수 뒤에 괄호가 따라오게 되는데, 이 괄호 안에 패키지의 이름이 입력된다. 주의점: 패키지의 이름은 문자 코드이기 때문에 따옴표를 붙여줘야 한다.

위의 정보를 입력하면 R은 해당 패키지를 CRAN이라는 서버로부터 다운로드 받는다. 하지만 패키지를 다운로드 받았을 뿐, 해당 함수를 실행하기 위해서 패키지를 R 세션에 “로드”해야하는 과정이 필요하다. 이 작업은 library()라는 함수로 실행한다.

library(stringr)

이러한 과정을 거치게 되면 우리는 오늘 R에서의 텍스트 분석을 위한 패키지 설치를 완료하게 된다: stringr은 불러온 파일로부터 추출한 텍스트를 분석을 위해 전처리하는 과정에서 필요한 함수를 제공한다.

디렉토리란?

텍스트를 분석하기 위해 다운로드 받은 디지털 문서가 저장되어 있는 장소를 알아야 R로 파일을 불러올 수 있다. 이때 필요한 정보가 Working Directory이다. 이 폴더는 R session에서 작업하는 파일들의 주소로서 다음의 함수를 이용하면 알 수 있다.

getwd() # 현재 작업중인 디렉토리 찾기 (입력 및 출력값이 저장되는 장소)

## [1] "D:/Dropbox/2022_Class/DMBD/DM_BD_R"

문자열 데이터를 다루기 위한 기본적인 R 사용 방법

R은 통계 프로그래밍 언어이다.

# Simple addition and subtraction
10+5

## [1] 15

10-5

## [1] 5

# Creating vectors: A sequence of numbers/integers, characters, Booleans
c(1,3,5) # Join elements into a vector

## [1] 1 3 5

1:5 # An integer sequence

## [1] 1 2 3 4 5

seq(1, 5, by=2) # A sequence of integers from 1 to 5, increasing by 2

## [1] 1 3 5

rep(1:5, times=3) # Repeat an integer sequence 1:5 three times

##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

rep(1:5, each=3) # Repeat each element of an integer sequence 1:5 three times

##  [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5

R은 객체 지향 언어이다.

# R uses the less than symbol followed by the hyphen(<-) as an assignment operator
a <- 10 # assign 10 to the object 'x'
a*2

## [1] 20

a <- 20
a - 5 # subtract 5 from x

## [1] 15

a <- "R uses the less than symbol followed by the hyphen(<-) as an assignment operator"
a

## [1] "R uses the less than symbol followed by the hyphen(<-) as an assignment operator"

table(unlist(strsplit(a, ' ')))

## 
##         an         as assignment         by   followed hyphen(<-)       less 
##          1          1          1          1          1          1          1 
##   operator          R     symbol       than        the       uses 
##          1          1          1          1          2          1

R 객체는 문자열로도 지정될 수 있다.

class <- "text mining"
print(class)

## [1] "text mining"

class = "text mining"

# R is an object-oriented programming language; everything can be assinged to an object
fruits <- c("Apple","Grape","Pear","Apple","Mango","Orange","Mango","Strawberry","Grape","Apple")
sort(fruits) # Return the object, fruits, sorted in alphabetical order for characters

##  [1] "Apple"      "Apple"      "Apple"      "Grape"      "Grape"     
##  [6] "Mango"      "Mango"      "Orange"     "Pear"       "Strawberry"

table(fruits) # See counts of values (elements)

## fruits
##      Apple      Grape      Mango     Orange       Pear Strawberry 
##          3          2          2          1          1          1

unique(fruits) # See unique values (elements)

## [1] "Apple"      "Grape"      "Pear"       "Mango"      "Orange"    
## [6] "Strawberry"

# String data can be a character vector
hello <- c("Hello!", "World", "is", "good!")
print(hello)

## [1] "Hello!" "World"  "is"     "good!"

벡터란?

같은 성격(숫자, 문자 등)을 가진 일련의 요소들로 이뤄진 집합(묶음)

예) 숫자 벡터; 문자 벡터; 논리 벡터

# assign a sequence of numbers to a numeric vector object, vector_numeric
vector_numeric <- c(1:6) # a sequence of numbers from 1 to 6
vector_numeric

## [1] 1 2 3 4 5 6

# assign a sequence of characters to a character vector object, vector_character
vector_character <- c("v","e","c","t","o","r") # characters should be distinguished with quotation marks.
vector_character

## [1] "v" "e" "c" "t" "o" "r"

# assign a sequence of Boolean values, TRUE or FALSE, to a logical vector object, vector_logical
vector_logical <- c(TRUE, FALSE, FALSE, TRUE, T, F) # TRUE/FALSE should be uppercase letters
vector_logical

## [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE

4-1. 벡터 요소 선택하기 ### 위치 순서로 지정

fruits[4] # The fourth element of the vector object, fruits

## [1] "Apple"

fruits[-4] # All but the fourth one

## [1] "Apple"      "Grape"      "Pear"       "Mango"      "Orange"    
## [6] "Mango"      "Strawberry" "Grape"      "Apple"

fruits[2:4] # Elements from the second to the fourth ones

## [1] "Grape" "Pear"  "Apple"

fruits[-(2:4)] # All elements except the second to the fourth ones

## [1] "Apple"      "Mango"      "Orange"     "Mango"      "Strawberry"
## [6] "Grape"      "Apple"

fruits[c(1,5)] # Elements the first one and the fifth one

## [1] "Apple" "Mango"

Note that parentheses ( ) follow a certain function and square brakets [ ] select vector elements

리스트란 하나의 문자열로 지정된 객체 안에서 다른 문자열 벡터들로 구분할 때 사용하는 객체. 예) 문단을 문장으로 구분후 각 문장을 어휘 벡터로 구분

Vectors

왜 리스트가 중요한가?

텍스트는 단어 차원, 문서 차원, 문단 차원, 문서 차원 등 여러 단계로 분석된다. 이 때 리스트 객체는 이러한 위계적 자료 구조 처리에 유용하다.

fruits

##  [1] "Apple"      "Grape"      "Pear"       "Apple"      "Mango"     
##  [6] "Orange"     "Mango"      "Strawberry" "Grape"      "Apple"

hello

## [1] "Hello!" "World"  "is"     "good!"

list_obj <- list(fruits, hello)
list_obj

## [[1]]
##  [1] "Apple"      "Grape"      "Pear"       "Apple"      "Mango"     
##  [6] "Orange"     "Mango"      "Strawberry" "Grape"      "Apple"     
## 
## [[2]]
## [1] "Hello!" "World"  "is"     "good!"

5-1. 리스트 인덱싱: 객체 요소 선택 리스트의 요소 선택 방법? 두개의 대괄호 쌍으로 리스트의 1차 요소인 벡터 선택 하나의 대괄호 쌍으로 리스트의 2차 요소인 벡터의 요소 선택

데이터 프레임 (Data frame)

복수의 벡터들의 결합으로 이뤄짐. 이때 벡터의 성격(숫자, 문자, 논리 등)은 상관없음. 단, 벡터의 길이는 모두 같아야 함.

Data frames

텍스트 분성을 위한 기본 패키지 함수들

nchar(): 문자 갯수 세기

nchar("data")

## [1] 4

# Blank and punctuation marks are also counted 
nchar("text mining") #11? 10?

## [1] 11

nchar("text, mining") #12? 11.5?

## [1] 12

strsplit(): 텍스트 분석은 어휘 빈도수 세기부터 시작. 따라서 문자열을 어휘 단위로 쪼개는 과정이 필요. 이 함수는 문자열 쪼개기를 위해 사용.

sent1 <- "Text mining begins with preprocessing and tokenization."
# We split sent1 into seven pieces of words that are separated from each other by white space (blank)
sent1_split <- strsplit(sent1, split=" ") # Parameter (split=" ") specifies a character (here, blank) that separates words as tokens

class(sent1)

## [1] "character"

class(sent1_split)

## [1] "list"

class(unlist(sent1_split))

## [1] "character"

sent1_split # returns a list of vectors of words segmented from sent1

## [[1]]
## [1] "Text"          "mining"        "begins"        "with"         
## [5] "preprocessing" "and"           "tokenization."

sent1_split[[1]] # returns the first vector of word segments in the list object, sent1_split

## [1] "Text"          "mining"        "begins"        "with"         
## [5] "preprocessing" "and"           "tokenization."

sent1_split[[1]][5] # returns the fifth element in the first vector of word segments

## [1] "preprocessing"

paste(): 쪼개진 어휘 혹은 문자를 다시 문장 혹은 문자열로 결합시키고 싶을 때

sent2 <- paste(sent1_split[[1]], collapse=" ") # Parameter (collapse) specifies how elements in sent1_split are combined. Here, we want to combine word elements into a sentence that separates them by blank (" ").
paste(sent1_split[[1]], collapse="/")

## [1] "Text/mining/begins/with/preprocessing/and/tokenization."

sent2

## [1] "Text mining begins with preprocessing and tokenization."

도움 청하기: 함수나 패키지에 대한 설명이 필요할 때

?c # Get help of a particular function c( )

## starting httpd help server ... done

?strsplit
help.search("split") # Search and return the help files for functions that include a word or phrase
help(package = "stringr") # Find help for a package

tolower( ) and toupper( ): 대소문자 처리

sentence <- "Text mining is easy to learn"
toupper(sentence)

## [1] "TEXT MINING IS EASY TO LEARN"

tolower(toupper(sentence))

## [1] "text mining is easy to learn"

Note1: R 운 대문자와 소문자를 다르게 인식

word1 <- "TEXT"
word1

## [1] "TEXT"

word2 <- "Text"
word2

## [1] "Text"

word1 == word2

## [1] FALSE

Note2: 함수는 함수가 적용될 객체를 표시하는 괄호와 함께 사용

tolower(word1)

## [1] "text"

toupper(word2)

## [1] "TEXT"

디지털미디어와 빅데이터: 2주차 강의

이신행

3/14/2022