형태소분석

1.1

자료 가져오기 (1)

  • 뉴스샘플 데이터 활용
txt <- readLines("sample_news.txt")
noun <- lapply(txt, extractNoun)
noun
## [[1]]
##  [1] "정부"               "내수"               "회복"              
##  [4] "하반기"             "경제운용"           "방향"              
##  [7] "핵심"               "키워드"             "마땅"              
## [10] "한"                 "해법이"             "고심"              
## [13] "소비"               "투자"               "등"                
## [16] "내수"               "회복"               "속도"              
## [19] "기대"               "치"                 "충족"              
## [22] "데"                 "세월"               "호"                
## [25] "참사"               "경제"               "심리"              
## [28] "가운데"             "내수"               "부진"              
## [31] "물줄기"             "만"                 "수"                
## [34] "것"                 "현실"               "전문가"            
## [37] "들"                 "추가경정예산(추경)" "편성"              
## [40] "유일"               "효과"               "수"                
## [43] "대책"               "시기"               "것"                
## [46] "중론"              
## 
## [[2]]
##  [1] "9"        "관계"     "부처"     "정부"     "이달"     "말"      
##  [7] "발표"     "한"       "하반기"   "경제운용" "방향"     "소비"    
## [13] "투자"     "방안"     "방침"     "그간"     "정부"     "세월"    
## [19] "호"       "참사"     "이후"     "소비"     "심리"     "악화"    
## [25] "됨"       "재정"     "당초"     "계획"     "집행"     "공무원"  
## [31] "복지"     "포인트"   "사용"     "독려"     "등"       "대책"    
## [37] "발표"     "한"       "바"       "근본"     "적"       "대책"    
## [43] "거리"     "정부"     "표현"     "한"       "대"       "‘원"    
## [49] "포인트’" "수준"    
## 
## [[3]]
##  [1] "문제"     "카드"     "마땅"     "치"       "점"       "서비스업"
##  [7] "규제"     "완화"     "자영업자" "대책"     "등"       "정부"    
## [13] "마련"     "수"       "내수"     "진작"     "책"       "대책"    
## [19] "급격"     "한"       "내수"     "회복"     "지적"    
## 
## [[4]]
##  [1] "추가경정예산(추경)을" "편성"                 "것"                  
##  [4] "주장"                 "일각"                 "실행"                
##  [7] "정부"                 "재정상황"             "뿐"                  
## [10] "추경"                 "편성"                 "만"                  
## [13] "상황"                 "지"                   "확신"                
## [16] "하기"                
## 
## [[5]]
##  [1] "신민"         "영"           "LG"           "경제연구원"  
##  [5] "경제"         "연구"         "부문장"       "“내수"      
##  [9] "효과"         "수"           "방법"         "사실상"      
## [13] "추경"         "추경"         "편성"         "시기"        
## [17] "아니다”라며" "“세월호"     "분위기"       "브라"        
## [21] "질"           "월드컵"       "등"           "영향"        
## [25] "소비"         "심리"         "가능성"       "만큼"        
## [29] "추경"         "금리인하"     "빠르다”고"   "지적"        
## 
## [[6]]
##  [1] "이"         "세월"       "호"         "여파"       "일상"      
##  [6] "적"         "경제"       "활동"       "전개"       "노력"      
## [11] "우선"       "적"         "필요"       "진단"       "강명헌"    
## [16] "단국대"     "경제학과"   "교수"       "“일단"     "세월"      
## [21] "호"         "분위기"     "정책"       "효과"       "수"        
## [26] "것”이라며" "“추경"     "등"         "상황"       "추후"      
## [31] "검토"       "수"         "것”이라고" "말"        
## 
## [[7]]
##  [1] "현오석"     "부총리"     "겸"         "기획"       "재정"      
##  [6] "부"         "장관"       "기업"       "들"         "세월"      
## [11] "호"         "여파"       "중단"       "마케팅"     "활동"      
## [16] "속개"       "투자"       "고용"       "계획"       "집행"      
## [21] "해"         "줄"         "것"         "당부"       "등"        
## [26] "정부"       "‘일상으로" "복귀’를"  
## 
## [[8]]
##  [1] "근본"     "적"       "기업"     "가계"     "사이"     "소득"    
##  [7] "차"       "줄"       "노력"     "서민"     "들"       "가처분"  
## [13] "소득"     "수"       "방안"     "모색"     "지적"     "가계"    
## [19] "소득"     "주거"     "사교육비" "지출"     "노후"     "불안"    
## [25] "한"       "한국"     "사회"     "구조"     "개선"     "노력"    
## [31] "필요"     "것"      
## 
## [[9]]
##  [1] "조원희"         "국민"           "대"             "경제학과"      
##  [5] "교수"           "“정부가"       "규제"           "완화"          
##  [9] "적극"           "나"             "규제"           "투자"          
## [13] "활성화"         "가설"           "않다”며"       "“박근혜정부가"
## [17] "출범"           "당시"           "강조"           "한"            
## [21] "경제"           "민주화"         "복지"           "등"            
## [25] "분배"           "신경"           "한다”고"       "주장"

리스트에서 백터 타입으로 변환

unlist(noun)
##   [1] "정부"                 "내수"                 "회복"                
##   [4] "하반기"               "경제운용"             "방향"                
##   [7] "핵심"                 "키워드"               "마땅"                
##  [10] "한"                   "해법이"               "고심"                
##  [13] "소비"                 "투자"                 "등"                  
##  [16] "내수"                 "회복"                 "속도"                
##  [19] "기대"                 "치"                   "충족"                
##  [22] "데"                   "세월"                 "호"                  
##  [25] "참사"                 "경제"                 "심리"                
##  [28] "가운데"               "내수"                 "부진"                
##  [31] "물줄기"               "만"                   "수"                  
##  [34] "것"                   "현실"                 "전문가"              
##  [37] "들"                   "추가경정예산(추경)"   "편성"                
##  [40] "유일"                 "효과"                 "수"                  
##  [43] "대책"                 "시기"                 "것"                  
##  [46] "중론"                 "9"                    "관계"                
##  [49] "부처"                 "정부"                 "이달"                
##  [52] "말"                   "발표"                 "한"                  
##  [55] "하반기"               "경제운용"             "방향"                
##  [58] "소비"                 "투자"                 "방안"                
##  [61] "방침"                 "그간"                 "정부"                
##  [64] "세월"                 "호"                   "참사"                
##  [67] "이후"                 "소비"                 "심리"                
##  [70] "악화"                 "됨"                   "재정"                
##  [73] "당초"                 "계획"                 "집행"                
##  [76] "공무원"               "복지"                 "포인트"              
##  [79] "사용"                 "독려"                 "등"                  
##  [82] "대책"                 "발표"                 "한"                  
##  [85] "바"                   "근본"                 "적"                  
##  [88] "대책"                 "거리"                 "정부"                
##  [91] "표현"                 "한"                   "대"                  
##  [94] "‘원"                 "포인트’"             "수준"                
##  [97] "문제"                 "카드"                 "마땅"                
## [100] "치"                   "점"                   "서비스업"            
## [103] "규제"                 "완화"                 "자영업자"            
## [106] "대책"                 "등"                   "정부"                
## [109] "마련"                 "수"                   "내수"                
## [112] "진작"                 "책"                   "대책"                
## [115] "급격"                 "한"                   "내수"                
## [118] "회복"                 "지적"                 "추가경정예산(추경)을"
## [121] "편성"                 "것"                   "주장"                
## [124] "일각"                 "실행"                 "정부"                
## [127] "재정상황"             "뿐"                   "추경"                
## [130] "편성"                 "만"                   "상황"                
## [133] "지"                   "확신"                 "하기"                
## [136] "신민"                 "영"                   "LG"                  
## [139] "경제연구원"           "경제"                 "연구"                
## [142] "부문장"               "“내수"               "효과"                
## [145] "수"                   "방법"                 "사실상"              
## [148] "추경"                 "추경"                 "편성"                
## [151] "시기"                 "아니다”라며"         "“세월호"            
## [154] "분위기"               "브라"                 "질"                  
## [157] "월드컵"               "등"                   "영향"                
## [160] "소비"                 "심리"                 "가능성"              
## [163] "만큼"                 "추경"                 "금리인하"            
## [166] "빠르다”고"           "지적"                 "이"                  
## [169] "세월"                 "호"                   "여파"                
## [172] "일상"                 "적"                   "경제"                
## [175] "활동"                 "전개"                 "노력"                
## [178] "우선"                 "적"                   "필요"                
## [181] "진단"                 "강명헌"               "단국대"              
## [184] "경제학과"             "교수"                 "“일단"              
## [187] "세월"                 "호"                   "분위기"              
## [190] "정책"                 "효과"                 "수"                  
## [193] "것”이라며"           "“추경"               "등"                  
## [196] "상황"                 "추후"                 "검토"                
## [199] "수"                   "것”이라고"           "말"                  
## [202] "현오석"               "부총리"               "겸"                  
## [205] "기획"                 "재정"                 "부"                  
## [208] "장관"                 "기업"                 "들"                  
## [211] "세월"                 "호"                   "여파"                
## [214] "중단"                 "마케팅"               "활동"                
## [217] "속개"                 "투자"                 "고용"                
## [220] "계획"                 "집행"                 "해"                  
## [223] "줄"                   "것"                   "당부"                
## [226] "등"                   "정부"                 "‘일상으로"          
## [229] "복귀’를"             "근본"                 "적"                  
## [232] "기업"                 "가계"                 "사이"                
## [235] "소득"                 "차"                   "줄"                  
## [238] "노력"                 "서민"                 "들"                  
## [241] "가처분"               "소득"                 "수"                  
## [244] "방안"                 "모색"                 "지적"                
## [247] "가계"                 "소득"                 "주거"                
## [250] "사교육비"             "지출"                 "노후"                
## [253] "불안"                 "한"                   "한국"                
## [256] "사회"                 "구조"                 "개선"                
## [259] "노력"                 "필요"                 "것"                  
## [262] "조원희"               "국민"                 "대"                  
## [265] "경제학과"             "교수"                 "“정부가"            
## [268] "규제"                 "완화"                 "적극"                
## [271] "나"                   "규제"                 "투자"                
## [274] "활성화"               "가설"                 "않다”며"            
## [277] "“박근혜정부가"       "출범"                 "당시"                
## [280] "강조"                 "한"                   "경제"                
## [283] "민주화"               "복지"                 "등"                  
## [286] "분배"                 "신경"                 "한다”고"            
## [289] "주장"

자료 가져오기 (2)

txt <- read.csv("sample_voc.csv", stringsAsFactors = FALSE)
noun <- lapply(txt$CONTENTS, extractNoun)
getMorph("우리의 소원은 통일입니다")
## [1] "우리" "소원" "통일"

1.2 비정형 빈도 분석

1.2.1 고빈도 단어 추출하기

table()

x <- c("a", "a", "c", "a")
table(x)
## x
## a c 
## 3 1

sort()

y <- c(5, 8, 3, 1, 2)
sort(y)
## [1] 1 2 3 5 8
sort(y, decreasing = T)
## [1] 8 5 3 2 1
  • table()에 대해서도 빈도순 정렬이 가능하다.
sort(table(x)); sort(table(x), decreasing = T)
## x
## c a 
## 1 3
## x
## a c 
## 3 1

1.2.2 빈도표 생성

txt <- readLines("sample_news.txt")
noun <- lapply(txt, getMorph, "noun")
nounVec <- unlist(noun)
nounFreq <- table(nounVec)

1.2.3 빈도수와 단어 표시

head(sort(nounFreq, decreasing = T),20)
## nounVec
##   정부   추경   경제   내수 세월호   대책   소비   규제   노력   당장 
##      9      7      6      6      6      5      4      3      3      3 
##   상황   소득   심리   투자   회복   효과   가계   경정 경제학   계획 
##      3      3      3      3      3      3      2      2      2      2

1.2.4 빈도 높은 단어만 표시

names(head(sort(nounFreq, decreasing = T),20))
##  [1] "정부"   "추경"   "경제"   "내수"   "세월호" "대책"   "소비"  
##  [8] "규제"   "노력"   "당장"   "상황"   "소득"   "심리"   "투자"  
## [15] "회복"   "효과"   "가계"   "경정"   "경제학" "계획"

1.2.5 막대그래프 그리기

freq <- as.vector(head(sort(nounFreq, decreasing = T),20))
word <- names(head(sort(nounFreq, decreasing = T),20))

sum <- sum(nounFreq)
percent <- round((freq/sum) * 100, digits = 2)
mainTxt <- "고빈도 단어"

bp <- barplot(percent, main = mainTxt, las = 2, ylim = c(0,5), ylab = "%", names.arg = word, col = "black")

text(x = bp, y = percent + 0.3, labels = paste(freq), col = "black", cex = 0.8)