Data Handling Class : Week2

Author

:> JungHwan Yun
:> Master Student in Data-Science
:> Seoul National University of Science & Technology(SeoulTech)
:> E-mail : junghwan.yun@seoultech.ac.kr

INTRO

데이터 핸들링 두번쨰주 입니다. 지난주 다들 한번쯤 경험해보시니까 어떤식으로 해야할지 조금 더 감이 잡혔으리라 생각합니다. 이번주도 역시 모듈화와 문서화에 집중하셔서 스터디를 진행해주시면 스터디의 효과가 더욱 커질 것 같습니다.
이번주는 지난주 네트워크 분석을 위한 전처리를 위한 데이터 처리를 계속 진행해보도록 하겠습니다. 이번주 스터디까지 진행하면 아마 네트워크 분석을 위한 데이터 처리는 끝마칠 수 있을것이라 생각됩니다.

- Reference

네트워크에 대한 간단한 이해 : Node와 Edge의 개념에 관하여
키워드 네트워크분석의 예시 : 북한이탈주민의 키워드네트워크 지도
NLTK를 이용한 명사추출 :Extract Noun From Text using NLTK in python

TASK

DATA : Web of Science의 1990~2016년까지의 “Artificial Intelligence”에 관한 논문 정보
ToDo : 논문의 Abstract를 이용하여, 명사를 추출한 뒤 Keyword co-occurance network의 Edge-list를 만드세요
- Document-Term Matrix를 이용하셔도 되고 기존에 만들었던 Edge-list 함수를 사용하셔도 무방합니다.
- 상위 100개의 문서만을가지고 먼저 진행하시고, 완성되면 전체에 적용하시면됩니다.
Warning : 데이터셋엔 결측치가 있을 수 있습니다. 킄킄

DUE DATE

1차 :

알고리즘 구현방식

function : extract.noun()
1. 데이터를 입력받는다
2. 초록을 추출한다. 
3. 명사를 추출한다. 
4. 명사형태의 키워드 리스트를 반환한다. 

function2 : edge.maker()
1. 데이터를 입력받는다 
2. 초록의 명사키워드 리스트를 받아서 2개씩 Pair를 생성해준다(Combination 연산을 진행한다)
3. Combination 연산의 결과를 세로로 배치한다 'Source'와 'Target'형태로 만들어 준다 
4. Source와 Target을 합산하여 Weight를 산출한다
5. 결과를 리턴한다
혹은
Document-Term Matrix를 생성한다 -> Term-Term Matrix를 생성한다 -> Edge-list를 생성한다

중점 사항

1. 코드를 최대한 모듈화 할 수 있도록 코드를 작성해주세요.
2. 큰 규모의 함수를 돌리는 상황에서는 시작시간과 종료시간을 체크할 수 있도록 코드를 작성해 주세요.
3. 혹시 기존의 패키지와 함수들을 사용한다면 사용버전을 꼭 명시해주세요

Code Example

#함수선언부분
extract.noun = function(){
  
  #함수기능 수행
  
}


#함수선언부분
edge.maker = function(){
  
  #함수기능 수행
  
}


#코드실행부분
keyword = pd.read_csv()
head(keyword)
noun.list = extract.noun(keyword)
edgelist =  edge.maker(noun.list)
head(edgelist)

Input Data의 형태

Input : [인덱스][제목][초록] 형태의 입력 데이터

head(keyword)

IDX	TI	AB
1	A computational model for the endogenous arousal of thoughts through Z*-numbers	Natural language provides a rich combinatorial mechanism for encoding meanings - a finite set of words can express an unbounded number of thoughts. Framed in 2015 to extend the purpose of Zadeh’s Z-numbers a Z*-number is a perceptual symbol of the meaning of a natural language expression and consequently mentalese or internal speech. This article through decomposition of the Z(x)-macro-parameters into its atomic constituents presents a model for the endogenous arousal of thoughts during empathetic bespoke comprehension of the real-world. Based on Minsky’s Society of Mind the framework is founded on the assimilation of multimodal experiences a sense of ’unified self and its derivatives (choice interest curiosityetc.) objective and subjective components of knowledge commonsense and attention dynamics over a real-world scenario. The model attempts emulation of slow and fast thinking instinctive reactions learning deliberation reflection and self-conscious decisions. The design has been validated against human responses and aims to contribute to the development of autonomous artificial systems for man-machine symbiosis. (C) 2017 Elsevier Inc. All rights reserved.
2	Mapping vulnerability of multiple aquifers using multiple models and fuzzy logic to objectively derive model structures	Driven by contamination risks mapping Vulnerability Indices (VI) of multiple aquifers (both unconfined and confined) is investigated by integrating the basic DRASTIC framework with multiple models overarched by Artificial Neural Networks (ANN). The DRASTIC framework is a proactive tool to assess VI values using the data from the hydrosphere lithosphere and anthroposphere. However a research case arises for the application of multiple models on the ground of poor determination coefficients between the VI values and non-point anthropogenic contaminants. The paper formulates SCFL models which are derived from the multiple model philosophy of Supervised Committee (SC) machines and Fuzzy Logic (FL) and hence SCFL as their integration. The Fuzzy Logic based (FL) models include: Sugeno Fuzzy Logic (SFL) Mamdani Fuzzy Logic (MFL) Larsen Fuzzy Logic (LFL) models. The basic DRASTIC framework uses prescribed rating and weighting values based on expert judgment but the four FL-based models (SFL MFL LFL and SCFL) derive their values as per internal strategy within these models. The paper reports that FL and multiple models improve considerably on the correlation between the modeled vulnerability indices and observed nitrate-N values and as such it provides evidence that the SCFL multiple models can be an alternative to the basic framework even for multiple aquifers. The study area with multiple aquifers is in Varzeqan plain East Azerbaijan northwest Iran. (C) 2017 Elsevier B.V. All rights reserved.
3	A review of affective computing: From unimodal analysis to multimodal fusion	Affective computing is an emerging interdisciplinary research field bringing together researchers and practitioners from various fields ranging from artificial intelligence natural language processing to cognitive and social sciences. With the proliferation of videos posted online (e.g. on YouTube Facebook Twitter) for product reviews movie reviews political views and more affective computing research has increasingly evolved from conventional unimodal analysis to more complex forms of multimodal analysis. This is the primary motivation behind our first of its kind comprehensive literature review of the diverse field of affective computing. Furthermore existing literature surveys lack a detailed discussion of state of the art in multimodal affect analysis frameworks which this review aims to address. Multimodality is defined by the presence of more than one modality or channel e.g. visual audio text gestures and eye gage. In this paper we focus mainly on the use of audio visual and text information for multimodal affect analysis since around 90% of the relevant literature appears to cover these three modalities. Following an overview of different techniques for unimodal affect analysis we outline existing methods for fusing information from different modalities. As part of this review we carry out an extensive study of different categories of state-of-the-art fusion techniques followed by a critical analysis of potential performance improvements with multimodal analysis compared to unimodal analysis. A comprehensive overview of these two complementary fields aims to form the building blocks for readers to better understand this challenging and exciting research field. (C) 2017 Elsevier B.V. All rights reserved.
4	Stock market one-day ahead movement prediction using disparate data sources	There are several commercial financial expert systems that can be used for trading on the stock exchange. However their predictions are somewhat limited since they primarily rely on time-series analysis of the market. With the rise of the Internet new forms of collective intelligence (e.g. Google and Wikipedia) have emerged representing a new generation of “crowd-sourced” knowledge bases. They collate information on publicly traded companies while capturing web traffic statistics that reflect the public’s collective interest. Google and Wikipedia have become important “knowledge bases” for investors. In this research we hypothesize that combining disparate online data sources with traditional time-series and technical indicators for a stock can provide a more effective and intelligent daily trading expert system. Three machine learning models decision trees neural networks and support vector machines serve as the basis for our “inference engine”. To evaluate the performance of our expert system we present a case study based on the AAPL (Apple NASDAQ) stock. Our expert system had an 85% accuracy in predicting the next-day AAPL stock movement which outperforms the reported rates in the literature. Our results suggest that: (a) the knowledge base of financial expert systems can benefit from data captured from nontraditional “experts” like Google and Wikipedia; (b) diversifying the knowledge base by combining data from disparate sources can help improve the performance of financial expert systems; and (c) the use of simple machine learning models for inference and rule generation is appropriate with our rich knowledge database. Finally an intelligent decision making tool is provided to assist investors in making trading decisions on any stock commodity or index. (c) 2017 Elsevier Ltd. All rights reserved.
5	A sub-space artificial neural network for mold cooling in injection molding	The applications of artificial intelligence (AI) have considerably expanded over recent years. A new class of industrial systems is beginning to evolve that incorporates using high volume data and advanced analytics to better optimize product quality while reducing energy consumption. Artificial neural networks (ANN) when combined with advanced modeling and control begins to form an AI platform that can be further enhanced for factories of the future. This paper provides a demonstration of such initial work that can be further developed for future systems in a generic way. When considering polymer processing such as plastic injection molding the mold cavity temperature (MCT) profile directly relates to part quality and part reject rates. Therefore it is desirable to optimize the mold cooling process using real time control of MCT as it directly affect part quality. However MCT is affected by a number of interacting nonlinear dynamic parameters that are often neglected due to the challenge of quantifying such parameters. Advanced model based control algorithms are often used for providing improved control of complex systems. However they depend on good model formulations that are analytically insufficient. An online intelligent system identification approach for the mold cooling process is developed and tested. An ANN is designed to adjust online sub-space parameters that govern a mold cooling model. Results demonstrate that this online ANN approach can be used to accurately predict the dynamic behavior of mold cavity surface temperature. This is key to many industrial systems where their states are not directly observable and uncertainties are unknown. The methodology can be readily adapted for different operating conditions as in this case of polymer processing and has good potential for its integration with advanced model based control schemes and cloud computing approaches for the next generation of machines. (C) 2017 Elsevier Ltd. All rights reserved.
6	Automatic Density Peaks Clustering Using DNA Genetic Algorithm Optimized Data Field and Gaussian Process	Clustering by fast search and finding of Density Peaks ( called as DPC) introduced by Alex Rodriguez and Alessandro Laio attracted much attention in the field of pattern recognition and artificial intelligence. However DPC still has a lot of defects that are not resolved. Firstly the local density rho(i) of point i is affected by the cutoff distance dc which can influence the clustering result especially for small real-world cases. Secondly the number of clusters is still found intuitively by using the decision diagram to select the cluster centers. In order to overcome these defects this paper proposes an automatic density peaks clustering approach using DNA genetic algorithm optimized data field and Gaussian process (referred to as ADPC-DNAGA). ADPC-DNAGA can extract the optimal value of threshold with the potential entropy of data field and automatically determine the cluster centers by Gaussian method. For any data set to be clustered the threshold can be calculated from the data set objectively rather than the empirical estimation. The proposed clustering algorithm is benchmarked on publicly available synthetic and real-world datasets which are commonly used for testing the performance of clustering algorithms. The clustering results are compared not only with that of DPC but also with that of several well-known clustering algorithms such as Affinity Propagation DBSCAN and Spectral Cluster. The experimental results demonstrate that our proposed clustering algorithm can find the optimal cutoff distance d(c) to automatically identify clusters regardless of their shape and dimension of the embedded space and can often outperform the comparisons.

Output Data의 형태

Output : [Source][Target][Weight] 형태의 출력

head(edgelist)

Source	Target	Type	Weight	Source_Label	Target_Label
1350	1218	Undirected	518	artificial neural network	artificial intelligence
7587	1218	Undirected	305	expert system	artificial intelligence
9049	1350	Undirected	214	genetic algorithm	artificial neural network
9049	1218	Undirected	185	genetic algorithm	artificial intelligence
8595	1218	Undirected	177	fuzzy	artificial intelligence
12951	1218	Undirected	171	machine learning	artificial intelligence

TIP

Jupyter를 사용하고 계시는 분들께 조금 더 편리한 코딩을 위한 방법이니 참고하세요 :)
Visual Studio Code 사용기: 사용기
Visual Studio Code 링크:MS Visual Studio Code
Visual Studio Code와 python:Python in visual code
Visual Studio Code와 jupyterJupyter in visual code

Data Handling Class : Week2

|:| Method For Make English Keyword Network

Author

INTRO

- Reference

TASK

DUE DATE

알고리즘 구현방식

중점 사항

Code Example

Input Data의 형태

Output Data의 형태

TIP