데이터 전처리 1

1.1 Feature Engineering

1.1.1 Scaling - scale()

비교되는 변수의 범위가 다른 경우 정규화로 비슷하게 맞출 수 있다.
정규화 과정이라고 이해하면 된다.
정규화된 관측치 = (관측치-평균)/표준편차

pampas=c(283,288,205,204,287,300,310)
milk=c(33,31,31,32,33,34,29)
tissue=c(2500,2450,2490,2750,2800,2350,2450)

plot(NULL,NULL,xlim=c(1,7),ylim=c(10,3000),main = "pampas, milk tissue sales")
lines(pampas,type="b",col="blue")
lines(milk,type="b",col="red")
lines(tissue,type="b",col="black")

sp=scale(pampas)
sm=scale(milk)
st=scale(tissue)
scale=cbind(sp,sm,st)
scale

##            [,1]        [,2]       [,3]
## [1,]  0.3344691  0.68182919 -0.2475199
## [2,]  0.4470309 -0.51137189 -0.5462509
## [3,] -1.4214938 -0.51137189 -0.3072661
## [4,] -1.4440061  0.08522865  1.2461348
## [5,]  0.4245185  0.68182919  1.5448658
## [6,]  0.7171790  1.27842973 -1.1437128
## [7,]  0.9423024 -1.70457297 -0.5462509

sc=scale(iris[1:4]) 
df=as.data.frame(sc)
cb=cbind(df,iris$Species)
head(cb)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width iris$Species
## 1   -0.8976739  1.01560199    -1.335752   -1.311052       setosa
## 2   -1.1392005 -0.13153881    -1.335752   -1.311052       setosa
## 3   -1.3807271  0.32731751    -1.392399   -1.311052       setosa
## 4   -1.5014904  0.09788935    -1.279104   -1.311052       setosa
## 5   -1.0184372  1.24503015    -1.335752   -1.311052       setosa
## 6   -0.5353840  1.93331463    -1.165809   -1.048667       setosa

1.1.2 Binning

관측치가 연속형이면서 범위가 다양할 경우 그룹을 지어주면 이해하기 쉬워진다.

데이터 불러오기

data=read.csv("data/agemoney.CSV")
head(data)

##   age money
## 1  10  3000
## 2  11  3000
## 3  12  3500
## 4  13  3500
## 5  14  6000
## 6  15  5000

binning

age10=mean(data[data$age>=10&data$age<20,2]) # 10대 용돈 
age20=mean(data[data$age>=20&data$age<30,2]) # 20대 용돈 
age30=mean(data[data$age>=30&data$age<40,2]) # 30대 용돈 
age40=mean(data[data$age>=40&data$age<50,2]) # 40대 용돈 
age50=mean(data[data$age>=50&data$age<60,2]) # 50대 용돈 

data.mean=data.frame(age=c(10,20,30,40,50),money=c(age10,age20,age30,age40,age50))
head(data.mean)

##   age  money
## 1  10   5500
## 2  20  70000
## 3  30 211000
## 4  40 338000
## 5  50 393000

1.1.3 Creating Feature

주어진 변수만으로 의미있는 결과가 나오지 않을 때, 변수의 속성 추가.

날짜 date 유형으로 변환

date.txt=c("2016-11-01","2016-11-1","2016-11-03","2016-11-05")
date.txt

## [1] "2016-11-01" "2016-11-1"  "2016-11-03" "2016-11-05"

date.as.date=as.Date(date.txt)
date.as.date

## [1] "2016-11-01" "2016-11-01" "2016-11-03" "2016-11-05"

class(date.as.date)

## [1] "Date"

날짜 포맷지정 - format(date,format)

%d : day as number
%a, %A : abbreviated/unabbreviated weekend
%m : month(00-12)
%b, %B : abbreviated/unabbreviated month
%y, %Y : 2/4 digit year

date.as.date.week.full=format(date.as.date,format="%Y-%m-%A")
date.as.date.week.full

## [1] "2016-11-Tuesday"  "2016-11-Tuesday"  "2016-11-Thursday"
## [4] "2016-11-Saturday"

date.as.date.week.only=format(date.as.date,format="%A")
date.as.date.week.only

## [1] "Tuesday"  "Tuesday"  "Thursday" "Saturday"

1.1.4 더미변수 만들기 - model.matrix(~x,data)[,-1]

더미 변수 만들기

lvl=factor(c("A","B","A","A","C"))
df=data.frame(lvl)
df

##   lvl
## 1   A
## 2   B
## 3   A
## 4   A
## 5   C

dv=model.matrix(~lvl,data=df)[,-1] # intercept를 빼주기 위해서 -1
dv

##   lvlB lvlC
## 1    0    0
## 2    1    0
## 3    0    0
## 4    0    0
## 5    0    1

1.1.5 예제 1 - format()

요일만 출력하기

data.txt=c("2016-11-01","2017-11-01","2018-11-03","2019-11-05")

#1 
data.as.date=as.Date(data.txt)
data.as.week.only=format(date.as.date,format="%A")
data.as.week.only

## [1] "Tuesday"  "Tuesday"  "Thursday" "Saturday"

년도와 월이 “16-11” 표시되게 하기

#2 
data.as.week=format(data.as.date,format="%y-%m")
data.as.week

## [1] "16-11" "17-11" "18-11" "19-11"

1.1.6 예제 2 - binning

타이타닉 탑승자의 10세 단위 별 평균 생존률 구하기 - 방법 1

titanic=read.csv("data/Titanic/train.csv")
tage=titanic$Age
tclass=titanic$Pclass

#방법 1
tage00=subset(titanic,titanic$Age<10)
tage00=mean(tage00$Survived)*100
tage10=subset(titanic,titanic$Age>=10&titanic$Age<20)
tage10=mean(tage10$Survived)*100
tage20=subset(titanic,titanic$Age>=20&titanic$Age<30)
tage20=mean(tage20$Survived)*100
tage30=subset(titanic,titanic$Age>=30&titanic$Age<40)
tage30=mean(tage30$Survived)*100
tage40=subset(titanic,titanic$Age>=40&titanic$Age<50)
tage40=mean(tage40$Survived)*100
tage50=subset(titanic,titanic$Age>=50&titanic$Age<60)
tage50=mean(tage50$Survived)*100
tage60=subset(titanic,titanic$Age>=60&titanic$Age<70)
tage60=mean(tage60$Survived)*100
tage70=subset(titanic,titanic$Age>=70&titanic$Age<80)
tage70=mean(tage70$Survived)*100
tage80=subset(titanic,titanic$Age>=80&titanic$Age<90)
tage80=mean(tage80$Survived)*100

rbind(tage00,tage10,tage20,tage30,tage40,tage50,tage60,tage70,tage80)

##             [,1]
## tage00  61.29032
## tage10  40.19608
## tage20  35.00000
## tage30  43.71257
## tage40  38.20225
## tage50  41.66667
## tage60  31.57895
## tage70   0.00000
## tage80 100.00000

타이타닉 탑승자의 10세 단위 별 평균 생존률 구하기 - 방법 2

# 방법 2
age00=mean(titanic[titanic$Age>=00&titanic$Age<10,2],na.rm=T) # 10대 용돈 
age10=mean(titanic[titanic$Age>=10&titanic$Age<20,2],na.rm=T) # 10대 용돈 
age20=mean(titanic[titanic$Age>=20&titanic$Age<30,2],na.rm=T) # 10대 용돈 
age30=mean(titanic[titanic$Age>=30&titanic$Age<40,2],na.rm=T) # 10대 용돈 
age40=mean(titanic[titanic$Age>=40&titanic$Age<50,2],na.rm=T) # 10대 용돈 
age50=mean(titanic[titanic$Age>=50&titanic$Age<60,2],na.rm=T) # 10대 용돈 
age60=mean(titanic[titanic$Age>=60&titanic$Age<70,2],na.rm=T) # 10대 용돈 
age70=mean(titanic[titanic$Age>=70&titanic$Age<80,2],na.rm=T) # 10대 용돈 
age80=mean(titanic[titanic$Age>=80&titanic$Age<90,2],na.rm=T) # 10대 용돈

rbind(age00,age10,age20,age30,age40,age50,age60,age70,age80)

##            [,1]
## age00 0.6129032
## age10 0.4019608
## age20 0.3500000
## age30 0.4371257
## age40 0.3820225
## age50 0.4166667
## age60 0.3157895
## age70 0.0000000
## age80 1.0000000

타이타닉 탑승자의 10세 단위 별 평균 좌석등급 구하기 - 방법 1

# 방법 1
tpclass1=subset(titanic,titanic$Pclass==1)
tpclass1=mean(tpclass1$Survived)*100
tpclass2=subset(titanic,titanic$Pclass==2)
tpclass2=mean(tpclass2$Survived)*100
tpclass3=subset(titanic,titanic$Pclass==3)
tpclass3=mean(tpclass3$Survived)*100
rbind(tpclass1,tpclass2,tpclass3)

##              [,1]
## tpclass1 62.96296
## tpclass2 47.28261
## tpclass3 24.23625

1.1.7 예제 3

타이타닉 데이터에서 성별 더미변수화 하기

titanic$Sex=as.factor(titanic$Sex)
one.hot.s=model.matrix(~Sex,data=titanic)[,-1]
head(one.hot.s)

## 1 2 3 4 5 6 
## 1 0 0 0 1 1

Embarke를 더비변수화 하기

titanic$Embarked=as.factor(titanic$Embarked)
one.hot.e=model.matrix(~Embarked,data=titanic)[,-1]
head(one.hot.e)

##   EmbarkedC EmbarkedQ EmbarkedS
## 1         0         0         1
## 2         1         0         0
## 3         0         0         1
## 4         0         0         1
## 5         0         0         1
## 6         0         1         0

1.2 데이터 프레임처리

1.2.1 데이터 프레임 생성 - data.frame()

x=c(1:5)
y=c(10,20,30,40,50)
z=c("M","M","M","F","F")
d1=data.frame(x,y,z)
d1

##   x  y z
## 1 1 10 M
## 2 2 20 M
## 3 3 30 M
## 4 4 40 F
## 5 5 50 F

1.2.2 주의사항 - stringsAsFactors

stringAsFactors : default = TRUE로 지정되어있음 그 결과, 문자열이 factor로 저장됨.

id=c(1,2,3,4,5)
factory=c("평택지부1","평택지부2","안산지부","인천지부","원주지부")
sales=c(70,90,80,85,87)
d2=data.frame(id,factory,sales)
str(d2)

## 'data.frame':    5 obs. of  3 variables:
##  $ id     : num  1 2 3 4 5
##  $ factory: Factor w/ 5 levels "<U+C548><U+C0B0><U+C9C0><U+BD80>",..: 4 5 1 3 2
##  $ sales  : num  70 90 80 85 87

문자열을 문자열로 인식하기 위해서는 stringsAsFactors=FALSE로 지정해야함

d2=data.frame(id,factory,sales,stringsAsFactors = FALSE)
str(d2)

## 'data.frame':    5 obs. of  3 variables:
##  $ id     : num  1 2 3 4 5
##  $ factory: chr  "<U+D3C9><U+D0DD><U+C9C0><U+BD80>1" "<U+D3C9><U+D0DD><U+C9C0><U+BD80>2" "<U+C548><U+C0B0><U+C9C0><U+BD80>" "<U+C778><U+CC9C><U+C9C0><U+BD80>" ...
##  $ sales  : num  70 90 80 85 87

1.2.3 특정 컬럼 데이터 유형 변환

id=c(1,2,3,4,5)
factory=c("평택1","평택2","안산","인천","원주")
type=c("공장","공장","오피스","오피스","오피스")
sales=c(70,90,80,85,87)
d3=data.frame(id,factory,type,sales,stringsAsFactors = FALSE)
str(d3)

## 'data.frame':    5 obs. of  4 variables:
##  $ id     : num  1 2 3 4 5
##  $ factory: chr  "<U+D3C9><U+D0DD>1" "<U+D3C9><U+D0DD>2" "<U+C548><U+C0B0>" "<U+C778><U+CC9C>" ...
##  $ type   : chr  "<U+ACF5><U+C7A5>" "<U+ACF5><U+C7A5>" "<U+C624><U+D53C><U+C2A4>" "<U+C624><U+D53C><U+C2A4>" ...
##  $ sales  : num  70 90 80 85 87

d3$type=as.factor(d3$type)
str(d3)

## 'data.frame':    5 obs. of  4 variables:
##  $ id     : num  1 2 3 4 5
##  $ factory: chr  "<U+D3C9><U+D0DD>1" "<U+D3C9><U+D0DD>2" "<U+C548><U+C0B0>" "<U+C778><U+CC9C>" ...
##  $ type   : Factor w/ 2 levels "<U+ACF5><U+C7A5>",..: 1 1 2 2 2
##  $ sales  : num  70 90 80 85 87

1.2.4 특정 컬럼 선택 (벡터 형태)

id=1:7
name=c("kim","park","jo","kim2","lee","yang","choi")
gender=c("F","M","F","F","F","M","M")
sales=c(1000,2000,1500,2200,1700,2000,2200)
d4=data.frame(id,name,gender,sales,stringsAsFactors = FALSE)

d.col=d4$sales
d.col

## [1] 1000 2000 1500 2200 1700 2000 2200

d.col=d4[,4]
d.col

## [1] 1000 2000 1500 2200 1700 2000 2200

d.col=d4[,"sales"]
d.col

## [1] 1000 2000 1500 2200 1700 2000 2200

1.2.5 특정 컬럼 선택 (데이터프레임 형태) - dplyr의 select()

ex=d4%>%select(gender)
str(ex)

## 'data.frame':    7 obs. of  1 variable:
##  $ gender: chr  "F" "M" "F" "F" ...

ex=d4%>%select(gender,sales)

1.2.6 여러 컬럼 선택

d.cols=d4[,c(3,4)]
d.cols

##   gender sales
## 1      F  1000
## 2      M  2000
## 3      F  1500
## 4      F  2200
## 5      F  1700
## 6      M  2000
## 7      M  2200

d.cols=d4[,3:4]
d.cols

##   gender sales
## 1      F  1000
## 2      M  2000
## 3      F  1500
## 4      F  2200
## 5      F  1700
## 6      M  2000
## 7      M  2200

d.cols=d4[,c("gender","sales")]
d.cols

##   gender sales
## 1      F  1000
## 2      M  2000
## 3      F  1500
## 4      F  2200
## 5      F  1700
## 6      M  2000
## 7      M  2200

1.2.7 행 및 원소 선택

특정 행 선택

d.row=d4[2,] 
d.row

##   id name gender sales
## 2  2 park      M  2000

특정 원소 선택

d.element=d4[2,3]
d.element

## [1] "M"

1.2.8 조건에 맞는 행 선택 1

d.man=d4[d4$gender=="M",]
d.man

##   id name gender sales
## 2  2 park      M  2000
## 6  6 yang      M  2000
## 7  7 choi      M  2200

d4[d4$sales>2000,]

##   id name gender sales
## 4  4 kim2      F  2200
## 7  7 choi      M  2200

d4[(d4$gender=="F")&(d4$sales>2000),]

##   id name gender sales
## 4  4 kim2      F  2200

d4[(d4$gender=="F")|(d4$sales>2000),]

##   id name gender sales
## 1  1  kim      F  1000
## 3  3   jo      F  1500
## 4  4 kim2      F  2200
## 5  5  lee      F  1700
## 7  7 choi      M  2200

1.2.9 조건에 맞는 행 선택 2 - which 이용

d.man=d4[which(d4$gender=="M"),]
d.man

##   id name gender sales
## 2  2 park      M  2000
## 6  6 yang      M  2000
## 7  7 choi      M  2200

d4[which(d4$gender=="F")&(d4$sales>2000),]

## Warning in which(d4$gender == "F") & (d4$sales > 2000): longer object
## length is not a multiple of shorter object length

##   id name gender sales
## 4  4 kim2      F  2200
## 7  7 choi      M  2200

1.2.10 조건에 맞는 행 선택 3 - subset()

d.man=subset(d4,gender=="M")
dsubset=subset(d4,gender=="F",select=c(name,gender))
dsubset

##   name gender
## 1  kim      F
## 3   jo      F
## 4 kim2      F
## 5  lee      F

subsetexam=subset(d4,gender=="F"&sales>2000)
subsetexam

##   id name gender sales
## 4  4 kim2      F  2200

subsetexam2=subset(d4,gender=="F"|sales>2000,select=c(name,gender))
subsetexam2

##   name gender
## 1  kim      F
## 3   jo      F
## 4 kim2      F
## 5  lee      F
## 7 choi      M

1.2.11 조건에 맞는 행 선택 4 - dplyr의 filter()

d.man=d4%>%filter(gender=="M")

dexam=d4%>%filter(gender=="F"|sales>2000)
dexam

##   id name gender sales
## 1  1  kim      F  1000
## 2  3   jo      F  1500
## 3  4 kim2      F  2200
## 4  5  lee      F  1700
## 5  7 choi      M  2200

1.2.12 연습문제 - subset, which, filter

d4 데이터에서
sales>2000인 행추출
gender가 female이면서 sales가 1500이상인 행 추출
subset이용하여서 위의 결과에서 name 컬럼만 추출

exam1=subset(d4,sales>=2000)
exam1

##   id name gender sales
## 2  2 park      M  2000
## 4  4 kim2      F  2200
## 6  6 yang      M  2000
## 7  7 choi      M  2200

exam2=subset(d4,sales>=1500&gender=="F")
exam2

##   id name gender sales
## 3  3   jo      F  1500
## 4  4 kim2      F  2200
## 5  5  lee      F  1700

exam3=subset(d4,sales>=1500&gender=="F",select = name)
exam3

##   name
## 3   jo
## 4 kim2
## 5  lee

1.2.13 데이터 프레임 추후 연결

원데이터

x1=c(1,2,3)
x2=c(4,5,6)
x3=c(7,8,9)
data=data.frame(x1,x2,x3)
x4=c(10,11,12)
d2=cbind(data,x4)
d2

##   x1 x2 x3 x4
## 1  1  4  7 10
## 2  2  5  8 11
## 3  3  6  9 12

x5=c("james","mary","tony")
d3=cbind(d2,x5,stringAsFactors=FALSE)
d3

##   x1 x2 x3 x4    x5 stringAsFactors
## 1  1  4  7 10 james           FALSE
## 2  2  5  8 11  mary           FALSE
## 3  3  6  9 12  tony           FALSE

데이터 붙이기

x4=c(10,11,12)
data$x4=x4
data

##   x1 x2 x3 x4
## 1  1  4  7 10
## 2  2  5  8 11
## 3  3  6  9 12

x5=c("James","Mary","Tony") # 이 방식으로 수행하면 stringAsFactors 옵션 주지 않아도 문자열로 인식 
data$x5=x5
data

##   x1 x2 x3 x4    x5
## 1  1  4  7 10 James
## 2  2  5  8 11  Mary
## 3  3  6  9 12  Tony

파생변수 만들기

data$sum=data$x1+data$x2+data$x3
data

##   x1 x2 x3 x4    x5 sum
## 1  1  4  7 10 James  12
## 2  2  5  8 11  Mary  15
## 3  3  6  9 12  Tony  18

data$pass=ifelse(data$sum>15,"pass","fail")
data

##   x1 x2 x3 x4    x5 sum pass
## 1  1  4  7 10 James  12 fail
## 2  2  5  8 11  Mary  15 fail
## 3  3  6  9 12  Tony  18 pass

1.2.14 두 개의 데이터 프레임을 같은 기준으로 묶기 - dyplr의 left_join()

a=data.frame(id=c(1:5),mid=c(30,40,50,60,70))
b=data.frame(id=c(5:1),mid=c(70,90,100,90,80))

#library(dplyr)

left_join(a,b,by="id")

##   id mid.x mid.y
## 1  1    30    80
## 2  2    40    90
## 3  3    50   100
## 4  4    60    90
## 5  5    70    70

1.2.15 데이터 프레임 컬럼 이름 바꾸기 - rename(newname=oldname)

d=rename(d4,newname=name)
d

##   id newname gender sales
## 1  1     kim      F  1000
## 2  2    park      M  2000
## 3  3      jo      F  1500
## 4  4    kim2      F  2200
## 5  5     lee      F  1700
## 6  6    yang      M  2000
## 7  7    choi      M  2200

1.2.16 연습문제 2 - 조건

타이타닉 데이터에서 남성과 여성으로 데이터를 나누고, 각 생존률 구하기

titanic.male=subset(titanic,titanic$Sex=="male")
mean(titanic.male$Survived,na.rm=TRUE)

## [1] 0.1889081

titanic.female=subset(titanic,titanic$Sex=="female")
mean(titanic.female$Survived,na.rm=TRUE)

## [1] 0.7420382

남성이면서 10대 미만, 10대, 20대의 생존률 구하기

titanic1=subset(titanic,titanic$Sex=="male"&titanic$Age<=10) # 10대 미만
mean(titanic1$Survived,na.rm=TRUE)

## [1] 0.5757576

titanic1=subset(titanic,titanic$Sex=="male"&titanic$Age>=10&titanic$Age<20) # 10대
mean(titanic1$Survived,na.rm=TRUE)

## [1] 0.122807

titanic1=subset(titanic,titanic$Sex=="male"&titanic$Age>=20&titanic$Age<30) # 20대
mean(titanic1$Survived,na.rm=TRUE)

## [1] 0.1689189

여성이면서 10대 미만, 10대, 20대의 생존률을 구하기

titanic2=subset(titanic,titanic$Sex=="female"&titanic$Age<=10) # 10대 미만
mean(titanic2$Survived,na.rm=TRUE)

## [1] 0.6129032

titanic2=subset(titanic,titanic$Sex=="female"&titanic$Age>=10&titanic$Age<20) # 10대
mean(titanic2$Survived,na.rm=TRUE)

## [1] 0.7555556

titanic2=subset(titanic,titanic$Sex=="female"&titanic$Age>=20&titanic$Age<30) # 20대
mean(titanic2$Survived,na.rm=TRUE)

## [1] 0.7222222

1.3 데이터 분리

1.3.1 sample()

sample(1:10,5)

## [1] 1 5 2 7 4

sample(1:10,10) # 무작위 섞기

##  [1]  3  2  6  7  1  4 10  8  9  5

sample(1:10,10,replace = TRUE) # 복원추출

##  [1] 7 3 5 5 9 8 3 7 1 8

1.3.2 split()

iris.s=split(iris,iris$Species) # 종에 따라 분리 
split(iris,1:10) # 10개의 집합으로 분리
head(iris.s$setosa)

1.3.3 ifelse()

ifelse(조건,참일 때,거짓일 때)

a=3
ifelse(a==3,"yes 3","no 3")

## [1] "yes 3"

autoparts=read.csv("data/autoparts.csv",header = TRUE)
autoparts1=autoparts[autoparts$prod_no=="90784-76001",c(2:11)]
autoparts2=autoparts1[autoparts1$c_thickness<1000,]

autoparts2$y_faulty=ifelse((autoparts2$c_thickness<20)|(autoparts2$c_thickness>32),1,0)
head(autoparts2$y_faulty,10)

##  [1] 0 0 0 0 0 0 0 0 0 1

autoparts2$g_class=as.factor(ifelse(autoparts2$c_thickness<20,1,ifelse(autoparts2$c_thickness<32,2,3)))
table(autoparts2$g_class)

## 
##     1     2     3 
##  2141 18914   712

1.3.4 연습문제 1 - ifelse(), split()

타이타닉 데이터 이용

15세 미만이면서 parch가 0보다 큰 경우 부모와 함께 탑승한 경우로 볼 수 있다. 이 경우는 몇 명인가?

# 방법 1
a=ifelse(titanic$Age>=15,0,ifelse(titanic$Parch>0,1,0))
sum(a,na.rm=TRUE)

## [1] 70

# 방법 2
NROW(subset(titanic,titanic$Age<15&titanic$Parch>0))

## [1] 70

15세 미만이면서 parch가 0인 그룹의 생존률과 parch가 1이상인 그룹의 생존률 구해라.

titanic$class=ifelse(titanic$Age>=15,0,ifelse(titanic$Parch>=1,1,2))

sp=split(titanic,titanic$class)

mean(sp$`0`$Survived)*100 # 15세 이상 그룹

## [1] 38.52201

mean(sp$`1`$Survived)*100 # 15세 미만 이면서 parch가 1 이상인 그룹

## [1] 57.14286

mean(sp$`2`$Survived)*100 # 15세 미만 이면서 parch가 0인 그룹

## [1] 62.5

1.3.5 자료준비 예시 - train / test

t_index=sample(1:nrow(autoparts2),size=nrow(autoparts2)*0.7)
train=autoparts2[t_index,]
test=autoparts2[-t_index,]
nrow(train)

## [1] 15236

nrow(test)

## [1] 6531

1.4 결측치 처리

1.4.1 결측치 확인 - is.na()

x=c(1,NA,2,3)
is.na(x)

## [1] FALSE  TRUE FALSE FALSE

자료가 많을 때는 sum(is.na(data))를 이용한다.

sum(is.na(autoparts))

## [1] 0

1.4.2 결측치 확인 2 - complete.cases(data)

NA값이 있으면 FALSE로 출력

complete.cases(x)

## [1]  TRUE FALSE  TRUE  TRUE

데이터 프레임과 행렬에서는 해당 행의 모든 값이 NA가 아니면 TRUE를 반환

library(reshape2)
head(french_fries)

##    time treatment subject rep potato buttery grassy rancid painty
## 61    1         1       3   1    2.9     0.0    0.0    0.0    5.5
## 25    1         1       3   2   14.0     0.0    0.0    1.1    0.0
## 62    1         1      10   1   11.0     6.4    0.0    0.0    0.0
## 26    1         1      10   2    9.9     5.9    2.9    2.2    0.0
## 63    1         1      15   1    1.2     0.1    0.0    1.1    5.1
## 27    1         1      15   2    8.8     3.0    3.6    1.5    2.3

french_fries[!complete.cases(french_fries),] # NA가 포함된 행 나타내기

##     time treatment subject rep potato buttery grassy rancid painty
## 315    5         3      15   1     NA      NA     NA     NA     NA
## 455    7         2      79   1    7.3      NA    0.0    0.7      0
## 515    8         1      79   1   10.5      NA    0.0    0.5      0
## 520    8         2      16   1    4.5      NA    1.4    6.7      0
## 563    8         2      79   2    5.7       0    1.4    2.3     NA

1.4.3 결측치 처리 1 - which(!complete.cases())

인덱스만 확인하고자 할때는 which()함수 사용

which(!complete.cases(french_fries))

## [1] 341 477 525 535 550

제거

x=1:5
y=c(2,4,NA,8,10)
df=data.frame(x,y)
df

##   x  y
## 1 1  2
## 2 2  4
## 3 3 NA
## 4 4  8
## 5 5 10

idx=which(!complete.cases(df))
df[-idx,]

##   x  y
## 1 1  2
## 2 2  4
## 4 4  8
## 5 5 10

1.4.4 결측치 처리 2 - na.rm

x=c(1,2,NA,3)
mean(x,na.rm=TRUE)

## [1] 2

1.4.5 결측치 처리 3 - na.omit() / na.fail() / na.pass()

na.omit() 결측치를 제외하고 출력

x=1:5 ; y=c(2,4,NA,8,10)
df=data.frame(x,y)
na.omit(df)

##   x  y
## 1 1  2
## 2 2  4
## 4 4  8
## 5 5 10

na.fail() 결측치가 있으면 실패

na.fail(df)

na.pass() 결측치 여부에 상관 없이 출력

na.pass(df)

##   x  y
## 1 1  2
## 2 2  4
## 3 3 NA
## 4 4  8
## 5 5 10

1.4.6 결측치 처리 4 - na.action

na.action을 옵션으로 받아 제거하는 경우도 있다.

x=1:5 ; y=c(2,4,NA,8,10)
df=data.frame(x,y)
resid(lm(y~x,data=df,na.action=na.omit))

##             1             2             4             5 
## -7.423409e-16  9.696420e-16  6.043761e-17 -2.877387e-16

resid(lm(y~x,data=df,na.action=na.exclude))

##             1             2             3             4             5 
## -7.423409e-16  9.696420e-16            NA  6.043761e-17 -2.877387e-16

1.4.7 결측치 처리 5 - x[is.na(x)]=0 / 결측치 0으로 치환

결측치를 모두 0으로 간주 할 수 있는 경우에만 사용
성과가 있는 경우만 기록된 경우 등이 이에 해당

x=c(1,2,3,NA,5,NA,7)
x[is.na(x)]=0
x

## [1] 1 2 3 0 5 0 7

데이터 프레임에 적용하는 경우에는 다음과 같이 사용

french_fries$buttery[is.na(french_fries$buttery)]=0
french_fries[!complete.cases(french_fries),]

##     time treatment subject rep potato buttery grassy rancid painty
## 315    5         3      15   1     NA       0     NA     NA     NA
## 563    8         2      79   2    5.7       0    1.4    2.3     NA

1.4.7 결측치 처리 6 - 평균으로 치환

x=c(1,2,3,NA,5,NA,7)
x[is.na(x)]=mean(x,na.rm=TRUE)
x

## [1] 1.0 2.0 3.0 3.6 5.0 3.6 7.0

데이터 프레임에 적용하는 경우에는 다음과 같이 사용

french_fries$grassy[is.na(french_fries$grassy)]=mean(french_fries$grassy,na.rm=TRUE)
french_fries[!complete.cases(french_fries),]

##     time treatment subject rep potato buttery    grassy rancid painty
## 315    5         3      15   1     NA       0 0.6641727     NA     NA
## 563    8         2      79   2    5.7       0 1.4000000    2.3     NA

1.4.8 결측치 처리 연습문제

타이타닉에서 결측치가 존재하는 행은 모두 몇개인가?

sum(is.na(titanic))

## [1] 354

각 변수에 대해 결측치가 존재하는지 찾아보고, 각 변수 별 결측치의 개수 탐색

sum(is.na(titanic$PassengerId))

## [1] 0

sum(is.na(titanic$Survived))

## [1] 0

sum(is.na(titanic$Pclass))

## [1] 0

sum(is.na(titanic$Name))

## [1] 0

sum(is.na(titanic$Sex))

## [1] 0

sum(is.na(titanic$Age))

## [1] 177

sum(is.na(titanic$SibSp))

## [1] 0

sum(is.na(titanic$Parch))

## [1] 0

sum(is.na(titanic$Ticket))

## [1] 0

sum(is.na(titanic$Fare))

## [1] 0

sum(is.na(titanic$class))

## [1] 177

## for 문

titanic=read.csv("data/Titanic/train.csv")
for(i in 1:length(names(titanic)))
{
  name=names(titanic)
  print(paste("number of NA in",name[i],"is",sum(is.na(titanic[i]))))
}

## [1] "number of NA in PassengerId is 0"
## [1] "number of NA in Survived is 0"
## [1] "number of NA in Pclass is 0"
## [1] "number of NA in Name is 0"
## [1] "number of NA in Sex is 0"
## [1] "number of NA in Age is 177"
## [1] "number of NA in SibSp is 0"
## [1] "number of NA in Parch is 0"
## [1] "number of NA in Ticket is 0"
## [1] "number of NA in Fare is 0"
## [1] "number of NA in Cabin is 0"
## [1] "number of NA in Embarked is 0"

cabin과 embarked의 “”로 존재하는 factor를 결측치로 보는 경우 해당변수에는 결측치 몇 개 존재?

sum(titanic$Cabin=="")

## [1] 687

sum(titanic$Embarked=="")

## [1] 2

나이에 대한 결측값을 나의의 median()으로 치환

titanic$Age[is.na(titanic$Age)]=median(titanic$Age,na.rm=T)
head(titanic)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  28     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0           330877  8.4583              Q

1.5 이상현상탐지

다른 데이터에 비해 아주 작은 값이나 아주 큰 값.
항상 의미없다고 핤 수는 없기 때문에, 처리에 신중을 기해야한다.
처리 방법에는 제거 / 치환 / 분리의 방법이 있다.

1.5.1 이상치 검출 1 - boxplot(), fivenum()

autoparts=read.csv("data/autoparts.csv",header = TRUE)
autoparts3=autoparts[autoparts$prod_no=="45231-3B610",-1]
head(autoparts3)

##       fix_time a_speed b_speed separation s_separation rate_terms  mpa
## 18780     82.0   0.736   1.776      180.7        719.1         92 74.7
## 18781     82.0   0.724   1.773      178.6        719.9         92 74.8
## 18782     82.0   0.732   1.752      178.9        719.7         92 74.7
## 18783     82.0   0.715   1.682      179.5        719.7         91 74.8
## 18784     81.6   0.728   1.690      180.7        719.4         93 74.8
## 18785     82.0   0.743   1.771      178.7        720.8         91 74.6
##       load_time highpressure_time c_thickness
## 18780      19.7                70        26.1
## 18781      19.7                72        27.4
## 18782      19.7                68        27.3
## 18783      19.7                63        26.7
## 18784      19.7                64        25.8
## 18785      19.7                67        26.4

boxplot은 사분위수를 포함하여 최소값, 중앙값, 최대값을 보여준다.

boxplot(autoparts3$c_thickness)

낮은 이상치 : 제1사분위수 - 1.5*사분위편차 보다 작은 값
높은 이상치 : 제3사분위수 + 1.5*사분위편차 보다 작은 값
4분위 편차 = 3사분위-1사분위

data=autoparts3$c_thickness
outlier1=which(data<fivenum(data)[2]-1.5*IQR(data)) # fivenum(data)[2] # 1사분위수, IQR(data)는 사분위편차
outlier2=which(data>fivenum(data)[4]+1.5*IQR(data)) # fivenum(data)[4] # 3사분위수, IQR(data)는 사분위편차
head(autoparts3[outlier1,]) # 낮은 이상치 행 선택

##       fix_time a_speed b_speed separation s_separation rate_terms  mpa
## 19092     82.2   0.739   1.544      184.5        723.3         92 72.5
## 19095     82.1   0.756   1.527      184.2        724.6         92 72.6
## 19195     82.1   0.749   1.568      185.1        724.2         91 72.6
## 19197     82.2   0.734   1.574      184.3        723.7         91 72.6
## 19203     82.2   0.740   1.540      184.6        723.3         91 72.5
## 19347     95.7   0.647   1.582      233.5        679.1         91 72.0
##       load_time highpressure_time c_thickness
## 19092      19.7                70        18.1
## 19095      19.7                78        17.1
## 19195      19.7                68        16.6
## 19197      19.7                71        17.9
## 19203      19.7                68        18.0
## 19347      21.7               101        13.3

head(autoparts3[outlier2,]) # 높은 이상치 행 선택

##       fix_time a_speed b_speed separation s_separation rate_terms  mpa
## 19084     82.2   0.742   1.470      165.6        726.8         90 26.4
## 19085     82.1   0.498   1.552      184.5        722.9         91 29.8
## 19086     82.0   0.500   1.552      184.5        722.9         91 29.6
## 19087     82.0   0.498   1.552      184.5        722.9         90 29.8
## 19240     82.3   0.747   1.599      165.5        723.1         91 72.7
## 19343     82.1   0.498   1.582      233.5        679.1         90 29.8
##       load_time highpressure_time c_thickness
## 19084      19.6                76        33.5
## 19085      19.6                76        33.5
## 19086      19.7                76        33.5
## 19087      19.6                76        34.1
## 19240      19.7                71        37.3
## 19343      19.7               101        32.5

1.5.2 연습 1 - boxplot의 결과와 사분위수 방법 같은지 보이기.

boxplot=boxplot(autoparts3$c_thickness)

boxplot$out # boxplot결과의 outlier값

##  [1] 33.5 33.5 33.5 34.1 18.1 17.1 16.6 17.9 18.0 37.3 32.5 13.3 32.3  9.3
## [15] 32.3 33.3 33.3 32.5 33.4 33.6 34.5 36.2 33.6 32.3 34.6 36.5  7.9 32.3
## [29] 32.6 10.3 16.5 17.0  5.9 17.8 36.3 36.9 36.6 37.1 36.6 39.5 38.8 39.0
## [43] 40.6 49.5 48.6 48.9 49.1 48.8 47.5 56.6

a=autoparts3[outlier1,] # 낮은 outlier index
b=autoparts3[outlier2,] # 높은 outlier index
c=a$c_thickness # 낮은 outlier 값
d=b$c_thickness # 높은 outlier 값

e=c(c,d)

sort(e)==sort(boxplot$out)

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

1.5.3 이상치 검출 2 - 잔차 계산 rstudent(), outlierTest()

회귀분석 후 예측 값과 실제 값의 차이인 잔차(residual)가 큰 값을 이상치로 삼는 방법.

m=lm(c_thickness~.,data=autoparts3)
head(rstudent(m))

##       18780       18781       18782       18783       18784       18785 
## -0.01545566 -0.29668984  0.13522741 -0.14901718  0.32864459  0.15521404

plot(rstudent(m),main="studenttized residuals")

car패키지의 outlierTest()는 이상치를 자동으로 탐색해준다.
bonferonni p 값이 작을수록 유의미하게 잔차가 큰 값들이다.

x=outlierTest(m)
head(x)

## $rstudent
##     21157     20262     19670     21151     20655     21153     21154 
## 14.215108 12.182360 11.297318 10.834554 10.355221 10.226805 10.164636 
##     21155     21152     20193 
## 10.046863 10.018596 -9.002669 
## 
## $p
##        21157        20262        19670        21151        20655 
## 4.557912e-44 3.712424e-33 7.311124e-29 9.897241e-27 1.317014e-24 
##        21153        21154        21155        21152        20193 
## 4.720989e-24 8.714120e-24 2.757341e-23 3.628952e-23 4.414454e-19 
## 
## $bonf.p
##        21157        20262        19670        21151        20655 
## 1.083872e-40 8.828145e-30 1.738585e-25 2.353564e-23 3.131859e-21 
##        21153        21154        21155        21152        20193 
## 1.122651e-20 2.072218e-20 6.556958e-20 8.629648e-20 1.049757e-15 
## 
## $signif
## [1] TRUE
## 
## $cutoff
## [1] 0.05

x$rstudent

##     21157     20262     19670     21151     20655     21153     21154 
## 14.215108 12.182360 11.297318 10.834554 10.355221 10.226805 10.164636 
##     21155     21152     20193 
## 10.046863 10.018596 -9.002669

names(x$rstudent)

##  [1] "21157" "20262" "19670" "21151" "20655" "21153" "21154" "21155"
##  [9] "21152" "20193"

1.5.4 이상치 검출 3 - cook.distance()

회귀 직선의 모양에 크게 영향을 끼치는 점을 찾는 방법
레버리지(설명 변수가 극단에 치우쳐 있는 정도)와 잔차에 비례한다.
플롯에서 레버리지는 작고 잔차는 0에 가까이에 있어야함 - 왼쪽 중앙에 몰려야 좋다.
이상치 = 쿡의거리 > 4/데이터 개수

m=lm(c_thickness~.,data=autoparts3)
par(mfrow=c(2,2))
plot(m) # 우하단 그래프 확인

cooks=cooks.distance(m) # 쿡의 거리 계산
plot(cooks,pch=".",cex=1.5,main="Plot for Cooks's Distance")
text(x=1:length(cooks),y=cooks,labels=ifelse(cooks>4/nrow(autoparts3),names(cooks),""),col="black")

influential=names(cooks)[(cooks)>4/nrow(autoparts3)]
head(autoparts3[rownames(autoparts3)%in%influential,]) # %in%는 같은 이름들이 있는지 여부

##       fix_time a_speed b_speed separation s_separation rate_terms  mpa
## 18867     82.0   0.505   1.789      181.3        718.8         91 30.5
## 19052    109.9   0.491   1.525      172.4        726.0         91 29.7
## 19084     82.2   0.742   1.470      165.6        726.8         90 26.4
## 19085     82.1   0.498   1.552      184.5        722.9         91 29.8
## 19086     82.0   0.500   1.552      184.5        722.9         91 29.6
## 19087     82.0   0.498   1.552      184.5        722.9         90 29.8
##       load_time highpressure_time c_thickness
## 18867       0.0                68        29.8
## 19052      21.7                68        31.5
## 19084      19.6                76        33.5
## 19085      19.6                76        33.5
## 19086      19.7                76        33.5
## 19087      19.6                76        34.1

1.5.5 이상치 검출 4 - outlier()

최대 값 찾아주는 패키지

outlier(autoparts3$c_thickness)

## [1] 56.6

1.5.6 이상치 검출 5 - DMwR패키지의 lofactor()

밀도를 기반으로 하여 지역적 이상치를 식별하는 알고리즘
관측지의 밀도가 주변 밀도에 비해 훨씬 작다면 (LOF점수가 크다면) 이상치로 판정
주변 관측치들의 밀집도에 따라서 이상치 여부가 달라짐

score=lofactor(autoparts3,k=5) # k는 lof를 구하는데 사용되는 주변 값의 개수 
plot(score)

top5=order(score,decreasing=TRUE)[1:5]
top5

## [1] 1216 2363 2378  568  461

1.5.7 연습문제 - 타이타닉 이상치 탐색

타이타닉 데이터에 대한 boxplot을 그리고, 이상치가 존재하는 변수를 탐색하시오.

플롯

titannic=read.csv("data/Titanic/train.csv")
boxplot(titanic)

Fare 이상치 탐색

#2 
box=boxplot(titanic$Fare)

length(box$out)

## [1] 116

outlier행 추출

#3 
data=titanic$Fare
outlier1=which(data<fivenum(data)[2]-1.5*IQR(data)) 
outlier2=which(data>fivenum(data)[4]+1.5*IQR(data))
NROW(outlier1)

## [1] 0

NROW(outlier2)

## [1] 116

newtitanic=titanic[-outlier2,]
head(newtitanic)

##   PassengerId Survived Pclass                                         Name
## 1           1        0      3                      Braund, Mr. Owen Harris
## 3           3        1      3                       Heikkinen, Miss. Laina
## 4           4        1      1 Futrelle, Mrs. Jacques Heath (Lily May Peel)
## 5           5        0      3                     Allen, Mr. William Henry
## 6           6        0      3                             Moran, Mr. James
## 7           7        0      1                      McCarthy, Mr. Timothy J
##      Sex Age SibSp Parch           Ticket    Fare Cabin Embarked
## 1   male  22     1     0        A/5 21171  7.2500              S
## 3 female  26     0     0 STON/O2. 3101282  7.9250              S
## 4 female  35     1     0           113803 53.1000  C123        S
## 5   male  35     0     0           373450  8.0500              S
## 6   male  28     0     0           330877  8.4583              Q
## 7   male  54     0     0            17463 51.8625   E46        S

평균값으로 대치

#4 
titanic2=titanic

fare.mean=mean(titanic2$Fare)
titanic2$Fare[outlier2]=fare.mean
head(titanic2[outlier2,"Fare"])

## [1] 32.20421 32.20421 32.20421 32.20421 32.20421 32.20421

데이터 전처리 1

건국대학교 통계학과 백광렬 - 2018 빅데이터 청년인재

2018 7 11 (8일차)

데이터 전처리 1

1.1 Feature Engineering

1.1.1 Scaling - scale()

1.1.2 Binning

1.1.3 Creating Feature

1.1.4 더미변수 만들기 - model.matrix(~x,data)[,-1]

1.1.5 예제 1 - format()

1.1.6 예제 2 - binning

1.1.7 예제 3

1.2 데이터 프레임처리

1.2.1 데이터 프레임 생성 - data.frame()

1.2.2 주의사항 - stringsAsFactors

1.2.3 특정 컬럼 데이터 유형 변환

1.2.4 특정 컬럼 선택 (벡터 형태)

1.2.5 특정 컬럼 선택 (데이터프레임 형태) - dplyr의 select()

1.2.6 여러 컬럼 선택

1.2.7 행 및 원소 선택

1.2.8 조건에 맞는 행 선택 1

1.2.9 조건에 맞는 행 선택 2 - which 이용

1.2.10 조건에 맞는 행 선택 3 - subset()

1.2.11 조건에 맞는 행 선택 4 - dplyr의 filter()

1.2.12 연습문제 - subset, which, filter

1.2.13 데이터 프레임 추후 연결

1.2.14 두 개의 데이터 프레임을 같은 기준으로 묶기 - dyplr의 left_join()

1.2.15 데이터 프레임 컬럼 이름 바꾸기 - rename(newname=oldname)

1.2.16 연습문제 2 - 조건

1.3 데이터 분리

1.3.1 sample()

1.3.2 split()

1.3.3 ifelse()

1.3.4 연습문제 1 - ifelse(), split()

1.3.5 자료준비 예시 - train / test

1.4 결측치 처리

1.4.1 결측치 확인 - is.na()

1.4.2 결측치 확인 2 - complete.cases(data)

1.4.3 결측치 처리 1 - which(!complete.cases())

1.4.4 결측치 처리 2 - na.rm

1.4.5 결측치 처리 3 - na.omit() / na.fail() / na.pass()

1.4.6 결측치 처리 4 - na.action

1.4.7 결측치 처리 5 - x[is.na(x)]=0 / 결측치 0으로 치환

1.4.7 결측치 처리 6 - 평균으로 치환

1.4.8 결측치 처리 연습문제

1.5 이상현상탐지

1.5.1 이상치 검출 1 - boxplot(), fivenum()

1.5.2 연습 1 - boxplot의 결과와 사분위수 방법 같은지 보이기.

1.5.3 이상치 검출 2 - 잔차 계산 rstudent(), outlierTest()

1.5.4 이상치 검출 3 - cook.distance()

1.5.5 이상치 검출 4 - outlier()

1.5.6 이상치 검출 5 - DMwR패키지의 lofactor()

1.5.7 연습문제 - 타이타닉 이상치 탐색