데이터마이너를 꿈꾸며 :: 제4장 앙상블 모형 - 분류앙상블모형

제4장 앙상블 모형 - 분류앙상블모형 - 배깅

KNOU/2 데이터마이닝 2016. 11. 3. 13:38

회귀앙상블모형 - 랜덤포레스트(링크)

분류앙상블모형 - 배깅, 부스팅, 랜덤포레스트

이번에는 배깅 방법을 이용하여 분류 앙상블 모형을 진행해보자

1) 데이터 입력

> setwd('c:/Rwork')

> german=read.table('germandata.txt',header=T)

> german$numcredits = factor(german$numcredits)

> german$residence = factor(german$residence)

> german$residpeople = factor(german$residpeople)

> class(german$numcredits);class(german$residpeople);class(german$residence)

[1] "factor"

2) 배깅 방법의 실행

> library(rpart)

> library(adabag)

> my.control <- rpart.control(xval = 0, cp = 0, minsplit = 5, maxdepth = 10)

> bag.german <- bagging(y~.,data = german, mfinal = 50, control = my.control)

> summary(bag.german)

Length Class Mode

formula 3 formula call

trees 50 -none- list

votes 2000 -none- numeric

prob 2000 -none- numeric

class 1000 -none- character

samples 50000 -none- numeric

importance 20 -none- numeric

terms 3 terms call

call 5 -none- call

함수 설명

rpart.control(xval = 0, cp = 0, minsplit = 5, maxdepth = 10) : mfinal은 배깅 방법에서 생성하게 될 분류기의 개수 B, 각 50개의 분류기는 rpart 분류나무를 이용하여 생성되고, 각각의 분류나무는 노드의 최소 데이터 수는 5이고 최대 분할의 깊이는 10이라는 옵션으로 생성된다.

3) 가장 중요한 변수가 뭔지 알아보자.

> names(bag.german)

[1] "formula" "trees" "votes" "prob" "class" "samples" "importance" "terms"

[9] "call"

> bag.german$importance

age check credit debtors duration employment foreign history housing

7.1242723 16.4701060 13.2520285 2.0382461 10.3813110 6.4053643 0.1100654 6.7854018 0.9387335

installment job numcredits others personal property purpose residence residpeople

1.7740456 1.8354714 0.9098175 2.6632713 3.0909168 3.8218936 11.4089057 3.5700495 0.3570470

savings telephone

6.5166935 0.5463590

결과 해석: check 변수가 가장 중요한 입력 변수이고, credit이 두 번째로 중요한 역할을 하는 입력변수이다.

> order(bag.german$importance)

[1] 7 18 20 12 9 10 11 4 13 14 17 15 6 19 8 1 5 16 3 2

> order(bag.german$importance,decreasing=T)

[1] 2 3 16 5 1 8 19 6 15 17 14 13 4 11 10 9 12 20 18 7

> bag.german$importance[2]

check

16.47011

> bag.german$importance[3]

credit

13.25203

> which.max(bag.german$importance)

check

> which.min(bag.german$importance)

foreign

> bag.german$importance[which.max(bag.german$importance)]

check

16.47011

> bag.german$importance[which.min(bag.german$importance)]

foreign

0.1100654

> importanceplot(bag.german)

4) 목표변수의 분류예측치를 구하고 그 정확도에 대해 평가하는 방법에 대해 알아보자

> pred.bag.german <- predict.bagging(bag.german,newdata=german)

> names(pred.bag.german)

[1] "formula" "votes" "prob" "class" "confusion" "error"

> head(pred.bag.german$prob,10) #각 집단 별 투표비율

[,1] [,2]

[1,] 0.04 0.96

[2,] 0.92 0.08

[3,] 0.02 0.98

[4,] 0.30 0.70

[5,] 0.72 0.28

[6,] 0.16 0.84

[7,] 0.00 1.00

[8,] 0.16 0.84

[9,] 0.00 1.00

[10,] 0.86 0.14

> head(pred.bag.german$class,10) #관측치마다 최종 예측 집단 출력

[1] "good" "bad" "good" "good" "bad" "good" "good" "good" "good" "bad"

> pred.bag.german$confusion #실제 목표변수의 값과 예측 목표변수의 값이 어느정도 유사한지 행렬의 형태로 보이고 있다.

Observed Class

Predicted Class bad good

bad 268 5

good 32 695

> addmargins(pred.bag.german$confusion)

Observed Class

Predicted Class bad good Sum

bad 268 5 273

good 32 695 727

Sum 300 700 1000

> sum(pred.bag.german$confusion)

[1] 1000

> diag(pred.bag.german$confusion)

bad good

268 695

> 1-sum(diag(pred.bag.german$confusion)) /sum(pred.bag.german$confusion)

[1] 0.037

결과 해석: 오분류율이 3.7%로 기존 19.6% 보다 배깅은 대단히 우수한 결과를 보임을 알 수 있다

(분류나무 cart 모형 바로가기)

5) 분류 앙상블에서 몇 개의 분류기가 적당한 것인지를 알 아 볼 수 있다.

> evol.german <- errorevol(bag.german,newdata=german)

> plot.errorevol(evol.german)

결과 해석: x축은 분류기의 개수, y축은 오분류율을 의미, 그림에 따르면 분류기의 개수가 40개가 넘어가면 비교적 안정적인 결과를 보인다. 따라서 독일신용평가데이터에는 배깅 앙상블의 크기를 최소 40개 이상으로 정하면 된다.

6) 훈련데이터와 검증 데이터로 분할하여 배깅 방법으로 평가

> set.seed(1234)

> i = sample(1:nrow(german),round(nrow(german)*0.7)) #70% for training data, 훈련 데이터, 30% test data 검증 데이터

> german.train = german[i,]

> german.test = german[-i,]

> bag.train.german <- bagging(y~., data = german.train, mfinal = 50, control = my.control)

> bag.train.german$importance

age check credit debtors duration employment foreign history

5.9281898 14.8985632 12.2570321 2.1746067 13.6936647 7.2300214 0.0000000 6.3576993

Housing installment job numcredits others personal property purpose

1.0506904 2.3547243 1.4585356 0.4620406 1.9891509 3.0157372 3.8454948 12.3522852

Residence residpeople savings telephone

3.7829112 0.4905634 5.8943300 0.7637592

> order(bag.train.german$importance,decreasing=T)

[1] 2 5 16 3 6 8 1 19 15 17 14 10 4 13 11 9 20 18 12 7

> bag.train.german$importance[2]

check

14.89856

> bag.train.german$importance[5]

duration

13.69366

> which.max(bag.train.german$importance)

check

> which.min(bag.train.german$importance)

foreign

> importanceplot(bag.train.german)

> pred.bag.train.german <- predict.bagging(bag.train.german,newdata=german.test)

> pred.bag.train.german$confusion

Observed Class

Predicted Class bad good

bad 44 24

good 55 177

> addmargins(pred.bag.train.german$confusion)

Observed Class

Predicted Class bad good Sum

bad 44 24 68

good 55 177 232

Sum 99 201 300

> 1-sum(diag(pred.bag.train.german$confusion)) / sum(pred.bag.train.german$confusion)

[1] 0.2633333

결과 해석: 검증데이터에 대한 오분류율은 26.33%로 계산되어, 기존 분류 나무의 검증데이터 오분류율 26.33%과 동일하네… 왜 동일하지..조금이라도 향상되었어야 하는데.. 더 공부하자.

(기존 cart 분류나무 모형 바로가기)

출처: 데이터마이닝(장영재, 김현중, 조형준 공저)

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

제4장 앙상블 모형 - 분류앙상블모형 - 랜덤 포레스트 (0)	2016.11.09
제4장 앙상블 모형 - 분류앙상블모형 - 부스팅 (0)	2016.11.07
제4장 앙상블 모형 - 회귀앙상블모형 - 랜덤포레스트 (1)	2016.11.02
제4장 앙상블 모형 (0)	2016.11.02
제3장 나무모형 - 회귀나무모형 (0)	2016.10.26

Posted by 마르띤

데이터마이너를 꿈꾸며

제4장 앙상블 모형 - 분류앙상블모형 - 배깅

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

링크

카테고리

최근에 올라온 글

최근에 받은 트랙백

글 보관함

티스토리툴바