'부스팅' 태그의 글 목록

'부스팅'에 해당되는 글 1건

2016.11.07 제4장 앙상블 모형 - 분류앙상블모형 - 부스팅

제4장 앙상블 모형 - 분류앙상블모형 - 부스팅

KNOU/2 데이터마이닝 2016. 11. 7. 13:49

회귀앙상블 - 랜덤포레스트(링크)

분류앙상블 – 배깅, 부스팅, 랜덤 포레스트

부스팅 방법 실행

1) 데이터 입력

> setwd('c:/Rwork')

> german=read.table('germandata.txt',header=T)

> german$numcredits = factor(german$numcredits)

> german$residence = factor(german$residence)

> german$residpeople = factor(german$residpeople)

2) 부스팅 방법의 실행

> library(rpart)

> library(adabag)

> my.control<-rpart.control(xval=0,cp=0,maxdepth=1)

> boo.german<-boosting(y~.,data=german,boos=T,mfianl=100,control=my.control)

> names(boo.german)

[1] "formula" "trees" "weights" "votes" "prob" "class" "importance"

[8] "terms" "call"

> summary(boo.german)

Length Class Mode

formula 3 formula call

trees 100 -none- list

weights 100 -none- numeric

votes 2000 -none- numeric

prob 2000 -none- numeric

class 1000 -none- character

importance 20 -none- numeric

terms 3 terms call

call 6 -none- call

함수 설명: boos=T, 표본추출에 의한 분류기 생성방식, boos=F 가중치를 반영한 분류기 생성방식 사용, mfinal은 부스팅 방법에서 생성하게 될 분류기의 개수 B, 100개의 분류기는 rpart 분류나무를 이용하여 생성되고, 각 분류나무는 최대분할의 깊이는 1이라는 옵션, 즉 100개의 스텀프나무를 생성

2) 부스팅 방법의 실행 - 중요도 보기

> names(boo.german)

[1] "formula" "trees" "weights" "votes" "prob" "class" "importance"

[8] "terms" "call"

> head(boo.german$trees) #B=100개의 분류나무를 출력할 수 있다

[[1]]

n= 1000

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 1000 294 good (0.2940000 0.7060000)

2) check=A11 275 119 bad (0.5672727 0.4327273) *

3) check=A12,A13,A14 725 138 good (0.1903448 0.8096552) *

… 중간 생략 …

> boo.german$importance

age check credit debtors duration employment foreign

5.757338101 40.631489260 5.445313887 0.008302238 15.889422076 2.918286176 0.133462457

history housing installment job numcredits others personal

12.835057694 0.312497120 1.276787836 0.000000000 0.000000000 1.941629429 1.282855587

property purpose residence residpeople savings telephone

1.402513542 6.311856846 0.537511661 0.000000000 3.315676089 0.000000000

> order(boo.german$importance,decreasing=T) #bagging 방법과 중요도 변수가 조금은 다르네.

[1] 2 5 8 16 1 3 19 6 13 15 14 10 17 9 7 4 11 12 18 20

> boo.german$importance[2]

check

40.63149

> which.max(boo.german$importance)

check

> boo.german$importance[which.max(boo.german$importance)]

check

40.63149

> boo.german$importance[5]

duration

15.88942

> boo.german$importance[8]

history

12.83506

> importanceplot(boo.german)

3) 목표변수의 분류예측치를 구하고 그 정확도에 대해 평가해보자.

> pred.boo.german<-predict.boosting(boo.german,newdata=german)

> names(pred.boo.german)

[1] "formula" "votes" "prob" "class" "confusion" "error"

> head(pred.boo.german$prob,10)

[,1] [,2]

[1,] 0.28278825 0.7172117

[2,] 0.56170615 0.4382938

[3,] 0.17385176 0.8261482

[4,] 0.50498723 0.4950128

[5,] 0.52209125 0.4779088

[6,] 0.35035420 0.6496458

[7,] 0.18447524 0.8155248

[8,] 0.44024939 0.5597506

[9,] 0.08690876 0.9130912

[10,] 0.48660566 0.5133943

> pred.boo.german$confusion

Observed Class

Predicted Class bad good

bad 135 61

good 165 639

> addmargins(pred.boo.german$confusion)

Observed Class

Predicted Class bad good Sum

bad 135 61 196

good 165 639 804

Sum 300 700 1000

> 1-sum(diag(pred.boo.german$confusion)) / sum(pred.boo.german$confusion)

[1] 0.226

함수 설명: > head(pred.boo.german$prob,10) #prob벡터는 부스팅 방법에서 집단별 투표비율을 출력

결과 해석: 기존 cart 분류나무 모형 오분류율은 19.6%, 배깅은 3.7%,

부스팅의 오분류율은 22.6%로 기존 cart 분류나무 모형 대비 다소 결과가 떨어짐을 알 수 있다.

(분류나무 cart 모형 바로가기) / (분류앙상블 모형 – 배깅 바로가기)

4) 분류 앙상블에서 몇 개의 분류기가 적당한 것인지를 알 아 볼 수 있다.

> evol.german <- errorevol(boo.german,newdata=german)

> plot.errorevol(evol.german)

결과 해석: x축은 분류기의 개수, y축은 오분류율을 의미, 그림에 따르면 분류기의 개수가 30개가 넘어가면 비교적 안정적인 결과를 보인다. 하지만 90개 이상인 경우 약간 오분류율이 감소하는 경향도 있는 듯하다.따라서 독일신용평가데이터에는 배깅 앙상블의 크기를 최소 90개 이상으로 정하면 된다.

5) 데이터의 70%를 랜덤하게 훈련 데이터로 할당하고, 나머지 30%는 검증 데이터로 할당

#데이터를 입력

> set.seed(1234)

> i=sample(1:nrow(german),round(nrow(german)*0.7))#70% for training data, 훈련 데이터, 30% test data 검증 데이터

> german.train = german[i,]

> german.test = german[-i,]

#부스팅 앙상블 분류 모형 만들기

> my.control<-rpart.control(xval=0,cp=0,maxdepth=1)

> boo.train.german<-boosting(y~.,data=german.train,boos=T,mfianl=100,control=my.control)

#중요한 변수 분석

> names(boo.train.german)

[1] "formula" "trees" "weights" "votes" "prob" "class" "importance"

[8] "terms" "call"

> boo.train.german$importance

age check credit debtors duration employment foreign history

3.9363991 32.5300152 4.3733165 0.0000000 16.0942441 6.2892310 0.0000000 13.8030374

housing installment job numcredits others personal property purpose

0.3568900 2.0703206 0.0000000 0.6247335 1.1360575 0.7039699 0.0000000 12.6654775

residence residpeople savings telephone

0.0000000 0.0000000 5.4163077 0.0000000

> order(boo.train.german$importance,decreasing = T)

[1] 2 5 8 16 6 19 3 1 10 13 14 12 9 4 7 11 15 17 18 20

> boo.train.german$importance[2]

check

32.53002

> boo.train.german$importance[5]

duration

16.09424

> boo.train.german$importance[8]

history

13.80304

> importanceplot(boo.train.german)

#예측하기

> pred.boo.train.german<-predict.boosting(boo.train.german,newdata=german.test)

> pred.boo.train.german$confusion

Observed Class

Predicted Class bad good

bad 42 19

good 57 182

> addmargins(pred.boo.train.german$confusion)

Observed Class

Predicted Class bad good Sum

bad 42 19 61

good 57 182 239

Sum 99 201 300

#오분류율 계산

> 1-sum(diag(pred.boo.train.german$confusion)) / sum(pred.boo.train.german$confusion)

[1] 0.2533333

결과 해석: 검증데이터에 대한 오분류율은 기존 cart 분류 나무는 26.33%, 배깅 역시 26.3%,

반면 부스팅은 25.3%으로 계산되어, 조금 향상되었음.

(분류나무 cart 모형 바로가기) / (분류앙상블 모형 – 배깅 바로가기)

출처: 데이터마이닝(장영재, 김현중, 조형준 공저)

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

제5장 신경망모형 - 회귀 (0)	2016.11.14
제4장 앙상블 모형 - 분류앙상블모형 - 랜덤 포레스트 (0)	2016.11.09
제4장 앙상블 모형 - 분류앙상블모형 - 배깅 (0)	2016.11.03
제4장 앙상블 모형 - 회귀앙상블모형 - 랜덤포레스트 (1)	2016.11.02
제4장 앙상블 모형 (0)	2016.11.02

Posted by 마르띤

이전 1 다음

데이터마이너를 꿈꾸며

'부스팅'에 해당되는 글 1건

제4장 앙상블 모형 - 분류앙상블모형 - 부스팅

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

링크

카테고리

최근에 올라온 글

최근에 받은 트랙백

글 보관함

티스토리툴바