'table' 태그의 글 목록

4장 - 크로스 분석: 어떤 속성들의 고객들이 떠날까?

Python, R 분석과 프로그래밍/비지니스 활용 사례로 배우는 데이터 분석 : R 2017. 2. 10. 19:24

1. csv 파일 읽어 들이기

> setwd("C:/파일/temp/r_temp")

> dau<-read.csv('section4-dau.csv',header=T,stringsAsFactors = F)

> head(dau)

log_date app_name user_id

1 2013-08-01 game-01 33754

2 2013-08-01 game-01 28598

3 2013-08-01 game-01 30306

4 2013-08-01 game-01 117

5 2013-08-01 game-01 6605

6 2013-08-01 game-01 346

> user.info<-read.csv('section4-user_info.csv',header=T,stringsAsFactors = F)

> head(user.info)

i nstall_date app_name user_id gender generation device_type

1 2013-04-15 game-01 1 M 40 iOS

2 2013-04-15 game-01 2 M 10 Android

3 2013-04-15 game-01 3 F 40 iOS

4 2013-04-15 game-01 4 M 10 Android

5 2013-04-15 game-01 5 M 40 iOS

6 2013-04-15 game-01 6 M 40 iOS

2. DAU에 user.info 결합하기

> dau.user.info<-merge(dau,user.info,by=c('user_id','app_name'))

> head(dau.user.info)

user_id app_name log_date install_date gender generation device_type

1 1 game-01 2013-09-06 2013-04-15 M 40 iOS

2 1 game-01 2013-09-05 2013-04-15 M 40 iOS

3 1 game-01 2013-09-28 2013-04-15 M 40 iOS

4 1 game-01 2013-09-12 2013-04-15 M 40 iOS

5 1 game-01 2013-09-11 2013-04-15 M 40 iOS

6 1 game-01 2013-09-08 2013-04-15 M 40 iOS

3. 월 항목 추가

> dau.user.info$month<-substr(dau.user.info$log_date,1,7)

> head(dau.user.info)

user_id app_name log_date install_date gender generation device_type month

1 1 game-01 2013-09-06 2013-04-15 M 40 iOS 2013-09

2 1 game-01 2013-09-05 2013-04-15 M 40 iOS 2013-09

3 1 game-01 2013-09-28 2013-04-15 M 40 iOS 2013-09

4 1 game-01 2013-09-12 2013-04-15 M 40 iOS 2013-09

5 1 game-01 2013-09-11 2013-04-15 M 40 iOS 2013-09

6 1 game-01 2013-09-08 2013-04-15 M 40 iOS 2013-09

> colnames(dau.user.info)[8]<-'log_month'

> head(dau.user.info)

user_id app_name log_date install_date gender generation device_type log_month

1 1 game-01 2013-09-06 2013-04-15 M 40 iOS 2013-09

2 1 game-01 2013-09-05 2013-04-15 M 40 iOS 2013-09

3 1 game-01 2013-09-28 2013-04-15 M 40 iOS 2013-09

4 1 game-01 2013-09-12 2013-04-15 M 40 iOS 2013-09

5 1 game-01 2013-09-11 2013-04-15 M 40 iOS 2013-09

6 1 game-01 2013-09-08 2013-04-15 M 40 iOS 2013-09

4. 성별로 집계

> table(dau.user.info[,c('log_month','gender')])

gender

log_month F M

2013-08 47343 46842

2013-09 38027 38148

> table(dau.user.info[,c('gender','log_month')])

log_month

gender 2013-08 2013-09

F 47343 38027

M 46842 38148

-> 전체적인 수치는 떨어졌지만 남녀간의 구성비율은 변화가 없으므로, 성별과 게임이용율 간 연관관계가 크지 않음.

5. 연령별로 집계

> table(dau.user.info[,c('log_month','generation')])

generation

log_month 10 20 30 40 50

2013-08 18785 33671 28072 8828 4829

2013-09 15391 27229 22226 7494 3835

> table(dau.user.info[,c('generation','log_month')])

log_month

generation 2013-08 2013-09

10 18785 15391

20 33671 27229

30 28072 22226

40 8828 7494

50 4829 3835

-> 전체적인 수치는 떨어졌지만 연령간의 구성비율은 변화가 없으므로, 연령과 게임이용율 간 연관관계가 크지 않음.

6. 세그먼트 분석(성별과 연령대를 조합해서 집계)

> library(reshape2)

> dcast(dau.user.info,log_month~gender+generation,value.var='user_id',length)

log_month F_10 F_20 F_30 F_40 F_50 M_10 M_20 M_30 M_40 M_50

1 2013-08 9091 17181 14217 4597 2257 9694 16490 13855 4231 2572

2 2013-09 7316 13616 11458 3856 1781 8075 13613 10768 3638 2054

-> 전체 이용율 하락에 영햐을 끼친 세그먼트는 찾아내지 못함

7. 세그먼트 분석(단말기별로 집계)

> head(dau.user.info)

user_id app_name log_date install_date gender generation device_type log_month

1 1 game-01 2013-09-06 2013-04-15 M 40 iOS 2013-09

2 1 game-01 2013-09-05 2013-04-15 M 40 iOS 2013-09

3 1 game-01 2013-09-28 2013-04-15 M 40 iOS 2013-09

4 1 game-01 2013-09-12 2013-04-15 M 40 iOS 2013-09

5 1 game-01 2013-09-11 2013-04-15 M 40 iOS 2013-09

6 1 game-01 2013-09-08 2013-04-15 M 40 iOS 2013-09

> table(dau.user.info[,c('log_month','device_type')])

device_type

log_month Android iOS

2013-08 46974 47211

2013-09 29647 46528

-> ios의 변화율은 없는 반면, android의 변화율이 큼을 알 수 있다.

8. 세그먼트 분석 결과 시각화하기

#날짜별로 단말기별 유저수 산출하기

> library(plyr)

> dau.user.info.device.summary<-ddply(dau.user.info,

.(log_date,device_type),

summarize,

dau=length(user_id))

> head(dau.user.info.device.summary)

log_date device_type dau

1 2013-08-01 Android 1784

2 2013-08-01 iOS 1805

3 2013-08-02 Android 1386

4 2013-08-02 iOS 1451

5 2013-08-03 Android 1295

6 2013-08-03 iOS 1351

#날짜별 데이터 형식으로 변환하기

> str(dau.user.info.device.summary)

'data.frame': 122 obs. of 3 variables:

$ log_date : chr "2013-08-01" "2013-08-01" "2013-08-02" "2013-08-02" ...

$ device_type: chr "Android" "iOS" "Android" "iOS" ...

$ dau : int 1784 1805 1386 1451 1295 1351 1283 1314 2002 2038 ...

> dau.user.info.device.summary$log_date<-as.Date(dau.user.info.device.summary$log_date)

> str(dau.user.info.device.summary)

'data.frame': 122 obs. of 3 variables:

$ log_date : Date, format: "2013-08-01" "2013-08-01" "2013-08-02" ...

$ device_type: chr "Android" "iOS" "Android" "iOS" ...

$ dau : int 1784 1805 1386 1451 1295 1351 1283 1314 2002 2038 ...

9. 시계열 트렌드 그래프 그리기

> library(ggplot2)

> library(scales)

> limits<-c(0,max(dau.user.info.device.summary$dau))

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))+ geom_line(lwd=1)+geom_point(size=4)+scale_y_continuous(label=comma,limits = limits)

-> 9월부터 안드로이드 유저가 빠르게 감소함을 알 수 있다. 따라서 안드로이드 시스템을 점검해야 한다는 결론을 얻을 수 있다.

#ggplot 그래프 구조 이해하기

> library(ggplot2)

> library(scales)

> head(dau.user.info.device.summary)

log_date device_type dau

1 2013-08-01 Android 1784

2 2013-08-01 iOS 1805

3 2013-08-02 Android 1386

4 2013-08-02 iOS 1451

5 2013-08-03 Android 1295

6 2013-08-03 iOS 1351

> ggplot(dau.user.info.device.summary)

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))+geom_line(lwd=1)

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))+geom_line(lwd=1)+geom_point(size=4)

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))+geom_line(lwd=1)+geom_point(size=4)+scale_y_continuous(label=comma)

> limits<-c(0,max(dau.user.info.device.summary$dau))

> max(dau.user.info.device.summary$dau)

[1] 2133

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))+geom_line(lwd=1)+geom_point(size=4)+scale_y_continuous(label=comma,limits = limits)

출처: 비지니스 활용 사레로 배우는 데이터 분석: R (사카마키 류지, 사토 요헤이 지음)

'Python, R 분석과 프로그래밍 > 비지니스 활용 사례로 배우는 데이터 분석 : R' 카테고리의 다른 글

6장 - 중회귀분석 : 집객효과가 가장 큰 광고의 조합은 무엇인가? (2)	2017.02.15
5장 - A/B 테스트 : 어느 쪽의 배너광고가 반응이 더 좋은가? (0)	2017.02.13

Posted by 마르띤

,

제4장 앙상블 모형 - 분류앙상블모형 - 랜덤 포레스트

KNOU/2 데이터마이닝 2016. 11. 9. 10:03

회귀앙상블 - 랜덤포레스트(링크)

분류앙상블 – 배깅, 부스팅, 랜덤 포레스트

1) 데이터 입력

> setwd('c:/Rwork')

> german=read.table('germandata.txt',header=T)

> german$numcredits = factor(german$numcredits)

> german$residence = factor(german$residence)

> german$residpeople = factor(german$residpeople)

2) 랜덤 포레스트 방법의 실행

> library(randomForest)

> rf.german<-randomForest(y~.,data=german,ntree=100,mytr=5,importance=T,na.action = na.omit)

> summary(rf.german)

Length Class Mode

call 7 -none- call

type 1 -none- character

predicted 1000 factor numeric

err.rate 300 -none- numeric

confusion 6 -none- numeric

votes 2000 matrix numeric

oob.times 1000 -none- numeric

classes 2 -none- character

importance 80 -none- numeric

importanceSD 60 -none- numeric

localImportance 0 -none- NULL

proximity 0 -none- NULL

ntree 1 -none- numeric

mtry 1 -none- numeric

forest 14 -none- list

y 1000 factor numeric

test 0 -none- NULL

inbag 0 -none- NULL

terms 3 terms call

함수 설명:

randomForest(y~.,data=german,ntree=100,mytr=5,importance=T,na.action = na.omit ntree는 랜덤포레스트 방법에서 생성하게 될 분류기의 개수 B이다. 100개의 분류기는 분류나무를 이용하여 생성되도록 하였다. mtry는 중간노드마다 랜덤하게 선택되는 변수의 개수를 의미한다. 5개의 입력변수를 중간노드마다 모두 선택하고 이 중에서 분할점이 탐색되도록 하였다. importance는 변수의 중요도를 계산하는 옵션이고, na.action은 결측치를 처리하는 방법에 대한 사항이다. 변수의 중요도를 계산하게 하고, 결측치는 필요한 경우에만 삭제되도락 하였다.

2) 랜덤 포레스트 방법의 실행 – 변수 중요도 보기

> names(rf.german)

[1] "call" "type" "predicted" "err.rate" "confusion"

[6] "votes" "oob.times" "classes" "importance" "importanceSD"

[11] "localImportance" "proximity" "ntree" "mtry" "forest"

[16] "y" "test" "inbag" "terms"

> head(rf.german$predicted,10) #데이터의 예측집단을 출력

1 2 3 4 5 6 7 8 9 10

good bad good bad bad good good good good bad

Levels: bad good

> importance(rf.german)

bad good MeanDecreaseAccuracy MeanDecreaseGini

check 13.14268724 9.63208341 15.1106884 44.838547

duration 3.33563217 8.10482760 8.6281030 41.273512

history 3.87863720 4.50449203 5.8685013 25.731461

purpose 2.67025503 3.09871408 4.1078354 35.651165

credit 2.44312087 4.16670498 4.8800356 52.641745

savings 7.48182326 2.84190645 6.1879918 21.705909

employment 1.91049595 2.70568977 3.2416484 23.910509

installment 0.02100147 3.49542966 3.0522029 15.714499

personal 2.10802207 1.83819513 3.0602966 15.365475

debtors -0.17277289 4.39384503 3.8963219 6.718969

residence 0.74571096 0.90284661 1.1155901 17.932937

property 2.16016716 3.85454658 4.7010080 18.803008

age 2.69637542 4.35170316 5.4753585 41.451103

others 0.31569112 3.60499256 3.3679530 11.399935

housing 2.70314243 2.06074416 3.0737900 9.596539

numcredits -0.24996827 0.95259106 0.6502566 7.944861

job 1.53070574 1.18486660 2.1420488 12.054190

residpeople 0.88657814 -0.43166449 0.1370845 4.234831

telephone 1.46824003 -0.24200291 0.6110284 6.427048

foreign 1.26297478 -0.05431566 0.5733125 1.519832

> importance(rf.german,type=1)

MeanDecreaseAccuracy

check 15.1106884

duration 8.6281030

history 5.8685013

purpose 4.1078354

credit 4.8800356

savings 6.1879918

employment 3.2416484

installment 3.0522029

personal 3.0602966

debtors 3.8963219

residence 1.1155901

property 4.7010080

age 5.4753585

others 3.3679530

housing 3.0737900

numcredits 0.6502566

job 2.1420488

residpeople 0.1370845

telephone 0.6110284

foreign 0.5733125

결과 해석: check가 가장 중요하고 duration이 두번째로 중요

> order(importance(rf.german)[,'MeanDecreaseAccuracy'],decreasing = T)

[1] 1 2 6 3 13 5 12 4 10 14 7 15 9 8 17 11 16 19 20 18

> which.max(importance(rf.german)[,'MeanDecreaseAccuracy'])

check

1

> importance(rf.german)[which.max(importance(rf.german)[,'MeanDecreaseAccuracy']),]

bad good MeanDecreaseAccuracy MeanDecreaseGini

13.142687 9.632083 15.110688 44.838547

> rf.german$importance

bad good MeanDecreaseAccuracy MeanDecreaseGini

check 5.728619e-02 2.488337e-02 3.461467e-02 44.838547

duration 1.262784e-02 2.059686e-02 1.820397e-02 41.273512

history 1.165482e-02 8.234252e-03 9.100115e-03 25.731461

purpose 1.001589e-02 5.877403e-03 7.125436e-03 35.651165

credit 8.957835e-03 9.469900e-03 9.342178e-03 52.641745

savings 1.873739e-02 4.572813e-03 8.795598e-03 21.705909

employment 5.377379e-03 4.391391e-03 4.647218e-03 23.910509

installment 5.083834e-05 4.492258e-03 3.109720e-03 15.714499

personal 4.903039e-03 2.481005e-03 3.188303e-03 15.365475

debtors -1.380717e-04 3.198633e-03 2.202015e-03 6.718969

residence 1.833243e-03 1.342976e-03 1.400876e-03 17.932937

property 6.357328e-03 5.871030e-03 6.069060e-03 18.803008

age 1.026633e-02 8.969565e-03 9.239623e-03 41.451103

others 6.321219e-04 4.940365e-03 3.562374e-03 11.399935

housing 5.086080e-03 2.740150e-03 3.358222e-03 9.596539

numcredits -3.793953e-04 7.653351e-04 4.256327e-04 7.944861

job 2.803065e-03 1.422938e-03 1.912685e-03 12.054190

residpeople 9.171770e-04 -2.801992e-04 7.286163e-05 4.234831

telephone 2.173474e-03 -2.108124e-04 4.743466e-04 6.427048

foreign 5.935867e-04 -1.538455e-05 1.547421e-04 1.519832

추가:

rf.german$importance #이건 importance(rf.german) 의 결과와 어떻게 다른지 좀 더 공부하자

3) 분류예측치를 구하고 정확도 평가

> pred.rf.german<-predict(rf.german,newdata=german)

> head(pred.rf.german,10)

1 2 3 4 5 6 7 8 9 10

good bad good good bad good good good good bad

Levels: bad good

> tab=table(german$y,pred.rf.german,dnn=c("Actual","Predicted"))

> tab

Predicted

Actual bad good

bad 300 0

good 0 700

> addmargins(tab)

Predicted

Actual bad good Sum

bad 300 0 300

good 0 700 700

Sum 300 700 1000

> 1-sum(diag(tab)/sum(tab))

[1] 0

결과 해석: 기존 cart 분류나무 모형 오분류율은 19.6%, 배깅은 3.7%, 부스팅은 22.6%인 반면

랜덤 포레스트는 오분류율이 0%로 기존 그 어떠한 모형보다 대단히 우수한 결과를 보임을 알 수 있다

(분류나무 cart 모형 바로가기) / (분류앙상블 모형 – 배깅 바로가기) / (분류앙상블 모형 – 부스팅 바로가기)

4) 몇개의 분류기가 적당할까?

> plot(rf.german,'l')

결과 해석: x축은 분류기, y축은 OOB 오분류율을 의미 out of bag의 약자로 부트스트랩에 포함되지 못한 데이터를 의미. 결국 oob데이터는 검증데이터의 역할과 비슷. 가장 아래선은 목표변수 중 good에 대한 오분율, 가장 위쪽 선은 bad에 대한 오분율. 가운데는 전체 오분류율. 그림에 따르면 분류기 개수가 80개이상이면 안정적인 형태로 보인다

5) 훈련 데이터와 검증데이터로 분할하여 랜덤포레스트 방법을 평가해 보자.

> set.seed(1234)

> i=sample(1:nrow(german),round(nrow(german)*0.7)) #70% for training 훈련 data, 30% test 검증 데이

> german.train = german[i,]

> german.test = german[-i,]

> rf.train.german<-randomForest(y~.,data=german.train,ntree=100,mtry=5,importance=T,na.action = na.omit)

> pred.rf.train.german<-predict(rf.train.german,newdata=german.test)

> tab.train = table(german.test$y,pred.rf.train.german,dnn=c('Actual','Predicted'))

> tab.train

Predicted

Actual bad good

bad 39 60

good 15 186

> addmargins(tab.train)

Predicted

Actual bad good Sum

bad 39 60 99

good 15 186 201

Sum 54 246 300

> 1-sum(diag(tab.train) / sum(tab.train)) #오분류율이 25%로 어느정도 향상

[1] 0.25

결과 해석: 검증데이터에 대한 오분류율은 기존 cart 분류 나무는 26.3%, 배깅 역시 26.3%, 부스팅은 25.3%. 랜덤포레스트는 25%로, 조금 향상 되었음.

(분류나무 cart 모형 바로가기) / (분류앙상블 모형 – 배깅 바로가기) / (분류앙상블 모형 – 부스팅 바로가기)

출처: 데이터마이닝(장영재, 김현중, 조형준 공저)

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

제5장 신경망모형 - 분류 (5)	2016.12.20
제5장 신경망모형 - 회귀 (0)	2016.11.14
제4장 앙상블 모형 - 분류앙상블모형 - 부스팅 (0)	2016.11.07
제4장 앙상블 모형 - 분류앙상블모형 - 배깅 (0)	2016.11.03
제4장 앙상블 모형 - 회귀앙상블모형 - 랜덤포레스트 (1)	2016.11.02

Posted by 마르띤

,

제4장 범주형 자료의 분석 - 4.1 범주형 자료와 분할표

KNOU/2 보건 정보 데이터 분석 2016. 11. 4. 09:49

제4장 범주형 자료의 분석

4.1 범주형 자료와 분할표

분할표 table 만들기 연습

> medi = read.table('c:/Rwork/medication.txt',header=T)

> head(medi,3)

癤퓆o medication surv

1 1 treat y

2 2 treat n

3 3 treat y

열 이름이 깨져 보임.

> colnames(medi)<-c('no','medication','surv')

> head(medi,3)

no medication surv

1 1 treat y

2 2 treat n

3 3 treat y

분할표 작성, treat = 처리군 , control = 대조군

> attach(medi)

> tab <- table(medication, surv)

> colnames(tab) = c('die','survival')

> rownames(tab) = c('trt','ctr')

> tab

surv

medication die survival

trt 1 4

ctr 3 2

> addmargins(tab)

surv

medication die survival Sum

trt 1 4 5

ctr 3 2 5

Sum 4 6 10

> tab/sum(tab)

surv

medication die survival

trt 0.1 0.4

ctr 0.3 0.2

> addmargins(tab/sum(tab))

surv

medication die survival Sum

trt 0.1 0.4 0.5

ctr 0.3 0.2 0.5

Sum 0.4 0.6 1.0

출처: 보건정보데이터 분석(이태림 저)

'KNOU > 2 보건 정보 데이터 분석' 카테고리의 다른 글

제2장 보건정보 데이터의 기초분석 (1)	2016.12.27
제4장 범주형 자료의 분석 - 4.2.2 독립성 검정 (카이제곱 검정) (0)	2016.11.08
제4장 범주형 자료의 분석 - 4.2 범주형 자료의 검정(카이제곱 검정) (0)	2016.11.04
제3장 연속형 자료의 분석 - 3.2 여러집단의 비교 ANOVA (1)	2016.10.31
제3장 연속형 자료의 분석 - 3.1 두 집단의 평균 비교 two sample, paired sample (0)	2016.10.24

Posted by 마르띤

,

제3장 나무모형 - 분류나무모형

KNOU/2 데이터마이닝 2016. 10. 18. 10:03

목표변수가 집단을 의미하는 범주형 의사결정나무 -> 분류나무모형

목표변수가 연속형 변수인 의사결정나무 -> 회귀나무모형

예제) 목표변수가 범주형 good, bad인 독일 신용평가 데이터를 이용하여 cart 방법을 이용한 의사결정나무 구축

1) 데이터 불러오기

> setwd('c:/Rwork')

> german<-read.table('germandata.txt',header=T)

> str(german)

'data.frame': 1000 obs. of 21 variables:

$ check : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...

$ duration : int 6 48 12 42 24 36 24 36 12 30 ...

$ history : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...

$ purpose : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...

$ credit : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...

$ savings : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...

$ employment : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...

$ installment: int 4 2 2 2 3 2 3 2 2 4 ...

$ personal : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...

$ debtors : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...

$ residence : int 4 2 3 4 4 4 4 2 4 2 ...

$ property : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...

$ age : int 67 22 49 45 53 35 53 35 61 28 ...

$ others : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...

$ housing : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...

$ numcredits : int 2 1 1 1 2 1 1 1 1 2 ...

$ job : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...

$ residpeople: int 1 1 2 2 2 2 1 1 1 1 ...

$ telephone : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...

$ foreign : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...

$ y : Factor w/ 2 levels "bad","good": 2 1 2 2 1 2 2 2 2 1 ...

> german$numcredits<-factor(german$numcredits)

> german$residence<-factor(german$residence)

> german$residpeople<-factor(german$residpeople)

> class(german$numcredits);class(german$residence);class(german$residpeople)

[1] "factor"

2) cart 방법 적용

> library(rpart)

> my.control<-rpart.control(xval=10,cp=0,minsplit=5)

> fit.german<-rpart(y~.,data=german,method='class',control=my.control)

> fit.german #최초의 나무. 가지치기를 하지 않은 최대 크기의 나무 보기

n= 1000

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 1000 300 good (0.300000000 0.700000000)

2) check=A11,A12 543 240 good (0.441988950 0.558011050)

.

. (너무 커서 중략)

.

253) credit>=1273 10 0 good (0.000000000 1.000000000) *

127) check=A14 122 1 good (0.008196721 0.991803279) *

함수 설명

1. rpart.control:

- xval=10: 교타 타당성의 fold 개수, 디폴트는 10

- cp=0: 오분류율이 cp값 이상으로 향상되지 않으면 더 이상 분할하지 않고 나무구조 생성을 멈춘다. cp값이 0이면 오분류값이 최소, 디폴트는 0.01

- minsplit=5: 한 노드를 분할하기 위해 필요한 데이터의 개수. 이 값보다 적은 수의 관측치가 있는 노드는 분할하지 않는다. 디폴트는 20

2. r.part

- method=class: 나무 모형을 지정한다. anova는 회귀나무, poisson 포아송 회귀나무, class는 분류나무 exp는 생존나무. 디폴트는 class

- na.action=na.rpart: 목표변수가 결측치이면 전체 관측치를 삭제. 입력변수가 결측치인 경우에는 삭제하지 않는다.

결과 해석

중간노드를 분할하는 최소 자료의 수를 5개로 지정하였고, cp값은 0으로 하여 나무모형의 오분류값이 최소가 될 때 까지 분할을 진행하였다. 또한 10-fold 교차타당성을 수행하여 최적의 cp값을 찾도록 하였다. 나무가 너무나 큰 관계로 중간 부분을 생략하였고, 용이한 모형 분석을 위해 가지치기를 해보자.

3) 큰 나무를 줄이기 위한 가지치기 작업

> printcp(fit.german)

Classification tree:

rpart(formula = y ~ ., data = german, method = "class", control = my.control)

Variables actually used in tree construction:

[1] age check credit debtors duration employment history

[8] housing installment job numcredits others personal property

[15] purpose residence savings

Root node error: 300/1000 = 0.3

n= 1000

CP nsplit rel error xerror xstd

1 0.0516667 0 1.00000 1.00000 0.048305

2 0.0466667 3 0.84000 0.94667 0.047533

3 0.0183333 4 0.79333 0.86333 0.046178

4 0.0166667 6 0.75667 0.87000 0.046294

5 0.0155556 8 0.72333 0.88667 0.046577

6 0.0116667 11 0.67667 0.88000 0.046464

7 0.0100000 13 0.65333 0.85667 0.046062

8 0.0083333 16 0.62333 0.87000 0.046294

9 0.0066667 18 0.60667 0.87333 0.046351

10 0.0060000 38 0.44333 0.92000 0.047120

11 0.0050000 43 0.41333 0.91000 0.046960

12 0.0044444 55 0.35333 0.92000 0.047120

13 0.0033333 59 0.33333 0.92000 0.047120

14 0.0029167 83 0.25000 0.97000 0.047879

15 0.0022222 93 0.22000 0.97667 0.047976

16 0.0016667 96 0.21333 0.97667 0.047976

17 0.0000000 104 0.20000 1.01333 0.048486

결과 해석

10-fold 교차타당성 방법에 의한 오분율(xerror)이 최소가 되는 값은 0.85667이며 이때의 cp값은 0.01임을 알 수 있다. 이 때 분리의 횟수가 13회(nsplit=13)인 나무를 의미한다.

또는 아래와 같은 방법으로도 최소 오분류값(xerror)를 찾을 수 있다.

> names(fit.german)

[1] "frame" "where" "call" "terms"

[5] "cptable" "method" "parms" "control"

[9] "functions" "numresp" "splits" "csplit"

[13] "variable.importance" "y" "ordered"

> fit.german$cptable[,'xerror']

1 2 3 4 5 6 7 8 9

1.0000000 0.9466667 0.8633333 0.8700000 0.8866667 0.8800000 0.8566667 0.8700000 0.8733333

10 11 12 13 14 15 16 17

0.9200000 0.9100000 0.9200000 0.9200000 0.9700000 0.9766667 0.9766667 1.0133333

> which.min(fit.german$cptable[,'xerror'])

7

> fit.german$cptable[7,]

CP nsplit rel error xerror xstd

0.01000000 13.00000000 0.65333333 0.85666667 0.04606167

> fit.german$cptable[7]

[1] 0.01

> fit.german$cptable[which.min(fit.german$cptable[,'xerror'])]

[1] 0.01

> min.cp<-fit.german$cptable[which.min(fit.german$cptable[,'xerror'])]

> min.cp

[1] 0.01

> fit.prune.german<-prune(fit.german,cp=min.cp)

4) 오분류율이 최소인 cp값(=0.011)을 찾았으니 이 값을 기준으로 가지치기를 시행하자.

> fit.prune.german<-prune(fit.german,cp=0.01)

> fit.prune.german

결과 해석

node), split, n, loss, yval, (yprob) 기준으로 첫번째 결과를 분석하면 다음과 같다.

노드, 분할점, 개수, …공부 필요

16) duration>=47.5 36 5 bad (0.8611111 0.1388889) *

duration 변수 중 47.5보다 큰 경우, 전체 36(n)개를 bad(yval)로 분류하였고 그 중 5개(loss)가 good이다. 그리하여 bad로 분류되는 것은 31/36 = 0.8611111로 표기하게 되고, 5개의 loss는 5/36 = 1388889 로 그 확률을 볼 수 있다. 아래 plot에서는 bad 31/5로 표기

376) property=A123,A124 20 3 bad (0.8500000 0.1500000) *

377) property=A121,A122 45 14 good (0.3111111 0.6888889) *

property가 a123(car), a124(unknown / no property)의 경우 전체 20개를 bad로 분류하였고 3개의 loss 즉 good (3/20 = 0.15)로 분류하였다. 아래 plot에서는 bad 17/3로 표기

property가 a121(real estate), a122(building society savings agreement)인 경우에는 전체 45개를 good으로 분류하였고 14개의 loss 즉 bad로 분류 (14/45=0.3111111), 아래 plot에서는 good 14/31로 표기

<< 17.6.18(일)>> 해석 부분 내용 추가

duration > = 22.5인 경우, 전체 고객은 237명이고, 이 중 신용도가 나쁜 사람의 비율은 56.5%이고 좋은 사람의 비율은43.5%로 103명이다. 따라서 duration > 22.5 그룹은 bad로 분류된다.

가지치기를 한 모형을 그림으로 나타내는 함수는 아래와 같다.

> plot(fit.prune.german,uniform = T,compress=T,margin=0.1)

> text(fit.prune.german,use.n=T,col='blue',cex=0.7)

왼쪽 가지의 가장 아랫부분의 분할점인 ‘purpose=acdeghj’는 purpose 변수의 범주값 중에서 알파벳 순서로, 1(=a), 3(=c), 4(=d), 5(=e), 7(=g), 8(=h), 10(=j)번째 범주값을 의미하며, fit.prune.german에서 각각 A40,A410,A42,A43,A45,A46,A49 임을 알 수 있다.

34) purpose=A40,A410,A42,A43,A45,A46,A49 137 52 bad (0.6204380 0.3795620) *

<< 17.6.18(일)>> 해석 부분 내용 추가

가장 우측의 duration > = 11.5가 아닌 경우, 신용다가 나쁜 / 좋은 사람의 비율은 9명 / . 4명이고, 신용도가 좋은 good으로 분류된다.

5) 나무수를 더 줄여보자.

> printcp(fit.german)

Classification tree:

rpart(formula = y ~ ., data = german, method = "class", control = my.control)

Variables actually used in tree construction:

[1] age check credit debtors duration employment history

[8] housing installment job numcredits others personal property

[15] purpose residence savings

Root node error: 300/1000 = 0.3

n= 1000

CP nsplit rel error xerror xstd

1 0.0516667 0 1.00000 1.00000 0.048305

2 0.0466667 3 0.84000 0.94667 0.047533

3 0.0183333 4 0.79333 0.86333 0.046178

4 0.0166667 6 0.75667 0.87000 0.046294

5 0.0155556 8 0.72333 0.88667 0.046577

6 0.0116667 11 0.67667 0.88000 0.046464

7 0.0100000 13 0.65333 0.85667 0.046062

8 0.0083333 16 0.62333 0.87000 0.046294

9 0.0066667 18 0.60667 0.87333 0.046351

10 0.0060000 38 0.44333 0.92000 0.047120

11 0.0050000 43 0.41333 0.91000 0.046960

12 0.0044444 55 0.35333 0.92000 0.047120

13 0.0033333 59 0.33333 0.92000 0.047120

14 0.0029167 83 0.25000 0.97000 0.047879

15 0.0022222 93 0.22000 0.97667 0.047976

16 0.0016667 96 0.21333 0.97667 0.047976

17 0.0000000 104 0.20000 1.01333 0.048486

5번째 단계이며 분리의 횟수가 8회(nsplit=8)인 나무는 교차타당성 오분류율이 0.88667로 최소는 아니지만 7번째 단계의 분리의 횟수 13회 나무 가지의 최소 오분류율 0.85667과는 크게 차이가 나지 않는다. 그리고 최소 오분류율 표준편차의 1배 범위(0.88667 < 0.85667 + 0.046062)에 있다. 이런 경우에는 5번째 단계이며 분리의 횟수가 8인 나무를 선택하는 경우도 있다.

5번째 단계이며 분리 횟수가 8인 cp값 0.0155556의 반올림 값 0.016 적용하여 다시 가지치기

> fit.prune.german<-prune(fit.german,cp=0.016)

> fit.prune.german

> plot(fit.prune.german,uniform=T,compress=T,margin=0.1)

> text(fit.prune.german,use.n=T,col='blue',cex=0.7)

6) 목표변수의 분류예측치를 구하고 그 정확도에 대해서 평가해 보자

> fit.prune.german<-prune(fit.german,cp=0.01)

> pred.german=predict(fit.prune.german,newdata=german,type='class')

> tab=table(german$y,pred.german,dnn=c('Actual','Predicted'))

> tab

Predicted

Actual bad good

bad 180 120

good 76 624

함수 설명

predict(fit.prune.german,newdata=german,type='class'), type = class는 분류나무의 집단값 예측결과, 회귀나무라면 type = vector라고 해야 한다.

결과 해석

실제 good인데 good으로 예측한 것이 624개, 실제 bad인데 bad로 예측한 것이 180

따라서 오분류율은 {1000 – (624+180)} / 1000 = 19.6%

R코드를 이용하면 1-sum(diag(tab)) / sum(tab)

7) 마지막으로 독일신용평가데이터를 훈련데이터와 검증 데이터로 분할하여 분류나무를 평가해보자.

> set.seed(1234)

> i=sample(1:nrow(german),round(nrow(german)*0.7)) #70% for training훈련 data, 30% for test검증

> german.train=german[i,]

> german.test=german[-i,]

> fit.german<-rpart(y~.,data=german.train,method='class',control=my.control)

> printcp(fit.german)

Classification tree:

rpart(formula = y ~ ., data = german.train, method = "class",

control = my.control)

Variables actually used in tree construction:

[1] age check credit debtors duration employment history

[8] housing installment job numcredits others personal property

[15] purpose residence savings telephone

Root node error: 201/700 = 0.28714

n= 700

CP nsplit rel error xerror xstd

1 0.05721393 0 1.00000 1.00000 0.059553

2 0.03482587 2 0.88557 1.00498 0.059641

3 0.02985075 5 0.78109 1.00000 0.059553

4 0.01990050 6 0.75124 0.95025 0.058631

5 0.01741294 8 0.71144 0.96020 0.058822

6 0.01492537 10 0.67662 1.00000 0.059553

7 0.01243781 14 0.61692 1.00000 0.059553

8 0.00995025 17 0.57711 1.00995 0.059728

9 0.00746269 35 0.39303 1.03980 0.060238

10 0.00621891 46 0.30846 1.06965 0.060722

11 0.00497512 50 0.28358 1.04975 0.060402

12 0.00331675 58 0.24378 1.09950 0.061181

13 0.00248756 61 0.23383 1.11940 0.061474

14 0.00124378 69 0.21393 1.14925 0.061894

15 0.00099502 73 0.20896 1.14925 0.061894

16 0.00000000 78 0.20398 1.14925 0.061894

> fit.prune.german<-prune(fit.german,cp=0.02)

> fit.prune.german

> p.german.test=predict(fit.prune.german,newdata=german.test,type='class')

> tab=table(german.test$y,p.german.test,dnn=c('Actual','Predicted'))

> tab

Predicted

Actual bad good

bad 34 65

good 14 187

> 1-sum(diag(tab))/sum(tab) #오분류율

[1] 0.2633333

출처: 데이터마이닝(장영재, 김현중, 조형준 공저,knou press)

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

제4장 앙상블 모형 (0)	2016.11.02
제3장 나무모형 - 회귀나무모형 (0)	2016.10.26
제2장 회귀모형 - 로지스틱 회귀모형 연습 (0)	2016.09.14
제2장 회귀모형 - 선형회귀 연습 (0)	2016.09.14
제2장 회귀모형 - 선형회귀, 로지스틱회귀 (0)	2016.09.14

Posted by 마르띤

,

데이터마이너를 꿈꾸며

'table'에 해당되는 글 4건

4장 - 크로스 분석: 어떤 속성들의 고객들이 떠날까?

'Python, R 분석과 프로그래밍 > 비지니스 활용 사례로 배우는 데이터 분석 : R' 카테고리의 다른 글

제4장 앙상블 모형 - 분류앙상블모형 - 랜덤 포레스트

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

제4장 범주형 자료의 분석 - 4.1 범주형 자료와 분할표

'KNOU > 2 보건 정보 데이터 분석' 카테고리의 다른 글

제3장 나무모형 - 분류나무모형

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

링크

카테고리

최근에 올라온 글

최근에 받은 트랙백

글 보관함

티스토리툴바