'AIC' 태그의 글 목록

'AIC'에 해당되는 글 2건

2017.01.26 제6.4장 모수적 방법
2016.09.14 제2장 회귀모형 - 로지스틱 회귀모형 연습

제6.4장 모수적 방법

KNOU/2 보건 정보 데이터 분석 2017. 1. 26. 09:38

6.4 모수적 방법을 이용한 생존함수의 추정과 비교

공학(시멘트의 양, 유리의 버티는 힘), 경영(고객 수), 교통(소방차 수) 모두 모수적 방법을 이용.

분포를 이루기 때문에 많은 분야에서 사용된다.

1. 데이터 입력

> setwd('c:/Rwork')

> lung=read.table('lung.txt',header=T)

> head(lung)

癤퓍ime status x1 x2 x3 trt type

1 411 1 70 64 5 1 1

2 126 1 60 63 9 1 1

3 118 1 70 65 11 1 1

4 82 1 40 69 10 1 1

5 8 1 40 63 58 1 1

6 25 0 70 48 9 1 1

> colnames(lung)<-c('time','status','x1','x2','x3','trt','type')

> head(lung)

time status x1 x2 x3 trt type

1 411 1 70 64 5 1 1

2 126 1 60 63 9 1 1

3 118 1 70 65 11 1 1

4 82 1 40 69 10 1 1

5 8 1 40 63 58 1 1

6 25 0 70 48 9 1 1

> attach(lung)

2. 모수적 모형에 근거한 생존시간 분석

> library(survival)

> weibull = survreg(Surv(time,status)~x1+x2+x3+factor(trt)+factor(type),data=lung,dist='weibull') #공변량factor(trt), 처리효과factor(type), 와이블 분포weibull, gaussian(정규분포), logistic(로지스틱)eh rksmd.

> summary(weibull)

Call:

survreg(formula = Surv(time, status) ~ x1 + x2 + x3 + factor(trt) +

factor(type), data = lung, dist = "weibull")

Value Std. Error z p

(Intercept) 1.06044 1.35959 0.780 4.35e-01

x1 0.05420 0.00954 5.680 1.35e-08

x2 0.01168 0.01918 0.609 5.42e-01

x3 0.00379 0.01051 0.361 7.18e-01

factor(trt)2 0.28871 0.36899 0.782 4.34e-01

factor(type)2 -0.49964 0.45323 -1.102 2.70e-01

factor(type)3 -1.25968 0.49732 -2.533 1.13e-02

factor(type)4 -0.40243 0.38726 -1.039 2.99e-01

Log(scale) -0.13615 0.13146 -1.036 3.00e-01

Scale= 0.873

Weibull distribution

Loglik(model)= -203.4 Loglik(intercept only)= -219.7

Chisq= 32.59 on 7 degrees of freedom, p= 3.2e-05

Number of Newton-Raphson Iterations: 6

n= 40

> weibull

Call:

survreg(formula = Surv(time, status) ~ x1 + x2 + x3 + factor(trt) +

factor(type), data = lung, dist = "weibull")

Coefficients:

(Intercept) x1 x2 x3 factor(trt)2

1.060436846 0.054195931 0.011681287 0.003792838 0.288708242

factor(type)2 factor(type)3 factor(type)4

-0.499640589 -1.259681146 -0.402431957

Scale= 0.8727099

Loglik(model)= -203.4 Loglik(intercept only)= -219.7

Chisq= 32.59 on 7 degrees of freedom, p= 3.2e-05

n= 40

logT = 1.060 + 0.054x1 + 0.012x2 + 0.004x3 + 0.289x4 - 0.500x5 -1.260x6 - 0.402x7 +0.873e

factor(trt)2 0.28871 0.36899 0.782 4.34e-01

-> x4는 처리그룹을 나타내는 가변수(dummy variable)로서 처리가 standard일 때 1, test 일 때 2의 값을 갖는다.

p-value가 0.434로 유의하지 않으므로 두 처리(standard, test)는 생존시간의 차이를 보이지 않는다.

type은 squamous,small,adeno,large, x5,6,7은 종양의 유형을 가르키는 가변수로서

x5 = 1, if 'small', = 0 o.w.

x6 = 1, if 'adeno', = 0 o.w.

x7 = 1, if 'large', = 0 o.w.

따라서 종양이 squamous인 경우에는 x5=x6=x7 = 0 된다.

factor(type)3 -1.25968 0.49732 -2.533 1.13e-02

-> adeno의 p-value만 0.0113으로 0.05보다 적으므로 유의, squamous와 차이를 보인다고 할 수 있다. adeno의 회귀계수가 -1.25968로 이 유형을 가진 사람들은 상대적으로 생존시간이 짧다는 것을 알 수 있다. type2 small, type4 large의 경우 p-value가 각각 0.270, 0,299로 유의하지 않다, 즉 squamous와 차이를 보이지 않는다.

Loglik(model)= -203.4 Loglik(intercept only)= -219.7

Chisq= 32.59 on 7 degrees of freedom, p= 3.2e-05

2(logL-logL0) = 32.59 > 14.067 모든 공변량의 회귀계수가 0이라는 귀무가설을 아주 강하게 기각한다.

통상적인 선형모형에서 모형에 대한 F-검정을 하는 것과 같이 여기서도 모든 공변량의 회귀계수가 0이라는 가설에 대하 우도비 검정(likelihood-ratio ttest)를 할 수 있다. 우리가 고려한 모형에서의 로그-우도(log-likelihood)를 logL, 공변량이 전혀 없는 귀무모형에서의 로그-우도를 logL0라고 했을 때 2(logL-logL0)이 귀무가설하에서 근사적으로 x2-분포를 따르게 되므로 이를 이용하여 검정할 수 있다.

이때 자유도는 귀무모형에서 제외되는 공변량의 수와 같게 되며, 위에서는 7이 된다.

LogL: 고려한 모형에서의 로그-우도 Log Likelihood for WEIBULL -203.4는 이 모형에 대한 로그-우도이다.

LogL0: 공변량이 전혀 없는 귀무모형에 대한 로그-우도를 구해보면 Log Likelihood for WEIULL = =219.7을

따라서 Chisq= 32.59는 아래와 같은 계산을 통해 얻을 수 있으며, 이 값이 x2-분포의 임계치인 x2 0.95(7) = 14.067보다 크므로

5%유의수준하에서 모든 공변량의 회귀계수가 0이라는 귀무사설을 기각하게 된다.

> 2*(-203.4-(-219.7))

[1] 32.6

3. 모수적 모형의 적합도 검토

고려한 모형이 타당한가를 검토하는 방법 중 하나는 로그-우도 비교. R을 이용하여 각 모형에 대하 로그-우도를 출력한 다음 이들을 비교하여 절대값이 가장 큰 모형을 택할 수 있다. 또는 AIC(Akaike information criterion) 값을 비교하여 이 값이 더 작은 것을 선택할 수 있다.

> library(flexsurv) #flexsurv library: Flexible parametric survival models

Warning message:

package ‘flexsurv’ was built under R version 3.2.5

> gengamma=flexsurvreg(formula=Surv(time,status)~x1+x2+x3+factor(trt)+factor(type),data=lung,dist='gengamma')

> gengamma

Call:

flexsurvreg(formula = Surv(time, status) ~ x1 + x2 + x3 + factor(trt) +

factor(type), data = lung, dist = "gengamma")

Estimates:

data mean est L95% U95% se exp(est)

mu NA 1.16226 -1.69652 4.02104 1.45859 NA

sigma NA 0.82367 0.47903 1.41628 0.22778 NA

Q NA 1.19119 -0.35449 2.73687 0.78863 NA

x1 56.50000 0.05426 0.03598 0.07254 0.00933 1.05576

x2 56.57500 0.01210 -0.02543 0.04963 0.01915 1.01217

x3 15.65000 0.00494 -0.01741 0.02729 0.01141 1.00495

factor(trt)2 0.50000 0.27185 -0.47815 1.02186 0.38266 1.31239

factor(type)2 0.27500 -0.57430 -1.63425 0.48566 0.54080 0.56310

factor(type)3 0.12500 -1.36101 -2.57769 -0.14434 0.62076 0.25640

factor(type)4 0.25000 -0.48503 -1.45929 0.48923 0.49708 0.61568

L95% U95%

mu NA NA

sigma NA NA

Q NA NA

x1 1.03664 1.07524

x2 0.97489 1.05088

x3 0.98274 1.02767

factor(trt)2 0.61993 2.77835

factor(type)2 0.19510 1.62524

factor(type)3 0.07595 0.86560

factor(type)4 0.23240 1.63107

N = 40, Events: 37, Censored: 3

Total time at risk: 5784

Log-likelihood = -203.4059, df = 10

AIC = 426.8117

> weibull=flexsurvreg(formula=Surv(time,status)~x1+x2+x3+factor(trt)+factor(type),data=lung,dist='weibull')

> weibull

Call:

flexsurvreg(formula = Surv(time, status) ~ x1 + x2 + x3 + factor(trt) +

factor(type), data = lung, dist = "weibull")

Estimates:

data mean est L95% U95% se exp(est)

shape NA 1.14586 0.88555 1.48269 0.15066 NA

scale NA 2.88763 0.20141 41.39933 3.92317 NA

x1 56.50000 0.05420 0.03551 0.07288 0.00953 1.05569

x2 56.57500 0.01168 -0.02587 0.04924 0.01916 1.01175

x3 15.65000 0.00379 -0.01679 0.02438 0.01050 1.00380

factor(trt)2 0.50000 0.28871 -0.43443 1.01185 0.36896 1.33470

factor(type)2 0.27500 -0.49964 -1.38783 0.38855 0.45317 0.60675

factor(type)3 0.12500 -1.25968 -2.23430 -0.28506 0.49726 0.28374

factor(type)4 0.25000 -0.40243 -1.16137 0.35651 0.38722 0.66869

L95% U95%

shape NA NA

scale NA NA

x1 1.03615 1.07560

x2 0.97446 1.05047

x3 0.98335 1.02468

factor(trt)2 0.64763 2.75068

factor(type)2 0.24962 1.47484

factor(type)3 0.10707 0.75197

factor(type)4 0.31306 1.42833

N = 40, Events: 37, Censored: 3

Total time at risk: 5784

Log-likelihood = -203.4363, df = 9

AIC = 424.8727

일반화감마분포 gengamma분포의 경우 로그-우도 값이 -203.4059, AIC는 426.8117

와이블분포 wibull 분포의 경우 로그-우도 값이 -203.4363, AIC는 424.8727로 큰 차이는 없다. 그래프로 그려보면 아래와 같이 큰 차이가 없음을 알 수 있다.

데이터 시각화

> plot(weibull,xlab='time',ylab='survival function',ci=F)

> lines(gengamma,col='blue',lty=2,ci=F)

> legend('topright',c('weibull','generalized gamma'),lty=1:2,col=c('red','blue'))

출처: 보건정보데이터 분석(이태림, 이재원, 김주한, 장대흥 공저)

'KNOU > 2 보건 정보 데이터 분석' 카테고리의 다른 글

제6.5장 준모수적 방법 (0)	2017.01.26
제6.3장 비모수적 방법을 이용한 생존함수의 비교 (0)	2017.01.26
제6.2장 비모수적 방법 - 2. 누적한계추정법 (0)	2017.01.26
제6.2장 비모수적 방법 - 1. 생명표 방법 (0)	2017.01.26
제5.1장 공분산 분석 - 2 공변량이 둘 이상인 경우 (2)	2017.01.13

Posted by 마르띤

제2장 회귀모형 - 로지스틱 회귀모형 연습

KNOU/2 데이터마이닝 2016. 9. 14. 10:24

목표변수가 연속형인 경우 -> 선형 회귀모델 , ex) 광고비 투입 대비 매출액

목표변수가 두 개의 범주를 가진 이항형인 경우 -> 로지스틱 회귀모형, ex) 좋다1, 나쁘다0

독일신용평가 데이터 셋

독일신용평가 데이터(German Credit Data)는 머신러닝 저장소에 탑재되어 있는 데이터로 분류의 예제로 많이 활용된다.

변수명	속성	변수 설명
check	범주형	자유예금형태 Status of existing checking account
duration	수치형	기간 Duration in month
history	범주형	과거신용정보 Credit history
purpose	범주형	목적 Purpose
credit	수치형	신용대출금액 Credit amount
savings	범주형	저축예금/채권 Savings account/bonds
employment	범주형	현직장 재직기간 Present employment since
installment	수치형	가처분소득 대비 적금비율 Installment rate in percentage of disposable income
personal	범주형	결혼상황 및 성별 Personal status and sex
debtors	범주형	여타 채무/채권 Other debtors / guarantors
residence	수치형	현 거주기간 Present residence since
property	범주형	재산 Property
age	수치형	나이 Age in years
others	범주형	여타적금 Other installment plans
housing	범주형	주거형태 Housing
numcredits	수치형	해당 은행 신용계좌 수 Number of existing credits at this bank
job	범주형	직업 Job
residpeople	수치형	부양가족수 Number of people being liable to provide maintenance for
telephone	범주형	전화소유 Telephone
foreign	범주형	외국인 노동자 여부 foreign worker
y	범주형	신용등급 양호 또는 불량 credit:Good or Bad

1. 데이터 불러오기

> setwd('c:/Rwork')

> german<-read.table('germandata.txt')

> head(german,2) #열 값들의 변수명이 없음.

> names<-c("check","duration","history","purpose","credit","savings","employment","installment", "personal", "debtors", "residence", "property", "age", "others", "housing", "numcredits", "job", "residpeople", "telephone", "foreign" ,"y")

> colnames(german)<-names

> head(german,2)

> german$y<-factor(german$y,levels=c(1,2),labels=c('good','bad'))

> head(german,2)

> summary(german)

# residence,numcredits,residpeople는 실제 범주형이지만 수치형으로 인식. 범주형으로 변환 필요

> class(german$residence) #integer 수치형

[1] "integer"

> class(german$check) #factor 범주형

[1] "factor"

> german$residence = factor(german$residence)

> german$numcredits = factor(german$numcredits)

> german$residpeople = factor(german$residpeople)

> class(german$residence) #factor로 변환

[1] "factor"

> class(german$numcredits) #factor로 변환

[1] "factor"

> class(german$residpeople) #factor로 변환

[1] "factor"

> table(german$residence)

1 2 3 4

130 308 149 413

> german$y<-ifelse(german$y=='good',1,0) #반응 값 good은 1로, bad는 2로 변환

2. 로지스틱 회귀 분석 시작

> fit.all = glm(y~.,family = binomial,data=german) #로지스틱 회귀 분석

또는 아래와 같은 방법도 가능하다.

> gmn<-names(german)
> f<-as.formula(paste('y~',paste(gmn[!gmn%in%y],collapse='+')))
> fit.all.1<-glm(f,family = binomial, data=german)

> fit.step = step(fit.all, direction='both') #단계적 선택방법

Start: AIC=993.44

y ~ check + duration + history + purpose + credit + savings +

employment + installment + personal + debtors + residence +

property + age + others + housing + numcredits + job + residpeople +

telephone + foreign

Df Deviance AIC

- job 3 888.00 988.00

- numcredits 3 890.25 990.25

- property 3 890.70 990.70

- residpeople 1 888.52 992.52

- age 1 889.37 993.37

- telephone 1 889.40 993.40

<none> 887.44 993.44

- employment 4 895.48 993.48

- housing 2 891.63 993.63

- residence 3 894.74 994.74

- debtors 2 894.80 996.80

- others 2 895.71 997.71

- personal 3 897.80 997.80

- foreign 1 894.16 998.16

- credit 1 895.07 999.07

- duration 1 896.25 1000.25

- installment 1 900.81 1004.81

- savings 4 908.55 1006.55

- history 4 911.01 1009.01

- purpose 9 922.07 1010.07

- check 3 957.33 1057.33

Step: AIC=988

y ~ check + duration + history + purpose + credit + savings +

employment + installment + personal + debtors + residence +

property + age + others + housing + numcredits + residpeople +

telephone + foreign

Df Deviance AIC

- numcredits 3 890.85 984.85

- property 3 891.21 985.21

- residpeople 1 889.08 987.08

- employment 4 895.67 987.67

<none> 888.00 988.00

- housing 2 892.01 988.01

- age 1 890.05 988.05

- telephone 1 890.34 988.34

- residence 3 895.32 989.32

- debtors 2 895.25 991.25

- personal 3 898.31 992.31

- others 2 896.49 992.49

- foreign 1 894.77 992.77

+ job 3 887.44 993.44

- credit 1 895.72 993.72

- duration 1 897.14 995.14

- installment 1 901.56 999.56

- savings 4 909.71 1001.71

- history 4 911.44 1003.44

- purpose 9 922.89 1004.89

- check 3 957.60 1051.60

Step: AIC=984.85

y ~ check + duration + history + purpose + credit + savings +

employment + installment + personal + debtors + residence +

property + age + others + housing + residpeople + telephone +

foreign

Df Deviance AIC

- property 3 894.03 982.03

- employment 4 898.02 984.02

- residpeople 1 892.07 984.07

- age 1 892.85 984.85

<none> 890.85 984.85

- housing 2 895.09 985.09

- telephone 1 893.29 985.29

- residence 3 898.52 986.52

+ numcredits 3 888.00 988.00

- debtors 2 898.27 988.27

- personal 3 901.17 989.17

- others 2 899.85 989.85

- foreign 1 898.00 990.00

+ job 3 890.25 990.25

- credit 1 898.64 990.64

- duration 1 899.76 991.76

- installment 1 904.66 996.66

- history 4 911.95 997.95

- savings 4 912.53 998.53

- purpose 9 926.15 1002.15

- check 3 959.38 1047.38

Step: AIC=982.03

y ~ check + duration + history + purpose + credit + savings +

employment + installment + personal + debtors + residence +

age + others + housing + residpeople + telephone + foreign

Df Deviance AIC

- residpeople 1 895.11 981.11

- employment 4 901.94 981.94

- telephone 1 895.95 981.95

<none> 894.03 982.03

- age 1 896.10 982.10

- housing 2 898.15 982.15

- residence 3 901.53 983.53

+ property 3 890.85 984.85

+ numcredits 3 891.21 985.21

- personal 3 903.97 985.97

- debtors 2 902.35 986.35

- foreign 1 901.07 987.07

+ job 3 893.45 987.45

- others 2 903.55 987.55

- credit 1 902.94 988.94

- duration 1 903.85 989.85

- installment 1 908.62 994.62

- savings 4 915.22 995.22

- history 4 915.59 995.59

- purpose 9 930.66 1000.66

- check 3 964.51 1046.51

Step: AIC=981.11

y ~ check + duration + history + purpose + credit + savings +

employment + installment + personal + debtors + residence +

age + others + housing + telephone + foreign

Df Deviance AIC

- employment 4 903.04 981.04

- age 1 897.04 981.04

<none> 895.11 981.11

- telephone 1 897.12 981.12

- housing 2 899.31 981.31

+ residpeople 1 894.03 982.03

- residence 3 902.80 982.80

- personal 3 904.04 984.04

+ property 3 892.07 984.07

+ numcredits 3 892.19 984.19

- debtors 2 903.15 985.15

- foreign 1 902.06 986.06

+ job 3 894.59 986.59

- others 2 904.70 986.70

- credit 1 903.73 987.73

- duration 1 904.80 988.80

- installment 1 909.03 993.03

- savings 4 916.06 994.06

- history 4 916.94 994.94

- purpose 9 932.01 1000.01

- check 3 965.87 1045.87

Step: AIC=981.04

y ~ check + duration + history + purpose + credit + savings +

installment + personal + debtors + residence + age + others +

housing + telephone + foreign

Df Deviance AIC

- age 1 904.91 980.91

<none> 903.04 981.04

+ employment 4 895.11 981.11

- telephone 1 905.28 981.28

- housing 2 907.58 981.58

+ residpeople 1 901.94 981.94

- residence 3 910.50 982.50

+ property 3 899.28 983.28

+ numcredits 3 900.64 984.64

- foreign 1 909.67 985.67

- debtors 2 912.24 986.24

+ job 3 902.89 986.89

- personal 3 915.04 987.04

- others 2 913.21 987.21

- duration 1 911.34 987.34

- credit 1 911.50 987.50

- installment 1 917.92 993.92

- savings 4 925.25 995.25

- history 4 925.74 995.74

- purpose 9 939.70 999.70

- check 3 975.57 1047.57

Step: AIC=980.91

y ~ check + duration + history + purpose + credit + savings +

installment + personal + debtors + residence + others + housing +

telephone + foreign

Df Deviance AIC

<none> 904.91 980.91

+ age 1 903.04 981.04

+ employment 4 897.04 981.04

- telephone 1 907.69 981.69

+ residpeople 1 903.95 981.95

- housing 2 910.11 982.11

- residence 3 912.96 982.96

+ property 3 901.18 983.18

+ numcredits 3 902.60 984.60

- foreign 1 911.56 985.56

- debtors 2 914.35 986.35

- others 2 914.61 986.61

+ job 3 904.63 986.63

- credit 1 913.18 987.18

- personal 3 917.50 987.50

- duration 1 914.06 988.06

- installment 1 919.35 993.35

- savings 4 927.70 995.70

- history 4 928.79 996.79

- purpose 9 940.82 998.82

- check 3 978.40 1048.40

> fit.step$anova #제거된 변수 보기

Step Df Deviance Resid. Df Resid. Dev AIC

1 NA NA 947 887.4372 993.4372

2 - job 3 0.5588674 950 887.9960 987.9960

3 - numcredits 3 2.8582392 953 890.8543 984.8543

4 - property 3 3.1777611 956 894.0320 982.0320

5 - residpeople 1 1.0747973 957 895.1068 981.1068

6 - employment 4 7.9298736 961 903.0367 981.0367

7 - age 1 1.8704615 962 904.9072 980.9072

> summary(fit.step) #최종모델

Call:

glm(formula = y ~ check + duration + history + purpose + credit +

savings + installment + personal + debtors + residence +

others + housing + telephone + foreign, family = binomial,

data = german)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.7904 -0.7290 0.3885 0.6911 2.1780

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -9.736e-01 7.032e-01 -1.385 0.166204

checkA12 3.863e-01 2.136e-01 1.809 0.070468 .

checkA13 1.055e+00 3.636e-01 2.902 0.003714 **

checkA14 1.782e+00 2.308e-01 7.721 1.15e-14 ***

duration -2.726e-02 9.034e-03 -3.018 0.002546 **

historyA31 1.290e-01 5.297e-01 0.244 0.807596

historyA32 8.608e-01 4.104e-01 2.097 0.035956 *

historyA33 9.975e-01 4.675e-01 2.133 0.032889 *

historyA34 1.564e+00 4.329e-01 3.612 0.000303 ***

purposeA41 1.591e+00 3.684e-01 4.320 1.56e-05 ***

purposeA410 1.397e+00 7.732e-01 1.806 0.070849 .

purposeA42 6.766e-01 2.529e-01 2.675 0.007467 **

purposeA43 8.867e-01 2.443e-01 3.629 0.000284 ***

purposeA44 5.231e-01 7.546e-01 0.693 0.488206

purposeA45 1.335e-01 5.388e-01 0.248 0.804301

purposeA46 -2.006e-01 3.883e-01 -0.517 0.605426

purposeA48 2.060e+00 1.202e+00 1.714 0.086523 .

purposeA49 7.396e-01 3.318e-01 2.229 0.025820 *

credit -1.230e-04 4.314e-05 -2.852 0.004351 **

savingsA62 3.126e-01 2.805e-01 1.115 0.264984

savingsA63 4.303e-01 3.887e-01 1.107 0.268291

savingsA64 1.396e+00 5.184e-01 2.692 0.007106 **

savingsA65 1.004e+00 2.606e-01 3.852 0.000117 ***

installment -3.218e-01 8.621e-02 -3.733 0.000189 ***

personalA92 2.159e-01 3.754e-01 0.575 0.565268

personalA93 8.302e-01 3.672e-01 2.261 0.023766 *

personalA94 3.551e-01 4.434e-01 0.801 0.423122

debtorsA102 -4.978e-01 4.005e-01 -1.243 0.213967

debtorsA103 1.074e+00 4.205e-01 2.555 0.010628 *

residence2 -7.181e-01 2.796e-01 -2.568 0.010223 *

residence3 -3.929e-01 3.246e-01 -1.210 0.226104

residence4 -2.893e-01 2.806e-01 -1.031 0.302546

othersA142 5.959e-02 4.061e-01 0.147 0.883344

othersA143 6.787e-01 2.355e-01 2.882 0.003955 **

housingA152 5.098e-01 2.271e-01 2.245 0.024799 *

housingA153 2.464e-01 3.288e-01 0.749 0.453710

telephoneA192 3.051e-01 1.838e-01 1.660 0.096958 .

foreignA202 1.439e+00 6.253e-01 2.301 0.021383 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1221.73 on 999 degrees of freedom

Residual deviance: 904.91 on 962 degrees of freedom

AIC: 980.91

Number of Fisher Scoring iterations: 5

-> 해석: fit.step = step(fit.all, direction='both'를 통해 AIC가 가장 작은 모형을 찾는다.

check는 4개의 범주(checkA11 계좌 없음 / A12 잔액 없음 / A13 잔액 200 이하 / A14 잔액 200 이상)를 가지므로 3개의 가변 수 생성. 추정된 회귀계수는 모두 양수이므로, A12~A14 즉 계좌가 있는 경우 계좌 없음(A11)대비 신용이 좋을 확률(Y=1)이 더 높다. 대출기간인 duration은 마이너스의 값을 지니므로 대출 기간이 오래 될 수록 신용도는 낮아진다. 모델의 AIC는 980.91로, AIC가 클 경우 그 모형은 적합하지 않기 때문에, 여러 후보 모형 중에서 AIC가 가장 작은 모형을 선택한다.

단계적선택법의 AIC는 980.91

[참고] 후진소거법의 AIC는 980.91

> fit.step.back = step(fit.all,direction='backward')

Step: AIC=980.91

y ~ check + duration + history + purpose + credit + savings +

installment + personal + debtors + residence + others + housing +

telephone + foreign

Df Deviance AIC

<none> 904.91 980.91

- telephone 1 907.69 981.69

- housing 2 910.11 982.11

- residence 3 912.96 982.96

- foreign 1 911.56 985.56

- debtors 2 914.35 986.35

- others 2 914.61 986.61

- credit 1 913.18 987.18

- personal 3 917.50 987.50

- duration 1 914.06 988.06

- installment 1 919.35 993.35

- savings 4 927.70 995.70

- history 4 928.79 996.79

- purpose 9 940.82 998.82

- check 3 978.40 1048.40

> fit.step.back$anova #제거된 변수 보기

Step Df Deviance Resid. Df Resid. Dev AIC

1 NA NA 947 887.4372 993.4372

2 - job 3 0.5588674 950 887.9960 987.9960

3 - numcredits 3 2.8582392 953 890.8543 984.8543

4 - property 3 3.1777611 956 894.0320 982.0320

5 - residpeople 1 1.0747973 957 895.1068 981.1068

6 - employment 4 7.9298736 961 903.0367 981.0367

7 - age 1 1.8704615 962 904.9072 980.9072

[참고] 전진선택법 AIC : 993.44

> fit.step.forward = step(fit.all, direction = 'forward')

Start: AIC=993.44

y ~ check + duration + history + purpose + credit + savings +

employment + installment + personal + debtors + residence +

property + age + others + housing + numcredits + job + residpeople +

telephone + foreign

> fit.step.forward$anova #제거된 변수 보기

Step Df Deviance Resid. Df Resid. Dev AIC

1 NA NA 947 887.4372 993.4372

3. 예측함수 및 정오분류표 작성

> p = predict(fit.step, newdata=german,type='response')

> threshold = 0.5 #cutoff기준 0.5로 정함

> yhat = ifelse(p>threshold,1,0)

> head(yhat)

1 2 3 4 5 6

1 0 1 1 0 1

> class.tab = table(german$y,yhat,dnn=c("Actual","Predicted"))#실값과 예측값 배열

> class.tab

Predicted

Actual 0 1

0 158 142

1 82 618

-> 해석: 1로 예측할 확률이 임계치(threshold) 0.5보다 클 경우에는 1로, 0.5이하일 경우에는 0으로 예측. 실제로는 0인데 0으로 예측한 경우가 158개, 1인데 1로 분류한 경우가 618개이다.반면에 0인데 1로 오분류한 경우가 142개, 1인데 0으로 오분류한 경우가 82개이다.

4. 예측력 측도

> sum(german$y==yhat)/length(german$y) #Prediction Accuracy 예측정확도

[1] 0.776

> sum(german$y!=yhat)/length(german$y) #Misclassification Rate 오분류율

[1] 0.224

> class.tab[1,1]/apply(class.tab,1,sum)[1] #Specificity 특이도

0.5266667

> class.tab[2,2]/apply(class.tab,1,sum)[2] #Sensitivity 민감도

0.8828571

-> 해석: 민감도는 실제 양성(Y=1)일 때 양성으로 예측할 확률, 특이도는 실제 음성(Y=0)일 때 음성으로 예측할 확률이다. 예측정확도(prediction accuracy)는 실제 양서일 때 양성으로, 음성일 때 음성으로 제대로 예측할 확률로 민감도와 특이도의 가중평균이다. 오분류율(misclassification rate)는 양성일 때 음성으로, 음성일 때 양성으로 잘못 예측할 확률이다.

5. ROC 곡선 및 AUC 생성

> library(ROCR)

> pred<-prediction(p,german$y)

> perf<-performance(pred,'tpr','fpr') #민감도와 1-특이도 계산 과정

> plot(perf,lty=1,col=2,xlim=c(0,1),ylim=c(0,1),xlab='1-Specificity',ylab='Sensitivity',main='ROC Curve')

> lines(x=c(0,1),y=c(0,1),col='grey')

> performance(pred,'auc')@y.values #면적 계산

[[1]]

[1] 0.8312286

-> 민감도와 특이도는 임계치에 다라 달라지고 임계치는 상황에 따라 다르게 결정할 수 이다. 여러 가능한 임계치에 대해 ‘1-특이도(Specificity)’를 가로축에, 민감도를 세로축에 놓고 그린 그래프를 ROC(Receiver operating characteristic) 곡선이라 한다. 민감도와 특이도가 높을수록 예측력이 좋다고 할 수 있기 때문에 ROC 곡선이 좌상단에 가까울수록 ROC 곡선 아래 면적인 AUC(area under the ROC curve)가 커지고, 예측력이 좋다고 할 수 있다.이 독일신용평가 데이터에 적합한 로지스틱 회귀모형에 대한 예측력의 측도인 AUC는 최대치 1보다 다소 작은 0.831로 상당히 높음을 알 수 있다.

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

제3장 나무모형 - 회귀나무모형 (0)	2016.10.26
제3장 나무모형 - 분류나무모형 (4)	2016.10.18
제2장 회귀모형 - 선형회귀 연습 (0)	2016.09.14
제2장 회귀모형 - 선형회귀, 로지스틱회귀 (0)	2016.09.14
1장 데이터과학과 데이터마이닝 (0)	2016.08.26

Posted by 마르띤

이전 1 다음

데이터마이너를 꿈꾸며

'AIC'에 해당되는 글 2건

제6.4장 모수적 방법

'KNOU > 2 보건 정보 데이터 분석' 카테고리의 다른 글

제2장 회귀모형 - 로지스틱 회귀모형 연습

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

링크

카테고리

최근에 올라온 글

최근에 받은 트랙백

글 보관함

티스토리툴바