데이터마이너를 꿈꾸며 :: 제6장 모형비교평가

제6장 모형비교평가 - 연속형 목표변수

KNOU/2 데이터마이닝 2017. 2. 24. 19:18

연속형 목표 변수.

목표변수가 연속형인 경우 모형의 예측력 측도로는 주로 예측평균제곱오차(PMSE: Prediction Mean Squared Error)를 사용한다.

PMSE = sigma(yi-yhat i)²/n

관측치(yi)와 예측치(yhat i) 사이에 차이가 적을수록 예측을 잘한 것으로므로 예측평균제곱오차가 작을수록 그 모형의 예측력이 높다고 할 수 있다. 또한 관측치와 예측치를 가로축 및 세로축으로 놓고 그린 산점도(scatter plot)가 45도 대각선을 중심으로 모여 있으면 예측력이 좋다고 할 수 있다.

보스턴 하우징 데이터 Housing Values in Suburbs of Boston

(출처: http://127.0.0.1:31866/library/MASS/html/Boston.html)

변수명	속성	변수 설명
crim	수치형 (numeric)	per capita crime rate by town 타운별 1인당 범죄율
zn	수치형 (numeric)	proportion of residential land zoned for lots over 25,000 sq.ft 25,000평방피트를 초과하는 거주지역 비율
indus	수치형 (numeric)	proportion of non-retail business acres per town. 비소매 사업지역의 토지 비율
chas	범주형 (integer)	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 찰스강 더비 변수 (강의 경계에 위치 = 1, 아니면 = 0)
nox	수치형 (numeric)	nitrogen oxides concentration (parts per 10 million) 10ppm당 농축 일산화질소
rm	수치형 (numeric)	average number of rooms per dwelling. 주택 1가구등 방의 평균 개수
age	수치형 (numeric)	proportion of owner-occupied units built prior to 1940. 1940년 이전에 건축된 소유자 주택 비율
dis	수치형 (numeric)	weighted mean of distances to five Boston employment centres. 5개의 보스턴 고용센터까지의 접근성 지수
rad	범주형 (integer)	index of accessibility to radial highways. 방사형 도로까지의 접근성 지수
tax	수치형 (numeric)	full-value property-tax rate per \$10,000. 10,000달러당 재산세율
ptratio	수치형 (numeric)	pupil-teacher ratio by town. 타운별 학생/교사 비율
black	수치형 (numeric)	1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. 타운별 흑인의 비율
lstat	수치형 (numeric)	lower status of the population (percent). 모집단의 하위계층의 비율
medv (목표변수)	수치형 (numeric)	median value of owner-occupied homes in \$1000s. 본인 소유의 주택가격(중앙값)

<데이터 준비>

> library(MASS)

> str(Boston)

'data.frame': 506 obs. of 15 variables:

$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...

$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...

$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...

$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...

$ rm : num 6.58 6.42 7.18 7 7.15 ...

$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...

$ dis : num 4.09 4.97 4.97 6.06 6.06 ...

$ rad : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5 5 5 ...

$ tax : num 296 242 242 222 222 222 311 311 311 311 ...

$ ptratio : num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...

$ black : num 397 397 393 395 397 ...

$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...

$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

$ medv.hat: num 23.7 23.7 32.4 32.4 32.4 ...

> dim(Boston)

[1] 506 15

> Boston<-Boston[,-15]

> str(Boston)

'data.frame': 506 obs. of 14 variables:

$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...

$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...

$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...

$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...

$ rm : num 6.58 6.42 7.18 7 7.15 ...

$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...

$ dis : num 4.09 4.97 4.97 6.06 6.06 ...

$ rad : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5 5 5 ...

$ tax : num 296 242 242 222 222 222 311 311 311 311 ...

$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...

$ black : num 397 397 393 395 397 ...

$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...

$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

1. 선형회귀모형

> i=sample(1:nrow(Boston),round(nrow(Boston)*0.7))

> Boston.train = Boston[i,] #70% for training data

> Boston.test = Boston[-i,] #30% for test data

> fit.reg = lm(medv~.,data = Boston.train) #선형회귀모형

> summary(fit.reg) #선형회귀모형 요약

Call:

lm(formula = medv ~ ., data = Boston.train)

Residuals:

Min 1Q Median 3Q Max

-9.9375 -2.7260 -0.4588 1.7937 24.3220

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 41.027525 6.644569 6.175 1.93e-09 ***

crim -0.098730 0.052682 -1.874 0.061794 .

zn 0.059161 0.016208 3.650 0.000304 ***

indus 0.073377 0.075763 0.969 0.333492

chas1 2.314962 1.023899 2.261 0.024409 *

nox -18.307923 4.638939 -3.947 9.67e-05 ***

rm 3.014437 0.498806 6.043 4.04e-09 ***

age -0.002066 0.016414 -0.126 0.899922

dis -1.564686 0.235506 -6.644 1.25e-10 ***

rad2 1.485691 1.707949 0.870 0.384999

rad3 5.371463 1.513283 3.550 0.000441 ***

rad4 2.763349 1.365438 2.024 0.043791 *

rad5 3.407697 1.380288 2.469 0.014057 *

rad6 1.840557 1.687536 1.091 0.276204

rad7 3.790169 1.832307 2.069 0.039362 *

rad8 4.691472 1.677890 2.796 0.005473 **

rad24 8.224882 2.114575 3.890 0.000121 ***

tax -0.009025 0.004555 -1.981 0.048382 *

ptratio -1.021842 0.175757 -5.814 1.43e-08 ***

black 0.008734 0.003633 2.404 0.016749 *

lstat -0.596699 0.059397 -10.046 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.658 on 333 degrees of freedom

Multiple R-squared: 0.7468, Adjusted R-squared: 0.7316

F-statistic: 49.11 on 20 and 333 DF, p-value: < 2.2e-16

> fit.step.reg = step(fit.reg,direction = 'both', trace = FALSE) #단계적 선택법, trace=T일 경우 AIC를 볼 수 있다.

> summary(fit.step.reg) #조정된 모델 요약

Call:

lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +

tax + ptratio + black + lstat, data = Boston.train)

Residuals:

Min 1Q Median 3Q Max

-9.8479 -2.6319 -0.4838 1.7225 24.3723

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 40.432105 6.597006 6.129 2.49e-09 ***

crim -0.102050 0.052455 -1.945 0.052555 .

zn 0.058012 0.015903 3.648 0.000307 ***

chas1 2.428119 1.014994 2.392 0.017296 *

nox -17.025847 4.237976 -4.017 7.27e-05 ***

rm 2.969435 0.489146 6.071 3.45e-09 ***

dis -1.604642 0.221607 -7.241 3.08e-12 ***

rad2 1.763045 1.681098 1.049 0.295051

rad3 5.319042 1.507369 3.529 0.000476 ***

rad4 2.871626 1.358651 2.114 0.035290 *

rad5 3.377174 1.377514 2.452 0.014731 *

rad6 1.669858 1.672013 0.999 0.318656

rad7 3.831541 1.823028 2.102 0.036322 *

rad8 4.551858 1.667065 2.730 0.006659 **

rad24 7.906985 2.076704 3.807 0.000167 ***

tax -0.007384 0.004213 -1.753 0.080567 .

ptratio -0.996843 0.173266 -5.753 1.98e-08 ***

black 0.008481 0.003612 2.348 0.019458 *

lstat -0.593839 0.055467 -10.706 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.65 on 335 degrees of freedom

Multiple R-squared: 0.7461, Adjusted R-squared: 0.7324

F-statistic: 54.69 on 18 and 335 DF, p-value: < 2.2e-16

> fit.step.reg$anova #제거된 변수 보기. age와 indus가 제거

Step Df Deviance Resid. Df Resid. Dev AIC

1 NA NA 333 7223.934 1109.614

2 - age 1 0.3436196 334 7224.278 1107.631

3 - indus 1 20.2559598 335 7244.534 1106.622

> yhat.reg = predict(fit.step.reg,newdata=Boston.test,type='response') #데이터 예측, 목표값 예측 시 type=’response’

> mean((Boston.test$medv-yhat.reg)^2) #예측 모형의 평균 제곱 오차 PMSE

[1] 23.68968

2. 회귀 나무모형

> library(rpart)

> set.seed(1234)

> i=sample(1:nrow(Boston),round(nrow(Boston)*0.7))

> Boston.train = Boston[i,]

> Boston.test = Boston[-i,]

> my.control = rpart.control(xval=10, cp=0, minsplit=nrow(Boston.train)*0.05)

> fit.tree = rpart(medv~.,data=Boston.train, method='anova',control=my.control)#method=anova 선택 시 회귀나무, class 선택 시 분류나무

> which.min(fit.tree$cp[,4])

> fit.tree$cp[17]

[1] 0.00193237

> fit.tree$cp[17,]

CP nsplit rel error xerror xstd

0.00193237 17.00000000 0.12723520 0.23200911 0.04362543

> fit.tree$cp[17,1]

[1] 0.00193237

> ii = which.min(fit.tree$cp[,4])

> fit.prune.tree = prune(fit.tree,cp=fit.tree$cp[ii,1])

> plot(fit.prune.tree,uniform=T,margin=0.1)

> text(fit.prune.tree,col='blue',cex=0.7)

> yhat.tree = predict(fit.prune.tree,newdata=Boston.test,type='vector')

> mean((Boston.test$medv - yhat.tree)^2)

[1] 24.17478

3. 신경망모형

> library(neuralnet)

> Boston2 = Boston

> for(i in 1:ncol(Boston2))if(!is.numeric(Boston2[,i]))Boston2[,i]=as.numeric(Boston[,i])

> max1 = apply(Boston2,2,max)

> min1 = apply(Boston2,2,min)

> sdat = scale(Boston2,center=min1,scale=max1-min1)

> sdat=as.data.frame(sdat)

> View(sdat)

> set.seed(1234)

> i=sample(1:nrow(Boston2),round(nrow(Boston2)*0.7))

> Boston2.train = sdat[i,]

> Boston2.test = sdat[-i,]

> vname = names(Boston2.train)

> form = as.formula(paste('medv~',paste(vname[!vname %in% "medv"],collapse='+')))

> form

medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +

tax + ptratio + black + lstat

> fit.nn = neuralnet(form,data=Boston2.train,hidden=c(3,2),linear.output=T)#linear.output=T 회귀

> plot(fit.nn)

> str(Boston.test)

'data.frame': 152 obs. of 14 variables:

$ crim : num 0.0273 0.0299 0.1446 0.6274 1.0539 ...

$ zn : num 0 0 12.5 0 0 0 0 0 0 0 ...

$ indus : num 7.07 2.18 7.87 8.14 8.14 8.14 8.14 8.14 8.14 8.14 ...

$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

$ nox : num 0.469 0.458 0.524 0.538 0.538 0.538 0.538 0.538 0.538 0.538 ...

$ rm : num 7.18 6.43 6.17 5.83 5.93 ...

$ age : num 61.1 58.7 96.1 56.5 29.3 98.1 89.2 91.7 85.7 90.3 ...

$ dis : num 4.97 6.06 5.95 4.5 4.5 ...

$ rad : Factor w/ 9 levels "1","2","3","4",..: 2 3 5 4 4 4 4 4 4 4 ...

$ tax : num 242 222 311 307 307 307 307 307 307 307 ...

$ ptratio: num 17.8 18.7 15.2 21 21 21 21 21 21 21 ...

$ black : num 393 394 397 396 387 ...

$ lstat : num 4.03 5.21 19.15 8.47 6.58 ...

$ medv : num 34.7 28.7 27.1 19.9 23.1 13.6 19.6 15.2 13.9 16.6 ...

> pred=compute(fit.nn,Boston2.test[,1:13])#14번째 변수가 medv 목표변수.compute는 작성된 신경망모형을 이용하여 새로운 예에 적용하여 결과를 도출.

> summary(pred)

Length Class Mode

neurons 3 -none- list

net.result 152 -none- numeric

> yhat.nn = pred$net.result*(max(Boston2$medv)-min(Boston2$medv))+min(Boston2$medv)#back transformation, 목표변수를 조정전 변수값으로 환원하는 과정. 원래 가지고 있는 값으로 환원. 역변환 과정

> head(cbind(Boston2[i,14],yhat.nn),50)

[,1] [,2]

2 31.6 21.17115946

3 23.8 30.30760801

10 28.2 20.06686203

13 21.6 21.29808480

14 16.1 20.28145030

18 23.8 18.30584489

24 36.2 16.10587573

26 21.2 16.16710531

27 17.1 17.71479369

28 20.9 16.71171622

33 31.2 12.87970542

35 20.7 14.84095463

36 17.8 22.05195863

42 14.1 33.16705465

43 15.6 26.86733582

46 17.2 21.39641163

47 14.0 20.62187769

49 19.2 13.55437741

51 22.0 21.63602446

57 18.7 23.04343250

59 19.4 23.33368068

72 15.6 21.68452217

75 20.0 22.95006783

76 18.2 21.37727858

81 19.5 27.88596126

90 11.5 31.97882170

94 29.6 24.56223312

98 8.4 41.94592034

102 8.5 23.20282812

103 19.6 27.89450422

104 28.7 19.47439349

105 21.4 19.90631048

111 11.8 20.49063003

120 23.3 20.15009581

123 26.6 19.19281571

128 21.7 16.99190292

135 20.6 14.54799220

148 20.3 13.09572781

150 21.4 15.35938684

151 13.3 18.92860416

161 50.0 31.83885454

165 24.8 20.14597079

167 29.8 44.60929619

169 23.2 20.70938882

170 15.3 20.60174521

171 31.7 19.12045877

172 22.1 19.63836241

173 27.5 20.36004622

176 22.8 32.60724125

177 26.6 24.24612266

Warning message:

In cbind(Boston2[i, 14], yhat.nn) :

number of rows of result is not a multiple of vector length (arg 1)

> Boston2.test$medv=Boston2.test$medv*(max(Boston2$medv)-min(Boston2$medv))+min(Boston2$medv)#back transformation

> mean((Boston2.test$medv-yhat.nn)^2)

[1] 19.34365205

4. 랜덤 포레스트

> library(randomForest)

> library(MASS)

> str(Boston)

'data.frame': 506 obs. of 14 variables:

$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...

$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...

$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...

$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...

$ rm : num 6.58 6.42 7.18 7 7.15 ...

$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...

$ dis : num 4.09 4.97 4.97 6.06 6.06 ...

$ rad : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5 5 5 ...

$ tax : num 296 242 242 222 222 222 311 311 311 311 ...

$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...

$ black : num 397 397 393 395 397 ...

$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...

$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

> set.seed(1234)

> i=sample(1:nrow(Boston),round(nrow(Boston)*0.7))

> Boston.train = Boston[i,]

> Boston.test = Boston[-i,]

> fit.rf = randomForest(medv~., data=Boston.train, ntree=100, mtry=5, importance = T, na.action=na.omit)

> fit.rf

Call:

randomForest(formula = medv ~ ., data = Boston.train, ntree = 100, mtry = 5, importance = T, na.action = na.omit)

Type of random forest: regression

Number of trees: 100

No. of variables tried at each split: 5

Mean of squared residuals: 11.60448807

% Var explained: 85.6

> summary(fit.rf)

Length Class Mode

call 7 -none- call

type 1 -none- character

predicted 354 -none- numeric

mse 100 -none- numeric

rsq 100 -none- numeric

oob.times 354 -none- numeric

importance 26 -none- numeric

importanceSD 13 -none- numeric

localImportance 0 -none- NULL

proximity 0 -none- NULL

ntree 1 -none- numeric

mtry 1 -none- numeric

forest 11 -none- list

coefs 0 -none- NULL

y 354 -none- numeric

test 0 -none- NULL

inbag 0 -none- NULL

terms 3 terms call

> head(fit.rf$predicted,30)

58 315 308 314 433 321

31.38358586 22.70753968 28.77652778 21.11594595 20.61043860 23.92682540

5 117 332 256 345 270

32.56341667 20.64287500 20.64851351 20.71283333 30.29218391 21.12653153

140 456 144 412 141 131

15.02797980 15.33454167 14.36800926 14.37462963 15.01537037 19.89743590

92 114 154 147 77 20

23.40651042 19.04477778 15.95588235 16.05761261 21.11859649 19.79196970

106 390 253 439 398 22

19.35319820 10.55704563 31.93869369 9.22796875 12.18531532 18.03844203

> importance(fit.rf,type=1)

%IncMSE

crim 6.3818661727

zn 1.5083317623

indus 4.9357684617

chas 0.6538995939

nox 6.8146688870

rm 15.9443905699

age 3.6616801321

dis 7.0604314940

rad 6.1703029034

tax 3.9762473150

ptratio 7.1788230837

black 3.4813158641

lstat 13.5182724091

> yhat.rf=predict(fit.rf,newdata=Boston.test,type='response')

> mean((Boston.test$medv-yhat.rf)^2)

[1] 12.41384092

각 모형 별 에측평균제곱오차

모델	예측평균제곱오차
선형회귀모형	23.68968
회귀나무모형	24.17478
신경망모형	19.34365205
랜덤포레스트	12.41384092

관측치와 예측치의 산점도 생성

> par(mfrow=c(2,2))

>plot(Boston.test$medv,yhat.reg,xlab='Observed',ylab='Predicted',main='Regression',xlim=c(0,55),ylim=c(0,55))

> abline(0,1,lty=2)

> plot(Boston.test$medv,yhat.tree,xlab='Observed',ylab='Predicted',main='Decision tree',xlim=c(0,55),ylim=c(0,55))

> abline(0,1,lty=2)

> plot(Boston.test$medv,yhat.nn,xlab='Observed',ylab='Predicted',main='Neural Network',xlim=c(0,55),ylim=c(0,55))

> abline(0,1,lty=2)

> plot(Boston.test$medv,yhat.rf,xlab='Observed',ylab='Predicted',main='Random Forest',xlim=c(0,55),ylim=c(0,55))

> abline(0,1,lty=2)

-> 랜덤포레스트의 그래프가 가장 조밀하게 형성되어, 가장 좋은 예측 결과를 가진다고 할 수 있다.

출처: 데이터마이닝(장영재, 김현중, 조형준 공저)

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

제8장 연관성분석 (0)	2016.12.26
제5장 신경망모형 - 분류 (5)	2016.12.20
제5장 신경망모형 - 회귀 (0)	2016.11.14
제4장 앙상블 모형 - 분류앙상블모형 - 랜덤 포레스트 (0)	2016.11.09
제4장 앙상블 모형 - 분류앙상블모형 - 부스팅 (0)	2016.11.07

Posted by 마르띤

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

데이터마이너를 꿈꾸며

제6장 모형비교평가 - 연속형 목표변수

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

링크

카테고리

최근에 올라온 글

최근에 받은 트랙백

글 보관함

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역