연속형 목표 변수.
목표변수가 연속형인 경우 모형의 예측력 측도로는 주로 예측평균제곱오차(PMSE: Prediction Mean Squared Error)를 사용한다.
PMSE = sigma(yi-yhat i)2/n
관측치(yi)와 예측치(yhat i) 사이에 차이가 적을수록 예측을 잘한 것으로므로 예측평균제곱오차가 작을수록 그 모형의 예측력이 높다고 할 수 있다. 또한 관측치와 예측치를 가로축 및 세로축으로 놓고 그린 산점도(scatter plot)가 45도 대각선을 중심으로 모여 있으면 예측력이 좋다고 할 수 있다.
보스턴 하우징 데이터 Housing Values in Suburbs of Boston
(출처: http://127.0.0.1:31866/library/MASS/html/Boston.html)
변수명 |
속성 |
변수 설명 |
crim |
수치형 (numeric) |
per capita crime rate by town |
zn |
수치형 (numeric) |
proportion of residential land zoned for
lots over 25,000 sq.ft |
indus |
수치형 (numeric) |
proportion of non-retail business acres per
town. |
chas |
범주형 (integer) |
Charles River dummy variable (= 1 if tract
bounds river; 0 otherwise) |
nox |
수치형 (numeric) |
nitrogen oxides concentration (parts per 10
million) |
rm |
수치형 (numeric) |
average number of rooms per dwelling. |
age |
수치형 (numeric) |
proportion of owner-occupied units built
prior to 1940. |
dis |
수치형 (numeric) |
weighted mean of distances to five Boston
employment centres. |
rad |
범주형 (integer) |
index of accessibility to radial highways. |
tax |
수치형 (numeric) |
full-value property-tax rate per \$10,000. |
ptratio |
수치형 (numeric) |
pupil-teacher ratio by town. |
black |
수치형 (numeric) |
1000(Bk - 0.63)^2 where Bk is the proportion
of blacks by town. |
lstat |
수치형 (numeric) |
lower status of the population
(percent). |
medv |
수치형 (numeric) |
median value of owner-occupied
homes in \$1000s. |
<데이터 준비>
> library(MASS)
> str(Boston)
'data.frame': 506 obs. of 15 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio : num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
$ medv.hat: num 23.7 23.7 32.4 32.4 32.4 ...
> dim(Boston)
[1] 506 15
> Boston<-Boston[,-15]
> str(Boston)
'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
1. 선형회귀모형
> i=sample(1:nrow(Boston),round(nrow(Boston)*0.7))
> Boston.train = Boston[i,] #70% for training data
> Boston.test = Boston[-i,] #30% for test data
> fit.reg = lm(medv~.,data = Boston.train) #선형회귀모형
> summary(fit.reg) #선형회귀모형 요약
Call:
lm(formula = medv ~ ., data = Boston.train)
Residuals:
Min 1Q Median 3Q Max
-9.9375 -2.7260 -0.4588 1.7937 24.3220
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.027525 6.644569 6.175 1.93e-09 ***
crim -0.098730 0.052682 -1.874 0.061794 .
zn 0.059161 0.016208 3.650 0.000304 ***
indus 0.073377 0.075763 0.969 0.333492
chas1 2.314962 1.023899 2.261 0.024409 *
nox -18.307923 4.638939 -3.947 9.67e-05 ***
rm 3.014437 0.498806 6.043 4.04e-09 ***
age -0.002066 0.016414 -0.126 0.899922
dis -1.564686 0.235506 -6.644 1.25e-10 ***
rad2 1.485691 1.707949 0.870 0.384999
rad3 5.371463 1.513283 3.550 0.000441 ***
rad4 2.763349 1.365438 2.024 0.043791 *
rad5 3.407697 1.380288 2.469 0.014057 *
rad6 1.840557 1.687536 1.091 0.276204
rad7 3.790169 1.832307 2.069 0.039362 *
rad8 4.691472 1.677890 2.796 0.005473 **
rad24 8.224882 2.114575 3.890 0.000121 ***
tax -0.009025 0.004555 -1.981 0.048382 *
ptratio -1.021842 0.175757 -5.814 1.43e-08 ***
black 0.008734 0.003633 2.404 0.016749 *
lstat -0.596699 0.059397 -10.046 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.658 on 333 degrees of freedom
Multiple R-squared: 0.7468, Adjusted R-squared: 0.7316
F-statistic: 49.11 on 20 and 333 DF, p-value: < 2.2e-16
> fit.step.reg = step(fit.reg,direction = 'both', trace = FALSE) #단계적 선택법, trace=T일 경우 AIC를 볼 수 있다.
> summary(fit.step.reg) #조정된 모델 요약
Call:
lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
tax + ptratio + black + lstat, data = Boston.train)
Residuals:
Min 1Q Median 3Q Max
-9.8479 -2.6319 -0.4838 1.7225 24.3723
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.432105 6.597006 6.129 2.49e-09 ***
crim -0.102050 0.052455 -1.945 0.052555 .
zn 0.058012 0.015903 3.648 0.000307 ***
chas1 2.428119 1.014994 2.392 0.017296 *
nox -17.025847 4.237976 -4.017 7.27e-05 ***
rm 2.969435 0.489146 6.071 3.45e-09 ***
dis -1.604642 0.221607 -7.241 3.08e-12 ***
rad2 1.763045 1.681098 1.049 0.295051
rad3 5.319042 1.507369 3.529 0.000476 ***
rad4 2.871626 1.358651 2.114 0.035290 *
rad5 3.377174 1.377514 2.452 0.014731 *
rad6 1.669858 1.672013 0.999 0.318656
rad7 3.831541 1.823028 2.102 0.036322 *
rad8 4.551858 1.667065 2.730 0.006659 **
rad24 7.906985 2.076704 3.807 0.000167 ***
tax -0.007384 0.004213 -1.753 0.080567 .
ptratio -0.996843 0.173266 -5.753 1.98e-08 ***
black 0.008481 0.003612 2.348 0.019458 *
lstat -0.593839 0.055467 -10.706 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.65 on 335 degrees of freedom
Multiple R-squared: 0.7461, Adjusted R-squared: 0.7324
F-statistic: 54.69 on 18 and 335 DF, p-value: < 2.2e-16
> fit.step.reg$anova #제거된 변수 보기. age와 indus가 제거
Step Df Deviance Resid. Df Resid. Dev AIC
1 NA NA 333 7223.934 1109.614
2 - age 1 0.3436196 334 7224.278 1107.631
3 - indus 1 20.2559598 335 7244.534 1106.622
> yhat.reg = predict(fit.step.reg,newdata=Boston.test,type='response') #데이터 예측, 목표값 예측 시 type=’response’
> mean((Boston.test$medv-yhat.reg)^2) #예측 모형의 평균 제곱 오차 PMSE
[1] 23.68968
2. 회귀 나무모형
> library(rpart)
> set.seed(1234)
> i=sample(1:nrow(Boston),round(nrow(Boston)*0.7))
> Boston.train = Boston[i,]
> Boston.test = Boston[-i,]
> my.control = rpart.control(xval=10, cp=0, minsplit=nrow(Boston.train)*0.05)
> fit.tree = rpart(medv~.,data=Boston.train, method='anova',control=my.control)#method=anova 선택 시 회귀나무, class 선택 시 분류나무
> which.min(fit.tree$cp[,4])
17
17
> fit.tree$cp[17]
[1] 0.00193237
> fit.tree$cp[17,]
CP nsplit rel error xerror xstd
0.00193237 17.00000000 0.12723520 0.23200911 0.04362543
> fit.tree$cp[17,1]
[1] 0.00193237
> ii = which.min(fit.tree$cp[,4])
> fit.prune.tree = prune(fit.tree,cp=fit.tree$cp[ii,1])
> plot(fit.prune.tree,uniform=T,margin=0.1)
> text(fit.prune.tree,col='blue',cex=0.7)
> yhat.tree = predict(fit.prune.tree,newdata=Boston.test,type='vector')
> mean((Boston.test$medv - yhat.tree)^2)
[1] 24.17478
3. 신경망모형
> library(neuralnet)
> Boston2 = Boston
> for(i in 1:ncol(Boston2))if(!is.numeric(Boston2[,i]))Boston2[,i]=as.numeric(Boston[,i])
> max1 = apply(Boston2,2,max)
> min1 = apply(Boston2,2,min)
> sdat = scale(Boston2,center=min1,scale=max1-min1)
> sdat=as.data.frame(sdat)
> View(sdat)
> set.seed(1234)
> i=sample(1:nrow(Boston2),round(nrow(Boston2)*0.7))
> Boston2.train = sdat[i,]
> Boston2.test = sdat[-i,]
> vname = names(Boston2.train)
> form = as.formula(paste('medv~',paste(vname[!vname %in% "medv"],collapse='+')))
> form
medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
tax + ptratio + black + lstat
> fit.nn = neuralnet(form,data=Boston2.train,hidden=c(3,2),linear.output=T)#linear.output=T 회귀
> plot(fit.nn)
> str(Boston.test)
'data.frame': 152 obs. of 14 variables:
$ crim : num 0.0273 0.0299 0.1446 0.6274 1.0539 ...
$ zn : num 0 0 12.5 0 0 0 0 0 0 0 ...
$ indus : num 7.07 2.18 7.87 8.14 8.14 8.14 8.14 8.14 8.14 8.14 ...
$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ nox : num 0.469 0.458 0.524 0.538 0.538 0.538 0.538 0.538 0.538 0.538 ...
$ rm : num 7.18 6.43 6.17 5.83 5.93 ...
$ age : num 61.1 58.7 96.1 56.5 29.3 98.1 89.2 91.7 85.7 90.3 ...
$ dis : num 4.97 6.06 5.95 4.5 4.5 ...
$ rad : Factor w/ 9 levels "1","2","3","4",..: 2 3 5 4 4 4 4 4 4 4 ...
$ tax : num 242 222 311 307 307 307 307 307 307 307 ...
$ ptratio: num 17.8 18.7 15.2 21 21 21 21 21 21 21 ...
$ black : num 393 394 397 396 387 ...
$ lstat : num 4.03 5.21 19.15 8.47 6.58 ...
$ medv : num 34.7 28.7 27.1 19.9 23.1 13.6 19.6 15.2 13.9 16.6 ...
> pred=compute(fit.nn,Boston2.test[,1:13])#14번째 변수가 medv 목표변수.compute는 작성된 신경망모형을 이용하여 새로운 예에 적용하여 결과를 도출.
> summary(pred)
Length Class Mode
neurons 3 -none- list
net.result 152 -none- numeric
> yhat.nn = pred$net.result*(max(Boston2$medv)-min(Boston2$medv))+min(Boston2$medv)#back transformation, 목표변수를 조정전 변수값으로 환원하는 과정. 원래 가지고 있는 값으로 환원. 역변환 과정
> head(cbind(Boston2[i,14],yhat.nn),50)
[,1] [,2]
2 31.6 21.17115946
3 23.8 30.30760801
10 28.2 20.06686203
13 21.6 21.29808480
14 16.1 20.28145030
18 23.8 18.30584489
24 36.2 16.10587573
26 21.2 16.16710531
27 17.1 17.71479369
28 20.9 16.71171622
33 31.2 12.87970542
35 20.7 14.84095463
36 17.8 22.05195863
42 14.1 33.16705465
43 15.6 26.86733582
46 17.2 21.39641163
47 14.0 20.62187769
49 19.2 13.55437741
51 22.0 21.63602446
57 18.7 23.04343250
59 19.4 23.33368068
72 15.6 21.68452217
75 20.0 22.95006783
76 18.2 21.37727858
81 19.5 27.88596126
90 11.5 31.97882170
94 29.6 24.56223312
98 8.4 41.94592034
102 8.5 23.20282812
103 19.6 27.89450422
104 28.7 19.47439349
105 21.4 19.90631048
111 11.8 20.49063003
120 23.3 20.15009581
123 26.6 19.19281571
128 21.7 16.99190292
135 20.6 14.54799220
148 20.3 13.09572781
150 21.4 15.35938684
151 13.3 18.92860416
161 50.0 31.83885454
165 24.8 20.14597079
167 29.8 44.60929619
169 23.2 20.70938882
170 15.3 20.60174521
171 31.7 19.12045877
172 22.1 19.63836241
173 27.5 20.36004622
176 22.8 32.60724125
177 26.6 24.24612266
Warning message:
In cbind(Boston2[i, 14], yhat.nn) :
number of rows of result is not a multiple of vector length (arg 1)
> Boston2.test$medv=Boston2.test$medv*(max(Boston2$medv)-min(Boston2$medv))+min(Boston2$medv)#back transformation
> mean((Boston2.test$medv-yhat.nn)^2)
[1] 19.34365205
4. 랜덤 포레스트
> library(randomForest)
> library(MASS)
> str(Boston)
'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
> set.seed(1234)
> i=sample(1:nrow(Boston),round(nrow(Boston)*0.7))
> Boston.train = Boston[i,]
> Boston.test = Boston[-i,]
> fit.rf = randomForest(medv~., data=Boston.train, ntree=100, mtry=5, importance = T, na.action=na.omit)
> fit.rf
Call:
randomForest(formula = medv ~ ., data = Boston.train, ntree = 100, mtry = 5, importance = T, na.action = na.omit)
Type of random forest: regression
Number of trees: 100
No. of variables tried at each split: 5
Mean of squared residuals: 11.60448807
% Var explained: 85.6
> summary(fit.rf)
Length Class Mode
call 7 -none- call
type 1 -none- character
predicted 354 -none- numeric
mse 100 -none- numeric
rsq 100 -none- numeric
oob.times 354 -none- numeric
importance 26 -none- numeric
importanceSD 13 -none- numeric
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 11 -none- list
coefs 0 -none- NULL
y 354 -none- numeric
test 0 -none- NULL
inbag 0 -none- NULL
terms 3 terms call
> head(fit.rf$predicted,30)
58 315 308 314 433 321
31.38358586 22.70753968 28.77652778 21.11594595 20.61043860 23.92682540
5 117 332 256 345 270
32.56341667 20.64287500 20.64851351 20.71283333 30.29218391 21.12653153
140 456 144 412 141 131
15.02797980 15.33454167 14.36800926 14.37462963 15.01537037 19.89743590
92 114 154 147 77 20
23.40651042 19.04477778 15.95588235 16.05761261 21.11859649 19.79196970
106 390 253 439 398 22
19.35319820 10.55704563 31.93869369 9.22796875 12.18531532 18.03844203
> importance(fit.rf,type=1)
%IncMSE
crim 6.3818661727
zn 1.5083317623
indus 4.9357684617
chas 0.6538995939
nox 6.8146688870
rm 15.9443905699
age 3.6616801321
dis 7.0604314940
rad 6.1703029034
tax 3.9762473150
ptratio 7.1788230837
black 3.4813158641
lstat 13.5182724091
> yhat.rf=predict(fit.rf,newdata=Boston.test,type='response')
> mean((Boston.test$medv-yhat.rf)^2)
[1] 12.41384092
각 모형 별 에측평균제곱오차
모델 |
예측평균제곱오차 |
선형회귀모형 |
23.68968 |
회귀나무모형 |
24.17478 |
신경망모형 |
19.34365205 |
랜덤포레스트 |
12.41384092 |
관측치와 예측치의 산점도 생성
> par(mfrow=c(2,2))
>plot(Boston.test$medv,yhat.reg,xlab='Observed',ylab='Predicted',main='Regression',xlim=c(0,55),ylim=c(0,55))
> abline(0,1,lty=2)
> plot(Boston.test$medv,yhat.tree,xlab='Observed',ylab='Predicted',main='Decision tree',xlim=c(0,55),ylim=c(0,55))
> abline(0,1,lty=2)
> plot(Boston.test$medv,yhat.nn,xlab='Observed',ylab='Predicted',main='Neural Network',xlim=c(0,55),ylim=c(0,55))
> abline(0,1,lty=2)
> plot(Boston.test$medv,yhat.rf,xlab='Observed',ylab='Predicted',main='Random Forest',xlim=c(0,55),ylim=c(0,55))
> abline(0,1,lty=2)
-> 랜덤포레스트의 그래프가 가장 조밀하게 형성되어, 가장 좋은 예측 결과를 가진다고 할 수 있다.
출처: 데이터마이닝(장영재, 김현중, 조형준 공저)
'KNOU > 2 데이터마이닝' 카테고리의 다른 글
제8장 연관성분석 (0) | 2016.12.26 |
---|---|
제5장 신경망모형 - 분류 (5) | 2016.12.20 |
제5장 신경망모형 - 회귀 (0) | 2016.11.14 |
제4장 앙상블 모형 - 분류앙상블모형 - 랜덤 포레스트 (0) | 2016.11.09 |
제4장 앙상블 모형 - 분류앙상블모형 - 부스팅 (0) | 2016.11.07 |