반응형

연속형 목표 변수.

목표변수가 연속형인 경우 모형의 예측력 측도로는 주로 예측평균제곱오차(PMSE: Prediction Mean Squared Error) 사용한다.

 

PMSE = sigma(yi-yhat i)2/n

 

관측치(yi) 예측치(yhat i) 사이에 차이가 적을수록 예측을 잘한 것으로므로 예측평균제곱오차가 작을수록 모형의 예측력이 높다고 있다. 또한 관측치와 예측치를 가로축 세로축으로 놓고 그린 산점도(scatter plot) 45 대각선을 중심으로 모여 있으면 예측력이 좋다고 있다.

 

보스턴 하우징 데이터 Housing Values in Suburbs of Boston

(출처: http://127.0.0.1:31866/library/MASS/html/Boston.html)

 

변수명

속성

변수 설명

crim

수치형

(numeric)

per capita crime rate by town
타운별 1인당 범죄율

zn

수치형

(numeric)

proportion of residential land zoned for lots over 25,000 sq.ft
25,000
평방피트를 초과하는 거주지역 비율

indus

수치형

(numeric)

proportion of non-retail business acres per town.
비소매 사업지역의 토지 비율

chas

범주형

(integer)

Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
찰스강 더비 변수 (강의 경계에 위치 = 1, 아니면 = 0)

nox

수치형

(numeric)

nitrogen oxides concentration (parts per 10 million)
10ppm
농축 일산화질소

rm

수치형

(numeric)

average number of rooms per dwelling.
주택 1가구등 방의 평균 개수

age

수치형

(numeric)

proportion of owner-occupied units built prior to 1940.
1940
이전에 건축된 소유자 주택 비율

dis

수치형

(numeric)

weighted mean of distances to five Boston employment centres.
5
개의 보스턴 고용센터까지의 접근성 지수

rad

범주형

(integer)

index of accessibility to radial highways.
방사형 도로까지의 접근성 지수

tax

수치형

(numeric)

full-value property-tax rate per \$10,000.
10,000
달러당 재산세율

ptratio

수치형

(numeric)

pupil-teacher ratio by town.
타운별 학생/교사 비율

black

수치형

(numeric)

1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
타운별 흑인의 비율

lstat

수치형

(numeric)

lower status of the population (percent).
모집단의 하위계층의 비율

medv
(
목표변수)

수치형

(numeric)

median value of owner-occupied homes in \$1000s.
본인 소유의 주택가격(중앙값)

 

 

 

 

<데이터 준비>

> library(MASS)

> str(Boston)

'data.frame':   506 obs. of  15 variables:

 $ crim    : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...

 $ zn      : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...

 $ indus   : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...

 $ chas    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

 $ nox     : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...

 $ rm      : num  6.58 6.42 7.18 7 7.15 ...

 $ age     : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...

 $ dis     : num  4.09 4.97 4.97 6.06 6.06 ...

 $ rad     : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5 5 5 ...

 $ tax     : num  296 242 242 222 222 222 311 311 311 311 ...

 $ ptratio : num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...

 $ black   : num  397 397 393 395 397 ...

 $ lstat   : num  4.98 9.14 4.03 2.94 5.33 ...

 $ medv    : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

 $ medv.hat: num  23.7 23.7 32.4 32.4 32.4 ...

> dim(Boston)

[1] 506  15

> Boston<-Boston[,-15]

 

> str(Boston)

'data.frame':   506 obs. of  14 variables:

 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...

 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...

 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...

 $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...

 $ rm     : num  6.58 6.42 7.18 7 7.15 ...

 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...

 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...

 $ rad    : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5 5 5 ...

 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...

 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...

 $ black  : num  397 397 393 395 397 ...

 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...

 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

 



1. 선형회귀모형

> i=sample(1:nrow(Boston),round(nrow(Boston)*0.7))

> Boston.train = Boston[i,] #70% for training data

> Boston.test = Boston[-i,] #30% for test data

> fit.reg = lm(medv~.,data = Boston.train) #선형회귀모형

> summary(fit.reg) #선형회귀모형 요약

 

Call:

lm(formula = medv ~ ., data = Boston.train)

 

Residuals:

    Min      1Q  Median      3Q     Max

-9.9375 -2.7260 -0.4588  1.7937 24.3220

 

Coefficients:

              Estimate Std. Error t value Pr(>|t|)   

(Intercept)  41.027525   6.644569   6.175 1.93e-09 ***

crim         -0.098730   0.052682  -1.874 0.061794 . 

zn            0.059161   0.016208   3.650 0.000304 ***

indus         0.073377   0.075763   0.969 0.333492   

chas1         2.314962   1.023899   2.261 0.024409 * 

nox         -18.307923   4.638939  -3.947 9.67e-05 ***

rm            3.014437   0.498806   6.043 4.04e-09 ***

age          -0.002066   0.016414  -0.126 0.899922   

dis          -1.564686   0.235506  -6.644 1.25e-10 ***

rad2          1.485691   1.707949   0.870 0.384999   

rad3          5.371463   1.513283   3.550 0.000441 ***

rad4          2.763349   1.365438   2.024 0.043791 * 

rad5          3.407697   1.380288   2.469 0.014057 * 

rad6          1.840557   1.687536   1.091 0.276204   

rad7          3.790169   1.832307   2.069 0.039362 * 

rad8          4.691472   1.677890   2.796 0.005473 **

rad24         8.224882   2.114575   3.890 0.000121 ***

tax          -0.009025   0.004555  -1.981 0.048382 * 

ptratio      -1.021842   0.175757  -5.814 1.43e-08 ***

black         0.008734   0.003633   2.404 0.016749 * 

lstat        -0.596699   0.059397 -10.046  < 2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 4.658 on 333 degrees of freedom

Multiple R-squared:  0.7468,    Adjusted R-squared:  0.7316

F-statistic: 49.11 on 20 and 333 DF,  p-value: < 2.2e-16

 

> fit.step.reg = step(fit.reg,direction = 'both', trace = FALSE) #단계적 선택법, trace=T 경우 AIC 있다.

> summary(fit.step.reg) #조정된 모델 요약

 

Call:

lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +

    tax + ptratio + black + lstat, data = Boston.train)

 

Residuals:

    Min      1Q  Median      3Q     Max

-9.8479 -2.6319 -0.4838  1.7225 24.3723

 

Coefficients:

              Estimate Std. Error t value Pr(>|t|)   

(Intercept)  40.432105   6.597006   6.129 2.49e-09 ***

crim         -0.102050   0.052455  -1.945 0.052555 . 

zn            0.058012   0.015903   3.648 0.000307 ***

chas1         2.428119   1.014994   2.392 0.017296 * 

nox         -17.025847   4.237976  -4.017 7.27e-05 ***

rm            2.969435   0.489146   6.071 3.45e-09 ***

dis          -1.604642   0.221607  -7.241 3.08e-12 ***

rad2          1.763045   1.681098   1.049 0.295051   

rad3          5.319042   1.507369   3.529 0.000476 ***

rad4          2.871626   1.358651   2.114 0.035290 * 

rad5          3.377174   1.377514   2.452 0.014731 * 

rad6          1.669858   1.672013   0.999 0.318656   

rad7          3.831541   1.823028   2.102 0.036322 * 

rad8          4.551858   1.667065   2.730 0.006659 **

rad24         7.906985   2.076704   3.807 0.000167 ***

tax          -0.007384   0.004213  -1.753 0.080567 . 

ptratio      -0.996843   0.173266  -5.753 1.98e-08 ***

black         0.008481   0.003612   2.348 0.019458 *  

lstat        -0.593839   0.055467 -10.706  < 2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 4.65 on 335 degrees of freedom

Multiple R-squared:  0.7461,    Adjusted R-squared:  0.7324

F-statistic: 54.69 on 18 and 335 DF,  p-value: < 2.2e-16

 

> fit.step.reg$anova #제거된 변수 보기. age indus 제거

     Step Df   Deviance Resid. Df Resid. Dev      AIC

1         NA         NA       333   7223.934 1109.614

2   - age  1  0.3436196       334   7224.278 1107.631

3 - indus  1 20.2559598       335   7244.534 1106.622

 

> yhat.reg = predict(fit.step.reg,newdata=Boston.test,type='response') #데이터 예측, 목표값 예측 type=’response’

> mean((Boston.test$medv-yhat.reg)^2) #예측 모형의 평균 제곱 오차 PMSE

[1] 23.68968

 

 

 

 

 

2. 회귀 나무모형

> library(rpart)

> set.seed(1234)

> i=sample(1:nrow(Boston),round(nrow(Boston)*0.7))

> Boston.train = Boston[i,]

> Boston.test = Boston[-i,]

 

> my.control = rpart.control(xval=10, cp=0, minsplit=nrow(Boston.train)*0.05)

> fit.tree = rpart(medv~.,data=Boston.train, method='anova',control=my.control)#method=anova 선택 회귀나무, class 선택 분류나무

 

> which.min(fit.tree$cp[,4])

17

17

> fit.tree$cp[17]

[1] 0.00193237

> fit.tree$cp[17,]

         CP      nsplit   rel error      xerror        xstd

 0.00193237 17.00000000  0.12723520  0.23200911  0.04362543

> fit.tree$cp[17,1]

[1] 0.00193237

> ii = which.min(fit.tree$cp[,4])

> fit.prune.tree = prune(fit.tree,cp=fit.tree$cp[ii,1])

 

> plot(fit.prune.tree,uniform=T,margin=0.1)

> text(fit.prune.tree,col='blue',cex=0.7)





> yhat.tree = predict(fit.prune.tree,newdata=Boston.test,type='vector')

> mean((Boston.test$medv - yhat.tree)^2)

[1] 24.17478

 

 

 

 

 

 3. 신경망모형

> library(neuralnet)

 

> Boston2 = Boston

> for(i in 1:ncol(Boston2))if(!is.numeric(Boston2[,i]))Boston2[,i]=as.numeric(Boston[,i])

 

> max1 = apply(Boston2,2,max)

> min1 = apply(Boston2,2,min)

 

> sdat = scale(Boston2,center=min1,scale=max1-min1)

> sdat=as.data.frame(sdat)

> View(sdat)

 

> set.seed(1234)

> i=sample(1:nrow(Boston2),round(nrow(Boston2)*0.7))

> Boston2.train = sdat[i,]

> Boston2.test = sdat[-i,]

 

> vname = names(Boston2.train)

> form = as.formula(paste('medv~',paste(vname[!vname %in% "medv"],collapse='+')))

> form

medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +

    tax + ptratio + black + lstat

> fit.nn = neuralnet(form,data=Boston2.train,hidden=c(3,2),linear.output=T)#linear.output=T 회귀

> plot(fit.nn) 

 

> str(Boston.test)

'data.frame':   152 obs. of  14 variables:

 $ crim   : num  0.0273 0.0299 0.1446 0.6274 1.0539 ...

 $ zn     : num  0 0 12.5 0 0 0 0 0 0 0 ...

 $ indus  : num  7.07 2.18 7.87 8.14 8.14 8.14 8.14 8.14 8.14 8.14 ...

 $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

 $ nox    : num  0.469 0.458 0.524 0.538 0.538 0.538 0.538 0.538 0.538 0.538 ...

 $ rm     : num  7.18 6.43 6.17 5.83 5.93 ...

 $ age    : num  61.1 58.7 96.1 56.5 29.3 98.1 89.2 91.7 85.7 90.3 ...

 $ dis    : num  4.97 6.06 5.95 4.5 4.5 ...

 $ rad    : Factor w/ 9 levels "1","2","3","4",..: 2 3 5 4 4 4 4 4 4 4 ...

 $ tax    : num  242 222 311 307 307 307 307 307 307 307 ...

 $ ptratio: num  17.8 18.7 15.2 21 21 21 21 21 21 21 ...

 $ black  : num  393 394 397 396 387 ...

 $ lstat  : num  4.03 5.21 19.15 8.47 6.58 ...

 $ medv   : num  34.7 28.7 27.1 19.9 23.1 13.6 19.6 15.2 13.9 16.6 ...

> pred=compute(fit.nn,Boston2.test[,1:13])#14번째 변수가 medv 목표변수.compute 작성된 신경망모형을 이용하여 새로운 예에 적용하여 결과를 도출.

> summary(pred)

           Length Class  Mode  

neurons      3    -none- list  

net.result 152    -none- numeric

> yhat.nn = pred$net.result*(max(Boston2$medv)-min(Boston2$medv))+min(Boston2$medv)#back transformation, 목표변수를 조정전 변수값으로 환원하는 과정. 원래 가지고 있는 값으로 환원. 역변환 과정

> head(cbind(Boston2[i,14],yhat.nn),50)

    [,1]        [,2]

2   31.6 21.17115946

3   23.8 30.30760801

10  28.2 20.06686203

13  21.6 21.29808480

14  16.1 20.28145030

18  23.8 18.30584489

24  36.2 16.10587573

26  21.2 16.16710531

27  17.1 17.71479369

28  20.9 16.71171622

33  31.2 12.87970542

35  20.7 14.84095463

36  17.8 22.05195863

42  14.1 33.16705465

43  15.6 26.86733582

46  17.2 21.39641163

47  14.0 20.62187769

49  19.2 13.55437741

51  22.0 21.63602446

57  18.7 23.04343250

59  19.4 23.33368068

72  15.6 21.68452217

75  20.0 22.95006783

76  18.2 21.37727858

81  19.5 27.88596126

90  11.5 31.97882170

94  29.6 24.56223312

98   8.4 41.94592034

102  8.5 23.20282812

103 19.6 27.89450422

104 28.7 19.47439349

105 21.4 19.90631048

111 11.8 20.49063003

120 23.3 20.15009581

123 26.6 19.19281571

128 21.7 16.99190292

135 20.6 14.54799220

148 20.3 13.09572781

150 21.4 15.35938684

151 13.3 18.92860416

161 50.0 31.83885454

165 24.8 20.14597079

167 29.8 44.60929619

169 23.2 20.70938882

170 15.3 20.60174521

171 31.7 19.12045877

172 22.1 19.63836241

173 27.5 20.36004622

176 22.8 32.60724125

177 26.6 24.24612266

Warning message:

In cbind(Boston2[i, 14], yhat.nn) :

  number of rows of result is not a multiple of vector length (arg 1)

 

> Boston2.test$medv=Boston2.test$medv*(max(Boston2$medv)-min(Boston2$medv))+min(Boston2$medv)#back transformation

> mean((Boston2.test$medv-yhat.nn)^2)

[1] 19.34365205

 

 

 


4. 랜덤 포레스트

> library(randomForest)

> library(MASS)

> str(Boston)

'data.frame':   506 obs. of  14 variables:

 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...

 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...

 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...

 $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...

 $ rm     : num  6.58 6.42 7.18 7 7.15 ...

 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...

 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...

 $ rad    : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5 5 5 ...

 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...

 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...

 $ black  : num  397 397 393 395 397 ...

 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...

 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

 

> set.seed(1234)

> i=sample(1:nrow(Boston),round(nrow(Boston)*0.7))

> Boston.train = Boston[i,]

> Boston.test = Boston[-i,]

 

> fit.rf = randomForest(medv~., data=Boston.train, ntree=100, mtry=5, importance = T, na.action=na.omit)

> fit.rf

 

Call:

 randomForest(formula = medv ~ ., data = Boston.train, ntree = 100,      mtry = 5, importance = T, na.action = na.omit)

               Type of random forest: regression

                     Number of trees: 100

No. of variables tried at each split: 5

 

          Mean of squared residuals: 11.60448807

                    % Var explained: 85.6

> summary(fit.rf)

                Length Class  Mode    

call              7    -none- call    

type              1    -none- character

predicted       354    -none- numeric 

mse             100    -none- numeric 

rsq             100    -none- numeric 

oob.times       354    -none- numeric 

importance       26    -none- numeric 

importanceSD     13    -none- numeric 

localImportance   0    -none- NULL    

proximity         0    -none- NULL    

ntree             1    -none- numeric 

mtry              1    -none- numeric 

forest           11    -none- list    

coefs             0    -none- NULL    

y               354    -none- numeric 

test              0    -none- NULL    

inbag             0    -none- NULL    

terms             3    terms  call    

> head(fit.rf$predicted,30)

         58         315         308         314         433         321

31.38358586 22.70753968 28.77652778 21.11594595 20.61043860 23.92682540

          5         117         332         256         345         270

32.56341667 20.64287500 20.64851351 20.71283333 30.29218391 21.12653153

        140         456         144         412         141         131

15.02797980 15.33454167 14.36800926 14.37462963 15.01537037 19.89743590

         92         114         154         147          77          20

23.40651042 19.04477778 15.95588235 16.05761261 21.11859649 19.79196970

        106         390         253         439         398          22

19.35319820 10.55704563 31.93869369  9.22796875 12.18531532 18.03844203

> importance(fit.rf,type=1)

              %IncMSE

crim     6.3818661727

zn       1.5083317623

indus    4.9357684617

chas     0.6538995939

nox      6.8146688870

rm      15.9443905699

age      3.6616801321

dis      7.0604314940

rad      6.1703029034

tax      3.9762473150

ptratio  7.1788230837

black    3.4813158641

lstat   13.5182724091

> yhat.rf=predict(fit.rf,newdata=Boston.test,type='response')

> mean((Boston.test$medv-yhat.rf)^2)

[1] 12.41384092

 

 


모형 에측평균제곱오차

 

모델

예측평균제곱오차

선형회귀모형

23.68968

회귀나무모형

24.17478

신경망모형

19.34365205

랜덤포레스트

12.41384092

 


관측치와 예측치의 산점도 생성

> par(mfrow=c(2,2))

 

>plot(Boston.test$medv,yhat.reg,xlab='Observed',ylab='Predicted',main='Regression',xlim=c(0,55),ylim=c(0,55))

> abline(0,1,lty=2)

 

> plot(Boston.test$medv,yhat.tree,xlab='Observed',ylab='Predicted',main='Decision tree',xlim=c(0,55),ylim=c(0,55))

> abline(0,1,lty=2)

 

> plot(Boston.test$medv,yhat.nn,xlab='Observed',ylab='Predicted',main='Neural Network',xlim=c(0,55),ylim=c(0,55))

> abline(0,1,lty=2)

 

> plot(Boston.test$medv,yhat.rf,xlab='Observed',ylab='Predicted',main='Random Forest',xlim=c(0,55),ylim=c(0,55))

> abline(0,1,lty=2) 

 

 -> 랜덤포레스트의 그래프가 가장 조밀하게 형성되어, 가장 좋은 예측 결과를 가진다고 할 수 있다.



출처: 데이터마이닝(장영재, 김현중, 조형준 공저)

 

반응형
Posted by 마르띤
,