'library(MASS)' 태그의 글 목록

'library(MASS)'에 해당되는 글 1건

2016.09.14 제2장 회귀모형 - 선형회귀 연습

제2장 회귀모형 - 선형회귀 연습

KNOU/2 데이터마이닝 2016. 9. 14. 10:10

목표변수가 연속형인 경우 -> 선형 회귀모델, ex) 광고비 투입 대비 매출액

목표변수가 두 개의 범주를 가진 이항형인 경우 -> 로지스틱 회귀모형, ex) 좋다1, 나쁘다0

보스턴 하우징 데이터 Housing Values in Suburbs of Boston

(출처: http://127.0.0.1:31866/library/MASS/html/Boston.html)

변수명	속성	변수 설명
crim	수치형(numeric)	per capita crime rate by town 타운별 1인당 범죄율
zn	수치형(numeric)	proportion of residential land zoned for lots over 25,000 sq.ft. 25,000평방피트를 초과하는 거주지역 비율
indus	수치형(numeric)	proportion of non-retail business acres per town. 비소매 사업지역의 토지 비율
chas	범주형(integer)	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). 찰스강 더비 변수 (강의 경계에 위치 = 1, 아니면 = 0)
nox	수치형(numeric)	nitrogen oxides concentration (parts per 10 million). 10ppm당 농축 일산화질소
rm	수치형(numeric)	average number of rooms per dwelling. 주택 1가구등 방의 평균 개수
age	수치형(numeric)	proportion of owner-occupied units built prior to 1940. 1940년 이전에 건축된 소유자 주택 비율
dis	수치형(numeric)	weighted mean of distances to five Boston employment centres. 5개의 보스턴 고용센터까지의 접근성 지수
rad	범주형(integer)	index of accessibility to radial highways. 방사형 도로까지의 접근성 지수
tax	수치형(numeric)	full-value property-tax rate per \$10,000. 10,000달러당 재산세율
ptratio	수치형(numeric)	pupil-teacher ratio by town. 타운별 학생/교사 비율
black	수치형(numeric)	1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. 타운별 흑인의 비율
lstat	수치형(numeric)	lower status of the population (percent). 모집단의 하위계층의 비율
medv (목표변수)	수치형(numeric)	median value of owner-occupied homes in \$1000s. 본인 소유의 주택가격(중앙값)

1. 데이터 불러오기

> library(MASS)

> range(Boston$medv)

[1] 5 50

> stem(Boston$medv)

The decimal point is at the |

4 | 006

6 | 30022245

8 | 1334455788567

10 | 2224455899035778899

12 | 013567778011112333444455668888899

14 | 0111233445556689990001222344666667

16 | 01112234556677880111222344455567888889

18 | 01222334445555667778899990011112233333444444555566666778889999

20 | 0000011111223333444455566666677888990001122222444445566777777788999

22 | 00000001222223344555666667788889999000011111112222333344566777788889

24 | 001112333444455566777888800000000123

26 | 24456667011555599

28 | 01244567770011466889

30 | 111357801255667

32 | 0024579011223448

34 | 679991244

36 | 01224502369

38 | 78

40 | 37

42 | 38158

44 | 084

46 | 07

48 | 358

50 | 0000000000000000

> i=which(Boston$medv==50)#본인 소유의 주택가격(중앙값)

> Boston[i,]

crim zn indus chas nox rm age dis rad tax ptratio black lstat medv

162 1.46336 0 19.58 0 0.6050 7.489 90.8 1.9709 5 403 14.7 374.43 1.73 50

163 1.83377 0 19.58 1 0.6050 7.802 98.2 2.0407 5 403 14.7 389.61 1.92 50

164 1.51902 0 19.58 1 0.6050 8.375 93.9 2.1620 5 403 14.7 388.45 3.32 50

167 2.01019 0 19.58 0 0.6050 7.929 96.2 2.0459 5 403 14.7 369.30 3.70 50

187 0.05602 0 2.46 0 0.4880 7.831 53.6 3.1992 3 193 17.8 392.63 4.45 50

196 0.01381 80 0.46 0 0.4220 7.875 32.0 5.6484 4 255 14.4 394.23 2.97 50

205 0.02009 95 2.68 0 0.4161 8.034 31.9 5.1180 4 224 14.7 390.55 2.88 50

226 0.52693 0 6.20 0 0.5040 8.725 83.0 2.8944 8 307 17.4 382.00 4.63 50

258 0.61154 20 3.97 0 0.6470 8.704 86.9 1.8010 5 264 13.0 389.70 5.12 50

268 0.57834 20 3.97 0 0.5750 8.297 67.0 2.4216 5 264 13.0 384.54 7.44 50

284 0.01501 90 1.21 1 0.4010 7.923 24.8 5.8850 1 198 13.6 395.52 3.16 50

369 4.89822 0 18.10 0 0.6310 4.970 100.0 1.3325 24 666 20.2 375.52 3.26 50

370 5.66998 0 18.10 1 0.6310 6.683 96.8 1.3567 24 666 20.2 375.33 3.73 50

371 6.53876 0 18.10 1 0.6310 7.016 97.5 1.2024 24 666 20.2 392.05 2.96 50

372 9.23230 0 18.10 0 0.6310 6.216 100.0 1.1691 24 666 20.2 366.15 9.53 50

373 8.26725 0 18.10 1 0.6680 5.875 89.6 1.1296 24 666 20.2 347.88 8.88 50

> boston=Boston[-i,] #최대값 50인 관측치 16개를 찾아 제거

> boston$chas = factor(boston$chas) #범주형으로 변경

> boston$rad = factor(boston$rad) #범주형으로 변경

> table(boston$rad)

1 2 3 4 5 6 7 8 24

19 24 37 108 109 26 17 23 127

> boston$chas <- as.factor(boston$chas)

> boston$rad <- as.factor(boston$rad)

> class(boston$rad);class(boston$chas)
[1] "factor"
[1] "factor"

[참고] 아래와 같은 방법으로 이용하면 모든 변수를 수치로 변경할 수 있다.

> for(i in 1:ncol(boston))if(!is.numeric(boston[,i])) boston[,i]=as.numeric(boston[,i])
> str(boston)
'data.frame':   490 obs. of 14 variables:
$ crim   : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn     : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas   : num 1 1 1 1 1 1 1 1 1 1 ...
$ nox    : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm     : num 6.58 6.42 7.18 7 7.15 ...
$ age    : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis    : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad    : num 1 2 2 3 3 3 5 5 5 5 ...
$ tax    : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv   : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

2. 선형 회귀 모형 만들기

#선형회귀모형 만들기

> fit1 = lm(medv~.,data=boston) #목표변수 = medv, 선형회귀모형 함수, ~.는 목표 변수를 제외한 모든 변수를 입력변수로 사용

> summary(fit1)

Call:

lm(formula = medv ~ ., data = boston)

Residuals:

Min 1Q Median 3Q Max

-9.5220 -2.2592 -0.4275 1.6778 15.2894

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 30.120918 4.338656 6.942 1.29e-11 ***

crim -0.105648 0.025640 -4.120 4.47e-05 ***

zn 0.044104 0.011352 3.885 0.000117 ***

indus -0.046743 0.051044 -0.916 0.360274

chas1 0.158802 0.736742 0.216 0.829435

nox -11.576589 3.084187 -3.754 0.000196 ***

rm 3.543733 0.356605 9.937 < 2e-16 ***

age -0.026082 0.010531 -2.477 0.013613 *

dis -1.282095 0.160452 -7.991 1.05e-14 ***

rad2 2.548109 1.175012 2.169 0.030616 *

rad3 4.605849 1.064492 4.327 1.85e-05 ***

rad4 2.663393 0.950747 2.801 0.005299 **

rad5 3.077800 0.962725 3.197 0.001483 **

rad6 1.314892 1.157689 1.136 0.256624

rad7 4.864208 1.241760 3.917 0.000103 ***

rad8 5.772296 1.194221 4.834 1.82e-06 ***

rad24 6.195415 1.417826 4.370 1.53e-05 ***

tax -0.009396 0.003070 -3.061 0.002333 **

ptratio -0.828498 0.114436 -7.240 1.85e-12 ***

black 0.007875 0.002084 3.779 0.000178 ***

lstat -0.354606 0.041901 -8.463 3.36e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.671 on 469 degrees of freedom

Multiple R-squared: 0.7911, Adjusted R-squared: 0.7821

F-statistic: 88.78 on 20 and 469 DF, p-value: < 2.2e-16

또는 아래와 같은 방법도 가능하다

> names(boston)
[1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
[8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
> bn <- names(boston)

> f <- as.formula(paste('medv~',paste(bn[!bn %in% 'medv'],collapse='+')))
> f
medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
tax + ptratio + black + lstat
> fit2 <- lm(f,data=boston)
> summary(fit2)

Call:
lm(formula = f, data = boston)

Residuals:
Min 1Q Median 3Q Max
-9.5220 -2.2592 -0.4275 1.6778 15.2894

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.120918   4.338656   6.942 1.29e-11 ***
crim         -0.105648   0.025640 -4.120 4.47e-05 ***
zn            0.044104   0.011352   3.885 0.000117 ***
indus        -0.046743   0.051044 -0.916 0.360274
chas2         0.158802   0.736742   0.216 0.829435
nox         -11.576589   3.084187 -3.754 0.000196 ***
rm            3.543733   0.356605   9.937 < 2e-16 ***
age          -0.026082   0.010531 -2.477 0.013613 *
dis          -1.282095   0.160452 -7.991 1.05e-14 ***
rad2          2.548109   1.175012   2.169 0.030616 *
rad3          4.605849   1.064492   4.327 1.85e-05 ***
rad4          2.663393   0.950747   2.801 0.005299 **
rad5          3.077800   0.962725   3.197 0.001483 **
rad6          1.314892   1.157689   1.136 0.256624
rad7          4.864208   1.241760   3.917 0.000103 ***
rad8          5.772296   1.194221   4.834 1.82e-06 ***
rad9          6.195415   1.417826   4.370 1.53e-05 ***
tax          -0.009396   0.003070 -3.061 0.002333 **
ptratio      -0.828498   0.114436 -7.240 1.85e-12 ***
black         0.007875   0.002084   3.779 0.000178 ***
lstat        -0.354606   0.041901 -8.463 3.36e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.671 on 469 degrees of freedom
Multiple R-squared: 0.7911, Adjusted R-squared: 0.7821
F-statistic: 88.78 on 20 and 469 DF, p-value: < 2.2e-16

#가장 적절한 모형 선택 위한 변수 선택

> fit.step = step(fit1,direction='both') #both는 단계적 선택법 적용

Start: AIC=1295.03

medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +

tax + ptratio + black + lstat

Df Sum of Sq RSS AIC

- chas 1 0.63 6321.5 1293.1

- indus 1 11.30 6332.2 1293.9

<none> 6320.9 1295.0

- age 1 82.67 6403.5 1299.4

- tax 1 126.28 6447.1 1302.7

- nox 1 189.88 6510.7 1307.5

- black 1 192.42 6513.3 1307.7

- zn 1 203.44 6524.3 1308.5

- crim 1 228.82 6549.7 1310.5

- rad 8 721.85 7042.7 1332.0

- ptratio 1 706.41 7027.3 1344.9

- dis 1 860.51 7181.4 1355.6

- lstat 1 965.26 7286.1 1362.7

- rm 1 1330.92 7651.8 1386.7

Step: AIC=1293.08

medv ~ crim + zn + indus + nox + rm + age + dis + rad + tax +

ptratio + black + lstat

Df Sum of Sq RSS AIC

- indus 1 11.00 6332.5 1291.9

<none> 6321.5 1293.1

+ chas 1 0.63 6320.9 1295.0

- age 1 82.48 6404.0 1297.4

- tax 1 130.45 6451.9 1301.1

- nox 1 189.27 6510.8 1305.5

- black 1 193.59 6515.1 1305.9

- zn 1 203.76 6525.2 1306.6

- crim 1 230.58 6552.1 1308.6

- rad 8 738.26 7059.8 1331.2

- ptratio 1 719.40 7040.9 1343.9

- dis 1 861.64 7183.1 1353.7

- lstat 1 965.11 7286.6 1360.7

- rm 1 1333.37 7654.9 1384.9

Step: AIC=1291.93

medv ~ crim + zn + nox + rm + age + dis + rad + tax + ptratio +

black + lstat

Df Sum of Sq RSS AIC

<none> 6332.5 1291.9

+ indus 1 11.00 6321.5 1293.1

+ chas 1 0.32 6332.2 1293.9

- age 1 81.09 6413.6 1296.2

- tax 1 192.78 6525.3 1304.6

- black 1 196.55 6529.0 1304.9

- zn 1 220.63 6553.1 1306.7

- crim 1 225.50 6558.0 1307.1

- nox 1 239.09 6571.6 1308.1

- rad 8 791.09 7123.6 1333.6

- ptratio 1 732.81 7065.3 1343.6

- dis 1 857.27 7189.8 1352.1

- lstat 1 987.73 7320.2 1361.0

- rm 1 1380.21 7712.7 1386.5

> summary(fit.step) #최종모형, rad는 범주형 변수를 가변수로 변환한 것.#AIC가 가장 작은 변수가 단계적 선택법에 의해 변수들이 정의 됨

Call:

lm(formula = medv ~ crim + zn + nox + rm + age + dis + rad +

tax + ptratio + black + lstat, data = boston)

Residuals:

Min 1Q Median 3Q Max

-9.5200 -2.2850 -0.4688 1.7535 15.3972

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 30.252522 4.329907 6.987 9.64e-12 ***

crim -0.104568 0.025533 -4.095 4.96e-05 ***

zn 0.045510 0.011235 4.051 5.97e-05 ***

nox -12.366882 2.932651 -4.217 2.97e-05 ***

rm 3.583130 0.353644 10.132 < 2e-16 ***

age -0.025822 0.010514 -2.456 0.014412 *

dis -1.253903 0.157029 -7.985 1.08e-14 ***

rad2 2.387130 1.160735 2.057 0.040278 *

rad3 4.644091 1.062157 4.372 1.51e-05 ***

rad4 2.608777 0.944668 2.762 0.005977 **

rad5 3.116933 0.960550 3.245 0.001258 **

rad6 1.422890 1.150280 1.237 0.216705

rad7 4.868388 1.240114 3.926 9.94e-05 ***

rad8 5.872144 1.180865 4.973 9.26e-07 ***

rad24 6.420553 1.393304 4.608 5.24e-06 ***

tax -0.010571 0.002792 -3.787 0.000172 ***

ptratio -0.837356 0.113420 -7.383 7.08e-13 ***

black 0.007949 0.002079 3.823 0.000149 ***

lstat -0.357576 0.041718 -8.571 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.667 on 471 degrees of freedom

Multiple R-squared: 0.7907, Adjusted R-squared: 0.7827

F-statistic: 98.83 on 18 and 471 DF, p-value: < 2.2e-16

-> 결과 해석: 최초 만든 회귀함수 fit1 = lm(medv~.,data=boston)에서, 가장 적절한 모형 선택 위한 변수 선택을 위해 step 함수를 사용한다. fit.step = step(fit1,direction='both') 이 함수에서 both는 단계적 선택법 적용을 의미한다. 결과적으로 lm(formula = medv ~ crim + zn + nox + rm + age + dis + rad + tax + ptratio + black + lstat, data = boston)라는 모형이 만들어졌고, 최초 만든 모형 대비 indus, chas1 변수가 사라졌음을 알 수 있다. 또한 최종 모형에서 범주 rad2,3,4 등은 범주형 범수 중 특정 변수를 의미한다.

입력변수 crim의 회귀계수 추정치는 음수이므로 crim이 증가함에 따라 목표변수medv는 감소한다. nox 변수의 회귀곗수는 -12인데, nox 변수가 올라갈 때 마다 medv 값은 내려간다. nox 변수는 10ppm당 농축 일산화질소를 뜻한다.

rad변수는 9개 범주로 구성되어 있기 때문에 8개의 가변수가 생성되었다. 각 입력 변수의 t값의 절대값으 커서 대응하는 p-값은 0.05보다 작아서 유의하다고 할 수 있다. 단, rad6는 유의하지 않지만 다른 가변수가 유의하므로 제거되지 않고 여전히 모형에 포함된다.

R²은 79.07%로 적합한 선형 회귀모형으로 데이터를 설명할 수 있는 부분이 약 80%로 높고, F-검정의 p-value도 2.2e-16로 아주 작은 것도 모형이 적합하다는 것을 지지하다.

[참고] 단계적선택법(stepwise selection)의 AIC는 1291.93이다. 후진소거법과 전친선택법은??

후진소거법(backward elimination)의 AIC는 1291.93

> fit.step.back = step(fit1,direction='backward')

Start: AIC=1295.03

medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +

tax + ptratio + black + lstat

Df Sum of Sq RSS AIC

- chas 1 0.63 6321.5 1293.1

- indus 1 11.30 6332.2 1293.9

<none> 6320.9 1295.0

- age 1 82.67 6403.5 1299.4

- tax 1 126.28 6447.1 1302.7

- nox 1 189.88 6510.7 1307.5

- black 1 192.42 6513.3 1307.7

- zn 1 203.44 6524.3 1308.5

- crim 1 228.82 6549.7 1310.5

- rad 8 721.85 7042.7 1332.0

- ptratio 1 706.41 7027.3 1344.9

- dis 1 860.51 7181.4 1355.6

- lstat 1 965.26 7286.1 1362.7

- rm 1 1330.92 7651.8 1386.7

Step: AIC=1293.08

medv ~ crim + zn + indus + nox + rm + age + dis + rad + tax +

ptratio + black + lstat

Df Sum of Sq RSS AIC

- indus 1 11.00 6332.5 1291.9

<none> 6321.5 1293.1

- age 1 82.48 6404.0 1297.4

- tax 1 130.45 6451.9 1301.1

- nox 1 189.27 6510.8 1305.5

- black 1 193.59 6515.1 1305.9

- zn 1 203.76 6525.2 1306.6

- crim 1 230.58 6552.1 1308.6

- rad 8 738.26 7059.8 1331.2

- ptratio 1 719.40 7040.9 1343.9

- dis 1 861.64 7183.1 1353.7

- lstat 1 965.11 7286.6 1360.7

- rm 1 1333.37 7654.9 1384.9

Step: AIC=1291.93

medv ~ crim + zn + nox + rm + age + dis + rad + tax + ptratio +

black + lstat

Df Sum of Sq RSS AIC

<none> 6332.5 1291.9

- age 1 81.09 6413.6 1296.2

- tax 1 192.78 6525.3 1304.6

- black 1 196.55 6529.0 1304.9

- zn 1 220.63 6553.1 1306.7

- crim 1 225.50 6558.0 1307.1

- nox 1 239.09 6571.6 1308.1

- rad 8 791.09 7123.6 1333.6

- ptratio 1 732.81 7065.3 1343.6

- dis 1 857.27 7189.8 1352.1

- lstat 1 987.73 7320.2 1361.0

- rm 1 1380.21 7712.7 1386.5

> summary(fit.step.back )

Call:

lm(formula = medv ~ crim + zn + nox + rm + age + dis + rad +

tax + ptratio + black + lstat, data = boston)

Residuals:

Min 1Q Median 3Q Max

-9.5200 -2.2850 -0.4688 1.7535 15.3972

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 30.252522 4.329907 6.987 9.64e-12 ***

crim -0.104568 0.025533 -4.095 4.96e-05 ***

zn 0.045510 0.011235 4.051 5.97e-05 ***

nox -12.366882 2.932651 -4.217 2.97e-05 ***

rm 3.583130 0.353644 10.132 < 2e-16 ***

age -0.025822 0.010514 -2.456 0.014412 *

dis -1.253903 0.157029 -7.985 1.08e-14 ***

rad2 2.387130 1.160735 2.057 0.040278 *

rad3 4.644091 1.062157 4.372 1.51e-05 ***

rad4 2.608777 0.944668 2.762 0.005977 **

rad5 3.116933 0.960550 3.245 0.001258 **

rad6 1.422890 1.150280 1.237 0.216705

rad7 4.868388 1.240114 3.926 9.94e-05 ***

rad8 5.872144 1.180865 4.973 9.26e-07 ***

rad9 6.420553 1.393304 4.608 5.24e-06 ***

tax -0.010571 0.002792 -3.787 0.000172 ***

ptratio -0.837356 0.113420 -7.383 7.08e-13 ***

black 0.007949 0.002079 3.823 0.000149 ***

lstat -0.357576 0.041718 -8.571 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.667 on 471 degrees of freedom

Multiple R-squared: 0.7907, Adjusted R-squared: 0.7827

F-statistic: 98.83 on 18 and 471 DF, p-value: < 2.2e-16

[참고] 전진선택법(forward selection)의 AIC는 1295.03

> fit.step.forward = step(fit1,direction='forward')

Start: AIC=1295.03

medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +

tax + ptratio + black + lstat

> summary(fit.step.forward)

Call:

lm(formula = medv ~ crim + zn + indus + chas + nox + rm + age +

dis + rad + tax + ptratio + black + lstat, data = boston)

Residuals:

Min 1Q Median 3Q Max

-9.5220 -2.2592 -0.4275 1.6778 15.2894

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 30.120918 4.338656 6.942 1.29e-11 ***

crim -0.105648 0.025640 -4.120 4.47e-05 ***

zn 0.044104 0.011352 3.885 0.000117 ***

indus -0.046743 0.051044 -0.916 0.360274

chas2 0.158802 0.736742 0.216 0.829435

nox -11.576589 3.084187 -3.754 0.000196 ***

rm 3.543733 0.356605 9.937 < 2e-16 ***

age -0.026082 0.010531 -2.477 0.013613 *

dis -1.282095 0.160452 -7.991 1.05e-14 ***

rad2 2.548109 1.175012 2.169 0.030616 *

rad3 4.605849 1.064492 4.327 1.85e-05 ***

rad4 2.663393 0.950747 2.801 0.005299 **

rad5 3.077800 0.962725 3.197 0.001483 **

rad6 1.314892 1.157689 1.136 0.256624

rad7 4.864208 1.241760 3.917 0.000103 ***

rad8 5.772296 1.194221 4.834 1.82e-06 ***

rad9 6.195415 1.417826 4.370 1.53e-05 ***

tax -0.009396 0.003070 -3.061 0.002333 **

ptratio -0.828498 0.114436 -7.240 1.85e-12 ***

black 0.007875 0.002084 3.779 0.000178 ***

lstat -0.354606 0.041901 -8.463 3.36e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.671 on 469 degrees of freedom

Multiple R-squared: 0.7911, Adjusted R-squared: 0.7821

F-statistic: 88.78 on 20 and 469 DF, p-value: < 2.2e-16

3. 어떤 변수들이 제거 되었을까?

> fit.all = lm(medv~.,data=boston)

> fit.step = step(fit.all, direction = "both")

Start: AIC=1295.03

medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +

tax + ptratio + black + lstat

Df Sum of Sq RSS AIC

- chas 1 0.63 6321.5 1293.1

- indus 1 11.30 6332.2 1293.9

<none> 6320.9 1295.0

- age 1 82.67 6403.5 1299.4

- tax 1 126.28 6447.1 1302.7

- nox 1 189.88 6510.7 1307.5

- black 1 192.42 6513.3 1307.7

- zn 1 203.44 6524.3 1308.5

- crim 1 228.82 6549.7 1310.5

- rad 8 721.85 7042.7 1332.0

- ptratio 1 706.41 7027.3 1344.9

- dis 1 860.51 7181.4 1355.6

- lstat 1 965.26 7286.1 1362.7

- rm 1 1330.92 7651.8 1386.7

Step: AIC=1293.08

medv ~ crim + zn + indus + nox + rm + age + dis + rad + tax +

ptratio + black + lstat

Df Sum of Sq RSS AIC

- indus 1 11.00 6332.5 1291.9

<none> 6321.5 1293.1

+ chas 1 0.63 6320.9 1295.0

- age 1 82.48 6404.0 1297.4

- tax 1 130.45 6451.9 1301.1

- nox 1 189.27 6510.8 1305.5

- black 1 193.59 6515.1 1305.9

- zn 1 203.76 6525.2 1306.6

- crim 1 230.58 6552.1 1308.6

- rad 8 738.26 7059.8 1331.2

- ptratio 1 719.40 7040.9 1343.9

- dis 1 861.64 7183.1 1353.7

- lstat 1 965.11 7286.6 1360.7

- rm 1 1333.37 7654.9 1384.9

Step: AIC=1291.93

medv ~ crim + zn + nox + rm + age + dis + rad + tax + ptratio +

black + lstat

Df Sum of Sq RSS AIC

<none> 6332.5 1291.9

+ indus 1 11.00 6321.5 1293.1

+ chas 1 0.32 6332.2 1293.9

- age 1 81.09 6413.6 1296.2

- tax 1 192.78 6525.3 1304.6

- black 1 196.55 6529.0 1304.9

- zn 1 220.63 6553.1 1306.7

- crim 1 225.50 6558.0 1307.1

- nox 1 239.09 6571.6 1308.1

- rad 8 791.09 7123.6 1333.6

- ptratio 1 732.81 7065.3 1343.6

- dis 1 857.27 7189.8 1352.1

- lstat 1 987.73 7320.2 1361.0

- rm 1 1380.21 7712.7 1386.5

> names(fit.step)
[1] "coefficients" "residuals"     "effects"       "rank"
[5] "fitted.values" "assign"        "qr"            "df.residual"
[9] "contrasts"     "xlevels"       "call"          "terms"
[13] "model"         "anova"

> fit.step$anova #최종모형에서 제거된 변수를 알 수 있다.

Step Df Deviance Resid. Df Resid. Dev AIC

1 NA NA 469 6320.865 1295.031

2 - chas 1 0.6261633 470 6321.491 1293.079

3 - indus 1 10.9964825 471 6332.487 1291.931

-> 해석: fit.step$anova라는 함수를 통해 최종 모형에서 제거된 변수를 알 수 있다. 여러 후보 모형 중에서 AIC가 가장 작은 모형을 선택하게 되는데, 여기서는 chas와 indus가 제거되었음을 일목요연하게 알 수 있다.

4. 목표 예측값을 알아보자

> yhat=predict(fit.step,newdata=boston,type='response') #목표값 예측 시, type='response'

> head(yhat) #예측된 값 산출

1 2 3 4 5 6

26.59831 24.00195 28.99396 29.60018 29.07676 26.41636

> plot(fit.step$fitted,boston$medv, xlim=c(0,50),ylim=c(0,50),xlab="Fitted",ylab="Observed")#실제값과 가까운지 평가

> abline(a=0,b=1) # or abline(0,1)

> mean((boston$medv-yhat)^2) #MSE

[1] 12.92344

-> 함수 predict 는 다양한 모형 적합결과로부터 예측값을 계산할 때 사용하고, 이중 type

옵션은 예측 형태를 입력하는 것으로, 목표값을 예측할 때 ‘response’를 사용한다.

-> 목표변수가 연속형인 경우에 모형의 예측력 측도로서 MSE(mean squared error)를 주로 사용한다. 관측치(yi)와 예측치 Ŷi의 차이가 적을수록 그 모형의 예측력은 높다고 할 수 있다. 이를 시각적으로 확인하기 위해서는 이들을 가로축 및 세로축에 놓고 그린 산점도가 45도 대각선을 중심으로 모여 있으면 예측력이 좋다고 할 수 있다.

출처: 데이터마이닝(장영재, 김현중, 조형준 공저)

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

제3장 나무모형 - 회귀나무모형 (0)	2016.10.26
제3장 나무모형 - 분류나무모형 (4)	2016.10.18
제2장 회귀모형 - 로지스틱 회귀모형 연습 (0)	2016.09.14
제2장 회귀모형 - 선형회귀, 로지스틱회귀 (0)	2016.09.14
1장 데이터과학과 데이터마이닝 (0)	2016.08.26

Posted by 마르띤

이전 1 다음

데이터마이너를 꿈꾸며

'library(MASS)'에 해당되는 글 1건

제2장 회귀모형 - 선형회귀 연습

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

링크

카테고리

최근에 올라온 글

최근에 받은 트랙백

글 보관함

티스토리툴바