'KNOU' 카테고리의 글 목록

1장 데이터 처리

KNOU/4 고급R 2017. 8. 30. 13:36

1. 자료 가져오기

> setwd(choose.dir()) #대화형으로 폴더를 지정할 수 있음

> getwd()

[1] "C:/Users/amore/Documents/FOR ME/Data Scientist/KNOU/2017년 2학기/고급R활용"

> setwd("C:/Users/amore/Documents/FOR ME/Data Scientist/KNOU/2017년 2학기/고급R활용")

> getwd()

[1] "C:/Users/amore/Documents/FOR ME/Data Scientist/KNOU/2017년 2학기/고급R활용"

> savehistory(file='mywork.Rhistory')

> history()

> history(max.show=10)

2. 다양한 형태의 자료 가져오기

> insurance.data2 = read.table('insurance2.txt', header=T)

> insurance.data2[c(15:16),]

id sex job religion edu amount salary

15 15 m -9 3 4 6 110

16 16 f 1 3 -9 7 88

#-9를 NA로 설정

> insurance.data2 = read.table('insurance2.txt', header=T, na.strings = '-9')

> insurance.data2[c(15:16),]

id sex job religion edu amount salary

15 15 m NA 3 4 6 110

16 16 f 1 3 NA 7 88

> csv.data = read.csv('csv.txt',header=T) #구분자가 콤마

> tab.data = read.table('tab.txt', header=T,sep='\t') #구분자가 탭

> write.table(tab.data, file='C:/Rwork/tab.txt') #Rwork 폴더에 저장

#고정형식 텍스트 파일 읽기

> fwf.data = read.fwf('insurance3.txt',

+ widths=c(2,2,3,3,3,6,6),

+ col.names=c('id','sex','job','religion','edu','amount','salary'))

> fwf.data[fwf.data$job==-9, 'job']=NA

> fwf.data[fwf.data$edu==-9, 'edu']=NA

> fwf.data[fwf.data$salary==-9, 'salary']=NA

> head(fwf.data)

id sex job religion edu amount salary

1 1 m 1 1 3 7.0 110

2 2 m 2 1 4 12.0 135

3 3 f 2 3 5 8.5 127

4 4 f 3 3 5 5.0 150

5 5 m 1 3 3 4.5 113

6 6 m 2 1 2 3.5 95

> fwf2.data = read.fwf('insurance3.txt',

+ widths=c(2,-2,-3,3,3,6,6),

+ col.names=c('id','religion','edu','amount','salary'))

> head(fwf2.data)

id religion edu amount salary

1 1 1 3 7.0 110

2 2 1 4 12.0 135

3 3 3 5 8.5 127

4 4 3 5 5.0 150

5 5 3 3 4.5 113

6 6 1 2 3.5 95

3. 통계 패키지 자료 가져오기

> library(foreign) #spss 파일

> ex1 = read.spss('ex1-1.sav',

+ to.data.frame=T, use.value.labels = T)

Warning message:

In read.spss("ex1-1.sav", to.data.frame = T, use.value.labels = T) :

ex1-1.sav: Unrecognized record type 7, subtype 18 encountered in system file

> ex1

shock response count

1 yes yes 25

2 yes no 19

3 no yes 31

4 no no 141

> dim(ex1)

[1] 4 3

> mouse.data = ex1[rep(1:nrow(ex1),ex1$count),] #가중치

> head(mouse.data)

shock response count

1 yes yes 25

1.1 yes yes 25

1.2 yes yes 25

1.3 yes yes 25

1.4 yes yes 25

1.5 yes yes 25

> dim(mouse.data)

[1] 216 3

> mouse.table=table(mouse.data$shock,mouse.data$response)

> mouse.table

yes no

yes 25 19

no 31 141

> summary(mouse.table)

Number of cases in table: 216

Number of factors: 2

Test for independence of all factors:

Chisq = 27.458, df = 1, p-value = 1.605e-07

- 카이제곱 = 27.458 , p - value = 1.6 X 10마이너스 7승. 자극과 반응간의 유의한 관계가 있음.

4. RData 저장 및 가져오기

> save(ex1, file='c:/Rwork/ex1.Rdata')

> rm(ex1)

> load('C:/Rwork/ex1.Rdata')

> ex1

shock response count

1 yes yes 25

2 yes no 19

3 no yes 31

4 no no 141

> load(file=file.choose()) #대화

참고문헌:

[1] 고급 R 활용 (심송용, 이윤동, 이은경, 김성수 공저, KNOU PRESS)

Posted by 마르띤

,

3장 인자분석 - 주성분인자법(principal factor method) 사례

KNOU/3 다변량분석 2017. 3. 27. 19:59

### 5주차, 제3장 인자분석 ###

아래 자료는 무료 검진프로그램인 PHI를 개발하여 프로그램 유효서을 모니터링 한 자료이다. 11개 검진 항목, 128명의 자료가 이으며 R을 이용하여 주성분인자법(principal factor method)를 이용한 인자분석을 해 보자.

1. 자료 가져오기 및 요약 통계

> setwd('C:/Rwork')

> med.data = read.table('medFactor.txt',header=T)

> head(med.data)

lung muscle liver skeleton kidneys heart step stamina stretch blow urine

1 20 16 52 10 24 23 19 20 23 29 67

2 24 16 52 7 27 16 16 15 31 33 59

3 19 21 57 18 22 23 16 19 42 40 61

4 24 21 62 12 31 25 17 17 36 36 77

5 29 18 62 14 26 27 15 20 33 29 88

6 18 19 51 15 29 23 19 20 50 37 54

> boxplot(med.data)

2. 초기 인자분석 실행하기

> library(psych)

> library(GPArotation) #인자회전

> med.factor = principal(med.data,rotate='none') #rotate에 ‘none’, ‘varimax’,’quanrtmax’, ‘oblimin’ 등 방법을 사용

> names(med.factor)

[1] "values" "rotation" "n.obs" "communality" "loadings" "fit"

[7] "fit.off" "fn" "Call" "uniquenesses" "complexity" "chi"

[13] "EPVAL" "R2" "objective" "residual" "rms" "factors"

[19] "dof" "null.dof" "null.model" "criteria" "STATISTIC" "PVAL"

[25] "weights" "r.scores" "Structure" "scores"

> med.factor$values #고유근

[1] 3.3791814 1.4827707 1.2506302 0.9804771 0.7688022 0.7330511 0.6403994 0.6221934 0.5283718 0.3519301

[11] 0.2621928

> plot(med.factor$values,type='b')

>> med.factor$values #고유근을 통해 세 번째 인자까지 고유근이 1 이상인 것을 알 수 있다. 스크리 그림에서는 4번째 인자다음부터 그래프의 기울기가 완만해지는 것을 볼 수 있다. 따라서 유효한 인자의 수는 3-4개로 생각할 수 있다.

3. 인자 회전

(1) 직교회전(orthogonal rotation)의 Varimax

> med.varimax = principal(med.data,nfactors=3,rotate='varimax')

> med.varimax

>> 공통성은 많은 문항(변수) 중에서 서로연관이 있는 일부 문항들을 선택하기 위하여 인자분석을 실시하는 경우 문항선택의 기준으로 이용된다. 예를 들면, 문항 100개를 이용하여 인자분석을 한 결과 공통성이 0.3 이하인 것이 40개라면 이 40개 문항은 다른 문항들과 공통점이 별로 없는 것으로 판단할 수 있다.

>> 고유분산은 u2 = 1 – h2(공통성) 공식으로 바로 구할 수 있다.lung의 경우 0.53 = 1 – 0.47

>> ss loading은 각 인자에 의해 설명되는 분산의 양을 나타낸다.

RC1 2.39 = 0.66² + 0.11² + 0.78² + … + 0.25² + (-0.07)²

>> Proportion Var는 각 인자가 설명하는 총 분산의 비율을 의미한다. RC1은 총 분산의 22%, RC2가 19%, RC3는 14%, 아래 Cumulative var는 누적값. 세 인자에 의해 설명되어지는 변동은 총 변동의 56%.

>> 회전된 요인에 대한 변수들의 요인 적재값을 보면 첫 번째 인자는 lung, liver, kidneys, heart가 높은 값을 가지며, step은 상대적으로 높은 적재값을 가진다. (적재값이 뭘 의미하는지 잘 모르겠다 ㅠ)

두 번째 인자는 stamina, stretch, blow, urine의 값이 높다. 마지막으로 세번째 인자는 muscle과 skeleton에서 높은 값을 가진다. 따라서 첫 번째 인자는 생물의학(biomedical), 두번째 인자는 인체기능(performance), 세번째 인자는 근육골계통력(Muscular – skeletal strength)으로 특정 지을 수 있다.

>> 인자분석은 주성분분석과 달리 인자들의 결합으로 원변수들을 표현한다.

Lung = 0.66 x factor1 + 0.12 x facotr2 + 0.16 x factor3 + error

인자점수 추정방법은 회귀분석, Barlett 등이 있는데, 아래는 회귀분석을 이용한 인자점수이다.

> head(med.varimax$scores)

RC1 RC2 RC3

[1,] -0.11970907 -0.2574867 -0.74473196

[2,] 0.05696634 -0.4784611 -1.53199247

[3,] -0.59153602 0.8957933 1.52598533

[4,] 1.20919164 0.3179760 -0.42022213

[5,] 0.82291043 0.1513448 -0.02988744

[6,] -0.08606120 1.1451913 0.76290104

인자분석의 시각화 몇가지 사례

> plot(med.varimax)

> plot(med.varimax$loadings[,c(1:2)])

> text(med.varimax$loadings[,c(1:2)],colnames(med.data))

> biplot(med.varimax$scores,med.varimax$loadings)

> fa.diagram(med.varimax)

> omega(med.data)

3. 인자 회전

(2) 사곽회전(Oblique rotation) - Oblimin

> med.oblimin = principal(med.data,nfactors=3,rotate='oblimin',scores=T,method='regression')

> med.oblimin

>> Oblimin을 이용한 인자분석에 대한 결과는 Varimax와 마찬가지로 세 인자에 대해 묶여지는 변수는 같으며 다만 인자 적재값이 차이가 나는 것을 볼 수 있다. Oblimin 방법을 이용한 결과에 대한 각 변수의 인자 모형은 아래와 같다.

Lung = 0.66 x factor1 + 0.01 x factor2 + 0.1 x facotr3

>> 회귀분석을 이용한 인자점수 함수식은 아래와 같다

Factor1 = 0.66 x lung + 0.02 x muscle + …+ (-0.09) x urine

인자점수는 아래와 같다.

> head(med.oblimin$scores)

TC1 TC2 TC3

[1,] -0.2325231 -0.2639103 -0.77351112

[2,] -0.1722964 -0.4641659 -1.54539480

[3,] -0.2852637 0.8308965 1.50251465

[4,] 1.1976234 0.4291411 -0.22225725

[5,] 0.8296036 0.2260879 0.09586509

[6,] 0.1766403 1.1289891 0.83993963

4. 행렬도

> biplot(med.varimax)

참고문헌:

[1] 다변량분석(김성수, 김현중, 정성석, 이용구 공저)

[2] R 프로그램에 기반한 다변량분석 및 데이터마이닝(이재길)

'KNOU > 3 다변량분석' 카테고리의 다른 글

2장 주성분분석 - heptathlon data / beer data (0)	2017.03.27

Posted by 마르띤

,

2장 주성분분석 - heptathlon data / beer data

KNOU/3 다변량분석 2017. 3. 27. 19:13

[예제1] - heptathlon data

1. 자료 준비

> library(HSAUR2)

> library(tools)

> data("heptathlon")

> head(heptathlon)

> heptathlon은 R의 HSAUR2 패키지에 들어있는 자료로서, 1988년 서울 올림픽 여성 7종 경기에 대한 결과이다. Hurdles(110m 허들), highjump(높이뛰기), shot(포환던지기), run200m(200미터 달리기), longjump(멀리뛰기), javelin(창던지기), run800m(800미터 달리기),score(점수)로 구성되어 있다.

2. 자료 변형

> heptathlon$hurdles <- max(heptathlon$hurdles) - heptathlon$hurdles

> heptathlon$run200m <- max(heptathlon$run200m) - heptathlon$run200m

> heptathlon$run800m <- max(heptathlon$run800m) - heptathlon$run800m

> 작은 값일수록 좋은 점수이기 때문에 최대값에서 빼줌

3. 주성분분석

> library(stats)

> hep.data <- heptathlon[,-8] #주성분점수에서 예외

> heptathlon.pca <- princomp(hep.data,cor=T,scores=T) #cor = T, 상관계수 행령, F는 공분산 행렬, scores=T 주성분점수

> heptathlon.pca #변수가 7개이니, 주성분도 7개

Call:

princomp(x = hep.data, cor = T, scores = T)

Standard deviations:

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7

2.1119364 1.0928497 0.7218131 0.6761411 0.4952441 0.2701029 0.2213617

7 variables and 25 observations.

> summary(heptathlon.pca)

> 주성분의 표준편차, 분산비율을 알수 있다. 첫 번째 주성분이 63.72%, 두 번째 주 성분이 17.06%의 분산 비율을 나타내고 있으며, 2개의 주성분이 총 변이의 80.8%의 정보를 가지고 있다.

> eig.val = heptathlon.pca$sdev^2 #주성분의 표준편차를 제곱

> eig.val #주성분 분산

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7

4.46027516 1.19432056 0.52101413 0.45716683 0.24526674 0.07295558 0.04900101

> 제1주성분의 분산이 4.46, 제2주성분이 1.19, 제3주성분이 0.52로 유효한 주성분의 수는 2개임을 알 수 있다.

> par(mfrow=c(1,2))

> screeplot(heptathlon.pca)

> screeplot(heptathlon.pca,type='line',pch=19,main='Scree Plot of heptathlon.pca')

> 스크리 그림은 주성분의 고유값 크기 순으로 그린 것으로, 고유값이 1보다 큰 주성분값이 2개임을 알 수 있다.

> heptathlon.pca$loadings[,1:2]

Comp.1 Comp.2

hurdles -0.4528710 0.15792058

highjump -0.3771992 0.24807386

shot -0.3630725 -0.28940743

run200m -0.4078950 -0.26038545

longjump -0.4562318 0.05587394

javelin -0.0754090 -0.84169212

run800m -0.3749594 0.22448984

> 상관계수행렬을 이용한 경우 제1, 제2주성분은 다음과 같다.

PC1 = -0.453 x hurldes – 0.377 x highjump - ․․․ - 0.374 x rum800m

PC2 = 0.157 x hurldes + 0.248 x highjump + ․․․ + 0.224 x rum800m

여기서 제1주성분은 jvelin(창던지기)를 제외한 모든 변수의 절대값이 큰 값을 가지는 것으로 볼 때, 전반적인 체력 정도를 나타내는 성분이라고 할 수 있고, 제2성분은 jvelin의 계수가 다른 변수에 비해 상대적으로 절대값이 큰 것을 볼 때, 창던지기와 밀접한 관련이 있는 성분으로 파악할 수 있다.

4. 주성분 점수 및 행렬도 (biplot)

> head(heptathlon.pca$scores[,1:2])

Comp.1 Comp.2

Joyner-Kersee (USA) -4.206435 -1.26802363

John (GDR) -2.941619 -0.53452561

Behmer (GDR) -2.704271 -0.69275901

Sablovskaite (URS) -1.371052 -0.70655862

Choubenkova (URS) -1.387050 -1.78931718

Schulz (GDR) -1.065372 0.08104469

> biplot(heptathlon.pca,cex=0.7,col=c('red','blue'),main='Biplot')

> 첫번째 주성분을 x축, 두 번째 주성분을 y축에 위치. Highjump, run800m, hurldes, longjump가 서로

가까운 곳에 위치하고 벡터의 방향(화살표) 또한 비슷하다. Ru200m, shot이 가깝게 위치하고 있고, javelin은 다른 변수들과 다른 방향에 위치하고 있다.

[예제2] - beer data

99명의 소비자가 맥주 구매에 중요하게 생각하는 요인에 대해 점수를 준 엑셀 자료이다. 독립변수는 7개, 케이스 수는 99개인 자료이다. 점수는 0~100으로 되어 있다.

상관관계가 아닌 공분산을 활용해보자.

1. 자료 읽기

> beer<-read.csv('beer.csv',header=T)

> head(beer)

cost size alcohol reputat color aroma taste

1 10 15 20 85 40 30 50

2 100 70 50 30 75 60 80

3 65 30 35 80 80 60 90

4 0 0 20 30 80 90 100

5 10 25 10 100 50 40 60

6 25 35 30 40 45 30 65

2. 상관계수행렬 및 산점도 행렬 보기

> round(cor(beer),2)

cost size alcohol reputat color aroma taste

cost 1.00 0.88 0.88 -0.17 0.32 -0.03 0.05

size 0.88 1.00 0.82 -0.06 0.01 -0.29 -0.31

alcohol 0.88 0.82 1.00 -0.36 0.40 0.10 0.06

reputat -0.17 -0.06 -0.36 1.00 -0.52 -0.52 -0.63

color 0.32 0.01 0.40 -0.52 1.00 0.82 0.80

aroma -0.03 -0.29 0.10 -0.52 0.82 1.00 0.87

taste 0.05 -0.31 0.06 -0.63 0.80 0.87 1.00

> 맥주의 가격(cost),크기(size), 알코올(alcohol)함량의 상관계수가 높아 서로 양의 상관관계가 있을 것으로 보인다. 또한 색(color), 향(aroma), taste(맛)이 서로 상관계수가 높게 나타났다. 반면 평판(reputation)은 비교적 다른 변수들과 상관계수가 높지 않으며, 다른 변수들과 모두 음의 상관 계수를 나타냈다.

> plot(beer[1:3])

3. 주성분분석 실행하기

> library(stats)

> beer.pca = princomp(beer,cor=F, scores=T) #cor = F 공분산

> beer.pca

Call:

princomp(x = beer, cor = F, scores = T)

Standard deviations:

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7

39.117489 34.783073 17.974805 8.456192 6.479111 4.381410 2.579257

7 variables and 99 observations.

> summary(beer.pca)

> screeplot(beer.pca,type='l',pch=19)

> beer.pca$loadings[,1:3] #주성분계수

Comp.1 Comp.2 Comp.3

cost 0.7344197 0.31868378 -0.1902272

size 0.3942553 0.34017182 0.1274959

alcohol 0.2831873 0.06888179 0.0370581

reputat -0.3356901 0.49057105 -0.7861697

color 0.2657171 -0.34379889 -0.3862210

aroma 0.1398211 -0.48510439 -0.3693624

taste 0.1488354 -0.42871339 -0.2062207

> screeplot을 보면 4번째 주성분부터 완만해짐로 유효한 주성분을 3개로 판단하여, 3개의 주성분 계수를 보자. 제1주성분을 보면 cost, size, alcohol의 점수가 높다. 비용, 크기, 알코올이 서로 연관이 있고, color, aroma, taste의 주성분점수가 상대적으로 낮게 측정되어 있다. 색, 향, 맛이 서로 연관되어 있다.

PC1 = 0.734 x cost + 0.397 x size + … + 0.148 x taste

4. 주성분 분석 점수 및 행렬도

> head(beer.pca$scores[,1:3])

Comp.1 Comp.2 Comp.3

[1,] -41.435562 40.03360 4.340612

[2,] 91.264626 23.06214 7.798284

[3,] 31.574345 15.79053 -34.501270

[4,] -9.770913 -59.53112 -0.351844

[5,] -39.816498 37.52890 -16.165598

[6,] -1.035114 22.08072 34.760924

> biplot(beer.pca,col=c('red','blue'),main = 'Biplot of Beer.pca')

참고 문헌

[1] R을 이용한 다변량 분석, 김성수 김현중 정성석 이용구 공저

'KNOU > 3 다변량분석' 카테고리의 다른 글

3장 인자분석 - 주성분인자법(principal factor method) 사례 (0)	2017.03.27

Posted by 마르띤

,

제3장 예측데이터의 기초분석

KNOU/3 예측방법론 2017. 3. 22. 20:05

### 3,4주차, 제3장 예측데이터의 기초 분석 ###

3.1 시계열의 요약

> gdp <- read.csv("gdpq.csv", header = TRUE)

> head(gdp)

gdp gdpsa

1 12807.5 14934.6

2 15732.5 15246.5

3 14621.6 15472.1

4 18689.5 16198.0

5 14512.5 16664.4

6 17709.4 17180.3

> gdp_o <- ts(gdp[,1]/1000, start=1970, frequency=4)

> gdp_o

Qtr1 Qtr2 Qtr3 Qtr4

1970 12.8075 15.7325 14.6216 18.6895

1971 14.5125 17.7094 16.5814 19.5058

1972 15.2014 18.4940 17.5132 21.5480

1973 17.1063 20.9626 20.6994 24.7473

1974 19.7445 23.2128 22.2597 26.1339

1975 20.0790 24.6205 23.9055 29.4487

1976 22.5622 28.2495 27.2866 33.1492

1977 24.5630 30.6101 30.7895 38.4318

1978 28.1520 34.5775 33.4854 40.9892

1979 32.1360 38.2981 36.1002 42.1807

1980 32.0372 36.8905 36.8476 40.1279

1981 33.0530 38.8709 39.6106 45.1715

1982 36.2445 41.7473 43.0336 48.6735

1983 40.6951 46.9821 49.2677 53.4268

1984 46.3244 52.1396 53.7155 56.9611

1985 49.3720 55.4260 57.4681 62.4993

1986 54.3321 61.6800 65.9805 70.2836

1987 62.3064 70.6516 72.8342 77.4281

1988 72.3188 76.2441 80.4584 87.2238

1989 75.6243 81.9830 85.7357 94.2551

1990 82.9977 90.4118 94.2292 101.3471

1991 91.5677 100.2760 102.4124 110.5688

1992 98.9458 107.5298 107.0952 114.5932

1993 103.2117 113.8639 115.0267 123.1619

1994 112.7728 123.0717 124.0430 135.3119

1995 123.1052 134.6932 136.4655 145.1600

1996 132.1325 144.3954 146.0147 155.6439

1997 139.4080 154.6306 155.6320 161.8583

1998 134.4830 143.3159 144.5975 154.1904

1999 143.0859 159.4288 161.7758 174.1675

2000 161.1069 174.1233 176.7639 182.6340

2001 167.9311 181.3261 181.9274 191.0444

2002 179.0416 194.0302 194.2672 206.5295

2003 185.3180 197.5801 198.0618 214.5984

2004 194.9231 209.3292 207.5780 220.4750

2005 200.2166 216.3471 216.9831 231.6941

2006 212.4693 227.4156 227.9013 242.2628

2007 221.9311 239.4580 239.1251 256.0003

2008 234.1789 249.8902 246.9623 247.4675

2009 224.3595 244.7084 249.5196 263.0376

2010 243.8225 263.1901 260.7457 275.9080

2011 254.2480 272.3620 270.2561 285.2294

2012 261.4804 278.8836 274.4844 289.3663

2013 265.2912 285.3123 283.5950 300.6547

> gdp_sa <- ts(gdp[,2]/1000, start=1970, frequency=4)

> gdp_sa

Qtr1 Qtr2 Qtr3 Qtr4

1970 14.9346 15.2465 15.4721 16.1980

1971 16.6644 17.1803 17.3029 17.1614

1972 17.5141 17.9938 18.2651 18.9837

1973 19.7739 20.4964 21.4395 21.8057

1974 22.8617 22.6609 22.9105 22.9179

1975 23.5435 24.1886 24.7458 25.5757

1976 26.7061 27.7776 28.2670 28.4967

1977 28.8486 30.2254 31.9596 33.3608

1978 32.9248 33.9149 34.2955 36.0687

1979 37.5335 37.7262 36.7787 36.6767

1980 36.6760 36.2477 36.8647 36.1147

1981 37.2882 38.7201 39.7069 40.9907

1982 40.9807 41.7302 42.6719 44.3161

1983 45.5381 46.9978 48.6324 49.2035

1984 51.1110 52.0170 52.8702 53.1425

1985 54.4390 55.6003 56.4163 58.3099

1986 59.6636 62.0513 64.7921 65.7692

1987 67.5395 70.8755 71.8664 72.9389

1988 77.9135 76.8991 79.7696 81.6630

1989 81.4808 82.6901 85.2974 88.1298

1990 89.1127 90.7379 93.9483 95.1870

1991 98.3701 100.0831 102.2049 104.1669

1992 105.8722 106.9636 106.9913 108.3369

1993 110.1757 112.9698 115.0582 117.0605

1994 120.1321 122.1115 124.0878 128.8680

1995 131.3819 133.6788 136.2208 138.1425

1996 141.2176 143.2933 145.5552 148.1203

1997 149.0865 153.3882 154.8591 154.1950

1998 143.4587 141.9956 143.8028 147.3296

1999 151.7891 157.9815 161.6553 167.0321

2000 170.3475 172.5127 176.8317 174.9363

2001 177.4443 179.7859 182.2794 182.7193

2002 189.0423 192.4812 195.1160 197.2289

2003 195.8857 196.0955 199.1527 204.4243

2004 206.1241 207.7387 208.5458 209.8967

2005 211.6290 214.7745 218.1001 220.7373

2006 224.3165 225.6878 229.0608 230.9839

2007 234.2216 237.6344 240.2943 244.3642

2008 246.6202 247.5462 247.8380 236.4944

2009 236.6559 242.6501 250.8293 251.4898

2010 256.9193 260.5657 262.1134 264.0678

2011 267.5257 269.7017 271.9174 272.9508

2012 275.1821 276.0057 276.1202 276.9067

2013 279.2230 282.3449 285.3485 287.9793

> gdp_gr <- ts((gdp_sa-lag(gdp_sa,-1))/lag(gdp_sa,-1)*100, start=c(1970,2), frequency=4)

> gdp_sa-lag(gdp_sa,-1)

Qtr1 Qtr2 Qtr3 Qtr4

1970 0.3119 0.2256 0.7259

1971 0.4664 0.5159 0.1226 -0.1415

1972 0.3527 0.4797 0.2713 0.7186

1973 0.7902 0.7225 0.9431 0.3662

1974 1.0560 -0.2008 0.2496 0.0074

1975 0.6256 0.6451 0.5572 0.8299

1976 1.1304 1.0715 0.4894 0.2297

1977 0.3519 1.3768 1.7342 1.4012

1978 -0.4360 0.9901 0.3806 1.7732

1979 1.4648 0.1927 -0.9475 -0.1020

1980 -0.0007 -0.4283 0.6170 -0.7500

1981 1.1735 1.4319 0.9868 1.2838

1982 -0.0100 0.7495 0.9417 1.6442

1983 1.2220 1.4597 1.6346 0.5711

1984 1.9075 0.9060 0.8532 0.2723

1985 1.2965 1.1613 0.8160 1.8936

1986 1.3537 2.3877 2.7408 0.9771

1987 1.7703 3.3360 0.9909 1.0725

1988 4.9746 -1.0144 2.8705 1.8934

1989 -0.1822 1.2093 2.6073 2.8324

1990 0.9829 1.6252 3.2104 1.2387

1991 3.1831 1.7130 2.1218 1.9620

1992 1.7053 1.0914 0.0277 1.3456

1993 1.8388 2.7941 2.0884 2.0023

1994 3.0716 1.9794 1.9763 4.7802

1995 2.5139 2.2969 2.5420 1.9217

1996 3.0751 2.0757 2.2619 2.5651

1997 0.9662 4.3017 1.4709 -0.6641

1998 -10.7363 -1.4631 1.8072 3.5268

1999 4.4595 6.1924 3.6738 5.3768

2000 3.3154 2.1652 4.3190 -1.8954

2001 2.5080 2.3416 2.4935 0.4399

2002 6.3230 3.4389 2.6348 2.1129

2003 -1.3432 0.2098 3.0572 5.2716

2004 1.6998 1.6146 0.8071 1.3509

2005 1.7323 3.1455 3.3256 2.6372

2006 3.5792 1.3713 3.3730 1.9231

2007 3.2377 3.4128 2.6599 4.0699

2008 2.2560 0.9260 0.2918 -11.3436

2009 0.1615 5.9942 8.1792 0.6605

2010 5.4295 3.6464 1.5477 1.9544

2011 3.4579 2.1760 2.2157 1.0334

2012 2.2313 0.8236 0.1145 0.7865

2013 2.3163 3.1219 3.0036 2.6308

> gdp_gr

Qtr1 Qtr2 Qtr3 Qtr4

1970 2.088438927 1.479683862 4.691670814

1971 2.879367823 3.095821032 0.713608028 -0.817781990

1972 2.055193632 2.738936057 1.507741555 3.934279035

1973 4.162518371 3.653806280 4.601295837 1.708062222

1974 4.842770468 -0.878324884 1.101456694 0.032299601

1975 2.729743999 2.740034404 2.303564489 3.353700426

1976 4.419820376 4.012191971 1.761851276 0.812608342

1977 1.234879828 4.772501959 5.737558477 4.384285160

1978 -1.306923095 3.007155700 1.122220617 5.170357627

1979 4.061138882 0.513408022 -2.511517195 -0.277334435

1980 -0.001908569 -1.167793653 1.702176966 -2.034466576

1981 3.249369370 3.840088822 2.548547137 3.233191209

1982 -0.024395778 1.828909706 2.256639077 3.853121141

1983 2.757462863 3.205447746 3.478035142 1.174320001

1984 3.876756735 1.772612549 1.640233001 0.515034935

1985 2.439666933 2.133213321 1.467617980 3.356476763

1986 2.321561176 4.001937530 4.416990458 1.508054223

1987 2.691685470 4.939331798 1.398085375 1.492352476

1988 6.820228986 -1.301956657 3.732813518 2.373585927

1989 -0.223112058 1.484153322 3.153098134 3.320617041

1990 1.115286770 1.823758005 3.538102601 1.318491128

1991 3.344049082 1.741382798 2.120038248 1.919673127

1992 1.637084333 1.030865515 0.025896660 1.257672353

1993 1.697297966 2.536040161 1.848635653 1.740249717

1994 2.623942320 1.647686172 1.618438886 3.852272343

1995 1.950755812 1.748262127 1.901573024 1.410724353

1996 2.226034711 1.469859281 1.578510649 1.762286748

1997 0.652307618 2.885371915 0.958939475 -0.428841444

1998 -6.962806836 -1.019875407 1.272715493 2.452525264

1999 3.026886654 4.079607824 2.325462159 3.326089525

2000 1.984887935 1.271048885 2.503583794 -1.071866639

2001 1.433664711 1.319625370 1.386927451 0.241332811

2002 3.460499247 1.819116674 1.368860959 1.082894278

2003 -0.681036096 0.107103275 1.559036286 2.647014075

2004 0.831505843 0.783314518 0.388516921 0.647771377

2005 0.825310736 1.486327488 1.548414733 1.209169551

2006 1.621474939 0.611323732 1.494542461 0.839558755

2007 1.401699426 1.457081670 1.119324475 1.693714749

2008 0.923212156 0.375476137 0.117876986 -4.577022087

2009 0.068289143 2.532875791 3.370779571 0.263326493

2010 2.158934478 1.419278349 0.593976874 0.745631471

2011 1.309474309 0.813379799 0.821537276 0.380041880

2012 0.817473332 0.299292723 0.041484650 0.284839718

2013 0.836491136 1.118066921 1.063805296 0.921960340

> mean(gdp_gr);median(gdp_gr);var(gdp_gr);sd(gdp_gr)

[1] 1.720042

[1] 1.618439

[1] 2.977614

[1] 1.725576

> (1+mean(gdp_gr)/100)^4-1 #연평균 7.05% 성장

[1] 0.07059725

> plot(gdp_gr, ylab="경제성장률(전 분기대비)", xlab="연도", col="steelblue", main="경제성장률 추이")

> abline(h=0, lty=2, col="gray")

3.2 시계열의 분포

> hist(gdp_gr, breaks = 12, col = "lightblue", border = "black", freq=FALSE, main="경제성장률의 확률분포", xlab="", xlim=c(-10, 10))

> lines(density(gdp_gr)) #꼬리부분에 큰 값을 가지고 있다. IMF

> shapiro.test(gdp_gr) #p-value = 4.285e-06, 경제성장률은 정규분포를 따른다는 귀무가설을 기각,

Shapiro-Wilk normality test

data: gdp_gr

W = 0.94723, p-value = 4.285e-06

3.3 자기 상관

시계열은 시간의 전개에 따른 시간영역 정보와 시간에 따라 반복되는 주파수 영역 정보를 가지고 있다. 시간 영역 정보를 파악하려면 시계열의 과거, 현재, 미래를 연결하는 구조를 이해해야 한다. 시간에 따른 시계열의 연결 구조를 이해하는 데에는 표본 자기 상관계수, 표본 부분자기 상관계수 등이 이용된다.

> set.seed(1)

> nn=length(gdp_o)

> wn=ts(rnorm(nn), start=1970, frequency=4)

> wn

Qtr1 Qtr2 Qtr3 Qtr4

1970 -0.626453811 0.183643324 -0.835628612 1.595280802

1971 0.329507772 -0.820468384 0.487429052 0.738324705

1972 0.575781352 -0.305388387 1.511781168 0.389843236

1973 -0.621240581 -2.214699887 1.124930918 -0.044933609

1974 -0.016190263 0.943836211 0.821221195 0.593901321

1975 0.918977372 0.782136301 0.074564983 -1.989351696

1976 0.619825748 -0.056128740 -0.155795507 -1.470752384

1977 -0.478150055 0.417941560 1.358679552 -0.102787727

1978 0.387671612 -0.053805041 -1.377059557 -0.414994563

1979 -0.394289954 -0.059313397 1.100025372 0.763175748

1980 -0.164523596 -0.253361680 0.696963375 0.556663199

1981 -0.688755695 -0.707495157 0.364581962 0.768532925

1982 -0.112346212 0.881107726 0.398105880 -0.612026393

1983 0.341119691 -1.129363096 1.433023702 1.980399899

1984 -0.367221476 -1.044134626 0.569719627 -0.135054604

1985 2.401617761 -0.039240003 0.689739362 0.028002159

1986 -0.743273209 0.188792300 -1.804958629 1.465554862

1987 0.153253338 2.172611670 0.475509529 -0.709946431

1988 0.610726353 -0.934097632 -1.253633400 0.291446236

1989 -0.443291873 0.001105352 0.074341324 -0.589520946

1990 -0.568668733 -0.135178615 1.178086997 -1.523566800

1991 0.593946188 0.332950371 1.063099837 -0.304183924

1992 0.370018810 0.267098791 -0.542520031 1.207867806

1993 1.160402616 0.700213650 1.586833455 0.558486426

1994 -1.276592208 -0.573265414 -1.224612615 -0.473400636

1995 -0.620366677 0.042115873 -0.910921649 0.158028772

1996 -0.654584644 1.767287269 0.716707476 0.910174229

1997 0.384185358 1.682176081 -0.635736454 -0.461644730

1998 1.432282239 -0.650696353 -0.207380744 -0.392807929

1999 -0.319992869 -0.279113303 0.494188331 -0.177330482

2000 -0.505957462 1.343038825 -0.214579409 -0.179556530

2001 -0.100190741 0.712666307 -0.073564404 -0.037634171

2002 -0.681660479 -0.324270272 0.060160440 -0.588894486

2003 0.531496193 -1.518394082 0.306557861 -1.536449824

2004 -0.300976127 -0.528279904 -0.652094781 -0.056896778

2005 -1.914359426 1.176583312 -1.664972436 -0.463530401

2006 -1.115920105 -0.750819001 2.087166546 0.017395620

2007 -1.286300530 -1.640605534 0.450187101 -0.018559833

2008 -0.318068375 -0.929362147 -1.487460310 -1.075192297

2009 1.000028804 -0.621266695 -1.384426847 1.869290622

2010 0.425100377 -0.238647101 1.058483049 0.886422651

2011 -0.619243048 2.206102465 -0.255027030 -1.424494650

2012 -0.144399602 0.207538339 2.307978399 0.105802368

2013 0.456998805 -0.077152935 -0.334000842 -0.034726028

> par(mfrow=c(2,1))

> plot(wn, main="백색잡음계열", xlab="", ylab="", col="steelblue")

> abline(h=0, lty=2, col="gray")

> acf(wn, main="ACF 상관도표", col="steelblue")

>> 표준정규분포를 따르는 백색잡음계열의 상관도표는 위와 같다. 이를 보면 시차 0을 제외하고는 모두 점선 안의 작은 값을 갖는다. 시차의 1,2,3은 1,2,3년을 의미하며, 4개의 시차가 모여 1년이 된다. 백색잡음계열은 예측할 수 없는 계열이다. 시차 0을 제외한 모든 자기상관계수값이 점선 내에 있다. 따라서 백색잡음의 각 시차의 자기상관계수값이 5% 유의수준에서 0과 다르다고 할 수 없다.

> nn=length(gdp_sa)

> sin = ts(sin(1:nn/nn*12*pi), start=1970, frequency=4)

> par(mfrow=c(2,1))

> plot(sin, main="sin 계열", xlab="", ylab="", col="steelblue")

> abline(h=0, lty=2, col="gray")

> acf(sin, main="ACF 상관도표", col="steelblue")

>> sin 커브를 가지는 시계열의 상관도표를 구해보면 위와 같다. Sin 시계열의 자기상관계수는 sin커브와 같이 일정한 주기를 가지고 움직이는 것으로 나타났다.

> plot(gdp_o, main="GDP 원계열", xlab="", ylab="", col="steelblue")

> lines(gdp_sa,col='red')

> acf(gdp_o, main="ACF 상관도표", col="steelblue")

>> GDP원계열의 상관도표를 그려보면 위와 같다. 자기상관계수값이 서서히 감소하는 것으로 나타났다. 동 시계열은 강한 추세가 있는 불안정시계열임을 나타낸다. 강한 추세로 인해 GDP의 계절변동을 상관도표로 파악하기는 어렵다.

> plot(diff(log(gdp_o)), main="GDP로그차분", xlab="", ylab="", col="steelblue")

> acf(diff(log(gdp_o)), main="ACF 상관도표", col="steelblue")

>> GDP 원계열을 로그차분한 계열의 상관도표는 위와 같다. 이를 보면 자기상관계수값이 주기적으로 움직이는 것으로 나타났다. 이는 동 시계열의 로그차분을 통해 추세를 제거했지만, 계절변동은 그대로 남아 있음을 의미한다.

> plot(diff(log(gdp_o),4), main="GDP로그4차 차분", xlab="", ylab="", col="steelblue")

> acf(diff(log(gdp_o),4), main="ACF 상관도표 ", col="steelblue")

>>GDP 원계열을 로그 4차 차분한 계열의 상관도표를 그려보면 위와 같다. 자기상관계수값이 3차 이후 급격히 하락하는 것으로 나타났다. 이는 동 시계열의 로그 4차 차분을 통해 추세 및 계절 변동이 제거되었음을 의미한다.

륭(Ljung) – 박스(Box) 검정을 통해, 시계열에 m차까지의 자기상관관계가 존재하지 않는다는 귀무가설을 검정하는 것이다.

- 귀무가설: H0 : p(1) = p(2) = … = p(m) = 0 잔차는 랜덤

> Box.test(wn,lag=8, type="Ljung")

Box-Ljung test

data: wn

X-squared = 3.16, df = 8, p-value = 0.9239

> Box.test(sin,lag=8, type="Ljung")

Box-Ljung test

data: sin

X-squared = 579.9, df = 8, p-value < 2.2e-16

> Box.test(gdp_o,lag=8, type="Ljung")

Box-Ljung test

data: gdp_o

X-squared = 1249.8, df = 8, p-value < 2.2e-16

> Box.test(diff(log(gdp_o)),lag=8, type="Ljung")

Box-Ljung test

data: diff(log(gdp_o))

X-squared = 756.05, df = 8, p-value < 2.2e-16

> Box.test(diff(log(gdp_o),4),lag=8, type="Ljung")

Box-Ljung test

data: diff(log(gdp_o), 4)

X-squared = 205.63, df = 8, p-value < 2.2e-16

계열명	X² 검정통계량 값	유의확률
백색잡음계열	3.16	0.9239
sin계열	579.9	< 2.2e-16
GDP원계열	1249.8	< 2.2e-16
GDP 로그차분계열	756.05	< 2.2e-16
GDP 로그 4차 차분계열	205.63	< 2.2e-16

>> 백색잡음계열의 경우 p-value가 0.05보다 크므로 시계열에 8차까지의 자기상관관계가 존재하지 않는다는 귀무가설을 기각하지 못하지만, 나머지 계열은 모두 작은 값이어서 8차까지의 자기상관관계가 존재하지 않는다는 귀무가설을 기각하낟.

3.4 부분 자기 상관

표본 부분 자기 상관계수(sample partial autocorrelation coefficietns). 앞뒤 시차 영향을 제외한 순수한 영역

> gdp <- read.csv("gdpq.csv", header = TRUE)

> gdp_o <- ts(gdp[,1]/1000, start=1970, frequency=4)

> gdp_sa <- ts(gdp[,2]/1000, start=1970, frequency=4)

> gdp_gr <- ts((gdp_sa-lag(gdp_sa,-1))/lag(gdp_sa,-1)*100, start=c(1970,2), frequency=4)

> par(mfrow=c(2,1))

> plot(wn, main="", xlab="백색잡음계열", ylab="", col="steelblue")

> abline(h=0, lty=2, col="gray")

> pacf(wn, main="Partial ACF 부분상관도표", col="steelblue") #main="", xlab="", ylab="", col="steelblue")

> abline(h=0, lty=2, col="gray")

>> 모두 점선 안(1.96 / root N)의 작은 값을 가지고 있다.

> plot(sin, main="sin 계열", xlab="", ylab="", col="steelblue")

> abline(h=0, lty=2, col="gray")

> pacf(sin, main="Partial ACF 부분상관도표", col="steelblue")

>> sin 계열의 부분자기상관계수는 앞의 2기에서 유의하게 큰 값을 가지고 있다.

> plot(gdp_o, main="GDP 원계열", xlab="", ylab="", col="steelblue")

> lines(gdp_sa,col='red')

> pacf(gdp_o, main="Partial ACF 부분상관도표", col="steelblue")

>> 부분자기상관계수값은 1차에서 1에 가까운 큰 값을 가지고 있다. 1차의 큰 값은 추세가 있음을, 주기적으로 크게 나타난 부분자기상관계수는 계절변동이 있음을 의미한다.

> plot(diff(log(gdp_o)), main="GDP 로그차분", xlab="", ylab="", col="steelblue")

> pacf(diff(log(gdp_o)), main="Partial ACF 부분상관도표 ", col="steelblue")

>> 부분자기상관계수값이 1차의 값은 작고 주기적으로 움직이는 것으로 나타났다. 이는 동 시계열의 로그차분을 통해 추세를 제거했지만, 계절 변동은 그대로 남아있음을 의미한다.

> plot(diff(log(gdp_o),4), main="GDP 로그 4차 차분", xlab="", ylab="", col="steelblue")

> abline(h=0,lty=2,col='grey')

> pacf(diff(log(gdp_o),4), main="Partial ACF 부분상관도표 ", col="steelblue")

>> GDP 원계열을 로그 4차차분한 계열의 부분상관도표를 그려보면 그림과 같다. 이를 살펴보면 부분자기상관계수값은 3차 이후 작은값을 가지는 것으로 나타났다. 이는 동시계열의 로그 4차 차분을 통해 추세 및 계절변동이 제거되었음을 의미한다.

> sin12 = ts(sin(1:nn/nn*12*pi), start=1970, frequency=4)

> sin36 = ts(sin(1:nn/nn*36*pi), start=1970, frequency=4)

> plot(cbind(sin12, sin36), main="sin12와 sin36 추이", xlab="", ylab="", col="steelblue")

> spectrum(cbind(sin12, sin36), spans=c(3,3), main="sin12와 sin36 함수 스펙트럼", col="steelblue")

>> 위의 그래프를 보면 첫번째 sin함수는 주기가 상대적으로 긴 함수, 즉 저주파 함수이고, 두번째 sin 함수는 상대적으로 주기가 짧은 함수, 즉 고주파 함수이다. 이를 평활화된 주기도(스펙트럼)을 구해보면 아래 그래프와 같은데 실선이 sin12, 점선이 sin36이다. 해당주파수(주기의 역수)의 값이 크게 나타난다. 저주파 sin12자료의 스펙트럼은 낮은 주파수에서 최고점이 나타나나, 고주파 sin36 자료의 스펙트럼은 높은 주파수에서 최고점이 나타난다.

> plot(sin12+sin36, main="sin12+sin36", xlab="", ylab="", col="steelblue")

> spectrum(sin12+sin36, spans=c(3,3), main="sin 12 + sin36 합성 스펙트럼", col="steelblue")

>> 두 sin 계열을 합성한 결과 스펙트럼은 다음과 같다. 두 계열 합성계열의 스펙트럼은 두 sin 계열에 해당되는 주파수에서 큰 값을 가지고 있어서 동 계열에 두 주파수에서의 변동요인이 있음을 알 수 있다.

> par(mfrow=c(2,1))

> plot(wn, main="백색잡음계열", xlab="", ylab="", col="steelblue")

> abline(h=0, lty=2, col="gray")

> wn=ts(rnorm(nn), start=1970, frequency=4)

> spectrum(wn, spans=c(3,3), main="스펙트럼", col=1:2)

>> 표준정규분포로부터 154개의 난수를 생성하여 보면 아무 특징이 없는 백색잡음계열이 생성된다. 백색잡음계열의 평활화한 스펙트럼은 그림과 같이 아무 특징이 없는데 y축 값을 보면 매우 작은값을 가지고 수평적으로 움직이고 있다. 백색잡음계열의 평활화한 스펙트럼은 이론적으로는 수평선이다. 다양한 peak가 나타난다는 것은 잔차가 random, 주파수에 특정 주기가 없다는 뜻이다.

> par(mfrow=c(2,1))

> plot(gdp_o, main="GDP 원계열", xlab="", ylab="")

> lines(gdp_sa, col=2)

> spectrum(cbind(gdp_o, gdp_sa), spans=c(3,3), main="스펙트럼(원계열+계정조정계열)", col=1:2)

>> GDP 원계열과 계절조정계열의 시계열도표와 스펙트럼은 다음과 같다. 원계열의 스펙트럼(실선)을 보면 저주파수와 계절주파수에서 큰 값을 가지는 것으로 나타났다. 여기서 저주파수는 추세변동요인의 존재를, 계절주파수의 큰 값은 계절변동요인의 존재를 의미한다. 반면 GDP 계절조정계열의 스펙트럼(점선)을 구해보면 큰 값이 사라짐을 알 수 있다. 즉 계절변동요인이 제거되었음을 말한다.

> dlgdp1 = diff(log(gdp_o))

> dlgdp2 = diff(log(gdp_o),4)

> dlgdp = cbind(dlgdp1,dlgdp2)

> plot(dlgdp1, main="로그1차차분과 4차차분계", xlab="", ylab="", col="steelblue")

> lines(dlgdp2, col=2)

> spectrum( na.omit(cbind(dlgdp1,dlgdp2)), spans=c(3,3), main="스펙트럼(로그1차차분과 4차차분계열)", col=c("steelblue", "red"))

>> 로그1차 차분한 GDP 원계열(보라색)의 스펙트럼을 구해보면 아래 그림과 같다(보라색). 이를 보면 원계열에서의 스텍트럼에서 계절주파수의 큰 값은 남고 저주파수의 추세변동요인과 관련된 큰 값은 사라짐을 알 수 있다. 이는 차분이 시계열에서 추세변동을 제거하는 역할을 했음을 의미한다. 한편 로그 4차 차분 계열의 스펙트럼(점선)을 구해보면 추세변동요인과 계졀변동 요인에 해당되는 주파스의 밀도함수값이 작게 나타난반면, 순환변동에 해당하는 값이 상대적으로 크게 나타나 로그 4차 차분계열에서 순환변동을 파악할 수 있을 것으로 보인다.

출처: 예측방법론(이긍희, 이한식, 장영재 공저)

'KNOU > 3 예측방법론' 카테고리의 다른 글

제2장 예측데이터 - 시계열 (1)	2017.03.17

Posted by 마르띤

,

제2장 예측데이터 - 시계열

KNOU/3 예측방법론 2017. 3. 17. 19:04

### 2주차, 제2장 예측데이터 - 시계열 ###

> setwd("c:/Rwork/")

> gdp<-read.csv('gdpq.csv',header=T)

> head(gdp)

gdp gdpsa #계절 조정

1 12807.5 14934.6

2 15732.5 15246.5

3 14621.6 15472.1

4 18689.5 16198.0

5 14512.5 16664.4

6 17709.4 17180.3

> gdp <- ts(read.csv("gdpq.csv", header = TRUE), start=1970, frequency=4)

> gdp

gdp gdpsa

1970 Q1 12807.5 14934.6

1970 Q2 15732.5 15246.5

1970 Q3 14621.6 15472.1

1970 Q4 18689.5 16198.0

(…중간 생략…)

2013 Q1 265291.2 279223.0

2013 Q2 285312.3 282344.9

2013 Q3 283595.0 285348.5

2013 Q4 300654.7 287979.3

> plot(gdp[,1]/1000, ylab="GDP(조원)", xlab="연도", col="steelblue")

> lines(gdp[,2]/1000, col="red")

> ww=c(.5,1,1,1,.5) #5분기 이동평균

> gdpm5 = filter(gdp, sides=2, ww/sum(ww))

> dlgdp1 = diff(log(gdp)) #1차 차분

> dlgdp4 = diff(log(gdp),4) #4차차분

> plot(gdp, main="GDP", ylab=" ", xlab="YEAR") #gdp 원계열

> plot(gdpm5, main="GDPM5", ylab=" ", xlab="YEAR") #gdp m5 이동평균

> plot(dlgdp1, main="diff(log(GDP))", ylab=" ", xlab= "YEAR") #1차 차분

> plot(dlgdp4, main="diff(log(GDP),4)", ylab=" ", xlab= "YEAR") #4차 차분

출처: 예측방법론(이긍희, 이한식, 장영재 공저, Knou press)

'KNOU > 3 예측방법론' 카테고리의 다른 글

제3장 예측데이터의 기초분석 (0)	2017.03.22

Posted by 마르띤

,

제1장 다변량 데이터 시각화

KNOU/3 R을 이용한 다변량분석 2017. 2. 25. 16:47

회귀모형의 적합결과 및 분산분석

> setwd("~/Documents/Rwork/KNOU/2017.2학기/R을 이용한 다변량 분석")

> hald.data = read.table('hald.txt',header=T)

> hald.data

y x1 x2 x3 x4

1 78.5 7 26 6 60

2 74.3 1 29 15 52

3 104.3 11 56 8 20

4 87.6 11 31 8 47

5 95.9 7 52 6 33

6 109.2 11 55 9 22

7 102.7 3 71 17 6

8 72.5 1 31 22 44

9 93.1 2 54 18 22

10 115.9 21 47 4 26

11 83.8 1 40 23 34

12 113.3 11 66 9 12

13 109.4 10 68 8 12

> hald.ls = lm(y~.,data=hald.data)

> hald.ls

Call:

lm(formula = y ~ ., data = hald.data)

Coefficients:

(Intercept) x1 x2 x3 x4

62.4054 1.5511 0.5102 0.1019 -0.1441

> summary(hald.ls)

Call:

lm(formula = y ~ ., data = hald.data)

Residuals:

Min 1Q Median 3Q Max

-3.1750 -1.6709 0.2508 1.3783 3.9254

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 62.4054 70.0710 0.891 0.3991

x1 1.5511 0.7448 2.083 0.0708 .

x2 0.5102 0.7238 0.705 0.5009

x3 0.1019 0.7547 0.135 0.8959

x4 -0.1441 0.7091 -0.203 0.8441

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.446 on 8 degrees of freedom

Multiple R-squared: 0.9824, Adjusted R-squared: 0.9736

F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07

> anova(hald.ls)

Analysis of Variance Table

Response: y

Df Sum Sq Mean Sq F value Pr(>F)

x1 1 1450.08 1450.08 242.3679 2.888e-07 ***

x2 1 1207.78 1207.78 201.8705 5.863e-07 ***

x3 1 9.79 9.79 1.6370 0.2366

x4 1 0.25 0.25 0.0413 0.8441

Residuals 8 47.86 5.98

---

데이터 생성

> a = c(1:10)

> a

[1] 1 2 3 4 5 6 7 8 9 10

> a = c(20:10)

> a

[1] 20 19 18 17 16 15 14 13 12 11 10

> a %/% 3

[1] 6 6 6 5 5 5 4 4 4 3 3

> a %% 3

[1] 2 1 0 2 1 0 2 1 0 2 1

> a = c(1:3) == c(3:1)

> a

[1] FALSE TRUE FALSE

SEQ 함수

> x = -5:5

> x

[1] -5 -4 -3 -2 -1 0 1 2 3 4 5

> x[x<0]

[1] -5 -4 -3 -2 -1

> mean(x)

[1] 0

> x[x<0]=0

> mean(x)

[1] 1.363636

> consec=(1:10)*10

> consec

[1] 10 20 30 40 50 60 70 80 90 100

> consec[2]

[1] 20

> consec[c(3,6,8,10)]

[1] 30 60 80 100

> consec[c(-2,-4)]

[1] 10 30 50 60 70 80 90 100

> x=c(10:1)

> consec<60

[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE

> x[consec<60] #첫번째, 두번째, 세번째 값

[1] 10 9 8 7 6

> seq(-pi,pi,1) #파이값을 1씩 증가

[1] -3.1415927 -2.1415927 -1.1415927 -0.1415927 0.8584073 1.8584073 2.8584073

> seq(-pi,pi,length=10) #파이값을 10개로 나눔

[1] -3.1415927 -2.4434610 -1.7453293 -1.0471976 -0.3490659 0.3490659 1.0471976 1.7453293

[9] 2.4434610 3.1415927

난수생성

> rnorm(10,-5,2.5) #평균이 -5, 표준편차가 2.5인 정규분포를 따르는 난수 10개

[1] -5.66696754 -2.73338407 -3.51628322 -5.24746144 -7.56506891 -4.47006154 -6.19141094 -9.51129455

[9] -0.08720922 -8.05256413

> rnorm(20) #평균이 0, 표준편차가 1인 정규분포를 따르는 난수 20개

[1] -0.43086982 -0.40359489 1.93710961 -0.84071208 -0.10802247 -0.01138187 0.92168668 -2.05161187

[9] -0.12887195 -0.27202527 1.73687933 -1.09407623 0.46089730 1.74135668 0.69786046 -0.38488478

[17] 1.05465527 -1.22405759 0.18905127 1.26340571

행렬연산

> x=c(1:12)

> x=matrix(x,ncol=4,byrow=T)

> x

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4

[2,] 5 6 7 8

[3,] 9 10 11 12

> y=matrix(x,ncol=4,byrow=F)

> y

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4

[2,] 5 6 7 8

[3,] 9 10 11 12

> y=c(1:12)

> y=matrix(y,ncol=4,byrow=F)

> y

[,1] [,2] [,3] [,4]

[1,] 1 4 7 10

[2,] 2 5 8 11

[3,] 3 6 9 12

> x

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4

[2,] 5 6 7 8

[3,] 9 10 11 12

> x[,c(1:3)]

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 5 6 7

[3,] 9 10 11

> matrix(1,nrow=4,byrow=T)

[,1]

[1,] 1

[2,] 1

[3,] 1

[4,] 1

> matrix(1,nrow=4,ncol=4)

[,1] [,2] [,3] [,4]

[1,] 1 1 1 1

[2,] 1 1 1 1

[3,] 1 1 1 1

[4,] 1 1 1 1

> x[c(1:3),]

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4

[2,] 5 6 7 8

[3,] 9 10 11 12

> x[c(1:3),-2]

[,1] [,2] [,3]

[1,] 1 3 4

[2,] 5 7 8

[3,] 9 11 12

> a=c(1:3,5:7)

> a

[1] 1 2 3 5 6 7

> x=matrix(a,ncol=3,byrow=T)

> x

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 5 6 7

> y=matrix(c(1:4),ncol=2,byrow=F)

> y

[,1] [,2]

[1,] 1 3

[2,] 2 4

> t(x)%*%y

[,1] [,2]

[1,] 11 23

[2,] 14 30

[3,] 17 37

apply함수

> apply(x,1,mean) #1은 행

[1] 2 6

> apply(x,2,mean) #2는 열

[1] 3 4 5

Function 문

> squareF=function(x){x*x}

> View(squreF)

> squareF(5)

[1] 25

기술통계량 구하기

> survey.data=read.table('survey.txt',header=T)

> survey.data

seq sex marriage age job edu salary

1 1 1 1 21 1 4 60

2 2 1 1 22 5 5 100

3 3 1 1 33 1 4 200

4 4 2 2 33 7 4 200

5 5 1 2 28 1 4 70

6 6 1 1 21 5 5 80

7 7 2 2 39 7 4 190

8 8 1 1 32 1 4 100

9 9 1 2 44 3 1 120

10 10 1 2 55 4 4 110

11 11 2 2 46 7 5 150

12 12 1 1 20 1 4 50

13 13 1 2 31 6 4 210

14 14 1 1 27 1 4 60

15 15 2 1 21 5 5 80

> str(survey.data)

'data.frame': 15 obs. of 7 variables:

$ seq : int 1 2 3 4 5 6 7 8 9 10 ...

$ sex : int 1 1 1 2 1 1 2 1 1 1 ...

$ marriage: int 1 1 1 2 2 1 2 1 2 2 ...

$ age : int 21 22 33 33 28 21 39 32 44 55 ...

$ job : int 1 5 1 7 1 5 7 1 3 4 ...

$ edu : int 4 5 4 4 4 5 4 4 1 4 ...

$ salary : int 60 100 200 200 70 80 190 100 120 110 ...

> survey.data$sex = as.factor(survey.data$sex)

> survey.data$marriage = as.factor(survey.data$marriage)

> survey.data$job = as.factor(survey.data$job)

> survey.data$edu = as.factor(survey.data$edu)

> str(survey.data)

'data.frame': 15 obs. of 7 variables:

$ seq : int 1 2 3 4 5 6 7 8 9 10 ...

$ sex : Factor w/ 2 levels "1","2": 1 1 1 2 1 1 2 1 1 1 ...

$ marriage: Factor w/ 2 levels "1","2": 1 1 1 2 2 1 2 1 2 2 ...

$ age : int 21 22 33 33 28 21 39 32 44 55 ...

$ job : Factor w/ 6 levels "1","3","4","5",..: 1 4 1 6 1 4 6 1 2 3 ...

$ edu : Factor w/ 3 levels "1","4","5": 2 3 2 2 2 3 2 2 1 2 ...

$ salary : int 60 100 200 200 70 80 190 100 120 110 ...

> attach(survey.data)

> mean(age)

[1] 31.53333

> sd(age)

[1] 10.57535

> summary(age)

Min. 1st Qu. Median Mean 3rd Qu. Max.

20.00 21.50 31.00 31.53 36.00 55.00

> tapply(age,sex,mean)

1 2

30.36364 34.75000

> tapply(age,sex,sd)

1 2

10.82841 10.59481

> tapply(age,marriage,mean)

1 2

24.62500 39.42857

> tapply(age,marriage,sd)

1 2

5.316752 9.571784

> table(sex,marriage)

marriage

sex 1 2

1 7 4

2 1 3

> sex.marriage=list(sex,marriage)

> table(sex.marriage)

sex.marriage.2

sex.marriage.1 1 2

1 7 4

2 1 3

> sex.edu=list(survey.data$sex,survey.data$edu)

> sex.edu.tb=table(sex.edu)

> sex.edu.tb

sex.edu.2

sex.edu.1 1 2 3 4 5

1 1 1 1 13 11

2 0 0 2 6 5

> rownames(sex.edu.tb)=c('남성','여성')

> colnames(sex.edu.tb)=c('무학','초졸','중졸','고졸','대졸')

> barplot(sex.edu.tb)

> par(mfrow=c(1,2))

> pie(sex.edu.tb[1,])

> title('Education of Male')

> pie(sex.edu.tb[2,])

> title('Education of Female')

#산점도 사례

> math <-c(68,64,48,46,78,60,90,50,66,70)

> physics<-c(70,68,46,48,84,64,92,52,68,72)

> cor(math,physics) #두 변수의 상관계수 값

[1] 0.9921232

> plot(math,physics,main='수학과 물리 산점도')

##설문조사 사례##

문항 1. 귀하의 성별은?

1) 남자 2) 여자 문항

2. 결혼하셨습니까?

1) 미혼 2) 기혼 3) 이혼 문항

3. 귀하의 나이는? (단위: 세)

4. 귀하의 직업은?

1) 회사원 2) 공무원 3) 노무자 4) 정치가 5) 학생 6) 기업가 7) 주부 8) 기타 문항

5. 귀하의 학력은?

1) 중졸이하 2)고졸 3) 대졸 4) 대학원 5) 그 이상

6. 가족의 월수입은? (단위: 만원)

> survey.data = read.table('survey.csv',sep=',',header=T)

> head(survey.data)

seq sex marriage age job edu salary

1 1 1 1 21 1 4 60

2 2 1 1 22 5 5 100

3 3 1 1 33 1 4 200

4 4 2 2 33 7 4 120

5 5 1 2 28 1 4 70

6 6 1 1 21 5 5 80

> attach(survey.data)

> mean(age)

[1] 34.275

> sd(age)

[1] 11.60236

> summary(age)

Min. 1st Qu. Median Mean 3rd Qu. Max.

20.00 24.75 32.00 34.28 42.50 59.00

#그룹화

> tapply(survey.data$age,survey.data$sex,mean) #성별 나의 평균

1 2

33.96296 34.92308

> tapply(survey.data$age,survey.data$sex,sd)

1 2

11.96945 11.24323

> tapply(survey.data$age,survey.data$marriage,mean) #결혼 별 나이 평균

1 2 3

24.66667 39.13043 50.50000

> tapply(survey.data$age,survey.data$marriage,sd)

1 2 3

4.151879 10.467718 12.020815

> sex.marriage=list(survey.data$sex,survey.data$marriage) #그룹

> table(sex.marriage) #분할표

sex.marriage.2

sex.marriage.1 1 2 3

1 10 15 2

2 5 8 0

> tapply(survey.data$age,sex.marriage,mean)

1 2 3

1 24.8 37.86667 50.5

2 24.4 41.50000 NA

> tapply(survey.data$age,sex.marriage,sd)

1 2 3

1 4.709329 11.230486 12.02082

2 3.209361 9.071147 NA

> table(survey.data$sex) #빈도표

1 2

27 13

> table(edu)

edu

1 2 3 4 5

1 1 3 19 16

> SexEdu=table(survey.data$sex,survey.data$edu)

> SexEdu

1 2 3 4 5

1 1 1 1 13 11

2 0 0 2 6 5

어느 집단에서 표본을 10명 추출하여 다음과 같은 4개 문항에 대하여 설문조사를 실시하였다.

문항 1. 귀하의 성별은? 1) 남자 2) 여자 문항

2. 귀하의 나이는? (단위 : 세) 문항

3. 교육정도는? 1) 중졸이하 2) 고졸 3)대졸 및 그 이상 문항

4. 월수입(단위: 만원)

> ex8=read.table('ex8.txt',header=T)

> colnames(ex8)<-c('sex','age','edu','salary')

> sex.tb=table(ex8$sex)

> sex.tb

1 2

6 4

> edu.tb=table(ex8$edu)

> edu.tbe

1 2 3

2 4 4

> rownames(edu.tb)=c('중졸이하','고졸이하','대졸이상')

> edu.tb

중졸이하 고졸이하 대졸이상

2 4 4

> barplot(edu.tb)

> plot(ex8$age,ex8$salary)

> title('Scatter plot of (Age,Salary)')

> plot(ex8$age,ex8$salary,type="n")

> points(ex8$age[sex=='1'],ex8$salary[sex=='1'],pch='M',col='blue')

> points(ex8$age[sex=='2'],ex8$salary[sex=='2'],pch='F',col='red')

> title("나이, 월수입 산점도:남녀별 구분")

위의 그래프가 내가 그린 것이고, 아래는 교재. 왜 안되지?.. 좀더 공부해보자.

> edt.tb = table(survey.data$edu)

> edt.tb

1 2 3 4 5

1 1 3 19 16

> rownames(edt.tb) = c('무학','초졸','중졸','고졸','대졸')

> barplot(edu.tb)

> title('교육정도 막대그림')

> pie(edu.tb)

> title('교육정도 원그림')

Sex를 그룹변수로 하여 교육(edu)의 막대그림을 그려보도록 하자.

> sex_edu = list(survey.data$sex, survey.data$edu)

> sex_edu.tb = table(sex_edu)

> sex_edu.tb

sex_edu.2

sex_edu.1 1 2 3 4 5

1 1 1 1 13 11

2 0 0 2 6 5

> colnames(sex_edu.tb) = c('무학','초졸','중졸','고졸','대졸')

> rownames(sex_edu.tb) = c('남자','여자')

> barplot(sex_edu.tb)

> title('성별 교육 정도 막대그림')

> boxplot(salary~sex,data=survey.data,xlab='성별',ylab='소득')

R에 내장된 데이터 co2를 이용, 선으로 그려보자(lines(smooth(co2)). col에서 4는 blue를 의미.

> plot(co2)

> lines(smooth(co2),col=4)

> x<-seq(0,20,0.1)

> y<-exp(-x/10)*cos(2*x)

> plot(x,y,type='line') #or type=’l’

> library(MVA)

> library(HSAUR2)

> data(USairpollution)

> head(USairpollution)

SO2 temp manu popul wind precip predays

Albany 46 47.6 44 116 8.8 33.36 135

Albuquerque 11 56.8 46 244 8.9 7.77 58

Atlanta 24 61.5 368 497 9.1 48.34 115

Baltimore 47 55.0 625 905 9.6 41.31 111

Buffalo 11 47.1 391 463 12.4 36.11 166

Charleston 31 55.2 35 71 6.5 40.75 148

> plot(USairpollution$manu,USairpollution$popul)

> x=USairpollution[,c(3,4)]

> bvbox(x,xlab='manu',ylab='popul')

> title('bivariate boxplot') # bivariate 이변량

> identify(x)

> rownames(x)[c(7,9,14,30)]

[1] "Chicago" "Cleveland" "Detroit" "Philadelphia"

> plot(wind~temp,data=USairpollution,pch=9)

> with(USairpollution,symbols(temp,wind,circles=SO2,inches=0.5,add=T))

> title('Bubble Plot')

등고선 그림(contour)과 이미지 그림(image plot)

기상자료, 지리 위치 등을 나타내는 자료는 좌표상에 z변수를 등고선이나, 이미지 등으로 표시하게 되면 매우 효율적임

- 등고선 자료:

자료를 나타내기 위하 두 좌표의 격자(grad)위에 등고선으로 면을 그려줌

- 이미지 그림:

색깔을 이용하여 z변수의 이미지를 표현한 그래프

> x<--6:16

> x

[1] -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

> par(mfrow=c(2,2))

> contour(outer(x,x),method='edge',vfont=c('sans serif','plain'))

> z<-outer(x,sqrt(abs(x)),FUN='/')

> image(x,x,z)

> contour(x,x,z,col='pink',add=TRUE,method='edge',vfont=c('sans serif','plain'))

> contour(x,x,z,ylim=c(1,6),method='simple',laboex=1)

Warning messages:

1: In plot.window(xlim, ylim, ...) : "laboex" is not a graphical parameter

2: In title(...) : "laboex" is not a graphical parameter

3: In axis(side = side, at = at, labels = labels, ...) :

"laboex" is not a graphical parameter

4: In axis(side = side, at = at, labels = labels, ...) :

"laboex" is not a graphical parameter

5: In box(...) : "laboex" is not a graphical parameter

> contour(x,x,z,ylim=c(-6,6),nlev=20,lty=2,method='simple')

겨냥도 그림(Perspective Plot)

등간격의 격자상에 높이 값을 갖는 행렬 자료에 대한 3차원 표현 방법 높이는 선으로 연결

> par(mfrow=c(1,2))

> x<-seq(-10,10,length=30)

> y<-x

> f<-function(x,y){r<-sqrt(x^2+y^2);10*sin(r)/r}

> z<-outer(x,y,f)

> z[is.na(z)]<-l

Error: object 'l' not found

> z[is.na(z)]<-1

> persp(x,y,z,theta=30,phi=30,expand=0.5,col='lightblue')

> persp(x,y,z,theta=30,phi=30,expand=0.5,col='lightblue',ltheta=120,shade=0.75,thicktype='detailed',xlab='X',ylab='Y',zlab='Sinc(r)')

Warning message:

In persp.default(x, y, z, theta = 30, phi = 30, expand = 0.5, col = "lightblue", :

"thicktype" is not a graphical parameter

> edu<-read.csv('education.csv')

> edu

grade y1970 y1990

1 초등졸 23.8 21.1

2 중졸 24.7 20.2

3 고졸 36.0 43.6

4 대졸 11.0 15.1

> name=edu[,1]

> grade70=edu[,2]

> grade90=edu[,3]

> par(mfrow=c(1,2))

> pie(grade70,label=name,clockwise=T,col=c(2:5),main='1970년도 여성학력')

> mtext(outer=F,'(a)',side=1,line=-6)

> pie(grade90,label=name,clockwise=T,col=c(2:5),main='1990년도 여성학력')

> mtext(outer=F,'(b)',side=1,line=-6)

연습문제

1. 13개 시중은행에 대한 편리성, 신속성, 친절, 능률, 쾌적, 자동화등의 점수. 출처: 유종렬 외 3인, S-Plus를 이용한 통계계산, 박영사, 1997

> bank<-read.csv('bank.csv',header=T)

> head(bank)

X convenience accuracy kindness efficiency pleasant automatic

1 kookmin 70.5 59.4 63.7 54.3 66.9 62.6

2 enterpr 64.8 70.3 68.6 55.2 68.0 64.1

3 boram 67.1 79.6 78.5 62.4 79.8 62.4

4 commerce 61.1 65.0 65.6 54.4 64.5 63.9

5 seoul 63.4 66.5 67.9 65.0 59.7 62.0

6 shinhan 72.3 69.1 74.2 60.0 70.1 68.2

> bank<-bank[,-1]

> name<-c('kookmin','enterpr','boram','commerce','seoul','shinhan','city','exchange','first','chohung','hana','hanil','house')

> head(bank)

convenience accuracy kindness efficiency pleasant automatic

1 70.5 59.4 63.7 54.3 66.9 62.6

2 64.8 70.3 68.6 55.2 68.0 64.1

3 67.1 79.6 78.5 62.4 79.8 62.4

4 61.1 65.0 65.6 54.4 64.5 63.9

5 63.4 66.5 67.9 65.0 59.7 62.0

6 72.3 69.1 74.2 60.0 70.1 68.2

(1) 각 변수들의 히스토그램을 그리고 설명

> library(dplyr)

> nrow(bank)

[1] 13

> bank$convenience %>% summary

Min. 1st Qu. Median Mean 3rd Qu. Max.

61.10 63.50 64.80 65.98 68.40 72.30

> bank$convenience %>% mean

[1] 65.97692

> bank$convenience %>% sd

[1] 3.279521

> bank$convenience %>% hist

> con.h<-bank %>% filter(convenience > mean(bank$convenience))

> con.l<-bank %>% filter(convenience < mean(bank$convenience))

> con.h

convenience accuracy kindness efficiency pleasant automatic

1 70.5 59.4 63.7 54.3 66.9 62.6

2 67.1 79.6 78.5 62.4 79.8 62.4

3 72.3 69.1 74.2 60.0 70.1 68.2

4 68.4 67.5 67.3 51.3 71.3 65.8

5 66.1 66.5 67.3 50.7 63.4 63.3

6 69.0 74.3 80.5 63.6 75.7 55.9

> con.h$convenience %>% summary

Min. 1st Qu. Median Mean 3rd Qu. Max.

66.10 67.42 68.70 68.90 70.12 72.30

> con.l$convenience %>% summary

Min. 1st Qu. Median Mean 3rd Qu. Max.

61.10 63.30 63.50 63.47 64.15 64.80

> con.h %>% summarise('평균 친절도' = mean(kindness))

평균 친절도

1 71.91667

> con.l %>% summarise('평균 친절도'=mean(kindness))

평균 친절도

1 67.7

2) 산점도 행렬

> plot(bank)

3) 별그림과 얼굴그림

> rownames(bank) = name

> head(bank)

convenience accuracy kindness efficiency pleasant automatic

kookmin 70.5 59.4 63.7 54.3 66.9 62.6

enterpr 64.8 70.3 68.6 55.2 68.0 64.1

boram 67.1 79.6 78.5 62.4 79.8 62.4

commerce 61.1 65.0 65.6 54.4 64.5 63.9

seoul 63.4 66.5 67.9 65.0 59.7 62.0

shinhan 72.3 69.1 74.2 60.0 70.1 68.2

> stars(bank)

> library(aplpack)

Loading required package: tcltk

> faces(bank,face.type=1)

effect of variables:

modified item Var

"height of face " "convenience"

"width of face " "accuracy"

"structure of face" "kindness"

"height of mouth " "efficiency"

"width of mouth " "pleasant"

"smiling " "automatic"

"height of eyes " "convenience"

"width of eyes " "accuracy"

"height of hair " "kindness"

"width of hair " "efficiency"

"style of hair " "pleasant"

"height of nose " "automatic"

"width of nose " "convenience"

"width of ear " "accuracy"

"height of ear " "kindness"

출처: R을 이용한 데이터 마이닝(김성수, 김현중, 정성석, 이용구 공저)

Posted by 마르띤

,

제6장 모형비교평가 - 연속형 목표변수

KNOU/2 데이터마이닝 2017. 2. 24. 19:18

연속형 목표 변수.

목표변수가 연속형인 경우 모형의 예측력 측도로는 주로 예측평균제곱오차(PMSE: Prediction Mean Squared Error)를 사용한다.

PMSE = sigma(yi-yhat i)²/n

관측치(yi)와 예측치(yhat i) 사이에 차이가 적을수록 예측을 잘한 것으로므로 예측평균제곱오차가 작을수록 그 모형의 예측력이 높다고 할 수 있다. 또한 관측치와 예측치를 가로축 및 세로축으로 놓고 그린 산점도(scatter plot)가 45도 대각선을 중심으로 모여 있으면 예측력이 좋다고 할 수 있다.

보스턴 하우징 데이터 Housing Values in Suburbs of Boston

(출처: http://127.0.0.1:31866/library/MASS/html/Boston.html)

변수명	속성	변수 설명
crim	수치형 (numeric)	per capita crime rate by town 타운별 1인당 범죄율
zn	수치형 (numeric)	proportion of residential land zoned for lots over 25,000 sq.ft 25,000평방피트를 초과하는 거주지역 비율
indus	수치형 (numeric)	proportion of non-retail business acres per town. 비소매 사업지역의 토지 비율
chas	범주형 (integer)	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 찰스강 더비 변수 (강의 경계에 위치 = 1, 아니면 = 0)
nox	수치형 (numeric)	nitrogen oxides concentration (parts per 10 million) 10ppm당 농축 일산화질소
rm	수치형 (numeric)	average number of rooms per dwelling. 주택 1가구등 방의 평균 개수
age	수치형 (numeric)	proportion of owner-occupied units built prior to 1940. 1940년 이전에 건축된 소유자 주택 비율
dis	수치형 (numeric)	weighted mean of distances to five Boston employment centres. 5개의 보스턴 고용센터까지의 접근성 지수
rad	범주형 (integer)	index of accessibility to radial highways. 방사형 도로까지의 접근성 지수
tax	수치형 (numeric)	full-value property-tax rate per \$10,000. 10,000달러당 재산세율
ptratio	수치형 (numeric)	pupil-teacher ratio by town. 타운별 학생/교사 비율
black	수치형 (numeric)	1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. 타운별 흑인의 비율
lstat	수치형 (numeric)	lower status of the population (percent). 모집단의 하위계층의 비율
medv (목표변수)	수치형 (numeric)	median value of owner-occupied homes in \$1000s. 본인 소유의 주택가격(중앙값)

<데이터 준비>

> library(MASS)

> str(Boston)

'data.frame': 506 obs. of 15 variables:

$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...

$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...

$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...

$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...

$ rm : num 6.58 6.42 7.18 7 7.15 ...

$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...

$ dis : num 4.09 4.97 4.97 6.06 6.06 ...

$ rad : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5 5 5 ...

$ tax : num 296 242 242 222 222 222 311 311 311 311 ...

$ ptratio : num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...

$ black : num 397 397 393 395 397 ...

$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...

$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

$ medv.hat: num 23.7 23.7 32.4 32.4 32.4 ...

> dim(Boston)

[1] 506 15

> Boston<-Boston[,-15]

> str(Boston)

'data.frame': 506 obs. of 14 variables:

$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...

$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...

$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...

$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...

$ rm : num 6.58 6.42 7.18 7 7.15 ...

$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...

$ dis : num 4.09 4.97 4.97 6.06 6.06 ...

$ rad : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5 5 5 ...

$ tax : num 296 242 242 222 222 222 311 311 311 311 ...

$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...

$ black : num 397 397 393 395 397 ...

$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...

$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

1. 선형회귀모형

> i=sample(1:nrow(Boston),round(nrow(Boston)*0.7))

> Boston.train = Boston[i,] #70% for training data

> Boston.test = Boston[-i,] #30% for test data

> fit.reg = lm(medv~.,data = Boston.train) #선형회귀모형

> summary(fit.reg) #선형회귀모형 요약

Call:

lm(formula = medv ~ ., data = Boston.train)

Residuals:

Min 1Q Median 3Q Max

-9.9375 -2.7260 -0.4588 1.7937 24.3220

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 41.027525 6.644569 6.175 1.93e-09 ***

crim -0.098730 0.052682 -1.874 0.061794 .

zn 0.059161 0.016208 3.650 0.000304 ***

indus 0.073377 0.075763 0.969 0.333492

chas1 2.314962 1.023899 2.261 0.024409 *

nox -18.307923 4.638939 -3.947 9.67e-05 ***

rm 3.014437 0.498806 6.043 4.04e-09 ***

age -0.002066 0.016414 -0.126 0.899922

dis -1.564686 0.235506 -6.644 1.25e-10 ***

rad2 1.485691 1.707949 0.870 0.384999

rad3 5.371463 1.513283 3.550 0.000441 ***

rad4 2.763349 1.365438 2.024 0.043791 *

rad5 3.407697 1.380288 2.469 0.014057 *

rad6 1.840557 1.687536 1.091 0.276204

rad7 3.790169 1.832307 2.069 0.039362 *

rad8 4.691472 1.677890 2.796 0.005473 **

rad24 8.224882 2.114575 3.890 0.000121 ***

tax -0.009025 0.004555 -1.981 0.048382 *

ptratio -1.021842 0.175757 -5.814 1.43e-08 ***

black 0.008734 0.003633 2.404 0.016749 *

lstat -0.596699 0.059397 -10.046 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.658 on 333 degrees of freedom

Multiple R-squared: 0.7468, Adjusted R-squared: 0.7316

F-statistic: 49.11 on 20 and 333 DF, p-value: < 2.2e-16

> fit.step.reg = step(fit.reg,direction = 'both', trace = FALSE) #단계적 선택법, trace=T일 경우 AIC를 볼 수 있다.

> summary(fit.step.reg) #조정된 모델 요약

Call:

lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +

tax + ptratio + black + lstat, data = Boston.train)

Residuals:

Min 1Q Median 3Q Max

-9.8479 -2.6319 -0.4838 1.7225 24.3723

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 40.432105 6.597006 6.129 2.49e-09 ***

crim -0.102050 0.052455 -1.945 0.052555 .

zn 0.058012 0.015903 3.648 0.000307 ***

chas1 2.428119 1.014994 2.392 0.017296 *

nox -17.025847 4.237976 -4.017 7.27e-05 ***

rm 2.969435 0.489146 6.071 3.45e-09 ***

dis -1.604642 0.221607 -7.241 3.08e-12 ***

rad2 1.763045 1.681098 1.049 0.295051

rad3 5.319042 1.507369 3.529 0.000476 ***

rad4 2.871626 1.358651 2.114 0.035290 *

rad5 3.377174 1.377514 2.452 0.014731 *

rad6 1.669858 1.672013 0.999 0.318656

rad7 3.831541 1.823028 2.102 0.036322 *

rad8 4.551858 1.667065 2.730 0.006659 **

rad24 7.906985 2.076704 3.807 0.000167 ***

tax -0.007384 0.004213 -1.753 0.080567 .

ptratio -0.996843 0.173266 -5.753 1.98e-08 ***

black 0.008481 0.003612 2.348 0.019458 *

lstat -0.593839 0.055467 -10.706 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.65 on 335 degrees of freedom

Multiple R-squared: 0.7461, Adjusted R-squared: 0.7324

F-statistic: 54.69 on 18 and 335 DF, p-value: < 2.2e-16

> fit.step.reg$anova #제거된 변수 보기. age와 indus가 제거

Step Df Deviance Resid. Df Resid. Dev AIC

1 NA NA 333 7223.934 1109.614

2 - age 1 0.3436196 334 7224.278 1107.631

3 - indus 1 20.2559598 335 7244.534 1106.622

> yhat.reg = predict(fit.step.reg,newdata=Boston.test,type='response') #데이터 예측, 목표값 예측 시 type=’response’

> mean((Boston.test$medv-yhat.reg)^2) #예측 모형의 평균 제곱 오차 PMSE

[1] 23.68968

2. 회귀 나무모형

> library(rpart)

> set.seed(1234)

> i=sample(1:nrow(Boston),round(nrow(Boston)*0.7))

> Boston.train = Boston[i,]

> Boston.test = Boston[-i,]

> my.control = rpart.control(xval=10, cp=0, minsplit=nrow(Boston.train)*0.05)

> fit.tree = rpart(medv~.,data=Boston.train, method='anova',control=my.control)#method=anova 선택 시 회귀나무, class 선택 시 분류나무

> which.min(fit.tree$cp[,4])

17

> fit.tree$cp[17]

[1] 0.00193237

> fit.tree$cp[17,]

CP nsplit rel error xerror xstd

0.00193237 17.00000000 0.12723520 0.23200911 0.04362543

> fit.tree$cp[17,1]

[1] 0.00193237

> ii = which.min(fit.tree$cp[,4])

> fit.prune.tree = prune(fit.tree,cp=fit.tree$cp[ii,1])

> plot(fit.prune.tree,uniform=T,margin=0.1)

> text(fit.prune.tree,col='blue',cex=0.7)

> yhat.tree = predict(fit.prune.tree,newdata=Boston.test,type='vector')

> mean((Boston.test$medv - yhat.tree)^2)

[1] 24.17478

3. 신경망모형

> library(neuralnet)

> Boston2 = Boston

> for(i in 1:ncol(Boston2))if(!is.numeric(Boston2[,i]))Boston2[,i]=as.numeric(Boston[,i])

> max1 = apply(Boston2,2,max)

> min1 = apply(Boston2,2,min)

> sdat = scale(Boston2,center=min1,scale=max1-min1)

> sdat=as.data.frame(sdat)

> View(sdat)

> set.seed(1234)

> i=sample(1:nrow(Boston2),round(nrow(Boston2)*0.7))

> Boston2.train = sdat[i,]

> Boston2.test = sdat[-i,]

> vname = names(Boston2.train)

> form = as.formula(paste('medv~',paste(vname[!vname %in% "medv"],collapse='+')))

> form

medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +

tax + ptratio + black + lstat

> fit.nn = neuralnet(form,data=Boston2.train,hidden=c(3,2),linear.output=T)#linear.output=T 회귀

> plot(fit.nn)

> str(Boston.test)

'data.frame': 152 obs. of 14 variables:

$ crim : num 0.0273 0.0299 0.1446 0.6274 1.0539 ...

$ zn : num 0 0 12.5 0 0 0 0 0 0 0 ...

$ indus : num 7.07 2.18 7.87 8.14 8.14 8.14 8.14 8.14 8.14 8.14 ...

$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

$ nox : num 0.469 0.458 0.524 0.538 0.538 0.538 0.538 0.538 0.538 0.538 ...

$ rm : num 7.18 6.43 6.17 5.83 5.93 ...

$ age : num 61.1 58.7 96.1 56.5 29.3 98.1 89.2 91.7 85.7 90.3 ...

$ dis : num 4.97 6.06 5.95 4.5 4.5 ...

$ rad : Factor w/ 9 levels "1","2","3","4",..: 2 3 5 4 4 4 4 4 4 4 ...

$ tax : num 242 222 311 307 307 307 307 307 307 307 ...

$ ptratio: num 17.8 18.7 15.2 21 21 21 21 21 21 21 ...

$ black : num 393 394 397 396 387 ...

$ lstat : num 4.03 5.21 19.15 8.47 6.58 ...

$ medv : num 34.7 28.7 27.1 19.9 23.1 13.6 19.6 15.2 13.9 16.6 ...

> pred=compute(fit.nn,Boston2.test[,1:13])#14번째 변수가 medv 목표변수.compute는 작성된 신경망모형을 이용하여 새로운 예에 적용하여 결과를 도출.

> summary(pred)

Length Class Mode

neurons 3 -none- list

net.result 152 -none- numeric

> yhat.nn = pred$net.result*(max(Boston2$medv)-min(Boston2$medv))+min(Boston2$medv)#back transformation, 목표변수를 조정전 변수값으로 환원하는 과정. 원래 가지고 있는 값으로 환원. 역변환 과정

> head(cbind(Boston2[i,14],yhat.nn),50)

[,1] [,2]

2 31.6 21.17115946

3 23.8 30.30760801

10 28.2 20.06686203

13 21.6 21.29808480

14 16.1 20.28145030

18 23.8 18.30584489

24 36.2 16.10587573

26 21.2 16.16710531

27 17.1 17.71479369

28 20.9 16.71171622

33 31.2 12.87970542

35 20.7 14.84095463

36 17.8 22.05195863

42 14.1 33.16705465

43 15.6 26.86733582

46 17.2 21.39641163

47 14.0 20.62187769

49 19.2 13.55437741

51 22.0 21.63602446

57 18.7 23.04343250

59 19.4 23.33368068

72 15.6 21.68452217

75 20.0 22.95006783

76 18.2 21.37727858

81 19.5 27.88596126

90 11.5 31.97882170

94 29.6 24.56223312

98 8.4 41.94592034

102 8.5 23.20282812

103 19.6 27.89450422

104 28.7 19.47439349

105 21.4 19.90631048

111 11.8 20.49063003

120 23.3 20.15009581

123 26.6 19.19281571

128 21.7 16.99190292

135 20.6 14.54799220

148 20.3 13.09572781

150 21.4 15.35938684

151 13.3 18.92860416

161 50.0 31.83885454

165 24.8 20.14597079

167 29.8 44.60929619

169 23.2 20.70938882

170 15.3 20.60174521

171 31.7 19.12045877

172 22.1 19.63836241

173 27.5 20.36004622

176 22.8 32.60724125

177 26.6 24.24612266

Warning message:

In cbind(Boston2[i, 14], yhat.nn) :

number of rows of result is not a multiple of vector length (arg 1)

> Boston2.test$medv=Boston2.test$medv*(max(Boston2$medv)-min(Boston2$medv))+min(Boston2$medv)#back transformation

> mean((Boston2.test$medv-yhat.nn)^2)

[1] 19.34365205

4. 랜덤 포레스트

> library(randomForest)

> library(MASS)

> str(Boston)

'data.frame': 506 obs. of 14 variables:

$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...

$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...

$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...

$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...

$ rm : num 6.58 6.42 7.18 7 7.15 ...

$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...

$ dis : num 4.09 4.97 4.97 6.06 6.06 ...

$ rad : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5 5 5 ...

$ tax : num 296 242 242 222 222 222 311 311 311 311 ...

$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...

$ black : num 397 397 393 395 397 ...

$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...

$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

> set.seed(1234)

> i=sample(1:nrow(Boston),round(nrow(Boston)*0.7))

> Boston.train = Boston[i,]

> Boston.test = Boston[-i,]

> fit.rf = randomForest(medv~., data=Boston.train, ntree=100, mtry=5, importance = T, na.action=na.omit)

> fit.rf

Call:

randomForest(formula = medv ~ ., data = Boston.train, ntree = 100, mtry = 5, importance = T, na.action = na.omit)

Type of random forest: regression

Number of trees: 100

No. of variables tried at each split: 5

Mean of squared residuals: 11.60448807

% Var explained: 85.6

> summary(fit.rf)

Length Class Mode

call 7 -none- call

type 1 -none- character

predicted 354 -none- numeric

mse 100 -none- numeric

rsq 100 -none- numeric

oob.times 354 -none- numeric

importance 26 -none- numeric

importanceSD 13 -none- numeric

localImportance 0 -none- NULL

proximity 0 -none- NULL

ntree 1 -none- numeric

mtry 1 -none- numeric

forest 11 -none- list

coefs 0 -none- NULL

y 354 -none- numeric

test 0 -none- NULL

inbag 0 -none- NULL

terms 3 terms call

> head(fit.rf$predicted,30)

58 315 308 314 433 321

31.38358586 22.70753968 28.77652778 21.11594595 20.61043860 23.92682540

5 117 332 256 345 270

32.56341667 20.64287500 20.64851351 20.71283333 30.29218391 21.12653153

140 456 144 412 141 131

15.02797980 15.33454167 14.36800926 14.37462963 15.01537037 19.89743590

92 114 154 147 77 20

23.40651042 19.04477778 15.95588235 16.05761261 21.11859649 19.79196970

106 390 253 439 398 22

19.35319820 10.55704563 31.93869369 9.22796875 12.18531532 18.03844203

> importance(fit.rf,type=1)

%IncMSE

crim 6.3818661727

zn 1.5083317623

indus 4.9357684617

chas 0.6538995939

nox 6.8146688870

rm 15.9443905699

age 3.6616801321

dis 7.0604314940

rad 6.1703029034

tax 3.9762473150

ptratio 7.1788230837

black 3.4813158641

lstat 13.5182724091

> yhat.rf=predict(fit.rf,newdata=Boston.test,type='response')

> mean((Boston.test$medv-yhat.rf)^2)

[1] 12.41384092

각 모형 별 에측평균제곱오차

모델	예측평균제곱오차
선형회귀모형	23.68968
회귀나무모형	24.17478
신경망모형	19.34365205
랜덤포레스트	12.41384092

관측치와 예측치의 산점도 생성

> par(mfrow=c(2,2))

>plot(Boston.test$medv,yhat.reg,xlab='Observed',ylab='Predicted',main='Regression',xlim=c(0,55),ylim=c(0,55))

> abline(0,1,lty=2)

> plot(Boston.test$medv,yhat.tree,xlab='Observed',ylab='Predicted',main='Decision tree',xlim=c(0,55),ylim=c(0,55))

> abline(0,1,lty=2)

> plot(Boston.test$medv,yhat.nn,xlab='Observed',ylab='Predicted',main='Neural Network',xlim=c(0,55),ylim=c(0,55))

> abline(0,1,lty=2)

> plot(Boston.test$medv,yhat.rf,xlab='Observed',ylab='Predicted',main='Random Forest',xlim=c(0,55),ylim=c(0,55))

> abline(0,1,lty=2)

-> 랜덤포레스트의 그래프가 가장 조밀하게 형성되어, 가장 좋은 예측 결과를 가진다고 할 수 있다.

출처: 데이터마이닝(장영재, 김현중, 조형준 공저)

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

제8장 연관성분석 (0)	2016.12.26
제5장 신경망모형 - 분류 (5)	2016.12.20
제5장 신경망모형 - 회귀 (0)	2016.11.14
제4장 앙상블 모형 - 분류앙상블모형 - 랜덤 포레스트 (0)	2016.11.09
제4장 앙상블 모형 - 분류앙상블모형 - 부스팅 (0)	2016.11.07

Posted by 마르띤

,

제6.5장 준모수적 방법

KNOU/2 보건 정보 데이터 분석 2017. 1. 26. 09:38

6.5 준모수적 방법

Cox의 비례 모형

모수적 모형은 가정이 타당할 때는 상당히 효율적이지만 어떤 모형이 적당한가에 대한 지식이 없다면 함부로 사용하기가 곤란하다. 이에 반해 Cox(1972)의 비례 위험 모형(proportional hazards model)은 준모수적(semi-parametric)방법으로서 생존 시간의 분포에 대한 가정을 필요로 하지 않는다. 또한 시간에 따라 바뀌는 공변량(time-dependent variable)의 경우에도 분석할 수 있다는 장점이 있어, 생존자료의 분석에 매우 자주 사용된다.

[예] Prenctice(1973)에 소개된 것으로 40명의 폐암 환자의 생존시간을 조사한 것이다. 40명의 환자 중 21명은 기존 치료방법인 처리1에, 나머지 19명은 새로운 치료방법은 처리2에 할당되었으며, 생존시간에 영향을 미칠 것으로 생각되는 공변량은 다음과 같다.

X1: 진단시의 환자상태(Perfermance Status : 0~100 점)

X2: 환자의 나이(단위 : 년)

X3: 진단 후 연구 참여시까지의 시간(단위 : 월)

Trt: 치료 방법 (1 – 기존 치료, 2 – 새로운 치료)

TYPE: 종양의 유형 (squamous, small, adeno, large)

1. 데이터 입력

> setwd('c:/Rwork/')

> lung<-read.table('lung.txt',header=T)

> colnames(lung)<-c('time','status','x1','x2','x3','trt','type')

> head(lung)

time status x1 x2 x3 trt type

1 411 1 70 64 5 1 1

2 126 1 60 63 9 1 1

3 118 1 70 65 11 1 1

4 82 1 40 69 10 1 1

5 8 1 40 63 58 1 1

6 25 0 70 48 9 1 1

2. Cox비례위험모형에 근거한 생존시간 분석

> library(survival)

> coxfit1 = coxph(Surv(time,status)~x1+x2+x3+factor(trt)+factor(type),data=lung)

> summary(coxfit1)

Call:

coxph(formula = Surv(time, status) ~ x1 + x2 + x3 + factor(trt) +

factor(type), data = lung)

n= 40, number of events= 37

coef exp(coef) se(coef) z Pr(>|z|)

x1 -0.060281 0.941500 0.013777 -4.375 1.21e-05 ***

x2 -0.015086 0.985027 0.022340 -0.675 0.4995

x3 0.001201 1.001201 0.011886 0.101 0.9195

factor(trt)2 -0.448171 0.638795 0.431302 -1.039 0.2988

factor(type)2 0.279682 1.322709 0.547259 0.511 0.6093

factor(type)3 1.418190 4.129638 0.625283 2.268 0.0233 *

factor(type)4 0.361145 1.434971 0.479210 0.754 0.4511

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

exp(coef) exp(-coef) lower .95 upper .95

x1 0.9415 1.0621 0.9164 0.9673

x2 0.9850 1.0152 0.9428 1.0291

x3 1.0012 0.9988 0.9781 1.0248

factor(trt)2 0.6388 1.5654 0.2743 1.4876

factor(type)2 1.3227 0.7560 0.4525 3.8663

factor(type)3 4.1296 0.2422 1.2125 14.0655

factor(type)4 1.4350 0.6969 0.5610 3.6707

Concordance= 0.764 (se = 0.058 )

Rsquare= 0.524 (max possible= 0.994 )

Likelihood ratio test= 29.66 on 7 df, p=0.0001097

Wald test = 26.29 on 7 df, p=0.0004479

Score (logrank) test = 30.67 on 7 df, p=7.138e-05

- coxph: cox의 비례위험 모형을 ㅈ거합함.

-> 결과 해석:

- 진단시의 환자상태(x1)에 대한 회귀계수가 -0.060281로 유의하다(p-value < 0.001). 위험률(exp(coef))를 보면 진단시의 환자상태에 대한 점수가 1점 더 높은 환자는 다른 조건이 동일한 환자와 비교할 때 0.9415(exp(-0.060281))배 위험하다는 것을 알 수 있다.

exp(-coef) = 1.0621은 exp(coef)의 역수인 1/0.9415이다.

- 나이(x2), 진단 후 연구 참여시까지의 시간(x3), 처리(trt)는 모두 p-value가 0.05보다 커서 유의한 영향을 미치지 않는다.

- TYPE: 종양의 유형 (squamous, small, adeno, large)을 보면 factor(type)3인 adeno의 경우 squamous와 차이를 보인다.(p-value: 0.0233)

회귀계수가 0이라는 가설에 대한 검정통계량. 차례로 우도비 검정(-2 LOG L), Wald 검정통계량, Score 검정이 있다. 세 통계랑 모두 가설을 기각하게 되므로 (p-value < 0.05), 적어도 하나의 공변량은 의미가 있다는 것(= 적어도 모든 변수가 0이라는 귀무가설을 기각)을 알 수 있다.

Likelihood ratio test= 29.66 on 7 df, p=0.0001097

Wald test = 26.29 on 7 df, p=0.0004479

Score (logrank) test = 30.67 on 7 df, p=7.138e-05

3. 생존함수추정치

> fit4 = survfit(coxfit1)

> summary(fit4)

Call: survfit(formula = coxfit1)

time n.risk n.event survival std.err lower 95% CI upper 95% CI

1 40 1 9.90e-01 1.10e-02 9.68e-01 1.000

2 39 1 9.79e-01 1.67e-02 9.47e-01 1.000

8 38 2 9.54e-01 2.71e-02 9.03e-01 1.000

10 36 1 9.37e-01 3.32e-02 8.74e-01 1.000

11 35 1 9.20e-01 3.87e-02 8.47e-01 0.999

12 34 2 8.83e-01 4.84e-02 7.93e-01 0.984

15 32 1 8.64e-01 5.33e-02 7.65e-01 0.975

16 31 1 8.44e-01 5.79e-02 7.38e-01 0.965

18 30 1 8.22e-01 6.27e-02 7.07e-01 0.954

19 29 1 7.96e-01 6.77e-02 6.74e-01 0.941

20 28 1 7.67e-01 7.27e-02 6.37e-01 0.924

21 27 1 7.34e-01 7.73e-02 5.97e-01 0.903

43 25 1 6.97e-01 8.22e-02 5.54e-01 0.879

44 24 1 6.61e-01 8.62e-02 5.12e-01 0.854

51 23 1 6.26e-01 8.94e-02 4.73e-01 0.828

54 22 1 5.84e-01 9.23e-02 4.28e-01 0.796

56 21 1 5.44e-01 9.39e-02 3.87e-01 0.763

82 20 1 5.06e-01 9.46e-02 3.50e-01 0.730

84 19 1 4.64e-01 9.54e-02 3.11e-01 0.694

90 18 1 4.25e-01 9.51e-02 2.74e-01 0.659

100 17 1 3.81e-01 9.46e-02 2.34e-01 0.620

118 15 1 3.35e-01 9.36e-02 1.94e-01 0.580

126 14 1 2.93e-01 9.12e-02 1.59e-01 0.540

153 13 1 2.53e-01 8.79e-02 1.28e-01 0.500

164 12 1 2.14e-01 8.37e-02 9.97e-02 0.461

177 11 1 1.80e-01 7.81e-02 7.66e-02 0.421

200 10 1 1.40e-01 7.08e-02 5.20e-02 0.377

201 9 1 1.07e-01 6.19e-02 3.41e-02 0.333

231 8 1 8.00e-02 5.25e-02 2.21e-02 0.289

250 6 1 5.18e-02 4.19e-02 1.06e-02 0.253

287 5 1 2.88e-02 2.98e-02 3.81e-03 0.218

340 4 1 9.18e-03 1.50e-02 3.74e-04 0.225

411 3 1 2.19e-03 5.10e-03 2.30e-05 0.210

991 2 1 1.28e-04 5.20e-04 4.54e-08 0.362

999 1 1 3.36e-10 5.39e-09 7.43e-24 1.000

> names(fit4)

[1] "n" "time" "n.risk" "n.event" "n.censor" "surv"

[7] "type" "cumhaz" "std.err" "upper" "lower" "conf.type"

[13] "conf.int" "call"

> fit4$surv

[1] 9.895862e-01 9.787704e-01 9.543895e-01 9.372414e-01 9.195316e-01

[6] 8.834541e-01 8.637107e-01 8.439345e-01 8.215162e-01 7.962148e-01

[11] 7.670584e-01 7.344065e-01 7.344065e-01 6.974181e-01 6.609451e-01

[16] 6.256915e-01 5.835925e-01 5.436422e-01 5.055202e-01 4.643798e-01

[21] 4.249339e-01 3.809762e-01 3.809762e-01 3.352450e-01 2.932721e-01

[26] 2.533265e-01 2.142532e-01 1.795911e-01 1.401508e-01 1.065182e-01

[31] 8.001270e-02 5.176743e-02 2.881620e-02 9.176925e-03 2.194374e-03

[36] 1.281990e-04 3.359491e-10

- fit에서 4번째 열 fit$surv 은 생존함수의 추정치를 나타내는데, 여기서 -log()를 취해주면 위험함수가 됨.

4. 생존함수그래프

> plot(survfit(coxfit1),xlab='time',ylab='Survival function',xlim=c(0,998.9))

> legend(500,1.0,c('누적한계추정치','95%신뢰구간'),lty=c(1,2))

5. 누적함수그래프

누적위험함수의 추정치 그래프를 통해 위험함수의 형태를 짐작할 수 있다. 예를 들어 단조 증가하는 직선형태는 위험함수가 시간에 대해 일정하다는 것을 의미하며, 위쪽으로 휘는 모양이면 시간이 지남에 따라 일정하다는 것을 의미하며, 위쪽으로 휘는 모양이면 시간이 지남에 따라 위험함수가 증가하고, 아래 방향으로 휘면 감소한다는 것을 의미한다.

> H.hat = -log(fit4$surv)

> H.hat = c(H.hat,tail(H.hat,1))

> plot(c(fit4$time,1100),H.hat,xlab='time',ylab='comulative hazard function',type='s')

- tail: 벡터, 매트릭스 데이터에서 마지막 n개의 행들을 선택함. 예를 들어 tail(H.hat,1)라고 입력하면 H.hat 벡터의 마지막 1개의 성분을 선택. 해당 처리를 하는 이유는 위험함수 곡선은 시간에 대한 상승곡선이므로, 마지막 함수값을 추가함을써 곡선의 모양을 자연스럽게 하기 위한 처리임.

6. 비례성 검토를 위한 로그-로그 그림

비례성의 가정이 타당한 것인지 검토하는 방법에 대해 알아보자. 폐암자료에서 두 처리그룹별로 시간에 따른 log(-log S^ (t)) 의 그래프를 그렸을 때 평행하게 되는 가를 볼 수 있다.

> coxfit2 = coxph(Surv(time,status)~x1+x2+x3+strata(trt)+factor(type),data=lung)

> plot(survfit(coxfit2),fun='cloglog',lty=1:2,col=c('red','blue'))#fun='cloglog', 두 그래프가 대체적으로 평행하므로 비례성이 타당하다고 할 수 있다.

> legend('topleft',c('처리1','처리2'),lty=1:2,col=c('red','blue'))

- strata(trt): 처리를 층으로 입력해주면, 처리를 층으로 입력하여, 처리1과 처리2로 나누어 Cox 비례 위험 모형을 추정함.

- fun='cloglog'를 입력하면 로그-로그 그림을 그릴 수 있다.

- 비례성검토를 위한 로그-로그 그림(log-log plot)은 그래프가 평행하므로 비례성의 가정이 타당하다 할 수 있다.

7. 로그 – 랭크 테스트 log – rank test

1) 종양 유형 별 생존함수의 차이

> survdiff(Surv(time,status)~factor(type),data=lung)

Call:

survdiff(formula = Surv(time, status) ~ factor(type), data = lung)

N Observed Expected (O-E)^2/E (O-E)^2/V

factor(type)=1 14 12 16.72 1.330 2.825

factor(type)=2 11 10 6.93 1.356 1.771

factor(type)=3 5 5 2.18 3.651 4.099

factor(type)=4 10 10 11.17 0.123 0.187

Chisq= 7.4 on 3 degrees of freedom, p= 0.0614

X²(df=3)=7.4이고, p-value가 0.0614로 0.05보다는 약간 크지만 상당히 의미 있음을 알 수 있다.

2) 치료법 생존함수의 차이

> survdiff(Surv(time,status)~x1,data=lung)

Call:

survdiff(formula = Surv(time, status) ~ x1, data = lung)

N Observed Expected (O-E)^2/E (O-E)^2/V

x1=20 2 2 0.128 27.3119 28.6351

x1=30 4 4 1.565 3.7893 4.2568

x1=40 7 7 1.932 13.2948 15.1966

x1=50 4 3 3.298 0.0270 0.0305

x1=60 7 7 6.642 0.0193 0.0247

x1=70 9 7 12.420 2.3650 3.7914

x1=80 6 6 6.982 0.1382 0.1761

x1=90 1 1 4.033 2.2811 3.7772

Chisq= 58.8 on 7 degrees of freedom, p= 2.65e-10

X²(df=7)=58.8이고, p-value가 0.05보다 매우 작기 때문에 의미 있음을 알 수 있다.

8. anova를 이용하여 모형 비교

> coxfit1 = coxph(Surv(time,status)~x1+x2+x3+factor(trt)+factor(type),data=lung)

> coxfit2 = coxph(Surv(time,status)~x1+factor(type),data=lung)

> anova(coxfit2,coxfit1)

Analysis of Deviance Table

Cox model: response is Surv(time, status)

Model 1: ~ x1 + factor(type)

Model 2: ~ x1 + x2 + x3 + factor(trt) + factor(type)

loglik Chisq Df P(>|Chi|)

1 -88.143

2 -87.516 1.254 3 0.7401

anova()는 Cox Regression의 모형을 비교할 때 LRT(Likelihood ratio test)를 사용한다. 환자의 나이(x2), 진단 후 연구 참여시까지의 시간( x3)의 p-value는 0.7401으로 유의하지 않다.

출처: 보건정보데이터 분석 (이태림, 이재원, 김주한, 장대흥 공저), R을 이용한 누구나 하는 통계분석

'KNOU > 2 보건 정보 데이터 분석' 카테고리의 다른 글

제6.4장 모수적 방법 (0)	2017.01.26
제6.3장 비모수적 방법을 이용한 생존함수의 비교 (0)	2017.01.26
제6.2장 비모수적 방법 - 2. 누적한계추정법 (0)	2017.01.26
제6.2장 비모수적 방법 - 1. 생명표 방법 (0)	2017.01.26
제5.1장 공분산 분석 - 2 공변량이 둘 이상인 경우 (2)	2017.01.13

Posted by 마르띤

,

제6.4장 모수적 방법

KNOU/2 보건 정보 데이터 분석 2017. 1. 26. 09:38

6.4 모수적 방법을 이용한 생존함수의 추정과 비교

공학(시멘트의 양, 유리의 버티는 힘), 경영(고객 수), 교통(소방차 수) 모두 모수적 방법을 이용.

분포를 이루기 때문에 많은 분야에서 사용된다.

1. 데이터 입력

> setwd('c:/Rwork')

> lung=read.table('lung.txt',header=T)

> head(lung)

癤퓍ime status x1 x2 x3 trt type

1 411 1 70 64 5 1 1

2 126 1 60 63 9 1 1

3 118 1 70 65 11 1 1

4 82 1 40 69 10 1 1

5 8 1 40 63 58 1 1

6 25 0 70 48 9 1 1

> colnames(lung)<-c('time','status','x1','x2','x3','trt','type')

> head(lung)

time status x1 x2 x3 trt type

1 411 1 70 64 5 1 1

2 126 1 60 63 9 1 1

3 118 1 70 65 11 1 1

4 82 1 40 69 10 1 1

5 8 1 40 63 58 1 1

6 25 0 70 48 9 1 1

> attach(lung)

2. 모수적 모형에 근거한 생존시간 분석

> library(survival)

> weibull = survreg(Surv(time,status)~x1+x2+x3+factor(trt)+factor(type),data=lung,dist='weibull') #공변량factor(trt), 처리효과factor(type), 와이블 분포weibull, gaussian(정규분포), logistic(로지스틱)eh rksmd.

> summary(weibull)

Call:

survreg(formula = Surv(time, status) ~ x1 + x2 + x3 + factor(trt) +

factor(type), data = lung, dist = "weibull")

Value Std. Error z p

(Intercept) 1.06044 1.35959 0.780 4.35e-01

x1 0.05420 0.00954 5.680 1.35e-08

x2 0.01168 0.01918 0.609 5.42e-01

x3 0.00379 0.01051 0.361 7.18e-01

factor(trt)2 0.28871 0.36899 0.782 4.34e-01

factor(type)2 -0.49964 0.45323 -1.102 2.70e-01

factor(type)3 -1.25968 0.49732 -2.533 1.13e-02

factor(type)4 -0.40243 0.38726 -1.039 2.99e-01

Log(scale) -0.13615 0.13146 -1.036 3.00e-01

Scale= 0.873

Weibull distribution

Loglik(model)= -203.4 Loglik(intercept only)= -219.7

Chisq= 32.59 on 7 degrees of freedom, p= 3.2e-05

Number of Newton-Raphson Iterations: 6

n= 40

> weibull

Call:

survreg(formula = Surv(time, status) ~ x1 + x2 + x3 + factor(trt) +

factor(type), data = lung, dist = "weibull")

Coefficients:

(Intercept) x1 x2 x3 factor(trt)2

1.060436846 0.054195931 0.011681287 0.003792838 0.288708242

factor(type)2 factor(type)3 factor(type)4

-0.499640589 -1.259681146 -0.402431957

Scale= 0.8727099

Loglik(model)= -203.4 Loglik(intercept only)= -219.7

Chisq= 32.59 on 7 degrees of freedom, p= 3.2e-05

n= 40

logT = 1.060 + 0.054x1 + 0.012x2 + 0.004x3 + 0.289x4 - 0.500x5 -1.260x6 - 0.402x7 +0.873e

factor(trt)2 0.28871 0.36899 0.782 4.34e-01

-> x4는 처리그룹을 나타내는 가변수(dummy variable)로서 처리가 standard일 때 1, test 일 때 2의 값을 갖는다.

p-value가 0.434로 유의하지 않으므로 두 처리(standard, test)는 생존시간의 차이를 보이지 않는다.

type은 squamous,small,adeno,large, x5,6,7은 종양의 유형을 가르키는 가변수로서

x5 = 1, if 'small', = 0 o.w.

x6 = 1, if 'adeno', = 0 o.w.

x7 = 1, if 'large', = 0 o.w.

따라서 종양이 squamous인 경우에는 x5=x6=x7 = 0 된다.

factor(type)3 -1.25968 0.49732 -2.533 1.13e-02

-> adeno의 p-value만 0.0113으로 0.05보다 적으므로 유의, squamous와 차이를 보인다고 할 수 있다. adeno의 회귀계수가 -1.25968로 이 유형을 가진 사람들은 상대적으로 생존시간이 짧다는 것을 알 수 있다. type2 small, type4 large의 경우 p-value가 각각 0.270, 0,299로 유의하지 않다, 즉 squamous와 차이를 보이지 않는다.

Loglik(model)= -203.4 Loglik(intercept only)= -219.7

Chisq= 32.59 on 7 degrees of freedom, p= 3.2e-05

2(logL-logL0) = 32.59 > 14.067 모든 공변량의 회귀계수가 0이라는 귀무가설을 아주 강하게 기각한다.

통상적인 선형모형에서 모형에 대한 F-검정을 하는 것과 같이 여기서도 모든 공변량의 회귀계수가 0이라는 가설에 대하 우도비 검정(likelihood-ratio ttest)를 할 수 있다. 우리가 고려한 모형에서의 로그-우도(log-likelihood)를 logL, 공변량이 전혀 없는 귀무모형에서의 로그-우도를 logL0라고 했을 때 2(logL-logL0)이 귀무가설하에서 근사적으로 x2-분포를 따르게 되므로 이를 이용하여 검정할 수 있다.

이때 자유도는 귀무모형에서 제외되는 공변량의 수와 같게 되며, 위에서는 7이 된다.

LogL: 고려한 모형에서의 로그-우도 Log Likelihood for WEIBULL -203.4는 이 모형에 대한 로그-우도이다.

LogL0: 공변량이 전혀 없는 귀무모형에 대한 로그-우도를 구해보면 Log Likelihood for WEIULL = =219.7을

따라서 Chisq= 32.59는 아래와 같은 계산을 통해 얻을 수 있으며, 이 값이 x2-분포의 임계치인 x2 0.95(7) = 14.067보다 크므로

5%유의수준하에서 모든 공변량의 회귀계수가 0이라는 귀무사설을 기각하게 된다.

> 2*(-203.4-(-219.7))

[1] 32.6

3. 모수적 모형의 적합도 검토

고려한 모형이 타당한가를 검토하는 방법 중 하나는 로그-우도 비교. R을 이용하여 각 모형에 대하 로그-우도를 출력한 다음 이들을 비교하여 절대값이 가장 큰 모형을 택할 수 있다. 또는 AIC(Akaike information criterion) 값을 비교하여 이 값이 더 작은 것을 선택할 수 있다.

> library(flexsurv) #flexsurv library: Flexible parametric survival models

Warning message:

package ‘flexsurv’ was built under R version 3.2.5

> gengamma=flexsurvreg(formula=Surv(time,status)~x1+x2+x3+factor(trt)+factor(type),data=lung,dist='gengamma')

> gengamma

Call:

flexsurvreg(formula = Surv(time, status) ~ x1 + x2 + x3 + factor(trt) +

factor(type), data = lung, dist = "gengamma")

Estimates:

data mean est L95% U95% se exp(est)

mu NA 1.16226 -1.69652 4.02104 1.45859 NA

sigma NA 0.82367 0.47903 1.41628 0.22778 NA

Q NA 1.19119 -0.35449 2.73687 0.78863 NA

x1 56.50000 0.05426 0.03598 0.07254 0.00933 1.05576

x2 56.57500 0.01210 -0.02543 0.04963 0.01915 1.01217

x3 15.65000 0.00494 -0.01741 0.02729 0.01141 1.00495

factor(trt)2 0.50000 0.27185 -0.47815 1.02186 0.38266 1.31239

factor(type)2 0.27500 -0.57430 -1.63425 0.48566 0.54080 0.56310

factor(type)3 0.12500 -1.36101 -2.57769 -0.14434 0.62076 0.25640

factor(type)4 0.25000 -0.48503 -1.45929 0.48923 0.49708 0.61568

L95% U95%

mu NA NA

sigma NA NA

Q NA NA

x1 1.03664 1.07524

x2 0.97489 1.05088

x3 0.98274 1.02767

factor(trt)2 0.61993 2.77835

factor(type)2 0.19510 1.62524

factor(type)3 0.07595 0.86560

factor(type)4 0.23240 1.63107

N = 40, Events: 37, Censored: 3

Total time at risk: 5784

Log-likelihood = -203.4059, df = 10

AIC = 426.8117

> weibull=flexsurvreg(formula=Surv(time,status)~x1+x2+x3+factor(trt)+factor(type),data=lung,dist='weibull')

> weibull

Call:

flexsurvreg(formula = Surv(time, status) ~ x1 + x2 + x3 + factor(trt) +

factor(type), data = lung, dist = "weibull")

Estimates:

data mean est L95% U95% se exp(est)

shape NA 1.14586 0.88555 1.48269 0.15066 NA

scale NA 2.88763 0.20141 41.39933 3.92317 NA

x1 56.50000 0.05420 0.03551 0.07288 0.00953 1.05569

x2 56.57500 0.01168 -0.02587 0.04924 0.01916 1.01175

x3 15.65000 0.00379 -0.01679 0.02438 0.01050 1.00380

factor(trt)2 0.50000 0.28871 -0.43443 1.01185 0.36896 1.33470

factor(type)2 0.27500 -0.49964 -1.38783 0.38855 0.45317 0.60675

factor(type)3 0.12500 -1.25968 -2.23430 -0.28506 0.49726 0.28374

factor(type)4 0.25000 -0.40243 -1.16137 0.35651 0.38722 0.66869

L95% U95%

shape NA NA

scale NA NA

x1 1.03615 1.07560

x2 0.97446 1.05047

x3 0.98335 1.02468

factor(trt)2 0.64763 2.75068

factor(type)2 0.24962 1.47484

factor(type)3 0.10707 0.75197

factor(type)4 0.31306 1.42833

N = 40, Events: 37, Censored: 3

Total time at risk: 5784

Log-likelihood = -203.4363, df = 9

AIC = 424.8727

일반화감마분포 gengamma분포의 경우 로그-우도 값이 -203.4059, AIC는 426.8117

와이블분포 wibull 분포의 경우 로그-우도 값이 -203.4363, AIC는 424.8727로 큰 차이는 없다. 그래프로 그려보면 아래와 같이 큰 차이가 없음을 알 수 있다.

데이터 시각화

> plot(weibull,xlab='time',ylab='survival function',ci=F)

> lines(gengamma,col='blue',lty=2,ci=F)

> legend('topright',c('weibull','generalized gamma'),lty=1:2,col=c('red','blue'))

출처: 보건정보데이터 분석(이태림, 이재원, 김주한, 장대흥 공저)

'KNOU > 2 보건 정보 데이터 분석' 카테고리의 다른 글

제6.5장 준모수적 방법 (0)	2017.01.26
제6.3장 비모수적 방법을 이용한 생존함수의 비교 (0)	2017.01.26
제6.2장 비모수적 방법 - 2. 누적한계추정법 (0)	2017.01.26
제6.2장 비모수적 방법 - 1. 생명표 방법 (0)	2017.01.26
제5.1장 공분산 분석 - 2 공변량이 둘 이상인 경우 (2)	2017.01.13

Posted by 마르띤

,

제6.3장 비모수적 방법을 이용한 생존함수의 비교

KNOU/2 보건 정보 데이터 분석 2017. 1. 26. 09:37

6.3 비모수적 방법을 이용한 생존함수의 비교

[예] 흑색종(melanoma) 환자들에 대한 BCG와 CP(coryne-bacterium parvum)의 생존지속 효과를 비교하기 위한 연구에서 30명의 흑색종 환자 중 11명은 BCG 처리를 받고 나머지 19명은 CP처리를 받았다고 한다. 중도절단이 포함되어 있는 경우에 이 두 그룹의 생존분포를 비교하기 위한 방법을 알아보자.

BCG 처리 그룹	33.7+	3.9	10.5	5.4	19.5	23.8+	7.9
BCG 처리 그룹	16.9+	16.6+	33.7+	17.1+
CP 처리 그룹	8.0	26.9+	21.4+	18.1+	16.0+	6.9	11.0+
	24.8+	23.0+	8.3	10.8+	12.2+	12.5+	24.4
	7.7	14.8+	8.2+	8.2+	7.8+

1. 데이터 입력

> library(survival)

> setwd('c:/Rwork')

> melanoma = read.table('melanoma.txt',header=T)

> head(melanoma,3)

癤퓍ime status x

1 33.7 0 BCG

2 3.9 1 BCG

3 10.5 1 BCG

> colnames(melanoma)<-c('time','status','x')

> head(melanoma)

time status x

1 33.7 0 BCG

2 3.9 1 BCG

3 10.5 1 BCG

4 5.4 1 BCG

5 19.5 1 BCG

6 23.8 0 BCG

> attach(melanoma)

2. 누적한계추정치(Kaplan-Meier 추정치)

> fit2 = survfit(Surv(time,status)~x,data=melanoma)

> summary(fit2)

Call: survfit(formula = Surv(time, status) ~ x, data = melanoma)

x=BCG

time n.risk n.event survival std.err lower 95% CI upper 95% CI

3.9 11 1 0.909 0.0867 0.754 1.000

5.4 10 1 0.818 0.1163 0.619 1.000

7.9 9 1 0.727 0.1343 0.506 1.000

10.5 8 1 0.636 0.1450 0.407 0.995

19.5 4 1 0.477 0.1755 0.232 0.981

x=CP

time n.risk n.event survival std.err lower 95% CI upper 95% CI

6.9 19 1 0.947 0.0512 0.852 1.000

7.7 18 1 0.895 0.0704 0.767 1.000

8.0 16 1 0.839 0.0854 0.687 1.000

8.3 13 1 0.774 0.1003 0.601 0.998

24.4 3 1 0.516 0.2211 0.223 1.000

> fit2

Call: survfit(formula = Surv(time, status) ~ x, data = melanoma)

n events median 0.95LCL 0.95UCL

x=BCG 11 5 19.5 10.5 NA

x=CP 19 5 NA 24.4 NA

3. 사망시점의 사분위수 추정치와 그의 신뢰구간

> quantile(fit2,probs=c(0.25,0.5,0.75),conf.int=T)

$quantile

25 50 75

x=BCG 7.9 19.5 NA

x=CP 24.4 NA NA

$lower

25 50 75

x=BCG 5.4 10.5 19.5

x=CP 8.0 24.4 24.4

$upper

25 50 75

x=BCG NA NA NA

x=CP NA NA NA

4. 데이터 시각화

> plot(fit2,xlab='time',ylab='survival function',lty=c(1,2),col=c(1,2))

> legend(5,0.2,c('cp 처리 그룹','BCG 처리 그룹'),lty=c(2,1),col=c(2,1))

> abline(h=0.5)

> abline(v=c(10.5,24.4))

5. 로그 순위 검정법(log-rank test)과 Gehan-Wilcoxon 검정법 비교

1) 로그 순위 검정법(log-rank test)

> survdiff(Surv(time,status)~x,data=melanoma)

Call:

survdiff(formula = Surv(time, status) ~ x, data = melanoma)

N Observed Expected (O-E)^2/E (O-E)^2/V

x=BCG 11 5 3.68 0.469 0.747

x=CP 19 5 6.32 0.274 0.747

Chisq= 0.7 on 1 degrees of freedom, p= 0.387

2) Gehan-Wilcoxon 검정법

> survdiff(Surv(time,status)~x,rho=1,data=melanoma)

Call:

survdiff(formula = Surv(time, status) ~ x, data = melanoma, rho = 1)

N Observed Expected (O-E)^2/E (O-E)^2/V

x=BCG 11 4.31 3.07 0.500 0.929

x=CP 19 4.10 5.34 0.288 0.929

Chisq= 0.9 on 1 degrees of freedom, p= 0.335

두 검정법 모두 p value가 0.05보다 크기 떄문에 유의하지 않다. 즉 BCG, CP 두 그룹 간의 생존함수는 유의한 차이를 보이지 않는다.

이상 그래프에서 보듯이 두 그룹의 누적한계추정치의 그래프도 교차되지 않고 나란한 형태를 보이므로 로그-순위 검정법이 타당한 것이었음을 알 수 있다.

출처: 보건정보데이터 분석 (이태림, 이재원, 김주한, 장대흥 공저)

'KNOU > 2 보건 정보 데이터 분석' 카테고리의 다른 글

제6.5장 준모수적 방법 (0)	2017.01.26
제6.4장 모수적 방법 (0)	2017.01.26
제6.2장 비모수적 방법 - 2. 누적한계추정법 (0)	2017.01.26
제6.2장 비모수적 방법 - 1. 생명표 방법 (0)	2017.01.26
제5.1장 공분산 분석 - 2 공변량이 둘 이상인 경우 (2)	2017.01.13

Posted by 마르띤

,

데이터마이너를 꿈꾸며

'KNOU'에 해당되는 글 52건

1장 데이터 처리

3장 인자분석 - 주성분인자법(principal factor method) 사례

'KNOU > 3 다변량분석' 카테고리의 다른 글

2장 주성분분석 - heptathlon data / beer data

'KNOU > 3 다변량분석' 카테고리의 다른 글

제3장 예측데이터의 기초분석

'KNOU > 3 예측방법론' 카테고리의 다른 글

제2장 예측데이터 - 시계열

'KNOU > 3 예측방법론' 카테고리의 다른 글

제1장 다변량 데이터 시각화

제6장 모형비교평가 - 연속형 목표변수

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

제6.5장 준모수적 방법

'KNOU > 2 보건 정보 데이터 분석' 카테고리의 다른 글

제6.4장 모수적 방법

'KNOU > 2 보건 정보 데이터 분석' 카테고리의 다른 글

제6.3장 비모수적 방법을 이용한 생존함수의 비교

'KNOU > 2 보건 정보 데이터 분석' 카테고리의 다른 글

링크

카테고리

최근에 올라온 글

최근에 받은 트랙백

글 보관함

티스토리툴바