'r' 태그의 글 목록

[R vs Python] 무엇이 더 좋을까? Chat GPT에게 물어보다

Python, R 분석과 프로그래밍 2023. 8. 3. 22:52

제목 그대로 R vs 파이썬, 무엇이 더 좋을까?

나도 파이썬을 배우기 전에는 R을 먼저 공부하였다. 2016년으로 기억한다. 당시는 방통대 통계학에 편입하여 공부하고 있었는데, 교재에서 많은 데이터들이 R을 다루기도 하였고, 인터넷으로 검색을 해보니 R 강의도 많았다.

파이썬이 없었던 것은 아니였지만, 처음 교재로 R을 접하다보니 자연스롭게 노출된 언어를 먼저 공부하게 되었다. 당시에도 R이냐 파이썬이냐를 두고 갑론을박을 했던 기억이 난다. 기억으로는 통계를 할거면 R, 통계를 넘어 개발까지 고민한다면 파이썬을 공부하라는 전문가의 말도 있었고, 어떤 프로그램 언어에 채용 공고가 많은지 보라는 의견도 있었다. 실제로 지금도 해당 내용은 인터넷 검색이 쉽게 되는데 당연하지만 파이썬의 채용공고가 더 많다. 그리고 캐글. 캐글은 데이터쟁이들의 놀이터인데 압도적인 비율로 파이썬을 사용한 결과값을 많이 제출하였고, R만 알던 나에게 파이썬도 해야겠구나...라고 생각하게 만들었던 생각이 난다.

문득 챗 GPT는 뭐라고 말할지 궁금해졌다. 두 프로그램 언어 중 무엇이 좋은지 물어보았다.

둘 다 데이터 분석과 머신러닝 분야에서 매우 인기 있는 프로그래밍 언어이며, 각각의 장단점이 있습니다. 어떤 언어를 배울지 선택하는 것은 주로 개인의 용도, 선호도, 배우기 쉬운 언어 등을 고려해야 합니다. 아래는 각 언어의 특징과 강점을 설명해드리겠습니다:
R 언어:

통계 및 데이터 분석에 최적화: R은 통계 분석과 데이터 시각화를 위해 설계된 언어로, 데이터 분석가와 통계학자들에게 매우 인기가 있습니다.
풍부한 패키지와 라이브러리: R은 CRAN (Comprehensive R Archive Network)을 통해 수많은 패키지를 제공하여 데이터 분석에 필요한 다양한 기능을 지원합니다.
데이터 시각화: R은 ggplot2 등의 라이브러리를 통해 풍부하고 멋진 데이터 시각화를 제공합니다.
통계적 테스트 및 모델링: 통계적 가설 검정과 회귀 분석 등 다양한 통계적 기법을 지원합니다.

Python:

다목적 프로그래밍 언어: Python은 데이터 분석뿐만 아니라 웹 개발, 자연어 처리, 인공지능, 자동화 등 다양한 분야에서 사용되는 다목적 프로그래밍 언어입니다.
머신러닝 라이브러리: Python은 Scikit-learn, TensorFlow, PyTorch 등 머신러닝 및 딥러닝 라이브러리가 풍부하며, 머신러닝 프로젝트에 유용합니다.
데이터 처리 라이브러리: Pandas와 NumPy를 통해 데이터를 쉽게 처리하고 분석할 수 있습니다.
커뮤니티 지원: Python은 넓은 커뮤니티와 다양한 온라인 자료들이 있어 학습과 문제 해결에 유용합니다.

따라서, 데이터 분석과 통계에 초점을 맞추고 싶다면 R을 선택하는 것이 좋을 수 있습니다. 반면에, 데이터 분석 외에도 다른 영역에서도 활용하고 싶다면 Python을 선택하는 것이 유용할 수 있습니다. 또한, 두 언어를 모두 배우는 것도 가능하며, 상황에 따라 적절하게 선택하여 활용하는 것도 좋은 전략입니다.

R의 장점으로는 통게 및 데이터 분석에 최적화 되있다는 점이다. 통계적 가설 검정을 많이 했던 기억이 난다. 그리고 일본 저자가 쓴 교재를 통해 시각화 공부도 많이 했었다. 파이썬은 역시나 데이터 분석 외 이야기들이 많이 나온다. R만으로 머신러닝이 되지 않는건 아닌데, 파이썬은 R에서 제공하지 못하는 많은 분석 외적 영역을 제공해주곤 하였다. 그리고 이게 지금 내가 파이썬을 공부하는 가장 큰 이유인데 유저가 많다는 것이다. 당연히 구글 검색을 하면 관련된 결과가 더 많이 나와 공부하기 쉬운 환경이다.

물어보다 파이썬으로 했을 때 R 대비 장점을 한번 더 물어보았다.

파이썬은 다목적 프로그래밍 언어로 다양한 분야에서 사용되며, R에 비해 다음과 같은 장점들이 있어 파이썬만으로 해결하기 쉬운 것들이 있습니다:

웹 개발: Django, Flask 등의 웹 프레임워크를 사용하여 웹 애플리케이션을 개발할 수 있습니다.
자연어 처리(NLP): Natural Language Toolkit(NLTK) 등의 라이브러리를 통해 텍스트 데이터를 분석하고 처리할 수 있습니다.
컴퓨터 비전: OpenCV와 같은 라이브러리를 사용하여 이미지 및 비디오 처리, 객체 감지, 얼굴 인식 등을 수행할 수 있습니다.
자동화 및 스크립팅: 간단한 작업부터 복잡한 프로세스까지 자동화할 수 있는 강력한 스크립팅 언어입니다.
인공지능과 머신러닝: TensorFlow, PyTorch, Scikit-learn 등의 라이브러리를 통해 딥러닝 모델과 머신러닝 모델을 구축할 수 있습니다.
빅데이터 처리: PySpark를 사용하여 대규모 데이터를 처리하고 분석할 수 있습니다.
게임 개발: Pygame을 사용하여 2D 게임을 만들 수 있습니다.
네트워크 프로그래밍: 소켓 프로그래밍을 통해 네트워크 애플리케이션을 구축할 수 있습니다.
데이터베이스 연동: 다양한 데이터베이스와 연동하여 데이터를 읽고 쓸 수 있습니다.

물론 이러한 분야들 중에 일부는 R에서도 패키지를 통해 지원되기도 합니다. 하지만 파이썬은 다양한 분야에서 폭넓게 사용되기 때문에, 더 다양한 프로젝트와 활용 가능성이 있습니다. 따라서, 파이썬만으로도 R을 대체할 수 있는 경우가 많습니다. 다만, 특정 분석 기법이나 시각화에서 R이 더 편리할 수도 있으니 상황에 맞게 선택하는 것이 중요합니다.

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

[독립표본 두 모평균 비교] 제품A와 B간 품질 차이 비교 (0)	2016.09.02
stars 함수 - 파이조각그림, 나이팅게일 차트 (0)	2016.05.24
R 회귀분석 (0)	2016.02.28
[미완성] 중국 sohu 크롤링 연습 (0)	2016.02.20
beer data 연습 (0)	2015.11.10

Posted by 마르띤

,

1장 데이터 처리

KNOU/4 고급R 2017. 8. 30. 13:36

1. 자료 가져오기

> setwd(choose.dir()) #대화형으로 폴더를 지정할 수 있음

> getwd()

[1] "C:/Users/amore/Documents/FOR ME/Data Scientist/KNOU/2017년 2학기/고급R활용"

> setwd("C:/Users/amore/Documents/FOR ME/Data Scientist/KNOU/2017년 2학기/고급R활용")

> getwd()

[1] "C:/Users/amore/Documents/FOR ME/Data Scientist/KNOU/2017년 2학기/고급R활용"

> savehistory(file='mywork.Rhistory')

> history()

> history(max.show=10)

2. 다양한 형태의 자료 가져오기

> insurance.data2 = read.table('insurance2.txt', header=T)

> insurance.data2[c(15:16),]

id sex job religion edu amount salary

15 15 m -9 3 4 6 110

16 16 f 1 3 -9 7 88

#-9를 NA로 설정

> insurance.data2 = read.table('insurance2.txt', header=T, na.strings = '-9')

> insurance.data2[c(15:16),]

id sex job religion edu amount salary

15 15 m NA 3 4 6 110

16 16 f 1 3 NA 7 88

> csv.data = read.csv('csv.txt',header=T) #구분자가 콤마

> tab.data = read.table('tab.txt', header=T,sep='\t') #구분자가 탭

> write.table(tab.data, file='C:/Rwork/tab.txt') #Rwork 폴더에 저장

#고정형식 텍스트 파일 읽기

> fwf.data = read.fwf('insurance3.txt',

+ widths=c(2,2,3,3,3,6,6),

+ col.names=c('id','sex','job','religion','edu','amount','salary'))

> fwf.data[fwf.data$job==-9, 'job']=NA

> fwf.data[fwf.data$edu==-9, 'edu']=NA

> fwf.data[fwf.data$salary==-9, 'salary']=NA

> head(fwf.data)

id sex job religion edu amount salary

1 1 m 1 1 3 7.0 110

2 2 m 2 1 4 12.0 135

3 3 f 2 3 5 8.5 127

4 4 f 3 3 5 5.0 150

5 5 m 1 3 3 4.5 113

6 6 m 2 1 2 3.5 95

> fwf2.data = read.fwf('insurance3.txt',

+ widths=c(2,-2,-3,3,3,6,6),

+ col.names=c('id','religion','edu','amount','salary'))

> head(fwf2.data)

id religion edu amount salary

1 1 1 3 7.0 110

2 2 1 4 12.0 135

3 3 3 5 8.5 127

4 4 3 5 5.0 150

5 5 3 3 4.5 113

6 6 1 2 3.5 95

3. 통계 패키지 자료 가져오기

> library(foreign) #spss 파일

> ex1 = read.spss('ex1-1.sav',

+ to.data.frame=T, use.value.labels = T)

Warning message:

In read.spss("ex1-1.sav", to.data.frame = T, use.value.labels = T) :

ex1-1.sav: Unrecognized record type 7, subtype 18 encountered in system file

> ex1

shock response count

1 yes yes 25

2 yes no 19

3 no yes 31

4 no no 141

> dim(ex1)

[1] 4 3

> mouse.data = ex1[rep(1:nrow(ex1),ex1$count),] #가중치

> head(mouse.data)

shock response count

1 yes yes 25

1.1 yes yes 25

1.2 yes yes 25

1.3 yes yes 25

1.4 yes yes 25

1.5 yes yes 25

> dim(mouse.data)

[1] 216 3

> mouse.table=table(mouse.data$shock,mouse.data$response)

> mouse.table

yes no

yes 25 19

no 31 141

> summary(mouse.table)

Number of cases in table: 216

Number of factors: 2

Test for independence of all factors:

Chisq = 27.458, df = 1, p-value = 1.605e-07

- 카이제곱 = 27.458 , p - value = 1.6 X 10마이너스 7승. 자극과 반응간의 유의한 관계가 있음.

4. RData 저장 및 가져오기

> save(ex1, file='c:/Rwork/ex1.Rdata')

> rm(ex1)

> load('C:/Rwork/ex1.Rdata')

> ex1

shock response count

1 yes yes 25

2 yes no 19

3 no yes 31

4 no no 141

> load(file=file.choose()) #대화

참고문헌:

[1] 고급 R 활용 (심송용, 이윤동, 이은경, 김성수 공저, KNOU PRESS)

Posted by 마르띤

,

Lesson 3: Explore One Variable

Python, R 분석과 프로그래밍/Data Analysis with R 2017. 7. 12. 10:23

#데이터 불러오기

getwd()[1] "C:/Users/amore/Documents/FOR ME/Data Scientist/Udacity/Data Analysis with R/eda-course-materials/lesson3"
list.files()
[1] "lesson3.Rmd"         "lesson3_student.rmd" "pseudo_facebook.tsv"
pf <- read.csv('pseudo_facebook.tsv', sep='\t')
names(pf)
 [1] "userid"                "age"                   "dob_day"              
 [4] "dob_year"              "dob_month"             "gender"               
 [7] "tenure"                "friend_count"          "friendships_initiated"
[10] "likes"                 "likes_received"        "mobile_likes"         
[13] "mobile_likes_received" "www_likes"             "www_likes_received"

#Histogram of User's Birthdays

library(ggplot2)qplot(x=dob_day, data = pf)

Here's some things that I noticed. On the first day of the month I see this huge bin of almost 8,000 people. This seems really unusual since I would expect most people to have the same number of birthday's across every day of the month.

library(ggplot2)qplot(x=dob_day, data = pf) + 
  scale_x_continuous(breaks=1:31)

#Faceting

library(ggplot2)qplot(x=dob_day, data = pf) + 
  scale_x_continuous(breaks=1:31)+
  facet_wrap(~dob_month, ncol=4)

Now, you may have noticed some peaks in May or perhaps in October, but I think what's really interesting is this huge spike on January first. There's almost 4,000 users in this bin. Now, this could be because of the default settings that Facebook uses or perhaps users are choosing the first choice in the drop down menus. Another idea is that some users may want to protect their privacy and so they just go with January first by default. Whatever the case may be, I think it's important that we make our considerations in the context of our data. We want to look out for these types of anomalies.

#Friend Count

ggplot(aes(x = friend_count),data = pf) +
   geom_histogram()

#Limiting the Axes

qplot(x = friend_count, data = pf, xlim = c(0,1000))


## alternative solution
qplot(x = friend_count, data = pf) + 
  scale_x_continuous(limits = c(0,1000))

#adjust the bin width

 qplot(x = friend_count, data = pf) + 
   scale_x_continuous(limits = c(0,1000), breaks = seq(0,1000,50))

## alternative solution
qplot(x = friend_count, data = pf) + 
  scale_x_continuous(limits = c(0,1000))

##splits up the data by gender

qplot(x = friend_count, data = pf) + 
  scale_x_continuous(limits = c(0,1000), breaks = seq(0,1000,50)) + 
  facet_grid(gender~.)




#Omitting NA Obervations
qplot(x = friend_count, data = subset(pf, !is.na(gender)), 
      binwidth = 10) + 
  scale_x_continuous(limits = c(0,1000), breaks = seq(0,1000,50)) + 
  facet_grid(gender~.)

#equivalent ggplot syntax:

ggplot(aes(x = friend_count), data = pf) +
  geom_histogram() +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
  facet_wrap(~gender)




ggplot(aes(x = friend_count), data = subset(pf, !is.na(gender))) +
  geom_histogram() +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
  facet_wrap(~gender)

##statistics by gender

> table(pf$gender)female   male 
 40254  58574 
> by(pf$friend_count, pf$gender, summary)
pf$gender: female
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0      37      96     242     244    4923 
--------------------------------------------------------------------- 
pf$gender: male
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0      27      74     165     182    4917

### Tenure

qplot(x=tenure, data = pf, binwidth = 30,
      color = I('black'), fill = I('#099DD9'))



#Equivalent ggplot syntax:
ggplot(aes(x = tenure), data = pf) +
  geom_histogram(binwidth = 30, color = 'black', fill = '#099DD9')

## How would you create a histogram of tenure by year?

#create a histogram of tenure measured in years rather than in days





#It looks like the bulk of our users had less than two and a half years on Facebook.qplot(x = tenure / 365, data = pf, binwidth = 0.25, 
      color = I('black'), fill = I('#099DD9')) +
  scale_x_continuous(breaks = c(1,7,1), limits = c(0,7))
#Equivalent ggplot syntax:
ggplot(aes(x = tenure/365), data = pf) +
  geom_histogram(binwidth = .25, color = 'black', fill = '#F79420')

## Labeling Plots

qplot(x = tenure / 365, data = pf, binwidth = 0.25, 
      xlab = 'Number of years using Facebook',      ylab = 'Number of users in sample',
      color = I('black'), fill = I('#099DD9')) +
  scale_x_continuous(breaks = c(1,7,1), limits = c(0,7))


#Equivalent ggplot syntax:

ggplot(aes(x = tenure / 365), data = pf) +
  geom_histogram(color = 'black', fill = '#F79420') +
  scale_x_continuous(breaks = seq(1, 7, 1), limits = c(0, 7)) +
  xlab('Number of years using Facebook') +
  ylab('Number of users in sample')

## User Ages

> range(pf$age)
[1]  13 113
qplot(x = age, data = pf,binwidth = 5,
      xlab = 'User ages',
      ylab = 'Number of users in sample',
      color = I('black'), fill = I('blue')) +
  scale_x_continuous(breaks = c(0,113,5),limits= c(13,113))


#Equivalent ggplot syntax:

ggplot(aes(x = age), data = pf) +
  geom_histogram(binwidth = 1, fill = '#5760AB') +
  scale_x_continuous(breaks = seq(0, 113, 5))

## Transforming Data

> summary(pf$friend_count)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0    31.0    82.0   196.4   206.0  4923.0 
> summary(log10(pf$friend_count))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -Inf   1.491   1.914    -Inf   2.314   3.692 
> summary(log10(pf$friend_count + 1))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.505   1.919   1.868   2.316   3.692 
> summary(sqrt(pf$friend_count))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   5.568   9.055  11.090  14.350  70.160 
#Learn about how to use scales and how to create multiple plots on one page.  You'll also need to install and load the package gridExtra.
#Create 1 column thwi the folling thres histograms, friend count / transformed using  log 10 and square root.
> library(gridExtra)
> p1 <- qplot(x = friend_count, data = pf)
> p2 <- qplot(x = log10(friend_count+1), data = pf)
> p3 <- qplot(x = sqrt(friend_count), data = pf)
> grid.arrange(p1,p2,p3, ncol = 1)
#Equivalent ggplot syntax: we get the same outputs as we had before
pp1 <- ggplot(aes(x=friend_count), data = pf) + geom_histogram()
pp2 <- pp1 + scale_x_log10()
pp3 <- pp1 + scale_x_sqrt()

grid.arrange(pp1,pp2,pp3, ncol = 1)

## Add a Scaling Layer

> logScale <- qplot(x = log10(friend_count), data = pf)
> countScale <- ggplot(aes(x = friend_count), data = pf) +
  geom_histogram() +
  scale_x_log10()
> grid.arrange(logScale, countScale, ncol=2)

-> At the two plots, we can see that the difference is really in the x axis labeling. 
   Using scale_x_log10 will label the axis in actual friend_count. Where as using the 
   log10 wrapper will label the x axis in the log units. In general it is easier to 
   think about actual counts, so that's why people prefer using the scale_x_log10 as 
   a layer.

qplot(x=friend_count, data = pf) +
    scale_x_log10()

## Frequency Polygons : who has more friends on average men or women?

#This allows us to see the shape and the peaks of our distribution in more detail.qplot(x = friend_count, data = subset(pf, !is.na(gender)),
      binwidth = 10) +
  scale_x_continuous(lim = c(0,1000), breaks = seq(0,1000,50)) +
  facet_wrap(~gender)


qplot(x = friend_count, data = subset(pf, !is.na(gender)),
      binwidth = 10, geom='freqpoly', color = gender) +
  scale_x_continuous(lim = c(0,1000), breaks = seq(0,1000,50)) 



#But again, this plot doesn't really answer our question who has more friends on 
average men or women. Let's change the y-axis to show proportions instead of raw 
counts. 


qplot(x = friend_count, y = ..count.. / sum(..count..),
      data = subset(pf, !is.na(gender)),
      xlab = 'Friend Count',
      ylab = 'Proportion of Users with that firned count',
      binwidth = 10, geom='freqpoly', color = gender) +
  scale_x_continuous(lim = c(0,1000), breaks = seq(0,1000,50))

## Likes on the Web

qplot(x=www_likes, data = subset(pf, !is.na(gender)),
      geom = 'freqpoly', color = gender) +
  scale_x_continuous()+
  scale_x_log10()



# what's the www_like count for males?The first question is asking how many www_likes 
  there are in the entire data set for males.

> by(pf$www_likes, pf$gender, sum)

pf$gender: female
[1] 3507665
--------------------------------------------------------------------- 
pf$gender: male

## Box Plots

qplot(x = gender, y = friend_count,      data = subset(pf, !is.na(gender)),
      geom='boxplot')

## Adjust the code to focus on users who have friend counts between 0 and 1000.

qplot(x = gender, y = friend_count,
      data = subset(pf, !is.na(gender)),
      geom='boxplot',ylim = c(0,1000))




#same way
qplot(x = gender, y = friend_count,
      data = subset(pf, !is.na(gender)),
      geom = 'boxplot') +
scale_y_continuous(limits=c(0,1000))




-> Notice how the top of the box is just below 250, so it might be around 230. But this value might not be accurate for all of our data. use the ylim parameter or the scale_y_continious layer, we actually remove data points from calculations. 

# So a better way to do this is tu use the cord Cartesian layer to set the y limits instead.qplot(x = gender, y = friend_count,
      data = subset(pf, !is.na(gender)),
      geom = 'boxplot') +
  coord_cartesian(ylim= c(0,1000))



-> Here we will set the y limts from 0 to a 1000, notice how the top of the box has moved slightly closer to 250 for females.

## Box Plots, Quartiles, and Friendships

qplot(x = gender, y = friend_count,
      data = subset(pf, !is.na(gender)),
      geom = 'boxplot') +
  coord_cartesian(ylim = c(0,250))


> by(pf$friend_count, pf$gender, summary)
pf$gender: female
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0      37      96     242     244    4923 
---------------------------------------------------------------------------- 
pf$gender: male
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0      27      74     165     182    4917 
-> The third quartile of the 75% mark is at 244 and that's all the way up 
here(그래프를 보며). This means that 75% of female users have friend count below 244.
Or another way to say this is that 25% of female user have more than 244 friends.

## On average, who initiated more friendships in our sample: men or women?

> names(pf)
 [1] "userid"                "age"                   "dob_day"              
 [4] "dob_year"              "dob_month"             "gender"               
 [7] "tenure"                "friend_count"          "friendships_initiated"
[10] "likes"                 "likes_received"        "mobile_likes"         
[13] "mobile_likes_received" "www_likes"             "www_likes_received"   
qplot(x = gender, y = friendships_initiated,
      data = subset(pf, !is.na(gender)), geom = 'boxplot') +
  coord_cartesian(ylim = c(0,150))

> by(pf$friendships_initiated, pf$gender, summary)
pf$gender: female
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0    19.0    49.0   113.9   124.8  3654.0 
---------------------------------------------------------------------------- 
pf$gender: male
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0    15.0    44.0   103.1   111.0  4144.0 

-> On average, who initiated more friendships in our sample: men or women?
Women.

## Getting Logical : What percent of check in using mobile?

> head(pf$mobile_likes)
[1] 0 0 0 0 0 0
> summary(pf$mobile_likes)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     0.0     4.0   106.1    46.0 25110.0 
> summary(pf$mobile_likes > 0)
   Mode   FALSE    TRUE    NA's 
logical   35056   63947       0 
> mobile_check_in <- NA
> pf$mobile_check_in <- ifelse(pf$mobile_likes>0, 1, 0)
> pf$mobile_check_in <- factor(pf$mobile_check_in)
> summary(pf$mobile_check_in)
    0     1 
35056 63947 
> sum(pf$mobile_check_in == 1) / length(pf$mobile_check_in) 
[1] 0.6459097
#what percent of check in using mobile? 65%

#reference

1. https://cn.udacity.com/course/data-analysis-with-r--ud651

2. http://www.cookbook-r.com/Graphs/Facets_(ggplot2)

3. https://en.wikipedia.org/wiki/Web_colors

4. http://ggplot2.tidyverse.org/reference/theme.html

5. http://lightonphiri.org/blog/ggplot2-multiple-plots-in-one-graph-using-gridextra

6. http://ggplot2.tidyverse.org/reference/scale_continuous.html

7. https://en.wikipedia.org/wiki/Linear_regression#Assumptions

8. https://en.wikipedia.org/wiki/Normal_distribution

9. https://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/

'Python, R 분석과 프로그래밍 > Data Analysis with R' 카테고리의 다른 글

Lesson2: R Basic (0)	2017.07.11

Posted by 마르띤

,

Lesson2: R Basic

Python, R 분석과 프로그래밍/Data Analysis with R 2017. 7. 11. 10:14

# 원하는 데이터만 발라서 보기

> data(mtcars)

> mean(mtcars$mpg)

[1] 20.09062

> subset(mtcars, mtcars$mpg >= 30 | mtcars$hp < 60)

mpg cyl disp hp drat wt qsec vs am gear carb

Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1

Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

> mtcars[mtcars$mpg >= 30 | mtcars$hp < 60, ] #column 전체를 보기 위해서는 콤마 필수

mpg cyl disp hp drat wt qsec vs am gear carb

Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1

Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

#reddit data 보기, 범주형 변수 그래프 그리기

> reddit <- read.csv('reddit.csv')

> library(ggplot2)

> qplot(data = reddit, x = age.range)

#오더 순서 정하기 setting levles of ordered factors solution

> reddit$age.range = ordered(reddit$age.range,levels=c("Under 18","18-24", "25-34","35-44","45-54","55-64","65 or Above"))

> qplot(data = reddit, x = age.range)

#alternative solution

> reddit$age.range = factor(reddit$age.range, levels=c("Under 18","18-24", "25-34","35-44","45-54","55-64","65 or Above"),ordered=T)

> qplot(data = reddit, x = age.range)

# practice

> nlevels(reddit$income.range)

[1] 8

> levels(reddit$income.range)

[1] "$100,000 - $149,999" "$150,000 or more" "$20,000 - $29,999" "$30,000 - $39,999"

[5] "$40,000 - $49,999" "$50,000 - $69,999" "$70,000 - $99,999" "Under $20,000"

> qplot(data = reddit, x = income.range)

#아래같은 방법도 가능

> reddit$income.range = ordered(reddit$income.range, levels=c("Under $20,000" , "$20,000 - $29,999" , "$30,000 - $39,999", "$40,000 - $49,999","$50,000 - $69,999" , "$70,000 - $99,999”, ”$100,000 - $149,999" , "$150,000 or more"))

> qplot(data = reddit, x = income.range)

#다른 예제

> tShirts <- factor(c('medium', 'small', 'large', 'medium', 'large', 'large'), levels = c('medium','small','large'))

> tShirts

[1] medium small large medium large large

Levels: medium small large

> qplot(x = tShirts)

> tShirts <- ordered(tShirts, levels = c('small', 'medium', 'large'))

> tShirts

[1] medium small large medium large large

Levels: small < medium < large

> qplot(x = tShirts)

참고

https://cn.udacity.com/course/data-analysis-with-r--ud651

https://cn.udacity.com/course/data-wrangling-with-mongodb--ud032

http://vita.had.co.nz/papers/tidy-data.pdf

http://courses.had.co.nz.s3-website-us-east-1.amazonaws.com/12-rice-bdsi/slides/07-tidy-data.pdf

http://www.computerworld.com/article/2497143/business-intelligence/business-intelligence-beginner-s-guide-to-r-introduction.html

http://www.statmethods.net/index.html

https://www.r-bloggers.com/

http://www.cookbook-r.com/

http://blog.revolutionanalytics.com/2013/08/foodborne-chicago.html

http://blog.yhat.com/posts/roc-curves.html

https://github.com/corynissen/foodborne_classifier

'Python, R 분석과 프로그래밍 > Data Analysis with R' 카테고리의 다른 글

Lesson 3: Explore One Variable (0)	2017.07.12

Posted by 마르띤

,

6장 - 중회귀분석 : 집객효과가 가장 큰 광고의 조합은 무엇인가?

Python, R 분석과 프로그래밍/비지니스 활용 사례로 배우는 데이터 분석 : R 2017. 2. 15. 19:30

문제인식: 매스미디어 광고에 의한 신규유저수가 일정치 않다. 이는 매월 TV광고와 잡지 광고의배분이 일정하지 않기 때문이다. 이에 TV, 잡지 광고비와 신규 유저수의 관계를 파악한다.

해결 방법: TV, 잡지 광고비와 신규 유저수 데이터를 기반으로 중회귀분석을 실시한다.

R

1. 데이터 읽어 들이기

> ad.data <- read.csv('ad_result.csv',header=T,stringsAsFactors = F)

> ad.data

month tvcm magazine install

1 2013-01 6358 5955 53948

2 2013-02 8176 6069 57300

3 2013-03 6853 5862 52057

4 2013-04 5271 5247 44044

5 2013-05 6473 6365 54063

6 2013-06 7682 6555 58097

7 2013-07 5666 5546 47407

8 2013-08 6659 6066 53333

9 2013-09 6066 5646 49918

10 2013-10 10090 6545 59963

2. TV 광고의 광고비용과 신규 유저수의 산점도 그리기

> library(ggplot2)
> library(scales)

> ggplot(ad.data,aes(x=tvcm,y=install))

> ggplot(ad.data,aes(x=tvcm,y=install))+geom_point()

> ggplot(ad.data,aes(x=tvcm,y=install))+geom_point()+xlab('TV 광고비')+ylab('신규유저수')

> ggplot(ad.data,aes(x=tvcm,y=install))+geom_point()+xlab('TV 광고비')+ylab('신규유저수')+scale_x_continuous(label=comma)+scale_y_continuous(label=comma)

3. 잡지 광고의 광고비용과 신규 유저수의 산점도 그리기

> ggplot(ad.data,aes(x=magazine,y=install))+geom_point()+xlab('잡지 광고비')+ylab('신규유저수')+scale_x_continuous(label=comma)+scale_y_continuous(label=comma)

4. 회귀분석 실행

> fit <-lm(install~.,data=ad.data[,c('install','tvcm','magazine')])

> fit

Call:

lm(formula = install ~ ., data = ad.data[, c("install", "tvcm",

"magazine")])

Coefficients:

(Intercept) tvcm magazine

188.174 1.361 7.250

-> 이상 내용으로부터 아래와 같은 모델을 만들 수 있다.

신규 유저수 = 1.361 X TV광고비 + 7.25 X 잡지광고비 + 188.174

이라는 관계가 있으며, 신규 유저는 광고를 실시하지 않을 때 월 188명 정도이다. (아래 summary(fit)을 통해 유의하지 않음을 알 수 있다) 그리고 TV 광고에 1만원을 투입하면 양 1.3609명의 신규 고객을, 잡지 광고에 1만원을 투자하면 약 7.2498명의 신규 유저를 확보할 수 있다.

5. 회귀분석의 결과 해석

> summary(fit)

Call:

lm(formula = install ~ ., data = ad.data[, c("install", "tvcm", "magazine")])

Residuals:

Min 1Q Median 3Q Max

-1406.87 -984.49 -12.11 432.82 1985.84

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 188.1743 7719.1308 0.024 0.98123

tvcm 1.3609 0.5174 2.630 0.03390 *

magazine 7.2498 1.6926 4.283 0.00364 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1387 on 7 degrees of freedom

Multiple R-squared: 0.9379, Adjusted R-squared: 0.9202

F-statistic: 52.86 on 2 and 7 DF, p-value: 5.967e-05

잔차(Residuals) : 잔차(예측값과 측정값의 차이)분포를 사분위수로 표현한 것으로, 데이터의 치우침이 있는지 확인할 수 있다. 1Q의 절대값이 3Q의 절대값보다 커서 약간 치우침이 있어 보인다.

Coefficients : 절편과 기울기에 관한 개요.

Adjusted R-Squared : 0.9202로 이 모델로 전체 데이터의 약 92.02%를 설명할 수 있다.

또는 아래와 같이 회귀분석 모델을 만들 수 있다.

> fit2<-lm(install~tvcm+magazine,data=ad.data)

> summary(fit2)

Call:

lm(formula = install ~ tvcm + magazine, data = ad.data)

Residuals:

Min 1Q Median 3Q Max

-1406.87 -984.49 -12.11 432.82 1985.84

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 188.1743 7719.1308 0.024 0.98123

tvcm 1.3609 0.5174 2.630 0.03390 *

magazine 7.2498 1.6926 4.283 0.00364 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1387 on 7 degrees of freedom

Multiple R-squared: 0.9379, Adjusted R-squared: 0.9202

F-statistic: 52.86 on 2 and 7 DF, p-value: 5.967e-05

출처: 비지니스 활용 사레로 배우는 데이터 분석: R (사카마키 류지, 사토 요헤이 지음)

'Python, R 분석과 프로그래밍 > 비지니스 활용 사례로 배우는 데이터 분석 : R' 카테고리의 다른 글

5장 - A/B 테스트 : 어느 쪽의 배너광고가 반응이 더 좋은가? (0)	2017.02.13
4장 - 크로스 분석: 어떤 속성들의 고객들이 떠날까? (0)	2017.02.10

Posted by 마르띤

,

5장 - A/B 테스트 : 어느 쪽의 배너광고가 반응이 더 좋은가?

Python, R 분석과 프로그래밍/비지니스 활용 사례로 배우는 데이터 분석 : R 2017. 2. 13. 19:51

문제인식: 자사 제품의 구매율이 다른 앱과 비교하여 낮음을 확인, 해당 배너 광고의 클릭율이 낮음을 밝힘.

해결 방법: 분석 위한 데이터가 없기 때문에 A/B 테스트를 통해 분석용 로그를 출력하는 데이터 수집. 이후 A/B 테스트 한 후 배너광고 A와 B의 클릭률 산출하여 x² 검정 실행

R 분석 내용

1. 데이터 읽어 들이기

> setwd("C:/파일/temp/r_temp")

> ab.test.imp <-read.csv('section5-ab_test_imp.csv',header=T,stringsAsFactors = F)

> head(ab.test.imp,2)

log_date app_name test_name test_case user_id transaction_id

1 2013-10-01 game-01 sales_test B 36703 25622

2 2013-10-01 game-01 sales_test A 44339 25623

> ab.test.goal <- read.csv('section5-ab_test_goal.csv',header=T,stringsAsFactors = F)

> head(ab.test.goal,2)

log_date app_name test_name test_case user_id transaction_id

1 2013-10-01 game-01 sales_test B 15021 25638

2 2013-10-01 game-01 sales_test B 351 25704

2. ab.test.imp에 ab.test.goal 결합시키기

> ab.test.imp <- merge(ab.test.imp, ab.test.goal, by='transaction_id', all.x=T, suffixes=c('','.g'))

> head(ab.test.imp,2)

-> all.x - logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output.

3. 클릭했는지 하지 않았는지 나타내는 플래그 작성하기

> ab.test.imp$is.goal <-ifelse(is.na(ab.test.imp$user_id.g),0,1)

-> 항목 user_id.g의 값이 없음(NA)인 경우 0, 그렇지 않을 경우 1 입력

4. 클릭률 집계하기

> library(plyr)

> ddply(ab.test.imp,

.(test_case),

summarize,

cvr=sum(is.goal)/length(user_id))

test_case cvr

1 A 0.08025559

2 B 0.11546015

-> cvr = 클릭한 사람의 합계 / 배너광고가 표시된 유저수

배너광고 A의 클릭률 보다 B가 전반적으로 높다는 것을 통계적으로 알 수 있다.

5. 카이제곱 검정

> chisq.test(ab.test.imp$test_case,ab.test.imp$is.goal)

Pearson's Chi-squared test with Yates' continuity correction

data: ab.test.imp$test_case and ab.test.imp$is.goal

X-squared = 308.38, df = 1, p-value < 2.2e-16

-> p-value가 매주 작으므로 (0.05보다 작음) A와 B의 차이는 우연히 발생한 것이 아님

6. 날짜별,테스트 케이스별로 클릭률 산출하기

> ab.test.imp.summary<- ddply(ab.test.imp,

.(log_date,test_case),

summarize,

imp=length(user_id),

cv=sum(is.goal),

cvr=sum(is.goal)/length(user_id))

> head(ab.test.imp.summary)

log_date test_case imp cv cvr

1 2013-10-01 A 1358 98 0.07216495

2 2013-10-01 B 1391 176 0.12652768

3 2013-10-02 A 1370 88 0.06423358

4 2013-10-02 B 1333 212 0.15903976

5 2013-10-03 A 1213 170 0.14014839

6 2013-10-03 B 1233 185 0.15004055

-> imp = 배너 광고가 얼마나 표시되었는지 확인

cv = 클릭한 사람의 합계

cvr = 클릭률

7. 테스트 케이스별로 클릭률 산출하기

> ab.test.imp.summary<- ddply(ab.test.imp.summary,

.(test_case),

transform,

cvr.avg=sum(cv/sum(imp)))

> head(ab.test.imp.summary)

log_date test_case imp cv cvr cvr.avg

1 2013-10-01 A 1358 98 0.07216495 0.08025559

2 2013-10-02 A 1370 88 0.06423358 0.08025559

3 2013-10-03 A 1213 170 0.14014839 0.08025559

4 2013-10-04 A 1521 89 0.05851414 0.08025559

5 2013-10-05 A 1587 56 0.03528670 0.08025559

6 2013-10-06 A 1219 120 0.09844135 0.08025559

-> transform: 기존 집계툴에 데이터 추가

cvr.avg: test_case 별 추가

8. 테스트 케이스별 클릭률의 시계열 추이 그래프

> str(ab.test.imp.summary)

'data.frame': 62 obs. of 6 variables:

$ log_date : chr "2013-10-01" "2013-10-02" "2013-10-03" "2013-10-04" ...

$ test_case: chr "A" "A" "A" "A" ...

$ imp : int 1358 1370 1213 1521 1587 1219 1595 1401 1648 1364 ...

$ cv : num 98 88 170 89 56 120 194 99 199 114 ...

$ cvr : num 0.0722 0.0642 0.1401 0.0585 0.0353 ...

$ cvr.avg : num 0.0803 0.0803 0.0803 0.0803 0.0803 ...

> ab.test.imp.summary$log_date <- as.Date(ab.test.imp.summary$log_date)

> library(ggplot2)

> library(scales)

> limits<-c(0,max(ab.test.imp.summary$cvr))

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,

col=test_case,lty=test_case,shape=test_case))+

geom_line(lwd=1)+geom_point(size=4)+geom_line(aes(y=cvr.avg,col=test_case))+

scale_y_continuous(label=percent,limits=limits)

-> 배너광고 A의 클릭률 보다 B가 전반적으로 높다는 것을 통계적으로 알 수 있다

##ggplot 그래프 이해하기

> library(ggplot2)

> library(scales)

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,col=test_case,lty=test_case,shape=test_case))

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,col=test_case,lty=test_case,shape=test_case))+geom_line(lwd=1)

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,col=test_case,lty=test_case,shape=test_case))+geom_line(lwd=1)+geom_point(size=4)

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,col=test_case,lty=test_case,shape=test_case))+geom_line(lwd=1)+geom_point(size=4)+geom_line(aes(y=cvr.avg,col=test_case))

> limits<-c(0,max(ab.test.imp.summary$cvr))

> max(ab.test.imp.summary$cvr)

[1] 0.1712159

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,col=test_case,lty=test_case,shape=test_case))+geom_line(lwd=1)+geom_point(size=4)+geom_line(aes(y=cvr.avg,col=test_case))+ scale_y_continuous(label=percent,limits=limits)

출처: 비지니스 활용 사레로 배우는 데이터 분석: R (사카마키 류지, 사토 요헤이 지음)

'Python, R 분석과 프로그래밍 > 비지니스 활용 사례로 배우는 데이터 분석 : R' 카테고리의 다른 글

6장 - 중회귀분석 : 집객효과가 가장 큰 광고의 조합은 무엇인가? (2)	2017.02.15
4장 - 크로스 분석: 어떤 속성들의 고객들이 떠날까? (0)	2017.02.10

Posted by 마르띤

,

제5.1장 공분산 분석 - 2 공변량이 둘 이상인 경우

KNOU/2 보건 정보 데이터 분석 2017. 1. 13. 10:06

[예] 어떤 약물에 대한 체내 배출연구에서 얻은 자료이다. 연구자는 약의 형태에 따라 체내로부터 배출되는 약물의 양이 달라지는지를 알고자 한다. 그런데 배출되는 약물의 양은 약의 형태뿐만 아니라 배출된 약물을 측정한 시간과 각 개체의 항신진대사 점수에도 영향을 받을 것으로 생각한다. 이러한 경우에는 측정시간과 항신진대사 점수를 2개의 공변량으로 하여 이들을 제어한 약의 형태에 대한 효과를 공분산분석을 통해 알 수 있다.

관측번호	약의형태(trt)	항신진대사점수(x1)	소요시간(x2)	약물량(y)
1	1	37	61	11.3208
2	2	37	37	12.9151
3	3	45	53	18.8947
…	…	…	…	…

공변량이 2개 이상인 경우에는 공변량이 하나인 경우의 모형을 그대로 확장해서 각 모수의 추정과 검정을 할 수 있다.

<공분산분석을 위한 두 가지 가정>

1) 각 처리 안에서 반응변수Y에 미치는 공변량x의 효과가 모두 동일해야 한다. 즉, 회귀계수가 모든 약의 형태에 대해서 동일해야 하며, 교호작용이 없어야 한다.

2) 공변량x 효과가 0이 아니다. 효과가 0이라면 분산분석을 하면 된다.

2. 공분산분석에서의 검정

1) H0: β1 = 0 항신진대사의 효과가 없음

2) H0: β2 = 0 소요시간의 효과가 없음

이상 2개의 공변량 효과를 제어한 후 배출된 약물량의 모평균이 약의 형태type에 따라 차이가 있는가를 검정할 수 있는데, 이를 검정하기 위한 귀무가설은 아래와 같다.

3) H0 : α1 = α2 = ... = αI

1. library 호출 및 데이터 불러오기

> library(lsmeans)

> setwd('C:/Rwork')

> drug = read.csv('약물배출량자료.csv')

> head(drug)

관측번호 약형태 항신진대사점수 소요시간 약물량

1 1 1 37 61 11.3208

2 2 2 37 37 12.9151

3 3 3 45 53 18.8947

4 4 4 41 41 14.6739

5 5 5 57 41 8.6493

6 6 6 49 33 9.5238

> colname<-c('obs','type','x1','x2','y')

> colnames(drug)<-colname

> head(drug)

obs type x1 x2 y

1 1 1 37 61 11.3208

2 2 2 37 37 12.9151

3 3 3 45 53 18.8947

4 4 4 41 41 14.6739

5 5 5 57 41 8.6493

6 6 6 49 33 9.5238

> attach(drug)

2. 회귀계수의 동일성 검정(교호작용 존재 확인)

공분산 분석을 위한 중요한 가정으로 교호작용이 존재하는지 점검한다. 교호작용이 존재하면 공분산분석이 아닌 분산분석을 실시한다.

> model1 = aov(y~factor(type) + factor(type)*x1 + factor(type)*x2 , data = drug)

> summary(model1)

Df Sum Sq Mean Sq F value Pr(>F)

factor(type) 5 250.3 50.1 2.099 0.113

x1 1 696.6 696.6 29.206 3.9e-05 ***

x2 1 54.4 54.4 2.282 0.148

factor(type):x1 5 160.8 32.2 1.349 0.289

factor(type):x2 5 42.3 8.5 0.355 0.872

Residuals 18 429.3 23.9

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘

-> 공변량이 2개 이므로 우변에 * 형태로 입력

귀무가설 H0 : 교호작용이 존재하지 않는다.

대립가설 H1 : 교호작용이 존재한다.

p-value: x1 항신진대사점수 – 0.289

x2 소요시간 – 0.872

의사결정: 유의수준 5% 하에서 귀무가설을 기각할 수 없다.

결론: 두 공변량(항신진대사점수, 소요시간)과 약의 형태 사이에 교호작용이 존재하지 않으므로 공분산분석을 할 수 있다. ( = 각 처리 안에서 반응변수Y에 미치는 공변량x의 효과가 모두 동일하다.)

3. 이원공분산분석 (Two-way ANOVA)

> model2 = aov(y~factor(type) + x1 + x2, data=drug)

> summary(model2)

Df Sum Sq Mean Sq F value Pr(>F)

factor(type) 5 250.3 50.1 2.216 0.0808 .

x1 1 696.6 696.6 30.837 6.13e-06 ***

x2 1 54.4 54.4 2.409 0.1319

Residuals 28 632.5 22.6

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

-> 2개의 공변량을 보정하지 않았을 때의 F-값과 p값은 각각 2.216, 0.0808으로 유의수준 5%하에서 유의하지 않다.

귀무가설 H0 : α1 = α2 = ... = αI

대립가설 H1 : 최소한 하나 이상의 약은 효과가 있다.

p-value: 0.0808

의사결정: 유의수준 5% 하에서 귀무가설을 기각할 수 없다.

결론: 약의 형태에 따라 배출된 약물량의 차이가 없다.

3-1. 공변량 효과 제어 시 치료법의 효과 검정

> model3 = lm(y~factor(type) + x1 + x2, data=drug)

> summary(model3)

Call:

lm(formula = y ~ factor(type) + x1 + x2, data = drug)

Residuals:

Min 1Q Median 3Q Max

-7.222 -2.634 -0.379 1.475 9.646

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.6370 8.9591 4.089 0.000331 ***

factor(type)2 0.8965 2.7823 0.322 0.749677

factor(type)3 7.9097 2.7684 2.857 0.007973 **

factor(type)4 3.0722 2.8247 1.088 0.286025

factor(type)5 9.5434 2.8534 3.345 0.002355 **

factor(type)6 5.8389 2.8391 2.057 0.049149 *

x1 -0.7606 0.1375 -5.531 6.51e-06 ***

x2 0.1647 0.1061 1.552 0.131868

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.753 on 28 degrees of freedom

Multiple R-squared: 0.6129, Adjusted R-squared: 0.5161

F-statistic: 6.332 on 7 and 28 DF, p-value: 0.0001641

귀무가설 H0 : β1 = β2 = 0 공변량(항신진대사, 소요시간)의 효과가 없음

H0 : α1 = α2 = ... = αI

대립가설 not H0

p-value : 0.0001641

의사결정: 귀무가설을 강하게 기각한다. 5%의 유의수준 하에서 매우 유의하다.

결론: 우리가 세운 모형의 자료에 적합하다는 것을 알 수 있다. 배출된 약물량의 모평균이 약 형태type의 효과와 공변량들의 효과가 없다라고 말할 수 없다.

4. 공변량 효과 제어시 치료법의 효과 검정 - 모형 제곱합

1)제1종 제곱합(Type I SS): SS(type)

> model2 = aov(y~factor(type) + x1 + x2, data=drug)

> summary(model2)

Df Sum Sq Mean Sq F value Pr(>F)

factor(type) 5 250.3 50.1 2.216 0.0808 .

x1 1 696.6 696.6 30.837 6.13e-06 ***

x2 1 54.4 54.4 2.409 0.1319

Residuals 28 632.5 22.6

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

OR

> summary(aov(y~factor(type)+x1+x2,data=drug))

Df Sum Sq Mean Sq F value Pr(>F)

factor(type) 5 250.3 50.1 2.216 0.0808 .

x1 1 696.6 696.6 30.837 6.13e-06 ***

x2 1 54.4 54.4 2.409 0.1319

Residuals 28 632.5 22.6

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’

-> 결과해석: 약의 형태type가 기여한 제1종 제공합의 p-value는 0.0808로 유의수준 5%하에서 유의하지 않다.

2)제3종 제곱합(Type III SS): SS(type | x1, x2)

> summary(aov(y~x1+x2+factor(type),data=drug))

Df Sum Sq Mean Sq F value Pr(>F)

x1 1 516.5 516.5 22.867 5.03e-05 ***

x2 1 62.8 62.8 2.779 0.1067

factor(type) 5 422.0 84.4 3.736 0.0102 *

Residuals 28 632.5 22.6

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘

-> 결과해석: x1,2 given type, 공변량x1,x2가 기여한 상태에서 순수하게 약의 형태type가 기여한 제3종 제공합의 p-value는 0.0102로 유의수준 5%하에서 유의하다.

5. 각 수준 별 추정치를 알아보자.

> model3 = lm(y~factor(type) + x1 + x2, data=drug)

> summary(model3)

Call:

lm(formula = y ~ factor(type) + x1 + x2, data = drug)

Residuals:

Min 1Q Median 3Q Max

-7.222 -2.634 -0.379 1.475 9.646

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.6370 8.9591 4.089 0.000331 ***

factor(type)2 0.8965 2.7823 0.322 0.749677

factor(type)3 7.9097 2.7684 2.857 0.007973 **

factor(type)4 3.0722 2.8247 1.088 0.286025

factor(type)5 9.5434 2.8534 3.345 0.002355 **

factor(type)6 5.8389 2.8391 2.057 0.049149 *

x1 -0.7606 0.1375 -5.531 6.51e-06 ***

x2 0.1647 0.1061 1.552 0.131868

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.753 on 28 degrees of freedom

Multiple R-squared: 0.6129, Adjusted R-squared: 0.5161

F-statistic: 6.332 on 7 and 28 DF, p-value: 0.0001641

-> 출력결과 유의하게 나타나는 것은 3,5,번째 약의 형태의 추정치들이 첫 번째 약의 형태와의 차이를 타나내는데 p-값이 유의수준 5% 하에서 유의하다. 항신진대사 점수(x1)는 p 값이 <0.0001로 유의수준 5%하에서 매우 유의하지만 약의 배출 소요시간(x2)은 p 값이 0.131868로 유위수준 5%하에서 유의하지 않은 것으로 나타났다. 위의 출력 결과를 바탕으로 약의 형태별 공분산모형식을 쓰면 다음과 같다.

ŷ_1j = β^₀ + α^₁ + β^1x₁₁_j₊β^2x₂₁_j = 36.637 + 0 - 0.761 x₁₁_j + 0.165x₂₁_j

ŷ_2j = β^₀ + α^₂ + β^1x₁₂_j₊β^2x₂₁_j = 36.637 + 0.897- 0.761 x₁₂_j + 0.165x₂₂_j

ŷ_3j = β^₀ + α^₃ + β^1x₁₃_j₊β^2x₂₁_j = 36.637 + 7.908 - 0.761 x₁₃_j + 0.165x₂₃_j

ŷ_4j = β^₀ + α^₄ + β^1x₁₄_j₊β^2x₂₁_j = 36.637 + 3.072 - 0.761 x₁₄_j + 0.165x₂₄_j

ŷ_5j = β^₀ + α^₅ + β^1x₁₅_j₊β^2x₂₁_j = 36.637 +9.544 - 0.761 x₁₅_j + 0.165x₂₅_j

ŷ6_j = β^₀ + α^₆ + β^1x₁₆_j₊β^2x₂₆_j = 36.637 + 5.839 - 0.761 x₁₆_j + 0.165x₂₆_j

위의 식에 나타난 각 회귀식의 추정결과 모든 약의 형태에 대해서 항신진대사 점수(x1)와 약의 배출 소요시간(x2)의 회귀계수는 동일하다. 약의 배출 소요시간을 제어한 항신진대사 점수의 효과를 보면 항신진대사 점수가 1단위 높아짐에 따라 약의 배출량은 0.761만큼 감소하고 통계적으로 유의한 효과가 있고, 항신진대사 점수를 제어한 약의 배출 소요시간은 1단위 증가할수록 0.165만큼 증가하지만 그 효과는 유의하지 않다.

6. LSMEANS(Adjusted means) 계산

> lsmeans(model3,~type)

type lsmean SE df lower.CL upper.CL

1 5.674385 1.991087 28 1.595828 9.752942

2 6.570912 1.946518 28 2.583650 10.558174

3 13.584122 1.969292 28 9.550210 17.618034

4 8.746633 1.955440 28 4.741095 12.752171

5 15.217831 1.987611 28 11.146394 19.289268

6 11.513284 1.980761 28 7.455879 15.570690

Confidence level used: 0.95

-> 각 약의 형태에 대한 LSMEAN(보정된 평균)값이 출력되어 있다. 보정된 평균은 항신진대사 점수x1와 소요시간x2의 효과를 제어했을 때 반응변수Y인 배출되는 약물량의 평균값이다. 앞의 식에서 각 약의 형태의 회귀식의 공변량값에 전체 평균값을 넣어서 약의 형태별 보정된 평균(adjusted mean)을 아래 식과 같이 계산할 수 있다.

y bar_ad1 = 36.387 + 0 – 0.761*51.22 + 0.165*48.56 = 5.67

y bar_ad2 = 36.387 + 0.897 – 0.761*51.22 + 0.165*48.56 = 6.57

y bar_ad3 = 36.387 + 7.908 – 0.761*51.22 + 0.165*48.56 = 13.59

y bar_ad4 = 36.387 + 3.072 – 0.761*51.22 + 0.165*48.56 = 8.75

y bar_ad5 = 36.387 + 9.544 – 0.761*51.22 + 0.165*48.56 = 15.22

y bar_ad6 = 36.387 + 5.839 – 0.761*51.22 + 0.165*48.56 = 11.51

보정된 평균값을 보면 첫 번째와 두 번째 약의 형태의 보정된 평균값이 다소 작고 세 번째, 다섯 번째, 여섯 번째 약의 형태에 대한 보정된 평균값이 크다는 것을 알 수 있다.

6. 처리 간 다중비교

> model3.lsm = lsmeans(model3,pairwise ~ type, glhargs = list())

> print(model3.lsm,omit=1)

$lsmeans

type lsmean SE df lower.CL upper.CL

1 5.674385 1.991087 28 1.595828 9.752942

2 6.570912 1.946518 28 2.583650 10.558174

3 13.584122 1.969292 28 9.550210 17.618034

4 8.746633 1.955440 28 4.741095 12.752171

5 15.217831 1.987611 28 11.146394 19.289268

6 11.513284 1.980761 28 7.455879 15.570690

Confidence level used: 0.95

$contrasts

contrast estimate SE df t.ratio p.value

1 - 2 -0.8965269 2.782316 28 -0.322 0.9995

1 - 3 -7.9097368 2.768397 28 -2.857 0.0771

1 - 4 -3.0722481 2.824671 28 -1.088 0.8822

1 - 5 -9.5434459 2.853395 28 -3.345 0.0257

1 - 6 -5.8388992 2.839070 28 -2.057 0.3378

2 - 3 -7.0132099 2.783087 28 -2.520 0.1525

2 - 4 -2.1757212 2.753630 28 -0.790 0.9669

2 - 5 -8.6469191 2.802553 28 -3.085 0.0468

2 - 6 -4.9423723 2.758561 28 -1.792 0.4869

3 - 4 4.8374887 2.801787 28 1.727 0.5266

3 - 5 -1.6337091 2.782316 28 -0.587 0.9911

3 - 6 2.0708376 2.840329 28 0.729 0.9766

4 - 5 -6.4711978 2.783344 28 -2.325 0.2178

4 - 6 -2.7666511 2.753890 28 -1.005 0.9125

5 - 6 3.7045468 2.831753 28 1.308 0.7781

P value adjustment: tukey method for comparing a family of 6 estimates

위의 결과를 보면 약의 형태 5의 경우 보정된 평균이 형태 1과 2와 유의하게 차이가 나는 것을 볼 수 있다.

> names(model3.lsm)

[1] "lsmeans" "contrasts"

> plot(model3.lsm[[1]])

> plot(model3.lsm[[2]])

->Tukey(HSD) 검정 결과를 신뢰구간으로 보면, 1-5, 2-5가 유의함을 알 수 있다.

Dunnett vs Tukey

다중분석 하기 전 순서형인 type 변수를 명목형 변수로 변경

> str(drug)

'data.frame': 36 obs. of 5 variables:

$ obs : int 1 2 3 4 5 6 7 8 9 10 ...

$ type: int 1 2 3 4 5 6 1 2 3 4 ...

$ x1 : int 37 37 45 41 57 49 49 53 53 53 ...

$ x2 : int 61 37 53 41 41 33 49 53 45 53 ...

$ y : num 11.32 12.92 18.89 14.67 8.65 ...

> drug$type<-as.factor(drug$type)

> summary(drug)

obs type x1 x2 y

Min. : 1.00 1:6 Min. :37.00 Min. :33.00 Min. : 0.0017

1st Qu.: 9.75 2:6 1st Qu.:49.00 1st Qu.:45.00 1st Qu.: 5.9561

Median :18.50 3:6 Median :53.00 Median :49.00 Median : 8.4527

Mean :18.50 4:6 Mean :51.22 Mean :48.56 Mean :10.2179

3rd Qu.:27.25 5:6 3rd Qu.:53.00 3rd Qu.:53.00 3rd Qu.:13.9175

Max. :36.00 6:6 Max. :61.00 Max. :65.00 Max. :28.1828

> model4<-lm(y~type+x1+x2,data=drug)

다중 분석 시작

> library(multcomp)

> tukey<-glht(model4,linfct=mcp(type='Tukey'))

> summary(tukey)

Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts

Fit: lm(formula = y ~ type + x1 + x2, data = drug)

Linear Hypotheses:

Estimate Std. Error t value Pr(>|t|)

2 - 1 == 0 0.8965 2.7823 0.322 0.9995

3 - 1 == 0 7.9097 2.7684 2.857 0.0771 .

4 - 1 == 0 3.0722 2.8247 1.088 0.8821

5 - 1 == 0 9.5434 2.8534 3.345 0.0257 *

6 - 1 == 0 5.8389 2.8391 2.057 0.3374

3 - 2 == 0 7.0132 2.7831 2.520 0.1524

4 - 2 == 0 2.1757 2.7536 0.790 0.9669

5 - 2 == 0 8.6469 2.8026 3.085 0.0469 *

6 - 2 == 0 4.9424 2.7586 1.792 0.4867

4 - 3 == 0 -4.8375 2.8018 -1.727 0.5265

5 - 3 == 0 1.6337 2.7823 0.587 0.9911

6 - 3 == 0 -2.0708 2.8403 -0.729 0.9766

5 - 4 == 0 6.4712 2.7833 2.325 0.2178

6 - 4 == 0 2.7667 2.7539 1.005 0.9124

6 - 5 == 0 -3.7045 2.8318 -1.308 0.7780

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Adjusted p values reported -- single-step method)

> plot(tukey)

->Tukey(HSD) 검정 결과를 신뢰구간으로 보면, 1-5, 2-5가 유의함을 알 수 있다.

> dunnett <- glht(model4,linfct=mcp(type='Dunnett'))

> summary(dunnett)

Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Dunnett Contrasts

Fit: lm(formula = y ~ type + x1 + x2, data = drug)

Linear Hypotheses:

Estimate Std. Error t value Pr(>|t|)

2 - 1 == 0 0.8965 2.7823 0.322 0.9974

3 - 1 == 0 7.9097 2.7684 2.857 0.0324 *

4 - 1 == 0 3.0722 2.8247 1.088 0.7130

5 - 1 == 0 9.5434 2.8534 3.345 0.0103 *

6 - 1 == 0 5.8389 2.8391 2.057 0.1733

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Adjusted p values reported -- single-step method)

> plot(dunnett)

7. 모형 적합성 검토

> par(mfrow=c(2,2))

> plot(model3)

-> 등분산성, 정규성 가정에는 큰 문제가 없음을 알 수 있다.

출처: 보건정보데이터 분석(이태림, 이재원, 김주한, 장대흥 공저), R을 이용한 누구나 하는 통계분석(안재형 저)

'KNOU > 2 보건 정보 데이터 분석' 카테고리의 다른 글

제6.2장 비모수적 방법 - 2. 누적한계추정법 (0)	2017.01.26
제6.2장 비모수적 방법 - 1. 생명표 방법 (0)	2017.01.26
제5.1장 공분산 분석 - 1. 공변량이 하나인 경우 (0)	2017.01.06
제4장 범주형 자료의 분석 - 4.3 로짓분석 (0)	2017.01.04
제4장 범주형 자료의 분석 - 4.2.4 대응자료 및 사례 - 대조군 검정 (0)	2016.12.29

Posted by 마르띤

,

제5장 신경망모형 - 분류

KNOU/2 데이터마이닝 2016. 12. 20. 10:12

독일 신용평가 데이터를 활용한 신경망 모형. 목표변수 y는 good / bad의 범주형 데이터로 모든 변수를 수치화 한 후 신경망 모형을

1) 데이터 입력

> set.seed(1000)

> library(neuralnet)

> library(dummy)

> setwd('c:/Rwork')

> german = read.table('germandata.txt',header = T)

추가 공부: dummy화란?

2) 데이터 및 타입 변경

> dvar=c(4,9,10,15,17) #명목변수 지정 purpose(a43,a40..), personal, debtors, housing, job

> german2 = dummy(x=german[,dvar]) #명목변수를 더미변수화

> head(german2,1)

purpose_A40 purpose_A41 purpose_A410 purpose_A42 purpose_A43 purpose_A44 purpose_A45

1 0 0 0 0 1 0 0

purpose_A46 purpose_A48 purpose_A49 personal_A91 personal_A92 personal_A93 personal_A94

1 0 0 0 0 0 1 0

debtors_A101 debtors_A102 debtors_A103 housing_A151 housing_A152 housing_A153 job_A171

1 1 0 0 0 1 0 0

job_A172 job_A173 job_A174

1 0 1 0

> german2 = german2[,-c(10,14,17,20,24)] #더미변수생성

> head(german,1)

check duration history purpose credit savings employment installment personal debtors

1 A11 6 A34 A43 1169 A65 A75 4 A93 A101

Residence property age others housing numcredits job residpeople telephone foreign y

1 4 A121 67 A143 A152 2 A173 1 A192 A201 good

> german2 = cbind(german[,-dvar],german2) #변수 결함

> str(german2)
'data.frame':   1000 obs. of 40 variables:
$ check       : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
$ duration    : int 6 48 12 42 24 36 24 36 12 30 ...
$ history     : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
$ credit      : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
$ savings     : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
$ employment : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
$ installment : int 4 2 2 2 3 2 3 2 2 4 ...
$ residence   : int 4 2 3 4 4 4 4 2 4 2 ...
$ property    : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
$ age         : int 67 22 49 45 53 35 53 35 61 28 ...
$ others      : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
$ numcredits : int 2 1 1 1 2 1 1 1 1 2 ...
$ residpeople : int 1 1 2 2 2 2 1 1 1 1 ...
$ telephone   : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
$ foreign     : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
$ y           : Factor w/ 2 levels "bad","good": 2 1 2 2 1 2 2 2 2 1 ...
$ purpose_A40 : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 2 ...
$ purpose_A41 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
$ purpose_A410: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ purpose_A42 : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 1 1 ...
$ purpose_A43 : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 2 1 ...
$ purpose_A44 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ purpose_A45 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ purpose_A46 : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 1 1 1 ...
$ purpose_A48 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ purpose_A49 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ personal_A91: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
$ personal_A92: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
$ personal_A93: Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 1 1 ...
$ personal_A94: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
$ debtors_A101: Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 2 2 2 ...
$ debtors_A102: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ debtors_A103: Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
$ housing_A151: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
$ housing_A152: Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 2 ...
$ housing_A153: Factor w/ 2 levels "0","1": 1 1 1 2 2 2 1 1 1 1 ...
$ job_A171    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ job_A172    : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 1 2 1 ...
$ job_A173    : Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 1 1 1 ...
$ job_A174    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 2 ...

> nrow(german2);ncol(german2)
[1] 1000
[1] 40

> for(i in 1:ncol(german2)) if(!is.numeric(german2[,i])) german2[,i] = as.numeric(german2[,i])#여타 순서가 있는 범주형 변수의 수치형 변수화

> german2$y = ifelse(german$y == 'good',1,0) #목표변수 변환

> head(german$y)

[1] good bad good good bad good

Levels: bad good

> head(german2$y)

[1] 1 0 1 1 0 1

## 중요 : 신경망에서는 범주형 데이터를 수치화하여 적용한다. ##

3) 75% 랜덤 추출

> i = sample(1:nrow(german2),round(0.75*nrow(german2))) #75%랜덤 추출

> length(i)
[1] 750

4) 변수의 표준화 과정

> max2 = apply(german2, 2, max)

> min2 = apply(german2, 2, min)

> gdat = scale(german2, center = min2, scale = max2 - min2) # 변수조정(0,1,dummy variable은 변화 없음)

> gdat = as.data.frame(gdat) #데이터 프레임 형태로 변경, 데이터 준비 끝!

> str(gdat)

'data.frame': 1000 obs. of 35 variables:

$ check : num 0 0.333 1 0 0 ...

$ duration : num 0.0294 0.6471 0.1176 0.5588 0.2941 ...

$ history : num 1 0.5 1 0.5 0.75 0.5 0.5 0.5 0.5 1 ...

$ credit : num 0.0506 0.3137 0.1016 0.4199 0.2542 ...

$ savings : num 1 0 0 0 0 1 0.5 0 0.75 0 ...

$ employment : num 1 0.5 0.75 0.75 0.5 0.5 1 0.5 0.75 0 ...

$ installment : num 1 0.333 0.333 0.333 0.667 ...

$ residence : num 1 0.333 0.667 1 1 ...

$ property : num 0 0 0 0.333 1 ...

$ age : num 0.8571 0.0536 0.5357 0.4643 0.6071 ...

$ others : num 1 1 1 1 1 1 1 1 1 1 ...

$ numcredits : num 0.333 0 0 0 0.333 ...

$ residpeople : num 0 0 1 1 1 1 0 0 0 0 ...

$ telephone : num 1 0 0 0 0 1 0 1 0 0 ...

$ foreign : num 0 0 0 0 0 0 0 0 0 0 ...

$ y : num 1 0 1 1 0 1 1 1 1 0 ...

$ purpose_A40 : num 0 0 0 0 1 0 0 0 0 1 ...

$ purpose_A41 : num 0 0 0 0 0 0 0 1 0 0 ...

$ purpose_A410: num 0 0 0 0 0 0 0 0 0 0 ...

$ purpose_A42 : num 0 0 0 1 0 0 1 0 0 0 ...

$ purpose_A43 : num 1 1 0 0 0 0 0 0 1 0 ...

$ purpose_A44 : num 0 0 0 0 0 0 0 0 0 0 ...

$ purpose_A45 : num 0 0 0 0 0 0 0 0 0 0 ...

$ purpose_A46 : num 0 0 1 0 0 1 0 0 0 0 ...

$ purpose_A48 : num 0 0 0 0 0 0 0 0 0 0 ...

$ personal_A91: num 0 0 0 0 0 0 0 0 1 0 ...

$ personal_A92: num 0 1 0 0 0 0 0 0 0 0 ...

$ personal_A93: num 1 0 1 1 1 1 1 1 0 0 ...

$ debtors_A101: num 1 1 1 0 1 1 1 1 1 1 ...

$ debtors_A102: num 0 0 0 0 0 0 0 0 0 0 ...

$ housing_A151: num 0 0 0 0 0 0 0 1 0 0 ...

$ housing_A152: num 1 1 1 0 0 0 1 0 1 1 ...

$ job_A171 : num 0 0 0 0 0 0 0 0 0 0 ...

$ job_A172 : num 0 0 1 0 0 1 0 0 1 0 ...

$ job_A173 : num 1 1 0 1 1 0 1 0 0 0 ...

5) 신경망 모델 구축 및 신경망 그래프 그리기

> train = gdat[i,] #학습샘플과 테스트 샘플 추출

> test = gdat[-i,]

> gn = names(german2)

> gn

[1] "check" "duration" "history" "credit" "savings" "employment"

[7] "installment" "residence" "property" "age" "others" "numcredits"

[13] "residpeople" "telephone" "foreign" "y" "purpose_A40" "purpose_A41"

[19] "purpose_A410" "purpose_A42" "purpose_A43" "purpose_A44" "purpose_A45" "purpose_A46"

[25] "purpose_A48" "personal_A91" "personal_A92" "personal_A93" "debtors_A101" "debtors_A102"

[31] "housing_A151" "housing_A152" "job_A171" "job_A172" "job_A173"

> f = as.formula(paste('y~',paste(gn[!gn %in% 'y'],collapse = '+')))

> f

y ~ check + duration + history + credit + savings + employment +

installment + residence + property + age + others + numcredits +

residpeople + telephone + foreign + purpose_A40 + purpose_A41 +

purpose_A410 + purpose_A42 + purpose_A43 + purpose_A44 +

purpose_A45 + purpose_A46 + purpose_A48 + personal_A91 +

personal_A92 + personal_A93 + debtors_A101 + debtors_A102 +

housing_A151 + housing_A152 + job_A171 + job_A172 + job_A173

> nn1 = neuralnet(f,data=train,hidden=c(3,2),linear.output=F) #은닉층은 2개, 첫번째 노드는 3개, 두번째 노드는 2개. 분류의 경우 linear.output = F

> summary(nn1)

Length Class Mode

call 5 -none- call

response 750 -none- numeric

covariate 25500 -none- numeric

model.list 2 -none- list

err.fct 1 -none- function

act.fct 1 -none- function

linear.output 1 -none- logical

data 35 data.frame list

net.result 1 -none- list

weights 1 -none- list

startweights 1 -none- list

generalized.weights 1 -none- list

result.matrix 119 -none- numeric

> plot(nn1)

은닉층이 2개인 신경망 모형의 그래프가 완성된다. 이 그래프에 대한 해석을 좀 더 공부하고 싶은데 아직은 잘 모르겠음.

6) 모형 추정:

> dim(german2)[1]
[1] 1000

> dim(german2)[2]
[1] 35

> colnames(test)[16]
[1] "y"

> pred.nn0 = compute(nn1,train[,c(1:15,17:dim(german2)[2])]) #학습데이터의 실제값과 예측값 비교, 16번째 열의 값은 y

함수 설명:

> pred.nn0 = compute(nn1,train[,c(1:15,17:dim(german2)[2])]) 16번째 변수가 y 목표변수. compute는 작성된 신경망모형을 이용하여 새로운 예에 적용하여 결과를 도출. nn1는 새롭게 예측에 적용할 자료, train[,c(1:15,17:dim(german2)[2])]는 신경망모형으로적합한 결과 오브젝트

7) 학습샘플의 실제값과 예측값을 비교해보자.

> head(cbind(german2[1,16],round(pred.nn0$net.result,10)))

[,1] [,2]

931 1 0.7786470581

546 1 0.0000005387

56 1 0.8208161458

883 1 0.9999999722

11 1 0.0000004232

354 1 0.0000046419

#왼쪽이 실제값, 오른쪽이 예측값. 분류의 문제이므로 값은 0과 1사이. 0.5를 cut off값을 둘 수 있다. 또는 0.3미만 폐기, 0.7이하 보류, 0.7 초과만 사용한느 cutoff도 가능. 왼쪽이 실제 값, 오른쪽이 학습된 데이터. 4번째 행은 실제 1의 값을 1에 가깝게 예측하였고, 5번째 행은 실제 1이지만 0에 가깝게 예측한 사례. german2[,16]은 16번째 열 즉 y값임을 알겠는데 german2[1,16]은 뭘까… 궁금

8) 예측 정확도 평가

> pred.nn1 = compute(nn1,test[,c(1:15,17:dim(german2)[2])]) #test data를 바탕으로 판단해보자

> pred.nn2 = ifelse(pred.nn1$net.result>0.5,1,0) #0.5를 경계로 1과 0 분류

> head(cbind(german2[-i,16],pred.nn2)) #테스트 샘플의 실제값과 예측값

[,1] [,2]

3 1 1

12 1 0

13 1 1

14 1 0

15 0 1

26 1 1

> sum(german2[-i,16]!=pred.nn2) / length(german2[-i,16])

[1] 0.404

> sum(german2[-i,16]!=pred.nn2)

[1] 101

> length(german2[-i,16])

[1] 250

#테스트 샘플의 오분류율, 16번째 값은 목표변수, sum(german2[-i,16]!=pred.nn2)는 pred.nn2와 같지 않은 값을 전체길이 length(german2[-i,16])로 나눔. i를 빼고 16번째 컬럼을 사용

> library(nnet)

> nnet1 =nnet(f,data=train,size = 3, linout = F)

# weights: 109

initial value 181.570914

iter 10 value 113.679457

iter 20 value 96.943318

iter 30 value 82.803121

iter 40 value 73.239858

iter 50 value 70.807278

iter 60 value 69.865795

iter 70 value 69.476434

iter 80 value 69.158350

iter 90 value 69.039026

iter 100 value 68.929899

final value 68.929899

stopped after 100 iterations

> pred.nnet1 = predict(nnet1,test[,c(1:15,17:dim(german2)[2])])

> pred.nnet2 = ifelse(pred.nnet1>0.5,1,0)

> head(cbind(german2[-i,16],pred.nnet2)) #테스트 샘플의 실제값과 예측값

[,1] [,2]

3 1 1

12 1 0

13 1 1

14 1 1

15 0 0

26 1 1

> sum(german2[-i,16]!=pred.nnet2) / length(german2[-i,16]) #테스트 샘플 예측의 오분류율

[1] 0.408

이 부분은 교재에 별도 설명이 없어서 추가 공부가 필요함.

소감: 알파고 딥마이닝으로 인해 관심을 가지게 된 신경망 모형. 이론 공부도 해보고 R도 따라해보니 약 50%정도 이해된 상태. 궁금한 점은 아래와 같음.

1. 명복 변수 중 더미화 하지 않은 것들도 있음.

-> 교수님 답변: 해당 신경망 모형에서는 숫자로 입력되어 있는 범주형 변수들은 수치변수로 그대로 사용하고 표준화만 하였습니다. 순서가 있는 범주형 변수라고 판단한 변수였습니다. 해당 변수들을 제외하고 나머지 변수들은 변환이 필요하여 dummy 함수를 사용하였습니다.

2. 위 신경망 plot 그래프가 무엇을 의미하는지 더 자세히 해석할 능력이 필요함.

-> 교수님 답변: 신경망 모형의 해석은 그림을 보고 해석하기가 상당히 힘듭니다. 워낙 복잡한 함수의 결합이기 때문입니다. 다만, 화살표 위의 수치(절대값)를 보고 연결강도가 강한지 여부를 판단할 수 있습니다. 신경망 모형의 태생적인 한계점인 것 같습니다.

참고로 교재에서 사용하였던 neuralnet 패키지는 plot을 제공합니다. 과거 강의에서 사용했던 다른 패키지는 직접적으로 plot을 산출할 수는 없었습니다.

3. 위 신경망 모형을 통해서 오분류율은 40.4%로 나왔는데 너무 높은건 아닌지 생각됨.

-> 교수님 답변: 오분율은 상대적인 것이긴 하지만 높은 수준으로 보입니다. 교재에서는 예측 판별 기준으로 0.5라는 값을 사용했는데 실제 실무에서는 이 값을 적절하게 변경시키면서 오분류율을 낮추어주는 것이 좋을 것입니다. 예를 들어 원 데이터가 1이 0에 비해 많이 분포되어 있다면, 0.5보다 큰 값을 기준치로 삼는 편이 좋습니다.

4. 은닉층의 개수는 어떻게 설정하면 좋을지?

5. 가장 궁금한 것은 이 모델을 어떻게 실무에 적용할지 잘 모르겠음. 위 독일 신용평가 데이터로 신경망 모형을 만들고 오분류율도 체크하고, 실제값과 예측값도 비교하였는데, 이것이 의미하는 것들. 가령, 변수의 중요성, 그래서 신용도가 좋은 경우는 어느 경우이고, 또 다른 고객 데이터 셋이 있는데 이 새로운 셋에서 어떤 고객, 변수, 값들을 추출해야 우리가 원하는 우수한 고객을 알아낼 수 있는지, 실제 비즈니스 적용 포인트가 가장 궁금함.

-> 교수님 답변: 실무에서 신경망 모형을 잘 해석하여 활용하기는 어려운 것이 사실입니다. 다만, 상대적으로 예측력이 높다는 점을 살려 새로운 데이터의 모든 변수들을 활용하여 1이나 0값을 예측하거나 목표변수의 값 자체를 산출하는 목적으로는 유용성이 높다고 봅니다. 변수의 해석이나 변수의 선택보다는 예측이나 분류 자체의 목적으로 사용하기에 적합하다고 보시면 좋습니다.

6. 전반적으로 의사결정나무, 앙상블 모형에 이어 데이터 마이닝 분야에 깊은 관심을 가지게 되는 좋은 계기.

출처: 데이터마이닝(장영재, 김현중, 조형준 공저)

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

제6장 모형비교평가 - 연속형 목표변수 (0)	2017.02.24
제8장 연관성분석 (0)	2016.12.26
제5장 신경망모형 - 회귀 (0)	2016.11.14
제4장 앙상블 모형 - 분류앙상블모형 - 랜덤 포레스트 (0)	2016.11.09
제4장 앙상블 모형 - 분류앙상블모형 - 부스팅 (0)	2016.11.07

Posted by 마르띤

,

제5장 신경망모형 - 회귀

KNOU/2 데이터마이닝 2016. 11. 14. 14:11

보스턴 하우징을 이용한 신경망 모형. 목표변수를 주택가격의 중간값인 medv(연속형 변수)로 하고, 나머지 변수를 입력변수로 하는 신경망 모형

1) 데이터 입력

> set.seed(100)

> library(MASS)

> library(neuralnet)

> bdat = Boston

> str(bdat)

'data.frame': 506 obs. of 15 variables:

$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...

$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...

$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...

$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...

$ rm : num 6.58 6.42 7.18 7 7.15 ...

$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...

$ dis : num 4.09 4.97 4.97 6.06 6.06 ...

$ rad : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5 5 5 ...

$ tax : num 296 242 242 222 222 222 311 311 311 311 ...

$ ptratio : num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...

$ black : num 397 397 393 395 397 ...

$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...

$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

$ medv.hat: num 23.7 23.7 32.4 32.4 32.4 ...

주의: chas, rad를 수치형 변수로 바꾸고, 앙상블 모형에서 사용하였던 medv.hat 변수를 삭제해야 한다.

신경망에서는 범주형 데이터를 수치화하여 적용한다. 대체로 순서가 의미 있는 범부형 데이터는 수치 전환을 한 뒤 표준화하여 사용하게 된다.

2) 데이터 및 타입 변경

> bdat<-bdat[,-15]

> bdat$chas = as.numeric(bdat$chas)

> bdat$rad = as.numeric(bdat$rad)

> class(bdat$chas);class(bdat$rad)

[1] "numeric"

3) 50% 랜덤 추출

> i = sample(1:nrow(bdat), round(0.5*nrow(bdat)))

> max1 = apply(bdat, 2, max)

> min1 = apply(bdat, 2, min)

결과 해석 : 50% 랜덤 추출하여 훈련데이터(train)로 저장하고 나머지는 검증 데이터(test)로 저장

4) 변수의 표준화 과정

> sdat = scale(bdat,center=min1,scale = max1 - min1)

> sdat = as.data.frame(sdat)

> head(sdat,3)

crim zn indus chas nox rm age dis rad

1 0.0000000000000 0.18 0.06781524927 0 0.3148148148 0.5775052692 0.6416065911 0.2692031391 0.000

2 0.0002359225392 0.00 0.24230205279 0 0.1728395062 0.5479977007 0.7826982492 0.3489619802 0.125

3 0.0002356977440 0.00 0.24230205279 0 0.1728395062 0.6943858977 0.5993820803 0.3489619802 0.125

tax ptratio black lstat medv

1 0.2080152672 0.2872340426 1.0000000000 0.08967991170 0.4222222222

2 0.1049618321 0.5531914894 1.0000000000 0.20447019868 0.3688888889

3 0.1049618321 0.5531914894 0.9897372535 0.06346578366 0.6600000000

함수 설명:

> sdat = scale(bdat,center=min1,scale = max1 - min1) 신경망에서 사용하는 수치형 변수를 0과 1사이의 값으로 바꾸어 주기 위한 단계. 위의 [0-1] 변환은 (x-min(x) /(max(x)-min(x)) 의 수식으로도 계산 가능.

5) 신경망 모델 구축 및 신경망 그래프 그리기

> train = sdat[i,] #학습, 훈련샘플 training

> test = sdat[-i,] #테스트샘플 test

> n = names(train) ;n

[1] "crim" "zn" "indus" "chas" "nox" "rm" "age" "dis" "rad" "tax"

[11] "ptratio" "black" "lstat" "medv"

> form = as.formula(paste('medv~',paste(n[!n %in% 'medv'],collapse = '+')))

> form

medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + lstat

> nn1 = neuralnet(form,data=train,hidden = c(5,3),linear.output = T)

> summary(nn1) #작성된 신경망모형의 오브젝트의 구조를 정리

Length Class Mode

call 5 -none- call

response 253 -none- numeric

covariate 3289 -none- numeric

model.list 2 -none- list

err.fct 1 -none- function

act.fct 1 -none- function

linear.output 1 -none- logical

data 14 data.frame list

net.result 1 -none- list

weights 1 -none- list

startweights 1 -none- list

generalized.weights 1 -none- list

result.matrix 95 -none- numeric

함수 설명

> form = as.formula(paste('medv~',paste(n[!n %in% 'medv'],collapse = '+'))) #종속변수는 mdev, 나머지는 독립변수이다 틸다 ~. 와같은 개념. 종속변수를 제외한 n 안에 있는 모든 변수명을 다 집어넣고 더하여라. 변수명을 직접 입력해도 된다.

실제 form라고 입력하면 medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + lstat

라는 공식을 볼 수 있다.

> nn1 = neuralnet(form,data=train,hidden = c(5,3),linear.output = T) #은닉층은 2개이고 처음은 5개 나머지는 3개, 회귀의 문제이므로 linear.output = T

질문: 은닉층의 수는 어떻게 결정할까? 통상 은닉층의 마디수는 입력층의 마디수 두배를 넘지 않도록 해야한다. 위에서는 c(5,3)으로 결정하였는데 이 수치가 최적의 수치인지는 어떻게 알까? 더 공부가 필요하다.

<참고> 딥러닝(deep learning)

기계학습 기법 중 하나로 신경망모형으로부터 비롯된 딥러닝(deep learning)은 기본적으로 은닉층이 많이 쌓여 가면서 복잡하고 깊은 구조로 발전하면서 deep 이라는 이름이 붙여졌다. 입력변수와 출력변수 간 복잡한 관계를 가중치를 통해 조정할 수 있는 구조와 매커니즘을 갖고 있다.

> plot(nn1)

①로 표시된 것이 상수항에 해당하고 가중치는 각각의 화살 표 위에 출력된다. 해석 공부도 더 필요하다.

처음 은닉층은 5개, 두번째 은닉층은 3개를 가지고 있는 신경망 모형이 완성된다.

6) 모형 추정: 위의 그래프는 50%의 training data만을 사용하였으니 나머지 50% test data를 쓰자.

> pred.nn0 = compute(nn1,test[,1:13])

> summary(pred.nn0)
               Length Class Mode
    neurons      3    -none- list
    net.result 253    -none- numeric

> names(pred.nn0)

[1] "neurons" "net.result"

> pred0 = pred.nn0$net.result*(max(bdat$medv)-min(bdat$medv))+min(bdat$medv)

> head(pred0)
                    [,1]
    1 26.55282080
    2 25.15912380
    3 34.48852794
    4 31.56317828
    5 32.47594868
    6 25.32302689

함수 설명:

> pred.nn0 = compute(nn1,test[,1:13])

14번째 변수가 medv 목표변수.compute는 작성된 신경망모형을 이용하여 새로운 예에 적용하여 결과를 도출. nn1는 새롭게 예측에 적용할 자료, test[,1:13]는 신경망모형으로 적합한 결과 오브젝트

> pred0 = pred.nn0$net.result*(max(bdat$medv)-min(bdat$medv))+min(bdat$medv)

목표변수를 조정전 변수값으로 환원하는 과정. 원래 가지고 있는 값으로 환원. 역변환 과정

7) 학습샘플의 실제값과 예측값 (적합값)을 비교해보자.

아래 함수 head(cbind(bdat[i,14],pred0),50)에서 cbind(bdat[i,14],는 실제값(training data)과 예측값(neural network)으로 구성된 행렬을 구성하는 함수

> head(cbind(bdat[i,14],pred0),50)

[,1] [,2]

1 15.6 28.12482177

2 19.2 21.36405177

3 29.1 34.24035993

4 18.4 35.88436410

6 24.0 25.77273199

9 22.2 14.86779840

12 11.9 20.82595913

14 26.4 20.03825755

18 24.4 17.19077183

19 23.9 18.69934412

21 20.3 14.85391889

23 9.6 15.77921679

24 13.3 14.95074253

25 33.3 16.19467498

26 15.0 15.53094329

27 19.3 16.50978025

28 27.5 15.65043905

30 22.6 19.71434020

33 29.4 13.39051466

34 19.5 15.45310394

36 33.8 22.75202201

41 31.2 37.96410157

42 21.2 36.32787047

43 19.9 27.29495234

44 42.3 26.74797523

46 24.8 20.55343697

47 50.0 19.83522653

48 20.8 16.66204238

50 48.8 16.90901569

51 23.0 19.54979162

54 41.7 22.43724314

55 17.1 16.69040814

57 25.0 22.91956778

58 15.2 29.22726384

62 8.1 18.82030494

63 8.8 24.13500281

66 19.7 25.15181243

67 28.6 20.16650282

68 20.2 20.27218139

69 18.7 17.48036500

70 17.0 19.87776814

71 12.1 25.21432307

73 25.0 23.16314867

75 12.3 23.82067236

77 23.9 20.42068536

79 37.6 20.09872166

80 22.7 20.89416546

81 5.0 28.30078783

84 28.4 22.10685814

87 14.0 21.30514814

결과 해석: 26번째 행의 값처럼 유사한 결과값도 있지만 81번 결과 값처럼 서로 차이가 나는 경우도 있다.

8) 예측 정확도 평가: 모형 추정에 사용되지 않은 test data를 가지고 예측을 해보자.

> pred.nn1 = compute(nn1,test[,1:13]) #목표 변수 제외, 새로운 변수로 간주하고 nn1에 적용

> pred1 = pred.nn1$net.result*(max(bdat$medv)-min(bdat$medv))+min(bdat$medv) #목표변수를 조정전 변수값으로 환원. 역변환

> head(cbind(bdat[-i,14],pred1),50) #좌측은 test 변수의 실제값, 우측은 예측값(적합값), 14번째 값은 목표변수, pred1은 예측값(적합값)

[,1] [,2]

1 24.0 28.12482177

2 21.6 21.36405177

3 34.7 34.24035993

4 33.4 35.88436410

6 28.7 25.77273199

9 16.5 14.86779840

12 18.9 20.82595913

14 20.4 20.03825755

18 17.5 17.19077183

19 20.2 18.69934412

21 13.6 14.85391889

23 15.2 15.77921679

24 14.5 14.95074253

25 15.6 16.19467498

26 13.9 15.53094329

27 16.6 16.50978025

28 14.8 15.65043905

30 21.0 19.71434020

33 13.2 13.39051466

34 13.1 15.45310394

36 18.9 22.75202201

41 34.9 37.96410157

42 26.6 36.32787047

43 25.3 27.29495234

44 24.7 26.74797523

46 19.3 20.55343697

47 20.0 19.83522653

48 16.6 16.66204238

50 19.4 16.90901569

51 19.7 19.54979162

54 23.4 22.43724314

55 18.9 16.69040814

57 24.7 22.91956778

58 31.6 29.22726384

62 16.0 18.82030494

63 22.2 24.13500281

66 23.5 25.15181243

67 19.4 20.16650282

68 22.0 20.27218139

69 17.4 17.48036500

70 20.9 19.87776814

71 24.2 25.21432307

73 22.8 23.16314867

75 24.1 23.82067236

77 20.0 20.42068536

79 21.2 20.09872166

80 20.3 20.89416546

81 28.0 28.30078783

84 22.9 22.10685814

87 22.5 21.30514814

결과해석: 전반적으로 실제값과 예측값이 서로 유사함을 알 수 있다.

> head(cbind(bdat[-i,14],pred1)[,1],10)
   1    2    3    4    5    6    7    8    9   10
24.0 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9
> head(cbind(bdat[-i,14],pred1),10)
   [,1]        [,2]
1 24.0 26.55282080
2 21.6 25.15912380
3 34.7 34.48852794
4 33.4 31.56317828
5 36.2 32.47594868
6 28.7 25.32302689
7 22.9 19.92609580
8 27.1 20.01203029
9 16.5 19.68915018
10 18.9 19.33552651
> obs <-cbind(bdat[-i,14],pred1)[,1]
> pdt <-cbind(bdat[-i,14],pred1)[,2]

> out = cbind(obs,pdt)
> head(out)
   obs         pdt
1 24.0 26.55282080
2 21.6 25.15912380
3 34.7 34.48852794
4 33.4 31.56317828
5 36.2 32.47594868
6 28.7 25.32302689
> which.min(abs(out$obs - out$pdt))
Error in out$obs : $ operator is invalid for atomic vectors
> out = as.data.frame(out)
> which.min(abs(out$obs - out$pdt))
[1] 43
> out[43,]
    obs         pdt
43 25.3 25.31048382
> which.max(abs(out$obs - out$pdt))
[1] 197
> out[197,]
    obs         pdt
372 50 18.13253008

9) PMSE: 예측된 값이 얼마나 잘 예측되었을까?

> PMSE = sum((bdat[-i,14] - pred1)^2) / nrow(test)

> PMSE

[1] 18.70948041

함수 설명:

> PMSE = sum((bdat[-i,14] - pred1)^2) / nrow(test) #예측 평균 제곱 오차,

결과 해석: 예측평균제곱오차는 18.7

출처: 데이터마이닝(장영재, 김현중, 조형준 공저)

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

제8장 연관성분석 (0)	2016.12.26
제5장 신경망모형 - 분류 (5)	2016.12.20
제4장 앙상블 모형 - 분류앙상블모형 - 랜덤 포레스트 (0)	2016.11.09
제4장 앙상블 모형 - 분류앙상블모형 - 부스팅 (0)	2016.11.07
제4장 앙상블 모형 - 분류앙상블모형 - 배깅 (0)	2016.11.03

Posted by 마르띤

,

제4장 앙상블 모형 - 회귀앙상블모형 - 랜덤포레스트

KNOU/2 데이터마이닝 2016. 11. 2. 10:05

보스턴 하우징 데이터 – 랜덤포레스트 방법의 회귀앙상블 모형

1) 데이터 읽기

> Boston$chas=factor(Boston$chas)

> Boston$rad=factor(Boston$rad)

> str(Boston)

'data.frame': 506 obs. of 14 variables:

$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...

$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...

$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...

$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...

$ rm : num 6.58 6.42 7.18 7 7.15 ...

$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...

$ dis : num 4.09 4.97 4.97 6.06 6.06 ...

$ rad : Factor w/ 9 levels "1","2","3","4",..: 1 2 2 3 3 3 5 5 5 5 ...

$ tax : num 296 242 242 222 222 222 311 311 311 311 ...

$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...

$ black : num 397 397 393 395 397 ...

$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...

$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Factor 함수를 이용하여 범주형으로 변경하고, medv 변수를 목표변수로 다른 변수를 입력변수로 사용한다.

2) 랜덤포레스트 방법의 실행

>library(randomForest)

> rf.boston<-randomForest(medv~.,data=Boston,ntree=100,mtry=5,importance=T,na.action=na.omit)

> rf.boston

Call:

randomForest(formula = medv ~ ., data = Boston, ntree = 100, mtry = 5, importance = T, na.action = na.omit)

Type of random forest: regression

Number of trees: 100

No. of variables tried at each split: 5

Mean of squared residuals: 9.743395

% Var explained: 88.46

> summary(rf.boston)

Length Class Mode

call 7 -none- call

type 1 -none- character

predicted 506 -none- numeric

mse 100 -none- numeric

rsq 100 -none- numeric

oob.times 506 -none- numeric

importance 26 -none- numeric

importanceSD 13 -none- numeric

localImportance 0 -none- NULL

proximity 0 -none- NULL

ntree 1 -none- numeric

mtry 1 -none- numeric

forest 11 -none- list

coefs 0 -none- NULL

y 506 -none- numeric

test 0 -none- NULL

inbag 0 -none- NULL

terms 3 terms call

함수 설명

ntree=100, 분류기 개수. 디폴트는 500개

mtry=5, 중간노드마다 랜덤하게 선택되는 변수들의 개수. 디폴트는 분류나무의 경우 sqrt(p), 회귀나무의 경우 p/3

importance=T, 변수의 중요도 계산, 디폴트는 F

na.action=na.omit, 결측치를 처리하는 방법, 변수의 중요도를 계산하게 하고, 결측치는 필요한 경우에만 삭제.

names()함수를 통해 rf.boston에 저장된 오브젝트의 리스트를 불러내어, $predicted를 이용하여 훈련 데이터의 예측 집단을 출력할 수 있다.

> names(rf.boston)

[1] "call" "type" "predicted" "mse" "rsq"

[6] "oob.times" "importance" "importanceSD" "localImportance" "proximity"

[11] "ntree" "mtry" "forest" "coefs" "y"

[16] "test" "inbag" "terms"

> head(rf.boston$predicted,30)

1 2 3 4 5 6 7 8 9 10

28.25382 22.55963 34.14192 35.45333 34.06798 26.81151 21.01950 16.78839 17.80599 19.23591

11 12 13 14 15 16 17 18 19 20

21.02440 21.23466 21.80889 20.05162 19.30557 20.21721 21.61349 18.46000 18.14724 19.96174

21 22 23 24 25 26 27 28 29 30

14.10136 18.55984 16.05801 15.04825 16.70996 15.70548 17.84748 14.82048 18.88633 20.64939

importance()함수를 통해 계산된 입력변수의 중요도를 알 수 있다.

> importance(rf.boston,type=1)

%IncMSE

crim 8.325232

zn 2.061869

indus 5.130483

chas 1.030915

nox 9.211906

rm 17.090802

age 5.229782

dis 8.322716

rad 5.342500

tax 4.604745

ptratio 7.102056

black 5.292651

lstat 14.652271

결과를 보면 rm변수과 lstat 변수의 중요도가 가장 높음을 알 수 있다.

함수 설명

type=1,은 정분류율의 평균감소값, 2는 불순도의 평균감소값을 이용하여 계산

목표변수의 적합값을 구하고 평가하기 위해 평균오차제곱합(mse)를 계산.

> names(Boston)

[1] "crim" "zn" "indus" "chas" "nox" "rm" "age" "dis" "rad"

[10] "tax" "ptratio" "black" "lstat" "medv"

> Boston$medv.hat = predict(rf.boston,newdata=Boston)

> mean((Boston$medv-Boston$medv.hat)^2) #mean square error(mse)

[1] 1.915207

기존 선형회귀 한 회귀 분류 나무 모형 결과의 평균오차제곱합 mean((Boston$medv-Boston$medv.hat)^2) = 10.8643 대비 랜덤포레스트의 평균오차제곱합이 1.915207로 설명력이 매우 증가되었음을 알 수 있다. 랜덤포레스트의 경우 부트스트랩을 이용하기 때문에 확률임의추출에 의한 변동성이 있을 수 있다. 따라서 모델링을 할 때 마다 결과가 다르기 때문에, 랜덤포레스트를 수차례 반복 시행하고 예측결과의 평균값을 취하는 경우도 있다.

(기존 보스턴 하우징 데이터 회귀나무모형 사례 분석 링크 바로가기)

랜덤 포레스트 회귀앙상블의 적합값과 실제값의 일치도를 보자. 예측일치도가 우수함을 알 수 있다.

> plot(Boston$medv,Boston$medv.hat,xlab='Observed Values',ylab='Fitted Values')

> abline(0,1)

기존 분류 회귀의 나무모형의 적합값과 실제값의 일치도와 비교해봐도 매우 우수함을 알 수 있다.

이 의사결정 나무를 활용하여 30% 검증 데이터에 적용시켜서 분류예측치를 구해보자. 그리고 그 예측치를 구해보자.

> set.seed(1234)

> nrow(Boston)

[1] 506

> i=sample(1:nrow(Boston),round(nrow(Boston)*0.7))

> Boston.train = Boston[i,] #70% for training data 훈련 데이터

> Boston.test = Boston[-i,] #30% for test data 검증 데이터

> rf.train.boston<-randomForest(medv~.,data=Boston.train,ntree=100,importance=T,na.action=na.omit)

#obtain the predicted values

> medv.hat.test<-predict(rf.train.boston,newdata=Boston.test)

> mean((Boston.test$medv-medv.hat.test)^2) #predicted mean square error

[1] 4.114596

검증 데이터에 대한 평균오차제곱합 mse는 4.11로 계산되었다. 기존 회귀 분류 나무모형 의 검증 데이터 오분류율 13.95258와 비교해보면 상당히 향상된 결과임을 알 수 있다.

(기존 보스턴 하우징 데이터 회귀나무모형 사례 분석 링크 바로가기)

출처: 데이터마이닝(장영재, 김현중, 조형준 공저)

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

제4장 앙상블 모형 - 분류앙상블모형 - 부스팅 (0)	2016.11.07
제4장 앙상블 모형 - 분류앙상블모형 - 배깅 (0)	2016.11.03
제4장 앙상블 모형 (0)	2016.11.02
제3장 나무모형 - 회귀나무모형 (0)	2016.10.26
제3장 나무모형 - 분류나무모형 (4)	2016.10.18

Posted by 마르띤

,

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

데이터마이너를 꿈꾸며

'r'에 해당되는 글 30건

[R vs Python] 무엇이 더 좋을까? Chat GPT에게 물어보다

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

1장 데이터 처리

Lesson 3: Explore One Variable

'Python, R 분석과 프로그래밍 > Data Analysis with R' 카테고리의 다른 글

Lesson2: R Basic

'Python, R 분석과 프로그래밍 > Data Analysis with R' 카테고리의 다른 글

6장 - 중회귀분석 : 집객효과가 가장 큰 광고의 조합은 무엇인가?

'Python, R 분석과 프로그래밍 > 비지니스 활용 사례로 배우는 데이터 분석 : R' 카테고리의 다른 글

5장 - A/B 테스트 : 어느 쪽의 배너광고가 반응이 더 좋은가?

'Python, R 분석과 프로그래밍 > 비지니스 활용 사례로 배우는 데이터 분석 : R' 카테고리의 다른 글

제5.1장 공분산 분석 - 2 공변량이 둘 이상인 경우

'KNOU > 2 보건 정보 데이터 분석' 카테고리의 다른 글

제5장 신경망모형 - 분류

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

제5장 신경망모형 - 회귀

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

제4장 앙상블 모형 - 회귀앙상블모형 - 랜덤포레스트

'KNOU > 2 데이터마이닝' 카테고리의 다른 글

링크

카테고리

최근에 올라온 글

최근에 받은 트랙백

글 보관함

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역