'ddply' 태그의 글 목록

'ddply'에 해당되는 글 2건

5장 - A/B 테스트 : 어느 쪽의 배너광고가 반응이 더 좋은가?

Python, R 분석과 프로그래밍/비지니스 활용 사례로 배우는 데이터 분석 : R 2017. 2. 13. 19:51

문제인식: 자사 제품의 구매율이 다른 앱과 비교하여 낮음을 확인, 해당 배너 광고의 클릭율이 낮음을 밝힘.

해결 방법: 분석 위한 데이터가 없기 때문에 A/B 테스트를 통해 분석용 로그를 출력하는 데이터 수집. 이후 A/B 테스트 한 후 배너광고 A와 B의 클릭률 산출하여 x² 검정 실행

R 분석 내용

1. 데이터 읽어 들이기

> setwd("C:/파일/temp/r_temp")

> ab.test.imp <-read.csv('section5-ab_test_imp.csv',header=T,stringsAsFactors = F)

> head(ab.test.imp,2)

log_date app_name test_name test_case user_id transaction_id

1 2013-10-01 game-01 sales_test B 36703 25622

2 2013-10-01 game-01 sales_test A 44339 25623

> ab.test.goal <- read.csv('section5-ab_test_goal.csv',header=T,stringsAsFactors = F)

> head(ab.test.goal,2)

log_date app_name test_name test_case user_id transaction_id

1 2013-10-01 game-01 sales_test B 15021 25638

2 2013-10-01 game-01 sales_test B 351 25704

2. ab.test.imp에 ab.test.goal 결합시키기

> ab.test.imp <- merge(ab.test.imp, ab.test.goal, by='transaction_id', all.x=T, suffixes=c('','.g'))

> head(ab.test.imp,2)

-> all.x - logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output.

3. 클릭했는지 하지 않았는지 나타내는 플래그 작성하기

> ab.test.imp$is.goal <-ifelse(is.na(ab.test.imp$user_id.g),0,1)

-> 항목 user_id.g의 값이 없음(NA)인 경우 0, 그렇지 않을 경우 1 입력

4. 클릭률 집계하기

> library(plyr)

> ddply(ab.test.imp,

.(test_case),

summarize,

cvr=sum(is.goal)/length(user_id))

test_case cvr

1 A 0.08025559

2 B 0.11546015

-> cvr = 클릭한 사람의 합계 / 배너광고가 표시된 유저수

배너광고 A의 클릭률 보다 B가 전반적으로 높다는 것을 통계적으로 알 수 있다.

5. 카이제곱 검정

> chisq.test(ab.test.imp$test_case,ab.test.imp$is.goal)

Pearson's Chi-squared test with Yates' continuity correction

data: ab.test.imp$test_case and ab.test.imp$is.goal

X-squared = 308.38, df = 1, p-value < 2.2e-16

-> p-value가 매주 작으므로 (0.05보다 작음) A와 B의 차이는 우연히 발생한 것이 아님

6. 날짜별,테스트 케이스별로 클릭률 산출하기

> ab.test.imp.summary<- ddply(ab.test.imp,

.(log_date,test_case),

summarize,

imp=length(user_id),

cv=sum(is.goal),

cvr=sum(is.goal)/length(user_id))

> head(ab.test.imp.summary)

log_date test_case imp cv cvr

1 2013-10-01 A 1358 98 0.07216495

2 2013-10-01 B 1391 176 0.12652768

3 2013-10-02 A 1370 88 0.06423358

4 2013-10-02 B 1333 212 0.15903976

5 2013-10-03 A 1213 170 0.14014839

6 2013-10-03 B 1233 185 0.15004055

-> imp = 배너 광고가 얼마나 표시되었는지 확인

cv = 클릭한 사람의 합계

cvr = 클릭률

7. 테스트 케이스별로 클릭률 산출하기

> ab.test.imp.summary<- ddply(ab.test.imp.summary,

.(test_case),

transform,

cvr.avg=sum(cv/sum(imp)))

> head(ab.test.imp.summary)

log_date test_case imp cv cvr cvr.avg

1 2013-10-01 A 1358 98 0.07216495 0.08025559

2 2013-10-02 A 1370 88 0.06423358 0.08025559

3 2013-10-03 A 1213 170 0.14014839 0.08025559

4 2013-10-04 A 1521 89 0.05851414 0.08025559

5 2013-10-05 A 1587 56 0.03528670 0.08025559

6 2013-10-06 A 1219 120 0.09844135 0.08025559

-> transform: 기존 집계툴에 데이터 추가

cvr.avg: test_case 별 추가

8. 테스트 케이스별 클릭률의 시계열 추이 그래프

> str(ab.test.imp.summary)

'data.frame': 62 obs. of 6 variables:

$ log_date : chr "2013-10-01" "2013-10-02" "2013-10-03" "2013-10-04" ...

$ test_case: chr "A" "A" "A" "A" ...

$ imp : int 1358 1370 1213 1521 1587 1219 1595 1401 1648 1364 ...

$ cv : num 98 88 170 89 56 120 194 99 199 114 ...

$ cvr : num 0.0722 0.0642 0.1401 0.0585 0.0353 ...

$ cvr.avg : num 0.0803 0.0803 0.0803 0.0803 0.0803 ...

> ab.test.imp.summary$log_date <- as.Date(ab.test.imp.summary$log_date)

> library(ggplot2)

> library(scales)

> limits<-c(0,max(ab.test.imp.summary$cvr))

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,

col=test_case,lty=test_case,shape=test_case))+

geom_line(lwd=1)+geom_point(size=4)+geom_line(aes(y=cvr.avg,col=test_case))+

scale_y_continuous(label=percent,limits=limits)

-> 배너광고 A의 클릭률 보다 B가 전반적으로 높다는 것을 통계적으로 알 수 있다

##ggplot 그래프 이해하기

> library(ggplot2)

> library(scales)

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,col=test_case,lty=test_case,shape=test_case))

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,col=test_case,lty=test_case,shape=test_case))+geom_line(lwd=1)

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,col=test_case,lty=test_case,shape=test_case))+geom_line(lwd=1)+geom_point(size=4)

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,col=test_case,lty=test_case,shape=test_case))+geom_line(lwd=1)+geom_point(size=4)+geom_line(aes(y=cvr.avg,col=test_case))

> limits<-c(0,max(ab.test.imp.summary$cvr))

> max(ab.test.imp.summary$cvr)

[1] 0.1712159

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,col=test_case,lty=test_case,shape=test_case))+geom_line(lwd=1)+geom_point(size=4)+geom_line(aes(y=cvr.avg,col=test_case))+ scale_y_continuous(label=percent,limits=limits)

출처: 비지니스 활용 사레로 배우는 데이터 분석: R (사카마키 류지, 사토 요헤이 지음)

'Python, R 분석과 프로그래밍 > 비지니스 활용 사례로 배우는 데이터 분석 : R' 카테고리의 다른 글

6장 - 중회귀분석 : 집객효과가 가장 큰 광고의 조합은 무엇인가? (2)	2017.02.15
4장 - 크로스 분석: 어떤 속성들의 고객들이 떠날까? (0)	2017.02.10

Posted by 마르띤

4장 - 크로스 분석: 어떤 속성들의 고객들이 떠날까?

Python, R 분석과 프로그래밍/비지니스 활용 사례로 배우는 데이터 분석 : R 2017. 2. 10. 19:24

1. csv 파일 읽어 들이기

> setwd("C:/파일/temp/r_temp")

> dau<-read.csv('section4-dau.csv',header=T,stringsAsFactors = F)

> head(dau)

log_date app_name user_id

1 2013-08-01 game-01 33754

2 2013-08-01 game-01 28598

3 2013-08-01 game-01 30306

4 2013-08-01 game-01 117

5 2013-08-01 game-01 6605

6 2013-08-01 game-01 346

> user.info<-read.csv('section4-user_info.csv',header=T,stringsAsFactors = F)

> head(user.info)

i nstall_date app_name user_id gender generation device_type

1 2013-04-15 game-01 1 M 40 iOS

2 2013-04-15 game-01 2 M 10 Android

3 2013-04-15 game-01 3 F 40 iOS

4 2013-04-15 game-01 4 M 10 Android

5 2013-04-15 game-01 5 M 40 iOS

6 2013-04-15 game-01 6 M 40 iOS

2. DAU에 user.info 결합하기

> dau.user.info<-merge(dau,user.info,by=c('user_id','app_name'))

> head(dau.user.info)

user_id app_name log_date install_date gender generation device_type

1 1 game-01 2013-09-06 2013-04-15 M 40 iOS

2 1 game-01 2013-09-05 2013-04-15 M 40 iOS

3 1 game-01 2013-09-28 2013-04-15 M 40 iOS

4 1 game-01 2013-09-12 2013-04-15 M 40 iOS

5 1 game-01 2013-09-11 2013-04-15 M 40 iOS

6 1 game-01 2013-09-08 2013-04-15 M 40 iOS

3. 월 항목 추가

> dau.user.info$month<-substr(dau.user.info$log_date,1,7)

> head(dau.user.info)

user_id app_name log_date install_date gender generation device_type month

1 1 game-01 2013-09-06 2013-04-15 M 40 iOS 2013-09

2 1 game-01 2013-09-05 2013-04-15 M 40 iOS 2013-09

3 1 game-01 2013-09-28 2013-04-15 M 40 iOS 2013-09

4 1 game-01 2013-09-12 2013-04-15 M 40 iOS 2013-09

5 1 game-01 2013-09-11 2013-04-15 M 40 iOS 2013-09

6 1 game-01 2013-09-08 2013-04-15 M 40 iOS 2013-09

> colnames(dau.user.info)[8]<-'log_month'

> head(dau.user.info)

user_id app_name log_date install_date gender generation device_type log_month

1 1 game-01 2013-09-06 2013-04-15 M 40 iOS 2013-09

2 1 game-01 2013-09-05 2013-04-15 M 40 iOS 2013-09

3 1 game-01 2013-09-28 2013-04-15 M 40 iOS 2013-09

4 1 game-01 2013-09-12 2013-04-15 M 40 iOS 2013-09

5 1 game-01 2013-09-11 2013-04-15 M 40 iOS 2013-09

6 1 game-01 2013-09-08 2013-04-15 M 40 iOS 2013-09

4. 성별로 집계

> table(dau.user.info[,c('log_month','gender')])

gender

log_month F M

2013-08 47343 46842

2013-09 38027 38148

> table(dau.user.info[,c('gender','log_month')])

log_month

gender 2013-08 2013-09

F 47343 38027

M 46842 38148

-> 전체적인 수치는 떨어졌지만 남녀간의 구성비율은 변화가 없으므로, 성별과 게임이용율 간 연관관계가 크지 않음.

5. 연령별로 집계

> table(dau.user.info[,c('log_month','generation')])

generation

log_month 10 20 30 40 50

2013-08 18785 33671 28072 8828 4829

2013-09 15391 27229 22226 7494 3835

> table(dau.user.info[,c('generation','log_month')])

log_month

generation 2013-08 2013-09

10 18785 15391

20 33671 27229

30 28072 22226

40 8828 7494

50 4829 3835

-> 전체적인 수치는 떨어졌지만 연령간의 구성비율은 변화가 없으므로, 연령과 게임이용율 간 연관관계가 크지 않음.

6. 세그먼트 분석(성별과 연령대를 조합해서 집계)

> library(reshape2)

> dcast(dau.user.info,log_month~gender+generation,value.var='user_id',length)

log_month F_10 F_20 F_30 F_40 F_50 M_10 M_20 M_30 M_40 M_50

1 2013-08 9091 17181 14217 4597 2257 9694 16490 13855 4231 2572

2 2013-09 7316 13616 11458 3856 1781 8075 13613 10768 3638 2054

-> 전체 이용율 하락에 영햐을 끼친 세그먼트는 찾아내지 못함

7. 세그먼트 분석(단말기별로 집계)

> head(dau.user.info)

user_id app_name log_date install_date gender generation device_type log_month

1 1 game-01 2013-09-06 2013-04-15 M 40 iOS 2013-09

2 1 game-01 2013-09-05 2013-04-15 M 40 iOS 2013-09

3 1 game-01 2013-09-28 2013-04-15 M 40 iOS 2013-09

4 1 game-01 2013-09-12 2013-04-15 M 40 iOS 2013-09

5 1 game-01 2013-09-11 2013-04-15 M 40 iOS 2013-09

6 1 game-01 2013-09-08 2013-04-15 M 40 iOS 2013-09

> table(dau.user.info[,c('log_month','device_type')])

device_type

log_month Android iOS

2013-08 46974 47211

2013-09 29647 46528

-> ios의 변화율은 없는 반면, android의 변화율이 큼을 알 수 있다.

8. 세그먼트 분석 결과 시각화하기

#날짜별로 단말기별 유저수 산출하기

> library(plyr)

> dau.user.info.device.summary<-ddply(dau.user.info,

.(log_date,device_type),

summarize,

dau=length(user_id))

> head(dau.user.info.device.summary)

log_date device_type dau

1 2013-08-01 Android 1784

2 2013-08-01 iOS 1805

3 2013-08-02 Android 1386

4 2013-08-02 iOS 1451

5 2013-08-03 Android 1295

6 2013-08-03 iOS 1351

#날짜별 데이터 형식으로 변환하기

> str(dau.user.info.device.summary)

'data.frame': 122 obs. of 3 variables:

$ log_date : chr "2013-08-01" "2013-08-01" "2013-08-02" "2013-08-02" ...

$ device_type: chr "Android" "iOS" "Android" "iOS" ...

$ dau : int 1784 1805 1386 1451 1295 1351 1283 1314 2002 2038 ...

> dau.user.info.device.summary$log_date<-as.Date(dau.user.info.device.summary$log_date)

> str(dau.user.info.device.summary)

'data.frame': 122 obs. of 3 variables:

$ log_date : Date, format: "2013-08-01" "2013-08-01" "2013-08-02" ...

$ device_type: chr "Android" "iOS" "Android" "iOS" ...

$ dau : int 1784 1805 1386 1451 1295 1351 1283 1314 2002 2038 ...

9. 시계열 트렌드 그래프 그리기

> library(ggplot2)

> library(scales)

> limits<-c(0,max(dau.user.info.device.summary$dau))

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))+ geom_line(lwd=1)+geom_point(size=4)+scale_y_continuous(label=comma,limits = limits)

-> 9월부터 안드로이드 유저가 빠르게 감소함을 알 수 있다. 따라서 안드로이드 시스템을 점검해야 한다는 결론을 얻을 수 있다.

#ggplot 그래프 구조 이해하기

> library(ggplot2)

> library(scales)

> head(dau.user.info.device.summary)

log_date device_type dau

1 2013-08-01 Android 1784

2 2013-08-01 iOS 1805

3 2013-08-02 Android 1386

4 2013-08-02 iOS 1451

5 2013-08-03 Android 1295

6 2013-08-03 iOS 1351

> ggplot(dau.user.info.device.summary)

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))+geom_line(lwd=1)

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))+geom_line(lwd=1)+geom_point(size=4)

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))+geom_line(lwd=1)+geom_point(size=4)+scale_y_continuous(label=comma)

> limits<-c(0,max(dau.user.info.device.summary$dau))

> max(dau.user.info.device.summary$dau)

[1] 2133

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))+geom_line(lwd=1)+geom_point(size=4)+scale_y_continuous(label=comma,limits = limits)

출처: 비지니스 활용 사레로 배우는 데이터 분석: R (사카마키 류지, 사토 요헤이 지음)

'Python, R 분석과 프로그래밍 > 비지니스 활용 사례로 배우는 데이터 분석 : R' 카테고리의 다른 글

6장 - 중회귀분석 : 집객효과가 가장 큰 광고의 조합은 무엇인가? (2)	2017.02.15
5장 - A/B 테스트 : 어느 쪽의 배너광고가 반응이 더 좋은가? (0)	2017.02.13

Posted by 마르띤

이전 1 다음

데이터마이너를 꿈꾸며

'ddply'에 해당되는 글 2건

5장 - A/B 테스트 : 어느 쪽의 배너광고가 반응이 더 좋은가?

'Python, R 분석과 프로그래밍 > 비지니스 활용 사례로 배우는 데이터 분석 : R' 카테고리의 다른 글

4장 - 크로스 분석: 어떤 속성들의 고객들이 떠날까?

'Python, R 분석과 프로그래밍 > 비지니스 활용 사례로 배우는 데이터 분석 : R' 카테고리의 다른 글

링크

카테고리

최근에 올라온 글

최근에 받은 트랙백

글 보관함

티스토리툴바