'Python, R 분석과 프로그래밍' 카테고리의 글 목록 (3 Page)

Lesson2: R Basic

Python, R 분석과 프로그래밍/Data Analysis with R 2017. 7. 11. 10:14

# 원하는 데이터만 발라서 보기

> data(mtcars)

> mean(mtcars$mpg)

[1] 20.09062

> subset(mtcars, mtcars$mpg >= 30 | mtcars$hp < 60)

mpg cyl disp hp drat wt qsec vs am gear carb

Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1

Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

> mtcars[mtcars$mpg >= 30 | mtcars$hp < 60, ] #column 전체를 보기 위해서는 콤마 필수

mpg cyl disp hp drat wt qsec vs am gear carb

Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1

Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

#reddit data 보기, 범주형 변수 그래프 그리기

> reddit <- read.csv('reddit.csv')

> library(ggplot2)

> qplot(data = reddit, x = age.range)

#오더 순서 정하기 setting levles of ordered factors solution

> reddit$age.range = ordered(reddit$age.range,levels=c("Under 18","18-24", "25-34","35-44","45-54","55-64","65 or Above"))

> qplot(data = reddit, x = age.range)

#alternative solution

> reddit$age.range = factor(reddit$age.range, levels=c("Under 18","18-24", "25-34","35-44","45-54","55-64","65 or Above"),ordered=T)

> qplot(data = reddit, x = age.range)

# practice

> nlevels(reddit$income.range)

[1] 8

> levels(reddit$income.range)

[1] "$100,000 - $149,999" "$150,000 or more" "$20,000 - $29,999" "$30,000 - $39,999"

[5] "$40,000 - $49,999" "$50,000 - $69,999" "$70,000 - $99,999" "Under $20,000"

> qplot(data = reddit, x = income.range)

#아래같은 방법도 가능

> reddit$income.range = ordered(reddit$income.range, levels=c("Under $20,000" , "$20,000 - $29,999" , "$30,000 - $39,999", "$40,000 - $49,999","$50,000 - $69,999" , "$70,000 - $99,999”, ”$100,000 - $149,999" , "$150,000 or more"))

> qplot(data = reddit, x = income.range)

#다른 예제

> tShirts <- factor(c('medium', 'small', 'large', 'medium', 'large', 'large'), levels = c('medium','small','large'))

> tShirts

[1] medium small large medium large large

Levels: medium small large

> qplot(x = tShirts)

> tShirts <- ordered(tShirts, levels = c('small', 'medium', 'large'))

> tShirts

[1] medium small large medium large large

Levels: small < medium < large

> qplot(x = tShirts)

참고

https://cn.udacity.com/course/data-analysis-with-r--ud651

https://cn.udacity.com/course/data-wrangling-with-mongodb--ud032

http://vita.had.co.nz/papers/tidy-data.pdf

http://courses.had.co.nz.s3-website-us-east-1.amazonaws.com/12-rice-bdsi/slides/07-tidy-data.pdf

http://www.computerworld.com/article/2497143/business-intelligence/business-intelligence-beginner-s-guide-to-r-introduction.html

http://www.statmethods.net/index.html

https://www.r-bloggers.com/

http://www.cookbook-r.com/

http://blog.revolutionanalytics.com/2013/08/foodborne-chicago.html

http://blog.yhat.com/posts/roc-curves.html

https://github.com/corynissen/foodborne_classifier

'Python, R 분석과 프로그래밍 > Data Analysis with R' 카테고리의 다른 글

Lesson 3: Explore One Variable (0)	2017.07.12

Posted by 마르띤

,

6장 - 중회귀분석 : 집객효과가 가장 큰 광고의 조합은 무엇인가?

Python, R 분석과 프로그래밍/비지니스 활용 사례로 배우는 데이터 분석 : R 2017. 2. 15. 19:30

문제인식: 매스미디어 광고에 의한 신규유저수가 일정치 않다. 이는 매월 TV광고와 잡지 광고의배분이 일정하지 않기 때문이다. 이에 TV, 잡지 광고비와 신규 유저수의 관계를 파악한다.

해결 방법: TV, 잡지 광고비와 신규 유저수 데이터를 기반으로 중회귀분석을 실시한다.

R

1. 데이터 읽어 들이기

> ad.data <- read.csv('ad_result.csv',header=T,stringsAsFactors = F)

> ad.data

month tvcm magazine install

1 2013-01 6358 5955 53948

2 2013-02 8176 6069 57300

3 2013-03 6853 5862 52057

4 2013-04 5271 5247 44044

5 2013-05 6473 6365 54063

6 2013-06 7682 6555 58097

7 2013-07 5666 5546 47407

8 2013-08 6659 6066 53333

9 2013-09 6066 5646 49918

10 2013-10 10090 6545 59963

2. TV 광고의 광고비용과 신규 유저수의 산점도 그리기

> library(ggplot2)
> library(scales)

> ggplot(ad.data,aes(x=tvcm,y=install))

> ggplot(ad.data,aes(x=tvcm,y=install))+geom_point()

> ggplot(ad.data,aes(x=tvcm,y=install))+geom_point()+xlab('TV 광고비')+ylab('신규유저수')

> ggplot(ad.data,aes(x=tvcm,y=install))+geom_point()+xlab('TV 광고비')+ylab('신규유저수')+scale_x_continuous(label=comma)+scale_y_continuous(label=comma)

3. 잡지 광고의 광고비용과 신규 유저수의 산점도 그리기

> ggplot(ad.data,aes(x=magazine,y=install))+geom_point()+xlab('잡지 광고비')+ylab('신규유저수')+scale_x_continuous(label=comma)+scale_y_continuous(label=comma)

4. 회귀분석 실행

> fit <-lm(install~.,data=ad.data[,c('install','tvcm','magazine')])

> fit

Call:

lm(formula = install ~ ., data = ad.data[, c("install", "tvcm",

"magazine")])

Coefficients:

(Intercept) tvcm magazine

188.174 1.361 7.250

-> 이상 내용으로부터 아래와 같은 모델을 만들 수 있다.

신규 유저수 = 1.361 X TV광고비 + 7.25 X 잡지광고비 + 188.174

이라는 관계가 있으며, 신규 유저는 광고를 실시하지 않을 때 월 188명 정도이다. (아래 summary(fit)을 통해 유의하지 않음을 알 수 있다) 그리고 TV 광고에 1만원을 투입하면 양 1.3609명의 신규 고객을, 잡지 광고에 1만원을 투자하면 약 7.2498명의 신규 유저를 확보할 수 있다.

5. 회귀분석의 결과 해석

> summary(fit)

Call:

lm(formula = install ~ ., data = ad.data[, c("install", "tvcm", "magazine")])

Residuals:

Min 1Q Median 3Q Max

-1406.87 -984.49 -12.11 432.82 1985.84

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 188.1743 7719.1308 0.024 0.98123

tvcm 1.3609 0.5174 2.630 0.03390 *

magazine 7.2498 1.6926 4.283 0.00364 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1387 on 7 degrees of freedom

Multiple R-squared: 0.9379, Adjusted R-squared: 0.9202

F-statistic: 52.86 on 2 and 7 DF, p-value: 5.967e-05

잔차(Residuals) : 잔차(예측값과 측정값의 차이)분포를 사분위수로 표현한 것으로, 데이터의 치우침이 있는지 확인할 수 있다. 1Q의 절대값이 3Q의 절대값보다 커서 약간 치우침이 있어 보인다.

Coefficients : 절편과 기울기에 관한 개요.

Adjusted R-Squared : 0.9202로 이 모델로 전체 데이터의 약 92.02%를 설명할 수 있다.

또는 아래와 같이 회귀분석 모델을 만들 수 있다.

> fit2<-lm(install~tvcm+magazine,data=ad.data)

> summary(fit2)

Call:

lm(formula = install ~ tvcm + magazine, data = ad.data)

Residuals:

Min 1Q Median 3Q Max

-1406.87 -984.49 -12.11 432.82 1985.84

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 188.1743 7719.1308 0.024 0.98123

tvcm 1.3609 0.5174 2.630 0.03390 *

magazine 7.2498 1.6926 4.283 0.00364 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1387 on 7 degrees of freedom

Multiple R-squared: 0.9379, Adjusted R-squared: 0.9202

F-statistic: 52.86 on 2 and 7 DF, p-value: 5.967e-05

출처: 비지니스 활용 사레로 배우는 데이터 분석: R (사카마키 류지, 사토 요헤이 지음)

'Python, R 분석과 프로그래밍 > 비지니스 활용 사례로 배우는 데이터 분석 : R' 카테고리의 다른 글

5장 - A/B 테스트 : 어느 쪽의 배너광고가 반응이 더 좋은가? (0)	2017.02.13
4장 - 크로스 분석: 어떤 속성들의 고객들이 떠날까? (0)	2017.02.10

Posted by 마르띤

,

5장 - A/B 테스트 : 어느 쪽의 배너광고가 반응이 더 좋은가?

Python, R 분석과 프로그래밍/비지니스 활용 사례로 배우는 데이터 분석 : R 2017. 2. 13. 19:51

문제인식: 자사 제품의 구매율이 다른 앱과 비교하여 낮음을 확인, 해당 배너 광고의 클릭율이 낮음을 밝힘.

해결 방법: 분석 위한 데이터가 없기 때문에 A/B 테스트를 통해 분석용 로그를 출력하는 데이터 수집. 이후 A/B 테스트 한 후 배너광고 A와 B의 클릭률 산출하여 x² 검정 실행

R 분석 내용

1. 데이터 읽어 들이기

> setwd("C:/파일/temp/r_temp")

> ab.test.imp <-read.csv('section5-ab_test_imp.csv',header=T,stringsAsFactors = F)

> head(ab.test.imp,2)

log_date app_name test_name test_case user_id transaction_id

1 2013-10-01 game-01 sales_test B 36703 25622

2 2013-10-01 game-01 sales_test A 44339 25623

> ab.test.goal <- read.csv('section5-ab_test_goal.csv',header=T,stringsAsFactors = F)

> head(ab.test.goal,2)

log_date app_name test_name test_case user_id transaction_id

1 2013-10-01 game-01 sales_test B 15021 25638

2 2013-10-01 game-01 sales_test B 351 25704

2. ab.test.imp에 ab.test.goal 결합시키기

> ab.test.imp <- merge(ab.test.imp, ab.test.goal, by='transaction_id', all.x=T, suffixes=c('','.g'))

> head(ab.test.imp,2)

-> all.x - logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output.

3. 클릭했는지 하지 않았는지 나타내는 플래그 작성하기

> ab.test.imp$is.goal <-ifelse(is.na(ab.test.imp$user_id.g),0,1)

-> 항목 user_id.g의 값이 없음(NA)인 경우 0, 그렇지 않을 경우 1 입력

4. 클릭률 집계하기

> library(plyr)

> ddply(ab.test.imp,

.(test_case),

summarize,

cvr=sum(is.goal)/length(user_id))

test_case cvr

1 A 0.08025559

2 B 0.11546015

-> cvr = 클릭한 사람의 합계 / 배너광고가 표시된 유저수

배너광고 A의 클릭률 보다 B가 전반적으로 높다는 것을 통계적으로 알 수 있다.

5. 카이제곱 검정

> chisq.test(ab.test.imp$test_case,ab.test.imp$is.goal)

Pearson's Chi-squared test with Yates' continuity correction

data: ab.test.imp$test_case and ab.test.imp$is.goal

X-squared = 308.38, df = 1, p-value < 2.2e-16

-> p-value가 매주 작으므로 (0.05보다 작음) A와 B의 차이는 우연히 발생한 것이 아님

6. 날짜별,테스트 케이스별로 클릭률 산출하기

> ab.test.imp.summary<- ddply(ab.test.imp,

.(log_date,test_case),

summarize,

imp=length(user_id),

cv=sum(is.goal),

cvr=sum(is.goal)/length(user_id))

> head(ab.test.imp.summary)

log_date test_case imp cv cvr

1 2013-10-01 A 1358 98 0.07216495

2 2013-10-01 B 1391 176 0.12652768

3 2013-10-02 A 1370 88 0.06423358

4 2013-10-02 B 1333 212 0.15903976

5 2013-10-03 A 1213 170 0.14014839

6 2013-10-03 B 1233 185 0.15004055

-> imp = 배너 광고가 얼마나 표시되었는지 확인

cv = 클릭한 사람의 합계

cvr = 클릭률

7. 테스트 케이스별로 클릭률 산출하기

> ab.test.imp.summary<- ddply(ab.test.imp.summary,

.(test_case),

transform,

cvr.avg=sum(cv/sum(imp)))

> head(ab.test.imp.summary)

log_date test_case imp cv cvr cvr.avg

1 2013-10-01 A 1358 98 0.07216495 0.08025559

2 2013-10-02 A 1370 88 0.06423358 0.08025559

3 2013-10-03 A 1213 170 0.14014839 0.08025559

4 2013-10-04 A 1521 89 0.05851414 0.08025559

5 2013-10-05 A 1587 56 0.03528670 0.08025559

6 2013-10-06 A 1219 120 0.09844135 0.08025559

-> transform: 기존 집계툴에 데이터 추가

cvr.avg: test_case 별 추가

8. 테스트 케이스별 클릭률의 시계열 추이 그래프

> str(ab.test.imp.summary)

'data.frame': 62 obs. of 6 variables:

$ log_date : chr "2013-10-01" "2013-10-02" "2013-10-03" "2013-10-04" ...

$ test_case: chr "A" "A" "A" "A" ...

$ imp : int 1358 1370 1213 1521 1587 1219 1595 1401 1648 1364 ...

$ cv : num 98 88 170 89 56 120 194 99 199 114 ...

$ cvr : num 0.0722 0.0642 0.1401 0.0585 0.0353 ...

$ cvr.avg : num 0.0803 0.0803 0.0803 0.0803 0.0803 ...

> ab.test.imp.summary$log_date <- as.Date(ab.test.imp.summary$log_date)

> library(ggplot2)

> library(scales)

> limits<-c(0,max(ab.test.imp.summary$cvr))

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,

col=test_case,lty=test_case,shape=test_case))+

geom_line(lwd=1)+geom_point(size=4)+geom_line(aes(y=cvr.avg,col=test_case))+

scale_y_continuous(label=percent,limits=limits)

-> 배너광고 A의 클릭률 보다 B가 전반적으로 높다는 것을 통계적으로 알 수 있다

##ggplot 그래프 이해하기

> library(ggplot2)

> library(scales)

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,col=test_case,lty=test_case,shape=test_case))

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,col=test_case,lty=test_case,shape=test_case))+geom_line(lwd=1)

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,col=test_case,lty=test_case,shape=test_case))+geom_line(lwd=1)+geom_point(size=4)

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,col=test_case,lty=test_case,shape=test_case))+geom_line(lwd=1)+geom_point(size=4)+geom_line(aes(y=cvr.avg,col=test_case))

> limits<-c(0,max(ab.test.imp.summary$cvr))

> max(ab.test.imp.summary$cvr)

[1] 0.1712159

> ggplot(ab.test.imp.summary,aes(x=log_date,y=cvr,col=test_case,lty=test_case,shape=test_case))+geom_line(lwd=1)+geom_point(size=4)+geom_line(aes(y=cvr.avg,col=test_case))+ scale_y_continuous(label=percent,limits=limits)

출처: 비지니스 활용 사레로 배우는 데이터 분석: R (사카마키 류지, 사토 요헤이 지음)

'Python, R 분석과 프로그래밍 > 비지니스 활용 사례로 배우는 데이터 분석 : R' 카테고리의 다른 글

6장 - 중회귀분석 : 집객효과가 가장 큰 광고의 조합은 무엇인가? (2)	2017.02.15
4장 - 크로스 분석: 어떤 속성들의 고객들이 떠날까? (0)	2017.02.10

Posted by 마르띤

,

4장 - 크로스 분석: 어떤 속성들의 고객들이 떠날까?

Python, R 분석과 프로그래밍/비지니스 활용 사례로 배우는 데이터 분석 : R 2017. 2. 10. 19:24

1. csv 파일 읽어 들이기

> setwd("C:/파일/temp/r_temp")

> dau<-read.csv('section4-dau.csv',header=T,stringsAsFactors = F)

> head(dau)

log_date app_name user_id

1 2013-08-01 game-01 33754

2 2013-08-01 game-01 28598

3 2013-08-01 game-01 30306

4 2013-08-01 game-01 117

5 2013-08-01 game-01 6605

6 2013-08-01 game-01 346

> user.info<-read.csv('section4-user_info.csv',header=T,stringsAsFactors = F)

> head(user.info)

i nstall_date app_name user_id gender generation device_type

1 2013-04-15 game-01 1 M 40 iOS

2 2013-04-15 game-01 2 M 10 Android

3 2013-04-15 game-01 3 F 40 iOS

4 2013-04-15 game-01 4 M 10 Android

5 2013-04-15 game-01 5 M 40 iOS

6 2013-04-15 game-01 6 M 40 iOS

2. DAU에 user.info 결합하기

> dau.user.info<-merge(dau,user.info,by=c('user_id','app_name'))

> head(dau.user.info)

user_id app_name log_date install_date gender generation device_type

1 1 game-01 2013-09-06 2013-04-15 M 40 iOS

2 1 game-01 2013-09-05 2013-04-15 M 40 iOS

3 1 game-01 2013-09-28 2013-04-15 M 40 iOS

4 1 game-01 2013-09-12 2013-04-15 M 40 iOS

5 1 game-01 2013-09-11 2013-04-15 M 40 iOS

6 1 game-01 2013-09-08 2013-04-15 M 40 iOS

3. 월 항목 추가

> dau.user.info$month<-substr(dau.user.info$log_date,1,7)

> head(dau.user.info)

user_id app_name log_date install_date gender generation device_type month

1 1 game-01 2013-09-06 2013-04-15 M 40 iOS 2013-09

2 1 game-01 2013-09-05 2013-04-15 M 40 iOS 2013-09

3 1 game-01 2013-09-28 2013-04-15 M 40 iOS 2013-09

4 1 game-01 2013-09-12 2013-04-15 M 40 iOS 2013-09

5 1 game-01 2013-09-11 2013-04-15 M 40 iOS 2013-09

6 1 game-01 2013-09-08 2013-04-15 M 40 iOS 2013-09

> colnames(dau.user.info)[8]<-'log_month'

> head(dau.user.info)

user_id app_name log_date install_date gender generation device_type log_month

1 1 game-01 2013-09-06 2013-04-15 M 40 iOS 2013-09

2 1 game-01 2013-09-05 2013-04-15 M 40 iOS 2013-09

3 1 game-01 2013-09-28 2013-04-15 M 40 iOS 2013-09

4 1 game-01 2013-09-12 2013-04-15 M 40 iOS 2013-09

5 1 game-01 2013-09-11 2013-04-15 M 40 iOS 2013-09

6 1 game-01 2013-09-08 2013-04-15 M 40 iOS 2013-09

4. 성별로 집계

> table(dau.user.info[,c('log_month','gender')])

gender

log_month F M

2013-08 47343 46842

2013-09 38027 38148

> table(dau.user.info[,c('gender','log_month')])

log_month

gender 2013-08 2013-09

F 47343 38027

M 46842 38148

-> 전체적인 수치는 떨어졌지만 남녀간의 구성비율은 변화가 없으므로, 성별과 게임이용율 간 연관관계가 크지 않음.

5. 연령별로 집계

> table(dau.user.info[,c('log_month','generation')])

generation

log_month 10 20 30 40 50

2013-08 18785 33671 28072 8828 4829

2013-09 15391 27229 22226 7494 3835

> table(dau.user.info[,c('generation','log_month')])

log_month

generation 2013-08 2013-09

10 18785 15391

20 33671 27229

30 28072 22226

40 8828 7494

50 4829 3835

-> 전체적인 수치는 떨어졌지만 연령간의 구성비율은 변화가 없으므로, 연령과 게임이용율 간 연관관계가 크지 않음.

6. 세그먼트 분석(성별과 연령대를 조합해서 집계)

> library(reshape2)

> dcast(dau.user.info,log_month~gender+generation,value.var='user_id',length)

log_month F_10 F_20 F_30 F_40 F_50 M_10 M_20 M_30 M_40 M_50

1 2013-08 9091 17181 14217 4597 2257 9694 16490 13855 4231 2572

2 2013-09 7316 13616 11458 3856 1781 8075 13613 10768 3638 2054

-> 전체 이용율 하락에 영햐을 끼친 세그먼트는 찾아내지 못함

7. 세그먼트 분석(단말기별로 집계)

> head(dau.user.info)

user_id app_name log_date install_date gender generation device_type log_month

1 1 game-01 2013-09-06 2013-04-15 M 40 iOS 2013-09

2 1 game-01 2013-09-05 2013-04-15 M 40 iOS 2013-09

3 1 game-01 2013-09-28 2013-04-15 M 40 iOS 2013-09

4 1 game-01 2013-09-12 2013-04-15 M 40 iOS 2013-09

5 1 game-01 2013-09-11 2013-04-15 M 40 iOS 2013-09

6 1 game-01 2013-09-08 2013-04-15 M 40 iOS 2013-09

> table(dau.user.info[,c('log_month','device_type')])

device_type

log_month Android iOS

2013-08 46974 47211

2013-09 29647 46528

-> ios의 변화율은 없는 반면, android의 변화율이 큼을 알 수 있다.

8. 세그먼트 분석 결과 시각화하기

#날짜별로 단말기별 유저수 산출하기

> library(plyr)

> dau.user.info.device.summary<-ddply(dau.user.info,

.(log_date,device_type),

summarize,

dau=length(user_id))

> head(dau.user.info.device.summary)

log_date device_type dau

1 2013-08-01 Android 1784

2 2013-08-01 iOS 1805

3 2013-08-02 Android 1386

4 2013-08-02 iOS 1451

5 2013-08-03 Android 1295

6 2013-08-03 iOS 1351

#날짜별 데이터 형식으로 변환하기

> str(dau.user.info.device.summary)

'data.frame': 122 obs. of 3 variables:

$ log_date : chr "2013-08-01" "2013-08-01" "2013-08-02" "2013-08-02" ...

$ device_type: chr "Android" "iOS" "Android" "iOS" ...

$ dau : int 1784 1805 1386 1451 1295 1351 1283 1314 2002 2038 ...

> dau.user.info.device.summary$log_date<-as.Date(dau.user.info.device.summary$log_date)

> str(dau.user.info.device.summary)

'data.frame': 122 obs. of 3 variables:

$ log_date : Date, format: "2013-08-01" "2013-08-01" "2013-08-02" ...

$ device_type: chr "Android" "iOS" "Android" "iOS" ...

$ dau : int 1784 1805 1386 1451 1295 1351 1283 1314 2002 2038 ...

9. 시계열 트렌드 그래프 그리기

> library(ggplot2)

> library(scales)

> limits<-c(0,max(dau.user.info.device.summary$dau))

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))+ geom_line(lwd=1)+geom_point(size=4)+scale_y_continuous(label=comma,limits = limits)

-> 9월부터 안드로이드 유저가 빠르게 감소함을 알 수 있다. 따라서 안드로이드 시스템을 점검해야 한다는 결론을 얻을 수 있다.

#ggplot 그래프 구조 이해하기

> library(ggplot2)

> library(scales)

> head(dau.user.info.device.summary)

log_date device_type dau

1 2013-08-01 Android 1784

2 2013-08-01 iOS 1805

3 2013-08-02 Android 1386

4 2013-08-02 iOS 1451

5 2013-08-03 Android 1295

6 2013-08-03 iOS 1351

> ggplot(dau.user.info.device.summary)

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))+geom_line(lwd=1)

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))+geom_line(lwd=1)+geom_point(size=4)

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))+geom_line(lwd=1)+geom_point(size=4)+scale_y_continuous(label=comma)

> limits<-c(0,max(dau.user.info.device.summary$dau))

> max(dau.user.info.device.summary$dau)

[1] 2133

> ggplot(dau.user.info.device.summary,aes(x=log_date,y=dau,col=device_type,lty=device_type,shape=device_type))+geom_line(lwd=1)+geom_point(size=4)+scale_y_continuous(label=comma,limits = limits)

출처: 비지니스 활용 사레로 배우는 데이터 분석: R (사카마키 류지, 사토 요헤이 지음)

'Python, R 분석과 프로그래밍 > 비지니스 활용 사례로 배우는 데이터 분석 : R' 카테고리의 다른 글

6장 - 중회귀분석 : 집객효과가 가장 큰 광고의 조합은 무엇인가? (2)	2017.02.15
5장 - A/B 테스트 : 어느 쪽의 배너광고가 반응이 더 좋은가? (0)	2017.02.13

Posted by 마르띤

,

[독립표본 두 모평균 비교] 제품A와 B간 품질 차이 비교

Python, R 분석과 프로그래밍 2016. 9. 2. 19:26

문제: A제품과 B 제품이 있는데, B 제품의 제품 클레임이 A제품보다 많다고 한다. A, B 두 제품의 품질 간에 유의한 차이가 있는지 독립표본 두 모평균 비교를 수행하라.

1. 데이터를 불러오자.

> setwd('c:/Rwork')

> claim<-read.csv('claim.csv',header = T)

> boxplot(ppm~제품, data=claim)

##A와 B를 나누어 살펴보기

> library(dplyr)

> claim.A <- claim %>% filter(제품 == "A")

> claim.B <- claim %>% filter(제품 == "B")

> nrow(claim.A) ; nrow(claim.B)

[1] 105

[1] 40

> claim.A$제품 %>% table #질적-명목형 변수

A B

105 0

> claim.A$ppm %>% summary #양적-이산형 변수

Min. 1st Qu. Median Mean 3rd Qu. Max.

3.0 83.0 211.0 259.7 385.0 994.0

> claim.A$ppm %>% sd

[1] 217.4249

> claim.B$ppm %>% summary #양적-이산형 변수

Min. 1st Qu. Median Mean 3rd Qu. Max.

29.00 71.25 145.50 272.60 345.20 1104.00

> claim.B$ppm %>% sd

[1] 279.8323

> par(mfrow=c(1,2))

> claim.A$ppm %>% hist

> claim.B$ppm %>% hist

-> 해석: A의 평균 ppm은 259, 표준편차는217, B는 평균이 272, 표준편차 279. Boxplot과 Histogram만으로는 확연히 B의 클레임 ppm이 높다고는 볼 수 없다.

2. R을 이용하여 통계 분석을 해보자.

두 집단의 평균이 같은지를 비교하여 두 집단의 차이를 평가하는 경우 two-sample t-test를 사용하는데, 쌍을 이룬 두 변수 간에 차이의 평균이 0인지 검정하는 paired t-test와는 달리 서로 독립적인 두 집단의 평균의 차이가 0인지를 검정한다. Two-sample t-test는 다음 순서를 따른다.

① 두 집단의 분산이 같은지 검정하자.

> var.test(ppm~제품, data=claim)

F test to compare two variances

data: ppm by 제품

F = 0.6037, num df = 104, denom df = 39, p-value = 0.04555

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

0.3454412 0.9901513

sample estimates:

ratio of variances

0.6037024

-> 해석:

귀무가설 H0: 두 모형의 분산이 같다.

대립가설 H1: 두 모형의 분산이 다르다.

F = 0.6037, num df = 104, denom df = 39, p-value = 0.04555

두 분산의 비율이 0.6배이고, p값이 0.05보다 작으므로 두 분산이 같다는 귀무가설을 기각, 즉 두 모집단이 다르다, 분산이 다른 경우 에는 welch의 t.test를 한다.

② 분산이 다른 경우 에는 welch의 t.test 실시한다.

> t.test(ppm~제품,data=claim)

Welch Two Sample t-test

data: ppm by 제품

t = -0.26303, df = 57.854, p-value = 0.7935

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-111.13697 85.32268

sample estimates:

mean in group A mean in group B

259.7429 272.6500

-> 해석:

귀무가설 H0: μ1-μ2 = 0,

대립가설 H1: μ1-μ2 ≠ 0

검정통계량 T(df=57.854)=-0.26303

p-value 0.7935 > 0.05

평균이 같다는 귀무가설을 기각하지 못한다. 즉 제품 A와 제품B간에는 유의수준 5%에서 품질 차이가 없다고 결론지을 수 있다.

아직 R과 통계 공부 걸음마 단계인데 어서 익숙해지고 싶다.

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

[R vs Python] 무엇이 더 좋을까? Chat GPT에게 물어보다 (0)	2023.08.03
stars 함수 - 파이조각그림, 나이팅게일 차트 (0)	2016.05.24
R 회귀분석 (0)	2016.02.28
[미완성] 중국 sohu 크롤링 연습 (0)	2016.02.20
beer data 연습 (0)	2015.11.10

Posted by 마르띤

,

stars 함수 - 파이조각그림, 나이팅게일 차트

Python, R 분석과 프로그래밍 2016. 5. 24. 11:55

Stars plot을 이용한 다변량 데이터의 시각화

> total<-read.table('성적.csv',header=T,sep=',')

> total

이름 국어 영어 수학 국사 화학 물리

1 박지영 90 85 55 88 91 79

2 김태함 70 65 80 75 76 89

3 김효섭 92 95 76 65 89 91

4 임경희 76 89 88 95 100 91

5 권혁진 97 87 83 91 86 91

6 하혜진 80 86 97 85 69 77

7 이준원 80 30 40 50 70 90

8 윤정웅 70 82 54 56 58 60

9 주시현 90 95 100 85 89 92

> row.names(total)<-total$이름 #학생 별로 성적을 불러오기 위한 작업

> total

이름 국어 영어 수학 국사 화학 물리

박지영 박지영 90 85 55 88 91 79

김태함 김태함 70 65 80 75 76 89

김효섭 김효섭 92 95 76 65 89 91

임경희 임경희 76 89 88 95 100 91

권혁진 권혁진 97 87 83 91 86 91

하혜진 하혜진 80 86 97 85 69 77

이준원 이준원 80 30 40 50 70 90

윤정웅 윤정웅 70 82 54 56 58 60

주시현 주시현 90 95 100 85 89 92

> total<-total[,2:7]

> total

국어 영어 수학 국사 화학 물리

박지영 90 85 55 88 91 79

김태함 70 65 80 75 76 89

김효섭 92 95 76 65 89 91

임경희 76 89 88 95 100 91

권혁진 97 87 83 91 86 91

하혜진 80 86 97 85 69 77

이준원 80 30 40 50 70 90

윤정웅 70 82 54 56 58 60

주시현 90 95 100 85 89 92

그래프 그리기

> stars(total,flip.labels = FALSE, draw.segments = FALSE, frameplot=TRUE, full=TRUE, main='성적 분석-Star Chart')

> stars(total,flip.labels = FALSE, draw.segments = TRUE, frameplot=TRUE, full=TRUE, main='성적 분석-Nightingale Chart')

> stars(total,flip.labels = FALSE, draw.segments = TRUE, frameplot=TRUE, full=FALSE, main='성적 분석-Nightingale Chart')

> stars(total,flip.labels = FALSE, draw.segments = TRUE, frameplot=TRUE, full=FALSE, main='성적 분석-Nightingale Chart',scale=T,key.loc=c(10,2))

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

[R vs Python] 무엇이 더 좋을까? Chat GPT에게 물어보다 (0)	2023.08.03
[독립표본 두 모평균 비교] 제품A와 B간 품질 차이 비교 (0)	2016.09.02
R 회귀분석 (0)	2016.02.28
[미완성] 중국 sohu 크롤링 연습 (0)	2016.02.20
beer data 연습 (0)	2015.11.10

Posted by 마르띤

,

R 회귀분석

Python, R 분석과 프로그래밍 2016. 2. 28. 22:53

1. 목적: 광고비에 따른 신규 고객 증감에 따른 회귀 분석을 통해 광고비 투입 비중 결정

2. 출처: 비즈니스 활용 사례로 배우는 데이터 분석:R , 한빛미디어

3. 코딩

> library(httr)
> library(stringr)
> ad.data<-read.csv("./ad_result.csv",header = T, stringsAsFactors = F)
> ad.data #tvcm: tv광고금액, magazine: 잡지광고금액,install=신규고객수
     month  tvcm magazine install
1  2013-01  6358     5955   53948
2  2013-02  8176     6069   57300
3  2013-03  6853     5862   52057
4  2013-04  5271     5247   44044
5  2013-05  6473     6365   54063
6  2013-06  7682     6555   58097
7  2013-07  5666     5546   47407
8  2013-08  6659     6066   53333
9  2013-09  6066     5646   49918
10 2013-10 10090     6545   59963
#TV 광고와 신규 고객 산점도 
> ggplot(ad.data,aes(x=tvcm,y=install))+geom_point()+xlab('TV 광고비')+ylab('신규 유저수')+
+ scale_x_continuous(label=comma)+scale_y_continuous(label=comma)

#잡지 광고와 신규 고객 산점도
> ggplot(ad.data,aes(x=magazine,y=install))+geom_point()+xlab('잡지 광고비')+ylab('신규 유저수')+
+ scale_x_continuous(label=comma)+scale_y_continuous(label=comma)

TV광고비, 잡지광고비가 신규 고객 획득에 어떤 영향을 주는지 알기 위해 회귀 분석을 해보자.

#회귀 분석
> fit<-lm(install~.,data=ad.data[,c("install","tvcm","magazine")])
> fit

Call:
lm(formula = install ~ ., data = ad.data[, c("install", "tvcm", 
    "magazine")])

Coefficients:#모델식
(Intercept)         tvcm     magazine  
    188.174        1.361        7.250  
# 신규 고객=188.174 + TV광고금액x1.361 + 잡지광고금액x7.25

#회귀 분석 요약
> summary(fit)

Call:
lm(formula = install ~ ., data = ad.data[, c("install", "tvcm", 
    "magazine")])

Residuals:#1Q 절대값이 3Q절대값보다 커서 치우침
     Min       1Q   Median       3Q      Max 
-1406.87  -984.49   -12.11   432.82  1985.84 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  188.1743  7719.1308   0.024  0.98123   
tvcm           1.3609     0.5174   2.630  0.03390 * 
magazine       7.2498     1.6926   4.283  0.00364 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1387 on 7 degrees of freedom
#결정계수와 조정 결정계수. 1에 가까울 수록 적합함
Multiple R-squared:  0.9379,	Adjusted R-squared:  0.9202 
F-statistic: 52.86 on 2 and 7 DF,  p-value: 5.967e-05

<해석>

1) 1Q의 절대값이 3Q보다 커 치우치지만, 조정 결정계수가 0.92로 1에 가깝고, p값 역시 5.967e-05로 0.05보다 작기 때문에 본 회귀분석 모델 값은 유효하다고 판단할 수 있음.

2) 결정계수(Coefficient determination): R square라고도 하며, 예측식의 정확성을 분석하는 지표. 값이 클 수록 정확하다고 판단. 결정계수가 0.4이상이면 정확도에 문제가 없다고 판단. 1에 가까울수록 정확성이 크다, 신뢰도가 높다고 판단

3) P값: Probability의 P로 관련성이 없을 확률. P값이 0.05이상이면 즉, 관련성이 없을 확률이 5%이상이면 예측을 하는 데 도움이 되지 않는다고 판단.

4. 개선점

1) ggplot 함수 더 공부

2) 회귀분석에 대한 통계적인 공부:

- 추정값, 표준오차, t값, p값, 통계적으로 유의한지, 결정계수, 조정결정계수

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

[독립표본 두 모평균 비교] 제품A와 B간 품질 차이 비교 (0)	2016.09.02
stars 함수 - 파이조각그림, 나이팅게일 차트 (0)	2016.05.24
[미완성] 중국 sohu 크롤링 연습 (0)	2016.02.20
beer data 연습 (0)	2015.11.10
subset, mosiacplot, hist, var,sd (0)	2015.10.31

Posted by 마르띤

,

[미완성] 중국 sohu 크롤링 연습

Python, R 분석과 프로그래밍 2016. 2. 20. 18:23

1. 데이터 분석 목적

중국 메인 포털 사이트 중 하나인 sohu. 이중 "정책"색센의 글을 모아보려 한다.

이를 통해 최근 사이 중국 주요 정부 기관에서 발표한 정책의 변화를 알아보려한다.

2. 해당 URL:

#메인 페이지는 http://business.sohu.com/jrjg/index.shtml

#두 번째 페이지는 http://business.sohu.com/jrjg/index_322.shtml

#맨 마지막 100번째 페이지는 http://business.sohu.com/jrjg/index_224.shtml

3. 코딩 사례

library(httr)

library(rvest)
library(stringr)

for(i in 1:100){
shtml='.shtml'
  if(i==1){
url=sprintf('http://business.sohu.com/jrjg/index%s',shtml)
  }else{
page<-324-i
url=paste0('http://business.sohu.com/jrjg/index_',page,shtml)
  }
}

h<-read_html(url)
article.titles=html_nodes(h,'div.f14list ul li a') #문제점1
article.links=html_attr(article.titles,'href') #문제점1

titles=c()
contents=c()
dates=c()

for(link in article.links){
h.link=read_html(link)

#제목 추출
title=repair_encoding(html_text(html_nodes(h.link,'div.left
h1')),from='UTF-8') #문제점2
titles=c(titles,title)

#내용 추출 
content=repair_encoding(html_text(html_nodes(h.link,'div#contentText')),from='UTF-8')
contents=c(contents,content)

#날짜 추출
date=repair_encoding(html_text(html_nodes(h.link,'div.sourceTime')),from='UTF-8')
date=str_extract(date,'[0-9]+.[0-9]+.[0-9]+.') 
dates=c(dates,date)
}

result=data.frame(titles,contents,dates)
View(result) #문제점3

setwd('d:/temp/r_temp')
write.table(result,'sohu.csv',sep=',')
sohu<-read.csv('sohu.csv')#문제점3
sohu[1,]

4. 코딩 후 문제점

1) 분명 URL을 긁을 때는 2016년 글이지만 article.titles=html_nodes(h,'div.f14list ul li a') 이후에는 2011년 8월 4일, 5일 글만 긁어진다.

2) UTF-8로 했음에도 불구하고 유니코드 문제가 해결이 안된다. 페이지 소스에서 보니 GBK 중국어 간체로 되어 있어 GBK로 해보아도 글자가 깨져 보인다.

3) View(result) 를 통해서 첫번째 column에 숫자 1,2,3 이렇게 시작되는 걸 볼 수 있는데, 엑셀에서도 숫자가 보여 Column의 제목이 왼쪽으로 하나씩 밀려졌다.

4) 크롤링 후 wordcloud 분석을 해야하는데 tmcn 패키지 설치가 안된다.

> install.packages("tmcn")

Warning in install.packages :

package ‘tmcn’ is not available (for R version 3.2.3)

공부를 좀 더 하여 위의 문제점을 개선하느데로 다시 업로드 할 예정.

포기하지 않고 될 때까지 계속 공부하면 된다라고 믿고, 계속 열심히 합시당.

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

stars 함수 - 파이조각그림, 나이팅게일 차트 (0)	2016.05.24
R 회귀분석 (0)	2016.02.28
beer data 연습 (0)	2015.11.10
subset, mosiacplot, hist, var,sd (0)	2015.10.31
dim, str, plot (0)	2015.10.29

Posted by 마르띤

,

beer data 연습

Python, R 분석과 프로그래밍 2015. 11. 10. 14:54

Beer 연습용 데이터에는 ID, 구매 카테고리, 브랜드, 구매일자, 총금액, 구매횟수, 가격, 맛, 연령, 성별 등의 데이터가 들어가 있다.

> setwd("D:/temp/r_temp")

> beer<-read.csv('beer.csv',stringsAsFactors = F)

> str(beer)

'data.frame':    26281 obs. of  10 variables:

  $ CustomerID     : int  867616 867616 7464358 1518180 5132322 1284613 4359610 6510791 5286631 6935836 ...

$ ProductCategory: chr  "Beer" "Beer" "Beer" "Beer" ...

$ BeerBrand      : chr  "Cass" "Cass" "Cass" "Hite" ...

$ PurchaseDate   : int  20070113 20070113 20070103 20071010 20070501 20070310 20070401 20071223 20070721 20071104 ...

$ DollarAmount   : int  26800 26800 13400 7300 7300 7300 5840 17520 7300 7300 ...

$ NumOfItems     : int  2 2 1 1 1 1 1 3 1 1 ...

$ Price          : int  13400 13400 13400 7300 7300 7300 7300 7300 7300 7300 ...

$ Taste          : chr  "Bitter" "Bitter" "Bitter" "Cool" ...

$ Age            : int  35 35 29 32 57 54 40 40 34 44 ...

$ Gender         : chr  "F" "F" "F" "F" ...

> names(beer)

[1] "CustomerID"      "ProductCategory" "BeerBrand"       "PurchaseDate"

[5] "DollarAmount"    "NumOfItems"      "Price"           "Taste"

[9] "Age"

구매일과 구매횟수로 Plot 그래프를 그려보면,

> summary(beer$NumOfItems)

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

1.00    1.00    1.00    1.19    1.00  180.00

> range(beer$NumOfItems)

[1]   1 180

> plot(beer$PurchaseDate,beer$NumOfItems)

Outlier에 해당하는 극단값이 있어 그래프 보기가 다시 불편하므로, 1회당 구매 금액을 PA값에 집어넣으면

> pa<-beer$DollarAmount/beer$NumOfItems

> plot(beer$PurchaseDate,pa)

-> 여전히 잘 안보임. 더 좋은 그래프나 다른 방식으로 보여줄 수 있는지 고민 필요함.

> hist(beer$PurchaseDate)

평균, 분산, 표준편차도 구해보고,

> mean(beer$DollarAmount)

[1] 13828.43

> var(beer$DollarAmount)

[1] 524584703

> sd(beer$DollarAmount)

[1] 22903.81

Beer$taste는 각각 Bitter/Cool/Soft라는 범주형 데이터를 가지고 있으므로 table 함수 적용 후 barplot를 그릴 수 있다.

> table(beer$Taste)

Bitter   Cool   Soft

3234  11540  11507

> barplot(table(beer$Taste))

성별도 비슷하게 구할 수 있다.

> table(beer$Gender)

          F     M

 8275 12952  5054

> table(beer$Gender)/18006

                  F         M

0.4595690 0.7193158 0.2806842

> barplot(table(beer$Gender))

null값이 많으므로 exclude를 사용하면

> table(beer$Gender,exclude="")/sum(table(beer$Gender,exclude=""))

F         M

0.7193158 0.2806842

> m<-table(beer$Gender,exclude="")

> barplot(m)

Gender와 Taste를 table값으로 묶어, mosaicplot도 만들고,

> gender_taste=table(beer$Gender,beer$Taste)

> mosaicplot(gender_taste)

->null값을 제외하고 보고싶은데, 아직 방법을 모르겠음. 공부가 더 필요함.

Subset 함수를 이용하여 원하는 변수를 묶을 수 있다. &는 and, |는 or 개념

> m_and_age=subset(beer,beer$Gender=='M'&beer$Age>30)

> head(m_and_age)

CustomerID ProductCategory BeerBrand PurchaseDate DollarAmount NumOfItems Price

7     4359610            Beer      Hite     20070401         5840          1  7300

14    4317050            Beer      Hite     20070921        21900          3  7300

15    4874758            Beer      Hite     20070206         7300          1  7300

18    1333660            Beer      Hite     20071030         7300          1  7300

23    5478004            Beer      Hite     20070325         7300          1  7300

29     944833            Beer      Hite     20070808         7300          1  7300

Taste Age Gender

7   Cool  40      M

14  Cool  66      M

15  Cool  59      M

18  Cool  35      M

23  Cool  50      M

29  Cool  43      M

> dim(m_and_age)

[1] 3923   10

> m_or_age=subset(beer,beer$Gender=='M'|beer$Age>30)

> head(m_or_age)

CustomerID ProductCategory BeerBrand PurchaseDate DollarAmount NumOfItems Price

1     867616            Beer      Cass     20070113        26800          2 13400

2     867616            Beer      Cass     20070113        26800          2 13400

4    1518180            Beer      Hite     20071010         7300          1  7300

5    5132322            Beer      Hite     20070501         7300          1  7300

6    1284613            Beer      Hite     20070310         7300          1  7300

7    4359610            Beer      Hite     20070401         5840          1  7300

Taste Age Gender

1 Bitter  35      F

2 Bitter  35      F

4   Cool  32      F

5   Cool  57      F

6   Cool  54      F

7   Cool  40      M

> dim(m_or_age)

[1] 14485    10

수치형 데이터인 Age로 히스트그램을 만들고,

> hist(beer$Age)

Bin을 50개로 나눌 수도 있다.

> hist(beer$Age,breaks=50)

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

R 회귀분석 (0)	2016.02.28
[미완성] 중국 sohu 크롤링 연습 (0)	2016.02.20
subset, mosiacplot, hist, var,sd (0)	2015.10.31
dim, str, plot (0)	2015.10.29
도수분포표와 히스토그램 (0)	2015.08.12

Posted by 마르띤

,

subset, mosiacplot, hist, var,sd

Python, R 분석과 프로그래밍 2015. 10. 31. 23:43

source: https://class.coursera.org/statistics-004 (coursera: Data Analysis and Statistical Inference by Duke Univ. by Dr. Mine Çetinkaya-Rundel)

source("http://www.openintro.org/stat/data/cdc.R")

This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.

> summary(cdc)

genhlth exerany hlthplan smoke100 height

excellent:4657 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :48.00

very good:6972 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:64.00

good :5675 Median :1.0000 Median :1.0000 Median :0.0000 Median :67.00

fair :2019 Mean :0.7457 Mean :0.8738 Mean :0.4721 Mean :67.18

poor : 677 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:70.00

Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :93.00

weight wtdesire age gender

Min. : 68.0 Min. : 68.0 Min. :18.00 m: 9569

1st Qu.:140.0 1st Qu.:130.0 1st Qu.:31.00 f:10431

Median :165.0 Median :150.0 Median :43.00

Mean :169.7 Mean :155.1 Mean :45.07

3rd Qu.:190.0 3rd Qu.:175.0 3rd Qu.:57.00

Max. :500.0 Max. :680.0 Max. :99.00

> head(cdc$genhlth) #categorical, ordinal

[1] good good good good very good very good

Levels: excellent very good good fair poor

> levels(cdc$genhlth)

[1] "excellent" "very good" "good" "fair" "poor"

> nlevels(cdc$genhlth)

[1] 5

> levels(cdc$genhlth)[2]

[1] "very good"

R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean, median, and variance of weight, type

> mean(cdc$weight)

[1] 169.683

> var(cdc$weight)

[1] 1606.484

> sd(cdc$weight)

[1] 40.08097

> median(cdc$weight)

[1] 165

> summary(cdc$weight) #numerical, continuous

Min. 1st Qu. Median Mean 3rd Qu. Max.

68.0 140.0 165.0 169.7 190.0 500.0

The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime, smoked (1) or did not (0).

> summary(cdc$smoke100)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.000 0.000 0.000 0.472 1.000 1.000

->what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function table does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type

> table(cdc$smoke100)

0 1

10559 9441

or instead look at the relative frequency distribution by typing

> table(cdc$smoke100)/20000

0 1

0.52795 0.47205

Next, we make a bar plot of the entries in the table by putting the table inside the barplot command.

> barplot(table(cdc$smoke100))

OR

> smoke = table(cdc$smoke100)

> barplot(smoke)

Recall that 1 indicates that a respondent has smoked at least 100 cigarettes. The rows refer to gender. To create a mosaic plot of this table, we would enter the following command. We can see the size of the data frame next to the object name in the workspace or we can type

> gender_smokers = table(cdc$gender,cdc$smoke100)

> mosaicplot(gender_smokers)

-->The mosaic plot shows that males are more likely to smoke than females. the type of smoke100 is categorical (not ordinal). Explanatory variable on the x-axis, response variable on the y-axis. x축의 변수는 내가 알고자 하는 Y를 구성하는 자료, 즉 Y를 설명해주기 때문에explanatory or predictor variable이라고 한다. 독립변수(independent)라고도 한다. Y축의 값은 X에 반응한다고 해서 Response Or dependent Variable 종속변수라고 한다

> cdc[567,6] #which means we want the element of our data set that is in the 567 th row

[1] 160

> cdc[1:10,6] #To see the weights for the first 10 respondents,

[1] 175 125 105 132 150 114 194 170 150 180

> cdc[1:10,] #Finally, if we want all of the data for the first 10 respondents, type

genhlth exerany hlthplan smoke100 height weight wtdesire age gender

1 good 0 1 0 70 175 175 77 m

2 good 0 1 1 64 125 115 33 f

3 good 1 1 1 60 105 105 49 f

4 good 1 1 0 66 132 124 42 f

5 very good 0 1 0 61 150 130 55 f

6 very good 1 1 0 64 114 114 55 f

7 very good 1 1 0 71 194 185 31 m

8 very good 0 1 0 67 170 160 45 m

9 good 0 1 1 65 150 130 27 f

10 good 1 1 0 70 180 170 44 m

#this command will create a new data set called mdata that contains only the men from the cdc data set.

> mdata = subset(cdc, cdc$gender == "m")

> head(mdata)

genhlth exerany hlthplan smoke100 height weight wtdesire age gender

1 good 0 1 0 70 175 175 77 m

7 very good 1 1 0 71 194 185 31 m

8 very good 0 1 0 67 170 160 45 m

10 good 1 1 0 70 180 170 44 m

11 excellent 1 1 1 69 186 175 46 m

12 fair 1 1 1 69 168 148 62 m

you can use several of these conditions together with & and |. The & is read “and” so that. this command will give you the data for men over the age of 30.

> m_and_over30 = subset(cdc, cdc$gender == "m" & cdc$age > 30)

> head(m_and_over30)

genhlth exerany hlthplan smoke100 height weight wtdesire age gender

1 good 0 1 0 70 175 175 77 m

7 very good 1 1 0 71 194 185 31 m

8 very good 0 1 0 67 170 160 45 m

10 good 1 1 0 70 180 170 44 m

11 excellent 1 1 1 69 186 175 46 m

12 fair 1 1 1 69 168 148 62 m

The | character is read “or” so that this command will take people who are men or over the age of 30

> m_or_over30=subset(cdc,cdc$gender=="m"|cdc$age>30)

> head(m_or_over30)

genhlth exerany hlthplan smoke100 height weight wtdesire age gender

1 good 0 1 0 70 175 175 77 m

2 good 0 1 1 64 125 115 33 f

3 good 1 1 1 60 105 105 49 f

4 good 1 1 0 66 132 124 42 f

5 very good 0 1 0 61 150 130 55 f

6 very good

To Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked at least 100 cigarettes in their lifetime.

> under23_and_smoke = subset(cdc,cdc$age<23 & smoke100=="1")

> str(under23_and_smoke)

'data.frame': 620 obs. of 9 variables:

$ genhlth : Factor w/ 5 levels "excellent","very good",..: 1 2 1 3 2 2 4 4 1 4 ...

$ exerany : num 1 1 1 1 1 1 0 1 1 1 ...

$ hlthplan: num 0 0 1 1 1 0 1 1 0 1 ...

$ smoke100: num 1 1 1 1 1 1 1 1 1 1 ...

$ height : num 66 70 74 64 62 64 71 72 63 71 ...

$ weight : int 185 160 175 190 92 125 185 185 105 185 ...

$ wtdesire: int 220 140 200 140 92 115 185 170 100 150 ...

$ age : int 21 18 22 20 21 22 20 19 19 18 ...

$ gender : Factor w/ 2 levels "m","f": 1 2 1 2 2 2 1 1 1 1 ...

> summary(under23_and_smoke)

genhlth exerany hlthplan smoke100 height

excellent:110 Min. :0.0000 Min. :0.0000 Min. :1 Min. :59.00

very good:244 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1 1st Qu.:65.00

good :204 Median :1.0000 Median :1.0000 Median :1 Median :68.00

fair : 53 Mean :0.8145 Mean :0.6952 Mean :1 Mean :67.92

poor : 9 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1 3rd Qu.:71.00

Max. :1.0000 Max. :1.0000 Max. :1 Max. :79.00

weight wtdesire age gender

Min. : 85.0 Min. : 80.0 Min. :18.00 m:305

1st Qu.:130.0 1st Qu.:125.0 1st Qu.:19.00 f:315

Median :155.0 Median :150.0 Median :20.00

Mean :158.9 Mean :152.2 Mean :20.22

3rd Qu.:180.0 3rd Qu.:175.0 3rd Qu.:21.00

Max. :350.0 Max. :315.0 Max. :22.00

> dim(under23_and_smoke)

[1] 620 9

-> 620 observations are in the subset under23

The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several categories. So we can, for example, compare the heights of men and women with. The notation here is new. The ~ character can be read “versus” or “as a function of”. So we’re asking R to give us a box plots of heights where the groups are defined by gender.

> boxplot(cdc$height ~ cdc$gender)

Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI). BMI is a weight to height ratio and can be calculated as.

> bmi = (cdc$weight / cdc$height^2) * 703

> boxplot(bmi~cdc$genhlth)

--> Among people with excellent health, there are some with unusually low BMIs compared to the rest of the group. The IQR increases slightly as general health status declines (from excellent to poor). The median BMI is roughly 25 for all general health categories, and there is a slight increase in median BMI as general health status declines (from excellent to poor).

To make histogram

> hist(bmi)

You can control the number of bins by adding an argument to the command. In the next two lines, we first make a default histogram of bmi and then one with 50 breaks.

> hist(bmi,breaks=50)

Using the same tools (the plot function) make a scatterplot of weight versus desired weight.

> plot(cdc$weight,cdc$wtdesire,xlim=c(0,700))

->moderately strong positive linear association

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

[미완성] 중국 sohu 크롤링 연습 (0)	2016.02.20
beer data 연습 (0)	2015.11.10
dim, str, plot (0)	2015.10.29
도수분포표와 히스토그램 (0)	2015.08.12
[데이터 처리 & 분석 실무] 데이터 타입 - 데이터 프레임, 판별, 변환 (0)	2015.02.12

Posted by 마르띤

,

데이터마이너를 꿈꾸며

'Python, R 분석과 프로그래밍'에 해당되는 글 36건

Lesson2: R Basic

'Python, R 분석과 프로그래밍 > Data Analysis with R' 카테고리의 다른 글

6장 - 중회귀분석 : 집객효과가 가장 큰 광고의 조합은 무엇인가?

'Python, R 분석과 프로그래밍 > 비지니스 활용 사례로 배우는 데이터 분석 : R' 카테고리의 다른 글

5장 - A/B 테스트 : 어느 쪽의 배너광고가 반응이 더 좋은가?

'Python, R 분석과 프로그래밍 > 비지니스 활용 사례로 배우는 데이터 분석 : R' 카테고리의 다른 글

4장 - 크로스 분석: 어떤 속성들의 고객들이 떠날까?

'Python, R 분석과 프로그래밍 > 비지니스 활용 사례로 배우는 데이터 분석 : R' 카테고리의 다른 글

[독립표본 두 모평균 비교] 제품A와 B간 품질 차이 비교

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

stars 함수 - 파이조각그림, 나이팅게일 차트

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

R 회귀분석

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

[미완성] 중국 sohu 크롤링 연습

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

beer data 연습

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

subset, mosiacplot, hist, var,sd

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

링크

카테고리

최근에 올라온 글

최근에 받은 트랙백

글 보관함

티스토리툴바