반응형

1. 혈액형 분석

> setwd('d:/temp/r_temp')
> bloodtype<-read.csv('혈액형.csv')
> head(bloodtype)
  혈액형
1      A
2      B
3      B
4      A
5      A
6      O
> sort(table(bloodtype$혈액형),decreasing = T)

 A  O  B AB 
34 28 27 11 
> sort.bloodtype<-sort(table(bloodtype),decreasing = T)
> par(mfrow=c(1,2))
> barplot(sort.bloodtype,main='혈액형 별 막대그래프')


> slices=c('red','blue','yellow','green') > barplot(sort.bloodtype,col=slices,main='혈액형 별 막대그래프')



2.국회의원 정당별 의석수 분석

> pie.vote<-c(0.5067, 0.0167, 0.0100, 0.0433, 0.4233)
> names(pie.vote)<-c('새누리152명','선진5명','무3명','진보13명','민주127명')
> par(mfrow=c(1,2))
> pie(pie.vote,col=c('red3','blue','green3','magenta','yellow'),radius=1,main='19대 국회의원 선거')


> slices=c('#FFFF00','#FF0000','#0000FF','#00FFFF','#00FF00') #RGB 체계 > barplot(pie.vote,col=slices,main='19대 국회의원 선거')



R이 컬러 체계

R의 컬러는 Red, Green, Blue로 각 16진법으로 구분

 RGB

 0

 진법

10 

11 

12 

13 

14 

15 

16 

#FF0000 -> 앞 두 자리의 Red는 꽉 차고, Green, Blue는 비어 있음. 따라서 Red

#00FFFF-> 앞 두 자리의 Red는 비어 있지만, Green, Blue는 꽉 참.따라서 하늘색


반응형

'KNOU > 1 데이터시각화' 카테고리의 다른 글

mosaicplot  (0) 2016.03.10
원 그래프  (0) 2016.03.07
Posted by 마르띤
,
반응형

1. 혈액형 분석

> setwd('d:/temp/r_temp')
> bloodtype<-read.csv('혈액형.csv')$혈액형
> bloodtype
  [1] A  B  B  A  A  O  A  AB O  O  O  A  A  B  AB A  O  B  A  B  B  A  B  A  B  AB B  A  O 
 [30] AB O  B  A  B  A  O  B  A  A  A  A  O  A  O  O  B  B  O  AB A  B  AB B  O  O  O  AB O 
 [59] O  B  A  A  O  A  B  O  A  O  B  O  A  B  O  AB B  B  A  O  B  A  B  B  O  AB B  A  AB
 [88] A  B  A  A  O  O  A  A  O  AB A  A  O 
Levels: A AB B O
> pie.bloodtype=sort(table(bloodtype),decreasing = T)
> pie.bloodtype
bloodtype
 A  O  B AB 
34 28 27 11 
> par(mfrow=c(1,2))
> pie(pie.bloodtype,radius=1,main='원 그래프')


> slices=c('red','blue','yellow','green') > pie(pie.bloodtype,col=slices,radius=1,main='원 그래프')



2.국회의원 정당별 의석수 분석

> require(grDevices) #Graphics devices and support for base and grid graphics, R컬러 함수
> assembly<-c(152,5,3,13,127)
> names(assembly)=c('새누리152명','선진5명','무3명','진보13명','민주127명')
> pie.vote<-assembly/300
> pie.vote
새누리152명     선진5명       무3명    진보13명   민주127명 
 0.50666667  0.01666667  0.01000000  0.04333333  0.42333333 
> par(mfrow=c(1,2))
> pie(pie.vote)


> pie(pie.vote,col=c('red3','blue','green3','magenta','yellow'),radius=1,main='19개 국회의원 선거')

> par(new=T) > pie(assembly,radius=0.8,col='white',label=NA,border=NA) > text(0,0,'총 300석')



반응형

'KNOU > 1 데이터시각화' 카테고리의 다른 글

mosaicplot  (0) 2016.03.10
막대그래프  (0) 2016.03.08
Posted by 마르띤
,
반응형

1. 목적: 광고비에 따른 신규 고객 증감에 따른 회귀 분석을 통해 광고비 투입 비중 결정

2. 출처: 비즈니스 활용 사례로 배우는 데이터 분석:R, 한빛미디어

3. 코딩

> library(httr) > library(stringr)

> ad.data<-read.csv("./ad_result.csv",header = T, stringsAsFactors = F) > ad.data #tvcm: tv광고금액, magazine: 잡지광고금액,install=신규고객수 month tvcm magazine install 1 2013-01 6358 5955 53948 2 2013-02 8176 6069 57300 3 2013-03 6853 5862 52057 4 2013-04 5271 5247 44044 5 2013-05 6473 6365 54063 6 2013-06 7682 6555 58097 7 2013-07 5666 5546 47407 8 2013-08 6659 6066 53333 9 2013-09 6066 5646 49918 10 2013-10 10090 6545 59963

#TV 광고와 신규 고객 산점도 > ggplot(ad.data,aes(x=tvcm,y=install))+geom_point()+xlab('TV 광고비')+ylab('신규 유저수')+ + scale_x_continuous(label=comma)+scale_y_continuous(label=comma)


#잡지 광고와 신규 고객 산점도 > ggplot(ad.data,aes(x=magazine,y=install))+geom_point()+xlab('잡지 광고비')+ylab('신규 유저수')+ + scale_x_continuous(label=comma)+scale_y_continuous(label=comma)


TV광고비, 잡지광고비가 신규 고객 획득에 어떤 영향을 주는지 알기 위해 회귀 분석을 해보자.

#회귀 분석

> fit<-lm(install~.,data=ad.data[,c("install","tvcm","magazine")]) > fit Call: lm(formula = install ~ ., data = ad.data[, c("install", "tvcm", "magazine")]) Coefficients:#모델식 (Intercept) tvcm magazine 188.174 1.361 7.250  

# 신규 고객=188.174 + TV광고금액x1.361 + 잡지광고금액x7.25

#회귀 분석 요약 > summary(fit) Call: lm(formula = install ~ ., data = ad.data[, c("install", "tvcm", "magazine")]) Residuals:#1Q 절대값이 3Q절대값보다 커서 치우침 Min 1Q Median 3Q Max -1406.87 -984.49 -12.11 432.82 1985.84 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 188.1743 7719.1308 0.024 0.98123 tvcm 1.3609 0.5174 2.630 0.03390 * magazine 7.2498 1.6926 4.283 0.00364 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1387 on 7 degrees of freedom

#결정계수와 조정 결정계수. 1에 가까울 수록 적합함 Multiple R-squared: 0.9379, Adjusted R-squared: 0.9202

F-statistic: 52.86 on 2 and 7 DF, p-value: 5.967e-05

<해석>

1) 1Q의 절대값이 3Q보다 커 치우치지만, 조정 결정계수가 0.92로 1에 가깝고, p값 역시 5.967e-05로 0.05보다 작기 때문에 본 회귀분석 모델 값은 유효하다고 판단할 수 있음.

2) 결정계수(Coefficient determination): R square라고도 하며, 예측식의 정확성을 분석하는 지표.  값이 클 수록 정확하다고 판단. 결정계수가 0.4이상이면 정확도에 문제가 없다고 판단. 1에 가까울수록 정확성이 크다, 신뢰도가 높다고 판단

3) P값: Probability의 P로 관련성이 없을 확률. P값이 0.05이상이면 즉, 관련성이 없을 확률이 5%이상이면 예측을 하는 데 도움이 되지 않는다고 판단.


4. 개선점

1) ggplot 함수 더 공부

2) 회귀분석에 대한 통계적인 공부:

 - 추정값, 표준오차, t값, p값, 통계적으로 유의한지, 결정계수, 조정결정계수

반응형
Posted by 마르띤
,
반응형

1. 데이터 분석 목적

중국 메인 포털 사이트 중 하나인 sohu. 이중  "정책"색센의 글을 모아보려 한다.

이를 통해 최근 사이 중국 주요 정부 기관에서 발표한 정책의 변화를 알아보려한다.


2. 해당 URL:

#메인 페이지는 http://business.sohu.com/jrjg/index.shtml

#두 번째 페이지는 http://business.sohu.com/jrjg/index_322.shtml

#맨 마지막 100번째 페이지는 http://business.sohu.com/jrjg/index_224.shtml


3. 코딩 사례

library(httr)

library(rvest)

library(stringr)


for(i in 1:100){

shtml='.shtml'

  if(i==1){

url=sprintf('http://business.sohu.com/jrjg/index%s',shtml)

  }else{

page<-324-i

url=paste0('http://business.sohu.com/jrjg/index_',page,shtml)

  }

}


h<-read_html(url)

article.titles=html_nodes(h,'div.f14list ul li a') #문제점1

article.links=html_attr(article.titles,'href') #문제점1


titles=c()

contents=c()

dates=c()


for(link in article.links){

h.link=read_html(link)


#제목 추출

title=repair_encoding(html_text(html_nodes(h.link,'div.left

h1')),from='UTF-8') #문제점2

titles=c(titles,title)


#내용 추출 

content=repair_encoding(html_text(html_nodes(h.link,'div#contentText')),from='UTF-8')

contents=c(contents,content)


#날짜 추출

date=repair_encoding(html_text(html_nodes(h.link,'div.sourceTime')),from='UTF-8')

date=str_extract(date,'[0-9]+.[0-9]+.[0-9]+.') 

dates=c(dates,date)

}


result=data.frame(titles,contents,dates)

View(result) #문제점3


setwd('d:/temp/r_temp')

write.table(result,'sohu.csv',sep=',')

sohu<-read.csv('sohu.csv')#문제점3

sohu[1,] 


4. 코딩 후 문제점

1) 분명 URL을 긁을 때는 2016년 글이지만 article.titles=html_nodes(h,'div.f14list ul li a')  이후에는 2011년 8월 4일, 5일 글만 긁어진다. 

2) UTF-8로 했음에도 불구하고 유니코드 문제가 해결이 안된다. 페이지 소스에서 보니 GBK 중국어 간체로 되어 있어 GBK로 해보아도 글자가 깨져 보인다.

3) View(result) 를 통해서 첫번째 column에 숫자 1,2,3 이렇게 시작되는 걸 볼 수 있는데, 엑셀에서도 숫자가 보여 Column의 제목이 왼쪽으로 하나씩 밀려졌다. 

4) 크롤링 후 wordcloud 분석을 해야하는데 tmcn 패키지 설치가 안된다. 

> install.packages("tmcn")

Warning in install.packages :

  package ‘tmcn’ is not available (for R version 3.2.3)


공부를 좀 더 하여 위의 문제점을 개선하느데로 다시 업로드 할 예정.

포기하지 않고 될 때까지 계속 공부하면 된다라고 믿고, 계속 열심히 합시당.

반응형

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

stars 함수 - 파이조각그림, 나이팅게일 차트  (0) 2016.05.24
R 회귀분석  (0) 2016.02.28
beer data 연습  (0) 2015.11.10
subset, mosiacplot, hist, var,sd  (0) 2015.10.31
dim, str, plot  (0) 2015.10.29
Posted by 마르띤
,
반응형

source: https://class.coursera.org/statistics-004 (coursera: Data Analysis and Statistical Inference by Duke Univ. by Dr. Mine Çetinkaya-Rundel)

 

source("http://www.openintro.org/stat/data/cdc.R")

This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.

> summary(cdc)

genhlth        exerany          hlthplan         smoke100          height    

excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :48.00 

very good:6972   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:64.00 

good     :5675   Median :1.0000   Median :1.0000   Median :0.0000   Median :67.00 

fair     :2019   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721   Mean   :67.18 

poor     : 677   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:70.00 

Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :93.00 

weight         wtdesire          age        gender  

Min.   : 68.0   Min.   : 68.0   Min.   :18.00   m: 9569 

1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   f:10431 

Median :165.0   Median :150.0   Median :43.00           

Mean   :169.7   Mean   :155.1   Mean   :45.07           

3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00           

Max.   :500.0   Max.   :680.0   Max.   :99.00

 

> head(cdc$genhlth) #categorical, ordinal

[1] good      good      good      good      very good very good

Levels: excellent very good good fair poor

> levels(cdc$genhlth)

[1] "excellent" "very good" "good"      "fair"      "poor"    

> nlevels(cdc$genhlth)

[1] 5

> levels(cdc$genhlth)[2]

[1] "very good"

  

R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean, median, and variance of weight, type

> mean(cdc$weight)

[1] 169.683

> var(cdc$weight)

[1] 1606.484

> sd(cdc$weight)

[1] 40.08097

> median(cdc$weight)

[1] 165

> summary(cdc$weight) #numerical, continuous

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

68.0   140.0   165.0   169.7   190.0   500.0

 

The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime, smoked (1) or did not (0).

> summary(cdc$smoke100)

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

0.000   0.000   0.000   0.472   1.000   1.000

 ->what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function table does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type

> table(cdc$smoke100)

0     1

10559  9441

 

or instead look at the relative frequency distribution by typing

> table(cdc$smoke100)/20000

0       1

0.52795 0.47205

 

Next, we make a bar plot of the entries in the table by putting the table inside the barplot command.

> barplot(table(cdc$smoke100))

OR

> smoke = table(cdc$smoke100)

> barplot(smoke) 

 

Recall that 1 indicates that a respondent has smoked at least 100 cigarettes. The rows refer to gender. To create a mosaic plot of this table, we would enter the following command. We can see the size of the data frame next to the object name in the workspace or we can type

 

> gender_smokers = table(cdc$gender,cdc$smoke100)

> mosaicplot(gender_smokers)

-->The mosaic plot shows that males are more likely to smoke than females. the type of smoke100 is categorical (not ordinal). Explanatory variable on the x-axis, response variable on the y-axis. x축의 변수는 내가 알고자 하는 Y를 구성하는 자료, 즉 Y를 설명해주기 때문에explanatory or predictor variable이라고 한다. 독립변수(independent)라고도 한다. Y축의 값은 X에 반응한다고 해서 Response Or dependent Variable 종속변수라고 한다

 

> cdc[567,6] #which means we want the element of our data set that is in the 567 th   row

[1] 160

> cdc[1:10,6] #To see the weights for the first 10 respondents,

[1] 175 125 105 132 150 114 194 170 150 180

> cdc[1:10,] #Finally, if we want all of the data for the first 10 respondents, type

genhlth exerany hlthplan smoke100 height weight wtdesire age gender

1       good       0        1        0     70    175      175  77      m

2       good       0        1        1     64    125      115  33      f

3       good       1        1        1     60    105      105  49      f

4       good       1        1        0     66    132      124  42      f

5  very good       0        1        0     61    150      130  55      f

6  very good       1        1        0     64    114      114  55      f

7  very good       1        1        0     71    194      185  31      m

8  very good       0        1        0     67    170      160  45      m

9       good       0        1        1     65    150      130  27      f

10      good       1        1        0     70    180      170  44      m

 

#this command will create a new data set called mdata that contains only the men from the cdc data set.

> mdata = subset(cdc, cdc$gender == "m")

> head(mdata)

genhlth exerany hlthplan smoke100 height weight wtdesire age gender

1       good       0        1        0     70    175      175  77      m

7  very good       1        1        0     71    194      185  31      m

8  very good       0        1        0     67    170      160  45      m

10      good       1        1        0     70    180      170  44      m

11 excellent       1        1        1     69    186      175  46      m

12      fair       1        1        1     69    168      148  62      m

 

you can use several of these conditions together with & and |. The & is read “and” so that. this command will give you the data for men over the age of 30.

> m_and_over30 = subset(cdc, cdc$gender == "m" & cdc$age > 30)

> head(m_and_over30)

genhlth exerany hlthplan smoke100 height weight wtdesire age gender

1       good       0        1        0     70    175      175  77      m

7  very good       1        1        0     71    194      185  31      m

8  very good       0        1        0     67    170      160  45      m

10      good       1        1        0     70    180      170  44      m

11 excellent       1        1        1     69    186      175  46      m

12      fair       1        1        1     69    168      148  62      m

 

The | character is read “or” so that this command will take people who are men or over the age of 30

> m_or_over30=subset(cdc,cdc$gender=="m"|cdc$age>30)

> head(m_or_over30)

genhlth exerany hlthplan smoke100 height weight wtdesire age gender

1      good       0        1        0     70    175      175  77      m

2      good       0        1        1     64    125      115  33      f

3      good       1        1        1     60    105      105  49      f

4      good       1        1        0     66    132      124  42      f

5 very good       0        1        0     61    150      130  55      f

6 very good

 

To Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked at least 100 cigarettes in their lifetime.

> under23_and_smoke = subset(cdc,cdc$age<23 & smoke100=="1")

> str(under23_and_smoke)

'data.frame':  620 obs. of  9 variables:

  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 1 2 1 3 2 2 4 4 1 4 ...

$ exerany : num  1 1 1 1 1 1 0 1 1 1 ...

$ hlthplan: num  0 0 1 1 1 0 1 1 0 1 ...

$ smoke100: num  1 1 1 1 1 1 1 1 1 1 ...

$ height  : num  66 70 74 64 62 64 71 72 63 71 ...

$ weight  : int  185 160 175 190 92 125 185 185 105 185 ...

$ wtdesire: int  220 140 200 140 92 115 185 170 100 150 ...

$ age     : int  21 18 22 20 21 22 20 19 19 18 ...

$ gender  : Factor w/ 2 levels "m","f": 1 2 1 2 2 2 1 1 1 1 ...

> summary(under23_and_smoke)

genhlth       exerany          hlthplan         smoke100     height    

excellent:110   Min.   :0.0000   Min.   :0.0000   Min.   :1   Min.   :59.00 

very good:244   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1   1st Qu.:65.00 

good     :204   Median :1.0000   Median :1.0000   Median :1   Median :68.00 

fair     : 53   Mean   :0.8145   Mean   :0.6952   Mean   :1   Mean   :67.92 

poor     :  9   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1   3rd Qu.:71.00 

Max.   :1.0000   Max.   :1.0000   Max.   :1   Max.   :79.00 

weight         wtdesire          age        gender

Min.   : 85.0   Min.   : 80.0   Min.   :18.00   m:305 

1st Qu.:130.0   1st Qu.:125.0   1st Qu.:19.00   f:315 

Median :155.0   Median :150.0   Median :20.00         

Mean   :158.9   Mean   :152.2   Mean   :20.22         

3rd Qu.:180.0   3rd Qu.:175.0   3rd Qu.:21.00         

Max.   :350.0   Max.   :315.0   Max.   :22.00         

> dim(under23_and_smoke)

[1] 620   9

 -> 620 observations are in the subset under23

 

The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several categories. So we can, for example, compare the heights of men and women with. The notation here is new. The ~ character can be read “versus” or “as a function of”. So we’re asking R to give us a box plots of heights where the groups are defined by gender.

> boxplot(cdc$height ~ cdc$gender)

 

Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI). BMI is a weight to height ratio and can be calculated as.

> bmi = (cdc$weight / cdc$height^2) * 703

> boxplot(bmi~cdc$genhlth)

 --> Among people with excellent health, there are some with unusually low BMIs compared to the rest of the group. The IQR increases slightly as general health status declines (from excellent to poor). The median BMI is roughly 25 for all general health categories, and there is a slight increase in median BMI as general health status declines (from excellent to poor).

 

 

To make histogram

> hist(bmi)

 

You can control the number of bins by adding an argument to the command. In the next two lines, we first make a default histogram of bmi and then one with 50 breaks.

> hist(bmi,breaks=50)

 

Using the same tools (the plot function) make a scatterplot of weight versus desired weight.

 > plot(cdc$weight,cdc$wtdesire,xlim=c(0,700))

   

 ->moderately strong positive linear association

반응형
Posted by 마르띤
,
반응형

source: https://class.coursera.org/statistics-004 (coursera: Data Analysis and Statistical Inference by Duke Univ. by Dr. Mine Çetinkaya-Rundel)

 

The present data set refers to the number of male and female births in the United States. The data set contains the data for all years from 1940 to 2002. We can take a look at the data by typing its name into the console.

> source("http://www.openintro.org/stat/data/present.R")
> head(present)
  year    boys   girls
1 1940 1211684 1148715
2 1941 1289734 1223693
3 1942 1444365 1364631
4 1943 1508959 1427901
5 1944 1435301 1359499
6 1945 1404587 


You can see the dimensions of this data frame by typing:

> dim(present)
[1] 63  3
> names(present)
[1] "year"  "boys"  "girls"
> str(present)
'data.frame':   63 obs. of  3 variables:
 $ year : num  1940 1941 1942 1943 1944 ...
 $ boys : num  1211684 1289734 1444365 1508959 1435301 ...
 $ girls: num  1148715 1223693 1364631 1427901 1359499 ...

 

Let’s start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like

> present$boys
 [1] 1211684 1289734 1444365 1508959 1435301 1404587 1691220 1899876 1813852 1826352 1823555 1923020 1971262 2001798
[15] 2059068 2073719 2133588 2179960 2152546 2173638 2179708 2186274 2132466 2101632 2060162 1927054 1845862 1803388
[29] 1796326 1846572 1915378 1822910 1669927 1608326 1622114 1613135 1624436 1705916 1709394 1791267 1852616 1860272
[43] 1885676 1865553 1879490 1927983 1924868 1951153 2002424 2069490 2129495 2101518 2082097 2048861 2022589 1996355
[57] 1990480 1985596 2016205 2026854 2076969 2057922 2057979

 

We can create a simple plot of the number of girls born per year with the command

> plot(x = present$year, y = present$girls)

-> There is initially an increase in the number of girls born, which peaks around 1960. After 1960 there is a decrease in the number of girls born, but the number begins to increase again in the early 1970s. Overall the trend is an increase in the number of girls born in the US since the 1940s.


If we wanted to connect the data points with lines, we could add a third argument, the letter l for line.

> plot(x = present$year, y = present$girls,type="l")

 

If we wanted to make the title,

> plot(x = present$year, y = present$girls,main="Girls born per year")

 


to see the total number of births in 1940, we add the vector for births for boys and girls, R will compute all sums simultaneously.

> present$boys+present$girls
 [1] 2360399 2513427 2808996 2936860 2794800 2735456 3288672 3699940 3535068
[10] 3559529 3554149 3750850 3846986 3902120 4017362 4047295 4163090 4254784
[19] 4203812 4244796 4257850 4268326 4167362 4098020 4027490 3760358 3606274
[28] 3520959 3501564 3600206 3731386 3555970 3258411 3136965 3159958 3144198
[37] 3167788 3326632 3333279 3494398 3612258 3629238 3680537 3638933 3669141
[46] 3760561 3756547 3809394 3909510 4040958 4158212 4110907 4065014 4000240
[55] 3952767 3899589 3891494 3880894 3941553 3959417 4058814 4025933 4021726

 

we can make a plot of the total number of births per year with the command

> plot(present$year,present$boys+present$girls,type="l")

  

-> In 1961, we can see the most total number of births in the U.S.


The proportion of newborns that are boys

> present$boys/(present$boys+present$girls)
 [1] 0.5133386 0.5131376 0.5141926 0.5138001 0.5135613 0.5134745 0.5142562
 [8] 0.5134883 0.5131024 0.5130881 0.5130778 0.5126891 0.5124173 0.5130027
[15] 0.5125423 0.5123716 0.5125011 0.5123550 0.5120462 0.5120713 0.5119269
[22] 0.5122088 0.5117064 0.5128408 0.5115250 0.5124656 0.5118474 0.5121866
[29] 0.5130068 0.5129073 0.5133154 0.5126337 0.5124973 0.5127013 0.5133340
[36] 0.5130513 0.5127982 0.5128057 0.5128266 0.5126110 0.5128692 0.5125792
[43] 0.5123372 0.5126648 0.5122425 0.5126849 0.5124035 0.5121951 0.5121931
[50] 0.5121286 0.5121179 0.5112054 0.5121992 0.5121845 0.5116894 0.5119398
[57] 0.5114951 0.5116337 0.5115255 0.5119072 0.5117182 0.5111665 0.5117154

 -> Based on the plot determine, the proportion of boys born in the US has decreased over time and every year there are more boys born than girls.


The present data set refers to the number of male and female births in the United States. The data set contains the data for all years from 1940 to 2002. We can take a look at the data by typing its name into the console.

> plot(present$year, present$boys/(present$boys+present$girls))

 

 We can ask if boys outnumber girls in each year with the expression

> present$boys > present$girls
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

-> Every year there are more boys born than girls.

 

plot(present$year,present$boys-present$girls)

 

 If we cant’ see the exact number of y-axis

> dd<-present$boys-present$girls
> min(dd,na.rm=T)
[1] 62969
> max(dd,na.rm=T)
[1] 105244
> plot(present$year,dd,ylim=c(60000,120000))

  

 

If we wanted to name the y-axis and the title,

> plot(present$year,dd,ylab="differences number of boys and girls", 
ylim=c(60000,120000),main="Absolute differences")


 

-> In 1963, we can see the biggest absolute differences number of boys and girls born. There is initially a decrease in the boy-to-girl ratio, and then an increase between 1960 and 1970, followed by a decrease.


X, y축의 평균을 표기 하기 위해서는

> plot(present$boys,present$girls,xlim=c(1100000,2000000),
ylim=c(1100000,2200000))
> abline(v=mean(present$boys),lty=2) 
> abline(h=mean(present$girls),lty=2)




반응형
Posted by 마르띤
,
반응형

source: R을 이용한 데이터 처리 & 분석 실무, 서민구 지음

http://book.naver.com/bookdb/book_detail.nhn?bid=8317471


6. 데이터 프레임: R에서 가장 중요한 데이터 타입

> (d<-data.frame(x=c(1:5),y=c(2:10,2)))

   x  y

1  1  2

2  2  3

3  3  4

4  4  5

5  5  6

6  1  7

7  2  8

8  3  9

9  4 10

10 5  2

> (d<-data.frame(x=rep(1:2,times=3),y=seq(2,12,2)))

  x  y

1 1  2

2 2  4

3 1  6

4 2  8

5 1 10

6 2 12

>

> (d<-data.frame(x=rep(1:2,times=3),

+                y=seq(2,12,2),

+                z=c('M','F','M','F','M','F')))

  x  y z

1 1  2 M

2 2  4 F

3 1  6 M

4 2  8 F

5 1 10 M

6 2 12 F

>

> str(d)

'data.frame':    6 obs. of  3 variables:

 $ x: int  1 2 1 2 1 2

 $ y: num  2 4 6 8 10 12

 $ z: Factor w/ 2 levels "F","M": 2 1 2 1 2 1

> d$x

[1] 1 2 1 2 1 2

> d$x<-rep(1:2,each=3)

> d$x

[1] 1 1 1 2 2 2

> d$w<-c("A","B","C","D","E","F")

> d

  x  y z w

1 1  2 M A

2 1  4 F B

3 1  6 M C

4 2  8 F D

5 2 10 M E

6 2 12 F F

> str(d)

'data.frame':    6 obs. of  4 variables:

 $ x: int  1 1 1 2 2 2

 $ y: num  2 4 6 8 10 12

 $ z: Factor w/ 2 levels "F","M": 2 1 2 1 2 1

 $ w: chr  "A" "B" "C" "D" ...

>

> (x<-data.frame(1:3))

  X1.3

1    1

2    2

3    3

> colnames(x)<-'val'

> x

  val

1   1

2   2

3   3

> rownames(x)<-c('a','b','c')

> x

  val

a   1

b   2

c   3

>

> d<-data.frame(x=c(1:5),y=c(2,4,6,8,10))

> d

  x  y

1 1  2

2 2  4

3 3  6

4 4  8

5 5 10

> rownames(d)<-c('a','b','c','d','e')

> d

  x  y

a 1  2

b 2  4

c 3  6

d 4  8

e 5 10

>

> d$x

[1] 1 2 3 4 5

> d$y

[1]  2  4  6  8 10

> d[1,]

  x y

a 1 2

> d[,1]

[1] 1 2 3 4 5

> d$a

NULL

> d["a"]

Error in `[.data.frame`(d, "a") : undefined columns selected

> d["a",]

  x y

a 1 2

> d

  x  y

a 1  2

b 2  4

c 3  6

d 4  8

e 5 10

> d[c(1,3),]

  x y

a 1 2

c 3 6

> d[c(1,3),1]

[1] 1 3

> d[c(1,3),2]

[1] 2 6

>

> d[-1,]

  x  y

b 2  4

c 3  6

d 4  8

e 5 10

> d[-1,-2]

[1] 2 3 4 5

> d[c("x","y")]

  x  y

a 1  2

b 2  4

c 3  6

d 4  8

e 5 10

> d[,c("x","y")]

  x  y

a 1  2

b 2  4

c 3  6

d 4  8

e 5 10

> d[c("x","y"),]

      x  y

NA   NA NA

NA.1 NA NA

> d[,"x"]

[1] 1 2 3 4 5

> d[,c("x")]

[1] 1 2 3 4 5

> d[,c("x"),drop=F]

  x

a 1

b 2

c 3

d 4

e 5

> (d<-data.frame(a=1:3,b=4:6,c=7:9))

  a b c

1 1 4 7

2 2 5 8

3 3 6 9

> d[,names(d)%in%c("b","c")]

  b c

1 4 7

2 5 8

3 6 9

> d[,!names(d)%in%c("b","c")]

[1] 1 2 3

> d[,!names(d)%in%c("b","c",drop=F)]

[1] 1 2 3

> View(d)


7. 타입 판별

> class(c(1,2))

[1] "numeric"

> class(matrix(c(1,2)))

[1] "matrix"

> class(data.frame(x=c(1,2),y=c(3,4)))

[1] "data.frame"

> str(c(1,2))

 num [1:2] 1 2

> str(matrix(c(1,2)))

 num [1:2, 1] 1 2

> str(list(c(1,2)))

List of 1

 $ : num [1:2] 1 2

> str(data.frame(x=c(1,2)))

'data.frame':    2 obs. of  1 variable:

 $ x: num  1 2

> is.factor(factor(c("m","f")))

[1] TRUE

> is.numeric(c(1:5))

[1] TRUE

> is.character(c("a","b"))

[1] TRUE

> is.data.frame(data.frame(x=1:6))

[1] TRUE

 

8. 타입 변환

> y<-c("a","b","c")

> as.factor(y)

[1] a b c

Levels: a b c

> y

[1] "a" "b" "c"

> as.character(as.factor(y))

[1] "a" "b" "c"

> x<-data.frame(1:9,ncol=3)

> x

  X1.9 ncol

1    1    3

2    2    3

3    3    3

4    4    3

5    5    3

6    6    3

7    7    3

8    8    3

9    9    3

> is.data.frame(x)

[1] TRUE

> is.matrix(x)

[1] FALSE

> x<-matrix(1:9,ncol=3)

> x

     [,1] [,2] [,3]

[1,]    1    4    7

[2,]    2    5    8

[3,]    3    6    9

> is.matrix(x)

[1] TRUE

> as.data.frame(x)

  V1 V2 V3

1  1  4  7

2  2  5  8

3  3  6  9

> is.data.frame(x)

[1] FALSE

> (x<-data.frame(matrix(c(1:4),ncol=2)))

  X1 X2

1  1  3

2  2  4

> list(x=c(1,2),y=c(3,4))

$x

[1] 1 2

 

$y

[1] 3 4

> data.frame(list(x=c(1,2),y=c(3,4)))

  x y

1 1 3

2 2 4

> factor(c("m","f"),levels=c("m","f"))

[1] m f

Levels: m f

 

반응형
Posted by 마르띤
,
반응형

source: R을 이용한 데이터 처리 & 분석 실무, 서민구 지음

http://book.naver.com/bookdb/book_detail.nhn?bid=8317471


1. 스칼라: 단일 차원의 값을 뜻함. 1,2,3

1) NA: Not available라는 상수. 값이 없음, 결측값. 값이 빠진 경우를 뜻함.

> four<-NA

> is.na(four)

[1] TRUE

 

2) NULL: NA와는 다른 개념으로 undefined , 미정값.

> x<-NULL

> is.null(x)

[1] TRUE

> is.null(1)

[1] FALSE

> is.null(NA)

[1] FALSE

> is.na(NULL)

logical(0)

Warning message:

In is.na(NULL) : is.na() applied to non-(list or vector) of type 'NULL'

 

3) 진리값: T/F 사용

> T&F

[1] FALSE

> T|T

[1] TRUE

> T|F

[1] TRUE

> T<-TRUE

> T

 [1] TRUE

> T<-FALSE

> T

[1] FALSE

> TRUE<-FALSE

Error in TRUE <- FALSE : invalid (do_set) left-hand side to assignment

 

잘 이해안되는 부분..좀 더 공부 필요 한 부분.

 

> c(TRUE,TRUE)&c(TRUE,FALSE)

[1]  TRUE FALSE

> c(TRUE,TRUE)&&c(TRUE,FALSE)

[1] TRUE

 

 

4) 팩터: 범주형 (categorical) 데이터 자료를 표현하기 위한 데이터 타입. //

 - 명목형(Nominal): /

 - 순서형(Ordinal): //

 

> sex<-factor("m",c("m","f")) #sex에는 "m" 지정, 팩터의 레벨은 "m","f" 제한> sex

[1] m

Levels: m f

> nlevels(sex)

[1] 2

> levels(sex)

[1] "m" "f"

> levels(sex)[1]

[1] "m"

> levels(sex)[2]

[1] "f"

> levels(sex)<-c("male","female")

> sex

[1] male

Levels: male female

> factor(c("m","m","f"),c("m","f"))

[1] m m f

Levels: m f

> factor(c("m","m","f"))

[1] m m f

Levels: f m

> ordered("a",c("a","b","c"))

[1] a

Levels: a < b < c

 

2. 벡터(Vector): 배열의 개념. 가지 스칼라 데이터 타입의 데이터를 저장

) 숫자만 저장 또는 문자열만 저장하는 배열이 벡터에 해당.

> x<-c(1,3,4)

> names(x)<-c("a","b","c")

> x

a b c

1 3 4

> x[1]

a

1

> x[-1]

b c

3 4

> x[c(1,3)]

a c

1 4

> x["a"]

a

1

> x[c("b","c")]

b c

3 4

> names(x)[2]

[1] "b"

> length(x)

[1] 3

> nrow(x)

NULL

> NROW(x)

[1] 3

> identical(c(1,2,3),c(1,2,3)) #객체가 동일한지 판단

[1] TRUE

> identical(c(1,2,3),c(1,2,1))

[1] FALSE

> "a"%in%c("a","b","c")

[1] TRUE

> "d"%in%c("a","b","c")

[1] FALSE

> x<-c(1:5)

> x

[1] 1 2 3 4 5

> x+1

[1] 2 3 4 5 6

> 10-x

[1] 9 8 7 6 5

> c(1,2,3)==c(1,2,3)

[1] TRUE TRUE TRUE

> union(c("a","b"),"d")

[1] "a" "b" "d"

> intersect(c("a","b","c"),c("a","d"))

[1] "a"

> setequal(c("a","b","c"),c("a","d"))

[1] FALSE

> setequal(c("a","b"),c("a","b","b"))

[1] TRUE

> setequal(c("a","b"),c("a","b","c"))

[1] FALSE

 

> seq(3,7)

[1] 3 4 5 6 7

> seq(7,3)

[1] 7 6 5 4 3

> seq(3,7,2)

[1] 3 5 7

> seq(3,7,3)

[1] 3 6

> x<-seq(2,10,2)

> x

[1]  2  4  6  8 10

> 1:NROW(x)

[1] 1 2 3 4 5

> seq_along(x)

[1] 1 2 3 4 5

> rep(1:2,times=5)

 [1] 1 2 1 2 1 2 1 2 1 2

> rep(1:2,each=5)

 [1] 1 1 1 1 1 2 2 2 2 2

> rep(1:2,each=5,times=2)

 [1] 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2

> (x<-list(name="foo",height=70))

$name

[1] "foo"

 

$height

[1] 70

 

3. list: 벡터와 달리 서로 다른 값을 담을 있음.

> (x<-list(names="foo",height=c(1,3,5)))

$names

[1] "foo"

 

$height

[1] 1 3 5

 

> list(a=list(val=c(1,2,3)),b=list(val=c(1,2,3,4)))

$a

$a$val

[1] 1 2 3

 

 

$b

$b$val

[1] 1 2 3 4

 

 

> x$name

[1] "foo"

> x

$names

[1] "foo"

 

$height

[1] 1 3 5

 

> x$names

[1] "foo"

> x$height

[1] 1 3 5

> x[1]

$names

[1] "foo"

 

> x[[1]]

[1] "foo"

 

4. matrix: 행렬

> matrix(c(1:9),nrow=3,byrow=T)

     [,1] [,2] [,3]

[1,]    1    4    7

[2,]    2    5    8

[3,]    3    6    9

> matrix(c(1:9),nrow=3)

     [,1] [,2] [,3]

[1,]    1    4    7

[2,]    2    5    8

[3,]    3    6    9

> matrix(c(1:9),nrow=3,

+        dimnames=list(c("r1","r2","r3"),c("c1","c2","c3")))

   c1 c2 c3

r1  1  4  7

r2  2  5  8

r3  3  6  9

> (x<-matrix(c(1:9),nrow=3))

     [,1] [,2] [,3]

[1,]    1    4    7

[2,]    2    5    8

[3,]    3    6    9

> dimnames(x)<-list(c("r4","r5","r6"),c("c4","c5","c6"))

> x

   c4 c5 c6

r4  1  4  7

r5  2  5  8

r6  3  6  9

>

> x<-matrix(1:9,ncol=3)

> x

     [,1] [,2] [,3]

[1,]    1    4    7

[2,]    2    5    8

[3,]    3    6    9

> rownames(x)<-c("r1","r2","r3")

> x

   [,1] [,2] [,3]

r1    1    4    7

r2    2    5    8

r3    3    6    9

> colnames(x)<-c("c1","c2","c3")

> x

   c1 c2 c3

r1  1  4  7

r2  2  5  8

r3  3  6  9

> x[1,1]

[1] 1

> x[,2]

r1 r2 r3

 4  5  6

> x[1:2,]

   c1 c2 c3

r1  1  4  7

r2  2  5  8

> x[,1:2]

   c1 c2

r1  1  4

r2  2  5

r3  3  6

> x[c(1,2),c(2,1)]

   c2 c1

r1  4  1

r2  5  2

> x[c(1,1),c(1,3)]

   c1 c3

r1  1  7

r1  1  7

> x[c(1,3),c(1,3)]

   c1 c3

r1  1  7

r3  3  9

> x[c(2,1),c(1,2)]

   c1 c2

r2  2  5

r1  1  4

> x<-matrix(1:9,nrow=3)

> x*2

     [,1] [,2] [,3]

[1,]    2    8   14

[2,]    4   10   16

[3,]    6   12   18

> x+x

     [,1] [,2] [,3]

[1,]    2    8   14

[2,]    4   10   16

[3,]    6   12   18

> x%*%x

     [,1] [,2] [,3]

[1,]   30   66  102

[2,]   36   81  126

[3,]   42   96  150

> t(x)

     [,1] [,2] [,3]

[1,]    1    2    3

[2,]    4    5    6

[3,]    7    8    9

> (x<-matrix(c(1:4),ncol=2))

     [,1] [,2]

[1,]    1    3

[2,]    2    4

> solve(x)

     [,1] [,2]

[1,]   -2  1.5

[2,]    1 -0.5

> (x<-matrix(c(1:6),nrow=2))

     [,1] [,2] [,3]

[1,]    1    3    5

[2,]    2    4    6

> dim(x)

[1] 2 3

> dim(x)<-c(3,2)

> x

     [,1] [,2]

[1,]    1    4

[2,]    2    5

[3,]    3    6

  

5. Array: 배열. Matrix행렬이 2차원이라면 배열은 다차원 데이터

> array(1:12,dim=c(3,4))

     [,1] [,2] [,3] [,4]

[1,]    1    4    7   10

[2,]    2    5    8   11

[3,]    3    6    9   12

> (x<-array(1:12,dim=c(2,2,3)))

, , 1

 

     [,1] [,2]

[1,]    1    3

[2,]    2    4

 

, , 2

 

     [,1] [,2]

[1,]    5    7

[2,]    6    8

 

, , 3

 

     [,1] [,2]

[1,]    9   11

[2,]   10   12

 

> x[1,1,3]

[1] 9

반응형
Posted by 마르띤
,
반응형

source: https://github.com/derekfranks/practice_assignment

To begin, download this file and unzip it into your R working directory.
http://s3.amazonaws.com/practice_assignment/diet_data.zip

 

> setwd("D:/temp/r_temp")

> list.files("diet_data")

[1] "Andy.csv"  "David.csv" "John.csv" 

[4] "Mike.csv"  "Steve.csv"



> andy<-read.csv("diet_data/Andy.csv")

> head(andy)

  Patient.Name Age Weight Day

1         Andy  30    140   1

2         Andy  30    140   2

3         Andy  30    140   3

4         Andy  30    139   4

5         Andy  30    138   5

6         Andy  30    138   6

 

> length(andy$Day)

[1] 30

> nrow(andy)

[1] 30

> ncol(andy)

[1] 4

> length(andy)

[1] 4

> dim(andy)

[1] 30  4

> str(andy)

'data.frame':  30 obs. of  4 variables:

 $ Patient.Name: Factor w/ 1 level "Andy": 1 1 1 1 1 1 1 1 1 1 ...

 $ Age         : int  30 30 30 30 30 30 30 30 30 30 ...

 $ Weight      : int  140 140 140 139 138 138 138 138 138 138 ...

 $ Day         : int  1 2 3 4 5 6 7 8 9 10 ...

> summary(andy)

 Patient.Name      Age         Weight           Day       

 Andy:30      Min.   :30   Min.   :135.0   Min.   : 1.00  

              1st Qu.:30   1st Qu.:137.0   1st Qu.: 8.25  

              Median :30   Median :137.5   Median :15.50  

              Mean   :30   Mean   :137.3   Mean   :15.50  

              3rd Qu.:30   3rd Qu.:138.0   3rd Qu.:22.75  

              Max.   :30   Max.   :140.0   Max.   :30.00  

> names(andy)

[1] "Patient.Name" "Age"          "Weight"       "Day" 

 

> andy[1,"Weight"]

[1] 140

> andy[1,"weight"]

NULL

> andy[30,"Weight"]

[1] 135

> andy[which(andy$day)==30,"Weight"]

Error in which(andy$day) : argument to 'which' is not logical

> andy[which(andy$Day)==30,"Weight"]

Error in which(andy$Day) : argument to 'which' is not logical

> andy[which(andy$Day==30),"Weight"]

[1] 135

> andy[which(andy[,Day]==30),"Weight"]

Error in `[.data.frame`(andy, , Day) : object 'Day' not found

> andy[which(andy[,"Day"]==30),"Weight"]

[1] 135

> subset(andy$Weight, andy$Day==30)

[1] 135

 

> andy_start<-subset[andy$Weight,andy$Day==1]

Error in subset[andy$Weight, andy$Day == 1] : 

  object of type 'closure' is not subsettable

> andy_start<-subset(andy$Weight,andy$Day==1)

> andy_start

[1] 140

> andy_start<-andy[1,"Weight"]

> andy_end<-andy[30,"Weight"]

> andy_loss<-andy_start - andy_end

> andy_loss

[1] 5

 

> files<-list.files("diet_data")

> files

[1] "Andy.csv"  "David.csv" "John.csv"  "Mike.csv"  "Steve.csv"

> files[1]

[1] "Andy.csv"

> files[-1]

[1] "David.csv" "John.csv"  "Mike.csv"  "Steve.csv"

> files[2]

[1] "David.csv"

> files[3:5]

[1] "John.csv"  "Mike.csv"  "Steve.csv"

> files[c(1,3)]

[1] "Andy.csv" "John.csv"


> head(read.csv(files[3]))

Error in file(file, "rt") : cannot open the connection

In addition: Warning message:

In file(file, "rt") :

  cannot open file 'John.csv': No such file or directory

> files_full <- list.files("diet_data",full.names=TRUE)

> files_full

[1] "diet_data/Andy.csv"  "diet_data/David.csv" "diet_data/John.csv"

[4] "diet_data/Mike.csv"  "diet_data/Steve.csv"

> head(read.csv(files_full[3]))

  Patient.Name Age Weight Day

1         John  22    175   1

2         John  22    175   2

3         John  22    175   3

4         John  22    175   4

5         John  22    175   5

6         John  22    175   6


> andy_david<-rbind(andy,david)

Error in rbind(andy, david) : object 'david' not found

> david<-read.csv("diet_data/David.csv")

> head(david)

  Patient.Name Age Weight Day

1        David  35    210   1

2        David  35    209   2

3        David  35    209   3

4        David  35    209   4

5        David  35    209   5

6        David  35    209   6

> andy_david<-rbind(andy,david)

> head(andy_david)

  Patient.Name Age Weight Day

1         Andy  30    140   1

2         Andy  30    140   2

3         Andy  30    140   3

4         Andy  30    139   4

5         Andy  30    138   5

6         Andy  30    138   6

> tail(andy_david)

   Patient.Name Age Weight Day

55        David  35    203  25

56        David  35    203  26

57        David  35    202  27

58        David  35    202  28

59        David  35    202  29

60        David  35    201  30


> andy_david<-rbind(andy,read.csv(files_full[2]))

> head(andy_david)

  Patient.Name Age Weight Day

1         Andy  30    140   1

2         Andy  30    140   2

3         Andy  30    140   3

4         Andy  30    139   4

5         Andy  30    138   5

6         Andy  30    138   6

> tail(andy_david)

   Patient.Name Age Weight Day

55        David  35    203  25

56        David  35    203  26

57        David  35    202  27

58        David  35    202  28

59        David  35    202  29

60        David  35    201  30

 

> day_25<-subset(andy_david[25,])

> day_25

   Patient.Name Age Weight Day

25         Andy  30    135  25

> day_26<-subset(andy_david[which(andy_david$Day==26),])

> day_26

   Patient.Name Age Weight Day

26         Andy  30    135  26

56        David  35    203  26

> day25<-andy_david[which(andy_david$Day == 25),]

> day25

   Patient.Name Age Weight Day

25         Andy  30    135  25

55        David  35    203  25

 

> for(i in 1:5){print [i]}

Error in print[i] : object of type 'closure' is not subsettable

> for(i in 1:5){print (i}

Error: unexpected '}' in "for(i in 1:5){print (i}"

> for(i in 1:5){print (i)}

[1] 1

[1] 2

[1] 3

[1] 4

[1] 5

> for(i in 1:5){

+        dat <- rbind(dat,read.csv(files_full[i]))

+ }

Error in rbind(dat, read.csv(files_full[i])) : object 'dat' not found

> dat<-data.frame()

> for(i in 1:5){

+        dat <- rbind(dat,read.csv(files_full[i]))

+ }

> str(dat)

'data.frame':    150 obs. of  4 variables:

 $ Patient.Name: Factor w/ 5 levels "Andy","David",..: 1 1 1 1 1 1 1 1 1 1 ...

 $ Age         : int  30 30 30 30 30 30 30 30 30 30 ...

 $ Weight      : int  140 140 140 139 138 138 138 138 138 138 ...

 $ Day         : int  1 2 3 4 5 6 7 8 9 10 ...

 

> for(i in 1:5 ){

+        dat2<-data.frame()

+        dat2<-rbind(dat2,read.csv(files_full[i])) 

+ }

> str(dat2)

'data.frame':    30 obs. of  4 variables:

 $ Patient.Name: Factor w/ 1 level "Steve": 1 1 1 1 1 1 1 1 1 1 ...

 $ Age         : int  55 55 55 55 55 55 55 55 55 55 ...

 $ Weight      : int  225 225 225 224 224 224 223 223 223 223 ...

 $ Day         : int  1 2 3 4 5 6 7 8 9 10 ...

> list.files("diet_data")

[1] "Andy.csv"  "David.csv" "John.csv"  "Mike.csv"  "Steve.csv"

> head(dat2)

  Patient.Name Age Weight Day

1        Steve  55    225   1

2        Steve  55    225   2

3        Steve  55    225   3

4        Steve  55    224   4

5        Steve  55    224   5

6        Steve  55    224   6

 

Because we put dat2<- data.frame() inside of the loop, dat2 is being rewritten with each pass of the loop. So we only end up with the data from the last file in our list.

 

> median(dat$Weight)

[1] NA

> nrow(dat)

[1] 150

> ncol(dat)

[1] 4

> dim(dat)

[1] 150   4

> median(dat$Weight,na.rm=TRUE)

[1] 190

> dat_30<-dat[which(dat$Day==30),]

> dat_30

    Patient.Name Age Weight Day

30          Andy  30    135  30

60         David  35    201  30

90          John  22    177  30

120         Mike  40    192  30

150        Steve  55    214  30

> median(dat_30$Weight)

[1] 192

 

> weightmedian<-function(directory,day){

+   files_list<-list.files(directory,full.names=TRUE) #Create a list of files

+   dat<-data.frame() #creates an empty data frame

+   for(i in 1:5) { #loops through the files, rbinding them together

+     dat<-rbind(dat,read.csv(files_list[i]))   

+   }

+   dat_subset<-dat[which(dat[,"Day"]==day),] #subsets the rows that match the 'day' arguments

+   median(dat_subset[,"Weight"],na.rm=TRUE) #identifies the median weight

+ }

> weightmedian(directory="diet_data",day=20)

[1] 197.5

> weightmedian("diet_data",4)

[1] 188

 

 

반응형
Posted by 마르띤
,
반응형

R 텍스트마이닝 연습. 책은 R까기 참고, 대상은 2014년 대통령 신년사(출처: 네이버 검색). 

> setwd("D:/temp/r_temp")


> library(KoNLP)

> library(wordcloud)


> useSejongDic()

> mergeUserDic(data.frame("청마","ncn")) #위 사전에 해당 단어 추가하는 과정

1 words were added to dic_user.txt.

> mergeUserDic(data.frame("수자원공사","ncn"))

1 words were added to dic_user.txt.

> mergeUserDic(data.frame("4대강","ncn"))

1 words were added to dic_user.txt.

> mergeUserDic(data.frame("원전비리","ncn"))

1 words were added to dic_user.txt.

> mergeUserDic(data.frame("시험성적서","ncn"))

1 words were added to dic_user.txt.

> mergeUserDic(data.frame("코레일","ncn"))

1 words were added to dic_user.txt.

> mergeUserDic(data.frame("벤처창업","ncn"))

1 words were added to dic_user.txt.

> mergeUserDic(data.frame("창업경제타운","ncn"))

1 words were added to dic_user.txt.

> mergeUserDic(data.frame("창조경제추진단","ncn"))

1 words were added to dic_user.txt. 

> mergeUserDic(data.frame("규제총량제","ncn"))

1 words were added to dic_user.txt.

> mergeUserDic(data.frame("세계평화공원","ncn"))

1 words were added to dic_user.txt.

> mergeUserDic(data.frame("유라시아","ncn"))

1 words were added to dic_user.txt.

> mergeUserDic(data.frame("원스톱","ncn"))

1 words were added to dic_user.txt.

> mergeUserDic(data.frame("자유학기제","ncn"))

1 words were added to dic_user.txt.


> txt<-readLines("president_lastyear.txt") # txt라는 변수에 분석용 데이터를 변수로 읽기

> place<-sapply(txt,extractNoun,USE.NAMES=F) #txt파일에서 공백을 기준으로 조사해 명사만 찾아 place라는 변수에 저장

> head(unlist(place),30) #추출된 명사를 30개만 출력하여 확인

> c<-unlist(place) # 필터링을 위해 unlist 작업을 지정해서 저장

> place<-Filter(function(x){nchar(x)>=2},c) #두 글자 이상 되는 것만 필터


> place<-gsub(" ","",place) # gsub 함수를 이용하여 불필요한 단어 삭제

> place<-gsub("해서","",place)

> place<-gsub("하기","",place)

> place<-gsub("들이","",place)


> write(unlist(place),"president_lastyear_2.txt") #파일로 저장한 후 테이블 형태로 변환

> rev<-read.table("president_lastyear_2.txt") #수정 완료된 파일을 table 형식으로 변환하여 다시 변수에 불러드림

> nrow(rev) #rev변수에 데이터가 몇 건이 있는지 조회

[1] 819

> wordcount<-table(rev) #rev변수의 값을 table형태로 변환해서 wordcount라는 변수에 할당

> head(sort(wordcount,decreasing=T),30) #빈도수 상위 30개, 많은 순으로 정렬

> library(RColorBrewer) #칼러 불러오기

> palete<-brewer.pal(9,"Set1") #색깔 지정

> wordcloud(names(wordcount),freq=wordcount,scale=c(5,1),rot.per=0.25,min.freq=3,random.order=F,random.color=T,colors=palete)#최소 3회 언급 인쇄

> savePlot("2014president_newyearmessage.png",type="png") #그림으로 저장



반응형
Posted by 마르띤
,