'Python, R 분석과 프로그래밍' 카테고리의 글 목록 (4 Page)

dim, str, plot

Python, R 분석과 프로그래밍 2015. 10. 29. 14:58

source: https://class.coursera.org/statistics-004 (coursera: Data Analysis and Statistical Inference by Duke Univ. by Dr. Mine Çetinkaya-Rundel)

The present data set refers to the number of male and female births in the United States. The data set contains the data for all years from 1940 to 2002. We can take a look at the data by typing its name into the console.

> source("http://www.openintro.org/stat/data/present.R")

> head(present)

  year    boys   girls

1 1940 1211684 1148715

2 1941 1289734 1223693

3 1942 1444365 1364631

4 1943 1508959 1427901

5 1944 1435301 1359499

6 1945 1404587

You can see the dimensions of this data frame by typing:

> dim(present)

[1] 63  3

> names(present)

[1] "year"  "boys"  "girls"

> str(present)

'data.frame':   63 obs. of  3 variables:

 $ year : num  1940 1941 1942 1943 1944 ...

 $ boys : num  1211684 1289734 1444365 1508959 1435301 ...

 $ girls: num  1148715 1223693 1364631 1427901 1359499 ...

Let’s start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like

> present$boys

 [1] 1211684 1289734 1444365 1508959 1435301 1404587 1691220 1899876 1813852 1826352 1823555 1923020 1971262 2001798

[15] 2059068 2073719 2133588 2179960 2152546 2173638 2179708 2186274 2132466 2101632 2060162 1927054 1845862 1803388

[29] 1796326 1846572 1915378 1822910 1669927 1608326 1622114 1613135 1624436 1705916 1709394 1791267 1852616 1860272

[43] 1885676 1865553 1879490 1927983 1924868 1951153 2002424 2069490 2129495 2101518 2082097 2048861 2022589 1996355

[57] 1990480 1985596 2016205 2026854 2076969 2057922 2057979

We can create a simple plot of the number of girls born per year with the command

> plot(x = present$year, y = present$girls)

-> There is initially an increase in the number of girls born, which peaks around 1960. After 1960 there is a decrease in the number of girls born, but the number begins to increase again in the early 1970s. Overall the trend is an increase in the number of girls born in the US since the 1940s.

If we wanted to connect the data points with lines, we could add a third argument, the letter l for line.

> plot(x = present$year, y = present$girls,type="l")

If we wanted to make the title,

> plot(x = present$year, y = present$girls,main="Girls born per year")

to see the total number of births in 1940, we add the vector for births for boys and girls, R will compute all sums simultaneously.

> present$boys+present$girls

 [1] 2360399 2513427 2808996 2936860 2794800 2735456 3288672 3699940 3535068

[10] 3559529 3554149 3750850 3846986 3902120 4017362 4047295 4163090 4254784

[19] 4203812 4244796 4257850 4268326 4167362 4098020 4027490 3760358 3606274

[28] 3520959 3501564 3600206 3731386 3555970 3258411 3136965 3159958 3144198

[37] 3167788 3326632 3333279 3494398 3612258 3629238 3680537 3638933 3669141

[46] 3760561 3756547 3809394 3909510 4040958 4158212 4110907 4065014 4000240

[55] 3952767 3899589 3891494 3880894 3941553 3959417 4058814 4025933 4021726

we can make a plot of the total number of births per year with the command

> plot(present$year,present$boys+present$girls,type="l")

-> In 1961, we can see the most total number of births in the U.S.

The proportion of newborns that are boys

> present$boys/(present$boys+present$girls)

 [1] 0.5133386 0.5131376 0.5141926 0.5138001 0.5135613 0.5134745 0.5142562

 [8] 0.5134883 0.5131024 0.5130881 0.5130778 0.5126891 0.5124173 0.5130027

[15] 0.5125423 0.5123716 0.5125011 0.5123550 0.5120462 0.5120713 0.5119269

[22] 0.5122088 0.5117064 0.5128408 0.5115250 0.5124656 0.5118474 0.5121866

[29] 0.5130068 0.5129073 0.5133154 0.5126337 0.5124973 0.5127013 0.5133340

[36] 0.5130513 0.5127982 0.5128057 0.5128266 0.5126110 0.5128692 0.5125792

[43] 0.5123372 0.5126648 0.5122425 0.5126849 0.5124035 0.5121951 0.5121931

[50] 0.5121286 0.5121179 0.5112054 0.5121992 0.5121845 0.5116894 0.5119398

[57] 0.5114951 0.5116337 0.5115255 0.5119072 0.5117182 0.5111665 0.5117154

-> Based on the plot determine, the proportion of boys born in the US has decreased over time and every year there are more boys born than girls.

The present data set refers to the number of male and female births in the United States. The data set contains the data for all years from 1940 to 2002. We can take a look at the data by typing its name into the console.

> plot(present$year, present$boys/(present$boys+present$girls))

We can ask if boys outnumber girls in each year with the expression

> present$boys > present$girls

 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

[15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

[29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

[43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

[57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

-> Every year there are more boys born than girls.

plot(present$year,present$boys-present$girls)

If we cant’ see the exact number of y-axis

> dd<-present$boys-present$girls

> min(dd,na.rm=T)

[1] 62969

> max(dd,na.rm=T)

[1] 105244

> plot(present$year,dd,ylim=c(60000,120000))

If we wanted to name the y-axis and the title,

> plot(present$year,dd,ylab="differences number of boys and girls",

ylim=c(60000,120000),main="Absolute differences")

-> In 1963, we can see the biggest absolute differences number of boys and girls born. There is initially a decrease in the boy-to-girl ratio, and then an increase between 1960 and 1970, followed by a decrease.

X축, y축의 평균을 표기 하기 위해서는

> plot(present$boys,present$girls,xlim=c(1100000,2000000),

ylim=c(1100000,2200000))

> abline(v=mean(present$boys),lty=2)

> abline(h=mean(present$girls),lty=2)

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

beer data 연습 (0)	2015.11.10
subset, mosiacplot, hist, var,sd (0)	2015.10.31
도수분포표와 히스토그램 (0)	2015.08.12
[데이터 처리 & 분석 실무] 데이터 타입 - 데이터 프레임, 판별, 변환 (0)	2015.02.12
[데이터 처리 & 분석 실무] 데이터 타입 - 스칼라, 벡터, 리스트, 행렬, 배열 (0)	2015.02.12

Posted by 마르띤

,

도수분포표와 히스토그램

Python, R 분석과 프로그래밍 2015. 8. 12. 18:52

참고도서: 퇴근시간이 빨라지는 비지니스 통계 입문

1. 도수분포표란?

전체 데이터를 일정한 구간으로 나누고, 그 구간에 몇 개의 데이터가 들어있는지 파악하는 표.

횟수	주스의 양(ml)	횟수	주스의 양(ml)
1	300	26	304
2	290	27	277
3	315	28	289
4	279	29	311
5	320	30	301
6	311	31	294
7	295	32	285
8	305	33	307
9	300	34	281
10	275	35	297
11	319	36	309
12	315	37	311
13	300	38	304
14	297	39	300
15	303	40	296
16	307	41	290
17	299	42	305
18	287	43	301
19	300	44	311
20	299	45	301
21	274	46	307
22	309	47	296
23	303	48	299
24	288	49	304
25	315	50	302
		평균	299.74

도수 분포표를 만들어 보자.

1) 우선 최대값, 최소값, 범위, 계급의 수 그리고 폭을 구한다.

	raw data	계산 용이
최대값	320	320
최소값	274	270
범위	46	50
계급의 수	6	6
계급의 폭	7.666667	10

2) 그리고 도수분포표를 정리한다.

하한	상한	계급(주스의 양)	계급값	도수	%(상대도수)
270	279	270~279	274.5	29	38.7%
280	289	280~289	284.5	5	6.7%
290	299	290~299	294.5	11	14.7%
300	309	300~309	304.5	21	28.0%
310	319	310~319	314.5	8	10.7%
320	329	320~329	324.5	1	1.3%
			합계	75	100.0%

주의: 도수는 frequency함수는 배열. 따라서 입력할 때에는 =FREQUENCY(주스의 양,상한)기입한 후 Ctrl+Shift+Enter로 입력

히스토그램은 위의 표를 바탕으로 그래프만 그려주면 간단히 해결.

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

subset, mosiacplot, hist, var,sd (0)	2015.10.31
dim, str, plot (0)	2015.10.29
[데이터 처리 & 분석 실무] 데이터 타입 - 데이터 프레임, 판별, 변환 (0)	2015.02.12
[데이터 처리 & 분석 실무] 데이터 타입 - 스칼라, 벡터, 리스트, 행렬, 배열 (0)	2015.02.12
R Practice - diet_data (0)	2015.01.19

Posted by 마르띤

,

[데이터 처리 & 분석 실무] 데이터 타입 - 데이터 프레임, 판별, 변환

Python, R 분석과 프로그래밍 2015. 2. 12. 14:44

source: R을 이용한 데이터 처리 & 분석 실무, 서민구 지음

http://book.naver.com/bookdb/book_detail.nhn?bid=8317471

6. 데이터 프레임: R에서 가장 중요한 데이터 타입

> (d<-data.frame(x=c(1:5),y=c(2:10,2)))

x y

1 1 2

2 2 3

3 3 4

4 4 5

5 5 6

6 1 7

7 2 8

8 3 9

9 4 10

10 5 2

> (d<-data.frame(x=rep(1:2,times=3),y=seq(2,12,2)))

x y

1 1 2

2 2 4

3 1 6

4 2 8

5 1 10

6 2 12

>

> (d<-data.frame(x=rep(1:2,times=3),

+ y=seq(2,12,2),

+ z=c('M','F','M','F','M','F')))

x y z

1 1 2 M

2 2 4 F

3 1 6 M

4 2 8 F

5 1 10 M

6 2 12 F

>

> str(d)

'data.frame': 6 obs. of 3 variables:

$ x: int 1 2 1 2 1 2

$ y: num 2 4 6 8 10 12

$ z: Factor w/ 2 levels "F","M": 2 1 2 1 2 1

> d$x

[1] 1 2 1 2 1 2

> d$x<-rep(1:2,each=3)

> d$x

[1] 1 1 1 2 2 2

> d$w<-c("A","B","C","D","E","F")

> d

x y z w

1 1 2 M A

2 1 4 F B

3 1 6 M C

4 2 8 F D

5 2 10 M E

6 2 12 F F

> str(d)

'data.frame': 6 obs. of 4 variables:

$ x: int 1 1 1 2 2 2

$ y: num 2 4 6 8 10 12

$ z: Factor w/ 2 levels "F","M": 2 1 2 1 2 1

$ w: chr "A" "B" "C" "D" ...

>

> (x<-data.frame(1:3))

X1.3

1 1

2 2

3 3

> colnames(x)<-'val'

> x

val

1 1

2 2

3 3

> rownames(x)<-c('a','b','c')

> x

val

a 1

b 2

c 3

>

> d<-data.frame(x=c(1:5),y=c(2,4,6,8,10))

> d

x y

1 1 2

2 2 4

3 3 6

4 4 8

5 5 10

> rownames(d)<-c('a','b','c','d','e')

> d

x y

a 1 2

b 2 4

c 3 6

d 4 8

e 5 10

>

> d$x

[1] 1 2 3 4 5

> d$y

[1] 2 4 6 8 10

> d[1,]

x y

a 1 2

> d[,1]

[1] 1 2 3 4 5

> d$a

NULL

> d["a"]

Error in `[.data.frame`(d, "a") : undefined columns selected

> d["a",]

x y

a 1 2

> d

x y

a 1 2

b 2 4

c 3 6

d 4 8

e 5 10

> d[c(1,3),]

x y

a 1 2

c 3 6

> d[c(1,3),1]

[1] 1 3

> d[c(1,3),2]

[1] 2 6

>

> d[-1,]

x y

b 2 4

c 3 6

d 4 8

e 5 10

> d[-1,-2]

[1] 2 3 4 5

> d[c("x","y")]

x y

a 1 2

b 2 4

c 3 6

d 4 8

e 5 10

> d[,c("x","y")]

x y

a 1 2

b 2 4

c 3 6

d 4 8

e 5 10

> d[c("x","y"),]

x y

NA NA NA

NA.1 NA NA

> d[,"x"]

[1] 1 2 3 4 5

> d[,c("x")]

[1] 1 2 3 4 5

> d[,c("x"),drop=F]

x

a 1

b 2

c 3

d 4

e 5

> (d<-data.frame(a=1:3,b=4:6,c=7:9))

a b c

1 1 4 7

2 2 5 8

3 3 6 9

> d[,names(d)%in%c("b","c")]

b c

1 4 7

2 5 8

3 6 9

> d[,!names(d)%in%c("b","c")]

[1] 1 2 3

> d[,!names(d)%in%c("b","c",drop=F)]

[1] 1 2 3

> View(d)

7. 타입 판별

> class(c(1,2))

[1] "numeric"

> class(matrix(c(1,2)))

[1] "matrix"

> class(data.frame(x=c(1,2),y=c(3,4)))

[1] "data.frame"

> str(c(1,2))

num [1:2] 1 2

> str(matrix(c(1,2)))

num [1:2, 1] 1 2

> str(list(c(1,2)))

List of 1

$ : num [1:2] 1 2

> str(data.frame(x=c(1,2)))

'data.frame': 2 obs. of 1 variable:

$ x: num 1 2

> is.factor(factor(c("m","f")))

[1] TRUE

> is.numeric(c(1:5))

[1] TRUE

> is.character(c("a","b"))

[1] TRUE

> is.data.frame(data.frame(x=1:6))

[1] TRUE

8. 타입 변환

> y<-c("a","b","c")

> as.factor(y)

[1] a b c

Levels: a b c

> y

[1] "a" "b" "c"

> as.character(as.factor(y))

[1] "a" "b" "c"

> x<-data.frame(1:9,ncol=3)

> x

X1.9 ncol

1 1 3

2 2 3

3 3 3

4 4 3

5 5 3

6 6 3

7 7 3

8 8 3

9 9 3

> is.data.frame(x)

[1] TRUE

> is.matrix(x)

[1] FALSE

> x<-matrix(1:9,ncol=3)

> x

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

> is.matrix(x)

[1] TRUE

> as.data.frame(x)

V1 V2 V3

1 1 4 7

2 2 5 8

3 3 6 9

> is.data.frame(x)

[1] FALSE

> (x<-data.frame(matrix(c(1:4),ncol=2)))

X1 X2

1 1 3

2 2 4

> list(x=c(1,2),y=c(3,4))

$x

[1] 1 2

$y

[1] 3 4

> data.frame(list(x=c(1,2),y=c(3,4)))

x y

1 1 3

2 2 4

> factor(c("m","f"),levels=c("m","f"))

[1] m f

Levels: m f

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

dim, str, plot (0)	2015.10.29
도수분포표와 히스토그램 (0)	2015.08.12
[데이터 처리 & 분석 실무] 데이터 타입 - 스칼라, 벡터, 리스트, 행렬, 배열 (0)	2015.02.12
R Practice - diet_data (0)	2015.01.19
[텍스트 마이닝 연습] 2014 대통령 신년사 분석 (0)	2015.01.14

Posted by 마르띤

,

[데이터 처리 & 분석 실무] 데이터 타입 - 스칼라, 벡터, 리스트, 행렬, 배열

Python, R 분석과 프로그래밍 2015. 2. 12. 14:29

source: R을 이용한 데이터 처리 & 분석 실무, 서민구 지음

http://book.naver.com/bookdb/book_detail.nhn?bid=8317471

1. 스칼라: 단일 차원의 값을 뜻함. 1,2,3 등

1) NA: Not available라는 상수. 값이 없음, 결측값. 값이 빠진 경우를 뜻함.

> four<-NA

> is.na(four)

[1] TRUE

2) NULL: NA와는 다른 개념으로 undefined 즉, 미정값.

> x<-NULL

> is.null(x)

[1] TRUE

> is.null(1)

[1] FALSE

> is.null(NA)

[1] FALSE

> is.na(NULL)

logical(0)

Warning message:

In is.na(NULL) : is.na() applied to non-(list or vector) of type 'NULL'

3) 진리값: T/F의 사용

> T&F

[1] FALSE

> T|T

[1] TRUE

> T|F

[1] TRUE

> T<-TRUE

> T

 [1] TRUE

> T<-FALSE

> T

[1] FALSE

> TRUE<-FALSE

Error in TRUE <- FALSE : invalid (do_set) left-hand side to assignment

잘 이해안되는 부분..좀 더 공부 필요 한 부분.

> c(TRUE,TRUE)&c(TRUE,FALSE)

[1] TRUE FALSE

> c(TRUE,TRUE)&&c(TRUE,FALSE)

[1] TRUE

4) 팩터: 범주형 (categorical) 데이터 자료를 표현하기 위한 데이터 타입. 대/중/소 등

- 명목형(Nominal): 좌/우

- 순서형(Ordinal): 대/중/소

> sex<-factor("m",c("m","f")) #sex에는 "m"이 지정, 이 팩터의 레벨은 "m","f"로 제한> sex

[1] m

Levels: m f

> nlevels(sex)

[1] 2

> levels(sex)

[1] "m" "f"

> levels(sex)[1]

[1] "m"

> levels(sex)[2]

[1] "f"

> levels(sex)<-c("male","female")

> sex

[1] male

Levels: male female

> factor(c("m","m","f"),c("m","f"))

[1] m m f

Levels: m f

> factor(c("m","m","f"))

[1] m m f

Levels: f m

> ordered("a",c("a","b","c"))

[1] a

Levels: a < b < c

2. 벡터(Vector): 배열의 개념. 한 가지 스칼라 데이터 타입의 데이터를 저장.

예) 숫자만 저장 또는 문자열만 저장하는 배열이 벡터에 해당.

> x<-c(1,3,4)

> names(x)<-c("a","b","c")

> x

a b c

1 3 4

> x[1]

a

1

> x[-1]

b c

3 4

> x[c(1,3)]

a c

1 4

> x["a"]

a

1

> x[c("b","c")]

b c

3 4

> names(x)[2]

[1] "b"

> length(x)

[1] 3

> nrow(x)

NULL

> NROW(x)

[1] 3

> identical(c(1,2,3),c(1,2,3)) #객체가 동일한지 판단

[1] TRUE

> identical(c(1,2,3),c(1,2,1))

[1] FALSE

> "a"%in%c("a","b","c")

[1] TRUE

> "d"%in%c("a","b","c")

[1] FALSE

> x<-c(1:5)

> x

[1] 1 2 3 4 5

> x+1

[1] 2 3 4 5 6

> 10-x

[1] 9 8 7 6 5

> c(1,2,3)==c(1,2,3)

[1] TRUE TRUE TRUE

> union(c("a","b"),"d")

[1] "a" "b" "d"

> intersect(c("a","b","c"),c("a","d"))

[1] "a"

> setequal(c("a","b","c"),c("a","d"))

[1] FALSE

> setequal(c("a","b"),c("a","b","b"))

[1] TRUE

> setequal(c("a","b"),c("a","b","c"))

[1] FALSE

> seq(3,7)

[1] 3 4 5 6 7

> seq(7,3)

[1] 7 6 5 4 3

> seq(3,7,2)

[1] 3 5 7

> seq(3,7,3)

[1] 3 6

> x<-seq(2,10,2)

> x

[1] 2 4 6 8 10

> 1:NROW(x)

[1] 1 2 3 4 5

> seq_along(x)

[1] 1 2 3 4 5

> rep(1:2,times=5)

[1] 1 2 1 2 1 2 1 2 1 2

> rep(1:2,each=5)

[1] 1 1 1 1 1 2 2 2 2 2

> rep(1:2,each=5,times=2)

[1] 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2

> (x<-list(name="foo",height=70))

$name

[1] "foo"

$height

[1] 70

3. list: 벡터와 달리 서로 다른 값을 담을 수 있음.

> (x<-list(names="foo",height=c(1,3,5)))

$names

[1] "foo"

$height

[1] 1 3 5

> list(a=list(val=c(1,2,3)),b=list(val=c(1,2,3,4)))

$a

$a$val

[1] 1 2 3

$b

$b$val

[1] 1 2 3 4

> x$name

[1] "foo"

> x

$names

[1] "foo"

$height

[1] 1 3 5

> x$names

[1] "foo"

> x$height

[1] 1 3 5

> x[1]

$names

[1] "foo"

> x[[1]]

[1] "foo"

4. matrix: 행렬

> matrix(c(1:9),nrow=3,byrow=T)

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

> matrix(c(1:9),nrow=3)

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

> matrix(c(1:9),nrow=3,

+ dimnames=list(c("r1","r2","r3"),c("c1","c2","c3")))

c1 c2 c3

r1 1 4 7

r2 2 5 8

r3 3 6 9

> (x<-matrix(c(1:9),nrow=3))

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

> dimnames(x)<-list(c("r4","r5","r6"),c("c4","c5","c6"))

> x

c4 c5 c6

r4 1 4 7

r5 2 5 8

r6 3 6 9

>

> x<-matrix(1:9,ncol=3)

> x

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

> rownames(x)<-c("r1","r2","r3")

> x

[,1] [,2] [,3]

r1 1 4 7

r2 2 5 8

r3 3 6 9

> colnames(x)<-c("c1","c2","c3")

> x

c1 c2 c3

r1 1 4 7

r2 2 5 8

r3 3 6 9

> x[1,1]

[1] 1

> x[,2]

r1 r2 r3

4 5 6

> x[1:2,]

c1 c2 c3

r1 1 4 7

r2 2 5 8

> x[,1:2]

c1 c2

r1 1 4

r2 2 5

r3 3 6

> x[c(1,2),c(2,1)]

c2 c1

r1 4 1

r2 5 2

> x[c(1,1),c(1,3)]

c1 c3

r1 1 7

> x[c(1,3),c(1,3)]

c1 c3

r1 1 7

r3 3 9

> x[c(2,1),c(1,2)]

c1 c2

r2 2 5

r1 1 4

> x<-matrix(1:9,nrow=3)

> x*2

[,1] [,2] [,3]

[1,] 2 8 14

[2,] 4 10 16

[3,] 6 12 18

> x+x

[,1] [,2] [,3]

[1,] 2 8 14

[2,] 4 10 16

[3,] 6 12 18

> x%*%x

[,1] [,2] [,3]

[1,] 30 66 102

[2,] 36 81 126

[3,] 42 96 150

> t(x)

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 4 5 6

[3,] 7 8 9

> (x<-matrix(c(1:4),ncol=2))

[,1] [,2]

[1,] 1 3

[2,] 2 4

> solve(x)

[,1] [,2]

[1,] -2 1.5

[2,] 1 -0.5

> (x<-matrix(c(1:6),nrow=2))

[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6

> dim(x)

[1] 2 3

> dim(x)<-c(3,2)

> x

[,1] [,2]

[1,] 1 4

[2,] 2 5

[3,] 3 6

5. Array: 배열. Matrix행렬이 2차원이라면 배열은 다차원 데이터

> array(1:12,dim=c(3,4))

[,1] [,2] [,3] [,4]

[1,] 1 4 7 10

[2,] 2 5 8 11

[3,] 3 6 9 12

> (x<-array(1:12,dim=c(2,2,3)))

, , 1

[,1] [,2]

[1,] 1 3

[2,] 2 4

, , 2

[,1] [,2]

[1,] 5 7

[2,] 6 8

, , 3

[,1] [,2]

[1,] 9 11

[2,] 10 12

> x[1,1,3]

[1] 9

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

dim, str, plot (0)	2015.10.29
도수분포표와 히스토그램 (0)	2015.08.12
[데이터 처리 & 분석 실무] 데이터 타입 - 데이터 프레임, 판별, 변환 (0)	2015.02.12
R Practice - diet_data (0)	2015.01.19
[텍스트 마이닝 연습] 2014 대통령 신년사 분석 (0)	2015.01.14

Posted by 마르띤

,

R Practice - diet_data

Python, R 분석과 프로그래밍 2015. 1. 19. 12:38

source: https://github.com/derekfranks/practice_assignment

To begin, download this file and unzip it into your R working directory.
http://s3.amazonaws.com/practice_assignment/diet_data.zip

> setwd("D:/temp/r_temp")

> list.files("diet_data")

[1] "Andy.csv" "David.csv" "John.csv"

[4] "Mike.csv" "Steve.csv"

> andy<-read.csv("diet_data/Andy.csv")

> head(andy)

Patient.Name Age Weight Day

1 Andy 30 140 1

2 Andy 30 140 2

3 Andy 30 140 3

4 Andy 30 139 4

5 Andy 30 138 5

6 Andy 30 138 6

> length(andy$Day)

[1] 30

> nrow(andy)

[1] 30

> ncol(andy)

[1] 4

> length(andy)

[1] 4

> dim(andy)

[1] 30 4

> str(andy)

'data.frame': 30 obs. of 4 variables:

$ Patient.Name: Factor w/ 1 level "Andy": 1 1 1 1 1 1 1 1 1 1 ...

$ Age : int 30 30 30 30 30 30 30 30 30 30 ...

$ Weight : int 140 140 140 139 138 138 138 138 138 138 ...

$ Day : int 1 2 3 4 5 6 7 8 9 10 ...

> summary(andy)

Patient.Name Age Weight Day

Andy:30 Min. :30 Min. :135.0 Min. : 1.00

1st Qu.:30 1st Qu.:137.0 1st Qu.: 8.25

Median :30 Median :137.5 Median :15.50

Mean :30 Mean :137.3 Mean :15.50

3rd Qu.:30 3rd Qu.:138.0 3rd Qu.:22.75

Max. :30 Max. :140.0 Max. :30.00

> names(andy)

[1] "Patient.Name" "Age" "Weight" "Day"

> andy[1,"Weight"]

[1] 140

> andy[1,"weight"]

NULL

> andy[30,"Weight"]

[1] 135

> andy[which(andy$day)==30,"Weight"]

Error in which(andy$day) : argument to 'which' is not logical

> andy[which(andy$Day)==30,"Weight"]

Error in which(andy$Day) : argument to 'which' is not logical

> andy[which(andy$Day==30),"Weight"]

[1] 135

> andy[which(andy[,Day]==30),"Weight"]

Error in `[.data.frame`(andy, , Day) : object 'Day' not found

> andy[which(andy[,"Day"]==30),"Weight"]

[1] 135

> subset(andy$Weight, andy$Day==30)

[1] 135

> andy_start<-subset[andy$Weight,andy$Day==1]

Error in subset[andy$Weight, andy$Day == 1] :

object of type 'closure' is not subsettable

> andy_start<-subset(andy$Weight,andy$Day==1)

> andy_start

[1] 140

> andy_start<-andy[1,"Weight"]

> andy_end<-andy[30,"Weight"]

> andy_loss<-andy_start - andy_end

> andy_loss

[1] 5

> files<-list.files("diet_data")

> files

[1] "Andy.csv" "David.csv" "John.csv" "Mike.csv" "Steve.csv"

> files[1]

[1] "Andy.csv"

> files[-1]

[1] "David.csv" "John.csv" "Mike.csv" "Steve.csv"

> files[2]

[1] "David.csv"

> files[3:5]

[1] "John.csv" "Mike.csv" "Steve.csv"

> files[c(1,3)]

[1] "Andy.csv" "John.csv"

> head(read.csv(files[3]))

Error in file(file, "rt") : cannot open the connection

In addition: Warning message:

In file(file, "rt") :

cannot open file 'John.csv': No such file or directory

> files_full <- list.files("diet_data",full.names=TRUE)

> files_full

[1] "diet_data/Andy.csv" "diet_data/David.csv" "diet_data/John.csv"

[4] "diet_data/Mike.csv" "diet_data/Steve.csv"

> head(read.csv(files_full[3]))

Patient.Name Age Weight Day

1 John 22 175 1

2 John 22 175 2

3 John 22 175 3

4 John 22 175 4

5 John 22 175 5

6 John 22 175 6

> andy_david<-rbind(andy,david)

Error in rbind(andy, david) : object 'david' not found

> david<-read.csv("diet_data/David.csv")

> head(david)

Patient.Name Age Weight Day

1 David 35 210 1

2 David 35 209 2

3 David 35 209 3

4 David 35 209 4

5 David 35 209 5

6 David 35 209 6

> andy_david<-rbind(andy,david)

> head(andy_david)

Patient.Name Age Weight Day

1 Andy 30 140 1

2 Andy 30 140 2

3 Andy 30 140 3

4 Andy 30 139 4

5 Andy 30 138 5

6 Andy 30 138 6

> tail(andy_david)

Patient.Name Age Weight Day

55 David 35 203 25

56 David 35 203 26

57 David 35 202 27

58 David 35 202 28

59 David 35 202 29

60 David 35 201 30

> andy_david<-rbind(andy,read.csv(files_full[2]))

> head(andy_david)

Patient.Name Age Weight Day

1 Andy 30 140 1

2 Andy 30 140 2

3 Andy 30 140 3

4 Andy 30 139 4

5 Andy 30 138 5

6 Andy 30 138 6

> tail(andy_david)

Patient.Name Age Weight Day

55 David 35 203 25

56 David 35 203 26

57 David 35 202 27

58 David 35 202 28

59 David 35 202 29

60 David 35 201 30

> day_25<-subset(andy_david[25,])

> day_25

Patient.Name Age Weight Day

25 Andy 30 135 25

> day_26<-subset(andy_david[which(andy_david$Day==26),])

> day_26

Patient.Name Age Weight Day

26 Andy 30 135 26

56 David 35 203 26

> day25<-andy_david[which(andy_david$Day == 25),]

> day25

Patient.Name Age Weight Day

25 Andy 30 135 25

55 David 35 203 25

> for(i in 1:5){print [i]}

Error in print[i] : object of type 'closure' is not subsettable

> for(i in 1:5){print (i}

Error: unexpected '}' in "for(i in 1:5){print (i}"

> for(i in 1:5){print (i)}

[1] 1

[1] 2

[1] 3

[1] 4

[1] 5

> for(i in 1:5){

+ dat <- rbind(dat,read.csv(files_full[i]))

+ }

Error in rbind(dat, read.csv(files_full[i])) : object 'dat' not found

> dat<-data.frame()

> for(i in 1:5){

+ dat <- rbind(dat,read.csv(files_full[i]))

+ }

> str(dat)

'data.frame': 150 obs. of 4 variables:

$ Patient.Name: Factor w/ 5 levels "Andy","David",..: 1 1 1 1 1 1 1 1 1 1 ...

$ Age : int 30 30 30 30 30 30 30 30 30 30 ...

$ Weight : int 140 140 140 139 138 138 138 138 138 138 ...

$ Day : int 1 2 3 4 5 6 7 8 9 10 ...

> for(i in 1:5 ){

+ dat2<-data.frame()

+ dat2<-rbind(dat2,read.csv(files_full[i]))

+ }

> str(dat2)

'data.frame': 30 obs. of 4 variables:

$ Patient.Name: Factor w/ 1 level "Steve": 1 1 1 1 1 1 1 1 1 1 ...

$ Age : int 55 55 55 55 55 55 55 55 55 55 ...

$ Weight : int 225 225 225 224 224 224 223 223 223 223 ...

$ Day : int 1 2 3 4 5 6 7 8 9 10 ...

> list.files("diet_data")

[1] "Andy.csv" "David.csv" "John.csv" "Mike.csv" "Steve.csv"

> head(dat2)

Patient.Name Age Weight Day

1 Steve 55 225 1

2 Steve 55 225 2

3 Steve 55 225 3

4 Steve 55 224 4

5 Steve 55 224 5

6 Steve 55 224 6

Because we put dat2<- data.frame() inside of the loop, dat2 is being rewritten with each pass of the loop. So we only end up with the data from the last file in our list.

> median(dat$Weight)

[1] NA

> nrow(dat)

[1] 150

> ncol(dat)

[1] 4

> dim(dat)

[1] 150 4

> median(dat$Weight,na.rm=TRUE)

[1] 190

> dat_30<-dat[which(dat$Day==30),]

> dat_30

Patient.Name Age Weight Day

30 Andy 30 135 30

60 David 35 201 30

90 John 22 177 30

120 Mike 40 192 30

150 Steve 55 214 30

> median(dat_30$Weight)

[1] 192

> weightmedian<-function(directory,day){

+ files_list<-list.files(directory,full.names=TRUE) #Create a list of files

+ dat<-data.frame() #creates an empty data frame

+ for(i in 1:5) { #loops through the files, rbinding them together

+ dat<-rbind(dat,read.csv(files_list[i]))

+ }

+ dat_subset<-dat[which(dat[,"Day"]==day),] #subsets the rows that match the 'day' arguments

+ median(dat_subset[,"Weight"],na.rm=TRUE) #identifies the median weight

+ }

> weightmedian(directory="diet_data",day=20)

[1] 197.5

> weightmedian("diet_data",4)

[1] 188

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

dim, str, plot (0)	2015.10.29
도수분포표와 히스토그램 (0)	2015.08.12
[데이터 처리 & 분석 실무] 데이터 타입 - 데이터 프레임, 판별, 변환 (0)	2015.02.12
[데이터 처리 & 분석 실무] 데이터 타입 - 스칼라, 벡터, 리스트, 행렬, 배열 (0)	2015.02.12
[텍스트 마이닝 연습] 2014 대통령 신년사 분석 (0)	2015.01.14

Posted by 마르띤

,

[텍스트 마이닝 연습] 2014 대통령 신년사 분석

Python, R 분석과 프로그래밍 2015. 1. 14. 12:49

R 텍스트마이닝 연습. 책은 R까기 참고, 대상은 2014년 대통령 신년사(출처: 네이버 검색).

> setwd("D:/temp/r_temp")

> library(KoNLP)

> library(wordcloud)

> useSejongDic()

> mergeUserDic(data.frame("청마","ncn")) #위 사전에 해당 단어 추가하는 과정