데이터마이너를 꿈꾸며

beer data 연습

Python, R 분석과 프로그래밍 2015. 11. 10. 14:54

Beer 연습용 데이터에는 ID, 구매 카테고리, 브랜드, 구매일자, 총금액, 구매횟수, 가격, 맛, 연령, 성별 등의 데이터가 들어가 있다.

> setwd("D:/temp/r_temp")

> beer<-read.csv('beer.csv',stringsAsFactors = F)

> str(beer)

'data.frame':    26281 obs. of  10 variables:

  $ CustomerID     : int  867616 867616 7464358 1518180 5132322 1284613 4359610 6510791 5286631 6935836 ...

$ ProductCategory: chr  "Beer" "Beer" "Beer" "Beer" ...

$ BeerBrand      : chr  "Cass" "Cass" "Cass" "Hite" ...

$ PurchaseDate   : int  20070113 20070113 20070103 20071010 20070501 20070310 20070401 20071223 20070721 20071104 ...

$ DollarAmount   : int  26800 26800 13400 7300 7300 7300 5840 17520 7300 7300 ...

$ NumOfItems     : int  2 2 1 1 1 1 1 3 1 1 ...

$ Price          : int  13400 13400 13400 7300 7300 7300 7300 7300 7300 7300 ...

$ Taste          : chr  "Bitter" "Bitter" "Bitter" "Cool" ...

$ Age            : int  35 35 29 32 57 54 40 40 34 44 ...

$ Gender         : chr  "F" "F" "F" "F" ...

> names(beer)

[1] "CustomerID"      "ProductCategory" "BeerBrand"       "PurchaseDate"

[5] "DollarAmount"    "NumOfItems"      "Price"           "Taste"

[9] "Age"

구매일과 구매횟수로 Plot 그래프를 그려보면,

> summary(beer$NumOfItems)

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

1.00    1.00    1.00    1.19    1.00  180.00

> range(beer$NumOfItems)

[1]   1 180

> plot(beer$PurchaseDate,beer$NumOfItems)

Outlier에 해당하는 극단값이 있어 그래프 보기가 다시 불편하므로, 1회당 구매 금액을 PA값에 집어넣으면

> pa<-beer$DollarAmount/beer$NumOfItems

> plot(beer$PurchaseDate,pa)

-> 여전히 잘 안보임. 더 좋은 그래프나 다른 방식으로 보여줄 수 있는지 고민 필요함.

> hist(beer$PurchaseDate)

평균, 분산, 표준편차도 구해보고,

> mean(beer$DollarAmount)

[1] 13828.43

> var(beer$DollarAmount)

[1] 524584703

> sd(beer$DollarAmount)

[1] 22903.81

Beer$taste는 각각 Bitter/Cool/Soft라는 범주형 데이터를 가지고 있으므로 table 함수 적용 후 barplot를 그릴 수 있다.

> table(beer$Taste)

Bitter   Cool   Soft

3234  11540  11507

> barplot(table(beer$Taste))

성별도 비슷하게 구할 수 있다.

> table(beer$Gender)

          F     M

 8275 12952  5054

> table(beer$Gender)/18006

                  F         M

0.4595690 0.7193158 0.2806842

> barplot(table(beer$Gender))

null값이 많으므로 exclude를 사용하면

> table(beer$Gender,exclude="")/sum(table(beer$Gender,exclude=""))

F         M

0.7193158 0.2806842

> m<-table(beer$Gender,exclude="")

> barplot(m)

Gender와 Taste를 table값으로 묶어, mosaicplot도 만들고,

> gender_taste=table(beer$Gender,beer$Taste)

> mosaicplot(gender_taste)

->null값을 제외하고 보고싶은데, 아직 방법을 모르겠음. 공부가 더 필요함.

Subset 함수를 이용하여 원하는 변수를 묶을 수 있다. &는 and, |는 or 개념

> m_and_age=subset(beer,beer$Gender=='M'&beer$Age>30)

> head(m_and_age)

CustomerID ProductCategory BeerBrand PurchaseDate DollarAmount NumOfItems Price

7     4359610            Beer      Hite     20070401         5840          1  7300

14    4317050            Beer      Hite     20070921        21900          3  7300

15    4874758            Beer      Hite     20070206         7300          1  7300

18    1333660            Beer      Hite     20071030         7300          1  7300

23    5478004            Beer      Hite     20070325         7300          1  7300

29     944833            Beer      Hite     20070808         7300          1  7300

Taste Age Gender

7   Cool  40      M

14  Cool  66      M

15  Cool  59      M

18  Cool  35      M

23  Cool  50      M

29  Cool  43      M

> dim(m_and_age)

[1] 3923   10

> m_or_age=subset(beer,beer$Gender=='M'|beer$Age>30)

> head(m_or_age)

CustomerID ProductCategory BeerBrand PurchaseDate DollarAmount NumOfItems Price

1     867616            Beer      Cass     20070113        26800          2 13400

2     867616            Beer      Cass     20070113        26800          2 13400

4    1518180            Beer      Hite     20071010         7300          1  7300

5    5132322            Beer      Hite     20070501         7300          1  7300

6    1284613            Beer      Hite     20070310         7300          1  7300

7    4359610            Beer      Hite     20070401         5840          1  7300

Taste Age Gender

1 Bitter  35      F

2 Bitter  35      F

4   Cool  32      F

5   Cool  57      F

6   Cool  54      F

7   Cool  40      M

> dim(m_or_age)

[1] 14485    10

수치형 데이터인 Age로 히스트그램을 만들고,

> hist(beer$Age)

Bin을 50개로 나눌 수도 있다.

> hist(beer$Age,breaks=50)

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

R 회귀분석 (0)	2016.02.28
[미완성] 중국 sohu 크롤링 연습 (0)	2016.02.20
subset, mosiacplot, hist, var,sd (0)	2015.10.31
dim, str, plot (0)	2015.10.29
도수분포표와 히스토그램 (0)	2015.08.12

Posted by 마르띤

,

subset, mosiacplot, hist, var,sd

Python, R 분석과 프로그래밍 2015. 10. 31. 23:43

source: https://class.coursera.org/statistics-004 (coursera: Data Analysis and Statistical Inference by Duke Univ. by Dr. Mine Çetinkaya-Rundel)

source("http://www.openintro.org/stat/data/cdc.R")

This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.

> summary(cdc)

genhlth exerany hlthplan smoke100 height

excellent:4657 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :48.00

very good:6972 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:64.00

good :5675 Median :1.0000 Median :1.0000 Median :0.0000 Median :67.00

fair :2019 Mean :0.7457 Mean :0.8738 Mean :0.4721 Mean :67.18

poor : 677 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:70.00

Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :93.00

weight wtdesire age gender

Min. : 68.0 Min. : 68.0 Min. :18.00 m: 9569

1st Qu.:140.0 1st Qu.:130.0 1st Qu.:31.00 f:10431

Median :165.0 Median :150.0 Median :43.00

Mean :169.7 Mean :155.1 Mean :45.07

3rd Qu.:190.0 3rd Qu.:175.0 3rd Qu.:57.00

Max. :500.0 Max. :680.0 Max. :99.00

> head(cdc$genhlth) #categorical, ordinal

[1] good good good good very good very good

Levels: excellent very good good fair poor

> levels(cdc$genhlth)

[1] "excellent" "very good" "good" "fair" "poor"

> nlevels(cdc$genhlth)

[1] 5

> levels(cdc$genhlth)[2]

[1] "very good"

R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean, median, and variance of weight, type

> mean(cdc$weight)

[1] 169.683

> var(cdc$weight)

[1] 1606.484

> sd(cdc$weight)

[1] 40.08097

> median(cdc$weight)

[1] 165

> summary(cdc$weight) #numerical, continuous

Min. 1st Qu. Median Mean 3rd Qu. Max.

68.0 140.0 165.0 169.7 190.0 500.0

The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime, smoked (1) or did not (0).

> summary(cdc$smoke100)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.000 0.000 0.000 0.472 1.000 1.000

->what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function table does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type

> table(cdc$smoke100)

0 1

10559 9441

or instead look at the relative frequency distribution by typing

> table(cdc$smoke100)/20000

0 1

0.52795 0.47205

Next, we make a bar plot of the entries in the table by putting the table inside the barplot command.

> barplot(table(cdc$smoke100))

OR

> smoke = table(cdc$smoke100)

> barplot(smoke)

Recall that 1 indicates that a respondent has smoked at least 100 cigarettes. The rows refer to gender. To create a mosaic plot of this table, we would enter the following command. We can see the size of the data frame next to the object name in the workspace or we can type

> gender_smokers = table(cdc$gender,cdc$smoke100)

> mosaicplot(gender_smokers)

-->The mosaic plot shows that males are more likely to smoke than females. the type of smoke100 is categorical (not ordinal). Explanatory variable on the x-axis, response variable on the y-axis. x축의 변수는 내가 알고자 하는 Y를 구성하는 자료, 즉 Y를 설명해주기 때문에explanatory or predictor variable이라고 한다. 독립변수(independent)라고도 한다. Y축의 값은 X에 반응한다고 해서 Response Or dependent Variable 종속변수라고 한다

> cdc[567,6] #which means we want the element of our data set that is in the 567 th row

[1] 160

> cdc[1:10,6] #To see the weights for the first 10 respondents,

[1] 175 125 105 132 150 114 194 170 150 180

> cdc[1:10,] #Finally, if we want all of the data for the first 10 respondents, type

genhlth exerany hlthplan smoke100 height weight wtdesire age gender

1 good 0 1 0 70 175 175 77 m

2 good 0 1 1 64 125 115 33 f

3 good 1 1 1 60 105 105 49 f

4 good 1 1 0 66 132 124 42 f

5 very good 0 1 0 61 150 130 55 f

6 very good 1 1 0 64 114 114 55 f

7 very good 1 1 0 71 194 185 31 m

8 very good 0 1 0 67 170 160 45 m

9 good 0 1 1 65 150 130 27 f

10 good 1 1 0 70 180 170 44 m

#this command will create a new data set called mdata that contains only the men from the cdc data set.

> mdata = subset(cdc, cdc$gender == "m")

> head(mdata)

genhlth exerany hlthplan smoke100 height weight wtdesire age gender

1 good 0 1 0 70 175 175 77 m

7 very good 1 1 0 71 194 185 31 m

8 very good 0 1 0 67 170 160 45 m

10 good 1 1 0 70 180 170 44 m

11 excellent 1 1 1 69 186 175 46 m

12 fair 1 1 1 69 168 148 62 m

you can use several of these conditions together with & and |. The & is read “and” so that. this command will give you the data for men over the age of 30.

> m_and_over30 = subset(cdc, cdc$gender == "m" & cdc$age > 30)

> head(m_and_over30)

genhlth exerany hlthplan smoke100 height weight wtdesire age gender

1 good 0 1 0 70 175 175 77 m

7 very good 1 1 0 71 194 185 31 m

8 very good 0 1 0 67 170 160 45 m

10 good 1 1 0 70 180 170 44 m

11 excellent 1 1 1 69 186 175 46 m

12 fair 1 1 1 69 168 148 62 m

The | character is read “or” so that this command will take people who are men or over the age of 30

> m_or_over30=subset(cdc,cdc$gender=="m"|cdc$age>30)

> head(m_or_over30)

genhlth exerany hlthplan smoke100 height weight wtdesire age gender

1 good 0 1 0 70 175 175 77 m

2 good 0 1 1 64 125 115 33 f

3 good 1 1 1 60 105 105 49 f

4 good 1 1 0 66 132 124 42 f

5 very good 0 1 0 61 150 130 55 f

6 very good

To Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked at least 100 cigarettes in their lifetime.

> under23_and_smoke = subset(cdc,cdc$age<23 & smoke100=="1")

> str(under23_and_smoke)

'data.frame': 620 obs. of 9 variables:

$ genhlth : Factor w/ 5 levels "excellent","very good",..: 1 2 1 3 2 2 4 4 1 4 ...

$ exerany : num 1 1 1 1 1 1 0 1 1 1 ...

$ hlthplan: num 0 0 1 1 1 0 1 1 0 1 ...

$ smoke100: num 1 1 1 1 1 1 1 1 1 1 ...

$ height : num 66 70 74 64 62 64 71 72 63 71 ...

$ weight : int 185 160 175 190 92 125 185 185 105 185 ...

$ wtdesire: int 220 140 200 140 92 115 185 170 100 150 ...

$ age : int 21 18 22 20 21 22 20 19 19 18 ...

$ gender : Factor w/ 2 levels "m","f": 1 2 1 2 2 2 1 1 1 1 ...

> summary(under23_and_smoke)

genhlth exerany hlthplan smoke100 height

excellent:110 Min. :0.0000 Min. :0.0000 Min. :1 Min. :59.00

very good:244 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1 1st Qu.:65.00

good :204 Median :1.0000 Median :1.0000 Median :1 Median :68.00

fair : 53 Mean :0.8145 Mean :0.6952 Mean :1 Mean :67.92

poor : 9 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1 3rd Qu.:71.00

Max. :1.0000 Max. :1.0000 Max. :1 Max. :79.00

weight wtdesire age gender

Min. : 85.0 Min. : 80.0 Min. :18.00 m:305

1st Qu.:130.0 1st Qu.:125.0 1st Qu.:19.00 f:315

Median :155.0 Median :150.0 Median :20.00

Mean :158.9 Mean :152.2 Mean :20.22

3rd Qu.:180.0 3rd Qu.:175.0 3rd Qu.:21.00

Max. :350.0 Max. :315.0 Max. :22.00

> dim(under23_and_smoke)

[1] 620 9

-> 620 observations are in the subset under23

The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several categories. So we can, for example, compare the heights of men and women with. The notation here is new. The ~ character can be read “versus” or “as a function of”. So we’re asking R to give us a box plots of heights where the groups are defined by gender.

> boxplot(cdc$height ~ cdc$gender)

Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI). BMI is a weight to height ratio and can be calculated as.

> bmi = (cdc$weight / cdc$height^2) * 703

> boxplot(bmi~cdc$genhlth)

--> Among people with excellent health, there are some with unusually low BMIs compared to the rest of the group. The IQR increases slightly as general health status declines (from excellent to poor). The median BMI is roughly 25 for all general health categories, and there is a slight increase in median BMI as general health status declines (from excellent to poor).

To make histogram

> hist(bmi)

You can control the number of bins by adding an argument to the command. In the next two lines, we first make a default histogram of bmi and then one with 50 breaks.

> hist(bmi,breaks=50)

Using the same tools (the plot function) make a scatterplot of weight versus desired weight.

> plot(cdc$weight,cdc$wtdesire,xlim=c(0,700))

->moderately strong positive linear association

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

[미완성] 중국 sohu 크롤링 연습 (0)	2016.02.20
beer data 연습 (0)	2015.11.10
dim, str, plot (0)	2015.10.29
도수분포표와 히스토그램 (0)	2015.08.12
[데이터 처리 & 분석 실무] 데이터 타입 - 데이터 프레임, 판별, 변환 (0)	2015.02.12

Posted by 마르띤

,

dim, str, plot

Python, R 분석과 프로그래밍 2015. 10. 29. 14:58

source: https://class.coursera.org/statistics-004 (coursera: Data Analysis and Statistical Inference by Duke Univ. by Dr. Mine Çetinkaya-Rundel)

The present data set refers to the number of male and female births in the United States. The data set contains the data for all years from 1940 to 2002. We can take a look at the data by typing its name into the console.

> source("http://www.openintro.org/stat/data/present.R")

> head(present)

  year    boys   girls

1 1940 1211684 1148715

2 1941 1289734 1223693

3 1942 1444365 1364631

4 1943 1508959 1427901

5 1944 1435301 1359499

6 1945 1404587

You can see the dimensions of this data frame by typing:

> dim(present)

[1] 63  3

> names(present)

[1] "year"  "boys"  "girls"

> str(present)

'data.frame':   63 obs. of  3 variables:

 $ year : num  1940 1941 1942 1943 1944 ...

 $ boys : num  1211684 1289734 1444365 1508959 1435301 ...

 $ girls: num  1148715 1223693 1364631 1427901 1359499 ...

Let’s start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like

> present$boys

 [1] 1211684 1289734 1444365 1508959 1435301 1404587 1691220 1899876 1813852 1826352 1823555 1923020 1971262 2001798

[15] 2059068 2073719 2133588 2179960 2152546 2173638 2179708 2186274 2132466 2101632 2060162 1927054 1845862 1803388

[29] 1796326 1846572 1915378 1822910 1669927 1608326 1622114 1613135 1624436 1705916 1709394 1791267 1852616 1860272

[43] 1885676 1865553 1879490 1927983 1924868 1951153 2002424 2069490 2129495 2101518 2082097 2048861 2022589 1996355

[57] 1990480 1985596 2016205 2026854 2076969 2057922 2057979

We can create a simple plot of the number of girls born per year with the command

> plot(x = present$year, y = present$girls)

-> There is initially an increase in the number of girls born, which peaks around 1960. After 1960 there is a decrease in the number of girls born, but the number begins to increase again in the early 1970s. Overall the trend is an increase in the number of girls born in the US since the 1940s.

If we wanted to connect the data points with lines, we could add a third argument, the letter l for line.

> plot(x = present$year, y = present$girls,type="l")

If we wanted to make the title,

> plot(x = present$year, y = present$girls,main="Girls born per year")

to see the total number of births in 1940, we add the vector for births for boys and girls, R will compute all sums simultaneously.

> present$boys+present$girls

 [1] 2360399 2513427 2808996 2936860 2794800 2735456 3288672 3699940 3535068

[10] 3559529 3554149 3750850 3846986 3902120 4017362 4047295 4163090 4254784

[19] 4203812 4244796 4257850 4268326 4167362 4098020 4027490 3760358 3606274

[28] 3520959 3501564 3600206 3731386 3555970 3258411 3136965 3159958 3144198

[37] 3167788 3326632 3333279 3494398 3612258 3629238 3680537 3638933 3669141

[46] 3760561 3756547 3809394 3909510 4040958 4158212 4110907 4065014 4000240

[55] 3952767 3899589 3891494 3880894 3941553 3959417 4058814 4025933 4021726

we can make a plot of the total number of births per year with the command

> plot(present$year,present$boys+present$girls,type="l")

-> In 1961, we can see the most total number of births in the U.S.

The proportion of newborns that are boys

> present$boys/(present$boys+present$girls)

 [1] 0.5133386 0.5131376 0.5141926 0.5138001 0.5135613 0.5134745 0.5142562

 [8] 0.5134883 0.5131024 0.5130881 0.5130778 0.5126891 0.5124173 0.5130027

[15] 0.5125423 0.5123716 0.5125011 0.5123550 0.5120462 0.5120713 0.5119269

[22] 0.5122088 0.5117064 0.5128408 0.5115250 0.5124656 0.5118474 0.5121866

[29] 0.5130068 0.5129073 0.5133154 0.5126337 0.5124973 0.5127013 0.5133340

[36] 0.5130513 0.5127982 0.5128057 0.5128266 0.5126110 0.5128692 0.5125792

[43] 0.5123372 0.5126648 0.5122425 0.5126849 0.5124035 0.5121951 0.5121931

[50] 0.5121286 0.5121179 0.5112054 0.5121992 0.5121845 0.5116894 0.5119398

[57] 0.5114951 0.5116337 0.5115255 0.5119072 0.5117182 0.5111665 0.5117154

-> Based on the plot determine, the proportion of boys born in the US has decreased over time and every year there are more boys born than girls.

The present data set refers to the number of male and female births in the United States. The data set contains the data for all years from 1940 to 2002. We can take a look at the data by typing its name into the console.

> plot(present$year, present$boys/(present$boys+present$girls))

We can ask if boys outnumber girls in each year with the expression

> present$boys > present$girls

 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

[15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

[29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

[43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

[57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

-> Every year there are more boys born than girls.

plot(present$year,present$boys-present$girls)

If we cant’ see the exact number of y-axis

> dd<-present$boys-present$girls

> min(dd,na.rm=T)

[1] 62969

> max(dd,na.rm=T)

[1] 105244

> plot(present$year,dd,ylim=c(60000,120000))

If we wanted to name the y-axis and the title,

> plot(present$year,dd,ylab="differences number of boys and girls",

ylim=c(60000,120000),main="Absolute differences")

-> In 1963, we can see the biggest absolute differences number of boys and girls born. There is initially a decrease in the boy-to-girl ratio, and then an increase between 1960 and 1970, followed by a decrease.

X축, y축의 평균을 표기 하기 위해서는

> plot(present$boys,present$girls,xlim=c(1100000,2000000),

ylim=c(1100000,2200000))

> abline(v=mean(present$boys),lty=2)

> abline(h=mean(present$girls),lty=2)

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

beer data 연습 (0)	2015.11.10
subset, mosiacplot, hist, var,sd (0)	2015.10.31
도수분포표와 히스토그램 (0)	2015.08.12
[데이터 처리 & 분석 실무] 데이터 타입 - 데이터 프레임, 판별, 변환 (0)	2015.02.12
[데이터 처리 & 분석 실무] 데이터 타입 - 스칼라, 벡터, 리스트, 행렬, 배열 (0)	2015.02.12

Posted by 마르띤

,

데이터마이너를 꿈꾸며

beer data 연습

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

subset, mosiacplot, hist, var,sd

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

dim, str, plot

'Python, R 분석과 프로그래밍' 카테고리의 다른 글

링크

카테고리

최근에 올라온 글

최근에 받은 트랙백

글 보관함

티스토리툴바