'SD'에 해당되는 글 1건

  1. 2015.10.31 subset, mosiacplot, hist, var,sd
반응형

source: https://class.coursera.org/statistics-004 (coursera: Data Analysis and Statistical Inference by Duke Univ. by Dr. Mine Çetinkaya-Rundel)

 

source("http://www.openintro.org/stat/data/cdc.R")

This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.

> summary(cdc)

genhlth        exerany          hlthplan         smoke100          height    

excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :48.00 

very good:6972   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:64.00 

good     :5675   Median :1.0000   Median :1.0000   Median :0.0000   Median :67.00 

fair     :2019   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721   Mean   :67.18 

poor     : 677   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:70.00 

Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :93.00 

weight         wtdesire          age        gender  

Min.   : 68.0   Min.   : 68.0   Min.   :18.00   m: 9569 

1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   f:10431 

Median :165.0   Median :150.0   Median :43.00           

Mean   :169.7   Mean   :155.1   Mean   :45.07           

3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00           

Max.   :500.0   Max.   :680.0   Max.   :99.00

 

> head(cdc$genhlth) #categorical, ordinal

[1] good      good      good      good      very good very good

Levels: excellent very good good fair poor

> levels(cdc$genhlth)

[1] "excellent" "very good" "good"      "fair"      "poor"    

> nlevels(cdc$genhlth)

[1] 5

> levels(cdc$genhlth)[2]

[1] "very good"

  

R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean, median, and variance of weight, type

> mean(cdc$weight)

[1] 169.683

> var(cdc$weight)

[1] 1606.484

> sd(cdc$weight)

[1] 40.08097

> median(cdc$weight)

[1] 165

> summary(cdc$weight) #numerical, continuous

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

68.0   140.0   165.0   169.7   190.0   500.0

 

The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime, smoked (1) or did not (0).

> summary(cdc$smoke100)

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

0.000   0.000   0.000   0.472   1.000   1.000

 ->what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function table does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type

> table(cdc$smoke100)

0     1

10559  9441

 

or instead look at the relative frequency distribution by typing

> table(cdc$smoke100)/20000

0       1

0.52795 0.47205

 

Next, we make a bar plot of the entries in the table by putting the table inside the barplot command.

> barplot(table(cdc$smoke100))

OR

> smoke = table(cdc$smoke100)

> barplot(smoke) 

 

Recall that 1 indicates that a respondent has smoked at least 100 cigarettes. The rows refer to gender. To create a mosaic plot of this table, we would enter the following command. We can see the size of the data frame next to the object name in the workspace or we can type

 

> gender_smokers = table(cdc$gender,cdc$smoke100)

> mosaicplot(gender_smokers)

-->The mosaic plot shows that males are more likely to smoke than females. the type of smoke100 is categorical (not ordinal). Explanatory variable on the x-axis, response variable on the y-axis. x축의 변수는 내가 알고자 하는 Y를 구성하는 자료, 즉 Y를 설명해주기 때문에explanatory or predictor variable이라고 한다. 독립변수(independent)라고도 한다. Y축의 값은 X에 반응한다고 해서 Response Or dependent Variable 종속변수라고 한다

 

> cdc[567,6] #which means we want the element of our data set that is in the 567 th   row

[1] 160

> cdc[1:10,6] #To see the weights for the first 10 respondents,

[1] 175 125 105 132 150 114 194 170 150 180

> cdc[1:10,] #Finally, if we want all of the data for the first 10 respondents, type

genhlth exerany hlthplan smoke100 height weight wtdesire age gender

1       good       0        1        0     70    175      175  77      m

2       good       0        1        1     64    125      115  33      f

3       good       1        1        1     60    105      105  49      f

4       good       1        1        0     66    132      124  42      f

5  very good       0        1        0     61    150      130  55      f

6  very good       1        1        0     64    114      114  55      f

7  very good       1        1        0     71    194      185  31      m

8  very good       0        1        0     67    170      160  45      m

9       good       0        1        1     65    150      130  27      f

10      good       1        1        0     70    180      170  44      m

 

#this command will create a new data set called mdata that contains only the men from the cdc data set.

> mdata = subset(cdc, cdc$gender == "m")

> head(mdata)

genhlth exerany hlthplan smoke100 height weight wtdesire age gender

1       good       0        1        0     70    175      175  77      m

7  very good       1        1        0     71    194      185  31      m

8  very good       0        1        0     67    170      160  45      m

10      good       1        1        0     70    180      170  44      m

11 excellent       1        1        1     69    186      175  46      m

12      fair       1        1        1     69    168      148  62      m

 

you can use several of these conditions together with & and |. The & is read “and” so that. this command will give you the data for men over the age of 30.

> m_and_over30 = subset(cdc, cdc$gender == "m" & cdc$age > 30)

> head(m_and_over30)

genhlth exerany hlthplan smoke100 height weight wtdesire age gender

1       good       0        1        0     70    175      175  77      m

7  very good       1        1        0     71    194      185  31      m

8  very good       0        1        0     67    170      160  45      m

10      good       1        1        0     70    180      170  44      m

11 excellent       1        1        1     69    186      175  46      m

12      fair       1        1        1     69    168      148  62      m

 

The | character is read “or” so that this command will take people who are men or over the age of 30

> m_or_over30=subset(cdc,cdc$gender=="m"|cdc$age>30)

> head(m_or_over30)

genhlth exerany hlthplan smoke100 height weight wtdesire age gender

1      good       0        1        0     70    175      175  77      m

2      good       0        1        1     64    125      115  33      f

3      good       1        1        1     60    105      105  49      f

4      good       1        1        0     66    132      124  42      f

5 very good       0        1        0     61    150      130  55      f

6 very good

 

To Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked at least 100 cigarettes in their lifetime.

> under23_and_smoke = subset(cdc,cdc$age<23 & smoke100=="1")

> str(under23_and_smoke)

'data.frame':  620 obs. of  9 variables:

  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 1 2 1 3 2 2 4 4 1 4 ...

$ exerany : num  1 1 1 1 1 1 0 1 1 1 ...

$ hlthplan: num  0 0 1 1 1 0 1 1 0 1 ...

$ smoke100: num  1 1 1 1 1 1 1 1 1 1 ...

$ height  : num  66 70 74 64 62 64 71 72 63 71 ...

$ weight  : int  185 160 175 190 92 125 185 185 105 185 ...

$ wtdesire: int  220 140 200 140 92 115 185 170 100 150 ...

$ age     : int  21 18 22 20 21 22 20 19 19 18 ...

$ gender  : Factor w/ 2 levels "m","f": 1 2 1 2 2 2 1 1 1 1 ...

> summary(under23_and_smoke)

genhlth       exerany          hlthplan         smoke100     height    

excellent:110   Min.   :0.0000   Min.   :0.0000   Min.   :1   Min.   :59.00 

very good:244   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1   1st Qu.:65.00 

good     :204   Median :1.0000   Median :1.0000   Median :1   Median :68.00 

fair     : 53   Mean   :0.8145   Mean   :0.6952   Mean   :1   Mean   :67.92 

poor     :  9   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1   3rd Qu.:71.00 

Max.   :1.0000   Max.   :1.0000   Max.   :1   Max.   :79.00 

weight         wtdesire          age        gender

Min.   : 85.0   Min.   : 80.0   Min.   :18.00   m:305 

1st Qu.:130.0   1st Qu.:125.0   1st Qu.:19.00   f:315 

Median :155.0   Median :150.0   Median :20.00         

Mean   :158.9   Mean   :152.2   Mean   :20.22         

3rd Qu.:180.0   3rd Qu.:175.0   3rd Qu.:21.00         

Max.   :350.0   Max.   :315.0   Max.   :22.00         

> dim(under23_and_smoke)

[1] 620   9

 -> 620 observations are in the subset under23

 

The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several categories. So we can, for example, compare the heights of men and women with. The notation here is new. The ~ character can be read “versus” or “as a function of”. So we’re asking R to give us a box plots of heights where the groups are defined by gender.

> boxplot(cdc$height ~ cdc$gender)

 

Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI). BMI is a weight to height ratio and can be calculated as.

> bmi = (cdc$weight / cdc$height^2) * 703

> boxplot(bmi~cdc$genhlth)

 --> Among people with excellent health, there are some with unusually low BMIs compared to the rest of the group. The IQR increases slightly as general health status declines (from excellent to poor). The median BMI is roughly 25 for all general health categories, and there is a slight increase in median BMI as general health status declines (from excellent to poor).

 

 

To make histogram

> hist(bmi)

 

You can control the number of bins by adding an argument to the command. In the next two lines, we first make a default histogram of bmi and then one with 50 breaks.

> hist(bmi,breaks=50)

 

Using the same tools (the plot function) make a scatterplot of weight versus desired weight.

 > plot(cdc$weight,cdc$wtdesire,xlim=c(0,700))

   

 ->moderately strong positive linear association

반응형
Posted by 마르띤
,