Loading the data
source("http://www.openintro.org/stat/data/cdc.R")
names(cdc)
## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight" "wtdesire"
## [8] "age" "gender"
Exercise 1 : How many cases are there in this data set? How many
variables? For each variable, identify its data type (e.g. categorical,
discrete).
There are 20000 observations (cases) of 9 distinct variables.
genhlth and gender are both categorical data, and they can be
classified as discrete. exerany, hlthplan, smoke100, height, weight,
wtdesire, age are numerical data, and can be classified as
continuous.
head(cdc)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 2 good 0 1 1 64 125 115 33 f
## 3 good 1 1 1 60 105 105 49 f
## 4 good 1 1 0 66 132 124 42 f
## 5 very good 0 1 0 61 150 130 55 f
## 6 very good 1 1 0 64 114 114 55 f
tail(cdc)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 19995 good 0 1 1 69 224 224 73 m
## 19996 good 1 1 0 66 215 140 23 f
## 19997 excellent 0 1 0 73 200 185 35 m
## 19998 poor 0 1 0 65 216 150 57 f
## 19999 good 1 1 0 67 165 165 81 f
## 20000 good 1 1 1 69 170 165 83 m
Summaries and tables
summary(cdc$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.0 140.0 165.0 169.7 190.0 500.0
190 - 140
## [1] 50
mean(cdc$weight)
## [1] 169.683
var(cdc$weight)
## [1] 1606.484
median(cdc$weight)
## [1] 165
table(cdc$smoke100)
##
## 0 1
## 10559 9441
table(cdc$smoke100)/20000
##
## 0 1
## 0.52795 0.47205
barplot(table(cdc$smoke100))

smoke <- table(cdc$smoke100)
barplot(smoke)

Exercise 2: Create a numerical summary for height and age, and
compute the interquartile range for each. Compute the relative frequency
distribution for gender and exerany. How many males are in the sample?
What proportion of the sample reports being in excellent health?
numerical summary/ interquartile range
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
IQR(cdc$height)
## [1] 6
IQR(cdc$age)
## [1] 26
relative frequency distribution for gender and exerany
gender_freq <- prop.table(table(cdc$gender))
prop.table(table(cdc$exerany))
##
## 0 1
## 0.2543 0.7457
number of males in the sample and the proportion of the sample
reporting being in excellent health.
sum(cdc$gender == "male")
## [1] 0
mean(cdc$genhlth == "excellent")
## [1] 0.23285
table(cdc$gender,cdc$smoke100)
##
## 0 1
## m 4547 5022
## f 6012 4419
mosaicplot(table(cdc$gender,cdc$smoke100))

Exercise 3: What does the mosaic plot reveal about smoking habits
and gender?
More males have smoke 100 cigarettes in their lifetime than
females.
How R thinks about data
dim(cdc)
## [1] 20000 9
cdc[567,6]
## [1] 160
names(cdc)
## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight" "wtdesire"
## [8] "age" "gender"
cdc[1:10,6]
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
cdc[1:10,]
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 2 good 0 1 1 64 125 115 33 f
## 3 good 1 1 1 60 105 105 49 f
## 4 good 1 1 0 66 132 124 42 f
## 5 very good 0 1 0 61 150 130 55 f
## 6 very good 1 1 0 64 114 114 55 f
## 7 very good 1 1 0 71 194 185 31 m
## 8 very good 0 1 0 67 170 160 45 m
## 9 good 0 1 1 65 150 130 27 f
## 10 good 1 1 0 70 180 170 44 m
cdc[,6]
cdc$weight
cdc$weight[567]
## [1] 160
A little more on subsetting
cdc$gender == "m"
cdc$age > 30
mdata <- subset(cdc, gender == "m")
head(mdata)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 7 very good 1 1 0 71 194 185 31 m
## 8 very good 0 1 0 67 170 160 45 m
## 10 good 1 1 0 70 180 170 44 m
## 11 excellent 1 1 1 69 186 175 46 m
## 12 fair 1 1 1 69 168 148 62 m
m_and_over30 <- subset(cdc, gender == "m" & age > 30)
m_or_over30 <- subset(cdc, gender == "m" | age > 30)
Exercise 4: Create a new object called under23_and_smoke that
contains all observations of respondents under the age of 23 that have
smoked 100 cigarettes in their lifetime. Write the command you used to
create the new object as the answer to this exercise.
under23_and_smoke <- subset(cdc, smoke100 == "1" & age < 23)