Introduction to data lab

Loading the data

source("http://www.openintro.org/stat/data/cdc.R")
names(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"   "wtdesire"
## [8] "age"      "gender"

Exercise 1 : How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

There are 20000 observations (cases) of 9 distinct variables.

genhlth and gender are both categorical data, and they can be classified as discrete. exerany, hlthplan, smoke100, height, weight, wtdesire, age are numerical data, and can be classified as continuous.

head(cdc)
##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f
tail(cdc)
##         genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 19995      good       0        1        1     69    224      224  73      m
## 19996      good       1        1        0     66    215      140  23      f
## 19997 excellent       0        1        0     73    200      185  35      m
## 19998      poor       0        1        0     65    216      150  57      f
## 19999      good       1        1        0     67    165      165  81      f
## 20000      good       1        1        1     69    170      165  83      m

Summaries and tables

summary(cdc$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   140.0   165.0   169.7   190.0   500.0
190 - 140
## [1] 50
mean(cdc$weight) 
## [1] 169.683
var(cdc$weight)
## [1] 1606.484
median(cdc$weight)
## [1] 165
table(cdc$smoke100)
## 
##     0     1 
## 10559  9441
table(cdc$smoke100)/20000
## 
##       0       1 
## 0.52795 0.47205
barplot(table(cdc$smoke100))

smoke <- table(cdc$smoke100)
barplot(smoke)

Exercise 2: Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

numerical summary/ interquartile range

summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00
IQR(cdc$height)
## [1] 6
IQR(cdc$age)
## [1] 26

relative frequency distribution for gender and exerany

gender_freq <- prop.table(table(cdc$gender))
prop.table(table(cdc$exerany))
## 
##      0      1 
## 0.2543 0.7457

number of males in the sample and the proportion of the sample reporting being in excellent health.

sum(cdc$gender == "male")
## [1] 0
mean(cdc$genhlth == "excellent")
## [1] 0.23285
table(cdc$gender,cdc$smoke100)
##    
##        0    1
##   m 4547 5022
##   f 6012 4419
mosaicplot(table(cdc$gender,cdc$smoke100))

Exercise 3: What does the mosaic plot reveal about smoking habits and gender?

More males have smoke 100 cigarettes in their lifetime than females.

How R thinks about data

dim(cdc)
## [1] 20000     9
cdc[567,6]
## [1] 160
names(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"   "wtdesire"
## [8] "age"      "gender"
cdc[1:10,6]
1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
cdc[1:10,]
##      genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1       good       0        1        0     70    175      175  77      m
## 2       good       0        1        1     64    125      115  33      f
## 3       good       1        1        1     60    105      105  49      f
## 4       good       1        1        0     66    132      124  42      f
## 5  very good       0        1        0     61    150      130  55      f
## 6  very good       1        1        0     64    114      114  55      f
## 7  very good       1        1        0     71    194      185  31      m
## 8  very good       0        1        0     67    170      160  45      m
## 9       good       0        1        1     65    150      130  27      f
## 10      good       1        1        0     70    180      170  44      m
cdc[,6]
cdc$weight
cdc$weight[567]
## [1] 160

A little more on subsetting

cdc$gender == "m"
cdc$age > 30
mdata <- subset(cdc, gender == "m")
head(mdata)
##      genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1       good       0        1        0     70    175      175  77      m
## 7  very good       1        1        0     71    194      185  31      m
## 8  very good       0        1        0     67    170      160  45      m
## 10      good       1        1        0     70    180      170  44      m
## 11 excellent       1        1        1     69    186      175  46      m
## 12      fair       1        1        1     69    168      148  62      m
m_and_over30 <- subset(cdc, gender == "m" & age > 30)
m_or_over30 <- subset(cdc, gender == "m" | age > 30)

Exercise 4: Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

under23_and_smoke <- subset(cdc, smoke100 == "1" & age < 23)

Quantitative data

boxplot(cdc$height)

summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
boxplot(cdc$height ~ cdc$gender)

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

Exercise 5: What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

The boxplot above uses “genhlth” variable to group the data and separate boxplots for each health rating category, with the horizontal lines in each box representing the median BMI for that specific health rating group. The outlier in fair indicates that there is someone with a significantly higher BMI than others in the fair group.

Below I have chosen to compare gender to BMI. It appears that males have a higher median BMI, and we can determine the outlier with the high BMI to be female.

boxplot(bmi ~ cdc$gender)

hist(cdc$age)

hist(bmi)

hist(bmi, breaks = 50)