Lab1_DATA606

source("http://www.openintro.org/stat/data/cdc.R")
names(cdc)

## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

head(cdc, n=10)

##      genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1       good       0        1        0     70    175      175  77      m
## 2       good       0        1        1     64    125      115  33      f
## 3       good       1        1        1     60    105      105  49      f
## 4       good       1        1        0     66    132      124  42      f
## 5  very good       0        1        0     61    150      130  55      f
## 6  very good       1        1        0     64    114      114  55      f
## 7  very good       1        1        0     71    194      185  31      m
## 8  very good       0        1        0     67    170      160  45      m
## 9       good       0        1        1     65    150      130  27      f
## 10      good       1        1        0     70    180      170  44      m

tail(cdc, n=10)

##         genhlth exerany hlthplan smoke100 height weight wtdesire age
## 19991 excellent       1        1        0     71    195      190  43
## 19992 very good       1        1        1     72    210      175  52
## 19993 very good       1        1        0     71    180      180  36
## 19994 very good       0        1        1     63    165      120  31
## 19995      good       0        1        1     69    224      224  73
## 19996      good       1        1        0     66    215      140  23
## 19997 excellent       0        1        0     73    200      185  35
## 19998      poor       0        1        0     65    216      150  57
## 19999      good       1        1        0     67    165      165  81
## 20000      good       1        1        1     69    170      165  83
##       gender
## 19991      m
## 19992      m
## 19993      m
## 19994      f
## 19995      m
## 19996      f
## 19997      m
## 19998      f
## 19999      f
## 20000      m

How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

#Cases and Variables in the Dataset
dim(cdc)

## [1] 20000     9

genhlth: Categorical exerany: Numerical, Discrete hlthplan: Numerical, Discrete smoke100: Numerical, Discrete height: Numerical, Continuous weight: Numerical, Continuous wtdesire: Numerical, Continuous age: Numerical, Continuous gender: Categorical

summary(cdc$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   140.0   165.0   169.7   190.0   500.0

mean(cdc$weight)

## [1] 169.683

var(cdc$weight)

## [1] 1606.484

median(cdc$weight)

## [1] 165

table(cdc$smoke100)

## 
##     0     1 
## 10559  9441

table(cdc$smoke100)/20000

## 
##       0       1 
## 0.52795 0.47205

barplot(table(cdc$smoke100))
smoke <- table(cdc$smoke100)
barplot(smoke)

2. Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

height interquartile: 70-64 = 6 weight interquartile: 57-31 = 26 males = 9569 Excellent Health: 4657

summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

mean(cdc$height)

## [1] 67.1829

var(cdc$height)

## [1] 17.0235

median(cdc$height)

## [1] 67

summary(cdc$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

mean(cdc$age)

## [1] 45.06825

var(cdc$age)

## [1] 295.5886

median(cdc$age)

## [1] 43

summary(cdc$gender)

##     m     f 
##  9569 10431

summary(cdc$exerany)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.7457  1.0000  1.0000

summary(cdc$genhlth)

## excellent very good      good      fair      poor 
##      4657      6972      5675      2019       677

table(cdc$gender,cdc$smoke100)

##    
##        0    1
##   m 4547 5022
##   f 6012 4419

mosaicplot(table(cdc$gender,cdc$smoke100))

3. What does the mosaic plot reveal about smoking habits and gender?

More males smoke than females

cdc[567,6]

## [1] 160

What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

Box plot shows that people with lower BMIs tend to find themselves in better health.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$smoke100)

People who dont smoke are typically in a lower BMI range

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

plot(cdc$weight ~ cdc$wtdesire)

Relationship is generally upward sloping, so people in general desire to be in their same weight. Some people wanted to be significantly heavier or lighter, but not most.

Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wtdiff <- c(cdc$weight-cdc$wtdesire)
plot(wtdiff)

What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

If it is positive, more people want to gain weight, while if it is negative, more people want to lose.

Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

Most people seem comfortable about their current range and do not want to stray too far away.

Lab1_DATA606

Michele Bradley

9/3/2017