Getting started

source("more/cdc.R")
  1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete). As each row represents a case and each column represenys a variable there are 20,000 cases and 9 variables.
dim(cdc)
## [1] 20000     9
names(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

-genhlth: Categorical-Ordinal -exerany: Categorical-Ordinal -hltplan: Categorical-Ordinal -smoke100: Categorical-Ordinal -height: Numerical-Contnuous -weight: Numerical-Contnuous -wtdesire: Numerical-Continuous -age: Numerical-Continous -gender: Catgorical-Nominal

  1. Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
IQR(cdc$height)
## [1] 6
summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00
IQR(cdc$age)
## [1] 26
gend<- cdc$gender
gend.freq <- table(gend)
gend.relfreq <- gend.freq/nrow(cdc)
gend.relfreq
## gend
##       m       f 
## 0.47845 0.52155
exer<- cdc$exerany
exer.freq <- table(exer)
exer.relfreq <- exer.freq/nrow(cdc)
exer.relfreq
## exer
##      0      1 
## 0.2543 0.7457
table(cdc$gender)[1]
##    m 
## 9569
cat((table(cdc$genhlth)[1]/NROW(cdc))*100,"%")
## 23.285 %
  1. What does the mosaic plot reveal about smoking habits and gender? More males have smoked at least 100 cigarettes and more woman have not smoked 100 cigarettes.

  2. Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

under23_and_smoke<- subset(cdc,cdc$smoke100==1 & age <23)

head(under23_and_smoke)
##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13  excellent       1        0        1     66    185      220  21      m
## 37  very good       1        0        1     70    160      140  18      f
## 96  excellent       1        1        1     74    175      200  22      m
## 180      good       1        1        1     64    190      140  20      f
## 182 very good       1        1        1     62     92       92  21      f
## 240 very good       1        0        1     64    125      115  22      f
  1. What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

Box plot shows BMI, a weight to height ratio vs. general health.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$gender)

The boxplot seems to indicate that BMI is lower in females.


On Your Own

1.Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

plot(cdc$weight,cdc$wtdesire, xlab="Weight", ylab= "Desired Weight")

It seems that a higher Weight indicates a lower Desired Weight.

2.Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wdiff <- (cdc$weight-cdc$wtdesire)

3.What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

str(wdiff)
##  int [1:20000] 0 10 0 8 20 0 9 10 20 10 ...
wdiff is an integer.If the observation is 0 that means that their Weight and Desired Weight is the same. If wdiff is positive, desired weight is lower than their weight. If wdiff is negative, desired weight is highed than their weight.

4.Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight? The box plot is comparratively short which means a majority of people are comfortable with their weight, as the center appears to be within range of 0.

summary(wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -500.00    0.00   10.00   14.59   21.00  300.00
boxplot(wdiff, horizontal = TRUE)