source("http://www.openintro.org/stat/data/cdc.R")

Ex1: How many cases are there in this data set? How many variables? For each variable, identify its data type

nrow(cdc)

## [1] 20000

How many variables?

length(cdc)

## [1] 9

For each variable, identify its data type

names(cdc)

## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

summary and tables

summary(cdc$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   140.0   165.0   169.7   190.0   500.0

mean(cdc$weight)

## [1] 169.683

median(cdc$weight)

## [1] 165

While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function table does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type

table(cdc$smoke100)

## 
##     0     1 
## 10559  9441

table(cdc$smoke100)/20000

## 
##       0       1 
## 0.52795 0.47205

barplot(table(cdc$smoke100))

Ex2: Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

summary(cdc$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

compute the interquartile range for each

IQR(cdc$weight)

## [1] 50

IQR(cdc$age)

## [1] 26

How many males are in the sample?

table(cdc$gender)[1]

##    m 
## 9569

what proportion of the sample reports being in excellent health?

cat((table(cdc$genhlth)[1]/NROW(cdc))*100,"%")

## 23.285 %

Ex3: What does the mosaic plot reveal about smoking habits and gender?

Ans: The mosaic plot reveals that a randomly selected male is more probable to be a person who smoked at least 100 cigarettes than a randomly selected female to be a person who smoked at least 100 cigarettes.

4, Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

boxplot(cdc$height)

summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

boxplot(cdc$height ~ cdc$gender)

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

Ex 5: What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

hist(cdc$age)

hist(bmi)

hist(bmi, breaks = 50)

on my own

1, Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

plot(cdc$weight,cdc$wtdesire, xlab="Weight", ylab= "Desired Weight")

Answer: when the weight increases, the desired weight also increases.

2, Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wdiff <- (cdc$weight-cdc$wtdesire)

3. What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

typeof(wdiff)

## [1] "integer"

4,Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

mean(wdiff)

## [1] 14.5891

sd(wdiff)

## [1] 24.04586

plot(density(wdiff))

5, Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

boxplot(cdc$weight-cdc$wtdesire ~ cdc$gender)

Answer: the difference between men’s weight and women’s weight seems equal to zero

6, Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean

mean(cdc$weight)

## [1] 169.683

sd(cdc$weight)

## [1] 40.08097

plot(density(cdc$weight))