source("http://www.openintro.org/stat/data/cdc.R")
Ex1: How many cases are there in this data set? How many variables? For each variable, identify its data type
nrow(cdc)
## [1] 20000
How many variables?
length(cdc)
## [1] 9
For each variable, identify its data type
names(cdc)
## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"
## [7] "wtdesire" "age" "gender"
summary and tables
summary(cdc$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.0 140.0 165.0 169.7 190.0 500.0
mean(cdc$weight)
## [1] 169.683
median(cdc$weight)
## [1] 165
While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function table does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type
table(cdc$smoke100)
##
## 0 1
## 10559 9441
table(cdc$smoke100)/20000
##
## 0 1
## 0.52795 0.47205
barplot(table(cdc$smoke100))

Ex2: Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
compute the interquartile range for each
IQR(cdc$weight)
## [1] 50
IQR(cdc$age)
## [1] 26
How many males are in the sample?
table(cdc$gender)[1]
## m
## 9569
what proportion of the sample reports being in excellent health?
cat((table(cdc$genhlth)[1]/NROW(cdc))*100,"%")
## 23.285 %
Ex3: What does the mosaic plot reveal about smoking habits and gender?
Ans: The mosaic plot reveals that a randomly selected male is more probable to be a person who smoked at least 100 cigarettes than a randomly selected female to be a person who smoked at least 100 cigarettes.
4, Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
boxplot(cdc$height)

summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
boxplot(cdc$height ~ cdc$gender)

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

on my own
1, Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
plot(cdc$weight,cdc$wtdesire, xlab="Weight", ylab= "Desired Weight")

Answer: when the weight increases, the desired weight also increases.
2, Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
wdiff <- (cdc$weight-cdc$wtdesire)
3. What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?
typeof(wdiff)
## [1] "integer"
4,Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
mean(wdiff)
## [1] 14.5891
sd(wdiff)
## [1] 24.04586
plot(density(wdiff))

5, Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
boxplot(cdc$weight-cdc$wtdesire ~ cdc$gender)

Answer: the difference between men’s weight and women’s weight seems equal to zero
6, Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean
mean(cdc$weight)
## [1] 169.683
sd(cdc$weight)
## [1] 40.08097
plot(density(cdc$weight))
