Excercise 1 :How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).
source("http://www.openintro.org/stat/data/cdc.R")
nrow(cdc)
## [1] 20000
How many variables?
length(cdc)
## [1] 9
For each variable, identify its data type (e.g. categorical, discrete)
names(cdc)
## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"
## [7] "wtdesire" "age" "gender"
“genhlth” “exerany” “hlthplan” “smoke100” and “gender” are categorical variables. “height” “weight” “wtdesire” “age” are descrete.
Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
The interquartile range of an observation variable is the difference of its upper and lower quartiles. It is a measure of how far apart the middle portion of data spreads in value.
Interquartile Range = U pper Quartile ??? Lower Quartile
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
compute the interquartile range for each
IQR(cdc$weight)
## [1] 50
IQR(cdc$age)
## [1] 26
Compute the relative frequency distribution for gender and exerany
Relative Frequency Distribution of Qualitative Data:
The relative frequency distribution of a data variable is a summary of the frequency proportion in a collection of non-overlapping categories.
The relationship of frequency and relative frequency is:
Relative F requency =-Frequency/Sample Size
gend<- cdc$gender
gend.freq <- table(gend)
gend.relfreq <- gend.freq/nrow(cdc)
#Relative Frequency Distribution of gender
gend.relfreq
## gend
## m f
## 0.47845 0.52155
v_exerany<-cdc$exerany
v_exerany.freq <- table(v_exerany)
v_exerany.relfreq<-v_exerany.freq/nrow(cdc)
#Relative Frequency Distribution of exerany
v_exerany.relfreq
## v_exerany
## 0 1
## 0.2543 0.7457
How many males are in the sample?
table(cdc$gender)[1]
## m
## 9569
What proportion of the sample reports being in excellent health?
cat((table(cdc$genhlth)[1]/NROW(cdc))*100,"%")
## 23.285 %
Excercise 3 : What does the mosaic plot reveal about smoking habits and gender?
table(cdc$gender,cdc$smoke100)
##
## 0 1
## m 4547 5022
## f 6012 4419
The mosaic plot reveals that a randomly selected male is more probable to be a person who smoked at least 100 cigarettes than a randomly selected female to be a person who smoked at least 100 cigarettes.
Excercise 4 : Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
under23_and_smoke<- subset(cdc,cdc$smoke100==1 & age <23)
head(under23_and_smoke)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13 excellent 1 0 1 66 185 220 21 m
## 37 very good 1 0 1 70 160 140 18 f
## 96 excellent 1 1 1 74 175 200 22 m
## 180 good 1 1 1 64 190 140 20 f
## 182 very good 1 1 1 62 92 92 21 f
## 240 very good 1 0 1 64 125 115 22 f
Exercise 5 : What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.
Ans: The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several categories
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$gender)
The boxplot of BMI Vs gender suggests that the BMI is lower on females.
Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?
Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean
1. Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
plot(cdc$weight,cdc$wtdesire, xlab="Weight", ylab= "Desired Weight")
As the Weight increases, the desired weight also slightly increases.
2. Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
wdiff <- (cdc$weight-cdc$wtdesire)
3. What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?
typeof(wdiff)
## [1] "integer"
If the value of wdiff is zero, means the persons weight is the desired weight for that person. If the value of wdiff is negative, the person is underweight and if the value of wdiff is positive the person is over weight. Normally difference is taken as the absolute value (abs(mdiff)) and that will always be greater than or equal to zero.
4. Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
mean(wdiff)
## [1] 14.5891
sd(wdiff)
## [1] 24.04586
plot(density(wdiff))
People feel pretty comfortable about their current weight as most of it is having the weight closer enough to their desired weight.
5. Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
boxplot(cdc$weight-cdc$wtdesire ~ cdc$gender)
No. The difference between their weight and desired weight seems to be close to zero.
6. Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean
mean(cdc$weight)
## [1] 169.683
sd(cdc$weight)
## [1] 40.08097
plot(density(cdc$weight))
The shape of the curve is closer to a normal distribution. About 68% of values drawn from a normal distribution are within one standard deviation.