source("more/cdc.R")
How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).
Total 20000 number of cases in thes data set.
Total of 9variables.
Looking at the data set below we can try to idnetify the data type for each variable
head(cdc)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 2 good 0 1 1 64 125 115 33 f
## 3 good 1 1 1 60 105 105 49 f
## 4 good 1 1 0 66 132 124 42 f
## 5 very good 0 1 0 61 150 130 55 f
## 6 very good 1 1 0 64 114 114 55 f
genhlth is a categorical type as it categorizes the population in one of the types such as excellent, very good, good, fair or poor category. Also there is no numerical operation such as addition, subtraction, division etc. that can be performed on this variable. exerany is also a categorical type as it places the population in one of two buckets did they exercise or not.
hlthplan is also a categorical type as it places the population in the one of the categories such as did they have coverage or not.
smoke100 is a categorical type the individual would have either smoked 100 cigarettes or not.
height is a numerical discrete data type.
weight is also a numerical discrete data type.
wtdesire is a numerical discrete data type.
age is a numerical discrete data type.
gender is a categorical data type as it puts the person in a male or female category.
Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
IQR(cdc$height)
## [1] 6
IQR(cdc$age)
## [1] 26
relative_freq=table(cdc$gender,cdc$exerany)
relative_freq
##
## 0 1
## m 2149 7420
## f 2937 7494
males=relative_freq[1,"0"]+relative_freq[1,"1"]
Total number of males `r males
relative_health=table(cdc$gender,cdc$genhlth)
excellentHealthMales=relative_health[1,"excellent"]
Total Number of Excellent health males are 2298
Proportion of excellent Health males in the Male Sample are 0.2401505
What does the mosaic plot reveal about smoking habits and gender?
Answer: The Mosaic chart reveals that there are more males that smoked a 100 cigarettes as compared to females.
mosaicplot(table(cdc$gender,cdc$smoke100))
Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
under23_and_smoke <- subset(cdc, smoke100 == "1" & age < 23)
head(under23_and_smoke)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13 excellent 1 0 1 66 185 220 21 m
## 37 very good 1 0 1 70 160 140 18 f
## 96 excellent 1 1 1 74 175 200 22 m
## 180 good 1 1 1 64 190 140 20 f
## 182 very good 1 1 1 62 92 92 21 f
## 240 very good 1 0 1 64 125 115 22 f
What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.
The box plot shows the relationship between the BMI and general health of the people in the cdc database
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)
I would like to pick the smoking variable from the database and see how that relates to the BMI. IT could have a relationship to BMI because smoking may impact a person’s weight. The figure below shows that the BMI of the people that smoke vs non smokers is generally lower.
boxplot(bmi ~ cdc$smoke100)
Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
The relationship shows that most people want to lower their weights and almost everyone wants to be below 200.
plot(cdc$weight,cdc$wtdesire)
Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
wdiff=cdc$wtdesire-cdc$weight
head(wdiff)
## [1] 0 -10 0 -8 -20 0
What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?
wdiff is numerical data . If an observation is 0 it means that the person’s desired weight and current weight is the same meaning they don’t have to gain or lose weight to be at their desired weight.
If wdiff is positive then it means that the person wants to gain weight.
If wdiff is negative it means that the person wants to lose weight.
Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
summary(wdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -21.00 -10.00 -14.59 0.00 500.00
boxplot(wdiff)
Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
The below Summary and box plot of the weight difference for men and women shows that more men want to gain their weight versus women. Also it shows that most women feel that they are overweight and need to lose weight. Where as for men it is ##Men Weight Difference and Box Plot
men=subset(cdc,gender=="m")
menweightdiff=men$wtdesire-men$weight
summary(menweightdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -20.00 -5.00 -10.71 0.00 500.00
boxplot(menweightdiff)
women=subset(cdc,gender=="f")
womenweightdiff=women$wtdesire-women$weight
summary(womenweightdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -27.00 -10.00 -18.15 0.00 83.00
boxplot(womenweightdiff)
Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.
weightmean=mean(cdc$weight)
weightstandarddeviation=sd(cdc$weight)
weightmean
## [1] 169.683
weightstandarddeviation
## [1] 40.08097
lowerrange=weightmean-weightstandarddeviation
upperrange=weightmean+weightstandarddeviation
within_one_sd=subset(cdc,weight>lowerrange & weight<upperrange)
lowerrange
## [1] 129.602
upperrange
## [1] 209.7639
proportion_weight_within_one_sd=nrow(within_one_sd)/nrow(cdc)
The Proportion of weight within One Standard Deviation of the mean is 0.7076