source("cdc.R")
names(cdc)
## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"
## [7] "wtdesire" "age" "gender"
dim(cdc)
## [1] 20000 9
There are 20,000 cases. There are nine(9) variables.
variable name | type of the variable | type 2 |
---|---|---|
index | numerical | continuous |
genhlth | categorical | ordinal |
exerany | categorical | |
hlthplan | categorical | |
smoke100 | categorical | |
height | numerical | continuous |
weight | numerical | continuous |
wtdesire | numerical | continuous |
age | numerical | continuous |
gender | categorical |
head(cdc)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 2 good 0 1 1 64 125 115 33 f
## 3 good 1 1 1 60 105 105 49 f
## 4 good 1 1 0 66 132 124 42 f
## 5 very good 0 1 0 61 150 130 55 f
## 6 very good 1 1 0 64 114 114 55 f
tail(cdc)
## genhlth exerany hlthplan smoke100 height weight wtdesire age
## 19995 good 0 1 1 69 224 224 73
## 19996 good 1 1 0 66 215 140 23
## 19997 excellent 0 1 0 73 200 185 35
## 19998 poor 0 1 0 65 216 150 57
## 19999 good 1 1 0 67 165 165 81
## 20000 good 1 1 1 69 170 165 83
## gender
## 19995 m
## 19996 f
## 19997 m
## 19998 f
## 19999 f
## 20000 m
summary(cdc$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.0 140.0 165.0 169.7 190.0 500.0
190 - 140
## [1] 50
mean(cdc$weight)
## [1] 169.683
var(cdc$weight)
## [1] 1606.484
median(cdc$weight)
## [1] 165
table(cdc$smoke100)
##
## 0 1
## 10559 9441
table(cdc$smoke100)/20000
##
## 0 1
## 0.52795 0.47205
barplot(table(cdc$smoke100))
smoke <- table(cdc$smoke100)
barplot(smoke)
### Exercise_2: Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
# getting summary for height
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
# getting the interquartile range
70 - 64
## [1] 6
#summary for age
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
# interquartile for age
57 - 31
## [1] 26
the relative frequency distribution for gender
table(cdc$gender)/20000
##
## m f
## 0.47845 0.52155
How many males are in the sample?
table(cdc$gender)
##
## m f
## 9569 10431
there are 9,569 males in the sample
the relative frequency distribution for exerany
table(cdc$exerany)/20000
##
## 0 1
## 0.2543 0.7457
What proportion of the sample reports being in excellent health
table(cdc$genhlth)['excellent']/20000
## excellent
## 0.23285
table(cdc$gender,cdc$smoke100)
##
## 0 1
## m 4547 5022
## f 6012 4419
mosaicplot(table(cdc$gender,cdc$smoke100))
Males smoking more then 100 cigerattes than females.
dim(cdc)
## [1] 20000 9
cdc[567, 6]
## [1] 160
cdc[1:10, 6]
## [1] 175 125 105 132 150 114 194 170 150 180
cdc[1:10, ]
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 2 good 0 1 1 64 125 115 33 f
## 3 good 1 1 1 60 105 105 49 f
## 4 good 1 1 0 66 132 124 42 f
## 5 very good 0 1 0 61 150 130 55 f
## 6 very good 1 1 0 64 114 114 55 f
## 7 very good 1 1 0 71 194 185 31 m
## 8 very good 0 1 0 67 170 160 45 m
## 9 good 0 1 1 65 150 130 27 f
## 10 good 1 1 0 70 180 170 44 m
mdata <- subset(cdc, cdc$gender == "m")
head(mdata)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 7 very good 1 1 0 71 194 185 31 m
## 8 very good 0 1 0 67 170 160 45 m
## 10 good 1 1 0 70 180 170 44 m
## 11 excellent 1 1 1 69 186 175 46 m
## 12 fair 1 1 1 69 168 148 62 m
under23_and_smoke <- subset(cdc, smoke100 == 1 & age < 23)
head(under23_and_smoke)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13 excellent 1 0 1 66 185 220 21 m
## 37 very good 1 0 1 70 160 140 18 f
## 96 excellent 1 1 1 74 175 200 22 m
## 180 good 1 1 1 64 190 140 20 f
## 182 very good 1 1 1 62 92 92 21 f
## 240 very good 1 0 1 64 125 115 22 f
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
boxplot(cdc$height ~ cdc$gender)
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)
It shows the calculated BMI for all participants corresponds to genhlth variable. As illustrated, it shows an increasing in the BMI.
boxplot(bmi ~ cdc$gender)
The boxplot shows a likely similar BMI for both genders. However, males seem to have BMI below 30.
smoothScatter(cdc$wtdesire ~ cdc$weight)
abline(lm(cdc$wtdesire~cdc$weight), col="red")
The relationship is positive.
wdiff <- (cdc$wtdesire - cdc$weight)
typeof(wdiff)
## [1] "integer"
If the observation of wdiff is 0 means that the person has an ideal weight (his weight is same as desired). If the wdiff is positive means that the person needs to gain weight to reach ideal. However, if the wdiff is negative means that the person needs to lose weight.
differ <- density(wdiff)
plot(differ)
This density plot reflects that most of the responders are happy with their weight(mode is 0)
mean(wdiff)
## [1] -14.5891
median(wdiff)
## [1] -10
quantile(wdiff)
## 0% 25% 50% 75% 100%
## -300 -21 -10 0 500
boxplot(wdiff)
From the boxplot, we can see the outlier points of people who think they should be 250 IB heavier.
m_desire <- subset(cdc, cdc$gender == 'm')$wtdesire
f_desire <- subset(cdc, cdc$gender == 'f')$wtdesire
m_weight <-subset(cdc, cdc$gender == 'm')$weight
f_weight <- subset(cdc, cdc$gender == 'f')$weight
boxplot(m_desire - m_weight, f_desire - f_weight)
summary(m_desire - m_weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -20.00 -5.00 -10.71 0.00 500.00
summary(f_desire - f_weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -27.00 -10.00 -18.15 0.00 83.00
The summary and boxplot showing that males are likely think that they like to lose weight.
mean(cdc$weight)
## [1] 169.683
sd(cdc$weight)
## [1] 40.08097
below_mean <-subset(cdc, cdc$weight > mean(cdc$weight)-sd(cdc$weight))
above_mean <-subset(cdc, cdc$weight < mean(cdc$weight)+sd(cdc$weight))
within_sd <-subset(below_mean, below_mean$weight < max(above_mean$weight))
nrow(within_sd)/nrow(cdc)
## [1] 0.7071