source("C:/Users/Georgia/Documents/Lab1/more/cdc.R")
head(cdc)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 2 good 0 1 1 64 125 115 33 f
## 3 good 1 1 1 60 105 105 49 f
## 4 good 1 1 0 66 132 124 42 f
## 5 very good 0 1 0 61 150 130 55 f
## 6 very good 1 1 0 64 114 114 55 f
How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).
dim(cdc)
## [1] 20000 9
names(cdc)
## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"
## [7] "wtdesire" "age" "gender"
There are 20,000 cases and 9 variables. Four are discrete (height, weight, wtdesire and age) and the remaining five are categorical. Of the categorical, genhlth is ordinal.
Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
# Numerical summaries for Height and Age
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
# Table of numerical summaries
summ_table = matrix(c(mean(cdc$height),
var(cdc$height),
median(cdc$height),
70-64,
mean(cdc$age),
var(cdc$age),
median(cdc$age),
57-31
), ncol = 2)
colnames(summ_table) = c("Height", "Age")
rownames(summ_table) = c( "Mean", "Var", "Median", "IQ range")
as.table(summ_table)
## Height Age
## Mean 67.18290 45.06825
## Var 17.02350 295.58857
## Median 67.00000 43.00000
## IQ range 6.00000 26.00000
# Relative frequencies for Gender and Exerany
table(cdc$gender)/20000
##
## m f
## 0.47845 0.52155
table(cdc$exerany)/20000
##
## 0 1
## 0.2543 0.7457
# Number of males in the sample
table(cdc$gender)
##
## m f
## 9569 10431
# Proportion of sample claiming excellent health
table(cdc$genhlth)/2000
##
## excellent very good good fair poor
## 2.3285 3.4860 2.8375 1.0095 0.3385
There are 9569 males in the sample and a proportion of 2.3285 report being in excellent health.
What does the mosaic plot reveal about smoking habits and gender?
mosaicplot(table(cdc$gender,cdc$smoke100))
The mosaic plot shows that more males smoke (on an individual level) at least 100 cigarettes.
Create a new object (under23_and_smoke) that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime.
under23_and_smoke = subset(cdc, cdc$age < 23 & cdc$smoke100 == 1)
head(under23_and_smoke)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13 excellent 1 0 1 66 185 220 21 m
## 37 very good 1 0 1 70 160 140 18 f
## 96 excellent 1 1 1 74 175 200 22 m
## 180 good 1 1 1 64 190 140 20 f
## 182 very good 1 1 1 62 92 92 21 f
## 240 very good 1 0 1 64 125 115 22 f
What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.
bmi = (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$hlthplan)
The BMI by hlthplan box plots show that the poorer one’s general health is, the higher one’s bmi index will be. The categorical variable I chose to compare to bmi was hlthplan because it would be interesting to see if not having health coverage would be a motivation for people to maintain a lower bmi to avoid any illnesses or diseases. Interestingly enough, the boxplots comparing bmi and hlthplan does not seem to differ according to whether an individual has some form of health coverage or not. The plot wher subjects have some form of coverage has more outliers, but as far as means and IQ ranges are concerned, the boxplots look very similar.
plot(cdc$weight~ cdc$wtdesire)
There appears to be a positive (potentially linear) relationship between the two variables.
wdiff = cdc$wtdesire- cdc$weight
wdiff is a discrete variable. If the observation of wdiff is 0, that indicates that the subject has reached their desired weight goal. If wdiff is negative, the individual must lose weight to reach their desired goal, while if wdiff is positive, the individual must gain weight instead.
summary(wdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -21.00 -10.00 -14.59 0.00 500.00
hist(wdiff, breaks = 100, xlim = range(-200:200))
boxplot(wdiff)
boxplot(wdiff, outline = F)
The summary statistics show that, on average, subjects desired to lose approximately 15 pounds. The histogram shows (through being mostly left skewed) that is more a tendency for subjects to desire losing weight than gaining it. The boxplot including outliers (first boxplot) showed that the average subject was slightly more likely to desired to lose weight to reach their goal, however removing outliers (second boxplot) showed that the average subject is far more likely to desire losing weight than gaining some to reach their goal (indicated by both the IQ range and the lower mean).
wdiff_gen = data.frame(cdc$gender,wdiff)
# summaries
diff_fem = subset(wdiff_gen, cdc$gender == "f")
diff_male = subset(wdiff_gen, cdc$gender == "m")
summary(diff_fem)
## cdc.gender wdiff
## m: 0 Min. :-300.00
## f:10431 1st Qu.: -27.00
## Median : -10.00
## Mean : -18.15
## 3rd Qu.: 0.00
## Max. : 83.00
summary(diff_male)
## cdc.gender wdiff
## m:9569 Min. :-300.00
## f: 0 1st Qu.: -20.00
## Median : -5.00
## Mean : -10.71
## 3rd Qu.: 0.00
## Max. : 500.00
# boxplot
boxplot(wdiff_gen$wdiff ~ cdc$gender, outline = F)
It appears that women desire to lose weight more than men and women also have a larger wdiff than men, which could potentially mean that women may feel more strongly about their weight goals or even set more difficult-to-achieve goals than men.
# mean and stand. dev.
mean_weight = mean(cdc$weight)
mean_weight
## [1] 169.683
sd_weight = sd(cdc$weight)
sd_weight
## [1] 40.08097
# one standard deviation from mean
prop_stdev = subset(cdc, cdc$weight<(mean_weight+sd_weight) & cdc$weight>(mean_weight-sd_weight))
dim(prop_stdev)/20000
## [1] 0.70760 0.00045
Approximately 70.76% of subjects’ weight was withing one standard deviation of the mean (169.683).