#1: How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).
nrow(cdc)
## [1] 20000
ncol(cdc)
## [1] 9
lapply(cdc[,], FUN=unique)
## $genhlth
## [1] good very good excellent fair poor
## Levels: excellent very good good fair poor
##
## $exerany
## [1] 0 1
##
## $hlthplan
## [1] 1 0
##
## $smoke100
## [1] 0 1
##
## $height
## [1] 70 64 60 66 61 71 67 65 69 73 75 68 74 63 59 62 72 77 76 78 57 55 58
## [24] 56 53 79 80 51 54 81 83 50 48 52 49 82 93 84
##
## $weight
## [1] 175 125 105 132 150 114 194 170 180 186 168 185 156 200 160 165 190
## [18] 115 166 182 220 117 124 143 118 210 145 130 112 141 179 135 140 172
## [35] 120 144 250 164 155 134 217 215 136 142 225 148 193 265 138 137 230
## [52] 100 110 211 163 207 280 224 205 107 187 128 204 97 195 152 151 235
## [69] 92 129 183 212 171 222 116 149 126 122 178 255 198 123 162 113 270
## [86] 206 330 108 295 90 300 153 240 147 127 283 173 290 184 275 208 103
## [103] 133 169 101 146 102 196 238 119 174 218 202 201 157 161 253 285 192
## [120] 242 176 328 111 154 121 248 245 167 380 177 340 260 226 400 199 95
## [137] 197 158 234 203 188 159 139 189 191 181 85 80 209 310 228 98 350
## [154] 109 274 364 236 267 243 362 216 500 213 106 131 227 104 223 308 263
## [171] 262 320 233 286 252 239 96 344 360 324 84 214 246 99 278 232 258
## [188] 294 495 249 256 371 83 279 315 219 231 296 257 348 271 385 254 276
## [205] 79 93 309 298 305 88 247 268 325 82 237 78 94 241 390 273 327
## [222] 282 318 272 221 405 86 297 70 319 292 229 287 244 68 370 313
##
## $wtdesire
## [1] 175 115 105 124 130 114 185 160 170 148 220 150 190 120 140 158 165
## [18] 182 110 180 125 145 118 128 168 172 200 164 134 207 195 135 138 137
## [35] 210 100 136 132 155 99 260 141 107 204 142 152 151 92 215 117 178
## [52] 225 198 186 123 113 250 90 108 163 240 85 127 153 154 147 144 103
## [69] 146 205 119 230 161 226 149 174 179 143 162 139 242 102 101 133 116
## [86] 177 202 194 173 235 176 270 126 122 157 159 192 267 183 191 121 280
## [103] 197 212 218 156 112 129 274 325 199 203 167 188 193 228 181 227 95
## [120] 106 184 196 109 104 249 209 169 189 300 265 97 206 171 248 78 98
## [137] 208 290 245 223 219 234 166 187 214 222 275 315 285 238 201 111 91
## [154] 298 80 93 211 131 82 601 252 255 77 96 273 350 216 320 224 244
## [171] 680 217 229 94 213 237 68 88
##
## $age
## [1] 77 33 49 42 55 31 45 27 44 46 62 21 69 23 79 47 76 43 48 54 30 32 63
## [24] 74 41 36 67 57 18 65 29 52 35 59 64 58 34 50 20 56 80 37 24 71 51 38
## [47] 40 22 28 25 87 61 75 26 53 39 66 60 78 84 82 70 72 68 85 19 86 81 73
## [70] 83 88 99 93 91 90 89 94 92 97 95 96
##
## $gender
## [1] m f
## Levels: m f
There are 20,000 cases (rows) in the data set
There are 9 varibles (columns) in the data set
Categorial Data: genhlth, hlthplan, smoke100, gender
Discrete (Quantitative) Data: height, weight, wtdesire, age
#2: Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
IQR(cdc$height)
## [1] 6
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
IQR(cdc$age)
## [1] 26
table(cdc$gender)
##
## m f
## 9569 10431
table(cdc$exerany)
##
## 0 1
## 5086 14914
nrow(cdc[cdc$genhlth == 'excellent',])/nrow(cdc)
## [1] 0.23285
The interquartile range for height is 6.
The interquartile range for age is 26.
There are 9569 males in the sample
Appoximately 23.285% of the respondidents reported being in excellent health
#3: What does the mosaic plot reveal about smoking habits and gender?
mosaicplot(table(cdc$gender, cdc$smoke100), shade=TRUE, main='Relationship between gender and smoking habits', xlab = 'Gender', ylab='Smoking Habit')
Women make up slightly more than half of the sample and are less likely to have smoked 100 cigarettes in their lifetime then men.
#4: Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
under23_and_smoke <- subset(cdc, age < 23 & smoke100 == 1)
#5: What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.
bmi <- (cdc$weight / cdc$height^2)*703
level_data <- factor(x=cdc$exerany, levels=0:1, labels = c('No Exercise Plan', 'Exercise Plan'))
ggplot(cdc, mapping=aes(x=level_data, y=bmi)) + geom_boxplot() + xlab('Group') + ylab('BMI') + ggtitle('Comparison of BMI between Different Exercise Groups')
The boxplot in the notes shows a comparison between BMI and respondents description of their health. It could be helpful in determining if there is a relationship between a respondents weight and whether or not they consider themselves healthy.
I plotted BMI against whether the respondant is on an exercise plan. I would anticipate that the BMI of people who exercise would be lower than those who do not but would also have more upper outliers. This may seem paradoxical but it would make sense that people who exercise would broadly be comprised of people concerned about their health and weight along with those who were put on exercise plans by doctors in an attempt to bring their weight down.
The figure mostly supports this interpretation. In addition, the plot shows fewer exercising respondents with severly low BMI as well. This supports the idea that people who exercise are perhaps more conscious of their weight (both high and low)
#1: Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
ggplot(cdc, aes(x=weight, y=wtdesire)) + geom_point() + geom_abline(slope=1,intercept=0, color='red') + geom_point(data=cdc[cdc$weight>230 & cdc$wtdesire>cdc$weight,], color='green') + ggtitle('Current Weight vs. Desired Weight') + xlab('Weight') + ylab('Desired Weight')
There is a strong positive correlation between the two variables. That is, people appear to base the weight they desire off of their current weight. This makes sense as it is common to hear people make statements such as, “I’d like to drop 20 pounds.”
The red line indicates equal weight and wtdesire and thus all people below the line would like to lose weight, while all people above would like to gain weight. It matches expectations that the number of people that would like to gain weight would depreciate as weight grows. Very few people above 230 pounds wish to gain weight (marked in green).
#2: Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
cdc$wtdiff <- cdc$wtdesire - cdc$weight
#3: What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?
typeof(cdc$wtdiff[0])
## [1] "integer"
wdiff is an integer and is quantitative data.
If wdiff is 0 that means the person is already at their desired weight. If the number is negative, this represents the amount of weight the person wishes to lose and if it is positive, that represents the weight the person wishes to gain.
#4: Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
ggplot(cdc, aes(x=wtdiff)) + geom_histogram(aes(y=..count../sum(..count..)),bins = 200) + stat_function(fun=dnorm, color='red', args=list(mean=mean(cdc$wtdiff), sd=sd(cdc$wtdiff)), show.legend=FALSE) + ggtitle('Distribution of Desired Weight Change') + xlab('Weight Difference') + ylab('Relative Frequency')
summary(cdc$wtdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -21.00 -10.00 -14.59 0.00 500.00
Upon initial observation the data is centered and heavily weighted just below 0. This would indicate that most people want to “lose a couple of pounds”. The mean and median are closely located at -10 and -14.59 again backing up that sentinment of “wanting to lose a few pounds”.
The data appears slight skewed towards losing weight. This is more easily noticable with the normal curve added to the historgram.
However the data is hard to process due to a number of extreme outliers. A boxplot supports the idea of a number of extreme outliers
boxplot(cdc$wtdiff)
I created a data set with these outliers removed
meanDiff <- mean(cdc$wtdiff)
iqrDiff <- IQR(cdc$wtdiff)
cdc$bmi <- bmi
#Components adapted from Stackoverflow
convert_bmi <- function(bmi){
sapply(bmi, function(bmi){
cuts <- c(-Inf, 18.5, 24.9, 29.9, Inf)
labs <- c('Under', 'Normal', 'Over', 'Obese')
return(labs[findInterval(bmi, cuts)])
})
}
cdc <- mutate(cdc, cat=convert_bmi(bmi))
cdcNoOutliers <- cdc[cdc$wtdiff > meanDiff-1.5*iqrDiff & cdc$wtdiff < meanDiff+1.5*iqrDiff,]
boxplot(cdcNoOutliers$wtdiff)
ggplot(cdcNoOutliers, aes(x=wtdiff)) + geom_histogram(aes(y=..count../sum(..count..), fill=cat),bins = 15) + ggtitle('Distribution of Desired Weight Change') + xlab('Weight Difference') + ylab('Relative Frequency') + labs(fill='Legend')
The histogram with the outliers removes shows a much clearer picture. Most people indicate wanting to lose a few pounds with the data being more heavily skewed towards losing weight. I grouped each person by their BMI category and added this information to the histogram. This information shows clear trends based on BMI category. Underweight people seem to acknolwedge they should gain weight, obese people clearly wish to lose the most weight and normal weight people are the most likely to be happy with their current weight.
#5: Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
summary(cdcNoOutliers[cdcNoOutliers$gender == 'm',]$wtdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -46.00 -15.00 -5.00 -8.62 0.00 16.00
summary(cdcNoOutliers[cdcNoOutliers$gender == 'f',]$wtdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -46.0 -20.0 -10.0 -12.3 0.0 16.0
ggplot(cdcNoOutliers, mapping=aes(x=gender, y=wtdiff)) + geom_boxplot() + xlab('Gender') + ylab('Weight Diff') + ggtitle('Comparison of Ideal Weight Change Between Men and Women')
Comparing the mean and median from the summary data along with side-by-side box plots supports the notion that women tend to want to lose more weight than men. Interestingly both genders had an equal proportion (\(\frac{1}{4}\)) who wanted to gain weight and the distribution of desired gain is also equal between 0 and 10 pounds.
This might indicate that men are in general more happy with their weight or it may simply mean that men are less likely to report that they are unhappy with their weight or that women may feel under more pressure to lose weight.
#6: Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.
#Code modified from r-graph-gallery.com number 162
meanWeight <- mean(cdc$weight)
sdWeight <- sd(cdc$weight)
plot(density(cdc$weight), main='Density Distribution of Weights')
lowerBound <- meanWeight-sdWeight
upperBound <- meanWeight+sdWeight
abline(v=lowerBound, col='red')
abline(v=upperBound, col='red')
densityData <- density(cdc$weight, from=lowerBound, to=upperBound)
polygon(c(lowerBound,densityData$x,upperBound), c(0,densityData$y,0), col='red')
abline(v=meanWeight)
nrow(cdc[cdc$weight >= lowerBound & cdc$weight <= upperBound,])/nrow(cdc)
## [1] 0.7076
The red area shows the portion of the sample that is within 1 standard deviation of the mean weight.
Approximately 70.76% of the sample falls within 1 standard deviation of the mean.