Data 606 Lab 1

source("more/cdc.R")

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

There are 200000 cases. There are 9 variables. genhlth is a categorical nominal. exerany is a categorical nominal. hlthplan is a categorical nominal. smoke100 is a categorical nominal. height is a numerical continuous. weight is a numerical continuous. wtdesire is a numerical continuous. age is a numerical continuous. gender is a categorical nominal.

Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

summary(cdc$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

table(cdc$exerany)/20000

## 
##      0      1 
## 0.2543 0.7457

table(cdc$gender)/20000

## 
##       m       f 
## 0.47845 0.52155

summary(cdc$gender)

##     m     f 
##  9569 10431

##There are 9569 males in the sample. 
table(cdc$genhlth)/20000

## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385

##23% of the sample reports excelent health

What does the mosaic plot reveal about smoking habits and gender?

More men than women smoke.

Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

under23<- subset(cdc, age < 23)
under23_and_smoke <- subset(under23, smoke100==1)

What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

Those who reported having excellent health had lower BMIs than those who did not. BMIs increase as the self-reporting of healt gets worse.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$smoke100)

Many people talk about using smoking to reduce weight. However, it appears that smokers have the same weight as non-smokers

On Your Own

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

ggplot(cdc,aes(x=weight, wtdesire))+geom_point()+scale_y_continuous(limits=c(0,700))+scale_x_continuous(limits=c(0,700))

The majority of desired weight seems to be below the x=y line, meaning that most people want to lose weight.

Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wdiff<-cdc$wtdesire-cdc$weight
cdc<- cbind(cdc,wdiff)

What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative? wdiff is a continuous numeric variable. If it is 0, a person is at the weight they want to be. If it is negative, they want to lose weight and if it is positive, they want to gain weight.
Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

summary(cdc$wdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00

ggplot(cdc,aes(x=wdiff))+geom_histogram(binwidth = 10)+scale_x_continuous(limits=c(-150,50), breaks=seq(-150,50, by =10))

## Warning: Removed 66 rows containing non-finite values (stat_bin).

Most people want to lose weight. The majority want to lose 10 pounds or less. The median for weight change is -10, the mean is skewed by some outliers which I left off the histogram. The histogram is right skewed, showing that most people do not desire a radical weight change.

Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

cdcm<-subset(cdc,gender=="m")
cdcf<-subset(cdc,gender=="f")
summary(cdcf$wdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -27.00  -10.00  -18.15    0.00   83.00

summary(cdcm$wdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -20.00   -5.00  -10.71    0.00  500.00

ggplot(cdcm,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")

ggplot(cdcm,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")

#I'd like to look at them scaled the same way for a better comparison
ggplot(cdcm,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")+scale_y_continuous(limits=c(-300,500))

ggplot(cdcf,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")+scale_y_continuous(limits=c(-300,500))

#without the outliers
ggplot(cdcm,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")+scale_y_continuous(limits=c(-40,20))

## Warning: Removed 919 rows containing non-finite values (stat_boxplot).

## Warning: Removed 919 rows containing non-finite values (stat_boxplot).

ggplot(cdcf,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")+scale_y_continuous(limits=c(-40,20))

## Warning: Removed 1284 rows containing non-finite values (stat_boxplot).

## Warning: Removed 1284 rows containing non-finite values (stat_boxplot).

Men have many more outliers that want to gain weight than women. Men tend to want to lose less weight than women.

Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

summary(cdc$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   140.0   165.0   169.7   190.0   500.0

cdcmean<-subset(cdc, weight<190 & weight > 68)
count(cdcmean)/20000

##        n
## 1 0.7161

71.6% are within 1 standard deviation of the mean. Given that the data is not normal, but skewed slightly left, it is not surprising that it is greater than the normal 68%

Data 606 Lab 1

Jason Givens-Doyle

September 17, 2018

On Your Own