Data 606 - Lab 0

Questions to answer:

How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).
- There are 20000 cases in the data set.
- There are 9 variables.
- The variable types are as follows:

##   variable                                type
## 1  genhlth               categorical (ordinal)
## 2  exerany                categorical (binary)
## 3 hlthplan                categorical (binary)
## 4 smoke100                categorical (binary)
## 5   height discrete (handled as a continuous?)
## 6   weight discrete (handled as a continuous?)
## 7 wtdesire discrete (handled as a continuous?)
## 8      age discrete (handled as a continuous?)
## 9   gender                categorical (binary)

Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

summary(cdc$height, c(0.25, 0.75))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

summary(cdc$weight, c(0.25, 0.75))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   140.0   165.0   169.7   190.0   500.0

heightIQR <- 70 - 64
weightIQR <- 190 - 140

The IQRs for height and weight are 6 and 50, respectively.

The relative frequency distribution for gender and any exercise are as follows:

table(cdc$gender)/20000

## 
##       m       f 
## 0.47845 0.52155

table(cdc$exerany)/20000

## 
##      0      1 
## 0.2543 0.7457

What does the mosaic plot reveal about smoking habits and gender?

mosaicplot(table(cdc$gender,cdc$smoke100))

Males appear to have a higher proportion of individuals who have smoked at least 100 cigarettes than females.

Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

under23_and_smoke <- cdc[cdc$age < 23 & cdc$smoke100 == 1,]

4.  What does this box plot show? Pick another categorical variable from the 
data set and see how it relates to BMI. List the variable you chose, why you
might think it would have a relationship to BMI,  and indicate what the 
figure seems to suggest.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

There’s a trend toward lower bmi correlating with better “general health” scores.

boxplot(bmi ~ cdc$exerany)

Initially, I expected that the “any exercise”" would be associated with a lower bmi, but it looks like this is not the most obvious relation between the two variables. Though the exercisers look to have a slightly lower median bmi, the range is much wider. This makes sense considering that people may be prone to exercise more if their bmi is not in a “normal” range.

## On Your Own

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

plot(cdc$weight, cdc$wtdesire)

As expected, almost all participants had a desired weight lower than their actual weight.

Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

cdc$wdiff <- cdc$wtdesire - cdc$weight

What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

wdiff is a discrete data point. If wdiff is 0, desired weight matches the person’s actual weight. If it’s positive (wtdesire > weight), the person would like to gain weight; if negative, the person would like to lose weight.

Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

summary(cdc$wdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00

boxplot(cdc$wdiff)

Most people would like to lose weight, there are some people who would like to gain quite a bit of weight. The extent of this (500 lbs) is surprising to me. There are two outliers in particular that may possible be questionable data points.

Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

boxplot(cdc$wdiff ~ cdc$gender)

It looks like there are more men who would like to gain weight (perhaps those looking to bulk up muscle mass through exercise?). Both outliers for weight gain were men.

Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

  meanWt <- mean(cdc$weight, na.rm = TRUE)
  sdWt <- sd(cdc$weight, na.rm = TRUE)
  
  cdc$winOneSD <- as.numeric(cdc$weight > meanWt - sdWt &
                             cdc$weight < meanWt + sdWt)
  
  table(cdc$winOneSD)/20000

## 
##      0      1 
## 0.2924 0.7076

About 70% of participants were within one standard deviation of the mean weight (which is about what would be expected).

Data 606 - Lab 0

Ryan Weber

February 4, 2018

Set up workspace

Load data

Questions to answer: