DATA606 - Lab1

Load the CDC dataset

source("more/cdc.R")

How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

dim(cdc)

## [1] 20000     9

Each row is a case, and each column is a variable, so there are 20000 cases and 9 variables.

We first get the variable names:

names(cdc)

## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

genhlth = Categorical (ordinal). It’s not numeric, and there are different levels.
exerany = Categorical (nominal). It’s a binary question, and there aren’t levels, i.e. 1 isn’t better/higher than 0.
hlthplan = Categorical (nominal). ibid.
smoke100 = Categorical (nominal). ibid.
height = Numerical (discrete). Since it records it inches, and assuming there can’t be fractions, it gives an exact number.
weight = Numerical (discrete). ibid.
wtdesire = Numerical (discrete). ibid.
age = Numerical (discrete). ibid.
gender = Categorical (nominal). Same reason as the other nominals.

Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

The interquartile range for height is

70 - 64

## [1] 6

summary(cdc$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

The interquartile range for age is

57 - 31

## [1] 26

The relative frequency for gender is

table(cdc$gender)/20000

## 
##       m       f 
## 0.47845 0.52155

The relative frequency for exerany is

table(cdc$exerany)/20000

## 
##      0      1 
## 0.2543 0.7457

The number of males in the sample is

table(cdc$gender)

## 
##     m     f 
##  9569 10431

table(cdc$genhlth)

## 
## excellent very good      good      fair      poor 
##      4657      6972      5675      2019       677

Since excellent is the first column, we can run the following command to get the proportion

table(cdc$genhlth)[1]/20000

## excellent 
##   0.23285

~23.3% of the sample reported being in excellent overall health.

What does the mosaic plot reveal about smoking habits and gender?

mosaicplot(table(cdc$gender, cdc$smoke100))

We can see that more males than females reported having smoked 100 cigarettes in their lifetime.

Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

under23_and_smoke <- subset(cdc, age < 23 & smoke100 == "1")
nrow(under23_and_smoke)

## [1] 620

There are 620 people that match these criteria.

What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

We can see that bmi is inversely correlated with reported health condition.

I would imagine that bmi would also be negatively correlated with exerany. To check, we can enter

boxplot(bmi ~ cdc$exerany)

Aside from a few outliers, we can see that this is indeed the case.

On Your Own

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

We can plot this using

plot(cdc$weight, cdc$wtdesire)

There is a positive correlation between weight and desired weight. That is, the desired weight of those with a higher weight is higher than those with a lower weight.

Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wdiff <- (cdc$wtdesire - cdc$weight)

What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

Assuming that we only record integers (round to the nearest lb/kg), then wdiff is a Numerical (discrete) variable. If wdiff is 0, then the person is at their desired weight. If wdiff is x < 0, then the person wants to lose x lbs/kgs. If wdiff is x > 0, then the person wants to gain x lbs/kgs.

Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

First, let’s get a summary

summary(wdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00

And a few graphs

boxplot(wdiff)

hist(wdiff)

hist(wdiff, breaks = 100)

The vast majority are close to their ideal weight, with a few extreme cases. The average amount of weight people want to lose is ~14.5 lbs/kgs.

Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

We’ll start by isolating each gender, finding their respective wdiff, listing a summary of each, and, finally, plotting wdiff respective to each gender.

# I looked up how to better specify what I want from the subset function (?subset), and was able to perform the following.
mwdiff <- subset(cdc, gender == "m", select = c(wtdesire, weight))
summary(mwdiff)

##     wtdesire         weight     
##  Min.   : 77.0   Min.   : 78.0  
##  1st Qu.:160.0   1st Qu.:165.0  
##  Median :175.0   Median :185.0  
##  Mean   :178.6   Mean   :189.3  
##  3rd Qu.:190.0   3rd Qu.:210.0  
##  Max.   :680.0   Max.   :500.0

fwdiff <- subset(cdc, gender == "f", select = c(wtdesire, weight))
summary(fwdiff)

##     wtdesire         weight     
##  Min.   : 68.0   Min.   : 68.0  
##  1st Qu.:120.0   1st Qu.:128.0  
##  Median :130.0   Median :145.0  
##  Mean   :133.5   Mean   :151.7  
##  3rd Qu.:145.0   3rd Qu.:170.0  
##  Max.   :350.0   Max.   :495.0

boxplot(wdiff ~ cdc$gender, ylim = c(-75,75))

From the tables and graphs, it’s evident that men are closer to their ideal weight, while women prefer to lose more weight.

Note: I limited the y-axis so that the plot will be zoomed in, and to remove the influence of the few outliers on each side.

Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

# Get an overall summary, including the mean
summary(cdc$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   140.0   165.0   169.7   190.0   500.0

# Function to get the sd
sd(cdc$weight)

## [1] 40.08097

cweight = cdc$weight
# I found the inner function to determine data within 1 sd on Stack Overflow. I repurposed it for my own needs, and then used the NROW function to determine the number of pieces of data fall within that 1 sd.
withinOne <- NROW(cweight[abs(cweight - mean(cweight)) <= sd(cweight)])
# To find the proportion, simply divide by the sample size, i.e. 20000
withinOne / 20000

## [1] 0.7076

DATA606 - Lab1

Joshua Sturm

August 29, 2017

Load the CDC dataset

On Your Own