Lab 1 Andrew Goldberg

How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

Cases: 2000 Variables: 9 genhealth: categorial, ordinal exerany: categorical, nominal hlthplan: categorical, nominal smoke100: categorical, nominal height: numerical, discrete weight: numerical, discrete age: numerical, discrete gender: categorical, nominal

Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

source("C:/Users/Andrew/Documents/R/win-library/3.1/IS606/labs/Lab1/more/cdc.R")
summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

70-64

## [1] 6

summary(cdc$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

57-31

## [1] 26

table(cdc$gender)/20000

## 
##       m       f 
## 0.47845 0.52155

table(cdc$exerany)/20000

## 
##      0      1 
## 0.2543 0.7457

table(cdc$gender)

## 
##     m     f 
##  9569 10431

#males = 9569

table(cdc$genhlt)/20000

## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385

#excellent = .23285

What does the mosaic plot reveal about smoking habits and gender?

More men have reported to smoke at least 100 cigarettes.

Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

under23_and_smoke <- subset(cdc, age < 23 & smoke100 == 1) 
summary(under23_and_smoke)

##       genhlth       exerany          hlthplan         smoke100
##  excellent:110   Min.   :0.0000   Min.   :0.0000   Min.   :1  
##  very good:244   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1  
##  good     :204   Median :1.0000   Median :1.0000   Median :1  
##  fair     : 53   Mean   :0.8145   Mean   :0.6952   Mean   :1  
##  poor     :  9   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1  
##                  Max.   :1.0000   Max.   :1.0000   Max.   :1  
##      height          weight         wtdesire          age        gender 
##  Min.   :59.00   Min.   : 85.0   Min.   : 80.0   Min.   :18.00   m:305  
##  1st Qu.:65.00   1st Qu.:130.0   1st Qu.:125.0   1st Qu.:19.00   f:315  
##  Median :68.00   Median :155.0   Median :150.0   Median :20.00          
##  Mean   :67.92   Mean   :158.9   Mean   :152.2   Mean   :20.22          
##  3rd Qu.:71.00   3rd Qu.:180.0   3rd Qu.:175.0   3rd Qu.:21.00          
##  Max.   :79.00   Max.   :350.0   Max.   :315.0   Max.   :22.00

What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

It shows boxplots of bmi for each self-reported general health bin. This figure suggests that people who have higher bmi’s are more likely to report worse general health as well.

exerany, or exercised in past month, is likely associated with better health, and lower bmi, since they are more likely to burn calories and weigh less. As the boxplot shows, those who have exercised in the past month have a slightly lower median bmi and a thinner iqr, although there are still many outliers.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$exerany)

On Your Own

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

plot(cdc$weight ~ cdc$wtdesire)

The general relationship looks to have a slope above 1, suggesting that people generally want to lose some weight

Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wdiff <- cdc$wtdesire - cdc$weight

What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

wdiff is numerical and discrete

If an observation is 0, then the respondent is satisfied with their current weight

If wdiff is negative, than they want to lose weight, if it is positive, they want to gain weight

Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

boxplot(wdiff)

hist(wdiff, breaks = 40)

plot(wdiff)

summary(wdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00

Wdiff median = -10, mean = -14.59, so it’s typical for people to want to lose around 10 to 15 pounds

The Wdiff histogram is unimodal with a slight left skew, so there are some people who want to lose a lot of weight, and few people who want to gain weight

The iqr spread is between 0 and -21 pounds, although there are many outliers, mostly of people who want to lose weight

Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

genwdiff <- data.frame(wdiff, cdc$gender)
summary(subset(genwdiff, cdc.gender == "m"))

##      wdiff         cdc.gender
##  Min.   :-300.00   m:9569    
##  1st Qu.: -20.00   f:   0    
##  Median :  -5.00             
##  Mean   : -10.71             
##  3rd Qu.:   0.00             
##  Max.   : 500.00

summary(subset(genwdiff, cdc.gender == "f"))

##      wdiff         cdc.gender
##  Min.   :-300.00   m:    0   
##  1st Qu.: -27.00   f:10431   
##  Median : -10.00             
##  Mean   : -18.15             
##  3rd Qu.:   0.00             
##  Max.   :  83.00

boxplot(genwdiff$wdiff ~ genwdiff$cdc.gender)

Women (median = -10) generally appear to want to lose a few more pounds than men (median = -5), and women have a slightly larger range of how much they want to lose/gain (iqr = 27) than men (iqr = 20). Interestingly, more men than women appear to want to gain weight.

Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

avgwt <- mean(cdc$weight)
sdwt <- sd(cdc$weight)
instdev <- subset(cdc, weight < (avgwt + sdwt) & weight > (avgwt - sdwt))
dim(instdev)[1]/dim(cdc)[1]

## [1] 0.7076

mean of weight = 169.7

standard deviation = 40.08

proportion within one standard deviation of the mean = .7076