Lab 2 - Introduction to data

Name: James Tian

Section: 3

Date: 9/10/13

Exercises

Load data:

source("http://www.openintro.org/stat/data/cdc.R")

Exercise 1:

There are 20,000 cases each with 9 variables. The following variables are categorical: genhlth, exerany, hlthplan, smoke100, and gender. The following variables are continuous quantitative: height, weight, wtdesire, and age.

Exercise 2:

# enter code for Ex2 below
summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    48.0    64.0    67.0    67.2    70.0    93.0
summary(cdc$height)[5] - summary(cdc$height)[2]
## 3rd Qu. 
##       6

summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0    31.0    43.0    45.1    57.0    99.0
summary(cdc$age)[5] - summary(cdc$age)[2]
## 3rd Qu. 
##      26

Exercise 3:

There are 9,569 males in the sample. Of the sample, 23.285% are in excellent health.

# enter code for Ex3 below
table(cdc$gender)/20000
## 
##      m      f 
## 0.4784 0.5215
table(cdc$genhlth)/20000
## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385

Exercise 4:

# code for Ex4 already given in lab
mosaicplot(table(cdc$gender, cdc$smoke100))

plot of chunk unnamed-chunk-4

The plot shows that gender and smoking 100 cigarettes are not independent variables. The probability of having smoked 100 cigarettes is different based on whether the subject is male or female.

Exercise 5:

# enter code for Ex5 below
under23_and_smoke <- subset(cdc, cdc$age < 23 & cdc$smoke == "1")
nrow(under23_and_smoke)
## [1] 620

There are 620 respondents who are under 23 and smoke.

Exercise 6:

# code for bmi vs. genhlth already given in lab
bmi = (cdc$weight/cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth, main = "BMI vs. general health")

plot of chunk unnamed-chunk-6

This shows a trend that as people are in poorer health, their BMI tends to increase.

# enter code for Ex6 below (boxplot for bmi vs. your chosen variable)
boxplot(bmi ~ cdc$gender, main = "BMI vs. gender")

plot of chunk unnamed-chunk-7

tapply(bmi, cdc$gender, summary)
## $m
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    12.4    23.7    26.3    26.9    29.2    64.2 
## 
## $f
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    12.7    21.8    24.6    25.7    28.3    73.1

Men and women are physiologically built differently so they may have different BMI's. The median and quartiles for BMI are all lower for women showing a difference between the two genders. However, the difference is small and not clearly shown by the boxplot.

Exercise 7:

# enter code for Ex7 below
cdcnew <- subset(cdc, cdc$wtdesire < 500)
plot(cdcnew$weight, cdcnew$wtdesire)

plot of chunk unnamed-chunk-8

These variables are closely correlated in the sense that people's desired weight do not differ greatly from their current weight. In general people's desired weight are less than their current weight.

Exercise 8:

No text needed for this question, just code.

# enter code for Ex8 below
wdiff = cdcnew$weight - cdcnew$wtdesire

Exercise 9:

wdiff is a list of integers. If an observation in wdiff is 0, that means the person does not want to change his or her weight. If the observation is positive, the person wishes to lose weight. Likewise, if the observation is negative, the person wishes to gain weight.

Exercise 10:

# enter code for Ex10 below - numerical summary
summary(wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -110.0     0.0    10.0    14.6    21.0   300.0
# enter code for Q4 below - plot(s)
hist(wdiff)

plot of chunk unnamed-chunk-11

The distribution of weight differences is right skewed. There are relatively few extreme cases where people want to lose large amounts of weight. The majority of the data is positive, which means people generally want to lose weight. The data is centered, which means people are generally comfortable with their weight.

Exercise 11:

# enter code for Ex11 below - numerical summary
tapply(wdiff, cdcnew$gender, summary)
## $m
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -110.0     0.0     5.0    10.8    20.0   300.0 
## 
## $f
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -83.0     0.0    10.0    18.2    27.0   300.0
# enter code for Ex11 below - side-by-side box plot
boxplot(wdiff ~ cdcnew$gender)

plot of chunk unnamed-chunk-13

The median and mean for wdiff is larger for females than for males, indicating that females in general want to lose more weight.

Exercise 12:

Note: these statistics are done after adjusting for outliers

# enter code for Ex12 below
summary(cdcnew$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      68     140     165     170     190     500
sd(cdcnew$weight)
## [1] 40.07
onesd <- subset(cdcnew$weight, cdcnew$weight < 209.77 & cdcnew$weight > 129.67)
dim(onesd)
## NULL
14151/20000
## [1] 0.7076

The mean is 169.7 and the standard deviation is 40.07. There are 14151 people within one standard deviation of the mean, or about 70.755%.