Introduction to data

source("more/cdc.R")

names(cdc)

## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

Number of cases

nrow(cdc)

## [1] 20000

Number of Variables

ncol(cdc)

## [1] 9

Variable	Data Type
genhlth	categorical ordinal
exerany	categorical
hlthplan	categorical
smoke100	categorical
height	numerical continuous
weight	mumerical continuous
wtdesire	mumerical discrete. One could argue it is continuous but i dont think any one is going to say i want be 180.25 lbs.
age	uumeric discrete or continous if you take month in to account
gender	categorical

Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

Summary and IQR : Height

summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

IQR(cdc$height)

## [1] 6

Summary and IQR : Age

summary(cdc$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

IQR(cdc$age)

## [1] 26

Relative Frequencey Distribution for Gender

table(cdc$gender)/nrow(cdc)

## 
##       m       f 
## 0.47845 0.52155

Relative Frequencey Distribution for exerany

table(cdc$exerany)/nrow(cdc)

## 
##      0      1 
## 0.2543 0.7457

Number of Males

nrow(subset(cdc, gender =="m", select =c('gender')))

## [1] 9569

 summary(cdc$gender)

##     m     f 
##  9569 10431

Good Health

nrow(subset(cdc, genhlth =="good", select =c('genhlth')))/(nrow(cdc))

## [1] 0.28375

table(cdc$genhlth)/nrow(cdc)

## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385

What does the mosaic plot reveal about smoking habits and gender?

Males smoke slightly more than females.

Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

under23_and_smoke <- subset(cdc, smoke100 == 1 & age < 23)

What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

Box plot shows BMI increases as genhealth gets worse. It also shows the middle 50% range gets bigger as genhealth gets worse.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$exerany)

Box plot shows that people who excercise have a lower BMI and the BMI range is narrower. Outliers for excericed group accounts for people like Arnold Schwarzenegger since BMI does not take muscle mass into account.

On Your Own

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
```
plot(cdc$wtdesire~ cdc$weight)
```
```
wdiff <-     cdc$wtdesire - cdc$weight
plot(wdiff~cdc$weight)
```
Weight and desired weight are negatively correlated, that is the more a person weighs the lower the desired weight. Plot of weight difference vs actual weight clearly shows this.
Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
```
wdiff <-   cdc$wtdesire - cdc$weight
head(wdiff)
```
```
## [1]   0 -10   0  -8 -20   0
```
```
length(wdiff)
```
```
## [1] 20000
```
```
str(wdiff)
```
```
##  int [1:20000] 0 -10 0 -8 -20 0 -9 -10 -20 -10 ...
```
```
typeof(wdiff)
```
```
## [1] "integer"
```
What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?
- Integer(discrete).
- wdiff = ‘0’ then that person is satisfied with current weight.
- wdiff = positive, that person would like to gain weight.
- wdiff = negative, that person like to loose weight.
Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
```
wdiff <-     cdc$wtdesire - cdc$weight
min(wdiff)
```
```
## [1] -300
```
```
max(wdiff)
```
```
## [1] 500
```
```
hist(wdiff, breaks = 100 ,freq = FALSE)
```
```
summary(wdiff)
```
```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00
```
The histogram is skewed to the left and shows that most of the people would like to loose weight. Median is -10 and mean is -14.59 and mean is probably skewed by the few outliers. This tells us that most of the people would like to loose weight and a small percentage of them would like to gain weight.

Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

men <-subset(cdc, gender == 'm')
women <-subset(cdc, gender == 'f')
men_wdiff <-  men$wtdesire - men$weight
women_wdiff <-  women$wtdesire - women$weight

summary(men_wdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -20.00   -5.00  -10.71    0.00  500.00

summary(women_wdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -27.00  -10.00  -18.15    0.00   83.00

wdiff <-  cdc$wtdesire -cdc$weight
boxplot(wdiff~ cdc$gender)

boxplot(cdc$weight~ cdc$gender)

Ladies want to loose more weight than men. Median for men is -5.00 and women is -10.00

Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

sd(cdc$weight)

## [1] 40.08097

mean(cdc$weight)

## [1] 169.683

one_std_of_mean_after <- mean(cdc$weight) + sd(cdc$weight)
one_std_of_mean_before <- mean(cdc$weight) - sd(cdc$weight)
one_std_of_mean_after

## [1] 209.7639

one_std_of_mean_before

## [1] 129.602

with_in_one_sd <- subset(cdc, weight >= one_std_of_mean_before & weight <= one_std_of_mean_after)
dim(with_in_one_sd)

## [1] 14152     9

#Porpotions of weights with in one standard deviation of the mean
nrow(with_in_one_sd)/nrow(cdc)

## [1] 0.7076

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.