DATA 606 Lab 1 Intro to data

Links

1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

str(cdc)

## 'data.frame':    20000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ exerany : num  0 0 1 1 0 1 1 0 0 1 ...
##  $ hlthplan: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num  0 1 1 0 0 0 0 0 1 0 ...
##  $ height  : num  70 64 60 66 61 64 71 67 65 70 ...
##  $ weight  : int  175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire: int  175 115 105 124 130 114 185 160 130 170 ...
##  $ age     : int  77 33 49 42 55 55 31 45 27 44 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...

There are 20000 observations and 9 variables
Categorical variable
- genhlth, exerany, hlthplan, smoke100, gender
Discrete variable
- height, weioght, wtdesire, age

2. Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

# Height numerical summary
summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

# Height Interquartile range
IQR(cdc$height)

## [1] 6

# Age numerical summary
summary(cdc$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

# Age Interquartile range
IQR(cdc$age)

## [1] 26

# Relative frequency distribution for gender
table(cdc$gender)/nrow(cdc)

## 
##       m       f 
## 0.47845 0.52155

# Relative frequency distribution for exerany
table(cdc$exerany)/nrow(cdc)

## 
##      0      1 
## 0.2543 0.7457

# Number of male samples are 9569
table(cdc$gender)

## 
##     m     f 
##  9569 10431

# Excellent in Health in proportion: 0.23285
table(cdc$genhlth)/nrow(cdc)

## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385

3. What does the mosaic plot reveal about smoking habits and gender?

# The mosaic plot displays that a larger proportion of male smoked at least 100 cigarettes. 
mosaicplot(table(cdc$gender, cdc$smoke100))

4. Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

under23_and_smoke <- subset(cdc, age < 23 & smoke100 == 1)
head(under23_and_smoke,10)

##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13  excellent       1        0        1     66    185      220  21      m
## 37  very good       1        0        1     70    160      140  18      f
## 96  excellent       1        1        1     74    175      200  22      m
## 180      good       1        1        1     64    190      140  20      f
## 182 very good       1        1        1     62     92       92  21      f
## 240 very good       1        0        1     64    125      115  22      f
## 262      fair       0        1        1     71    185      185  20      m
## 296      fair       1        1        1     72    185      170  19      m
## 297 excellent       1        0        1     63    105      100  19      m
## 300      fair       1        1        1     71    185      150  18      m

5. What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

# The box plot displays that the claimed excellent health condition people have a lower median bmi. The figure shows that, the lower the bmi, the better the health condition in general.
cdc$bmi <- cdc$weight / cdc$height^2 * 703
ggplot(cdc, aes(genhlth, bmi)) + geom_boxplot()

# bmi ~ gender
# Male shows higher median bmi, but the range is smaller and so is IQR. IQR range is larger in female.
ggplot(cdc, aes(gender, bmi)) + geom_boxplot()

# bmi ~ exerany
# people who exercised (1) shows lower median bmi and smaller range and IQR.
ggplot(cdc, aes(as.factor(exerany), bmi)) + geom_boxplot() + labs (x = "exercised in the past month")

On Your Own

1. Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

# As weight increases, so does the desired weight. 

ggplot(cdc, aes(weight, wtdesire)) +
    labs( x = "Weight", y = "Desired Weight") + geom_point()

2. Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

cdc$wdiff <- cdc$wtdesire - cdc$weight

3. What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

# wdiff is numeric yet discrete variable -- the weights are integer values. 
# If wdiff < 0, the person wants to weigh less.
# If wdiff = 0, the person is content with their weight.
# if wdiff > 0, the person wants to weight more.

typeof(cdc$wdiff)

## [1] "integer"

4. Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

# Given the median value = -10 and mean = -14.59, we can assume the majority of people generally desire to weigh less than they do now.
summary(cdc$wdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00

ggplot(data = cdc, aes(cdc$wdiff)) + geom_histogram(binwidth = 30)

5. Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

# The outliers are removed in the boxplot below. Given the numerical summaries and the plot, we can state that the wdiff median is lower for female than for male; woman are far from their desired weight than men are. Furthermore, women have a larger spread of wdiff than males do.

summary(cdc$wdiff[cdc$gender == "m"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -20.00   -5.00  -10.71    0.00  500.00

summary(cdc$wdiff[cdc$gender == "f"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -27.00  -10.00  -18.15    0.00   83.00

boxplot(cdc$wdiff ~ cdc$gender, outline = F, ylab = "wdiff")

6. Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

# mean 169.7
avg <- mean(cdc$weight)
avg

## [1] 169.683

#SD 40.08
sd <- sd(cdc$weight)
sd

## [1] 40.08097

#weights within one standard deviation of the mean: 70.76%
wwos <- subset(cdc, weight < avg + sd & weight > avg - sd)
nrow(wwos)/nrow(cdc)

## [1] 0.7076

DATA 606 Lab 1 Intro to data

Rose Koh

2/11/2018

Links

1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

2. Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

3. What does the mosaic plot reveal about smoking habits and gender?

4. Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

5. What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

On Your Own

1. Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

2. Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

3. What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

4. Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

5. Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

6. Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.