Introduction to Data

Load Data

source("more/cdc.R")

Exercise 1

How many cases are there in this data set?

There are 20,000 cases in the cdc data frame

How many variables?

the dataframe has 9 variables

For each variable, identify its data type

“genhlth”: categorical nominal
“exerany”: categorical nominal
“hlthplan”: categorical nominal
“smoke100”: categorical nominal
“height”: numerical continuous
“weight”: numerical continuous
“wtdesire”: numerical continuous
“age”: numerical discrete
“gender”: categorical nominal

dim(cdc)
## [1] 20000     9
names(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

Exercise 2

Create a numerical summary for height and age, and compute the interquartile range for each.

Height IQR: 6

Age IQR: 26

summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
summary(cdc$height)[5]-summary(cdc$height)[2]
## 3rd Qu. 
##       6
summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00
summary(cdc$age)[5]-summary(cdc$age)[2]
## 3rd Qu. 
##      26

Compute the relative frequency distribution for gender and exerany.

table(cdc$gender)/20000
## 
##       m       f 
## 0.47845 0.52155
table(cdc$exerany)/20000
## 
##      0      1 
## 0.2543 0.7457

How many males are in the sample?

9569

table(cdc$gender)
## 
##     m     f 
##  9569 10431

What proportion of the sample reports being in excellent health?

4657

table(cdc$genhlth)
## 
## excellent very good      good      fair      poor 
##      4657      6972      5675      2019       677

Exercise 3

What does the mosaic plot reveal about smoking habits and gender?

More males have smoked at least 100 cigarettes than females

mosaicplot(table(cdc$gender,cdc$smoke100))

Exercise 4

Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise

under23_and_smoke<-subset(cdc,age < 23 & smoke100==1)
head(under23_and_smoke)
##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13  excellent       1        0        1     66    185      220  21      m
## 37  very good       1        0        1     70    160      140  18      f
## 96  excellent       1        1        1     74    175      200  22      m
## 180      good       1        1        1     64    190      140  20      f
## 182 very good       1        1        1     62     92       92  21      f
## 240 very good       1        0        1     64    125      115  22      f

Exercise 5

What does this box plot show?

Shows a comparison of the BMI for all individuals by the general health categorical variable. We can see how the group reporting excellent health shows a lower and tighter (less spread) BMI than the group reporting poor health. We see a transition from excellent to poor with increasing medians and IQRs.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

Graphs for all categorical viables were looked at, and gender showed the most interesting result. It shows how males have a higher BMI, but also show a lower IQR than females. Females show lower minimum, and higher maximum weights also. So although as a whole women have lower BMI than men, the sample has higher fluctuations.Womens weights vary more than men’s weights, eventhough as a whole their median weight is lower.

boxplot(bmi ~ cdc$gender)

On Your Own

1

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

The plots shows a linear relationship, with people reporting a lower desired weight than their actual weight. Shows most people want a lower weight than what they have.

plot(cdc$weight,cdc$wtdesire)

2

Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wdiff<-cdc$wtdesire-cdc$weight

3

What type of data is wdiff?

Numerical continuous

If an observation wdiff is 0, what does this mean about the person’s weight and desired weight.

That the person has in fact the weight he or she desires

What if wdiff is positive or negative?

If its positive the person has a lower weight than the desired weight. If negative it means he or she weight more than what they desire.

4

Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

Plotting a histogram and looking at the summary we can tell that most people would like to have a lower weight than what they actually have. We see this on the median, which is negative. But we also see that the distribution in the histogram is mainly on the negative side. In the summary we find that the third quartile is zero. So there are very few people who have a lower weight than the one desired. In fact, by counting the number of occurences where wdiff is positive we find that only 1620 people have weights lower than what they desire, agaisnt 18380 who have weigths higher than what they woukd like, Only 8.1% of people weight less than what they would like to weight.

hist(wdiff,breaks = 50)

summary(wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00
sum(wdiff>0)
## [1] 1620
sum(wdiff<=0)
## [1] 18380
sum(wdiff>0)/(sum(wdiff>0)+sum(wdiff<=0))
## [1] 0.081

5

Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women

There is a small difference in how men view their weight. If we look at the box plots, and the statistics for them, we find that the median for wdiff for women is more negatice than for men, -5 for women agaisnt -10 for men. This means that women in general see their taret weight as much lower than their actuals compared to men. Men see their target weight closer to their actuals. Womens median is the same as for the entire sample of data, while mens is smaller. IQR for women is larger than for men, 27 vs 20. This means women have a larger spread of difference between desired and current weight. Men on the other hand have less of a spread, meaning more men don’t see their target weight as diferrent from their current weight.

boxplot(wdiff ~ cdc$gender)
b<-boxplot(wdiff ~ cdc$gender)

b$stats
##      [,1] [,2]
## [1,]  -50  -67
## [2,]  -20  -27
## [3,]   -5  -10
## [4,]    0    0
## [5,]   30   38
## attr(,"class")
##         m 
## "integer"
summary(wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00

6

Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

70.76% of weights fall within 1 standard deviation of the mean

mean(cdc$weight)
## [1] 169.683
sd(cdc$weight)
## [1] 40.08097
sum(cdc$weight>(mean(cdc$weight)-sd(cdc$weight)) & cdc$weight<(mean(cdc$weight)+sd(cdc$weight)))/20000
## [1] 0.7076