Lab 1

source("more/cdc.R")

Exercise 1

How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

Answer

Total 20000 number of cases in thes data set.

Total of 9variables.

Looking at the data set below we can try to idnetify the data type for each variable

head(cdc)

##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f

Data Types

genhlth is a categorical type as it categorizes the population in one of the types such as excellent, very good, good, fair or poor category. Also there is no numerical operation such as addition, subtraction, division etc. that can be performed on this variable. exerany is also a categorical type as it places the population in one of two buckets did they exercise or not.

hlthplan is also a categorical type as it places the population in the one of the categories such as did they have coverage or not.

smoke100 is a categorical type the individual would have either smoked 100 cigarettes or not.

height is a numerical discrete data type.

weight is also a numerical discrete data type.

wtdesire is a numerical discrete data type.

age is a numerical discrete data type.

gender is a categorical data type as it puts the person in a male or female category.

Exercise 2

Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

Numerical Summary Height

summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

Numerical Summary Age

summary(cdc$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

Interquartile range for Height

IQR(cdc$height)

## [1] 6

Interquartile range for Age

IQR(cdc$age)

## [1] 26

Relative Frquency Distribution of gender and exerany

relative_freq=table(cdc$gender,cdc$exerany)
relative_freq

##    
##        0    1
##   m 2149 7420
##   f 2937 7494

males=relative_freq[1,"0"]+relative_freq[1,"1"]

Total number of males `r males

relative_health=table(cdc$gender,cdc$genhlth)
excellentHealthMales=relative_health[1,"excellent"]

Total Number of Excellent health males are 2298

Proportion of excellent Health males in the Male Sample are 0.2401505

Exercise 3

What does the mosaic plot reveal about smoking habits and gender?

Answer: The Mosaic chart reveals that there are more males that smoked a 100 cigarettes as compared to females.

mosaicplot(table(cdc$gender,cdc$smoke100))

Exercise 4

Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

under23_and_smoke <- subset(cdc, smoke100 == "1" & age < 23)
head(under23_and_smoke)

##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13  excellent       1        0        1     66    185      220  21      m
## 37  very good       1        0        1     70    160      140  18      f
## 96  excellent       1        1        1     74    175      200  22      m
## 180      good       1        1        1     64    190      140  20      f
## 182 very good       1        1        1     62     92       92  21      f
## 240 very good       1        0        1     64    125      115  22      f

Exercise 5

What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

Answer:

The box plot shows the relationship between the BMI and general health of the people in the cdc database

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

I would like to pick the smoking variable from the database and see how that relates to the BMI. IT could have a relationship to BMI because smoking may impact a person’s weight. The figure below shows that the BMI of the people that smoke vs non smokers is generally lower.

boxplot(bmi ~ cdc$smoke100)

On Your Own

Question 1

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

The relationship shows that most people want to lower their weights and almost everyone wants to be below 200.

plot(cdc$weight,cdc$wtdesire)

Question 2

Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wdiff=cdc$wtdesire-cdc$weight
head(wdiff)

## [1]   0 -10   0  -8 -20   0

Question 3

What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

wdiff is numerical data . If an observation is 0 it means that the person’s desired weight and current weight is the same meaning they don’t have to gain or lose weight to be at their desired weight.

If wdiff is positive then it means that the person wants to gain weight.

If wdiff is negative it means that the person wants to lose weight.

Question 4

Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

Answer : From the Summary and box plot below it seems as though people feel that they are overweight.

summary(wdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00

boxplot(wdiff)

Question 5

Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

Answer

The below Summary and box plot of the weight difference for men and women shows that more men want to gain their weight versus women. Also it shows that most women feel that they are overweight and need to lose weight. Where as for men it is ##Men Weight Difference and Box Plot

men=subset(cdc,gender=="m")
menweightdiff=men$wtdesire-men$weight

summary(menweightdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -20.00   -5.00  -10.71    0.00  500.00

boxplot(menweightdiff)

Women Weight Difference Summary and Box Plot

women=subset(cdc,gender=="f")
womenweightdiff=women$wtdesire-women$weight

summary(womenweightdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -27.00  -10.00  -18.15    0.00   83.00

boxplot(womenweightdiff)

Question 6

Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

weightmean=mean(cdc$weight)
weightstandarddeviation=sd(cdc$weight)

weightmean

## [1] 169.683

weightstandarddeviation

## [1] 40.08097

lowerrange=weightmean-weightstandarddeviation
upperrange=weightmean+weightstandarddeviation

within_one_sd=subset(cdc,weight>lowerrange & weight<upperrange)

lowerrange

## [1] 129.602

upperrange

## [1] 209.7639

proportion_weight_within_one_sd=nrow(within_one_sd)/nrow(cdc)

The Proportion of weight within One Standard Deviation of the mean is 0.7076

lab 01

Umais Siddiqui

September 3, 2017

Lab 1

Exercise 1

Answer

Data Types

Exercise 2

Numerical Summary Height

Numerical Summary Age

Interquartile range for Height

Interquartile range for Age

Relative Frquency Distribution of gender and exerany

Exercise 3

Exercise 4

Exercise 5

Answer:

On Your Own

Question 1

Question 2

Question 3

Question 4

Answer : From the Summary and box plot below it seems as though people feel that they are overweight.

Question 5

Answer

Women Weight Difference Summary and Box Plot

Question 6