Lab 1 - Data 606 with all questions answered

library(knitr)
source("http://www.openintro.org/stat/data/cdc.R")
names(cdc)

## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

head(cdc)

##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f

Question 1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

cases_count	variables_count	genhlth	exerany	hlthplan	smoke100	height	weight	wtdesire	age	gender
20000	9	categorical ordinal	discrete	discrete	discrete	continous	continous	discrete	discrete	categorical

Question 2. Create a numerical summary for `height` and `age`, and compute the interquartile range for each. Compute the relative frequency distribution for `gender` and `exerany`. How many males are in the sample? What proportion of the sample reports being in excellent health?

summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

summary(cdc$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

iqr_height<-summary(cdc$height)[5]-summary(cdc$height)[2]
iqr_age<-summary(cdc$age)[5]-summary(cdc$age)[2]
relative_frequency_gender<-table(cdc$gender)/cases_count 
relative_frequency_exerany<-table(cdc$exerany)/cases_count 
male_count<-table(cdc$gender)[1]
excellent_health_reported<-table(cdc$genhlth)[1]/cases_count

iqr_height	iqr_age	relative_frequency_gender	relative_frequency_exerany	male_count	excellent_health_reported
6	26	0.47845 male/ 0.52155 female	0.2543 male/ 0.7457 female	9569	23.285 %

Question 3. What does the mosaic plot reveal about smoking habits and gender?

table(cdc$gender,cdc$smoke100)

##    
##        0    1
##   m 4547 5022
##   f 6012 4419

mosaicplot(table(cdc$gender,cdc$smoke100))

The mosiac show that the square box (m,1) is bigger than the square box (f,1), men are more smokers than women of 100 cigarettes or more. To prove this, the following table displays:

m_smoking_less_or_100	f_smoking_less_or_100	m_smoking_above_or_100	f_smoking_above_or_100
48 %	58 %	52 %	42 %

The results show that men tend to smoke 100 cigarettes or more, more than women.

Question 4. Create a new object called `under23_and_smoke` that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

under23_and_smoke<-subset(cdc, cdc$smoke100==1 & cdc$age<23)
head(under23_and_smoke)

##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13  excellent       1        0        1     66    185      220  21      m
## 37  very good       1        0        1     70    160      140  18      f
## 96  excellent       1        1        1     74    175      200  22      m
## 180      good       1        1        1     64    190      140  20      f
## 182 very good       1        1        1     62     92       92  21      f
## 240 very good       1        0        1     64    125      115  22      f

Question 5. What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

This boxplot show the relationship between the BMI value holder and the holder’s general health. To check whether participants exercised or not over the past month, let’s box plot using cdc$exerany:

boxplot(bmi ~ cdc$exerany)

The box 0 on the left side clearly shows that in the past month participants had higher BMIs than the ones who were exercising.

On Your Own

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

There’s a correlation between the desired weight and the weight, when one of them increases the other does also. It is skewed to the right.

Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wdiff<-cdc$wtdesire-cdc$weight

What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

1- wdiff is discrete integer 2- wdiff = 0, desired weight achieved 3- wdiff > 0, needs to lose weight 4- wdiff < 0, needs to gain weight

Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

summary(wdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00

hist(wdiff, breaks=70)

Checking the above 2 outputs, we can conclude that the histogram shows high peaks around and below 0, mean=-10.00 and median=-14.59 are negatives which means that most of the subjects are above their desired weight looking to lose weight.

Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

men_weight	m_desired_weight	women_weight	f_desired_weight	m_diff	w_diff
1811629	1709182	1582030	1392695	102447	189335

m_diff = men_weight - m_desired_weight = 102447, f_diff = women_weight - f_desired_weight = 189335

As seen on the table, m_diff < f_diff. Men has less desire than women in changing their weight.

Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

mea n_weight wei ght_sd sd_ proportions_one

Mean 169.683 40.08097 0.7076

mea	n_weight wei	ght_sd sd_	proportions_one
Mean	169.683	40.08097	0.7076

70.76% of the weight proportions are within 1 standard deviation of the mean.

Lab 1 - Data 606 with all questions answered

Ohannes (Hovig) Ohannessian

2/9/2018

Question 1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

Question 2. Create a numerical summary for `height` and `age`, and compute the interquartile range for each. Compute the relative frequency distribution for `gender` and `exerany`. How many males are in the sample? What proportion of the sample reports being in excellent health?

Question 3. What does the mosaic plot reveal about smoking habits and gender?

Question 4. Create a new object called `under23_and_smoke` that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

Question 5. What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

On Your Own

Lab 1 - Data 606 with all questions answered

Ohannes (Hovig) Ohannessian

2/9/2018

Question 1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

Question 2. Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

Question 3. What does the mosaic plot reveal about smoking habits and gender?

Question 4. Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

Question 5. What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

On Your Own

Question 2. Create a numerical summary for `height` and `age`, and compute the interquartile range for each. Compute the relative frequency distribution for `gender` and `exerany`. How many males are in the sample? What proportion of the sample reports being in excellent health?

Question 4. Create a new object called `under23_and_smoke` that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.