library(DATA606)

## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.

## 
## Attaching package: 'DATA606'

## The following object is masked from 'package:utils':
## 
##     demo

setwd("~/R/Lab1")
source("more/cdc.R")

Exercise 1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

From the Global Enviroment tab/ Data

This data set has 20000 cases and 9 variable or

dim(cdc)

## [1] 20000     9

Also we can use the following command to find out the last entry on the data set

tail(cdc)

##         genhlth exerany hlthplan smoke100 height weight wtdesire age
## 19995      good       0        1        1     69    224      224  73
## 19996      good       1        1        0     66    215      140  23
## 19997 excellent       0        1        0     73    200      185  35
## 19998      poor       0        1        0     65    216      150  57
## 19999      good       1        1        0     67    165      165  81
## 20000      good       1        1        1     69    170      165  83
##       gender
## 19995      m
## 19996      f
## 19997      m
## 19998      f
## 19999      f
## 20000      m

Variables genhlth: categorical/ordinal exarany:Categorical/binary healthplan:Categorical/binary smoke100:Categorical/binary height: numerical/discrete weigth: numerical/continous wtdesire:numerical/continuos age:numerical/discrete gender:categorical/nominal

Exercise 2: Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

Numerical summary for height

summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

#Interquartile
70 - 64

## [1] 6

Numerical summary for age

summary(cdc$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

#Interquartile
57 - 31

## [1] 26

Relative frecuency distribution by gender

table(cdc$gender)/20000

## 
##       m       f 
## 0.47845 0.52155

There are 9569 males in this sample

table(cdc$gender)

## 
##     m     f 
##  9569 10431

Out of 20000 cases, 4657 cases report being in excelent health.

table(cdc$genhlth)

## 
## excellent very good      good      fair      poor 
##      4657      6972      5675      2019       677

table(cdc$genhlth == 'excellent')

## 
## FALSE  TRUE 
## 15343  4657

Exercise 3: What does the mosaic plot reveal about smoking habits and gender?

Males are the most smokers than woman, a greater number of females had smoked less than 100 cigarrettes in their life time.

habitsbygender <- table(cdc$gender,cdc$smoke100)
mosaicplot(habitsbygender)

Exercise 4: Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

resp_under23_smoke <- subset(cdc, age < 23 & smoke100 == "1")
head(resp_under23_smoke )

##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13  excellent       1        0        1     66    185      220  21      m
## 37  very good       1        0        1     70    160      140  18      f
## 96  excellent       1        1        1     74    175      200  22      m
## 180      good       1        1        1     64    190      140  20      f
## 182 very good       1        1        1     62     92       92  21      f
## 240 very good       1        0        1     64    125      115  22      f

Exercise 5 :What does this box plot show?

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

The box plot allows to compare clearly two variables; general health and the BMI values for all respondents.

There is a condiserable amount of outliers in each level the BMI is beyond the upper whisker betwen 30 to 40, which make me think that in the first 3 levels it wouldn’t be possible to have any healthy respondents with a BMI value over 30. It might be an error on this cases, this data needs to be reviewed. The median variation has a minimal diference between all levels, the BMI values are very close.

Exercise 5: Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

BMI and exerany = The box plot shows the respondents that exercised in the past month are less than the respondents that haven’t done any exercise in the same period of time. The outliers are out of range.

boxplot(bmi ~ cdc$exerany)

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

plot(cdc$weight, cdc$wtdesire , xlab = 'Weight', ylab = 'Weight Desire')

A scatterplot is showing the respondents’ Weight against their desire Weight. The relationship between the respondests’ actual weight and desire weight is non-linear, it shows a slight increase in the desire weight when the actual weight increases.

Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wdiff <- cdc$weight - cdc$wtdesire
head(wdiff)

## [1]  0 10  0  8 20  0

wdiff [901]

## [1] -10

What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

class(wdiff)

## [1] "integer"

If the observation is 0, respondents are in the ideal weight. Positive interger: means the number of pounds respondents would have to loss to reach out a desire weight. Negative integer: The actual weight is less than the desire weight.Respondents are underweight.

Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

Over 10000 respondents are not comfortable with their actual weight.

hist(wdiff)

5. Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

male_wdiff <- subset(cdc, cdc$gender == 'm')
summary(male_wdiff$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    78.0   165.0   185.0   189.3   210.0   500.0

dim(male_wdiff)

## [1] 9569    9

fem_wdiff <- subset(cdc, cdc$gender == 'f')
summary(fem_wdiff$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   128.0   145.0   151.7   170.0   495.0

dim(fem_wdiff)

## [1] 10431     9

boxplot(cdc$weight- cdc$wtdesire ~ cdc$gender, ylab = 'Weight', outline = FALSE , main = " Respondents Weight by Gender")

I used the outline parameter = ‘FALSE’ to clean up the outliers from the plot and have a better visualization of the final output.

The median weight by gender shows that men are close to 0, I can make an inference that men tends to keep their desire weight. Women’s median are slight far from cero clearly they have a diffence between their current weigth and desire weight being their current weight greater.

Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean

x <- cdc$weight
x <- mean(x, na.rm = T)
x

## [1] 169.683

y <- cdc$weight
y <- sd(y, na.rm = T)
y

## [1] 40.08097

z = subset(cdc, cdc$weight > (x-y) & cdc$weight < (x+y))
prop_weight = dim(z)/dim(cdc)*100
prop_weight

## [1]  70.76 100.00

The 70.76 % of respondents proportion of the weights are within the standard distribution of the mean.

Lab1

Durley Torres-Marin