library(DATA606)
##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
##
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
##
## demo
setwd("~/R/Lab1")
source("more/cdc.R")
This data set has 20000 cases and 9 variable or
dim(cdc)
## [1] 20000 9
tail(cdc)
## genhlth exerany hlthplan smoke100 height weight wtdesire age
## 19995 good 0 1 1 69 224 224 73
## 19996 good 1 1 0 66 215 140 23
## 19997 excellent 0 1 0 73 200 185 35
## 19998 poor 0 1 0 65 216 150 57
## 19999 good 1 1 0 67 165 165 81
## 20000 good 1 1 1 69 170 165 83
## gender
## 19995 m
## 19996 f
## 19997 m
## 19998 f
## 19999 f
## 20000 m
Variables genhlth: categorical/ordinal exarany:Categorical/binary healthplan:Categorical/binary smoke100:Categorical/binary height: numerical/discrete weigth: numerical/continous wtdesire:numerical/continuos age:numerical/discrete gender:categorical/nominal
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
#Interquartile
70 - 64
## [1] 6
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
#Interquartile
57 - 31
## [1] 26
table(cdc$gender)/20000
##
## m f
## 0.47845 0.52155
table(cdc$gender)
##
## m f
## 9569 10431
table(cdc$genhlth)
##
## excellent very good good fair poor
## 4657 6972 5675 2019 677
table(cdc$genhlth == 'excellent')
##
## FALSE TRUE
## 15343 4657
Males are the most smokers than woman, a greater number of females had smoked less than 100 cigarrettes in their life time.
habitsbygender <- table(cdc$gender,cdc$smoke100)
mosaicplot(habitsbygender)
resp_under23_smoke <- subset(cdc, age < 23 & smoke100 == "1")
head(resp_under23_smoke )
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13 excellent 1 0 1 66 185 220 21 m
## 37 very good 1 0 1 70 160 140 18 f
## 96 excellent 1 1 1 74 175 200 22 m
## 180 good 1 1 1 64 190 140 20 f
## 182 very good 1 1 1 62 92 92 21 f
## 240 very good 1 0 1 64 125 115 22 f
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)
The box plot allows to compare clearly two variables; general health and the BMI values for all respondents.
There is a condiserable amount of outliers in each level the BMI is beyond the upper whisker betwen 30 to 40, which make me think that in the first 3 levels it wouldn’t be possible to have any healthy respondents with a BMI value over 30. It might be an error on this cases, this data needs to be reviewed. The median variation has a minimal diference between all levels, the BMI values are very close.
BMI and exerany = The box plot shows the respondents that exercised in the past month are less than the respondents that haven’t done any exercise in the same period of time. The outliers are out of range.
boxplot(bmi ~ cdc$exerany)
plot(cdc$weight, cdc$wtdesire , xlab = 'Weight', ylab = 'Weight Desire')
A scatterplot is showing the respondents’ Weight against their desire Weight. The relationship between the respondests’ actual weight and desire weight is non-linear, it shows a slight increase in the desire weight when the actual weight increases.
wdiff <- cdc$weight - cdc$wtdesire
head(wdiff)
## [1] 0 10 0 8 20 0
wdiff [901]
## [1] -10
class(wdiff)
## [1] "integer"
If the observation is 0, respondents are in the ideal weight. Positive interger: means the number of pounds respondents would have to loss to reach out a desire weight. Negative integer: The actual weight is less than the desire weight.Respondents are underweight.
Over 10000 respondents are not comfortable with their actual weight.
hist(wdiff)
5. Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
male_wdiff <- subset(cdc, cdc$gender == 'm')
summary(male_wdiff$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 78.0 165.0 185.0 189.3 210.0 500.0
dim(male_wdiff)
## [1] 9569 9
fem_wdiff <- subset(cdc, cdc$gender == 'f')
summary(fem_wdiff$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.0 128.0 145.0 151.7 170.0 495.0
dim(fem_wdiff)
## [1] 10431 9
boxplot(cdc$weight- cdc$wtdesire ~ cdc$gender, ylab = 'Weight', outline = FALSE , main = " Respondents Weight by Gender")
I used the outline parameter = ‘FALSE’ to clean up the outliers from the plot and have a better visualization of the final output.
The median weight by gender shows that men are close to 0, I can make an inference that men tends to keep their desire weight. Women’s median are slight far from cero clearly they have a diffence between their current weigth and desire weight being their current weight greater.
x <- cdc$weight
x <- mean(x, na.rm = T)
x
## [1] 169.683
y <- cdc$weight
y <- sd(y, na.rm = T)
y
## [1] 40.08097
z = subset(cdc, cdc$weight > (x-y) & cdc$weight < (x+y))
prop_weight = dim(z)/dim(cdc)*100
prop_weight
## [1] 70.76 100.00
The 70.76 % of respondents proportion of the weights are within the standard distribution of the mean.