source("more/cdc.R")dim(cdc)## [1] 20000 9
Each row is a case, and each column is a variable, so there are 20000 cases and 9 variables.
We first get the variable names:
names(cdc)## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"
## [7] "wtdesire" "age" "gender"
height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?summary(cdc$height)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
The interquartile range for height is
70 - 64## [1] 6
summary(cdc$age)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
The interquartile range for age is
57 - 31## [1] 26
The relative frequency for gender is
table(cdc$gender)/20000##
## m f
## 0.47845 0.52155
The relative frequency for exerany is
table(cdc$exerany)/20000##
## 0 1
## 0.2543 0.7457
The number of males in the sample is
table(cdc$gender)##
## m f
## 9569 10431
table(cdc$genhlth)##
## excellent very good good fair poor
## 4657 6972 5675 2019 677
Since excellent is the first column, we can run the following command to get the proportion
table(cdc$genhlth)[1]/20000## excellent
## 0.23285
~23.3% of the sample reported being in excellent overall health.
mosaicplot(table(cdc$gender, cdc$smoke100))We can see that more males than females reported having smoked 100 cigarettes in their lifetime.
under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.under23_and_smoke <- subset(cdc, age < 23 & smoke100 == "1")
nrow(under23_and_smoke)## [1] 620
There are 620 people that match these criteria.
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth) We can see that
bmi is inversely correlated with reported health condition.
I would imagine that bmi would also be negatively correlated with exerany. To check, we can enter
boxplot(bmi ~ cdc$exerany) Aside from a few outliers, we can see that this is indeed the case.
We can plot this using
plot(cdc$weight, cdc$wtdesire) There is a positive correlation between weight and desired weight. That is, the desired weight of those with a higher weight is higher than those with a lower weight.
wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.wdiff <- (cdc$wtdesire - cdc$weight)wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?Assuming that we only record integers (round to the nearest lb/kg), then wdiff is a Numerical (discrete) variable. If wdiff is 0, then the person is at their desired weight. If wdiff is x < 0, then the person wants to lose x lbs/kgs. If wdiff is x > 0, then the person wants to gain x lbs/kgs.
wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?First, let’s get a summary
summary(wdiff)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -21.00 -10.00 -14.59 0.00 500.00
And a few graphs
boxplot(wdiff)hist(wdiff)hist(wdiff, breaks = 100)The vast majority are close to their ideal weight, with a few extreme cases. The average amount of weight people want to lose is ~14.5 lbs/kgs.
We’ll start by isolating each gender, finding their respective wdiff, listing a summary of each, and, finally, plotting wdiff respective to each gender.
# I looked up how to better specify what I want from the subset function (?subset), and was able to perform the following.
mwdiff <- subset(cdc, gender == "m", select = c(wtdesire, weight))
summary(mwdiff)## wtdesire weight
## Min. : 77.0 Min. : 78.0
## 1st Qu.:160.0 1st Qu.:165.0
## Median :175.0 Median :185.0
## Mean :178.6 Mean :189.3
## 3rd Qu.:190.0 3rd Qu.:210.0
## Max. :680.0 Max. :500.0
fwdiff <- subset(cdc, gender == "f", select = c(wtdesire, weight))
summary(fwdiff)## wtdesire weight
## Min. : 68.0 Min. : 68.0
## 1st Qu.:120.0 1st Qu.:128.0
## Median :130.0 Median :145.0
## Mean :133.5 Mean :151.7
## 3rd Qu.:145.0 3rd Qu.:170.0
## Max. :350.0 Max. :495.0
boxplot(wdiff ~ cdc$gender, ylim = c(-75,75))From the tables and graphs, it’s evident that men are closer to their ideal weight, while women prefer to lose more weight.
Note: I limited the y-axis so that the plot will be zoomed in, and to remove the influence of the few outliers on each side.
weight and determine what proportion of the weights are within one standard deviation of the mean.# Get an overall summary, including the mean
summary(cdc$weight)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.0 140.0 165.0 169.7 190.0 500.0
# Function to get the sd
sd(cdc$weight)## [1] 40.08097
cweight = cdc$weight
# I found the inner function to determine data within 1 sd on Stack Overflow. I repurposed it for my own needs, and then used the NROW function to determine the number of pieces of data fall within that 1 sd.
withinOne <- NROW(cweight[abs(cweight - mean(cweight)) <= sd(cweight)])
# To find the proportion, simply divide by the sample size, i.e. 20000
withinOne / 20000## [1] 0.7076