Exercise One

See the headings, the first few and the last few data rows

source("http://www.openintro.org/stat/data/cdc.R")
names(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"
head(cdc)
##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f
tail(cdc)
##         genhlth exerany hlthplan smoke100 height weight wtdesire age
## 19995      good       0        1        1     69    224      224  73
## 19996      good       1        1        0     66    215      140  23
## 19997 excellent       0        1        0     73    200      185  35
## 19998      poor       0        1        0     65    216      150  57
## 19999      good       1        1        0     67    165      165  81
## 20000      good       1        1        1     69    170      165  83
##       gender
## 19995      m
## 19996      f
## 19997      m
## 19998      f
## 19999      f
## 20000      m

Learn more about the weight of the sample

summary(cdc$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   140.0   165.0   169.7   190.0   500.0
190 - 140
## [1] 50
mean(cdc$weight) 
## [1] 169.683
var(cdc$weight)
## [1] 1606.484
median(cdc$weight)
## [1] 165

find out more about the data

Ssee the total number of people who have smoked 100 cigarettes in their lifetimes. Then find the percentage of people who have smoked over 100 cigarettes within the data #a little over half (52.7%) are not generally smokers

table(cdc$smoke100)
## 
##     0     1 
## 10559  9441
table(cdc$smoke100)/20000
## 
##       0       1 
## 0.52795 0.47205

visualize what the number of smokers v. nonsmokers is in a bar graph

assign the term “smoke” to the smoker data to make it easier to graph

smoke <- table(cdc$smoke100)

numerical summary for height

summary(cdc\(height) mean(cdc\)height) var(cdc\(height) median(cdc\)height)

numerical summary for age

summary(cdc\(age) mean(cdc\)age) var(cdc\(age) median(cdc\)age) range(cdc\(age) table(cdc\)age, cdc$exerany)/20000 Gender and General Health percentages

summary (cdc$gender)/20000
##       m       f 
## 0.47845 0.52155
summary (cdc$genhlth)/20000
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385

The age range is 18 to 99. The total number of males is 9,569, which is 47.85% of the sample The mean number of people who have any exercise (I think that’s what “exerany” means) is 74.57% of the sample The total number of people who are in excellent health is 4,657, which is 23.28% of the population

Exercise Two

make a table that compares two different variables

table(cdc$gender,cdc$smoke100)
##    
##        0    1
##   m 4547 5022
##   f 6012 4419
mfsmoke <- table(cdc$gender,cdc$smoke100)

This table shows comparisons between male and female smokers and nonsmokers Men were more likely than women to smoke, but among men there was less variation.

## numeric(0)

Exercise 3

Learn more about the size of the data frame

dim(cdc)
## [1] 20000     9

There are 20,000 respondants and 9 variables in the cdc dataframe Finding a specific data point in the cdc dataframe -cdc[567,6] -cdc[1:10, 6] -cdc[1:10,] -cdc[,6]

Making R find a vector using the $ does the same thing

-cdc\(weight -cdc\)weight[567] -cdc$weight[1:10]

Make a subset

mdata <- subset(cdc, cdc$gender == "m")
head(mdata)
##      genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1       good       0        1        0     70    175      175  77      m
## 7  very good       1        1        0     71    194      185  31      m
## 8  very good       0        1        0     67    170      160  45      m
## 10      good       1        1        0     70    180      170  44      m
## 11 excellent       1        1        1     69    186      175  46      m
## 12      fair       1        1        1     69    168      148  62      m

Make an even subier-subset

m_and_over_30 <- subset(cdc, gender == "m" & age >30)
head(m_and_over_30)
##      genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1       good       0        1        0     70    175      175  77      m
## 7  very good       1        1        0     71    194      185  31      m
## 8  very good       0        1        0     67    170      160  45      m
## 10      good       1        1        0     70    180      170  44      m
## 11 excellent       1        1        1     69    186      175  46      m
## 12      fair       1        1        1     69    168      148  62      m
m_or_over_30 <- subset(cdc, gender == "m" | age >30)
head(m_or_over_30)
##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f

Exercise Four

Create a subset with vectors for smokers under 23

smokers <- subset(cdc, cdc$smoke100 == "1")
head(smokers)
##      genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 2       good       0        1        1     64    125      115  33      f
## 3       good       1        1        1     60    105      105  49      f
## 9       good       0        1        1     65    150      130  27      f
## 11 excellent       1        1        1     69    186      175  46      m
## 12      fair       1        1        1     69    168      148  62      m
## 13 excellent       1        0        1     66    185      220  21      m
young <- subset(cdc, cdc$age <23)
under23_and_smoke <- subset(cdc, age < 23 & smoke100 == "1")

There are a total of 620 people under the age of 23 who have smoked 100 cigarettes or more That is 3.1% of the total sample summarize variables using a box plot

summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

-The shortest person was 48 inches tall, the tallest was 93 inches, and the median was 67 inches -A breakdown by gender might be useful to compare more accurately. The - in this case means “height as a function of gender” Not surprisingly, women tended to be shorter than men

This line of code ran fine and created the boxplot I wanted, but when I tried to knit it into RMarkdown it gave me errors and wouldn’t create my document boxplot(cdc\(height ~ cdc\)gender) + title(“Height (inches)”)

BMI

bmi <- (cdc$weight / cdc$height^2) * 703

This code also would not knit for some reason. The error being that there is a “non-numberic argument to binary operator” boxplot(bmi ~ cdc$genhlth) + title(“General Health as a Function of Body Mass Index”)


#Exercise 5
  -This shows that the higher someone's BMI, the less likely they are to be in good health
  -I chose to look at age as a function of BMI. What seems to happen is that in middle age, BMIs are highest, then drop again slightly for the elderly
  
*same error*
boxplot(bmi ~ cdc$age) + title("Age as a function of BMI")

Histograms Histograms can also show something similar, especially with one variable hist(cdc$age) hist(bmi)

Having a high number of breaks shows variance more clearly

Other Exercises

1.

A scatterplot of weight vs. desired weight shows that over about 220 pounds, most people desired to weight less -For example, people who were 150 pounds stuck to a fairly close slope of 1, but those who were over had more points for a lower desired weight

plot(cdc$weight, cdc$wtdesire) + abline(1,1) 

## numeric(0)

2

Make a vector for the difference between what people weigh and what they want to weigh

wdiff = cdc$weight - cdc$wtdesire

3

wdiff is a new variable that shows how different someone wants to weigh. If their value is 0 it means they are happy with their current weight -If wdiff is negative it means they want to weigh more than they currently do. If it is positive it means they hope to lose weight

4

-Most people are comfortable with their weight, with more people wanting to lose weight than gain it. -The median desired weight is ten pounds less than current weight

summary(wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -500.00    0.00   10.00   14.59   21.00  300.00

The standard deviation is 24.05 pounds

5

When comparing men and women, you can see that women tended to want to lose weight more than gain it. Fewer people had a difference value of 0.

6

mean(cdc$weight)
## [1] 169.683
sd(cdc$weight)
## [1] 40.08097

I am struggling to figure out how to call up the summaries of a data frame with a range of values. I know I’m supposed to start with cdc[and somehow include the range of 129.602:209.764] because that would include all the data points within one standard deviation of the mean.