See the headings, the first few and the last few data rows
source("http://www.openintro.org/stat/data/cdc.R")
names(cdc)
## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"
## [7] "wtdesire" "age" "gender"
head(cdc)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 2 good 0 1 1 64 125 115 33 f
## 3 good 1 1 1 60 105 105 49 f
## 4 good 1 1 0 66 132 124 42 f
## 5 very good 0 1 0 61 150 130 55 f
## 6 very good 1 1 0 64 114 114 55 f
tail(cdc)
## genhlth exerany hlthplan smoke100 height weight wtdesire age
## 19995 good 0 1 1 69 224 224 73
## 19996 good 1 1 0 66 215 140 23
## 19997 excellent 0 1 0 73 200 185 35
## 19998 poor 0 1 0 65 216 150 57
## 19999 good 1 1 0 67 165 165 81
## 20000 good 1 1 1 69 170 165 83
## gender
## 19995 m
## 19996 f
## 19997 m
## 19998 f
## 19999 f
## 20000 m
Learn more about the weight of the sample
summary(cdc$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.0 140.0 165.0 169.7 190.0 500.0
190 - 140
## [1] 50
mean(cdc$weight)
## [1] 169.683
var(cdc$weight)
## [1] 1606.484
median(cdc$weight)
## [1] 165
find out more about the data
Ssee the total number of people who have smoked 100 cigarettes in their lifetimes. Then find the percentage of people who have smoked over 100 cigarettes within the data #a little over half (52.7%) are not generally smokers
table(cdc$smoke100)
##
## 0 1
## 10559 9441
table(cdc$smoke100)/20000
##
## 0 1
## 0.52795 0.47205
visualize what the number of smokers v. nonsmokers is in a bar graph
assign the term “smoke” to the smoker data to make it easier to graph
smoke <- table(cdc$smoke100)
numerical summary for height
summary(cdc\(height) mean(cdc\)height) var(cdc\(height) median(cdc\)height)
numerical summary for age
summary(cdc\(age) mean(cdc\)age) var(cdc\(age) median(cdc\)age) range(cdc\(age) table(cdc\)age, cdc$exerany)/20000 Gender and General Health percentages
summary (cdc$gender)/20000
## m f
## 0.47845 0.52155
summary (cdc$genhlth)/20000
## excellent very good good fair poor
## 0.23285 0.34860 0.28375 0.10095 0.03385
The age range is 18 to 99. The total number of males is 9,569, which is 47.85% of the sample The mean number of people who have any exercise (I think that’s what “exerany” means) is 74.57% of the sample The total number of people who are in excellent health is 4,657, which is 23.28% of the population
make a table that compares two different variables
table(cdc$gender,cdc$smoke100)
##
## 0 1
## m 4547 5022
## f 6012 4419
mfsmoke <- table(cdc$gender,cdc$smoke100)
This table shows comparisons between male and female smokers and nonsmokers Men were more likely than women to smoke, but among men there was less variation.
## numeric(0)
Learn more about the size of the data frame
dim(cdc)
## [1] 20000 9
There are 20,000 respondants and 9 variables in the cdc dataframe Finding a specific data point in the cdc dataframe -cdc[567,6] -cdc[1:10, 6] -cdc[1:10,] -cdc[,6]
Making R find a vector using the $ does the same thing
-cdc\(weight -cdc\)weight[567] -cdc$weight[1:10]
Make a subset
mdata <- subset(cdc, cdc$gender == "m")
head(mdata)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 7 very good 1 1 0 71 194 185 31 m
## 8 very good 0 1 0 67 170 160 45 m
## 10 good 1 1 0 70 180 170 44 m
## 11 excellent 1 1 1 69 186 175 46 m
## 12 fair 1 1 1 69 168 148 62 m
Make an even subier-subset
m_and_over_30 <- subset(cdc, gender == "m" & age >30)
head(m_and_over_30)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 7 very good 1 1 0 71 194 185 31 m
## 8 very good 0 1 0 67 170 160 45 m
## 10 good 1 1 0 70 180 170 44 m
## 11 excellent 1 1 1 69 186 175 46 m
## 12 fair 1 1 1 69 168 148 62 m
m_or_over_30 <- subset(cdc, gender == "m" | age >30)
head(m_or_over_30)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 2 good 0 1 1 64 125 115 33 f
## 3 good 1 1 1 60 105 105 49 f
## 4 good 1 1 0 66 132 124 42 f
## 5 very good 0 1 0 61 150 130 55 f
## 6 very good 1 1 0 64 114 114 55 f
Create a subset with vectors for smokers under 23
smokers <- subset(cdc, cdc$smoke100 == "1")
head(smokers)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 2 good 0 1 1 64 125 115 33 f
## 3 good 1 1 1 60 105 105 49 f
## 9 good 0 1 1 65 150 130 27 f
## 11 excellent 1 1 1 69 186 175 46 m
## 12 fair 1 1 1 69 168 148 62 m
## 13 excellent 1 0 1 66 185 220 21 m
young <- subset(cdc, cdc$age <23)
under23_and_smoke <- subset(cdc, age < 23 & smoke100 == "1")
There are a total of 620 people under the age of 23 who have smoked 100 cigarettes or more That is 3.1% of the total sample summarize variables using a box plot
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
-The shortest person was 48 inches tall, the tallest was 93 inches, and the median was 67 inches -A breakdown by gender might be useful to compare more accurately. The - in this case means “height as a function of gender” Not surprisingly, women tended to be shorter than men
This line of code ran fine and created the boxplot I wanted, but when I tried to knit it into RMarkdown it gave me errors and wouldn’t create my document boxplot(cdc\(height ~ cdc\)gender) + title(“Height (inches)”)
BMI
bmi <- (cdc$weight / cdc$height^2) * 703
This code also would not knit for some reason. The error being that there is a “non-numberic argument to binary operator” boxplot(bmi ~ cdc$genhlth) + title(“General Health as a Function of Body Mass Index”)
#Exercise 5
-This shows that the higher someone's BMI, the less likely they are to be in good health
-I chose to look at age as a function of BMI. What seems to happen is that in middle age, BMIs are highest, then drop again slightly for the elderly
*same error*
boxplot(bmi ~ cdc$age) + title("Age as a function of BMI")
Histograms Histograms can also show something similar, especially with one variable hist(cdc$age) hist(bmi)
Having a high number of breaks shows variance more clearly
A scatterplot of weight vs. desired weight shows that over about 220 pounds, most people desired to weight less -For example, people who were 150 pounds stuck to a fairly close slope of 1, but those who were over had more points for a lower desired weight
plot(cdc$weight, cdc$wtdesire) + abline(1,1)
## numeric(0)
Make a vector for the difference between what people weigh and what they want to weigh
wdiff = cdc$weight - cdc$wtdesire
wdiff is a new variable that shows how different someone wants to weigh. If their value is 0 it means they are happy with their current weight -If wdiff is negative it means they want to weigh more than they currently do. If it is positive it means they hope to lose weight
-Most people are comfortable with their weight, with more people wanting to lose weight than gain it. -The median desired weight is ten pounds less than current weight
summary(wdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -500.00 0.00 10.00 14.59 21.00 300.00
The standard deviation is 24.05 pounds
When comparing men and women, you can see that women tended to want to lose weight more than gain it. Fewer people had a difference value of 0.
mean(cdc$weight)
## [1] 169.683
sd(cdc$weight)
## [1] 40.08097
I am struggling to figure out how to call up the summaries of a data frame with a range of values. I know I’m supposed to start with cdc[and somehow include the range of 129.602:209.764] because that would include all the data points within one standard deviation of the mean.