Statistics and Data Analysis 2019-20
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of health care coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.
names(cdc)
## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"
## [7] "wtdesire" "age" "gender"
How many cases are there in this data set? 20,000 How many variables? 9 For each variable, identify its data type (e.g. categorical, discrete). Categorical: genhlth, exerany, hlthplan, smoke100, gender
Quantatative: weight (lb), wtdesire(lb), height(in), age(years)
summary (cdc$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.0 140.0 165.0 169.7 190.0 500.0
table (cdc$smoke100)
##
## 0 1
## 10559 9441
table (cdc$smoke100)/20000
##
## 0 1
## 0.52795 0.47205
barplot (table(cdc$smoke100))
Create a numerical summary for height and age, and compute the interquartile range for each. age- 14.07 height- 6 Compute the relative frequency distribution for gender and exerany. How many males are in the sample? 9,569 What proportion of the sample reports being in excellent health?
summary (cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
summary (cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
table(cdc$gender)/20000
##
## m f
## 0.47845 0.52155
table(cdc$exerany)/20000
##
## 0 1
## 0.2543 0.7457
table(cdc$genhlth)
##
## excellent very good good fair poor
## 4657 6972 5675 2019 677
How many males were in the sample? 9569 or 47.8%
What proportion reported being in excellent health? 23.3% report they are in excellent health, even though 74.6% report they exercise
table(cdc$gender, cdc$smoke100)
##
## 0 1
## m 4547 5022
## f 6012 4419
mosaicplot(table(cdc$gender, cdc$smoke100))
What does the mosaic plot reveal about smoking habits and gender? Smoking habits and gender seem to be associated as females are less likely to report being smokes than male
Create a new object called under23 and smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigrarettes in their life time. Write the command you used to create the new object as the annwer to this
under23andsmoke <- subset(cdc, cdc$age < 23 & cdc$smoke100 == "1")
620 subjects under 23 that have smoked
bmi <- (cdc$weight/cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth, horizontal = TRUE)
Analysis As general health goes down from excellent to poor health, the average BMI increases and becomes more variable. All distributions are skewed right with many high outliers.
boxplot(bmi ~ cdc$gender, horizontal = TRUE)
Analysis Womens BMI is lower than the males in this study. Both distributions are skewed rights with numerous outliers. 50% of the women have a BMI in between 22 and 28, whereas the middle 50% of males is in the range from 24 and 29.
On Your Own Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
plot(cdc$wtdesire ~ cdc$weight, main= "Weight Vs Desiered Weight", xlab= "Weight (lbs)", ylab= "Desired Weight (lbs)", xlim=c(75,500), ylim=c(75,700))
Analysis Those who are between 100-200 pounds are close to their desired weight, whereas those between 300-400 are heavier than their desired weight.
Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
wtdiff <- (cdc$wtdesire - cdc$weight)
Analyze: What type of data is wdiff? Quantitative If an observation wdiff is 0, what does this mean about the person’s weight and desired weight? If it (the value) is 0, then the persons weight is their desired weightWhat if wdiff is positive or negative? Positive means the person wants to gain weight and negative means the person wants to loose weight
Make a histogram of wdiff Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
hist(wtdiff, main= "Weight Difference", xlab="wtdesire - weight")
Analysis The distribution of the weight shows that most want to loose weight.
Subset wdiff into male and female groups. Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women. EXPLAIN
boxplot(wtdiff ~ cdc$gender , xlab= "Weight Difference", ylab="Gender", horizontal=TRUE)
Analysis 25% of both males and females are content with their weight or would like to gain weight. There are two outlliers for men. In general 75% of both male and females want to loose weight. Women tend to want to loose more weight then men.