Lab1 - Introduction to data

Some define Statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information - the data. In this lab, you will gain insight into public health by generating simple graphical and numerical summaries of a data set collected by the Centers for Disease Control and Prevention (CDC). As this is a large data set, along the way you’ll also learn the indispensable skills of data processing and subsetting.

Getting started

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.

We begin by loading the data set of 20,000 observations into the R workspace. After launching RStudio, enter the following command.

source("more/cdc.R")

The data set cdc that shows up in your workspace is a data matrix, with each row representing a case and each column representing a variable. R calls this data format a data frame, which is a term that will be used throughout the labs.

To view the names of the variables, type the command

names(cdc)

## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.

How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

Response to Exercise 1:

cdc_table_dimension <- dim(cdc)
numcases <- nrow(cdc)
numvariables <- ncol(cdc)
str(cdc)

## 'data.frame':    20000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ exerany : num  0 0 1 1 0 1 1 0 0 1 ...
##  $ hlthplan: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num  0 1 1 0 0 0 0 0 1 0 ...
##  $ height  : num  70 64 60 66 61 64 71 67 65 70 ...
##  $ weight  : int  175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire: int  175 115 105 124 130 114 185 160 130 170 ...
##  $ age     : int  77 33 49 42 55 55 31 45 27 44 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...

There are 20000 cases in the CDC dataset, each of which contains 9 variables.

The variables “height”, “weight”, “wtdesire”, and “age” are numeric and should be considered continuous. However, of these four, only the “height” variable is actually stored in “numeric” format (enabling values of partial inches), while the other three ( weight, wtdesire, and age) are actually stored as integers (precluding partial pounds or years.) It could be argued that such variables are being handled as discrete rather than continuous in the implementation.

The variables “exerany” , “hlthplan” , and “smoke100” are logical variables, each stored as 0 or 1, and are thus categorical/nominal variables.

Similarly, the variable “gender” is a categorical/nominal variable, stored as a factor with levels “m” and “f”.

The variable “genhlth” is a categorical variable, stored as a factor with the following levels:

excellent, very good, good, fair, poor

This is fundamentally an ordinal variable, though it has not been explicitly stored in this dataframe as “ordered.”

(end of responses for Exercise 1)

We can have a look at the first few entries (rows) of our data with the command

head(cdc)

##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f

and similarly we can look at the last few by typing

tail(cdc)

##         genhlth exerany hlthplan smoke100 height weight wtdesire age
## 19995      good       0        1        1     69    224      224  73
## 19996      good       1        1        0     66    215      140  23
## 19997 excellent       0        1        0     73    200      185  35
## 19998      poor       0        1        0     65    216      150  57
## 19999      good       1        1        0     67    165      165  81
## 20000      good       1        1        1     69    170      165  83
##       gender
## 19995      m
## 19996      f
## 19997      m
## 19998      f
## 19999      f
## 20000      m

You could also look at all of the data frame at once by typing its name into the console, but that might be unwise here. We know cdc has 20,000 rows, so viewing the entire data set would mean flooding your screen. It’s better to take small peeks at the data with head, tail or the subsetting techniques that you’ll learn in a moment.

Summaries and tables

The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all of that information into a few summary statistics and graphics. As a simple example, the function summary returns a numerical summary: minimum, first quartile, median, mean, second quartile, and maximum. For weight this is

summary(cdc$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   140.0   165.0   169.7   190.0   500.0

R also functions like a very fancy calculator. If you wanted to compute the interquartile range for the respondents’ weight, you would look at the output from the summary command above and then enter

190 - 140

## [1] 50

R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean, median, and variance of weight, type

mean(cdc$weight)

## [1] 169.683

var(cdc$weight)

## [1] 1606.484

median(cdc$weight)

## [1] 165

While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function table does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type

table(cdc$smoke100)

## 
##     0     1 
## 10559  9441

or instead look at the relative frequency distribution by typing

table(cdc$smoke100)/20000

## 
##       0       1 
## 0.52795 0.47205

Notice how R automatically divides all entries in the table by 20,000 in the command above. This is similar to something we observed in the Introduction to R; when we multiplied or divided a vector with a number, R applied that action across entries in the vectors. As we see above, this also works for tables. Next, we make a bar plot of the entries in the table by putting the table inside the barplot command.

barplot(table(cdc$smoke100))

Notice what we’ve done here! We’ve computed the table of cdc$smoke100 and then immediately applied the graphical function, barplot. This is an important idea: R commands can be nested. You could also break this into two steps by typing the following:

smoke <- table(cdc$smoke100)

barplot(smoke)

Here, we’ve made a new object, a table, called smoke (the contents of which we can see by typing smoke into the console) and then used it in as the input for barplot. The special symbol <- performs an assignment, taking the output of one line of code and saving it into an object in your workspace. This is another important idea that we’ll return to later.

Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

Response to Exercise 2:

Summary and interquartile range for height :

sum_height <- summary(cdc$height)
sum_height

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

IQR_height <- sum_height["3rd Qu."] - sum_height["1st Qu."]
names(IQR_height) <- 'IQR'
# calculated by extraction from summary
IQR_height

## IQR 
##   6

# computed by IQR function
IQR(cdc$height)

## [1] 6

The IQR for height is 6 .

Summary and interquartile range for age :

sum_age <- summary(cdc$age)
sum_age

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

IQR_age <- sum_age["3rd Qu."] - sum_age["1st Qu."]
names(IQR_age) <- 'IQR'
# calculated by extraction from summary
IQR_age

## IQR 
##  26

# computed by IQR function
IQR(cdc$age)

## [1] 26

The IQR for age is 26 .

Relative frequency distribution for `gender` :

table(cdc$gender)/20000

## 
##       m       f 
## 0.47845 0.52155

Relative frequency distribution for `exerany` :

table(cdc$exerany)/20000

## 
##      0      1 
## 0.2543 0.7457

How many males are in the sample?

gendertable = table(cdc$gender)
numberofmales <- gendertable["m"]

The number of males in the sample is 9569 .

What proportion of the sample reports being in excellent health?

genhlth_table = table(cdc$genhlth)/20000
excellent_health = genhlth_table["excellent"]
excellent_health

## excellent 
##   0.23285

The proportion of the sample reporting excellent health is 0.23285 .

(end of responses for Exercise 2)

The table command can be used to tabulate any number of variables that you provide. For example, to examine which participants have smoked across each gender, we could use the following.

table(cdc$gender,cdc$smoke100)

##    
##        0    1
##   m 4547 5022
##   f 6012 4419

Here, we see column labels of 0 and 1. Recall that 1 indicates a respondent has smoked at least 100 cigarettes. The rows refer to gender. To create a mosaic plot of this table, we would enter the following command.

mosaicplot(table(cdc$gender,cdc$smoke100))

We could have accomplished this in two steps by saving the table in one line and applying mosaicplot in the next (see the table/barplot example above).

What does the mosaic plot reveal about smoking habits and gender?

Response to Exercise 3:

The mosaic plot indicates that the majority of males have smoked more than 100 cigarettes, while the majority of females have not.
However, these differences are narrow, as the proportion of males who have smoked more than 100 cigarettes is 0.5248197 while the proportion of females who have not smoked 100 cigarettes is 0.5763589 .

(end of response for Exercise 3)

Interlude: How R thinks about data

We mentioned that R stores data in data frames, which you might think of as a type of spreadsheet. Each row is a different observation (a different respondent) and each column is a different variable (the first is genhlth, the second exerany and so on). We can see the size of the data frame next to the object name in the workspace or we can type

dim(cdc)

## [1] 20000     9

which will return the number of rows and columns. Now, if we want to access a subset of the full data frame, we can use row-and-column notation. For example, to see the sixth variable of the 567^th respondent, use the format

cdc[567,6]

## [1] 160

which means we want the element of our data set that is in the 567^th row (meaning the 567^th person or observation) and the 6^th column (in this case, weight). We know that weight is the 6^th variable because it is the 6^th entry in the list of variable names

names(cdc)

## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

To see the weights for the first 10 respondents we can type

cdc[1:10,6]

##  [1] 175 125 105 132 150 114 194 170 150 180

In this expression, we have asked just for rows in the range 1 through 10. R uses the : to create a range of values, so 1:10 expands to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You can see this by entering

1:10

##  [1]  1  2  3  4  5  6  7  8  9 10

Finally, if we want all of the data for the first 10 respondents, type

cdc[1:10,]

##      genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1       good       0        1        0     70    175      175  77      m
## 2       good       0        1        1     64    125      115  33      f
## 3       good       1        1        1     60    105      105  49      f
## 4       good       1        1        0     66    132      124  42      f
## 5  very good       0        1        0     61    150      130  55      f
## 6  very good       1        1        0     64    114      114  55      f
## 7  very good       1        1        0     71    194      185  31      m
## 8  very good       0        1        0     67    170      160  45      m
## 9       good       0        1        1     65    150      130  27      f
## 10      good       1        1        0     70    180      170  44      m

By leaving out an index or a range (we didn’t type anything between the comma and the square bracket), we get all the columns. When starting out in R, this is a bit counterintuitive. As a rule, we omit the column number to see all columns in a data frame. Similarly, if we leave out an index or range for the rows, we would access all the observations, not just the 567^th, or rows 1 through 10. Try the following to see the weights for all 20,000 respondents fly by on your screen

#lengthy display suppressed for submission
#cdc[,6]

Recall that column 6 represents respondents’ weight, so the command above reported all of the weights in the data set. An alternative method to access the weight data is by referring to the name. Previously, we typed names(cdc) to see all the variables contained in the cdc data set. We can use any of the variable names to select items in our data set.

#lengthy display suppressed for submission
#cdc$weight

The dollar-sign tells R to look in data frame cdc for the column called weight. Since that’s a single vector, we can subset it with just a single index inside square brackets. We see the weight for the 567^th respondent by typing

cdc$weight[567]

## [1] 160

Similarly, for just the first 10 respondents

cdc$weight[1:10]

##  [1] 175 125 105 132 150 114 194 170 150 180

The command above returns the same result as the cdc[1:10,6] command. Both row-and-column notation and dollar-sign notation are widely used, which one you choose to use depends on your personal preference.

A little more on subsetting

It’s often useful to extract all individuals (cases) in a data set that have specific characteristics. We accomplish this through conditioning commands. First, consider expressions like

#lengthy display suppressed for submission
#cdc$gender == "m"

#lengthy display suppressed for submission
#cdc$age > 30

These commands produce a series of TRUE and FALSE values. There is one value for each respondent, where TRUE indicates that the person was male (via the first command) or older than 30 (second command).

Suppose we want to extract just the data for the men in the sample, or just for those over 30. We can use the R function subset to do that for us. For example, the command

mdata <- subset(cdc, cdc$gender == "m")

will create a new data set called mdata that contains only the men from the cdc data set. In addition to finding it in your workspace alongside its dimensions, you can take a peek at the first several rows as usual

head(mdata)

##      genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1       good       0        1        0     70    175      175  77      m
## 7  very good       1        1        0     71    194      185  31      m
## 8  very good       0        1        0     67    170      160  45      m
## 10      good       1        1        0     70    180      170  44      m
## 11 excellent       1        1        1     69    186      175  46      m
## 12      fair       1        1        1     69    168      148  62      m

This new data set contains all the same variables but just under half the rows. It is also possible to tell R to keep only specific variables, which is a topic we’ll discuss in a future lab. For now, the important thing is that we can carve up the data based on values of one or more variables.

As an aside, you can use several of these conditions together with & and |. The & is read “and” so that

m_and_over30 <- subset(cdc, gender == "m" & age > 30)

will give you the data for men over the age of 30. The | character is read “or” so that

m_or_over30 <- subset(cdc, gender == "m" | age > 30)

will take people who are men or over the age of 30 (why that’s an interesting group is hard to say, but right now the mechanics of this are the important thing). In principle, you may use as many “and” and “or” clauses as you like when forming a subset.

Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

Responses for Exercise 4

under23_and_smoke <- subset(cdc, age < 23 & smoke100)    # we don't need "== 1" because that's redundant

The number of individuals who are under age 23 and who are smokers is 620 .

(end of responses for Exercise 4)

Quantitative data

With our subsetting tools in hand, we’ll now return to the task of the day: making basic summaries of the BRFSS questionnaire. We’ve already looked at categorical data such as smoke and gender so now let’s turn our attention to quantitative data. Two common ways to visualize quantitative data are with box plots and histograms. We can construct a box plot for a single variable with the following command.

boxplot(cdc$height)

You can compare the locations of the components of the box by examining the summary statistics.

summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

Confirm that the median and upper and lower quartiles reported in the numerical summary match those in the graph. The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several categories. So we can, for example, compare the heights of men and women with

boxplot(cdc$height ~ cdc$gender)

The notation here is new. The ~ character can be read versus or as a function of. So we’re asking R to give us a box plots of heights where the groups are defined by gender.

Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI) (http://en.wikipedia.org/wiki/Body_mass_index). BMI is a weight to height ratio and can be calculated as:

\[ BMI = \frac{weight~(lb)}{height~(in)^2} * 703 \]

703 is the approximate conversion factor to change units from metric (meters and kilograms) to imperial (inches and pounds).

The following two lines first make a new object called bmi and then creates box plots of these values, defining groups by the variable cdc$genhlth.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

Notice that the first line above is just some arithmetic, but it’s applied to all 20,000 numbers in the cdc data set. That is, for each of the 20,000 participants, we take their weight, divide by their height-squared and then multiply by 703. The result is 20,000 BMI values, one for each respondent. This is one reason why we like R: it lets us perform computations like this using very simple expressions.

What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

Responses for Exercise 5

The box plot of Body Mass Index (BMI) broken out by self-assessed general level of health (above) shows a relationship between lower BMI and better general level of health.

Make a boxplot of BMI vs. the binary variable indicating whether or not the subject engages in exercise :

boxplot(bmi~cdc$exerany)

I chose the variable “exerany” under the hypothesis that individuals who do not engage in exercise

(left-side boxplot, label “0”) would have greater BMI than individuals who do engage in exercise

(right-hand plot, label “1”)

Below, the summary data which is reflected in the above boxplots:

print("Summary of BMI for individuals who indicate they do not exercise:")

## [1] "Summary of BMI for individuals who indicate they do not exercise:"

summary_BMI_no_exercise <- summary(bmi[cdc$exerany==0])
summary_BMI_no_exercise

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.75   23.03   26.50   27.27   30.27   62.64

median_BMI_no_exercise <- round(summary_BMI_no_exercise["Median"],2)

print("Summary of BMI for individuals who indicate they do exercise:")

## [1] "Summary of BMI for individuals who indicate they do exercise:"

summary_BMI_with_exercise <- summary(bmi[cdc$exerany==1])
summary_BMI_with_exercise

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.40   22.66   25.11   25.98   28.34   73.09

median_BMI_with_exercise <- round(summary_BMI_with_exercise["Median"],2)

diff_BMI_exercise <- median_BMI_no_exercise - median_BMI_with_exercise
diff_BMI_exercise

## Median 
##   1.39

This clearly shows that the median BMI for individuals who do exercise is 25.11 while the median BMI for individuals who do not exercise is 26.5 , for a difference of 1.39 .

(end of responses for Exercise 5)

Finally, let’s make some histograms. We can look at the histogram for the age of our respondents with the command

hist(cdc$age)

Histograms are generally a very good way to see the shape of a single distribution, but that shape can change depending on how the data is split between the different bins. You can control the number of bins by adding an argument to the command. In the next two lines, we first make a default histogram of bmi and then one with 50 breaks.

hist(bmi)

hist(bmi, breaks = 50)

Note that you can flip between plots that you’ve created by clicking the forward and backward arrows in the lower right region of RStudio, just above the plots. How do these two histograms compare?

At this point, we’ve done a good first pass at analyzing the information in the BRFSS questionnaire. We’ve found an interesting association between smoking and gender, and we can say something about the relationship between people’s assessment of their general health and their own BMI. We’ve also picked up essential computing tools – summary statistics, subsetting, and plots – that will serve us well throughout this course.

On Your Own

1) Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

weightmodel = lm(cdc$wtdesire ~ cdc$weight)
intercept = weightmodel$coefficients[1]
intercept

## (Intercept) 
##    46.66401

plot(x = cdc$weight, y = cdc$wtdesire, main = "Plot of Weight vs. Desired Weight")
slope = weightmodel$coefficients[2]
slope

## cdc$weight 
##  0.6390143

intersection = intercept / (1 - slope)
intersection

## (Intercept) 
##    129.2683

intersection = round(intersection,3)

abline(weightmodel,col="blue")
abline(a=0,b=1,col="green")

The green line has intercept 0 and slope 1: it represents cases where desired weight = current weight.

Individuals plotted above the green line desire to gain weight, while individuals below the green line desire to lose weight.

There are more points below the green line than above, indicating that more individuals wish to lose weight rather than gain.

The blue line fits a linear regression through the data. It has slope 0.6390143 and intersects the green line at ( x , y ) = (129.268,129.268). The slope indicates that for each 1 lb. increase in actual weight, the fitted desired weight increases by only 0.6390143 lbs.

2) Let’s consider a new variable: the difference between desired weight (`wtdesire`) and current weight (`weight`). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called `wdiff`.

wdiff = cdc$wtdesire - cdc$weight
summary(wdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00

cdc2=cbind(cdc,wdiff)
summary(cdc2)

##       genhlth        exerany          hlthplan         smoke100     
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  very good:6972   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :5675   Median :1.0000   Median :1.0000   Median :0.0000  
##  fair     :2019   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721  
##  poor     : 677   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire          age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68.0   Min.   :18.00   m: 9569  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   f:10431  
##  Median :67.00   Median :165.0   Median :150.0   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155.1   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00            
##  Max.   :93.00   Max.   :500.0   Max.   :680.0   Max.   :99.00            
##      wdiff        
##  Min.   :-300.00  
##  1st Qu.: -21.00  
##  Median : -10.00  
##  Mean   : -14.59  
##  3rd Qu.:   0.00  
##  Max.   : 500.00

3) What type of data is `wdiff`? If an observation `wdiff` is 0, what does this mean about the person’s weight and desired weight. What if `wdiff` is positive or negative?

`wdiff` is numerical data. Like weight and wtdesire , it is fundamentally continuous , but because each of weight and wtdesire are implemented as int in the cdc dataframe, their difference is automatically stored as int as well. This causes such data to exhibit discrete properties unless it is coerced to numeric.

If an observation `wdiff` is 0, the person’s desired weight is the same as his/her current weight. The number of observations for which this is the case is 5616 .

If an observation `wdiff` is positive, the person wishes to gain weight. The number of observations for which this is the case is 1620 .

If an observation `wdiff` is negative, the person wishes to lose weight. The number of observations for which this is the case is 12764 .

4) Describe the distribution of `wdiff` in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

library(psych)
describe(wdiff)

##    vars     n   mean    sd median trimmed   mad  min max range  skew
## X1    1 20000 -14.59 24.05    -10  -11.41 14.83 -300 500   800 -1.45
##    kurtosis   se
## X1    21.61 0.17

summary(wdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00

The center of the distribution of `wdiff` is a mean of -14.59 and a median of -10.

This indicates that the distribution is left-skewed, which is consistent with the skew of -1.45.

The standard deviation is 24.05 and the IQR is 21.

hist(wdiff,breaks = 100)

This histogram shows that most of the observations are negative (reflecting a desire to lose weight).

5) Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

describe(cdc2$wdiff)

##    vars     n   mean    sd median trimmed   mad  min max range  skew
## X1    1 20000 -14.59 24.05    -10  -11.41 14.83 -300 500   800 -1.45
##    kurtosis   se
## X1    21.61 0.17

describeBy(cdc2$wdiff,group = cdc2$gender)

## 
##  Descriptive statistics by group 
## group: m
##    vars    n   mean    sd median trimmed  mad  min max range skew kurtosis
## X1    1 9569 -10.71 23.49     -5   -8.41 7.41 -300 500   800 -0.6    37.44
##      se
## X1 0.24
## -------------------------------------------------------- 
## group: f
##    vars     n   mean sd median trimmed   mad  min max range  skew kurtosis
## X1    1 10431 -18.15 24    -10   -14.2 14.83 -300  83   383 -2.26     9.22
##      se
## X1 0.23

boxplot(cdc2$wdiff~cdc2$gender)

Women seek to lose an average of 18.15 lbs, while men seek to lose an average of 10.71 lbs.

There are a few outliers, particularly on the men’s side, which indicate that the participant wishes to gain a very large amount of weight, e.g, 500 lbs. It seems quite implausible that someone currently weighing 180 lbs would desire to increase his weight to 680 lbs, which suggests that such data points are erroneous.

cdc[abs(wdiff)>200,]

##         genhlth exerany hlthplan smoke100 height weight wtdesire age
## 1995       poor       1        1        0     74    500      200  45
## 2944       poor       1        1        1     72    400      190  50
## 4445       fair       1        1        1     69    495      195  32
## 4612       poor       0        1        1     67    371      125  35
## 9007       fair       1        0        0     65    300       80  45
## 10034 very good       1        1        1     73    290      601  56
## 15720      poor       1        1        0     68    405      170  32
## 16874      good       0        1        0     69    180      680  24
##       gender
## 1995       m
## 2944       m
## 4445       f
## 4612       f
## 9007       f
## 10034      m
## 15720      m
## 16874      m

The presence of such extreme outliers makes it difficult to visualize the central portion of the boxplot.

By dropping such outliers, we can get a better look at the more important port of the boxplot.

cdc3 = cdc2[abs(wdiff)<200,]
dim(cdc3)

## [1] 19986    10

describe(cdc3$wdiff)

##    vars     n  mean    sd median trimmed   mad  min max range  skew
## X1    1 19986 -14.5 23.07    -10  -11.39 14.83 -190 110   300 -1.84
##    kurtosis   se
## X1     6.63 0.16

describeBy(cdc3$wdiff,group = cdc3$gender)

## 
##  Descriptive statistics by group 
## group: m
##    vars    n   mean    sd median trimmed  mad  min max range  skew
## X1    1 9560 -10.64 21.93     -5   -8.39 7.41 -175 110   285 -1.69
##    kurtosis   se
## X1     7.37 0.22
## -------------------------------------------------------- 
## group: f
##    vars     n   mean    sd median trimmed   mad  min max range  skew
## X1    1 10426 -18.05 23.52    -10  -14.19 14.83 -190  83   273 -2.01
##    kurtosis   se
## X1     6.25 0.23

boxplot(cdc3$wdiff~cdc3$gender)

From the above boxplot, where extreme outliers have been dropped, it is easier to see that:

Women seek to lose more weight than men

Few women (n=449) seek to gain weight, while a larger number of men (n=1171, or n=1169 with 2 extreme outliers dropped) seek to gain weight.

# All individuals seeking to gain weight
describe(cdc2$wdiff[cdc2$wdiff>0])

##    vars    n  mean    sd median trimmed  mad min max range  skew kurtosis
## X1    1 1620 15.58 18.82     10   13.04 7.41   1 500   499 13.61   307.99
##      se
## X1 0.47

describeBy(cdc2$wdiff[cdc2$wdiff>0],cdc2$gender[cdc2$wdiff>0])

## 
##  Descriptive statistics by group 
## group: m
##    vars    n mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 1171 17.2 21.15     13    14.4 10.38   1 500   499 12.98   262.29
##      se
## X1 0.62
## -------------------------------------------------------- 
## group: f
##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 449 11.34 9.29     10    9.94 7.41   1  83    82 3.12     16.4
##      se
## X1 0.44

#As above, with outliers removed
describe(cdc3$wdiff[cdc3$wdiff>0])

##    vars    n mean    sd median trimmed  mad min max range skew kurtosis
## X1    1 1618 15.1 12.46     10   13.03 7.41   1 110   109 2.49      9.3
##      se
## X1 0.31

describeBy(cdc3$wdiff[cdc3$wdiff>0],cdc3$gender[cdc3$wdiff>0])

## 
##  Descriptive statistics by group 
## group: m
##    vars    n  mean   sd median trimmed   mad min max range skew kurtosis
## X1    1 1169 16.54 13.2     13   14.37 10.38   1 110   109 2.33     7.97
##      se
## X1 0.39
## -------------------------------------------------------- 
## group: f
##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 449 11.34 9.29     10    9.94 7.41   1  83    82 3.12     16.4
##      se
## X1 0.44

6) Now it’s time to get creative. Find the mean and standard deviation of `weight` and determine what proportion of the weights are within one standard deviation of the mean.

meanweight = mean(cdc$weight)
meanweight

## [1] 169.683

sdweight=sd(cdc$weight)
sdweight

## [1] 40.08097

lowweight=meanweight-sdweight
lowweight

## [1] 129.602

highweight=meanweight+sdweight
highweight

## [1] 209.7639

onesdset=subset(x=cdc,subset = cdc$weight>=lowweight & cdc$weight <= highweight)
count=length(onesdset$weight)
count

## [1] 14152

proportion=count/20000
proportion

## [1] 0.7076

The mean weight is 169.68295. The standard deviation is 40.08097 .

The +/-1 standard deviation range is ( 129.60198 , 209.76392 ).

The number of observations with weight in this range is 14152 .

This figure, as a proportion of all observations, is 0.7076 .

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.