Introduction to Data

source("http://www.openintro.org/stat/data/cdc.R")

To view the names of the variables, type the command

names(cdc)

## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

dim(cdc)

## [1] 20000     9

names(cdc)

## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

There are 20000 cases in this data set and 9 variables.

We can have a look at the first few entries (rows) of our data with the command

head(cdc)

##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f

tail(cdc)

##         genhlth exerany hlthplan smoke100 height weight wtdesire age
## 19995      good       0        1        1     69    224      224  73
## 19996      good       1        1        0     66    215      140  23
## 19997 excellent       0        1        0     73    200      185  35
## 19998      poor       0        1        0     65    216      150  57
## 19999      good       1        1        0     67    165      165  81
## 20000      good       1        1        1     69    170      165  83
##       gender
## 19995      m
## 19996      f
## 19997      m
## 19998      f
## 19999      f
## 20000      m

genhlth-> ordinal categorical// exerany -> ordinal categorical// hlthplan -> nominal categorical// smokeF100 -> nominal categorical// height -> discrete numerical// weight -> discrete numerical// wtdesire -> discrete numerical// age -> discrete numerical // gender -> nominal categorical

summary(cdc$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   140.0   165.0   169.7   190.0   500.0

R also functions like a very fancy calculator. If you wanted to compute the interquartile range for the respondents’ weight, you would look at the output from the summary command above and then enter

190 - 140

## [1] 50

R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean, median, and variance of weight, type

mean(cdc$weight)

## [1] 169.683

var(cdc$weight)

## [1] 1606.484

median(cdc$weight)

## [1] 165

table(cdc$smoke100)

## 
##     0     1 
## 10559  9441

or instead look at the relative frequency distribution by typing

table(cdc$smoke100)/20000

## 
##       0       1 
## 0.52795 0.47205

barplot(table(cdc$smoke100))

smoke <- table(cdc$smoke100)

barplot(smoke)

Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

#Numerical Summary

summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

summary(cdc$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

#The interquartile range

IQR(cdc$height)

## [1] 6

IQR(cdc$age)

## [1] 26

#Relative frecuency distribution by gender

table(cdc$gender)/20000

## 
##       m       f 
## 0.47845 0.52155

table(cdc$gender)

## 
##     m     f 
##  9569 10431

**There are 9569 males

table(cdc$genhlth)

## 
## excellent very good      good      fair      poor 
##      4657      6972      5675      2019       677

#4657 cases out of a total of 20000, report are in excelent health

The table command can be used to tabulate any number of variables that you provide. For example, to examine which participants have smoked across each gender, we could use the following.

table(cdc$gender,cdc$smoke100)

##    
##        0    1
##   m 4547 5022
##   f 6012 4419

Here, we see column labels of 0 and 1. Recall that 1 indicates a respondent has smoked at least 100 cigarettes. The rows refer to gender. To create a mosaic plot of this table, we would enter the following command.

mosaicplot(table(cdc$gender,cdc$smoke100))

We could have accomplished this in two steps by saving the table in one line and applying mosaicplot in the next (see the table/barplot example above).

What does the mosaic plot reveal about smoking habits and gender?

Answer Males smoke more than woman. A greater number of females smoked less than 100 cigarrettes in their life time.

Interlude: How R thinks about data

We mentioned that R stores data in data frames, which you might think of as a type of spreadsheet. Each row is a different observation (a different respondent) and each column is a different variable (the first is genhlth, the second exerany and so on). We can see the size of the data frame next to the object name in the workspace or we can type

dim(cdc)

## [1] 20000     9

which will return the number of rows and columns. Now, if we want to access a subset of the full data frame, we can use row-and-column notation. For example, to see the sixth variable of the 567^th respondent, use the format

cdc[567,6]

## [1] 160

which means we want the element of our data set that is in the 567^th row (meaning the 567^th person or observation) and the 6^th column (in this case, weight). We know that weight is the 6^th variable because it is the 6^th entry in the list of variable names

names(cdc)

## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

To see the weights for the first 10 respondents we can type

cdc[1:10,6]

##  [1] 175 125 105 132 150 114 194 170 150 180

In this expression, we have asked just for rows in the range 1 through 10. R uses the : to create a range of values, so 1:10 expands to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You can see this by entering

1:10

##  [1]  1  2  3  4  5  6  7  8  9 10

Finally, if we want all of the data for the first 10 respondents, type

cdc[1:10,]

##      genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1       good       0        1        0     70    175      175  77      m
## 2       good       0        1        1     64    125      115  33      f
## 3       good       1        1        1     60    105      105  49      f
## 4       good       1        1        0     66    132      124  42      f
## 5  very good       0        1        0     61    150      130  55      f
## 6  very good       1        1        0     64    114      114  55      f
## 7  very good       1        1        0     71    194      185  31      m
## 8  very good       0        1        0     67    170      160  45      m
## 9       good       0        1        1     65    150      130  27      f
## 10      good       1        1        0     70    180      170  44      m

cdc$weight[567]

## [1] 160

Similarly, for just the first 10 respondents

cdc$weight[1:10]

##  [1] 175 125 105 132 150 114 194 170 150 180

The command above returns the same result as the cdc[1:10,6] command. Both row-and-column notation and dollar-sign notation are widely used, which one you choose to use depends on your personal preference.

A little more on subsetting

It’s often useful to extract all individuals (cases) in a data set that have specific characteristics. We accomplish this through conditioning commands. First, consider expressions like

mdata <- subset(cdc, cdc$gender == "m")

will create a new data set called mdata that contains only the men from the cdc data set. In addition to finding it in your workspace alongside its dimensions, you can take a peek at the first several rows as usual

head(mdata)

##      genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1       good       0        1        0     70    175      175  77      m
## 7  very good       1        1        0     71    194      185  31      m
## 8  very good       0        1        0     67    170      160  45      m
## 10      good       1        1        0     70    180      170  44      m
## 11 excellent       1        1        1     69    186      175  46      m
## 12      fair       1        1        1     69    168      148  62      m

This new data set contains all the same variables but just under half the rows. It is also possible to tell R to keep only specific variables, which is a topic we’ll discuss in a future lab. For now, the important thing is that we can carve up the data based on values of one or more variables.

As an aside, you can use several of these conditions together with & and |. The & is read “and” so that

m_and_over30 <- subset(cdc, gender == "m" & age > 30)

will give you the data for men over the age of 30. The | character is read “or” so that

m_or_over30 <- subset(cdc, gender == "m" | age > 30)

will take people who are men or over the age of 30 (why that’s an interesting group is hard to say, but right now the mechanics of this are the important thing). In principle, you may use as many “and” and “or” clauses as you like when forming a subset.

Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

under23_smoke <- subset(cdc, age < 23 & smoke100 == "1")
head(under23_smoke)

##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13  excellent       1        0        1     66    185      220  21      m
## 37  very good       1        0        1     70    160      140  18      f
## 96  excellent       1        1        1     74    175      200  22      m
## 180      good       1        1        1     64    190      140  20      f
## 182 very good       1        1        1     62     92       92  21      f
## 240 very good       1        0        1     64    125      115  22      f

boxplot(cdc$height)

You can compare the locations of the components of the box by examining the summary statistics.

summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

boxplot(cdc$height ~ cdc$gender)

The notation here is new. The ~ character can be read versus or as a function of. So we’re asking R to give us a box plots of heights where the groups are defined by gender.

Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI) (http://en.wikipedia.org/wiki/Body_mass_index). BMI is a weight to height ratio and can be calculated as:

\[ BMI = \frac{weight~(lb)}{height~(in)^2} * 703 \]

703 is the approximate conversion factor to change units from metric (meters and kilograms) to imperial (inches and pounds).

The following two lines first make a new object called bmi and then creates box plots of these values, defining groups by the variable cdc$genhlth.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

What does this box plot show? AnswerThe bos plot shows clearly 2 variables;general health and BMI values.The median variation has a minimal diference between all levels.

Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

```r
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$hlthplan)
```

<img src="Introduction_to_data_files/figure-html/boxplot-bmi-age-1.png" width="672" />

BMI and hlthplan = The box plot shows the respondents that didn’t have health plan are less than the respondents that have healthplan. The outliers are out of range

Finally, let’s make some histograms. We can look at the histogram for the age of our respondents with the command

hist(cdc$age)

hist(bmi)

hist(bmi, breaks = 50)

On Your Own

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
```
 plot(cdc$weight, cdc$wtdesire , xlab = 'Weight', ylab = 'Weight Desired')
```
There is a non-linear association between subjects weight and their desired weight. Most of the points fall below the identity line, that means most people desire a weight that is lower than their recorded weight.
Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
```
  wdiff <- cdc$weight - cdc$wtdesire
  head(wdiff)
```
```
## [1]  0 10  0  8 20  0
```
What type of data is wdiff? Answer wdiff is a continuous discretely variable.

If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. Answer It meand that person is happy with their actual weight.

What if wdiff is positive or negative? Answer A positive wdiff means that person wishes to gain weight; a negative wdiff means that person want to lose weight
Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
```
hist(wdiff)
```
```
summary(wdiff)
```
```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -500.00    0.00   10.00   14.59   21.00  300.00
```
The wdiff data is centered at a median of 10 lb, meaning that the majority of people wish to lose weight. The IQR is 21 lb, though there is a low outlier at -500 lb and a high outlier at 300 lb. Excluding the outliers, the data appears to have a slight left skew.

Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

 male_wdiff <- subset(cdc, cdc$gender == 'm')
summary(male_wdiff$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    78.0   165.0   185.0   189.3   210.0   500.0

dim(male_wdiff)

## [1] 9569    9

 female_wdiff <- subset(cdc, cdc$gender == 'f')
summary(female_wdiff$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   128.0   145.0   151.7   170.0   495.0

dim(female_wdiff)

## [1] 10431     9

boxplot(cdc$weight- cdc$wtdesire ~ cdc$gender, ylab = 'Weight', main = " Respondents Weight by Gender")

Women have a lower Q1 and median wdiff than do men, which means more women wish to lose more weight than men

Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

   x<- mean(cdc$weight)
summary(x)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   169.7   169.7   169.7   169.7   169.7   169.7

y<-sd(cdc$weight)
 summary(y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40.08   40.08   40.08   40.08   40.08   40.08

z = subset(cdc, cdc$weight > (x-y) & cdc$weight < (x+y))
prop_weight = dim(z)/dim(cdc)*100
summary(prop_weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   70.76   78.07   85.38   85.38   92.69  100.00

70.76 percent of the sample falls within one standard deviation of the mean weight

Introduction to Data

Erinda Budo

9/6/2019

Interlude: How R thinks about data

A little more on subsetting

On Your Own