Data for Demonstrations

I’ll use a sample of records from the Behavioral Risk Factors Surveillance System (BRFSS) conducted by the Centers for Disease Control (CDC).

First, make the data and a few packages available.

load("~/cdc.Rdata")

Determine the structure of the object CDC.

str(cdc)

## 'data.frame':    20000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ exerany : num  0 0 1 1 0 1 1 0 0 1 ...
##  $ hlthplan: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num  0 1 1 0 0 0 0 0 1 0 ...
##  $ height  : num  70 64 60 66 61 64 71 67 65 70 ...
##  $ weight  : int  175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire: int  175 115 105 124 130 114 185 160 130 170 ...
##  $ age     : int  77 33 49 42 55 55 31 45 27 44 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...

There are two categorical variables identified as factors, three categorical variables coded as numerical (binary variables) and 4 true quantitative variables.

Three Problems for Quantitative Variables

There are three basic questions you need to answer to explore a single quantitative variable.

Location - basically, where is the data located?
Variation (scatter, spread, etc.) - how spread out is the data?
Shape - Beyond location and variation, what should be said about the data?

Location

There are two standard measures of location:

The mean - Sensitive to outliers.
The median - Not senstive to outliers.

Consider age in the dataframe cdc.

mean(cdc$age)

## [1] 45.06825

median(cdc$age)

## [1] 43

Variation

Standard Deviation: Square root of the squares of the deviations from the mean.
IQR: 75th percentile - 25th percentile
Range: Max - Min

sd(cdc$age)

## [1] 17.19269

IQR(cdc$age)

## [1] 26

range(cdc$age)

## [1] 18 99

max(cdc$age) - min(cdc$age)

## [1] 81

Five-number summary

The summary function gives several standard descriptive statistics.

summary(cdc$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

Shape

Histogram
Boxplot
Density

hist(cdc$age)

boxplot(cdc$age)

plot(density(cdc$age))

Detecting Outliers

The following code finds the outlier status of every value of age using two different types of criterion,

Pct75 = quantile(cdc$age,.75)
Pct25 = quantile(cdc$age,.25)
IQR150 = IQR(cdc$age) * 1.5
High_IQR = cdc$age > Pct75 + IQR150
Low_IQR  = cdc$age < Pct25 - IQR150
zscore = (cdc$age - mean(cdc$age,na.rm=T))/sd(cdc$age,na.rm=T)
High_z = zscore > 3
Low_z  = zscore < - 3
table(High_z)

## High_z
## FALSE  TRUE 
## 19997     3

table(Low_z)

## Low_z
## FALSE 
## 20000

table(High_IQR)

## High_IQR
## FALSE  TRUE 
## 19997     3

table(Low_IQR)

## Low_IQR
## FALSE 
## 20000

Put it All together.

Having run all of these descriptive statistics and graphics, try to put it all together in words that would allow someone else to understand what you’re seeing.

Ages run from 18 to 99, but 50% of the observed ages fall within an interval of length 26 between 31 and 57. The standard deviation of the ages is 17. There are 3 persons with high values of age that meet both the z-score and IQR standard for outlier designation.

The center of the age distribution is at 43 according to the median or at 45 according to the mean. The distribution is asymmetric with a long tail to the right.

Categorical Variables

Finite number of possible values. All we can do is tabulate, either absolute or relative.

tabs = table(cdc$genhlth)
tabs

## 
## excellent very good      good      fair      poor 
##      4657      6972      5675      2019       677

trel = table(cdc$genhlth)/length(cdc$genhlth)
trel

## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385

You may use the barplot function to display these tables. This function can’t be directly applied to raw data.

barplot(tabs)

barplot(trel)

General Comments

Most functions will fail if any data is missing. To have the statsitic computed without the missing data, add the argument “na.rm=TRUE”.

Exanple:

x = c(1,2,3,NA)
mean(x)

## [1] NA

mean(x,na.rm=TRUE)

## [1] 2

A title can be placed above any of the graphics with the argument main.

Example

hist(cdc$age,main="Histogram of Ages in the cdc Dataset")

plot(density(cdc$age),main="Density of Age in the cdc Dataset")

Create an Outlier Detection Fuction.

We can convert the code above into a function.

DetectOut = function(x){
  Pct75 = quantile(x,.75)
  Pct25 = quantile(x,.25)
  IQR150 = IQR(x) * 1.5
  High_IQR = x > Pct75 + IQR150
  Low_IQR  = x < Pct25 - IQR150
  zscore = (x - mean(x,na.rm=T))/sd(x,na.rm=T)
  High_z = zscore > 3
  Low_z  = zscore < - 3
  DetDF = data.frame(High_IQR,High_z,Low_IQR,Low_z)
  return(DetDF)
}

We can use the function to check cdc$height for outliers.

DetDF = DetectOut(cdc$height)
summary(DetDF)

##   High_IQR         High_z         Low_IQR          Low_z        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:19982     FALSE:19982     FALSE:19982     FALSE:19982    
##  TRUE :18        TRUE :18        TRUE :18        TRUE :18       
##  NA's :0         NA's :0         NA's :0         NA's :0

str(DetDF)

## 'data.frame':    20000 obs. of  4 variables:
##  $ High_IQR: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ High_z  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Low_IQR : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Low_z   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

We can use the logical vector High_IQR to extract these observations into a separate dataframe for analysis.

stretchers = cdc[DetDF$High_IQR,]
str(stretchers)

## 'data.frame':    18 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 5 2 1 2 2 1 3 2 1 2 ...
##  $ exerany : num  1 1 1 1 1 0 1 1 0 1 ...
##  $ hlthplan: num  1 1 1 0 1 0 0 1 1 1 ...
##  $ smoke100: num  1 0 0 1 0 0 1 0 0 0 ...
##  $ height  : num  80 81 83 80 80 80 80 82 81 82 ...
##  $ weight  : int  400 190 265 175 200 163 190 225 285 192 ...
##  $ wtdesire: int  225 250 265 175 185 163 190 240 260 180 ...
##  $ age     : int  48 21 25 32 21 51 45 20 35 57 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 1 1 1 1 1 1 1 1 1 ...

summary(stretchers)

##       genhlth     exerany          hlthplan         smoke100     
##  excellent:7   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  very good:7   1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :3   Median :1.0000   Median :1.0000   Median :0.0000  
##  fair     :0   Mean   :0.8333   Mean   :0.7778   Mean   :0.3333  
##  poor     :1   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire          age        gender
##  Min.   :80.00   Min.   :150.0   Min.   :100.0   Min.   :19.00   m:18  
##  1st Qu.:80.00   1st Qu.:182.8   1st Qu.:176.2   1st Qu.:25.75   f: 0  
##  Median :80.00   Median :194.5   Median :187.5   Median :34.00         
##  Mean   :81.50   Mean   :220.4   Mean   :204.4   Mean   :37.83         
##  3rd Qu.:81.75   3rd Qu.:251.2   3rd Qu.:247.5   3rd Qu.:47.75         
##  Max.   :93.00   Max.   :400.0   Max.   :300.0   Max.   :77.00

People with heights of 80 inches or more are designated as outliers by this criterion.

County Data

First load the countyComplete dataset provided by OpenIntro.

countyComplete <- read.delim("~/Downloads/Data from openintro.org/countyComplete.txt")

Do a structure examination using str().

str(countyComplete)

## 'data.frame':    3143 obs. of  53 variables:
##  $ state                                    : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ name                                     : Factor w/ 1877 levels "Abbeville County",..: 83 90 101 151 166 227 237 250 298 320 ...
##  $ FIPS                                     : int  1001 1003 1005 1007 1009 1011 1013 1015 1017 1019 ...
##  $ pop2010                                  : int  54571 182265 27457 22915 57322 10914 20947 118572 34215 25989 ...
##  $ pop2000                                  : int  43671 140415 29038 20826 51024 11714 21399 112249 36583 23988 ...
##  $ age_under_5                              : num  6.6 6.1 6.2 6 6.3 6.8 6.5 6.1 5.7 5.3 ...
##  $ age_under_18                             : num  26.8 23 21.9 22.7 24.6 22.3 24.1 22.9 22.5 21.4 ...
##  $ age_over_65                              : num  12 16.8 14.2 12.7 14.7 13.5 16.7 14.3 16.7 17.9 ...
##  $ female                                   : num  51.3 51.1 46.9 46.3 50.5 45.8 53 51.8 52.2 50.4 ...
##  $ white                                    : num  78.5 85.7 48 75.8 92.6 23 54.4 74.9 58.8 92.7 ...
##  $ black                                    : num  17.7 9.4 46.9 22 1.3 70.2 43.4 20.6 38.7 4.6 ...
##  $ native                                   : num  0.4 0.7 0.4 0.3 0.5 0.2 0.3 0.5 0.2 0.5 ...
##  $ asian                                    : num  0.9 0.7 0.4 0.1 0.2 0.2 0.8 0.7 0.5 0.2 ...
##  $ pac_isl                                  : num  NA NA NA NA NA NA 0 0.1 0 0 ...
##  $ two_plus_races                           : num  1.6 1.5 0.9 0.9 1.2 0.8 0.8 1.7 1.1 1.5 ...
##  $ hispanic                                 : num  2.4 4.4 5.1 1.8 8.1 7.1 0.9 3.3 1.6 1.2 ...
##  $ white_not_hispanic                       : num  77.2 83.5 46.8 75 88.9 21.9 54.1 73.6 58.1 92.1 ...
##  $ no_move_in_one_plus_year                 : num  86.3 83 83 90.5 87.2 88.5 92.8 82.9 86.2 88.1 ...
##  $ foreign_born                             : num  2 3.6 2.8 0.7 4.7 1.1 1.1 2.5 0.9 0.5 ...
##  $ foreign_spoken_at_home                   : num  3.7 5.5 4.7 1.5 7.2 3.8 1.6 4.5 1.6 1.4 ...
##  $ hs_grad                                  : num  85.3 87.6 71.9 74.5 74.7 74.7 74.8 78.5 71.8 73.4 ...
##  $ bachelors                                : num  21.7 26.8 13.5 10 12.5 12 11 16.1 10.8 10.5 ...
##  $ veterans                                 : int  5817 20396 2327 1883 4072 943 1675 11757 2893 2172 ...
##  $ mean_work_travel                         : num  25.1 25.8 23.8 28.3 33.2 28.1 25.1 22.1 23.6 26.2 ...
##  $ housing_units                            : int  22135 104061 11829 8981 23887 4493 9964 53289 17004 16267 ...
##  $ home_ownership                           : num  77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
##  $ housing_multi_unit                       : num  7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
##  $ median_val_owner_occupied                : num  133900 177200 88200 81200 113700 ...
##  $ households                               : int  19718 69476 9795 7441 20605 3732 8019 46421 13681 11352 ...
##  $ persons_per_household                    : num  2.7 2.5 2.52 3.02 2.73 2.85 2.58 2.46 2.51 2.22 ...
##  $ per_capita_income                        : int  24568 26469 15875 19918 21070 20289 16916 20574 16626 21322 ...
##  $ median_household_income                  : int  53255 50147 33219 41770 45549 31602 30659 38407 31467 40690 ...
##  $ poverty                                  : num  10.6 12.2 25 12.6 13.4 25.3 25 19.5 20.3 17.6 ...
##  $ private_nonfarm_establishments           : int  877 4812 522 318 749 120 446 2444 568 350 ...
##  $ private_nonfarm_employment               : int  10628 52233 7990 2927 6968 1919 5400 38324 6241 3600 ...
##  $ percent_change_private_nonfarm_employment: num  16.6 17.4 -27 -14 -11.4 -18.5 2.1 -5.6 -45.8 5.4 ...
##  $ nonemployment_establishments             : int  2971 14175 1527 1192 3501 390 1180 6329 2074 1627 ...
##  $ firms                                    : int  4067 19035 1667 1385 4458 417 1769 8713 1981 2180 ...
##  $ black_owned_firms                        : num  15.2 2.7 NA 14.9 NA NA NA 7.2 NA NA ...
##  $ native_owned_firms                       : num  NA 0.4 NA NA NA NA NA NA NA NA ...
##  $ asian_owned_firms                        : num  1.3 1 NA NA NA NA 3.3 1.6 NA NA ...
##  $ pac_isl_owned_firms                      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ hispanic_owned_firms                     : num  0.7 1.3 NA NA NA NA NA 0.5 NA NA ...
##  $ women_owned_firms                        : num  31.7 27.3 27 NA 23.2 38.8 NA 24.7 29.3 14.5 ...
##  $ manufacturer_shipments_2007              : int  NA 1410273 NA 0 341544 NA 399132 2679991 667283 307439 ...
##  $ mercent_whole_sales_2007                 : int  NA NA NA NA NA NA 56712 NA NA 62293 ...
##  $ sales                                    : int  598175 2966489 188337 124707 319700 43810 229277 1542981 264650 186321 ...
##  $ sales_per_capita                         : int  12003 17166 6334 5804 5622 3995 11326 13678 7620 7613 ...
##  $ accommodation_food_service               : int  88157 436955 NA 10757 20941 3670 28427 186533 23237 13948 ...
##  $ building_permits                         : int  191 696 10 8 18 1 3 107 10 6 ...
##  $ fed_spending                             : int  331142 1119082 240308 163201 294114 108846 195055 1830659 294718 184642 ...
##  $ area                                     : num  594 1590 885 623 645 ...
##  $ density                                  : num  91.8 114.6 31 36.8 88.9 ...

The Veterans Data

There is a variable veterans in countyComplete, which gives a count of veterans living in every county. Exploring this variable leads to an interesting issue.

summary(countyComplete$veterans)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     958    2180    7207    5944  368100

hist(countyComplete$veterans)

boxplot(countyComplete$veterans,horizontal=TRUE)

plot(density(countyComplete$veterans))

The right-skewness is extreme and obscures the detail for most of the data.

Coping with the Extreme Skewness

There are two standard ways to cope with this other than eliminating the large values.

Do a log transformation and look at the log to the base 10 of the data. Recall that a log value of 2 corresponds to a raw value of 100; a log value of 3 correpsonds to a raw value of 1000, etc.
Look at the percentage of veterans rather than the count.

Let’s construct a smaller dataframe to make this easier to view.

VetDF = countyComplete[c("state","name","veterans","pop2010")]

We can add the new variables to this smaller dataframe.

VetDF$PctVet = VetDF$veterans/VetDF$pop2010
VetDF$LogVeterans = log10(VetDF$veterans + 1)

Now we can view this dataframe by double-clicking on it in the upper-right pane or using the View() command.

Note what we can do in the viewing tab.

Examine the Percent Veterans Variabls

summary(VetDF$PctVet)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.07243 0.08545 0.08714 0.09966 0.27710

hist(VetDF$PctVet)

boxplot(VetDF$PctVet)

plot(density(VetDF$PctVet))

Examine the Log Veterans Variable

summary(VetDF$LogVeterans)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.982   3.339   3.384   3.774   5.566

hist(VetDF$LogVeterans)

boxplot(VetDF$LogVeterans)

plot(density(VetDF$LogVeterans))

Note that the right-skewness has been eliminated and the graph is easier to interpret. We can interpret the mean and median log values by raising 10 to these powers. Look at the mean and median of the raw data values for comparison.

10^mean(VetDF$LogVeterans)

## [1] 2418.815

mean(VetDF$veterans)

## [1] 7207.285

10^median(VetDF$LogVeterans)

## [1] 2181

median(VetDF$veterans)

## [1] 2180

Examine the Percent Veterans Variable

summary(VetDF$PctVet)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.07243 0.08545 0.08714 0.09966 0.27710

hist(VetDF$PctVet)

boxplot(VetDF$PctVet)

plot(density(VetDF$PctVet))

This variable is well-behaved with a center aroung 8.5%. Values range from 0 to nearly 28%.

Look at the outliers in the viewing tab.

Exploratory Data Analysis: 1 Variable