Harold Nelson
10/24/2016
I’ll use a sample of records from the Behavioral Risk Factors Surveillance System (BRFSS) conducted by the Centers for Disease Control (CDC).
First, make the data and a few packages available.
load("~/cdc.Rdata")
Determine the structure of the object CDC.
str(cdc)
## 'data.frame': 20000 obs. of 9 variables:
## $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
## $ exerany : num 0 0 1 1 0 1 1 0 0 1 ...
## $ hlthplan: num 1 1 1 1 1 1 1 1 1 1 ...
## $ smoke100: num 0 1 1 0 0 0 0 0 1 0 ...
## $ height : num 70 64 60 66 61 64 71 67 65 70 ...
## $ weight : int 175 125 105 132 150 114 194 170 150 180 ...
## $ wtdesire: int 175 115 105 124 130 114 185 160 130 170 ...
## $ age : int 77 33 49 42 55 55 31 45 27 44 ...
## $ gender : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
There are two categorical variables identified as factors, three categorical variables coded as numerical (binary variables) and 4 true quantitative variables.
There are three basic questions you need to answer to explore a single quantitative variable.
There are two standard measures of location:
Consider age in the dataframe cdc.
mean(cdc$age)
## [1] 45.06825
median(cdc$age)
## [1] 43
sd(cdc$age)
## [1] 17.19269
IQR(cdc$age)
## [1] 26
range(cdc$age)
## [1] 18 99
max(cdc$age) - min(cdc$age)
## [1] 81
The summary function gives several standard descriptive statistics.
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
hist(cdc$age)
boxplot(cdc$age)
plot(density(cdc$age))
The following code finds the outlier status of every value of age using two different types of criterion,
Pct75 = quantile(cdc$age,.75)
Pct25 = quantile(cdc$age,.25)
IQR150 = IQR(cdc$age) * 1.5
High_IQR = cdc$age > Pct75 + IQR150
Low_IQR = cdc$age < Pct25 - IQR150
zscore = (cdc$age - mean(cdc$age,na.rm=T))/sd(cdc$age,na.rm=T)
High_z = zscore > 3
Low_z = zscore < - 3
table(High_z)
## High_z
## FALSE TRUE
## 19997 3
table(Low_z)
## Low_z
## FALSE
## 20000
table(High_IQR)
## High_IQR
## FALSE TRUE
## 19997 3
table(Low_IQR)
## Low_IQR
## FALSE
## 20000
Having run all of these descriptive statistics and graphics, try to put it all together in words that would allow someone else to understand what you’re seeing.
Ages run from 18 to 99, but 50% of the observed ages fall within an interval of length 26 between 31 and 57. The standard deviation of the ages is 17. There are 3 persons with high values of age that meet both the z-score and IQR standard for outlier designation.
The center of the age distribution is at 43 according to the median or at 45 according to the mean. The distribution is asymmetric with a long tail to the right.
Finite number of possible values. All we can do is tabulate, either absolute or relative.
tabs = table(cdc$genhlth)
tabs
##
## excellent very good good fair poor
## 4657 6972 5675 2019 677
trel = table(cdc$genhlth)/length(cdc$genhlth)
trel
##
## excellent very good good fair poor
## 0.23285 0.34860 0.28375 0.10095 0.03385
You may use the barplot function to display these tables. This function can’t be directly applied to raw data.
barplot(tabs)
barplot(trel)
Most functions will fail if any data is missing. To have the statsitic computed without the missing data, add the argument “na.rm=TRUE”.
Exanple:
x = c(1,2,3,NA)
mean(x)
## [1] NA
mean(x,na.rm=TRUE)
## [1] 2
A title can be placed above any of the graphics with the argument main.
Example
hist(cdc$age,main="Histogram of Ages in the cdc Dataset")
plot(density(cdc$age),main="Density of Age in the cdc Dataset")
We can convert the code above into a function.
DetectOut = function(x){
Pct75 = quantile(x,.75)
Pct25 = quantile(x,.25)
IQR150 = IQR(x) * 1.5
High_IQR = x > Pct75 + IQR150
Low_IQR = x < Pct25 - IQR150
zscore = (x - mean(x,na.rm=T))/sd(x,na.rm=T)
High_z = zscore > 3
Low_z = zscore < - 3
DetDF = data.frame(High_IQR,High_z,Low_IQR,Low_z)
return(DetDF)
}
We can use the function to check cdc$height for outliers.
DetDF = DetectOut(cdc$height)
summary(DetDF)
## High_IQR High_z Low_IQR Low_z
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:19982 FALSE:19982 FALSE:19982 FALSE:19982
## TRUE :18 TRUE :18 TRUE :18 TRUE :18
## NA's :0 NA's :0 NA's :0 NA's :0
str(DetDF)
## 'data.frame': 20000 obs. of 4 variables:
## $ High_IQR: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ High_z : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Low_IQR : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Low_z : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
We can use the logical vector High_IQR to extract these observations into a separate dataframe for analysis.
stretchers = cdc[DetDF$High_IQR,]
str(stretchers)
## 'data.frame': 18 obs. of 9 variables:
## $ genhlth : Factor w/ 5 levels "excellent","very good",..: 5 2 1 2 2 1 3 2 1 2 ...
## $ exerany : num 1 1 1 1 1 0 1 1 0 1 ...
## $ hlthplan: num 1 1 1 0 1 0 0 1 1 1 ...
## $ smoke100: num 1 0 0 1 0 0 1 0 0 0 ...
## $ height : num 80 81 83 80 80 80 80 82 81 82 ...
## $ weight : int 400 190 265 175 200 163 190 225 285 192 ...
## $ wtdesire: int 225 250 265 175 185 163 190 240 260 180 ...
## $ age : int 48 21 25 32 21 51 45 20 35 57 ...
## $ gender : Factor w/ 2 levels "m","f": 1 1 1 1 1 1 1 1 1 1 ...
summary(stretchers)
## genhlth exerany hlthplan smoke100
## excellent:7 Min. :0.0000 Min. :0.0000 Min. :0.0000
## very good:7 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.0000
## good :3 Median :1.0000 Median :1.0000 Median :0.0000
## fair :0 Mean :0.8333 Mean :0.7778 Mean :0.3333
## poor :1 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## height weight wtdesire age gender
## Min. :80.00 Min. :150.0 Min. :100.0 Min. :19.00 m:18
## 1st Qu.:80.00 1st Qu.:182.8 1st Qu.:176.2 1st Qu.:25.75 f: 0
## Median :80.00 Median :194.5 Median :187.5 Median :34.00
## Mean :81.50 Mean :220.4 Mean :204.4 Mean :37.83
## 3rd Qu.:81.75 3rd Qu.:251.2 3rd Qu.:247.5 3rd Qu.:47.75
## Max. :93.00 Max. :400.0 Max. :300.0 Max. :77.00
People with heights of 80 inches or more are designated as outliers by this criterion.
First load the countyComplete dataset provided by OpenIntro.
countyComplete <- read.delim("~/Downloads/Data from openintro.org/countyComplete.txt")
Do a structure examination using str().
str(countyComplete)
## 'data.frame': 3143 obs. of 53 variables:
## $ state : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 1877 levels "Abbeville County",..: 83 90 101 151 166 227 237 250 298 320 ...
## $ FIPS : int 1001 1003 1005 1007 1009 1011 1013 1015 1017 1019 ...
## $ pop2010 : int 54571 182265 27457 22915 57322 10914 20947 118572 34215 25989 ...
## $ pop2000 : int 43671 140415 29038 20826 51024 11714 21399 112249 36583 23988 ...
## $ age_under_5 : num 6.6 6.1 6.2 6 6.3 6.8 6.5 6.1 5.7 5.3 ...
## $ age_under_18 : num 26.8 23 21.9 22.7 24.6 22.3 24.1 22.9 22.5 21.4 ...
## $ age_over_65 : num 12 16.8 14.2 12.7 14.7 13.5 16.7 14.3 16.7 17.9 ...
## $ female : num 51.3 51.1 46.9 46.3 50.5 45.8 53 51.8 52.2 50.4 ...
## $ white : num 78.5 85.7 48 75.8 92.6 23 54.4 74.9 58.8 92.7 ...
## $ black : num 17.7 9.4 46.9 22 1.3 70.2 43.4 20.6 38.7 4.6 ...
## $ native : num 0.4 0.7 0.4 0.3 0.5 0.2 0.3 0.5 0.2 0.5 ...
## $ asian : num 0.9 0.7 0.4 0.1 0.2 0.2 0.8 0.7 0.5 0.2 ...
## $ pac_isl : num NA NA NA NA NA NA 0 0.1 0 0 ...
## $ two_plus_races : num 1.6 1.5 0.9 0.9 1.2 0.8 0.8 1.7 1.1 1.5 ...
## $ hispanic : num 2.4 4.4 5.1 1.8 8.1 7.1 0.9 3.3 1.6 1.2 ...
## $ white_not_hispanic : num 77.2 83.5 46.8 75 88.9 21.9 54.1 73.6 58.1 92.1 ...
## $ no_move_in_one_plus_year : num 86.3 83 83 90.5 87.2 88.5 92.8 82.9 86.2 88.1 ...
## $ foreign_born : num 2 3.6 2.8 0.7 4.7 1.1 1.1 2.5 0.9 0.5 ...
## $ foreign_spoken_at_home : num 3.7 5.5 4.7 1.5 7.2 3.8 1.6 4.5 1.6 1.4 ...
## $ hs_grad : num 85.3 87.6 71.9 74.5 74.7 74.7 74.8 78.5 71.8 73.4 ...
## $ bachelors : num 21.7 26.8 13.5 10 12.5 12 11 16.1 10.8 10.5 ...
## $ veterans : int 5817 20396 2327 1883 4072 943 1675 11757 2893 2172 ...
## $ mean_work_travel : num 25.1 25.8 23.8 28.3 33.2 28.1 25.1 22.1 23.6 26.2 ...
## $ housing_units : int 22135 104061 11829 8981 23887 4493 9964 53289 17004 16267 ...
## $ home_ownership : num 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
## $ housing_multi_unit : num 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
## $ median_val_owner_occupied : num 133900 177200 88200 81200 113700 ...
## $ households : int 19718 69476 9795 7441 20605 3732 8019 46421 13681 11352 ...
## $ persons_per_household : num 2.7 2.5 2.52 3.02 2.73 2.85 2.58 2.46 2.51 2.22 ...
## $ per_capita_income : int 24568 26469 15875 19918 21070 20289 16916 20574 16626 21322 ...
## $ median_household_income : int 53255 50147 33219 41770 45549 31602 30659 38407 31467 40690 ...
## $ poverty : num 10.6 12.2 25 12.6 13.4 25.3 25 19.5 20.3 17.6 ...
## $ private_nonfarm_establishments : int 877 4812 522 318 749 120 446 2444 568 350 ...
## $ private_nonfarm_employment : int 10628 52233 7990 2927 6968 1919 5400 38324 6241 3600 ...
## $ percent_change_private_nonfarm_employment: num 16.6 17.4 -27 -14 -11.4 -18.5 2.1 -5.6 -45.8 5.4 ...
## $ nonemployment_establishments : int 2971 14175 1527 1192 3501 390 1180 6329 2074 1627 ...
## $ firms : int 4067 19035 1667 1385 4458 417 1769 8713 1981 2180 ...
## $ black_owned_firms : num 15.2 2.7 NA 14.9 NA NA NA 7.2 NA NA ...
## $ native_owned_firms : num NA 0.4 NA NA NA NA NA NA NA NA ...
## $ asian_owned_firms : num 1.3 1 NA NA NA NA 3.3 1.6 NA NA ...
## $ pac_isl_owned_firms : num NA NA NA NA NA NA NA NA NA NA ...
## $ hispanic_owned_firms : num 0.7 1.3 NA NA NA NA NA 0.5 NA NA ...
## $ women_owned_firms : num 31.7 27.3 27 NA 23.2 38.8 NA 24.7 29.3 14.5 ...
## $ manufacturer_shipments_2007 : int NA 1410273 NA 0 341544 NA 399132 2679991 667283 307439 ...
## $ mercent_whole_sales_2007 : int NA NA NA NA NA NA 56712 NA NA 62293 ...
## $ sales : int 598175 2966489 188337 124707 319700 43810 229277 1542981 264650 186321 ...
## $ sales_per_capita : int 12003 17166 6334 5804 5622 3995 11326 13678 7620 7613 ...
## $ accommodation_food_service : int 88157 436955 NA 10757 20941 3670 28427 186533 23237 13948 ...
## $ building_permits : int 191 696 10 8 18 1 3 107 10 6 ...
## $ fed_spending : int 331142 1119082 240308 163201 294114 108846 195055 1830659 294718 184642 ...
## $ area : num 594 1590 885 623 645 ...
## $ density : num 91.8 114.6 31 36.8 88.9 ...
There is a variable veterans in countyComplete, which gives a count of veterans living in every county. Exploring this variable leads to an interesting issue.
summary(countyComplete$veterans)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 958 2180 7207 5944 368100
hist(countyComplete$veterans)
boxplot(countyComplete$veterans,horizontal=TRUE)
plot(density(countyComplete$veterans))
The right-skewness is extreme and obscures the detail for most of the data.
There are two standard ways to cope with this other than eliminating the large values.
Do a log transformation and look at the log to the base 10 of the data. Recall that a log value of 2 corresponds to a raw value of 100; a log value of 3 correpsonds to a raw value of 1000, etc.
Look at the percentage of veterans rather than the count.
Let’s construct a smaller dataframe to make this easier to view.
VetDF = countyComplete[c("state","name","veterans","pop2010")]
We can add the new variables to this smaller dataframe.
VetDF$PctVet = VetDF$veterans/VetDF$pop2010
VetDF$LogVeterans = log10(VetDF$veterans + 1)
Now we can view this dataframe by double-clicking on it in the upper-right pane or using the View() command.
Note what we can do in the viewing tab.
summary(VetDF$PctVet)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.07243 0.08545 0.08714 0.09966 0.27710
hist(VetDF$PctVet)
boxplot(VetDF$PctVet)
plot(density(VetDF$PctVet))
summary(VetDF$LogVeterans)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.982 3.339 3.384 3.774 5.566
hist(VetDF$LogVeterans)
boxplot(VetDF$LogVeterans)
plot(density(VetDF$LogVeterans))
Note that the right-skewness has been eliminated and the graph is easier to interpret. We can interpret the mean and median log values by raising 10 to these powers. Look at the mean and median of the raw data values for comparison.
10^mean(VetDF$LogVeterans)
## [1] 2418.815
mean(VetDF$veterans)
## [1] 7207.285
10^median(VetDF$LogVeterans)
## [1] 2181
median(VetDF$veterans)
## [1] 2180
summary(VetDF$PctVet)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.07243 0.08545 0.08714 0.09966 0.27710
hist(VetDF$PctVet)
boxplot(VetDF$PctVet)
plot(density(VetDF$PctVet))
This variable is well-behaved with a center aroung 8.5%. Values range from 0 to nearly 28%.
Look at the outliers in the viewing tab.