Challenge Question

The challenge question for this week involves creating a combined histogram and horizontal boxplot for county-level population estimates in Kansas for the year 1890. To help with this, the file “KansasCensus.csv” is provided which includes decennial census data for each county in Kansas from 1890 through the most recent 2020 census.

Submission Requirements

Create your combined histogram and horizontal boxplot in R and export the graph as a JPG file. After your graph is complete, also create a new web page for this assignment and insert your graphic into the page. Include the R code you sused to generate this result below your graphic. Upload your new HTML and graphics file to the class web server and modify your homepage to include a link to this new assignment page.

Solution

As usual, the first step in creating the graph is to import your data file into R:

# Import the data file into R
census <- read.csv(file="KansasCensus.csv", header=TRUE, sep=",")

Next, review the structure of the resulting data frame and pay special attention to the names assigned to each variable that can be graphed.

# Use the str() function to summarize the structure of the variable "census"
str(census)
## 'data.frame':    106 obs. of  26 variables:
##  $ FIPSSTCO : int  20001 20003 20005 20007 20009 20011 20013 20015 20017 20019 ...
##  $ GISJOIN  : chr  "G2000010" "G2000030" "G2000050" "G2000070" ...
##  $ STATE    : chr  "Kansas" "Kansas" "Kansas" "Kansas" ...
##  $ COUNTY   : chr  "Allen" "Anderson" "Atchison" "Barber" ...
##  $ CO_ABBR  : chr  "AL" "AN" "AT" "BA" ...
##  $ Y1890    : int  13509 14203 26758 7973 13172 28575 20319 24055 8233 12297 ...
##  $ Y1900    : int  19507 13938 28606 6594 13784 24712 22369 23363 8246 11804 ...
##  $ Y1910    : int  27640 13829 28107 9916 17876 24007 21314 23059 7527 11429 ...
##  $ Y1920    : int  23509 12986 23411 9739 18422 23198 20949 43842 7144 11598 ...
##  $ Y1930    : int  21391 13355 23945 10178 19776 22386 20553 35904 6952 10352 ...
##  $ Y1940    : int  19874 11658 22222 9073 25010 20944 17395 32013 6345 9233 ...
##  $ Y1950    : int  18187 10267 21496 8521 29909 19153 14651 31001 4831 7376 ...
##  $ Y1960    : int  16369 9035 20898 8713 32368 16090 13229 38395 3921 5956 ...
##  $ Y1970    : int  15043 8501 19165 7016 30663 15215 11685 38658 3408 4642 ...
##  $ Y1980    : int  15654 8749 18397 6548 31343 15969 11955 44782 3309 5016 ...
##  $ Y1990    : int  14638 7803 16932 5874 29382 14966 11128 50580 3021 4407 ...
##  $ Y2000    : int  14385 8110 16774 5307 28205 15379 10724 59482 3030 4359 ...
##  $ Y2010    : int  13371 8102 16924 4861 27674 15173 9984 65880 2790 3669 ...
##  $ Y2020    : int  12537 7850 16305 4198 25419 14333 9473 67401 2564 3386 ...
##  $ YR_MAX   : int  1910 1890 1900 1930 1960 1890 1900 2020 1900 1890 ...
##  $ AMT_MAX  : int  27640 14203 28606 10178 32368 28575 22369 67401 8246 12297 ...
##  $ YR_MIN   : int  2020 1990 2020 2020 1900 2020 2020 1910 2020 2020 ...
##  $ AMT_MIN  : int  12537 7803 16305 4198 13172 14333 9473 23059 2564 3386 ...
##  $ POP_LOST : int  -15103 -6400 -12301 -5980 -19196 -14242 -12896 0 -5682 -8911 ...
##  $ LABEL_MAX: chr  "AL--1910" "AN--1890" "AT--1900" "BA--1930" ...
##  $ SPACER   : chr  "--" "--" "--" "--" ...

The data frame referred to by the variable name “census” contains 106 observations of 26 variables, including population estimates for each county dating back to the first conducted in Kansas in 1890. The variable names for each census year are begin with the letter “Y” followed by the year.

Interesting, this data frame includes 106 rather than the modern-day total of 105 counties in the state. In 1890, population data exists for the former Garfield County which was merged with Finney County in March 1983. Visit the website Garfield County, Kansas - Lost in County Seat War - to learn how and why this happened.

Now that we know how to call the correct variable from the data frame, plot the histogram.

# Create the histogram and note the upper values for the y-axis
hist(census$Y1890)

Without customizing the y-axis extent, there’s no room to include a horizontal boxplot along the top of the histogram. Here, I extend both axes and customize the main title and axes labels.

hist(census$Y1890, main="Kansas County Population 1980", xlab="No. of People", xlim=c(0, 60000), ylim=c(0, 35))

Next, call the boxplot function and supply the correct variable name to create a box-and-whisker plot.

boxplot(census$Y1890)

Without modification, the boxplot function creates a standard vertical box-and-whisker plot. However, this challenge calls for adding a horizontal boxplot to an existing histogram.

To accomplish this, we need to take advantage of several optional arguments available to the boxplot function. You can read more about these using R’s built-in help system.

help(boxplot)
## starting httpd help server ... done

First, I’ll recreate the improved histogram in the graphics window and then insert the boxplot at a location relative to the y-axis. In the code block below, note the use of the optional (but here required) arguments “horizontal”, “at”, and “add”, and “axes”.

hist(census$Y1890, main="Kansas County Population 1980", xlab="No. of People", xlim=c(0, 60000), ylim=c(0, 35))
boxplot(census$Y1890, horizontal=TRUE, at=32.5, add=TRUE, axes=FALSE)

Not a bad result, but notice that the box portion of the box-and-whisker plot representing the interquartile range appear a bit too thin. Let’s use one additional optional argument for the boxplot function - boxwex - to make it a bit thicker.

hist(census$Y1890, main="Kansas County Population 1980", xlab="No. of People", xlim=c(0, 60000), ylim=c(0, 35))
boxplot(census$Y1890, horizontal=TRUE, at=32.5, add=TRUE, axes=FALSE, boxwex=3)

Interpretation

What does this combined graph tell us? Remember that histograms help us evaluate the distribution of values in a dataset which, here, is county-level population in 1890. Box-and-whisker plot, however, excel at communicating central tendency, dispersion, and helps highlight the presence of outliers.

To assist with our interpretation, we can also make good use of functions that provide summary statistics for the variable(s) of interest. Let’s first use the built-in summary function.

summary(census$Y1890)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     724    4537   12256   13463   19284   54407

The summary function yield exact values for some important statistics - many featured in the boxplot - including the minimum and maximum values, 1st and 3rd quartiles, and the median and mean values. We now know, for example that the median value (the dark black line in the boxplot) is 12,256. We can also easily calculate the interquartile range (represented by the box in the boxplot) as the 3rd quartile minus the 1st quartile or 14,747.

I also find another function helpful during exploratory analyses such as this. That function is “describe” that is included in the psych library.

library(psych)
describe(census$Y1890)
##    vars   n     mean       sd  median  trimmed     mad min   max range skew
## X1    1 106 13463.17 10746.93 12256.5 12213.97 10778.5 724 54407 53683 1.19
##    kurtosis      se
## X1     1.82 1043.83

The describe function includes an expanded set of summary statistics, including skew and kurtosis which are descriptors of the shape of the data distribution illustrated by the histogram.

Combined, our histogram/boxplot graph and summary statistics convey a lot of information. THe county-level population of Kansas in 1890 was low as compared to today and had different mean and median values. The distribution of population values is positively (or right) skewed and is leptokurtic (or heavy-tailed). All of these conditions suggest a non-normal distribution.

The boxplot helps us identify good measures of central tendency and dispersion for the 1890 population values that were confirmed with the summary statistics. It also shows us that there were three upper outliers.

If you’re curious about which three Kansas counties these were, we can find out using a combination of built-in R functions.

head(census[order(-census$Y1890), ])
##     FIPSSTCO  GISJOIN  STATE      COUNTY CO_ABBR Y1890 Y1900  Y1910  Y1920
## 106    20209 G2002090 Kansas   Wyandotte      WY 54407 73227 100068 122218
## 90     20177 G2001770 Kansas     Shawnee      SN 49172 53727  61874  69159
## 88     20173 G2001730 Kansas    Sedgwick      SG 43626 44037  73095  92234
## 53     20103 G2001030 Kansas Leavenworth      LV 38485 40940  41207  38402
## 18     20035 G2000350 Kansas      Cowley      CL 34478 30156  31790  35155
## 19     20037 G2000370 Kansas    Crawford      CR 30286 38809  51178  61800
##      Y1930  Y1940  Y1950  Y1960  Y1970  Y1980  Y1990  Y2000  Y2010  Y2020
## 106 141211 145071 165318 185495 186845 172335 161993 157882 157505 168873
## 90   85200  91247 105418 141286 155322 154916 160976 169871 177934 178608
## 88  136330 143311 222290 343231 350694 366531 403662 452869 498365 524246
## 53   42673  41112  42361  48524  53340  54809  64371  68691  76227  81870
## 18   40903  38139  36905  37861  35012  36824  36915  36291  36311  34498
## 19   49329  44191  40231  37032  37850  37916  35568  38242  39134  38930
##     YR_MAX AMT_MAX YR_MIN AMT_MIN POP_LOST LABEL_MAX SPACER
## 106   1970  186845   1890   54407  -132438  WY--1970     --
## 90    2020  178608   1890   49172        0  SN--2020     --
## 88    2020  524246   1890   43626        0  SG--2020     --
## 53    2020   81870   1920   38402        0  LV--2020     --
## 18    1930   40903   1900   30156   -10747  CL--1930     --
## 19    1920   61800   1890   30286   -31514  CR--1920     --

What were the three Kansas counties with the largest populations…and the outliers? Wyandotte, Shawnee, and Sedgwick.