The challenge question for this week involves creating a combined histogram and horizontal boxplot for county-level population estimates in Kansas for the year 1890. To help with this, the file “KansasCensus.csv” is provided which includes decennial census data for each county in Kansas from 1890 through the most recent 2020 census.
Create your combined histogram and horizontal boxplot in R and export the graph as a JPG file. After your graph is complete, also create a new web page for this assignment and insert your graphic into the page. Include the R code you sused to generate this result below your graphic. Upload your new HTML and graphics file to the class web server and modify your homepage to include a link to this new assignment page.
As usual, the first step in creating the graph is to import your data file into R:
# Import the data file into R
census <- read.csv(file="KansasCensus.csv", header=TRUE, sep=",")
Next, review the structure of the resulting data frame and pay special attention to the names assigned to each variable that can be graphed.
# Use the str() function to summarize the structure of the variable "census"
str(census)
## 'data.frame': 106 obs. of 26 variables:
## $ FIPSSTCO : int 20001 20003 20005 20007 20009 20011 20013 20015 20017 20019 ...
## $ GISJOIN : chr "G2000010" "G2000030" "G2000050" "G2000070" ...
## $ STATE : chr "Kansas" "Kansas" "Kansas" "Kansas" ...
## $ COUNTY : chr "Allen" "Anderson" "Atchison" "Barber" ...
## $ CO_ABBR : chr "AL" "AN" "AT" "BA" ...
## $ Y1890 : int 13509 14203 26758 7973 13172 28575 20319 24055 8233 12297 ...
## $ Y1900 : int 19507 13938 28606 6594 13784 24712 22369 23363 8246 11804 ...
## $ Y1910 : int 27640 13829 28107 9916 17876 24007 21314 23059 7527 11429 ...
## $ Y1920 : int 23509 12986 23411 9739 18422 23198 20949 43842 7144 11598 ...
## $ Y1930 : int 21391 13355 23945 10178 19776 22386 20553 35904 6952 10352 ...
## $ Y1940 : int 19874 11658 22222 9073 25010 20944 17395 32013 6345 9233 ...
## $ Y1950 : int 18187 10267 21496 8521 29909 19153 14651 31001 4831 7376 ...
## $ Y1960 : int 16369 9035 20898 8713 32368 16090 13229 38395 3921 5956 ...
## $ Y1970 : int 15043 8501 19165 7016 30663 15215 11685 38658 3408 4642 ...
## $ Y1980 : int 15654 8749 18397 6548 31343 15969 11955 44782 3309 5016 ...
## $ Y1990 : int 14638 7803 16932 5874 29382 14966 11128 50580 3021 4407 ...
## $ Y2000 : int 14385 8110 16774 5307 28205 15379 10724 59482 3030 4359 ...
## $ Y2010 : int 13371 8102 16924 4861 27674 15173 9984 65880 2790 3669 ...
## $ Y2020 : int 12537 7850 16305 4198 25419 14333 9473 67401 2564 3386 ...
## $ YR_MAX : int 1910 1890 1900 1930 1960 1890 1900 2020 1900 1890 ...
## $ AMT_MAX : int 27640 14203 28606 10178 32368 28575 22369 67401 8246 12297 ...
## $ YR_MIN : int 2020 1990 2020 2020 1900 2020 2020 1910 2020 2020 ...
## $ AMT_MIN : int 12537 7803 16305 4198 13172 14333 9473 23059 2564 3386 ...
## $ POP_LOST : int -15103 -6400 -12301 -5980 -19196 -14242 -12896 0 -5682 -8911 ...
## $ LABEL_MAX: chr "AL--1910" "AN--1890" "AT--1900" "BA--1930" ...
## $ SPACER : chr "--" "--" "--" "--" ...
The data frame referred to by the variable name “census” contains 106 observations of 26 variables, including population estimates for each county dating back to the first conducted in Kansas in 1890. The variable names for each census year are begin with the letter “Y” followed by the year.
Interesting, this data frame includes 106 rather than the modern-day total of 105 counties in the state. In 1890, population data exists for the former Garfield County which was merged with Finney County in March 1983. Visit the website Garfield County, Kansas - Lost in County Seat War - to learn how and why this happened.
Now that we know how to call the correct variable from the data frame, plot the histogram.
# Create the histogram and note the upper values for the y-axis
hist(census$Y1890)
Without customizing the y-axis extent, there’s no room to include a horizontal boxplot along the top of the histogram. Here, I extend both axes and customize the main title and axes labels.
hist(census$Y1890, main="Kansas County Population 1980", xlab="No. of People", xlim=c(0, 60000), ylim=c(0, 35))
Next, call the boxplot function and supply the correct variable name to create a box-and-whisker plot.
boxplot(census$Y1890)
Without modification, the boxplot function creates a standard vertical box-and-whisker plot. However, this challenge calls for adding a horizontal boxplot to an existing histogram.
To accomplish this, we need to take advantage of several optional arguments available to the boxplot function. You can read more about these using R’s built-in help system.
help(boxplot)
## starting httpd help server ... done
First, I’ll recreate the improved histogram in the graphics window and then insert the boxplot at a location relative to the y-axis. In the code block below, note the use of the optional (but here required) arguments “horizontal”, “at”, and “add”, and “axes”.
hist(census$Y1890, main="Kansas County Population 1980", xlab="No. of People", xlim=c(0, 60000), ylim=c(0, 35))
boxplot(census$Y1890, horizontal=TRUE, at=32.5, add=TRUE, axes=FALSE)
Not a bad result, but notice that the box portion of the box-and-whisker plot representing the interquartile range appear a bit too thin. Let’s use one additional optional argument for the boxplot function - boxwex - to make it a bit thicker.
hist(census$Y1890, main="Kansas County Population 1980", xlab="No. of People", xlim=c(0, 60000), ylim=c(0, 35))
boxplot(census$Y1890, horizontal=TRUE, at=32.5, add=TRUE, axes=FALSE, boxwex=3)
What does this combined graph tell us? Remember that histograms help us evaluate the distribution of values in a dataset which, here, is county-level population in 1890. Box-and-whisker plot, however, excel at communicating central tendency, dispersion, and helps highlight the presence of outliers.
To assist with our interpretation, we can also make good use of functions that provide summary statistics for the variable(s) of interest. Let’s first use the built-in summary function.
summary(census$Y1890)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 724 4537 12256 13463 19284 54407
The summary function yield exact values for some important statistics - many featured in the boxplot - including the minimum and maximum values, 1st and 3rd quartiles, and the median and mean values. We now know, for example that the median value (the dark black line in the boxplot) is 12,256. We can also easily calculate the interquartile range (represented by the box in the boxplot) as the 3rd quartile minus the 1st quartile or 14,747.
I also find another function helpful during exploratory analyses such as this. That function is “describe” that is included in the psych library.
library(psych)
describe(census$Y1890)
## vars n mean sd median trimmed mad min max range skew
## X1 1 106 13463.17 10746.93 12256.5 12213.97 10778.5 724 54407 53683 1.19
## kurtosis se
## X1 1.82 1043.83
The describe function includes an expanded set of summary statistics, including skew and kurtosis which are descriptors of the shape of the data distribution illustrated by the histogram.
Combined, our histogram/boxplot graph and summary statistics convey a lot of information. THe county-level population of Kansas in 1890 was low as compared to today and had different mean and median values. The distribution of population values is positively (or right) skewed and is leptokurtic (or heavy-tailed). All of these conditions suggest a non-normal distribution.
The boxplot helps us identify good measures of central tendency and dispersion for the 1890 population values that were confirmed with the summary statistics. It also shows us that there were three upper outliers.
If you’re curious about which three Kansas counties these were, we can find out using a combination of built-in R functions.
head(census[order(-census$Y1890), ])
## FIPSSTCO GISJOIN STATE COUNTY CO_ABBR Y1890 Y1900 Y1910 Y1920
## 106 20209 G2002090 Kansas Wyandotte WY 54407 73227 100068 122218
## 90 20177 G2001770 Kansas Shawnee SN 49172 53727 61874 69159
## 88 20173 G2001730 Kansas Sedgwick SG 43626 44037 73095 92234
## 53 20103 G2001030 Kansas Leavenworth LV 38485 40940 41207 38402
## 18 20035 G2000350 Kansas Cowley CL 34478 30156 31790 35155
## 19 20037 G2000370 Kansas Crawford CR 30286 38809 51178 61800
## Y1930 Y1940 Y1950 Y1960 Y1970 Y1980 Y1990 Y2000 Y2010 Y2020
## 106 141211 145071 165318 185495 186845 172335 161993 157882 157505 168873
## 90 85200 91247 105418 141286 155322 154916 160976 169871 177934 178608
## 88 136330 143311 222290 343231 350694 366531 403662 452869 498365 524246
## 53 42673 41112 42361 48524 53340 54809 64371 68691 76227 81870
## 18 40903 38139 36905 37861 35012 36824 36915 36291 36311 34498
## 19 49329 44191 40231 37032 37850 37916 35568 38242 39134 38930
## YR_MAX AMT_MAX YR_MIN AMT_MIN POP_LOST LABEL_MAX SPACER
## 106 1970 186845 1890 54407 -132438 WY--1970 --
## 90 2020 178608 1890 49172 0 SN--2020 --
## 88 2020 524246 1890 43626 0 SG--2020 --
## 53 2020 81870 1920 38402 0 LV--2020 --
## 18 1930 40903 1900 30156 -10747 CL--1930 --
## 19 1920 61800 1890 30286 -31514 CR--1920 --
What were the three Kansas counties with the largest populations…and the outliers? Wyandotte, Shawnee, and Sedgwick.