Background information
The data for this homework assignment comes from 1888 Swiss data set on relevant 6 statistics on 47 provinces of Switzerland. Those six statistical indices are: - 1. Percentage of fertility. - 2. Percentage of males involved in Agriculture as an occupation. - 3. Percentage of draftees who received the highest mark on the Army examination. - 4. Percentage of education beyond primary school for Draftees. - 5. Percentage of population who are Catholic (as opposed to Protestant). - 6. Number of Live Births who live less than One year. The reference data comes from : https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/doc/datasets/swiss.html
Based on the previous assignment, I followed the script that I used to load a table from a specific URL.
a. Install and use necessary packages
#install.packages("ggplot2")
require(ggplot2)
## Loading required package: ggplot2
b. Read from text file
In this part of the source code, I am reading from the swiss.csv. You will notice, however, that I am making two calls from the URL to create two tables. One table, tableInput, will be used for most calculations. The last table, smallTableInput, will be used for a simple bar chart. Notice also, that I eliminated the columns Fertility, Catholic, and Infant Mortality for smallTableInput. This is done for expediency.
I also renamed the column names for both tables to make them more user friendly.
theURL <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/swiss.csv"
tableInput <- read.table(file = theURL, header = TRUE, sep = ",")
smallTableInput <- read.table(file = theURL, header = TRUE, sep = ",", nrows = 6)
colnames(tableInput) <- c("Province","Fertility", "Agriculture", "Examination", "Education", "Catholic", "Infant Mortality")
colnames(smallTableInput) <- c("Province","Fertility", "Agriculture", "Examination", "Education", "Catholic", "Infant Mortality")
provinceNames <- smallTableInput$Province
smallTableInput <- smallTableInput[, -which(names(smallTableInput) %in% c("Province","Fertility","Catholic","Infant Mortality"))]
For the next five tasks, I just wanted to show the basic functionality from ggplot based on my experimentation with this week’s and last week’s labs. You will notice that I displayed the histogram, box plot, and scatter plot charts as specified by Assignment 4 requirements.
a. show some values from created tables or vectors
This is to make sure required data is shown.
head(tableInput)
## Province Fertility Agriculture Examination Education Catholic
## 1 Courtelary 80.2 17.0 15 12 9.96
## 2 Delemont 83.1 45.1 6 9 84.84
## 3 Franches-Mnt 92.5 39.7 5 5 93.40
## 4 Moutier 85.8 36.5 12 7 33.77
## 5 Neuveville 76.9 43.5 17 15 5.16
## 6 Porrentruy 76.1 35.3 9 7 90.57
## Infant Mortality
## 1 22.2
## 2 22.2
## 3 20.2
## 4 20.3
## 5 20.6
## 6 26.6
smallTableInput
## Agriculture Examination Education
## 1 17.0 15 12
## 2 45.1 6 9
## 3 39.7 5 5
## 4 36.5 12 7
## 5 43.5 17 15
## 6 35.3 9 7
provinceNames
## [1] Courtelary Delemont Franches-Mnt Moutier Neuveville
## [6] Porrentruy
## 6 Levels: Courtelary Delemont Franches-Mnt Moutier ... Porrentruy
b. Generate summary level descriptive statistics: Show the mean, median, 25th and 75th quartiles, min, and max for each of the applicable variables in tableInput.
summary(tableInput)
## Province Fertility Agriculture Examination
## Aigle : 1 Min. :35.00 Min. : 1.20 Min. : 3.00
## Aubonne : 1 1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00
## Avenches: 1 Median :70.40 Median :54.10 Median :16.00
## Boudry : 1 Mean :70.14 Mean :50.66 Mean :16.49
## Broye : 1 3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00
## Conthey : 1 Max. :92.50 Max. :89.70 Max. :37.00
## (Other) :41
## Education Catholic Infant Mortality
## Min. : 1.00 Min. : 2.150 Min. :10.80
## 1st Qu.: 6.00 1st Qu.: 5.195 1st Qu.:18.15
## Median : 8.00 Median : 15.140 Median :20.00
## Mean :10.98 Mean : 41.144 Mean :19.94
## 3rd Qu.:12.00 3rd Qu.: 93.125 3rd Qu.:21.70
## Max. :53.00 Max. :100.000 Max. :26.60
##
c. Generate a histogram on Fertility
hist(tableInput$Fertility)
d. Generate a boxplot on Agriculture
boxplot(tableInput$Agriculture)
e. Generate a scatterplot on Examination vs. Education
plot(tableInput$Examination ~ tableInput$Education)
For this part of the assignment, I chose to be a little more creative by using some chart technologies I studied in the last few weeks.
a. Generate a scatterplot on Examination vs. Education but with more enhanced information and graphics.
If you look at the resultant chart, it measures against two criteria: the Percentage of draftees who received the highest mark on the Army examination vs. the Percentage of education beyond primary school for Draftees. It clearly shows in this graph alone that there is a linear relationship between the two. Those who received higher marks in the Army examination tend to have education beyon primary school.
ref: McCown, Frank. Producing Simple Graphs with R. URL: http://harding.edu/fmccown/r/
plot(tableInput$Education, tableInput$Examination, xlab="Education", ylab="Examination", main="Swiss Provinces - Percent of High Scores of Army Examinees vs. Percent of Those Who Obtained Higher Education", col.main="purple", font.main=2, col="red", pch=15, type="p")
b. Generate a Line chart on Fertility vs. Education
If you look at the resultant chart, it measures demonstrates two criteria: the Percentage of Fertility and the Percentage of education beyond primary school for Draftees. You will notice some interesting chart characteristics. There are two trend lines, blue for Fertility and red for Education. The fonts and placement for all 47 provinces could be better. However, notice the relationship between Fertility and Education. It appears that regardless of Province, in instances where there was a high percentage of Fertility, the percentage of education beyond primary school is low. However, there is one glaring instance to the right of the graph, whereby the percentage of education shows a steep drop in the percentage of fertility. This is only one instance, however. Based on this observation, this instance is only an outlier. Perhaps, if the chart was organized from provinces with the highest fertility percentages to lowest fertility percentages, we could see a more definitive trend.
For overall chart development, I used the following reference: McCown, Frank. Producing Simple Graphs with R. URL: http://harding.edu/fmccown/r/
For X axis label and rotation, I used the following reference: UCLA, Institute for Digital Research and Education. How can I change the angle of the value labels on my axes? URL: http://www.ats.ucla.edu/stat/r/faq/angled_labels.htm
Fertility <- tableInput$Fertility
Education <- tableInput$Education
g_range <- range(0, Fertility, Education)
plot(Fertility, type="o", col="blue", ylim=g_range, axes=FALSE, ann=FALSE)
labellist <- tableInput$Province
axis(1, at=seq(1,47,by=1), labels = FALSE)
text(seq(1, 47, by=1), par("usr")[3] - 0.2, labels = labellist, srt = 45, pos = 1, xpd = TRUE, cex=0.6)
lines(Education, type="o", pch=22, lty=2, col="red")
title(main="Fertility and Education by Province", col.main="purple", font.main=4)
title(xlab="Province", col.lab=rgb(0,0.5,0))
title(ylab="Fertility and Education", col.lab=rgb(0,0.5,0))
legend(40, NULL, c("Fertility","Education"), cex=0.8, col=c("blue","red"), pch=21:22, lty=1:2)
c. Generate a Bar Chart
In this resultant chart, I am measuring three criteria: Agriculture, Examination, and Education against Province. This is a small subset of the data but this is just to clearly demonstrate a simple, but robust bar chart.
The simple bar chart clearly shows some observations. For example, the percentage of males in agriculture for 5 of the 6 provinces is at least 30%. There is no correlation between Agriculture and the other two indices regardless of Province. Moreover, Education and Examination appear to be consistent with each other regardless of Province. In this example, only 6 provinces were displayed.
For overall chart development, I used the following reference: McCown, Frank. Producing Simple Graphs with R. URL: http://harding.edu/fmccown/r/
barplot(as.matrix(smallTableInput), main="1888 Swiss Provinces Statistics", xlab = "Indices", ylab = "Percentages", cex.lab = 1.0, beside=TRUE, col=rainbow(6))
legend("topright", legend = provinceNames, fill = rainbow(6))