Romerl Elizes

MSDA Workshop

Week 4 R Programming Assignment

Background information

The data for this homework assignment comes from 1888 Swiss data set on relevant 6 statistics on 47 provinces of Switzerland. Those six statistical indices are: - 1. Percentage of fertility. - 2. Percentage of males involved in Agriculture as an occupation. - 3. Percentage of draftees who received the highest mark on the Army examination. - 4. Percentage of education beyond primary school for Draftees. - 5. Percentage of population who are Catholic (as opposed to Protestant). - 6. Number of Live Births who live less than One year. The reference data comes from : https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/doc/datasets/swiss.html

1. Preliminary Tasks

Based on the previous assignment, I followed the script that I used to load a table from a specific URL.

a. Install and use necessary packages

#install.packages("ggplot2")
require(ggplot2)
## Loading required package: ggplot2

b. Read from text file

In this part of the source code, I am reading from the swiss.csv. You will notice, however, that I am making two calls from the URL to create two tables. One table, tableInput, will be used for most calculations. The last table, smallTableInput, will be used for a simple bar chart. Notice also, that I eliminated the columns Fertility, Catholic, and Infant Mortality for smallTableInput. This is done for expediency.

I also renamed the column names for both tables to make them more user friendly.

theURL <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/swiss.csv"
tableInput <- read.table(file = theURL, header = TRUE, sep = ",")

smallTableInput <- read.table(file = theURL, header = TRUE, sep = ",", nrows = 6)

colnames(tableInput) <- c("Province","Fertility", "Agriculture", "Examination", "Education", "Catholic", "Infant Mortality")

colnames(smallTableInput) <- c("Province","Fertility", "Agriculture", "Examination", "Education", "Catholic", "Infant Mortality")
provinceNames <- smallTableInput$Province
smallTableInput <- smallTableInput[, -which(names(smallTableInput) %in% c("Province","Fertility","Catholic","Infant Mortality"))]

2. Demonstrating basic ggplot functionality

For the next five tasks, I just wanted to show the basic functionality from ggplot based on my experimentation with this week’s and last week’s labs. You will notice that I displayed the histogram, box plot, and scatter plot charts as specified by Assignment 4 requirements.

a. show some values from created tables or vectors

This is to make sure required data is shown.

head(tableInput)
##       Province Fertility Agriculture Examination Education Catholic
## 1   Courtelary      80.2        17.0          15        12     9.96
## 2     Delemont      83.1        45.1           6         9    84.84
## 3 Franches-Mnt      92.5        39.7           5         5    93.40
## 4      Moutier      85.8        36.5          12         7    33.77
## 5   Neuveville      76.9        43.5          17        15     5.16
## 6   Porrentruy      76.1        35.3           9         7    90.57
##   Infant Mortality
## 1             22.2
## 2             22.2
## 3             20.2
## 4             20.3
## 5             20.6
## 6             26.6
smallTableInput
##   Agriculture Examination Education
## 1        17.0          15        12
## 2        45.1           6         9
## 3        39.7           5         5
## 4        36.5          12         7
## 5        43.5          17        15
## 6        35.3           9         7
provinceNames
## [1] Courtelary   Delemont     Franches-Mnt Moutier      Neuveville  
## [6] Porrentruy  
## 6 Levels: Courtelary Delemont Franches-Mnt Moutier ... Porrentruy

b. Generate summary level descriptive statistics: Show the mean, median, 25th and 75th quartiles, min, and max for each of the applicable variables in tableInput.

summary(tableInput)
##      Province    Fertility      Agriculture     Examination   
##  Aigle   : 1   Min.   :35.00   Min.   : 1.20   Min.   : 3.00  
##  Aubonne : 1   1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00  
##  Avenches: 1   Median :70.40   Median :54.10   Median :16.00  
##  Boudry  : 1   Mean   :70.14   Mean   :50.66   Mean   :16.49  
##  Broye   : 1   3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00  
##  Conthey : 1   Max.   :92.50   Max.   :89.70   Max.   :37.00  
##  (Other) :41                                                  
##    Education        Catholic       Infant Mortality
##  Min.   : 1.00   Min.   :  2.150   Min.   :10.80   
##  1st Qu.: 6.00   1st Qu.:  5.195   1st Qu.:18.15   
##  Median : 8.00   Median : 15.140   Median :20.00   
##  Mean   :10.98   Mean   : 41.144   Mean   :19.94   
##  3rd Qu.:12.00   3rd Qu.: 93.125   3rd Qu.:21.70   
##  Max.   :53.00   Max.   :100.000   Max.   :26.60   
## 

c. Generate a histogram on Fertility

hist(tableInput$Fertility)

d. Generate a boxplot on Agriculture

boxplot(tableInput$Agriculture)

e. Generate a scatterplot on Examination vs. Education

plot(tableInput$Examination ~ tableInput$Education)

3. Demonstrating some fancier charts based on given data.

For this part of the assignment, I chose to be a little more creative by using some chart technologies I studied in the last few weeks.

a. Generate a scatterplot on Examination vs. Education but with more enhanced information and graphics.

If you look at the resultant chart, it measures against two criteria: the Percentage of draftees who received the highest mark on the Army examination vs. the Percentage of education beyond primary school for Draftees. It clearly shows in this graph alone that there is a linear relationship between the two. Those who received higher marks in the Army examination tend to have education beyon primary school.

ref: McCown, Frank. Producing Simple Graphs with R. URL: http://harding.edu/fmccown/r/

plot(tableInput$Education, tableInput$Examination, xlab="Education", ylab="Examination",  main="Swiss Provinces - Percent of High Scores of Army Examinees vs. Percent of Those Who Obtained Higher Education", col.main="purple", font.main=2, col="red", pch=15, type="p")

b. Generate a Line chart on Fertility vs. Education

If you look at the resultant chart, it measures demonstrates two criteria: the Percentage of Fertility and the Percentage of education beyond primary school for Draftees. You will notice some interesting chart characteristics. There are two trend lines, blue for Fertility and red for Education. The fonts and placement for all 47 provinces could be better. However, notice the relationship between Fertility and Education. It appears that regardless of Province, in instances where there was a high percentage of Fertility, the percentage of education beyond primary school is low. However, there is one glaring instance to the right of the graph, whereby the percentage of education shows a steep drop in the percentage of fertility. This is only one instance, however. Based on this observation, this instance is only an outlier. Perhaps, if the chart was organized from provinces with the highest fertility percentages to lowest fertility percentages, we could see a more definitive trend.

For overall chart development, I used the following reference: McCown, Frank. Producing Simple Graphs with R. URL: http://harding.edu/fmccown/r/

For X axis label and rotation, I used the following reference: UCLA, Institute for Digital Research and Education. How can I change the angle of the value labels on my axes? URL: http://www.ats.ucla.edu/stat/r/faq/angled_labels.htm

Fertility <- tableInput$Fertility
Education <- tableInput$Education

g_range <- range(0, Fertility, Education)

plot(Fertility, type="o", col="blue", ylim=g_range, axes=FALSE, ann=FALSE)
labellist <- tableInput$Province
axis(1, at=seq(1,47,by=1), labels = FALSE)
text(seq(1, 47, by=1), par("usr")[3] - 0.2, labels = labellist, srt = 45, pos = 1, xpd = TRUE, cex=0.6)

lines(Education, type="o", pch=22, lty=2, col="red")

title(main="Fertility and Education by Province", col.main="purple", font.main=4)

title(xlab="Province", col.lab=rgb(0,0.5,0))
title(ylab="Fertility and Education", col.lab=rgb(0,0.5,0))

legend(40, NULL, c("Fertility","Education"), cex=0.8, col=c("blue","red"), pch=21:22, lty=1:2)

c. Generate a Bar Chart

In this resultant chart, I am measuring three criteria: Agriculture, Examination, and Education against Province. This is a small subset of the data but this is just to clearly demonstrate a simple, but robust bar chart.

The simple bar chart clearly shows some observations. For example, the percentage of males in agriculture for 5 of the 6 provinces is at least 30%. There is no correlation between Agriculture and the other two indices regardless of Province. Moreover, Education and Examination appear to be consistent with each other regardless of Province. In this example, only 6 provinces were displayed.

For overall chart development, I used the following reference: McCown, Frank. Producing Simple Graphs with R. URL: http://harding.edu/fmccown/r/

barplot(as.matrix(smallTableInput), main="1888 Swiss Provinces Statistics", xlab = "Indices", ylab = "Percentages", cex.lab = 1.0, beside=TRUE, col=rainbow(6))

legend("topright", legend = provinceNames, fill = rainbow(6))