I first set up the environment and take a look at the data.
#load Libraries and load data
library("ggplot2")
library("bitops")
library("RCurl")
library("cowplot")
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
##
## ggsave
url = "https://raw.githubusercontent.com/chrisestevez/MSDA-Bridge/master/USArrest1973.csv"
Rdata = getURL(url)
MyData = read.csv(text = Rdata,header = TRUE,sep=",", na.strings = "..")
str(MyData)
## 'data.frame': 50 obs. of 5 variables:
## $ State : Factor w/ 50 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
## $ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
## $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
## $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
summary(MyData)
## State Murder Assault UrbanPop
## Alabama : 1 Min. : 0.800 Min. : 45.0 Min. :32.00
## Alaska : 1 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50
## Arizona : 1 Median : 7.250 Median :159.0 Median :66.00
## Arkansas : 1 Mean : 7.788 Mean :170.8 Mean :65.54
## California: 1 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75
## Colorado : 1 Max. :17.400 Max. :337.0 Max. :91.00
## (Other) :44
## Rape
## Min. : 7.30
## 1st Qu.:15.07
## Median :20.10
## Mean :21.23
## 3rd Qu.:26.18
## Max. :46.00
##
In the below step I create a data frame from my original source. I also selected the columns of interest to me and named them accordingly. The variables State, Murder, and Rapes per 100k will be analyzed further.
FinalUSAarrests = data.frame(MyData)
FinalUSAarrests = subset(FinalUSAarrests,select = c(State,Murder,Rape))
colnames( FinalUSAarrests) = c("States","MurderP100k","Rapep100k")
head(FinalUSAarrests, n=10)
## States MurderP100k Rapep100k
## 1 Alabama 13.2 21.2
## 2 Alaska 10.0 44.5
## 3 Arizona 8.1 31.0
## 4 Arkansas 8.8 19.5
## 5 California 9.0 40.6
## 6 Colorado 7.9 38.7
## 7 Connecticut 3.3 11.1
## 8 Delaware 5.9 15.8
## 9 Florida 15.4 31.9
## 10 Georgia 17.4 25.8
h1 =ggplot(FinalUSAarrests, aes(x=MurderP100k)) + geom_histogram(binwidth=3,color = "White",fill=I("blue"))+ ggtitle("USA Arrest 1973 Murders")
h2=ggplot(FinalUSAarrests, aes(x=Rapep100k)) + geom_histogram(binwidth=3,color = "white",fill=I("red"))+ ggtitle("USA Arrest 1973 Rapes")
HistPlot =plot_grid(h1, h2, align='h')
HistPlot
#reduced the number of examples to five.
top = head(FinalUSAarrests, n=5)
BoxplotEx = ggplot(top, aes(y=MurderP100k, x=States)) + geom_boxplot()
BoxplotEx
In order to plot a line across the date I used the below guide.
http://r4stats.com/examples/graphics-ggplot2/
scatterP =ggplot(FinalUSAarrests, aes(MurderP100k, Rapep100k))+ geom_point()+ geom_smooth()
scatterP
The data selected was not very ideal to use in the Box plot due to the lack of additional factors. This lead to the usage of various techniques in order to properly plot the data accordingly thought the assignment.