This week’s assignment is focusing on the Exploratory Data Analysis. The dataset I choose is from the Udacity website: https://github.com/tollek/udacity-data-science/blob/master/p4/l2/stateData.csv. This is a very short dataset, in the meantime, it is also missing units among those variables. However, since I am just analyzing the relationship between some of the variables. In the end, I still come up with reasonable findings from this dataset.

First of all, I need to install the ggplot2 package. The base graphics from R is already very powerful in terms of producing quality graphics. ggplot2 can add more complexity to the graph. In addition, it is easier to use ggplot2 to modify the color, shape and other features of the graph.

install.packages("ggplot2", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/blin261/Documents/R/win-library/3.3'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\blin261\AppData\Local\Temp\Rtmpo1a5qH\downloaded_packages
require(ggplot2)
## Loading required package: ggplot2

I downloaded the data into my local machine. The following code can help me load the data and figure out what are categorical variables and continuous variables.

raw_data<-read.csv("stateData.csv",sep=",")
head(raw_data)
##            X state.abb state.area state.region population income
## 1    Alabama        AL      51609            2       3615   3624
## 2     Alaska        AK     589757            4        365   6315
## 3    Arizona        AZ     113909            4       2212   4530
## 4   Arkansas        AR      53104            2       2110   3378
## 5 California        CA     158693            4      21198   5114
## 6   Colorado        CO     104247            4       2541   4884
##   illiteracy life.exp murder highSchoolGrad frost   area
## 1        2.1    69.05   15.1           41.3    20  50708
## 2        1.5    69.31   11.3           66.7   152 566432
## 3        1.8    70.55    7.8           58.1    15 113417
## 4        1.9    70.66   10.1           39.9    65  51945
## 5        1.1    71.71   10.3           62.6    20 156361
## 6        0.7    72.06    6.8           63.9   166 103766
str(raw_data)
## 'data.frame':    50 obs. of  12 variables:
##  $ X             : Factor w/ 50 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ state.abb     : Factor w/ 50 levels "AK","AL","AR",..: 2 1 4 3 5 6 7 8 9 10 ...
##  $ state.area    : int  51609 589757 113909 53104 158693 104247 5009 2057 58560 58876 ...
##  $ state.region  : int  2 4 4 2 4 4 1 2 2 2 ...
##  $ population    : int  3615 365 2212 2110 21198 2541 3100 579 8277 4931 ...
##  $ income        : int  3624 6315 4530 3378 5114 4884 5348 4809 4815 4091 ...
##  $ illiteracy    : num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
##  $ life.exp      : num  69 69.3 70.5 70.7 71.7 ...
##  $ murder        : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
##  $ highSchoolGrad: num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
##  $ frost         : int  20 152 15 65 20 166 139 103 11 60 ...
##  $ area          : int  50708 566432 113417 51945 156361 103766 4862 1982 54090 58073 ...

I replaced the data in the state region column from integer into human readable string datatype.

raw_data$state.region[raw_data$state.region == "1"] <- "Northeast"
raw_data$state.region[raw_data$state.region == "2"] <- "South"
raw_data$state.region[raw_data$state.region == "3"] <- "Midwest"
raw_data$state.region[raw_data$state.region == "4"] <- "West"
head(raw_data)
##            X state.abb state.area state.region population income
## 1    Alabama        AL      51609        South       3615   3624
## 2     Alaska        AK     589757         West        365   6315
## 3    Arizona        AZ     113909         West       2212   4530
## 4   Arkansas        AR      53104        South       2110   3378
## 5 California        CA     158693         West      21198   5114
## 6   Colorado        CO     104247         West       2541   4884
##   illiteracy life.exp murder highSchoolGrad frost   area
## 1        2.1    69.05   15.1           41.3    20  50708
## 2        1.5    69.31   11.3           66.7   152 566432
## 3        1.8    70.55    7.8           58.1    15 113417
## 4        1.9    70.66   10.1           39.9    65  51945
## 5        1.1    71.71   10.3           62.6    20 156361
## 6        0.7    72.06    6.8           63.9   166 103766
table(raw_data$state.region)
## 
##   Midwest Northeast     South      West 
##        12         9        16        13

The code to generate a box plot. According to the graph the interquatile range of life expectancy in US among different states is about 70 to 72.

ggplot(data=raw_data) + geom_boxplot(aes(y = life.exp, x = 1), fill='blue')

#Base graphic code for boxplot
#boxplot(raw_data$life.exp)

The histogram shows average income among 50 states is mostly spreading between $35,000 to $54,000, with one state that has the highest average income which is around $63,000.

ggplot(data=raw_data) + geom_histogram(aes(x = income), fill='blue')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Base graphic code for histogram
#hist(raw_data$income, main = "Average Household Income in US by States", xlab = "Income Range")

The scatterplot shows in the south region of America, the trend of higher illeteracy rate and lower income is very prominent, but the opposite trend shows in the Northeast region. The graph also shows states that have higher average income, the population tend to have higher life expectancy as a lot of these points located at the top left side of the graph.

ggplot(data=raw_data, aes(x = illiteracy, y = income, color = life.exp)) + geom_point() + facet_wrap(~state.region)

#Base graphic code for scatterplot
#plot(raw_data$income ~ raw_data$illiteracy, data = raw_data, main = "Relationship Between Illiteracy Rate and Income in US by States", xlab = "Illiteracy", ylab = "Income")

The second scatterplot shows that states with higher high school graduation rate (low illiteracy) tend to have population that has higher average income. The murder rate of each state might have inverse relationship with the life expectancy of the pulation in that state, as the murder rate increases, the life expectancy decreases.

ggplot(data=raw_data, aes(x = murder, y = life.exp, color = highSchoolGrad)) + geom_point()

#Base graphic code for scatterplot
#plot(raw_data$life.exp ~ raw_data$murder, data = raw_data, main = "Relationship Between Murder Rate and Life Expectancy in US by States", xlab = "Murder Rate", ylab = "Life Expectancy")