This week’s assignment is focusing on the Exploratory Data Analysis. The dataset I choose is from the Udacity website: https://github.com/tollek/udacity-data-science/blob/master/p4/l2/stateData.csv. This is a very short dataset, in the meantime, it is also missing units among those variables. However, since I am just analyzing the relationship between some of the variables. In the end, I still come up with reasonable findings from this dataset.
First of all, I need to install the ggplot2 package. The base graphics from R is already very powerful in terms of producing quality graphics. ggplot2 can add more complexity to the graph. In addition, it is easier to use ggplot2 to modify the color, shape and other features of the graph.
install.packages("ggplot2", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/blin261/Documents/R/win-library/3.3'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\blin261\AppData\Local\Temp\Rtmpo1a5qH\downloaded_packages
require(ggplot2)
## Loading required package: ggplot2
I downloaded the data into my local machine. The following code can help me load the data and figure out what are categorical variables and continuous variables.
raw_data<-read.csv("stateData.csv",sep=",")
head(raw_data)
## X state.abb state.area state.region population income
## 1 Alabama AL 51609 2 3615 3624
## 2 Alaska AK 589757 4 365 6315
## 3 Arizona AZ 113909 4 2212 4530
## 4 Arkansas AR 53104 2 2110 3378
## 5 California CA 158693 4 21198 5114
## 6 Colorado CO 104247 4 2541 4884
## illiteracy life.exp murder highSchoolGrad frost area
## 1 2.1 69.05 15.1 41.3 20 50708
## 2 1.5 69.31 11.3 66.7 152 566432
## 3 1.8 70.55 7.8 58.1 15 113417
## 4 1.9 70.66 10.1 39.9 65 51945
## 5 1.1 71.71 10.3 62.6 20 156361
## 6 0.7 72.06 6.8 63.9 166 103766
str(raw_data)
## 'data.frame': 50 obs. of 12 variables:
## $ X : Factor w/ 50 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ state.abb : Factor w/ 50 levels "AK","AL","AR",..: 2 1 4 3 5 6 7 8 9 10 ...
## $ state.area : int 51609 589757 113909 53104 158693 104247 5009 2057 58560 58876 ...
## $ state.region : int 2 4 4 2 4 4 1 2 2 2 ...
## $ population : int 3615 365 2212 2110 21198 2541 3100 579 8277 4931 ...
## $ income : int 3624 6315 4530 3378 5114 4884 5348 4809 4815 4091 ...
## $ illiteracy : num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
## $ life.exp : num 69 69.3 70.5 70.7 71.7 ...
## $ murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
## $ highSchoolGrad: num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
## $ frost : int 20 152 15 65 20 166 139 103 11 60 ...
## $ area : int 50708 566432 113417 51945 156361 103766 4862 1982 54090 58073 ...
I replaced the data in the state region column from integer into human readable string datatype.
raw_data$state.region[raw_data$state.region == "1"] <- "Northeast"
raw_data$state.region[raw_data$state.region == "2"] <- "South"
raw_data$state.region[raw_data$state.region == "3"] <- "Midwest"
raw_data$state.region[raw_data$state.region == "4"] <- "West"
head(raw_data)
## X state.abb state.area state.region population income
## 1 Alabama AL 51609 South 3615 3624
## 2 Alaska AK 589757 West 365 6315
## 3 Arizona AZ 113909 West 2212 4530
## 4 Arkansas AR 53104 South 2110 3378
## 5 California CA 158693 West 21198 5114
## 6 Colorado CO 104247 West 2541 4884
## illiteracy life.exp murder highSchoolGrad frost area
## 1 2.1 69.05 15.1 41.3 20 50708
## 2 1.5 69.31 11.3 66.7 152 566432
## 3 1.8 70.55 7.8 58.1 15 113417
## 4 1.9 70.66 10.1 39.9 65 51945
## 5 1.1 71.71 10.3 62.6 20 156361
## 6 0.7 72.06 6.8 63.9 166 103766
table(raw_data$state.region)
##
## Midwest Northeast South West
## 12 9 16 13
The code to generate a box plot. According to the graph the interquatile range of life expectancy in US among different states is about 70 to 72.
ggplot(data=raw_data) + geom_boxplot(aes(y = life.exp, x = 1), fill='blue')
#Base graphic code for boxplot
#boxplot(raw_data$life.exp)
The histogram shows average income among 50 states is mostly spreading between $35,000 to $54,000, with one state that has the highest average income which is around $63,000.
ggplot(data=raw_data) + geom_histogram(aes(x = income), fill='blue')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Base graphic code for histogram
#hist(raw_data$income, main = "Average Household Income in US by States", xlab = "Income Range")
The scatterplot shows in the south region of America, the trend of higher illeteracy rate and lower income is very prominent, but the opposite trend shows in the Northeast region. The graph also shows states that have higher average income, the population tend to have higher life expectancy as a lot of these points located at the top left side of the graph.
ggplot(data=raw_data, aes(x = illiteracy, y = income, color = life.exp)) + geom_point() + facet_wrap(~state.region)
#Base graphic code for scatterplot
#plot(raw_data$income ~ raw_data$illiteracy, data = raw_data, main = "Relationship Between Illiteracy Rate and Income in US by States", xlab = "Illiteracy", ylab = "Income")
The second scatterplot shows that states with higher high school graduation rate (low illiteracy) tend to have population that has higher average income. The murder rate of each state might have inverse relationship with the life expectancy of the pulation in that state, as the murder rate increases, the life expectancy decreases.
ggplot(data=raw_data, aes(x = murder, y = life.exp, color = highSchoolGrad)) + geom_point()
#Base graphic code for scatterplot
#plot(raw_data$life.exp ~ raw_data$murder, data = raw_data, main = "Relationship Between Murder Rate and Life Expectancy in US by States", xlab = "Murder Rate", ylab = "Life Expectancy")