This document pertains to exploratory data analysis in R using graphs. The ggplot2 package is needed since the plotting is done using this package.
The data set is from the list found at: http://vincentarelbundock.github.io/Rdatasets/ and the data set = education.cvs
Data format: 6 variables and 50 observations, the data has 1 row header
State: State for which the data is relevant
Region: 1=Northeastern, 2=Northcentral, 3=Southern, 4=Western
x1: Number of residents per thousand residing in urban areas in 1970
x2: Per capita personal income in 1973
x3: Number of residents per thousand under 18 years of age in 1974
x4: Per capita expenditure on public education in a state, projected for 1975
We are first going to import the data and store it in a dataframe: education
# using package ggplot2
require(ggplot2)
## Loading required package: ggplot2
#Data Set = Education from vincentarelbundock/Rdatasets
education_url <- "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/robustbase/education.csv"
education <- read.csv(education_url, header = TRUE, row.names = 1)
str(education)
## 'data.frame': 50 obs. of 6 variables:
## $ State : Factor w/ 50 levels "AK","AL","AR",..: 21 30 46 19 39 7 34 31 38 35 ...
## $ Region: int 1 1 1 1 1 1 1 1 1 2 ...
## $ X1 : int 508 564 322 846 871 774 856 889 715 753 ...
## $ X2 : int 3944 4578 4011 5233 4780 5889 5663 5759 4894 5012 ...
## $ X3 : int 325 323 328 305 303 307 301 310 300 324 ...
## $ Y : int 235 231 270 261 300 317 387 285 300 221 ...
We are now going to modify the column header and remap the regions to their names
#Rename column header to more meaningful name (see description of data above)
names(education)[3:6] <- c("Residents_urban_areas", "Per_capita_income", "Residents_under18", "Per_capita_school_expenditure")
#Remap Region to their name
education$Region <- factor(education$Region)
education <- within(education,
{
levels(Region)[levels(Region) == "1"] <- "Northeastern"
levels(Region)[levels(Region) == "2"] <- "Northcentral"
levels(Region)[levels(Region) == "3"] <- "Southern"
levels(Region)[levels(Region) == "4"] <- "Western"
})
we will reorder the states by “Per_capita_school_expenditure” column and plot the other variables with x=State
# sort data frame by Per_capita_school_expenditure
education$State <- reorder(education$State, education$Per_capita_school_expenditure)
g1<-ggplot(education, aes(x=State, y=Residents_urban_areas, group=1)) + geom_line() + geom_point()
g2<-ggplot(education, aes(x=State, y=Per_capita_income, group=1)) + geom_line() + geom_point()
g3<-ggplot(education, aes(x=State, y=Residents_under18, group=1)) + geom_line() + geom_point()
g4<-ggplot(education, aes(x=State, y=Per_capita_school_expenditure, group=1)) + geom_line() + geom_point()
g1
g2
g3
g4
Since the states are now sorted by “Per_capita_school_expenditure”, if any of the other variables have an impact on this one, they should either follow the same progression or the opposite. From the graphs above, it does not appear that a high number of residents under 18 (which we would assume skew the population ratio adult/children towards children would corelates with higher spending on school. However, from graph# 3 (g3) this does not appear to be the case.
we will now try to plot “Per Capita School Expenditure” on as a function of “Per Capita Income” and “Residents under 18” respectively.
g5<-ggplot(education, aes(x=Per_capita_income, y=Per_capita_school_expenditure)) + geom_point(shape=3, colour = "red")
g5
g6<-ggplot(education, aes(x=Residents_under18, y=Per_capita_school_expenditure)) + geom_point(shape=3, colour = "blue")
g6
We will replot g5: x=Per_capita_income y=Per_capita_school_expenditure with a line regression (with the default 95% confidence in shaded area)
g7<-ggplot(education, aes(x=Per_capita_income, y=Per_capita_school_expenditure)) + geom_point(shape=3, colour = "red") + stat_smooth(method=lm)
g7
Finally, we will plot Histograms and boxplot of “Per Capita School Expenditure” facetted by region
g8 <-ggplot(education, aes(x=Per_capita_school_expenditure)) + geom_histogram(binwidth = 15, fill = "pink", colour = "black") + facet_grid(Region ~ .)
g9<-ggplot(education, aes(x=Region, y=Per_capita_school_expenditure)) + geom_boxplot()
g8
g9