R Homework Week 4 Assignment
Exploratory Data Analysis in R.
Choose an interesting dataset and use R graphics to describe the data. You may use base R graphics, or a graphics package of your choice. You should include at least one example of each of the following: histogram boxplot scatterplot
Do the graphics provide insight into any relationships in the data?
Install necessary packages:
install.packages("devtools", dependencies = TRUE, repos = "http://lib.stat.cmu.edu/R/CRAN/")
## Warning: dependencies 'BiocInstaller', 'lintr (>= 0.2.1)' are not available
## package 'devtools' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Karen\AppData\Local\Temp\RtmpUz8aza\downloaded_packages
Get RCurl on board so csv file can be retrieved
library(RCurl)
## Loading required package: bitops
Retrieve data file from github repository
caschooldata <- getURL("https://raw.githubusercontent.com/karenweigandt/cuny-summer-bridge/master/Caschoolincolumns.csv")
Read data file
ca_school_data_csv <- read.csv(text = caschooldata, header = TRUE, stringsAsFactors = FALSE)
Look at first six rows of data
head(ca_school_data_csv)
## X distcod county district grspan enrltot
## 1 1 75119 Alameda Sunol Glen Unified KK-08 195
## 2 2 61499 Butte Manzanita Elementary KK-08 240
## 3 3 61549 Butte Thermalito Union Elementary KK-08 1550
## 4 4 61457 Butte Golden Feather Union Elementary KK-08 243
## 5 5 61523 Butte Palermo Union Elementary KK-08 1335
## 6 6 62042 Fresno Burrel Union Elementary KK-08 137
## teachers calwpct mealpct computer testscr compstu expnstu str
## 1 10.90 0.5102 2.0408 67 690.80 0.3435898 6384.911 17.88991
## 2 11.15 15.4167 47.9167 101 661.20 0.4208333 5099.381 21.52466
## 3 82.90 55.0323 76.3226 169 643.60 0.1090323 5501.955 18.69723
## 4 14.00 36.4754 77.0492 85 647.70 0.3497942 7101.831 17.35714
## 5 71.50 33.1086 78.4270 171 640.85 0.1280899 5235.988 18.67133
## 6 6.40 12.3188 86.9565 25 605.55 0.1824818 5580.147 21.40625
## avginc elpct readscr mathscr
## 1 22.690001 0.000000 691.6 690.0
## 2 9.824000 4.583333 660.5 661.9
## 3 8.978000 30.000002 636.3 650.9
## 4 8.978000 0.000000 651.9 643.5
## 5 9.080333 13.857677 641.8 639.9
## 6 10.415000 12.408759 605.7 605.4
Generate a histogram of number of teachers
hist(ca_school_data_csv$teachers, col = "lightblue", breaks = 100, main = "Number of Teachers in California Schools")
Generate a boxplot of enrollment totals for selected four California counties
four_counties_of_ca_school <- ca_school_data_csv [which(ca_school_data_csv$county == "Kings" |ca_school_data_csv$county == "Orange" |ca_school_data_csv$county == "Fresno" |ca_school_data_csv$county == "Yuba" ),]
boxplot(enrltot~county, data = four_counties_of_ca_school, main = "Enrollment Totals in Four California Counties")
Generate a scatterplot of enrollment totals vs. number of teachers
plot(ca_school_data_csv$enrltot, ca_school_data_csv$teachers, main = "Enrollment Total vs. Number of Teachers")
From the histogram, we see right skewed data, and we see that the vast majority of schools have smaller numbers of teachers.
The boxplot shows four counties in California, for ease of readability. A boxplot of all counties would be difficult to read visually. Fom the boxplot we can see that Orange County Schools have the highest average enrollment of the four counties, with one very large school. The vast majority of schools in Fresno are similar in enrollment size. Kings and Yuba counties are similar in terms of average enrollment, and Yuba county has no outliers in terms of enrollment size.
The scatterplot shows an essentially linear relationship between enrollment totals and the number of teachers. This makes sense, since most schools are packed to their legal limit for student:teacher ratio.