R Homework Week 4 Assignment

Exploratory Data Analysis in R.

Choose an interesting dataset and use R graphics to describe the data. You may use base R graphics, or a graphics package of your choice. You should include at least one example of each of the following: histogram boxplot scatterplot

Do the graphics provide insight into any relationships in the data?

Install necessary packages:

install.packages("devtools", dependencies = TRUE, repos = "http://lib.stat.cmu.edu/R/CRAN/")
## Warning: dependencies 'BiocInstaller', 'lintr (>= 0.2.1)' are not available
## package 'devtools' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Karen\AppData\Local\Temp\RtmpUz8aza\downloaded_packages

Get RCurl on board so csv file can be retrieved

library(RCurl)
## Loading required package: bitops

Retrieve data file from github repository

caschooldata <- getURL("https://raw.githubusercontent.com/karenweigandt/cuny-summer-bridge/master/Caschoolincolumns.csv")

Read data file

ca_school_data_csv <- read.csv(text = caschooldata, header = TRUE, stringsAsFactors = FALSE)

Look at first six rows of data

head(ca_school_data_csv)
##   X distcod  county                        district grspan enrltot
## 1 1   75119 Alameda              Sunol Glen Unified  KK-08     195
## 2 2   61499   Butte            Manzanita Elementary  KK-08     240
## 3 3   61549   Butte     Thermalito Union Elementary  KK-08    1550
## 4 4   61457   Butte Golden Feather Union Elementary  KK-08     243
## 5 5   61523   Butte        Palermo Union Elementary  KK-08    1335
## 6 6   62042  Fresno         Burrel Union Elementary  KK-08     137
##   teachers calwpct mealpct computer testscr   compstu  expnstu      str
## 1    10.90  0.5102  2.0408       67  690.80 0.3435898 6384.911 17.88991
## 2    11.15 15.4167 47.9167      101  661.20 0.4208333 5099.381 21.52466
## 3    82.90 55.0323 76.3226      169  643.60 0.1090323 5501.955 18.69723
## 4    14.00 36.4754 77.0492       85  647.70 0.3497942 7101.831 17.35714
## 5    71.50 33.1086 78.4270      171  640.85 0.1280899 5235.988 18.67133
## 6     6.40 12.3188 86.9565       25  605.55 0.1824818 5580.147 21.40625
##      avginc     elpct readscr mathscr
## 1 22.690001  0.000000   691.6   690.0
## 2  9.824000  4.583333   660.5   661.9
## 3  8.978000 30.000002   636.3   650.9
## 4  8.978000  0.000000   651.9   643.5
## 5  9.080333 13.857677   641.8   639.9
## 6 10.415000 12.408759   605.7   605.4

Generate a histogram of number of teachers

hist(ca_school_data_csv$teachers, col = "lightblue", breaks = 100, main = "Number of Teachers in California Schools")

Generate a boxplot of enrollment totals for selected four California counties

four_counties_of_ca_school <- ca_school_data_csv [which(ca_school_data_csv$county == "Kings" |ca_school_data_csv$county == "Orange" |ca_school_data_csv$county == "Fresno" |ca_school_data_csv$county == "Yuba" ),]
boxplot(enrltot~county, data = four_counties_of_ca_school, main = "Enrollment Totals in Four California Counties")

Generate a scatterplot of enrollment totals vs. number of teachers

plot(ca_school_data_csv$enrltot, ca_school_data_csv$teachers, main = "Enrollment Total vs. Number of Teachers")

Conclusions

From the histogram, we see right skewed data, and we see that the vast majority of schools have smaller numbers of teachers.

The boxplot shows four counties in California, for ease of readability. A boxplot of all counties would be difficult to read visually. Fom the boxplot we can see that Orange County Schools have the highest average enrollment of the four counties, with one very large school. The vast majority of schools in Fresno are similar in enrollment size. Kings and Yuba counties are similar in terms of average enrollment, and Yuba county has no outliers in terms of enrollment size.

The scatterplot shows an essentially linear relationship between enrollment totals and the number of teachers. This makes sense, since most schools are packed to their legal limit for student:teacher ratio.