Problem : Exploratory Data Analysis in R. Choose an interesting dataset and use R graphics to describe the data. You may use base R graphics, or a graphics package of your choice. You should include at least one example of each of the following: 1. histogram 2. boxplot 3. scatterplot
Solution :
Install required packages first
install.packages("devtools", dependencies = TRUE, repos = "http://lib.stat.cmu.edu/R/CRAN/")
## Warning: dependency 'BiocInstaller' is not available
## package 'devtools' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\chirag.vithalani\AppData\Local\Temp\RtmpGM3W9O\downloaded_packages
library(RCurl)
## Loading required package: bitops
I found “The California Test Score Data Set” intresting, so using that data and printing head
california_school_CSV <- getURL("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Ecdat/Caschool.csv")
california_school_data <- read.csv(text = california_school_CSV, header = TRUE, stringsAsFactors = FALSE)
head(california_school_data)
## X distcod county district grspan enrltot
## 1 1 75119 Alameda Sunol Glen Unified KK-08 195
## 2 2 61499 Butte Manzanita Elementary KK-08 240
## 3 3 61549 Butte Thermalito Union Elementary KK-08 1550
## 4 4 61457 Butte Golden Feather Union Elementary KK-08 243
## 5 5 61523 Butte Palermo Union Elementary KK-08 1335
## 6 6 62042 Fresno Burrel Union Elementary KK-08 137
## teachers calwpct mealpct computer testscr compstu expnstu str
## 1 10.90 0.5102 2.0408 67 690.80 0.3435898 6384.911 17.88991
## 2 11.15 15.4167 47.9167 101 661.20 0.4208333 5099.381 21.52466
## 3 82.90 55.0323 76.3226 169 643.60 0.1090323 5501.955 18.69723
## 4 14.00 36.4754 77.0492 85 647.70 0.3497942 7101.831 17.35714
## 5 71.50 33.1086 78.4270 171 640.85 0.1280899 5235.988 18.67133
## 6 6.40 12.3188 86.9565 25 605.55 0.1824818 5580.147 21.40625
## avginc elpct readscr mathscr
## 1 22.690001 0.000000 691.6 690.0
## 2 9.824000 4.583333 660.5 661.9
## 3 8.978000 30.000002 636.3 650.9
## 4 8.978000 0.000000 651.9 643.5
## 5 9.080333 13.857677 641.8 639.9
## 6 10.415000 12.408759 605.7 605.4
Print graphics
hist(california_school_data$teachers, col = "lightblue", breaks = 100, main = "Number of Teachers in California Schools")
four_counties_of_california_school <- california_school_data [which(california_school_data$county == "Los Angeles" |california_school_data$county == "San Diego" |california_school_data$county == "Orange" |california_school_data$county == "Riverside" ),]
boxplot(enrltot~county, data = four_counties_of_california_school, main = "Total Enrollments in Four (largest [by population] ) California Counties")
plot(california_school_data$enrltot, california_school_data$teachers, main = "Total Enrollment vs. Number of Teachers")
Summary :
Number of schools have smaller numbers of teachers