Problem : Exploratory Data Analysis in R. Choose an interesting dataset and use R graphics to describe the data. You may use base R graphics, or a graphics package of your choice. You should include at least one example of each of the following: 1. histogram 2. boxplot 3. scatterplot

Solution :

Install required packages first

install.packages("devtools", dependencies = TRUE, repos = "http://lib.stat.cmu.edu/R/CRAN/")
## Warning: dependency 'BiocInstaller' is not available
## package 'devtools' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\chirag.vithalani\AppData\Local\Temp\RtmpGM3W9O\downloaded_packages
library(RCurl)
## Loading required package: bitops

I found “The California Test Score Data Set” intresting, so using that data and printing head

california_school_CSV <- getURL("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Ecdat/Caschool.csv")
california_school_data <- read.csv(text = california_school_CSV, header = TRUE, stringsAsFactors = FALSE)
head(california_school_data)
##   X distcod  county                        district grspan enrltot
## 1 1   75119 Alameda              Sunol Glen Unified  KK-08     195
## 2 2   61499   Butte            Manzanita Elementary  KK-08     240
## 3 3   61549   Butte     Thermalito Union Elementary  KK-08    1550
## 4 4   61457   Butte Golden Feather Union Elementary  KK-08     243
## 5 5   61523   Butte        Palermo Union Elementary  KK-08    1335
## 6 6   62042  Fresno         Burrel Union Elementary  KK-08     137
##   teachers calwpct mealpct computer testscr   compstu  expnstu      str
## 1    10.90  0.5102  2.0408       67  690.80 0.3435898 6384.911 17.88991
## 2    11.15 15.4167 47.9167      101  661.20 0.4208333 5099.381 21.52466
## 3    82.90 55.0323 76.3226      169  643.60 0.1090323 5501.955 18.69723
## 4    14.00 36.4754 77.0492       85  647.70 0.3497942 7101.831 17.35714
## 5    71.50 33.1086 78.4270      171  640.85 0.1280899 5235.988 18.67133
## 6     6.40 12.3188 86.9565       25  605.55 0.1824818 5580.147 21.40625
##      avginc     elpct readscr mathscr
## 1 22.690001  0.000000   691.6   690.0
## 2  9.824000  4.583333   660.5   661.9
## 3  8.978000 30.000002   636.3   650.9
## 4  8.978000  0.000000   651.9   643.5
## 5  9.080333 13.857677   641.8   639.9
## 6 10.415000 12.408759   605.7   605.4

Print graphics

hist(california_school_data$teachers, col = "lightblue", breaks = 100, main = "Number of Teachers in California Schools")

four_counties_of_california_school <- california_school_data [which(california_school_data$county == "Los Angeles" |california_school_data$county == "San Diego" |california_school_data$county == "Orange" |california_school_data$county == "Riverside" ),]
boxplot(enrltot~county, data = four_counties_of_california_school, main = "Total Enrollments in Four (largest [by population] ) California Counties")

plot(california_school_data$enrltot, california_school_data$teachers, main = "Total Enrollment vs. Number of Teachers")

Summary :

Number of schools have smaller numbers of teachers