R Week 2 Assignment
R Bridge Course Week 2 Assignment
One of the challenges in working with data is wrangling. In this assignment we will use R to perform this task.
Here is a list of data sets: http://vincentarelbundock.github.io/Rdatasets/ (click on the csv index for a list)
Please select one, download it and perform the following tasks:
1. Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.
doc<-read.csv("doctor.csv", header = TRUE)
> summary(doc)
X doctor children access
Min. : 1 Min. : 0.00 Min. :1.000 Min. :0.0000
1st Qu.:122 1st Qu.: 0.00 1st Qu.:1.000 1st Qu.:0.2500
Median :243 Median : 1.00 Median :2.000 Median :0.3500
Mean :243 Mean : 1.61 Mean :2.264 Mean :0.3812
3rd Qu.:364 3rd Qu.: 2.00 3rd Qu.:3.000 3rd Qu.:0.5000
Max. :485 Max. :48.00 Max. :9.000 Max. :0.9200
health
Min. :-1.524000
1st Qu.:-1.066000
Median :-0.421000
Mean :-0.000041
3rd Qu.: 0.657000
Max. : 7.217000
> mean(doc$health)
[1] -4.123711e-05
> median(doc$health)
[1] -0.421
2. Create a new data frame with a subset of the columns and rows. Make sure to rename it.
doc.sub <- subset(doc, doctor > 5 & children < 2)
3. Create new column names for the new data frame.
names(doc.sub)[1]<-"Record"
names(doc.sub)[2]<-"DocVisits"
names(doc.sub)[3]<-"NumOfChildren"
names(doc.sub)[4]<-"AccessRate"
names(doc.sub)[5]<-"HealthRate"
4. Use the summary function to create an overview of your new data frame. The print the mean and median for the same two attributes. Please compare.
summary(doc.sub)
Record DocVisits NumOfChildren AccessRate HealthRate
Min. : 5.0 Min. : 6 Min. :1 Min. :0.0000 Min. :0.000
1st Qu.:109.0 1st Qu.: 7 1st Qu.:1 1st Qu.:0.3300 1st Qu.:0.000
Median :248.0 Median : 9 Median :1 Median :0.4200 Median :2.000
Mean :233.2 Mean :12 Mean :1 Mean :0.4376 Mean :1.882
3rd Qu.:320.0 3rd Qu.:11 3rd Qu.:1 3rd Qu.:0.6700 3rd Qu.:2.000
Max. :483.0 Max. :48 Max. :1 Max. :0.6700 Max. :4.000
> mean(doc.sub$HealthRate)
[1] 1.882353
> median(doc.sub$HealthRate)
[1] 2
Based on data housholds with less than 2 children and more doctor visits have better health the average households.
5. For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “excellent”.
doc.sub$HealthRate[doc.sub$HealthRate < 0]<-0
doc.sub$HealthRate[doc.sub$HealthRate > 0 & doc.sub$HealthRate < 2]<-2
doc.sub$HealthRate[doc.sub$HealthRate > 2]<-4
6. Display enough rows to see examples of all of steps 1-5 above.
head(doc)
X doctor children access health
1 1 0 1 0.50 0.495
2 2 1 3 0.17 0.520
3 3 0 4 0.42 -1.227
4 4 0 2 0.33 -1.524
5 5 11 1 0.67 0.173
6 6 3 1 0.25 -0.905
head(doc.sub)
Record DocVisits NumOfChildren AccessRate HealthRate
5 5 11 1 0.67 2
8 8 6 1 0.67 2
13 13 15 1 0.67 0
36 36 7 1 0.58 2
109 109 6 1 0.67 4
150 150 9 1 0.08 0
7. BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.
require(RCurl)
url.data <- read.csv(text=getURL("https://raw.githubusercontent.com/apag101/MSDSBridge/master/temparature.csv"), header = T)
Please submit your .rmd file and the .csv file as well as a link to your RPubs.
End Data