Week 2 Assignment on RPubs
Rmd on Github
require(plyr)
## Loading required package: plyr
# Load data
# US Births in 1969 - 1988
z_file <- gzcon(url("https://raw.githubusercontent.com/logicalschema/r_winter_projects/master/Birthdays.csv.gz"))
raw_csv <- textConnection(readLines(z_file))
close(z_file)
birthday_data <- read.table(raw_csv, header = TRUE, sep = ",")
head(birthday_data, 5)
1. Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.
summary(birthday_data)
## X state year month
## Min. : 1 NY : 7333 Min. :1969 Min. : 1.000
## 1st Qu.: 93217 PA : 7330 1st Qu.:1973 1st Qu.: 4.000
## Median :186433 TX : 7330 Median :1978 Median : 7.000
## Mean :186433 CA : 7325 Mean :1978 Mean : 6.522
## 3rd Qu.:279648 MI : 7323 3rd Qu.:1983 3rd Qu.:10.000
## Max. :372864 NJ : 7321 Max. :1988 Max. :12.000
## (Other):328902
## day date wday births
## Min. : 1.00 1970-03-01: 70 Fri :53275 Min. : 1
## 1st Qu.: 8.00 1970-10-01: 67 Mon :53242 1st Qu.: 52
## Median :16.00 1969-03-01: 66 Sat :53280 Median : 129
## Mean :15.74 1969-07-01: 66 Sun :53240 Mean : 189
## 3rd Qu.:23.00 1969-10-01: 66 Thurs:53291 3rd Qu.: 223
## Max. :31.00 1969-12-01: 65 Tues :53238 Max. :1779
## (Other) :372464 Wed :53298
mean(birthday_data$year)
## [1] 1978.495
median(birthday_data$year)
## [1] 1978
mean(birthday_data$births)
## [1] 189.0409
median(birthday_data$births)
## [1] 129
plot(birthday_data[,c("births", "year")])
2. Create a new data frame with a subset of the columns and rows. Make sure to rename it.
subset_birthdays = subset(birthday_data, state == "NY" & year >= 1981 & year <= 1988, select=c(state, year, month, births))
tail(subset_birthdays, 5)
3. Create new column names for the new data frame.
subset_birthdays <- rename(subset_birthdays, c("state" = "STATE", "year" = "Y", "month" = "M", "births" = "BIRTHS"))
head(subset_birthdays, 5)
4. Use the summary function to create an overview of your new data frame. Then print the mean and median for the same two attributes. Please compare.
summary(subset_birthdays)
## STATE Y M BIRTHS
## NY :2922 Min. :1981 Min. : 1.000 Min. :509.0
## AK : 0 1st Qu.:1983 1st Qu.: 4.000 1st Qu.:650.0
## AL : 0 Median :1984 Median : 7.000 Median :711.0
## AR : 0 Mean :1985 Mean : 6.523 Mean :709.1
## AZ : 0 3rd Qu.:1987 3rd Qu.:10.000 3rd Qu.:768.0
## CA : 0 Max. :1988 Max. :12.000 Max. :933.0
## (Other): 0
mean(subset_birthdays$Y)
## [1] 1984.501
median(subset_birthdays$Y)
## [1] 1984.5
mean(subset_birthdays$BIRTHS)
## [1] 709.1499
median(subset_birthdays$BIRTHS)
## [1] 711
plot(subset_birthdays[,c("BIRTHS", "Y")])
The births attribute varied widely from the original data set. The original had a mean: 189 and median: 129. The subset data taken for New York between 1981 and 1988 produced a mean: 709 and median: 711.
5. For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “excellent”.
The following code replaces NY with New York and prints out the first hundred rows.
subset_birthdays$STATE <- gsub("NY", "New York", subset_birthdays$STATE)
head(subset_birthdays, 100)
6. Display enough rows to see examples of all of steps 1-5 above.
7. BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.
The following code block implements the task. I zipped the csv file to make transport quicker.
z_file <- gzcon(url("https://raw.githubusercontent.com/logicalschema/r_winter_projects/master/Birthdays.csv.gz"))
raw_csv <- textConnection(readLines(z_file))
close(z_file)
birthday_data <- read.table(raw_csv, header = TRUE, sep = ",")