Week 2 Assignment on RPubs
Rmd on Github

require(plyr)
## Loading required package: plyr
# Load data
# US Births in 1969 - 1988 

z_file <- gzcon(url("https://raw.githubusercontent.com/logicalschema/r_winter_projects/master/Birthdays.csv.gz"))
raw_csv <- textConnection(readLines(z_file))
close(z_file)

birthday_data <- read.table(raw_csv, header = TRUE, sep = ",")
head(birthday_data, 5)


1. Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.

summary(birthday_data)
##        X              state             year          month       
##  Min.   :     1   NY     :  7333   Min.   :1969   Min.   : 1.000  
##  1st Qu.: 93217   PA     :  7330   1st Qu.:1973   1st Qu.: 4.000  
##  Median :186433   TX     :  7330   Median :1978   Median : 7.000  
##  Mean   :186433   CA     :  7325   Mean   :1978   Mean   : 6.522  
##  3rd Qu.:279648   MI     :  7323   3rd Qu.:1983   3rd Qu.:10.000  
##  Max.   :372864   NJ     :  7321   Max.   :1988   Max.   :12.000  
##                   (Other):328902                                  
##       day                date           wday           births    
##  Min.   : 1.00   1970-03-01:    70   Fri  :53275   Min.   :   1  
##  1st Qu.: 8.00   1970-10-01:    67   Mon  :53242   1st Qu.:  52  
##  Median :16.00   1969-03-01:    66   Sat  :53280   Median : 129  
##  Mean   :15.74   1969-07-01:    66   Sun  :53240   Mean   : 189  
##  3rd Qu.:23.00   1969-10-01:    66   Thurs:53291   3rd Qu.: 223  
##  Max.   :31.00   1969-12-01:    65   Tues :53238   Max.   :1779  
##                  (Other)   :372464   Wed  :53298
The mean and median for the year attribute
mean(birthday_data$year)
## [1] 1978.495
median(birthday_data$year)
## [1] 1978
The mean and median for the births attribute
mean(birthday_data$births)
## [1] 189.0409
median(birthday_data$births)
## [1] 129
plot(birthday_data[,c("births", "year")])


2. Create a new data frame with a subset of the columns and rows. Make sure to rename it.

subset_birthdays = subset(birthday_data, state == "NY" & year >= 1981 & year <= 1988, select=c(state, year, month, births))

tail(subset_birthdays, 5)


3. Create new column names for the new data frame.

subset_birthdays <- rename(subset_birthdays, c("state" = "STATE", "year" = "Y", "month" = "M", "births" = "BIRTHS"))

head(subset_birthdays, 5)


4. Use the summary function to create an overview of your new data frame. Then print the mean and median for the same two attributes. Please compare.

summary(subset_birthdays)
##      STATE            Y              M              BIRTHS     
##  NY     :2922   Min.   :1981   Min.   : 1.000   Min.   :509.0  
##  AK     :   0   1st Qu.:1983   1st Qu.: 4.000   1st Qu.:650.0  
##  AL     :   0   Median :1984   Median : 7.000   Median :711.0  
##  AR     :   0   Mean   :1985   Mean   : 6.523   Mean   :709.1  
##  AZ     :   0   3rd Qu.:1987   3rd Qu.:10.000   3rd Qu.:768.0  
##  CA     :   0   Max.   :1988   Max.   :12.000   Max.   :933.0  
##  (Other):   0
mean(subset_birthdays$Y)
## [1] 1984.501
median(subset_birthdays$Y)
## [1] 1984.5
mean(subset_birthdays$BIRTHS)
## [1] 709.1499
median(subset_birthdays$BIRTHS)
## [1] 711
plot(subset_birthdays[,c("BIRTHS", "Y")])

The births attribute varied widely from the original data set. The original had a mean: 189 and median: 129. The subset data taken for New York between 1981 and 1988 produced a mean: 709 and median: 711.


5. For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “excellent”.

The following code replaces NY with New York and prints out the first hundred rows.

subset_birthdays$STATE <- gsub("NY", "New York", subset_birthdays$STATE)

head(subset_birthdays, 100)


6. Display enough rows to see examples of all of steps 1-5 above.


7. BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

The following code block implements the task. I zipped the csv file to make transport quicker.

z_file <- gzcon(url("https://raw.githubusercontent.com/logicalschema/r_winter_projects/master/Birthdays.csv.gz"))
raw_csv <- textConnection(readLines(z_file))
close(z_file)

birthday_data <- read.table(raw_csv, header = TRUE, sep = ",")