Week 2 - R Assignment

1. Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.

require(RCurl)

## Loading required package: RCurl

## Loading required package: bitops

url1 <- getURL("https://raw.githubusercontent.com/emrahakin85/csvTest/master/Titanic.csv")
Titanic <- read.csv(text = url1)

DataStats <- c(Ave_Age = mean(Titanic$Age, na.rm = T), 
         Median_Age = median(Titanic$Age, na.rm = T), 
         Survival_Rate = mean(Titanic$Survived))
DataStats

##       Ave_Age    Median_Age Survival_Rate 
##    30.3979894    28.0000000     0.3427266

aggregate(Survived~PClass, Titanic, mean)

##   PClass  Survived
## 1      * 0.0000000
## 2    1st 0.5993789
## 3    2nd 0.4265233
## 4    3rd 0.1940928

2. Create a new data frame with a subset of the columns and rows. Make sure to rename it.

subTitanic <- subset(Titanic, PClass == "1st", select = c(Age, Survived))
head(subTitanic, 8)

##     Age Survived
## 1 29.00        1
## 2  2.00        0
## 3 30.00        0
## 4 25.00        0
## 5  0.92        1
## 6 47.00        1
## 7 63.00        1
## 8 39.00        0

3. Create new column names for the new data frame.

colnames(subTitanic) <- c("pAge", "Survived_or_not")

4. Use the summary function to create an overview of your new data frame. The print the mean and median for the same two attributes. Please compare.

summary(subTitanic)

##       pAge       Survived_or_not 
##  Min.   : 0.92   Min.   :0.0000  
##  1st Qu.:28.00   1st Qu.:0.0000  
##  Median :39.50   Median :1.0000  
##  Mean   :39.67   Mean   :0.5994  
##  3rd Qu.:50.00   3rd Qu.:1.0000  
##  Max.   :71.00   Max.   :1.0000  
##  NA's   :96

subDataStats <- c(Average_Age = mean(subTitanic$pAge, na.rm = T),
         Median_Age = median(subTitanic$pAge, na.rm = T),
         Survival_Rate = mean(subTitanic$Survived_or_not)
)

rbind(subDataStats, DataStats)

##              Average_Age Median_Age Survival_Rate
## subDataStats    39.66779       39.5     0.5993789
## DataStats       30.39799       28.0     0.3427266

Sub-dataset includeds the passangers from first class only. Survival rate and average age seem to be much higher in the sub-dataset.

5. For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “excellent”.

Titanic[,3] <- gsub("1st", "First_Class", Titanic[,3])

head(subset(Titanic,, -2), 8)

##   X      PClass   Age    Sex Survived SexCode
## 1 1 First_Class 29.00 female        1       1
## 2 2 First_Class  2.00 female        0       1
## 3 3 First_Class 30.00   male        0       0
## 4 4 First_Class 25.00 female        0       1
## 5 5 First_Class  0.92   male        1       0
## 6 6 First_Class 47.00   male        1       0
## 7 7 First_Class 63.00 female        1       1
## 8 8 First_Class 39.00   male        0       0

# name column excluded