1. Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.
require(RCurl)
## Loading required package: RCurl
## Loading required package: bitops
url1 <- getURL("https://raw.githubusercontent.com/emrahakin85/csvTest/master/Titanic.csv")
Titanic <- read.csv(text = url1)
DataStats <- c(Ave_Age = mean(Titanic$Age, na.rm = T),
Median_Age = median(Titanic$Age, na.rm = T),
Survival_Rate = mean(Titanic$Survived))
DataStats
## Ave_Age Median_Age Survival_Rate
## 30.3979894 28.0000000 0.3427266
aggregate(Survived~PClass, Titanic, mean)
## PClass Survived
## 1 * 0.0000000
## 2 1st 0.5993789
## 3 2nd 0.4265233
## 4 3rd 0.1940928
2. Create a new data frame with a subset of the columns and rows. Make sure to rename it.
subTitanic <- subset(Titanic, PClass == "1st", select = c(Age, Survived))
head(subTitanic, 8)
## Age Survived
## 1 29.00 1
## 2 2.00 0
## 3 30.00 0
## 4 25.00 0
## 5 0.92 1
## 6 47.00 1
## 7 63.00 1
## 8 39.00 0
3. Create new column names for the new data frame.
colnames(subTitanic) <- c("pAge", "Survived_or_not")
4. Use the summary function to create an overview of your new data frame. The print the mean and median for the same two attributes. Please compare.
summary(subTitanic)
## pAge Survived_or_not
## Min. : 0.92 Min. :0.0000
## 1st Qu.:28.00 1st Qu.:0.0000
## Median :39.50 Median :1.0000
## Mean :39.67 Mean :0.5994
## 3rd Qu.:50.00 3rd Qu.:1.0000
## Max. :71.00 Max. :1.0000
## NA's :96
subDataStats <- c(Average_Age = mean(subTitanic$pAge, na.rm = T),
Median_Age = median(subTitanic$pAge, na.rm = T),
Survival_Rate = mean(subTitanic$Survived_or_not)
)
rbind(subDataStats, DataStats)
## Average_Age Median_Age Survival_Rate
## subDataStats 39.66779 39.5 0.5993789
## DataStats 30.39799 28.0 0.3427266
Sub-dataset includeds the passangers from first class only. Survival rate and average age seem to be much higher in the sub-dataset.
5. For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “excellent”.
Titanic[,3] <- gsub("1st", "First_Class", Titanic[,3])
head(subset(Titanic,, -2), 8)
## X PClass Age Sex Survived SexCode
## 1 1 First_Class 29.00 female 1 1
## 2 2 First_Class 2.00 female 0 1
## 3 3 First_Class 30.00 male 0 0
## 4 4 First_Class 25.00 female 0 1
## 5 5 First_Class 0.92 male 1 0
## 6 6 First_Class 47.00 male 1 0
## 7 7 First_Class 63.00 female 1 1
## 8 8 First_Class 39.00 male 0 0
# name column excluded