R Markdown

# Preliminary. Read from text file
tableInput <- read.table(file = "Titanic.csv", header = TRUE, sep = ",")

#1a. Execution of summary function in R reveals some interesting statistics
summary(tableInput)
##        X                                  Name      PClass   
##  Min.   :   1   Carlsson, Mr Frans Olof     :   2   *  :  1  
##  1st Qu.: 329   Connolly, Miss Kate         :   2   1st:322  
##  Median : 657   Kelly, Mr James             :   2   2nd:279  
##  Mean   : 657   Abbing, Mr Anthony          :   1   3rd:711  
##  3rd Qu.: 985   Abbott, Master Eugene Joseph:   1            
##  Max.   :1313   Abbott, Mr Rossmore Edward  :   1            
##                 (Other)                     :1304            
##       Age            Sex         Survived         SexCode      
##  Min.   : 0.17   female:462   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:21.00   male  :851   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :28.00                Median :0.0000   Median :0.0000  
##  Mean   :30.40                Mean   :0.3427   Mean   :0.3519  
##  3rd Qu.:39.00                3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :71.00                Max.   :1.0000   Max.   :1.0000  
##  NA's   :557
#1b. Find the mean and median of the column Age 
#I found a good reference on how to retrieve mean from this link: https://stackoverflow.com/questions/37908949/how-to-find-the-mean-of-a-column-in-r
# This was especially helpful because some of my column data had NA. The na.rm helped remove the NAs from the calculations.
mean(tableInput$Age, na.rm=TRUE)
## [1] 30.39799
median(tableInput$Age, na.rm=TRUE)
## [1] 28
#1c. Find the mean and median of the column Survived
mean(tableInput$Survived, na.rm=TRUE)
## [1] 0.3427266
median(tableInput$Survived, na.rm=TRUE)
## [1] 0
#2. New data set frame from tableInput. The only columns that will be used will be Name, Age, Sex, and Survived. For this exercise, I limited the rows to 20.
#ref: https://dzone.com/articles/learn-r-how-create-data-frames
tableInput2 <- read.table(file = "Titanic.csv", header = TRUE, sep = ",", nrows=20)
dfTableInput <- as.data.frame(tableInput2)
dfSubSetTableInput <- dfTableInput[,c(2,4,5,6)]

#3. New column names for dfSubSetTableInput will be Full Name, Age, Gender, Alive
names(dfSubSetTableInput) <- c("FullName", "Age", "Gender", "Alive")
dfSubSetTableInput
##                                            FullName   Age Gender Alive
## 1                      Allen, Miss Elisabeth Walton 29.00 female     1
## 2                       Allison, Miss Helen Loraine  2.00 female     0
## 3               Allison, Mr Hudson Joshua Creighton 30.00   male     0
## 4     Allison, Mrs Hudson JC (Bessie Waldo Daniels) 25.00 female     0
## 5                     Allison, Master Hudson Trevor  0.92   male     1
## 6                                Anderson, Mr Harry 47.00   male     1
## 7                  Andrews, Miss Kornelia Theodosia 63.00 female     1
## 8                            Andrews, Mr Thomas, jr 39.00   male     0
## 9      Appleton, Mrs Edward Dale (Charlotte Lamson) 58.00 female     1
## 10                           Artagaveytia, Mr Ramon 71.00   male     0
## 11                        Astor, Colonel John Jacob 47.00   male     0
## 12 Astor, Mrs John Jacob (Madeleine Talmadge Force) 19.00 female     1
## 13                     Aubert, Mrs Leontine Pauline    NA female     1
## 14                         Barkworth, Mr Algernon H    NA   male     1
## 15                               Baumann, Mr John D    NA   male     0
## 16   Baxter, Mrs James (Helene DeLaudeniere Chaput) 50.00 female     1
## 17                          Baxter, Mr Quigg Edmond 24.00   male     0
## 18                              Beattie, Mr Thomson 36.00   male     0
## 19                     Beckwith, Mr Richard Leonard 37.00   male     1
## 20  Beckwith, Mrs Richard Leonard (Sallie Monypeny) 47.00 female     1
#4a. Execution of summary function in R reveals some interesting statistics for dfSubsetTableInput
# Some notable comparisons
# Age - in the original data set, the median age was 28. With the smaller data set, it has become 37.
# Gender - in the original data set, there was an 8:5 ratio of men to woman. In the smaller dataset, it has become 11:9 respectively.
# Alive/Survived - in the orignal data set, the mean survivability of passengers was 34.7%. In the smaller data set, the mean survivability of passengers increased to 55%.
summary(dfSubSetTableInput)
##                                           FullName       Age       
##  Allen, Miss Elisabeth Walton                 : 1   Min.   : 0.92  
##  Allison, Master Hudson Trevor                : 1   1st Qu.:25.00  
##  Allison, Miss Helen Loraine                  : 1   Median :37.00  
##  Allison, Mr Hudson Joshua Creighton          : 1   Mean   :36.76  
##  Allison, Mrs Hudson JC (Bessie Waldo Daniels): 1   3rd Qu.:47.00  
##  Anderson, Mr Harry                           : 1   Max.   :71.00  
##  (Other)                                      :14   NA's   :3      
##     Gender       Alive     
##  female: 9   Min.   :0.00  
##  male  :11   1st Qu.:0.00  
##              Median :1.00  
##              Mean   :0.55  
##              3rd Qu.:1.00  
##              Max.   :1.00  
## 
#4b. Find the mean and median of the column Age 
mean(dfSubSetTableInput$Age, na.rm=TRUE)
## [1] 36.76
median(dfSubSetTableInput$Age, na.rm=TRUE)
## [1] 37
#4c. Find the mean and median of the column Alive
mean(dfSubSetTableInput$Alive, na.rm=TRUE)
## [1] 0.55
median(dfSubSetTableInput$Alive, na.rm=TRUE)
## [1] 1
# 5. Change all column values of male in Gender to 'Dude'
dfSubSetTableInputA <- within(dfSubSetTableInput, levels(Gender)[levels(Gender) == "male"] <- "Dude")

#6. Show samples of output table of each problem where applicable.
#ref: https://www.statmethods.net/input/contents.html
#6. - 1. Heading output for tableInput for first 10 rows
head(tableInput, n=10)
##     X                                          Name PClass   Age    Sex
## 1   1                  Allen, Miss Elisabeth Walton    1st 29.00 female
## 2   2                   Allison, Miss Helen Loraine    1st  2.00 female
## 3   3           Allison, Mr Hudson Joshua Creighton    1st 30.00   male
## 4   4 Allison, Mrs Hudson JC (Bessie Waldo Daniels)    1st 25.00 female
## 5   5                 Allison, Master Hudson Trevor    1st  0.92   male
## 6   6                            Anderson, Mr Harry    1st 47.00   male
## 7   7              Andrews, Miss Kornelia Theodosia    1st 63.00 female
## 8   8                        Andrews, Mr Thomas, jr    1st 39.00   male
## 9   9  Appleton, Mrs Edward Dale (Charlotte Lamson)    1st 58.00 female
## 10 10                        Artagaveytia, Mr Ramon    1st 71.00   male
##    Survived SexCode
## 1         1       1
## 2         0       1
## 3         0       0
## 4         0       1
## 5         1       0
## 6         1       0
## 7         1       1
## 8         0       0
## 9         1       1
## 10        0       0
#6. - 2. Tail output for tableInput2 for last 10 rows
tail(tableInput2, n=10)
##     X                                             Name PClass Age    Sex
## 11 11                        Astor, Colonel John Jacob    1st  47   male
## 12 12 Astor, Mrs John Jacob (Madeleine Talmadge Force)    1st  19 female
## 13 13                     Aubert, Mrs Leontine Pauline    1st  NA female
## 14 14                         Barkworth, Mr Algernon H    1st  NA   male
## 15 15                               Baumann, Mr John D    1st  NA   male
## 16 16   Baxter, Mrs James (Helene DeLaudeniere Chaput)    1st  50 female
## 17 17                          Baxter, Mr Quigg Edmond    1st  24   male
## 18 18                              Beattie, Mr Thomson    1st  36   male
## 19 19                     Beckwith, Mr Richard Leonard    1st  37   male
## 20 20  Beckwith, Mrs Richard Leonard (Sallie Monypeny)    1st  47 female
##    Survived SexCode
## 11        0       0
## 12        1       1
## 13        1       1
## 14        1       0
## 15        0       0
## 16        1       1
## 17        0       0
## 18        0       0
## 19        1       0
## 20        1       1
#6. - 3. and 4. Head output for dfSubSetTableInput for first 10 rows
head(dfSubSetTableInput, n=10)
##                                         FullName   Age Gender Alive
## 1                   Allen, Miss Elisabeth Walton 29.00 female     1
## 2                    Allison, Miss Helen Loraine  2.00 female     0
## 3            Allison, Mr Hudson Joshua Creighton 30.00   male     0
## 4  Allison, Mrs Hudson JC (Bessie Waldo Daniels) 25.00 female     0
## 5                  Allison, Master Hudson Trevor  0.92   male     1
## 6                             Anderson, Mr Harry 47.00   male     1
## 7               Andrews, Miss Kornelia Theodosia 63.00 female     1
## 8                         Andrews, Mr Thomas, jr 39.00   male     0
## 9   Appleton, Mrs Edward Dale (Charlotte Lamson) 58.00 female     1
## 10                        Artagaveytia, Mr Ramon 71.00   male     0
#6. - 5. Tail output for dfSubSetTableInputA for last 10 rows
tail(dfSubSetTableInputA, n=10)
##                                            FullName Age Gender Alive
## 11                        Astor, Colonel John Jacob  47   Dude     0
## 12 Astor, Mrs John Jacob (Madeleine Talmadge Force)  19 female     1
## 13                     Aubert, Mrs Leontine Pauline  NA female     1
## 14                         Barkworth, Mr Algernon H  NA   Dude     1
## 15                               Baumann, Mr John D  NA   Dude     0
## 16   Baxter, Mrs James (Helene DeLaudeniere Chaput)  50 female     1
## 17                          Baxter, Mr Quigg Edmond  24   Dude     0
## 18                              Beattie, Mr Thomson  36   Dude     0
## 19                     Beckwith, Mr Richard Leonard  37   Dude     1
## 20  Beckwith, Mrs Richard Leonard (Sallie Monypeny)  47 female     1
#7. BONUS - Place .csv file in my github repository and access it from there. Using head function to on rawgithuboutput to prove the file was read by program.
#ref: https://stackoverflow.com/questions/14441729/read-a-csv-from-github-into-r
urlfile <- 'https://raw.githubusercontent.com/RommyGraphs/MSDA/master/2018Workshop/Titanic.csv'
rawgithuboutput <- read.csv(url(urlfile))
head(rawgithuboutput, n=10)
##     X                                          Name PClass   Age    Sex
## 1   1                  Allen, Miss Elisabeth Walton    1st 29.00 female
## 2   2                   Allison, Miss Helen Loraine    1st  2.00 female
## 3   3           Allison, Mr Hudson Joshua Creighton    1st 30.00   male
## 4   4 Allison, Mrs Hudson JC (Bessie Waldo Daniels)    1st 25.00 female
## 5   5                 Allison, Master Hudson Trevor    1st  0.92   male
## 6   6                            Anderson, Mr Harry    1st 47.00   male
## 7   7              Andrews, Miss Kornelia Theodosia    1st 63.00 female
## 8   8                        Andrews, Mr Thomas, jr    1st 39.00   male
## 9   9  Appleton, Mrs Edward Dale (Charlotte Lamson)    1st 58.00 female
## 10 10                        Artagaveytia, Mr Ramon    1st 71.00   male
##    Survived SexCode
## 1         1       1
## 2         0       1
## 3         0       0
## 4         0       1
## 5         1       0
## 6         1       0
## 7         1       1
## 8         0       0
## 9         1       1
## 10        0       0