Question 1(a)

Titanic Dataset - Survival of passengers on the Titanic

Description

This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner ‘Titanic’, summarized according to economic status (class), sex, age and survival.

Details

The sinking of the Titanic is a famous event, and new books are still being published about it. Many well-known facts—from the proportions of first-class passengers to the ‘women and children first’ policy, and the fact that that policy was not entirely successful in saving the women and children in the third class—are reflected in the survival rates for various classes of passenger.

These data were originally collected by the British Board of Trade in their investigation of the sinking. Note that there is not complete agreement among primary sources as to the exact numbers on board, rescued, or lost.

Titanic Ship

Titanic Head Dataset

##   Class    Sex   Age Survived Freq
## 1   1st   Male Child       No    0
## 2   2nd   Male Child       No    0
## 3   3rd   Male Child       No   35
## 4  Crew   Male Child       No    0
## 5   1st Female Child       No    0
## 6   2nd Female Child       No    0

The bar chart below shows the number of survivors based on their age :

This the code to find the total number of person who survived and din`t survive based on age:

titanicData <- data.frame(Titanic)

survive <- split(titanicData, titanicData$Survived)
surviveNo <- survive$No
totalNoSurvive <- tapply(surviveNo$Freq, surviveNo$Age, sum)

surviveYes <- survive$Yes
totalYesSurvive <- tapply(surviveYes$Freq, surviveYes$Age, sum)

Based on the bar chart above and the code above, as we can see majority of passengers aboard Titanic ship who does survive the incidents and only a few of survivors. Based on the data set, 711 of the passengers consisting 57 Children and 654 Adult have survive. Meanwhile, 1490 of the passengers consisting 52 Children and 1438 Adult does not. The data shows that approximately 52.29% of children survives and approximately 31.26% of adult survives the incident.

The barchart below shows the number of survivor based on their sex :

This the code to find the total number of person who survived and didn`t survive based on sex:

titanicData <- data.frame(Titanic)

survive <- split(titanicData, titanicData$Survived)
surviveNo <- survive$No
totalNoSurvive <- tapply(surviveNo$Freq, surviveNo$Sex, sum)

surviveYes <- survive$Yes
totalYesSurvive <- tapply(surviveYes$Freq, surviveYes$Sex, sum)

Based on the data set and code above, 711 of passengers consisting of 367 Male and 344 Female survived the incident. Meanwhile, 1490 of the passengers consisting of 1364 Male and 126 Female does not survive the incident. The data shows that approximately 73.19% of female survives and approximately 21.20% of male only survived the Titanic incident.

The summary of the Titanic data set

## ================================================================================
## 
##    Class
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: integer
##    Factor with 4 levels
## 
##    Levels and labels     N Valid
##                                 
##    1 '1st'               8  25.0
##    2 '2nd'               8  25.0
##    3 '3rd'               8  25.0
##    4 'Crew'              8  25.0
## 
## ================================================================================
## 
##    Sex
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: integer
##    Factor with 2 levels
## 
##    Levels and labels     N Valid
##                                 
##    1 'Male'             16  50.0
##    2 'Female'           16  50.0
## 
## ================================================================================
## 
##    Age
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: integer
##    Factor with 2 levels
## 
##    Levels and labels     N Valid
##                                 
##    1 'Child'            16  50.0
##    2 'Adult'            16  50.0
## 
## ================================================================================
## 
##    Survived
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: integer
##    Factor with 2 levels
## 
##    Levels and labels     N Valid
##                                 
##    1 'No'               16  50.0
##    2 'Yes'              16  50.0
## 
## ================================================================================
## 
##    Freq
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: double
## 
##         Min:   0.000
##         Max: 670.000
##        Mean:  68.781
##    Std.Dev.: 133.854
##    Skewness:   3.224
##    Kurtosis:  10.780

Data Frame Summary

titanicData

Dimensions: 32 x 5
Duplicates: 0
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 Class [factor]
1. 1st
2. 2nd
3. 3rd
4. Crew
8(25.0%)
8(25.0%)
8(25.0%)
8(25.0%)
32 (100.0%) 0 (0.0%)
2 Sex [factor]
1. Male
2. Female
16(50.0%)
16(50.0%)
32 (100.0%) 0 (0.0%)
3 Age [factor]
1. Child
2. Adult
16(50.0%)
16(50.0%)
32 (100.0%) 0 (0.0%)
4 Survived [factor]
1. No
2. Yes
16(50.0%)
16(50.0%)
32 (100.0%) 0 (0.0%)
5 Freq [numeric]
Mean (sd) : 68.8 (136)
min ≤ med ≤ max:
0 ≤ 13.5 ≤ 670
IQR (CV) : 76.2 (2)
22 distinct values 32 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.0 (R version 4.1.1)
2021-12-31


Question 1(b)


NFT sales dataset from https://www.kaggle.com/hemil26/nft-collections-dataset

NFT is a non-fungible token that have been use to buy and sell an owership of an unique digital item through the block chain.NFT have a lot of tansaction made every thus this the data set of NFT sales which contain Collection, Sales, Buyer, Transaction and Owners

Firstly, by using the library dplyr we get to assign the dataset of nft_sales2.csv into a variable called nftSales.

nftSales <- read.csv("D:/newCode/university/data sc/lab/nft_sales2.csv",stringsAsFactors=FALSE)

i) filter()

filter() function is use to subset and extract data from the main data set (nftSales) based on the given condition :

library(dplyr)

filter(nftSales,is.na(nftSales$Owners))
##          Collections     Sales Buyers  Txns Owners
## 1     Parallel Alpha 163724921  11103 67736     NA
## 2    Gutter Cat Gang  35876258   1729  3343     NA
## 3      Frontier Game  23972975   3257  7409     NA
## 4        Gutter Rats  12682958   1738  3157     NA
## 5           Illuvium   4849821   1255  3021     NA
## 6 Fluffy Polar Bears   3794206   3066  5104     NA

As result above have shown the filter() function is being use on the nftSales on the column Owners.The filter have been given the condition if any of the rows in the Owners column is NA, do not have data, the filter subset nftSales and give us which row that have the Owners column with NA.

ii) arrange()

arrange() function is used in order to sort the data set either in ascending order or descending order :

library(dplyr)

head(arrange(nftSales,desc(nftSales$Buyers)))#descending order
##     Collections      Sales  Buyers     Txns  Owners
## 1 Axie Infinity 3328148500 1079811  9755511 2656431
## 2  Alien Worlds   33282729  405975  4630191 2562646
## 3  NBA Top Shot  781965423  374818 11790699  603928
## 4 CryptoKitties   45790208  111129   786656  109858
## 5        Sorare  129615752   42675   713122   60277
## 6       Zed Run  120191155   40469   160217   40190
head(arrange(nftSales,nftSales$Buyers))#ascending order
##                  Collections    Sales Buyers Txns Owners
## 1                  Chain Saw  5241292     31   47     28
## 2                   Deafbeef 19249730     91  135    109
## 3         Wrapped Cryptocats  2774010    111  201    145
## 4 Non Fungible Fungi Genesis  4480098    127  167     71
## 5                 Autoglyphs 41866276    183  349    157
## 6       Mutant Garden Seeder  9798416    250  468    272

As result above have shown the arrange() take the Buyers column and sort acording to either descending, when the data set is inside the desc() function, or ascending order, when the data set is not inside the desc() function.

iii) mutate()

mutate() function is use to add a new variable, new column, and preserse the existing ones :

library(dplyr)

 head(mutate(nftSales,doubleSales = Sales * 2))
##             Collections      Sales  Buyers     Txns  Owners doubleSales
## 1         Axie Infinity 3328148500 1079811  9755511 2656431  6656297000
## 2           CryptoPunks 1664246968    4723    18961    3289  3328493936
## 3            Art Blocks 1075223906   20934   117602   25094  2150447812
## 4  Bored Ape Yacht Club  783882186    8284    22584    5862  1567764372
## 5          NBA Top Shot  781965423  374818 11790699  603928  1563930846
## 6 Mutant Ape Yacht Club  422429206   10350    17343   10254   844858412

As result above have shown the mutate() function take the variable in column Sales, multiply it by two and and add new column while still maintaining the original data named doubleSales.

iv) select()

select() function is used to select a specific column from the data set based on some specifications.The specification can be the column name`s itself or using regex pattern :

library(dplyr)

 head(select(nftSales,starts_with("Owner"))) #to search which column start with the word Owner
##    Owners
## 1 2656431
## 2    3289
## 3   25094
## 4    5862
## 5  603928
## 6   10254
 head(select(nftSales,matches("[Ow]n"))) #using regex pattern to determine which column to be selected
##             Collections  Owners
## 1         Axie Infinity 2656431
## 2           CryptoPunks    3289
## 3            Art Blocks   25094
## 4  Bored Ape Yacht Club    5862
## 5          NBA Top Shot  603928
## 6 Mutant Ape Yacht Club   10254

As result above have shown the select() function have given the column that you have specified either by calling the column name`s itself like Owner and only give the column with that name or using regex pattern and getting every column that match the pattern for instance the code above give use the column Collections and Owners since both match the criteria.

v) summarise()

summarise() function create a new data frame that will contain with one coloumn for each summary statistics that have been specified :

library(dplyr)

summarise(nftSales,mean = mean(nftSales$Sales))
##       mean
## 1 59224828
summarise(nftSales,qs = quantile(Sales, c(0.25, 0.75)),n=n())
##         qs   n
## 1  4645715 250
## 2 33968942 250

As result above shown the summarise() function have created a new data frame that contain with one summary statistics that have been specified like the first code only one column named mean while the second one contain to column With quantile and the number of the current group size.