Install and load the necessary packages to reproduce the report here:
options(warn= -1)
There were 36 warnings (use warnings() to see them)
library(readr) # Useful for importing data
library(foreign) # Useful for importing SPSS, SAS, STATA etc. data files
library(rvest) # Useful for scraping HTML data
library(knitr) # Useful for creating nice tables
library(dplyr) #Useful for data manipulation
library(tidyr) #Useful for tidying up variables
library(stringr) #Useful for tidying strings up
The dataset I have chosen to work with is the famous Titanic dataset, which contains information about the passengers who were on the titanic. It contains categorical variables, numerical variables and character variables. The data set has some NA values and some variables which do not adhere to tidy principles, these will need to be cleaned up. The dataset was obtained in .csv form from this link at kaggle.com : https://www.kaggle.com/vinicius150987/titanic3
head() of the data set) that you used to import/read/scrape the data set.df <- read_csv("titanic3_CSV.csv")
-- Column specification ----------------------------------------------------------------------------------------
cols(
pclass = col_double(),
survived = col_double(),
name = col_character(),
sex = col_character(),
age = col_double(),
sibsp = col_double(),
parch = col_double(),
ticket = col_character(),
fare = col_double(),
cabin = col_character(),
embarked = col_character(),
boat = col_character(),
body = col_double(),
home.dest = col_character()
)
head(df)
As previously mentioned, the dataset is the Titanic dataset, it contains the names of all the passengers on board as well as information about what class ticket they had, whether they survived or not, their gender, age, the number of siblings/spouses aboard, their number of parents/children aboard, their ticket number, their fare, their cabin, where they embarked, the lifeboat they were on (if they got one), their body number (if they didn’t survive and their body was recovered) and their home destination. The dataset was obtained from this link at kaggle.com : https://www.kaggle.com/vinicius150987/titanic3
The variables are: * pclass: The passenger class (1=1st, 2=2nd, 3=3rd) * survived: (0= No, 1=Yes) * name: The passengers name * sex: The gender of the passenger * age: The age of the passenger * sibsp: The number of siblings/spouses aboard * parch: The number of parents/children aboard * ticket: Their ticket number * fare: The passengers fare cost * cabin: The passengers cabin number * embarked: The port where the passenger embarked from (C= Cherbourg, Q=Queenstown, S=Southhampton) * boat: The lifeboat number (if they survived) * body: The body number (if they did not survive and the body was recovered)
head(df) #Take a look at the head of the data frame
any(is.na(df)) #Checking for missing data
[1] TRUE
sum(is.na(df)) #Count the number of NA values
[1] 3869
#Remove NA's for every column except the 'boat' and 'body' columns
df <- df %>% drop_na(pclass)
df <- df %>% drop_na(survived)
df <- df %>% drop_na(name)
df <- df %>% drop_na(sex)
df <- df %>% drop_na(age)
df <- df %>% drop_na(sibsp)
df <- df %>% drop_na(parch)
df <- df %>% drop_na(ticket)
df <- df %>% drop_na(fare)
df <- df %>% drop_na(cabin)
df <- df %>% drop_na(embarked)
df <- df %>% drop_na(home.dest)
#Convert the NA's in the 'boat' and 'body' columns to 0's
df$boat[is.na(df$boat)] <- 0
df$body[is.na(df$body)] <- 0
dim(df) #Check the dimensions of the data frame
[1] 239 14
#Check the type of each column
typeof(df$pclass) #Double
[1] "double"
typeof(df$survived) #Double
[1] "double"
typeof(df$name) #Character
[1] "character"
typeof(df$sex) #Character
[1] "character"
typeof(df$age) #Double
[1] "double"
typeof(df$sibsp) #Double
[1] "double"
typeof(df$parch) #Double
[1] "double"
typeof(df$ticket) #Character
[1] "character"
typeof(df$fare) #Double
[1] "double"
typeof(df$cabin) #Character
[1] "character"
typeof(df$embarked) #Character
[1] "character"
typeof(df$boat) #Character
[1] "character"
typeof(df$body) #Double
[1] "double"
typeof(df$home.dest) #Character
[1] "character"
#Check the structure of each column
str(df$pclass) #Numeric
num [1:239] 1 1 1 1 1 1 1 1 1 1 ...
str(df$survived) #Numeric
num [1:239] 1 1 0 0 0 1 1 0 1 0 ...
str(df$name) #Character
chr [1:239] "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" ...
str(df$sex) #Character
chr [1:239] "female" "male" "female" "male" "female" "male" "female" "male" "female" "male" "female" ...
str(df$age) #Numeric
num [1:239] 29 0.917 2 30 25 ...
str(df$sibsp) #Numeric
num [1:239] 0 1 1 1 1 0 1 0 2 1 ...
str(df$parch) #Numeric
num [1:239] 0 2 2 2 2 0 0 0 0 0 ...
str(df$ticket) #Character
chr [1:239] "24160" "113781" "113781" "113781" "113781" "19952" "13502" "112050" "11769" "PC 17757" ...
str(df$fare) #Numeric
num [1:239] 211 152 152 152 152 ...
str(df$cabin) #Character
chr [1:239] "B5" "C22 C26" "C22 C26" "C22 C26" "C22 C26" "E12" "D7" "A36" "C101" "C62 C64" "C62 C64" "B35" ...
str(df$embarked) #Character
chr [1:239] "S" "S" "S" "S" "S" "S" "S" "S" "S" "C" "C" "C" "S" "C" "C" "C" "S" "S" "C" "C" "C" "S" "S" ...
str(df$boat) #Character
chr [1:239] "2" "11" "0" "0" "0" "3" "10" "0" "D" "0" "4" "9" "B" "0" "6" "A" "5" "5" "5" "7" "7" "D" "0" ...
str(df$body) #Numeric
num [1:239] 0 0 0 135 0 0 0 0 0 124 ...
str(df$home.dest) #Character
chr [1:239] "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
#Check the class of the columns
class(df$pclass)
[1] "numeric"
class(df$survived)
[1] "numeric"
class(df$name)
[1] "character"
class(df$sex)
[1] "character"
class(df$age)
[1] "numeric"
class(df$sibsp)
[1] "numeric"
class(df$parch)
[1] "numeric"
class(df$ticket)
[1] "character"
class(df$fare)
[1] "numeric"
class(df$cabin)
[1] "character"
class(df$embarked)
[1] "character"
class(df$boat)
[1] "character"
class(df$body)
[1] "numeric"
class(df$home.dest)
[1] "character"
#Convert categorical columns into factors & assess the levels of each, order ones that need to be ordered.
df$pclass <- as.factor(df$pclass) #Convert the numeric column to a factor
levels(df$pclass) #The levels are ordered how we want them
[1] "1" "2" "3"
df$survived <- as.factor(df$survived)
levels(df$survived) #Check the levels of the column
[1] "0" "1"
df$sex <- as.factor(df$sex)
levels(df$sex) #Check the levels of the column
[1] "female" "male"
df$sibsp <- as.factor(df$sibsp)
levels(df$sibsp) #Check the levels of the column
[1] "0" "1" "2" "3"
df$parch <- as.factor(df$parch)
levels(df$parch) #Check the levels of the column
[1] "0" "1" "2" "3" "4"
df$embarked <- as.factor(df$embarked)
levels(df$embarked) #Check the levels of the column
[1] "C" "Q" "S"
df$boat <- as.factor(df$boat)
levels(df$boat) #Check the levels of the column
[1] "0" "1" "10" "11" "12" "13" "14" "16" "2" "3" "4" "5" "5 7" "5 9" "6" "7" "8" "9"
[19] "A" "B" "C" "D"
#Change the names of the columns
colnames(df) <- c("PassClass","Survived","Name","Sex","Age","SibsSpousesPres","KidsParentsPres","Ticket","Fare",
"Cabin","EmbarkedFrom","Lifeboat","BodyNum", "HomeDest")
colnames(df) #Check the names of the columns
[1] "PassClass" "Survived" "Name" "Sex" "Age" "SibsSpousesPres"
[7] "KidsParentsPres" "Ticket" "Fare" "Cabin" "EmbarkedFrom" "Lifeboat"
[13] "BodyNum" "HomeDest"
#Clean up the data set in the cabin and lifeboat columns
df$Cabin <- sub(' .*', "",df$Cabin)
df$Lifeboat <- sub(' .*', "",df$Lifeboat)
The dataset in its raw form did not adhere to the tidy principles.
The three principles for a tidy data set defined by Wickham & Grolemund, 2016 (1) are: 1. Each variable must have its own column 2. Each observation must have its own cell 3. Each value must have its own cell
We already removed or converted all NA values in the previous step. In this step we made sure that each value had its own cell, I did this by using the sub function to remove all characters after a space in the string, I replaced them with nothing. We could have created new columns to contain the values instead of removing them, however they weren’t relevant for our analysis so instead I just removed them.
We also could have split the Name column into surname, title, first name, middle name and maiden name columns, however this is not applicable here because we do not need that data, we just need the name column as a string, as it currently is.
# Here we determine the summary statistics of our numeric variable columns (Age and Fare) when grouped by passenger class
SummaryStats = df %>% group_by(PassClass) %>% #Group by Passenger class
summarize(Avg_Age = mean(Age), #Find the mean age for each passenger class
Median_Age = median(Age), #Find the median age for each passenger class
Minimum_Age = min(Age), #Find the minimum age for each passenger class
Max_Age = max(Age), #Find the maximum age for each passenger class
Standard_Dev_Age = sd(Age), #Find the standard deviation of age for each passenger class
Avg_FarePrice = mean(Fare), # Find the mean fare price for each passenger class
Median_Fare = median(Fare), # Find the median fare price for each passenger class
Minimum_Fare = min(Fare), # Find the minimum fare price for each passenger class
Maximum_Fare = max(Fare), # Find the maximum fare price for each passenger class
Standard_Dev_Fare = sd(Fare), .groups="keep") # Find the standard deviation of fare price for each passenger class
SummaryStats
First we have created a variable called SummaryStats which draws on the initial data frame (df) grouped by one qualitative categorical variable (passenger class) using the group_by function. We then pipe that to the summarize function and create columns for average values, median values, minimum values, maximum values and the standard deviations of the numeric variables age and fare.
There is no standard deviation for Passenger Class 3 as there is only one person in passenger class 3 without NA values in the columns that we dropped NA values in.
#Create a list that contains a numeric value for each response to the categorical variable 'Sex'.
listy <- list(as.numeric(df$Sex))
listy
[[1]]
[1] 1 2 1 2 1 2 1 2 1 2 1 1 2 2 1 2 2 1 2 2 1 2 2 2 1 1 2 1 2 2 1 1 1 2 2 2 1 2 2 1 2 1 2 1 2 1 2 1 1 2 2 1 2
[54] 1 2 1 1 2 1 1 2 1 2 2 1 2 1 2 2 1 2 1 1 1 2 2 1 1 1 1 2 2 1 1 1 2 2 1 2 1 2 1 2 1 2 2 2 2 1 2 1 2 1 2 2 1
[107] 2 1 2 1 1 2 1 2 2 1 1 1 2 1 1 2 2 2 2 2 1 1 1 1 2 1 1 2 2 1 1 2 2 2 2 2 1 1 2 2 2 1 1 2 1 1 2 2 2 2 2 1 2
[160] 1 2 2 1 1 2 1 2 1 2 1 1 2 1 1 1 2 2 1 2 2 2 2 1 2 1 2 2 1 2 2 2 2 1 1 2 1 2 1 1 2 1 2 2 2 1 2 2 2 2 1 2 2
[213] 1 1 2 2 1 1 1 2 1 1 1 2 1 1 1 1 2 1 2 2 2 2 2 1 1 1 2
Here we created a list called listy which contains numeric values for each response to the categorical variable ‘sex’.
listydf = as.data.frame(listy) #Convert the list to a data frame
# Create a new column in listydf called ID which contains a sequence of numbers from 1 to the number of rows in listydf
listydf$ID = seq.int(nrow(listydf))
str(listydf) #Assess the structure of listydf
'data.frame': 239 obs. of 2 variables:
$ c.1..2..1..2..1..2..1..2..1..2..1..1..2..2..1..2..2..1..2..2..: num 1 2 1 2 1 2 1 2 1 2 ...
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
#Create a new column in df called ID containing a sequence of numbers from 1 to the number of rows in df
df$ID = seq.int(nrow(df))
str(df) #Assess the structure of df
tibble [239 x 15] (S3: tbl_df/tbl/data.frame)
$ PassClass : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ Survived : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 2 1 2 1 ...
$ Name : chr [1:239] "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
$ Sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
$ Age : num [1:239] 29 0.917 2 30 25 ...
$ SibsSpousesPres: Factor w/ 4 levels "0","1","2","3": 1 2 2 2 2 1 2 1 3 2 ...
$ KidsParentsPres: Factor w/ 5 levels "0","1","2","3",..: 1 3 3 3 3 1 1 1 1 1 ...
$ Ticket : chr [1:239] "24160" "113781" "113781" "113781" ...
$ Fare : num [1:239] 211 152 152 152 152 ...
$ Cabin : chr [1:239] "B5" "C22" "C22" "C22" ...
$ EmbarkedFrom : Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 3 3 3 1 ...
$ Lifeboat : chr [1:239] "2" "11" "0" "0" ...
$ BodyNum : num [1:239] 0 0 0 135 0 0 0 0 0 124 ...
$ HomeDest : chr [1:239] "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
$ ID : int [1:239] 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "spec")=
.. cols(
.. pclass = col_double(),
.. survived = col_double(),
.. name = col_character(),
.. sex = col_character(),
.. age = col_double(),
.. sibsp = col_double(),
.. parch = col_double(),
.. ticket = col_character(),
.. fare = col_double(),
.. cabin = col_character(),
.. embarked = col_character(),
.. boat = col_character(),
.. body = col_double(),
.. home.dest = col_character()
.. )
#Join df and listydf using inner_join, joining them by their ID columns
Joineddf <- inner_join(df, listydf, by="ID", copy=TRUE)
Joineddf
Here we converted our list (named listy) to a data frame using the as.data.frame function and we named it listydf. We then added a column to this data frame called ID, which contains a numeric value which corresponds to the row number and we did this for every row in listydf, using the seq.int function with the nrow function wrapped inside of it. We then check the structure of listydf to confirm it is a data frame. We then added a column called ID to our starting data frame (df) using the same seq.int and nrow functions. We then confirmed that the structure of ‘df’ is a data frame using the str() function before we performed a merge of these two data frames using the inner_join function and we merged them by the “ID” columns, we want to copy the data contained in both data frames to a new one so we use the code copy=TRUE inside our join function, and we assign this new data frame to the variable “Joineddf”
#Create a subset of the joined dataframe (Joineddf) using the first 10 observations & containing all variables.
Subsetdf <- Joineddf[1:10,1:length(Joineddf)]
Subsetdf <- head(Joineddf,10) # Here is another method to obtain the same result
Subsetdf <- subset(Joineddf[1:10,1:length(Joineddf)]) # And here is a third method to obtain the same result
Subsetdf #Have a look at what the Subsetdf looks like
Subsetdf1 <- as.matrix(Subsetdf) #Convert Subsetdf to a matrix
str(Subsetdf1) #Assess the structure of Subsetdf1
chr [1:10, 1:16] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "0" "0" "0" "1" "1" "0" "1" "0" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:16] "PassClass" "Survived" "Name" "Sex" ...
There are multiple ways to subset our data frame including using the subset function, the head function or by calling the rows and columns that we want from the data frame (Joineddf). Above I have illustrated all three ways to subset our data frame and I have assigned this subset of the data frame to the variable ‘Subsetdf’. I then converted the data frame to a matrix using the as.matrix() function. I then checked the structure of it using the str() function. The matrices structure is character, because all elements of a matrix must be of the same class (2), so R converts the elements of the matrix to the structure which is most compatible, in this case it’s character. Also all columns of a matrix must be the same length, however this was not an issue for this example.
colnames(Joineddf) #Assess the column names
[1] "PassClass"
[2] "Survived"
[3] "Name"
[4] "Sex"
[5] "Age"
[6] "SibsSpousesPres"
[7] "KidsParentsPres"
[8] "Ticket"
[9] "Fare"
[10] "Cabin"
[11] "EmbarkedFrom"
[12] "Lifeboat"
[13] "BodyNum"
[14] "HomeDest"
[15] "ID"
[16] "c.1..2..1..2..1..2..1..2..1..2..1..1..2..2..1..2..2..1..2..2.."
#Change/assign column names to each column in the data frame
colnames(Joineddf) <- c("PassClass", "Survived", "Name", "Sex", "Age", "SibsSpousesPres", "KidsParentsPres", "TicketNum", "Fare", "Cabin", "EmbarkedFrom", "Lifeboat", "BodyNum", "HomeDest", "ID", "SexInt")
#Subset the data frame including only the first and the last variable in the data set.
dfsubset <- Joineddf %>% select(PassClass, SexInt)
dfsubset <- subset(Joineddf, select=c("PassClass", "SexInt")) #Here is another method to achieve the same thing
dfsubset <- Joineddf[,c(1,16)] #And here is a third method to obtain the same result
dfsubset #Lets take a look at dfsubset
save(dfsubset, file="output/subsetofdf.RData") #Save the output as an R object file (.RData)
Above I demonstrated three ways we can subset our data frame to obtain the first and last column of the data frame (Joineddf). Before we could do that we want to check what the column names are using the colnames() function. We can then use that same function to re-name or assign names to our columns. This then allows us to call the columns we want directly by their names. Above I demonstrated how to subset the data frame by using the subset function, the select function and by calling the desired column number directly. I assigned this subset of data to the variable named “dfsubset”. I then saved the variable “dfsubset” as a .RData file in our output folder inside our working directory, I named the .RData file “subsetofdf”.
# Create a new data frame called Q11DF, create a column in it called ID which contains a sequence of numbers from 1 to 239
Q11DF <- data.frame(ID=1:239)
#Determine where the breaks in our new ordinal variable column should be
min(round(df$Age, digits=0)) #Rounded minimum age is 1
[1] 1
max(df$Age) #Maximum age is 80
[1] 80
quantile(df$Age, probs = .25) #Quartile 1 = 26
25%
26
quantile(df$Age, probs= .75) #Quartile 3 = 49
75%
49
#Create a new ordinal variable column
Q11DF$AgeBracket = as.factor(case_when(df$Age < 26 ~ "Young",
df$Age >= 26 & df$Age <= 49 ~ "MiddleAged",
df$Age > 49 ~ "Old"))
levels(Q11DF$AgeBracket) <- c("Young", "MiddleAged", "Old") #Order the levels of the factor column (AgeBracket)
levels(Q11DF$AgeBracket) #Now we can see that the order of the levels is correct
[1] "Young" "MiddleAged" "Old"
str(Q11DF$AgeBracket) #We can see it's a factor with 3 levels: "Young", "MiddleAged", "Old"
Factor w/ 3 levels "Young","MiddleAged",..: 1 3 3 1 3 1 2 1 2 1 ...
Q11DF$IntAge <- round(df$Age, digits=0) #Create a new column of integer values
str(Q11DF$IntAge) #We can see it's a numeric
num [1:239] 29 1 2 30 25 48 63 39 53 47 ...
FareVec <- df$Fare #Create a numeric vector with the data from the Fare column in our initial data frame (df)
str(FareVec) #We can see it's numeric
num [1:239] 211 152 152 152 152 ...
Q11DF <- as.data.frame(cbind(Q11DF,FareVec)) # Use cbind() to add this column to the new data frame
typeof(Q11DF) #Assess the type of Q11DF
[1] "list"
colnames(Q11DF) #Check the variable names
[1] "ID" "AgeBracket" "IntAge" "FareVec"
rownames(Q11DF) #Check the attribute names
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17"
[18] "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34"
[35] "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45" "46" "47" "48" "49" "50" "51"
[52] "52" "53" "54" "55" "56" "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68"
[69] "69" "70" "71" "72" "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84" "85"
[86] "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98" "99" "100" "101" "102"
[103] "103" "104" "105" "106" "107" "108" "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119"
[120] "120" "121" "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132" "133" "134" "135" "136"
[137] "137" "138" "139" "140" "141" "142" "143" "144" "145" "146" "147" "148" "149" "150" "151" "152" "153"
[154] "154" "155" "156" "157" "158" "159" "160" "161" "162" "163" "164" "165" "166" "167" "168" "169" "170"
[171] "171" "172" "173" "174" "175" "176" "177" "178" "179" "180" "181" "182" "183" "184" "185" "186" "187"
[188] "188" "189" "190" "191" "192" "193" "194" "195" "196" "197" "198" "199" "200" "201" "202" "203" "204"
[205] "205" "206" "207" "208" "209" "210" "211" "212" "213" "214" "215" "216" "217" "218" "219" "220" "221"
[222] "222" "223" "224" "225" "226" "227" "228" "229" "230" "231" "232" "233" "234" "235" "236" "237" "238"
[239] "239"
#Check the first 6 attributes for all variables
head(Q11DF)
#Check the dimensions of the new data frame
dim(Q11DF)
[1] 239 4
Here we wanted to create a new data frame which contained one integer variable and one ordinal variable. First I created a blank data frame called “Q11DF” which had one column in it, called “ID”, this column contained numbers from 1 to 239, which is the length of our initial data frame. This columns values act like a row number.
From there I determined what the rounded minimum value was in the age column of our starting data frame (df). I rounded this to 0 decimal places in order to give the value as an integer. I did this using the min() function with the round() function wrapped inside it, with the command “digits=0” so as to give a whole number. I then determined what the first and third quantile values were for the age column, as well as what its maximum value was. I did this using the quantile() and max() functions.
I then created a new categorical column in our new data frame (Q11DF) which was called “AgeBracket”, this columns values were based on the values in df$Age, if their age was under 26 then they were categorized as “Young”, if the passengers age was equal to or between 26 and 49 then they were categorized as “MiddleAged” and if they were above 49 they were categorized as “Old”. I created this column using the case_when() function on the “Age” column from “df”. I made this column a factor using the as.factor() function.
I then assigned the order of the levels of this factor column as “Young”, “MiddleAged” and “Old”, respectively, using the levels() function. I then double checked the levels had been correctly assigned by using the levels() function again. I then confirmed the structure of the column was a factor by using the str() function. After that I created another new column in our new data frame (Q11DF) called “IntAge”, this column contained the values from the Age column from our starting data frame (df) except that I made sure they were all in integer form, I did this using the round() function with the command “digits=0” inside of it. I then confirmed its structure was numeric using the str() function on the column.
I then created a numeric vector called “FareVec” which contained the results from the Fare column from our initial data frame (df). I confirmed its structure was numeric using the str() function.
I then used the cbind() function to bind the values inside “FareVec” as a column of Q11DF. I did this by wrapping the code inside the function as.data.frame(), I then re-assigned this code to the “Q11DF” variable we created earlier. I then confirmed that Q11DF was a data frame using the typeof() function. I then found out what the column names of Q11DF were using the colnames() function, I then checked the attribute names of Q11DF using the rownames() function. I then checked the first 10 attributes for all variables of “Q11DF” by using the head() function. I then confirmed the dimensions of “Q11DF” using the dim() function on Q11DF.
#Create a new data frame named Q12DF, containing one column named "ID" which contains a sequence of values from 1 to 239.
Warning messages:
1: Unknown or uninitialised column: `boat`.
2: Unknown or uninitialised column: `body`.
Q12DF <- data.frame(ID=1:239)
#Merge the two data frames using inner_join and assign it to a variable named Q12df
Q12df <- inner_join(Q11DF,Q12DF, by="ID", copy=TRUE)
Q12df
Here I created a new data frame called “Q12DF” using the data.frame() function, this data frame contained one column which was named ID and contained a sequence of numbers from 1 to 239. The same column name with the same values, in the same order, existed in our previous data frame (Q11DF). I then merged these two data frames using the inner_join() function, merging them by the “ID” columns. I also used the command copy=TRUE to copy the columns from both data frames into a new data frame, which I named “Q12df” (note the lower case df). You can see the output of the data frame Q12df above.