Let’s create a data frame using data.frame():
> df <- data.frame(person=c("Ripon","Mina","Meghla","Ridoy","Saad"),
+ age=c(22,20,17,24,19),
+ gender=factor(c("M","F","F","M","M")))
> df
person age gender
1 Ripon 22 M
2 Mina 20 F
3 Meghla 17 F
4 Ridoy 24 M
5 Saad 19 M
To get the structure of data frame use str():
> str(df)
'data.frame': 5 obs. of 3 variables:
$ person: chr "Ripon" "Mina" "Meghla" "Ridoy" ...
$ age : num 22 20 17 24 19
$ gender: Factor w/ 2 levels "F","M": 2 1 1 2 2
To know the summary statistics of the data frame use summary():
> summary(df)
person age gender
Length:5 Min. :17.0 F:2
Class :character 1st Qu.:19.0 M:3
Mode :character Median :20.0
Mean :20.4
3rd Qu.:22.0
Max. :24.0
> dim(df)
[1] 5 3
> nrow(df)
[1] 5
> ncol(df)
[1] 3
To change the factor values of a colum-
Let’s change F to Female and M to Male. Here we can see gender has two levels:
> df$gender
[1] M F F M M
Levels: F M
> levels(df$gender)
[1] "F" "M"
Now simply replace the levels as desired:
> levels(df$gender) <- c("Female","Male")
Let’s see the result:
> df
person age gender
1 Ripon 22 Male
2 Mina 20 Female
3 Meghla 17 Female
4 Ridoy 24 Male
5 Saad 19 Male
> df$age
[1] 22 20 17 24 19
> df[,'age'] # notice comma ( df[row,col] )
[1] 22 20 17 24 19
> df[,2]
[1] 22 20 17 24 19
> df[2] # no comma for selecting 2nd column
age
1 22
2 20
3 17
4 24
5 19
> df['age']
age
1 22
2 20
3 17
4 24
5 19
> df[c('person','age')] # notice: no comma
person age
1 Ripon 22
2 Mina 20
3 Meghla 17
4 Ridoy 24
5 Saad 19
> df[c(1,2)]
person age
1 Ripon 22
2 Mina 20
3 Meghla 17
4 Ridoy 24
5 Saad 19
> df[2,] # Selecting single row
person age gender
2 Mina 20 Female
> df[c(3:5),] # notice: comma
person age gender
3 Meghla 17 Female
4 Ridoy 24 Male
5 Saad 19 Male
Selecting observations with age greater than 18:
> df[df$age > 18,] # notice comma
person age gender
1 Ripon 22 Male
2 Mina 20 Female
4 Ridoy 24 Male
5 Saad 19 Male
Selecting only Males:
> df[df$gender == 'M',]
[1] person age gender
<0 rows> (or 0-length row.names)
Selecting Males and age greater than 18:
> df[df$gender == 'M' & df$age > 18,]
[1] person age gender
<0 rows> (or 0-length row.names)
Selecting Males and age greater than 18 and removing the gender column:
> df[df$gender == 'M' & df$age > 18,-3]
[1] person age
<0 rows> (or 0-length row.names)
Selecting Males and age greater than 18 and removing the gender column using subset():
> subset(x= df, gender=='M', select=-gender)
[1] person age
<0 rows> (or 0-length row.names)
> subset(x= df, age>18 & gender=='M', select=-gender)
[1] person age
<0 rows> (or 0-length row.names)
Another clever way to do this is using column names by names():
> names(df)
[1] "person" "age" "gender"
> names(df) %in% 'gender' #checks which col. name matches 'gender'
[1] FALSE FALSE TRUE
> !names(df) %in% 'gender'
[1] TRUE TRUE FALSE
> df[df$age > 18, !(names(df) %in% 'gender')] #selects only the TRUE columns
person age
1 Ripon 22
2 Mina 20
4 Ridoy 24
5 Saad 19
cbind() can be used to bind two different datasets by column.
Let’s make two data frames d1 and d2 with same number of rows:
> SL<-1:6
> name<-c('Abdul','Rafiq','Selim','Ataur','Moin','Lara')
> d1<-data.frame(SL,name)
> d1
SL name
1 1 Abdul
2 2 Rafiq
3 3 Selim
4 4 Ataur
5 5 Moin
6 6 Lara
> marks<-c(67,78,80,70,30,80)
> posi<-c(4,2,1,3,5,1)
> d2<-data.frame(marks,posi)
> d2
marks posi
1 67 4
2 78 2
3 80 1
4 70 3
5 30 5
6 80 1
Now simply bind these two data frames using cbind():
> cb<-cbind(d1,d2)
> cb
SL name marks posi
1 1 Abdul 67 4
2 2 Rafiq 78 2
3 3 Selim 80 1
4 4 Ataur 70 3
5 5 Moin 30 5
6 6 Lara 80 1
Let’s see the class of cb:
> class(cb)
[1] "data.frame"
Again let’s say we have the following vectors
> person <- c("Ripon","Mina","Chris","Shad")
> age <- c(22,20,24,19)
> gender <- c("Male","Female","Male","Male")
In this case we can use cbind (column bind) to bind these vectors into a metrix and then converting them using as.data.frame() into a data frame:
> dfm<-cbind(person,age,gender)
> dfm
person age gender
[1,] "Ripon" "22" "Male"
[2,] "Mina" "20" "Female"
[3,] "Chris" "24" "Male"
[4,] "Shad" "19" "Male"
> class(dfm)
[1] "matrix" "array"
So the result comes out as a 2D array. Using as.data.frame():
> dfm<-as.data.frame(dfm)
> dfm
person age gender
1 Ripon 22 Male
2 Mina 20 Female
3 Chris 24 Male
4 Shad 19 Male
> class(dfm)
[1] "data.frame"
rbind can be used in binding different data frames with same number of columns.
Let’s make two data frame:
> SL<-1:3
> name<-c('Abdul','Rafiq','Selim')
> marks<-c(67,78,80)
> posi<-c(4,2,1)
> df_from_source_1<-data.frame(SL,name,marks,posi)
> df_from_source_1
SL name marks posi
1 1 Abdul 67 4
2 2 Rafiq 78 2
3 3 Selim 80 1
> SL<-4:6
> name<-c('Ataur','Moin','Lara')
> marks<-c(70,30,80)
> posi<-c(3,5,1)
> df_from_source_2<-data.frame(SL,name,marks,posi)
> df_from_source_2
SL name marks posi
1 4 Ataur 70 3
2 5 Moin 30 5
3 6 Lara 80 1
“Notice that the columns name must match and the number of columns must be equal”
Now bind these data frames by row using rbind():
> rdf<- rbind(df_from_source_1,df_from_source_2)
> rdf
SL name marks posi
1 1 Abdul 67 4
2 2 Rafiq 78 2
3 3 Selim 80 1
4 4 Ataur 70 3
5 5 Moin 30 5
6 6 Lara 80 1
Using the function attach() this can be done. For example take the data frame rdf that we created above -
> rdf
SL name marks posi
1 1 Abdul 67 4
2 2 Rafiq 78 2
3 3 Selim 80 1
4 4 Ataur 70 3
5 5 Moin 30 5
6 6 Lara 80 1
If we want to select a column as vector we would need to do it using dataframename$columnname -
> rdf$name
[1] "Abdul" "Rafiq" "Selim" "Ataur" "Moin" "Lara"
If we run only name we will get an Error saying “Error: object ‘name’ not found”.
By attaching the data frame we can do the work easily-
> attach(rdf)
> name
[1] "Ataur" "Moin" "Lara"
> marks
[1] 70 30 80
Using search we can see everything that is in the current search path of the global environment -
> search()
[1] ".GlobalEnv" "rdf" "package:stats"
[4] "package:graphics" "package:grDevices" "package:utils"
[7] "package:datasets" "package:methods" "Autoloads"
[10] "package:base"
To detach anything from the search path, for example say the rdf data frame -
> detach("rdf", unload=T)
Now rdf is not in the search path -
> search()
[1] ".GlobalEnv" "package:stats" "package:graphics"
[4] "package:grDevices" "package:utils" "package:datasets"
[7] "package:methods" "Autoloads" "package:base"
And we cannot call any column by the column name only. We have to use $ this method again.
Let’s create two data frames:
> # data frame 1
> df1 = data.frame(CustomerId = c(1:6),
+ Product = c("Oven","Television","Mobile","WashingMachine","Lightings","Ipad"))
> df1
CustomerId Product
1 1 Oven
2 2 Television
3 3 Mobile
4 4 WashingMachine
5 5 Lightings
6 6 Ipad
> # data frame 2
> df2 = data.frame(CustomerId = c(1, 4, 6, 8),
+ State = c("California","Santiago","Texas","Indiana"))
> df2
CustomerId State
1 1 California
2 4 Santiago
3 6 Texas
4 8 Indiana
Joins all the observations that match in both data frames by CustomerId
> merge(x=df1, y=df2, by="CustomerId")
CustomerId Product State
1 1 Oven California
2 4 WashingMachine Santiago
3 6 Ipad Texas
Joins all the observations by CustomerId
> merge(x=df1, y=df2, by="CustomerId", all=TRUE)
CustomerId Product State
1 1 Oven California
2 2 Television <NA>
3 3 Mobile <NA>
4 4 WashingMachine Santiago
5 5 Lightings <NA>
6 6 Ipad Texas
7 8 <NA> Indiana
Joins all the observations of df1 to df2 by CustomerId
> merge(x=df1, y=df2, by="CustomerId", all.x=TRUE)
CustomerId Product State
1 1 Oven California
2 2 Television <NA>
3 3 Mobile <NA>
4 4 WashingMachine Santiago
5 5 Lightings <NA>
6 6 Ipad Texas
Joins df1 to all the observations of df2 by CustomerId
> merge(x=df1,y=df2,by="CustomerId",all.y=TRUE)
CustomerId Product State
1 1 Oven California
2 4 WashingMachine Santiago
3 6 Ipad Texas
4 8 <NA> Indiana
A Cross Join (also sometimes known as a Cartesian Join) results in every row of one table being joined to every row of another table
> merge(x = df1, y = df2, by=NULL)
CustomerId.x Product CustomerId.y State
1 1 Oven 1 California
2 2 Television 1 California
3 3 Mobile 1 California
4 4 WashingMachine 1 California
5 5 Lightings 1 California
6 6 Ipad 1 California
7 1 Oven 4 Santiago
8 2 Television 4 Santiago
9 3 Mobile 4 Santiago
10 4 WashingMachine 4 Santiago
11 5 Lightings 4 Santiago
12 6 Ipad 4 Santiago
13 1 Oven 6 Texas
14 2 Television 6 Texas
15 3 Mobile 6 Texas
16 4 WashingMachine 6 Texas
17 5 Lightings 6 Texas
18 6 Ipad 6 Texas
19 1 Oven 8 Indiana
20 2 Television 8 Indiana
21 3 Mobile 8 Indiana
22 4 WashingMachine 8 Indiana
23 5 Lightings 8 Indiana
24 6 Ipad 8 Indiana
Know joining and merging in data frames using dplyr package from here
> df3 <- data.frame(SL = 1:5,
+ CustomerId = 1:5,
+ Product = c("Oven","Television","Mobile","WashingMachine","Lightings"))
> df4 <- data.frame(SL = 392:396,
+ CustomerId = 1:5,
+ Location = c("California","Santiago","Texas","Indiana","New York"))
>
> merge(x = df3 , y = df4, by = "CustomerId")
CustomerId SL.x Product SL.y Location
1 1 1 Oven 392 California
2 2 2 Television 393 Santiago
3 3 3 Mobile 394 Texas
4 4 4 WashingMachine 395 Indiana
5 5 5 Lightings 396 New York
> merge(x = df3 , y = df4, by = "CustomerId", suffixes = c("df3","df4"))
CustomerId SLdf3 Product SLdf4 Location
1 1 1 Oven 392 California
2 2 2 Television 393 Santiago
3 3 3 Mobile 394 Texas
4 4 4 WashingMachine 395 Indiana
5 5 5 Lightings 396 New York
Calculate mean column wise -
> set.seed(0)
> df1 <- data.frame(a = 1:10,
+ b = round(rnorm(10, 5, 1),1))
> df1
a b
1 1 6.3
2 2 4.7
3 3 6.3
4 4 6.3
5 5 5.4
6 6 3.5
7 7 4.1
8 8 4.7
9 9 5.0
10 10 7.4
> colMeans(df1)
a b
5.50 5.37
Add the means to the data frame -
> temp <- df1
> temp["MeanGen",] <- colMeans(df1) # adding a new column with column wise means
> temp
a b
1 1.0 6.30
2 2.0 4.70
3 3.0 6.30
4 4.0 6.30
5 5.0 5.40
6 6.0 3.50
7 7.0 4.10
8 8.0 4.70
9 9.0 5.00
10 10.0 7.40
MeanGen 5.5 5.37
Calculate mean row wise -
> rowMeans(df1)
[1] 3.65 3.35 4.65 5.15 5.20 4.75 5.55 6.35 7.00 8.70
Add the means to the data frame -
> temp <- df1
> temp["MeanGen"] <- rowMeans(df1) # adding a new row with row wise means
> temp
a b MeanGen
1 1 6.3 3.65
2 2 4.7 3.35
3 3 6.3 4.65
4 4 6.3 5.15
5 5 5.4 5.20
6 6 3.5 4.75
7 7 4.1 5.55
8 8 4.7 6.35
9 9 5.0 7.00
10 10 7.4 8.70
Creating a new column can be done using the mutate function from dplyr package -
> library(dplyr)
> df1 %>%
+ mutate(a,b,MeanDpl = rowMeans(df1))
a b MeanDpl
1 1 6.3 3.65
2 2 4.7 3.35
3 3 6.3 4.65
4 4 6.3 5.15
5 5 5.4 5.20
6 6 3.5 4.75
7 7 4.1 5.55
8 8 4.7 6.35
9 9 5.0 7.00
10 10 7.4 8.70
> apply(df1, 2, FUN=mean) # applies function 'mean' to 2nd dimension (columns)
a b
5.50 5.37
> apply(df1, 1, FUN=mean) # applies function to 1st dimension (rows)
[1] 3.65 3.35 4.65 5.15 5.20 4.75 5.55 6.35 7.00 8.70
> sapply(df1, FUN=mean) # also takes mean of columns, treating data frame like list of vectors
a b
5.50 5.37
> lapply(df1, FUN=mean) # returns a list with column wise means
$a
[1] 5.5
$b
[1] 5.37
> df2 <- data.frame(
+ a = rep(10,5),
+ b = round(rnorm(5,10,3)),
+ c = c(2,4,NA,8,NA)
+ )
> df2
a b c
1 10 12 2
2 10 8 4
3 10 7 NA
4 10 9 8
5 10 9 NA
The above data frame has two NA/ uncomplete rows. To get the data frame with complete rows run the following command -
> df2[complete.cases(df2),]
a b c
1 10 12 2
2 10 8 4
4 10 9 8
> df5 <- data.frame(SL = 1:10,
+ vals = c("a",NA,NA,"b",NA,NA,NA,"c",NA,NA))
> df5
SL vals
1 1 a
2 2 <NA>
3 3 <NA>
4 4 b
5 5 <NA>
6 6 <NA>
7 7 <NA>
8 8 c
9 9 <NA>
10 10 <NA>
> df5n <- tidyr::fill(data = df5, vals, .direction = "down")
> cbind(df5, vals_filled = df5n$vals)
SL vals vals_filled
1 1 a a
2 2 <NA> a
3 3 <NA> a
4 4 b b
5 5 <NA> b
6 6 <NA> b
7 7 <NA> b
8 8 c c
9 9 <NA> c
10 10 <NA> c
> df5[is.na(df5$vals),"vals"] <- 0
> df5
SL vals
1 1 a
2 2 0
3 3 0
4 4 b
5 5 0
6 6 0
7 7 0
8 8 c
9 9 0
10 10 0
Let’s see the data -
> Data <- read.table("multiple reponse.txt", header = T, fill = T)
> Data
ID Response
1 1 a,b
2 2
3 3 b,c
4 4 a,d
5 5 a,c
6 6 a,b
7 7 a,b
8 8 b,d
9 9
10 10 a
11 11 a,b
12 12 b,c
Creating dummy variables from this column -
> Data$Responses <- as.character(Data$Response)
> # splitting the responses
> resp.split <- strsplit(Data$Responses, split = ",")
> lev <- unique(unlist(resp.split)) # taking unique values for column
> # creating dummy
> resp.dummy <- t(sapply(resp.split,
+ FUN = function(x) table(factor(x,
+ levels=lev))))
> # assigning it to a data frame
> Data2 <- with(Data, cbind(Responses, data.frame(resp.dummy)))
> Data2
Responses a b c d
1 a,b 1 1 0 0
2 0 0 0 0
3 b,c 0 1 1 0
4 a,d 1 0 0 1
5 a,c 1 0 1 0
6 a,b 1 1 0 0
7 a,b 1 1 0 0
8 b,d 0 1 0 1
9 0 0 0 0
10 a 1 0 0 0
11 a,b 1 1 0 0
12 b,c 0 1 1 0
This solution of dummy variable creation from multiple responses is taken from here.