In the exercises below we cover the basics of data frames.
5.2.2.1 Consider two vectors: x=seq(1,43,along.with=Id) and y=seq(-20,0,along.with=Id)
Create a data.frame df:
df
Id Letter x y
1 1 a 1.000000 -20.000000
2 1 b 4.818182 -18.181818
3 1 c 8.636364 -16.363636
4 2 a 12.454545 -14.545455
5 2 b 16.272727 -12.727273
6 2 c 20.090909 -10.909091
7 3 a 23.909091 -9.090909
8 3 b 27.727273 -7.272727
9 3 c 31.545455 -5.454545
10 4 a 35.363636 -3.636364
11 4 b 39.181818 -1.818182
12 4 c 43.000000 0.000000
Id <- rep(1:4, each = 3)
x=seq(1,43,along.with=Id)
y=seq(-20,0,along.with=Id)
Letter=rep(letters[1:3],4)
df <- data.frame(Id,Letter,x,y)
df
## Id Letter x y
## 1 1 a 1.000000 -20.000000
## 2 1 b 4.818182 -18.181818
## 3 1 c 8.636364 -16.363636
## 4 2 a 12.454545 -14.545455
## 5 2 b 16.272727 -12.727273
## 6 2 c 20.090909 -10.909091
## 7 3 a 23.909091 -9.090909
## 8 3 b 27.727273 -7.272727
## 9 3 c 31.545455 -5.454545
## 10 4 a 35.363636 -3.636364
## 11 4 b 39.181818 -1.818182
## 12 4 c 43.000000 0.000000
5.2.2.2 From the previous one data frame df. Create this data frame:
Id x.a y.a x.b y.b x.c y.c
1 1 1.00000 -20.000000 4.818182 -18.181818 8.636364 -16.363636 4 2 12.45455 -14.545455 16.272727 -12.727273 20.090909 -10.909091 7 3 23.90909 -9.090909 27.727273 -7.272727 31.545455 -5.454545 10 4 35.36364 -3.636364 39.181818 -1.818182 43.000000 0.000000
Hint: Check the reshape function.
reshape(df,timevar='Letter',idvar='Id',direction='wide') #reshaping the data from long to wide.
## Id x.a y.a x.b y.b x.c y.c
## 1 1 1.00000 -20.000000 4.818182 -18.181818 8.636364 -16.363636
## 4 2 12.45455 -14.545455 16.272727 -12.727273 20.090909 -10.909091
## 7 3 23.90909 -9.090909 27.727273 -7.272727 31.545455 -5.454545
## 10 4 35.36364 -3.636364 39.181818 -1.818182 43.000000 0.000000
5.2.2.3 Create two data frame df1 and df2:
df1
Id Age
1 1 14
2 2 12
3 3 15
4 4 10
df2
Id Sex Code
1 1 F a
2 2 M b
3 3 M c
4 4 F d
Id <- c(1:4)
Age <- c(14,12,15,10)
df1 <- data.frame(Id,Age)
Sex <- c("F","M","M","F")
Code <- letters[1:4]
df2 <- data.frame(Id,Sex,Code)
From df1 and df2 create M:
M
Id Age Sex Code
1 1 14 F a
2 2 12 M b
3 3 15 M c
4 4 10 F d
M <- merge(df1,df2, by = "Id")
M
## Id Age Sex Code
## 1 1 14 F a
## 2 2 12 M b
## 3 3 15 M c
## 4 4 10 F d
5.2.2.4 Create a data frame df3:
df3
id2 score
1 4 100
2 3 98
3 2 94
4 1 99
From M and df3 create N:
Id Age Sex Code score
1 1 14 F a 99
2 2 12 M b 94
3 3 15 M c 98
4 4 10 F d 100
id2 <- 4:1
score <- c(100,98,94,99)
df3 <- data.frame(id2,score)
N=merge(M,df3,by.x='Id',by.y='id2')
N
## Id Age Sex Code score
## 1 1 14 F a 99
## 2 2 12 M b 94
## 3 3 15 M c 98
## 4 4 10 F d 100
5.2.2.5 Consider the previous one data frame N:
1)Remove the variables Sex and Code
N[,c("Sex")]=NULL
N[,c("Code")]=NULL
2)From N, create a data frame:
values ind
1 1 Id
2 2 Id
3 3 Id
4 4 Id
5 14 Age
6 12 Age
7 15 Age
8 10 Age
9 99 score
10 94 score
11 98 score
12 100 score
Using stack function can be very useful in some cases. Check this example
stack(N)
## values ind
## 1 1 Id
## 2 2 Id
## 3 3 Id
## 4 4 Id
## 5 14 Age
## 6 12 Age
## 7 15 Age
## 8 10 Age
## 9 99 score
## 10 94 score
## 11 98 score
## 12 100 score
5.2.2.6 For this exercise, we’ll use the (built-in) dataset trees.
a) Make sure the object is a data frame, if not change it to a data frame.
str(trees)
## 'data.frame': 31 obs. of 3 variables:
## $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
## $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
## $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
A <- trees
#So it is already a data.frame
A Girth Height Volume
mean_tree 13.24839 76 30.17097
min_tree 8.30000 63 10.20000
max_tree 20.60000 87 77.00000
sum_tree 410.70000 2356 935.30000
Hint: Instead of using seperate statements or loops, we can use apply function to calculate mean,min,max and sum of each column or row.
mean_tree=apply(trees,2,mean)
max_tree=apply(trees,2,max)
min_tree=apply(trees,2,min)
sum_tree=apply(trees,2,sum)
A=data.frame(mean_tree,min_tree,max_tree,sum_tree) # The expected table is the transpose of A.
A <- t(A)
A
## Girth Height Volume
## mean_tree 13.24839 76 30.17097
## min_tree 8.30000 63 10.20000
## max_tree 20.60000 87 77.00000
## sum_tree 410.70000 2356 935.30000
5.2.2.7 Consider the data frame A:
1)Order the entire data frame by the first column.
A[order(A[,1]),]
## Girth Height Volume
## min_tree 8.30000 63 10.20000
## mean_tree 13.24839 76 30.17097
## max_tree 20.60000 87 77.00000
## sum_tree 410.70000 2356 935.30000
2)Rename the row names as follows: mean, min, max, tree
row.names(A)
## [1] "mean_tree" "min_tree" "max_tree" "sum_tree"
row.names(A) <- c("mean","min","max","tree")
5.2.2.8 Create an empty data frame with column types:
df
Ints Logicals Doubles Characters
(or 0-length row.names)
df <- data.frame(Ints=integer(), Logicals=logical(),Doubles=double(),Characters=character())
5.2.2.9 Create a data frame XY where X=c(1,2,3,1,4,5,2) and Y=c(0,3,2,0,5,9,3)
XY
X Y
1 1 0
2 2 3
3 3 2
4 1 0
5 4 5
6 5 9
7 2 3
XY <- data.frame(X=c(1,2,3,1,4,5,2),Y=c(0,3,2,0,5,9,3))
XY
## X Y
## 1 1 0
## 2 2 3
## 3 3 2
## 4 1 0
## 5 4 5
## 6 5 9
## 7 2 3
1)looks at duplicated elements using a provided R function.
duplicated(XY) # TRUE means a duplicated row.
## [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE
unique(XY) #4th and 7th rows will not be displayed.
## X Y
## 1 1 0
## 2 2 3
## 3 3 2
## 5 4 5
## 6 5 9
5.2.2.10 For this exercise, we’ll use the (built-in) dataset Titanic.
a) Make sure the object is a data frame, if not change it to a data frame.
str(Titanic)
## table [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
## - attr(*, "dimnames")=List of 4
## ..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew"
## ..$ Sex : chr [1:2] "Male" "Female"
## ..$ Age : chr [1:2] "Child" "Adult"
## ..$ Survived: chr [1:2] "No" "Yes"
Tit <- data.frame(Titanic)
df <- subset(Tit, subset = Class=='1st' & Survived=='No',select=c(Sex,Age,Freq))
df
## Sex Age Freq
## 1 Male Child 0
## 5 Female Child 0
## 9 Male Adult 118
## 13 Female Adult 4