5.2.2 Data Frames Vol. 2

In the exercises below we cover the basics of data frames.

5.2.2.1 Consider two vectors: x=seq(1,43,along.with=Id) and y=seq(-20,0,along.with=Id)
Create a data.frame df:

df
Id Letter x y
1 1 a 1.000000 -20.000000
2 1 b 4.818182 -18.181818
3 1 c 8.636364 -16.363636
4 2 a 12.454545 -14.545455
5 2 b 16.272727 -12.727273
6 2 c 20.090909 -10.909091
7 3 a 23.909091 -9.090909
8 3 b 27.727273 -7.272727
9 3 c 31.545455 -5.454545
10 4 a 35.363636 -3.636364
11 4 b 39.181818 -1.818182
12 4 c 43.000000 0.000000

Id <- rep(1:4, each = 3)
x=seq(1,43,along.with=Id)
y=seq(-20,0,along.with=Id)
Letter=rep(letters[1:3],4)

df <- data.frame(Id,Letter,x,y)
df

##    Id Letter         x          y
## 1   1      a  1.000000 -20.000000
## 2   1      b  4.818182 -18.181818
## 3   1      c  8.636364 -16.363636
## 4   2      a 12.454545 -14.545455
## 5   2      b 16.272727 -12.727273
## 6   2      c 20.090909 -10.909091
## 7   3      a 23.909091  -9.090909
## 8   3      b 27.727273  -7.272727
## 9   3      c 31.545455  -5.454545
## 10  4      a 35.363636  -3.636364
## 11  4      b 39.181818  -1.818182
## 12  4      c 43.000000   0.000000

5.2.2.2 From the previous one data frame df. Create this data frame:

Id x.a y.a x.b y.b x.c y.c
1 1 1.00000 -20.000000 4.818182 -18.181818 8.636364 -16.363636 4 2 12.45455 -14.545455 16.272727 -12.727273 20.090909 -10.909091 7 3 23.90909 -9.090909 27.727273 -7.272727 31.545455 -5.454545 10 4 35.36364 -3.636364 39.181818 -1.818182 43.000000 0.000000

Hint: Check the reshape function.

reshape(df,timevar='Letter',idvar='Id',direction='wide') #reshaping the data from long to wide.

##    Id      x.a        y.a       x.b        y.b       x.c        y.c
## 1   1  1.00000 -20.000000  4.818182 -18.181818  8.636364 -16.363636
## 4   2 12.45455 -14.545455 16.272727 -12.727273 20.090909 -10.909091
## 7   3 23.90909  -9.090909 27.727273  -7.272727 31.545455  -5.454545
## 10  4 35.36364  -3.636364 39.181818  -1.818182 43.000000   0.000000

5.2.2.3 Create two data frame df1 and df2:

df1
Id Age
1 1 14
2 2 12
3 3 15
4 4 10

df2
Id Sex Code
1 1 F a
2 2 M b
3 3 M c
4 4 F d

Id <- c(1:4)
Age <- c(14,12,15,10)
df1 <- data.frame(Id,Age)

Sex <- c("F","M","M","F")
Code <- letters[1:4]
df2 <- data.frame(Id,Sex,Code)

From df1 and df2 create M:

M
Id Age Sex Code
1 1 14 F a
2 2 12 M b
3 3 15 M c
4 4 10 F d

M <- merge(df1,df2, by = "Id")

M

##   Id Age Sex Code
## 1  1  14   F    a
## 2  2  12   M    b
## 3  3  15   M    c
## 4  4  10   F    d

5.2.2.4 Create a data frame df3:

df3
id2 score
1 4 100
2 3 98
3 2 94
4 1 99

From M and df3 create N:

Id Age Sex Code score
1 1 14 F a 99
2 2 12 M b 94
3 3 15 M c 98
4 4 10 F d 100

id2 <- 4:1
score <- c(100,98,94,99)
df3 <- data.frame(id2,score)

N=merge(M,df3,by.x='Id',by.y='id2')
N

##   Id Age Sex Code score
## 1  1  14   F    a    99
## 2  2  12   M    b    94
## 3  3  15   M    c    98
## 4  4  10   F    d   100

5.2.2.5 Consider the previous one data frame N:
1)Remove the variables Sex and Code

N[,c("Sex")]=NULL
N[,c("Code")]=NULL

2)From N, create a data frame:

values ind
1 1 Id
2 2 Id
3 3 Id
4 4 Id
5 14 Age
6 12 Age
7 15 Age
8 10 Age
9 99 score
10 94 score
11 98 score
12 100 score

Using stack function can be very useful in some cases. Check this example

stack(N)

##    values   ind
## 1       1    Id
## 2       2    Id
## 3       3    Id
## 4       4    Id
## 5      14   Age
## 6      12   Age
## 7      15   Age
## 8      10   Age
## 9      99 score
## 10     94 score
## 11     98 score
## 12    100 score

5.2.2.6 For this exercise, we’ll use the (built-in) dataset trees.
a) Make sure the object is a data frame, if not change it to a data frame.

str(trees)

## 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

A <- trees

#So it is already a data.frame

Create a new data frame A:

A Girth Height Volume
mean_tree 13.24839 76 30.17097
min_tree 8.30000 63 10.20000
max_tree 20.60000 87 77.00000
sum_tree 410.70000 2356 935.30000

Hint: Instead of using seperate statements or loops, we can use apply function to calculate mean,min,max and sum of each column or row.

mean_tree=apply(trees,2,mean)
max_tree=apply(trees,2,max)
min_tree=apply(trees,2,min)
sum_tree=apply(trees,2,sum)

A=data.frame(mean_tree,min_tree,max_tree,sum_tree) # The expected table is the transpose of A.

A <- t(A)

A

##               Girth Height    Volume
## mean_tree  13.24839     76  30.17097
## min_tree    8.30000     63  10.20000
## max_tree   20.60000     87  77.00000
## sum_tree  410.70000   2356 935.30000

5.2.2.7 Consider the data frame A:
1)Order the entire data frame by the first column.

A[order(A[,1]),]

##               Girth Height    Volume
## min_tree    8.30000     63  10.20000
## mean_tree  13.24839     76  30.17097
## max_tree   20.60000     87  77.00000
## sum_tree  410.70000   2356 935.30000

2)Rename the row names as follows: mean, min, max, tree

row.names(A)

## [1] "mean_tree" "min_tree"  "max_tree"  "sum_tree"

row.names(A) <- c("mean","min","max","tree")

5.2.2.8 Create an empty data frame with column types:

df
Ints Logicals Doubles Characters
(or 0-length row.names)

df <- data.frame(Ints=integer(), Logicals=logical(),Doubles=double(),Characters=character())

5.2.2.9 Create a data frame XY where X=c(1,2,3,1,4,5,2) and Y=c(0,3,2,0,5,9,3)

XY
X Y
1 1 0
2 2 3
3 3 2
4 1 0
5 4 5
6 5 9
7 2 3

XY <- data.frame(X=c(1,2,3,1,4,5,2),Y=c(0,3,2,0,5,9,3))

XY

##   X Y
## 1 1 0
## 2 2 3
## 3 3 2
## 4 1 0
## 5 4 5
## 6 5 9
## 7 2 3

1)looks at duplicated elements using a provided R function.

duplicated(XY) # TRUE means a duplicated row.

## [1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE

keeps only the uniques lines on XY using a provided R function.

unique(XY) #4th and 7th rows will not be displayed.

##   X Y
## 1 1 0
## 2 2 3
## 3 3 2
## 5 4 5
## 6 5 9

5.2.2.10 For this exercise, we’ll use the (built-in) dataset Titanic.
a) Make sure the object is a data frame, if not change it to a data frame.

str(Titanic)

##  table [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
##  - attr(*, "dimnames")=List of 4
##   ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
##   ..$ Sex     : chr [1:2] "Male" "Female"
##   ..$ Age     : chr [1:2] "Child" "Adult"
##   ..$ Survived: chr [1:2] "No" "Yes"

Tit <- data.frame(Titanic)

Define a data frame with value 1st in Class variable, and value NO in Survived variable and variables Sex, Age and Freq.

df <- subset(Tit, subset = Class=='1st' & Survived=='No',select=c(Sex,Age,Freq))

df

##       Sex   Age Freq
## 1    Male Child    0
## 5  Female Child    0
## 9    Male Adult  118
## 13 Female Adult    4

5.2.2 Data Frames Vol. 2

source: http://www.r-exercises.com/2016/11/28/data-frame-exercises-vol-2/