This tutorial is about using a few Simple R functions using the ‘sampling’ command. 1. sample function() 2. for loops 3. apply () function
when working with R there are various reasons why we want R to do some task repeatedly,like running through every row in the data set and performing some calculations. Or may be, we want to run a simulation that requires repeated draw of samples from some population or run some sort of iterated experiment in order to estimate a statistic– for example, the sample mean.
At first, I am going to define a vector named ‘states’ and give some states’ name to be stored in the vector. Or the vector “state” has three character elements, i.e., three U.S. states.
states<-c("Florida", "Georgia", "Alabama")
states
[1] "Florida" "Georgia" "Alabama"
Once I created the vector, I want to use the sample command to simulate the vector. I will use a sample() function where I define the data set to be used, number of iterations I want to conduct, and whether I want to replace the value or not once they are, so to say, used.
sample(x=states, size=12, replace=TRUE)
[1] "Alabama" "Alabama" "Georgia" "Georgia" "Florida" "Georgia" "Florida"
[8] "Florida" "Georgia" "Florida" "Alabama" "Georgia"
I can see that I simulated a new data set with 12 elements using the vector I created earlier. Every time, the all of the elements do have equal chance of being drawn regardless of previous draw.
I can also choose to not replace the values once they are drawn. If so, the sample size of the simulated data cannot exceed 3. Because I just have 3 states in the data set. I tried all three option for the demonstration purpose.
sample(x=states, size=3, replace=FALSE)
[1] "Georgia" "Florida" "Alabama"
sample(x=states, size=2, replace=FALSE)
[1] "Florida" "Alabama"
sample(x=states, size=1, replace=FALSE)
[1] "Alabama"
I got three different outputs. The first one selected 3 states, second 2 states, and the last one states.
Lesson Learned: If we replace the drawn sample back to the game, we can create as big sample size as we want, but limited to a certain number if we don’t.
a<-(1:5)
a
[1] 1 2 3 4 5
sample(x=a, size=100, replace = TRUE)
[1] 3 1 2 3 2 5 2 2 1 2 1 2 3 1 2 3 4 1 2 5 2 5 2 5 3 4 1 2 1 2 2 3 4 3 5 2 3
[38] 5 1 1 2 2 4 4 3 1 5 5 1 4 5 5 5 3 2 2 4 4 4 3 5 5 2 4 2 4 3 4 5 1 2 5 1 3
[75] 5 5 1 3 1 3 1 5 2 3 3 3 2 4 1 4 5 1 4 4 4 5 5 2 4 2
sample(x=a, size=5, replace=FALSE)
[1] 1 2 5 3 4
sample(x=a, size=4, replace=FALSE)
[1] 4 1 2 3
sample(x=a, size=3, replace=FALSE)
[1] 5 2 3
sample(x=a, size=2, replace=FALSE)
[1] 4 1
sample(x=a, size=1, replace=FALSE)
[1] 4
a<-(50:80)
a
[1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
[26] 75 76 77 78 79 80
sample(x=a, size=200, replace = TRUE)
[1] 58 79 70 69 66 79 58 56 80 74 76 74 57 57 61 80 78 80 64 50 68 51 76 67 60
[26] 56 63 69 74 74 58 80 60 61 74 78 65 52 79 74 59 54 63 52 62 53 63 61 71 74
[51] 56 52 56 50 67 57 55 73 65 80 60 62 72 51 64 71 58 56 58 76 51 74 56 50 65
[76] 54 71 61 64 64 64 59 50 69 60 79 73 75 66 64 60 69 56 65 71 67 67 72 75 51
[101] 66 55 69 51 78 65 77 58 64 68 80 74 69 53 60 72 72 68 58 80 71 73 51 75 61
[126] 61 59 64 57 54 68 63 74 66 60 54 77 67 55 61 76 52 55 75 60 62 70 75 60 68
[151] 53 71 52 61 52 68 66 54 77 69 77 75 74 76 56 75 77 52 76 67 57 73 59 71 76
[176] 58 50 53 62 75 65 73 59 68 57 60 67 50 74 59 57 70 68 63 74 69 56 64 75 70
sample(x=a, size=30, replace=FALSE)
[1] 62 69 68 70 60 58 71 80 76 73 50 65 75 74 78 59 51 55 79 54 63 77 64 57 52
[26] 66 72 67 61 56
sample(x=a, size=21, replace=FALSE)
[1] 79 56 62 52 54 73 51 63 53 70 74 71 69 68 67 72 60 50 80 66 65
sample(x=a, size=11, replace=FALSE)
[1] 62 53 69 76 51 50 72 66 80 65 73
sample(x=a, size=6, replace=FALSE)
[1] 54 74 52 64 78 60
sample(x=a, size=1, replace=FALSE)
[1] 64
Imagine that 1 refers to heads and 0 to tails.
coinflip <- (0:1)
One_flip <- sample(x = coinflip, size = 1)
One_flip
[1] 1
plot(One_flip)
five_flips <- sample(x = coinflip, size = 5, replace = T)
print(five_flips)
[1] 0 0 1 1 1
barplot(five_flips)
ten_flips <- sample(x = coinflip, size = 10, replace = T)
print(ten_flips)
[1] 1 0 1 0 0 0 1 0 0 0
barplot(ten_flips)
hundred_flips <- sample(x = coinflip, size = 100, replace = T)
print(hundred_flips)
[1] 0 0 0 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 1 0 1 1 0 0
[38] 0 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 0 1 1 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0
[75] 1 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1
hist(hundred_flips)
thousand_flips <- sample(x = coinflip, size = 1000, replace = T)
summary(thousand_flips)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 1.000 0.508 1.000 1.000
hist(thousand_flips)
OHT_flips <- sample(x = coinflip, size = 100000, replace = T)
summary(OHT_flips)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.4956 1.0000 1.0000
hist(OHT_flips)
I am going to roll 6-sided dice
dice_roll <- (1:6)
hundred_rolls <- sample(x = dice_roll, size = 100, replace = TRUE)
str(hundred_rolls)
int [1:100] 2 5 6 4 6 2 3 6 1 1 ...
hist(hundred_rolls)
thousand_rolls <- sample(x = dice_roll, size = 1000, replace = TRUE)
str(thousand_rolls)
int [1:1000] 5 6 4 1 5 1 4 2 4 2 ...
hist(thousand_rolls)
Tthousand_rolls <- sample(x = dice_roll, size = 10000, replace = TRUE)
str(Tthousand_rolls)
int [1:10000] 4 1 2 4 1 1 4 2 1 4 ...
hist(Tthousand_rolls)
powerball <- (1:74)
sample(x = powerball, size = 6, replace = F)# Six Draws
[1] 59 23 53 45 35 48
c <- sample(x = powerball, size = 10000, replace = TRUE) #Ten Thousand Draws
hist(c)
So the sample command allows us to draw from some finite set of elements with an equal probability of drawing each element. But one note here is that we can also sample from other kinds of probability distributions–for example, the normal distribution.
Let’s take an example: I am going to generate 50 random numbers.
norm <- rnorm(n=100)
head(norm)
[1] 1.1880658 0.7176062 -0.8077974 -0.2422761 1.3265662 1.5531266
I can create much larger random normal numbers and create a density plot to see how it looks.
plot(density(rnorm(n=10000000)))
Plot is absolutely a Gaussian Plot. After having this sample () command under our belt lets’ learn the for loop. ## For Loop So the basic idea of for loops is that we’re just going to tell R to loop through some operations a set number of times in order to perform a task that might otherwise take a long time to repeat over and over. So let’s start with a simple coin tossing experiment. So first, we’ll just create a vector, an object with two elements, heads or tails. And we’ll call it cflip.
cflip <- c("Head", "Tail")#Creates a two sided coin
toss <- c()#Creates an Empty Vector
#Now, let's create a foor loop
for(i in 1:100){
toss[i] <- sample(x = cflip, size = 1)#size 1 because we created a loop and want it to perform exact same way every time
}
print(toss)# prints every single toss for 100 times
[1] "Head" "Tail" "Head" "Head" "Tail" "Tail" "Head" "Head" "Head" "Tail"
[11] "Head" "Tail" "Tail" "Head" "Tail" "Head" "Tail" "Head" "Tail" "Tail"
[21] "Tail" "Head" "Head" "Tail" "Head" "Head" "Tail" "Head" "Tail" "Tail"
[31] "Tail" "Tail" "Tail" "Tail" "Head" "Head" "Head" "Tail" "Tail" "Tail"
[41] "Head" "Tail" "Tail" "Tail" "Tail" "Head" "Tail" "Tail" "Head" "Tail"
[51] "Tail" "Head" "Tail" "Tail" "Head" "Head" "Tail" "Head" "Tail" "Head"
[61] "Tail" "Tail" "Head" "Tail" "Tail" "Tail" "Tail" "Tail" "Head" "Head"
[71] "Tail" "Head" "Tail" "Head" "Tail" "Head" "Tail" "Head" "Tail" "Tail"
[81] "Tail" "Head" "Tail" "Head" "Tail" "Tail" "Head" "Head" "Head" "Tail"
[91] "Tail" "Tail" "Tail" "Tail" "Head" "Head" "Head" "Tail" "Head" "Head"
table(toss) # Gives the summary of total heads and tails
toss
Head Tail
43 57
i in the above loop is the index that defines the number or simulations.
Lets see:
powerball <- c(1:74)# Utilizes any number between 1 and 74
draw <- c()#Creates an Empty Vector
#Now, let's create a for loop
for(i in 1:600){#600 iterations
draw[i] <- sample(x = powerball, size = 6)#size 6 because we have to match all 6 numbers to win the power ball
}
summary(draw)# prints every single draw
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 19.00 38.00 38.39 58.00 74.00
table(draw) # Gives the summary of total draws and frequency
draw
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
8 10 9 6 13 7 6 3 9 12 7 7 6 5 6 10 11 5 12 1 6 5 11 6 10 10
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
9 11 8 4 8 12 8 8 8 7 11 6 6 9 10 6 7 6 10 5 7 10 7 3 12 11
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
4 5 14 6 9 12 10 10 8 5 7 6 7 11 8 9 12 16 9 10 5 7
Please note that it is absolutely a fictional dataset. We will create a matrix called ‘result’ which has three variables named: Marital Status, income level, and their state of living. Participant will have either ‘Married’, or ‘Single’ marital status; four different income levels (1 through 4) and they live in three states: Florida, Alabama, or Georgia.
I have to run the function three times for completing 1 row (one for states, one for income,and one for marital status). Thus, within the for loop I used [i, 1 or 2, or 3] suggesting ith row and first or second or third column.
state <- c("Florida", "Georgia", "Alabama")
m_status <- c("Married", "Single")
income <- 1:4
results <- matrix (nrow = 100, ncol = 3, data = NA)#Creates a matrix with 100 rows and 3 columns and it doesn't have any values, yet
colnames(results) <- c("m_status", "state", "income")
head(results)
m_status state income
[1,] NA NA NA
[2,] NA NA NA
[3,] NA NA NA
[4,] NA NA NA
[5,] NA NA NA
[6,] NA NA NA
for(i in 1:100){
results [i,1] <- sample(m_status, size = 1)
results [i,2] <- sample(state, size = 1)
results [i,3] <- sample (income, size = 1)
}
head(results)
m_status state income
[1,] "Married" "Georgia" "2"
[2,] "Married" "Alabama" "3"
[3,] "Single" "Florida" "4"
[4,] "Single" "Florida" "3"
[5,] "Married" "Alabama" "4"
[6,] "Married" "Georgia" "3"
str(results)
chr [1:100, 1:3] "Married" "Married" "Single" "Single" "Married" "Married" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "m_status" "state" "income"
In the created sample we can see that the matrix is now full. It has all the variables that we wanted and we selected these variables randomly every single time. If you want to generate numbers like this, we have to define values by column, and that process repeats one time for every column.
Why? The apply function is useful for summarizing large amounts of data or applying a function over our a data set.
Let’s take a look at one example using the ‘results’ that we’ve just found. Basically, what I want to do with the apply function is use it to summarize the information from the results that I just created for my results above and use the table command to summarize.
Simply, apply function starts with a prompt “apply” and a small bracket. Big X = results, i.e., the data set I am interested in. The expression ‘MARGIN’ refers to either the row or column. If MARGIN = 2, I want my data be summarized based on column. If it is 1, then by rows. Finally, FUN = table, the data is in a tabular form.
apply(X = results, MARGIN = 2, FUN = table)
$m_status
Married Single
49 51
$state
Alabama Florida Georgia
29 37 34
$income
1 2 3 4
20 29 24 27
We see, x number participants to be married, y numbers of participants form certain state, and their income distribution.
$m_Status for column marital status, $s for the state of residence, and $income for income distribution.
THANKS