Introduction:

R is a software environment for data analysis, statistical computing and graphics. R program was initially developed by two statisticians Roberts Gentlemen and Ross Ihaka at the University of Auckland, New Zealand in the early 1990’s. R is one of the powerful programming languages to do data analysis. The reason is the script are so simple that a non programmer can also easily execute after several weeks of practice. At the beginning one may get intimidating but after you go on executing commands, one may get addictive about it.

R is an environment within which many classical and modern statistical techniques have been implemented. A few of these are built into the base R environment, but many are supplied as packages. R consists of thousands of functions. Thousands of packages are contributed to CRAN by hundreds of contributers. The R program can be downloaded from http://r-project.org and follow the instruction as per your system. The IDE for R programming, most commonly used is Rstudio, can also be downloaded from http://rstudio.com.

History of R

R is basically originated from a statistical programming languages “S” which is originally developed by Richard A. Becker, John Chambers and Allan R. Wilks at Bell Laboratory in the 1970’s. As the first version of R was developed by Robert Gentlemen and Ross Ihaka at the University of Auckland in the mid 1990’s. They wanted a better version of statistical software to use in their Macintosh Laboratory and decided to create their own. They also released it as on open source alternative to “S” and encouraged others to download and help developing software more extensively.

Why are we using R?

R is a free and open sourced and available in applied form Mac, Windows, Linux. It is one of the widely used programming languages; there are more than 20 million users around the world. This means that new features are being developed all the times and there are lots of community resources. Additionally, R makes it easy to run previous work and to make adjustments. R also has nice graphics and visualizations. There are many choices for data analysis software available today. In addition to R some popular examples are SAS, Stata, SPSS, Ms Excel, Excel Add ins, MATLAB, Minitab, pandas etc. Now let us try to understand some of the basic functions of R for whom who are willing to learn R.

Basics of R

R as a basic calculator

R can be used as a calculator. The program can be used a basic calculator.The operator +, -, *, /, can easily be used in R for addition, subtraction, multiplication and division respectively. Here are some illustrations:

4+5  # to add 4 and 5
## [1] 9
4-2  #to subtract 2 from 4
## [1] 2
5*6  # multiplication between 5 and 6
## [1] 30
15/5 # to divide 15 by 5
## [1] 3

Likewise, some other basic functions/algebraic operations are ^, sqrt, log, exp etc. Some of the demonstrations are listed below.

(4+5)^2    # the function '^' is used to raise the power
## [1] 81
sqrt(16)   # the function 'sqrt' is used to square root 
## [1] 4
(2+2)/2
## [1] 2
log10(100)  # log10 is common logarithm
## [1] 2
exp(10)     #exp is exponent
## [1] 22026.47
exp(1)
## [1] 2.718282
log(exp(1))
## [1] 1
log(exp(2))
## [1] 2

Defining a Variable

In the previous session we have learnt about how, basic calculator functions can be operated. Now we shall learn to define a Variable in R. Variable is used as a inventory where the data can be stored so that it can be used later on. for example,let \(x\) be a variable where a number 5 is stored. In order to assign 5 as \(x\); the assignment can be executed in several ways. one of the commonly used method is:

x=5
#or 
x<- 5
#or 
5->x 
x  # this will view the value stored in 'x'
## [1] 5

Similarly let us assign 6 as \(y\)

y=6
y  
## [1] 6

Let us perform some basic operations between x and y. some examples are:

x+y     # note that 5 is stored in 'x' and 6 is in 'y' therefore, x+y shlould resutl in 11.
## [1] 11
x*y
## [1] 30
x^2+y^3
## [1] 241

and so on.

Vector assignment

In the previous section we learnt about defining variables and operations between them.It is noted that \(x\) is used to store a single no. 5 and that of \(y\) is used to store 6. Now we shall learn how vector is defined in R. Vector is a collection of object of similar kind.

Name=c("ram","hari","sita", "geeta","bir","shyam","laxman","neelam","ramesh")   #characters are enquoted.
Gender=c("boy","boy","girl","girl","boy","boy","boy","girl","boy")
Age =c(25,18,25,20,20,31,28,25,22)

The function ’c’is used to combine (or concatenate), also it should be noted that “” is used for characters(Name and Gender in this case) while for numerical(Age) it is not necessary. Note that the vector Name is contained with the same kind of object and the vector Gender contains similar object. If we want to know the class of different vectors. we can use the class function

class(Name)
## [1] "character"
class(Gender)
## [1] "character"
class(Age)
## [1] "numeric"

How to change class?

let us change the class of variable Gender. In this example we are going to change the class of the variable Gender from character to factor using the function as.factor. [Note that categorical variable are termed as factor in R. Since Gender is a categorical variable(two category;boy and girl) it should be converted into factor].

Gender= as.factor(Gender)

How to make a dataframe?

Learners are advised to learn in details about different data structure like vector, list, matrix,array, dataframe etc. we shall now won’t discuss about all these. AS data frame is an important data structure while doing data analysis as we encounter vectors with different classes. Roughly speaking a data frame is a table of vectors of different class. Now we shall discuss how to make a data frame by combining two or more that two vectors with different classes. The following function called data.frame will be used. Let us create a data frame called “Students” by clubbing three vectors ‘Name’, ‘Gender’ and ‘Age’ as:

Students=data.frame(Name,Gender,Age)   # Note that 'Name', 'Gender' and 'Age' are of different classes.
print(Students)
##     Name Gender Age
## 1    ram    boy  25
## 2   hari    boy  18
## 3   sita   girl  25
## 4  geeta   girl  20
## 5    bir    boy  20
## 6  shyam    boy  31
## 7 laxman    boy  28
## 8 neelam   girl  25
## 9 ramesh    boy  22

The table shows a data frame called “Students” which is created by clubbing three vectors namely ‘Name’, ‘Gender’ and ‘Age’.

How to count (frequency)?

In explorative data analysis counting of frequency of a vector is very important.The following funcion table is used to count.

table(Gender)  # this function results in no of boys and girls. (in our data, there were 6 boys and 3 girls)
## Gender
##  boy girl 
##    6    3
table(Name)     #this is nonsense in this case
## Name
##    bir  geeta   hari laxman neelam    ram ramesh  shyam   sita 
##      1      1      1      1      1      1      1      1      1

Computation of Statistical Measures

In this section we shall learn how to compute some basic statistical measures like arithmetic mean, standard deviation and coefficient of variation of a continuous variable. We shall compute the mean and standard deviation of the vector ‘Age’

How to compute mean?

The arithmetic mean and standard deviation can easily be computed by using mean and sd functions respectively.

mean(Age) #gives the mean of the variable "Age"
## [1] 23.77778
sd(Age)   #gives the standard deviaton of "Age" 
## [1] 4.176655

We see that the mean age of all students is 23.7777778years and that of standard deviation is 4.1766547years. Now let us try to compute coefficient of variation(CV) of ‘Age’. First recall the formula. CV is computed by the following formula, \((Std. deviation/mean)*100\).

CV = (sd(Age)/mean(Age))*100  
CV
## [1] 17.56537

Computation of mean age of students gender wise

Now the reader may be interested in computing the average age of boys and girls separately. we shall compute mean age of students gender wise by subsetting the data frame ‘Students’ in two dataframe namely ‘Boys’ and ‘Girls’

How to subset a data frame?

For this let sub set the data for boys and girls using the subsetfunction.

Boys=subset(Students,Gender=="boy", select =c(Name,Age))  # note that == sign is used to indicate equal to in R.
Boys
##     Name Age
## 1    ram  25
## 2   hari  18
## 5    bir  20
## 6  shyam  31
## 7 laxman  28
## 9 ramesh  22
Girls= subset(Students,Gender=="girl",select = c(Name,Age))
Girls
##     Name Age
## 3   sita  25
## 4  geeta  20
## 8 neelam  25

After subsetting for boys and girls, now we can compute their average ages separately. Let us compute average of boys and girls respectively.

mean(Boys$Age) # note $ sign is used to fix the variable Age in the dataframe Boys.
## [1] 24
mean(Girls$Age)
## [1] 23.33333

The above result shows that the average age of boys was found to be 24 years and that of girls was found to be 23.3333333years. Similarly learner can go on practising with their data set. It should be noted that in R programming you will later notice different alternative functions for a given operation.