R is a software environment for data analysis, statistical computing and graphics. R program was initially developed by two statisticians Roberts Gentlemen and Ross Ihaka at the University of Auckland, New Zealand in the early 1990’s. R is one of the powerful programming languages to do data analysis. The reason is the script are so simple that a non programmer can also easily execute after several weeks of practice. At the beginning one may get intimidating but after you go on executing commands, one may get addictive about it.
R is an environment within which many classical and modern statistical techniques have been implemented. A few of these are built into the base R environment, but many are supplied as packages. R consists of thousands of functions. Thousands of packages are contributed to CRAN by hundreds of contributers. The R program can be downloaded from http://r-project.org and follow the instruction as per your system. The IDE for R programming, most commonly used is Rstudio, can also be downloaded from http://rstudio.com.
R is basically originated from a statistical programming languages “S” which is originally developed by Richard A. Becker, John Chambers and Allan R. Wilks at Bell Laboratory in the 1970’s. As the first version of R was developed by Robert Gentlemen and Ross Ihaka at the University of Auckland in the mid 1990’s. They wanted a better version of statistical software to use in their Macintosh Laboratory and decided to create their own. They also released it as on open source alternative to “S” and encouraged others to download and help developing software more extensively.
R is a free and open sourced and available in applied form Mac, Windows, Linux. It is one of the widely used programming languages; there are more than 20 million users around the world. This means that new features are being developed all the times and there are lots of community resources. Additionally, R makes it easy to run previous work and to make adjustments. R also has nice graphics and visualizations. There are many choices for data analysis software available today. In addition to R some popular examples are SAS, Stata, SPSS, Ms Excel, Excel Add ins, MATLAB, Minitab, pandas etc. Now let us try to understand some of the basic functions of R for whom who are willing to learn R.
R can be used as a calculator. The program can be used a basic calculator.The operator +, -, *, /, can easily be used in R for addition, subtraction, multiplication and division respectively. Here are some illustrations:
4+5 # to add 4 and 5
## [1] 9
4-2 #to subtract 2 from 4
## [1] 2
5*6 # multiplication between 5 and 6
## [1] 30
15/5 # to divide 15 by 5
## [1] 3
Likewise, some other basic functions/algebraic operations are ^, sqrt, log, exp etc. Some of the demonstrations are listed below.
(4+5)^2 # the function '^' is used to raise the power
## [1] 81
sqrt(16) # the function 'sqrt' is used to square root
## [1] 4
(2+2)/2
## [1] 2
log10(100) # log10 is common logarithm
## [1] 2
exp(10) #exp is exponent
## [1] 22026.47
exp(1)
## [1] 2.718282
log(exp(1))
## [1] 1
log(exp(2))
## [1] 2
In the previous session we have learnt about how, basic calculator functions can be operated. Now we shall learn to define a Variable in R. Variable is used as a inventory where the data can be stored so that it can be used later on. for example,let \(x\) be a variable where a number 5 is stored. In order to assign 5 as \(x\); the assignment can be executed in several ways. one of the commonly used method is:
x=5
#or
x<- 5
#or
5->x
x # this will view the value stored in 'x'
## [1] 5
Similarly let us assign 6 as \(y\)
y=6
y
## [1] 6
Let us perform some basic operations between x and y. some examples are:
x+y # note that 5 is stored in 'x' and 6 is in 'y' therefore, x+y shlould resutl in 11.
## [1] 11
x*y
## [1] 30
x^2+y^3
## [1] 241
and so on.
In the previous section we learnt about defining variables and operations between them.It is noted that \(x\) is used to store a single no. 5 and that of \(y\) is used to store 6. Now we shall learn how vector is defined in R. Vector is a collection of object of similar kind.
Name=c("ram","hari","sita", "geeta","bir","shyam","laxman","neelam","ramesh") #characters are enquoted.
Gender=c("boy","boy","girl","girl","boy","boy","boy","girl","boy")
Age =c(25,18,25,20,20,31,28,25,22)
The function ’c’is used to combine (or concatenate), also it should be noted that “” is used for characters(Name and Gender in this case) while for numerical(Age) it is not necessary. Note that the vector Name is contained with the same kind of object and the vector Gender contains similar object. If we want to know the class of different vectors. we can use the class function
class(Name)
## [1] "character"
class(Gender)
## [1] "character"
class(Age)
## [1] "numeric"
let us change the class of variable Gender. In this example we are going to change the class of the variable Gender from character to factor using the function as.factor. [Note that categorical variable are termed as factor in R. Since Gender is a categorical variable(two category;boy and girl) it should be converted into factor].
Gender= as.factor(Gender)
Learners are advised to learn in details about different data structure like vector, list, matrix,array, dataframe etc. we shall now won’t discuss about all these. AS data frame is an important data structure while doing data analysis as we encounter vectors with different classes. Roughly speaking a data frame is a table of vectors of different class. Now we shall discuss how to make a data frame by combining two or more that two vectors with different classes. The following function called data.frame will be used. Let us create a data frame called “Students” by clubbing three vectors ‘Name’, ‘Gender’ and ‘Age’ as:
Students=data.frame(Name,Gender,Age) # Note that 'Name', 'Gender' and 'Age' are of different classes.
print(Students)
## Name Gender Age
## 1 ram boy 25
## 2 hari boy 18
## 3 sita girl 25
## 4 geeta girl 20
## 5 bir boy 20
## 6 shyam boy 31
## 7 laxman boy 28
## 8 neelam girl 25
## 9 ramesh boy 22
The table shows a data frame called “Students” which is created by clubbing three vectors namely ‘Name’, ‘Gender’ and ‘Age’.
In explorative data analysis counting of frequency of a vector is very important.The following funcion table is used to count.
table(Gender) # this function results in no of boys and girls. (in our data, there were 6 boys and 3 girls)
## Gender
## boy girl
## 6 3
table(Name) #this is nonsense in this case
## Name
## bir geeta hari laxman neelam ram ramesh shyam sita
## 1 1 1 1 1 1 1 1 1
In this section we shall learn how to compute some basic statistical measures like arithmetic mean, standard deviation and coefficient of variation of a continuous variable. We shall compute the mean and standard deviation of the vector ‘Age’
The arithmetic mean and standard deviation can easily be computed by using mean and sd functions respectively.
mean(Age) #gives the mean of the variable "Age"
## [1] 23.77778
sd(Age) #gives the standard deviaton of "Age"
## [1] 4.176655
We see that the mean age of all students is 23.7777778years and that of standard deviation is 4.1766547years. Now let us try to compute coefficient of variation(CV) of ‘Age’. First recall the formula. CV is computed by the following formula, \((Std. deviation/mean)*100\).
CV = (sd(Age)/mean(Age))*100
CV
## [1] 17.56537
Now the reader may be interested in computing the average age of boys and girls separately. we shall compute mean age of students gender wise by subsetting the data frame ‘Students’ in two dataframe namely ‘Boys’ and ‘Girls’
For this let sub set the data for boys and girls using the subsetfunction.
Boys=subset(Students,Gender=="boy", select =c(Name,Age)) # note that == sign is used to indicate equal to in R.
Boys
## Name Age
## 1 ram 25
## 2 hari 18
## 5 bir 20
## 6 shyam 31
## 7 laxman 28
## 9 ramesh 22
Girls= subset(Students,Gender=="girl",select = c(Name,Age))
Girls
## Name Age
## 3 sita 25
## 4 geeta 20
## 8 neelam 25
After subsetting for boys and girls, now we can compute their average ages separately. Let us compute average of boys and girls respectively.
mean(Boys$Age) # note $ sign is used to fix the variable Age in the dataframe Boys.
## [1] 24
mean(Girls$Age)
## [1] 23.33333
The above result shows that the average age of boys was found to be 24 years and that of girls was found to be 23.3333333years. Similarly learner can go on practising with their data set. It should be noted that in R programming you will later notice different alternative functions for a given operation.