Inroduction

R is a programming language and software environment for statistical analysis, graphical representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team.

The only hardware requirement for most of the R tutorials is a PC with the latest free open source R software installed. R has extensive documentation and active online community support. It is the perfect environment to get started in statistical computing.

Features of R

  • R is a well-developed, simple and effective programming language which includes conditionals, loops, user defined recursive functions and input and output facilities

  • R has an effective data handling and storage facility

  • R provides a suite of operators for calculations on arrays, lists, vectors and matrices

  • R provides a large, coherent and integrated collection of tools for data analysis.

  • R provides graphical facilities for data analysis and display either directly at the computer or printing at the papers.

R - Basic Syntax

As a convention, we will start learning R programming by writing a “Hello, World!” program.

# My first program in R Programming
myString <- "Hello, World!"
print(myString)
## [1] "Hello, World!"

R - Data Types

Generally, while doing programming in any programming language, you need to use various variables to store various information. Variables are nothing but reserved memory locations to store values. This means that, when you create a variable you reserve some space in memory.

In contrast to other programming languages like C and java in R, the variables are not declared as some data type. The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable. There are many types of R-objects. The frequently used ones are:
* Vectors
* Lists
* Matrices
* Arrays
* Factors
* Data Frames

Setting up the working directory (WD)

setwd("D:/Analytics/BACP-Dec2017/R_Programming")
getwd()
## [1] "D:/Analytics/BACP-Dec2017/R_Programming"

Create X vector of length 100, with elements 1,2,3,4,5,….,100)

x<-1:100
x
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

R as a calculator

12+5
## [1] 17
(12+5)*(39-13)/45
## [1] 9.822222

Variables

Variable allows you to store a value (e.g. 6) or an object (e.g. a function description) in R. You can then later use this variable’s name to easily access the value or the object that is stored within this variable.

my_var<-29
my_var
## [1] 29
my_oranges<-6
my_apples<-6
my_fruits<-my_apples+my_oranges
my_fruits
## [1] 12

Basic data types in R

  • Decimals values like 3.5 are called numerics.
  • Natural numbers like 3 are called integers.Integers are also numerics.
  • Boolean values (TRUE or FALSE) are called logical.
  • Text (or string) values are called characters.
class (5)
## [1] "numeric"
class('six')
## [1] "character"
class(1.2)
## [1] "numeric"

R Objects

Vectors * R operates on named data structures.
* The simplest such structure is the numeric vector, which is a single entity consisting of an ordered collection of numbers.

x<-c(3,4,6,7,8)     #Assignment operator and c() function called combined function
c(3,4,6,7,8) ->x1   #is also possible
x1
## [1] 3 4 6 7 8
1/x                 #gives the reciprocal
## [1] 0.3333333 0.2500000 0.1666667 0.1428571 0.1250000
C<-2*x              #Vector arithmetics
C
## [1]  6  8 12 14 16

Character vectors

Characters delimited by the double quote character, e.g., “x-values”, “New iteration results”.

names<-c("ram","shyam","john")

type<- c("Compact","Minivan","SUV","Roadster","Pickup Truck")
mileage<-c(1256,237,6780,1000,12000)
names
## [1] "ram"   "shyam" "john"
type
## [1] "Compact"      "Minivan"      "SUV"          "Roadster"    
## [5] "Pickup Truck"
mileage
## [1]  1256   237  6780  1000 12000

Vector selection

mileage[3]
## [1] 6780
type[2]
## [1] "Minivan"
type[5]
## [1] "Pickup Truck"
mileage[4]
## [1] 1000

Matrix

M<-matrix(1:9, 3, 3) 
M
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

M is the name of the matrix
matrix is a keyword
1:9 is for data (1,2,3,4,5,6,7,8,9) to be arranged within the matrix
3,3 is for no of rows and columns

To select 1st row and 2nd column

M[1,2]
## [1] 4

Use negative subscripts to remove elements: M [-1,-2] removes 1st row and 2nd column

m<-matrix(data=c(2,3,4,5), nrow=2, ncol=2)
m
##      [,1] [,2]
## [1,]    2    4
## [2,]    3    5
m2 <-matrix(c(2,3,4,5),2,2)
m2
##      [,1] [,2]
## [1,]    2    4
## [2,]    3    5
x<-c(1,2,3,4)
m3<-matrix(x,2,2)
m3
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
m3[1,2]
## [1] 3
m4<-m3[-1,-2]
m3
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
m4
## [1] 2
m<-matrix(data=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16), nrow=4, ncol=4)
m
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
m5<-m[1,]
m5
## [1]  1  5  9 13
m6<-m[,1]
m6
## [1] 1 2 3 4
m7<-m[1:3,2:4]
m7
##      [,1] [,2] [,3]
## [1,]    5    9   13
## [2,]    6   10   14
## [3,]    7   11   15

Claculation using matrix

m<-matrix(data=c(2,3,4,5), nrow=2, ncol=2)
m
##      [,1] [,2]
## [1,]    2    4
## [2,]    3    5
m_rsum<-rowSums(m)
m_rsum
## [1] 6 8
m_csum<-colSums(m)
m_csum
## [1] 5 9

Insert new rows and columns in a metrix

m
##      [,1] [,2]
## [1,]    2    4
## [2,]    3    5
a<-c(8,9)
m_newc<-cbind(m,a)    #use cbind to insert new column
m_newc
##          a
## [1,] 2 4 8
## [2,] 3 5 9
b<-c(6,7)
m_newr<-rbind(m,b)    #use rbind to insert new row
m_newr
##   [,1] [,2]
##      2    4
##      3    5
## b    6    7

Data Frames

type<- c("Compact","Minivan","SUV","Roadster","Pickup Truck")
mileage<-c(1256,237,6780,1000,12000)
price<-c(36790,3445,6678,2455,76889)
no.cyl<-c(3,4,4,4,4)
cars<-data.frame(type,price,mileage,no.cyl)
cars
##           type price mileage no.cyl
## 1      Compact 36790    1256      3
## 2      Minivan  3445     237      4
## 3          SUV  6678    6780      4
## 4     Roadster  2455    1000      4
## 5 Pickup Truck 76889   12000      4

Select from data frames#

cars[1,2]
## [1] 36790
cars[1,]
##      type price mileage no.cyl
## 1 Compact 36790    1256      3
cars[,1]
## [1] Compact      Minivan      SUV          Roadster     Pickup Truck
## Levels: Compact Minivan Pickup Truck Roadster SUV
cars[2:4,1:3]
##       type price mileage
## 2  Minivan  3445     237
## 3      SUV  6678    6780
## 4 Roadster  2455    1000

List

my_list <- list(component1, component2 …)

my_vector<-1:10
my_matrix<-matrix(1:9,ncol=3)
my_list<-list(my_vector,my_matrix)
my_list
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
my_list[[2]]
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
my_list[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10

Packages

install.packages(“package name”)
library(“package name”)

install.packages("psych",repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/Ranvir Kumar/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## package 'psych' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Ranvir Kumar\AppData\Local\Temp\RtmpURLW4n\downloaded_packages
library("psych")
## Warning: package 'psych' was built under R version 3.4.3

Practice Excercise

A<-c(1:10,21:35) B<-matrix(A,5,5) C<-B[1:3,1:3]

A<-c(1:10,21:35)
B<-matrix(A,5,5)
C<-B[1:3,1:3]
A
##  [1]  1  2  3  4  5  6  7  8  9 10 21 22 23 24 25 26 27 28 29 30 31 32 33
## [24] 34 35
B
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    6   21   26   31
## [2,]    2    7   22   27   32
## [3,]    3    8   23   28   33
## [4,]    4    9   24   29   34
## [5,]    5   10   25   30   35
C
##      [,1] [,2] [,3]
## [1,]    1    6   21
## [2,]    2    7   22
## [3,]    3    8   23

Data Transformation

x<- c(1:10)
xsq<-x^2
xsq
##  [1]   1   4   9  16  25  36  49  64  81 100
logx<-log(x)
logx
##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
##  [8] 2.0794415 2.1972246 2.3025851

Importing Data

File formats:
* Text (.txt)
* CSV (.csv)
* Excel (.xls)
* SPSS ( .sav)
* STATA (.dta)
* SAS (.ssd)
* (For more formats you can visit http://cran.rproject.org/doc/manuals/R-data.pdf , here you get information on how to import image files as well ! )

Text Files

Comma Delimited Text Files:
data1<- read.table(“C:/Users/xyz/Desktop/folderX/mydata.txt”,header=TRUE, sep=“,”)

Space as the separator:
data1<- read.table(“C:/Users/xyz/Desktop/folderX/mydata.txt”, header=TRUE)

Another(easier) way, set your working directory then the command is:
data1<- read.table(“mydata.txt”, header=TRUE)

CSV Files

Similar way, use ‘read.csv’ instead of ‘read.table’

Excel Files

Use read.xls (needs package ‘gdata’, use ‘library(gdata)’ after installing this package)

SPSS

Need library ‘foreign’ - Use command: ‘read.spss’

STATA

Need library ‘foreign’- Use command: ‘read.dta’

SAS

Need library ‘foreign’- Use command: ‘read.ssd’

Importing houseprice data set from the working directory

library(readr)
houseprices<-read_csv("D:/Analytics/BACP-Dec2017/02_Introduction_To_R/ClassMaterial/houseprices.csv")
## Parsed with column specification:
## cols(
##   Price = col_integer(),
##   LivingArea = col_integer(),
##   Bathrooms = col_double(),
##   Bedrooms = col_integer(),
##   LotSize = col_double(),
##   Age = col_integer(),
##   Fireplace = col_integer()
## )
houseprices
## # A tibble: 1,047 x 7
##     Price LivingArea Bathrooms Bedrooms LotSize   Age Fireplace
##     <int>      <int>     <dbl>    <int>   <dbl> <int>     <int>
##  1 142212       1982      1.00        3  2.00     133         0
##  2 134865       1676      1.50        3  0.380     14         1
##  3 118007       1694      2.00        3  0.960     15         1
##  4 138297       1800      1.00        2  0.480     49         1
##  5 129470       2088      1.00        3  1.84      29         1
##  6 206512       1456      2.00        3  0.980     10         0
##  7  50709        960      1.50        2  0.0100    12         0
##  8 108794       1464      1.00        2  0.110     87         0
##  9  68353       1216      1.00        2  0.610    101         0
## 10 123266       1632      1.50        3  0.230     14         0
## # ... with 1,037 more rows
View(houseprices)    #View dataset in a seperate window
head(houseprices)    #First six rows of the dataset
## # A tibble: 6 x 7
##    Price LivingArea Bathrooms Bedrooms LotSize   Age Fireplace
##    <int>      <int>     <dbl>    <int>   <dbl> <int>     <int>
## 1 142212       1982      1.00        3   2.00    133         0
## 2 134865       1676      1.50        3   0.380    14         1
## 3 118007       1694      2.00        3   0.960    15         1
## 4 138297       1800      1.00        2   0.480    49         1
## 5 129470       2088      1.00        3   1.84     29         1
## 6 206512       1456      2.00        3   0.980    10         0
head(houseprices,10) #First 10 rows of the dataset  
## # A tibble: 10 x 7
##     Price LivingArea Bathrooms Bedrooms LotSize   Age Fireplace
##     <int>      <int>     <dbl>    <int>   <dbl> <int>     <int>
##  1 142212       1982      1.00        3  2.00     133         0
##  2 134865       1676      1.50        3  0.380     14         1
##  3 118007       1694      2.00        3  0.960     15         1
##  4 138297       1800      1.00        2  0.480     49         1
##  5 129470       2088      1.00        3  1.84      29         1
##  6 206512       1456      2.00        3  0.980     10         0
##  7  50709        960      1.50        2  0.0100    12         0
##  8 108794       1464      1.00        2  0.110     87         0
##  9  68353       1216      1.00        2  0.610    101         0
## 10 123266       1632      1.50        3  0.230     14         0
tail(houseprices)    #Last six rows of the dataset
## # A tibble: 6 x 7
##    Price LivingArea Bathrooms Bedrooms LotSize   Age Fireplace
##    <int>      <int>     <dbl>    <int>   <dbl> <int>     <int>
## 1 206480       2310      2.50        3   1.00     18         0
## 2 107695       1802      2.00        4   0.970    56         1
## 3 236737       3239      3.50        4   2.50      1         1
## 4 154829       1440      2.00        2   0.610    66         1
## 5 179492       2030      2.50        3   1.00      3         1
## 6 189108       2097      2.50        3   1.93     10         1
tail(houseprices,10) #Last 10 rows of the dataset
## # A tibble: 10 x 7
##     Price LivingArea Bathrooms Bedrooms LotSize   Age Fireplace
##     <int>      <int>     <dbl>    <int>   <dbl> <int>     <int>
##  1 107973       1388      1.00        3   0.230    60         0
##  2 119875       1512      1.50        4   1.00     61         1
##  3  66027       1653      2.00        3   0.480    79         0
##  4 182649       1758      2.50        3   0.270     1         1
##  5 206480       2310      2.50        3   1.00     18         0
##  6 107695       1802      2.00        4   0.970    56         1
##  7 236737       3239      3.50        4   2.50      1         1
##  8 154829       1440      2.00        2   0.610    66         1
##  9 179492       2030      2.50        3   1.00      3         1
## 10 189108       2097      2.50        3   1.93     10         1
dim(houseprices)     # Dimension of the dataset
## [1] 1047    7
summary(houseprices) # Summary of the dataset
##      Price          LivingArea     Bathrooms        Bedrooms    
##  Min.   : 16858   Min.   : 672   Min.   :1.000   Min.   :1.000  
##  1st Qu.:112014   1st Qu.:1336   1st Qu.:1.500   1st Qu.:3.000  
##  Median :151917   Median :1672   Median :2.000   Median :3.000  
##  Mean   :163862   Mean   :1807   Mean   :1.918   Mean   :3.183  
##  3rd Qu.:205235   3rd Qu.:2206   3rd Qu.:2.500   3rd Qu.:4.000  
##  Max.   :446436   Max.   :4534   Max.   :4.500   Max.   :6.000  
##     LotSize            Age           Fireplace     
##  Min.   :0.0000   Min.   :  0.00   Min.   :0.0000  
##  1st Qu.:0.2100   1st Qu.:  6.00   1st Qu.:0.0000  
##  Median :0.3900   Median : 18.00   Median :1.0000  
##  Mean   :0.5696   Mean   : 28.06   Mean   :0.5931  
##  3rd Qu.:0.6000   3rd Qu.: 34.00   3rd Qu.:1.0000  
##  Max.   :9.0000   Max.   :247.00   Max.   :1.0000

Variable calling

summary(houseprices$Bedrooms)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.000   3.183   4.000   6.000
table(houseprices$Bedrooms)
## 
##   1   2   3   4   5   6 
##   3 176 522 321  22   3
summary(houseprices$Bathrooms)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.500   2.000   1.918   2.500   4.500
table(houseprices$Bathrooms)
## 
##   1 1.5   2 2.5   3 3.5   4 4.5 
## 198 261 172 373  23  15   3   2
mean(houseprices$Price)
## [1] 163862.1
var(houseprices$Price)
## [1] 4576733424
sd(houseprices$Price)
## [1] 67651.56

Subset of the dataset

Subset1<-data[which(data$Price<=amount),]

housepricessubset1<-houseprices[which(houseprices$Price<=50000),]
housepricessubset1
## # A tibble: 12 x 7
##    Price LivingArea Bathrooms Bedrooms LotSize   Age Fireplace
##    <int>      <int>     <dbl>    <int>   <dbl> <int>     <int>
##  1 45004        960      1.00        2  0.540     11         0
##  2 45904       1328      1.00        4  0.190    103         0
##  3 44674       1214      1.00        3  0.140    103         0
##  4 16858       1629      1.00        3  0.760    180         0
##  5 26130        822      1.00        2  0.560    173         0
##  6 47630       1235      1.00        3  0.320     84         1
##  7 40932       1320      1.00        3  0.170     90         0
##  8 26049       1344      2.00        3  0.920     13         0
##  9 49211        800      1.00        2  0.460     55         0
## 10 31113       1540      1.00        2  0.0400   115         0
## 11 44873        882      1.50        3  0.180     71         0
## 12 49564       1363      2.00        3  2.40      39         0
housepricessubset2<-houseprices[which(houseprices$Price<=50000 & houseprices$Bedrooms==3),]
housepricessubset2
## # A tibble: 7 x 7
##   Price LivingArea Bathrooms Bedrooms LotSize   Age Fireplace
##   <int>      <int>     <dbl>    <int>   <dbl> <int>     <int>
## 1 44674       1214      1.00        3   0.140   103         0
## 2 16858       1629      1.00        3   0.760   180         0
## 3 47630       1235      1.00        3   0.320    84         1
## 4 40932       1320      1.00        3   0.170    90         0
## 5 26049       1344      2.00        3   0.920    13         0
## 6 44873        882      1.50        3   0.180    71         0
## 7 49564       1363      2.00        3   2.40     39         0

Scatter Plots

x<-c(2:200)
y<-2*x-8
plot(x,y)

Plot

plot(houseprices$Price,houseprices$Bedrooms)

plot(houseprices$Price,houseprices$LotSize)

Barplot

barplot(table(houseprices$Bedrooms),main = "No. of Bedrooms")

barplot(table(houseprices$Price), main = "Prices")

barplot(table(houseprices$LotSize), main = "LotSize")

Histogram

hist(houseprices$Bedrooms, main = "Bedrooms")

hist(houseprices$Price, main = "Prices",xlab = "Home Prices", ylab = "frequency of prices")

hist(houseprices$Age, main = "Age of the House", xlab = "House Age", ylab = "frequency of the house age",col = "blue")

Boxplot

boxplot(houseprices$Bedrooms, horizontal = TRUE)

boxplot(houseprices$Price, horizontal = TRUE)

boxplot(houseprices$Age, horizontal = TRUE,main="House Age", xlab="Frequency", ylab="Age",col="red")

Paneling Graphics

par(mfrow=c(2,2))
plot(houseprices$Price,houseprices$Bedrooms,main = "Prices vs No. of Bedrooms")
hist(houseprices$Age, main = "Age of the House", xlab = "House Age", ylab = "frequency of the house age",col ="blue")
boxplot(houseprices$Age, horizontal = TRUE,main="House Age", xlab="Frequency", ylab="Age",col="green")
barplot(table(houseprices$Bedrooms),main = "No. of Bedrooms", col = heat.colors(10))

Reset the earlier partition command

dev.off()
## null device 
##           1