1.What are the measures of central tendency and variation of data?
Mean (the average value) and Median (middle value on a sorted list of data), are the measures for the central tendency of data.
Although, Mean is the basic measure for central tendency, it could be sensitive to outliers. In that case Median or trimmed Mean (after dropping outliers) is more robust measure.
Variation of data is measured by
* Variance (average of squared deviation, which is difference between Mean and observed data)
* Standard Deviation (square root of the variance).
* Mean absolute deviation (average of absolute value of difference between mean and observed data)
All of these measures are sensitive to Outliers.
A robust estimate of Variability can be measured by either one of the below :
* MAD (Median absolute Deviation:(average of absolute value of difference between Median and observed data)
* IQR (Interquartile Range: difference between 25th percentile and 75th percentile)
* Trimmed Standard Deviation (Standard Deviation after dropping outliers)
2. What are the different ways to create a vector in R?
#1) Vector creation for Numeric values
A <- c(100,101,102)
A <- c(C1=100,C2=101,C3=102) # Vector creation with column labels
A
## C1 C2 C3
## 100 101 102
B <- 1:10
B
## [1] 1 2 3 4 5 6 7 8 9 10
C <- seq(1,10,by=0.5)
C
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
## [15] 8.0 8.5 9.0 9.5 10.0
#2) Vector creation for String values
D <- c("red","blue","green")
D
## [1] "red" "blue" "green"
#3) Vector creation by reading a csv,txt,xlsx,fwf files.
# 3a) Example of reading from csv files
data <- url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
iris <- read.csv(data, header=FALSE)
head(iris)
## V1 V2 V3 V4 V5
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
## 4 4.6 3.1 1.5 0.2 Iris-setosa
## 5 5.0 3.6 1.4 0.2 Iris-setosa
## 6 5.4 3.9 1.7 0.4 Iris-setosa
#4) Reading from data saved in workspaces, using load() function
#5) Reading from Apache style log files using read_log() function
3. Create the following vector and check the class (‘x’,’x’, ‘x’, 1,3,5,7,9,2,4,6,8,10)
A <- c('x','x', 'x', 1,3,5,7,9,2,4,6,8,10)
class(A)
## [1] "character"
4. Create a vector of positive odd integers less than 100
# 4. Creating odd integers vector
A <- 1:100
A <- A[A %% 2 != 0]
A
## [1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91
## [47] 93 95 97 99
5. Remove the values greater than 60 and less than 80
# 5. Removing values greater than 80 and less than 60
A <- A[A > 60]
A <- A[A < 80]
A
## [1] 61 63 65 67 69 71 73 75 77 79
6. Write a function to return standard deviation, mean, and median of the vector from Question 5.
# 6. Calculating standard deviation, mean, and median of the vector
meanMedianStdDev <- function(A1){
c(Mean=mean(A1), Median=median(A1), Standard_Deviation=sd(A1))
}
meanMedianStdDev(A)
## Mean Median Standard_Deviation
## 70.000000 70.000000 6.055301
7. Create two matrices of the form from the given set of numbers in two ways X1 = {2,3,7,1,6,2,3,5,1} and x2 = {3,2,9,0,7,8,5,8,2}
X1 = c(2,3,7,1,6,2,3,5,1)
X2 = c(3,2,9,0,7,8,5,8,2)
#Matrix for X1 created by populating column first
matX1 = matrix(X1,3)
matX1
## [,1] [,2] [,3]
## [1,] 2 1 3
## [2,] 3 6 5
## [3,] 7 2 1
#Matrix for X1 created by populating row first
matX1 = matrix(X1,3, byrow=TRUE)
matX1
## [,1] [,2] [,3]
## [1,] 2 3 7
## [2,] 1 6 2
## [3,] 3 5 1
#Matrix for X2 created by populating column first
matX2 = matrix(X2,3)
matX2
## [,1] [,2] [,3]
## [1,] 3 0 5
## [2,] 2 7 8
## [3,] 9 8 2
#Matrix for X2 created by populating row first
matX2 = matrix(X2,3, byrow=TRUE)
matX2
## [,1] [,2] [,3]
## [1,] 3 2 9
## [2,] 0 7 8
## [3,] 5 8 2
8. Find the matrix product
# Matrix multiplication
matX1 %*% matX2
## [,1] [,2] [,3]
## [1,] 41 81 56
## [2,] 13 60 61
## [3,] 14 49 69
9. Find the class of ‘iris’ dataframe, find the class of all the columns of ‘iris’ get the summary. Get rownames, get column names. Get the number of rows and number of columns.
remove(iris)
#install.packages("plyr")
library("plyr")
#head(iris)
# Class for all columns of iris
class(iris)
## [1] "data.frame"
# Class for each column of iris
paste( "class for colnames:"
,class(iris$Sepal.Length)
,class(iris$Sepal.Width)
,class(iris$Petal.Length)
,class(iris$Petal.Width)
,class(iris$Species))
## [1] "class for colnames: numeric numeric numeric numeric factor"
#Summary for iris
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
#Column names for iris
colnames(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
#Row names for iris
rownames(iris)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"
## [12] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22"
## [23] "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33"
## [34] "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44"
## [45] "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55"
## [56] "56" "57" "58" "59" "60" "61" "62" "63" "64" "65" "66"
## [67] "67" "68" "69" "70" "71" "72" "73" "74" "75" "76" "77"
## [78] "78" "79" "80" "81" "82" "83" "84" "85" "86" "87" "88"
## [89] "89" "90" "91" "92" "93" "94" "95" "96" "97" "98" "99"
## [100] "100" "101" "102" "103" "104" "105" "106" "107" "108" "109" "110"
## [111] "111" "112" "113" "114" "115" "116" "117" "118" "119" "120" "121"
## [122] "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132"
## [133] "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143"
## [144] "144" "145" "146" "147" "148" "149" "150"
# Number of Rows
NROW(iris)
## [1] 150
# Number of Columns
NCOL(iris)
## [1] 5
10. Get the last two rows in the last 2 columns from iris dataset.
iris[149:150,4:5]
## Petal.Width Species
## 149 2.3 virginica
## 150 1.8 virginica
#Last 2 rows and columns when number of rows and columns are unknown
iris[(NROW(iris)-1):NROW(iris),(NCOL(iris)-1):NCOL(iris)]
## Petal.Width Species
## 149 2.3 virginica
## 150 1.8 virginica