Basic Data Exploration

Topics

Basic exploration

Basic Exploration

  • Getting a list of datasets currently available
data()

? or help()

  • Getting background info about a dataset for example iris
?iris 

str() function

  • Getting info about structure
str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

dim() function

  • Getting info about dimensions of a data set
dim(iris)
[1] 150   5

attributes() function

  • Getting info about attributes
attributes(iris)
$names
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
[5] "Species"     

$row.names
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
 [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
 [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
 [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
 [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
 [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102
[103] 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
[120] 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
[137] 137 138 139 140 141 142 143 144 145 146 147 148 149 150

$class
[1] "data.frame"

class() function

  • Getting info about class type
class(iris)
[1] "data.frame"

head() function

  • Getting a few rows of data from top
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

tail() function

  • Getting a few rows of data from bottom
tail(iris)
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
145          6.7         3.3          5.7         2.5 virginica
146          6.7         3.0          5.2         2.3 virginica
147          6.3         2.5          5.0         1.9 virginica
148          6.5         3.0          5.2         2.0 virginica
149          6.2         3.4          5.4         2.3 virginica
150          5.9         3.0          5.1         1.8 virginica

summary() function

  • Getting 5-point summary of data
summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  



quantile() function

  • Getting 5 quartile values of a data series
quantile(iris$Sepal.Length)
  0%  25%  50%  75% 100% 
 4.3  5.1  5.8  6.4  7.9 

Exploring single variable

var() function

  • Computing variance
var(iris$Sepal.Length)
[1] 0.6856935

hist() function

  • Creating graph of freq. distribution (histogram)
#histogram with raw frequency
hist(iris$Sepal.Length)

plot of chunk unnamed-chunk-12

hist() function

  • Histogram with
#histogram with density on y-axis
hist(iris$Sepal.Length, freq=FALSE)

plot of chunk unnamed-chunk-13

Plotting density of a histogram

plot(density(iris$Sepal.Length))

plot of chunk unnamed-chunk-14

Tabulating data from categorical variables

  • Categorical variables are non-numerical
  • They indicate categories.
table(iris$Species)

    setosa versicolor  virginica 
        50         50         50 

Creating a pie chart from a table

  • First, there should be a table
tab1=table(iris$Species)
pie(tab1)

plot of chunk unnamed-chunk-16

Creating a bar chart from a table

  • First, there should be a table
tab1=table(iris$Species)
barplot(tab1)

plot of chunk unnamed-chunk-17

Exploring multiple variable

Measuring linear association with cov() & cor()

  • Between two variables
cov(iris$Sepal.Length, iris$Petal.Length)
[1] 1.274315
cor(iris$Sepal.Length, iris$Petal.Length)
[1] 0.8717538

Measuring correlations with cor()

  • among all numerical variable
cor(iris[,1:4])
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

Computing stats with aggregate()

  • Suppose we want to compute the summary stats of Sepal.Length by grouping them in terms of Species
aggregate(Sepal.Length ~ Species, summary, data=iris)
     Species Sepal.Length.Min. Sepal.Length.1st Qu. Sepal.Length.Median
1     setosa             4.300                4.800               5.000
2 versicolor             4.900                5.600               5.900
3  virginica             4.900                6.225               6.500
  Sepal.Length.Mean Sepal.Length.3rd Qu. Sepal.Length.Max.
1             5.006                5.200             5.800
2             5.936                6.300             7.000
3             6.588                6.900             7.900

Boxplot grouped by each Species

boxplot(Sepal.Length ~Species, data=iris)

plot of chunk unnamed-chunk-21

Scatterplot grouped by each Species

with(iris, plot(Sepal.Length,Sepal.Width,col=Species,pch=as.numeric(Species)))

plot of chunk unnamed-chunk-22

Creating a matrix of Scatterplot

pairs(iris)

plot of chunk unnamed-chunk-23

Advanced plots

3D Scatter plot

##Ensure that the package scatterplot3d is installed already
library(scatterplot3d)
scatterplot3d(iris$Petal.Width,iris$Sepal.Length,iris$Sepal.Width)

plot of chunk unnamed-chunk-24

Anothe type of 3D plot

##Ensure that the package rgl is installed already
library(rgl)
plot3d(iris$Petal.Width,iris$Sepal.Length,iris$Sepal.Width)

Heatmap from stats package

##Ensure that the package stats is installed already
library(stats)
distmatrix=as.matrix(dist(iris[,1:4]))
heatmap(distmatrix)

plot of chunk unnamed-chunk-26

Heatmap using mtcars dataset

distmatrix=as.matrix(dist(mtcars[,]))
heatmap(distmatrix)

plot of chunk unnamed-chunk-27

Parallel Coordinates map

  • Ensure MASS packge is already installed
library(MASS)
parcoord(iris[1:4],col=iris$Species)
legend("topleft",levels(iris$Species),lty=1,col=iris$Species)

plot of chunk unnamed-chunk-28

Parallel plot with lattice package

  • Ensure lattice packge is already installed
library(lattice)
parallelplot(~ iris[1:4]| Species, data=iris)

plot of chunk unnamed-chunk-29

Saving charts into files

  • Use functions pdf(), bmp(), jpeg(), png()
png("C:/Users/Gokul/desktop/chart1.png")
library(lattice)
parallelplot(~ iris[1:4]| Species, data=iris)
graphics.off()