Exploring and Understanding Data with R

Exploring and understanding data

Data exploration example using used car data

Get the working dir

getwd()
[1] "/cloud/project"

Load the dataset

usedcars <- read.csv("usedcars.csv", stringsAsFactors = FALSE)
usedcars

Get structure of the dataset using the function str() arg is dataset name

str(usedcars)
'data.frame':   150 obs. of  6 variables:
 $ year        : int  2011 2011 2011 2011 2012 2010 2011 2010 2011 2010 ...
 $ model       : chr  "SEL" "SEL" "SEL" "SEL" ...
 $ price       : int  21992 20995 19995 17809 17500 17495 17000 16995 16995 16995 ...
 $ mileage     : int  7413 10926 7351 11613 8367 25125 27393 21026 32655 36116 ...
 $ color       : chr  "Yellow" "Gray" "Silver" "Gray" ...
 $ transmission: chr  "AUTO" "AUTO" "AUTO" "AUTO" ...

Exploring numeric variables

Summarize by using the function summary()

summary(usedcars)
      year         model               price          mileage          color          
 Min.   :2000   Length:150         Min.   : 3800   Min.   :  4867   Length:150        
 1st Qu.:2008   Class :character   1st Qu.:10995   1st Qu.: 27200   Class :character  
 Median :2009   Mode  :character   Median :13592   Median : 36385   Mode  :character  
 Mean   :2009                      Mean   :12962   Mean   : 44261                     
 3rd Qu.:2010                      3rd Qu.:14904   3rd Qu.: 55124                     
 Max.   :2012                      Max.   :21992   Max.   :151479                     
 transmission      
 Length:150        
 Class :character  
 Mode  :character  
                   
                   
                   

You can get the summary of particular columns

summary(usedcars[c("price","mileage")])
     price          mileage      
 Min.   : 3800   Min.   :  4867  
 1st Qu.:10995   1st Qu.: 27200  
 Median :13592   Median : 36385  
 Mean   :12962   Mean   : 44261  
 3rd Qu.:14904   3rd Qu.: 55124  
 Max.   :21992   Max.   :151479  
summary(usedcars[c("model")])
    model          
 Length:150        
 Class :character  
 Mode  :character  

You can get mean, median by mean(c(arg1,arg2,arg3))

mean(c(36000, 44000, 56000))
[1] 45333.33

The min/max of used car prices

range(usedcars$year)
[1] 2000 2012
range(usedcars$model)
[1] "SE"  "SES"
range(usedcars$mileage)
[1]   4867 151479

The difference of the range

diff(range(usedcars$price))
[1] 18192

Inter Quartile Range (IQR)

IQR tells you how wide the middle of your data is

it’s calculated by: IQR = Q3 − Q1

IQR(usedcars$price)
[1] 3909.5

Use quantile to calculate five-number summary

quantile(usedcars$price)
     0%     25%     50%     75%    100% 
 3800.0 10995.0 13591.5 14904.5 21992.0 

The 99th percentile

this lets us filter the outliers by specifying the probability range

quantile(usedcars$price, probs = c(0.01,0.99))

quintiles

quantile(usedcars$price, seq(from = 0, to = 1, by = 0.2))
     0%     20%     40%     60%     80%    100% 
 3800.0 10759.4 12993.8 13992.0 14999.0 21992.0 

boxplot of used car prices and mileage

boxplot(usedcars$price, main="Boxplot of Used Car Prices", 
        ylab="Price ($)") 

histograms of used car prices and mileage

hist(usedcars$price, main = "Histogram of Used Car Prices",
     xlab = "Price ($)")

variance and standard deviation with var() and std()

sd(usedcars$price)
[1] 3122.482
var(usedcars$mileage)
[1] 728033954

Exploring numeric Variable

one-way tables for the used car data

table(usedcars$year)

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 
   3    1    1    1    3    2    6   11   14   42   49   16    1 

compute table proportions

model_table <- table(usedcars$model)
prop.table(model_table)

       SE       SEL       SES 
0.5200000 0.1533333 0.3266667 

round the data

color_table <- table(usedcars$color)
color_pct <- prop.table(color_table) * 100
round(color_pct, digits = 1)

 Black   Blue   Gold   Gray  Green    Red Silver  White Yellow 
  23.3   11.3    0.7   10.7    3.3   16.7   21.3   10.7    2.0 

Exploring relationships between variables

scatterplot of price vs. mileage

plot(x = usedcars$mileage, y = usedcars$price,
     main = "Scatterplot of Price vs. Mileage",
     xlab = "Used Car Odometer (mi.)",
     ylab = "Used Car Price ($)")

new variable indicating conservative colors

usedcars$conservative <-
  usedcars$color %in% c("Black", "Gray", "Silver", "White")

# checking our variable
table(usedcars$conservative)

FALSE  TRUE 
   51    99 

installing gmodels package

install.packages("gmodels")
trying URL 'http://rspm/default/__linux__/focal/latest/src/contrib/gtools_3.9.5.tar.gz'
trying URL 'http://rspm/default/__linux__/focal/latest/src/contrib/gdata_3.0.1.tar.gz'
trying URL 'http://rspm/default/__linux__/focal/latest/src/contrib/gmodels_2.19.1.tar.gz'

The downloaded source packages are in
    ‘/tmp/Rtmp12cPec/downloaded_packages’

Crosstab of conservative by model

gmodels is mainly used for data analysis helpers

CrossTable() for contingency tables (way nicer than table())

Odds ratios, proportions, chi-square summaries, etc

library(gmodels)
CrossTable(x = usedcars$model, y = usedcars$conservative)

 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  150 

 
               | usedcars$conservative 
usedcars$model |     FALSE |      TRUE | Row Total | 
---------------|-----------|-----------|-----------|
            SE |        27 |        51 |        78 | 
               |     0.009 |     0.004 |           | 
               |     0.346 |     0.654 |     0.520 | 
               |     0.529 |     0.515 |           | 
               |     0.180 |     0.340 |           | 
---------------|-----------|-----------|-----------|
           SEL |         7 |        16 |        23 | 
               |     0.086 |     0.044 |           | 
               |     0.304 |     0.696 |     0.153 | 
               |     0.137 |     0.162 |           | 
               |     0.047 |     0.107 |           | 
---------------|-----------|-----------|-----------|
           SES |        17 |        32 |        49 | 
               |     0.007 |     0.004 |           | 
               |     0.347 |     0.653 |     0.327 | 
               |     0.333 |     0.323 |           | 
               |     0.113 |     0.213 |           | 
---------------|-----------|-----------|-----------|
  Column Total |        51 |        99 |       150 | 
               |     0.340 |     0.660 |           | 
---------------|-----------|-----------|-----------|

 

The contribution of True is much more significant than False,

which means that more cars have conservative colors (grey, black, white and Silver)

Now, we will try more args in the Crosstable() function.

CrossTable(x = usedcars$model, y = usedcars$conservative, 
           chisq = TRUE, 
           expected = TRUE, 
           prop.r = TRUE, 
           prop.c = TRUE, 
           prop.t = TRUE, 
           prop.chisq = TRUE,
           )

 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  150 

 
               | usedcars$conservative 
usedcars$model |     FALSE |      TRUE | Row Total | 
---------------|-----------|-----------|-----------|
            SE |        27 |        51 |        78 | 
               |    26.520 |    51.480 |           | 
               |     0.009 |     0.004 |           | 
               |     0.346 |     0.654 |     0.520 | 
               |     0.529 |     0.515 |           | 
               |     0.180 |     0.340 |           | 
---------------|-----------|-----------|-----------|
           SEL |         7 |        16 |        23 | 
               |     7.820 |    15.180 |           | 
               |     0.086 |     0.044 |           | 
               |     0.304 |     0.696 |     0.153 | 
               |     0.137 |     0.162 |           | 
               |     0.047 |     0.107 |           | 
---------------|-----------|-----------|-----------|
           SES |        17 |        32 |        49 | 
               |    16.660 |    32.340 |           | 
               |     0.007 |     0.004 |           | 
               |     0.347 |     0.653 |     0.327 | 
               |     0.333 |     0.323 |           | 
               |     0.113 |     0.213 |           | 
---------------|-----------|-----------|-----------|
  Column Total |        51 |        99 |       150 | 
               |     0.340 |     0.660 |           | 
---------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  0.1539564     d.f. =  2     p =  0.92591 


 

Now we will try the assocstats()

It returns chi-squared test, Cramer’s V, Contingency Coeff, and Phi)

assocplot(table(x = usedcars$model,y=usedcars$conservative))

now we can use ci() for confidence intervals for proportions

ci(x=usedcars$price, confidence = 0.95)
  Estimate   CI lower   CI upper Std. Error 
12961.9333 12458.1496 13465.7170   254.9496 
LS0tCnRpdGxlOiAiWW91c3NlZiAtIEFjdGl2aXR5IDQiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCiMgRXhwbG9yaW5nIGFuZCBVbmRlcnN0YW5kaW5nIERhdGEgd2l0aCBSCiMjIyMgRXhwbG9yaW5nIGFuZCB1bmRlcnN0YW5kaW5nIGRhdGEKCiMjIERhdGEgZXhwbG9yYXRpb24gZXhhbXBsZSB1c2luZyB1c2VkIGNhciBkYXRhCgojIyMgR2V0IHRoZSB3b3JraW5nIGRpcgpgYGB7cn0KZ2V0d2QoKQpgYGAKCiMjIyBMb2FkIHRoZSBkYXRhc2V0IApgYGB7cn0KdXNlZGNhcnMgPC0gcmVhZC5jc3YoInVzZWRjYXJzLmNzdiIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGQUxTRSkKdXNlZGNhcnMKYGBgCgojIyMgR2V0IHN0cnVjdHVyZSBvZiB0aGUgZGF0YXNldCB1c2luZyB0aGUgZnVuY3Rpb24gc3RyKCkgYXJnIGlzIGRhdGFzZXQgbmFtZSAKYGBge3J9CnN0cih1c2VkY2FycykKYGBgCiMjIEV4cGxvcmluZyBudW1lcmljIHZhcmlhYmxlcyAKCiMjIyBTdW1tYXJpemUgYnkgdXNpbmcgdGhlIGZ1bmN0aW9uIHN1bW1hcnkoKQpgYGB7cn0Kc3VtbWFyeSh1c2VkY2FycykKYGBgCiMjIyBZb3UgY2FuIGdldCB0aGUgc3VtbWFyeSBvZiBwYXJ0aWN1bGFyIGNvbHVtbnMgCmBgYHtyfQpzdW1tYXJ5KHVzZWRjYXJzW2MoInByaWNlIiwibWlsZWFnZSIpXSkKc3VtbWFyeSh1c2VkY2Fyc1tjKCJtb2RlbCIpXSkKCmBgYAojIyMgWW91IGNhbiBnZXQgbWVhbiwgbWVkaWFuIGJ5IG1lYW4oYyhhcmcxLGFyZzIsYXJnMykpCmBgYHtyfQptZWFuKGMoMzYwMDAsIDQ0MDAwLCA1NjAwMCkpCm1lZGlhbihjKDM2MDAwLCA0NDAwMCwgNTYwMDApKQoKYGBgCgojIyMgVGhlIG1pbi9tYXggb2YgdXNlZCBjYXIgcHJpY2VzIApgYGB7cn0KcmFuZ2UodXNlZGNhcnMkeWVhcikKcmFuZ2UodXNlZGNhcnMkbW9kZWwpCnJhbmdlKHVzZWRjYXJzJG1pbGVhZ2UpCmBgYAoKIyMjIFRoZSBkaWZmZXJlbmNlIG9mIHRoZSByYW5nZQpgYGB7cn0KZGlmZihyYW5nZSh1c2VkY2FycyRwcmljZSkpCgpgYGAKCiMjIyBJbnRlciBRdWFydGlsZSBSYW5nZSAoSVFSKSAKIyMjIElRUiB0ZWxscyB5b3UgaG93IHdpZGUgdGhlIG1pZGRsZSBvZiB5b3VyIGRhdGEgaXMKIyMjIGl0J3MgY2FsY3VsYXRlZCBieTogSVFSID0gUTMg4oiSIFExCmBgYHtyfQpJUVIodXNlZGNhcnMkcHJpY2UpCmBgYAoKIyMjIFVzZSBxdWFudGlsZSB0byBjYWxjdWxhdGUgZml2ZS1udW1iZXIgc3VtbWFyeQpgYGB7cn0KcXVhbnRpbGUodXNlZGNhcnMkcHJpY2UpCmBgYAoKIyMjIFRoZSA5OXRoIHBlcmNlbnRpbGUgCiMjIyB0aGlzIGxldHMgdXMgZmlsdGVyIHRoZSBvdXRsaWVycyBieSBzcGVjaWZ5aW5nIHRoZSBwcm9iYWJpbGl0eSByYW5nZSAKYGBge3J9CnF1YW50aWxlKHVzZWRjYXJzJHByaWNlLCBwcm9icyA9IGMoMC4wMSwwLjk5KSkKYGBgCgojIyMgcXVpbnRpbGVzIApgYGB7cn0KcXVhbnRpbGUodXNlZGNhcnMkcHJpY2UsIHNlcShmcm9tID0gMCwgdG8gPSAxLCBieSA9IDAuMikpCmBgYAoKIyMjIGJveHBsb3Qgb2YgdXNlZCBjYXIgcHJpY2VzIGFuZCBtaWxlYWdlIAoKYGBge3J9CmJveHBsb3QodXNlZGNhcnMkcHJpY2UsIG1haW49IkJveHBsb3Qgb2YgVXNlZCBDYXIgUHJpY2VzIiwgCiAgICAgICAgeWxhYj0iUHJpY2UgKCQpIikgCmBgYAoKCiMjIyBoaXN0b2dyYW1zIG9mIHVzZWQgY2FyIHByaWNlcyBhbmQgbWlsZWFnZQoKYGBge3J9Cmhpc3QodXNlZGNhcnMkcHJpY2UsIG1haW4gPSAiSGlzdG9ncmFtIG9mIFVzZWQgQ2FyIFByaWNlcyIsCiAgICAgeGxhYiA9ICJQcmljZSAoJCkiKQpgYGAKCiMjIyB2YXJpYW5jZSBhbmQgc3RhbmRhcmQgZGV2aWF0aW9uIHdpdGggdmFyKCkgYW5kIHN0ZCgpCmBgYHtyfQpzZCh1c2VkY2FycyRwcmljZSkKdmFyKHVzZWRjYXJzJG1pbGVhZ2UpCmBgYAoKIyMgRXhwbG9yaW5nIG51bWVyaWMgVmFyaWFibGUgCgojIyMgb25lLXdheSB0YWJsZXMgZm9yIHRoZSB1c2VkIGNhciBkYXRhIApgYGB7cn0KdGFibGUodXNlZGNhcnMkeWVhcikKYGBgCgojIyMgY29tcHV0ZSB0YWJsZSBwcm9wb3J0aW9ucyAKYGBge3J9Cm1vZGVsX3RhYmxlIDwtIHRhYmxlKHVzZWRjYXJzJG1vZGVsKQpwcm9wLnRhYmxlKG1vZGVsX3RhYmxlKQpgYGAKCiMjIyByb3VuZCB0aGUgZGF0YSAKYGBge3J9CmNvbG9yX3RhYmxlIDwtIHRhYmxlKHVzZWRjYXJzJGNvbG9yKQpjb2xvcl9wY3QgPC0gcHJvcC50YWJsZShjb2xvcl90YWJsZSkgKiAxMDAKcm91bmQoY29sb3JfcGN0LCBkaWdpdHMgPSAxKQpgYGAKCgojIyBFeHBsb3JpbmcgcmVsYXRpb25zaGlwcyBiZXR3ZWVuIHZhcmlhYmxlcyAKCiMjIyBzY2F0dGVycGxvdCBvZiBwcmljZSB2cy4gbWlsZWFnZQpgYGB7cn0KcGxvdCh4ID0gdXNlZGNhcnMkbWlsZWFnZSwgeSA9IHVzZWRjYXJzJHByaWNlLAogICAgIG1haW4gPSAiU2NhdHRlcnBsb3Qgb2YgUHJpY2UgdnMuIE1pbGVhZ2UiLAogICAgIHhsYWIgPSAiVXNlZCBDYXIgT2RvbWV0ZXIgKG1pLikiLAogICAgIHlsYWIgPSAiVXNlZCBDYXIgUHJpY2UgKCQpIikKYGBgCgojIyMgbmV3IHZhcmlhYmxlIGluZGljYXRpbmcgY29uc2VydmF0aXZlIGNvbG9ycwpgYGB7cn0KdXNlZGNhcnMkY29uc2VydmF0aXZlIDwtCiAgdXNlZGNhcnMkY29sb3IgJWluJSBjKCJCbGFjayIsICJHcmF5IiwgIlNpbHZlciIsICJXaGl0ZSIpCgojIGNoZWNraW5nIG91ciB2YXJpYWJsZQp0YWJsZSh1c2VkY2FycyRjb25zZXJ2YXRpdmUpCgpgYGAKCgojIyMgaW5zdGFsbGluZyBnbW9kZWxzIHBhY2thZ2UgCmBgYHtyfQppbnN0YWxsLnBhY2thZ2VzKCJnbW9kZWxzIikKCmBgYAojIyMgQ3Jvc3N0YWIgb2YgY29uc2VydmF0aXZlIGJ5IG1vZGVsCiMjIyBnbW9kZWxzIGlzIG1haW5seSB1c2VkIGZvciBkYXRhIGFuYWx5c2lzIGhlbHBlcnMKIyMjIENyb3NzVGFibGUoKSBmb3IgY29udGluZ2VuY3kgdGFibGVzICh3YXkgbmljZXIgdGhhbiB0YWJsZSgpKQojIyMgT2RkcyByYXRpb3MsIHByb3BvcnRpb25zLCBjaGktc3F1YXJlIHN1bW1hcmllcywgZXRjCmBgYHtyfQpsaWJyYXJ5KGdtb2RlbHMpCkNyb3NzVGFibGUoeCA9IHVzZWRjYXJzJG1vZGVsLCB5ID0gdXNlZGNhcnMkY29uc2VydmF0aXZlKQoKCmBgYAoKCiMjIyBUaGUgY29udHJpYnV0aW9uIG9mIFRydWUgaXMgbXVjaCBtb3JlIHNpZ25pZmljYW50IHRoYW4gRmFsc2UsIAojIyMgd2hpY2ggbWVhbnMgdGhhdCBtb3JlIGNhcnMgaGF2ZSBjb25zZXJ2YXRpdmUgY29sb3JzIChncmV5LCBibGFjaywgd2hpdGUgYW5kIFNpbHZlcikgCgoKCiMjIyBOb3csIHdlIHdpbGwgdHJ5IG1vcmUgYXJncyBpbiB0aGUgQ3Jvc3N0YWJsZSgpIGZ1bmN0aW9uLiAKYGBge3J9CkNyb3NzVGFibGUoeCA9IHVzZWRjYXJzJG1vZGVsLCB5ID0gdXNlZGNhcnMkY29uc2VydmF0aXZlLCAKICAgICAgICAgICBjaGlzcSA9IFRSVUUsIAogICAgICAgICAgIGV4cGVjdGVkID0gVFJVRSwgCiAgICAgICAgICAgcHJvcC5yID0gVFJVRSwgCiAgICAgICAgICAgcHJvcC5jID0gVFJVRSwgCiAgICAgICAgICAgcHJvcC50ID0gVFJVRSwgCiAgICAgICAgICAgcHJvcC5jaGlzcSA9IFRSVUUsCiAgICAgICAgICAgKQpgYGAKIyMjIE5vdyB3ZSB3aWxsIHRyeSB0aGUgYXNzb2NzdGF0cygpCiMjIyBJdCByZXR1cm5zIGNoaS1zcXVhcmVkIHRlc3QsIENyYW1lcidzIFYsIENvbnRpbmdlbmN5IENvZWZmLCBhbmQgUGhpKSAKCmBgYHtyfQphc3NvY3Bsb3QodGFibGUoeCA9IHVzZWRjYXJzJG1vZGVsLHk9dXNlZGNhcnMkY29uc2VydmF0aXZlKSkKYGBgCgojIyMgbm93IHdlIGNhbiB1c2UgY2koKSBmb3IgY29uZmlkZW5jZSBpbnRlcnZhbHMgZm9yIHByb3BvcnRpb25zIApgYGB7cn0KY2koeD11c2VkY2FycyRwcmljZSwgY29uZmlkZW5jZSA9IDAuOTUpCmBgYAoK