Explore the Data Set

Wendy Sarrett
February 20, 2017

What is "Explore the Data Set"

The purpose of Explore the Data Set is to allow one to see a basic analysis on the basic datasets included in rStudio that meet the following criteria:

  • The data set is a dataframe or a timeseries
  • The data set has at least 4 columns

With these datasets you can quickly and easily see how the columns are related and after selecting four columns (your x, y z and color) you can see a plot_ly graph that allows you to visualize the relationships

How the List of data sets is obtained

The trickiest thing about this app is getting the list of datasets:

  #Only data of the type data.frame or >= 4 columns
  getDataSets<-function() {
    ds<-data(package="datasets")
    res<-ds$results
    dsNames<-res[,3]
    choise<-c()
    dataLst<-list()
    for(i in 1:length(dsNames)) {
      wd<-strsplit(dsNames[[i]]," ")[[1]]
      dsNames[[i]]<-wd[1]
      assign("xo", get(wd[1]))
      if("data.frame" %in% class(xo)){
        if(ncol(xo) >= 4) {
          choise<-c(choise,wd[1])
          dataLst[[wd[1]]]<-xo
        }
      } else {
        if("ts" %in% class(xo)){
          xoDF <- as.data.frame(xo)
          if(ncol(xoDF) >= 4) {
            choise<-c(choise,wd[1])
            dataLst[[wd[1]]]<-xoDF
          }
        }
      }
    }    
    ## note: commented out ui element update for this demo
    dataLst
  }

The Server Calculation

   ##Reactive method takes selected dataset and calculates the lm 
  ## which is then displayed.
   plotdata<- reactive({
     shinyjs::hideElement("pPlot")
     datasel <- input$visData
     data2<-dataLst[[datasel]]
    if(class(data2) == "data.frame") {
      newdata<-data2
     } else {
     newdata<-as.data.frame(data2)
    }
    choise<-names(newdata)
    updateSelectInput(session, "colA",
                      choices = choise)
    updateSelectInput(session, "colB",
                      choices = choise)
    updateSelectInput(session, "colC",
                      choices = choise)
    updateSelectInput(session, "colD",
                      choices = choise)
    ##Avoiding issue with y = factor variable
    col<-0
    for(i in 1:ncol(newdata)) {
      if(!is.factor(newdata[,i]) && col == 0) {
        col<-i
      }

    }
    if(col > 0) {
    x<-summary(lm(newdata[,col] ~., data = newdata))
    } else {
      x<- "no non-factor columns .... lm is not valid"
    }
     x
   })
   ##End of reactive method

Executing the Server Calculation

The main server calculation for a dataset would look as follows if mtcars was selected. Note for the purposes of display well set newdata = mtcars and col = 1:

   x<-summary(lm(newdata[,col] ~., data = newdata))
   x

Call:
lm(formula = newdata[, col] ~ ., data = newdata)

Residuals:
       Min         1Q     Median         3Q        Max 
-2.740e-15 -4.193e-16  3.000e-19  2.972e-16  6.276e-15 

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept)  0.000e+00  1.222e-14  0.000e+00    1.000    
mpg          1.000e+00  1.410e-16  7.094e+15   <2e-16 ***
cyl          7.826e-17  6.753e-16  1.160e-01    0.909    
disp        -2.380e-18  1.169e-17 -2.040e-01    0.841    
hp          -1.379e-17  1.438e-17 -9.590e-01    0.349    
drat         1.459e-16  1.062e-15  1.370e-01    0.892    
wt          -2.701e-16  1.331e-15 -2.030e-01    0.841    
qsec         2.251e-16  4.861e-16  4.630e-01    0.648    
vs          -1.473e-15  1.360e-15 -1.083e+00    0.292    
am           1.026e-15  1.375e-15  7.460e-01    0.465    
gear        -4.885e-16  9.691e-16 -5.040e-01    0.620    
carb         3.681e-16  5.361e-16  6.870e-01    0.500    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.712e-15 on 20 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 3.492e+31 on 11 and 20 DF,  p-value: < 2.2e-16

Benefits and Possible Enhancements "Exploring the Data Set"

The benefits of this application are the following:

  • Allows one to quickly get a sense of how the data is related by calculating a linear model
  • When a Dataset is selected a reactive method calculates the limear model and fills the column lists based of the data set selected
  • Allows one to quickly get a sense of what the data looks like by using a 3-d plot-ly plot

Current Limitations

  • Set number of columns for graph
  • Only one type of graph There are enhancements that might be done to improve this

  • Allow options of other calculations such as predictive functions (ie. machine learning)

  • Expand the number of datasets available

  • Allow a dataset to be loaded from a URL