Explore the Data Set

Wendy Sarrett
February 20, 2017

What is "Explore the Data Set"

The purpose of Explore the Data Set is to allow one to see a basic analysis on the basic datasets included in rStudio that meet the following criteria:

The data set is a dataframe or a timeseries
The data set has at least 4 columns

With these datasets you can quickly and easily see how the columns are related and after selecting four columns (your x, y z and color) you can see a plot_ly graph that allows you to visualize the relationships

How the List of data sets is obtained

The trickiest thing about this app is getting the list of datasets:

  #Only data of the type data.frame or >= 4 columns
  getDataSets<-function() {
    ds<-data(package="datasets")
    res<-ds$results
    dsNames<-res[,3]
    choise<-c()
    dataLst<-list()
    for(i in 1:length(dsNames)) {
      wd<-strsplit(dsNames[[i]]," ")[[1]]
      dsNames[[i]]<-wd[1]
      assign("xo", get(wd[1]))
      if("data.frame" %in% class(xo)){
        if(ncol(xo) >= 4) {
          choise<-c(choise,wd[1])
          dataLst[[wd[1]]]<-xo
        }
      } else {
        if("ts" %in% class(xo)){
          xoDF <- as.data.frame(xo)
          if(ncol(xoDF) >= 4) {
            choise<-c(choise,wd[1])
            dataLst[[wd[1]]]<-xoDF
          }
        }
      }
    }    
    ## note: commented out ui element update for this demo
    dataLst
  }

The Server Calculation

   ##Reactive method takes selected dataset and calculates the lm 
  ## which is then displayed.
   plotdata<- reactive({
     shinyjs::hideElement("pPlot")
     datasel <- input$visData
     data2<-dataLst[[datasel]]
    if(class(data2) == "data.frame") {
      newdata<-data2
     } else {
     newdata<-as.data.frame(data2)
    }
    choise<-names(newdata)
    updateSelectInput(session, "colA",
                      choices = choise)
    updateSelectInput(session, "colB",
                      choices = choise)
    updateSelectInput(session, "colC",
                      choices = choise)
    updateSelectInput(session, "colD",
                      choices = choise)
    ##Avoiding issue with y = factor variable
    col<-0
    for(i in 1:ncol(newdata)) {
      if(!is.factor(newdata[,i]) && col == 0) {
        col<-i
      }

    }
    if(col > 0) {
    x<-summary(lm(newdata[,col] ~., data = newdata))
    } else {
      x<- "no non-factor columns .... lm is not valid"
    }
     x
   })
   ##End of reactive method

Executing the Server Calculation

The main server calculation for a dataset would look as follows if mtcars was selected. Note for the purposes of display well set newdata = mtcars and col = 1:

   x<-summary(lm(newdata[,col] ~., data = newdata))
   x


Call:
lm(formula = newdata[, col] ~ ., data = newdata)

Residuals:
       Min         1Q     Median         3Q        Max 
-2.740e-15 -4.193e-16  3.000e-19  2.972e-16  6.276e-15 

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept)  0.000e+00  1.222e-14  0.000e+00    1.000    
mpg          1.000e+00  1.410e-16  7.094e+15   <2e-16 ***
cyl          7.826e-17  6.753e-16  1.160e-01    0.909    
disp        -2.380e-18  1.169e-17 -2.040e-01    0.841    
hp          -1.379e-17  1.438e-17 -9.590e-01    0.349    
drat         1.459e-16  1.062e-15  1.370e-01    0.892    
wt          -2.701e-16  1.331e-15 -2.030e-01    0.841    
qsec         2.251e-16  4.861e-16  4.630e-01    0.648    
vs          -1.473e-15  1.360e-15 -1.083e+00    0.292    
am           1.026e-15  1.375e-15  7.460e-01    0.465    
gear        -4.885e-16  9.691e-16 -5.040e-01    0.620    
carb         3.681e-16  5.361e-16  6.870e-01    0.500    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.712e-15 on 20 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 3.492e+31 on 11 and 20 DF,  p-value: < 2.2e-16

Benefits and Possible Enhancements "Exploring the Data Set"

The benefits of this application are the following:

Allows one to quickly get a sense of how the data is related by calculating a linear model
When a Dataset is selected a reactive method calculates the limear model and fills the column lists based of the data set selected
Allows one to quickly get a sense of what the data looks like by using a 3-d plot-ly plot

Current Limitations

Set number of columns for graph
Only one type of graph There are enhancements that might be done to improve this
Allow options of other calculations such as predictive functions (ie. machine learning)
Expand the number of datasets available
Allow a dataset to be loaded from a URL