Instructions

You just recently joined a datascience team.

There are two datasets junk1.txt and junk2.csv They have two options 1. They can go back to the client and ask for more data to remedy problems with the data. 2. They can accept the data and undertake a major analytics exercise.

The team is relying on your dsc skills to determine how they should proceed.

Can you explore the data and recommend actions for each file enumerating the reasons.

grade – 20

Work

First, we read the datasets into dataframes.

junk1 <- read.delim("junk1.txt", header=TRUE, sep=" ")
junk2 <- read.csv("junk2.csv")

Obviously it’d be great to know what the heck this data represents. We have two unnamed continuous numerical variables and a class in both junk datasets. Presumably the a and b variables are to be used in a classification exercise to predict the class variable. Might junk1 be a test dataset of a model built from junk2? Probably not, but it’s possible.

Let’s do some quick exploratory data analysis (EDA) to look at the two junk datasets.

str(junk1)
## 'data.frame':    100 obs. of  3 variables:
##  $ a    : num  1.62 1.434 2.477 0.528 1.005 ...
##  $ b    : num  3.004 0.785 0.937 0.12 0.787 ...
##  $ class: int  1 1 1 1 1 1 1 1 1 1 ...
str(junk2)
## 'data.frame':    4000 obs. of  3 variables:
##  $ a    : num  3.189 0.822 0.815 -1.507 0.443 ...
##  $ b    : num  0.9292 0.0476 0.0291 3.1323 2.8494 ...
##  $ class: int  0 0 0 0 0 0 0 0 0 0 ...

Same datatypes and structures. That we only see class = 1 in junk1 and class = 0 in the preview data in junk2 is something to keep i mind.

library(psych)
describe(junk1)
##       vars   n mean   sd median trimmed  mad   min  max range skew
## a        1 100 0.05 1.27  -0.05    0.03 1.47 -2.30 3.01  5.30 0.13
## b        2 100 0.01 1.45  -0.07   -0.02 1.49 -3.17 3.10  6.27 0.12
## class    3 100 1.50 0.50   1.50    1.50 0.74  1.00 2.00  1.00 0.00
##       kurtosis   se
## a        -0.80 0.13
## b        -0.53 0.14
## class    -2.02 0.05
describe(junk2)
##       vars    n  mean   sd median trimmed  mad   min  max range  skew
## a        1 4000 -0.05 1.30   0.09   -0.02 1.40 -4.17 4.63  8.79 -0.17
## b        2 4000  0.06 1.31  -0.08    0.03 1.39 -3.90 4.31  8.22  0.21
## class    3 4000  0.06 0.24   0.00    0.00 0.00  0.00 1.00  1.00  3.61
##       kurtosis   se
## a        -0.34 0.02
## b        -0.35 0.02
## class    11.06 0.00

The glaring issue here appears to be the coding of the classes. Perhaps there are three classes - 0, 1, and 2, and for some reason junk1 only includes coded 1 and 2 classes and junk2 only coded 0 and 1. Or it’s possible there are only two classes but the data preparer(s) used inconsistent coding.

Other than the glasses, I see no obvious issues. The dataset with 4000 observations shows a wider range of a and b values, but that’s to be expected given the 40x size.

After converting the class variable from numerical to categorical, let’s do some EDA visualization:

junk1$class <- factor(junk1$class)
junk2$class <- factor(junk2$class)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(data = junk1) +
  geom_point(mapping = aes(x = a, y = b, color = class))

ggplot(data = junk2) +
  geom_point(mapping = aes(x = a, y = b, color = class)) 

The class variable in junk2 seems to have its highest density in a particular spot in the scatterplot, whereas the class variable in junk1 looks to have a much more random value. In junk1, class = 1 may have a vaguely linear pattern, whereas class = 2 could be said to have a somewhat negative linear association - as a increases, b decreases.

ggplot(junk1, aes(class, a)) + geom_boxplot()

ggplot(junk1, aes(class, b)) + geom_boxplot()

ggplot(junk2, aes(class, a)) + geom_boxplot()

ggplot(junk2, aes(class, b)) + geom_boxplot()

As we saw with the scatterplot, there appear to be greater discernable differences between classes in the junk2 dataset than in junk1.

In conclusion, we definitely cannot proceed with any analytics or data science exercise with this data without a much more solidified understanding of what this data represents. Then, we need to comprehend what the business problem is that we’re trying to solve. After that is established, we can determine if the data provided is sufficient.

Just receiving data and throwing into a model is a recipe for disaster.