You just recently joined a datascience team.

There are two datasets junk1.txt and junk2.csv They have two options 1. They can go back to the client and ask for more data to remedy problems with the data. 2. They can accept the data and undertake a major analytics exercise.

The team is relying on your dsc skills to determine how they should proceed.

Can you explore the data and recommend actions for each file enumerating the reasons.

library(ggplot2)
library(cowplot)
## Warning: package 'cowplot' was built under R version 3.6.2
## 
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
##   default ggplot2 theme anymore. To recover the previous
##   behavior, execute:
##   theme_set(theme_cowplot())
## ********************************************************
junk1_ds <- read.table("https://raw.githubusercontent.com/deepakmongia/Data622/master/Homework-1/Data/junk1.txt",header = TRUE)

head(junk1_ds)
##           a         b class
## 1 1.6204214 3.0036241     1
## 2 1.4340220 0.7852487     1
## 3 2.4766615 0.9367761     1
## 4 0.5283093 0.1196222     1
## 5 1.0054081 0.7872866     1
## 6 1.1032636 0.7330594     1
summary(junk1_ds)
##        a                  b                class    
##  Min.   :-2.29854   Min.   :-3.17174   Min.   :1.0  
##  1st Qu.:-0.85014   1st Qu.:-1.04712   1st Qu.:1.0  
##  Median :-0.04754   Median :-0.07456   Median :1.5  
##  Mean   : 0.04758   Mean   : 0.01324   Mean   :1.5  
##  3rd Qu.: 1.09109   3rd Qu.: 1.05342   3rd Qu.:2.0  
##  Max.   : 3.00604   Max.   : 3.10230   Max.   :2.0

Let us look at the first file - Junk1

unique(junk1_ds$class)
## [1] 1 2
junk1_ds$class = as.factor(junk1_ds$class)

summary(junk1_ds)
##        a                  b            class 
##  Min.   :-2.29854   Min.   :-3.17174   1:50  
##  1st Qu.:-0.85014   1st Qu.:-1.04712   2:50  
##  Median :-0.04754   Median :-0.07456         
##  Mean   : 0.04758   Mean   : 0.01324         
##  3rd Qu.: 1.09109   3rd Qu.: 1.05342         
##  Max.   : 3.00604   Max.   : 3.10230
str(junk1_ds)
## 'data.frame':    100 obs. of  3 variables:
##  $ a    : num  1.62 1.434 2.477 0.528 1.005 ...
##  $ b    : num  3.004 0.785 0.937 0.12 0.787 ...
##  $ class: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
apply(junk1_ds[,1:2], 2, sd)
##        a        b 
## 1.267740 1.446067

Plotting the bixplots to see if there are any outliers:

boxplot(junk1_ds[,1:2])

As we see above, there are no outliers. Verifying that there are no outliers:

OutVals_a_1 = boxplot(junk1_ds$a, plot=FALSE)$out
print(OutVals_a_1)
## numeric(0)
OutVals_b_1 = boxplot(junk1_ds$b, plot=FALSE)$out
print(OutVals_b_1)
## numeric(0)

So, it is confimed that there are no outliers in the data.

Also from the table of the classes, the number of observations are equal for both the classes.

So, we don’t see any issues with the data - junk1, and it can be used with the analytics exercise without any further changes.

Now, let us import the 2nd file - Junk2, and do some exploratory data analysis on the same.

junk2_ds <- read.csv("https://raw.githubusercontent.com/deepakmongia/Data622/master/Homework-1/Data/junk2.csv",
                     header = TRUE)

Doing some basic EDA

dim(junk2_ds)
## [1] 4000    3
head(junk2_ds)
##            a           b class
## 1  3.1886481  0.92917735     0
## 2  0.8224527  0.04760314     0
## 3  0.8147247  0.02910931     0
## 4 -1.5065362  3.13231360     0
## 5  0.4426887  2.84942822     0
## 6  0.8564405 -0.66143851     0
summary(junk2_ds)
##        a                  b                class       
##  Min.   :-4.16505   Min.   :-3.90472   Min.   :0.0000  
##  1st Qu.:-1.01447   1st Qu.:-0.89754   1st Qu.:0.0000  
##  Median : 0.08754   Median :-0.08358   Median :0.0000  
##  Mean   :-0.05126   Mean   : 0.05624   Mean   :0.0625  
##  3rd Qu.: 0.89842   3rd Qu.: 1.00354   3rd Qu.:0.0000  
##  Max.   : 4.62647   Max.   : 4.31052   Max.   :1.0000

Converting class feature into a factor

junk2_ds$class <- as.factor(junk2_ds$class)

table(junk2_ds$class)
## 
##    0    1 
## 3750  250

So, the dataset is imbalanced as the class ‘1’ is having only 250 observations out of 4000 total.

Plotting the boxplots to see the data ranges and see if there are any outliers.

boxplot(junk2_ds[,1:2])

Looks like there are outliers in both the independent features - a and b. Checking the outlier values.

OutVals_a_2 = boxplot(junk2_ds$a, plot=FALSE)$out
print(OutVals_a_2)
## [1]  4.385471  4.626473  4.400435  3.963930 -4.033136 -3.925369 -4.165048
## [8] -4.109807 -4.075896
OutVals_b_2 = boxplot(junk2_ds$b, plot=FALSE)$out
print(OutVals_b_2)
## [1]  3.942222  4.310516  3.898998  4.131511  3.901890  3.962441 -3.773330
## [8] -3.904721

So, there are outliers in the data.

For junk2 dataset: 1) there are outliers in the data 2) the data is highly imbalanced

So we should go back to the source team to get more samples of the class with the lower number of observations which is class ‘1’.