You just recently joined a datascience team.
There are two datasets junk1.txt and junk2.csv They have two options 1. They can go back to the client and ask for more data to remedy problems with the data. 2. They can accept the data and undertake a major analytics exercise.
The team is relying on your dsc skills to determine how they should proceed.
Can you explore the data and recommend actions for each file enumerating the reasons.
library(ggplot2)
library(cowplot)
## Warning: package 'cowplot' was built under R version 3.6.2
##
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
## default ggplot2 theme anymore. To recover the previous
## behavior, execute:
## theme_set(theme_cowplot())
## ********************************************************
junk1_ds <- read.table("https://raw.githubusercontent.com/deepakmongia/Data622/master/Homework-1/Data/junk1.txt",header = TRUE)
head(junk1_ds)
## a b class
## 1 1.6204214 3.0036241 1
## 2 1.4340220 0.7852487 1
## 3 2.4766615 0.9367761 1
## 4 0.5283093 0.1196222 1
## 5 1.0054081 0.7872866 1
## 6 1.1032636 0.7330594 1
summary(junk1_ds)
## a b class
## Min. :-2.29854 Min. :-3.17174 Min. :1.0
## 1st Qu.:-0.85014 1st Qu.:-1.04712 1st Qu.:1.0
## Median :-0.04754 Median :-0.07456 Median :1.5
## Mean : 0.04758 Mean : 0.01324 Mean :1.5
## 3rd Qu.: 1.09109 3rd Qu.: 1.05342 3rd Qu.:2.0
## Max. : 3.00604 Max. : 3.10230 Max. :2.0
Let us look at the first file - Junk1
unique(junk1_ds$class)
## [1] 1 2
junk1_ds$class = as.factor(junk1_ds$class)
summary(junk1_ds)
## a b class
## Min. :-2.29854 Min. :-3.17174 1:50
## 1st Qu.:-0.85014 1st Qu.:-1.04712 2:50
## Median :-0.04754 Median :-0.07456
## Mean : 0.04758 Mean : 0.01324
## 3rd Qu.: 1.09109 3rd Qu.: 1.05342
## Max. : 3.00604 Max. : 3.10230
str(junk1_ds)
## 'data.frame': 100 obs. of 3 variables:
## $ a : num 1.62 1.434 2.477 0.528 1.005 ...
## $ b : num 3.004 0.785 0.937 0.12 0.787 ...
## $ class: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
apply(junk1_ds[,1:2], 2, sd)
## a b
## 1.267740 1.446067
Plotting the bixplots to see if there are any outliers:
boxplot(junk1_ds[,1:2])
As we see above, there are no outliers. Verifying that there are no outliers:
OutVals_a_1 = boxplot(junk1_ds$a, plot=FALSE)$out
print(OutVals_a_1)
## numeric(0)
OutVals_b_1 = boxplot(junk1_ds$b, plot=FALSE)$out
print(OutVals_b_1)
## numeric(0)
So, it is confimed that there are no outliers in the data.
Also from the table of the classes, the number of observations are equal for both the classes.
So, we don’t see any issues with the data - junk1, and it can be used with the analytics exercise without any further changes.
Now, let us import the 2nd file - Junk2, and do some exploratory data analysis on the same.
junk2_ds <- read.csv("https://raw.githubusercontent.com/deepakmongia/Data622/master/Homework-1/Data/junk2.csv",
header = TRUE)
Doing some basic EDA
dim(junk2_ds)
## [1] 4000 3
head(junk2_ds)
## a b class
## 1 3.1886481 0.92917735 0
## 2 0.8224527 0.04760314 0
## 3 0.8147247 0.02910931 0
## 4 -1.5065362 3.13231360 0
## 5 0.4426887 2.84942822 0
## 6 0.8564405 -0.66143851 0
summary(junk2_ds)
## a b class
## Min. :-4.16505 Min. :-3.90472 Min. :0.0000
## 1st Qu.:-1.01447 1st Qu.:-0.89754 1st Qu.:0.0000
## Median : 0.08754 Median :-0.08358 Median :0.0000
## Mean :-0.05126 Mean : 0.05624 Mean :0.0625
## 3rd Qu.: 0.89842 3rd Qu.: 1.00354 3rd Qu.:0.0000
## Max. : 4.62647 Max. : 4.31052 Max. :1.0000
Converting class feature into a factor
junk2_ds$class <- as.factor(junk2_ds$class)
table(junk2_ds$class)
##
## 0 1
## 3750 250
So, the dataset is imbalanced as the class ‘1’ is having only 250 observations out of 4000 total.
Plotting the boxplots to see the data ranges and see if there are any outliers.
boxplot(junk2_ds[,1:2])
Looks like there are outliers in both the independent features - a and b. Checking the outlier values.
OutVals_a_2 = boxplot(junk2_ds$a, plot=FALSE)$out
print(OutVals_a_2)
## [1] 4.385471 4.626473 4.400435 3.963930 -4.033136 -3.925369 -4.165048
## [8] -4.109807 -4.075896
OutVals_b_2 = boxplot(junk2_ds$b, plot=FALSE)$out
print(OutVals_b_2)
## [1] 3.942222 4.310516 3.898998 4.131511 3.901890 3.962441 -3.773330
## [8] -3.904721
So, there are outliers in the data.
For junk2 dataset: 1) there are outliers in the data 2) the data is highly imbalanced
So we should go back to the source team to get more samples of the class with the lower number of observations which is class ‘1’.