Part a) Graphical Techniques for Univariate

Read the Data from data file data1.txt. Define the data frame as x

x <- scan("data1.txt")

Looking at the raw data, it’s very difficult to make any intelligent comments. The .txt file contains a single column of 1003 numbers. The numbers typically range around 6 to 27. They are not integers (they have 4 or 5 decimal values).I noticed an outlier of 50. Other than that, there’s not much else to say without investigating further using R.

summary(x)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.466   9.895  14.881  16.192  22.224 100.400

Now I see the min is actually 4.5 and the max seems to be an outlier I missed at first glance (100.4).

# Sequence Plot
plot(x, main="Run Sequence Plot")

If the fixed location assumption hold, then the run sequence will be flat and non-drifting. This is not the case. Fixed location assumption does not hold. There seems to be two populations here. We’ll run a lag plot to better visualize how many clusters are present.

Fixed variation assumption does not hold. vertical spread of run sequence plot is not the same over the entire horizontal axis.

# Lag Plot
lag.plot(x, main="Lag Plot")

Randomness assumption does not hold; data is not structureless and random. There appears to be two clusters present.

# Histogram
hist(x)

This is the default histogram. There is only room for 10 bars, with interval x=10. Only 4 bars visible, because most frequencies = 0 at intervals of x > 40.

These bars are far too wide. Let’s slice these bars up to better represent the variability of our data.

# A "refined" histogram 
hist(x, br=100, xlim =c(0,100))

Much better. Now we can visualize the two clusters we saw using the run sequence and lag plots. This shows the variability of our data much better. I see 3 outliers: one that is incredibly large. This is fairly distracting.

Let’s generate a histogram that ignores these outliers. Additionally, I’m going to add more bars and color for flair.

# An exquisite Histogram
hist(x, br=300, xlim =c(3,32), col= "orange", main="Histogram ignoring outliers, more bars")

The data appear to be bi-modal. This histogram now uses 300 bars per 100 values of x. I did this by limiting the x axis to a range of (3:32). This made the data appear a bit more granular and noisy. Note: there are 3 outliers not accounted for with values of x > 35

# normal Probability plot
qqnorm(x)
qqline(x, col="red")

If the fixed distribution assumption holds, there histograms would be bell-shaped and the normal probability plot would be linear. Neither of these occurred. Each of the 2 clusters might very well have normal distributions (bell curves) on analyzed their own. However when together, the fixed distribution assumption does not hold. The normal probability plot (black line) is not linear. A theoretical normal QQ line was generated (red) for comparison. If there was a fixed distribution, then the normal probability plot would correlate much more strongly with the QQ line.

Therefore, the data in data1.txt are not normally distributed. Further investigation and/or transformation is necessary before we can make any useful conclusions.

Furthermore, there appeared to be 3 outliers in the data. But it was hard to see in the previous plots. A box plot would display the outliers better.

boxplot(x, ylab="x", main = "Boxplot!!!")

We see three outliers present that will need further investigation.

Test for Time Dependency, Graphically

Autocorrelation Plot

#Autocorrelation Plots
acf(x, main="Autocorrelation plots")

An autocorrelation plot was used to test if there was any time dependency in the data. There appears to be a strong positive autocorrelation. Therefore, the sequence of data points is not random.

Part b) Quantitative EDA techniques for Univariate

New data being used in data2.txt. Read the data, set it to = y.

y <- scan("data2.txt")
x. <- 1:195

Similar to part A, this data is a basic text file with one column of 195 non-integer numbers. This time, the data will be indexed by combining variable “y” with range “x.”.

#Linear model. y= a*x. + b
model = lm(y~x.)
summary(model)

## 
## Call:
## lm(formula = y ~ x.)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.060590 -0.014735  0.000943  0.013639  0.063579 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.267e+00  3.253e-03 2848.98   <2e-16 ***
## x.          -5.641e-05  2.878e-05   -1.96   0.0514 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02262 on 193 degrees of freedom
## Multiple R-squared:  0.01952,    Adjusted R-squared:  0.01444 
## F-statistic: 3.842 on 1 and 193 DF,  p-value: 0.05143

A linear regression model was created. Essentially, it paired the values from the data2.txt file (y) with the index values defined as a range 1:195 (x.)

Note that the slope parameters has a t value of -1.96 (barely in 95% CI) The slope parameter value is essentially 0 (-0.00056). This indicates there is no change in location parameter in the dataset– no significant variation in y depending on where the value is listed on the dataset (index x.). This is a crude way to check for constant scale.

Test for Constant Scale

Let’s use the Bartlett Test to verify there is no change in variation. This tests the assumption of constant scale by dividing the data into several equal-sized intervals and analyzing.

# Split up the data into several (4) equal intervals
195/4 # with decimal, doesn't divide evenly

## [1] 48.75

195%/%4 # round down to nearest integer

## [1] 48

195%%4 # remainder after dividing

## [1] 3

Generate a factor. First 48 get “1”, 49-96 get “2”, etc.

g = rep(1:4, each = 48)
g

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4
## [149] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [186] 4 4 4 4 4 4 4

But we’re missing factors for some of the data; there’s only 192 factors here. 195 didn’t divide evenly. There were 3 left over. Let’s add 3 more “4”s so we have 195 total. Then run the Bartlett Test.

g <- c(g,4,4,4) #Adds three 4s to the end.
g # verify these were added to correct place.

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4
## [149] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [186] 4 4 4 4 4 4 4 4 4 4

bartlett.test(y , g) # Run Bartlett's test for variation

## 
##  Bartlett test of homogeneity of variances
## 
## data:  y and g
## Bartlett's K-squared = 3.034, df = 3, p-value = 0.3864

Because p-value > .05, we can conclude the standard deviations are not significantly different in the 4 intervals. So, the assumption of constant scale is valid.

Test of Normality

library(nortest) #nortest package contains several methods to test for normality

ad.test(y) #Anderson-Darling normality Test

## 
##  Anderson-Darling normality test
## 
## data:  y
## A = 0.12648, p-value = 0.985

Anderson-Darling test does not reject the normality assumption

cvm.test(y) #Cramer-von Mises Normality Test

## 
##  Cramer-von Mises normality test
## 
## data:  y
## W = 0.020672, p-value = 0.9619

Cramer-von Mises normality test does not reject the normality assumption

lillie.test(y)

## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  y
## D = 0.032481, p-value = 0.8842

Does not reject the normality assumption

pearson.test(y)

## 
##  Pearson chi-square normality test
## 
## data:  y
## P = 9.2615, p-value = 0.8139

Does not reject the normality assumption.

sf.test(y)

## 
##  Shapiro-Francia normality test
## 
## data:  y
## W = 0.99806, p-value = 0.9923

Does not reject the normality assumption.

None of these tests provide evidence to reject the normality assumption.

Test for Outliers

Grubb’s test can detect presence of outliers, but is based upon the assumption of normality. Our data already passed the tests for normality.

library(outliers)
grubbs.test(y)

## 
##  Grubbs test for one outlier
## 
## data:  y
## G = 2.91864, U = 0.95586, p-value = 0.3121
## alternative hypothesis: highest value 9.327973 is an outlier

Null Hypothesis: There are no outliers in the data set Alternative : there is at least one outlier in the data set

Because the P value is 0.312, (>0.05) is fails to reject null. So the alternative must be true: at least one outlier is present.

Test of Randomness

library(lawstat)
runs.test(y)

## 
##  Runs Test - Two sided
## 
## data:  y
## Standardized Runs Statistic = -3.2306, p-value = 0.001235

acf(y)

Since p-value is 0.001234 we can conclude that the data is indeed random. This quantitative test of Randomness using “lawstat” was verified using the autocorrelation plot.

Unit-1_Exploratory-Data-Analysis_Univariate

Part a) Graphical Techniques for Univariate

Part b) Quantitative EDA techniques for Univariate