To start, we have imported the google share price Dataset from “kaggle” into Rstudio, and selected the Opening stock price of “Google” as a key variable to plot in this test. We will use the ggplot library to create the histogram representing this variable.
library(ggplot2)
library(ggpubr)
DataGoogle <- read.csv(file = "Data/GOOG.csv",header = TRUE)
DataG <- as.data.frame(DataGoogle)
ggplot( data = DataG, mapping = aes( x = Open
) ) +
geom_histogram(color="Black", fill="Green", alpha=0.2, binwidth = 12)+
geom_vline(mapping = aes(xintercept = mean(Open
)), colour = 'Blue') +
geom_vline(mapping = aes( xintercept = median(Open)),
linetype = "dashed", color = 'black')
In a perfectly symmetrical distribution, the \(Mean\) and the \(Median\) are the same. If you see the above graph,you can get an idea of the distribution of a dataset. When the Mean and the Median are the same, the dataset is more evenly distributed from the lowest to highest values. When the mean and the median are different then it is likely the data is not symmetrical. In this case the Median is greater than the Mean, which resulted in “Leftward Skewed”.
Standard Deviation is a measure of how spread out a normally distributed set of data is. It is a statistic that tells you how closely all of the data is gathered around the mean in a dataset. The shape of a normal distribution is determined by the mean and the standard deviation. The steeper the bell curve, the smaller the standard deviation. If the data is spread far apart, the bell curve will be much flatter, meaning the standard deviation is large. To study this we will calculate the mean, median, and standard deviation, and also plot the density plot for the Google’s Opening share price.
summary( DataG$Open)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49.4 244.8 398.9 633.1 929.0 2919.0
print(paste("SD =", sd(DataG$Open)))
## [1] "SD = 555.618089985143"
ggdensity(DataG$Open)
Observing the density plot for the Shareprice, we can analyze that the graph is flat , which is due to non-normal data distribution. We can also conclude that the standard deviation is relatively high.
In a normal distribution, about 68% of the data lies inside +-1 standard deviation, and around 95% of the data lies +-2 standard deviations apart. In order to test our data we will calculate the number of events that lie in 1 standard deviation and number of events in 2 standard deviations.
lower <- mean(DataG$Open)- sd(DataG$Open)
upper <- mean(DataG$Open)+ sd(DataG$Open)
index1 <- DataG$Open > lower &
DataG$Open < upper
y <- DataG[index1,]
x <- sum(index1)/nrow(DataG)
print(paste("1 SD =", x))
## [1] "1 SD = 0.852873030583874"
lower2 <- mean(DataG$Open)- 2*sd(DataG$Open)
upper2 <- mean(DataG$Open)+ 2*sd(DataG$Open)
index2 <- DataG$Open > lower2 &
DataG$Open < upper2
x <- sum(index2)/nrow(DataG)
print(paste("2 SD =", x))
## [1] "2 SD = 0.94902687673772"
gghist <- ggplot( data = DataG, mapping = aes( x = Open) ) + geom_histogram( color = "Black" ,fill="Green", alpha=0.2, binwidth = 12)
gghist +
geom_vline(mapping = aes(
xintercept = mean(Open)), colour = 'Blue') +
geom_vline(mapping = aes(
xintercept = upper), colour = 'Yellow') +
geom_vline(mapping = aes(
xintercept = lower), colour = 'Yellow') +
geom_vline(mapping = aes(
xintercept = upper2), colour = 'Purple') +
geom_vline(mapping = aes(
xintercept = lower2), colour = 'Purple') +
geom_vline(mapping = aes(
xintercept = median(Open)
),
linetype = "dashed", colour = 'Red')
Since for our dataset does not follow the 68 and 95 % of the data distribution we can conclude that the data is not normally distriuted.
We can visualise the data better with geom_boxplot() function and geom_jitter() function.This allows us to focus on where the data is more concentrated. In the graph below we can see that the data is focused more toward the leftside of the graph.
ggplot( data = DataG,
mapping = aes(x = Open, y = 0))+
geom_jitter(mapping = aes(x= Open),data = DataG) +
geom_boxplot(alpha = .2,fill= "Green")
***
To further test the normality of dataset, we used the ggqqplot() function.The data is considered to be normally distributed if the the data lies on the line resembling a perfectly normal dataset.
ggqqplot(DataG$Open)
Based on our observations, we conclude that the Openin shareprice of the Google dataset is not normally distributed.