telco_data <- read.csv ("telco.csv", stringsAsFactors = TRUE)
It is important for us to recognize the shape of our data distribution. It will determine the appropriate statistical methods use to analyze the data.
Generally, we will see whether our datasets;
i. Symmetrical (bell-shaped). Also know as Normal distribution.
ii. Skewed to the right (majority of the data values fall to the left of the mean and cluster at the lower end of the distribution, tail is to the right).
iii. Skewed to the left (majority of the data values fall to the right of the mean and cluster at the upper end of the distribution, the tail to the left).
It may not a good idea to use the mean as the central tendency value if the data is not symmetrical. Median is more appropriate measure of central tendency.
We will plot the distribution of telco_data by using;
1. Histogram
2. Boxplot
3. Density plot
4. q-q (quantile-quantile) plot
1. Histogram
hist (telco_data$Usage_GB, prob = TRUE)
lines(density(telco_data$Usage_GB),lwd=4,col="red")
Comment: The usage_GB variable is symmetricall/ has a bell shaped distribution.
2. Boxplot
Boxplot is one way to describe the distribution of the data based on five statistical summary (minimum, first quartile (Q1), median, third quartile (Q3), and maximum).
First quartile (Q1) is the 25th percentile, middle number between the smallest number and the median of the dataset.
Median is the middle value of the dataset (50th Percentile).
Third quartile (Q3) is the 75th percentile, the middle value between the median and the highest value.
boxplot(telco_data$Usage_GB,xlab="Internet Quota Usage in GB",
main="Figure 2", horizontal=TRUE)
Comment: Median for Usage_GB is near to the centre of the box, and the whisker lines (left and right) are about the same length, thus the distribution for Usage_GB is approximately symmetric.
3. Density Plot
plot(density(telco_data$Usage_GB),main="Density Estimate of Usage_GB")
Comment: The usage_GB variable is symmetricall/ has a bell shaped distribution
4. q-q plot
qqnorm(telco_data$Usage_GB) #high-level function to plot a q-q plot
qqline(telco_data$Usage_GB,col="red",lwd=3) #low-level function for qqline
Comment: The points seem to fall around a straight line, and no significant outlier. We can consider Usage_GB come from a Normal distribution.
Normality Test using Kolmogorov-Smirnov & Shapiro Wilk Test
It is best to not just rely on the graphical presentation of our datasets to check the assumption of normality. We can consider to test the hypothesis whether our datasets come from a normal distribution by using test such as Kolmogorov-Smirnov Test and/or Shapiro Wilk test. The hypothesis of these tests are;
H0 : The datasets is normally distributed in some population.
H1: The datasets is not normally distributed in some population.
Reject the null hypothesis if p-value < 0.05.
Conclusion: Since p-value < 0.05, the datasets is not normally distributed in some population;
or
Conclusion: Since p-value > 0.05, the datasets is normally distributed in some population.
1. Kolmogorov-Smirnov test
ks.test (telco_data$Usage_GB, "pnorm", mean = mean(telco_data$Usage_GB),
sd = sd(telco_data$Usage_GB))
## Warning in ks.test(telco_data$Usage_GB, "pnorm", mean =
## mean(telco_data$Usage_GB), : ties should not be present for the Kolmogorov-
## Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: telco_data$Usage_GB
## D = 0.076064, p-value = 0.957
## alternative hypothesis: two-sided
H0 : Usage_GB is normally distributed in some population.
H1: Usage_GB is not normally distributed in some population.
Reject the null hypothesis if p-value < 0.05
Conclusion: Since p-value = 0.957 > 0.05, Usage_GB is normally distributed in some population;
2. Shapiro-Wilk Test
shapiro.test (telco_data$Usage_GB)
##
## Shapiro-Wilk normality test
##
## data: telco_data$Usage_GB
## W = 0.9878, p-value = 0.9119
H0 : Usage_GB is normally distributed in some population.
H1: Usage_GB is not normally distributed in some population.
Reject the null hypothesis if p-value < 0.05
Conclusion: Since p-value = 0.9119 > 0.05, Usage_GB is normally distributed in some population;
Normality Conclusion
The normality test suggest that the internet Usage_GB distribution does not differ from Normal distribution, which we might have assumed from the histogram, boxplot, density plot, q-q plot and KS/SW tests.
Terms in hypothesis testing
Steps to do Hypothesis Testing
Step 1: State the hypotheses and identify the claim
Step 2: State the level of significance, alpha is equal to (0.1 or 0.05 or 0.01)
Step 3: Find the p-value
Step 4: Make the decision
Reject H0 if p-value < alpha
Step 5: Summarize the results
Example, Hypothesis Test for One-Sample (t-test)
Question : There was a claim that the usage of internet quota by the students was different from the average of 15 GB. A study was conducted to investigate the claim and 45 students were selected at random. Test at 5% level of significance.
Instruction: Use telco_data from slide 7.
Step 1: State the hypotheses and identify the claim;
H0: Population mean is equal to 15 GB.
H1: Population mean is not equal to 15 GB.
Step 2: State the level of significance,
alpha𝛼=0.05 (5% level of significance, confidence level at 95%)
Step 3: Find the p-value
t.test (telco_data$Usage_GB, alternative = "two.sided", conf.level = 0.95,
mu = 15)
##
## One Sample t-test
##
## data: telco_data$Usage_GB
## t = 3.2358, df = 44, p-value = 0.002306
## alternative hypothesis: true mean is not equal to 15
## 95 percent confidence interval:
## 16.18011 20.07766
## sample estimates:
## mean of x
## 18.12889
Step 4: Make the decison
Reject Ho if p-value < alpha. Since p-value = 0.0023 < alpha = 0.05, reject H0.
Step 5: Summarize the result
At 5% significance level, the internet usage by the students is different from 15GB.
Confidence Interval
t.test (telco_data$Usage_GB, alternative = "two.sided", conf.level = 0.95,
mu = 15)
##
## One Sample t-test
##
## data: telco_data$Usage_GB
## t = 3.2358, df = 44, p-value = 0.002306
## alternative hypothesis: true mean is not equal to 15
## 95 percent confidence interval:
## 16.18011 20.07766
## sample estimates:
## mean of x
## 18.12889
We are 95% confident that the mean of internet usage by the students is between 16.1801 annd 20.0777 GB.
Bluman, A. G. (2009). Elementary statistics: A step by step approach. New York, NY: McGraw-Hill Higher Education.
Reply on comment.
edited : 27th April 2020
by: Muhammad Asmui Abdul Rahim
email: asmui@tmsk.uitm.edu.my
created using: rmarkdown