Comparing Production of Barley over Decades

Assignment 3

* Anamica Raj Suresh - S3790068
* Avisha Warnakulasooriya - S3808119
* Shriya Bagwe - S3803954

Last updated: 27 October, 2019

Introduction

In this investigation we are trying to determine if the production of Barley increased over decades (1991-2000 and 2001-2010) in Australia? As we have to compare the production of the crop over the decade, we will use two sample t-test for independent variables since they are different crops after the period of 10 years.

We started the investigation by gathering the production details for the crop over Decades. The main purpose of this report is to figure and analyse the Barely production from the years “1991-2000” and for “2001-2010” years and mapped it in excel. There are 4 variables namely State, year, year range and production.

Introduction continuation

There are 278 observations in the data sample. We will first group then over the decades and then calculated the summary statistics for both the prices. We used boxplot for the outliers and histograms and qqplot for visualization of the data. Shapiro test was used to test the normality.

The null Hypothesis is that there is no difference in the production of the crop over the decade and the alternate Hypothesis is that the production of barley did increase from 1991-2000 to 2001-2010. After performing the test it was concluded that p-value is 0.002 lesser than 0.05 (single sided). We reject the null Hypothesis which was that Production of Barley had no change over the years. 0 also does not fall under the 95% confidence interval. The test was statistically significant.

Problem Statement

Data

Data Cont.

Pre-processing

Barley <- read_excel("C:/Users/SHRIYA SALIL BAGWE/Desktop/statistics/Asssignment 3/Barley.xlsx")

Barley <- Barley %>% 
  group_by(Year_range) %>% 
  mutate(grouped_id = row_number())

Barley1 <- Barley %>% 
  spread(Year_range,Production) %>% 
  select(-grouped_id)

Descriptive Statistics

str(Barley1)
## Classes 'tbl_df', 'tbl' and 'data.frame':    280 obs. of  4 variables:
##  $ State    : chr  "New South Wales(b)" "New South Wales(b)" "New South Wales(b)" "New South Wales(b)" ...
##  $ Year     : num  1991 1991 1992 1992 1993 ...
##  $ 1991-2000: num  463300 822500 517500 748700 559700 ...
##  $ 2001-2010: num  NA NA NA NA NA NA NA NA NA NA ...
Barley1 %>% summarise(
  MEANI = mean(`1991-2000`, na.rm = TRUE),
  MEANII = mean(`2001-2010`, na.rm = TRUE),
  STDI = sd(`1991-2000`, na.rm = TRUE),
  STDII = sd(`2001-2010`, na.rm = TRUE),
  MinI = min(`1991-2000`, na.rm = TRUE),
  MiII = min(`2001-2010`, na.rm = TRUE),
  MaxI = max(`1991-2000`, na.rm = TRUE),
  MaxII = max(`2001-2010`, na.rm = TRUE),
  QI  = quantile(`1991-2000`, probs = 0.25, na.rm = TRUE),
  QII  = quantile(`2001-2010`, probs = 0.25, na.rm = TRUE),
  MedianI = median(`1991-2000`, na.rm = TRUE),
  MedianII = median(`2001-2010`, na.rm = TRUE),
  Q3I = quantile(`1991-2000`, probs = 0.75, na.rm = TRUE),
  Q3II = quantile(`2001-2010`, probs = 0.75, na.rm = TRUE),
  IQR = IQR(`1991-2000`, na.rm = TRUE),
  IQR = IQR(`2001-2010`, na.rm = TRUE),
  MissingI = sum(is.na(`1991-2000`),
  MissingII = sum(is.na(`2001-2010`))
)) -> table1

knitr::kable(table1)
MEANI MEANII STDI STDII MinI MiII MaxI MaxII QI QII MedianI MedianII Q3I Q3II IQR MissingI
596702.9 834597.9 571257.1 834923 0 0 2242400 3169900 31575 27450 542000 795900 981275 1307175 1279725 280

Checking for outliers - Boxplot

par(mfrow = c(1,2))
boxplot(Barley1$`1991-2000`, main = "Boxplot Of Barley Production from 1991-2000")
boxplot(Barley1$`2001-2010`, main = "Boxplot Of Barley Production from 2001-2010")

Visualisation - Histogram

We have used Histrogram,using hist() function, which is usually used to represent the frequencies of values of a variable bucketed into ranges. in this case we have used 2 histrogram ; Barley Production from 1991-2000 and Barely Production from 2001-2010.

par(mfrow = c(1,2))
hist(Barley1$`1991-2000`, main = "Boxplot Of Barley Production from 1991-2000")
hist(Barley1$`2001-2010`, main = "Boxplot Of Barley Production from 2001-2010")

Visualisation - QQNorm

For visualisation we have used qqnorm which is a generic function the default method of which produces a normal QQ plot of the values in y, in this case we have used qqnorm again for both 1991-2000 and 2001-2010. Usually qqline puts together a line to a “theoretical”,select automatically normal, quantile-quantile plot which passes through the probs quantiles, by default the first and third quartiles.

par(mfrow = c(1,2))
Barley1$`1991-2000` %>% qqnorm(dist="norm",main="Barley Production from 1991-2000")
Barley1$`2001-2010` %>% qqnorm(dist="norm",main="Barley Production from 2001-2010")

Hypothesis Testing

When testing for the Hypothesis, we have firstly taken into consideration and used shapiro.test function, where this is widely used to test for normality in statistics. And we have used this function for both Barely 1991-2000 and 2001-2010.

The p-value = 1.369e-08 for “1991-2000” and also the p-value =1.522e-09 for “2001-2010” is a lot larger than 0.05, therefore we have come to a conclution that the distribution for both Barely “1991-200” and “2001-2010” is not significantly different from normal distribution.

#Testing for normality

shapiro.test(Barley1$`1991-2000`)
## 
##  Shapiro-Wilk normality test
## 
## data:  Barley1$`1991-2000`
## W = 0.89326, p-value = 1.369e-08
shapiro.test(Barley1$`2001-2010`)
## 
##  Shapiro-Wilk normality test
## 
## data:  Barley1$`2001-2010`
## W = 0.87416, p-value = 1.522e-09

T Test

For T tesing we have taken 2 variables into consideration which is barely “1991-2000” and “2001-2010”. And we have chosen for Alternative hypothesis as “less” because we are trying to prove the X value is lesser than the Y value. And we are taken the confidence level to 0.95 because our level of significance is 0.05.

t.test(Barley1$`1991-2000`, Barley1$`2001-2010`,
        alternative = "less",
        conf.level = 0.95, na.rm = TRUE) -> test_results
test_results
## 
##  Welch Two Sample t-test
## 
## data:  Barley1$`1991-2000` and Barley1$`2001-2010`
## t = -2.7824, df = 245.75, p-value = 0.002907
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -96728.13
## sample estimates:
## mean of x mean of y 
##  596702.9  834597.9

Hypthesis Testing Cont.

\[H_0: \mu_1 = \mu_2 \]

\[H_A: \mu_1 \le \mu_2\]

Discussion

As we have rejected the null Hypothesis we are providing the statsical evidence to support the Alternate Hypothesis. As a result it can be said that Barely production has increase during these 2 decades. We can assume that the increase in production could be due to various reasons:

-Development in the agriculcutural sector -Improvement in Climatic conditions -More effective machineries over the years -Increase in the agricultural land

Limitations

The reasons for the increase in production are explainatory. However, there is no statistical evidence provided in this test to prove them. We are just considering these generalised factors to have made an effect in the Increased production. To test this we can perform the Linear Regression Analysis. However, the other factors will still be assumptions.

FINAL CONCLUSION

To conclude we can say that, with regard to our investigation the Production of Barley increased over 1991-2000 and 2001-2010, thus, providing us evidence to side with Alternate Hypothesis.

References