* Anamica Raj Suresh - S3790068
* Avisha Warnakulasooriya - S3808119
* Shriya Bagwe - S3803954
Last updated: 27 October, 2019
In this investigation we are trying to determine if the production of Barley increased over decades (1991-2000 and 2001-2010) in Australia? As we have to compare the production of the crop over the decade, we will use two sample t-test for independent variables since they are different crops after the period of 10 years.
We started the investigation by gathering the production details for the crop over Decades. The main purpose of this report is to figure and analyse the Barely production from the years “1991-2000” and for “2001-2010” years and mapped it in excel. There are 4 variables namely State, year, year range and production.
There are 278 observations in the data sample. We will first group then over the decades and then calculated the summary statistics for both the prices. We used boxplot for the outliers and histograms and qqplot for visualization of the data. Shapiro test was used to test the normality.
The null Hypothesis is that there is no difference in the production of the crop over the decade and the alternate Hypothesis is that the production of barley did increase from 1991-2000 to 2001-2010. After performing the test it was concluded that p-value is 0.002 lesser than 0.05 (single sided). We reject the null Hypothesis which was that Production of Barley had no change over the years. 0 also does not fall under the 95% confidence interval. The test was statistically significant.
The main question for this investigation is whether the crop production increased for Barley over a decade
We are going to use t.test() function from R to figure out the p-value that will help with the investigation
We collected the data from the data set Agriculture census from State Government of Victoria: https://data.vic.gov.au
We have only taken one data set into consideration for this investigation
The dataset has information about many crop production, but we have only taken Barley into consideration
We have taken 20 years data. 10 for the 1st part and the other 10 for the second part
We have 4 variables, State - The state of Australia where the production was measured, Year - The recorded year, Year range - The grouping variable, 1991 - 2000,2001 - 2010 and production
The production variable is measured in ton
We have used one of the tidyr functions to clean up and get the data ready for analysis(spread).Since the grouping id is just a variable, we had used that to group and seperate the production variable into two new variables.
## Classes 'tbl_df', 'tbl' and 'data.frame': 280 obs. of 4 variables:
## $ State : chr "New South Wales(b)" "New South Wales(b)" "New South Wales(b)" "New South Wales(b)" ...
## $ Year : num 1991 1991 1992 1992 1993 ...
## $ 1991-2000: num 463300 822500 517500 748700 559700 ...
## $ 2001-2010: num NA NA NA NA NA NA NA NA NA NA ...
Barley1 %>% summarise(
MEANI = mean(`1991-2000`, na.rm = TRUE),
MEANII = mean(`2001-2010`, na.rm = TRUE),
STDI = sd(`1991-2000`, na.rm = TRUE),
STDII = sd(`2001-2010`, na.rm = TRUE),
MinI = min(`1991-2000`, na.rm = TRUE),
MiII = min(`2001-2010`, na.rm = TRUE),
MaxI = max(`1991-2000`, na.rm = TRUE),
MaxII = max(`2001-2010`, na.rm = TRUE),
QI = quantile(`1991-2000`, probs = 0.25, na.rm = TRUE),
QII = quantile(`2001-2010`, probs = 0.25, na.rm = TRUE),
MedianI = median(`1991-2000`, na.rm = TRUE),
MedianII = median(`2001-2010`, na.rm = TRUE),
Q3I = quantile(`1991-2000`, probs = 0.75, na.rm = TRUE),
Q3II = quantile(`2001-2010`, probs = 0.75, na.rm = TRUE),
IQR = IQR(`1991-2000`, na.rm = TRUE),
IQR = IQR(`2001-2010`, na.rm = TRUE),
MissingI = sum(is.na(`1991-2000`),
MissingII = sum(is.na(`2001-2010`))
)) -> table1
knitr::kable(table1)| MEANI | MEANII | STDI | STDII | MinI | MiII | MaxI | MaxII | QI | QII | MedianI | MedianII | Q3I | Q3II | IQR | MissingI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 596702.9 | 834597.9 | 571257.1 | 834923 | 0 | 0 | 2242400 | 3169900 | 31575 | 27450 | 542000 | 795900 | 981275 | 1307175 | 1279725 | 280 |
par(mfrow = c(1,2))
boxplot(Barley1$`1991-2000`, main = "Boxplot Of Barley Production from 1991-2000")
boxplot(Barley1$`2001-2010`, main = "Boxplot Of Barley Production from 2001-2010")We have used Histrogram,using hist() function, which is usually used to represent the frequencies of values of a variable bucketed into ranges. in this case we have used 2 histrogram ; Barley Production from 1991-2000 and Barely Production from 2001-2010.
par(mfrow = c(1,2))
hist(Barley1$`1991-2000`, main = "Boxplot Of Barley Production from 1991-2000")
hist(Barley1$`2001-2010`, main = "Boxplot Of Barley Production from 2001-2010")For visualisation we have used qqnorm which is a generic function the default method of which produces a normal QQ plot of the values in y, in this case we have used qqnorm again for both 1991-2000 and 2001-2010. Usually qqline puts together a line to a “theoretical”,select automatically normal, quantile-quantile plot which passes through the probs quantiles, by default the first and third quartiles.
par(mfrow = c(1,2))
Barley1$`1991-2000` %>% qqnorm(dist="norm",main="Barley Production from 1991-2000")
Barley1$`2001-2010` %>% qqnorm(dist="norm",main="Barley Production from 2001-2010")When testing for the Hypothesis, we have firstly taken into consideration and used shapiro.test function, where this is widely used to test for normality in statistics. And we have used this function for both Barely 1991-2000 and 2001-2010.
The p-value = 1.369e-08 for “1991-2000” and also the p-value =1.522e-09 for “2001-2010” is a lot larger than 0.05, therefore we have come to a conclution that the distribution for both Barely “1991-200” and “2001-2010” is not significantly different from normal distribution.
##
## Shapiro-Wilk normality test
##
## data: Barley1$`1991-2000`
## W = 0.89326, p-value = 1.369e-08
##
## Shapiro-Wilk normality test
##
## data: Barley1$`2001-2010`
## W = 0.87416, p-value = 1.522e-09
For T tesing we have taken 2 variables into consideration which is barely “1991-2000” and “2001-2010”. And we have chosen for Alternative hypothesis as “less” because we are trying to prove the X value is lesser than the Y value. And we are taken the confidence level to 0.95 because our level of significance is 0.05.
t.test(Barley1$`1991-2000`, Barley1$`2001-2010`,
alternative = "less",
conf.level = 0.95, na.rm = TRUE) -> test_results
test_results##
## Welch Two Sample t-test
##
## data: Barley1$`1991-2000` and Barley1$`2001-2010`
## t = -2.7824, df = 245.75, p-value = 0.002907
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -96728.13
## sample estimates:
## mean of x mean of y
## 596702.9 834597.9
\[H_0: \mu_1 = \mu_2 \]
\[H_A: \mu_1 \le \mu_2\]
As we have rejected the null Hypothesis we are providing the statsical evidence to support the Alternate Hypothesis. As a result it can be said that Barely production has increase during these 2 decades. We can assume that the increase in production could be due to various reasons:
-Development in the agriculcutural sector -Improvement in Climatic conditions -More effective machineries over the years -Increase in the agricultural land
The reasons for the increase in production are explainatory. However, there is no statistical evidence provided in this test to prove them. We are just considering these generalised factors to have made an effect in the Increased production. To test this we can perform the Linear Regression Analysis. However, the other factors will still be assumptions.
FINAL CONCLUSION
To conclude we can say that, with regard to our investigation the Production of Barley increased over 1991-2000 and 2001-2010, thus, providing us evidence to side with Alternate Hypothesis.