Introduction

In this investigation we are trying to determine if the production of Barley increased over decades (1991-2000 and 2001-2010) in Australia? As we have to compare the production of the crop over the decade, we will use two sample t-test for independent variables since they are different crops after the period of 10 years.

We started the investigation by gathering the production details for the crop over Decades. The main purpose of this report is to figure and analyse the Barely production from the years “1991-2000” and for “2001-2010” years and mapped it in excel. There are 4 variables namely State, year, year range and production.

Introduction continuation

There are 278 observations in the data sample. We will first group then over the decades and then calculated the summary statistics for both the prices. We used boxplot for the outliers and histograms and qqplot for visualization of the data. Shapiro test was used to test the normality.

The null Hypothesis is that there is no difference in the production of the crop over the decade and the alternate Hypothesis is that the production of barley did increase from 1991-2000 to 2001-2010. After performing the test it was concluded that p-value is 0.002 lesser than 0.05 (single sided). We reject the null Hypothesis which was that Production of Barley had no change over the years. 0 also does not fall under the 95% confidence interval. The test was statistically significant.

Problem Statement

The main question for this investigation is whether the crop production increased for Barley over a decade
We are going to use t.test() function from R to figure out the p-value that will help with the investigation

Data

We collected the data from the data set Agriculture census from State Government of Victoria: https://data.vic.gov.au
We have only taken one data set into consideration for this investigation
The dataset has information about many crop production, but we have only taken Barley into consideration
We have taken 20 years data. 10 for the 1st part and the other 10 for the second part

Data Cont.

We have 4 variables, State - The state of Australia where the production was measured, Year - The recorded year, Year range - The grouping variable, 1991 - 2000,2001 - 2010 and production
The production variable is measured in ton
We have used one of the tidyr functions to clean up and get the data ready for analysis(spread).Since the grouping id is just a variable, we had used that to group and seperate the production variable into two new variables.

Pre-processing

We have grouped by the year range and then used mutated a ID called grouped_is
Then we have used spraed() to produce two different variables.

Barley <- read_excel("C:/Users/SHRIYA SALIL BAGWE/Desktop/statistics/Asssignment 3/Barley.xlsx")

Barley <- Barley %>% 
  group_by(Year_range) %>% 
  mutate(grouped_id = row_number())

Barley1 <- Barley %>% 
  spread(Year_range,Production) %>% 
  select(-grouped_id)

Descriptive Statistics

We have produced the descriptive statistic values, i.e mean, median, standard deviation, min, max, quartiles.
We have even showed the total number of missing values in our dataset - 0.

str(Barley1)

## Classes 'tbl_df', 'tbl' and 'data.frame':    280 obs. of  4 variables:
##  $ State    : chr  "New South Wales(b)" "New South Wales(b)" "New South Wales(b)" "New South Wales(b)" ...
##  $ Year     : num  1991 1991 1992 1992 1993 ...
##  $ 1991-2000: num  463300 822500 517500 748700 559700 ...
##  $ 2001-2010: num  NA NA NA NA NA NA NA NA NA NA ...

Barley1 %>% summarise(
  MEANI = mean(`1991-2000`, na.rm = TRUE),
  MEANII = mean(`2001-2010`, na.rm = TRUE),
  STDI = sd(`1991-2000`, na.rm = TRUE),
  STDII = sd(`2001-2010`, na.rm = TRUE),
  MinI = min(`1991-2000`, na.rm = TRUE),
  MiII = min(`2001-2010`, na.rm = TRUE),
  MaxI = max(`1991-2000`, na.rm = TRUE),
  MaxII = max(`2001-2010`, na.rm = TRUE),
  QI  = quantile(`1991-2000`, probs = 0.25, na.rm = TRUE),
  QII  = quantile(`2001-2010`, probs = 0.25, na.rm = TRUE),
  MedianI = median(`1991-2000`, na.rm = TRUE),
  MedianII = median(`2001-2010`, na.rm = TRUE),
  Q3I = quantile(`1991-2000`, probs = 0.75, na.rm = TRUE),
  Q3II = quantile(`2001-2010`, probs = 0.75, na.rm = TRUE),
  IQR = IQR(`1991-2000`, na.rm = TRUE),
  IQR = IQR(`2001-2010`, na.rm = TRUE),
  MissingI = sum(is.na(`1991-2000`),
  MissingII = sum(is.na(`2001-2010`))
)) -> table1

knitr::kable(table1)

MEANI	MEANII	STDI	STDII	MinI	MiII	MaxI	MaxII	QI	QII	MedianI	MedianII	Q3I	Q3II	IQR	MissingI
596702.9	834597.9	571257.1	834923	0	0	2242400	3169900	31575	27450	542000	795900	981275	1307175	1279725	280

Checking for outliers - Boxplot

We have used box plot to check for outliers, and fortunately there are no outliers in our data that will badly impact the test statistic.

par(mfrow = c(1,2))
boxplot(Barley1$`1991-2000`, main = "Boxplot Of Barley Production from 1991-2000")
boxplot(Barley1$`2001-2010`, main = "Boxplot Of Barley Production from 2001-2010")

Visualisation - Histogram

We have used Histrogram,using hist() function, which is usually used to represent the frequencies of values of a variable bucketed into ranges. in this case we have used 2 histrogram ; Barley Production from 1991-2000 and Barely Production from 2001-2010.

par(mfrow = c(1,2))
hist(Barley1$`1991-2000`, main = "Boxplot Of Barley Production from 1991-2000")
hist(Barley1$`2001-2010`, main = "Boxplot Of Barley Production from 2001-2010")

Visualisation - QQNorm

For visualisation we have used qqnorm which is a generic function the default method of which produces a normal QQ plot of the values in y, in this case we have used qqnorm again for both 1991-2000 and 2001-2010. Usually qqline puts together a line to a “theoretical”,select automatically normal, quantile-quantile plot which passes through the probs quantiles, by default the first and third quartiles.

par(mfrow = c(1,2))
Barley1$`1991-2000` %>% qqnorm(dist="norm",main="Barley Production from 1991-2000")
Barley1$`2001-2010` %>% qqnorm(dist="norm",main="Barley Production from 2001-2010")

Hypothesis Testing

When testing for the Hypothesis, we have firstly taken into consideration and used shapiro.test function, where this is widely used to test for normality in statistics. And we have used this function for both Barely 1991-2000 and 2001-2010.

The p-value = 1.369e-08 for “1991-2000” and also the p-value =1.522e-09 for “2001-2010” is a lot larger than 0.05, therefore we have come to a conclution that the distribution for both Barely “1991-200” and “2001-2010” is not significantly different from normal distribution.

#Testing for normality

shapiro.test(Barley1$`1991-2000`)

## 
##  Shapiro-Wilk normality test
## 
## data:  Barley1$`1991-2000`
## W = 0.89326, p-value = 1.369e-08

shapiro.test(Barley1$`2001-2010`)

## 
##  Shapiro-Wilk normality test
## 
## data:  Barley1$`2001-2010`
## W = 0.87416, p-value = 1.522e-09

T Test

For T tesing we have taken 2 variables into consideration which is barely “1991-2000” and “2001-2010”. And we have chosen for Alternative hypothesis as “less” because we are trying to prove the X value is lesser than the Y value. And we are taken the confidence level to 0.95 because our level of significance is 0.05.

t.test(Barley1$`1991-2000`, Barley1$`2001-2010`,
        alternative = "less",
        conf.level = 0.95, na.rm = TRUE) -> test_results
test_results

## 
##  Welch Two Sample t-test
## 
## data:  Barley1$`1991-2000` and Barley1$`2001-2010`
## t = -2.7824, df = 245.75, p-value = 0.002907
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -96728.13
## sample estimates:
## mean of x mean of y 
##  596702.9  834597.9

Hypthesis Testing Cont.

\[H_0: \mu_1 = \mu_2 \]

\[H_A: \mu_1 \le \mu_2\]

Discussion

As we have rejected the null Hypothesis we are providing the statsical evidence to support the Alternate Hypothesis. As a result it can be said that Barely production has increase during these 2 decades. We can assume that the increase in production could be due to various reasons:

-Development in the agriculcutural sector -Improvement in Climatic conditions -More effective machineries over the years -Increase in the agricultural land

Limitations

The reasons for the increase in production are explainatory. However, there is no statistical evidence provided in this test to prove them. We are just considering these generalised factors to have made an effect in the Increased production. To test this we can perform the Linear Regression Analysis. However, the other factors will still be assumptions.

FINAL CONCLUSION

To conclude we can say that, with regard to our investigation the Production of Barley increased over 1991-2000 and 2001-2010, thus, providing us evidence to side with Alternate Hypothesis.

Comparing Production of Barley over Decades

Assignment 3

RPubs link information