PREDICTING THE WEATHER STABILITY OF AUSTRALIA
Intro to Statistics : ASSIGNMENT 3
RAFEED SULTAAN (s3763175), VISHAL BENIWAL (s3759790), JEWEL JAMES (s3763905)
Last updated: 02 June, 2019
Introduction
The changing weather conditions affect the daily lives of people living in Australia. When the weather changes, most people may catch common diseases like common cold, flu and other diseases.
The common belief of residents in Australia is that the weather changes drastically in a matter of seconds. They believe the weather conditions are not stable. But most of these individuals make a claim about the unpredictability of the weather without any evidence to show for it. It is uncertain if the horrible weather conditions are just extreme cases or not.
In short, we can not make the claim that the weather condition in Australia is unpredictable without any statistical evidence to prove it.
Problem Statement
The problem is that residents of Australia claim that weather conditions like temperature and wind speed change drastically in a small amount of time. If we are able to solve this problem, all Australian residents will take particular care about the weather report instead of the few.
Our main investigation is to check whether the factors affecting the weather conditions like temperature, pressure, humidity and wind speed changes from 9 am to 3 pm of the same day remains the same or not. For solving this problem, we would use a paired t-test to see if the weather conditions are statistically significantly different at 3 pm, compared to 9 am.
Our secondary investigation is that if the weather conditions between two consecutive days are related, in terms of rain. To solve our secondary investigation, we would use a chi-square test of association to check if the raining today and raining tomorrow have a statistically significant association.
Data
The dataset is called “Rain in Australia” dataset and downloaded from the website Kaggle. The dataset was donated by Joe Young. The observations were drawn from several weather stations in Australia and collected by the Bureau of Meteorology of the Australian government. The dataset has the license for study purposes and research.
The dataset has a lot of variables. But we have narrowed down the variables that we will use. Those variables are Humidity9am, Humidity3pm, Temperature9am, Temperature3pm, Pressure9am, Pressure3pm, Windspeed9am, Windspeed3pm, RainToday, and RainTomorrow.
The variable Humidity9am,Humidity3pm,Temperature9am,Temperature3pm,Pressure9am,Pressure3pm,Windspeed9am,Windspeed3pm are continuous variables. The variables RainToday and RainToday are categorical variables.
Since the continuous variables selected were normally distributed, the standard z-score method was an appropriate choice to remove outliers of the continuous variables. At z-score=1.96, the confidence interval is 95%, therefore all data that was not contained within the 95% confidence interval of the data were dropped from the dataset.
The columns Evaporation, Sunshine, Cloud9am, and Cloud3pm were dropped from the dataset since these the columns had all null values in this weather dataset. Moreover, since these variables were not selected for this investigation, it was safe to drop it. Finally, we dropped all rows that had null values in the variables. The reason behind is that we don’t have enough domain knowledge to replace the values of these null values with meaningful values.
Data(contd.)
The following are the explanation of the variables:
- Humidity9am: Humidity (percent) at 9am
- Humidity3pm: Humidity (percent) at 3pm
- Temperature9am: Temperature (degrees C) at 9am
- Temperature3pm: Temperature (degrees C) at 3pm
- Pressure9am: Atmospheric pressure (hpa) reduced to mean sea level at 9am
- Pressure3pm: Atmospheric pressure (hpa) reduced to mean sea level at 3pm
- Windspeed9am: Wind speed (km/hr) averaged over 10 minutes prior to 9am
- Windspeed3pm: Wind speed (km/hr) averaged over 10 minutes prior to 3pm
- RainToday: Boolean: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise
- RainTomorrow: Boolean: 0
Descriptive Statistics | WindSpeed
- The mean of WindSpeed9am is 14 is less than the median of WindSpeed3pm which is 18.5.
- The median of WindSpeed9am is 13 is less than the median of WindSpeed3pm which is 19. This means that 50% of the values in WindSpeed9am is less than 50% of the WindSpeed3pm
- The standard deviation of WindSpeed9am is 6.99 is less than the standard deviation of WindSpeed3pm which is 7.27. This means the spread of data in WindSpeed9am is slightly less than the spread of data in WindSpeed3pm
## [1] "WindSpeed at 9 AM"
| 2 |
9 |
13 |
19 |
10 |
31 |
29 |
14 |
6.99 |
## [1] "WindSpeed at 3 PM"
| 2 |
13 |
19 |
24 |
11 |
35 |
33 |
18.5 |
7.27 |
Descriptive Statistics | Humidity
- The mean of Humidity9am is 69.3 is greater than the median of Humidity3pm which is 51.4.
- The median of Humidity9am is 69 is greater than the median of Humidity3pm which is 52. This means that 50% of the values in Humidity9am is greater than 50% of the Humidity3pm.
- The standard deviation of Humidity9am is 15.81 is less than the standard deviation of Humidity3pm which is 18.06. This means the spread of data in Humidity9am is less than the spread of data in Humidity3pm.
## [1] "Humidity at 9 AM"
| 32 |
58 |
69 |
81 |
23 |
100 |
68 |
69.3 |
15.81 |
## [1] "Humidity at 3 PM"
| 11 |
38 |
52 |
64 |
26 |
92 |
81 |
51.4 |
18.06 |
Descriptive Statistics | Temperature
- The mean of Temp9am is 17.1 is less than the median of Temp which is 21.7.
- The median of Temp9am is 16.8 is less than the median of Temp3pm which is 21.5. This means that 50% of the values in Temp9am is less than 50% of the Temp3pm.
- The standard deviation of Temp9am is 5.66 is less than the standard deviation of Temp3pm which is 5.88. This means the spread of data in Temp9am is slightly less than the spread of data in Temp3pm.
## [1] "Temperature at 9 AM"
| 4.3 |
12.8 |
16.8 |
21.2 |
8.4 |
29.7 |
25.4 |
17.1 |
5.66 |
## [1] "Temperature at 3 PM"
| 8.1 |
17.3 |
21.5 |
26.1 |
8.8 |
35.2 |
27.1 |
21.7 |
5.88 |
Descriptive Statistics | Pressure
- The mean of Pressure9am is 1018.1 which is slightly more than the median of Pressure3pm which is 1015.7.
- The median of Pressure9am is 1018.1 is less than the median of Pressure3pm which is 1015.7. This means that 50% of the values in Pressure9am is approximately equal to 50% of the values in Pressure3pm.
- The standard deviation of Pressure9am is 5.84 is approximately equal to the standard deviation of Pressure3pm which is 5.78. This means the spread of data in Pressure9am is approximately equal to the spread of data in Pressure3pm.
## [1] "Pressure at 9 AM"
| 1003.8 |
1013.9 |
1018.1 |
1022.3 |
8.4 |
1031.5 |
27.7 |
1018.1 |
5.84 |
## [1] "Pressure at 3 PM"
| 1001.5 |
1011.5 |
1015.7 |
1019.9 |
8.4 |
1029 |
27.5 |
1015.7 |
5.78 |
Visualization | RainFall
We used piechart to describe the categorical variables RainToday and RainTomorrow. Both of piechart for the RainToday and RainTomorrow are identical with almost equal proportion


Hypothesis Testing
- As the change in the weather conditions has dependability upon various factors like Wind Speed, Humidity, Pressure, and Rainfall. Therefore, we conducted a hypothesis test on each of the factors to determine whether the change in the weather condition of Australia during a day is statistically significant or not. As the data was given for two particular timings of the day (9 am & 3 pm) for the same place, therefore, paired-samples t-test was conducted for each factor except for rainfall.
- The first paired-samples t-test was conducted to check whether there was a significant difference between WindSpeed at 9 am and WindSpeed at 3 pm. The null hypothesis (H0) of the first hypothesis test is the assumption that the Windspeed will remain unchanged during a day. The alternative hypothesis (HA) of the second hypothesis test is the assumption that the difference in Windspeed will change with time.
\[H_0: \mu_1 - \mu_2 = 0\]
##
## Paired t-test
##
## data: weatherAUS$WindSpeed3pm and weatherAUS$WindSpeed9am
## t = 173.52, df = 89490, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.398652 4.499159
## sample estimates:
## mean of the differences
## 4.448905
- Results: The mean difference was found to be 4.44. The paired-samples t-test found a statistically significant mean difference between wind speed level at 9am and 3pm, t(df = 89491) = 173.52, p < 0.001, 95% [4.398652, 4.499159]. WindSpeed was found to be significantly increased from 9 am to 3 pm during a day.
Hypothesis Testing Cont.
- The Second paired-samples t-test was conducted to check whether there was a significant difference between Humidity at 9 am and Humidity at 3 pm. The null hypothesis (H0) of the first hypothesis test is the assumption that the Humidity will remain unchanged during a day. The alternative hypothesis (HA) of the second hypothesis test is the assumption that the difference in Humidity will change with time. Based on the results of the second hypothesis test, we found that the null hypothesis of the second hypothesis test was rejected.
\[H_0: \mu_1 - \mu_2 =0\]
\[H_A: \mu_1 - \mu_2 \ne 0\]
##
## Paired t-test
##
## data: weatherAUS$Humidity3pm and weatherAUS$Humidity9am
## t = -352.78, df = 89490, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -18.01054 -17.81152
## sample estimates:
## mean of the differences
## -17.91103
- Results: The mean difference was found to be -17.91. The paired-samples t-test found a statistically significant mean difference between humidity level at 9am and 3pm, t(df = 89491) = -352.78, p < 0.001, 95% [-18.01054, -17.81152]. Humidity was found to be significantly reduced from 9 am to 3 pm during a day.
Hypothesis Testing Cont.
- The Third paired-samples t-test was conducted to check whether there was a significant difference between Pressure at 9 am and Pressure at 3 pm. The null hypothesis (H0) of the first hypothesis test is the assumption that the Pressure will remain unchanged during a day. The alternative hypothesis (HA) of the second hypothesis test is the assumption that the difference in Pressure will change with time. Based on the results of the third hypothesis test, we found that the null hypothesis of the third hypothesis test was rejected.
\[H_0: \mu_1 - \mu_2 =0\]
\[H_A: \mu_1 - \mu_2 \ne 0\]
##
## Paired t-test
##
## data: weatherAUS$Pressure3pm and weatherAUS$Pressure9am
## t = -388.22, df = 89490, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.384078 -2.360126
## sample estimates:
## mean of the differences
## -2.372102
- Results: The mean difference was found to be 4.68. The paired-samples t-test found a statistically significant mean difference between Pressure level at 9am and 3pm, t(df = 89491) = -388.22, p < 0.001, 95% [-2.384078, -2.360126]. The pressure was found to be significantly reduced from 9 am to 3 pm during a day.
Hypothesis Testing Cont.
- The Fourth paired-samples t-test was conducted to check whether there was a significant difference between Temperature at 9 am and Temperature at 3 pm. The null hypothesis (H0) of the first hypothesis test is the assumption that the Temperature will remain unchanged during a day. The alternative hypothesis (HA) of the second hypothesis test is the assumption that the difference in Temperature will change with time. Based on the results of the fourth hypothesis test, we found that the null hypothesis of the fourth hypothesis test was rejected.
\[H_0: \mu_1 - \mu_2 =0\]
\[H_A: \mu_1 - \mu_2 \ne 0\]
##
## Paired t-test
##
## data: weatherAUS$Temp3pm and weatherAUS$Temp9am
## t = 424.46, df = 89490, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.662130 4.705386
## sample estimates:
## mean of the differences
## 4.683758
- Results: The mean difference was found to be -2.37. The paired-samples t-test found a statistically significant mean difference between Temperature level at 9am and 3pm, t(df = 89491) = 424.46, p < 0.001, 95% [4.662130, 4.705386]. Temperature was found to be significantly increased from 9 am to 3 pm during a day.
Hypothesis Testing Cont.
- Chi-square test of association was used to test for a statistically significant association between the rain today and rain tomorrow. The null hypothesis (H0) of the hypothesis test is the assumption that there is no association of raining between two consecutive days (Independence). The alternative hypothesis (HA) of the second hypothesis test is the assumption that there is an association of raining between two consecutive days (dependence). Based on the results of the Chi-square test, we found that the null hypothesis of the Chi-square test was rejected.
\[H_0: There \ is \ no \ association \ of \ raining \ between \ two \ consecutive \ days \ (Independence)\]
\[H_A: There \ is \ an \ association \ of \ raining \ between \ two \ consecutive \ days \ (dependence)\]
##
## Paired t-test
##
## data: weatherAUS$Temp3pm and weatherAUS$Temp9am
## t = 424.46, df = 89490, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.662130 4.705386
## sample estimates:
## mean of the differences
## 4.683758
- Results: A Chi-square test of association was used to test for a statistically significant association between the rain today and rain tomorrow.The results of the test found a statistically significant association, χ2 = 6850.5, p < .001. The results of this study suggest that rain on a particular day increases the chance of raining on the next day also.
Discussion
- In the project, we are predicting whether there is a statistically significant change in the factors affecting the weather conditions during a day such as humidity, temperature, pressure, and wind speed in a time span of 6 hours. The fluctuation in these factors can determine the stability of the Australian weather. Also, we are checking if weather conditions between two consecutive days are related, in terms of rainfall.
- During our analysis the first action was pre-processing of data, pre-processing refers to the transformations applied to your data before doing any further computations. The major difficulty we faced during this phase was the number of NULL values present in the dataset. We did not impute meaningful data in place of null values in the dataset since we lack domain knowledge. Therefore, we dropped the columns which were unnecessary for our research goal again, due to the predominance of Null values and also dropped the rows with null for providing a clean and absolute sure dataset.
- The next challenge under consideration was the outliers which hindered the analysis from predicting the changes since the time period under consideration was from 9 am to 3 pm, which is a very small time period and even a small change in values can provide unreal predictions. To resolve this all the values under a 95% confidence interval only were taken. Note that four different factors, namely temperature, pressure, humidity, and wind speed were taken for analysis.
- Statistical analysis of chosen attributes using t-test showed that null hypothesis test was rejected for all the chosen factors like wind speed, humidity, pressure, and temperature, giving us enough support that alternative hypothesis i.e. the difference in factors will change with time, stands significantly true. The Chi-square test gave us the proof of relation that rainfall on the given day increases the chance of rainfall on the very next day. Thus, we could successfully conclude with statistically significant evidence that there is a significant change in weather conditions in Australia within a short time span and people in Australia must be prepared to check the weather report before leaving the house.