## Warning: package 'fpp' was built under R version 3.3.1
## Warning: package 'fma' was built under R version 3.3.1
## Warning: package 'expsmooth' was built under R version 3.3.1
## Warning: package 'lmtest' was built under R version 3.3.1
require(fpp) # fpp package needs to be loaded once in the beginning of every R session
econsumption
Mwh temp
1 16.3 29.3
2 16.8 21.7
3 15.5 23.7
4 18.2 10.4
5 15.2 29.7
6 17.5 11.9
7 19.8 9.0
8 19.0 23.4
9 17.5 17.8
10 16.0 30.0
11 19.6 8.6
12 18.0 11.8
Plotting the data and finding the regression model:
plot(Mwh ~ temp, data = econsumption)
(fit = lm(Mwh ~ temp, data = econsumption))
Call:
lm(formula = Mwh ~ temp, data = econsumption)
Coefficients:
(Intercept) temp
20.1995 -0.1452
# brackets outside the expression are for printing the result
There is a negative relationship between temperature and electricity consumption.
Probably for the temperature range involved and during the time when the data were observed, a lot of electricity was used for heating. Since low temperatures induced higher demand for heating, it led to a negative relationship between temperature and electricity consumption.
In addition, lower temperatures encouraged people to stay inside houses, where they naturally consumed more electricity compare to when they were outside their houses.
Residual plot
plot(residuals(fit) ~ temp, data = econsumption)
abline(0, 0, col="grey")
# Horizontal line with zero intercept is useful for assesment of the residuals
The residuals do not show obvious patterns or heteroscedasticity (i.e. when the residuals show non-constant variance). Therefore the model can be considered as adequate. Although small number of observations do not allow to be conclusive.
There is at least one outlier. It is observation number 8: consumption 19.0 Mwh and temperature 23.4\(^\circ\).
There are no obvious influential observations, although the outlier can be considered as a good candidate for one of them.
Prediction of the electricity consumption for a day with maximum temperature 10\(^\circ\) and a day with maximum temperature 35\(^\circ\).
forecast(fit, newdata=data.frame(temp=c(10,35)))
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
1 18.74795 17.27010 20.22579 16.34824 21.14766
2 15.11902 13.50469 16.73335 12.49768 17.74035
Plot of the training set and the forecast:
plot(forecast(fit, newdata=data.frame(temp=c(10,35))))
The forecast is more believable for temperature 10\(^\circ\) rather than for temperature 35\(^\circ\) because the second value lies outside the region of observed temperatures in the training set.
\newpageThe original data set
olympic
Year time
1 1896 54.20
2 1900 49.40
3 1904 49.20
4 1908 50.00
5 1912 48.20
6 1920 49.60
7 1924 47.60
8 1928 47.80
9 1932 46.20
10 1936 46.50
11 1948 46.20
12 1952 45.90
13 1956 46.70
14 1960 44.90
15 1964 45.10
16 1968 43.80
17 1972 44.66
18 1976 44.26
19 1980 44.60
20 1984 44.27
21 1988 43.87
22 1992 43.50
23 1996 43.49
To update dataset with the winning times from the last few Olympics, data is taken from and .
olympic = rbind(olympic, data.frame(Year = c(2000, 2004, 2008, 2012),
time = c(43.84, 44.00, 43.75, 43.94)))
tail(olympic)
Year time
22 1992 43.50
23 1996 43.49
24 2000 43.84
25 2004 44.00
26 2008 43.75
27 2012 43.94
Plot of the updated dataset with the winning times from the last few Olympics:
plot(time ~ Year, data = olympic)
The scatterplot shows linear relationship between Year and time for years in range from 1900 to 1976. Data for years from 1980 to 2012 show that linear relationship between Year and time was broken. Year 1986 is an outlier.
Plot of the data with the fitted regression line to the data:
(fit = lm(time ~ Year, data = olympic))
Call:
lm(formula = time ~ Year, data = olympic)
Coefficients:
(Intercept) Year
174.95440 -0.06585
plot(time ~ Year, data = olympic)
abline(a = fit$coefficients[1], b = fit$coefficients[2], col = "red")
The winning times have been decreasing with average rate of 0.066 second per year.
Plot of the residuals against the year:
plot(fit$residuals ~ Year, data = olympic)
abline(h = 0, col = "grey")
The fitted line might be suitable for the period from 1990 to 2000. Data has an outlier at year 1896 and a structural break at around years from 1980 to 2000 or nonlinearity starting from around years 1980 – 2000.
Prediction of the winning times for the menâs 400 meters final in the 2000, 2004, 2008 and 2012 Olympics:
trainingSet = olympic[1:23,]
testSet = olympic[24:27,]
fitOverTheTrainingSet = lm(time ~ Year, data = trainingSet)
plot(forecast(fitOverTheTrainingSet, newdata = testSet[,"Year"]))
Warning in forecast.lm(fitOverTheTrainingSet, newdata = testSet[, "Year"]):
newdata column names not specified, defaulting to first variable required.
lines(time ~ Year, data = testSet, col = "red", type = "p")
The above calculations are made with the assumption that the the menâs 400 meters Olympics final results change on average with constant rate. The forecasts are not very good although the real results are inside 95% prediction intervals.
\newpageThe log-log model is written as: \(\log(y) = \beta_0 + \beta_1 \log(x) + \varepsilon\).
We will take conditional expectation of the left and the right parts of the equation:
\(\mathrm{E}(\log(y)\mid x) = \mathrm{E}(\beta_0 + \beta_1 \log(x) + \varepsilon\mid x) = \beta_0 + \beta_1\log(x)\).
By taking derivatives of the left and the right parts of the last equation we get:
\(\frac{y'}{y} = \frac{\beta_1}{x}\), and then \(\beta_1 = \frac{y' x}{y}\).
It is exactly what we need to prove, taking into account that \(y' = \frac{dy}{dx}\).