Tourism spending. The Association of Turkish Travel Agencies reports the number of foreign tourists visiting Turkey and tourist spending by year. Three plots are provided: scatterplot showing the relationship between these two variables along with the least squares fit, residuals plot, and histogram of residuals.
sourced: http://www.stat.wmich.edu/s216/book/node126.html
The least squares regression line is represented by the equation…
\[ PREDICTED-Y = a + b X \]
…where the slope b and intercept a are calculated in the following order:
\[\begin{array}{ll} b = r \frac{ \mbox{SD}_Y}{\mbox{SD}_X}, & \;\;a = \overline{Y} - b \overline{X} \end{array}\]
So, to get started, I’ll walk through this manually, and then show you the R shortcut. First, we can load the data…
tourism <- read.csv("https://raw.githubusercontent.com/RobertSellers/R/master/data/tourism.csv")
…and then calculate the Standard deviation and mean for the Y and X axis variables.
SDy<- sd(tourism$tourist_spending)
SDx<- sd(tourism$visitor_count_tho)
MEANY<- mean(tourism$tourist_spending)
MEANX<- mean(tourism$visitor_count_tho)
\[\mbox{SD}_Y = 4877.672\] \[\mbox{SD}_X = 7356.59\] \[\overline{X} = 3825.596\] \[\overline{Y} = 6371.766\]
Pearson’s R is a measure of the strength of the linear relationship between two variables. Notice the x & y variables are lower case.
\[ r = \frac{\Sigma{xy}}{\sqrt{\Sigma{x^2}\Sigma{y^2}}} \]
We must calculate x, y, xy, x-squared, and y-squared, from big X, and big Y…
tourism$x <- tourism$visitor_count_tho - MEANX
tourism$y <- tourism$tourist_spending- MEANY
tourism$xy <- tourism$x*tourism$y
SIGMAxy<- sum(as.numeric(tourism$xy))
SIGMAxsquared<-sum(tourism$x^2)
SIGMAysquared<-sum(tourism$y^2)
head(tourism)
## year visitor_count_tho tourist_spending x y xy
## 1 1963 198 7 -6173.766 -3818.596 23575116
## 2 1964 229 8 -6142.766 -3817.596 23450597
## 3 1965 361 13 -6010.766 -3812.596 22916621
## 4 1966 449 12 -5922.766 -3813.596 22587035
## 5 1967 574 13 -5797.766 -3812.596 22104538
## 6 1968 602 24 -5769.766 -3801.596 21934318
…and then populate the Pearson’s R equation with our values.
\[ \Sigma{xy} = 1640659327 \] \[ \Sigma{x^2} = 2489493284 \] \[ \Sigma{y^2} = 1094417617 \]
\[ r = \frac{\Sigma{1640659327}}{\Sigma{2489493284}\Sigma{1094417617}} = \frac{\Sigma{xy}}{\sqrt{\Sigma{x^2}\Sigma{y^2}}}\]
r <- SIGMAxy/sqrt(SIGMAxsquared*SIGMAysquared)
A high R value signifies a strong positive linear correlation between the values.
\[ r = 0.9939657 \]
Alternatively, we can visualize this strong positive linear relationship in R by combining both of these values into a plot below.
par(mar = c(5,5,2,5))
with(tourism, plot(tourism$tourist_spending~tourism$year, type="l", col="red3", ylab="Spending (in million $)", xlab="Year"))
par(new = T)
with(tourism, plot(tourism$visitor_count_tho~tourism$year,axes=F, xlab=NA, ylab=NA, type="l", col="blue"))
axis(side = 4)
mtext(side = 4, line = 3, 'Number of Tourists')
We can safely answer the first two questions at this point.
There is a strong positive correlation.
Explanatory variable: number of tourists
Response variable: spending
In other words, the number of tourists is hypothesized to be the result of spending, but not vice versa.
We now have each of the values needed for the linear regression model. We will calculate these in R.
\[\begin{array}{ll} b = r \frac{ \mbox{SD}_Y}{\mbox{SD}_X}, & \;\;a = \overline{Y} - b \overline{X} \end{array}\]
b <- r*(SDy/SDx)
a <- MEANY - (b*MEANX)
\[ PREDICTED Y = a + b X \] \[ PREDICTED Y = -373.6111 + 0.6590334X \]
Or we could just run R’s lm() function. The coef() function will then extract these exact values for you.
tourism_spending_regression<-lm(tourism$tourist_spending~tourism$visitor_count_tho)
coef(tourism_spending_regression)
## (Intercept) tourism$visitor_count_tho
## -373.6111020 0.6590334
From this we can reproduce the first scatterplot graph with the least squares fit from question 7.23
frame()
smoothScatter(tourism$visitor_count_tho,tourism$tourist_spending, pch=21,
ylab="Spending (in million $)", main = "Least Squares Fitting", xlab="Number of Tourists (in thousands)")
abline(lm(tourism$tourist_spending~tourism$visitor_count_tho), col="red")
The Residuals Plot, with added years
require(maptools)
tourism.lm = lm(tourism$tourist_spending~tourism$visitor_count_tho)
tourism.res = resid(tourism.lm)
plot(tourism$visitor_count_tho,tourism.res, xlab="Number of Tourists (in thousands)", ylab="Residuals" ,main = "Spending vs Tourism Residual Plot",ylim= c(-1500,1500), pch="+", col="blue")
pointLabel(tourism$visitor_count_tho,tourism.res,labels=as.character(tourism$year),cex=0.5)
abline(h=0, col = "gray60")
Residuals Histogram
hist(tourism.res, col="blue", xlab="Residuals", main="Histogram of Residuals", breaks=10, xlim=c(-2000,1500))
Remaining Questions
Forecasting revenue from tourism can help influence the allocation of tourism funding.
Each of the plots provided help us see the relationship from different perspectives. The first scatterplot helps us see the linear positive relationship, while the residuals plot highlights the variance of this relationship.
The residuals plot is not constrained by the response variable: spending. This helps embellish the difference between the observed and dependent variables, and can help us recognize outliers. Supporting this outlier analysis, the histogram gives the frequency of this variation. The high R value further supports a non-curved linear regression.
We can conclude by agreeing that the data is suitable for this regression method, and that given similar circumstances, this is a good basis to keep in mind when doing further statistical analysis.