7.23

Tourism spending. The Association of Turkish Travel Agencies reports the number of foreign tourists visiting Turkey and tourist spending by year. Three plots are provided: scatterplot showing the relationship between these two variables along with the least squares fit, residuals plot, and histogram of residuals.

Calculating Least Squares Regression Line

sourced: http://www.stat.wmich.edu/s216/book/node126.html

The least squares regression line is represented by the equation…

\[ PREDICTED-Y = a + b X \]

…where the slope b and intercept a are calculated in the following order:

\[\begin{array}{ll} b = r \frac{ \mbox{SD}_Y}{\mbox{SD}_X}, & \;\;a = \overline{Y} - b \overline{X} \end{array}\]

So, to get started, I’ll walk through this manually, and then show you the R shortcut. First, we can load the data…

tourism <- read.csv("https://raw.githubusercontent.com/RobertSellers/R/master/data/tourism.csv")

…and then calculate the Standard deviation and mean for the Y and X axis variables.

SDy<- sd(tourism$tourist_spending)
SDx<- sd(tourism$visitor_count_tho)
MEANY<- mean(tourism$tourist_spending)
MEANX<- mean(tourism$visitor_count_tho)

\[\mbox{SD}_Y = 4877.672\] \[\mbox{SD}_X = 7356.59\] \[\overline{X} = 3825.596\] \[\overline{Y} = 6371.766\]

Pearson’s R is a measure of the strength of the linear relationship between two variables. Notice the x & y variables are lower case.

\[ r = \frac{\Sigma{xy}}{\sqrt{\Sigma{x^2}\Sigma{y^2}}} \]

We must calculate x, y, xy, x-squared, and y-squared, from big X, and big Y…

tourism$x <- tourism$visitor_count_tho - MEANX
tourism$y <- tourism$tourist_spending- MEANY
tourism$xy <- tourism$x*tourism$y
SIGMAxy<- sum(as.numeric(tourism$xy))
SIGMAxsquared<-sum(tourism$x^2)
SIGMAysquared<-sum(tourism$y^2)
head(tourism)
##   year visitor_count_tho tourist_spending         x         y       xy
## 1 1963               198                7 -6173.766 -3818.596 23575116
## 2 1964               229                8 -6142.766 -3817.596 23450597
## 3 1965               361               13 -6010.766 -3812.596 22916621
## 4 1966               449               12 -5922.766 -3813.596 22587035
## 5 1967               574               13 -5797.766 -3812.596 22104538
## 6 1968               602               24 -5769.766 -3801.596 21934318

…and then populate the Pearson’s R equation with our values.

\[ \Sigma{xy} = 1640659327 \] \[ \Sigma{x^2} = 2489493284 \] \[ \Sigma{y^2} = 1094417617 \]

\[ r = \frac{\Sigma{1640659327}}{\Sigma{2489493284}\Sigma{1094417617}} = \frac{\Sigma{xy}}{\sqrt{\Sigma{x^2}\Sigma{y^2}}}\]

r <- SIGMAxy/sqrt(SIGMAxsquared*SIGMAysquared)

A high R value signifies a strong positive linear correlation between the values.

\[ r = 0.9939657 \]

Alternatively, we can visualize this strong positive linear relationship in R by combining both of these values into a plot below.

par(mar = c(5,5,2,5))
with(tourism, plot(tourism$tourist_spending~tourism$year, type="l", col="red3", ylab="Spending (in million $)", xlab="Year"))
par(new = T)
with(tourism, plot(tourism$visitor_count_tho~tourism$year,axes=F, xlab=NA, ylab=NA, type="l", col="blue"))
axis(side = 4)
mtext(side = 4, line = 3, 'Number of Tourists')

We can safely answer the first two questions at this point.

(a) Describe the relationship between number of tourists and spending.

There is a strong positive correlation.

(b) What are the explanatory and response variables?

Explanatory variable: number of tourists

Response variable: spending

In other words, the number of tourists is hypothesized to be the result of spending, but not vice versa.

We now have each of the values needed for the linear regression model. We will calculate these in R.

\[\begin{array}{ll} b = r \frac{ \mbox{SD}_Y}{\mbox{SD}_X}, & \;\;a = \overline{Y} - b \overline{X} \end{array}\]

b <- r*(SDy/SDx)
a <- MEANY - (b*MEANX)

\[ PREDICTED Y = a + b X \] \[ PREDICTED Y = -373.6111 + 0.6590334X \]

Or we could just run R’s lm() function. The coef() function will then extract these exact values for you.

tourism_spending_regression<-lm(tourism$tourist_spending~tourism$visitor_count_tho)
coef(tourism_spending_regression)
##               (Intercept) tourism$visitor_count_tho 
##              -373.6111020                 0.6590334

From this we can reproduce the first scatterplot graph with the least squares fit from question 7.23

frame()
smoothScatter(tourism$visitor_count_tho,tourism$tourist_spending, pch=21,
ylab="Spending (in million $)", main = "Least Squares Fitting", xlab="Number of Tourists (in thousands)")
abline(lm(tourism$tourist_spending~tourism$visitor_count_tho), col="red")

The Residuals Plot, with added years

require(maptools)
tourism.lm = lm(tourism$tourist_spending~tourism$visitor_count_tho)
tourism.res = resid(tourism.lm)
plot(tourism$visitor_count_tho,tourism.res, xlab="Number of Tourists (in thousands)", ylab="Residuals" ,main = "Spending vs Tourism Residual Plot",ylim= c(-1500,1500), pch="+", col="blue")
pointLabel(tourism$visitor_count_tho,tourism.res,labels=as.character(tourism$year),cex=0.5)
abline(h=0, col = "gray60")

Residuals Histogram

hist(tourism.res, col="blue", xlab="Residuals", main="Histogram of Residuals", breaks=10, xlim=c(-2000,1500))

Remaining Questions

(c) Why might we want to fit a regression line to these data?

Forecasting revenue from tourism can help influence the allocation of tourism funding.

(d) Do the data meet the conditions required for fitting a least squares line?

Each of the plots provided help us see the relationship from different perspectives. The first scatterplot helps us see the linear positive relationship, while the residuals plot highlights the variance of this relationship.

The residuals plot is not constrained by the response variable: spending. This helps embellish the difference between the observed and dependent variables, and can help us recognize outliers. Supporting this outlier analysis, the histogram gives the frequency of this variation. The high R value further supports a non-curved linear regression.

We can conclude by agreeing that the data is suitable for this regression method, and that given similar circumstances, this is a good basis to keep in mind when doing further statistical analysis.