The data can be found in the companion package for OpenIntro resources, openintro. Let’s load the packages.
library(tidyverse)
library(openintro)
library(DATA606)
##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
The data is a summary of tourism in Turkey. It contains 47 observations on the following 3 variables. year numeric vector
visitor_count_tho numeric vector
tourist_spending numeric vector
glimpse(tourism)
Rows: 47 Columns: 3 $ year
Questions (a) Describe the relationship between number of tourists and spending.
Ans. The scatter plot shows linear relationship between visitor count and tourism spending. There is 99.4% strong correlation obtained between the dependent and the independent variables.
ggplot(tourism, aes(x=visitor_count_tho, y=tourist_spending)) +
geom_point()+
geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
cor.test(tourism$visitor_count_tho,tourism$tourist_spending, method = c("pearson", "kendall", "spearman"))
Pearson's product-moment correlation
data: tourism\(visitor_count_tho and tourism\)tourist_spending t = 60.786, df = 45, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.9891302 0.9966537 sample estimates: cor 0.9939657
Ans. In this case, the explanatory variable is visitor_count, which is the number of foreign tourists visiting Turkey. The response variable is tourist spending by year.
Ans. The relation between ‘the number of foreign tourists visiting Turkey’ and ‘the tourist spending by year’ seems linear from scatter plot. Hence, it makes sense to fit a regression line to the data.
Ans. The conditions for regression are - * Linearity: Condition met * Nearly Normal Residuals: Need to check * Constant Variability: Need to check * Independent observations: We can assume that the observations are independent of each other as the number of tourists in a year may not be dependent on the previous years
Histogram shows near normal distribution, hence the second condition is met
m1 <- lm(tourist_spending ~ visitor_count_tho, data = tourism)
ggplot(data = m1, aes(x = .resid)) +
geom_histogram(binwidth = 300) +
xlab("Residuals") +
ggtitle("Distribution of Residuals")
Scatter plot does not show random distribution of the residuals. Hence the third condition (Homoscedasticity) is not met.
ggplot(data = m1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")