DATA 606 Presentation

Load packages

The data can be found in the companion package for OpenIntro resources, openintro. Let’s load the packages.

library(tidyverse)
library(openintro)
library(DATA606)

## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.

The data is a summary of tourism in Turkey. It contains 47 observations on the following 3 variables. year numeric vector

visitor_count_tho numeric vector

tourist_spending numeric vector

glimpse(tourism)

Rows: 47 Columns: 3 $ year 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1… $ visitor_count_tho 198, 229, 361, 449, 574, 602, 694, 724, 926, 1034… $ tourist_spending 7, 8, 13, 12, 13, 24, 36, 51, 62, 103, 171, 193, …

Questions (a) Describe the relationship between number of tourists and spending.

Ans. The scatter plot shows linear relationship between visitor count and tourism spending. There is 99.4% strong correlation obtained between the dependent and the independent variables.

ggplot(tourism, aes(x=visitor_count_tho, y=tourist_spending)) + 
  geom_point()+
  geom_smooth(method=lm)

## `geom_smooth()` using formula 'y ~ x'

cor.test(tourism$visitor_count_tho,tourism$tourist_spending, method = c("pearson", "kendall", "spearman"))

Pearson's product-moment correlation

data: tourism$visitor_count_tho and tourism$tourist_spending t = 60.786, df = 45, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.9891302 0.9966537 sample estimates: cor 0.9939657

What are the explanatory and response variables?

Ans. In this case, the explanatory variable is visitor_count, which is the number of foreign tourists visiting Turkey. The response variable is tourist spending by year.

Why might we want to fit a regression line to these data?

Ans. The relation between ‘the number of foreign tourists visiting Turkey’ and ‘the tourist spending by year’ seems linear from scatter plot. Hence, it makes sense to fit a regression line to the data.

Do the data meet the conditions required for fitting a least squares line? In addition to the scatterplot, use the residual plot and histogram to answer this question.

Ans. The conditions for regression are - * Linearity: Condition met * Nearly Normal Residuals: Need to check * Constant Variability: Need to check * Independent observations: We can assume that the observations are independent of each other as the number of tourists in a year may not be dependent on the previous years

Histogram shows near normal distribution, hence the second condition is met

m1 <- lm(tourist_spending ~ visitor_count_tho, data = tourism)
ggplot(data = m1, aes(x = .resid)) +
  geom_histogram(binwidth = 300) +
  xlab("Residuals") +
    ggtitle("Distribution of Residuals")

Scatter plot does not show random distribution of the residuals. Hence the third condition (Homoscedasticity) is not met.

ggplot(data = m1, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

DATA 606 Presentation

Bharani Nittala

Load packages