Assignment 7

The Association of Turkish Travel Agencies reports the number of foreign tourists visiting Turkey and tourist spending by year. 20 Three plots are provided: scatter-plot showing the relationship between these two variables along with the least squares fit, residuals plot, and histogram of residuals.

Describe the relationship between number of tourists and spending.
What are the explanatory and response variables?
Why might we want to fit a regression line to these data?
Do the data meet the conditions required for fitting a least squares line? In addition to the scatterplot, use the residual plot and histogram to answer this question.

#a The relationship between number of tourists and spending strong positive linear.
#b The explanatory variable is number of tourists and the response variable is spending. 

library(tidyverse)
tourism <- read.csv('https://raw.githubusercontent.com/jbryer/DATA606Fall2018/master/data/os3_data/Ch%207%20Exercise%20Data/tourism.csv')
tourism %>%
  ggplot(aes(visitor_count_tho, tourist_spending)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Tourism Spending", 
       x = "Number of tourists (in thousands)", 
       y = "Spending (in million $)") +
  theme_minimal()

#c We might want to fit a regression line to these data because 98% of the variability in the % of tourist spending among the number of visitors is explained by the model. 
q1 <- lm(tourism$tourist_spending ~ tourism$visitor_count_tho, data = tourism)
summary(q1)

## 
## Call:
## lm(formula = tourism$tourist_spending ~ tourism$visitor_count_tho, 
##     data = tourism)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1618.04  -254.34   -20.88   234.81  1147.75 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -373.61110  104.87320  -3.563 0.000883 ***
## tourism$visitor_count_tho    0.65903    0.01084  60.786  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 541 on 45 degrees of freedom
## Multiple R-squared:  0.988,  Adjusted R-squared:  0.9877 
## F-statistic:  3695 on 1 and 45 DF,  p-value: < 2.2e-16

#d The data does meet the conditions required for fitting a least squares line. There is linearity in the data from the first graph. Also looking at the residuals histogram, we can see that its nearly normal. Finally, at first glance, the variability does not seem constant. But we can see some high leverage points which influences the data.   
plot(q1$residuals ~ tourism$visitor_count_tho)
abline(h = 0, lty = 3)

hist(q1$residuals)

Exercise 7.13 introduces data on the Coast Starlight Amtrak train that runs from Seattle to Los Angeles. The mean travel time from one stop to the next on the Coast Starlight is 129 mins, with a standard deviation of 113 minutes. The mean distance traveled from one stop to the next is 108 miles with a standard deviation of 99 miles. The correlation between travel time and distance is 0.636.

Write the equation of the regression line for predicting travel time.
Interpret the slope and the intercept in this context.
Calculate R 2 of the regression line for predicting travel time from distance traveled for the Coast Starlight, and interpret R 2 in the context of the application.
The distance between Santa Barbara and Los Angeles is 103 miles. Use the model to estimate the time it takes for the Starlight to travel between these two cities.
It actually takes the Coast Starlight about 168 mins to travel from Santa Barbara to Los Angeles. Calculate the residual and explain the meaning of this residual value.
Suppose Amtrak is considering adding a stop to the Coast Starlight 500 miles away from Los Angeles. Would it be appropriate to use this linear model to predict the travel time from Los Angeles to this point?

#a The equation for predicting travel time is y = 50.59855 + 0.7259394 * x
# Explanatory variable - distance
# Response variable - travel time
Mx <- 108
Sx <- 99
My <- 129
Sy <- 113
R <- 0.636

#b The slope in this context is how much time it takes to travel each additional mile on average. In our case, it takes 0.726 minutes. 
#b The intercept in this context is expected average time with no distance travelled. In our case, stops with no distance travelled are expected on average to take 50 minutes. Since there are no stops in the dataset with no distance traveled, the intercept is of no interest, not very useful.
b1 <- (Sy / Sx) * R #intercept
b1

## [1] 0.7259394

b0 <- My-(b1*Mx) #slope
b0

## [1] 50.59855

#c r-squared of the regression line for predicting travel time from distance traveled for the Coast Starlight is 0.404
#c The r-squared in this context tells us what percent of variability in the travel time variable is explained by the model. In our case, 40% of the variability in the time traveled is accounted by the model.
r2 <- R ** 2
r2

## [1] 0.404496

#d The time it takes for the Starlight to travel between Santa Barbara and Los Angeles is 125.37 minutes.
y <- b0 + (b1 * 103) #103 is the distance between these two cities
y

## [1] 125.3703

#e The residuals value is 42.63
#e The residual in this context means, by how much we overestimated or underestimated the travel time. In our case, we underestimated the travel time by 42.63 minutes.
168 - y #168 - actual travel time, y - predicted travel time

## [1] 42.6297

#f  If Amtrak is considering adding a stop to the Coast Starlight 500 miles away from Los Angeles. It would depend on two cases whether it would be appropriate to use this linear model to predict the travel time from Los Angeles to this point. 1) It would not be appropriate because 500 miles would be an extreme outlier in the dataset. 2) It might be appropriate if the 500 miles point lies horizontally away from the center and if it influences the slope of the regression line.

The following regression output is for predicting annual murders per million from percentage living in poverty in a random sample of 20 metropolitan areas.

Write out the linear model.
Interpret the intercept.
Interpret the slope.
Interpret R 2 .
Calculate the correlation coefficient.

#a The linear model is y = b0 + b1 * x
#a y: prediction, b0: intercept, b1: slope, x: explanatory
#a y = -29.901 + 2.559 * % in poverty
#b The intercept in this context means metropolitan areas with 0 poverty has -29.9 million murders on average. Since there are no metropolitan areas with 0 poverty, the intercept is of no interest.
#c The slope in this context means for each additional % point in poverty rate, we would expect annual murder in poverty area to be higher on average by 2.559 million.
#d The r-squared is 70.52%. It tells us 70.52% of variability in the annual murder is explained by the model.
#e The correlation coefficient is 0.8397619. It means there is strong positive linearity between annual murder and percent in poverty.
sqrt(70.52/100)

## [1] 0.8397619

Exercise 7.33 gives a scatterplot displaying the relationship between the percent of families that own their home and the percent of the population living in urban areas. Below is a similar scatterplot, excluding District of Columbia, as well as the residuals plot. There were 51 cases.

For these data, R 2 = 0.28. What is the correlation? How can you tell if it is positive or negative?
Examine the residual plot. What do you observe? Is a simple least squares fit appropriate for these data?

urban_owner <- read.csv("https://raw.githubusercontent.com/jbryer/DATA606Fall2018/master/data/os3_data/Ch%207%20Exercise%20Data/urban_owner.csv")
urban_owner_2010 <- select(urban_owner, pct_owner_occupied, poppct_urban)
ggplot(urban_owner_2010, aes(poppct_urban, pct_owner_occupied)) +
  geom_point() +
  xlab("% urban population") +
  ylab("% who own home") +
  geom_smooth(method = "lm", se = FALSE)

#a The correlation is 0.5291503 (square root of r-squared). The correlation between two variables is moderatly negative. We can tell this by the downlhill regression line. 
sqrt(0.28)

## [1] 0.5291503

#b By examining the residual plot, we observe that residual is right skewed normal distribution i.e. it doesnot follow the trendline. Simple least square is not fit for these data. There are some unusual points that do not follow the trend of the rest of the data.  
q39 <- lm(pct_owner_occupied ~ poppct_urban, data = urban_owner_2010)
hist(q39$residuals)

qqnorm(q39$residuals)
qqline(q39$residuals)

Assignment 7

Saayed Alam

November 6, 2018