Simple Regression Line

The Numbeo website (www.numbeo.com) provides access to a variety of data. One table lists prices of certain items in selected cities around the world. They also report an overall cost-of-living index for each city compared to the costs of hundreds of items in New York City. For example, London at 110.69 is 10.69% more expensive than New York. In the data file Cost_of_living_2013.txt included are the Cost of Living Index, a Rent Index, a Groceries Index, a Restaurant Price Index, and a Local Purchasing Power Index that measures the ability of the average wage earner in a city to buy foods and services. All indices are measured relative to New York City, which is scored 100.

setwd("C:/Users/cris-/Desktop/r111")
x <- read.table("Cost_of_living_2013.txt", sep = '\t', header = TRUE)
names(x) #getting to know my dataset

## [1] "City"                         "Cost.of.Living.Index"        
## [3] "Rent.Index"                   "Groceries.Index"             
## [5] "Restaurant.Price.Index"       "Local.Purchasing.Power.Index"

Scatterplots

Cost of Living Index vs. Rent

Moderate positive linear relationship

plot(Cost.of.Living.Index ~ Rent.Index, data = x)

Cost of Living Index vs. Groceries Index

Strong positive linear relationship

plot(Cost.of.Living.Index ~ Groceries.Index, data = x)

Cost of living Index vs. Restaurant Price Index

Strong Positive linear relationship

plot(Cost.of.Living.Index ~ Restaurant.Price.Index, data = x)

Cost of living Index vs. Local Purchasing Power Index

Weak linear relationship

plot(Cost.of.Living.Index ~ Local.Purchasing.Power.Index, data = x)

Correlations

Cost of Living Index vs. Rent

cor(x$Cost.of.Living.Index, x$Rent.Index)

## [1] 0.7722926

Cost of Living Index vs. Groceries Index

cor(x$Cost.of.Living.Index, x$Groceries.Index)

## [1] 0.9538616

Cost of living Index vs. Restaurant Price Index

cor(x$Cost.of.Living.Index, x$Restaurant.Price.Index)

## [1] 0.9493554

Cost of living Index vs. Local Purchasing Power Index

cor(x$Cost.of.Living.Index, x$Local.Purchasing.Power.Index)

## [1] 0.525902

Simple Linear Regression

Model 1

m1 <- lm(Cost.of.Living.Index ~ Rent.Index, data = x)
coef(m1)

## (Intercept)  Rent.Index 
##   45.232600    1.024624

Model 2

m2 <- lm(Cost.of.Living.Index ~ Groceries.Index, data = x)
coef(m2)

##     (Intercept) Groceries.Index 
##       9.2178364       0.9529463

Model 3

m3 <- lm(Cost.of.Living.Index ~ Restaurant.Price.Index, data = x)
coef(m3)

##            (Intercept) Restaurant.Price.Index 
##             24.6635984              0.8033304

Model 4

m4 <- lm(Cost.of.Living.Index ~ Local.Purchasing.Power.Index, data = x)
coef(m4)

##                  (Intercept) Local.Purchasing.Power.Index 
##                   48.9974246                    0.3761637

Which one is the best predictor?

R-squared

Look at R-squared; R-squared is the percentage/fraction of the variability in the predicted value that is accounted for by the regression model

Model 1

summary(m1)$r.squared

## [1] 0.5964358

Model 2

summary(m2)$r.squared

## [1] 0.909852

Model 3

summary(m3)$r.squared

## [1] 0.9012757

Model 4

summary(m4)$r.squared

## [1] 0.2765729

The best model is Model 2, since it has high r (correlation) and R-squared, whereas the worst is Model 4.

summary(m2)

## 
## Call:
## lm(formula = Cost.of.Living.Index ~ Groceries.Index, data = x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.2714  -6.2766   0.4478   5.2780  20.7336 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.21784    1.31039   7.034 1.22e-11 ***
## Groceries.Index  0.95295    0.01677  56.831  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.356 on 320 degrees of freedom
## Multiple R-squared:  0.9099, Adjusted R-squared:  0.9096 
## F-statistic:  3230 on 1 and 320 DF,  p-value: < 2.2e-16

Predicting Values and their residuals

Cost of living for Beijing? fitted value

predicting = which(x$City == 'Beijing, China')
m2$fitted.values[predicting]

##      172 
## 88.85556

Residual for prediction

Its residual is an overestimate, that is, The cost of living index for Beijing, as predicted by Groceries Index, is 88.86 (11.14% less expensive than New York).

m2$residuals[predicting]

##       172 
## -11.66556