Source file ⇒ The_Analytics_Edge_edX_MIT062015.Rmd

Unit 1: An Introduction to Analytics

Preliminaries

IMPORTANT NOTE

If you downloaded and installed R in a location other than the United States, you might encounter some formating issues later in this class due to language differences. To fix this, you will need to type in your R console:

Sys.setlocale(“LC_ALL”, “C”)

This will only change the locale for your current R session, so please make a note to run this command when you are working on any lectures or exercises that might depend on the English language (for example, the names for the days of the week).

Working with data_An introduction to R

Please download and install R for free from the following webpage:http://www.cran.r-project.org

QUICK QUESTIONS FOR INSTALLING R

When you start R, you should see a window titled “R Console”. In this window, there is some text, and then at the bottom there should be a > symbol (greater than symbol), followed by a blinking cursor. At the cursor, type:

sd(c(5,8,12))

and then hit enter. You should see [1] followed by a number. What is this number?

sd(c(5,8,12))
## [1] 3.511885

Ans:3.511885

Now type:

which.min(c(4,1,6))

at the cursor, and hit enter. You should again see [1], followed by a number. What is this number?

which.min(c(4,1,6))
## [1] 2

Ans:2

which.min gives the position of the min value of the data set/vector

Q:At which university was the first version of R developed?

Ans:University of Auckland

EXPLANATION:The first version of R was developed by Robert Gentleman and Ross Ihaka at the University of Auckland in the mid-1990s.

Q:Which of the following are recommended variable names in R? (Select all that apply.)

Ans:SquareRoot2

Square2.Root

EXPLANATION:SquareRoot2 and Square2.Root are recommended variable names. The second option is not recommended because it has a space, and the fourth option is not recommended because it starts with a number.

Q:If you want to add new observations to a data frame, what should you use?

Ans:The rbind function.

EXPLANATION:To add new observations to a data frame with the same variable values, you should use rbind.

Q:If you want to combine two vectors into a data frame, what should you use?

Ans:The data.frame function.

EXPLANATION:To combine two vectors into a data frame, you should use data.frame.

Q:If you want to add a variable to your data frame, what should you use?

Ans:The dollar sign notation.

EXPLANATION:To add a variable to your data frame, you should the dollar sign notation.

Q:If you want to know the mean value of a numerical variable, which function could you use?

Ans:summary

EXPLANATION:If you using the summary function (in the video, we typed summary(WHO) in our R console) you can see a statistical summary of each variable. For numerical variables, the mean value is listed.

Q:If you want to know the standard deviation of a numerical variable, which function could you use?

Ans:If you want to know the standard deviation of a numerical variable, which function could you use?

EXPLANATION:Neither the str function nor the summary function provides the standard deviation value of a variable. We’ll see how to compute this value in the next video.

#importing the WHO dataset
WHO<-read.csv("WHO.csv")
str(WHO)
## 'data.frame':    194 obs. of  13 variables:
##  $ Country                      : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
##  $ Population                   : int  29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
##  $ Under15                      : num  47.4 21.3 27.4 15.2 47.6 ...
##  $ Over60                       : num  3.82 14.93 7.17 22.86 3.84 ...
##  $ FertilityRate                : num  5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
##  $ LifeExpectancy               : int  60 74 73 82 51 75 76 71 82 81 ...
##  $ ChildMortality               : num  98.5 16.7 20 3.2 163.5 ...
##  $ CellularSubscribers          : num  54.3 96.4 99 75.5 48.4 ...
##  $ LiteracyRate                 : num  NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
##  $ GNI                          : num  1140 8820 8310 NA 5230 ...
##  $ PrimarySchoolEnrollmentMale  : num  NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
##  $ PrimarySchoolEnrollmentFemale: num  NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
summary(WHO)
##                 Country                      Region     Population     
##  Afghanistan        :  1   Africa               :46   Min.   :      1  
##  Albania            :  1   Americas             :35   1st Qu.:   1696  
##  Algeria            :  1   Eastern Mediterranean:22   Median :   7790  
##  Andorra            :  1   Europe               :53   Mean   :  36360  
##  Angola             :  1   South-East Asia      :11   3rd Qu.:  24535  
##  Antigua and Barbuda:  1   Western Pacific      :27   Max.   :1390000  
##  (Other)            :188                                               
##     Under15          Over60      FertilityRate   LifeExpectancy 
##  Min.   :13.12   Min.   : 0.81   Min.   :1.260   Min.   :47.00  
##  1st Qu.:18.72   1st Qu.: 5.20   1st Qu.:1.835   1st Qu.:64.00  
##  Median :28.65   Median : 8.53   Median :2.400   Median :72.50  
##  Mean   :28.73   Mean   :11.16   Mean   :2.941   Mean   :70.01  
##  3rd Qu.:37.75   3rd Qu.:16.69   3rd Qu.:3.905   3rd Qu.:76.00  
##  Max.   :49.99   Max.   :31.92   Max.   :7.580   Max.   :83.00  
##                                  NA's   :11                     
##  ChildMortality    CellularSubscribers  LiteracyRate        GNI       
##  Min.   :  2.200   Min.   :  2.57      Min.   :31.10   Min.   :  340  
##  1st Qu.:  8.425   1st Qu.: 63.57      1st Qu.:71.60   1st Qu.: 2335  
##  Median : 18.600   Median : 97.75      Median :91.80   Median : 7870  
##  Mean   : 36.149   Mean   : 93.64      Mean   :83.71   Mean   :13321  
##  3rd Qu.: 55.975   3rd Qu.:120.81      3rd Qu.:97.85   3rd Qu.:17558  
##  Max.   :181.600   Max.   :196.41      Max.   :99.80   Max.   :86440  
##                    NA's   :10          NA's   :91      NA's   :32     
##  PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
##  Min.   : 37.20              Min.   : 32.50               
##  1st Qu.: 87.70              1st Qu.: 87.30               
##  Median : 94.70              Median : 95.10               
##  Mean   : 90.85              Mean   : 89.63               
##  3rd Qu.: 98.10              3rd Qu.: 97.90               
##  Max.   :100.00              Max.   :100.00               
##  NA's   :93                  NA's   :93
#Q:What is the mean value of the "Over60" variable?
mean(WHO$Over60)
## [1] 11.16366
#Ans:11.16366
#EXPLANATION:You can compute this value by either typing mean(WHO$Over60) in your R console, or by typing summary(WHO$Over60) in your R console. The output is 11.16.

#Q:Which country has the smallest percentage of the population over 60?
which.min(WHO$Over60)
## [1] 183
WHO$Country[183]
## [1] United Arab Emirates
## 194 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe
#Ans:United Arab Emirates
#EXPLANATION:To get this value, you should type which.min(WHO$Over60) in your R console. The output is 183. Then, to see the name of the 183rd country in your data frame, type WHO$Country[183] in your R console. The output is United Arab Emirates.

#Q:Which country has the largest literacy rate?
which.max(WHO$LiteracyRate)
## [1] 44
WHO$Country[44]
## [1] Cuba
## 194 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe
#Ans:Cuba
#EXPLANATION:To get this value, you should type which.max(WHO$LiteracyRate) in your R console. The output is 44. Then, to see the name of the 44th country in your data frame, type WHO$Country[44] in your R console. The output is Cuba.

#Q:Use the tapply function to find the average child mortality rate of countries in each region.
#Which region has the lowest average child mortality rate across all countries in that region?
tapply(WHO$ChildMortality, WHO$Region, mean)
##                Africa              Americas Eastern Mediterranean 
##              84.03696              19.32286              40.25000 
##                Europe       South-East Asia       Western Pacific 
##              10.05094              35.04545              24.71111
#Ans:Europe

Working with data_An introduction to R (R lecture videos reproduced)

# VIDEO 2

# Basic Calculations
8*6
## [1] 48
2^16
## [1] 65536
2^
8*6
## [1] 1536
8*10
## [1] 80
# Functions
sqrt(2)
## [1] 1.414214
abs(-65)
## [1] 65
#?sqrt

# Variables
SquareRoot2 = sqrt(2)
SquareRoot2
## [1] 1.414214
HoursYear <- 365*24
HoursYear
## [1] 8760
ls()
## [1] "HoursYear"   "SquareRoot2" "WHO"
# VIDEO 3 

# Vectors
c(2,3,5,8,13)
## [1]  2  3  5  8 13
Country = c("Brazil", "China", "India","Switzerland","USA")
LifeExpectancy = c(74,76,65,83,79)
Country
## [1] "Brazil"      "China"       "India"       "Switzerland" "USA"
LifeExpectancy
## [1] 74 76 65 83 79
Country[1]
## [1] "Brazil"
LifeExpectancy[3]
## [1] 65
Sequence = seq(0,100,2)
Sequence
##  [1]   0   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32
## [18]  34  36  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66
## [35]  68  70  72  74  76  78  80  82  84  86  88  90  92  94  96  98 100
# Data Frames
CountryData = data.frame(Country, LifeExpectancy)
CountryData
##       Country LifeExpectancy
## 1      Brazil             74
## 2       China             76
## 3       India             65
## 4 Switzerland             83
## 5         USA             79
CountryData$Population = c(199000,1390000,1240000,7997,318000)
CountryData
##       Country LifeExpectancy Population
## 1      Brazil             74     199000
## 2       China             76    1390000
## 3       India             65    1240000
## 4 Switzerland             83       7997
## 5         USA             79     318000
Country = c("Australia","Greece")
LifeExpectancy = c(82,81)
Population = c(23050,11125)
NewCountryData = data.frame(Country, LifeExpectancy, Population)
NewCountryData
##     Country LifeExpectancy Population
## 1 Australia             82      23050
## 2    Greece             81      11125
AllCountryData = rbind(CountryData, NewCountryData)
AllCountryData
##       Country LifeExpectancy Population
## 1      Brazil             74     199000
## 2       China             76    1390000
## 3       India             65    1240000
## 4 Switzerland             83       7997
## 5         USA             79     318000
## 6   Australia             82      23050
## 7      Greece             81      11125
# VIDEO 4

# Loading csv files
WHO<-read.csv("WHO.csv")
str(WHO)
## 'data.frame':    194 obs. of  13 variables:
##  $ Country                      : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
##  $ Population                   : int  29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
##  $ Under15                      : num  47.4 21.3 27.4 15.2 47.6 ...
##  $ Over60                       : num  3.82 14.93 7.17 22.86 3.84 ...
##  $ FertilityRate                : num  5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
##  $ LifeExpectancy               : int  60 74 73 82 51 75 76 71 82 81 ...
##  $ ChildMortality               : num  98.5 16.7 20 3.2 163.5 ...
##  $ CellularSubscribers          : num  54.3 96.4 99 75.5 48.4 ...
##  $ LiteracyRate                 : num  NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
##  $ GNI                          : num  1140 8820 8310 NA 5230 ...
##  $ PrimarySchoolEnrollmentMale  : num  NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
##  $ PrimarySchoolEnrollmentFemale: num  NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
summary(WHO)
##                 Country                      Region     Population     
##  Afghanistan        :  1   Africa               :46   Min.   :      1  
##  Albania            :  1   Americas             :35   1st Qu.:   1696  
##  Algeria            :  1   Eastern Mediterranean:22   Median :   7790  
##  Andorra            :  1   Europe               :53   Mean   :  36360  
##  Angola             :  1   South-East Asia      :11   3rd Qu.:  24535  
##  Antigua and Barbuda:  1   Western Pacific      :27   Max.   :1390000  
##  (Other)            :188                                               
##     Under15          Over60      FertilityRate   LifeExpectancy 
##  Min.   :13.12   Min.   : 0.81   Min.   :1.260   Min.   :47.00  
##  1st Qu.:18.72   1st Qu.: 5.20   1st Qu.:1.835   1st Qu.:64.00  
##  Median :28.65   Median : 8.53   Median :2.400   Median :72.50  
##  Mean   :28.73   Mean   :11.16   Mean   :2.941   Mean   :70.01  
##  3rd Qu.:37.75   3rd Qu.:16.69   3rd Qu.:3.905   3rd Qu.:76.00  
##  Max.   :49.99   Max.   :31.92   Max.   :7.580   Max.   :83.00  
##                                  NA's   :11                     
##  ChildMortality    CellularSubscribers  LiteracyRate        GNI       
##  Min.   :  2.200   Min.   :  2.57      Min.   :31.10   Min.   :  340  
##  1st Qu.:  8.425   1st Qu.: 63.57      1st Qu.:71.60   1st Qu.: 2335  
##  Median : 18.600   Median : 97.75      Median :91.80   Median : 7870  
##  Mean   : 36.149   Mean   : 93.64      Mean   :83.71   Mean   :13321  
##  3rd Qu.: 55.975   3rd Qu.:120.81      3rd Qu.:97.85   3rd Qu.:17558  
##  Max.   :181.600   Max.   :196.41      Max.   :99.80   Max.   :86440  
##                    NA's   :10          NA's   :91      NA's   :32     
##  PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
##  Min.   : 37.20              Min.   : 32.50               
##  1st Qu.: 87.70              1st Qu.: 87.30               
##  Median : 94.70              Median : 95.10               
##  Mean   : 90.85              Mean   : 89.63               
##  3rd Qu.: 98.10              3rd Qu.: 97.90               
##  Max.   :100.00              Max.   :100.00               
##  NA's   :93                  NA's   :93
# Subsetting
WHO_Europe = subset(WHO, Region == "Europe")
str(WHO_Europe)
## 'data.frame':    53 obs. of  13 variables:
##  $ Country                      : Factor w/ 194 levels "Afghanistan",..: 2 4 8 10 11 16 17 22 26 42 ...
##  $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Population                   : int  3162 78 2969 8464 9309 9405 11060 3834 7278 4307 ...
##  $ Under15                      : num  21.3 15.2 20.3 14.5 22.2 ...
##  $ Over60                       : num  14.93 22.86 14.06 23.52 8.24 ...
##  $ FertilityRate                : num  1.75 NA 1.74 1.44 1.96 1.47 1.85 1.26 1.51 1.48 ...
##  $ LifeExpectancy               : int  74 82 71 81 71 71 80 76 74 77 ...
##  $ ChildMortality               : num  16.7 3.2 16.4 4 35.2 5.2 4.2 6.7 12.1 4.7 ...
##  $ CellularSubscribers          : num  96.4 75.5 103.6 154.8 108.8 ...
##  $ LiteracyRate                 : num  NA NA 99.6 NA NA NA NA 97.9 NA 98.8 ...
##  $ GNI                          : num  8820 NA 6100 42050 8960 ...
##  $ PrimarySchoolEnrollmentMale  : num  NA 78.4 NA NA 85.3 NA 98.9 86.5 99.3 94.8 ...
##  $ PrimarySchoolEnrollmentFemale: num  NA 79.4 NA NA 84.1 NA 99.2 88.4 99.7 97 ...
# Writing csv files
write.csv(WHO_Europe, "WHO_Europe.csv")

# Removing variables
rm(WHO_Europe)


# VIDEO 5

# Basic data analysis 

mean(WHO$Under15)
## [1] 28.73242
sd(WHO$Under15)
## [1] 10.53457
summary(WHO$Under15)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.12   18.72   28.65   28.73   37.75   49.99
which.min(WHO$Under15)
## [1] 86
WHO$Country[86]
## [1] Japan
## 194 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe
which.max(WHO$Under15)
## [1] 124
WHO$Country[124]
## [1] Niger
## 194 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe
# Scatterplot
plot(WHO$GNI, WHO$FertilityRate)

# Subsetting
Outliers = subset(WHO, GNI > 10000 & FertilityRate > 2.5) 
nrow(Outliers)
## [1] 7
Outliers[c("Country","GNI","FertilityRate")]
##               Country   GNI FertilityRate
## 23           Botswana 14550          2.71
## 56  Equatorial Guinea 25620          5.04
## 63              Gabon 13740          4.18
## 83             Israel 27110          2.92
## 88         Kazakhstan 11250          2.52
## 131            Panama 14510          2.52
## 150      Saudi Arabia 24700          2.76
# VIDEO 6

# Histograms
hist(WHO$CellularSubscribers)

# Boxplot
boxplot(WHO$LifeExpectancy ~ WHO$Region)

boxplot(WHO$LifeExpectancy ~ WHO$Region, xlab = "", ylab = "Life Expectancy", main = "Life Expectancy of Countries by Region")

# Summary Tables
table(WHO$Region)
## 
##                Africa              Americas Eastern Mediterranean 
##                    46                    35                    22 
##                Europe       South-East Asia       Western Pacific 
##                    53                    11                    27
tapply(WHO$Over60, WHO$Region, mean)
##                Africa              Americas Eastern Mediterranean 
##              5.220652             10.943714              5.620000 
##                Europe       South-East Asia       Western Pacific 
##             19.774906              8.769091             10.162963
tapply(WHO$LiteracyRate, WHO$Region, min)
##                Africa              Americas Eastern Mediterranean 
##                    NA                    NA                    NA 
##                Europe       South-East Asia       Western Pacific 
##                    NA                    NA                    NA
tapply(WHO$LiteracyRate, WHO$Region, min, na.rm=TRUE)
##                Africa              Americas Eastern Mediterranean 
##                  31.1                  75.2                  63.9 
##                Europe       South-East Asia       Western Pacific 
##                  95.2                  56.8                  60.6

Understanding Food_Nutritional Education with Data(recitation)

# Video 2 - Reading in the Dataset

# Get the current directory
  getwd()
## [1] "J:/rstudio files/Rstudio files_Analytics Edge_MIT"
# Read the csv file
  USDA<-read.csv("USDA.csv")
  
# Structure of the dataset
  str(USDA)
## 'data.frame':    7058 obs. of  16 variables:
##  $ ID          : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $ Description : Factor w/ 7054 levels "ABALONE,MIXED SPECIES,RAW",..: 1303 1302 1298 2303 2304 2305 2306 2307 2308 2309 ...
##  $ Calories    : int  717 717 876 353 371 334 300 376 403 387 ...
##  $ Protein     : num  0.85 0.85 0.28 21.4 23.24 ...
##  $ TotalFat    : num  81.1 81.1 99.5 28.7 29.7 ...
##  $ Carbohydrate: num  0.06 0.06 0 2.34 2.79 0.45 0.46 3.06 1.28 4.78 ...
##  $ Sodium      : int  714 827 2 1395 560 629 842 690 621 700 ...
##  $ SaturatedFat: num  51.4 50.5 61.9 18.7 18.8 ...
##  $ Cholesterol : int  215 219 256 75 94 100 72 93 105 103 ...
##  $ Sugar       : num  0.06 0.06 0 0.5 0.51 0.45 0.46 NA 0.52 NA ...
##  $ Calcium     : int  24 24 4 528 674 184 388 673 721 643 ...
##  $ Iron        : num  0.02 0.16 0 0.31 0.43 0.5 0.33 0.64 0.68 0.21 ...
##  $ Potassium   : int  24 26 5 256 136 152 187 93 98 95 ...
##  $ VitaminC    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VitaminE    : num  2.32 2.32 2.8 0.25 0.26 0.24 0.21 NA 0.29 NA ...
##  $ VitaminD    : num  1.5 1.5 1.8 0.5 0.5 0.5 0.4 NA 0.6 NA ...
# Statistical summary
  summary(USDA)
##        ID       
##  Min.   : 1001  
##  1st Qu.: 8387  
##  Median :13294  
##  Mean   :14260  
##  3rd Qu.:18337  
##  Max.   :93600  
##                 
##                                                        Description  
##  BEEF,CHUCK,UNDER BLADE CNTR STEAK,BNLESS,DENVER CUT,LN,0" FA:   2  
##  CAMPBELL,CAMPBELL'S SEL MICROWAVEABLE BOWLS,HEA             :   2  
##  OIL,INDUSTRIAL,PALM KERNEL (HYDROGENATED),CONFECTION FAT    :   2  
##  POPCORN,OIL-POPPED,LOFAT                                    :   2  
##  ABALONE,MIXED SPECIES,RAW                                   :   1  
##  ABALONE,MXD SP,CKD,FRIED                                    :   1  
##  (Other)                                                     :7048  
##     Calories        Protein         TotalFat       Carbohydrate   
##  Min.   :  0.0   Min.   : 0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 85.0   1st Qu.: 2.29   1st Qu.:  0.72   1st Qu.:  0.00  
##  Median :181.0   Median : 8.20   Median :  4.37   Median :  7.13  
##  Mean   :219.7   Mean   :11.71   Mean   : 10.32   Mean   : 20.70  
##  3rd Qu.:331.0   3rd Qu.:20.43   3rd Qu.: 12.70   3rd Qu.: 28.17  
##  Max.   :902.0   Max.   :88.32   Max.   :100.00   Max.   :100.00  
##  NA's   :1       NA's   :1       NA's   :1        NA's   :1       
##      Sodium         SaturatedFat     Cholesterol          Sugar       
##  Min.   :    0.0   Min.   : 0.000   Min.   :   0.00   Min.   : 0.000  
##  1st Qu.:   37.0   1st Qu.: 0.172   1st Qu.:   0.00   1st Qu.: 0.000  
##  Median :   79.0   Median : 1.256   Median :   3.00   Median : 1.395  
##  Mean   :  322.1   Mean   : 3.452   Mean   :  41.55   Mean   : 8.257  
##  3rd Qu.:  386.0   3rd Qu.: 4.028   3rd Qu.:  69.00   3rd Qu.: 7.875  
##  Max.   :38758.0   Max.   :95.600   Max.   :3100.00   Max.   :99.800  
##  NA's   :84        NA's   :301      NA's   :288       NA's   :1910    
##     Calcium             Iron           Potassium          VitaminC       
##  Min.   :   0.00   Min.   :  0.000   Min.   :    0.0   Min.   :   0.000  
##  1st Qu.:   9.00   1st Qu.:  0.520   1st Qu.:  135.0   1st Qu.:   0.000  
##  Median :  19.00   Median :  1.330   Median :  250.0   Median :   0.000  
##  Mean   :  73.53   Mean   :  2.828   Mean   :  301.4   Mean   :   9.436  
##  3rd Qu.:  56.00   3rd Qu.:  2.620   3rd Qu.:  348.0   3rd Qu.:   3.100  
##  Max.   :7364.00   Max.   :123.600   Max.   :16500.0   Max.   :2400.000  
##  NA's   :136       NA's   :123       NA's   :409       NA's   :332       
##     VitaminE          VitaminD       
##  Min.   :  0.000   Min.   :  0.0000  
##  1st Qu.:  0.120   1st Qu.:  0.0000  
##  Median :  0.270   Median :  0.0000  
##  Mean   :  1.488   Mean   :  0.5769  
##  3rd Qu.:  0.710   3rd Qu.:  0.1000  
##  Max.   :149.400   Max.   :250.0000  
##  NA's   :2720      NA's   :2834
# Video 3 - Basic Data Analysis

# Vector notation
  head(USDA$Sodium,40)
##  [1]  714  827    2 1395  560  629  842  690  621  700  604  364  344  330
## [15]  330  406  321  965 1116  800  600  819  336  800  536  627  415  619
## [29]  652  628  334 1529 1602  534  876   84  125 1200 1809  192
# Finding the index of the food with highest sodium levels
  which.max(USDA$Sodium)
## [1] 265
# Get names of variables in the dataset
  names(USDA)
##  [1] "ID"           "Description"  "Calories"     "Protein"     
##  [5] "TotalFat"     "Carbohydrate" "Sodium"       "SaturatedFat"
##  [9] "Cholesterol"  "Sugar"        "Calcium"      "Iron"        
## [13] "Potassium"    "VitaminC"     "VitaminE"     "VitaminD"
# Get the name of the food with highest sodium levels
  USDA$Description[265]
## [1] SALT,TABLE
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ... ZWIEBACK
# Create a subset of the foods with sodium content above 10,000mg
  HighSodium<-subset(USDA, Sodium>10000)
  
# Count the number of rows, or observations
  nrow(HighSodium)
## [1] 10
# Output names of the foods with high sodium content
  HighSodium$Description
##  [1] SALT,TABLE                                             
##  [2] SOUP,BF BROTH OR BOUILLON,PDR,DRY                      
##  [3] SOUP,BEEF BROTH,CUBED,DRY                              
##  [4] SOUP,CHICK BROTH OR BOUILLON,DRY                       
##  [5] SOUP,CHICK BROTH CUBES,DRY                             
##  [6] GRAVY,AU JUS,DRY                                       
##  [7] ADOBO FRESCO                                           
##  [8] LEAVENING AGENTS,BAKING PDR,DOUBLE-ACTING,NA AL SULFATE
##  [9] LEAVENING AGENTS,BAKING SODA                           
## [10] DESSERTS,RENNIN,TABLETS,UNSWTND                        
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ... ZWIEBACK
# Finding the index of CAVIAR in the dataset
  match("CAVIAR", USDA$Description)
## [1] 4154
# Find amount of sodium in caviar
  USDA$Sodium[4154]
## [1] 1500
# Doing it in one command!
  USDA$Sodium[match("CAVIAR", USDA$Description)]
## [1] 1500
# Summary function over Sodium vector
  summary(USDA$Sodium)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    37.0    79.0   322.1   386.0 38760.0      84
# Standard deviation
  sd(USDA$Sodium, na.rm = TRUE)
## [1] 1045.417
# Video 4 - Plots

# Scatter Plots
  plot(USDA$Protein, USDA$TotalFat)

# Add xlabel, ylabel and title
  plot(USDA$Protein, USDA$TotalFat, xlab="Protein", ylab = "Fat", main = "Protein vs Fat", col = "red")

# Creating a histogram
  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C")

# Add limits to x-axis (zooming on)
  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100))

# Specify breaks of histogram
  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=100)

  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=2000)

# Boxplots
  boxplot(USDA$Sugar, ylab = "Sugar (g)", main = "Boxplot of Sugar")

# Video 5 - Adding a variable

# Creating a variable that takes value 1 if the food has higher sodium than average, 0 otherwise
  HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))
  str(HighSodium)
##  num [1:7058] 1 1 0 1 1 1 1 1 1 1 ...
# Adding the variable to the dataset
  USDA$HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))
  str(USDA) # to see if the new variable has been added to USDA dataset
## 'data.frame':    7058 obs. of  17 variables:
##  $ ID          : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $ Description : Factor w/ 7054 levels "ABALONE,MIXED SPECIES,RAW",..: 1303 1302 1298 2303 2304 2305 2306 2307 2308 2309 ...
##  $ Calories    : int  717 717 876 353 371 334 300 376 403 387 ...
##  $ Protein     : num  0.85 0.85 0.28 21.4 23.24 ...
##  $ TotalFat    : num  81.1 81.1 99.5 28.7 29.7 ...
##  $ Carbohydrate: num  0.06 0.06 0 2.34 2.79 0.45 0.46 3.06 1.28 4.78 ...
##  $ Sodium      : int  714 827 2 1395 560 629 842 690 621 700 ...
##  $ SaturatedFat: num  51.4 50.5 61.9 18.7 18.8 ...
##  $ Cholesterol : int  215 219 256 75 94 100 72 93 105 103 ...
##  $ Sugar       : num  0.06 0.06 0 0.5 0.51 0.45 0.46 NA 0.52 NA ...
##  $ Calcium     : int  24 24 4 528 674 184 388 673 721 643 ...
##  $ Iron        : num  0.02 0.16 0 0.31 0.43 0.5 0.33 0.64 0.68 0.21 ...
##  $ Potassium   : int  24 26 5 256 136 152 187 93 98 95 ...
##  $ VitaminC    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VitaminE    : num  2.32 2.32 2.8 0.25 0.26 0.24 0.21 NA 0.29 NA ...
##  $ VitaminD    : num  1.5 1.5 1.8 0.5 0.5 0.5 0.4 NA 0.6 NA ...
##  $ HighSodium  : num  1 1 0 1 1 1 1 1 1 1 ...
# Similarly for HighProtein, HigCarbs, HighFat
  USDA$HighCarbs = as.numeric(USDA$Carbohydrate > mean(USDA$Carbohydrate, na.rm=TRUE))
  USDA$HighProtein = as.numeric(USDA$Protein > mean(USDA$Protein, na.rm=TRUE))
  USDA$HighFat = as.numeric(USDA$TotalFat > mean(USDA$TotalFat, na.rm=TRUE))
  str(USDA) # to see if the new variable has been added to USDA dataset
## 'data.frame':    7058 obs. of  20 variables:
##  $ ID          : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $ Description : Factor w/ 7054 levels "ABALONE,MIXED SPECIES,RAW",..: 1303 1302 1298 2303 2304 2305 2306 2307 2308 2309 ...
##  $ Calories    : int  717 717 876 353 371 334 300 376 403 387 ...
##  $ Protein     : num  0.85 0.85 0.28 21.4 23.24 ...
##  $ TotalFat    : num  81.1 81.1 99.5 28.7 29.7 ...
##  $ Carbohydrate: num  0.06 0.06 0 2.34 2.79 0.45 0.46 3.06 1.28 4.78 ...
##  $ Sodium      : int  714 827 2 1395 560 629 842 690 621 700 ...
##  $ SaturatedFat: num  51.4 50.5 61.9 18.7 18.8 ...
##  $ Cholesterol : int  215 219 256 75 94 100 72 93 105 103 ...
##  $ Sugar       : num  0.06 0.06 0 0.5 0.51 0.45 0.46 NA 0.52 NA ...
##  $ Calcium     : int  24 24 4 528 674 184 388 673 721 643 ...
##  $ Iron        : num  0.02 0.16 0 0.31 0.43 0.5 0.33 0.64 0.68 0.21 ...
##  $ Potassium   : int  24 26 5 256 136 152 187 93 98 95 ...
##  $ VitaminC    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VitaminE    : num  2.32 2.32 2.8 0.25 0.26 0.24 0.21 NA 0.29 NA ...
##  $ VitaminD    : num  1.5 1.5 1.8 0.5 0.5 0.5 0.4 NA 0.6 NA ...
##  $ HighSodium  : num  1 1 0 1 1 1 1 1 1 1 ...
##  $ HighCarbs   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HighProtein : num  0 0 0 1 1 1 1 1 1 1 ...
##  $ HighFat     : num  1 1 1 1 1 1 1 1 1 1 ...
# Video 6 - Summary Tables

# How many foods have higher sodium level than average?
  table(USDA$HighSodium)
## 
##    0    1 
## 4884 2090
# How many foods have both high sodium and high fat?
  table(USDA$HighSodium, USDA$HighFat)
##    
##        0    1
##   0 3529 1355
##   1 1378  712
# Average amount of iron sorted by high and low protein?
  tapply(USDA$Iron, USDA$HighProtein, mean, na.rm=TRUE)
##        0        1 
## 2.558945 3.197294
# Maximum level of Vitamin C in hfoods with high and low carbs?
  tapply(USDA$VitaminC, USDA$HighCarbs, max, na.rm=TRUE)
##      0      1 
## 1677.6 2400.0
# Using summary function with tapply
  tapply(USDA$VitaminC, USDA$HighCarbs, summary, na.rm=TRUE)
## $`0`
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    0.000    0.000    0.000    6.364    2.800 1678.000      248 
## 
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    0.20   16.31    4.50 2400.00      83

Assignment1:AN ANALYTICAL DETECTIVE Solution

AN ANALYTICAL DETECTIVE

Crime is an international concern, but it is documented and handled in very different ways in different countries. In the United States, violent crimes and property crimes are recorded by the Federal Bureau of Investigation (FBI). Additionally, each city documents crime, and some cities release data regarding crime rates. The city of Chicago, Illinois releases crime data from 2001 onward online.

Chicago is the third most populous city in the United States, with a population of over 2.7 million people. The city of Chicago is shown in the map below, with the state of Illinois highlighted in red. There are two main types of crimes: violent crimes, and property crimes. In this problem, we’ll focus on one specific type of property crime, called “motor vehicle theft” (sometimes referred to as grand theft auto). This is the act of stealing, or attempting to steal, a car. In this problem, we’ll use some basic data analysis in R to understand the motor vehicle thefts in Chicago.

#PROBLEM 1.1 - LOADING THE DATA
  
#lets import the dataset
mvt<-read.csv("mvtWeek1.csv") 
str(mvt)
## 'data.frame':    191641 obs. of  11 variables:
##  $ ID                 : int  8951354 8951141 8952745 8952223 8951608 8950793 8950760 8951611 8951802 8950706 ...
##  $ Date               : Factor w/ 131680 levels "1/1/01 0:01",..: 42824 42823 42823 42823 42822 42821 42820 42819 42817 42816 ...
##  $ LocationDescription: Factor w/ 78 levels "ABANDONED BUILDING",..: 72 72 62 72 72 72 72 72 72 72 ...
##  $ Arrest             : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
##  $ Domestic           : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Beat               : int  623 1213 1622 724 211 2521 423 231 1021 1215 ...
##  $ District           : int  6 12 16 7 2 25 4 2 10 12 ...
##  $ CommunityArea      : int  69 24 11 67 35 19 48 40 29 24 ...
##  $ Year               : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ Latitude           : num  41.8 41.9 42 41.8 41.8 ...
##  $ Longitude          : num  -87.6 -87.7 -87.8 -87.7 -87.6 ...
summary(mvt)
##        ID                      Date       
##  Min.   :1310022   5/16/08 0:00  :    11  
##  1st Qu.:2832144   10/17/01 22:00:    10  
##  Median :4762956   4/13/04 21:00 :    10  
##  Mean   :4968629   9/17/05 22:00 :    10  
##  3rd Qu.:7201878   10/12/01 22:00:     9  
##  Max.   :9181151   10/13/01 22:00:     9  
##                    (Other)       :191582  
##                      LocationDescription   Arrest         Domestic      
##  STREET                        :156564   Mode :logical   Mode :logical  
##  PARKING LOT/GARAGE(NON.RESID.): 14852   FALSE:176105    FALSE:191226   
##  OTHER                         :  4573   TRUE :15536     TRUE :415      
##  ALLEY                         :  2308   NA's :0         NA's :0        
##  GAS STATION                   :  2111                                  
##  DRIVEWAY - RESIDENTIAL        :  1675                                  
##  (Other)                       :  9558                                  
##       Beat         District     CommunityArea        Year     
##  Min.   : 111   Min.   : 1.00   Min.   : 0      Min.   :2001  
##  1st Qu.: 722   1st Qu.: 6.00   1st Qu.:22      1st Qu.:2003  
##  Median :1121   Median :10.00   Median :32      Median :2006  
##  Mean   :1259   Mean   :11.82   Mean   :38      Mean   :2006  
##  3rd Qu.:1733   3rd Qu.:17.00   3rd Qu.:60      3rd Qu.:2009  
##  Max.   :2535   Max.   :31.00   Max.   :77      Max.   :2012  
##                 NA's   :43056   NA's   :24616                 
##     Latitude       Longitude     
##  Min.   :41.64   Min.   :-87.93  
##  1st Qu.:41.77   1st Qu.:-87.72  
##  Median :41.85   Median :-87.68  
##  Mean   :41.84   Mean   :-87.68  
##  3rd Qu.:41.92   3rd Qu.:-87.64  
##  Max.   :42.02   Max.   :-87.52  
##  NA's   :2276    NA's   :2276
#How many rows of data (observations) are in this dataset?
nrow(mvt)
## [1] 191641
#Ans:191641   #If you type str(mvt) in the R console, the first row of output says that this is a data frame with 191,641 observations.

#How many variables are in this dataset?
ncol(mvt)
## [1] 11
#Ans:11      #If you type str(mvt) in the R console, the first row of output says that this is a data frame with 11 variables.


#Using the "max" function, what is the maximum value of the variable "ID"?
max(mvt$ID)
## [1] 9181151
#Ans:9181151

#What is the minimum value of the variable "Beat"?
min(mvt$Beat)   # using summary(mvt$Beat)
## [1] 111
#Ans:111

#How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?
table(mvt$Arrest)  # or 
## 
##  FALSE   TRUE 
## 176105  15536
summary(mvt$Arrest)
##    Mode   FALSE    TRUE    NA's 
## logical  176105   15536       0
#better summary function
library(Hmisc)
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     combine, src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
describe(mvt$Arrest)
## mvt$Arrest 
##       n missing  unique 
##  191641       0       2 
## 
## FALSE (176105, 92%), TRUE (15536, 8%)
#Ans:15536

#How many observations have a LocationDescription value of ALLEY?
table(mvt$LocationDescription=="ALLEY")  # or summary(mvt)
## 
##  FALSE   TRUE 
## 189333   2308
#Ans:2308

#######################################################

#PROBLEM 2.1 - UNDERSTANDING DATES IN R

#In many datasets, like this one, you have a date field. Unfortunately, R does not automatically recognize entries that look like dates. We need to use a function in R to extract the date and time. Take a look at the first entry of Date (remember to use square brackets when looking at a certain entry of a variable).

#In what format are the entries in the variable Date?
mvt$Date[1]
## [1] 12/31/12 23:15
## 131680 Levels: 1/1/01 0:01 1/1/01 0:05 1/1/01 0:30 1/1/01 1:17 ... 9/9/12 9:50
#Ans:Month/Day/Year Hour:Minute      (12/31/12 23:15)

#Now, let's convert these characters into a Date object in R. In your R console, type
DateConvert<-as.Date(strptime(mvt$Date, "%m/%d/%y %H:%M"))
#This converts the variable "Date" into a Date object in R. Take a look at the variable DateConvert using the summary function.
summary(DateConvert)
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2001-01-01" "2003-07-10" "2006-05-21" "2006-08-23" "2009-10-24" 
##         Max. 
## "2012-12-31"
#What is the month and year of the median date in our dataset? Enter your answer as "Month Year", without the quotes. (Ex: if the answer was 2008-03-28, you would give the answer "March 2008", without the quotes.)
#Ans: May 2006

#Now, let's extract the month and the day of the week, and add these variables to our data frame mvt. We can do this with two simple functions. Type the following commands in R:
mvt$Month<-months(DateConvert)
mvt$Weekday<-weekdays(DateConvert)
#This creates two new variables in our data frame, Month and Weekday, and sets them equal to the month and weekday values that we can extract from the Date object. Lastly, replace the old Date variable with DateConvert by typing:
mvt$Date<-DateConvert

str(mvt)  # see the new var added to the mvt dataset
## 'data.frame':    191641 obs. of  13 variables:
##  $ ID                 : int  8951354 8951141 8952745 8952223 8951608 8950793 8950760 8951611 8951802 8950706 ...
##  $ Date               : Date, format: "2012-12-31" "2012-12-31" ...
##  $ LocationDescription: Factor w/ 78 levels "ABANDONED BUILDING",..: 72 72 62 72 72 72 72 72 72 72 ...
##  $ Arrest             : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
##  $ Domestic           : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Beat               : int  623 1213 1622 724 211 2521 423 231 1021 1215 ...
##  $ District           : int  6 12 16 7 2 25 4 2 10 12 ...
##  $ CommunityArea      : int  69 24 11 67 35 19 48 40 29 24 ...
##  $ Year               : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ Latitude           : num  41.8 41.9 42 41.8 41.8 ...
##  $ Longitude          : num  -87.6 -87.7 -87.8 -87.7 -87.6 ...
##  $ Month              : chr  "December" "December" "December" "December" ...
##  $ Weekday            : chr  "Monday" "Monday" "Monday" "Monday" ...
#Using the table command, answer the following questions.

#In which month did the fewest motor vehicle thefts occur?
min(table(mvt$Month))
## [1] 13511
table(mvt$Month)
## 
##     April    August  December  February   January      July      June 
##     15280     16572     16426     13511     16047     16801     16002 
##     March       May  November   October September 
##     15758     16035     16063     17086     16060
#Ans:February

#On which weekday did the most motor vehicle thefts occur?
max(table(mvt$Weekday))
## [1] 29284
table(mvt$Weekday)
## 
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##     29284     27397     27118     26316     27319     26791     27416
#Ans:Friday

#Each observation in the dataset represents a motor vehicle theft, and the Arrest variable indicates whether an arrest was later made for this theft. Which month has the largest number of motor vehicle thefts for which an arrest was made?
max(table(mvt$Month[mvt$Arrest==TRUE]))  
## [1] 1435
table(mvt$Month[mvt$Arrest==TRUE])    # or
## 
##     April    August  December  February   January      July      June 
##      1252      1329      1397      1238      1435      1324      1230 
##     March       May  November   October September 
##      1298      1187      1256      1342      1248
table(mvt$Arrest,mvt$Month)   
##        
##         April August December February January  July  June March   May
##   FALSE 14028  15243    15029    12273   14612 15477 14772 14460 14848
##   TRUE   1252   1329     1397     1238    1435  1324  1230  1298  1187
##        
##         November October September
##   FALSE    14807   15744     14812
##   TRUE      1256    1342      1248
#########################################################

#PROBLEM 3.1 - VISUALIZING CRIME TRENDS

#Now, let's make some plots to help us better understand how crime has changed over time in Chicago.
#First, let's make a histogram of the variable Date. We'll add an extra argument, to specify the number of bars we want in our histogram. In your R console, type
hist(mvt$Date, breaks=100)

#In general, does it look like crime increases or decreases from 2002 - 2012?
#Ans:decreases
#While there is not a clear trend, it looks like crime generally decreases.

#In general, does it look like crime increases or decreases from 2005 - 2008?
#Ans:Decreases 
#In this time period, there is a clear downward trend in crime.

#In general, does it look like crime increases or decreases from 2009 - 2011?
#Ans:Increases
#n this time period, there is a clear upward trend in crime.

#Now, let's see how arrests have changed over time. Create a boxplot of the variable "Date", sorted by the variable "Arrest".
#Does it look like there were more crimes for which arrests were made in the first half of the time period or the second half of the time period? (Note that the time period is from 2001 to 2012, so the middle of the time period is the beginning of 2007.)
boxplot(mvt$Date~mvt$Arrest,xlab="Arrests",ylab="Date in Year")

#Ans:First half
#If you look at the boxplot, the one for Arrest=TRUE is definitely skewed towards the bottom of the plot, meaning that there were more crimes for which arrests were made in the first half of the time period.

#For what proportion of motor vehicle thefts in 2001 was an arrest made?
prop.table(table(mvt$Arrest[mvt$Year=="2001"]))
## 
##     FALSE      TRUE 
## 0.8958827 0.1041173
#Ans:0.1041173  
#EXPLANATION:If you create a table using the command table(mvt$Arrest, mvt$Year), the column for 2001 has 2152 observations with Arrest=TRUE and 18517 observations with Arrest=FALSE. The fraction of motor vehicle thefts in 2001 for which an arrest was made is thus 2152/(2152+18517) = 0.1041173.

#For what proportion of motor vehicle thefts in 2007 was an arrest made?
prop.table(table(mvt$Arrest[mvt$Year=="2007"]))
## 
##      FALSE       TRUE 
## 0.91512605 0.08487395
#Ans:0.08487395  
#EXPLANATION:If you create a table using the command table(mvt$Arrest, mvt$Year), the column for 2007 has 1212 observations with Arrest=TRUE and 13068 observations with Arrest=FALSE. The fraction of motor vehicle thefts in 2007 for which an arrest was made is thus 1212/(1212+13068) = 0.08487395.

#For what proportion of motor vehicle thefts in 2012 was an arrest made?
prop.table(table(mvt$Arrest[mvt$Year=="2012"]))
## 
##      FALSE       TRUE 
## 0.96097076 0.03902924
#Ans:0.03902924
#EXPLANATION:If you create a table using the command table(mvt$Arrest, mvt$Year), the column for 2012 has 550 observations with Arrest=TRUE and 13542 observations with Arrest=FALSE. The fraction of motor vehicle thefts in 2012 for which an arrest was made is thus 550/(550+13542) = 0.03902924.

#########################################

#PROBLEM 4.1 - POPULAR LOCATIONS

#Analyzing this data could be useful to the Chicago Police Department when deciding where to allocate resources. If they want to increase the number of arrests that are made for motor vehicle thefts, where should they focus their efforts?We want to find the top five locations where motor vehicle thefts occur. If you create a table of the LocationDescription variable, it is unfortunately very hard to read since there are 78 different locations in the data set. By using the sort function, we can view this same table, but sorted by the number of observations in each category. In your R console, type:
sort(table(mvt$LocationDescription))
## 
##     AIRPORT BUILDING NON-TERMINAL - SECURE AREA 
##                                               1 
##                  AIRPORT EXTERIOR - SECURE AREA 
##                                               1 
##                                 ANIMAL HOSPITAL 
##                                               1 
##                                 APPLIANCE STORE 
##                                               1 
##                                       CTA TRAIN 
##                                               1 
##                         JAIL / LOCK-UP FACILITY 
##                                               1 
##                                       NEWSSTAND 
##                                               1 
##                                          BRIDGE 
##                                               2 
##               COLLEGE/UNIVERSITY RESIDENCE HALL 
##                                               2 
##                               CURRENCY EXCHANGE 
##                                               2 
##                                   BOWLING ALLEY 
##                                               3 
##                                  CLEANING STORE 
##                                               3 
##                           MEDICAL/DENTAL OFFICE 
##                                               3 
##                              ABANDONED BUILDING 
##                                               4 
## AIRPORT BUILDING NON-TERMINAL - NON-SECURE AREA 
##                                               4 
##                                      BARBERSHOP 
##                                               4 
##                  LAKEFRONT/WATERFRONT/RIVERBANK 
##                                               4 
##                                         LIBRARY 
##                                               4 
##                                SAVINGS AND LOAN 
##                                               4 
##  AIRPORT TERMINAL UPPER LEVEL - NON-SECURE AREA 
##                                               5 
##                                   CHA APARTMENT 
##                                               5 
##                                 DAY CARE CENTER 
##                                               5 
##                                    FIRE STATION 
##                                               5 
##                                 FOREST PRESERVE 
##                                               6 
##                                            BANK 
##                                               7 
##                               CONVENIENCE STORE 
##                                               7 
##                                      DRUG STORE 
##                                               8 
##                 OTHER COMMERCIAL TRANSPORTATION 
##                                               8 
##                                   ATHLETIC CLUB 
##                                               9 
##                   AIRPORT VENDING ESTABLISHMENT 
##                                              10 
##                             AIRPORT PARKING LOT 
##                                              11 
##                       SCHOOL, PRIVATE, BUILDING 
##                                              14 
##                             TAVERN/LIQUOR STORE 
##                                              14 
##                  FACTORY/MANUFACTURING BUILDING 
##                                              16 
##                                   BAR OR TAVERN 
##                                              17 
##                                       WAREHOUSE 
##                                              17 
##                             MOVIE HOUSE/THEATER 
##                                              18 
##                         RESIDENCE PORCH/HALLWAY 
##                                              18 
##                    NURSING HOME/RETIREMENT HOME 
##                                              21 
##                                         TAXICAB 
##                                              21 
##                                DEPARTMENT STORE 
##                                              22 
##                              HIGHWAY/EXPRESSWAY 
##                                              22 
##                        SCHOOL, PRIVATE, GROUNDS 
##                                              23 
##                              VEHICLE-COMMERCIAL 
##                                              23 
##              AIRPORT EXTERIOR - NON-SECURE AREA 
##                                              24 
##               OTHER RAILROAD PROP / TRAIN DEPOT 
##                                              28 
##                              SMALL RETAIL STORE 
##                                              33 
##                               CONSTRUCTION SITE 
##                                              35 
##                                        CAR WASH 
##                                              44 
##                      COLLEGE/UNIVERSITY GROUNDS 
##                                              47 
##                    GOVERNMENT BUILDING/PROPERTY 
##                                              48 
##                                      RESTAURANT 
##                                              49 
##               CHURCH/SYNAGOGUE/PLACE OF WORSHIP 
##                                              56 
##                              GROCERY FOOD STORE 
##                                              80 
##                       HOSPITAL BUILDING/GROUNDS 
##                                             101 
##                        SCHOOL, PUBLIC, BUILDING 
##                                             114 
##                                     HOTEL/MOTEL 
##                                             124 
##                    COMMERCIAL / BUSINESS OFFICE 
##                                             126 
##                     CTA GARAGE / OTHER PROPERTY 
##                                             148 
##                            SPORTS ARENA/STADIUM 
##                                             166 
##                                       APARTMENT 
##                                             184 
##                         SCHOOL, PUBLIC, GROUNDS 
##                                             206 
##                                   PARK PROPERTY 
##                                             255 
##                 POLICE FACILITY/VEH PARKING LOT 
##                                             266 
##                                AIRPORT/AIRCRAFT 
##                                             363 
##                         CHA PARKING LOT/GROUNDS 
##                                             405 
##                                        SIDEWALK 
##                                             462 
##                          VEHICLE NON-COMMERCIAL 
##                                             817 
##                                 VACANT LOT/LAND 
##                                             985 
##                                RESIDENCE-GARAGE 
##                                            1176 
##                                       RESIDENCE 
##                                            1302 
##                   RESIDENTIAL YARD (FRONT/BACK) 
##                                            1536 
##                          DRIVEWAY - RESIDENTIAL 
##                                            1675 
##                                     GAS STATION 
##                                            2111 
##                                           ALLEY 
##                                            2308 
##                                           OTHER 
##                                            4573 
##                  PARKING LOT/GARAGE(NON.RESID.) 
##                                           14852 
##                                          STREET 
##                                          156564
#Which locations are the top five locations for motor vehicle thefts, excluding the "Other" category? You should select 5 of the following options.
#Ans:Street, Parking Lot/Garage (Non. Resid.), Alley, Gas Station, and Driveway - Residential

#Create a subset of your data, only taking observations for which the theft happened in one of these five locations, and call this new data set "Top5". To do this, you can use the | symbol. In lecture, we used the & symbol to use two criteria to make a subset of the data. To only take observations that have a certain value in one variable or the other, the | character can be used in place of the & symbol. This is also called a logical "or" operation.

#How many observations are in Top5?
Top5<-mvt[mvt$LocationDescription %in%  c("STREET","ALLEY","PARKING LOT/GARAGE(NON.RESID.)","GAS STATION","DRIVEWAY - RESIDENTIAL"),]

#or 

TopLocations<-c("STREET", "PARKING LOT/GARAGE(NON.RESID.)", "ALLEY", "GAS STATION", "DRIVEWAY - RESIDENTIAL")

Top5<-subset(mvt, LocationDescription %in% TopLocations)

#or 

Top5<-subset(mvt, LocationDescription=="STREET" | LocationDescription=="PARKING LOT/GARAGE(NON.RESID.)" | LocationDescription=="ALLEY" | LocationDescription=="GAS STATION" | LocationDescription=="DRIVEWAY - RESIDENTIAL")

str(Top5)
## 'data.frame':    177510 obs. of  13 variables:
##  $ ID                 : int  8951354 8951141 8952223 8951608 8950793 8950760 8951611 8951802 8950706 8951585 ...
##  $ Date               : Date, format: "2012-12-31" "2012-12-31" ...
##  $ LocationDescription: Factor w/ 78 levels "ABANDONED BUILDING",..: 72 72 72 72 72 72 72 72 72 72 ...
##  $ Arrest             : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Domestic           : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Beat               : int  623 1213 724 211 2521 423 231 1021 1215 1011 ...
##  $ District           : int  6 12 7 2 25 4 2 10 12 10 ...
##  $ CommunityArea      : int  69 24 67 35 19 48 40 29 24 29 ...
##  $ Year               : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ Latitude           : num  41.8 41.9 41.8 41.8 41.9 ...
##  $ Longitude          : num  -87.6 -87.7 -87.7 -87.6 -87.8 ...
##  $ Month              : chr  "December" "December" "December" "December" ...
##  $ Weekday            : chr  "Monday" "Monday" "Monday" "Monday" ...
margin.table(sort(table(Top5$LocationDescription)))
## [1] 177510
#Ans:177510

#####################################################

#R will remember the other categories of the LocationDescription variable from the original dataset, so running table(Top5$LocationDescription) will have a lot of unnecessary output. To make our tables a bit nicer to read, we can refresh this factor variable. In your R console, type:

Top5$LocationDescription<-factor(Top5$LocationDescription)

#If you run the str or table function on Top5 now, you should see that LocationDescription now only has 5 values, as we expect.
str(Top5)
## 'data.frame':    177510 obs. of  13 variables:
##  $ ID                 : int  8951354 8951141 8952223 8951608 8950793 8950760 8951611 8951802 8950706 8951585 ...
##  $ Date               : Date, format: "2012-12-31" "2012-12-31" ...
##  $ LocationDescription: Factor w/ 5 levels "ALLEY","DRIVEWAY - RESIDENTIAL",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Arrest             : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Domestic           : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Beat               : int  623 1213 724 211 2521 423 231 1021 1215 1011 ...
##  $ District           : int  6 12 7 2 25 4 2 10 12 10 ...
##  $ CommunityArea      : int  69 24 67 35 19 48 40 29 24 29 ...
##  $ Year               : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ Latitude           : num  41.8 41.9 41.8 41.8 41.9 ...
##  $ Longitude          : num  -87.6 -87.7 -87.7 -87.6 -87.8 ...
##  $ Month              : chr  "December" "December" "December" "December" ...
##  $ Weekday            : chr  "Monday" "Monday" "Monday" "Monday" ...
table(Top5$LocationDescription)
## 
##                          ALLEY         DRIVEWAY - RESIDENTIAL 
##                           2308                           1675 
##                    GAS STATION PARKING LOT/GARAGE(NON.RESID.) 
##                           2111                          14852 
##                         STREET 
##                         156564
#Use the Top5 data frame to answer the remaining questions.

#One of the locations has a much higher arrest rate than the other locations. Which is it? Please enter the text in exactly the same way as how it looks in the answer options for Problem 4.1.
table(Top5$LocationDescription, Top5$Arrest)
##                                 
##                                   FALSE   TRUE
##   ALLEY                            2059    249
##   DRIVEWAY - RESIDENTIAL           1543    132
##   GAS STATION                      1672    439
##   PARKING LOT/GARAGE(NON.RESID.)  13249   1603
##   STREET                         144969  11595
prop.table(table(Top5$LocationDescription, Top5$Arrest),1)#calculation prop. row-wise
##                                 
##                                       FALSE       TRUE
##   ALLEY                          0.89211438 0.10788562
##   DRIVEWAY - RESIDENTIAL         0.92119403 0.07880597
##   GAS STATION                    0.79204169 0.20795831
##   PARKING LOT/GARAGE(NON.RESID.) 0.89206841 0.10793159
##   STREET                         0.92594083 0.07405917
#Ans:Gas Station
#EXPLANATION:If you create a table of LocationDescription compared to Arrest, table(Top5$LocationDescription, Top5$Arrest), you can then compute the fraction of motor vehicle thefts that resulted in arrests at each location. Gas Station has by far the highest percentage of arrests, with over 20% of motor vehicle thefts resulting in an arrest.

#On which day of the week do the most motor vehicle thefts at gas stations happen?
max(table(Top5$Weekday[Top5$LocationDescription=="GAS STATION"])) # to find the max value
## [1] 338
table(Top5$Weekday[Top5$LocationDescription=="GAS STATION"])
## 
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##       332       280       338       336       282       270       273
#Ans:Saturday
#EXPLANATION:This can be read from table(Top5$LocationDescription, Top5$Weekday).
table(Top5$LocationDescription,Top5$Weekday)
##                                 
##                                  Friday Monday Saturday Sunday Thursday
##   ALLEY                             385    320      341    307      315
##   DRIVEWAY - RESIDENTIAL            257    255      202    221      263
##   GAS STATION                       332    280      338    336      282
##   PARKING LOT/GARAGE(NON.RESID.)   2331   2128     2199   1936     2082
##   STREET                          23773  22305    22175  21756    22296
##                                 
##                                  Tuesday Wednesday
##   ALLEY                              323       317
##   DRIVEWAY - RESIDENTIAL             243       234
##   GAS STATION                        270       273
##   PARKING LOT/GARAGE(NON.RESID.)    2073      2103
##   STREET                           21888     22371
#On which day of the week do the fewest motor vehicle thefts in residential driveways happen?
min(table(Top5$Weekday[Top5$LocationDescription=="DRIVEWAY - RESIDENTIAL"])) # to find the min value
## [1] 202
table(Top5$Weekday[Top5$LocationDescription=="DRIVEWAY - RESIDENTIAL"])
## 
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##       257       255       202       221       263       243       234
#Ans:Saturday

Assignment1:STOCK DYNAMICS Solution

A stock market is where buyers and sellers trade shares of a company, and is one of the most popular ways for individuals and companies to invest money. The size of the world stock market is now estimated to be in the trillions. The largest stock market in the world is the New York Stock Exchange (NYSE), located in New York City. About 2,800 companies are listed on the NSYE. In this problem, we’ll look at the monthly stock prices of five of these companies: IBM, General Electric (GE), Procter and Gamble, Coca Cola, and Boeing. The data used in this problem comes from Infochimps.

In this problem, we’ll take a look at how the stock dynamics of these companies have changed over time.

#Lets  import the datasets

IBM<-read.csv("IBMStock.csv")
GE<-read.csv("GEStock.csv")
ProcterGamble<-read.csv("ProcterGambleStock.csv")
CocaCola<-read.csv("CocaColaStock.csv")
Boeing<-read.csv("BoeingStock.csv")

#PROBLEM 1.1 - SUMMARY STATISTICS

#Before working with these data sets, we need to convert the dates into a format that R can understand. Take a look at the structure of one of the datasets using the str function. Right now, the date variable is stored as a factor. 
str(IBM)
## 'data.frame':    480 obs. of  2 variables:
##  $ Date      : Factor w/ 480 levels "1/1/00","1/1/01",..: 11 171 211 251 291 331 371 411 451 51 ...
##  $ StockPrice: num  360 347 327 320 270 ...
head(IBM)
##     Date StockPrice
## 1 1/1/70   360.3190
## 2 2/1/70   346.7237
## 3 3/1/70   327.3457
## 4 4/1/70   319.8527
## 5 5/1/70   270.3752
## 6 6/1/70   267.2050
IBM$Date[1]
## [1] 1/1/70
## 480 Levels: 1/1/00 1/1/01 1/1/02 1/1/03 1/1/04 1/1/05 1/1/06 ... 9/1/99
#We can convert this to a "Date" object in R by using the following five commands (one for each data set):

IBM$Date<-as.Date(IBM$Date, "%m/%d/%y")

GE$Date<-as.Date(GE$Date, "%m/%d/%y")

CocaCola$Date<-as.Date(CocaCola$Date, "%m/%d/%y")

ProcterGamble$Date<-as.Date(ProcterGamble$Date, "%m/%d/%y")

Boeing$Date<-as.Date(Boeing$Date, "%m/%d/%y")

#Our five datasets all have the same number of observations. How many observations are there in each data set?
str(GE)
## 'data.frame':    480 obs. of  2 variables:
##  $ Date      : Date, format: "1970-01-01" "1970-02-01" ...
##  $ StockPrice: num  74.3 70 72.2 74.3 66.7 ...
#Ans:480
#EXPLANATION:Using the str function, we can see that each data set has 480 observations. We have monthly data for 40 years, so there are 12*40 = 480 observations.

#What is the earliest year in our datasets?
summary(IBM)   # or head(sort(IBM$Date))
##       Date              StockPrice    
##  Min.   :1970-01-01   Min.   : 43.40  
##  1st Qu.:1979-12-24   1st Qu.: 88.34  
##  Median :1989-12-16   Median :112.11  
##  Mean   :1989-12-15   Mean   :144.38  
##  3rd Qu.:1999-12-08   3rd Qu.:165.41  
##  Max.   :2009-12-01   Max.   :438.90
#Ans:1970

#What is the latest year in our datasets?
summary(GE)
##       Date              StockPrice     
##  Min.   :1970-01-01   Min.   :  9.294  
##  1st Qu.:1979-12-24   1st Qu.: 44.214  
##  Median :1989-12-16   Median : 55.812  
##  Mean   :1989-12-15   Mean   : 59.303  
##  3rd Qu.:1999-12-08   3rd Qu.: 72.226  
##  Max.   :2009-12-01   Max.   :156.844
#Ans:2009

#What is the mean stock price of IBM over this time period?
mean(IBM$StockPrice)   # or  summary(IBM)
## [1] 144.375
#Ans:144.375

#What is the minimum stock price of General Electric (GE) over this time period?
min(GE$StockPrice)  #or summary(GE)
## [1] 9.293636
#Ans:9.293636

#What is the maximum stock price of Coca-Cola over this time period?
max(CocaCola$StockPrice)  #or summary(CocaCola)
## [1] 146.5843
#Ans:146.5843

#What is the median stock price of Boeing over this time period?
median((Boeing$StockPrice))  #or summary(Boeing)
## [1] 44.8834
#Ans:44.8834

#What is the standard deviation of the stock price of Procter & Gamble over this time period?
sd(ProcterGamble$StockPrice)
## [1] 18.19414
#Ans:18.19414

####################################################

#PROBLEM 2.1 - VISUALIZING STOCK DYNAMICS

#Let's plot the stock prices to see if we can visualize trends in stock prices during this time period. Using the plot function, plot the Date on the x-axis and the StockPrice on the y-axis, for Coca-Cola.

plot(CocaCola$Date,CocaCola$StockPrice)

#This plots our observations as points, but we would really like to see a line instead, since this is a continuous time period. To do this, add the argument type="l" to your plot command, and re-generate the plot (the character is quotes is the letter l, for line). You should now see a line plot of the Coca-Cola stock price.

plot(CocaCola$Date,CocaCola$StockPrice,type="l")

#Around what year did Coca-Cola has its highest stock price in this time period?
CocaCola$Date[which.max(CocaCola$StockPrice)]
## [1] "1973-01-01"
#Ans:1973


#Around what year did Coca-Cola has its lowest stock price in this time period?
CocaCola$Date[which.min(CocaCola$StockPrice)]
## [1] "1980-03-01"
#Ans:1980

#Now, let's add the line for Procter & Gamble too. You can add a line to a plot in R by using the lines function instead of the plot function. Keeping the plot for Coca-Cola open, type in your R console:

lines(ProcterGamble$Date, ProcterGamble$StockPrice)

#Unfortunately, it's hard to tell which line is which. Let's fix this by giving each line a color. First, re-run the plot command for Coca-Cola, but add the argument col="red". 
plot(CocaCola$Date,CocaCola$StockPrice,type="l",col="red")
#You should see the plot for Coca-Cola show up again, but this time in red. Now, let's add the Procter & Gamble line (using the lines function like we did before), adding the argument col="blue". You should now see in your plot the Coca-Cola stock price in red, and the Procter & Gamble stock price in blue.
lines(ProcterGamble$Date, ProcterGamble$StockPrice,col="blue")

#As an alternative choice to changing the colors, you could instead change the line type of the Procter & Gamble line by adding the argument lty=2. This will make the Procter & Gamble line dashed.
plot(CocaCola$Date,CocaCola$StockPrice,type="l",col="red")
lines(ProcterGamble$Date, ProcterGamble$StockPrice,lty=2)
abline(v=as.Date(c("2000-03-01")), lwd=2)

#In March of 2000, the technology bubble burst, and a stock market crash occurred. According to this plot, which company's stock dropped more?
#Ans:Procter and Gamble

#Around 1983, the stock for one of these companies (Coca-Cola or Procter and Gamble) was going up, while the other was going down. Which one was going up?
abline(v=as.Date(c("1983-03-01")), lwd=2)
#Ans:Coca-Cola

#In the time period shown in the plot, which stock generally has lower values?
abline(h=mean(CocaCola$StockPrice),col="green")
abline(h=mean(ProcterGamble$StockPrice),col="blue")

#Ans:Coca-Cola

##############################################

#PROBLEM 3.1 - VISUALIZING STOCK DYNAMICS 1995-2005  

#Let's take a look at how the stock prices changed from 1995-2005 for all five companies. In your R console, start by typing the following plot command:
plot(CocaCola$Date[301:432], CocaCola$StockPrice[301:432], type="l", col="red", ylim=c(0,210),lwd=3)
#This will plot the CocaCola stock prices from 1995 through 2005, which are the observations numbered from 301 to 432. The additional argument, ylim=c(0,210), makes the y-axis range from 0 to 210. This will allow us to see all of the stock values when we add in the other companies.

#Now, use the lines function to add in the other four companies, remembering to only plot the observations from 1995 to 2005, or [301:432]. You don't need the "type" or "ylim" arguments for the lines function, but remember to make each company a different color so that you can tell them apart. Some color options are "red", "blue", "green", "purple", "orange", and "black". To see all of the color options in R, type colors() in your R console.
lines(ProcterGamble$Date, ProcterGamble$StockPrice,lty=2,col="blue")
lines(GE$Date,GE$StockPrice,lty=3,col="green",lwd=2)
lines(IBM$Date,IBM$StockPrice,col="purple",lty=4)
lines(Boeing$Date,Boeing$StockPrice,col="orange",lty=5,lwd=3)
#If you prefer to change the type of the line instead of the color, here are some options for changing the line type: lty=2 will make the line dashed, lty=3 will make the line dotted, lty=4 will make the line alternate between dashes and dots, and lty=5 will make the line long-dashed.

#Which stock fell the most right after the technology bubble burst in March 2000?
abline(v=as.Date(c("2000-03-01")), lwd=2)
#Ans:GE

#Which stock reaches the highest value in the time period 1995-2005?
#Ans:IBM

#In October of 1997, there was a global stock market crash that was caused by an economic crisis in Asia. Comparing September 1997 to November 1997, which companies saw a decreasing trend in their stock price? (Select all that apply.)
abline(v=as.Date(c("1997-09-01")), lwd=2,col="steelblue")
abline(v=as.Date(c("1997-11-01")), lwd=2,col="steelblue")
#Ans:Boeing and Procter & Gamble

#In the last two years of this time period (2004 and 2005) which stock seems to be performing the best, in terms of increasing stock price?
abline(v=as.Date(c("2004-01-01")), lwd=2,col="steelblue")
abline(v=as.Date(c("2005-01-01")), lwd=2,col="steelblue")

#Ans:Boeing

###########################################

#PROBLEM 4.1 - MONTHLY TRENDS

#Lastly, let's see if stocks tend to be higher or lower during certain months. Use the tapply command to calculate the mean stock price of IBM, sorted by months. To sort by months, use months(IBM$Date) as the second argument of the tapply function.
tapply(IBM$StockPrice,months(IBM$Date),mean)
##     April    August  December  February   January      July      June 
##  152.1168  140.1455  140.7593  152.6940  150.2384  139.0670  139.0907 
##     March       May  November   October September 
##  152.4327  151.5022  138.0187  137.3466  139.0885
#For IBM, compare the monthly averages to the overall average stock price. In which months has IBM historically had a higher stock price (on average)? Select all that apply.
tapply(IBM$StockPrice,months(IBM$Date),mean)>mean(IBM$StockPrice)
##     April    August  December  February   January      July      June 
##      TRUE     FALSE     FALSE      TRUE      TRUE     FALSE     FALSE 
##     March       May  November   October September 
##      TRUE      TRUE     FALSE     FALSE     FALSE
#Ans:January to May

#Repeat the tapply function from the previous problem for each of the other four companies, and use the output to answer the remaining questions.
tapply(GE$StockPrice,months(GE$Date),mean)
##     April    August  December  February   January      July      June 
##  64.48009  56.50315  59.10217  62.52080  62.04511  56.73349  56.46844 
##     March       May  November   October September 
##  63.15055  60.87135  57.28879  56.23897  56.23913
tapply(ProcterGamble$StockPrice,months(ProcterGamble$Date),mean)
##     April    August  December  February   January      July      June 
##  77.68671  76.82266  78.29661  79.02575  79.61798  76.64556  77.39275 
##     March       May  November   October September 
##  77.34761  77.85958  78.45610  76.67903  76.62385
tapply(Boeing$StockPrice,months(Boeing$Date),mean)
##     April    August  December  February   January      July      June 
##  47.04686  46.86311  46.17315  46.89223  46.51097  46.55360  47.38525 
##     March       May  November   October September 
##  46.88208  48.13716  45.14990  45.21603  46.30485
tapply(CocaCola$StockPrice,months(CocaCola$Date),mean)
##     April    August  December  February   January      July      June 
##  62.68888  58.88014  59.73223  60.73475  60.36849  58.98346  60.81208 
##     March       May  November   October September 
##  62.07135  61.44358  59.10268  57.93887  57.60024
#General Electric and Coca-Cola both have their highest average stock price in the same month. Which month is this?
which.max(tapply(GE$StockPrice,months(GE$Date),mean))
## April 
##     1
which.max(tapply(CocaCola$StockPrice,months(CocaCola$Date),mean))
## April 
##     1
#Ans:April

#For the months of December and January, every company's average stock is higher in one month and lower in the other. In which month are the stock prices lower?
#Ans:December

Assignment1:DEMOGRAPHICS AND EMPLOYMENT IN THE UNITED STATES solution

In the wake of the Great Recession of 2009, there has been a good deal of focus on employment statistics, one of the most important metrics policymakers use to gauge the overall strength of the economy. In the United States, the government measures unemployment using the Current Population Survey (CPS), which collects demographic and employment information from a wide range of Americans each month. In this exercise, we will employ the topics reviewed in the lectures as well as a few new techniques using the September 2013 version of this rich, nationally representative dataset (available online).

The observations in the dataset represent people surveyed in the September 2013 CPS who actually completed a survey. While the full dataset has 385 variables, in this exercise we will use a more compact version of the dataset, CPSData.csv

#PROBLEM 1.1 - LOADING AND SUMMARIZING THE DATASET

#Lets import the dataset
CPS<-read.csv("CPSData.csv")
str(CPS)
## 'data.frame':    131302 obs. of  14 variables:
##  $ PeopleInHousehold : int  1 3 3 3 3 3 3 2 2 2 ...
##  $ Region            : Factor w/ 4 levels "Midwest","Northeast",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ State             : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ MetroAreaCode     : int  26620 13820 13820 13820 26620 26620 26620 33660 33660 26620 ...
##  $ Age               : int  85 21 37 18 52 24 26 71 43 52 ...
##  $ Married           : Factor w/ 5 levels "Divorced","Married",..: 5 3 3 3 5 3 3 1 1 3 ...
##  $ Sex               : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 2 1 2 2 ...
##  $ Education         : Factor w/ 8 levels "Associate degree",..: 1 4 4 6 1 2 4 4 4 2 ...
##  $ Race              : Factor w/ 6 levels "American Indian",..: 6 3 3 3 6 6 6 6 6 6 ...
##  $ Hispanic          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CountryOfBirthCode: int  57 57 57 57 57 57 57 57 57 57 ...
##  $ Citizenship       : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ EmploymentStatus  : Factor w/ 5 levels "Disabled","Employed",..: 4 5 1 3 2 2 2 2 3 2 ...
##  $ Industry          : Factor w/ 14 levels "Agriculture, forestry, fishing, and hunting",..: NA 11 NA NA 11 4 14 4 NA 12 ...
summary(CPS)
##  PeopleInHousehold       Region               State       MetroAreaCode  
##  Min.   : 1.000    Midwest  :30684   California  :11570   Min.   :10420  
##  1st Qu.: 2.000    Northeast:25939   Texas       : 7077   1st Qu.:21780  
##  Median : 3.000    South    :41502   New York    : 5595   Median :34740  
##  Mean   : 3.284    West     :33177   Florida     : 5149   Mean   :35075  
##  3rd Qu.: 4.000                      Pennsylvania: 3930   3rd Qu.:41860  
##  Max.   :15.000                      Illinois    : 3912   Max.   :79600  
##                                      (Other)     :94069   NA's   :34238  
##       Age                 Married          Sex       
##  Min.   : 0.00   Divorced     :11151   Female:67481  
##  1st Qu.:19.00   Married      :55509   Male  :63821  
##  Median :39.00   Never Married:30772                 
##  Mean   :38.83   Separated    : 2027                 
##  3rd Qu.:57.00   Widowed      : 6505                 
##  Max.   :85.00   NA's         :25338                 
##                                                      
##                    Education                   Race       
##  High school            :30906   American Indian :  1433  
##  Bachelor's degree      :19443   Asian           :  6520  
##  Some college, no degree:18863   Black           : 13913  
##  No high school diploma :16095   Multiracial     :  2897  
##  Associate degree       : 9913   Pacific Islander:   618  
##  (Other)                :10744   White           :105921  
##  NA's                   :25338                            
##     Hispanic      CountryOfBirthCode               Citizenship    
##  Min.   :0.0000   Min.   : 57.00     Citizen, Native     :116639  
##  1st Qu.:0.0000   1st Qu.: 57.00     Citizen, Naturalized:  7073  
##  Median :0.0000   Median : 57.00     Non-Citizen         :  7590  
##  Mean   :0.1393   Mean   : 82.68                                  
##  3rd Qu.:0.0000   3rd Qu.: 57.00                                  
##  Max.   :1.0000   Max.   :555.00                                  
##                                                                   
##            EmploymentStatus                               Industry    
##  Disabled          : 5712   Educational and health services   :15017  
##  Employed          :61733   Trade                             : 8933  
##  Not in Labor Force:15246   Professional and business services: 7519  
##  Retired           :18619   Manufacturing                     : 6791  
##  Unemployed        : 4203   Leisure and hospitality           : 6364  
##  NA's              :25789   (Other)                           :21618  
##                             NA's                              :65060
names(CPS)
##  [1] "PeopleInHousehold"  "Region"             "State"             
##  [4] "MetroAreaCode"      "Age"                "Married"           
##  [7] "Sex"                "Education"          "Race"              
## [10] "Hispanic"           "CountryOfBirthCode" "Citizenship"       
## [13] "EmploymentStatus"   "Industry"
#How many interviewees are in the dataset?
#Ans:131302

#Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.
head(sort(table(CPS$Industry),decreasing=TRUE))
## 
##    Educational and health services                              Trade 
##                              15017                               8933 
## Professional and business services                      Manufacturing 
##                               7519                               6791 
##            Leisure and hospitality                       Construction 
##                               6364                               4387
#Ans:Educational and health services 
#EXPLANATION:The output of summary(CPS) orders the levels of a factor variable like Industry from largest to smallest, so we can see that "Educational and health services" is the most common Industry. table(CPS$Industry) would have provided the breakdown across all industries.

#Recall from the homework assignment "The Analytical Detective" that you can call the sort() function on the output of the table() function to obtain a sorted breakdown of a variable. For instance, sort(table(CPS$Region)) sorts the regions by the number of interviewees from that region.

#Which state has the fewest interviewees?
head(sort(table(CPS$State),decreasing=F))
## 
##    New Mexico       Montana   Mississippi       Alabama West Virginia 
##          1102          1214          1230          1376          1409 
##      Arkansas 
##          1421
#Ans:New Mexico 

#Which state has the largest number of interviewees?
head(sort(table(CPS$State),decreasing=T))
## 
##   California        Texas     New York      Florida Pennsylvania 
##        11570         7077         5595         5149         3930 
##     Illinois 
##         3912
#Ans:  California 

#What proportion of interviewees are citizens of the United States?
prop.table(table(CPS$Citizenship))
## 
##      Citizen, Native Citizen, Naturalized          Non-Citizen 
##           0.88832615           0.05386818           0.05780567
0.88832615+0.05386818
## [1] 0.9421943
#Ans:0.9421943

#The CPS differentiates between race (with possible values American Indian, Asian, Black, Pacific Islander, White, or Multiracial) and ethnicity. A number of interviewees are of Hispanic ethnicity, as captured by the Hispanic variable. For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity? (Select all that apply.)
table(CPS$Race, CPS$Hispanic)
##                   
##                        0     1
##   American Indian   1129   304
##   Asian             6407   113
##   Black            13292   621
##   Multiracial       2449   448
##   Pacific Islander   541    77
##   White            89190 16731
#or
table(CPS$Hispanic==T)
## 
##  FALSE   TRUE 
## 113008  18294
plus250<-table(CPS$Race[CPS$Hispanic==T])
plus250
## 
##  American Indian            Asian            Black      Multiracial 
##              304              113              621              448 
## Pacific Islander            White 
##               77            16731
plus250>=250
## 
##  American Indian            Asian            Black      Multiracial 
##             TRUE            FALSE             TRUE             TRUE 
## Pacific Islander            White 
##            FALSE             TRUE
#Ans:American Indian,Black,Multiracial ,White

##############################################

#PROBLEM 2.1 - EVALUATING MISSING VALUES  

#Which variables have at least one interviewee with a missing (NA) value? (Select all that apply.)
colnames(CPS)[colSums(is.na(CPS)) > 0]
## [1] "MetroAreaCode"    "Married"          "Education"       
## [4] "EmploymentStatus" "Industry"
#or
apply(is.na(CPS), 2, any)
##  PeopleInHousehold             Region              State 
##              FALSE              FALSE              FALSE 
##      MetroAreaCode                Age            Married 
##               TRUE              FALSE               TRUE 
##                Sex          Education               Race 
##              FALSE               TRUE              FALSE 
##           Hispanic CountryOfBirthCode        Citizenship 
##              FALSE              FALSE              FALSE 
##   EmploymentStatus           Industry 
##               TRUE               TRUE
#or
colnames(CPS)[apply(is.na(CPS), 2, any)]
## [1] "MetroAreaCode"    "Married"          "Education"       
## [4] "EmploymentStatus" "Industry"
#or
which(apply(is.na(CPS), 2, sum)>0) # gives col names as well as col no
##    MetroAreaCode          Married        Education EmploymentStatus 
##                4                6                8               13 
##         Industry 
##               14
#or
summary(CPS)  # crude way
##  PeopleInHousehold       Region               State       MetroAreaCode  
##  Min.   : 1.000    Midwest  :30684   California  :11570   Min.   :10420  
##  1st Qu.: 2.000    Northeast:25939   Texas       : 7077   1st Qu.:21780  
##  Median : 3.000    South    :41502   New York    : 5595   Median :34740  
##  Mean   : 3.284    West     :33177   Florida     : 5149   Mean   :35075  
##  3rd Qu.: 4.000                      Pennsylvania: 3930   3rd Qu.:41860  
##  Max.   :15.000                      Illinois    : 3912   Max.   :79600  
##                                      (Other)     :94069   NA's   :34238  
##       Age                 Married          Sex       
##  Min.   : 0.00   Divorced     :11151   Female:67481  
##  1st Qu.:19.00   Married      :55509   Male  :63821  
##  Median :39.00   Never Married:30772                 
##  Mean   :38.83   Separated    : 2027                 
##  3rd Qu.:57.00   Widowed      : 6505                 
##  Max.   :85.00   NA's         :25338                 
##                                                      
##                    Education                   Race       
##  High school            :30906   American Indian :  1433  
##  Bachelor's degree      :19443   Asian           :  6520  
##  Some college, no degree:18863   Black           : 13913  
##  No high school diploma :16095   Multiracial     :  2897  
##  Associate degree       : 9913   Pacific Islander:   618  
##  (Other)                :10744   White           :105921  
##  NA's                   :25338                            
##     Hispanic      CountryOfBirthCode               Citizenship    
##  Min.   :0.0000   Min.   : 57.00     Citizen, Native     :116639  
##  1st Qu.:0.0000   1st Qu.: 57.00     Citizen, Naturalized:  7073  
##  Median :0.0000   Median : 57.00     Non-Citizen         :  7590  
##  Mean   :0.1393   Mean   : 82.68                                  
##  3rd Qu.:0.0000   3rd Qu.: 57.00                                  
##  Max.   :1.0000   Max.   :555.00                                  
##                                                                   
##            EmploymentStatus                               Industry    
##  Disabled          : 5712   Educational and health services   :15017  
##  Employed          :61733   Trade                             : 8933  
##  Not in Labor Force:15246   Professional and business services: 7519  
##  Retired           :18619   Manufacturing                     : 6791  
##  Unemployed        : 4203   Leisure and hospitality           : 6364  
##  NA's              :25789   (Other)                           :21618  
##                             NA's                              :65060
table(is.na(CPS)) # to know the total no of NAs
## 
##   FALSE    TRUE 
## 1662465  175763
#Ans:MetroAreaCode,Married,Education,EmploymentStatus,Industry 


#Often when evaluating a new dataset, we try to identify if there is a pattern in the missing values in the dataset. We will try to determine if there is a pattern in the missing values of the Married variable. The function is.na(CPS$Married) returns a vector of TRUE/FALSE values for whether the Married variable is missing. We can see the breakdown of whether Married is missing based on the reported value of the Region variable with the function table(CPS$Region, is.na(CPS$Married)). Which is the most accurate:
#(CPS$Married)
table(CPS$Region, is.na(CPS$Married))
##            
##             FALSE  TRUE
##   Midwest   24609  6075
##   Northeast 21432  4507
##   South     33535  7967
##   West      26388  6789
table(CPS$Sex, is.na(CPS$Married))
##         
##          FALSE  TRUE
##   Female 55264 12217
##   Male   50700 13121
table(CPS$Age,is.na(CPS$Married))
##     
##      FALSE TRUE
##   0      0 1283
##   1      0 1559
##   2      0 1574
##   3      0 1693
##   4      0 1695
##   5      0 1795
##   6      0 1721
##   7      0 1681
##   8      0 1729
##   9      0 1748
##   10     0 1750
##   11     0 1721
##   12     0 1797
##   13     0 1802
##   14     0 1790
##   15  1795    0
##   16  1751    0
##   17  1764    0
##   18  1596    0
##   19  1517    0
##   20  1398    0
##   21  1525    0
##   22  1536    0
##   23  1638    0
##   24  1627    0
##   25  1604    0
##   26  1643    0
##   27  1657    0
##   28  1736    0
##   29  1645    0
##   30  1854    0
##   31  1762    0
##   32  1790    0
##   33  1804    0
##   34  1653    0
##   35  1716    0
##   36  1663    0
##   37  1531    0
##   38  1530    0
##   39  1542    0
##   40  1571    0
##   41  1673    0
##   42  1711    0
##   43  1819    0
##   44  1764    0
##   45  1749    0
##   46  1665    0
##   47  1647    0
##   48  1791    0
##   49  1989    0
##   50  1966    0
##   51  1931    0
##   52  1935    0
##   53  1994    0
##   54  1912    0
##   55  1895    0
##   56  1935    0
##   57  1827    0
##   58  1874    0
##   59  1758    0
##   60  1746    0
##   61  1735    0
##   62  1595    0
##   63  1596    0
##   64  1519    0
##   65  1569    0
##   66  1577    0
##   67  1227    0
##   68  1130    0
##   69  1062    0
##   70  1195    0
##   71  1031    0
##   72   941    0
##   73   896    0
##   74   842    0
##   75   763    0
##   76   729    0
##   77   698    0
##   78   659    0
##   79   661    0
##   80  2664    0
##   85  2446    0
table(CPS$Citizenship, is.na(CPS$Married))
##                       
##                        FALSE  TRUE
##   Citizen, Native      91956 24683
##   Citizen, Naturalized  6910   163
##   Non-Citizen           7098   492
#Ans:The Married variable being missing is related to the Age value for the interviewee. 
#EXPLANATION:For each possible value of Region, Sex, and Citizenship, there are both interviewees with missing and non-missing Married values. However, Married is missing for all interviewees Aged 0-14 and is present for all interviewees aged 15 and older. This is because the CPS does not ask about marriage status for interviewees 14 and younger.


#As mentioned in the variable descriptions, MetroAreaCode is missing if an interviewee does not live in a metropolitan area. Using the same technique as in the previous question, answer the following questions about people who live in non-metropolitan areas.

#How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).
table(CPS$State, is.na(CPS$MetroAreaCode))
##                       
##                        FALSE  TRUE
##   Alabama               1020   356
##   Alaska                   0  1590
##   Arizona               1327   201
##   Arkansas               724   697
##   California           11333   237
##   Colorado              2545   380
##   Connecticut           2593   243
##   Delaware              1696   518
##   District of Columbia  1791     0
##   Florida               4947   202
##   Georgia               2250   557
##   Hawaii                1576   523
##   Idaho                  761   757
##   Illinois              3473   439
##   Indiana               1420   584
##   Iowa                  1297  1231
##   Kansas                1234   701
##   Kentucky               908   933
##   Louisiana             1216   234
##   Maine                  909  1354
##   Maryland              2978   222
##   Massachusetts         1858   129
##   Michigan              2517   546
##   Minnesota             2150   989
##   Mississippi            376   854
##   Missouri              1440   705
##   Montana                199  1015
##   Nebraska               816  1133
##   Nevada                1609   247
##   New Hampshire         1148  1514
##   New Jersey            2567     0
##   New Mexico             832   270
##   New York              5144   451
##   North Carolina        1642   977
##   North Dakota           432  1213
##   Ohio                  2754   924
##   Oklahoma              1024   499
##   Oregon                1519   424
##   Pennsylvania          3245   685
##   Rhode Island          2209     0
##   South Carolina        1139   519
##   South Dakota           595  1405
##   Tennessee             1149   635
##   Texas                 6060  1017
##   Utah                  1455   387
##   Vermont                657  1233
##   Virginia              2367   586
##   Washington            1937   429
##   West Virginia          344  1065
##   Wisconsin             1882   804
##   Wyoming                  0  1624
#Ans:2
#How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.
#Ans:3


#Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?
prop.table(table(CPS$Region,is.na(CPS$MetroAreaCode)),1) #row wise proportions
##            
##                 FALSE      TRUE
##   Midwest   0.6521314 0.3478686
##   Northeast 0.7837619 0.2162381
##   South     0.7621560 0.2378440
##   West      0.7563372 0.2436628
#Ans:Midwest
#EXPLANATION:34.8% in the Midwest, 21.6% in the Northeast, 23.8% in the South, and 24.4% in the West.


#While we were able to use the table() command to compute the proportion of interviewees from each region not living in a metropolitan area, it was somewhat tedious (it involved manually computing the proportion for each region) and isn't something you would want to do if there were a larger number of options. It turns out there is a less tedious way to compute the proportion of values that are TRUE. The mean() function, which takes the average of the values passed to it, will treat TRUE as 1 and FALSE as 0, meaning it returns the proportion of values that are true. For instance, mean(c(TRUE, FALSE, TRUE, TRUE)) returns 0.75. Knowing this, use tapply() with the mean function to answer the following questions:

#Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?
tapply(is.na(CPS$MetroAreaCode),CPS$State,mean)
##              Alabama               Alaska              Arizona 
##           0.25872093           1.00000000           0.13154450 
##             Arkansas           California             Colorado 
##           0.49049965           0.02048401           0.12991453 
##          Connecticut             Delaware District of Columbia 
##           0.08568406           0.23396567           0.00000000 
##              Florida              Georgia               Hawaii 
##           0.03923092           0.19843249           0.24916627 
##                Idaho             Illinois              Indiana 
##           0.49868248           0.11221881           0.29141717 
##                 Iowa               Kansas             Kentucky 
##           0.48694620           0.36227390           0.50678979 
##            Louisiana                Maine             Maryland 
##           0.16137931           0.59832081           0.06937500 
##        Massachusetts             Michigan            Minnesota 
##           0.06492199           0.17825661           0.31506849 
##          Mississippi             Missouri              Montana 
##           0.69430894           0.32867133           0.83607908 
##             Nebraska               Nevada        New Hampshire 
##           0.58132376           0.13308190           0.56874530 
##           New Jersey           New Mexico             New York 
##           0.00000000           0.24500907           0.08060769 
##       North Carolina         North Dakota                 Ohio 
##           0.37304315           0.73738602           0.25122349 
##             Oklahoma               Oregon         Pennsylvania 
##           0.32764281           0.21821925           0.17430025 
##         Rhode Island       South Carolina         South Dakota 
##           0.00000000           0.31302774           0.70250000 
##            Tennessee                Texas                 Utah 
##           0.35594170           0.14370496           0.21009772 
##              Vermont             Virginia           Washington 
##           0.65238095           0.19844226           0.18131868 
##        West Virginia            Wisconsin              Wyoming 
##           0.75585522           0.29932986           1.00000000
sort(tapply(is.na(CPS$MetroAreaCode), CPS$State, mean))
## District of Columbia           New Jersey         Rhode Island 
##           0.00000000           0.00000000           0.00000000 
##           California              Florida        Massachusetts 
##           0.02048401           0.03923092           0.06492199 
##             Maryland             New York          Connecticut 
##           0.06937500           0.08060769           0.08568406 
##             Illinois             Colorado              Arizona 
##           0.11221881           0.12991453           0.13154450 
##               Nevada                Texas            Louisiana 
##           0.13308190           0.14370496           0.16137931 
##         Pennsylvania             Michigan           Washington 
##           0.17430025           0.17825661           0.18131868 
##              Georgia             Virginia                 Utah 
##           0.19843249           0.19844226           0.21009772 
##               Oregon             Delaware           New Mexico 
##           0.21821925           0.23396567           0.24500907 
##               Hawaii                 Ohio              Alabama 
##           0.24916627           0.25122349           0.25872093 
##              Indiana            Wisconsin       South Carolina 
##           0.29141717           0.29932986           0.31302774 
##            Minnesota             Oklahoma             Missouri 
##           0.31506849           0.32764281           0.32867133 
##            Tennessee               Kansas       North Carolina 
##           0.35594170           0.36227390           0.37304315 
##                 Iowa             Arkansas                Idaho 
##           0.48694620           0.49049965           0.49868248 
##             Kentucky        New Hampshire             Nebraska 
##           0.50678979           0.56874530           0.58132376 
##                Maine              Vermont          Mississippi 
##           0.59832081           0.65238095           0.69430894 
##         South Dakota         North Dakota        West Virginia 
##           0.70250000           0.73738602           0.75585522 
##              Montana               Alaska              Wyoming 
##           0.83607908           1.00000000           1.00000000
#or
prop.table(table(CPS$State,is.na(CPS$MetroAreaCode)),1)
##                       
##                             FALSE       TRUE
##   Alabama              0.74127907 0.25872093
##   Alaska               0.00000000 1.00000000
##   Arizona              0.86845550 0.13154450
##   Arkansas             0.50950035 0.49049965
##   California           0.97951599 0.02048401
##   Colorado             0.87008547 0.12991453
##   Connecticut          0.91431594 0.08568406
##   Delaware             0.76603433 0.23396567
##   District of Columbia 1.00000000 0.00000000
##   Florida              0.96076908 0.03923092
##   Georgia              0.80156751 0.19843249
##   Hawaii               0.75083373 0.24916627
##   Idaho                0.50131752 0.49868248
##   Illinois             0.88778119 0.11221881
##   Indiana              0.70858283 0.29141717
##   Iowa                 0.51305380 0.48694620
##   Kansas               0.63772610 0.36227390
##   Kentucky             0.49321021 0.50678979
##   Louisiana            0.83862069 0.16137931
##   Maine                0.40167919 0.59832081
##   Maryland             0.93062500 0.06937500
##   Massachusetts        0.93507801 0.06492199
##   Michigan             0.82174339 0.17825661
##   Minnesota            0.68493151 0.31506849
##   Mississippi          0.30569106 0.69430894
##   Missouri             0.67132867 0.32867133
##   Montana              0.16392092 0.83607908
##   Nebraska             0.41867624 0.58132376
##   Nevada               0.86691810 0.13308190
##   New Hampshire        0.43125470 0.56874530
##   New Jersey           1.00000000 0.00000000
##   New Mexico           0.75499093 0.24500907
##   New York             0.91939231 0.08060769
##   North Carolina       0.62695685 0.37304315
##   North Dakota         0.26261398 0.73738602
##   Ohio                 0.74877651 0.25122349
##   Oklahoma             0.67235719 0.32764281
##   Oregon               0.78178075 0.21821925
##   Pennsylvania         0.82569975 0.17430025
##   Rhode Island         1.00000000 0.00000000
##   South Carolina       0.68697226 0.31302774
##   South Dakota         0.29750000 0.70250000
##   Tennessee            0.64405830 0.35594170
##   Texas                0.85629504 0.14370496
##   Utah                 0.78990228 0.21009772
##   Vermont              0.34761905 0.65238095
##   Virginia             0.80155774 0.19844226
##   Washington           0.81868132 0.18131868
##   West Virginia        0.24414478 0.75585522
##   Wisconsin            0.70067014 0.29932986
##   Wyoming              0.00000000 1.00000000
#Ans:Wisconsin

#Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?
#ANs:Montana
###################################################

#PROBLEM 3.1 - INTEGRATING METROPOLITAN AREA DATA

#Codes like MetroAreaCode and CountryOfBirthCode are a compact way to encode factor variables with text as their possible values, and they are therefore quite common in survey datasets. In fact, all but one of the variables in this dataset were actually stored by a numeric code in the original CPS datafile.

#When analyzing a variable stored by a numeric code, we will often want to convert it into the values the codes represent. To do this, we will use a dictionary, which maps the the code to the actual value of the variable. We have provided dictionaries MetroAreaCodes.csv and CountryCodes.csv, which respectively map MetroAreaCode and CountryOfBirthCode into their true values. Read these two dictionaries into data frames MetroAreaMap and CountryMap.
MetroAreaMap<-read.csv("MetroAreaCodes.csv")
CountryMap<-read.csv("CountryCodes.csv")

#How many observations (codes for metropolitan areas) are there in MetroAreaMap?
str(MetroAreaMap)
## 'data.frame':    271 obs. of  2 variables:
##  $ Code     : int  460 3000 3160 3610 3720 6450 10420 10500 10580 10740 ...
##  $ MetroArea: Factor w/ 271 levels "Akron, OH","Albany-Schenectady-Troy, NY",..: 12 92 97 117 122 195 1 3 2 4 ...
#or
nrow(MetroAreaMap)
## [1] 271
#Ans:271

#How many observations (codes for countries) are there in CountryMap?
str(CountryMap)
## 'data.frame':    149 obs. of  2 variables:
##  $ Code   : int  57 66 73 78 96 100 102 103 104 105 ...
##  $ Country: Factor w/ 149 levels "Afghanistan",..: 139 57 105 135 97 3 11 18 24 37 ...
#or
nrow(CountryMap)
## [1] 149
#Ans:149

#To merge in the metropolitan areas, we want to connect the field MetroAreaCode from the CPS data frame with the field Code in MetroAreaMap. The following command merges the two data frames on these columns, overwriting the CPS data frame with the result:
CPS<-merge(CPS, MetroAreaMap, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)

#The first two arguments determine the data frames to be merged (they are called "x" and "y", respectively, in the subsequent parameters to the merge function). by.x="MetroAreaCode" means we're matching on the MetroAreaCode variable from the "x" data frame (CPS), while by.y="Code" means we're matching on the Code variable from the "y" data frame (MetroAreaMap). Finally, all.x=TRUE means we want to keep all rows from the "x" data frame (CPS), even if some of the rows' MetroAreaCode doesn't match any codes in MetroAreaMap (for those familiar with database terminology, this parameter makes the operation a left outer join instead of an inner join).

#Review the new version of the CPS data frame with the summary() and str() functions. What is the name of the variable that was added to the data frame by the merge() operation?
str(CPS)
## 'data.frame':    131302 obs. of  15 variables:
##  $ MetroAreaCode     : int  10420 10420 10420 10420 10420 10420 10420 10420 10420 10420 ...
##  $ PeopleInHousehold : int  4 4 2 4 1 3 4 4 2 3 ...
##  $ Region            : Factor w/ 4 levels "Midwest","Northeast",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ State             : Factor w/ 51 levels "Alabama","Alaska",..: 36 36 36 36 36 36 36 36 36 36 ...
##  $ Age               : int  2 9 73 40 63 19 30 6 60 32 ...
##  $ Married           : Factor w/ 5 levels "Divorced","Married",..: NA NA 2 2 3 3 2 NA 2 2 ...
##  $ Sex               : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 1 1 1 1 2 ...
##  $ Education         : Factor w/ 8 levels "Associate degree",..: NA NA 8 4 6 4 2 NA 4 4 ...
##  $ Race              : Factor w/ 6 levels "American Indian",..: 6 6 6 6 6 6 2 6 6 6 ...
##  $ Hispanic          : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ CountryOfBirthCode: int  57 57 57 362 57 57 203 57 57 57 ...
##  $ Citizenship       : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 2 1 1 3 1 1 1 ...
##  $ EmploymentStatus  : Factor w/ 5 levels "Disabled","Employed",..: NA NA 4 3 1 2 3 NA 2 2 ...
##  $ Industry          : Factor w/ 14 levels "Agriculture, forestry, fishing, and hunting",..: NA NA NA NA NA 7 NA NA 4 13 ...
##  $ MetroArea         : Factor w/ 271 levels "Akron, OH","Albany-Schenectady-Troy, NY",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(CPS)
##  MetroAreaCode   PeopleInHousehold       Region               State      
##  Min.   :10420   Min.   : 1.000    Midwest  :30684   California  :11570  
##  1st Qu.:21780   1st Qu.: 2.000    Northeast:25939   Texas       : 7077  
##  Median :34740   Median : 3.000    South    :41502   New York    : 5595  
##  Mean   :35075   Mean   : 3.284    West     :33177   Florida     : 5149  
##  3rd Qu.:41860   3rd Qu.: 4.000                      Pennsylvania: 3930  
##  Max.   :79600   Max.   :15.000                      Illinois    : 3912  
##  NA's   :34238                                       (Other)     :94069  
##       Age                 Married          Sex       
##  Min.   : 0.00   Divorced     :11151   Female:67481  
##  1st Qu.:19.00   Married      :55509   Male  :63821  
##  Median :39.00   Never Married:30772                 
##  Mean   :38.83   Separated    : 2027                 
##  3rd Qu.:57.00   Widowed      : 6505                 
##  Max.   :85.00   NA's         :25338                 
##                                                      
##                    Education                   Race       
##  High school            :30906   American Indian :  1433  
##  Bachelor's degree      :19443   Asian           :  6520  
##  Some college, no degree:18863   Black           : 13913  
##  No high school diploma :16095   Multiracial     :  2897  
##  Associate degree       : 9913   Pacific Islander:   618  
##  (Other)                :10744   White           :105921  
##  NA's                   :25338                            
##     Hispanic      CountryOfBirthCode               Citizenship    
##  Min.   :0.0000   Min.   : 57.00     Citizen, Native     :116639  
##  1st Qu.:0.0000   1st Qu.: 57.00     Citizen, Naturalized:  7073  
##  Median :0.0000   Median : 57.00     Non-Citizen         :  7590  
##  Mean   :0.1393   Mean   : 82.68                                  
##  3rd Qu.:0.0000   3rd Qu.: 57.00                                  
##  Max.   :1.0000   Max.   :555.00                                  
##                                                                   
##            EmploymentStatus                               Industry    
##  Disabled          : 5712   Educational and health services   :15017  
##  Employed          :61733   Trade                             : 8933  
##  Not in Labor Force:15246   Professional and business services: 7519  
##  Retired           :18619   Manufacturing                     : 6791  
##  Unemployed        : 4203   Leisure and hospitality           : 6364  
##  NA's              :25789   (Other)                           :21618  
##                             NA's                              :65060  
##                                               MetroArea    
##  New York-Northern New Jersey-Long Island, NY-NJ-PA: 5409  
##  Washington-Arlington-Alexandria, DC-VA-MD-WV      : 4177  
##  Los Angeles-Long Beach-Santa Ana, CA              : 4102  
##  Philadelphia-Camden-Wilmington, PA-NJ-DE          : 2855  
##  Chicago-Naperville-Joliet, IN-IN-WI               : 2772  
##  (Other)                                           :77749  
##  NA's                                              :34238
#Ans:MetroArea  #this dataframe is coming from the second dataframe which is being added into the first data frame after merging

#How many interviewees have a missing value for the new metropolitan area variable? Note that all of these interviewees would have been removed from the merged data frame if we did not include the all.x=TRUE parameter.
#Ans:34238

#Which of the following metropolitan areas has the largest number of interviewees?
head(sort(table(CPS$MetroArea),decreasing=T))
## 
## New York-Northern New Jersey-Long Island, NY-NJ-PA 
##                                               5409 
##       Washington-Arlington-Alexandria, DC-VA-MD-WV 
##                                               4177 
##               Los Angeles-Long Beach-Santa Ana, CA 
##                                               4102 
##           Philadelphia-Camden-Wilmington, PA-NJ-DE 
##                                               2855 
##                Chicago-Naperville-Joliet, IN-IN-WI 
##                                               2772 
##               Providence-Fall River-Warwick, MA-RI 
##                                               2284
#Ans:Boston-Cambridge-Quincy, MA-NH 

#Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity? Hint: Use tapply() with mean, as in the previous subproblem. Calling sort() on the output of tapply() could also be helpful here.
head(sort(tapply(CPS$Hispanic, CPS$MetroArea, mean),decreasing=T))
##                 Laredo, TX McAllen-Edinburg-Pharr, TX 
##                  0.9662921                  0.9487179 
##  Brownsville-Harlingen, TX                El Paso, TX 
##                  0.7974684                  0.7909836 
##              El Centro, CA            San Antonio, TX 
##                  0.6868687                  0.6441516
#Ans:Laredo, TX

#Remembering that CPS$Race == "Asian" returns a TRUE/FALSE vector of whether an interviewee is Asian, determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.
head(sort(tapply(CPS$Race == "Asian", CPS$MetroArea, mean),decreasing = T))
##                       Honolulu, HI  San Francisco-Oakland-Fremont, CA 
##                          0.5019036                          0.2467532 
## San Jose-Sunnyvale-Santa Clara, CA              Vallejo-Fairfield, CA 
##                          0.2417910                          0.2030075 
##                         Fresno, CA                  Warner Robins, GA 
##                          0.1848185                          0.1666667
#Ans:4
#We can read from the sorted output that Honolulu, HI; San Francisco-Oakland-Fremont, CA; San Jose-Sunnyvale-Santa Clara, CA; and Vallejo-Fairfield, CA had at least 20% of their interviewees of the Asian race.

#Normally, we would look at the sorted proportion of interviewees from each metropolitan area who have not received a high school diploma with the command:

head(sort(tapply(CPS$Education == "No high school diploma", CPS$MetroArea, mean,na.rm=TRUE)))
##            Iowa City, IA        Bowling Green, KY    Kalamazoo-Portage, MI 
##               0.02912621               0.03703704               0.05050505 
##     Champaign-Urbana, IL Bremerton-Silverdale, WA             Lawrence, KS 
##               0.05154639               0.05405405               0.05952381
#However, none of the interviewees aged 14 and younger have an education value reported, so the mean value is reported as NA for each metropolitan area. To get mean (and related functions, like sum) to ignore missing values, you can pass the parameter na.rm=TRUE. Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.
#Ans:Iowa City, IA 
#We can see that Iowa City, IA had 2.9% of interviewees not finish high school, the smallest value of any metropolitan area.

######################################################

#PROBLEM 4.1 - INTEGRATING COUNTRY OF BIRTH DATA

#Just as we did with the metropolitan area information, merge in the country of birth information from the CountryMap data frame, replacing the CPS data frame with the result. If you accidentally overwrite CPS with the wrong values, remember that you can restore it by re-loading the data frame from CPSData.csv and then merging in the metropolitan area information using the command provided in the previous subproblem.
str(CPS)
## 'data.frame':    131302 obs. of  15 variables:
##  $ MetroAreaCode     : int  10420 10420 10420 10420 10420 10420 10420 10420 10420 10420 ...
##  $ PeopleInHousehold : int  4 4 2 4 1 3 4 4 2 3 ...
##  $ Region            : Factor w/ 4 levels "Midwest","Northeast",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ State             : Factor w/ 51 levels "Alabama","Alaska",..: 36 36 36 36 36 36 36 36 36 36 ...
##  $ Age               : int  2 9 73 40 63 19 30 6 60 32 ...
##  $ Married           : Factor w/ 5 levels "Divorced","Married",..: NA NA 2 2 3 3 2 NA 2 2 ...
##  $ Sex               : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 1 1 1 1 2 ...
##  $ Education         : Factor w/ 8 levels "Associate degree",..: NA NA 8 4 6 4 2 NA 4 4 ...
##  $ Race              : Factor w/ 6 levels "American Indian",..: 6 6 6 6 6 6 2 6 6 6 ...
##  $ Hispanic          : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ CountryOfBirthCode: int  57 57 57 362 57 57 203 57 57 57 ...
##  $ Citizenship       : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 2 1 1 3 1 1 1 ...
##  $ EmploymentStatus  : Factor w/ 5 levels "Disabled","Employed",..: NA NA 4 3 1 2 3 NA 2 2 ...
##  $ Industry          : Factor w/ 14 levels "Agriculture, forestry, fishing, and hunting",..: NA NA NA NA NA 7 NA NA 4 13 ...
##  $ MetroArea         : Factor w/ 271 levels "Akron, OH","Albany-Schenectady-Troy, NY",..: 1 1 1 1 1 1 1 1 1 1 ...
CPS<-merge(CPS, CountryMap, by.x="CountryOfBirthCode", by.y="Code", all.x=TRUE)
summary(CPS)
##  CountryOfBirthCode MetroAreaCode   PeopleInHousehold       Region     
##  Min.   : 57.00     Min.   :10420   Min.   : 1.000    Midwest  :30684  
##  1st Qu.: 57.00     1st Qu.:21780   1st Qu.: 2.000    Northeast:25939  
##  Median : 57.00     Median :34740   Median : 3.000    South    :41502  
##  Mean   : 82.68     Mean   :35075   Mean   : 3.284    West     :33177  
##  3rd Qu.: 57.00     3rd Qu.:41860   3rd Qu.: 4.000                     
##  Max.   :555.00     Max.   :79600   Max.   :15.000                     
##                     NA's   :34238                                      
##           State            Age                 Married          Sex       
##  California  :11570   Min.   : 0.00   Divorced     :11151   Female:67481  
##  Texas       : 7077   1st Qu.:19.00   Married      :55509   Male  :63821  
##  New York    : 5595   Median :39.00   Never Married:30772                 
##  Florida     : 5149   Mean   :38.83   Separated    : 2027                 
##  Pennsylvania: 3930   3rd Qu.:57.00   Widowed      : 6505                 
##  Illinois    : 3912   Max.   :85.00   NA's         :25338                 
##  (Other)     :94069                                                       
##                    Education                   Race       
##  High school            :30906   American Indian :  1433  
##  Bachelor's degree      :19443   Asian           :  6520  
##  Some college, no degree:18863   Black           : 13913  
##  No high school diploma :16095   Multiracial     :  2897  
##  Associate degree       : 9913   Pacific Islander:   618  
##  (Other)                :10744   White           :105921  
##  NA's                   :25338                            
##     Hispanic                    Citizenship               EmploymentStatus
##  Min.   :0.0000   Citizen, Native     :116639   Disabled          : 5712  
##  1st Qu.:0.0000   Citizen, Naturalized:  7073   Employed          :61733  
##  Median :0.0000   Non-Citizen         :  7590   Not in Labor Force:15246  
##  Mean   :0.1393                                 Retired           :18619  
##  3rd Qu.:0.0000                                 Unemployed        : 4203  
##  Max.   :1.0000                                 NA's              :25789  
##                                                                           
##                                Industry    
##  Educational and health services   :15017  
##  Trade                             : 8933  
##  Professional and business services: 7519  
##  Manufacturing                     : 6791  
##  Leisure and hospitality           : 6364  
##  (Other)                           :21618  
##  NA's                              :65060  
##                                               MetroArea    
##  New York-Northern New Jersey-Long Island, NY-NJ-PA: 5409  
##  Washington-Arlington-Alexandria, DC-VA-MD-WV      : 4177  
##  Los Angeles-Long Beach-Santa Ana, CA              : 4102  
##  Philadelphia-Camden-Wilmington, PA-NJ-DE          : 2855  
##  Chicago-Naperville-Joliet, IN-IN-WI               : 2772  
##  (Other)                                           :77749  
##  NA's                                              :34238  
##           Country      
##  United States:115063  
##  Mexico       :  3921  
##  Philippines  :   839  
##  India        :   770  
##  China        :   581  
##  (Other)      :  9952  
##  NA's         :   176
#What is the name of the variable added to the CPS data frame by this merge operation?
#Ans:Country

#How many interviewees have a missing value for the new country of birth variable?
#Ans:  176 

#Among all interviewees born outside of North America, which country was the most common place of birth?
head(sort(table(CPS$Country),decreasing=T))  #or
## 
## United States        Mexico   Philippines         India         China 
##        115063          3921           839           770           581 
##   Puerto Rico 
##           518
head(summary(CPS$Country))
## United States        Mexico   Philippines         India         China 
##        115063          3921           839           770           581 
##   Puerto Rico 
##           518
#Ans:Philippines

#What proportion of the interviewees from the "New York-Northern New Jersey-Long Island, NY-NJ-PA" metropolitan area have a country of birth that is not the United States? For this computation, don't include people from this metropolitan area who have a missing country of birth.'
tapply(CPS$Country!= "United States",CPS$MetroArea == "New York-Northern New Jersey-Long Island, NY-NJ-PA",mean,na.rm=TRUE)
##     FALSE      TRUE 
## 0.1392772 0.3086603
#or
table(CPS$MetroArea =="New York-Northern New Jersey-Long Island, NY-NJ-PA", CPS$Country!="United States")
##        
##         FALSE  TRUE
##   FALSE 78757 12744
##   TRUE   3736  1668
#Ans:0.3086603
#EXPLANATION:From table(CPS$MetroArea == "New York-Northern New Jersey-Long Island, NY-NJ-PA", CPS$Country != "United States"), we can see that 1668 of interviewees from this metropolitan area were born outside the United States and 3736 were born in the United States (it turns out an additional 5 have a missing country of origin). Therefore, the proportion is 1668/(1668+3736)=0.309.


#Which metropolitan area has the largest number (note -- not proportion) of interviewees with a country of birth in India? Hint -- remember to include na.rm=TRUE if you are using tapply() to answer this question.

#To obtain the number of TRUE values in a vector of TRUE/FALSE values, you can use the sum() function. For instance, sum(c(TRUE, FALSE, TRUE, TRUE)) is 3. Therefore, we can obtain counts of people born in a particular country living in a particular metropolitan area with:
head(sort(tapply(CPS$Country == "India", CPS$MetroArea, sum, na.rm=TRUE),decreasing = T))
## New York-Northern New Jersey-Long Island, NY-NJ-PA 
##                                                 96 
##       Washington-Arlington-Alexandria, DC-VA-MD-WV 
##                                                 50 
##           Philadelphia-Camden-Wilmington, PA-NJ-DE 
##                                                 32 
##                Chicago-Naperville-Joliet, IN-IN-WI 
##                                                 31 
##                         Detroit-Warren-Livonia, MI 
##                                                 30 
##                 Atlanta-Sandy Springs-Marietta, GA 
##                                                 27
#Ans:New York-Northern New Jersey-Long Island, NY-NJ-PA

#In Brazil?
head(sort(tapply(CPS$Country == "Brazil", CPS$MetroArea, sum, na.rm=TRUE),decreasing = T))
##                     Boston-Cambridge-Quincy, MA-NH 
##                                                 18 
##              Miami-Fort Lauderdale-Miami Beach, FL 
##                                                 16 
##               Los Angeles-Long Beach-Santa Ana, CA 
##                                                  9 
##       Washington-Arlington-Alexandria, DC-VA-MD-WV 
##                                                  8 
##                    Bridgeport-Stamford-Norwalk, CT 
##                                                  7 
## New York-Northern New Jersey-Long Island, NY-NJ-PA 
##                                                  7
#Ans:Boston-Cambridge-Quincy, MA-NH

#In Somalia?
head(sort(tapply(CPS$Country == "Somalia", CPS$MetroArea, sum, na.rm=TRUE),decreasing = T))
## Minneapolis-St Paul-Bloomington, MN-WI 
##                                     17 
##            Phoenix-Mesa-Scottsdale, AZ 
##                                      7 
##            Seattle-Tacoma-Bellevue, WA 
##                                      7 
##                          St. Cloud, MN 
##                                      7 
##                           Columbus, OH 
##                                      5 
##                           Fargo, ND-MN 
##                                      5
#Ans:Minneapolis-St Paul-Bloomington, MN-WI  

Assignment1:INTERNET PRIVACY POLL (OPTIONAL)

INTERNET PRIVACY POLL (OPTIONAL)

Internet privacy has gained widespread attention in recent years. To measure the degree to which people are concerned about hot-button issues like Internet privacy, social scientists conduct polls in which they interview a large number of people about the topic. In this assignment, we will analyze data from a July 2013 Pew Internet and American Life Project poll on Internet anonymity and privacy, which involved interviews across the United States. While the full polling data can be found here, we will use a more limited version of the results, available in AnonymityPoll.csv.

#Lets load the data
poll<-read.csv("AnonymityPoll.csv")

#PROBLEM 1.1 - LOADING AND SUMMARIZING THE DATASET  

#How many people participated in the poll?
nrow(poll)  #or str(poll)
## [1] 1002
#Ans:1002

#Let's look at the breakdown of the number of people with smartphones using the table() and summary() commands on the Smartphone variable. (HINT: These three numbers should sum to 1002.)
#How many interviewees responded that they use a smartphone?
table(poll$Smartphone)  #&
## 
##   0   1 
## 472 487
summary(poll)
##   Internet.Use      Smartphone         Sex           Age       
##  Min.   :0.0000   Min.   :0.0000   Female:505   Min.   :18.00  
##  1st Qu.:1.0000   1st Qu.:0.0000   Male  :497   1st Qu.:37.00  
##  Median :1.0000   Median :1.0000                Median :55.00  
##  Mean   :0.7742   Mean   :0.5078                Mean   :52.37  
##  3rd Qu.:1.0000   3rd Qu.:1.0000                3rd Qu.:66.00  
##  Max.   :1.0000   Max.   :1.0000                Max.   :96.00  
##  NA's   :1        NA's   :43                    NA's   :27     
##           State           Region    Conservativeness Info.On.Internet
##  California  :103   Midwest  :239   Min.   :1.000    Min.   : 0.000  
##  Texas       : 72   Northeast:166   1st Qu.:3.000    1st Qu.: 2.000  
##  New York    : 60   South    :359   Median :3.000    Median : 4.000  
##  Pennsylvania: 45   West     :238   Mean   :3.277    Mean   : 3.795  
##  Florida     : 42                   3rd Qu.:4.000    3rd Qu.: 6.000  
##  Ohio        : 38                   Max.   :5.000    Max.   :11.000  
##  (Other)     :642                   NA's   :62       NA's   :210     
##  Worry.About.Info Privacy.Importance Anonymity.Possible
##  Min.   :0.0000   Min.   :  0.00     Min.   :0.0000    
##  1st Qu.:0.0000   1st Qu.: 41.43     1st Qu.:0.0000    
##  Median :0.0000   Median : 68.75     Median :0.0000    
##  Mean   :0.4886   Mean   : 62.85     Mean   :0.3692    
##  3rd Qu.:1.0000   3rd Qu.: 88.89     3rd Qu.:1.0000    
##  Max.   :1.0000   Max.   :100.00     Max.   :1.0000    
##  NA's   :212      NA's   :215        NA's   :249       
##  Tried.Masking.Identity Privacy.Laws.Effective
##  Min.   :0.0000         Min.   :0.0000        
##  1st Qu.:0.0000         1st Qu.:0.0000        
##  Median :0.0000         Median :0.0000        
##  Mean   :0.1633         Mean   :0.2617        
##  3rd Qu.:0.0000         3rd Qu.:1.0000        
##  Max.   :1.0000         Max.   :1.0000        
##  NA's   :218            NA's   :108
#Ans:487

#How many interviewees responded that they don't use a smartphone?
#Ans:472

#How many interviewees did not respond to the question, resulting in a missing value, or NA, in the summary() output?
#Ans:43
#EXPLANATION:From the output of table(poll$Smartphone), we can read that 487 interviewees use a smartphone and 472 do not. From the summary(poll$Smartphone) output, we see that another 43 had missing values. As a sanity check, 487+472+43=1002, the total number of interviewees.

#By using the table() function on two variables, we can tell how they are related. To use the table() function on two variables, just put the two variable names inside the parentheses, separated by a comma (don't forget to add poll$ before each variable name). In the output, the possible values of the first variable will be listed in the left, and the possible values of the second variable will be listed on the top. Each entry of the table counts the number of observations in the data set that have the value of the first value in that row, and the value of the second variable in that column.
#Which of the following are states in the Midwest census region? (Select all that apply.)
mid<-poll[poll$Region=="Midwest",]
mid$Region<-factor(mid$Region)
table(mid$State,mid$Region)
##                       
##                        Midwest
##   Alabama                    0
##   Arizona                    0
##   Arkansas                   0
##   California                 0
##   Colorado                   0
##   Connecticut                0
##   Delaware                   0
##   District of Columbia       0
##   Florida                    0
##   Georgia                    0
##   Idaho                      0
##   Illinois                  32
##   Indiana                   27
##   Iowa                      14
##   Kansas                    14
##   Kentucky                   0
##   Louisiana                  0
##   Maine                      0
##   Maryland                   0
##   Massachusetts              0
##   Michigan                  31
##   Minnesota                 15
##   Mississippi                0
##   Missouri                  26
##   Montana                    0
##   Nebraska                  11
##   Nevada                     0
##   New Hampshire              0
##   New Jersey                 0
##   New Mexico                 0
##   New York                   0
##   North Carolina             0
##   North Dakota               5
##   Ohio                      38
##   Oklahoma                   0
##   Oregon                     0
##   Pennsylvania               0
##   Rhode Island               0
##   South Carolina             0
##   South Dakota               3
##   Tennessee                  0
##   Texas                      0
##   Utah                       0
##   Vermont                    0
##   Virginia                   0
##   Washington                 0
##   West Virginia              0
##   Wisconsin                 23
##   Wyoming                    0
#or
MidwestInterviewees<-subset(poll, Region=="Midwest")
table(MidwestInterviewees$State)
## 
##              Alabama              Arizona             Arkansas 
##                    0                    0                    0 
##           California             Colorado          Connecticut 
##                    0                    0                    0 
##             Delaware District of Columbia              Florida 
##                    0                    0                    0 
##              Georgia                Idaho             Illinois 
##                    0                    0                   32 
##              Indiana                 Iowa               Kansas 
##                   27                   14                   14 
##             Kentucky            Louisiana                Maine 
##                    0                    0                    0 
##             Maryland        Massachusetts             Michigan 
##                    0                    0                   31 
##            Minnesota          Mississippi             Missouri 
##                   15                    0                   26 
##              Montana             Nebraska               Nevada 
##                    0                   11                    0 
##        New Hampshire           New Jersey           New Mexico 
##                    0                    0                    0 
##             New York       North Carolina         North Dakota 
##                    0                    0                    5 
##                 Ohio             Oklahoma               Oregon 
##                   38                    0                    0 
##         Pennsylvania         Rhode Island       South Carolina 
##                    0                    0                    0 
##         South Dakota            Tennessee                Texas 
##                    3                    0                    0 
##                 Utah              Vermont             Virginia 
##                    0                    0                    0 
##           Washington        West Virginia            Wisconsin 
##                    0                    0                   23 
##              Wyoming 
##                    0
#Ans:Kansas ,Missouri,Ohio

#Which was the state in the South census region with the largest number of interviewees?
sot<-poll[poll$Region=="South",]
sot$Region<-factor(sot$Region)#this step removes unnecessary levels in the factor variable
table(sot$State,sot$Region)
##                       
##                        South
##   Alabama                 11
##   Arizona                  0
##   Arkansas                10
##   California               0
##   Colorado                 0
##   Connecticut              0
##   Delaware                 6
##   District of Columbia     2
##   Florida                 42
##   Georgia                 34
##   Idaho                    0
##   Illinois                 0
##   Indiana                  0
##   Iowa                     0
##   Kansas                   0
##   Kentucky                25
##   Louisiana               17
##   Maine                    0
##   Maryland                18
##   Massachusetts            0
##   Michigan                 0
##   Minnesota                0
##   Mississippi             11
##   Missouri                 0
##   Montana                  0
##   Nebraska                 0
##   Nevada                   0
##   New Hampshire            0
##   New Jersey               0
##   New Mexico               0
##   New York                 0
##   North Carolina          32
##   North Dakota             0
##   Ohio                     0
##   Oklahoma                14
##   Oregon                   0
##   Pennsylvania             0
##   Rhode Island             0
##   South Carolina          12
##   South Dakota             0
##   Tennessee               17
##   Texas                   72
##   Utah                     0
##   Vermont                  0
##   Virginia                31
##   Washington               0
##   West Virginia            5
##   Wisconsin                0
##   Wyoming                  0
#or
SouthInterviewees<-subset(poll, Region=="South")
table(SouthInterviewees$State)
## 
##              Alabama              Arizona             Arkansas 
##                   11                    0                   10 
##           California             Colorado          Connecticut 
##                    0                    0                    0 
##             Delaware District of Columbia              Florida 
##                    6                    2                   42 
##              Georgia                Idaho             Illinois 
##                   34                    0                    0 
##              Indiana                 Iowa               Kansas 
##                    0                    0                    0 
##             Kentucky            Louisiana                Maine 
##                   25                   17                    0 
##             Maryland        Massachusetts             Michigan 
##                   18                    0                    0 
##            Minnesota          Mississippi             Missouri 
##                    0                   11                    0 
##              Montana             Nebraska               Nevada 
##                    0                    0                    0 
##        New Hampshire           New Jersey           New Mexico 
##                    0                    0                    0 
##             New York       North Carolina         North Dakota 
##                    0                   32                    0 
##                 Ohio             Oklahoma               Oregon 
##                    0                   14                    0 
##         Pennsylvania         Rhode Island       South Carolina 
##                    0                    0                   12 
##         South Dakota            Tennessee                Texas 
##                    0                   17                   72 
##                 Utah              Vermont             Virginia 
##                    0                    0                   31 
##           Washington        West Virginia            Wisconsin 
##                    0                    5                    0 
##              Wyoming 
##                    0
#Ans:Texas

####################################################

#PROBLEM 2.1 - INTERNET AND SMARTPHONE USERS  

#As mentioned in the introduction to this problem, many of the response variables (Info.On.Internet, Worry.About.Info, Privacy.Importance, Anonymity.Possible, and Tried.Masking.Identity) were not collected if an interviewee does not use the Internet or a smartphone, meaning the variables will have missing values for these interviewees.

#How many interviewees reported not having used the Internet and not having used a smartphone?
table(poll$Internet.Use,poll$Smartphone)
##    
##       0   1
##   0 186  17
##   1 285 470
#Ans:186

#How many interviewees reported having used the Internet and having used a smartphone?
#Ans:470

#How many interviewees reported having used the Internet but not having used a smartphone?
#Ans:285

#How many interviewees reported having used a smartphone but not having used the Internet?
#Ans:17

#How many interviewees have a missing value for their Internet use?
summary(poll$Internet.Use)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  1.0000  1.0000  0.7742  1.0000  1.0000       1
#Ans:1

#How many interviewees have a missing value for their smartphone use?
summary(poll$Smartphone)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  1.0000  0.5078  1.0000  1.0000      43
#Ans:43

#Use the subset function to obtain a data frame called "limited", which is limited to interviewees who reported Internet use or who reported smartphone use. In lecture, we used the & symbol to use two criteria to make a subset of the data. To only take observations that have a certain value in one variable or the other, the | character can be used in place of the & symbol. This is also called a logical "or" operation.
limited<-subset(poll,Internet.Use == 1 | Smartphone == 1)
nrow(limited)
## [1] 792
#How many interviewees are in the new data frame?
#Ans:792

#########################################

#PROBLEM 3.1 - SUMMARIZING OPINIONS ABOUT INTERNET PRIVACY  

#Which variables have missing values in the limited data frame? (Select all that apply.)
colnames(limited)[colSums(is.na(limited)) > 0]
## [1] "Smartphone"             "Age"                   
## [3] "Conservativeness"       "Worry.About.Info"      
## [5] "Privacy.Importance"     "Anonymity.Possible"    
## [7] "Tried.Masking.Identity" "Privacy.Laws.Effective"
#or
apply(is.na(limited), 2, any)
##           Internet.Use             Smartphone                    Sex 
##                  FALSE                   TRUE                  FALSE 
##                    Age                  State                 Region 
##                   TRUE                  FALSE                  FALSE 
##       Conservativeness       Info.On.Internet       Worry.About.Info 
##                   TRUE                  FALSE                   TRUE 
##     Privacy.Importance     Anonymity.Possible Tried.Masking.Identity 
##                   TRUE                   TRUE                   TRUE 
## Privacy.Laws.Effective 
##                   TRUE
#or
colnames(limited)[apply(is.na(limited), 2, any)]
## [1] "Smartphone"             "Age"                   
## [3] "Conservativeness"       "Worry.About.Info"      
## [5] "Privacy.Importance"     "Anonymity.Possible"    
## [7] "Tried.Masking.Identity" "Privacy.Laws.Effective"
#or
which(apply(is.na(limited), 2, sum)>0) # gives col names as well as col no
##             Smartphone                    Age       Conservativeness 
##                      2                      4                      7 
##       Worry.About.Info     Privacy.Importance     Anonymity.Possible 
##                      9                     10                     11 
## Tried.Masking.Identity Privacy.Laws.Effective 
##                     12                     13
#or
summary(limited) #crude way
##   Internet.Use      Smartphone         Sex           Age       
##  Min.   :0.0000   Min.   :0.0000   Female:392   Min.   :18.00  
##  1st Qu.:1.0000   1st Qu.:0.0000   Male  :400   1st Qu.:33.00  
##  Median :1.0000   Median :1.0000                Median :51.00  
##  Mean   :0.9785   Mean   :0.6308                Mean   :48.57  
##  3rd Qu.:1.0000   3rd Qu.:1.0000                3rd Qu.:62.00  
##  Max.   :1.0000   Max.   :1.0000                Max.   :93.00  
##                   NA's   :20                    NA's   :22     
##             State           Region    Conservativeness Info.On.Internet
##  California    : 89   Midwest  :172   Min.   :1.000    Min.   : 0.000  
##  Texas         : 57   Northeast:128   1st Qu.:3.000    1st Qu.: 2.000  
##  New York      : 45   South    :288   Median :3.000    Median : 4.000  
##  Pennsylvania  : 33   West     :204   Mean   :3.237    Mean   : 3.795  
##  Florida       : 32                   3rd Qu.:4.000    3rd Qu.: 6.000  
##  North Carolina: 28                   Max.   :5.000    Max.   :11.000  
##  (Other)       :508                   NA's   :45                       
##  Worry.About.Info Privacy.Importance Anonymity.Possible
##  Min.   :0.0000   Min.   :  0.00     Min.   :0.0000    
##  1st Qu.:0.0000   1st Qu.: 41.43     1st Qu.:0.0000    
##  Median :0.0000   Median : 68.75     Median :0.0000    
##  Mean   :0.4886   Mean   : 62.85     Mean   :0.3692    
##  3rd Qu.:1.0000   3rd Qu.: 88.89     3rd Qu.:1.0000    
##  Max.   :1.0000   Max.   :100.00     Max.   :1.0000    
##  NA's   :2        NA's   :5          NA's   :39        
##  Tried.Masking.Identity Privacy.Laws.Effective
##  Min.   :0.0000         Min.   :0.0000        
##  1st Qu.:0.0000         1st Qu.:0.0000        
##  Median :0.0000         Median :0.0000        
##  Mean   :0.1633         Mean   :0.2559        
##  3rd Qu.:0.0000         3rd Qu.:1.0000        
##  Max.   :1.0000         Max.   :1.0000        
##  NA's   :8              NA's   :65
#Ans:Smartphone,Age,Conservativeness,Worry.About.Info,"Privacy.Importance,Anonymity.Possible,Tried.Masking.Identity,Privacy.Laws.Effective

#What is the average number of pieces of personal information on the Internet, according to the Info.On.Internet variable?
mean(limited$Info.On.Internet) #or summary(limited$Info.On.Internet)
## [1] 3.795455
#Ans:3.795455

#How many interviewees reported a value of 0 for Info.On.Internet?
table(limited$Info.On.Internet)
## 
##   0   1   2   3   4   5   6   7   8   9  10  11 
## 105  84  95 101 104  94  67  63  40  18  13   8
#Ans:105

#How many interviewees reported the maximum value of 11 for Info.On.Internet?
#Ans:8

#What proportion of interviewees who answered the Worry.About.Info question worry about how much information is available about them on the Internet? Note that to compute this proportion you will be dividing by the number of people who answered the Worry.About.Info question, not the total number of people in the data frame.
table(limited$Worry.About.Info)
## 
##   0   1 
## 404 386
prop.table(table(limited$Worry.About.Info))
## 
##         0         1 
## 0.5113924 0.4886076
#or crudely
table(limited$Worry.About.Info)/margin.table(table(limited$Worry.About.Info))
## 
##         0         1 
## 0.5113924 0.4886076
#or crudely
table(limited$Worry.About.Info,limited$Info.On.Internet)
##    
##      0  1  2  3  4  5  6  7  8  9 10 11
##   0 71 50 44 48 47 38 34 32 17 12  7  4
##   1 34 33 51 53 57 55 33 31 23  6  6  4
sum(limited$Worry.About.Info=="1",na.rm=T)
## [1] 386
sum(limited$Worry.About.Info=="0",na.rm=T)
## [1] 404
(386-34)/386
## [1] 0.9119171
#Ans:0.4886076 

#What proportion of interviewees who answered the Anonymity.Possible question think it is possible to be completely anonymous on the Internet?
prop.table(table(limited$Anonymity.Possible))    #or
## 
##         0         1 
## 0.6308101 0.3691899
summary(limited$Anonymity.Possible) #as the mean gives the prop of 1s
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.3692  1.0000  1.0000      39
#Ans:0.3691899

#What proportion of interviewees who answered the Tried.Masking.Identity question have tried masking their identity on the Internet?
prop.table(table(limited$Tried.Masking.Identity)) #or 
## 
##         0         1 
## 0.8367347 0.1632653
summary(limited$Tried.Masking.Identity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.1633  0.0000  1.0000       8
#Ans:0.1632653 

#What proportion of interviewees who answered the Privacy.Laws.Effective question find United States privacy laws effective?
prop.table(table(limited$Privacy.Laws.Effective)) #or 
## 
##         0         1 
## 0.7441541 0.2558459
summary(limited$Privacy.Laws.Effective)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.2558  1.0000  1.0000      65
#Ans:0.2558459 

##############################################

#PROBLEM 4.1 - RELATING DEMOGRAPHICS TO POLLING RESULTS  

#Often, we are interested in whether certain characteristics of interviewees (e.g. their age or political opinions) affect their opinions on the topic of the poll (in this case, opinions on privacy). In this section, we will investigate the relationship between the characteristics Age and Smartphone and outcome variables Info.On.Internet and Tried.Masking.Identity, again using the limited data frame we built in an earlier section of this problem.

#Build a histogram of the age of interviewees. What is the best represented age group in the population?
hist(limited$Age)

#Ans:People aged about 60 years old 

#Both Age and Info.On.Internet are variables that take on many values, so a good way to observe their relationship is through a graph. We learned in lecture that we can plot Age against Info.On.Internet with the command plot(limited$Age, limited$Info.On.Internet). However, because Info.On.Internet takes on a small number of values, multiple points can be plotted in exactly the same location on this graph.
plot(limited$Age, limited$Info.On.Internet)

#What is the largest number of interviewees that have exactly the same value in their Age variable AND the same value in their Info.On.Internet variable? In other words, what is the largest number of overlapping points in the plot plot(limited$Age, limited$Info.On.Internet)? (HINT: Use the table function to compare the number of observations with different values of Age and Info.On.Internet.)
max(table(limited$Age, limited$Info.On.Internet))
## [1] 6
#Ans:6

#To avoid points covering each other up, we can use the jitter() function on the values we pass to the plot function. Experimenting with the command jitter(c(1, 2, 3)), what appears to be the functionality of the jitter command?
jitter(c(1, 2, 3))
## [1] 1.016522 2.188608 3.189527
#Ans:jitter adds or subtracts a small amount of random noise to the values passed to it, and two runs will yield different results
#EXPLANATION:By running the command jitter(c(1, 2, 3)) multiple times, we can see that the jitter function randomly adds or subtracts a small value from each number, and two runs will yield different results.

#Now, plot Age against Info.On.Internet with plot(jitter(limited$Age), jitter(limited$Info.On.Internet)). What relationship to you observe between Age and Info.On.Internet?
plot(jitter(limited$Age), jitter(limited$Info.On.Internet))
abline(lm(limited$Info.On.Internet~limited$Age,limited))

#Ans:Older age seems moderately associated with a smaller value for Info.On.Internet
#EXPLANATION:For younger people aged 18-30, the average value of Info.On.Internet appears to be roughly 5, while most peopled aged 60 and older have a value less than 5. Therefore, older age appears to be associated with a smaller value of Info.On.Internet, but from the spread of dots on the image, it's clear the association is not particularly strong.

#Use the tapply() function to obtain the summary of the Info.On.Internet value, broken down by whether an interviewee is a smartphone user.
tapply(limited$Info.On.Internet,limited$Smartphone,summary)
## $`0`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   2.923   5.000  11.000 
## 
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   4.000   4.368   6.000  11.000
#What is the average Info.On.Internet value for smartphone users?
#Ans:4.368  #The average for smartphone users from the summary output labeled with 1.

#What is the average Info.On.Internet value for non-smartphone users?
#Ans:2.923  #We can read the average for non-smartphone users from the summary output labeled with 0 

#Similarly use tapply to break down the Tried.Masking.Identity variable for smartphone and non-smartphone users.
tapply(limited$Tried.Masking.Identity,limited$Smartphone,summary)
## $`0`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.1174  0.0000  1.0000       4 
## 
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.1925  0.0000  1.0000       4
#Note when we want proportions of a variables having 0s and 1s, all we meed is the mean from the summary function.Mean gives the proportion of TRUEs.For ex:
mean(c(TRUE,TRUE,FALSE,TRUE))
## [1] 0.75
#What proportion of smartphone users who answered the Tried.Masking.Identity question have tried masking their identity when using the Internet?
#Ans:0.1925

#What proportion of non-smartphone users who answered the Tried.Masking.Identity question have tried masking their identity when using the Internet?
#Ans:0.1174

#EXPLANATION:We can get the breakdown for smartphone and non-smartphone users with:
#tapply(limited$Tried.Masking.Identity, limited$Smartphone, table)
#Among smartphone users, 93 tried masking their identity and 390 did not, resulting in proportion 93/(93+390)=0.1925. Among non-smartphone users, 33 tried masking their identity and 248 did not, resulting in proportion 33/(33+248)=0.1174.This could have also been read from tapply(limited$Tried.Masking.Identity, limited$Smartphone, summary).

Unit 2: Linear Regression

The statistical sommelier_An introduction to linear regression

Video1:PREDICTING THE QUALITY OF WINE

The plots below show the relationship between two of the independent variables considered by Ashenfelter and the price of wine.

Wine_QQ1_Plot1

Wine_QQ1_Plot1

Wine_QQ1_Plot1

Wine_QQ1_Plot1

Q:What is the correct relationship between harvest rain, average growing season temperature, and wine prices?

Ans:More harvest rain is associated with a lower price, and higher temperatures is associated with a higher price

EXPLANATION:The plots show a positive trend between average growing season temperature and the wine price. While the trend is less clear between harvest rain and price, there is a slight negative association.

Video2:ONE-VARIABLE LINEAR REGRESSION

The following figure shows three data points and the best fit line

y = 3x + 2.

The x-coordinate, or “x”, is our independent variable and the y-coordinate, or “y”, is our dependent variable.

Wine_QQ2

Wine_QQ2

#lets create the above scatter plot
x<-c(0,1,1)
y<-c(2,2,8)

plot(x,y,xlim = c(-3,3),ylim = c(0,10),pch=19)

#Baseline prediction
abline(h=mean(y),col="red",lwd=2)
text(-2,5,labels="Y=4,Baseline Prediction",cex=0.7)

#To get the SSE

#first fit the Linear model
fit1<-lm(y~x)
abline(lm(y~x),lwd=2,col="blue")
text(1.5,9,labels="Regression Line",cex = 0.7)

anova(fit1)
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value Pr(>F)
## x          1      6       6  0.3333 0.6667
## Residuals  1     18      18
summary(fit1)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##         1         2         3 
## -2.22e-16 -3.00e+00  3.00e+00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    2.000      4.243   0.471    0.720
## x              3.000      5.196   0.577    0.667
## 
## Residual standard error: 4.243 on 1 degrees of freedom
## Multiple R-squared:   0.25,  Adjusted R-squared:   -0.5 
## F-statistic: 0.3333 on 1 and 1 DF,  p-value: 0.6667
#Lets us SDSFoundations package which simulates all the above code to produce the scatter plot plus the basic summary o/p
library(SDSFoundations)
## 
## Attaching package: 'SDSFoundations'
## The following object is masked from 'package:lattice':
## 
##     histogram
linFit(x,y)

## Linear Fit 
##  Intercept =  2 
##  Slope =  3 
##  R-squared =  0.25

Q:What is the baseline prediction?

Ans:4

EXPLANATION:The baseline prediction is the average value of the dependent variable. Since our dependent variable takes values 2, 2, and 8 in our data set, the average is (2+2+8)/3 = 4.

Q:What is the Sum of Squared Errors (SSE) ?

Ans:18

EXPLANATION:The SSE is computed by summing the squared errors between the actual values and our predictions. For each value of the independent variable (x), our best fit line makes the following predictions:

If x = 0, y = 3(0) + 2 = 2,

If x = 1, y = 3(1) + 2 = 5.

Thus we make an error of 0 for the data point (0,2), an error of 3 for the data point (1,2), and an error of 3 for the data point (1,8). So we have

SSE = 0² + 3² + 3² = 18.

Q:What is the Total Sum of Squares (SST) ?

Ans:24

EXPLANATION:The SST is computed by summing the squared errors between the actual values and the baseline prediction. From the first question, we computed the baseline prediction to be 4. Thus the SST is:

SST = (2 - 4)² + (2 - 4)² + (8 - 4)² = 24.

Q:What is the R² of the model?

Ans:0.25

EXPLANATION: The R² formula is:R² = 1 - SSE/SST

Thus using our answers to the previous questions, we have that

R² = 1 - 18/24 = 0.25.

Video3:MULTIPLE LINEAR REGRESSION

Q:Suppose we add another variable, Average Winter Temperature, to our model to predict wine price. Is it possible for the model’s R² value to go down from 0.83 to 0.80?

Ans:No, the model’s R² value can not decrease at all by adding new variables.

EXPLANATION:The model’s R² value can never decrease from adding new variables to the model. This is due to the fact that it is always possible to set the coefficient for the new variable to zero in the new model. However, this would be the same as the old model. So the only reason to make the coefficient non-zero is if it improves the R² value of the model, since linear regression picks the coefficients to minimize the error terms, which is the same as maximizing the R².

Video4:LINEAR REGRESSION IN R

A script file containing all of the R commands used in this lecture reproduced below

# VIDEO 4

# Read in data
wine<-read.csv("wine.csv")
str(wine)
## 'data.frame':    25 obs. of  7 variables:
##  $ Year       : int  1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
##  $ Price      : num  7.5 8.04 7.69 6.98 6.78 ...
##  $ WinterRain : int  600 690 502 420 582 485 763 830 697 608 ...
##  $ AGST       : num  17.1 16.7 17.1 16.1 16.4 ...
##  $ HarvestRain: int  160 80 130 110 187 187 290 38 52 155 ...
##  $ Age        : int  31 30 28 26 25 24 23 22 21 20 ...
##  $ FrancePop  : num  43184 43495 44218 45152 45654 ...
summary(wine)
##       Year          Price         WinterRain         AGST      
##  Min.   :1952   Min.   :6.205   Min.   :376.0   Min.   :14.98  
##  1st Qu.:1960   1st Qu.:6.519   1st Qu.:536.0   1st Qu.:16.20  
##  Median :1966   Median :7.121   Median :600.0   Median :16.53  
##  Mean   :1966   Mean   :7.067   Mean   :605.3   Mean   :16.51  
##  3rd Qu.:1972   3rd Qu.:7.495   3rd Qu.:697.0   3rd Qu.:17.07  
##  Max.   :1978   Max.   :8.494   Max.   :830.0   Max.   :17.65  
##   HarvestRain         Age         FrancePop    
##  Min.   : 38.0   Min.   : 5.0   Min.   :43184  
##  1st Qu.: 89.0   1st Qu.:11.0   1st Qu.:46584  
##  Median :130.0   Median :17.0   Median :50255  
##  Mean   :148.6   Mean   :17.2   Mean   :49694  
##  3rd Qu.:187.0   3rd Qu.:23.0   3rd Qu.:52894  
##  Max.   :292.0   Max.   :31.0   Max.   :54602
# Linear Regression (one variable)
model1<-lm(Price ~ AGST, data=wine)
summary(model1)
## 
## Call:
## lm(formula = Price ~ AGST, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78450 -0.23882 -0.03727  0.38992  0.90318 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.4178     2.4935  -1.371 0.183710    
## AGST          0.6351     0.1509   4.208 0.000335 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared:  0.435,  Adjusted R-squared:  0.4105 
## F-statistic: 17.71 on 1 and 23 DF,  p-value: 0.000335
# Sum of Squared Errors
model1$residuals
##           1           2           3           4           5           6 
##  0.04204258  0.82983774  0.21169394  0.15609432 -0.23119140  0.38991701 
##           7           8           9          10          11          12 
## -0.48959140  0.90318115  0.45372410  0.14887461 -0.23882157 -0.08974238 
##          13          14          15          16          17          18 
##  0.66185660 -0.05211511 -0.62726647 -0.74714947  0.42113502 -0.03727441 
##          19          20          21          22          23          24 
##  0.10685278 -0.78450270 -0.64017590 -0.05508720 -0.67055321 -0.22040381 
##          25 
##  0.55866518
SSE<-sum(model1$residuals^2)
SSE
## [1] 5.734875
# Linear Regression (two variables)
model2<-lm(Price ~ AGST + HarvestRain, data=wine)
summary(model2)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88321 -0.19600  0.06178  0.15379  0.59722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.20265    1.85443  -1.188 0.247585    
## AGST         0.60262    0.11128   5.415 1.94e-05 ***
## HarvestRain -0.00457    0.00101  -4.525 0.000167 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared:  0.7074, Adjusted R-squared:  0.6808 
## F-statistic: 26.59 on 2 and 22 DF,  p-value: 1.347e-06
# Sum of Squared Errors
SSE<-sum(model2$residuals^2)
SSE
## [1] 2.970373
# Linear Regression (all variables)
model3<-lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data=wine)
summary(model3)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age + 
##     FrancePop, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48179 -0.24662 -0.00726  0.22012  0.51987 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.504e-01  1.019e+01  -0.044 0.965202    
## AGST         6.012e-01  1.030e-01   5.836 1.27e-05 ***
## HarvestRain -3.958e-03  8.751e-04  -4.523 0.000233 ***
## WinterRain   1.043e-03  5.310e-04   1.963 0.064416 .  
## Age          5.847e-04  7.900e-02   0.007 0.994172    
## FrancePop   -4.953e-05  1.667e-04  -0.297 0.769578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3019 on 19 degrees of freedom
## Multiple R-squared:  0.8294, Adjusted R-squared:  0.7845 
## F-statistic: 18.47 on 5 and 19 DF,  p-value: 1.044e-06
# Sum of Squared Errors
SSE<-sum(model3$residuals^2)
SSE
## [1] 1.732113
####################################################

#In R, use the dataset wine.csv to create a linear regression model to predict Price using HarvestRain and WinterRain as independent variables. Using the summary output of this model, answer the following questions:
fit2<-lm(Price ~ HarvestRain + WinterRain, data=wine)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ HarvestRain + WinterRain, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0933 -0.3222 -0.1012  0.3871  1.1877 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.865e+00  6.616e-01  11.888 4.76e-11 ***
## HarvestRain -4.971e-03  1.601e-03  -3.105  0.00516 ** 
## WinterRain  -9.848e-05  9.007e-04  -0.109  0.91392    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5611 on 22 degrees of freedom
## Multiple R-squared:  0.3177, Adjusted R-squared:  0.2557 
## F-statistic: 5.122 on 2 and 22 DF,  p-value: 0.01492
#What is the "Multiple R-squared" value of your model?
#Ans: 0.3177

#What is the coefficient for HarvestRain?
#Ans:-4.971e-03   i.e.???4.971×10???03

#What is the intercept coefficient?
#Ans:7.865
#EXPLANATION:In R, create the model by typing the following line into your R console:
#modelQQ4 = lm(Price ~ HarvestRain + WinterRain, data=wine)
#Then, look at the output of summary(modelQQ4). The Multiple R-squared is listed at the bottom of the output, and the coefficients can be found in the coefficients table.

Video5:UNDERSTANDING THE MODEL

A script file containing all of the R commands used in this lecture reproduced below

# VIDEO 5
wine<-read.csv("wine.csv")

model3<-lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data<-wine)
summary(model3)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age + 
##     FrancePop, data = data <- wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48179 -0.24662 -0.00726  0.22012  0.51987 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.504e-01  1.019e+01  -0.044 0.965202    
## AGST         6.012e-01  1.030e-01   5.836 1.27e-05 ***
## HarvestRain -3.958e-03  8.751e-04  -4.523 0.000233 ***
## WinterRain   1.043e-03  5.310e-04   1.963 0.064416 .  
## Age          5.847e-04  7.900e-02   0.007 0.994172    
## FrancePop   -4.953e-05  1.667e-04  -0.297 0.769578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3019 on 19 degrees of freedom
## Multiple R-squared:  0.8294, Adjusted R-squared:  0.7845 
## F-statistic: 18.47 on 5 and 19 DF,  p-value: 1.044e-06
# Remove FrancePop
model4<-lm(Price ~ AGST + HarvestRain + WinterRain + Age, data=wine)
summary(model4)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45470 -0.24273  0.00752  0.19773  0.53637 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.4299802  1.7658975  -1.942 0.066311 .  
## AGST         0.6072093  0.0987022   6.152  5.2e-06 ***
## HarvestRain -0.0039715  0.0008538  -4.652 0.000154 ***
## WinterRain   0.0010755  0.0005073   2.120 0.046694 *  
## Age          0.0239308  0.0080969   2.956 0.007819 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.295 on 20 degrees of freedom
## Multiple R-squared:  0.8286, Adjusted R-squared:  0.7943 
## F-statistic: 24.17 on 4 and 20 DF,  p-value: 2.036e-07
#########################################

#Use the dataset wine.csv to create a linear regression model to predict Price using HarvestRain and WinterRain as independent variables, like you did in the previous quick question. Using the summary output of this model, answer the following questions:
fit2<-lm(Price ~ HarvestRain + WinterRain, data<-wine)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ HarvestRain + WinterRain, data = data <- wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0933 -0.3222 -0.1012  0.3871  1.1877 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.865e+00  6.616e-01  11.888 4.76e-11 ***
## HarvestRain -4.971e-03  1.601e-03  -3.105  0.00516 ** 
## WinterRain  -9.848e-05  9.007e-04  -0.109  0.91392    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5611 on 22 degrees of freedom
## Multiple R-squared:  0.3177, Adjusted R-squared:  0.2557 
## F-statistic: 5.122 on 2 and 22 DF,  p-value: 0.01492
#Is the coefficient for HarvestRain significant?
#Ans:Yes

#Is the coefficient for WinterRain significant?
#Ans:No

Video6:CORRELATION AND MULTICOLLINEARITY

A script file containing all of the R commands used in this lecture reproduced below

# VIDEO 6

wine<-read.csv("wine.csv")

# Correlations
cor(wine$WinterRain, wine$Price)
## [1] 0.1366505
cor(wine$Age, wine$FrancePop)
## [1] -0.9944851
cor(wine) #correlation matrix 
##                    Year      Price   WinterRain        AGST HarvestRain
## Year         1.00000000 -0.4477679  0.016970024 -0.24691585  0.02800907
## Price       -0.44776786  1.0000000  0.136650547  0.65956286 -0.56332190
## WinterRain   0.01697002  0.1366505  1.000000000 -0.32109061 -0.27544085
## AGST        -0.24691585  0.6595629 -0.321090611  1.00000000 -0.06449593
## HarvestRain  0.02800907 -0.5633219 -0.275440854 -0.06449593  1.00000000
## Age         -1.00000000  0.4477679 -0.016970024  0.24691585 -0.02800907
## FrancePop    0.99448510 -0.4668616 -0.001621627 -0.25916227  0.04126439
##                     Age    FrancePop
## Year        -1.00000000  0.994485097
## Price        0.44776786 -0.466861641
## WinterRain  -0.01697002 -0.001621627
## AGST         0.24691585 -0.259162274
## HarvestRain -0.02800907  0.041264394
## Age          1.00000000 -0.994485097
## FrancePop   -0.99448510  1.000000000
# Remove Age and FrancePop as they were highly  correlated
model5<-lm(Price ~ AGST + HarvestRain + WinterRain, data=wine)
summary(model5)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67472 -0.12958  0.01973  0.20751  0.63846 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.3016263  2.0366743  -2.112 0.046831 *  
## AGST         0.6810242  0.1117011   6.097 4.75e-06 ***
## HarvestRain -0.0039481  0.0009987  -3.953 0.000726 ***
## WinterRain   0.0011765  0.0005920   1.987 0.060097 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.345 on 21 degrees of freedom
## Multiple R-squared:  0.7537, Adjusted R-squared:  0.7185 
## F-statistic: 21.42 on 3 and 21 DF,  p-value: 1.359e-06
#removing both correlated IVs Age & FrancePop results in lowering of Rsqrd & Adjusted R-squared.Hence finalyly we should remove FrancePop as intuitively Age of Wine is a better/imp predictor of Price of Wine

##########################################

#Using the data set wine.csv, what is the correlation between HarvestRain and WinterRain?
cor(wine$HarvestRain,wine$WinterRain)
## [1] -0.2754409
#Ans:-0.2754409

Video7:MAKING PREDICTIONS

A script file containing all of the R commands used in this lecture reproduced below

# VIDEO 7

# Read in test set
wineTest<-read.csv("wine_test.csv")
str(wineTest)
## 'data.frame':    2 obs. of  7 variables:
##  $ Year       : int  1979 1980
##  $ Price      : num  6.95 6.5
##  $ WinterRain : int  717 578
##  $ AGST       : num  16.2 16
##  $ HarvestRain: int  122 74
##  $ Age        : int  4 3
##  $ FrancePop  : num  54836 55110
# Make test set predictions using predict()
predictTest<-predict(model4, newdata=wineTest)
predictTest
##        1        2 
## 6.768925 6.684910
# Compute R-squared
SSE<-sum((wineTest$Price - predictTest)^2)
SST<-sum((wineTest$Price - mean(wine$Price))^2)
1 - SSE/SST
## [1] 0.7944278

Q:Which of the following are NOT valid values for an out-of-sample (test set) R² ? Select all that apply.

Ans:2.4

EXPLANATION:The formula for R² is

R² = 1 - SSE/SST,where SST is calculated using the average value of the dependent variable on the training set.Since SSE and SST are the sums of squared terms, we know that both will be positive. Thus SSE/SST must be greater than or equal to zero. This means it is not possible to have an out-of-sample R² value of 2.4.However, all other values are valid (even the negative ones!), since SSE can be more or less than SST, due to the fact that this is an out-of-sample R², not a model R².

Moneyball_The power of sports analytics

VIDEO 2: MAKING IT TO THE PLAYOFFS

A script file containing all of the R commands used in this lecture reproduced below

# VIDEO 2

# Read in data
baseball<-read.csv("baseball.csv")
str(baseball)
## 'data.frame':    1232 obs. of  15 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 2 3 4 5 7 8 9 10 11 12 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
##  $ Year        : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ RS          : int  734 700 712 734 613 748 669 667 758 726 ...
##  $ RA          : int  688 600 705 806 759 676 588 845 890 670 ...
##  $ W           : int  81 94 93 69 61 85 97 68 64 88 ...
##  $ OBP         : num  0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
##  $ SLG         : num  0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
##  $ BA          : num  0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
##  $ Playoffs    : int  0 1 1 0 0 0 1 0 0 1 ...
##  $ RankSeason  : int  NA 4 5 NA NA NA 2 NA NA 6 ...
##  $ RankPlayoffs: int  NA 5 4 NA NA NA 4 NA NA 2 ...
##  $ G           : int  162 162 162 162 162 162 162 162 162 162 ...
##  $ OOBP        : num  0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
##  $ OSLG        : num  0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
# Subset to only include moneyball years before 2002
moneyball<-subset(baseball, Year < 2002)
str(moneyball)
## 'data.frame':    902 obs. of  15 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
##  $ Year        : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ RS          : int  691 818 729 687 772 777 798 735 897 923 ...
##  $ RA          : int  730 677 643 829 745 701 795 850 821 906 ...
##  $ W           : int  75 92 88 63 82 88 83 66 91 73 ...
##  $ OBP         : num  0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
##  $ SLG         : num  0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
##  $ BA          : num  0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
##  $ Playoffs    : int  0 1 1 0 0 0 0 0 1 0 ...
##  $ RankSeason  : int  NA 5 7 NA NA NA NA NA 6 NA ...
##  $ RankPlayoffs: int  NA 1 3 NA NA NA NA NA 4 NA ...
##  $ G           : int  162 162 162 162 161 162 162 162 162 162 ...
##  $ OOBP        : num  0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
##  $ OSLG        : num  0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
# Compute Run Difference and add this new var to the dataframe
moneyball$RD<-moneyball$RS - moneyball$RA
str(moneyball)
## 'data.frame':    902 obs. of  16 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
##  $ Year        : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ RS          : int  691 818 729 687 772 777 798 735 897 923 ...
##  $ RA          : int  730 677 643 829 745 701 795 850 821 906 ...
##  $ W           : int  75 92 88 63 82 88 83 66 91 73 ...
##  $ OBP         : num  0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
##  $ SLG         : num  0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
##  $ BA          : num  0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
##  $ Playoffs    : int  0 1 1 0 0 0 0 0 1 0 ...
##  $ RankSeason  : int  NA 5 7 NA NA NA NA NA 6 NA ...
##  $ RankPlayoffs: int  NA 1 3 NA NA NA NA NA 4 NA ...
##  $ G           : int  162 162 162 162 161 162 162 162 162 162 ...
##  $ OOBP        : num  0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
##  $ OSLG        : num  0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
##  $ RD          : int  -39 141 86 -142 27 76 3 -115 76 17 ...
# Scatterplot to check for linear relationship
plot(moneyball$RD, moneyball$W)

# Regression model to predict wins
WinsReg<-lm(W ~ RD, data=moneyball)
summary(WinsReg)
## 
## Call:
## lm(formula = W ~ RD, data = moneyball)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2662  -2.6509   0.1234   2.9364  11.6570 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 80.881375   0.131157  616.67   <2e-16 ***
## RD           0.105766   0.001297   81.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.939 on 900 degrees of freedom
## Multiple R-squared:  0.8808, Adjusted R-squared:  0.8807 
## F-statistic:  6651 on 1 and 900 DF,  p-value: < 2.2e-16
##########################################

#If a baseball team scores 713 runs and allows 614 runs, how many games do we expect the team to win?

#Using the linear regression model constructed during the lecture, enter the number of games we expect the team to win:

library(SDSFoundations)
rdiff<-713-614
linFitPred(moneyball$RD,moneyball$W,rdiff)

#Ans:91.352
#EXPLANATION:Our linear regression model was 
#Wins = 80.88 + 0.1058*(Run Difference)
#Here, the run difference is 99, so our prediction is
#Wins = 80.88 + 0.1058*99 = 91 games.

VIDEO 3: PREDICTING RUNS

A script file containing all of the R commands used in this lecture reproduced below

# VIDEO 3

# Read in data
baseball<-read.csv("baseball.csv")

# Subset to only include moneyball years before 2002
moneyball<-subset(baseball, Year < 2002)
str(moneyball)
## 'data.frame':    902 obs. of  15 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
##  $ Year        : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ RS          : int  691 818 729 687 772 777 798 735 897 923 ...
##  $ RA          : int  730 677 643 829 745 701 795 850 821 906 ...
##  $ W           : int  75 92 88 63 82 88 83 66 91 73 ...
##  $ OBP         : num  0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
##  $ SLG         : num  0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
##  $ BA          : num  0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
##  $ Playoffs    : int  0 1 1 0 0 0 0 0 1 0 ...
##  $ RankSeason  : int  NA 5 7 NA NA NA NA NA 6 NA ...
##  $ RankPlayoffs: int  NA 1 3 NA NA NA NA NA 4 NA ...
##  $ G           : int  162 162 162 162 161 162 162 162 162 162 ...
##  $ OOBP        : num  0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
##  $ OSLG        : num  0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
# Compute Run Difference and add this new var to the dataframe
moneyball$RD<-moneyball$RS - moneyball$RA
str(moneyball)
## 'data.frame':    902 obs. of  16 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
##  $ Year        : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ RS          : int  691 818 729 687 772 777 798 735 897 923 ...
##  $ RA          : int  730 677 643 829 745 701 795 850 821 906 ...
##  $ W           : int  75 92 88 63 82 88 83 66 91 73 ...
##  $ OBP         : num  0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
##  $ SLG         : num  0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
##  $ BA          : num  0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
##  $ Playoffs    : int  0 1 1 0 0 0 0 0 1 0 ...
##  $ RankSeason  : int  NA 5 7 NA NA NA NA NA 6 NA ...
##  $ RankPlayoffs: int  NA 1 3 NA NA NA NA NA 4 NA ...
##  $ G           : int  162 162 162 162 161 162 162 162 162 162 ...
##  $ OOBP        : num  0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
##  $ OSLG        : num  0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
##  $ RD          : int  -39 141 86 -142 27 76 3 -115 76 17 ...
# Regression model to predict runs scored
RunsReg<-lm(RS ~ OBP + SLG + BA, data=moneyball)
summary(RunsReg) #BA coeff is negative impying that a team will score more runs if BA is lower which is counterintuitive....this must be due to existence of multicollinearity....now lets remove BA from the model
## 
## Call:
## lm(formula = RS ~ OBP + SLG + BA, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.941 -17.247  -0.621  16.754  90.998 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -788.46      19.70 -40.029  < 2e-16 ***
## OBP          2917.42     110.47  26.410  < 2e-16 ***
## SLG          1637.93      45.99  35.612  < 2e-16 ***
## BA           -368.97     130.58  -2.826  0.00482 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.69 on 898 degrees of freedom
## Multiple R-squared:  0.9302, Adjusted R-squared:   0.93 
## F-statistic:  3989 on 3 and 898 DF,  p-value: < 2.2e-16
RunsReg<-lm(RS ~ OBP + SLG, data=moneyball)
summary(RunsReg)
## 
## Call:
## lm(formula = RS ~ OBP + SLG, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.838 -17.174  -1.108  16.770  90.036 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -804.63      18.92  -42.53   <2e-16 ***
## OBP          2737.77      90.68   30.19   <2e-16 ***
## SLG          1584.91      42.16   37.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.79 on 899 degrees of freedom
## Multiple R-squared:  0.9296, Adjusted R-squared:  0.9294 
## F-statistic:  5934 on 2 and 899 DF,  p-value: < 2.2e-16
#this is a simple (parsimonious) model giving significant coeff with high Rsqred.Also seeing the coeff of OBP & SLG which are of same scale,OBP has a higher coeff impying that OBP is probably worth more than SLG.

#########################################

#If a baseball team's OBP is 0.311 and SLG is 0.405, how many runs do we expect the team to score?

#Using the linear regression model constructed during the lecture (the one that uses OBP and SLG as independent variables), enter the number of runs we expect the team to score:
summary(RunsReg)
## 
## Call:
## lm(formula = RS ~ OBP + SLG, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.838 -17.174  -1.108  16.770  90.036 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -804.63      18.92  -42.53   <2e-16 ***
## OBP          2737.77      90.68   30.19   <2e-16 ***
## SLG          1584.91      42.16   37.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.79 on 899 degrees of freedom
## Multiple R-squared:  0.9296, Adjusted R-squared:  0.9294 
## F-statistic:  5934 on 2 and 899 DF,  p-value: < 2.2e-16
-804.63+(0.311* 2737.77)+(0.405*1584.91)#from summary o/p
## [1] 688.705
#Ans:689
#EXPLANATION:Our linear regression model was:
#Runs Scored = -804.63 + 2737.77*(OBP) + 1584.91*(SLG)
#Here, OBP is 0.311 and SLG is 0.405, so our prediction is:
#Runs Scored = -804.63 + 2737.77*0.311 + 1584.91*0.405 = 689 runs


#If a baseball team's opponents OBP (OOBP) is 0.297 and oppenents SLG (OSLG) is 0.370, how many runs do we expect the team to allow?

#Using the linear regression model discussed during the lecture (the one on the last slide of the previous video), enter the number of runs we expect the team to allow:
-837.38+( 0.297*2913.60)+(0.370*1514.29)
## [1] 588.2465
#Ans:588
#EXPLANATION:Our linear regression model was:
#Runs Allowed = -837.38 + 2913.60*(OOBP) + 1514.29*(OSLG)
#Here, OOBP is 0.297 and OSLG is 0.370, so our prediction is:
#Runs Scored = -837.38 + 2913.60*(.297) + 1514.29*(.370) = 588 runs

VIDEO 4: USING THE MODELS TO MAKE PREDICTIONS

Q:Suppose you are the General Manager of a baseball team, and you are selecting TWO players for your team. You have a budget of $1,500,000, and you have the choice between the following players:

Player Name OBP SLG Salary Eric Chavez 0.338 0.540 $1,400,000 Jeremy Giambi 0.391 0.450 $1,065,000 Frank Menechino 0.369 0.374 $295,000 Greg Myers 0.313 0.447 $800,000 Carlos Pena 0.361 0.500 $300,000 Given your budget and the player statistics, which TWO players would you select?

Ans:Jeremy Giambi & Carlos Pena

EXPLANATION:We would select Jeremy Giambi and Carlos Pena, since they give the highest contribution to Runs Scored.We would not select Eric Chavez, since his salary consumes our entire budget, and although he has the highest SLG, there are players with better OBP.We would not select Frank Menechino since even though he has a high OBP, his SLG is low.We would not select Greg Myers since he is dominated by Carlos Pena in OBP and SLG, but has a much higher salary.

VIDEO 5: WINNING THE WORLD SERIES

In 2012 and 2013, there were 10 teams in the MLB playoffs: the six teams that had the most wins in each baseball division, and four “wild card” teams. The playoffs start between the four wild card teams - the two teams that win proceed in the playoffs (8 teams remaining). Then, these teams are paired off and play a series of games. The four teams that win are then paired and play to determine who will play in the World Series.

We can assign rankings to the teams as follows:

Rank 1: the team that won the World Series Rank 2: the team that lost the World Series Rank 3: the two teams that lost to the teams in the World Series Rank 4: the four teams that made it past the wild card round, but lost to the above four teams Rank 5: the two teams that lost the wild card round In your R console, create a corresponding rank vector by typing teamRank<-c(1,2,3,3,4,4,4,4,5,5)

In this quick question, we’ll see how well these rankings correlate with the regular season wins of the teams. In 2012, the ranking of the teams and their regular season wins were as follows:

Rank 1: San Francisco Giants (Wins = 94) Rank 2: Detroit Tigers (Wins = 88) Rank 3: New York Yankees (Wins = 95), and St. Louis Cardinals (Wins = 88) Rank 4: Baltimore Orioles (Wins = 93), Oakland A’s (Wins = 94), Washington Nationals (Wins = 98), Cincinnati Reds (Wins = 97) Rank 5: Texas Rangers (Wins = 93), and Atlanta Braves (Wins = 94) Create a vector in R called wins2012, that has the wins of each team in 2012, in order of rank (the vector should have 10 numbers).

In 2013, the ranking of the teams and their regular season wins were as follows:

Rank 1: Boston Red Sox (Wins = 97) Rank 2: St. Louis Cardinals (Wins = 97) Rank 3: Los Angeles Dodgers (Wins = 92), and Detroit Tigers (Wins = 93) Rank 4: Tampa Bay Rays (Wins = 92), Oakland A’s (Wins = 96), Pittsburgh Pirates (Wins = 94), and Atlanta Braves (Wins = 96) Rank 5: Cleveland Indians (Wins = 92), and Cincinnati Reds (Wins = 90) Create another vector in R called wins2013, that has the wins of each team in 2013, in order of rank (the vector should have 10 numbers).

teamRank<-c(1,2,3,3,4,4,4,4,5,5)
wins2012<-c(94,88,95,88,93,94,98,97,93,94)
wins2013<-c(97,97,92,93,92,96,94,96,92,90)

#What is the correlation between teamRank and wins2012?
cor(teamRank,wins2012)
## [1] 0.3477129
#Ans:0.3477129


#What is the correlation between teamRank and wins2013?
cor(teamRank,wins2013)
## [1] -0.6556945
#Ans:-0.6556945

#Since one of the correlations is positive and the other is negative, this means that there does not seem to be a pattern between regular season wins and winning the playoffs. We wouldn't feel comfortable making a bet for this year given this data!

VIDEO 6: THE ANALYTICS EDGE IN SPORTS

Q:Which of the following is MOST LIKELY to be a topic of Sabermetric research?

Ans:Predicting how many home runs the Oakland A’s will hit next year

EXPLANATION:Sabermetric research tries to take a quantitative approach to baseball. Predicting how many home runs the Oakland A’s will hit next year is a very quantitative problem. While the other two topics could be an area of Sabermetric research, they are more qualitative.

Playing moneyball in the NBA(Recitation)

In this recitation, we’ll apply some of the ideas from Moneyball to data from the National Basketball Association (NBA). Please download the datasets NBA_train.csv and NBA_test.csv, and save them to a location on your computer that you will remember. This data comes from Basketball-Reference.com.

VIDEO 1: THE DATA

# VIDEO 1

# Read in the data
NBA<-read.csv("NBA_train.csv")
str(NBA)
## 'data.frame':    835 obs. of  20 variables:
##  $ SeasonEnd: int  1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
##  $ Team     : Factor w/ 37 levels "Atlanta Hawks",..: 1 2 5 6 8 9 10 11 12 13 ...
##  $ Playoffs : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ W        : int  50 61 30 37 30 16 24 41 37 47 ...
##  $ PTS      : int  8573 9303 8813 9360 8878 8933 8493 9084 9119 8860 ...
##  $ oppPTS   : int  8334 8664 9035 9332 9240 9609 8853 9070 9176 8603 ...
##  $ FG       : int  3261 3617 3362 3811 3462 3643 3527 3599 3639 3582 ...
##  $ FGA      : int  7027 7387 6943 8041 7470 7596 7318 7496 7689 7489 ...
##  $ X2P      : int  3248 3455 3292 3775 3379 3586 3500 3495 3551 3557 ...
##  $ X2PA     : int  6952 6965 6668 7854 7215 7377 7197 7117 7375 7375 ...
##  $ X3P      : int  13 162 70 36 83 57 27 104 88 25 ...
##  $ X3PA     : int  75 422 275 187 255 219 121 379 314 114 ...
##  $ FT       : int  2038 1907 2019 1702 1871 1590 1412 1782 1753 1671 ...
##  $ FTA      : int  2645 2449 2592 2205 2539 2149 1914 2326 2333 2250 ...
##  $ ORB      : int  1369 1227 1115 1307 1311 1226 1155 1394 1398 1187 ...
##  $ DRB      : int  2406 2457 2465 2381 2524 2415 2437 2217 2326 2429 ...
##  $ AST      : int  1913 2198 2152 2108 2079 1950 2028 2149 2148 2123 ...
##  $ STL      : int  782 809 704 764 746 783 779 782 900 863 ...
##  $ BLK      : int  539 308 392 342 404 562 339 373 530 356 ...
##  $ TOV      : int  1495 1539 1684 1370 1533 1742 1492 1565 1517 1439 ...

VIDEO 2: PLAYOFFS AND WINS

# VIDEO 2

# How many wins to make the playoffs?
table(NBA$W, NBA$Playoffs)
##     
##       0  1
##   11  2  0
##   12  2  0
##   13  2  0
##   14  2  0
##   15 10  0
##   16  2  0
##   17 11  0
##   18  5  0
##   19 10  0
##   20 10  0
##   21 12  0
##   22 11  0
##   23 11  0
##   24 18  0
##   25 11  0
##   26 17  0
##   27 10  0
##   28 18  0
##   29 12  0
##   30 19  1
##   31 15  1
##   32 12  0
##   33 17  0
##   34 16  0
##   35 13  3
##   36 17  4
##   37 15  4
##   38  8  7
##   39 10 10
##   40  9 13
##   41 11 26
##   42  8 29
##   43  2 18
##   44  2 27
##   45  3 22
##   46  1 15
##   47  0 28
##   48  1 14
##   49  0 17
##   50  0 32
##   51  0 12
##   52  0 20
##   53  0 17
##   54  0 18
##   55  0 24
##   56  0 16
##   57  0 23
##   58  0 13
##   59  0 14
##   60  0  8
##   61  0 10
##   62  0 13
##   63  0  7
##   64  0  3
##   65  0  3
##   66  0  2
##   67  0  4
##   69  0  1
##   72  0  1
# Compute Points Difference
NBA$PTSdiff<-NBA$PTS - NBA$oppPTS

# Check for linear relationship
plot(NBA$PTSdiff, NBA$W)

# Linear regression model for wins
WinsReg = lm(W ~ PTSdiff, data=NBA)
summary(WinsReg)
## 
## Call:
## lm(formula = W ~ PTSdiff, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7393 -2.1018 -0.0672  2.0265 10.6026 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.100e+01  1.059e-01   387.0   <2e-16 ***
## PTSdiff     3.259e-02  2.793e-04   116.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.061 on 833 degrees of freedom
## Multiple R-squared:  0.9423, Adjusted R-squared:  0.9423 
## F-statistic: 1.361e+04 on 1 and 833 DF,  p-value: < 2.2e-16
#lets write the linear model

#W= 41 + 0.0326 * PTSdiff >= 42
#from the table the team would like to win abt 42 games to make it to the playoffs
# therefore  PTSdiff= 42-41/0.0326= 30.67 ==31

VIDEO 3: POINTS SCORED

# VIDEO 3

# Linear regression model for points scored
PointsReg = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + STL + BLK, data=NBA)
summary(PointsReg)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + 
##     STL + BLK, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -527.40 -119.83    7.83  120.67  564.71 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.051e+03  2.035e+02 -10.078   <2e-16 ***
## X2PA         1.043e+00  2.957e-02  35.274   <2e-16 ***
## X3PA         1.259e+00  3.843e-02  32.747   <2e-16 ***
## FTA          1.128e+00  3.373e-02  33.440   <2e-16 ***
## AST          8.858e-01  4.396e-02  20.150   <2e-16 ***
## ORB         -9.554e-01  7.792e-02 -12.261   <2e-16 ***
## DRB          3.883e-02  6.157e-02   0.631   0.5285    
## TOV         -2.475e-02  6.118e-02  -0.405   0.6859    
## STL         -1.992e-01  9.181e-02  -2.169   0.0303 *  
## BLK         -5.576e-02  8.782e-02  -0.635   0.5256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.5 on 825 degrees of freedom
## Multiple R-squared:  0.8992, Adjusted R-squared:  0.8981 
## F-statistic: 817.3 on 9 and 825 DF,  p-value: < 2.2e-16
# Sum of Squared Errors
head(PointsReg$residuals)
##           1           2           3           4           5           6 
##   38.572271  142.872004  -92.895718   -8.391347 -258.470561  171.460833
SSE = sum(PointsReg$residuals^2)
SSE #difficult to interpret, hence we calc RMSE
## [1] 28394314
# Root mean squared error
RMSE = sqrt(SSE/nrow(NBA))
RMSE
## [1] 184.4049
# Average number of points in a season
mean(NBA$PTS)
## [1] 8370.24
# Remove insignifcant variables
summary(PointsReg)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + 
##     STL + BLK, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -527.40 -119.83    7.83  120.67  564.71 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.051e+03  2.035e+02 -10.078   <2e-16 ***
## X2PA         1.043e+00  2.957e-02  35.274   <2e-16 ***
## X3PA         1.259e+00  3.843e-02  32.747   <2e-16 ***
## FTA          1.128e+00  3.373e-02  33.440   <2e-16 ***
## AST          8.858e-01  4.396e-02  20.150   <2e-16 ***
## ORB         -9.554e-01  7.792e-02 -12.261   <2e-16 ***
## DRB          3.883e-02  6.157e-02   0.631   0.5285    
## TOV         -2.475e-02  6.118e-02  -0.405   0.6859    
## STL         -1.992e-01  9.181e-02  -2.169   0.0303 *  
## BLK         -5.576e-02  8.782e-02  -0.635   0.5256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.5 on 825 degrees of freedom
## Multiple R-squared:  0.8992, Adjusted R-squared:  0.8981 
## F-statistic: 817.3 on 9 and 825 DF,  p-value: < 2.2e-16
PointsReg2 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL + BLK, data=NBA) # remove var TOV as it has highest non significant p value
summary(PointsReg2)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL + 
##     BLK, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -526.79 -121.09    6.37  120.74  565.94 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.077e+03  1.931e+02 -10.755   <2e-16 ***
## X2PA         1.044e+00  2.951e-02  35.366   <2e-16 ***
## X3PA         1.263e+00  3.703e-02  34.099   <2e-16 ***
## FTA          1.125e+00  3.308e-02  34.023   <2e-16 ***
## AST          8.861e-01  4.393e-02  20.173   <2e-16 ***
## ORB         -9.581e-01  7.758e-02 -12.350   <2e-16 ***
## DRB          3.892e-02  6.154e-02   0.632   0.5273    
## STL         -2.068e-01  8.984e-02  -2.301   0.0216 *  
## BLK         -5.863e-02  8.749e-02  -0.670   0.5029    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.4 on 826 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8982 
## F-statistic: 920.4 on 8 and 826 DF,  p-value: < 2.2e-16
PointsReg3 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK, data=NBA)
summary(PointsReg3)# remove var DRB as it has highest non significant p value
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK, 
##     data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -523.79 -121.64    6.07  120.81  573.64 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.015e+03  1.670e+02 -12.068  < 2e-16 ***
## X2PA         1.048e+00  2.852e-02  36.753  < 2e-16 ***
## X3PA         1.271e+00  3.475e-02  36.568  < 2e-16 ***
## FTA          1.128e+00  3.270e-02  34.506  < 2e-16 ***
## AST          8.909e-01  4.326e-02  20.597  < 2e-16 ***
## ORB         -9.702e-01  7.519e-02 -12.903  < 2e-16 ***
## STL         -2.276e-01  8.356e-02  -2.724  0.00659 ** 
## BLK         -3.882e-02  8.165e-02  -0.475  0.63462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.4 on 827 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8982 
## F-statistic:  1053 on 7 and 827 DF,  p-value: < 2.2e-16
PointsReg4 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data=NBA)
summary(PointsReg4)# remove var BLK as it has highest non significant p
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -523.33 -122.02    6.93  120.68  568.26 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.033e+03  1.629e+02 -12.475  < 2e-16 ***
## X2PA         1.050e+00  2.829e-02  37.117  < 2e-16 ***
## X3PA         1.273e+00  3.441e-02  37.001  < 2e-16 ***
## FTA          1.127e+00  3.260e-02  34.581  < 2e-16 ***
## AST          8.884e-01  4.292e-02  20.701  < 2e-16 ***
## ORB         -9.743e-01  7.465e-02 -13.051  < 2e-16 ***
## STL         -2.268e-01  8.350e-02  -2.717  0.00673 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.3 on 828 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8983 
## F-statistic:  1229 on 6 and 828 DF,  p-value: < 2.2e-16
# now we have a simple model with all significant var

# Compute SSE and RMSE for new model
SSE_4 = sum(PointsReg4$residuals^2)
RMSE_4 = sqrt(SSE_4/nrow(NBA))
SSE_4 #difficult to interpret, hence we calc RMSE
## [1] 28421465
RMSE_4
## [1] 184.493
# Now we hav a simpler model with almost same RMSE

VIDEO 4: MAKING PREDICTIONS

# VIDEO 4

# Read in test set
NBA_test = read.csv("NBA_test.csv")

# Make predictions on test set using the linear model from the training data
PointsPredictions = predict(PointsReg4, newdata=NBA_test)

# Compute out-of-sample R^2 #how well the model predicts on test data
SSE = sum((PointsPredictions - NBA_test$PTS)^2)
SST = sum((mean(NBA$PTS) - NBA_test$PTS)^2)
R2 = 1 - SSE/SST
R2
## [1] 0.8127142
# Compute the RMSE
RMSE = sqrt(SSE/nrow(NBA_test))
RMSE
## [1] 196.3723

Assignment 2: CLIMATE CHANGE

There have been many studies documenting that the average global temperature has been increasing over the last century. The consequences of a continued rise in global temperature will be dire. Rising sea levels and an increased frequency of extreme weather events will affect billions of people.

In this problem, we will attempt to study the relationship between average global temperature and several other factors.

The file climate_change.csv contains climate data from May 1983 to December 2008. The available variables include:

Year: the observation year. Month: the observation month. Temp: the difference in degrees Celsius between the average global temperature in that period and a reference value. This data comes from the Climatic Research Unit at the University of East Anglia. CO2, N2O, CH4, CFC.11, CFC.12: atmospheric concentrations of carbon dioxide (CO2), nitrous oxide (N2O), methane (CH4), trichlorofluoromethane (CCl3F; commonly referred to as CFC-11) and dichlorodifluoromethane (CCl2F2; commonly referred to as CFC-12), respectively. This data comes from the ESRL/NOAA Global Monitoring Division. CO2, N2O and CH4 are expressed in ppmv (parts per million by volume – i.e., 397 ppmv of CO2 means that CO2 constitutes 397 millionths of the total volume of the atmosphere) CFC.11 and CFC.12 are expressed in ppbv (parts per billion by volume). Aerosols: the mean stratospheric aerosol optical depth at 550 nm. This variable is linked to volcanoes, as volcanic eruptions result in new particles being added to the atmosphere, which affect how much of the sun’s energy is reflected back into space. This data is from the Godard Institute for Space Studies at NASA. TSI: the total solar irradiance (TSI) in W/m2 (the rate at which the sun’s energy is deposited per unit area). Due to sunspots and other solar phenomena, the amount of energy that is given off by the sun varies substantially with time. This data is from the SOLARIS-HEPPA project website. MEI: multivariate El Nino Southern Oscillation index (MEI), a measure of the strength of the El Nino/La Nina-Southern Oscillation (a weather effect in the Pacific Ocean that affects global temperatures). This data comes from the ESRL/NOAA Physical Sciences Division.

PROBLEM 1.1 - CREATING OUR FIRST MODEL

We are interested in how changes in these variables affect future temperatures, as well as how well these variables explain temperature changes so far. To do this, first read the dataset climate_change.csv into R.

Then, split the data into a training set, consisting of all the observations up to and including 2006, and a testing set consisting of the remaining years (hint: use subset). A training set refers to the data that will be used to build the model (this is the data we give to the lm() function), and a testing set refers to the data we will use to test our predictive ability.

Next, build a linear regression model to predict the dependent variable Temp, using MEI, CO2, CH4, N2O, CFC.11, CFC.12, TSI, and Aerosols as independent variables (Year and Month should NOT be used in the model). Use the training set to build the model.

climate<-read.csv("climate_change.csv")
str(climate)
## 'data.frame':    308 obs. of  11 variables:
##  $ Year    : int  1983 1983 1983 1983 1983 1983 1983 1983 1984 1984 ...
##  $ Month   : int  5 6 7 8 9 10 11 12 1 2 ...
##  $ MEI     : num  2.556 2.167 1.741 1.13 0.428 ...
##  $ CO2     : num  346 346 344 342 340 ...
##  $ CH4     : num  1639 1634 1633 1631 1648 ...
##  $ N2O     : num  304 304 304 304 304 ...
##  $ CFC.11  : num  191 192 193 194 194 ...
##  $ CFC.12  : num  350 352 354 356 357 ...
##  $ TSI     : num  1366 1366 1366 1366 1366 ...
##  $ Aerosols: num  0.0863 0.0794 0.0731 0.0673 0.0619 0.0569 0.0524 0.0486 0.0451 0.0416 ...
##  $ Temp    : num  0.109 0.118 0.137 0.176 0.149 0.093 0.232 0.078 0.089 0.013 ...
#splitting the dataset into train & test dataset

#method1
train<-subset(climate,Year<=2006)
str(train)
## 'data.frame':    284 obs. of  11 variables:
##  $ Year    : int  1983 1983 1983 1983 1983 1983 1983 1983 1984 1984 ...
##  $ Month   : int  5 6 7 8 9 10 11 12 1 2 ...
##  $ MEI     : num  2.556 2.167 1.741 1.13 0.428 ...
##  $ CO2     : num  346 346 344 342 340 ...
##  $ CH4     : num  1639 1634 1633 1631 1648 ...
##  $ N2O     : num  304 304 304 304 304 ...
##  $ CFC.11  : num  191 192 193 194 194 ...
##  $ CFC.12  : num  350 352 354 356 357 ...
##  $ TSI     : num  1366 1366 1366 1366 1366 ...
##  $ Aerosols: num  0.0863 0.0794 0.0731 0.0673 0.0619 0.0569 0.0524 0.0486 0.0451 0.0416 ...
##  $ Temp    : num  0.109 0.118 0.137 0.176 0.149 0.093 0.232 0.078 0.089 0.013 ...
dim(train)
## [1] 284  11
plot(train) #scatter plot matrix

test<-subset(climate,Year>2006)
str(test)
## 'data.frame':    24 obs. of  11 variables:
##  $ Year    : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ Month   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MEI     : num  0.974 0.51 0.074 -0.049 0.183 ...
##  $ CO2     : num  383 384 385 386 387 ...
##  $ CH4     : num  1800 1803 1803 1802 1796 ...
##  $ N2O     : num  321 321 321 321 320 ...
##  $ CFC.11  : num  248 248 248 248 247 ...
##  $ CFC.12  : num  539 539 539 539 538 ...
##  $ TSI     : num  1366 1366 1366 1366 1366 ...
##  $ Aerosols: num  0.0054 0.0051 0.0045 0.0045 0.0041 0.004 0.004 0.0041 0.0042 0.0041 ...
##  $ Temp    : num  0.601 0.498 0.435 0.466 0.372 0.382 0.394 0.358 0.402 0.362 ...
dim(test)
## [1] 24 11
#method2
train_index<-(climate$Year<=2006)

train1<-subset(climate,train_index)
str(train1)
## 'data.frame':    284 obs. of  11 variables:
##  $ Year    : int  1983 1983 1983 1983 1983 1983 1983 1983 1984 1984 ...
##  $ Month   : int  5 6 7 8 9 10 11 12 1 2 ...
##  $ MEI     : num  2.556 2.167 1.741 1.13 0.428 ...
##  $ CO2     : num  346 346 344 342 340 ...
##  $ CH4     : num  1639 1634 1633 1631 1648 ...
##  $ N2O     : num  304 304 304 304 304 ...
##  $ CFC.11  : num  191 192 193 194 194 ...
##  $ CFC.12  : num  350 352 354 356 357 ...
##  $ TSI     : num  1366 1366 1366 1366 1366 ...
##  $ Aerosols: num  0.0863 0.0794 0.0731 0.0673 0.0619 0.0569 0.0524 0.0486 0.0451 0.0416 ...
##  $ Temp    : num  0.109 0.118 0.137 0.176 0.149 0.093 0.232 0.078 0.089 0.013 ...
dim(train1)
## [1] 284  11
test1<-climate[!train_index,]
str(test1)
## 'data.frame':    24 obs. of  11 variables:
##  $ Year    : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ Month   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MEI     : num  0.974 0.51 0.074 -0.049 0.183 ...
##  $ CO2     : num  383 384 385 386 387 ...
##  $ CH4     : num  1800 1803 1803 1802 1796 ...
##  $ N2O     : num  321 321 321 321 320 ...
##  $ CFC.11  : num  248 248 248 248 247 ...
##  $ CFC.12  : num  539 539 539 539 538 ...
##  $ TSI     : num  1366 1366 1366 1366 1366 ...
##  $ Aerosols: num  0.0054 0.0051 0.0045 0.0045 0.0041 0.004 0.004 0.0041 0.0042 0.0041 ...
##  $ Temp    : num  0.601 0.498 0.435 0.466 0.372 0.382 0.394 0.358 0.402 0.362 ...
dim(test1)
## [1] 24 11
#lets now build the linear regression model on the training dataset

climatelm<-lm(Temp ~ MEI+ CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols, data=train)
summary(climatelm) 
## 
## Call:
## lm(formula = Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + 
##     TSI + Aerosols, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25888 -0.05913 -0.00082  0.05649  0.32433 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.246e+02  1.989e+01  -6.265 1.43e-09 ***
## MEI          6.421e-02  6.470e-03   9.923  < 2e-16 ***
## CO2          6.457e-03  2.285e-03   2.826  0.00505 ** 
## CH4          1.240e-04  5.158e-04   0.240  0.81015    
## N2O         -1.653e-02  8.565e-03  -1.930  0.05467 .  
## CFC.11      -6.631e-03  1.626e-03  -4.078 5.96e-05 ***
## CFC.12       3.808e-03  1.014e-03   3.757  0.00021 ***
## TSI          9.314e-02  1.475e-02   6.313 1.10e-09 ***
## Aerosols    -1.538e+00  2.133e-01  -7.210 5.41e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09171 on 275 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7436 
## F-statistic: 103.6 on 8 and 275 DF,  p-value: < 2.2e-16
#using visreg package to plot regression models.Default plots contain a confidence band, prediction line, and partial residuals.
library(visreg)
par(mfrow=c(2,4))
visreg(climatelm)

#Enter the model R2 (the "Multiple R-squared" value):
summary(climatelm)$r.squared #?summary.lm  #to see the values that can be passed to summary()
## [1] 0.7508933
#Ans:0.7508933

##################################

#PROBLEM 1.2 - CREATING OUR FIRST MODEL

#Q:Which variables are significant in the model? We will consider a variable signficant only if the p-value is below 0.05. (Select all that apply.)
#Ans:MEI,CO2,CFC.11,CFC.12,TSI,Aerosols  #from summary(climatelm)

####################################

#PROBLEM 2.1 - UNDERSTANDING THE MODEL

#Current scientific opinion is that nitrous oxide and CFC-11 are greenhouse gases: gases that are able to trap heat from the sun and contribute to the heating of the Earth. However, the regression coefficients of both the N2O and CFC-11 variables are negative, indicating that increasing atmospheric concentrations of either of these two compounds is associated with lower global temperatures.

#Which of the following is the simplest correct explanation for this contradiction?
#Ans:All of the gas concentration variables reflect human development - N2O and CFC.11 are correlated with other variables in the data set.
#EXPLANATION:The linear correlation of N2O and CFC.11 with other variables in the data set is quite large. The first explanation does not seem correct, as the warming effect of nitrous oxide and CFC-11 are well documented, and our regression analysis is not enough to disprove it. The second explanation is unlikely, as we have estimated eight coefficients and the intercept from 284 observations.

#############################################

#PROBLEM 2.2 - UNDERSTANDING THE MODEL  

#Compute the correlations between all the variables in the training set. Which of the following independent variables is N2O highly correlated with (absolute correlation greater than 0.7)? Select all that apply.
#cor(train) # correlation matrix not required
cor(train$N2O,train)
##           Year      Month         MEI       CO2       CH4 N2O    CFC.11
## [1,] 0.9938452 0.01363153 -0.05081978 0.9767198 0.8998386   1 0.5224773
##         CFC.12       TSI   Aerosols      Temp
## [1,] 0.8679308 0.1997567 -0.3370546 0.7786389
#Ans:CO2,CH4,CFC.12 

#Which of the following independent variables is CFC.11 highly correlated with? Select all that apply.
cor(train$CFC.11,train)
##           Year       Month        MEI       CO2      CH4       N2O CFC.11
## [1,] 0.5691064 -0.01311122 0.06900044 0.5140597 0.779904 0.5224773      1
##         CFC.12      TSI   Aerosols      Temp
## [1,] 0.8689852 0.272046 -0.0439212 0.4077103
#Ans: CH4,CFC.12

library(sjPlot)
#?sjp.corr
sjp.corr(train,type = "tile",show.legend =FALSE) #plotting the correlation matrix
## Computing correlation using spearman-method with listwise-deletion...

#########################################

#PROBLEM 3 - SIMPLIFYING THE MODEL

#Given that the correlations are so high, let us focus on the N2O variable and build a model with only MEI, TSI, Aerosols and N2O as independent variables. Remember to use the training set to build the model.

climatelm2<-lm(Temp ~ MEI+ N2O + TSI + Aerosols, data=train)
summary(climatelm2)
## 
## Call:
## lm(formula = Temp ~ MEI + N2O + TSI + Aerosols, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27916 -0.05975 -0.00595  0.05672  0.34195 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.162e+02  2.022e+01  -5.747 2.37e-08 ***
## MEI          6.419e-02  6.652e-03   9.649  < 2e-16 ***
## N2O          2.532e-02  1.311e-03  19.307  < 2e-16 ***
## TSI          7.949e-02  1.487e-02   5.344 1.89e-07 ***
## Aerosols    -1.702e+00  2.180e-01  -7.806 1.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09547 on 279 degrees of freedom
## Multiple R-squared:  0.7261, Adjusted R-squared:  0.7222 
## F-statistic: 184.9 on 4 and 279 DF,  p-value: < 2.2e-16
par(mfrow =c(2, 2))
plot(climatelm2) #Plot Diagnostics for an lm Object

#Another excellent way to plot the LM diagnostics
#Plotting Diagnostics for LM with ggplot2 and ggfortify
library(ggfortify)
par(mfrow = c(1, 2))
autoplot(climatelm2, which = 1:6, ncol = 3,colour="black",size=2)

#help(autoplot.lm)


#Testing the model assumptions
library(sjPlot)
#?sjp.lm
sjp.lm(climatelm2,type="ma",completeDiagnostic=T)
## Removed 5 cases during 1 step(s).
## R^2 / adj. R^2 of original model: 0.726132 / 0.722206
## R^2 / adj. R^2 of updated model:  0.751744 / 0.748120
## AIC of original model: -521.289225 
## AIC of updated model:  -563.091626

##  lag Autocorrelation D-W Statistic p-value
##    1       0.5784039     0.8424936       0
##  Alternative hypothesis: rho != 0

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.8193595    Df = 1     p = 0.3653675 
## 
##  studentized Breusch-Pagan test
## 
## data:  linreg
## BP = 1.4215, df = 4, p-value = 0.8405

## 
## Suggested power transformation:  1.000072
#Enter the coefficient of N2O in this reduced model:
#Ans:0.02532

#Enter the model R2:
summary(climatelm2)$r.squared
## [1] 0.7261321
#Ans:0.7261321

#EXPLANATION:We have observed that, for this problem, when we remove many variables the sign of N2O flips. The model has not lost a lot of explanatory power (the model R2 is 0.7261 compared to 0.7509 previously) despite removing many variables. As discussed in lecture, this type of behavior is typical when building a model where many of the independent variables are highly correlated with each other. In this particular problem many of the variables (CO2, CH4, N2O, CFC.11 and CFC.12) are highly correlated, since they are all driven by human industrial development.

#################################################

#PROBLEM 4 - AUTOMATICALLY BUILDING THE MODEL

#We have many variables in this problem, and as we have seen above, dropping some from the model does not decrease model quality. R provides a function, step, that will automate the procedure of trying different combinations of variables to find a good compromise of model simplicity and R2. This trade-off is formalized by the Akaike information criterion (AIC) - it can be informally thought of as the quality of the model with a penalty for the number of variables in the model.

#The step function has one argument - the name of the initial model. It returns a simplified model. Use the step function in R to derive a new model, with the full model as the initial model (HINT: If your initial full model was called "climateLM", you could create a new model with the step function by typing step(climateLM). Be sure to save your new model to a variable name so that you can look at the summary. For more information about the step function, type ?step in your R console.)

StepModel<-step(climatelm)
## Start:  AIC=-1348.16
## Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
## 
##            Df Sum of Sq    RSS     AIC
## - CH4       1   0.00049 2.3135 -1350.1
## <none>                  2.3130 -1348.2
## - N2O       1   0.03132 2.3443 -1346.3
## - CO2       1   0.06719 2.3802 -1342.0
## - CFC.12    1   0.11874 2.4318 -1335.9
## - CFC.11    1   0.13986 2.4529 -1333.5
## - TSI       1   0.33516 2.6482 -1311.7
## - Aerosols  1   0.43727 2.7503 -1301.0
## - MEI       1   0.82823 3.1412 -1263.2
## 
## Step:  AIC=-1350.1
## Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
## 
##            Df Sum of Sq    RSS     AIC
## <none>                  2.3135 -1350.1
## - N2O       1   0.03133 2.3448 -1348.3
## - CO2       1   0.06672 2.3802 -1344.0
## - CFC.12    1   0.13023 2.4437 -1336.5
## - CFC.11    1   0.13938 2.4529 -1335.5
## - TSI       1   0.33500 2.6485 -1313.7
## - Aerosols  1   0.43987 2.7534 -1302.7
## - MEI       1   0.83118 3.1447 -1264.9
summary(StepModel)
## 
## Call:
## lm(formula = Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + 
##     Aerosols, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25770 -0.05994 -0.00104  0.05588  0.32203 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.245e+02  1.985e+01  -6.273 1.37e-09 ***
## MEI          6.407e-02  6.434e-03   9.958  < 2e-16 ***
## CO2          6.402e-03  2.269e-03   2.821 0.005129 ** 
## N2O         -1.602e-02  8.287e-03  -1.933 0.054234 .  
## CFC.11      -6.609e-03  1.621e-03  -4.078 5.95e-05 ***
## CFC.12       3.868e-03  9.812e-04   3.942 0.000103 ***
## TSI          9.312e-02  1.473e-02   6.322 1.04e-09 ***
## Aerosols    -1.540e+00  2.126e-01  -7.244 4.36e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09155 on 276 degrees of freedom
## Multiple R-squared:  0.7508, Adjusted R-squared:  0.7445 
## F-statistic: 118.8 on 7 and 276 DF,  p-value: < 2.2e-16
coef(StepModel)
##   (Intercept)           MEI           CO2           N2O        CFC.11 
## -1.245152e+02  6.406779e-02  6.401495e-03 -1.602113e-02 -6.609351e-03 
##        CFC.12           TSI      Aerosols 
##  3.867565e-03  9.311551e-02 -1.540206e+00
#fitted(StepModel)
#resid(StepModel)
AIC(StepModel)
## [1] -542.1426
#If you look at the summary of the model with summary(StepModel), you can see that the R-squared value is 0.75, and only CH4 was removed.
#It is interesting to note that the step function does not address the collinearity of the variables, except that adding highly correlated variables will not improve the R2 significantly. The consequence of this is that the step function will not necessarily produce a very interpretable model - just a model that has balanced quality and simplicity for a particular weighting of quality and simplicity (AIC).

###################################

#PROBLEM 5 - TESTING ON UNSEEN DATA

#We have developed an understanding of how well we can fit a linear regression to the training data, but does the model quality hold when applied to unseen data?

#Using the model produced from the step function, calculate temperature predictions for the testing data set, using the predict function.
climateStep<-lm(Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + Aerosols, data = train)
tempPredict<- predict(climateStep, newdata=test)
summary(tempPredict)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3142  0.3418  0.3771  0.3832  0.4245  0.4678
#Enter the testing set R2:
# Compute out-of-sample R^2 #how well the model predicts on test data
SSE = sum((tempPredict - test$Temp)^2)
SST = sum((mean(train$Temp) - test$Temp)^2)
R2 = 1 - SSE/SST
R2
## [1] 0.6286051
# Compute the RMSE
RMSE = sqrt(SSE/nrow(test))
RMSE
## [1] 0.09522876
#Lets do variable selection using leaps package with regsubsets()

library(leaps)
regfit.bwd<-regsubsets(Temp~.-Year-Month,data = train,nvmax=8,nbest=1,method = "backward")
reg.summary<-summary(regfit.bwd)
reg.summary
## Subset selection object
## Call: regsubsets.formula(Temp ~ . - Year - Month, data = train, nvmax = 8, 
##     nbest = 1, method = "backward")
## 8 Variables  (and intercept)
##          Forced in Forced out
## MEI          FALSE      FALSE
## CO2          FALSE      FALSE
## CH4          FALSE      FALSE
## N2O          FALSE      FALSE
## CFC.11       FALSE      FALSE
## CFC.12       FALSE      FALSE
## TSI          FALSE      FALSE
## Aerosols     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: backward
##          MEI CO2 CH4 N2O CFC.11 CFC.12 TSI Aerosols
## 1  ( 1 ) " " " " " " " " " "    "*"    " " " "     
## 2  ( 1 ) "*" "*" " " " " " "    " "    " " " "     
## 3  ( 1 ) "*" " " " " " " "*"    "*"    " " " "     
## 4  ( 1 ) "*" " " " " " " "*"    "*"    " " "*"     
## 5  ( 1 ) "*" " " " " " " "*"    "*"    "*" "*"     
## 6  ( 1 ) "*" "*" " " " " "*"    "*"    "*" "*"     
## 7  ( 1 ) "*" "*" " " "*" "*"    "*"    "*" "*"     
## 8  ( 1 ) "*" "*" "*" "*" "*"    "*"    "*" "*"
#An asterix indicates that a given var is included in the corresponding model
#?regsubsets
reg.summary$rsq
## [1] 0.4727354 0.6638403 0.6666184 0.7085996 0.7434239 0.7474671 0.7508409
## [8] 0.7508933
reg.summary$cp #Mallows' Cp,lowest value for the model selected which corresponds to 7 var model
## [1] 302.070882  93.101659  92.034807  47.689848  11.245733   8.782349
## [7]   7.057826   9.000000
reg.summary$adjr2 #Adjusted r-squared,we selcted model with the highest which corresponds to 7 var model 
## [1] 0.4708656 0.6614477 0.6630464 0.7044218 0.7388093 0.7419970 0.7445216
## [8] 0.7436465
reg.summary$bic # lowest value for the model selected
## [1] -170.4770 -292.6611 -289.3689 -321.9433 -352.4399 -351.3018 -349.4727
## [8] -343.8834
reg.summary$which #A logical matrix indicating which elements are in each model
##   (Intercept)   MEI   CO2   CH4   N2O CFC.11 CFC.12   TSI Aerosols
## 1        TRUE FALSE FALSE FALSE FALSE  FALSE   TRUE FALSE    FALSE
## 2        TRUE  TRUE  TRUE FALSE FALSE  FALSE  FALSE FALSE    FALSE
## 3        TRUE  TRUE FALSE FALSE FALSE   TRUE   TRUE FALSE    FALSE
## 4        TRUE  TRUE FALSE FALSE FALSE   TRUE   TRUE FALSE     TRUE
## 5        TRUE  TRUE FALSE FALSE FALSE   TRUE   TRUE  TRUE     TRUE
## 6        TRUE  TRUE  TRUE FALSE FALSE   TRUE   TRUE  TRUE     TRUE
## 7        TRUE  TRUE  TRUE FALSE  TRUE   TRUE   TRUE  TRUE     TRUE
## 8        TRUE  TRUE  TRUE  TRUE  TRUE   TRUE   TRUE  TRUE     TRUE
#lets use in built plot() in regsubsets()
#Darker the color, higher is the fitting. A gap in the continuation means that the variable has not been included.
par(mfrow=c(2,2))
plot(regfit.bwd,scale="adjr2")
#By lloking at the plot we see that at the highest value of AdjR2,va CH4 shows discontinuity implying that the var shud be dropped, hence 7 var model as per highest adjR2 criterion is selected

# we can do the same plot based on other accuracy criterias
#?plot.regsubsets
plot(regfit.bwd,scale="r2") #here look for highest values
plot(regfit.bwd,scale="Cp") #here look for lowest values
plot(regfit.bwd,scale="bic") #here look for lowest values

#from the reg.summary & plots, we can see that 7 var model fits the best which corresponds to the step() we used before

#lets use conventional plots of  RSS,adjusted Rsqrd,Cp,& BIC for all models at once will help us decide which model to select
par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of variables",ylab="RSS",type="l")
which.min(reg.summary$rss)
## [1] 8
points(8,reg.summary$rss[8],col="red",cex=2,pch=20) # adding the point to the plot which shows the min Cp


plot(reg.summary$adjr2,xlab="Number of variables",ylab="Adjusted R2",type="l")
which.max(reg.summary$adjr2)
## [1] 7
points(7,reg.summary$adjr2[7],col="red",cex=2,pch=20) # adding the point to the plot which shows the max adjusted R2

#similarily for other measures
plot(reg.summary$cp,xlab="Number of variables",ylab="Cp",type="l")
which.min(reg.summary$cp)
## [1] 7
points(7,reg.summary$cp[7],col="red",cex=2,pch=20) # adding the point to the plot which shows the min Cp

plot(reg.summary$bic,xlab="Number of variables",ylab="BIC",type="l")
which.min(reg.summary$bic)
## [1] 5
points(5,reg.summary$bic[5],col="red",cex=2,pch=20) # adding the point to the plot which shows the min Cp

Lets see the sjPlot package for journal type summary output

#lets just compare the two fitted model
#lets gets the summary o/p in better presented table
library(sjPlot)
sjt.lm(climatelm,climatelm2) 
    Temp   Temp
    B CI p   B CI p
(Intercept)   -124.59 -163.74 – -85.44 <.001   -116.23 -156.04 – -76.42 <.001
MEI   0.06 0.05 – 0.08 <.001   0.06 0.05 – 0.08 <.001
CO2   0.01 0.00 – 0.01 .005        
CH4   0.00 -0.00 – 0.00 .810        
N2O   -0.02 -0.03 – 0.00 .055   0.03 0.02 – 0.03 <.001
CFC.11   -0.01 -0.01 – -0.00 <.001        
CFC.12   0.00 0.00 – 0.01 <.001        
TSI   0.09 0.06 – 0.12 <.001   0.08 0.05 – 0.11 <.001
Aerosols   -1.54 -1.96 – -1.12 <.001   -1.70 -2.13 – -1.27 <.001
Observations   284   284
R2 / adj. R2   .751 / .744   .726 / .722

Assignment 2: READING TEST SCORES

READING TEST SCORES

The Programme for International Student Assessment (PISA) is a test given every three years to 15-year-old students from around the world to evaluate their performance in mathematics, reading, and science. This test provides a quantitative way to compare the performance of students from different parts of the world. In this homework assignment, we will predict the reading scores of students from the United States of America on the 2009 PISA exam.

The datasets pisa2009train.csv and pisa2009test.csv contain information about the demographics and schools for American students taking the exam, derived from 2009 PISA Public-Use Data Files distributed by the United States National Center for Education Statistics (NCES). While the datasets are not supposed to contain identifying information about students taking the test, by using the data you are bound by the NCES data use agreement, which prohibits any attempt to determine the identity of any student in the datasets.

Each row in the datasets pisa2009train.csv and pisa2009test.csv represents one student taking the exam

#lets import the data set
pisaTrain<-read.csv("pisa2009train.csv")
pisaTest<-read.csv("pisa2009test.csv")
str(pisaTrain)
## 'data.frame':    3663 obs. of  24 variables:
##  $ grade                : int  11 11 9 10 10 10 10 10 9 10 ...
##  $ male                 : int  1 1 1 0 1 1 0 0 0 1 ...
##  $ raceeth              : Factor w/ 7 levels "American Indian/Alaska Native",..: NA 7 7 3 4 3 2 7 7 5 ...
##  $ preschool            : int  NA 0 1 1 1 1 0 1 1 1 ...
##  $ expectBachelors      : int  0 0 1 1 0 1 1 1 0 1 ...
##  $ motherHS             : int  NA 1 1 0 1 NA 1 1 1 1 ...
##  $ motherBachelors      : int  NA 1 1 0 0 NA 0 0 NA 1 ...
##  $ motherWork           : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ fatherHS             : int  NA 1 1 1 1 1 NA 1 0 0 ...
##  $ fatherBachelors      : int  NA 0 NA 0 0 0 NA 0 NA 0 ...
##  $ fatherWork           : int  1 1 1 1 0 1 NA 1 1 1 ...
##  $ selfBornUS           : int  1 1 1 1 1 1 0 1 1 1 ...
##  $ motherBornUS         : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ fatherBornUS         : int  0 1 1 1 0 1 NA 1 1 1 ...
##  $ englishAtHome        : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ computerForSchoolwork: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ read30MinsADay       : int  0 1 0 1 1 0 0 1 0 0 ...
##  $ minutesPerWeekEnglish: int  225 450 250 200 250 300 250 300 378 294 ...
##  $ studentsInEnglish    : int  NA 25 28 23 35 20 28 30 20 24 ...
##  $ schoolHasLibrary     : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ publicSchool         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ urban                : int  1 0 0 1 1 0 1 0 1 0 ...
##  $ schoolSize           : int  673 1173 1233 2640 1095 227 2080 1913 502 899 ...
##  $ readingScore         : num  476 575 555 458 614 ...
#PROBLEM 1.1 - DATASET SIZE  

#How many students are there in the training set?
nrow(pisaTrain)
## [1] 3663
#Ans:3663
#We can then access the number of rows in the training set with str(pisaTrain) or nrow(pisaTrain).

################################

#PROBLEM 1.2 - SUMMARIZING THE DATASET  

#Using tapply() on pisaTrain, what is the average reading test score of males?
tapply(pisaTrain$readingScore,pisaTrain$male,mean,na.rm=T)
##        0        1 
## 512.9406 483.5325
#Ans:483.5325

#Of females?
#Ans:512.9406

#################################

#PROBLEM 1.3 - LOCATING MISSING VALUES  

#Which variables are missing data in at least one observation in the training set? Select all that apply.
colnames(pisaTrain)[colSums(is.na(pisaTrain)) > 0] #or colnames(pisaTrain)[apply(is.na(pisaTrain), 2, any)]
##  [1] "raceeth"               "preschool"            
##  [3] "expectBachelors"       "motherHS"             
##  [5] "motherBachelors"       "motherWork"           
##  [7] "fatherHS"              "fatherBachelors"      
##  [9] "fatherWork"            "selfBornUS"           
## [11] "motherBornUS"          "fatherBornUS"         
## [13] "englishAtHome"         "computerForSchoolwork"
## [15] "read30MinsADay"        "minutesPerWeekEnglish"
## [17] "studentsInEnglish"     "schoolHasLibrary"     
## [19] "schoolSize"
#or apply(is.na(pisaTrain), 2, any)
#or
which(apply(is.na(pisaTrain), 2, sum)>0) # gives col names as well as col no
##               raceeth             preschool       expectBachelors 
##                     3                     4                     5 
##              motherHS       motherBachelors            motherWork 
##                     6                     7                     8 
##              fatherHS       fatherBachelors            fatherWork 
##                     9                    10                    11 
##            selfBornUS          motherBornUS          fatherBornUS 
##                    12                    13                    14 
##         englishAtHome computerForSchoolwork        read30MinsADay 
##                    15                    16                    17 
## minutesPerWeekEnglish     studentsInEnglish      schoolHasLibrary 
##                    18                    19                    20 
##            schoolSize 
##                    23
#Ans:[1]"raceeth"               "preschool"             "expectBachelors"      
# [4] "motherHS"              "motherBachelors"       "motherWork"           
# [7] "fatherHS"              "fatherBachelors"       "fatherWork"           
#[10] "selfBornUS"            "motherBornUS"          "fatherBornUS"         
#[13] "englishAtHome"         "computerForSchoolwork" "read30MinsADay"       
#[16] "minutesPerWeekEnglish" "studentsInEnglish"     "schoolHasLibrary"     
#[19] "schoolSize"           
#EXPLANATION:We can read which variables have missing values from summary(pisaTrain). Because most variables are collected from study participants via survey, it is expected that most questions will have at least one missing value.

##########################

# Problem 1.4 - Removing missing values

#Linear regression discards observations with missing data, so we will remove all such observations from the training and testing sets. Later in the course, we will learn about imputation, which deals with missing data by filling in missing values with plausible information.

pisaTrain<-na.omit(pisaTrain)

pisaTest<-na.omit(pisaTest)

#How many observations are now in the training set?
nrow(pisaTrain)
## [1] 2414
#Ans:2414

#How many observations are now in the testing set?
nrow(pisaTest)
## [1] 990
#Ans:990

##############################

# Problem 2.1 - Factor variables 

#Factor variables are variables that take on a discrete set of values, like the "Region" variable in the WHO dataset from the second lecture of Unit 1. This is an unordered factor because there isn't any natural ordering between the levels. An ordered factor has a natural ordering between the levels (an example would be the classifications "large," "medium," and "small").
sapply(pisaTrain,class) #to get classes of all var
##                 grade                  male               raceeth 
##             "integer"             "integer"              "factor" 
##             preschool       expectBachelors              motherHS 
##             "integer"             "integer"             "integer" 
##       motherBachelors            motherWork              fatherHS 
##             "integer"             "integer"             "integer" 
##       fatherBachelors            fatherWork            selfBornUS 
##             "integer"             "integer"             "integer" 
##          motherBornUS          fatherBornUS         englishAtHome 
##             "integer"             "integer"             "integer" 
## computerForSchoolwork        read30MinsADay minutesPerWeekEnglish 
##             "integer"             "integer"             "integer" 
##     studentsInEnglish      schoolHasLibrary          publicSchool 
##             "integer"             "integer"             "integer" 
##                 urban            schoolSize          readingScore 
##             "integer"             "integer"             "numeric"
#Which of the following variables is an unordered factor with at least 3 levels? (Select all that apply.)
levels(pisaTrain$raceeth)
## [1] "American Indian/Alaska Native"         
## [2] "Asian"                                 
## [3] "Black"                                 
## [4] "Hispanic"                              
## [5] "More than one race"                    
## [6] "Native Hawaiian/Other Pacific Islander"
## [7] "White"
#Ans:raceeth

#Which of the following variables is an ordered factor with at least 3 levels? (Select all that apply.)
#Ans:grade

#Explanation:Male only has 2 levels (1 and 0). There is no natural ordering between the different values of raceeth, so it is an unordered factor. Meanwhile, we can order grades (8, 9, 10, 11, 12), so it is an ordered factor.

########################################

# Problem 2.2 - Unordered factors in regression models 

#To include unordered factors in a linear regression model, we define one level as the "reference level" and add a binary variable for each of the remaining levels. In this way, a factor with n levels is replaced by n-1 binary variables. The reference level is typically selected to be the most frequently occurring level in the dataset.

#As an example, consider the unordered factor variable "color", with levels "red", "green", and "blue". If "green" were the reference level, then we would add binary variables "colorred" and "colorblue" to a linear regression problem. All red examples would have colorred=1 and colorblue=0. All blue examples would have colorred=0 and colorblue=1. All green examples would have colorred=0 and colorblue=0.

#Now, consider the variable "raceeth" in our problem, which has levels "American Indian/Alaska Native", "Asian", "Black", "Hispanic", "More than one race", "Native Hawaiian/Other Pacific Islander", and "White". Because it is the most common in our population, we will select White as the reference level.

#Which binary variables will be included in the regression model? (Select all that apply.)
#Ans:raceethAmerican Indian/Alaska Native,raceethAsian,raceethBlack ,raceethHispanic ,raceethMore than one race,raceethNative Hawaiian/Other Pacific Islander
#Explanation:We create a binary variable for each level except the reference level, so we would create all these variables except for raceethWhite.

#########################################

# Problem 2.3 - Example unordered factors 

#Consider again adding our unordered factor race to the regression model with reference level "White".

#For a student who is Asian, which binary variables would be set to 0? All remaining variables will be set to 1. (Select all that apply.)
#Ans:raceethAmerican Indian/Alaska Native,raceethBlack ,raceethHispanic ,raceethMore than one race,raceethNative Hawaiian/Other Pacific Islander

#For a student who is white, which binary variables would be set to 0? All remaining variables will be set to 1. (Select all that apply.)
#Ans:raceethAmerican Indian/Alaska Native,raceethAsian,raceethBlack ,raceethHispanic ,raceethMore than one race,raceethNative Hawaiian/Other Pacific Islander

#Explanation:An Asian student will have raceethAsian set to 1 and all other raceeth binary variables set to 0. Because "White" is the reference level, a white student will have all raceeth binary variables set to 0.

######################################

# Problem 3.1 - Building a model 

#Because the race variable takes on text values, it was loaded as a factor variable when we read in the dataset with read.csv() -- you can see this when you run str(pisaTrain) or str(pisaTest). However, by default R selects the first level alphabetically ("American Indian/Alaska Native") as the reference level of our factor instead of the most common level ("White"). Set the reference level of the factor by typing the following two lines in your R console:

pisaTrain$raceeth<-relevel(pisaTrain$raceeth, "White")
pisaTest$raceeth<-relevel(pisaTest$raceeth, "White")

#Now, build a linear regression model (call it lmScore) using the training set to predict readingScore using all the remaining variables.

#It would be time-consuming to type all the variables, but R provides the shorthand notation "readingScore ~ ." to mean "predict readingScore using all the other variables in the data frame." The period is used to replace listing out all of the independent variables. As an example, if your dependent variable is called "Y", your independent variables are called "X1", "X2", and "X3", and your training data set is called "Train", instead of the regular notation:
#LinReg = lm(Y ~ X1 + X2 + X3, data = Train)
#You would use the following command to build your model:
#LinReg = lm(Y ~ ., data = Train)

lmScore<-lm(readingScore ~ .,data=pisaTrain)
summary(lmScore)
## 
## Call:
## lm(formula = readingScore ~ ., data = pisaTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -247.44  -48.86    1.86   49.77  217.18 
## 
## Coefficients:
##                                                 Estimate Std. Error
## (Intercept)                                   143.766333  33.841226
## grade                                          29.542707   2.937399
## male                                          -14.521653   3.155926
## raceethAmerican Indian/Alaska Native          -67.277327  16.786935
## raceethAsian                                   -4.110325   9.220071
## raceethBlack                                  -67.012347   5.460883
## raceethHispanic                               -38.975486   5.177743
## raceethMore than one race                     -16.922522   8.496268
## raceethNative Hawaiian/Other Pacific Islander  -5.101601  17.005696
## preschool                                      -4.463670   3.486055
## expectBachelors                                55.267080   4.293893
## motherHS                                        6.058774   6.091423
## motherBachelors                                12.638068   3.861457
## motherWork                                     -2.809101   3.521827
## fatherHS                                        4.018214   5.579269
## fatherBachelors                                16.929755   3.995253
## fatherWork                                      5.842798   4.395978
## selfBornUS                                     -3.806278   7.323718
## motherBornUS                                   -8.798153   6.587621
## fatherBornUS                                    4.306994   6.263875
## englishAtHome                                   8.035685   6.859492
## computerForSchoolwork                          22.500232   5.702562
## read30MinsADay                                 34.871924   3.408447
## minutesPerWeekEnglish                           0.012788   0.010712
## studentsInEnglish                              -0.286631   0.227819
## schoolHasLibrary                               12.215085   9.264884
## publicSchool                                  -16.857475   6.725614
## urban                                          -0.110132   3.962724
## schoolSize                                      0.006540   0.002197
##                                               t value Pr(>|t|)    
## (Intercept)                                     4.248 2.24e-05 ***
## grade                                          10.057  < 2e-16 ***
## male                                           -4.601 4.42e-06 ***
## raceethAmerican Indian/Alaska Native           -4.008 6.32e-05 ***
## raceethAsian                                   -0.446  0.65578    
## raceethBlack                                  -12.271  < 2e-16 ***
## raceethHispanic                                -7.528 7.29e-14 ***
## raceethMore than one race                      -1.992  0.04651 *  
## raceethNative Hawaiian/Other Pacific Islander  -0.300  0.76421    
## preschool                                      -1.280  0.20052    
## expectBachelors                                12.871  < 2e-16 ***
## motherHS                                        0.995  0.32001    
## motherBachelors                                 3.273  0.00108 ** 
## motherWork                                     -0.798  0.42517    
## fatherHS                                        0.720  0.47147    
## fatherBachelors                                 4.237 2.35e-05 ***
## fatherWork                                      1.329  0.18393    
## selfBornUS                                     -0.520  0.60331    
## motherBornUS                                   -1.336  0.18182    
## fatherBornUS                                    0.688  0.49178    
## englishAtHome                                   1.171  0.24153    
## computerForSchoolwork                           3.946 8.19e-05 ***
## read30MinsADay                                 10.231  < 2e-16 ***
## minutesPerWeekEnglish                           1.194  0.23264    
## studentsInEnglish                              -1.258  0.20846    
## schoolHasLibrary                                1.318  0.18749    
## publicSchool                                   -2.506  0.01226 *  
## urban                                          -0.028  0.97783    
## schoolSize                                      2.977  0.00294 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73.81 on 2385 degrees of freedom
## Multiple R-squared:  0.3251, Adjusted R-squared:  0.3172 
## F-statistic: 41.04 on 28 and 2385 DF,  p-value: < 2.2e-16
#What is the Multiple R-squared value of lmScore on the training set?
#Ans:0.3251
#Note that this R-squared is lower than the ones for the models we saw in the lectures and recitation. This does not necessarily imply that the model is of poor quality. More often than not, it simply means that the prediction problem at hand (predicting a student's test score based on demographic and school-related variables) is more difficult than other prediction problems (like predicting a team's number of wins from their runs scored and allowed, or predicting the quality of wine from weather conditions).

###############################

#Problem 3.2 - Computing the root-mean squared error of the model 

#What is the training-set root-mean squared error (RMSE) of lmScore?

# Compute SSE and RMSE for new model
SSE = sum(lmScore$residuals^2) 
SSE
## [1] 12993365
RMSE = sqrt(SSE / nrow(pisaTrain))
RMSE
## [1] 73.36555
# or calc RMSE directltly
sqrt(mean(lmScore$residuals^2))
## [1] 73.36555
#Ans:73.36555

#using qpcR package to calc RMSE
library(qpcR)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: minpack.lm
## Loading required package: rgl
## Loading required package: robustbase
## 
## Attaching package: 'robustbase'
## The following object is masked from 'package:survival':
## 
##     heart
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
RMSE(lmScore, which = NULL)
## [1] 73.36555
#############################

# Problem 3.3 - Comparing predictions for similar students

#Consider two students A and B. They have all variable values the same, except that student A is in grade 11 and student B is in grade 9. What is the predicted reading score of student A minus the predicted reading score of student B?
29.542707*(11-9) # grade var coeff times the diff in grade
## [1] 59.08541
#59.08541
#Explanation:The coefficient 29.54 on grade is the difference in reading score between two students who are identical other than having a difference in grade of 1. Because A and B have a difference in grade of 2, the model predicts that student A has a reading score that is 2*29.54 larger.

################################

# Problem 3.4 - Interpreting model coefficients

#What is the meaning of the coefficient associated with variable raceethAsian?
coef(summary(lmScore))["raceethAsian","Estimate"]
## [1] -4.110325
summary(lmScore)$coefficients["raceethAsian","Estimate"]
## [1] -4.110325
#Ans:Predicted difference in the reading score between an Asian student and a white student who is otherwise identical 
#Explanation:The only difference between an Asian student and white student with otherwise identical variables is that the former has raceethAsian=1 and the latter has raceethAsian=0. The predicted reading score for these two students will differ by the coefficient on the variable raceethAsian.

##############################

# Problem 3.5 - Identifying variables lacking statistical significance

#Based on the significance codes, which variables are candidates for removal from the model? Select all that apply. (We'll assume that the factor variable raceeth should only be removed if none of its levels are significant.)
#Ans:preschool,motherHS,motherWork,fatherHS,fatherWork ,selfBornUS,motherBornUS,fatherBornUS,englishAtHome,minutesPerWeekEnglish,studentsInEnglish,schoolHasLibrary,urban
#Explanation:From summary(lmScore), we can see which variables were significant at the 0.05 level. Because several of the binary variables generated from the race factor variable are significant, we should not remove this variable.

########################

# Problem 4.1 - Predicting on unseen data

#Using the "predict" function and supplying the "newdata" argument, use the lmScore model to predict the reading scores of students in pisaTest. Call this vector of predictions "predTest". Do not change the variables in the model (for example, do not remove variables that we found were not significant in the previous part of this problem). Use the summary function to describe the test set predictions.

#What is the range between the maximum and minimum predicted reading score on the test set?
predTest<-predict(lmScore,newdata = pisaTest)
range(predTest) # we get the min & max values
## [1] 353.2231 637.6914
353.2231-637.6914
## [1] -284.4683
#Ans:284.4683

################################

# Problem 4.2 - Test set SSE and RMSE

#What is the sum of squared errors (SSE) of lmScore on the testing set?

#method1:lets use long hand method (a.k.a using formula)
SSE = sum((predTest - pisaTest$readingScore)^2)
#Ans:5762082

# Compute the RMSE
sqrt(mean((predTest-pisaTest$readingScore)^2)) # orRMSE = sqrt(SSE / nrow(pisaTest))
## [1] 76.29079
#Ans:76.29079

################################

# Problem 4.3 - Baseline prediction and test-set SSE 

#What is the predicted test score used in the baseline model? Remember to compute this value using the training set and not the test set.
baseline = mean(pisaTrain$readingScore)
baseline 
## [1] 517.9629
#Ans:517.9629

#What is the sum of squared errors of the baseline model on the testing set? HINT: We call the sum of squared errors for the baseline model the total sum of squares (SST)
SST<-sum((pisaTest$readingScore-mean(pisaTrain$readingScore))^2)
SST
## [1] 7802354
#or
sum((baseline-pisaTest$readingScore)^2)
## [1] 7802354
#Ans:5762082

###########################

# Problem 4.4 - Test-set R-squared 

#What is the test-set R-squared value of lmScore?
1-5762082/7802354
## [1] 0.2614944
#Ans:0.2614944
#Explanation:The test-set R^2 is defined as 1-SSE/SST, where SSE is the sum of squared errors of the model on the test set and SST is the sum of squared errors of the baseline model. For this model, the R^2 is then computed to be 1-5762082/7802354.

Assignment 2: DETECTING FLU EPIDEMICS VIA SEARCH ENGINE QUERY DATA

Flu epidemics constitute a major public health concern causing respiratory illnesses, hospitalizations, and deaths. According to the National Vital Statistics Reports published in October 2012, influenza ranked as the eighth leading cause of death in 2011 in the United States. Each year, 250,000 to 500,000 deaths are attributed to influenza related diseases throughout the world.

The U.S. Centers for Disease Control and Prevention (CDC) and the European Influenza Surveillance Scheme (EISS) detect influenza activity through virologic and clinical data, including Influenza-like Illness (ILI) physician visits. Reporting national and regional data, however, are published with a 1-2 week lag.

The Google Flu Trends project was initiated to see if faster reporting can be made possible by considering flu-related online search queries – data that is available almost immediately.

PROBLEM 1.1 - UNDERSTANDING THE DATA
We would like to estimate influenza-like illness (ILI) activity using Google web search logs. Fortunately, one can easily access this data online:

ILI Data - The CDC publishes on its website the official regional and state-level percentage of patient visits to healthcare providers for ILI purposes on a weekly basis.

Google Search Queries - Google Trends allows public retrieval of weekly counts for every query searched by users around the world. For each location, the counts are normalized by dividing the count for each query in a particular week by the total number of online search queries submitted in that location during the week. Then, the values are adjusted to be between 0 and 1.

The csv file FluTrain.csv aggregates this data from January 1, 2004 until December 31, 2011 as follows:

“Week” - The range of dates represented by this observation, in year/month/day format.

“ILI” - This column lists the percentage of ILI-related physician visits for the corresponding week.

“Queries” - This column lists the fraction of queries that are ILI-related for the corresponding week, adjusted to be between 0 and 1 (higher values correspond to more ILI-related search queries).

#lets load the data
FluTrain<-read.csv("FluTrain.csv")
str(FluTrain)
## 'data.frame':    417 obs. of  3 variables:
##  $ Week   : Factor w/ 417 levels "2004-01-04 - 2004-01-10",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ ILI    : num  2.42 1.81 1.71 1.54 1.44 ...
##  $ Queries: num  0.238 0.22 0.226 0.238 0.224 ...
#Looking at the time period 2004-2011, which week corresponds to the highest percentage of ILI-related physician visits? Select the day of the month corresponding to the start of this week.
subset(FluTrain, ILI == max(ILI)) #or FluTrain$Week[which.max(FluTrain$ILI)]
##                        Week      ILI Queries
## 303 2009-10-18 - 2009-10-24 7.618892       1
#Ans:2009-10-18

#Which week corresponds to the highest percentage of ILI-related query fraction?
subset(FluTrain,Queries==max(Queries)) #or FluTrain$Week[which.max(FluTrain$Queries)]
##                        Week      ILI Queries
## 303 2009-10-18 - 2009-10-24 7.618892       1
#Ans:2009-10-18

#######################################

#PROBLEM 1.2 - UNDERSTANDING THE DATA  

#Let us now understand the data at an aggregate level. Plot the histogram of the dependent variable, ILI. What best describes the distribution of values of ILI?
hist(FluTrain$ILI)

#Ans:Most of the ILI values are small, with a relatively small number of much larger values (in statistics, this sort of data is called "skew right").
# the data is skew right

########################################

#PROBLEM 1.3 - UNDERSTANDING THE DATA  

#When handling a skewed dependent variable, it is often useful to predict the logarithm of the dependent variable instead of the dependent variable itself -- this prevents the small number of unusually large or small observations from having an undue influence on the sum of squared errors of predictive models. In this problem, we will predict the natural log of the ILI variable, which can be computed in R using the log() function.

#Plot the natural logarithm of ILI versus Queries. What does the plot suggest?.
plot(FluTrain$Queries,log(FluTrain$ILI))

#Ans:There is a positive, linear relationship between log(ILI) and Queries.

###########################################

#PROBLEM 2.1 - LINEAR REGRESSION MODEL  

#Based on the plot we just made, it seems that a linear regression model could be a good modeling choice. Based on our understanding of the data from the previous subproblem, which model best describes our estimation problem?
FluTrend1<-lm(log(ILI)~Queries,data=FluTrain)
summary(FluTrend1)
## 
## Call:
## lm(formula = log(ILI) ~ Queries, data = FluTrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76003 -0.19696 -0.01657  0.18685  1.06450 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.49934    0.03041  -16.42   <2e-16 ***
## Queries      2.96129    0.09312   31.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2995 on 415 degrees of freedom
## Multiple R-squared:  0.709,  Adjusted R-squared:  0.7083 
## F-statistic:  1011 on 1 and 415 DF,  p-value: < 2.2e-16
#log(ILI) = intercept + coefficient x Queries, where the coefficient is positive
#XPLANATION:From the previous subproblem, we are predicting log(ILI) using the Queries variable. From the plot in the previous subproblem, we expect the coefficient on Queries to be positive.

#######################################

#PROBLEM 2.2 - LINEAR REGRESSION MODEL

#Let's call the regression model from the previous problem (Problem 2.1) FluTrend1 and run it in R. Hint: to take the logarithm of a variable Var in a regression equation, you simply use log(Var) when specifying the formula to the lm() function.
FluTrend1<-lm(log(ILI)~Queries,data=FluTrain)
summary(FluTrend1)
## 
## Call:
## lm(formula = log(ILI) ~ Queries, data = FluTrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76003 -0.19696 -0.01657  0.18685  1.06450 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.49934    0.03041  -16.42   <2e-16 ***
## Queries      2.96129    0.09312   31.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2995 on 415 degrees of freedom
## Multiple R-squared:  0.709,  Adjusted R-squared:  0.7083 
## F-statistic:  1011 on 1 and 415 DF,  p-value: < 2.2e-16
#What is the training set R-squared value for FluTrend1 model (the "Multiple R-squared")?
#Ans:0.709

##########################################

#PROBLEM 2.3 - LINEAR REGRESSION MODEL  

#For a single variable linear regression model, there is a direct relationship between the R-squared and the correlation between the independent and the dependent variables. What is the relationship we infer from our problem? (Don't forget that you can use the cor function to compute the correlation between two variables.)
Correlation<-cor(FluTrain$Queries,log(FluTrain$ILI))
Correlation^2
## [1] 0.7090201
#Ans:R-squared = Correlation^2
#To test these hypotheses, we first need to compute the correlation between the independent variable used in the model (Queries) and the dependent variable (log(ILI)).
log(1/Correlation) 
## [1] 0.1719357
exp(-0.5*Correlation) 
## [1] 0.6563792
#It appears that Correlation^2 is equal to the R-squared value. It can be proved that this is always the case.

#########################################

#PROBLEM 3.1 - PERFORMANCE ON THE TEST SET  

#The csv file FluTest.csv provides the 2012 weekly data of the ILI-related search queries and the observed weekly percentage of ILI-related physician visits.

FluTest<-read.csv("FluTest.csv")

#Normally, we would obtain test-set predictions from the model FluTrend1 using the code

#PredTest1 = predict(FluTrend1, newdata=FluTest)

#However, the dependent variable in our model is log(ILI), so PredTest1 would contain predictions of the log(ILI) value. We are instead interested in obtaining predictions of the ILI value. We can convert from predictions of log(ILI) to predictions of ILI via exponentiation, or the exp() function. The new code, which predicts the ILI value, is

PredTest1 = exp(predict(FluTrend1, newdata=FluTest))

#What is our estimate for the percentage of ILI-related physician visits for the week of March 11, 2012? (HINT: You can either just output FluTest$Week to find which element corresponds to March 11, 2012, or you can use the "which" function in R. To learn more about the which function, type ?which in your R console.)
PredTest1[which(FluTest$Week=="2012-03-11 - 2012-03-17")]
##       11 
## 2.187378
#Ans:2.187383

###########################################

#PROBLEM 3.2 - PERFORMANCE ON THE TEST SET  

#What is the relative error betweeen the estimate (our prediction) and the observed value for the week of March 11, 2012? Note that the relative error is calculated as
#(Observed ILI - Estimated ILI)/Observed ILI
FluTest$ILI[which(FluTest$Week=="2012-03-11 - 2012-03-17")]
## [1] 2.293422
(2.293422-2.187383)/2.293422
## [1] 0.04623615
#Ans:0.04623615
#EXPLANATION:From the previous problem, we know the predicted value is 2.187378. The actual value is the 11th testing set ILI value or FluTest$ILI[11], which has value 2.293422. Finally we compute the relative error to be (2.293422 - 2.187378)/2.293422.

######################################

#PROBLEM 3.3 - PERFORMANCE ON THE TEST SET  

#What is the Root Mean Square Error (RMSE) between our estimates and the actual observations for the percentage of ILI-related physician visits, on the test set?
#lets use long hand method (a.k.a using formula)
SSE = sum((PredTest1 - FluTest$ILI)^2)
# Compute the RMSE
sqrt(mean((PredTest1-FluTest$ILI)^2)) # orRMSE = sqrt(SSE / nrow(FluTest))
## [1] 0.7490645
#Ans: 0.7490645

############################################

#PROBLEM 4.1 - TRAINING A TIME SERIES MODEL 

#The observations in this dataset are consecutive weekly measurements of the dependent and independent variables. This sort of dataset is called a "time series." Often, statistical models can be improved by predicting the current value of the dependent variable using the value of the dependent variable from earlier weeks. In our models, this means we will predict the ILI variable in the current week using values of the ILI variable from previous weeks.

#First, we need to decide the amount of time to lag the observations. Because the ILI variable is reported with a 1- or 2-week lag, a decision maker cannot rely on the previous week's ILI value to predict the current week's value. Instead, the decision maker will only have data available from 2 or more weeks ago. We will build a variable called ILILag2 that contains the ILI value from 2 weeks before the current observation.

#To do so, we will use the "zoo" package, which provides a number of helpful methods for time series models. While many functions are built into R, you need to add new packages to use some functions. New packages can be installed and loaded easily in R, and we will do this many times in this class. Run the following two commands to install and load the zoo package. In the first command, you will be prompted to select a CRAN mirror to use for your download. Select a mirror near you geographically.

library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
#After installing and loading the zoo package, run the following commands to create the ILILag2 variable in the training set:

ILILag2 = lag(zoo(FluTrain$ILI), k=-2, na.pad=TRUE)

FluTrain$ILILag2 = coredata(ILILag2)

#In these commands, the value of -2 passed to lag means to return 2 observations before the current one; a positive value would have returned future observations. The parameter na.pad=TRUE means to add missing values for the first two weeks of our dataset, where we can't compute the data from 2 weeks earlier.

#How many values are missing in the new ILILag2 variable?
sum(is.na(FluTrain$ILILag2))  #or summary(FluTrain$ILILag2)
## [1] 1
#Ans:2

#################################

#PROBLEM 4.2 - TRAINING A TIME SERIES MODEL 

#Use the plot() function to plot the log of ILILag2 against the log of ILI. Which best describes the relationship between these two variables?
plot(log(FluTrain$ILILag2),log(FluTrain$ILI))

#Ans:There is a strong positive relationship between log(ILILag2) and log(ILI)

###################################

#PROBLEM 4.3 - TRAINING A TIME SERIES MODEL  

#Train a linear regression model on the FluTrain dataset to predict the log of the ILI variable using the Queries variable as well as the log of the ILILag2 variable. Call this model FluTrend2.

FluTrend2<-lm(log(ILI)~Queries + log(ILILag2),data=FluTrain)
summary(FluTrend2)
## 
## Call:
## lm(formula = log(ILI) ~ Queries + log(ILILag2), data = FluTrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38495 -0.07746 -0.01194  0.06849  0.66338 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.11980    0.01546  -7.749 7.25e-14 ***
## Queries       0.63227    0.06605   9.573  < 2e-16 ***
## log(ILILag2)  0.82279    0.01879  43.787  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1256 on 413 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.949,  Adjusted R-squared:  0.9487 
## F-statistic:  3839 on 2 and 413 DF,  p-value: < 2.2e-16
#Which coefficients are significant at the p=0.05 level in this regression model? (Select all that apply.)
#Ans:Intercept ,Queries,log(ILILag2)

#What is the R^2 value of the FluTrend2 model?
#Ans: 0.9063

#########################################

#PROBLEM 4.4 - TRAINING A TIME SERIES MODEL  

#On the basis of R-squared value and significance of coefficients, which statement is the most accurate?
#Ans:FluTrend2 is a stronger model than FluTrend1 on the training set.
#EXPLANATION:Moving from FluTrend1 to FluTrend2, in-sample R^2 improved from 0.709 to 0.9063, and the new variable is highly significant. As a result, there is no sign of overfitting, and FluTrend2 is superior to FluTrend1 on the training set.

################

#PROBLEM 5.1 - EVALUATING THE TIME SERIES MODEL IN THE TEST SET 

#So far, we have only added the ILILag2 variable to the FluTrain data frame. To make predictions with our FluTrend2 model, we will also need to add ILILag2 to the FluTest data frame (note that adding variables before splitting into a training and testing set can prevent this duplication of effort).

#Modify the code from the previous subproblem to add an ILILag2 variable to the FluTest data frame. How many missing values are there in this new variable?

ILILag2 = lag(zoo(FluTest$ILI), k=-2, na.pad=TRUE)
#??lag()

FluTest$ILILag2 = coredata(ILILag2)
summary(FluTest$ILILag2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.9018  1.1450  1.3580  1.5780  1.8150  4.5470       1
#Ans:2

#############################

#PROBLEM 5.2 - EVALUATING THE TIME SERIES MODEL IN THE TEST SET 

#In this problem, the training and testing sets are split sequentially -- the training set contains all observations from 2004-2011 and the testing set contains all observations from 2012. There is no time gap between the two datasets, meaning the first observation in FluTest was recorded one week after the last observation in FluTrain. From this, we can identify how to fill in the missing values for the ILILag2 variable in FluTest.

#Which value should be used to fill in the ILILag2 variable for the first observation in FluTest?
#Ans:The ILI value of the second-to-last observation in the FluTrain data frame
#EXPLANATION:The time two weeks before the first week of 2012 is the second-to-last week of 2011. This corresponds to the second-to-last observation in FluTrain.

#Which value should be used to fill in the ILILag2 variable for the second observation in FluTest?
#Ans:The ILI value of the last observation in the FluTrain data frame.
#EXPLANATION:The time two weeks before the second week of 2012 is the last week of 2011. This corresponds to the last observation in FluTrain.

################################

#PROBLEM 5.3 - EVALUATING THE TIME SERIES MODEL IN THE TEST SET 

#Fill in the missing values for ILILag2 in FluTest. In terms of syntax, you could set the value of ILILag2 in row "x" of the FluTest data frame to the value of ILI in row "y" of the FluTrain data frame with "FluTest$ILILag2[x] = FluTrain$ILI[y]". Use the answer to the previous questions to determine the appropriate values of "x" and "y". It may be helpful to check the total number of rows in FluTrain using str(FluTrain) or nrow(FluTrain).
nrow(FluTrain)
## [1] 417
#What is the new value of the ILILag2 variable in the first row of FluTest?
FluTest$ILILag2[1] = FluTrain$ILI[416]
FluTest$ILILag2[1]
## [1] 1.852736
#Ans: 1.852736

#What is the new value of the ILILag2 variable in the second row of FluTest?
FluTest$ILILag2[2] = FluTrain$ILI[417]
FluTest$ILILag2[2]
## [1] 2.12413
#Ans:2.12413

####################################

#PROBLEM 5.4 - EVALUATING THE TIME SERIES MODEL IN THE TEST SET

#Obtain test set predictions of the ILI variable from the FluTrend2 model, again remembering to call the exp() function on the result of the predict() function to obtain predictions for ILI instead of log(ILI).
PredTest2 = exp(predict(FluTrend2, newdata=FluTest))

#What is the test-set RMSE of the FluTrend2 model?
#lets use long hand method (a.k.a using formula)
SSE = sum((PredTest2 - FluTest$ILI)^2)
# Compute the RMSE
sqrt(mean((PredTest2-FluTest$ILI)^2)) # orRMSE = sqrt(SSE / nrow(FluTest))
## [1] 0.2229195
#Ans:0.2942029

########################

#PROBLEM 5.5 - EVALUATING THE TIME SERIES MODEL IN THE TEST SET 

#Which model obtained the best test-set RMSE?
#Ans:FluTrend2
#EXPLANATION:The test-set RMSE of FluTrend2 is 0.294, as opposed to the 0.749 value obtained by the FluTrend1 model.

#in this problem, we used a simple time series model with a single lag term. ARIMA models are a more general form of the model we built, which can include multiple lag terms as well as more complicated combinations of previous values of the dependent variable.
#lets compare models for PROBLEM 4.4 above using sjPot package
library(sjPlot)
sjt.lm(FluTrend1,FluTrend2)
    log(ILI)   log(ILI)
    B CI p   B CI p
(Intercept)   -0.50 -0.56 – -0.44 <.001   -0.12 -0.15 – -0.09 <.001
Queries   2.96 2.78 – 3.14 <.001   0.63 0.50 – 0.76 <.001
log(ILILag2)           0.82 0.79 – 0.86 <.001
Observations   417   416
R2 / adj. R2   .709 / .708   .949 / .949

Assignment 2: STATE DATA (OPTIONAL)

We often take data for granted. However, one of the hardest parts about analyzing a problem you’re interested in can be to find good data to answer the questions you want to ask. As you’re learning R, though, there are many datasets that R has built in that you can take advantage of.

In this problem, we will be examining the “state” dataset, which has data from the 1970s on all fifty US states. For each state, the dataset includes the population, per capita income, illiteracy rate, murder rate, high school graduation rate, average number of frost days, area, latitude and longitude, division the state belongs to, region the state belongs to, and two-letter abbreviation.

#lets load the data set
data(state) #inbuilt data set containing 6 vectors and 1 data frame(state.x77).Lets combine them to a single data frame
statedata = cbind(data.frame(state.x77), state.abb, state.area, state.center,  state.division, state.name, state.region)  # or import statedata.csv
str(statedata)
## 'data.frame':    50 obs. of  15 variables:
##  $ Population    : num  3615 365 2212 2110 21198 ...
##  $ Income        : num  3624 6315 4530 3378 5114 ...
##  $ Illiteracy    : num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
##  $ Life.Exp      : num  69 69.3 70.5 70.7 71.7 ...
##  $ Murder        : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
##  $ HS.Grad       : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
##  $ Frost         : num  20 152 15 65 20 166 139 103 11 60 ...
##  $ Area          : num  50708 566432 113417 51945 156361 ...
##  $ state.abb     : Factor w/ 50 levels "AK","AL","AR",..: 2 1 4 3 5 6 7 8 9 10 ...
##  $ state.area    : num  51609 589757 113909 53104 158693 ...
##  $ x             : num  -86.8 -127.2 -111.6 -92.3 -119.8 ...
##  $ y             : num  32.6 49.2 34.2 34.7 36.5 ...
##  $ state.division: Factor w/ 9 levels "New England",..: 4 9 8 5 9 8 1 3 3 3 ...
##  $ state.name    : Factor w/ 50 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ state.region  : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
#PROBLEM 1.1 - DATA EXPLORATION 

#We begin by exploring the data. Plot all of the states' centers with latitude on the y axis (the "y" variable in our dataset) and longitude on the x axis (the "x" variable in our dataset). The shape of the plot should look like the outline of the United States! Note that Alaska and Hawaii have had their coordinates adjusted to appear just off of the west coast.
plot(statedata$x,statedata$y)

#In the R command you used to generate this plot, which variable name did you use as the first argument?
#Ans:

####################################

#PROBLEM 1.2 - DATA EXPLORATION  

#Using the tapply command, determine which region of the US (West, North Central, South, or Northeast) has the highest average high school graduation rate of all the states in the region:
tapply(statedata$HS.Grad,statedata$state.region,mean)
##     Northeast         South North Central          West 
##      53.96667      44.34375      54.51667      62.00000
#Ans:West

########################

#PROBLEM 1.3 - DATA EXPLORATION  

#Now, make a boxplot of the murder rate by region (for more information about creating boxplots in R, type ?boxplot in your console).
plot(statedata$Murder~state.region) #or boxplot(statedata$Murder ~ statedata$state.region)

#Which region has the highest median murder rate?
#Ans:South

#########################

#PROBLEM 1.4 - DATA EXPLORATION  

#You should see that there is an outlier in the Northeast region of the boxplot you just generated. Which state does this correspond to? (Hint: There are many ways to find the answer to this question, but one way is to use the subset command to only look at the Northeast data.)
NE<-subset(statedata,statedata$state.region=="Northeast")
NE$state.name[NE$Murder==max(NE$Murder)]
## [1] New York
## 50 Levels: Alabama Alaska Arizona Arkansas California ... Wyoming
#Ans:New York

#################################

#PROBLEM 2.1 - PREDICTING LIFE EXPECTANCY - AN INITIAL MODEL

#We would like to build a model to predict life expectancy by state using the state statistics we have in our dataset.

#Build the model with all potential variables included (Population, Income, Illiteracy, Murder, HS.Grad, Frost, and Area). Note that you should use the variable "Area" in your model, NOT the variable "state.area".
LinReg<-lm(Life.Exp~Population+Income+Illiteracy+Murder+HS.Grad+Frost+Area,data=statedata)
summary(LinReg)
## 
## Call:
## lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder + 
##     HS.Grad + Frost + Area, data = statedata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48895 -0.51232 -0.02747  0.57002  1.49447 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.094e+01  1.748e+00  40.586  < 2e-16 ***
## Population   5.180e-05  2.919e-05   1.775   0.0832 .  
## Income      -2.180e-05  2.444e-04  -0.089   0.9293    
## Illiteracy   3.382e-02  3.663e-01   0.092   0.9269    
## Murder      -3.011e-01  4.662e-02  -6.459 8.68e-08 ***
## HS.Grad      4.893e-02  2.332e-02   2.098   0.0420 *  
## Frost       -5.735e-03  3.143e-03  -1.825   0.0752 .  
## Area        -7.383e-08  1.668e-06  -0.044   0.9649    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7448 on 42 degrees of freedom
## Multiple R-squared:  0.7362, Adjusted R-squared:  0.6922 
## F-statistic: 16.74 on 7 and 42 DF,  p-value: 2.534e-10
#lets plot the coeff using sjPlot library
library(sjPlot)
#?sjp.lm
sjp.lm(LinReg,type="lm")

#What is the coefficient for "Income" in your linear regression model?
coef(summary(LinReg))["Income","Estimate"]  #or look in summary(LinReg)
## [1] -2.180424e-05
#Ans:-2.180424e-05  (or -0.0000218)

#########################################

#PROBLEM 2.2 - PREDICTING LIFE EXPECTANCY - AN INITIAL MODEL 

#Call the coefficient for income x (the answer to Problem 2.1). What is the interpretation of the coefficient x?
#Ans:For a one unit increase in income, predicted life expectancy decreases by |x|
#EXPLANATION:If we increase income by one unit, then our model's prediction will increase by the coefficient of income, x. Because x is negative, this is the same as predicted life expectancy decreasing by |x|.

#######################################

#PROBLEM 2.3 - PREDICTING LIFE EXPECTANCY - AN INITIAL MODEL 

#Now plot a graph of life expectancy vs. income using the command:
plot(statedata$Income, statedata$Life.Exp)

#Visually observe the plot. What appears to be the relationship?
#Ans:Life expectancy is somewhat positively correlated with income
#EXPLANATION:Although the point in the lower right hand corner of the plot appears to be an outlier, we observe a positive linear relationship in the plot.

#####################################

#PROBLEM 2.4 - PREDICTING LIFE EXPECTANCY - AN INITIAL MODEL  

#The model we built does not display the relationship we saw from the plot of life expectancy vs. income. Which of the following explanations seems the most reasonable?
#Ans:Multicollinearity
#EXPLANATION:Although income is an insignificant variable in the model, this does not mean that there is no association between income and life expectancy. However, in the presence of all of the other variables, income does not add statistically significant explanatory power to the model. This means that multicollinearity is probably the issue.

###########################################

#PROBLEM 3.1 - PREDICTING LIFE EXPECTANCY - REFINING THE MODEL AND ANALYZING PREDICTIONS

#Recall that we discussed the principle of simplicity: that is, a model with fewer variables is preferable to a model with many unnnecessary variables. Experiment with removing independent variables from the original model. Remember to use the significance of the coefficients to decide which variables to remove (remove the one with the largest "p-value" first, or the one with the "t value" closest to zero), and to remove them one at a time (this is called "backwards variable selection"). This is important due to multicollinearity issues - removing one insignificant variable may make another previously insignificant variable become significant.

#You should be able to find a good model with only 4 independent variables, instead of the original 7. Which variables does this model contain?

#Lets do variable selection using leaps package with regsubsets()

library(leaps)
regfit.bwd1<-regsubsets(Life.Exp~Population+Income+Illiteracy+Murder+HS.Grad+Frost+Area,data = statedata,nvmax=7,nbest=1,method = "backward")
reg.summary<-summary(regfit.bwd1)
reg.summary ##An asterix indicates that a given var is included in the corresponding model
## Subset selection object
## Call: regsubsets.formula(Life.Exp ~ Population + Income + Illiteracy + 
##     Murder + HS.Grad + Frost + Area, data = statedata, nvmax = 7, 
##     nbest = 1, method = "backward")
## 7 Variables  (and intercept)
##            Forced in Forced out
## Population     FALSE      FALSE
## Income         FALSE      FALSE
## Illiteracy     FALSE      FALSE
## Murder         FALSE      FALSE
## HS.Grad        FALSE      FALSE
## Frost          FALSE      FALSE
## Area           FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: backward
##          Population Income Illiteracy Murder HS.Grad Frost Area
## 1  ( 1 ) " "        " "    " "        "*"    " "     " "   " " 
## 2  ( 1 ) " "        " "    " "        "*"    "*"     " "   " " 
## 3  ( 1 ) " "        " "    " "        "*"    "*"     "*"   " " 
## 4  ( 1 ) "*"        " "    " "        "*"    "*"     "*"   " " 
## 5  ( 1 ) "*"        "*"    " "        "*"    "*"     "*"   " " 
## 6  ( 1 ) "*"        "*"    "*"        "*"    "*"     "*"   " " 
## 7  ( 1 ) "*"        "*"    "*"        "*"    "*"     "*"   "*"
#lets calc the errors for the data
reg.summary$rsq
## [1] 0.6097201 0.6628461 0.7126624 0.7360328 0.7361014 0.7361440 0.7361563
reg.summary$rss  # lowest value for the model selected
## [1] 34.46133 29.77036 25.37162 23.30804 23.30198 23.29822 23.29714
reg.summary$adjr2 #Adjusted r-squared,we selcted model with the highest which corresponds to 7 var model 
## [1] 0.6015893 0.6484991 0.6939230 0.7125690 0.7061129 0.6993268 0.6921823
reg.summary$cp #Mallows' Cp,lowest value for the model selected which corresponds to 7 var model
## [1] 16.126760  9.669894  3.739878  2.019659  4.008737  6.001959  8.000000
reg.summary$bic # lowest value for the model selected
## [1] -39.22051 -42.62472 -46.70678 -47.03640 -43.13738 -39.23342 -35.32373
#lets use in built plot() in regsubsets()
#Darker the color, higher is the fitting. A gap in the continuation means that the variable has not been included.
par(mfrow=c(2,2))
plot(regfit.bwd1,scale="adjr2")
plot(regfit.bwd1,scale="r2")

plot(regfit.bwd1,scale="Cp")
plot(regfit.bwd1,scale="bic")

##lets use conventional plots of  RSS,adjusted Rsqrd,Cp,& BIC for all models at once will help us decide which model to select
par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of variables",ylab="RSS",type="l")
which.min(reg.summary$rss)
## [1] 7
points(7,reg.summary$rss[7],col="red",cex=2,pch=20) # adding the point to the plot which shows the min RSS

plot(reg.summary$adjr2,xlab="Number of variables",ylab="Adjusted R2",type="l")
which.max(reg.summary$adjr2)
## [1] 4
points(4,reg.summary$adjr2[4],col="red",cex=2,pch=20) # adding the point to the plot which shows the max adjusted R2

plot(reg.summary$cp,xlab="Number of variables",ylab="Cp",type="l")
which.min(reg.summary$cp)
## [1] 4
points(4,reg.summary$cp[4],col="red",cex=2,pch=20) # adding the point to the plot which shows the min Cp

plot(reg.summary$bic,xlab="Number of variables",ylab="BIC",type="l")
which.min(reg.summary$bic)
## [1] 4
points(4,reg.summary$bic[4],col="red",cex=2,pch=20) # adding the point to the plot which shows the min Cp

#So we know now that 4 var model is the best model as per backward selection method based on Cp/Adj Rsqr/BIC
reg.summary$which #A logical matrix indicating which elements are in each model
##   (Intercept) Population Income Illiteracy Murder HS.Grad Frost  Area
## 1        TRUE      FALSE  FALSE      FALSE   TRUE   FALSE FALSE FALSE
## 2        TRUE      FALSE  FALSE      FALSE   TRUE    TRUE FALSE FALSE
## 3        TRUE      FALSE  FALSE      FALSE   TRUE    TRUE  TRUE FALSE
## 4        TRUE       TRUE  FALSE      FALSE   TRUE    TRUE  TRUE FALSE
## 5        TRUE       TRUE   TRUE      FALSE   TRUE    TRUE  TRUE FALSE
## 6        TRUE       TRUE   TRUE       TRUE   TRUE    TRUE  TRUE FALSE
## 7        TRUE       TRUE   TRUE       TRUE   TRUE    TRUE  TRUE  TRUE
#Ans:Population,Murder,HS.Grad,Frost

#EXPLANATION;

#Lets see the backward selection method wherein we start with the full model and start eliminating var one by one based on highest P-value citerion (least significant var eliminated first but done one by one ).
LinReg<-lm(Life.Exp~Population+Income+Illiteracy+Murder+HS.Grad+Frost+Area,data=statedata) #full model
summary(LinReg)
## 
## Call:
## lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder + 
##     HS.Grad + Frost + Area, data = statedata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48895 -0.51232 -0.02747  0.57002  1.49447 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.094e+01  1.748e+00  40.586  < 2e-16 ***
## Population   5.180e-05  2.919e-05   1.775   0.0832 .  
## Income      -2.180e-05  2.444e-04  -0.089   0.9293    
## Illiteracy   3.382e-02  3.663e-01   0.092   0.9269    
## Murder      -3.011e-01  4.662e-02  -6.459 8.68e-08 ***
## HS.Grad      4.893e-02  2.332e-02   2.098   0.0420 *  
## Frost       -5.735e-03  3.143e-03  -1.825   0.0752 .  
## Area        -7.383e-08  1.668e-06  -0.044   0.9649    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7448 on 42 degrees of freedom
## Multiple R-squared:  0.7362, Adjusted R-squared:  0.6922 
## F-statistic: 16.74 on 7 and 42 DF,  p-value: 2.534e-10
#We would eliminate the variable "Area" first (since it has the highest p-value, or probability, with a value of 0.9649), by adjusting our lm command to the following:
LinReg = lm(Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost, data=statedata)
summary(LinReg)
## 
## Call:
## lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder + 
##     HS.Grad + Frost, data = statedata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.49047 -0.52533 -0.02546  0.57160  1.50374 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.099e+01  1.387e+00  51.165  < 2e-16 ***
## Population   5.188e-05  2.879e-05   1.802   0.0785 .  
## Income      -2.444e-05  2.343e-04  -0.104   0.9174    
## Illiteracy   2.846e-02  3.416e-01   0.083   0.9340    
## Murder      -3.018e-01  4.334e-02  -6.963 1.45e-08 ***
## HS.Grad      4.847e-02  2.067e-02   2.345   0.0237 *  
## Frost       -5.776e-03  2.970e-03  -1.945   0.0584 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7361 on 43 degrees of freedom
## Multiple R-squared:  0.7361, Adjusted R-squared:  0.6993 
## F-statistic: 19.99 on 6 and 43 DF,  p-value: 5.362e-11
#Looking at summary(LinReg) now, we would choose to eliminate "Illiteracy" since it now has the highest p-value of 0.9340, using the following command:
LinReg = lm(Life.Exp ~ Population + Income + Murder + HS.Grad + Frost, data=statedata)
summary(LinReg)
## 
## Call:
## lm(formula = Life.Exp ~ Population + Income + Murder + HS.Grad + 
##     Frost, data = statedata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4892 -0.5122 -0.0329  0.5645  1.5166 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.107e+01  1.029e+00  69.067  < 2e-16 ***
## Population   5.115e-05  2.709e-05   1.888   0.0657 .  
## Income      -2.477e-05  2.316e-04  -0.107   0.9153    
## Murder      -3.000e-01  3.704e-02  -8.099 2.91e-10 ***
## HS.Grad      4.776e-02  1.859e-02   2.569   0.0137 *  
## Frost       -5.910e-03  2.468e-03  -2.395   0.0210 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7277 on 44 degrees of freedom
## Multiple R-squared:  0.7361, Adjusted R-squared:  0.7061 
## F-statistic: 24.55 on 5 and 44 DF,  p-value: 1.019e-11
#Looking at summary(LinReg) again, we would next choose to eliminate "Income", since it has a p-value of 0.9153. This gives the following four variable model:
LinReg = lm(Life.Exp ~ Population + Murder + HS.Grad + Frost, data=statedata)
summary(LinReg)
## 
## Call:
## lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost, 
##     data = statedata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47095 -0.53464 -0.03701  0.57621  1.50683 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.103e+01  9.529e-01  74.542  < 2e-16 ***
## Population   5.014e-05  2.512e-05   1.996  0.05201 .  
## Murder      -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
## HS.Grad      4.658e-02  1.483e-02   3.142  0.00297 ** 
## Frost       -5.943e-03  2.421e-03  -2.455  0.01802 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7126 
## F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12
#This model with 4 variables is a good model. However, we can see that the variable "Population" is not quite significant. In practice, it would be up to you whether or not to keep the variable "Population" or eliminate it for a 3-variable model. Population does not add much statistical significance in the presence of murder, high school graduation rate, and frost days. However, for the remainder of this question, we will analyze the 4-variable model.

# This entire backward selection steps can be easily done using regsubsets() in leaps library

##################################

#PROBLEM 3.2 - PREDICTING LIFE EXPECTANCY - REFINING THE MODEL AND ANALYZING PREDICTIONS

#Removing insignificant variables changes the Multiple R-squared value of the model. By looking at the summary output for both the initial model (all independent variables) and the simplified model (only 4 independent variables) and using what you learned in class, which of the following correctly explains the change in the Multiple R-squared value?
#Ans:We expect the "Multiple R-squared" value of the simplified model to be slightly worse than that of the initial model. It can't be better than the "Multiple R-squared" value of the initial model.

#EXPLANATION:When we remove insignificant variables, the "Multiple R-squared" will always be worse, but only slightly worse. This is due to the nature of a linear regression model. It is always possible for the regression model to make a coefficient zero, which would be the same as removing the variable from the model. The fact that the coefficient is not zero in the intial model means it must be helping the R-squared value, even if it is only a very small improvement. So when we force the variable to be removed, it will decrease the R-squared a little bit. However, this small decrease is worth it to have a simpler model.
#On the contrary, when we remove insignificant variables, the "Adjusted R-squred" will frequently be better. This value accounts for the complexity of the model, and thus tends to increase as insignificant variables are removed, and decrease as insignificant variables are added.

##########################################

#PROBLEM 3.3 - PREDICTING LIFE EXPECTANCY - REFINING THE MODEL AND ANALYZING PREDICTIONS

#Using the simplified 4 variable model that we created, we'll now take a look at how our predictions compare to the actual values.

#Take a look at the vector of predictions by using the predict function (since we are just looking at predictions on the training set, you don't need to pass a "newdata" argument to the predict function).

#Which state do we predict to have the lowest life expectancy? (Hint: use the sort function)

predTest<-predict(LinReg)
sort(predTest)
##        Alabama        Georgia    Mississippi South Carolina      Louisiana 
##       68.48112       68.63694       69.00535       69.06109       69.15045 
##       Kentucky North Carolina      Tennessee         Nevada       Arkansas 
##       69.24418       69.28624       69.46583       69.52482       69.57374 
##         Alaska       Michigan          Texas     New Mexico       Missouri 
##       69.85740       69.86893       69.97886       70.03119       70.10610 
##       Virginia       Illinois  West Virginia       Maryland        Florida 
##       70.14691       70.19244       70.44983       70.51852       70.61539 
##       New York        Wyoming        Indiana        Vermont           Ohio 
##       70.62937       70.87679       70.90159       71.06135       71.08549 
##       Colorado       Delaware       Oklahoma   Pennsylvania        Montana 
##       71.10354       71.12647       71.15860       71.38046       71.40025 
##        Arizona          Idaho     New Jersey  New Hampshire   Rhode Island 
##       71.41416       71.49989       71.59612       71.72636       71.76007 
##     California          Maine   North Dakota         Kansas      Wisconsin 
##       71.79565       71.86095       71.87649       71.90352       72.00996 
##   South Dakota    Connecticut           Utah         Hawaii       Nebraska 
##       72.01161       72.03459       72.05753       72.09317       72.17032 
##      Minnesota           Iowa         Oregon  Massachusetts     Washington 
##       72.26560       72.39653       72.41445       72.44105       72.68272
#Ans:Alabama 

#Which state actually has the lowest life expectancy? (Hint: use the which.min function)
statedata$state.name[which.min(statedata$Life.Exp)]
## [1] South Carolina
## 50 Levels: Alabama Alaska Arizona Arkansas California ... Wyoming
#Ans: South Carolina

#################################

#PROBLEM 3.4 - PREDICTING LIFE EXPECTANCY - REFINING THE MODEL AND ANALYZING PREDICTIONS

#Which state do we predict to have the highest life expectancy?
sort(predTest,decreasing = T)
##     Washington  Massachusetts         Oregon           Iowa      Minnesota 
##       72.68272       72.44105       72.41445       72.39653       72.26560 
##       Nebraska         Hawaii           Utah    Connecticut   South Dakota 
##       72.17032       72.09317       72.05753       72.03459       72.01161 
##      Wisconsin         Kansas   North Dakota          Maine     California 
##       72.00996       71.90352       71.87649       71.86095       71.79565 
##   Rhode Island  New Hampshire     New Jersey          Idaho        Arizona 
##       71.76007       71.72636       71.59612       71.49989       71.41416 
##        Montana   Pennsylvania       Oklahoma       Delaware       Colorado 
##       71.40025       71.38046       71.15860       71.12647       71.10354 
##           Ohio        Vermont        Indiana        Wyoming       New York 
##       71.08549       71.06135       70.90159       70.87679       70.62937 
##        Florida       Maryland  West Virginia       Illinois       Virginia 
##       70.61539       70.51852       70.44983       70.19244       70.14691 
##       Missouri     New Mexico          Texas       Michigan         Alaska 
##       70.10610       70.03119       69.97886       69.86893       69.85740 
##       Arkansas         Nevada      Tennessee North Carolina       Kentucky 
##       69.57374       69.52482       69.46583       69.28624       69.24418 
##      Louisiana South Carolina    Mississippi        Georgia        Alabama 
##       69.15045       69.06109       69.00535       68.63694       68.48112
#Ans: Washington

#Which state actually has the highest life expectancy?
statedata$state.name[which.max(statedata$Life.Exp)]
## [1] Hawaii
## 50 Levels: Alabama Alaska Arizona Arkansas California ... Wyoming
#Ans: Hawaii

##################################

#PROBLEM 3.5 - PREDICTING LIFE EXPECTANCY - REFINING THE MODEL AND ANALYZING PREDICTIONS

#Take a look at the vector of residuals (the difference between the predicted and actual values).
sort(abs(LinReg$residuals)) #or sort(abs(predTest-statedata$Life.Exp))
##        Indiana        Florida       Illinois       Virginia   South Dakota 
##     0.02158526     0.04460505     0.05244160     0.06691392     0.06839119 
## North Carolina       New York     California        Georgia   Rhode Island 
##     0.07624179     0.07937149     0.08564599     0.09694227     0.13992982 
##           Iowa       Oklahoma           Ohio         Oregon     New Mexico 
##     0.16347124     0.26139958     0.26548767     0.28445333     0.28880945 
##       Maryland          Idaho      Louisiana       Nebraska    Connecticut 
##     0.29851996     0.37010714     0.39044846     0.42967691     0.44541028 
##      Wisconsin         Nevada  New Hampshire         Alaska        Alabama 
##     0.47004324     0.49482393     0.49635615     0.54740399     0.56888134 
##        Vermont       Missouri        Wyoming  Massachusetts      Tennessee 
##     0.57865019     0.58389969     0.58678863     0.61105391     0.64416651 
##     New Jersey         Kansas      Minnesota       Michigan        Montana 
##     0.66612086     0.67648037     0.69440380     0.76106640     0.84024805 
##           Utah       Kentucky        Arizona   North Dakota    Mississippi 
##     0.84246817     0.85582067     0.86415671     0.90350550     0.91535384 
##          Texas   Pennsylvania       Colorado     Washington  West Virginia 
##     0.92114057     0.95045527     0.95645816     0.96272426     0.96982588 
##       Delaware       Arkansas South Carolina          Maine         Hawaii 
##     1.06646884     1.08626119     1.10109172     1.47095411     1.50683146
#For which state do we make the smallest absolute error?
#Ans:Indiana 

#For which state do we make the largest absolute error?
sort(abs(LinReg$residuals),decreasing = T ) #or sort(abs(predTest-statedata$Life.Exp),decreasing = T)
##         Hawaii          Maine South Carolina       Arkansas       Delaware 
##     1.50683146     1.47095411     1.10109172     1.08626119     1.06646884 
##  West Virginia     Washington       Colorado   Pennsylvania          Texas 
##     0.96982588     0.96272426     0.95645816     0.95045527     0.92114057 
##    Mississippi   North Dakota        Arizona       Kentucky           Utah 
##     0.91535384     0.90350550     0.86415671     0.85582067     0.84246817 
##        Montana       Michigan      Minnesota         Kansas     New Jersey 
##     0.84024805     0.76106640     0.69440380     0.67648037     0.66612086 
##      Tennessee  Massachusetts        Wyoming       Missouri        Vermont 
##     0.64416651     0.61105391     0.58678863     0.58389969     0.57865019 
##        Alabama         Alaska  New Hampshire         Nevada      Wisconsin 
##     0.56888134     0.54740399     0.49635615     0.49482393     0.47004324 
##    Connecticut       Nebraska      Louisiana          Idaho       Maryland 
##     0.44541028     0.42967691     0.39044846     0.37010714     0.29851996 
##     New Mexico         Oregon           Ohio       Oklahoma           Iowa 
##     0.28880945     0.28445333     0.26548767     0.26139958     0.16347124 
##   Rhode Island        Georgia     California       New York North Carolina 
##     0.13992982     0.09694227     0.08564599     0.07937149     0.07624179 
##   South Dakota       Virginia       Illinois        Florida        Indiana 
##     0.06839119     0.06691392     0.05244160     0.04460505     0.02158526
#Ans: Hawaii

Assignment 2: FORECASTING ELANTRA SALES (OPTIONAL)

FORECASTING ELANTRA SALES (OPTIONAL)

An important application of linear regression is understanding sales. Consider a company that produces and sells a product. In a given period, if the company produces more units than how many consumers will buy, the company will not earn money on the unsold units and will incur additional costs due to having to store those units in inventory before they can be sold. If it produces fewer units than how many consumers will buy, the company will earn less than it potentially could have earned. Being able to predict consumer sales, therefore, is of first order importance to the company.

In this problem, we will try to predict monthly sales of the Hyundai Elantra in the United States. The Hyundai Motor Company is a major automobile manufacturer based in South Korea. The Elantra is a car model that has been produced by Hyundai since 1990 and is sold all over the world, including the United States. We will build a linear regression model to predict monthly sales using economic indicators of the United States as well as Google search queries.

The file elantra.csv contains data for the problem. Each observation is a month, from January 2010 to February 2014.

#lets import the data set
Elantra<-read.csv("elantra.csv")
str(Elantra)
## 'data.frame':    50 obs. of  7 variables:
##  $ Month       : int  1 1 1 1 1 2 2 2 2 2 ...
##  $ Year        : int  2010 2011 2012 2013 2014 2010 2011 2012 2013 2014 ...
##  $ ElantraSales: int  7690 9659 10900 12174 15326 7966 12289 13820 16219 16393 ...
##  $ Unemployment: num  9.7 9.1 8.2 7.9 6.6 9.8 9 8.3 7.7 6.7 ...
##  $ Queries     : int  153 259 354 230 232 130 266 296 239 240 ...
##  $ CPI_energy  : num  213 229 244 243 248 ...
##  $ CPI_all     : num  217 221 228 231 235 ...
#PROBLEM 1 - LOADING THE DATA 

#lets split the data set into training & test data set
 
#Load the data set. Split the data set into training and testing sets as follows: place all observations for 2012 and earlier in the training set, and all observations for 2013 and 2014 into the testing set.

ElantraTrain<-subset(Elantra,Year<=2012)
dim(ElantraTrain)
## [1] 36  7
ElantraTest<-subset(Elantra,Year>2012) #or ElantraTest<-subset(Elantra,Year==2013 | Year==2014)
dim(ElantraTest)
## [1] 14  7
#How many observations are in the training set?
#Ans:36

##################################################

#PROBLEM 2.1 - A LINEAR REGRESSION MODEL 

#Build a linear regression model to predict monthly Elantra sales using Unemployment, CPI_all, CPI_energy and Queries as the independent variables. Use all of the training set data to do this.

ElantraLM<-lm(ElantraSales~Unemployment+CPI_all+CPI_energy +Queries,data=ElantraTrain)
summary(ElantraLM)
## 
## Call:
## lm(formula = ElantraSales ~ Unemployment + CPI_all + CPI_energy + 
##     Queries, data = ElantraTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6785.2 -2101.8  -562.5  2901.7  7021.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)   95385.36  170663.81   0.559    0.580
## Unemployment  -3179.90    3610.26  -0.881    0.385
## CPI_all        -297.65     704.84  -0.422    0.676
## CPI_energy       38.51     109.60   0.351    0.728
## Queries          19.03      11.26   1.690    0.101
## 
## Residual standard error: 3295 on 31 degrees of freedom
## Multiple R-squared:  0.4282, Adjusted R-squared:  0.3544 
## F-statistic: 5.803 on 4 and 31 DF,  p-value: 0.00132
#lets visualise the coeff using sjPlot package
library(sjPlot)
sjp.lm(ElantraLM)

#What is the model R-squared? Note: In this problem, we will always be asking for the "Multiple R-Squared" of the model.
#Ans: 0.4282

###############################################

#PROBLEM 2.2 - SIGNIFICANT VARIABLES


#How many variables are significant, or have levels that are significant? Use 0.10 as your p-value cutoff.
#Ans:0
#EXPLANATION:After obtaining the output of the model summary, simply look at the p-values of all of the variables in the output (the right-most column, labeled "Pr(>|t|)"). It turns out that none of them are significant.



########################################

#PROBLEM 2.3 - COEFFICIENTS  

#What is the coefficient of the Unemployment variable?
summary(ElantraLM)$coefficients["Unemployment","Estimate"]
## [1] -3179.9
#or coef(summary(ElantraLM))["Unemployment","Estimate"]
#Ans:-3179.9

####################################

#PROBLEM 2.4 - INTERPRETING THE COEFFICIENT 

#What is the interpretation of this coefficient?
#Ans:For an increase of 1 in Unemployment, the prediction of Elantra sales decreases by approximately 3000.
#EXPLANATION:The second choice is the correct answer; the coefficient is defined as the change in the prediction of the dependent variable (ElantraSales) per unit change in the independent variable in question (Unemployment). The first choice is therefore not correct; it also does not make intuitive sense since Unemployment is the percentage unemployment rate, which is bounded to be between 0 and 100.
#The third choice is not correct because the coefficient indicates how the prediction changes, not how the actual sales change, and this option asserts that actual sales change, i.e., there is a causal effect.
#The fourth choice is not correct because the statistical significance indicates how likely it is that, by chance, the true coefficient is not different from zero. However, the estimated coefficient still has a (non-zero) value, and our prediction will change for different values of Unemployment; therefore, the sales prediction cannot stay the same.

#########################################

#PROBLEM 3.1 - MODELING SEASONALITY 

#Our model R-Squared is relatively low, so we would now like to improve our model. In modeling demand and sales, it is often useful to model seasonality. Seasonality refers to the fact that demand is often cyclical/periodic in time. For example, in countries with different seasons, demand for warm outerwear (like jackets and coats) is higher in fall/autumn and winter (due to the colder weather) than in spring and summer. (In contrast, demand for swimsuits and sunscreen is higher in the summer than in the other seasons.) Another example is the "back to school" period in North America: demand for stationary (pencils, notebooks and so on) in late July and all of August is higher than the rest of the year due to the start of the school year in September.

#In our problem, since our data includes the month of the year in which the units were sold, it is feasible for us to incorporate monthly seasonality. From a modeling point of view, it may be reasonable that the month plays an effect in how many Elantra units are sold.

#To incorporate the seasonal effect due to the month, build a new linear regression model that predicts monthly Elantra sales using Month as well as Unemployment, CPI_all, CPI_energy and Queries. Do not modify the training and testing data frames before building the model.
ElantraLM1<-lm(ElantraSales~Month+Unemployment+CPI_all+CPI_energy +Queries,data=ElantraTrain)
summary(ElantraLM1)
## 
## Call:
## lm(formula = ElantraSales ~ Month + Unemployment + CPI_all + 
##     CPI_energy + Queries, data = ElantraTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6416.6 -2068.7  -597.1  2616.3  7183.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  148330.49  195373.51   0.759   0.4536  
## Month           110.69     191.66   0.578   0.5679  
## Unemployment  -4137.28    4008.56  -1.032   0.3103  
## CPI_all        -517.99     808.26  -0.641   0.5265  
## CPI_energy       54.18     114.08   0.475   0.6382  
## Queries          21.19      11.98   1.769   0.0871 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3331 on 30 degrees of freedom
## Multiple R-squared:  0.4344, Adjusted R-squared:  0.3402 
## F-statistic: 4.609 on 5 and 30 DF,  p-value: 0.003078
#What is the model R-Squared?
#Ans:  0.4344

##################################

#PROBLEM 3.2 - EFFECT OF ADDING A NEW VARIABLE  

#Which of the following best describes the effect of adding Month?
#lets just compare the two fitted model
#lets gets the summary o/p in better presented table
library(sjPlot)
sjt.lm(ElantraLM,ElantraLM1) 
    ElantraSales   ElantraSales
    B CI p   B CI p
(Intercept)   95385.36 -252685.78 – 443456.51 .580   148330.49 -250675.44 – 547336.42 .454
Unemployment   -3179.90 -10543.08 – 4183.28 .385   -4137.28 -12323.85 – 4049.28 .310
CPI_all   -297.65 -1735.17 – 1139.88 .676   -517.99 -2168.68 – 1132.69 .526
CPI_energy   38.51 -185.03 – 262.04 .728   54.18 -178.79 – 287.16 .638
Queries   19.03 -3.93 – 41.99 .101   21.19 -3.28 – 45.65 .087
Month           110.69 -280.73 – 502.10 .568
Observations   36   36
R2 / adj. R2   .428 / .354   .434 / .340
#Ans:The model is not better because the adjusted R-squared has gone down and none of the variables (including the new one) are very significant.
#EXPLANATION:The first option is incorrect because (ordinary) R-Squared always increases (or at least stays the same) when you add new variables. This does not make the model better, and in fact, may hurt the ability of the model to generalize to new, unseen data (overfitting).

#The second option is correct: the adjusted R-Squared is the R-Squared but adjusted to take into account the number of variables. If the adjusted R-Squared is lower, then this indicates that our model is not better and in fact may be worse. Furthermore, if none of the variables have become significant, then this also indicates that the model is not better.

#The third option is not correct because as stated above, the adjusted R-Squared has become worse. Although the variables have come closer to being significant, this doesn't make it a better model.

#The fourth option is not correct. Although it is desirable to have models that are parsimonious (fewer variables), we are ultimately interested in models that have high explanatory power (as measured in training R-Squared) and out of sample predictive power (as measured in testing R-Squared). Adding a key variable may significantly improve the predictive power of the model, and we should thus not dismiss the model simply because it has more variables.

############################

#PROBLEM 3.3 - UNDERSTANDING THE MODEL  

#Let us try to understand our model.

#In the new model, given two monthly periods that are otherwise identical in Unemployment, CPI_all, CPI_energy and Queries, what is the absolute difference in predicted Elantra sales given that one period is in January and one is in March?
#Ans:221.38
#EXPLANATION:The coefficient for Month is 110.69 (look at the summary output of the model). For the first question, January is coded numerically as 1, while March is coded numerically as 3; the difference in the prediction is therefore
#110.69 * (3 - 1) = 110.69 * 2 = 221.38

#In the new model, given two monthly periods that are otherwise identical in Unemployment, CPI_all, CPI_energy and Queries, what is the absolute difference in predicted Elantra sales given that one period is in January and one is in May?
#Ans:442.76
#EXPLANATION: May is numerically coded as 5, while January is 1, so the difference in predicted sales is
#110.69 * (5 - 1) = 110.69 * 4 = 442.76

#########################

#PROBLEM 3.4 - NUMERIC VS. FACTORS

#You may be experiencing an uneasy feeling that there is something not quite right in how we have modeled the effect of the calendar month on the monthly sales of Elantras. If so, you are right. In particular, we added Month as a variable, but Month is an ordinary numeric variable. In fact, we must convert Month to a factor variable before adding it to the model.

#What is the best explanation for why we must do this?
#Ans:By modeling Month as a factor variable, the effect of each calendar month is not restricted to be linear in the numerical coding of the month. 
#EXPLANATION:The second choice is the correct answer. The previous subproblem essentially showed that for every month that we move into the future (e.g, from January to February, from February to March, etc.), our predicted sales go up by 110.69. This isn't right, because the effect of the month should not be affected by the numerical coding, and by modeling Month as a numeric variable, we cannot capture more complex effects. For example, suppose that when the other variables are fixed, an additional 500 units are sold from June to December, relative to the other months. This type of relationship between the boost to the sales and the Month variable would look like a step function at Month = 6, which cannot be modeled as a linear function of Month.

#The first choice is not right. As we have discussed before, increasing the number of coefficients will never cause the model's R-Squared to decrease, but if the increase is small, then we have not really improved the predictive power of our model, and converting Month to a factor variable is not justified.

#The third choice is also not correct. Month is stored as an ordinary number, so there cannot be any issues due to the Date format.

################################

#PROBLEM 4.1 - A NEW MODEL 

#Re-run the regression with the Month variable modeled as a factor variable. (Create a new variable that models the Month as a factor (using the as.factor function) instead of overwriting the current Month variable. We'll still use the numeric version of Month later in the problem.)
ElantraTrain$MonthFactor = as.factor(ElantraTrain$Month) #new var 'MonthFactor'added to the ElantraTrain dataset

ElantraTest$MonthFactor = as.factor(ElantraTest$Month) #new var 'MonthFactor'added to the ElantraTest dataset

#Then, you want to rebuild the model using the lm function:
ElantraLM = lm(ElantraSales ~ Unemployment + Queries + CPI_energy + CPI_all + MonthFactor, data=ElantraTrain)
summary(ElantraLM)
## 
## Call:
## lm(formula = ElantraSales ~ Unemployment + Queries + CPI_energy + 
##     CPI_all + MonthFactor, data = ElantraTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3865.1 -1211.7   -77.1  1207.5  3562.2 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   312509.280 144061.867   2.169 0.042288 *  
## Unemployment   -7739.381   2968.747  -2.607 0.016871 *  
## Queries           -4.764     12.938  -0.368 0.716598    
## CPI_energy       288.631     97.974   2.946 0.007988 ** 
## CPI_all        -1343.307    592.919  -2.266 0.034732 *  
## MonthFactor2    2254.998   1943.249   1.160 0.259540    
## MonthFactor3    6696.557   1991.635   3.362 0.003099 ** 
## MonthFactor4    7556.607   2038.022   3.708 0.001392 ** 
## MonthFactor5    7420.249   1950.139   3.805 0.001110 ** 
## MonthFactor6    9215.833   1995.230   4.619 0.000166 ***
## MonthFactor7    9929.464   2238.800   4.435 0.000254 ***
## MonthFactor8    7939.447   2064.629   3.845 0.001010 ** 
## MonthFactor9    5013.287   2010.745   2.493 0.021542 *  
## MonthFactor10   2500.184   2084.057   1.200 0.244286    
## MonthFactor11   3238.932   2397.231   1.351 0.191747    
## MonthFactor12   5293.911   2228.310   2.376 0.027621 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2306 on 20 degrees of freedom
## Multiple R-squared:  0.8193, Adjusted R-squared:  0.6837 
## F-statistic: 6.044 on 15 and 20 DF,  p-value: 0.0001469
#What is the model R-Squared?
#Ans:0.8193

##############################

#PROBLEM 4.2 - SIGNIFICANT VARIABLES 

#Which variables are significant, or have levels that are significant? Use 0.10 as your p-value cutoff. (Select all that apply.)
#Ans:Month (the factor version) ,CPI_all,CPI_energy ,Unemployment 

#lets plot the coeff to see which are significant using sjPlot library
library(sjPlot)
#?sjp.lm
sjp.lm(ElantraLM,type="lm")

#################################

#PROBLEM 5.1 - MULTICOLINEARITY 

#Another peculiar observation about the regression is that the sign of the Queries variable has changed. In particular, when we naively modeled Month as a numeric variable, Queries had a positive coefficient. Now, Queries has a negative coefficient. Furthermore, CPI_energy has a positive coefficient -- as the overall price of energy increases, we expect Elantra sales to increase, which seems counter-intuitive (if the price of energy increases, we'd expect consumers to have less funds to purchase automobiles, leading to lower Elantra sales).

#As we have seen before, changes in coefficient signs and signs that are counter to our intuition may be due to a multicolinearity problem. To check, compute the correlations of the variables in the training set.
cor(ElantraTrain[c("Unemployment","Month","Queries","CPI_energy","CPI_all")])  #or
##              Unemployment      Month    Queries CPI_energy    CPI_all
## Unemployment    1.0000000 -0.2036029 -0.6411093 -0.8007188 -0.9562123
## Month          -0.2036029  1.0000000  0.0158443  0.1760198  0.2667883
## Queries        -0.6411093  0.0158443  1.0000000  0.8328381  0.7536732
## CPI_energy     -0.8007188  0.1760198  0.8328381  1.0000000  0.9132259
## CPI_all        -0.9562123  0.2667883  0.7536732  0.9132259  1.0000000
cor(ElantraTrain$CPI_energy,ElantraTrain[,1:7]) #last col i.e.MonthFactor is a factor var and hence not be included for getting correlation as only numeric var can be computed
##          Month      Year ElantraSales Unemployment   Queries CPI_energy
## [1,] 0.1760198 0.8316052    0.5916491   -0.8007188 0.8328381          1
##        CPI_all
## [1,] 0.9132259
#Which of the following variables is CPI_energy highly correlated with? Select all that apply. (Include only variables where the absolute value of the correlation exceeds 0.6. For the purpose of this question, treat Month as a numeric variable, not a factor variable.)
#Ans:Unemployment,Queries,CPI_all 

#plotting the correlation matrix
library(sjPlot)
#?sjp.corr
sjp.corr(ElantraTrain[,c("Unemployment","Month","Queries","CPI_energy","CPI_all")],type = "tile",show.legend=FALSE) 
## Computing correlation using spearman-method with listwise-deletion...

#Testing the model assumptions
library(sjPlot)
#?sjp.lm
sjp.lm(ElantraLM,type="ma",completeDiagnostic=T) 
## Removed 0 cases during 0 step(s).
## R^2 / adj. R^2 of original model: 0.819264 / 0.683712
## R^2 / adj. R^2 of updated model:  0.819264 / 0.683712
## AIC of original model: 672.515747 
## AIC of updated model:  672.515747

##  lag Autocorrelation D-W Statistic p-value
##    1      -0.4653687      2.795112   0.476
##  Alternative hypothesis: rho != 0

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2.04248    Df = 1     p = 0.1529601 
## 
##  studentized Breusch-Pagan test
## 
## data:  linreg
## BP = 18.094, df = 15, p-value = 0.2577

## 
## Suggested power transformation:  0.6030641
#the plots are giving a real & strong indication of presence of multicollinearity

#########################################

#PROBLEM 5.2 - CORRELATIONS 

#Which of the following variables is Queries highly correlated with? Again, compute the correlations on the training set. Select all that apply. (Include only variables where the absolute value of the correlation exceeds 0.6. For the purpose of this question, treat Month as a numeric variable, not a factor variable.)
cor(ElantraTrain[c("Unemployment","Month","Queries","CPI_energy","CPI_all")])  #or
##              Unemployment      Month    Queries CPI_energy    CPI_all
## Unemployment    1.0000000 -0.2036029 -0.6411093 -0.8007188 -0.9562123
## Month          -0.2036029  1.0000000  0.0158443  0.1760198  0.2667883
## Queries        -0.6411093  0.0158443  1.0000000  0.8328381  0.7536732
## CPI_energy     -0.8007188  0.1760198  0.8328381  1.0000000  0.9132259
## CPI_all        -0.9562123  0.2667883  0.7536732  0.9132259  1.0000000
cor(ElantraTrain$Queries,ElantraTrain[,1:7]) #last col
##          Month     Year ElantraSales Unemployment Queries CPI_energy
## [1,] 0.0158443 0.726531    0.6100645   -0.6411093       1  0.8328381
##        CPI_all
## [1,] 0.7536732
#Ans:Unemployment,CPI_energy,CPI_all

#plotting the correlation matrix
library(sjPlot)
#?sjp.corr
sjp.corr(ElantraTrain[,c("Unemployment","Month","Queries","CPI_energy","CPI_all")],type = "tile",show.legend =FALSE) 
## Computing correlation using spearman-method with listwise-deletion...

##############################

#PROBLEM 6.1 - A REDUCED MODEL  

#Let us now simplify our model (the model using the factor version of the Month variable). We will do this by iteratively removing variables, one at a time. Remove the variable with the highest p-value (i.e., the least statistically significant variable) from the model. Repeat this until there are no variables that are insignificant or variables for which all of the factor levels are insignificant. Use a threshold of 0.10 to determine whether a variable is significant.

#Method1:AUTOMATICALLY BUILDING THE MODEL
StepModel<-step(ElantraLM)
## Start:  AIC=568.35
## ElantraSales ~ Unemployment + Queries + CPI_energy + CPI_all + 
##     MonthFactor
## 
##                Df Sum of Sq       RSS    AIC
## - Queries       1    720829 107064906 566.60
## <none>                      106344077 568.35
## - CPI_all       1  27292499 133636576 574.58
## - Unemployment  1  36136773 142480850 576.88
## - CPI_energy    1  46147903 152491980 579.33
## - MonthFactor  11 230125737 336469814 587.82
## 
## Step:  AIC=566.6
## ElantraSales ~ Unemployment + CPI_energy + CPI_all + MonthFactor
## 
##                Df Sum of Sq       RSS    AIC
## <none>                      107064906 566.60
## - CPI_all       1  29428416 136493322 573.34
## - Unemployment  1  40143086 147207993 576.06
## - CPI_energy    1  59057377 166122283 580.41
## - MonthFactor  11 260411263 367476169 588.99
summary(StepModel)
## 
## Call:
## lm(formula = ElantraSales ~ Unemployment + CPI_energy + CPI_all + 
##     MonthFactor, data = ElantraTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3866.0 -1283.3  -107.2  1098.3  3650.1 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   325709.15  136627.85   2.384 0.026644 *  
## Unemployment   -7971.34    2840.79  -2.806 0.010586 *  
## CPI_energy       268.03      78.75   3.403 0.002676 ** 
## CPI_all        -1377.58     573.39  -2.403 0.025610 *  
## MonthFactor2    2410.91    1857.10   1.298 0.208292    
## MonthFactor3    6880.09    1888.15   3.644 0.001517 ** 
## MonthFactor4    7697.36    1960.21   3.927 0.000774 ***
## MonthFactor5    7444.64    1908.48   3.901 0.000823 ***
## MonthFactor6    9223.13    1953.64   4.721 0.000116 ***
## MonthFactor7    9602.72    2012.66   4.771 0.000103 ***
## MonthFactor8    7919.50    2020.99   3.919 0.000789 ***
## MonthFactor9    5074.29    1962.23   2.586 0.017237 *  
## MonthFactor10   2724.24    1951.78   1.396 0.177366    
## MonthFactor11   3665.08    2055.66   1.783 0.089062 .  
## MonthFactor12   5643.19    1974.36   2.858 0.009413 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2258 on 21 degrees of freedom
## Multiple R-squared:  0.818,  Adjusted R-squared:  0.6967 
## F-statistic: 6.744 on 14 and 21 DF,  p-value: 5.73e-05
# the iteration shows that Queries var is the var removed

#Which variables, and in what order, are removed by this process?
#Then, you want to rebuild the model using the lm function:

#the main model
ElantraLM = lm(ElantraSales ~ Unemployment + Queries + CPI_energy + CPI_all + MonthFactor, data=ElantraTrain)
summary(ElantraLM)
## 
## Call:
## lm(formula = ElantraSales ~ Unemployment + Queries + CPI_energy + 
##     CPI_all + MonthFactor, data = ElantraTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3865.1 -1211.7   -77.1  1207.5  3562.2 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   312509.280 144061.867   2.169 0.042288 *  
## Unemployment   -7739.381   2968.747  -2.607 0.016871 *  
## Queries           -4.764     12.938  -0.368 0.716598    
## CPI_energy       288.631     97.974   2.946 0.007988 ** 
## CPI_all        -1343.307    592.919  -2.266 0.034732 *  
## MonthFactor2    2254.998   1943.249   1.160 0.259540    
## MonthFactor3    6696.557   1991.635   3.362 0.003099 ** 
## MonthFactor4    7556.607   2038.022   3.708 0.001392 ** 
## MonthFactor5    7420.249   1950.139   3.805 0.001110 ** 
## MonthFactor6    9215.833   1995.230   4.619 0.000166 ***
## MonthFactor7    9929.464   2238.800   4.435 0.000254 ***
## MonthFactor8    7939.447   2064.629   3.845 0.001010 ** 
## MonthFactor9    5013.287   2010.745   2.493 0.021542 *  
## MonthFactor10   2500.184   2084.057   1.200 0.244286    
## MonthFactor11   3238.932   2397.231   1.351 0.191747    
## MonthFactor12   5293.911   2228.310   2.376 0.027621 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2306 on 20 degrees of freedom
## Multiple R-squared:  0.8193, Adjusted R-squared:  0.6837 
## F-statistic: 6.044 on 15 and 20 DF,  p-value: 0.0001469
#Lets remove Queries var and rebuild the model
ElantraLMnew = lm(ElantraSales ~ Unemployment + CPI_energy + CPI_all + MonthFactor, data=ElantraTrain)
summary(ElantraLMnew)
## 
## Call:
## lm(formula = ElantraSales ~ Unemployment + CPI_energy + CPI_all + 
##     MonthFactor, data = ElantraTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3866.0 -1283.3  -107.2  1098.3  3650.1 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   325709.15  136627.85   2.384 0.026644 *  
## Unemployment   -7971.34    2840.79  -2.806 0.010586 *  
## CPI_energy       268.03      78.75   3.403 0.002676 ** 
## CPI_all        -1377.58     573.39  -2.403 0.025610 *  
## MonthFactor2    2410.91    1857.10   1.298 0.208292    
## MonthFactor3    6880.09    1888.15   3.644 0.001517 ** 
## MonthFactor4    7697.36    1960.21   3.927 0.000774 ***
## MonthFactor5    7444.64    1908.48   3.901 0.000823 ***
## MonthFactor6    9223.13    1953.64   4.721 0.000116 ***
## MonthFactor7    9602.72    2012.66   4.771 0.000103 ***
## MonthFactor8    7919.50    2020.99   3.919 0.000789 ***
## MonthFactor9    5074.29    1962.23   2.586 0.017237 *  
## MonthFactor10   2724.24    1951.78   1.396 0.177366    
## MonthFactor11   3665.08    2055.66   1.783 0.089062 .  
## MonthFactor12   5643.19    1974.36   2.858 0.009413 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2258 on 21 degrees of freedom
## Multiple R-squared:  0.818,  Adjusted R-squared:  0.6967 
## F-statistic: 6.744 on 14 and 21 DF,  p-value: 5.73e-05
#Ans:Queries
#EXPLANATION:The variable with the highest p-value is "Queries". After removing it and looking at the model summary again, we can see that there are no variables that are insignificant, at the 0.10 p-level. Note that Month has a few values that are insignificant, but we don't want to remove it because many values are very significant.

#####################################

#PROBLEM 6.2 - TEST SET PREDICTIONS  

predTest<-predict(ElantraLMnew ,newdata = ElantraTest)

#Using the model from Problem 6.1, make predictions on the test set. What is the sum of squared errors of the model on the test set?
SSE = sum((predTest - ElantraTest$ElantraSales)^2)
SSE
## [1] 190757747
#Ans:190757747

##################################

#PROBLEM 6.3 - COMPARING TO A BASELINE 

#What would the baseline method predict for all observations in the test set? Remember that the baseline method we use predicts the average outcome of all observations in the training set.
baseline = mean(ElantraTrain$ElantraSales)
baseline 
## [1] 14462.25
#Ans:14462.25
#EXPLANATION:The baseline method that is used in the R-Squared calculation (to compute SST, the total sum of squares) simply predicts the mean of ElantraSales in the training set for every observation (i.e., without regard to any of the independent variables).


#################################

#PROBLEM 6.4 - TEST SET R-SQUARED 

#What is the test set R-Squared?
1-190757747/701375142
## [1] 0.7280232
#Ans:0.7280232
#EXPLANATION:You can compute the SST as the sum of the squared differences between ElantraSales in the testing set and the mean of ElantraSales in the training set:
SST<-sum((ElantraTest$ElantraSales-mean(ElantraTrain$ElantraSales))^2)
SST
## [1] 701375142
#then R-Squared=1-SSE/SST

##############################

#PROBLEM 6.5 - ABSOLUTE ERRORS

#What is the largest absolute error that we make in our test set predictions?
sort(abs(predTest - ElantraTest$ElantraSales),decreasing = TRUE)
##        14        22        18        34         4        38        30 
## 7491.4877 6903.3078 5997.2221 4392.9279 3088.6588 2482.2349 2245.0544 
##         9        26        50        42        46         5        10 
## 2187.5530 2021.6654  934.0051  658.5309  631.7978  490.4278  394.2148
#or
max(abs(predTest- ElantraTest$ElantraSales))
## [1] 7491.488
#Ans:7491.488

########################

#PROBLEM 6.6 - MONTH OF LARGEST ERROR 

#In which period (Month,Year pair) do we make the largest absolute error in our prediction?
ElantraTest$Month[which.max(abs(predTest - ElantraTest$ElantraSales))]
## [1] 3
#Ans:03/2013 
#This returns 5, which is the row number in ElantraTest corresponding to the period for which we make the largest absolute error.