Source file ⇒ The_Analytics_Edge_edX_MIT062015.Rmd
These are my notes for the lectures of the The_Analytics_Edge_edX_MIT15.071x_June2015“ by Professor Dimitris Bertsimas. The goal of these notes is to provide the reproducible R code for all the lectures.
The notes are written in Rmarkdown, which is based on knitr. The R code appears in darker grey boxes, while the output appears below in white boxes (the output appears here prefixed with ##, but that’s not really part of R output; see the Rmarkdown and knitr references for details). The important thing is that you can copy the code from the grey boxes, paste it into R and execute it directly.
IMPORTANT NOTE
If you downloaded and installed R in a location other than the United States, you might encounter some formating issues later in this class due to language differences. To fix this, you will need to type in your R console:
Sys.setlocale(“LC_ALL”, “C”)
This will only change the locale for your current R session, so please make a note to run this command when you are working on any lectures or exercises that might depend on the English language (for example, the names for the days of the week).
Please download and install R for free from the following webpage:http://www.cran.r-project.org
When you start R, you should see a window titled “R Console”. In this window, there is some text, and then at the bottom there should be a > symbol (greater than symbol), followed by a blinking cursor. At the cursor, type:
sd(c(5,8,12))
and then hit enter. You should see [1] followed by a number. What is this number?
sd(c(5,8,12))
## [1] 3.511885
Ans:3.511885
Now type:
which.min(c(4,1,6))
at the cursor, and hit enter. You should again see [1], followed by a number. What is this number?
which.min(c(4,1,6))
## [1] 2
Ans:2
which.min
gives the position of the min value of the data set/vector
Q:At which university was the first version of R developed?
Ans:University of Auckland
EXPLANATION:The first version of R was developed by Robert Gentleman and Ross Ihaka at the University of Auckland in the mid-1990s.
Q:Which of the following are recommended variable names in R? (Select all that apply.)
Ans:SquareRoot2
Square2.Root
EXPLANATION:SquareRoot2 and Square2.Root are recommended variable names. The second option is not recommended because it has a space, and the fourth option is not recommended because it starts with a number.
Q:If you want to add new observations to a data frame, what should you use?
Ans:The rbind function.
EXPLANATION:To add new observations to a data frame with the same variable values, you should use rbind.
Q:If you want to combine two vectors into a data frame, what should you use?
Ans:The data.frame function.
EXPLANATION:To combine two vectors into a data frame, you should use data.frame.
Q:If you want to add a variable to your data frame, what should you use?
Ans:The dollar sign notation.
EXPLANATION:To add a variable to your data frame, you should the dollar sign notation.
Q:If you want to know the mean value of a numerical variable, which function could you use?
Ans:summary
EXPLANATION:If you using the summary function (in the video, we typed summary(WHO) in our R console) you can see a statistical summary of each variable. For numerical variables, the mean value is listed.
Q:If you want to know the standard deviation of a numerical variable, which function could you use?
Ans:If you want to know the standard deviation of a numerical variable, which function could you use?
EXPLANATION:Neither the str function nor the summary function provides the standard deviation value of a variable. We’ll see how to compute this value in the next video.
#importing the WHO dataset
WHO<-read.csv("WHO.csv")
str(WHO)
## 'data.frame': 194 obs. of 13 variables:
## $ Country : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Region : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
## $ Population : int 29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
## $ Under15 : num 47.4 21.3 27.4 15.2 47.6 ...
## $ Over60 : num 3.82 14.93 7.17 22.86 3.84 ...
## $ FertilityRate : num 5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
## $ LifeExpectancy : int 60 74 73 82 51 75 76 71 82 81 ...
## $ ChildMortality : num 98.5 16.7 20 3.2 163.5 ...
## $ CellularSubscribers : num 54.3 96.4 99 75.5 48.4 ...
## $ LiteracyRate : num NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
## $ GNI : num 1140 8820 8310 NA 5230 ...
## $ PrimarySchoolEnrollmentMale : num NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
## $ PrimarySchoolEnrollmentFemale: num NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
summary(WHO)
## Country Region Population
## Afghanistan : 1 Africa :46 Min. : 1
## Albania : 1 Americas :35 1st Qu.: 1696
## Algeria : 1 Eastern Mediterranean:22 Median : 7790
## Andorra : 1 Europe :53 Mean : 36360
## Angola : 1 South-East Asia :11 3rd Qu.: 24535
## Antigua and Barbuda: 1 Western Pacific :27 Max. :1390000
## (Other) :188
## Under15 Over60 FertilityRate LifeExpectancy
## Min. :13.12 Min. : 0.81 Min. :1.260 Min. :47.00
## 1st Qu.:18.72 1st Qu.: 5.20 1st Qu.:1.835 1st Qu.:64.00
## Median :28.65 Median : 8.53 Median :2.400 Median :72.50
## Mean :28.73 Mean :11.16 Mean :2.941 Mean :70.01
## 3rd Qu.:37.75 3rd Qu.:16.69 3rd Qu.:3.905 3rd Qu.:76.00
## Max. :49.99 Max. :31.92 Max. :7.580 Max. :83.00
## NA's :11
## ChildMortality CellularSubscribers LiteracyRate GNI
## Min. : 2.200 Min. : 2.57 Min. :31.10 Min. : 340
## 1st Qu.: 8.425 1st Qu.: 63.57 1st Qu.:71.60 1st Qu.: 2335
## Median : 18.600 Median : 97.75 Median :91.80 Median : 7870
## Mean : 36.149 Mean : 93.64 Mean :83.71 Mean :13321
## 3rd Qu.: 55.975 3rd Qu.:120.81 3rd Qu.:97.85 3rd Qu.:17558
## Max. :181.600 Max. :196.41 Max. :99.80 Max. :86440
## NA's :10 NA's :91 NA's :32
## PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
## Min. : 37.20 Min. : 32.50
## 1st Qu.: 87.70 1st Qu.: 87.30
## Median : 94.70 Median : 95.10
## Mean : 90.85 Mean : 89.63
## 3rd Qu.: 98.10 3rd Qu.: 97.90
## Max. :100.00 Max. :100.00
## NA's :93 NA's :93
#Q:What is the mean value of the "Over60" variable?
mean(WHO$Over60)
## [1] 11.16366
#Ans:11.16366
#EXPLANATION:You can compute this value by either typing mean(WHO$Over60) in your R console, or by typing summary(WHO$Over60) in your R console. The output is 11.16.
#Q:Which country has the smallest percentage of the population over 60?
which.min(WHO$Over60)
## [1] 183
WHO$Country[183]
## [1] United Arab Emirates
## 194 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe
#Ans:United Arab Emirates
#EXPLANATION:To get this value, you should type which.min(WHO$Over60) in your R console. The output is 183. Then, to see the name of the 183rd country in your data frame, type WHO$Country[183] in your R console. The output is United Arab Emirates.
#Q:Which country has the largest literacy rate?
which.max(WHO$LiteracyRate)
## [1] 44
WHO$Country[44]
## [1] Cuba
## 194 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe
#Ans:Cuba
#EXPLANATION:To get this value, you should type which.max(WHO$LiteracyRate) in your R console. The output is 44. Then, to see the name of the 44th country in your data frame, type WHO$Country[44] in your R console. The output is Cuba.
#Q:Use the tapply function to find the average child mortality rate of countries in each region.
#Which region has the lowest average child mortality rate across all countries in that region?
tapply(WHO$ChildMortality, WHO$Region, mean)
## Africa Americas Eastern Mediterranean
## 84.03696 19.32286 40.25000
## Europe South-East Asia Western Pacific
## 10.05094 35.04545 24.71111
#Ans:Europe
# VIDEO 2
# Basic Calculations
8*6
## [1] 48
2^16
## [1] 65536
2^
8*6
## [1] 1536
8*10
## [1] 80
# Functions
sqrt(2)
## [1] 1.414214
abs(-65)
## [1] 65
#?sqrt
# Variables
SquareRoot2 = sqrt(2)
SquareRoot2
## [1] 1.414214
HoursYear <- 365*24
HoursYear
## [1] 8760
ls()
## [1] "HoursYear" "SquareRoot2" "WHO"
# VIDEO 3
# Vectors
c(2,3,5,8,13)
## [1] 2 3 5 8 13
Country = c("Brazil", "China", "India","Switzerland","USA")
LifeExpectancy = c(74,76,65,83,79)
Country
## [1] "Brazil" "China" "India" "Switzerland" "USA"
LifeExpectancy
## [1] 74 76 65 83 79
Country[1]
## [1] "Brazil"
LifeExpectancy[3]
## [1] 65
Sequence = seq(0,100,2)
Sequence
## [1] 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
## [18] 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66
## [35] 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100
# Data Frames
CountryData = data.frame(Country, LifeExpectancy)
CountryData
## Country LifeExpectancy
## 1 Brazil 74
## 2 China 76
## 3 India 65
## 4 Switzerland 83
## 5 USA 79
CountryData$Population = c(199000,1390000,1240000,7997,318000)
CountryData
## Country LifeExpectancy Population
## 1 Brazil 74 199000
## 2 China 76 1390000
## 3 India 65 1240000
## 4 Switzerland 83 7997
## 5 USA 79 318000
Country = c("Australia","Greece")
LifeExpectancy = c(82,81)
Population = c(23050,11125)
NewCountryData = data.frame(Country, LifeExpectancy, Population)
NewCountryData
## Country LifeExpectancy Population
## 1 Australia 82 23050
## 2 Greece 81 11125
AllCountryData = rbind(CountryData, NewCountryData)
AllCountryData
## Country LifeExpectancy Population
## 1 Brazil 74 199000
## 2 China 76 1390000
## 3 India 65 1240000
## 4 Switzerland 83 7997
## 5 USA 79 318000
## 6 Australia 82 23050
## 7 Greece 81 11125
# VIDEO 4
# Loading csv files
WHO<-read.csv("WHO.csv")
str(WHO)
## 'data.frame': 194 obs. of 13 variables:
## $ Country : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Region : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
## $ Population : int 29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
## $ Under15 : num 47.4 21.3 27.4 15.2 47.6 ...
## $ Over60 : num 3.82 14.93 7.17 22.86 3.84 ...
## $ FertilityRate : num 5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
## $ LifeExpectancy : int 60 74 73 82 51 75 76 71 82 81 ...
## $ ChildMortality : num 98.5 16.7 20 3.2 163.5 ...
## $ CellularSubscribers : num 54.3 96.4 99 75.5 48.4 ...
## $ LiteracyRate : num NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
## $ GNI : num 1140 8820 8310 NA 5230 ...
## $ PrimarySchoolEnrollmentMale : num NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
## $ PrimarySchoolEnrollmentFemale: num NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
summary(WHO)
## Country Region Population
## Afghanistan : 1 Africa :46 Min. : 1
## Albania : 1 Americas :35 1st Qu.: 1696
## Algeria : 1 Eastern Mediterranean:22 Median : 7790
## Andorra : 1 Europe :53 Mean : 36360
## Angola : 1 South-East Asia :11 3rd Qu.: 24535
## Antigua and Barbuda: 1 Western Pacific :27 Max. :1390000
## (Other) :188
## Under15 Over60 FertilityRate LifeExpectancy
## Min. :13.12 Min. : 0.81 Min. :1.260 Min. :47.00
## 1st Qu.:18.72 1st Qu.: 5.20 1st Qu.:1.835 1st Qu.:64.00
## Median :28.65 Median : 8.53 Median :2.400 Median :72.50
## Mean :28.73 Mean :11.16 Mean :2.941 Mean :70.01
## 3rd Qu.:37.75 3rd Qu.:16.69 3rd Qu.:3.905 3rd Qu.:76.00
## Max. :49.99 Max. :31.92 Max. :7.580 Max. :83.00
## NA's :11
## ChildMortality CellularSubscribers LiteracyRate GNI
## Min. : 2.200 Min. : 2.57 Min. :31.10 Min. : 340
## 1st Qu.: 8.425 1st Qu.: 63.57 1st Qu.:71.60 1st Qu.: 2335
## Median : 18.600 Median : 97.75 Median :91.80 Median : 7870
## Mean : 36.149 Mean : 93.64 Mean :83.71 Mean :13321
## 3rd Qu.: 55.975 3rd Qu.:120.81 3rd Qu.:97.85 3rd Qu.:17558
## Max. :181.600 Max. :196.41 Max. :99.80 Max. :86440
## NA's :10 NA's :91 NA's :32
## PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
## Min. : 37.20 Min. : 32.50
## 1st Qu.: 87.70 1st Qu.: 87.30
## Median : 94.70 Median : 95.10
## Mean : 90.85 Mean : 89.63
## 3rd Qu.: 98.10 3rd Qu.: 97.90
## Max. :100.00 Max. :100.00
## NA's :93 NA's :93
# Subsetting
WHO_Europe = subset(WHO, Region == "Europe")
str(WHO_Europe)
## 'data.frame': 53 obs. of 13 variables:
## $ Country : Factor w/ 194 levels "Afghanistan",..: 2 4 8 10 11 16 17 22 26 42 ...
## $ Region : Factor w/ 6 levels "Africa","Americas",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Population : int 3162 78 2969 8464 9309 9405 11060 3834 7278 4307 ...
## $ Under15 : num 21.3 15.2 20.3 14.5 22.2 ...
## $ Over60 : num 14.93 22.86 14.06 23.52 8.24 ...
## $ FertilityRate : num 1.75 NA 1.74 1.44 1.96 1.47 1.85 1.26 1.51 1.48 ...
## $ LifeExpectancy : int 74 82 71 81 71 71 80 76 74 77 ...
## $ ChildMortality : num 16.7 3.2 16.4 4 35.2 5.2 4.2 6.7 12.1 4.7 ...
## $ CellularSubscribers : num 96.4 75.5 103.6 154.8 108.8 ...
## $ LiteracyRate : num NA NA 99.6 NA NA NA NA 97.9 NA 98.8 ...
## $ GNI : num 8820 NA 6100 42050 8960 ...
## $ PrimarySchoolEnrollmentMale : num NA 78.4 NA NA 85.3 NA 98.9 86.5 99.3 94.8 ...
## $ PrimarySchoolEnrollmentFemale: num NA 79.4 NA NA 84.1 NA 99.2 88.4 99.7 97 ...
# Writing csv files
write.csv(WHO_Europe, "WHO_Europe.csv")
# Removing variables
rm(WHO_Europe)
# VIDEO 5
# Basic data analysis
mean(WHO$Under15)
## [1] 28.73242
sd(WHO$Under15)
## [1] 10.53457
summary(WHO$Under15)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.12 18.72 28.65 28.73 37.75 49.99
which.min(WHO$Under15)
## [1] 86
WHO$Country[86]
## [1] Japan
## 194 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe
which.max(WHO$Under15)
## [1] 124
WHO$Country[124]
## [1] Niger
## 194 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe
# Scatterplot
plot(WHO$GNI, WHO$FertilityRate)
# Subsetting
Outliers = subset(WHO, GNI > 10000 & FertilityRate > 2.5)
nrow(Outliers)
## [1] 7
Outliers[c("Country","GNI","FertilityRate")]
## Country GNI FertilityRate
## 23 Botswana 14550 2.71
## 56 Equatorial Guinea 25620 5.04
## 63 Gabon 13740 4.18
## 83 Israel 27110 2.92
## 88 Kazakhstan 11250 2.52
## 131 Panama 14510 2.52
## 150 Saudi Arabia 24700 2.76
# VIDEO 6
# Histograms
hist(WHO$CellularSubscribers)
# Boxplot
boxplot(WHO$LifeExpectancy ~ WHO$Region)
boxplot(WHO$LifeExpectancy ~ WHO$Region, xlab = "", ylab = "Life Expectancy", main = "Life Expectancy of Countries by Region")
# Summary Tables
table(WHO$Region)
##
## Africa Americas Eastern Mediterranean
## 46 35 22
## Europe South-East Asia Western Pacific
## 53 11 27
tapply(WHO$Over60, WHO$Region, mean)
## Africa Americas Eastern Mediterranean
## 5.220652 10.943714 5.620000
## Europe South-East Asia Western Pacific
## 19.774906 8.769091 10.162963
tapply(WHO$LiteracyRate, WHO$Region, min)
## Africa Americas Eastern Mediterranean
## NA NA NA
## Europe South-East Asia Western Pacific
## NA NA NA
tapply(WHO$LiteracyRate, WHO$Region, min, na.rm=TRUE)
## Africa Americas Eastern Mediterranean
## 31.1 75.2 63.9
## Europe South-East Asia Western Pacific
## 95.2 56.8 60.6
# Video 2 - Reading in the Dataset
# Get the current directory
getwd()
## [1] "J:/rstudio files/Rstudio files_Analytics Edge_MIT"
# Read the csv file
USDA<-read.csv("USDA.csv")
# Structure of the dataset
str(USDA)
## 'data.frame': 7058 obs. of 16 variables:
## $ ID : int 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
## $ Description : Factor w/ 7054 levels "ABALONE,MIXED SPECIES,RAW",..: 1303 1302 1298 2303 2304 2305 2306 2307 2308 2309 ...
## $ Calories : int 717 717 876 353 371 334 300 376 403 387 ...
## $ Protein : num 0.85 0.85 0.28 21.4 23.24 ...
## $ TotalFat : num 81.1 81.1 99.5 28.7 29.7 ...
## $ Carbohydrate: num 0.06 0.06 0 2.34 2.79 0.45 0.46 3.06 1.28 4.78 ...
## $ Sodium : int 714 827 2 1395 560 629 842 690 621 700 ...
## $ SaturatedFat: num 51.4 50.5 61.9 18.7 18.8 ...
## $ Cholesterol : int 215 219 256 75 94 100 72 93 105 103 ...
## $ Sugar : num 0.06 0.06 0 0.5 0.51 0.45 0.46 NA 0.52 NA ...
## $ Calcium : int 24 24 4 528 674 184 388 673 721 643 ...
## $ Iron : num 0.02 0.16 0 0.31 0.43 0.5 0.33 0.64 0.68 0.21 ...
## $ Potassium : int 24 26 5 256 136 152 187 93 98 95 ...
## $ VitaminC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VitaminE : num 2.32 2.32 2.8 0.25 0.26 0.24 0.21 NA 0.29 NA ...
## $ VitaminD : num 1.5 1.5 1.8 0.5 0.5 0.5 0.4 NA 0.6 NA ...
# Statistical summary
summary(USDA)
## ID
## Min. : 1001
## 1st Qu.: 8387
## Median :13294
## Mean :14260
## 3rd Qu.:18337
## Max. :93600
##
## Description
## BEEF,CHUCK,UNDER BLADE CNTR STEAK,BNLESS,DENVER CUT,LN,0" FA: 2
## CAMPBELL,CAMPBELL'S SEL MICROWAVEABLE BOWLS,HEA : 2
## OIL,INDUSTRIAL,PALM KERNEL (HYDROGENATED),CONFECTION FAT : 2
## POPCORN,OIL-POPPED,LOFAT : 2
## ABALONE,MIXED SPECIES,RAW : 1
## ABALONE,MXD SP,CKD,FRIED : 1
## (Other) :7048
## Calories Protein TotalFat Carbohydrate
## Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 85.0 1st Qu.: 2.29 1st Qu.: 0.72 1st Qu.: 0.00
## Median :181.0 Median : 8.20 Median : 4.37 Median : 7.13
## Mean :219.7 Mean :11.71 Mean : 10.32 Mean : 20.70
## 3rd Qu.:331.0 3rd Qu.:20.43 3rd Qu.: 12.70 3rd Qu.: 28.17
## Max. :902.0 Max. :88.32 Max. :100.00 Max. :100.00
## NA's :1 NA's :1 NA's :1 NA's :1
## Sodium SaturatedFat Cholesterol Sugar
## Min. : 0.0 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 37.0 1st Qu.: 0.172 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 79.0 Median : 1.256 Median : 3.00 Median : 1.395
## Mean : 322.1 Mean : 3.452 Mean : 41.55 Mean : 8.257
## 3rd Qu.: 386.0 3rd Qu.: 4.028 3rd Qu.: 69.00 3rd Qu.: 7.875
## Max. :38758.0 Max. :95.600 Max. :3100.00 Max. :99.800
## NA's :84 NA's :301 NA's :288 NA's :1910
## Calcium Iron Potassium VitaminC
## Min. : 0.00 Min. : 0.000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 9.00 1st Qu.: 0.520 1st Qu.: 135.0 1st Qu.: 0.000
## Median : 19.00 Median : 1.330 Median : 250.0 Median : 0.000
## Mean : 73.53 Mean : 2.828 Mean : 301.4 Mean : 9.436
## 3rd Qu.: 56.00 3rd Qu.: 2.620 3rd Qu.: 348.0 3rd Qu.: 3.100
## Max. :7364.00 Max. :123.600 Max. :16500.0 Max. :2400.000
## NA's :136 NA's :123 NA's :409 NA's :332
## VitaminE VitaminD
## Min. : 0.000 Min. : 0.0000
## 1st Qu.: 0.120 1st Qu.: 0.0000
## Median : 0.270 Median : 0.0000
## Mean : 1.488 Mean : 0.5769
## 3rd Qu.: 0.710 3rd Qu.: 0.1000
## Max. :149.400 Max. :250.0000
## NA's :2720 NA's :2834
# Video 3 - Basic Data Analysis
# Vector notation
head(USDA$Sodium,40)
## [1] 714 827 2 1395 560 629 842 690 621 700 604 364 344 330
## [15] 330 406 321 965 1116 800 600 819 336 800 536 627 415 619
## [29] 652 628 334 1529 1602 534 876 84 125 1200 1809 192
# Finding the index of the food with highest sodium levels
which.max(USDA$Sodium)
## [1] 265
# Get names of variables in the dataset
names(USDA)
## [1] "ID" "Description" "Calories" "Protein"
## [5] "TotalFat" "Carbohydrate" "Sodium" "SaturatedFat"
## [9] "Cholesterol" "Sugar" "Calcium" "Iron"
## [13] "Potassium" "VitaminC" "VitaminE" "VitaminD"
# Get the name of the food with highest sodium levels
USDA$Description[265]
## [1] SALT,TABLE
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ... ZWIEBACK
# Create a subset of the foods with sodium content above 10,000mg
HighSodium<-subset(USDA, Sodium>10000)
# Count the number of rows, or observations
nrow(HighSodium)
## [1] 10
# Output names of the foods with high sodium content
HighSodium$Description
## [1] SALT,TABLE
## [2] SOUP,BF BROTH OR BOUILLON,PDR,DRY
## [3] SOUP,BEEF BROTH,CUBED,DRY
## [4] SOUP,CHICK BROTH OR BOUILLON,DRY
## [5] SOUP,CHICK BROTH CUBES,DRY
## [6] GRAVY,AU JUS,DRY
## [7] ADOBO FRESCO
## [8] LEAVENING AGENTS,BAKING PDR,DOUBLE-ACTING,NA AL SULFATE
## [9] LEAVENING AGENTS,BAKING SODA
## [10] DESSERTS,RENNIN,TABLETS,UNSWTND
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ... ZWIEBACK
# Finding the index of CAVIAR in the dataset
match("CAVIAR", USDA$Description)
## [1] 4154
# Find amount of sodium in caviar
USDA$Sodium[4154]
## [1] 1500
# Doing it in one command!
USDA$Sodium[match("CAVIAR", USDA$Description)]
## [1] 1500
# Summary function over Sodium vector
summary(USDA$Sodium)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 37.0 79.0 322.1 386.0 38760.0 84
# Standard deviation
sd(USDA$Sodium, na.rm = TRUE)
## [1] 1045.417
# Video 4 - Plots
# Scatter Plots
plot(USDA$Protein, USDA$TotalFat)
# Add xlabel, ylabel and title
plot(USDA$Protein, USDA$TotalFat, xlab="Protein", ylab = "Fat", main = "Protein vs Fat", col = "red")
# Creating a histogram
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C")
# Add limits to x-axis (zooming on)
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100))
# Specify breaks of histogram
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=100)
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=2000)
# Boxplots
boxplot(USDA$Sugar, ylab = "Sugar (g)", main = "Boxplot of Sugar")
# Video 5 - Adding a variable
# Creating a variable that takes value 1 if the food has higher sodium than average, 0 otherwise
HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))
str(HighSodium)
## num [1:7058] 1 1 0 1 1 1 1 1 1 1 ...
# Adding the variable to the dataset
USDA$HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))
str(USDA) # to see if the new variable has been added to USDA dataset
## 'data.frame': 7058 obs. of 17 variables:
## $ ID : int 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
## $ Description : Factor w/ 7054 levels "ABALONE,MIXED SPECIES,RAW",..: 1303 1302 1298 2303 2304 2305 2306 2307 2308 2309 ...
## $ Calories : int 717 717 876 353 371 334 300 376 403 387 ...
## $ Protein : num 0.85 0.85 0.28 21.4 23.24 ...
## $ TotalFat : num 81.1 81.1 99.5 28.7 29.7 ...
## $ Carbohydrate: num 0.06 0.06 0 2.34 2.79 0.45 0.46 3.06 1.28 4.78 ...
## $ Sodium : int 714 827 2 1395 560 629 842 690 621 700 ...
## $ SaturatedFat: num 51.4 50.5 61.9 18.7 18.8 ...
## $ Cholesterol : int 215 219 256 75 94 100 72 93 105 103 ...
## $ Sugar : num 0.06 0.06 0 0.5 0.51 0.45 0.46 NA 0.52 NA ...
## $ Calcium : int 24 24 4 528 674 184 388 673 721 643 ...
## $ Iron : num 0.02 0.16 0 0.31 0.43 0.5 0.33 0.64 0.68 0.21 ...
## $ Potassium : int 24 26 5 256 136 152 187 93 98 95 ...
## $ VitaminC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VitaminE : num 2.32 2.32 2.8 0.25 0.26 0.24 0.21 NA 0.29 NA ...
## $ VitaminD : num 1.5 1.5 1.8 0.5 0.5 0.5 0.4 NA 0.6 NA ...
## $ HighSodium : num 1 1 0 1 1 1 1 1 1 1 ...
# Similarly for HighProtein, HigCarbs, HighFat
USDA$HighCarbs = as.numeric(USDA$Carbohydrate > mean(USDA$Carbohydrate, na.rm=TRUE))
USDA$HighProtein = as.numeric(USDA$Protein > mean(USDA$Protein, na.rm=TRUE))
USDA$HighFat = as.numeric(USDA$TotalFat > mean(USDA$TotalFat, na.rm=TRUE))
str(USDA) # to see if the new variable has been added to USDA dataset
## 'data.frame': 7058 obs. of 20 variables:
## $ ID : int 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
## $ Description : Factor w/ 7054 levels "ABALONE,MIXED SPECIES,RAW",..: 1303 1302 1298 2303 2304 2305 2306 2307 2308 2309 ...
## $ Calories : int 717 717 876 353 371 334 300 376 403 387 ...
## $ Protein : num 0.85 0.85 0.28 21.4 23.24 ...
## $ TotalFat : num 81.1 81.1 99.5 28.7 29.7 ...
## $ Carbohydrate: num 0.06 0.06 0 2.34 2.79 0.45 0.46 3.06 1.28 4.78 ...
## $ Sodium : int 714 827 2 1395 560 629 842 690 621 700 ...
## $ SaturatedFat: num 51.4 50.5 61.9 18.7 18.8 ...
## $ Cholesterol : int 215 219 256 75 94 100 72 93 105 103 ...
## $ Sugar : num 0.06 0.06 0 0.5 0.51 0.45 0.46 NA 0.52 NA ...
## $ Calcium : int 24 24 4 528 674 184 388 673 721 643 ...
## $ Iron : num 0.02 0.16 0 0.31 0.43 0.5 0.33 0.64 0.68 0.21 ...
## $ Potassium : int 24 26 5 256 136 152 187 93 98 95 ...
## $ VitaminC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VitaminE : num 2.32 2.32 2.8 0.25 0.26 0.24 0.21 NA 0.29 NA ...
## $ VitaminD : num 1.5 1.5 1.8 0.5 0.5 0.5 0.4 NA 0.6 NA ...
## $ HighSodium : num 1 1 0 1 1 1 1 1 1 1 ...
## $ HighCarbs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ HighProtein : num 0 0 0 1 1 1 1 1 1 1 ...
## $ HighFat : num 1 1 1 1 1 1 1 1 1 1 ...
# Video 6 - Summary Tables
# How many foods have higher sodium level than average?
table(USDA$HighSodium)
##
## 0 1
## 4884 2090
# How many foods have both high sodium and high fat?
table(USDA$HighSodium, USDA$HighFat)
##
## 0 1
## 0 3529 1355
## 1 1378 712
# Average amount of iron sorted by high and low protein?
tapply(USDA$Iron, USDA$HighProtein, mean, na.rm=TRUE)
## 0 1
## 2.558945 3.197294
# Maximum level of Vitamin C in hfoods with high and low carbs?
tapply(USDA$VitaminC, USDA$HighCarbs, max, na.rm=TRUE)
## 0 1
## 1677.6 2400.0
# Using summary function with tapply
tapply(USDA$VitaminC, USDA$HighCarbs, summary, na.rm=TRUE)
## $`0`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 6.364 2.800 1678.000 248
##
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 0.20 16.31 4.50 2400.00 83
AN ANALYTICAL DETECTIVE
Crime is an international concern, but it is documented and handled in very different ways in different countries. In the United States, violent crimes and property crimes are recorded by the Federal Bureau of Investigation (FBI). Additionally, each city documents crime, and some cities release data regarding crime rates. The city of Chicago, Illinois releases crime data from 2001 onward online.
Chicago is the third most populous city in the United States, with a population of over 2.7 million people. The city of Chicago is shown in the map below, with the state of Illinois highlighted in red. There are two main types of crimes: violent crimes, and property crimes. In this problem, we’ll focus on one specific type of property crime, called “motor vehicle theft” (sometimes referred to as grand theft auto). This is the act of stealing, or attempting to steal, a car. In this problem, we’ll use some basic data analysis in R to understand the motor vehicle thefts in Chicago.
#PROBLEM 1.1 - LOADING THE DATA
#lets import the dataset
mvt<-read.csv("mvtWeek1.csv")
str(mvt)
## 'data.frame': 191641 obs. of 11 variables:
## $ ID : int 8951354 8951141 8952745 8952223 8951608 8950793 8950760 8951611 8951802 8950706 ...
## $ Date : Factor w/ 131680 levels "1/1/01 0:01",..: 42824 42823 42823 42823 42822 42821 42820 42819 42817 42816 ...
## $ LocationDescription: Factor w/ 78 levels "ABANDONED BUILDING",..: 72 72 62 72 72 72 72 72 72 72 ...
## $ Arrest : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
## $ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Beat : int 623 1213 1622 724 211 2521 423 231 1021 1215 ...
## $ District : int 6 12 16 7 2 25 4 2 10 12 ...
## $ CommunityArea : int 69 24 11 67 35 19 48 40 29 24 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ Latitude : num 41.8 41.9 42 41.8 41.8 ...
## $ Longitude : num -87.6 -87.7 -87.8 -87.7 -87.6 ...
summary(mvt)
## ID Date
## Min. :1310022 5/16/08 0:00 : 11
## 1st Qu.:2832144 10/17/01 22:00: 10
## Median :4762956 4/13/04 21:00 : 10
## Mean :4968629 9/17/05 22:00 : 10
## 3rd Qu.:7201878 10/12/01 22:00: 9
## Max. :9181151 10/13/01 22:00: 9
## (Other) :191582
## LocationDescription Arrest Domestic
## STREET :156564 Mode :logical Mode :logical
## PARKING LOT/GARAGE(NON.RESID.): 14852 FALSE:176105 FALSE:191226
## OTHER : 4573 TRUE :15536 TRUE :415
## ALLEY : 2308 NA's :0 NA's :0
## GAS STATION : 2111
## DRIVEWAY - RESIDENTIAL : 1675
## (Other) : 9558
## Beat District CommunityArea Year
## Min. : 111 Min. : 1.00 Min. : 0 Min. :2001
## 1st Qu.: 722 1st Qu.: 6.00 1st Qu.:22 1st Qu.:2003
## Median :1121 Median :10.00 Median :32 Median :2006
## Mean :1259 Mean :11.82 Mean :38 Mean :2006
## 3rd Qu.:1733 3rd Qu.:17.00 3rd Qu.:60 3rd Qu.:2009
## Max. :2535 Max. :31.00 Max. :77 Max. :2012
## NA's :43056 NA's :24616
## Latitude Longitude
## Min. :41.64 Min. :-87.93
## 1st Qu.:41.77 1st Qu.:-87.72
## Median :41.85 Median :-87.68
## Mean :41.84 Mean :-87.68
## 3rd Qu.:41.92 3rd Qu.:-87.64
## Max. :42.02 Max. :-87.52
## NA's :2276 NA's :2276
#How many rows of data (observations) are in this dataset?
nrow(mvt)
## [1] 191641
#Ans:191641 #If you type str(mvt) in the R console, the first row of output says that this is a data frame with 191,641 observations.
#How many variables are in this dataset?
ncol(mvt)
## [1] 11
#Ans:11 #If you type str(mvt) in the R console, the first row of output says that this is a data frame with 11 variables.
#Using the "max" function, what is the maximum value of the variable "ID"?
max(mvt$ID)
## [1] 9181151
#Ans:9181151
#What is the minimum value of the variable "Beat"?
min(mvt$Beat) # using summary(mvt$Beat)
## [1] 111
#Ans:111
#How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?
table(mvt$Arrest) # or
##
## FALSE TRUE
## 176105 15536
summary(mvt$Arrest)
## Mode FALSE TRUE NA's
## logical 176105 15536 0
#better summary function
library(Hmisc)
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## combine, src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
describe(mvt$Arrest)
## mvt$Arrest
## n missing unique
## 191641 0 2
##
## FALSE (176105, 92%), TRUE (15536, 8%)
#Ans:15536
#How many observations have a LocationDescription value of ALLEY?
table(mvt$LocationDescription=="ALLEY") # or summary(mvt)
##
## FALSE TRUE
## 189333 2308
#Ans:2308
#######################################################
#PROBLEM 2.1 - UNDERSTANDING DATES IN R
#In many datasets, like this one, you have a date field. Unfortunately, R does not automatically recognize entries that look like dates. We need to use a function in R to extract the date and time. Take a look at the first entry of Date (remember to use square brackets when looking at a certain entry of a variable).
#In what format are the entries in the variable Date?
mvt$Date[1]
## [1] 12/31/12 23:15
## 131680 Levels: 1/1/01 0:01 1/1/01 0:05 1/1/01 0:30 1/1/01 1:17 ... 9/9/12 9:50
#Ans:Month/Day/Year Hour:Minute (12/31/12 23:15)
#Now, let's convert these characters into a Date object in R. In your R console, type
DateConvert<-as.Date(strptime(mvt$Date, "%m/%d/%y %H:%M"))
#This converts the variable "Date" into a Date object in R. Take a look at the variable DateConvert using the summary function.
summary(DateConvert)
## Min. 1st Qu. Median Mean 3rd Qu.
## "2001-01-01" "2003-07-10" "2006-05-21" "2006-08-23" "2009-10-24"
## Max.
## "2012-12-31"
#What is the month and year of the median date in our dataset? Enter your answer as "Month Year", without the quotes. (Ex: if the answer was 2008-03-28, you would give the answer "March 2008", without the quotes.)
#Ans: May 2006
#Now, let's extract the month and the day of the week, and add these variables to our data frame mvt. We can do this with two simple functions. Type the following commands in R:
mvt$Month<-months(DateConvert)
mvt$Weekday<-weekdays(DateConvert)
#This creates two new variables in our data frame, Month and Weekday, and sets them equal to the month and weekday values that we can extract from the Date object. Lastly, replace the old Date variable with DateConvert by typing:
mvt$Date<-DateConvert
str(mvt) # see the new var added to the mvt dataset
## 'data.frame': 191641 obs. of 13 variables:
## $ ID : int 8951354 8951141 8952745 8952223 8951608 8950793 8950760 8951611 8951802 8950706 ...
## $ Date : Date, format: "2012-12-31" "2012-12-31" ...
## $ LocationDescription: Factor w/ 78 levels "ABANDONED BUILDING",..: 72 72 62 72 72 72 72 72 72 72 ...
## $ Arrest : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
## $ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Beat : int 623 1213 1622 724 211 2521 423 231 1021 1215 ...
## $ District : int 6 12 16 7 2 25 4 2 10 12 ...
## $ CommunityArea : int 69 24 11 67 35 19 48 40 29 24 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ Latitude : num 41.8 41.9 42 41.8 41.8 ...
## $ Longitude : num -87.6 -87.7 -87.8 -87.7 -87.6 ...
## $ Month : chr "December" "December" "December" "December" ...
## $ Weekday : chr "Monday" "Monday" "Monday" "Monday" ...
#Using the table command, answer the following questions.
#In which month did the fewest motor vehicle thefts occur?
min(table(mvt$Month))
## [1] 13511
table(mvt$Month)
##
## April August December February January July June
## 15280 16572 16426 13511 16047 16801 16002
## March May November October September
## 15758 16035 16063 17086 16060
#Ans:February
#On which weekday did the most motor vehicle thefts occur?
max(table(mvt$Weekday))
## [1] 29284
table(mvt$Weekday)
##
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 29284 27397 27118 26316 27319 26791 27416
#Ans:Friday
#Each observation in the dataset represents a motor vehicle theft, and the Arrest variable indicates whether an arrest was later made for this theft. Which month has the largest number of motor vehicle thefts for which an arrest was made?
max(table(mvt$Month[mvt$Arrest==TRUE]))
## [1] 1435
table(mvt$Month[mvt$Arrest==TRUE]) # or
##
## April August December February January July June
## 1252 1329 1397 1238 1435 1324 1230
## March May November October September
## 1298 1187 1256 1342 1248
table(mvt$Arrest,mvt$Month)
##
## April August December February January July June March May
## FALSE 14028 15243 15029 12273 14612 15477 14772 14460 14848
## TRUE 1252 1329 1397 1238 1435 1324 1230 1298 1187
##
## November October September
## FALSE 14807 15744 14812
## TRUE 1256 1342 1248
#########################################################
#PROBLEM 3.1 - VISUALIZING CRIME TRENDS
#Now, let's make some plots to help us better understand how crime has changed over time in Chicago.
#First, let's make a histogram of the variable Date. We'll add an extra argument, to specify the number of bars we want in our histogram. In your R console, type
hist(mvt$Date, breaks=100)
#In general, does it look like crime increases or decreases from 2002 - 2012?
#Ans:decreases
#While there is not a clear trend, it looks like crime generally decreases.
#In general, does it look like crime increases or decreases from 2005 - 2008?
#Ans:Decreases
#In this time period, there is a clear downward trend in crime.
#In general, does it look like crime increases or decreases from 2009 - 2011?
#Ans:Increases
#n this time period, there is a clear upward trend in crime.
#Now, let's see how arrests have changed over time. Create a boxplot of the variable "Date", sorted by the variable "Arrest".
#Does it look like there were more crimes for which arrests were made in the first half of the time period or the second half of the time period? (Note that the time period is from 2001 to 2012, so the middle of the time period is the beginning of 2007.)
boxplot(mvt$Date~mvt$Arrest,xlab="Arrests",ylab="Date in Year")
#Ans:First half
#If you look at the boxplot, the one for Arrest=TRUE is definitely skewed towards the bottom of the plot, meaning that there were more crimes for which arrests were made in the first half of the time period.
#For what proportion of motor vehicle thefts in 2001 was an arrest made?
prop.table(table(mvt$Arrest[mvt$Year=="2001"]))
##
## FALSE TRUE
## 0.8958827 0.1041173
#Ans:0.1041173
#EXPLANATION:If you create a table using the command table(mvt$Arrest, mvt$Year), the column for 2001 has 2152 observations with Arrest=TRUE and 18517 observations with Arrest=FALSE. The fraction of motor vehicle thefts in 2001 for which an arrest was made is thus 2152/(2152+18517) = 0.1041173.
#For what proportion of motor vehicle thefts in 2007 was an arrest made?
prop.table(table(mvt$Arrest[mvt$Year=="2007"]))
##
## FALSE TRUE
## 0.91512605 0.08487395
#Ans:0.08487395
#EXPLANATION:If you create a table using the command table(mvt$Arrest, mvt$Year), the column for 2007 has 1212 observations with Arrest=TRUE and 13068 observations with Arrest=FALSE. The fraction of motor vehicle thefts in 2007 for which an arrest was made is thus 1212/(1212+13068) = 0.08487395.
#For what proportion of motor vehicle thefts in 2012 was an arrest made?
prop.table(table(mvt$Arrest[mvt$Year=="2012"]))
##
## FALSE TRUE
## 0.96097076 0.03902924
#Ans:0.03902924
#EXPLANATION:If you create a table using the command table(mvt$Arrest, mvt$Year), the column for 2012 has 550 observations with Arrest=TRUE and 13542 observations with Arrest=FALSE. The fraction of motor vehicle thefts in 2012 for which an arrest was made is thus 550/(550+13542) = 0.03902924.
#########################################
#PROBLEM 4.1 - POPULAR LOCATIONS
#Analyzing this data could be useful to the Chicago Police Department when deciding where to allocate resources. If they want to increase the number of arrests that are made for motor vehicle thefts, where should they focus their efforts?We want to find the top five locations where motor vehicle thefts occur. If you create a table of the LocationDescription variable, it is unfortunately very hard to read since there are 78 different locations in the data set. By using the sort function, we can view this same table, but sorted by the number of observations in each category. In your R console, type:
sort(table(mvt$LocationDescription))
##
## AIRPORT BUILDING NON-TERMINAL - SECURE AREA
## 1
## AIRPORT EXTERIOR - SECURE AREA
## 1
## ANIMAL HOSPITAL
## 1
## APPLIANCE STORE
## 1
## CTA TRAIN
## 1
## JAIL / LOCK-UP FACILITY
## 1
## NEWSSTAND
## 1
## BRIDGE
## 2
## COLLEGE/UNIVERSITY RESIDENCE HALL
## 2
## CURRENCY EXCHANGE
## 2
## BOWLING ALLEY
## 3
## CLEANING STORE
## 3
## MEDICAL/DENTAL OFFICE
## 3
## ABANDONED BUILDING
## 4
## AIRPORT BUILDING NON-TERMINAL - NON-SECURE AREA
## 4
## BARBERSHOP
## 4
## LAKEFRONT/WATERFRONT/RIVERBANK
## 4
## LIBRARY
## 4
## SAVINGS AND LOAN
## 4
## AIRPORT TERMINAL UPPER LEVEL - NON-SECURE AREA
## 5
## CHA APARTMENT
## 5
## DAY CARE CENTER
## 5
## FIRE STATION
## 5
## FOREST PRESERVE
## 6
## BANK
## 7
## CONVENIENCE STORE
## 7
## DRUG STORE
## 8
## OTHER COMMERCIAL TRANSPORTATION
## 8
## ATHLETIC CLUB
## 9
## AIRPORT VENDING ESTABLISHMENT
## 10
## AIRPORT PARKING LOT
## 11
## SCHOOL, PRIVATE, BUILDING
## 14
## TAVERN/LIQUOR STORE
## 14
## FACTORY/MANUFACTURING BUILDING
## 16
## BAR OR TAVERN
## 17
## WAREHOUSE
## 17
## MOVIE HOUSE/THEATER
## 18
## RESIDENCE PORCH/HALLWAY
## 18
## NURSING HOME/RETIREMENT HOME
## 21
## TAXICAB
## 21
## DEPARTMENT STORE
## 22
## HIGHWAY/EXPRESSWAY
## 22
## SCHOOL, PRIVATE, GROUNDS
## 23
## VEHICLE-COMMERCIAL
## 23
## AIRPORT EXTERIOR - NON-SECURE AREA
## 24
## OTHER RAILROAD PROP / TRAIN DEPOT
## 28
## SMALL RETAIL STORE
## 33
## CONSTRUCTION SITE
## 35
## CAR WASH
## 44
## COLLEGE/UNIVERSITY GROUNDS
## 47
## GOVERNMENT BUILDING/PROPERTY
## 48
## RESTAURANT
## 49
## CHURCH/SYNAGOGUE/PLACE OF WORSHIP
## 56
## GROCERY FOOD STORE
## 80
## HOSPITAL BUILDING/GROUNDS
## 101
## SCHOOL, PUBLIC, BUILDING
## 114
## HOTEL/MOTEL
## 124
## COMMERCIAL / BUSINESS OFFICE
## 126
## CTA GARAGE / OTHER PROPERTY
## 148
## SPORTS ARENA/STADIUM
## 166
## APARTMENT
## 184
## SCHOOL, PUBLIC, GROUNDS
## 206
## PARK PROPERTY
## 255
## POLICE FACILITY/VEH PARKING LOT
## 266
## AIRPORT/AIRCRAFT
## 363
## CHA PARKING LOT/GROUNDS
## 405
## SIDEWALK
## 462
## VEHICLE NON-COMMERCIAL
## 817
## VACANT LOT/LAND
## 985
## RESIDENCE-GARAGE
## 1176
## RESIDENCE
## 1302
## RESIDENTIAL YARD (FRONT/BACK)
## 1536
## DRIVEWAY - RESIDENTIAL
## 1675
## GAS STATION
## 2111
## ALLEY
## 2308
## OTHER
## 4573
## PARKING LOT/GARAGE(NON.RESID.)
## 14852
## STREET
## 156564
#Which locations are the top five locations for motor vehicle thefts, excluding the "Other" category? You should select 5 of the following options.
#Ans:Street, Parking Lot/Garage (Non. Resid.), Alley, Gas Station, and Driveway - Residential
#Create a subset of your data, only taking observations for which the theft happened in one of these five locations, and call this new data set "Top5". To do this, you can use the | symbol. In lecture, we used the & symbol to use two criteria to make a subset of the data. To only take observations that have a certain value in one variable or the other, the | character can be used in place of the & symbol. This is also called a logical "or" operation.
#How many observations are in Top5?
Top5<-mvt[mvt$LocationDescription %in% c("STREET","ALLEY","PARKING LOT/GARAGE(NON.RESID.)","GAS STATION","DRIVEWAY - RESIDENTIAL"),]
#or
TopLocations<-c("STREET", "PARKING LOT/GARAGE(NON.RESID.)", "ALLEY", "GAS STATION", "DRIVEWAY - RESIDENTIAL")
Top5<-subset(mvt, LocationDescription %in% TopLocations)
#or
Top5<-subset(mvt, LocationDescription=="STREET" | LocationDescription=="PARKING LOT/GARAGE(NON.RESID.)" | LocationDescription=="ALLEY" | LocationDescription=="GAS STATION" | LocationDescription=="DRIVEWAY - RESIDENTIAL")
str(Top5)
## 'data.frame': 177510 obs. of 13 variables:
## $ ID : int 8951354 8951141 8952223 8951608 8950793 8950760 8951611 8951802 8950706 8951585 ...
## $ Date : Date, format: "2012-12-31" "2012-12-31" ...
## $ LocationDescription: Factor w/ 78 levels "ABANDONED BUILDING",..: 72 72 72 72 72 72 72 72 72 72 ...
## $ Arrest : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Beat : int 623 1213 724 211 2521 423 231 1021 1215 1011 ...
## $ District : int 6 12 7 2 25 4 2 10 12 10 ...
## $ CommunityArea : int 69 24 67 35 19 48 40 29 24 29 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ Latitude : num 41.8 41.9 41.8 41.8 41.9 ...
## $ Longitude : num -87.6 -87.7 -87.7 -87.6 -87.8 ...
## $ Month : chr "December" "December" "December" "December" ...
## $ Weekday : chr "Monday" "Monday" "Monday" "Monday" ...
margin.table(sort(table(Top5$LocationDescription)))
## [1] 177510
#Ans:177510
#####################################################
#R will remember the other categories of the LocationDescription variable from the original dataset, so running table(Top5$LocationDescription) will have a lot of unnecessary output. To make our tables a bit nicer to read, we can refresh this factor variable. In your R console, type:
Top5$LocationDescription<-factor(Top5$LocationDescription)
#If you run the str or table function on Top5 now, you should see that LocationDescription now only has 5 values, as we expect.
str(Top5)
## 'data.frame': 177510 obs. of 13 variables:
## $ ID : int 8951354 8951141 8952223 8951608 8950793 8950760 8951611 8951802 8950706 8951585 ...
## $ Date : Date, format: "2012-12-31" "2012-12-31" ...
## $ LocationDescription: Factor w/ 5 levels "ALLEY","DRIVEWAY - RESIDENTIAL",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Arrest : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Beat : int 623 1213 724 211 2521 423 231 1021 1215 1011 ...
## $ District : int 6 12 7 2 25 4 2 10 12 10 ...
## $ CommunityArea : int 69 24 67 35 19 48 40 29 24 29 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ Latitude : num 41.8 41.9 41.8 41.8 41.9 ...
## $ Longitude : num -87.6 -87.7 -87.7 -87.6 -87.8 ...
## $ Month : chr "December" "December" "December" "December" ...
## $ Weekday : chr "Monday" "Monday" "Monday" "Monday" ...
table(Top5$LocationDescription)
##
## ALLEY DRIVEWAY - RESIDENTIAL
## 2308 1675
## GAS STATION PARKING LOT/GARAGE(NON.RESID.)
## 2111 14852
## STREET
## 156564
#Use the Top5 data frame to answer the remaining questions.
#One of the locations has a much higher arrest rate than the other locations. Which is it? Please enter the text in exactly the same way as how it looks in the answer options for Problem 4.1.
table(Top5$LocationDescription, Top5$Arrest)
##
## FALSE TRUE
## ALLEY 2059 249
## DRIVEWAY - RESIDENTIAL 1543 132
## GAS STATION 1672 439
## PARKING LOT/GARAGE(NON.RESID.) 13249 1603
## STREET 144969 11595
prop.table(table(Top5$LocationDescription, Top5$Arrest),1)#calculation prop. row-wise
##
## FALSE TRUE
## ALLEY 0.89211438 0.10788562
## DRIVEWAY - RESIDENTIAL 0.92119403 0.07880597
## GAS STATION 0.79204169 0.20795831
## PARKING LOT/GARAGE(NON.RESID.) 0.89206841 0.10793159
## STREET 0.92594083 0.07405917
#Ans:Gas Station
#EXPLANATION:If you create a table of LocationDescription compared to Arrest, table(Top5$LocationDescription, Top5$Arrest), you can then compute the fraction of motor vehicle thefts that resulted in arrests at each location. Gas Station has by far the highest percentage of arrests, with over 20% of motor vehicle thefts resulting in an arrest.
#On which day of the week do the most motor vehicle thefts at gas stations happen?
max(table(Top5$Weekday[Top5$LocationDescription=="GAS STATION"])) # to find the max value
## [1] 338
table(Top5$Weekday[Top5$LocationDescription=="GAS STATION"])
##
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 332 280 338 336 282 270 273
#Ans:Saturday
#EXPLANATION:This can be read from table(Top5$LocationDescription, Top5$Weekday).
table(Top5$LocationDescription,Top5$Weekday)
##
## Friday Monday Saturday Sunday Thursday
## ALLEY 385 320 341 307 315
## DRIVEWAY - RESIDENTIAL 257 255 202 221 263
## GAS STATION 332 280 338 336 282
## PARKING LOT/GARAGE(NON.RESID.) 2331 2128 2199 1936 2082
## STREET 23773 22305 22175 21756 22296
##
## Tuesday Wednesday
## ALLEY 323 317
## DRIVEWAY - RESIDENTIAL 243 234
## GAS STATION 270 273
## PARKING LOT/GARAGE(NON.RESID.) 2073 2103
## STREET 21888 22371
#On which day of the week do the fewest motor vehicle thefts in residential driveways happen?
min(table(Top5$Weekday[Top5$LocationDescription=="DRIVEWAY - RESIDENTIAL"])) # to find the min value
## [1] 202
table(Top5$Weekday[Top5$LocationDescription=="DRIVEWAY - RESIDENTIAL"])
##
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 257 255 202 221 263 243 234
#Ans:Saturday
A stock market is where buyers and sellers trade shares of a company, and is one of the most popular ways for individuals and companies to invest money. The size of the world stock market is now estimated to be in the trillions. The largest stock market in the world is the New York Stock Exchange (NYSE), located in New York City. About 2,800 companies are listed on the NSYE. In this problem, we’ll look at the monthly stock prices of five of these companies: IBM, General Electric (GE), Procter and Gamble, Coca Cola, and Boeing. The data used in this problem comes from Infochimps.
In this problem, we’ll take a look at how the stock dynamics of these companies have changed over time.
#Lets import the datasets
IBM<-read.csv("IBMStock.csv")
GE<-read.csv("GEStock.csv")
ProcterGamble<-read.csv("ProcterGambleStock.csv")
CocaCola<-read.csv("CocaColaStock.csv")
Boeing<-read.csv("BoeingStock.csv")
#PROBLEM 1.1 - SUMMARY STATISTICS
#Before working with these data sets, we need to convert the dates into a format that R can understand. Take a look at the structure of one of the datasets using the str function. Right now, the date variable is stored as a factor.
str(IBM)
## 'data.frame': 480 obs. of 2 variables:
## $ Date : Factor w/ 480 levels "1/1/00","1/1/01",..: 11 171 211 251 291 331 371 411 451 51 ...
## $ StockPrice: num 360 347 327 320 270 ...
head(IBM)
## Date StockPrice
## 1 1/1/70 360.3190
## 2 2/1/70 346.7237
## 3 3/1/70 327.3457
## 4 4/1/70 319.8527
## 5 5/1/70 270.3752
## 6 6/1/70 267.2050
IBM$Date[1]
## [1] 1/1/70
## 480 Levels: 1/1/00 1/1/01 1/1/02 1/1/03 1/1/04 1/1/05 1/1/06 ... 9/1/99
#We can convert this to a "Date" object in R by using the following five commands (one for each data set):
IBM$Date<-as.Date(IBM$Date, "%m/%d/%y")
GE$Date<-as.Date(GE$Date, "%m/%d/%y")
CocaCola$Date<-as.Date(CocaCola$Date, "%m/%d/%y")
ProcterGamble$Date<-as.Date(ProcterGamble$Date, "%m/%d/%y")
Boeing$Date<-as.Date(Boeing$Date, "%m/%d/%y")
#Our five datasets all have the same number of observations. How many observations are there in each data set?
str(GE)
## 'data.frame': 480 obs. of 2 variables:
## $ Date : Date, format: "1970-01-01" "1970-02-01" ...
## $ StockPrice: num 74.3 70 72.2 74.3 66.7 ...
#Ans:480
#EXPLANATION:Using the str function, we can see that each data set has 480 observations. We have monthly data for 40 years, so there are 12*40 = 480 observations.
#What is the earliest year in our datasets?
summary(IBM) # or head(sort(IBM$Date))
## Date StockPrice
## Min. :1970-01-01 Min. : 43.40
## 1st Qu.:1979-12-24 1st Qu.: 88.34
## Median :1989-12-16 Median :112.11
## Mean :1989-12-15 Mean :144.38
## 3rd Qu.:1999-12-08 3rd Qu.:165.41
## Max. :2009-12-01 Max. :438.90
#Ans:1970
#What is the latest year in our datasets?
summary(GE)
## Date StockPrice
## Min. :1970-01-01 Min. : 9.294
## 1st Qu.:1979-12-24 1st Qu.: 44.214
## Median :1989-12-16 Median : 55.812
## Mean :1989-12-15 Mean : 59.303
## 3rd Qu.:1999-12-08 3rd Qu.: 72.226
## Max. :2009-12-01 Max. :156.844
#Ans:2009
#What is the mean stock price of IBM over this time period?
mean(IBM$StockPrice) # or summary(IBM)
## [1] 144.375
#Ans:144.375
#What is the minimum stock price of General Electric (GE) over this time period?
min(GE$StockPrice) #or summary(GE)
## [1] 9.293636
#Ans:9.293636
#What is the maximum stock price of Coca-Cola over this time period?
max(CocaCola$StockPrice) #or summary(CocaCola)
## [1] 146.5843
#Ans:146.5843
#What is the median stock price of Boeing over this time period?
median((Boeing$StockPrice)) #or summary(Boeing)
## [1] 44.8834
#Ans:44.8834
#What is the standard deviation of the stock price of Procter & Gamble over this time period?
sd(ProcterGamble$StockPrice)
## [1] 18.19414
#Ans:18.19414
####################################################
#PROBLEM 2.1 - VISUALIZING STOCK DYNAMICS
#Let's plot the stock prices to see if we can visualize trends in stock prices during this time period. Using the plot function, plot the Date on the x-axis and the StockPrice on the y-axis, for Coca-Cola.
plot(CocaCola$Date,CocaCola$StockPrice)
#This plots our observations as points, but we would really like to see a line instead, since this is a continuous time period. To do this, add the argument type="l" to your plot command, and re-generate the plot (the character is quotes is the letter l, for line). You should now see a line plot of the Coca-Cola stock price.
plot(CocaCola$Date,CocaCola$StockPrice,type="l")
#Around what year did Coca-Cola has its highest stock price in this time period?
CocaCola$Date[which.max(CocaCola$StockPrice)]
## [1] "1973-01-01"
#Ans:1973
#Around what year did Coca-Cola has its lowest stock price in this time period?
CocaCola$Date[which.min(CocaCola$StockPrice)]
## [1] "1980-03-01"
#Ans:1980
#Now, let's add the line for Procter & Gamble too. You can add a line to a plot in R by using the lines function instead of the plot function. Keeping the plot for Coca-Cola open, type in your R console:
lines(ProcterGamble$Date, ProcterGamble$StockPrice)
#Unfortunately, it's hard to tell which line is which. Let's fix this by giving each line a color. First, re-run the plot command for Coca-Cola, but add the argument col="red".
plot(CocaCola$Date,CocaCola$StockPrice,type="l",col="red")
#You should see the plot for Coca-Cola show up again, but this time in red. Now, let's add the Procter & Gamble line (using the lines function like we did before), adding the argument col="blue". You should now see in your plot the Coca-Cola stock price in red, and the Procter & Gamble stock price in blue.
lines(ProcterGamble$Date, ProcterGamble$StockPrice,col="blue")
#As an alternative choice to changing the colors, you could instead change the line type of the Procter & Gamble line by adding the argument lty=2. This will make the Procter & Gamble line dashed.
plot(CocaCola$Date,CocaCola$StockPrice,type="l",col="red")
lines(ProcterGamble$Date, ProcterGamble$StockPrice,lty=2)
abline(v=as.Date(c("2000-03-01")), lwd=2)
#In March of 2000, the technology bubble burst, and a stock market crash occurred. According to this plot, which company's stock dropped more?
#Ans:Procter and Gamble
#Around 1983, the stock for one of these companies (Coca-Cola or Procter and Gamble) was going up, while the other was going down. Which one was going up?
abline(v=as.Date(c("1983-03-01")), lwd=2)
#Ans:Coca-Cola
#In the time period shown in the plot, which stock generally has lower values?
abline(h=mean(CocaCola$StockPrice),col="green")
abline(h=mean(ProcterGamble$StockPrice),col="blue")
#Ans:Coca-Cola
##############################################
#PROBLEM 3.1 - VISUALIZING STOCK DYNAMICS 1995-2005
#Let's take a look at how the stock prices changed from 1995-2005 for all five companies. In your R console, start by typing the following plot command:
plot(CocaCola$Date[301:432], CocaCola$StockPrice[301:432], type="l", col="red", ylim=c(0,210),lwd=3)
#This will plot the CocaCola stock prices from 1995 through 2005, which are the observations numbered from 301 to 432. The additional argument, ylim=c(0,210), makes the y-axis range from 0 to 210. This will allow us to see all of the stock values when we add in the other companies.
#Now, use the lines function to add in the other four companies, remembering to only plot the observations from 1995 to 2005, or [301:432]. You don't need the "type" or "ylim" arguments for the lines function, but remember to make each company a different color so that you can tell them apart. Some color options are "red", "blue", "green", "purple", "orange", and "black". To see all of the color options in R, type colors() in your R console.
lines(ProcterGamble$Date, ProcterGamble$StockPrice,lty=2,col="blue")
lines(GE$Date,GE$StockPrice,lty=3,col="green",lwd=2)
lines(IBM$Date,IBM$StockPrice,col="purple",lty=4)
lines(Boeing$Date,Boeing$StockPrice,col="orange",lty=5,lwd=3)
#If you prefer to change the type of the line instead of the color, here are some options for changing the line type: lty=2 will make the line dashed, lty=3 will make the line dotted, lty=4 will make the line alternate between dashes and dots, and lty=5 will make the line long-dashed.
#Which stock fell the most right after the technology bubble burst in March 2000?
abline(v=as.Date(c("2000-03-01")), lwd=2)
#Ans:GE
#Which stock reaches the highest value in the time period 1995-2005?
#Ans:IBM
#In October of 1997, there was a global stock market crash that was caused by an economic crisis in Asia. Comparing September 1997 to November 1997, which companies saw a decreasing trend in their stock price? (Select all that apply.)
abline(v=as.Date(c("1997-09-01")), lwd=2,col="steelblue")
abline(v=as.Date(c("1997-11-01")), lwd=2,col="steelblue")
#Ans:Boeing and Procter & Gamble
#In the last two years of this time period (2004 and 2005) which stock seems to be performing the best, in terms of increasing stock price?
abline(v=as.Date(c("2004-01-01")), lwd=2,col="steelblue")
abline(v=as.Date(c("2005-01-01")), lwd=2,col="steelblue")
#Ans:Boeing
###########################################
#PROBLEM 4.1 - MONTHLY TRENDS
#Lastly, let's see if stocks tend to be higher or lower during certain months. Use the tapply command to calculate the mean stock price of IBM, sorted by months. To sort by months, use months(IBM$Date) as the second argument of the tapply function.
tapply(IBM$StockPrice,months(IBM$Date),mean)
## April August December February January July June
## 152.1168 140.1455 140.7593 152.6940 150.2384 139.0670 139.0907
## March May November October September
## 152.4327 151.5022 138.0187 137.3466 139.0885
#For IBM, compare the monthly averages to the overall average stock price. In which months has IBM historically had a higher stock price (on average)? Select all that apply.
tapply(IBM$StockPrice,months(IBM$Date),mean)>mean(IBM$StockPrice)
## April August December February January July June
## TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## March May November October September
## TRUE TRUE FALSE FALSE FALSE
#Ans:January to May
#Repeat the tapply function from the previous problem for each of the other four companies, and use the output to answer the remaining questions.
tapply(GE$StockPrice,months(GE$Date),mean)
## April August December February January July June
## 64.48009 56.50315 59.10217 62.52080 62.04511 56.73349 56.46844
## March May November October September
## 63.15055 60.87135 57.28879 56.23897 56.23913
tapply(ProcterGamble$StockPrice,months(ProcterGamble$Date),mean)
## April August December February January July June
## 77.68671 76.82266 78.29661 79.02575 79.61798 76.64556 77.39275
## March May November October September
## 77.34761 77.85958 78.45610 76.67903 76.62385
tapply(Boeing$StockPrice,months(Boeing$Date),mean)
## April August December February January July June
## 47.04686 46.86311 46.17315 46.89223 46.51097 46.55360 47.38525
## March May November October September
## 46.88208 48.13716 45.14990 45.21603 46.30485
tapply(CocaCola$StockPrice,months(CocaCola$Date),mean)
## April August December February January July June
## 62.68888 58.88014 59.73223 60.73475 60.36849 58.98346 60.81208
## March May November October September
## 62.07135 61.44358 59.10268 57.93887 57.60024
#General Electric and Coca-Cola both have their highest average stock price in the same month. Which month is this?
which.max(tapply(GE$StockPrice,months(GE$Date),mean))
## April
## 1
which.max(tapply(CocaCola$StockPrice,months(CocaCola$Date),mean))
## April
## 1
#Ans:April
#For the months of December and January, every company's average stock is higher in one month and lower in the other. In which month are the stock prices lower?
#Ans:December
In the wake of the Great Recession of 2009, there has been a good deal of focus on employment statistics, one of the most important metrics policymakers use to gauge the overall strength of the economy. In the United States, the government measures unemployment using the Current Population Survey (CPS), which collects demographic and employment information from a wide range of Americans each month. In this exercise, we will employ the topics reviewed in the lectures as well as a few new techniques using the September 2013 version of this rich, nationally representative dataset (available online).
The observations in the dataset represent people surveyed in the September 2013 CPS who actually completed a survey. While the full dataset has 385 variables, in this exercise we will use a more compact version of the dataset, CPSData.csv
#PROBLEM 1.1 - LOADING AND SUMMARIZING THE DATASET
#Lets import the dataset
CPS<-read.csv("CPSData.csv")
str(CPS)
## 'data.frame': 131302 obs. of 14 variables:
## $ PeopleInHousehold : int 1 3 3 3 3 3 3 2 2 2 ...
## $ Region : Factor w/ 4 levels "Midwest","Northeast",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ State : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ MetroAreaCode : int 26620 13820 13820 13820 26620 26620 26620 33660 33660 26620 ...
## $ Age : int 85 21 37 18 52 24 26 71 43 52 ...
## $ Married : Factor w/ 5 levels "Divorced","Married",..: 5 3 3 3 5 3 3 1 1 3 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 2 1 2 2 ...
## $ Education : Factor w/ 8 levels "Associate degree",..: 1 4 4 6 1 2 4 4 4 2 ...
## $ Race : Factor w/ 6 levels "American Indian",..: 6 3 3 3 6 6 6 6 6 6 ...
## $ Hispanic : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CountryOfBirthCode: int 57 57 57 57 57 57 57 57 57 57 ...
## $ Citizenship : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ EmploymentStatus : Factor w/ 5 levels "Disabled","Employed",..: 4 5 1 3 2 2 2 2 3 2 ...
## $ Industry : Factor w/ 14 levels "Agriculture, forestry, fishing, and hunting",..: NA 11 NA NA 11 4 14 4 NA 12 ...
summary(CPS)
## PeopleInHousehold Region State MetroAreaCode
## Min. : 1.000 Midwest :30684 California :11570 Min. :10420
## 1st Qu.: 2.000 Northeast:25939 Texas : 7077 1st Qu.:21780
## Median : 3.000 South :41502 New York : 5595 Median :34740
## Mean : 3.284 West :33177 Florida : 5149 Mean :35075
## 3rd Qu.: 4.000 Pennsylvania: 3930 3rd Qu.:41860
## Max. :15.000 Illinois : 3912 Max. :79600
## (Other) :94069 NA's :34238
## Age Married Sex
## Min. : 0.00 Divorced :11151 Female:67481
## 1st Qu.:19.00 Married :55509 Male :63821
## Median :39.00 Never Married:30772
## Mean :38.83 Separated : 2027
## 3rd Qu.:57.00 Widowed : 6505
## Max. :85.00 NA's :25338
##
## Education Race
## High school :30906 American Indian : 1433
## Bachelor's degree :19443 Asian : 6520
## Some college, no degree:18863 Black : 13913
## No high school diploma :16095 Multiracial : 2897
## Associate degree : 9913 Pacific Islander: 618
## (Other) :10744 White :105921
## NA's :25338
## Hispanic CountryOfBirthCode Citizenship
## Min. :0.0000 Min. : 57.00 Citizen, Native :116639
## 1st Qu.:0.0000 1st Qu.: 57.00 Citizen, Naturalized: 7073
## Median :0.0000 Median : 57.00 Non-Citizen : 7590
## Mean :0.1393 Mean : 82.68
## 3rd Qu.:0.0000 3rd Qu.: 57.00
## Max. :1.0000 Max. :555.00
##
## EmploymentStatus Industry
## Disabled : 5712 Educational and health services :15017
## Employed :61733 Trade : 8933
## Not in Labor Force:15246 Professional and business services: 7519
## Retired :18619 Manufacturing : 6791
## Unemployed : 4203 Leisure and hospitality : 6364
## NA's :25789 (Other) :21618
## NA's :65060
names(CPS)
## [1] "PeopleInHousehold" "Region" "State"
## [4] "MetroAreaCode" "Age" "Married"
## [7] "Sex" "Education" "Race"
## [10] "Hispanic" "CountryOfBirthCode" "Citizenship"
## [13] "EmploymentStatus" "Industry"
#How many interviewees are in the dataset?
#Ans:131302
#Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.
head(sort(table(CPS$Industry),decreasing=TRUE))
##
## Educational and health services Trade
## 15017 8933
## Professional and business services Manufacturing
## 7519 6791
## Leisure and hospitality Construction
## 6364 4387
#Ans:Educational and health services
#EXPLANATION:The output of summary(CPS) orders the levels of a factor variable like Industry from largest to smallest, so we can see that "Educational and health services" is the most common Industry. table(CPS$Industry) would have provided the breakdown across all industries.
#Recall from the homework assignment "The Analytical Detective" that you can call the sort() function on the output of the table() function to obtain a sorted breakdown of a variable. For instance, sort(table(CPS$Region)) sorts the regions by the number of interviewees from that region.
#Which state has the fewest interviewees?
head(sort(table(CPS$State),decreasing=F))
##
## New Mexico Montana Mississippi Alabama West Virginia
## 1102 1214 1230 1376 1409
## Arkansas
## 1421
#Ans:New Mexico
#Which state has the largest number of interviewees?
head(sort(table(CPS$State),decreasing=T))
##
## California Texas New York Florida Pennsylvania
## 11570 7077 5595 5149 3930
## Illinois
## 3912
#Ans: California
#What proportion of interviewees are citizens of the United States?
prop.table(table(CPS$Citizenship))
##
## Citizen, Native Citizen, Naturalized Non-Citizen
## 0.88832615 0.05386818 0.05780567
0.88832615+0.05386818
## [1] 0.9421943
#Ans:0.9421943
#The CPS differentiates between race (with possible values American Indian, Asian, Black, Pacific Islander, White, or Multiracial) and ethnicity. A number of interviewees are of Hispanic ethnicity, as captured by the Hispanic variable. For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity? (Select all that apply.)
table(CPS$Race, CPS$Hispanic)
##
## 0 1
## American Indian 1129 304
## Asian 6407 113
## Black 13292 621
## Multiracial 2449 448
## Pacific Islander 541 77
## White 89190 16731
#or
table(CPS$Hispanic==T)
##
## FALSE TRUE
## 113008 18294
plus250<-table(CPS$Race[CPS$Hispanic==T])
plus250
##
## American Indian Asian Black Multiracial
## 304 113 621 448
## Pacific Islander White
## 77 16731
plus250>=250
##
## American Indian Asian Black Multiracial
## TRUE FALSE TRUE TRUE
## Pacific Islander White
## FALSE TRUE
#Ans:American Indian,Black,Multiracial ,White
##############################################
#PROBLEM 2.1 - EVALUATING MISSING VALUES
#Which variables have at least one interviewee with a missing (NA) value? (Select all that apply.)
colnames(CPS)[colSums(is.na(CPS)) > 0]
## [1] "MetroAreaCode" "Married" "Education"
## [4] "EmploymentStatus" "Industry"
#or
apply(is.na(CPS), 2, any)
## PeopleInHousehold Region State
## FALSE FALSE FALSE
## MetroAreaCode Age Married
## TRUE FALSE TRUE
## Sex Education Race
## FALSE TRUE FALSE
## Hispanic CountryOfBirthCode Citizenship
## FALSE FALSE FALSE
## EmploymentStatus Industry
## TRUE TRUE
#or
colnames(CPS)[apply(is.na(CPS), 2, any)]
## [1] "MetroAreaCode" "Married" "Education"
## [4] "EmploymentStatus" "Industry"
#or
which(apply(is.na(CPS), 2, sum)>0) # gives col names as well as col no
## MetroAreaCode Married Education EmploymentStatus
## 4 6 8 13
## Industry
## 14
#or
summary(CPS) # crude way
## PeopleInHousehold Region State MetroAreaCode
## Min. : 1.000 Midwest :30684 California :11570 Min. :10420
## 1st Qu.: 2.000 Northeast:25939 Texas : 7077 1st Qu.:21780
## Median : 3.000 South :41502 New York : 5595 Median :34740
## Mean : 3.284 West :33177 Florida : 5149 Mean :35075
## 3rd Qu.: 4.000 Pennsylvania: 3930 3rd Qu.:41860
## Max. :15.000 Illinois : 3912 Max. :79600
## (Other) :94069 NA's :34238
## Age Married Sex
## Min. : 0.00 Divorced :11151 Female:67481
## 1st Qu.:19.00 Married :55509 Male :63821
## Median :39.00 Never Married:30772
## Mean :38.83 Separated : 2027
## 3rd Qu.:57.00 Widowed : 6505
## Max. :85.00 NA's :25338
##
## Education Race
## High school :30906 American Indian : 1433
## Bachelor's degree :19443 Asian : 6520
## Some college, no degree:18863 Black : 13913
## No high school diploma :16095 Multiracial : 2897
## Associate degree : 9913 Pacific Islander: 618
## (Other) :10744 White :105921
## NA's :25338
## Hispanic CountryOfBirthCode Citizenship
## Min. :0.0000 Min. : 57.00 Citizen, Native :116639
## 1st Qu.:0.0000 1st Qu.: 57.00 Citizen, Naturalized: 7073
## Median :0.0000 Median : 57.00 Non-Citizen : 7590
## Mean :0.1393 Mean : 82.68
## 3rd Qu.:0.0000 3rd Qu.: 57.00
## Max. :1.0000 Max. :555.00
##
## EmploymentStatus Industry
## Disabled : 5712 Educational and health services :15017
## Employed :61733 Trade : 8933
## Not in Labor Force:15246 Professional and business services: 7519
## Retired :18619 Manufacturing : 6791
## Unemployed : 4203 Leisure and hospitality : 6364
## NA's :25789 (Other) :21618
## NA's :65060
table(is.na(CPS)) # to know the total no of NAs
##
## FALSE TRUE
## 1662465 175763
#Ans:MetroAreaCode,Married,Education,EmploymentStatus,Industry
#Often when evaluating a new dataset, we try to identify if there is a pattern in the missing values in the dataset. We will try to determine if there is a pattern in the missing values of the Married variable. The function is.na(CPS$Married) returns a vector of TRUE/FALSE values for whether the Married variable is missing. We can see the breakdown of whether Married is missing based on the reported value of the Region variable with the function table(CPS$Region, is.na(CPS$Married)). Which is the most accurate:
#(CPS$Married)
table(CPS$Region, is.na(CPS$Married))
##
## FALSE TRUE
## Midwest 24609 6075
## Northeast 21432 4507
## South 33535 7967
## West 26388 6789
table(CPS$Sex, is.na(CPS$Married))
##
## FALSE TRUE
## Female 55264 12217
## Male 50700 13121
table(CPS$Age,is.na(CPS$Married))
##
## FALSE TRUE
## 0 0 1283
## 1 0 1559
## 2 0 1574
## 3 0 1693
## 4 0 1695
## 5 0 1795
## 6 0 1721
## 7 0 1681
## 8 0 1729
## 9 0 1748
## 10 0 1750
## 11 0 1721
## 12 0 1797
## 13 0 1802
## 14 0 1790
## 15 1795 0
## 16 1751 0
## 17 1764 0
## 18 1596 0
## 19 1517 0
## 20 1398 0
## 21 1525 0
## 22 1536 0
## 23 1638 0
## 24 1627 0
## 25 1604 0
## 26 1643 0
## 27 1657 0
## 28 1736 0
## 29 1645 0
## 30 1854 0
## 31 1762 0
## 32 1790 0
## 33 1804 0
## 34 1653 0
## 35 1716 0
## 36 1663 0
## 37 1531 0
## 38 1530 0
## 39 1542 0
## 40 1571 0
## 41 1673 0
## 42 1711 0
## 43 1819 0
## 44 1764 0
## 45 1749 0
## 46 1665 0
## 47 1647 0
## 48 1791 0
## 49 1989 0
## 50 1966 0
## 51 1931 0
## 52 1935 0
## 53 1994 0
## 54 1912 0
## 55 1895 0
## 56 1935 0
## 57 1827 0
## 58 1874 0
## 59 1758 0
## 60 1746 0
## 61 1735 0
## 62 1595 0
## 63 1596 0
## 64 1519 0
## 65 1569 0
## 66 1577 0
## 67 1227 0
## 68 1130 0
## 69 1062 0
## 70 1195 0
## 71 1031 0
## 72 941 0
## 73 896 0
## 74 842 0
## 75 763 0
## 76 729 0
## 77 698 0
## 78 659 0
## 79 661 0
## 80 2664 0
## 85 2446 0
table(CPS$Citizenship, is.na(CPS$Married))
##
## FALSE TRUE
## Citizen, Native 91956 24683
## Citizen, Naturalized 6910 163
## Non-Citizen 7098 492
#Ans:The Married variable being missing is related to the Age value for the interviewee.
#EXPLANATION:For each possible value of Region, Sex, and Citizenship, there are both interviewees with missing and non-missing Married values. However, Married is missing for all interviewees Aged 0-14 and is present for all interviewees aged 15 and older. This is because the CPS does not ask about marriage status for interviewees 14 and younger.
#As mentioned in the variable descriptions, MetroAreaCode is missing if an interviewee does not live in a metropolitan area. Using the same technique as in the previous question, answer the following questions about people who live in non-metropolitan areas.
#How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).
table(CPS$State, is.na(CPS$MetroAreaCode))
##
## FALSE TRUE
## Alabama 1020 356
## Alaska 0 1590
## Arizona 1327 201
## Arkansas 724 697
## California 11333 237
## Colorado 2545 380
## Connecticut 2593 243
## Delaware 1696 518
## District of Columbia 1791 0
## Florida 4947 202
## Georgia 2250 557
## Hawaii 1576 523
## Idaho 761 757
## Illinois 3473 439
## Indiana 1420 584
## Iowa 1297 1231
## Kansas 1234 701
## Kentucky 908 933
## Louisiana 1216 234
## Maine 909 1354
## Maryland 2978 222
## Massachusetts 1858 129
## Michigan 2517 546
## Minnesota 2150 989
## Mississippi 376 854
## Missouri 1440 705
## Montana 199 1015
## Nebraska 816 1133
## Nevada 1609 247
## New Hampshire 1148 1514
## New Jersey 2567 0
## New Mexico 832 270
## New York 5144 451
## North Carolina 1642 977
## North Dakota 432 1213
## Ohio 2754 924
## Oklahoma 1024 499
## Oregon 1519 424
## Pennsylvania 3245 685
## Rhode Island 2209 0
## South Carolina 1139 519
## South Dakota 595 1405
## Tennessee 1149 635
## Texas 6060 1017
## Utah 1455 387
## Vermont 657 1233
## Virginia 2367 586
## Washington 1937 429
## West Virginia 344 1065
## Wisconsin 1882 804
## Wyoming 0 1624
#Ans:2
#How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.
#Ans:3
#Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?
prop.table(table(CPS$Region,is.na(CPS$MetroAreaCode)),1) #row wise proportions
##
## FALSE TRUE
## Midwest 0.6521314 0.3478686
## Northeast 0.7837619 0.2162381
## South 0.7621560 0.2378440
## West 0.7563372 0.2436628
#Ans:Midwest
#EXPLANATION:34.8% in the Midwest, 21.6% in the Northeast, 23.8% in the South, and 24.4% in the West.
#While we were able to use the table() command to compute the proportion of interviewees from each region not living in a metropolitan area, it was somewhat tedious (it involved manually computing the proportion for each region) and isn't something you would want to do if there were a larger number of options. It turns out there is a less tedious way to compute the proportion of values that are TRUE. The mean() function, which takes the average of the values passed to it, will treat TRUE as 1 and FALSE as 0, meaning it returns the proportion of values that are true. For instance, mean(c(TRUE, FALSE, TRUE, TRUE)) returns 0.75. Knowing this, use tapply() with the mean function to answer the following questions:
#Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?
tapply(is.na(CPS$MetroAreaCode),CPS$State,mean)
## Alabama Alaska Arizona
## 0.25872093 1.00000000 0.13154450
## Arkansas California Colorado
## 0.49049965 0.02048401 0.12991453
## Connecticut Delaware District of Columbia
## 0.08568406 0.23396567 0.00000000
## Florida Georgia Hawaii
## 0.03923092 0.19843249 0.24916627
## Idaho Illinois Indiana
## 0.49868248 0.11221881 0.29141717
## Iowa Kansas Kentucky
## 0.48694620 0.36227390 0.50678979
## Louisiana Maine Maryland
## 0.16137931 0.59832081 0.06937500
## Massachusetts Michigan Minnesota
## 0.06492199 0.17825661 0.31506849
## Mississippi Missouri Montana
## 0.69430894 0.32867133 0.83607908
## Nebraska Nevada New Hampshire
## 0.58132376 0.13308190 0.56874530
## New Jersey New Mexico New York
## 0.00000000 0.24500907 0.08060769
## North Carolina North Dakota Ohio
## 0.37304315 0.73738602 0.25122349
## Oklahoma Oregon Pennsylvania
## 0.32764281 0.21821925 0.17430025
## Rhode Island South Carolina South Dakota
## 0.00000000 0.31302774 0.70250000
## Tennessee Texas Utah
## 0.35594170 0.14370496 0.21009772
## Vermont Virginia Washington
## 0.65238095 0.19844226 0.18131868
## West Virginia Wisconsin Wyoming
## 0.75585522 0.29932986 1.00000000
sort(tapply(is.na(CPS$MetroAreaCode), CPS$State, mean))
## District of Columbia New Jersey Rhode Island
## 0.00000000 0.00000000 0.00000000
## California Florida Massachusetts
## 0.02048401 0.03923092 0.06492199
## Maryland New York Connecticut
## 0.06937500 0.08060769 0.08568406
## Illinois Colorado Arizona
## 0.11221881 0.12991453 0.13154450
## Nevada Texas Louisiana
## 0.13308190 0.14370496 0.16137931
## Pennsylvania Michigan Washington
## 0.17430025 0.17825661 0.18131868
## Georgia Virginia Utah
## 0.19843249 0.19844226 0.21009772
## Oregon Delaware New Mexico
## 0.21821925 0.23396567 0.24500907
## Hawaii Ohio Alabama
## 0.24916627 0.25122349 0.25872093
## Indiana Wisconsin South Carolina
## 0.29141717 0.29932986 0.31302774
## Minnesota Oklahoma Missouri
## 0.31506849 0.32764281 0.32867133
## Tennessee Kansas North Carolina
## 0.35594170 0.36227390 0.37304315
## Iowa Arkansas Idaho
## 0.48694620 0.49049965 0.49868248
## Kentucky New Hampshire Nebraska
## 0.50678979 0.56874530 0.58132376
## Maine Vermont Mississippi
## 0.59832081 0.65238095 0.69430894
## South Dakota North Dakota West Virginia
## 0.70250000 0.73738602 0.75585522
## Montana Alaska Wyoming
## 0.83607908 1.00000000 1.00000000
#or
prop.table(table(CPS$State,is.na(CPS$MetroAreaCode)),1)
##
## FALSE TRUE
## Alabama 0.74127907 0.25872093
## Alaska 0.00000000 1.00000000
## Arizona 0.86845550 0.13154450
## Arkansas 0.50950035 0.49049965
## California 0.97951599 0.02048401
## Colorado 0.87008547 0.12991453
## Connecticut 0.91431594 0.08568406
## Delaware 0.76603433 0.23396567
## District of Columbia 1.00000000 0.00000000
## Florida 0.96076908 0.03923092
## Georgia 0.80156751 0.19843249
## Hawaii 0.75083373 0.24916627
## Idaho 0.50131752 0.49868248
## Illinois 0.88778119 0.11221881
## Indiana 0.70858283 0.29141717
## Iowa 0.51305380 0.48694620
## Kansas 0.63772610 0.36227390
## Kentucky 0.49321021 0.50678979
## Louisiana 0.83862069 0.16137931
## Maine 0.40167919 0.59832081
## Maryland 0.93062500 0.06937500
## Massachusetts 0.93507801 0.06492199
## Michigan 0.82174339 0.17825661
## Minnesota 0.68493151 0.31506849
## Mississippi 0.30569106 0.69430894
## Missouri 0.67132867 0.32867133
## Montana 0.16392092 0.83607908
## Nebraska 0.41867624 0.58132376
## Nevada 0.86691810 0.13308190
## New Hampshire 0.43125470 0.56874530
## New Jersey 1.00000000 0.00000000
## New Mexico 0.75499093 0.24500907
## New York 0.91939231 0.08060769
## North Carolina 0.62695685 0.37304315
## North Dakota 0.26261398 0.73738602
## Ohio 0.74877651 0.25122349
## Oklahoma 0.67235719 0.32764281
## Oregon 0.78178075 0.21821925
## Pennsylvania 0.82569975 0.17430025
## Rhode Island 1.00000000 0.00000000
## South Carolina 0.68697226 0.31302774
## South Dakota 0.29750000 0.70250000
## Tennessee 0.64405830 0.35594170
## Texas 0.85629504 0.14370496
## Utah 0.78990228 0.21009772
## Vermont 0.34761905 0.65238095
## Virginia 0.80155774 0.19844226
## Washington 0.81868132 0.18131868
## West Virginia 0.24414478 0.75585522
## Wisconsin 0.70067014 0.29932986
## Wyoming 0.00000000 1.00000000
#Ans:Wisconsin
#Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?
#ANs:Montana
###################################################
#PROBLEM 3.1 - INTEGRATING METROPOLITAN AREA DATA
#Codes like MetroAreaCode and CountryOfBirthCode are a compact way to encode factor variables with text as their possible values, and they are therefore quite common in survey datasets. In fact, all but one of the variables in this dataset were actually stored by a numeric code in the original CPS datafile.
#When analyzing a variable stored by a numeric code, we will often want to convert it into the values the codes represent. To do this, we will use a dictionary, which maps the the code to the actual value of the variable. We have provided dictionaries MetroAreaCodes.csv and CountryCodes.csv, which respectively map MetroAreaCode and CountryOfBirthCode into their true values. Read these two dictionaries into data frames MetroAreaMap and CountryMap.
MetroAreaMap<-read.csv("MetroAreaCodes.csv")
CountryMap<-read.csv("CountryCodes.csv")
#How many observations (codes for metropolitan areas) are there in MetroAreaMap?
str(MetroAreaMap)
## 'data.frame': 271 obs. of 2 variables:
## $ Code : int 460 3000 3160 3610 3720 6450 10420 10500 10580 10740 ...
## $ MetroArea: Factor w/ 271 levels "Akron, OH","Albany-Schenectady-Troy, NY",..: 12 92 97 117 122 195 1 3 2 4 ...
#or
nrow(MetroAreaMap)
## [1] 271
#Ans:271
#How many observations (codes for countries) are there in CountryMap?
str(CountryMap)
## 'data.frame': 149 obs. of 2 variables:
## $ Code : int 57 66 73 78 96 100 102 103 104 105 ...
## $ Country: Factor w/ 149 levels "Afghanistan",..: 139 57 105 135 97 3 11 18 24 37 ...
#or
nrow(CountryMap)
## [1] 149
#Ans:149
#To merge in the metropolitan areas, we want to connect the field MetroAreaCode from the CPS data frame with the field Code in MetroAreaMap. The following command merges the two data frames on these columns, overwriting the CPS data frame with the result:
CPS<-merge(CPS, MetroAreaMap, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)
#The first two arguments determine the data frames to be merged (they are called "x" and "y", respectively, in the subsequent parameters to the merge function). by.x="MetroAreaCode" means we're matching on the MetroAreaCode variable from the "x" data frame (CPS), while by.y="Code" means we're matching on the Code variable from the "y" data frame (MetroAreaMap). Finally, all.x=TRUE means we want to keep all rows from the "x" data frame (CPS), even if some of the rows' MetroAreaCode doesn't match any codes in MetroAreaMap (for those familiar with database terminology, this parameter makes the operation a left outer join instead of an inner join).
#Review the new version of the CPS data frame with the summary() and str() functions. What is the name of the variable that was added to the data frame by the merge() operation?
str(CPS)
## 'data.frame': 131302 obs. of 15 variables:
## $ MetroAreaCode : int 10420 10420 10420 10420 10420 10420 10420 10420 10420 10420 ...
## $ PeopleInHousehold : int 4 4 2 4 1 3 4 4 2 3 ...
## $ Region : Factor w/ 4 levels "Midwest","Northeast",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ State : Factor w/ 51 levels "Alabama","Alaska",..: 36 36 36 36 36 36 36 36 36 36 ...
## $ Age : int 2 9 73 40 63 19 30 6 60 32 ...
## $ Married : Factor w/ 5 levels "Divorced","Married",..: NA NA 2 2 3 3 2 NA 2 2 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 1 1 1 1 2 ...
## $ Education : Factor w/ 8 levels "Associate degree",..: NA NA 8 4 6 4 2 NA 4 4 ...
## $ Race : Factor w/ 6 levels "American Indian",..: 6 6 6 6 6 6 2 6 6 6 ...
## $ Hispanic : int 0 0 0 0 0 0 0 1 0 0 ...
## $ CountryOfBirthCode: int 57 57 57 362 57 57 203 57 57 57 ...
## $ Citizenship : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 2 1 1 3 1 1 1 ...
## $ EmploymentStatus : Factor w/ 5 levels "Disabled","Employed",..: NA NA 4 3 1 2 3 NA 2 2 ...
## $ Industry : Factor w/ 14 levels "Agriculture, forestry, fishing, and hunting",..: NA NA NA NA NA 7 NA NA 4 13 ...
## $ MetroArea : Factor w/ 271 levels "Akron, OH","Albany-Schenectady-Troy, NY",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(CPS)
## MetroAreaCode PeopleInHousehold Region State
## Min. :10420 Min. : 1.000 Midwest :30684 California :11570
## 1st Qu.:21780 1st Qu.: 2.000 Northeast:25939 Texas : 7077
## Median :34740 Median : 3.000 South :41502 New York : 5595
## Mean :35075 Mean : 3.284 West :33177 Florida : 5149
## 3rd Qu.:41860 3rd Qu.: 4.000 Pennsylvania: 3930
## Max. :79600 Max. :15.000 Illinois : 3912
## NA's :34238 (Other) :94069
## Age Married Sex
## Min. : 0.00 Divorced :11151 Female:67481
## 1st Qu.:19.00 Married :55509 Male :63821
## Median :39.00 Never Married:30772
## Mean :38.83 Separated : 2027
## 3rd Qu.:57.00 Widowed : 6505
## Max. :85.00 NA's :25338
##
## Education Race
## High school :30906 American Indian : 1433
## Bachelor's degree :19443 Asian : 6520
## Some college, no degree:18863 Black : 13913
## No high school diploma :16095 Multiracial : 2897
## Associate degree : 9913 Pacific Islander: 618
## (Other) :10744 White :105921
## NA's :25338
## Hispanic CountryOfBirthCode Citizenship
## Min. :0.0000 Min. : 57.00 Citizen, Native :116639
## 1st Qu.:0.0000 1st Qu.: 57.00 Citizen, Naturalized: 7073
## Median :0.0000 Median : 57.00 Non-Citizen : 7590
## Mean :0.1393 Mean : 82.68
## 3rd Qu.:0.0000 3rd Qu.: 57.00
## Max. :1.0000 Max. :555.00
##
## EmploymentStatus Industry
## Disabled : 5712 Educational and health services :15017
## Employed :61733 Trade : 8933
## Not in Labor Force:15246 Professional and business services: 7519
## Retired :18619 Manufacturing : 6791
## Unemployed : 4203 Leisure and hospitality : 6364
## NA's :25789 (Other) :21618
## NA's :65060
## MetroArea
## New York-Northern New Jersey-Long Island, NY-NJ-PA: 5409
## Washington-Arlington-Alexandria, DC-VA-MD-WV : 4177
## Los Angeles-Long Beach-Santa Ana, CA : 4102
## Philadelphia-Camden-Wilmington, PA-NJ-DE : 2855
## Chicago-Naperville-Joliet, IN-IN-WI : 2772
## (Other) :77749
## NA's :34238
#Ans:MetroArea #this dataframe is coming from the second dataframe which is being added into the first data frame after merging
#How many interviewees have a missing value for the new metropolitan area variable? Note that all of these interviewees would have been removed from the merged data frame if we did not include the all.x=TRUE parameter.
#Ans:34238
#Which of the following metropolitan areas has the largest number of interviewees?
head(sort(table(CPS$MetroArea),decreasing=T))
##
## New York-Northern New Jersey-Long Island, NY-NJ-PA
## 5409
## Washington-Arlington-Alexandria, DC-VA-MD-WV
## 4177
## Los Angeles-Long Beach-Santa Ana, CA
## 4102
## Philadelphia-Camden-Wilmington, PA-NJ-DE
## 2855
## Chicago-Naperville-Joliet, IN-IN-WI
## 2772
## Providence-Fall River-Warwick, MA-RI
## 2284
#Ans:Boston-Cambridge-Quincy, MA-NH
#Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity? Hint: Use tapply() with mean, as in the previous subproblem. Calling sort() on the output of tapply() could also be helpful here.
head(sort(tapply(CPS$Hispanic, CPS$MetroArea, mean),decreasing=T))
## Laredo, TX McAllen-Edinburg-Pharr, TX
## 0.9662921 0.9487179
## Brownsville-Harlingen, TX El Paso, TX
## 0.7974684 0.7909836
## El Centro, CA San Antonio, TX
## 0.6868687 0.6441516
#Ans:Laredo, TX
#Remembering that CPS$Race == "Asian" returns a TRUE/FALSE vector of whether an interviewee is Asian, determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.
head(sort(tapply(CPS$Race == "Asian", CPS$MetroArea, mean),decreasing = T))
## Honolulu, HI San Francisco-Oakland-Fremont, CA
## 0.5019036 0.2467532
## San Jose-Sunnyvale-Santa Clara, CA Vallejo-Fairfield, CA
## 0.2417910 0.2030075
## Fresno, CA Warner Robins, GA
## 0.1848185 0.1666667
#Ans:4
#We can read from the sorted output that Honolulu, HI; San Francisco-Oakland-Fremont, CA; San Jose-Sunnyvale-Santa Clara, CA; and Vallejo-Fairfield, CA had at least 20% of their interviewees of the Asian race.
#Normally, we would look at the sorted proportion of interviewees from each metropolitan area who have not received a high school diploma with the command:
head(sort(tapply(CPS$Education == "No high school diploma", CPS$MetroArea, mean,na.rm=TRUE)))
## Iowa City, IA Bowling Green, KY Kalamazoo-Portage, MI
## 0.02912621 0.03703704 0.05050505
## Champaign-Urbana, IL Bremerton-Silverdale, WA Lawrence, KS
## 0.05154639 0.05405405 0.05952381
#However, none of the interviewees aged 14 and younger have an education value reported, so the mean value is reported as NA for each metropolitan area. To get mean (and related functions, like sum) to ignore missing values, you can pass the parameter na.rm=TRUE. Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.
#Ans:Iowa City, IA
#We can see that Iowa City, IA had 2.9% of interviewees not finish high school, the smallest value of any metropolitan area.
######################################################
#PROBLEM 4.1 - INTEGRATING COUNTRY OF BIRTH DATA
#Just as we did with the metropolitan area information, merge in the country of birth information from the CountryMap data frame, replacing the CPS data frame with the result. If you accidentally overwrite CPS with the wrong values, remember that you can restore it by re-loading the data frame from CPSData.csv and then merging in the metropolitan area information using the command provided in the previous subproblem.
str(CPS)
## 'data.frame': 131302 obs. of 15 variables:
## $ MetroAreaCode : int 10420 10420 10420 10420 10420 10420 10420 10420 10420 10420 ...
## $ PeopleInHousehold : int 4 4 2 4 1 3 4 4 2 3 ...
## $ Region : Factor w/ 4 levels "Midwest","Northeast",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ State : Factor w/ 51 levels "Alabama","Alaska",..: 36 36 36 36 36 36 36 36 36 36 ...
## $ Age : int 2 9 73 40 63 19 30 6 60 32 ...
## $ Married : Factor w/ 5 levels "Divorced","Married",..: NA NA 2 2 3 3 2 NA 2 2 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 1 1 1 1 2 ...
## $ Education : Factor w/ 8 levels "Associate degree",..: NA NA 8 4 6 4 2 NA 4 4 ...
## $ Race : Factor w/ 6 levels "American Indian",..: 6 6 6 6 6 6 2 6 6 6 ...
## $ Hispanic : int 0 0 0 0 0 0 0 1 0 0 ...
## $ CountryOfBirthCode: int 57 57 57 362 57 57 203 57 57 57 ...
## $ Citizenship : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 2 1 1 3 1 1 1 ...
## $ EmploymentStatus : Factor w/ 5 levels "Disabled","Employed",..: NA NA 4 3 1 2 3 NA 2 2 ...
## $ Industry : Factor w/ 14 levels "Agriculture, forestry, fishing, and hunting",..: NA NA NA NA NA 7 NA NA 4 13 ...
## $ MetroArea : Factor w/ 271 levels "Akron, OH","Albany-Schenectady-Troy, NY",..: 1 1 1 1 1 1 1 1 1 1 ...
CPS<-merge(CPS, CountryMap, by.x="CountryOfBirthCode", by.y="Code", all.x=TRUE)
summary(CPS)
## CountryOfBirthCode MetroAreaCode PeopleInHousehold Region
## Min. : 57.00 Min. :10420 Min. : 1.000 Midwest :30684
## 1st Qu.: 57.00 1st Qu.:21780 1st Qu.: 2.000 Northeast:25939
## Median : 57.00 Median :34740 Median : 3.000 South :41502
## Mean : 82.68 Mean :35075 Mean : 3.284 West :33177
## 3rd Qu.: 57.00 3rd Qu.:41860 3rd Qu.: 4.000
## Max. :555.00 Max. :79600 Max. :15.000
## NA's :34238
## State Age Married Sex
## California :11570 Min. : 0.00 Divorced :11151 Female:67481
## Texas : 7077 1st Qu.:19.00 Married :55509 Male :63821
## New York : 5595 Median :39.00 Never Married:30772
## Florida : 5149 Mean :38.83 Separated : 2027
## Pennsylvania: 3930 3rd Qu.:57.00 Widowed : 6505
## Illinois : 3912 Max. :85.00 NA's :25338
## (Other) :94069
## Education Race
## High school :30906 American Indian : 1433
## Bachelor's degree :19443 Asian : 6520
## Some college, no degree:18863 Black : 13913
## No high school diploma :16095 Multiracial : 2897
## Associate degree : 9913 Pacific Islander: 618
## (Other) :10744 White :105921
## NA's :25338
## Hispanic Citizenship EmploymentStatus
## Min. :0.0000 Citizen, Native :116639 Disabled : 5712
## 1st Qu.:0.0000 Citizen, Naturalized: 7073 Employed :61733
## Median :0.0000 Non-Citizen : 7590 Not in Labor Force:15246
## Mean :0.1393 Retired :18619
## 3rd Qu.:0.0000 Unemployed : 4203
## Max. :1.0000 NA's :25789
##
## Industry
## Educational and health services :15017
## Trade : 8933
## Professional and business services: 7519
## Manufacturing : 6791
## Leisure and hospitality : 6364
## (Other) :21618
## NA's :65060
## MetroArea
## New York-Northern New Jersey-Long Island, NY-NJ-PA: 5409
## Washington-Arlington-Alexandria, DC-VA-MD-WV : 4177
## Los Angeles-Long Beach-Santa Ana, CA : 4102
## Philadelphia-Camden-Wilmington, PA-NJ-DE : 2855
## Chicago-Naperville-Joliet, IN-IN-WI : 2772
## (Other) :77749
## NA's :34238
## Country
## United States:115063
## Mexico : 3921
## Philippines : 839
## India : 770
## China : 581
## (Other) : 9952
## NA's : 176
#What is the name of the variable added to the CPS data frame by this merge operation?
#Ans:Country
#How many interviewees have a missing value for the new country of birth variable?
#Ans: 176
#Among all interviewees born outside of North America, which country was the most common place of birth?
head(sort(table(CPS$Country),decreasing=T)) #or
##
## United States Mexico Philippines India China
## 115063 3921 839 770 581
## Puerto Rico
## 518
head(summary(CPS$Country))
## United States Mexico Philippines India China
## 115063 3921 839 770 581
## Puerto Rico
## 518
#Ans:Philippines
#What proportion of the interviewees from the "New York-Northern New Jersey-Long Island, NY-NJ-PA" metropolitan area have a country of birth that is not the United States? For this computation, don't include people from this metropolitan area who have a missing country of birth.'
tapply(CPS$Country!= "United States",CPS$MetroArea == "New York-Northern New Jersey-Long Island, NY-NJ-PA",mean,na.rm=TRUE)
## FALSE TRUE
## 0.1392772 0.3086603
#or
table(CPS$MetroArea =="New York-Northern New Jersey-Long Island, NY-NJ-PA", CPS$Country!="United States")
##
## FALSE TRUE
## FALSE 78757 12744
## TRUE 3736 1668
#Ans:0.3086603
#EXPLANATION:From table(CPS$MetroArea == "New York-Northern New Jersey-Long Island, NY-NJ-PA", CPS$Country != "United States"), we can see that 1668 of interviewees from this metropolitan area were born outside the United States and 3736 were born in the United States (it turns out an additional 5 have a missing country of origin). Therefore, the proportion is 1668/(1668+3736)=0.309.
#Which metropolitan area has the largest number (note -- not proportion) of interviewees with a country of birth in India? Hint -- remember to include na.rm=TRUE if you are using tapply() to answer this question.
#To obtain the number of TRUE values in a vector of TRUE/FALSE values, you can use the sum() function. For instance, sum(c(TRUE, FALSE, TRUE, TRUE)) is 3. Therefore, we can obtain counts of people born in a particular country living in a particular metropolitan area with:
head(sort(tapply(CPS$Country == "India", CPS$MetroArea, sum, na.rm=TRUE),decreasing = T))
## New York-Northern New Jersey-Long Island, NY-NJ-PA
## 96
## Washington-Arlington-Alexandria, DC-VA-MD-WV
## 50
## Philadelphia-Camden-Wilmington, PA-NJ-DE
## 32
## Chicago-Naperville-Joliet, IN-IN-WI
## 31
## Detroit-Warren-Livonia, MI
## 30
## Atlanta-Sandy Springs-Marietta, GA
## 27
#Ans:New York-Northern New Jersey-Long Island, NY-NJ-PA
#In Brazil?
head(sort(tapply(CPS$Country == "Brazil", CPS$MetroArea, sum, na.rm=TRUE),decreasing = T))
## Boston-Cambridge-Quincy, MA-NH
## 18
## Miami-Fort Lauderdale-Miami Beach, FL
## 16
## Los Angeles-Long Beach-Santa Ana, CA
## 9
## Washington-Arlington-Alexandria, DC-VA-MD-WV
## 8
## Bridgeport-Stamford-Norwalk, CT
## 7
## New York-Northern New Jersey-Long Island, NY-NJ-PA
## 7
#Ans:Boston-Cambridge-Quincy, MA-NH
#In Somalia?
head(sort(tapply(CPS$Country == "Somalia", CPS$MetroArea, sum, na.rm=TRUE),decreasing = T))
## Minneapolis-St Paul-Bloomington, MN-WI
## 17
## Phoenix-Mesa-Scottsdale, AZ
## 7
## Seattle-Tacoma-Bellevue, WA
## 7
## St. Cloud, MN
## 7
## Columbus, OH
## 5
## Fargo, ND-MN
## 5
#Ans:Minneapolis-St Paul-Bloomington, MN-WI
INTERNET PRIVACY POLL (OPTIONAL)
Internet privacy has gained widespread attention in recent years. To measure the degree to which people are concerned about hot-button issues like Internet privacy, social scientists conduct polls in which they interview a large number of people about the topic. In this assignment, we will analyze data from a July 2013 Pew Internet and American Life Project poll on Internet anonymity and privacy, which involved interviews across the United States. While the full polling data can be found here, we will use a more limited version of the results, available in AnonymityPoll.csv.
#Lets load the data
poll<-read.csv("AnonymityPoll.csv")
#PROBLEM 1.1 - LOADING AND SUMMARIZING THE DATASET
#How many people participated in the poll?
nrow(poll) #or str(poll)
## [1] 1002
#Ans:1002
#Let's look at the breakdown of the number of people with smartphones using the table() and summary() commands on the Smartphone variable. (HINT: These three numbers should sum to 1002.)
#How many interviewees responded that they use a smartphone?
table(poll$Smartphone) #&
##
## 0 1
## 472 487
summary(poll)
## Internet.Use Smartphone Sex Age
## Min. :0.0000 Min. :0.0000 Female:505 Min. :18.00
## 1st Qu.:1.0000 1st Qu.:0.0000 Male :497 1st Qu.:37.00
## Median :1.0000 Median :1.0000 Median :55.00
## Mean :0.7742 Mean :0.5078 Mean :52.37
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:66.00
## Max. :1.0000 Max. :1.0000 Max. :96.00
## NA's :1 NA's :43 NA's :27
## State Region Conservativeness Info.On.Internet
## California :103 Midwest :239 Min. :1.000 Min. : 0.000
## Texas : 72 Northeast:166 1st Qu.:3.000 1st Qu.: 2.000
## New York : 60 South :359 Median :3.000 Median : 4.000
## Pennsylvania: 45 West :238 Mean :3.277 Mean : 3.795
## Florida : 42 3rd Qu.:4.000 3rd Qu.: 6.000
## Ohio : 38 Max. :5.000 Max. :11.000
## (Other) :642 NA's :62 NA's :210
## Worry.About.Info Privacy.Importance Anonymity.Possible
## Min. :0.0000 Min. : 0.00 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.: 41.43 1st Qu.:0.0000
## Median :0.0000 Median : 68.75 Median :0.0000
## Mean :0.4886 Mean : 62.85 Mean :0.3692
## 3rd Qu.:1.0000 3rd Qu.: 88.89 3rd Qu.:1.0000
## Max. :1.0000 Max. :100.00 Max. :1.0000
## NA's :212 NA's :215 NA's :249
## Tried.Masking.Identity Privacy.Laws.Effective
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.1633 Mean :0.2617
## 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
## NA's :218 NA's :108
#Ans:487
#How many interviewees responded that they don't use a smartphone?
#Ans:472
#How many interviewees did not respond to the question, resulting in a missing value, or NA, in the summary() output?
#Ans:43
#EXPLANATION:From the output of table(poll$Smartphone), we can read that 487 interviewees use a smartphone and 472 do not. From the summary(poll$Smartphone) output, we see that another 43 had missing values. As a sanity check, 487+472+43=1002, the total number of interviewees.
#By using the table() function on two variables, we can tell how they are related. To use the table() function on two variables, just put the two variable names inside the parentheses, separated by a comma (don't forget to add poll$ before each variable name). In the output, the possible values of the first variable will be listed in the left, and the possible values of the second variable will be listed on the top. Each entry of the table counts the number of observations in the data set that have the value of the first value in that row, and the value of the second variable in that column.
#Which of the following are states in the Midwest census region? (Select all that apply.)
mid<-poll[poll$Region=="Midwest",]
mid$Region<-factor(mid$Region)
table(mid$State,mid$Region)
##
## Midwest
## Alabama 0
## Arizona 0
## Arkansas 0
## California 0
## Colorado 0
## Connecticut 0
## Delaware 0
## District of Columbia 0
## Florida 0
## Georgia 0
## Idaho 0
## Illinois 32
## Indiana 27
## Iowa 14
## Kansas 14
## Kentucky 0
## Louisiana 0
## Maine 0
## Maryland 0
## Massachusetts 0
## Michigan 31
## Minnesota 15
## Mississippi 0
## Missouri 26
## Montana 0
## Nebraska 11
## Nevada 0
## New Hampshire 0
## New Jersey 0
## New Mexico 0
## New York 0
## North Carolina 0
## North Dakota 5
## Ohio 38
## Oklahoma 0
## Oregon 0
## Pennsylvania 0
## Rhode Island 0
## South Carolina 0
## South Dakota 3
## Tennessee 0
## Texas 0
## Utah 0
## Vermont 0
## Virginia 0
## Washington 0
## West Virginia 0
## Wisconsin 23
## Wyoming 0
#or
MidwestInterviewees<-subset(poll, Region=="Midwest")
table(MidwestInterviewees$State)
##
## Alabama Arizona Arkansas
## 0 0 0
## California Colorado Connecticut
## 0 0 0
## Delaware District of Columbia Florida
## 0 0 0
## Georgia Idaho Illinois
## 0 0 32
## Indiana Iowa Kansas
## 27 14 14
## Kentucky Louisiana Maine
## 0 0 0
## Maryland Massachusetts Michigan
## 0 0 31
## Minnesota Mississippi Missouri
## 15 0 26
## Montana Nebraska Nevada
## 0 11 0
## New Hampshire New Jersey New Mexico
## 0 0 0
## New York North Carolina North Dakota
## 0 0 5
## Ohio Oklahoma Oregon
## 38 0 0
## Pennsylvania Rhode Island South Carolina
## 0 0 0
## South Dakota Tennessee Texas
## 3 0 0
## Utah Vermont Virginia
## 0 0 0
## Washington West Virginia Wisconsin
## 0 0 23
## Wyoming
## 0
#Ans:Kansas ,Missouri,Ohio
#Which was the state in the South census region with the largest number of interviewees?
sot<-poll[poll$Region=="South",]
sot$Region<-factor(sot$Region)#this step removes unnecessary levels in the factor variable
table(sot$State,sot$Region)
##
## South
## Alabama 11
## Arizona 0
## Arkansas 10
## California 0
## Colorado 0
## Connecticut 0
## Delaware 6
## District of Columbia 2
## Florida 42
## Georgia 34
## Idaho 0
## Illinois 0
## Indiana 0
## Iowa 0
## Kansas 0
## Kentucky 25
## Louisiana 17
## Maine 0
## Maryland 18
## Massachusetts 0
## Michigan 0
## Minnesota 0
## Mississippi 11
## Missouri 0
## Montana 0
## Nebraska 0
## Nevada 0
## New Hampshire 0
## New Jersey 0
## New Mexico 0
## New York 0
## North Carolina 32
## North Dakota 0
## Ohio 0
## Oklahoma 14
## Oregon 0
## Pennsylvania 0
## Rhode Island 0
## South Carolina 12
## South Dakota 0
## Tennessee 17
## Texas 72
## Utah 0
## Vermont 0
## Virginia 31
## Washington 0
## West Virginia 5
## Wisconsin 0
## Wyoming 0
#or
SouthInterviewees<-subset(poll, Region=="South")
table(SouthInterviewees$State)
##
## Alabama Arizona Arkansas
## 11 0 10
## California Colorado Connecticut
## 0 0 0
## Delaware District of Columbia Florida
## 6 2 42
## Georgia Idaho Illinois
## 34 0 0
## Indiana Iowa Kansas
## 0 0 0
## Kentucky Louisiana Maine
## 25 17 0
## Maryland Massachusetts Michigan
## 18 0 0
## Minnesota Mississippi Missouri
## 0 11 0
## Montana Nebraska Nevada
## 0 0 0
## New Hampshire New Jersey New Mexico
## 0 0 0
## New York North Carolina North Dakota
## 0 32 0
## Ohio Oklahoma Oregon
## 0 14 0
## Pennsylvania Rhode Island South Carolina
## 0 0 12
## South Dakota Tennessee Texas
## 0 17 72
## Utah Vermont Virginia
## 0 0 31
## Washington West Virginia Wisconsin
## 0 5 0
## Wyoming
## 0
#Ans:Texas
####################################################
#PROBLEM 2.1 - INTERNET AND SMARTPHONE USERS
#As mentioned in the introduction to this problem, many of the response variables (Info.On.Internet, Worry.About.Info, Privacy.Importance, Anonymity.Possible, and Tried.Masking.Identity) were not collected if an interviewee does not use the Internet or a smartphone, meaning the variables will have missing values for these interviewees.
#How many interviewees reported not having used the Internet and not having used a smartphone?
table(poll$Internet.Use,poll$Smartphone)
##
## 0 1
## 0 186 17
## 1 285 470
#Ans:186
#How many interviewees reported having used the Internet and having used a smartphone?
#Ans:470
#How many interviewees reported having used the Internet but not having used a smartphone?
#Ans:285
#How many interviewees reported having used a smartphone but not having used the Internet?
#Ans:17
#How many interviewees have a missing value for their Internet use?
summary(poll$Internet.Use)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 1.0000 1.0000 0.7742 1.0000 1.0000 1
#Ans:1
#How many interviewees have a missing value for their smartphone use?
summary(poll$Smartphone)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 1.0000 0.5078 1.0000 1.0000 43
#Ans:43
#Use the subset function to obtain a data frame called "limited", which is limited to interviewees who reported Internet use or who reported smartphone use. In lecture, we used the & symbol to use two criteria to make a subset of the data. To only take observations that have a certain value in one variable or the other, the | character can be used in place of the & symbol. This is also called a logical "or" operation.
limited<-subset(poll,Internet.Use == 1 | Smartphone == 1)
nrow(limited)
## [1] 792
#How many interviewees are in the new data frame?
#Ans:792
#########################################
#PROBLEM 3.1 - SUMMARIZING OPINIONS ABOUT INTERNET PRIVACY
#Which variables have missing values in the limited data frame? (Select all that apply.)
colnames(limited)[colSums(is.na(limited)) > 0]
## [1] "Smartphone" "Age"
## [3] "Conservativeness" "Worry.About.Info"
## [5] "Privacy.Importance" "Anonymity.Possible"
## [7] "Tried.Masking.Identity" "Privacy.Laws.Effective"
#or
apply(is.na(limited), 2, any)
## Internet.Use Smartphone Sex
## FALSE TRUE FALSE
## Age State Region
## TRUE FALSE FALSE
## Conservativeness Info.On.Internet Worry.About.Info
## TRUE FALSE TRUE
## Privacy.Importance Anonymity.Possible Tried.Masking.Identity
## TRUE TRUE TRUE
## Privacy.Laws.Effective
## TRUE
#or
colnames(limited)[apply(is.na(limited), 2, any)]
## [1] "Smartphone" "Age"
## [3] "Conservativeness" "Worry.About.Info"
## [5] "Privacy.Importance" "Anonymity.Possible"
## [7] "Tried.Masking.Identity" "Privacy.Laws.Effective"
#or
which(apply(is.na(limited), 2, sum)>0) # gives col names as well as col no
## Smartphone Age Conservativeness
## 2 4 7
## Worry.About.Info Privacy.Importance Anonymity.Possible
## 9 10 11
## Tried.Masking.Identity Privacy.Laws.Effective
## 12 13
#or
summary(limited) #crude way
## Internet.Use Smartphone Sex Age
## Min. :0.0000 Min. :0.0000 Female:392 Min. :18.00
## 1st Qu.:1.0000 1st Qu.:0.0000 Male :400 1st Qu.:33.00
## Median :1.0000 Median :1.0000 Median :51.00
## Mean :0.9785 Mean :0.6308 Mean :48.57
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:62.00
## Max. :1.0000 Max. :1.0000 Max. :93.00
## NA's :20 NA's :22
## State Region Conservativeness Info.On.Internet
## California : 89 Midwest :172 Min. :1.000 Min. : 0.000
## Texas : 57 Northeast:128 1st Qu.:3.000 1st Qu.: 2.000
## New York : 45 South :288 Median :3.000 Median : 4.000
## Pennsylvania : 33 West :204 Mean :3.237 Mean : 3.795
## Florida : 32 3rd Qu.:4.000 3rd Qu.: 6.000
## North Carolina: 28 Max. :5.000 Max. :11.000
## (Other) :508 NA's :45
## Worry.About.Info Privacy.Importance Anonymity.Possible
## Min. :0.0000 Min. : 0.00 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.: 41.43 1st Qu.:0.0000
## Median :0.0000 Median : 68.75 Median :0.0000
## Mean :0.4886 Mean : 62.85 Mean :0.3692
## 3rd Qu.:1.0000 3rd Qu.: 88.89 3rd Qu.:1.0000
## Max. :1.0000 Max. :100.00 Max. :1.0000
## NA's :2 NA's :5 NA's :39
## Tried.Masking.Identity Privacy.Laws.Effective
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.1633 Mean :0.2559
## 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
## NA's :8 NA's :65
#Ans:Smartphone,Age,Conservativeness,Worry.About.Info,"Privacy.Importance,Anonymity.Possible,Tried.Masking.Identity,Privacy.Laws.Effective
#What is the average number of pieces of personal information on the Internet, according to the Info.On.Internet variable?
mean(limited$Info.On.Internet) #or summary(limited$Info.On.Internet)
## [1] 3.795455
#Ans:3.795455
#How many interviewees reported a value of 0 for Info.On.Internet?
table(limited$Info.On.Internet)
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 105 84 95 101 104 94 67 63 40 18 13 8
#Ans:105
#How many interviewees reported the maximum value of 11 for Info.On.Internet?
#Ans:8
#What proportion of interviewees who answered the Worry.About.Info question worry about how much information is available about them on the Internet? Note that to compute this proportion you will be dividing by the number of people who answered the Worry.About.Info question, not the total number of people in the data frame.
table(limited$Worry.About.Info)
##
## 0 1
## 404 386
prop.table(table(limited$Worry.About.Info))
##
## 0 1
## 0.5113924 0.4886076
#or crudely
table(limited$Worry.About.Info)/margin.table(table(limited$Worry.About.Info))
##
## 0 1
## 0.5113924 0.4886076
#or crudely
table(limited$Worry.About.Info,limited$Info.On.Internet)
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 0 71 50 44 48 47 38 34 32 17 12 7 4
## 1 34 33 51 53 57 55 33 31 23 6 6 4
sum(limited$Worry.About.Info=="1",na.rm=T)
## [1] 386
sum(limited$Worry.About.Info=="0",na.rm=T)
## [1] 404
(386-34)/386
## [1] 0.9119171
#Ans:0.4886076
#What proportion of interviewees who answered the Anonymity.Possible question think it is possible to be completely anonymous on the Internet?
prop.table(table(limited$Anonymity.Possible)) #or
##
## 0 1
## 0.6308101 0.3691899
summary(limited$Anonymity.Possible) #as the mean gives the prop of 1s
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.3692 1.0000 1.0000 39
#Ans:0.3691899
#What proportion of interviewees who answered the Tried.Masking.Identity question have tried masking their identity on the Internet?
prop.table(table(limited$Tried.Masking.Identity)) #or
##
## 0 1
## 0.8367347 0.1632653
summary(limited$Tried.Masking.Identity)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.1633 0.0000 1.0000 8
#Ans:0.1632653
#What proportion of interviewees who answered the Privacy.Laws.Effective question find United States privacy laws effective?
prop.table(table(limited$Privacy.Laws.Effective)) #or
##
## 0 1
## 0.7441541 0.2558459
summary(limited$Privacy.Laws.Effective)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.2558 1.0000 1.0000 65
#Ans:0.2558459
##############################################
#PROBLEM 4.1 - RELATING DEMOGRAPHICS TO POLLING RESULTS
#Often, we are interested in whether certain characteristics of interviewees (e.g. their age or political opinions) affect their opinions on the topic of the poll (in this case, opinions on privacy). In this section, we will investigate the relationship between the characteristics Age and Smartphone and outcome variables Info.On.Internet and Tried.Masking.Identity, again using the limited data frame we built in an earlier section of this problem.
#Build a histogram of the age of interviewees. What is the best represented age group in the population?
hist(limited$Age)
#Ans:People aged about 60 years old
#Both Age and Info.On.Internet are variables that take on many values, so a good way to observe their relationship is through a graph. We learned in lecture that we can plot Age against Info.On.Internet with the command plot(limited$Age, limited$Info.On.Internet). However, because Info.On.Internet takes on a small number of values, multiple points can be plotted in exactly the same location on this graph.
plot(limited$Age, limited$Info.On.Internet)
#What is the largest number of interviewees that have exactly the same value in their Age variable AND the same value in their Info.On.Internet variable? In other words, what is the largest number of overlapping points in the plot plot(limited$Age, limited$Info.On.Internet)? (HINT: Use the table function to compare the number of observations with different values of Age and Info.On.Internet.)
max(table(limited$Age, limited$Info.On.Internet))
## [1] 6
#Ans:6
#To avoid points covering each other up, we can use the jitter() function on the values we pass to the plot function. Experimenting with the command jitter(c(1, 2, 3)), what appears to be the functionality of the jitter command?
jitter(c(1, 2, 3))
## [1] 1.016522 2.188608 3.189527
#Ans:jitter adds or subtracts a small amount of random noise to the values passed to it, and two runs will yield different results
#EXPLANATION:By running the command jitter(c(1, 2, 3)) multiple times, we can see that the jitter function randomly adds or subtracts a small value from each number, and two runs will yield different results.
#Now, plot Age against Info.On.Internet with plot(jitter(limited$Age), jitter(limited$Info.On.Internet)). What relationship to you observe between Age and Info.On.Internet?
plot(jitter(limited$Age), jitter(limited$Info.On.Internet))
abline(lm(limited$Info.On.Internet~limited$Age,limited))
#Ans:Older age seems moderately associated with a smaller value for Info.On.Internet
#EXPLANATION:For younger people aged 18-30, the average value of Info.On.Internet appears to be roughly 5, while most peopled aged 60 and older have a value less than 5. Therefore, older age appears to be associated with a smaller value of Info.On.Internet, but from the spread of dots on the image, it's clear the association is not particularly strong.
#Use the tapply() function to obtain the summary of the Info.On.Internet value, broken down by whether an interviewee is a smartphone user.
tapply(limited$Info.On.Internet,limited$Smartphone,summary)
## $`0`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 2.923 5.000 11.000
##
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 4.000 4.368 6.000 11.000
#What is the average Info.On.Internet value for smartphone users?
#Ans:4.368 #The average for smartphone users from the summary output labeled with 1.
#What is the average Info.On.Internet value for non-smartphone users?
#Ans:2.923 #We can read the average for non-smartphone users from the summary output labeled with 0
#Similarly use tapply to break down the Tried.Masking.Identity variable for smartphone and non-smartphone users.
tapply(limited$Tried.Masking.Identity,limited$Smartphone,summary)
## $`0`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.1174 0.0000 1.0000 4
##
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.1925 0.0000 1.0000 4
#Note when we want proportions of a variables having 0s and 1s, all we meed is the mean from the summary function.Mean gives the proportion of TRUEs.For ex:
mean(c(TRUE,TRUE,FALSE,TRUE))
## [1] 0.75
#What proportion of smartphone users who answered the Tried.Masking.Identity question have tried masking their identity when using the Internet?
#Ans:0.1925
#What proportion of non-smartphone users who answered the Tried.Masking.Identity question have tried masking their identity when using the Internet?
#Ans:0.1174
#EXPLANATION:We can get the breakdown for smartphone and non-smartphone users with:
#tapply(limited$Tried.Masking.Identity, limited$Smartphone, table)
#Among smartphone users, 93 tried masking their identity and 390 did not, resulting in proportion 93/(93+390)=0.1925. Among non-smartphone users, 33 tried masking their identity and 248 did not, resulting in proportion 33/(33+248)=0.1174.This could have also been read from tapply(limited$Tried.Masking.Identity, limited$Smartphone, summary).
Video1:PREDICTING THE QUALITY OF WINE
The plots below show the relationship between two of the independent variables considered by Ashenfelter and the price of wine.
Wine_QQ1_Plot1
Wine_QQ1_Plot1
Q:What is the correct relationship between harvest rain, average growing season temperature, and wine prices?
Ans:More harvest rain is associated with a lower price, and higher temperatures is associated with a higher price
EXPLANATION:The plots show a positive trend between average growing season temperature and the wine price. While the trend is less clear between harvest rain and price, there is a slight negative association.
Video2:ONE-VARIABLE LINEAR REGRESSION
The following figure shows three data points and the best fit line
y = 3x + 2.
The x-coordinate, or “x”, is our independent variable and the y-coordinate, or “y”, is our dependent variable.
Wine_QQ2
#lets create the above scatter plot
x<-c(0,1,1)
y<-c(2,2,8)
plot(x,y,xlim = c(-3,3),ylim = c(0,10),pch=19)
#Baseline prediction
abline(h=mean(y),col="red",lwd=2)
text(-2,5,labels="Y=4,Baseline Prediction",cex=0.7)
#To get the SSE
#first fit the Linear model
fit1<-lm(y~x)
abline(lm(y~x),lwd=2,col="blue")
text(1.5,9,labels="Regression Line",cex = 0.7)
anova(fit1)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 6 6 0.3333 0.6667
## Residuals 1 18 18
summary(fit1)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## 1 2 3
## -2.22e-16 -3.00e+00 3.00e+00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.000 4.243 0.471 0.720
## x 3.000 5.196 0.577 0.667
##
## Residual standard error: 4.243 on 1 degrees of freedom
## Multiple R-squared: 0.25, Adjusted R-squared: -0.5
## F-statistic: 0.3333 on 1 and 1 DF, p-value: 0.6667
#Lets us SDSFoundations package which simulates all the above code to produce the scatter plot plus the basic summary o/p
library(SDSFoundations)
##
## Attaching package: 'SDSFoundations'
## The following object is masked from 'package:lattice':
##
## histogram
linFit(x,y)
## Linear Fit
## Intercept = 2
## Slope = 3
## R-squared = 0.25
Q:What is the baseline prediction?
Ans:4
EXPLANATION:The baseline prediction is the average value of the dependent variable. Since our dependent variable takes values 2, 2, and 8 in our data set, the average is (2+2+8)/3 = 4.
Q:What is the Sum of Squared Errors (SSE) ?
Ans:18
EXPLANATION:The SSE is computed by summing the squared errors between the actual values and our predictions. For each value of the independent variable (x), our best fit line makes the following predictions:
If x = 0, y = 3(0) + 2 = 2,
If x = 1, y = 3(1) + 2 = 5.
Thus we make an error of 0 for the data point (0,2), an error of 3 for the data point (1,2), and an error of 3 for the data point (1,8). So we have
SSE = 0² + 3² + 3² = 18.
Q:What is the Total Sum of Squares (SST) ?
Ans:24
EXPLANATION:The SST is computed by summing the squared errors between the actual values and the baseline prediction. From the first question, we computed the baseline prediction to be 4. Thus the SST is:
SST = (2 - 4)² + (2 - 4)² + (8 - 4)² = 24.
Q:What is the R² of the model?
Ans:0.25
EXPLANATION: The R² formula is:R² = 1 - SSE/SST
Thus using our answers to the previous questions, we have that
R² = 1 - 18/24 = 0.25.
Video3:MULTIPLE LINEAR REGRESSION
Q:Suppose we add another variable, Average Winter Temperature, to our model to predict wine price. Is it possible for the model’s R² value to go down from 0.83 to 0.80?
Ans:No, the model’s R² value can not decrease at all by adding new variables.
EXPLANATION:The model’s R² value can never decrease from adding new variables to the model. This is due to the fact that it is always possible to set the coefficient for the new variable to zero in the new model. However, this would be the same as the old model. So the only reason to make the coefficient non-zero is if it improves the R² value of the model, since linear regression picks the coefficients to minimize the error terms, which is the same as maximizing the R².
Video4:LINEAR REGRESSION IN R
A script file containing all of the R commands used in this lecture reproduced below
# VIDEO 4
# Read in data
wine<-read.csv("wine.csv")
str(wine)
## 'data.frame': 25 obs. of 7 variables:
## $ Year : int 1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
## $ Price : num 7.5 8.04 7.69 6.98 6.78 ...
## $ WinterRain : int 600 690 502 420 582 485 763 830 697 608 ...
## $ AGST : num 17.1 16.7 17.1 16.1 16.4 ...
## $ HarvestRain: int 160 80 130 110 187 187 290 38 52 155 ...
## $ Age : int 31 30 28 26 25 24 23 22 21 20 ...
## $ FrancePop : num 43184 43495 44218 45152 45654 ...
summary(wine)
## Year Price WinterRain AGST
## Min. :1952 Min. :6.205 Min. :376.0 Min. :14.98
## 1st Qu.:1960 1st Qu.:6.519 1st Qu.:536.0 1st Qu.:16.20
## Median :1966 Median :7.121 Median :600.0 Median :16.53
## Mean :1966 Mean :7.067 Mean :605.3 Mean :16.51
## 3rd Qu.:1972 3rd Qu.:7.495 3rd Qu.:697.0 3rd Qu.:17.07
## Max. :1978 Max. :8.494 Max. :830.0 Max. :17.65
## HarvestRain Age FrancePop
## Min. : 38.0 Min. : 5.0 Min. :43184
## 1st Qu.: 89.0 1st Qu.:11.0 1st Qu.:46584
## Median :130.0 Median :17.0 Median :50255
## Mean :148.6 Mean :17.2 Mean :49694
## 3rd Qu.:187.0 3rd Qu.:23.0 3rd Qu.:52894
## Max. :292.0 Max. :31.0 Max. :54602
# Linear Regression (one variable)
model1<-lm(Price ~ AGST, data=wine)
summary(model1)
##
## Call:
## lm(formula = Price ~ AGST, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78450 -0.23882 -0.03727 0.38992 0.90318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4178 2.4935 -1.371 0.183710
## AGST 0.6351 0.1509 4.208 0.000335 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared: 0.435, Adjusted R-squared: 0.4105
## F-statistic: 17.71 on 1 and 23 DF, p-value: 0.000335
# Sum of Squared Errors
model1$residuals
## 1 2 3 4 5 6
## 0.04204258 0.82983774 0.21169394 0.15609432 -0.23119140 0.38991701
## 7 8 9 10 11 12
## -0.48959140 0.90318115 0.45372410 0.14887461 -0.23882157 -0.08974238
## 13 14 15 16 17 18
## 0.66185660 -0.05211511 -0.62726647 -0.74714947 0.42113502 -0.03727441
## 19 20 21 22 23 24
## 0.10685278 -0.78450270 -0.64017590 -0.05508720 -0.67055321 -0.22040381
## 25
## 0.55866518
SSE<-sum(model1$residuals^2)
SSE
## [1] 5.734875
# Linear Regression (two variables)
model2<-lm(Price ~ AGST + HarvestRain, data=wine)
summary(model2)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88321 -0.19600 0.06178 0.15379 0.59722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.20265 1.85443 -1.188 0.247585
## AGST 0.60262 0.11128 5.415 1.94e-05 ***
## HarvestRain -0.00457 0.00101 -4.525 0.000167 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared: 0.7074, Adjusted R-squared: 0.6808
## F-statistic: 26.59 on 2 and 22 DF, p-value: 1.347e-06
# Sum of Squared Errors
SSE<-sum(model2$residuals^2)
SSE
## [1] 2.970373
# Linear Regression (all variables)
model3<-lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data=wine)
summary(model3)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age +
## FrancePop, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48179 -0.24662 -0.00726 0.22012 0.51987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.504e-01 1.019e+01 -0.044 0.965202
## AGST 6.012e-01 1.030e-01 5.836 1.27e-05 ***
## HarvestRain -3.958e-03 8.751e-04 -4.523 0.000233 ***
## WinterRain 1.043e-03 5.310e-04 1.963 0.064416 .
## Age 5.847e-04 7.900e-02 0.007 0.994172
## FrancePop -4.953e-05 1.667e-04 -0.297 0.769578
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3019 on 19 degrees of freedom
## Multiple R-squared: 0.8294, Adjusted R-squared: 0.7845
## F-statistic: 18.47 on 5 and 19 DF, p-value: 1.044e-06
# Sum of Squared Errors
SSE<-sum(model3$residuals^2)
SSE
## [1] 1.732113
####################################################
#In R, use the dataset wine.csv to create a linear regression model to predict Price using HarvestRain and WinterRain as independent variables. Using the summary output of this model, answer the following questions:
fit2<-lm(Price ~ HarvestRain + WinterRain, data=wine)
summary(fit2)
##
## Call:
## lm(formula = Price ~ HarvestRain + WinterRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0933 -0.3222 -0.1012 0.3871 1.1877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.865e+00 6.616e-01 11.888 4.76e-11 ***
## HarvestRain -4.971e-03 1.601e-03 -3.105 0.00516 **
## WinterRain -9.848e-05 9.007e-04 -0.109 0.91392
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5611 on 22 degrees of freedom
## Multiple R-squared: 0.3177, Adjusted R-squared: 0.2557
## F-statistic: 5.122 on 2 and 22 DF, p-value: 0.01492
#What is the "Multiple R-squared" value of your model?
#Ans: 0.3177
#What is the coefficient for HarvestRain?
#Ans:-4.971e-03 i.e.???4.971×10???03
#What is the intercept coefficient?
#Ans:7.865
#EXPLANATION:In R, create the model by typing the following line into your R console:
#modelQQ4 = lm(Price ~ HarvestRain + WinterRain, data=wine)
#Then, look at the output of summary(modelQQ4). The Multiple R-squared is listed at the bottom of the output, and the coefficients can be found in the coefficients table.
Video5:UNDERSTANDING THE MODEL
A script file containing all of the R commands used in this lecture reproduced below
# VIDEO 5
wine<-read.csv("wine.csv")
model3<-lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data<-wine)
summary(model3)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age +
## FrancePop, data = data <- wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48179 -0.24662 -0.00726 0.22012 0.51987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.504e-01 1.019e+01 -0.044 0.965202
## AGST 6.012e-01 1.030e-01 5.836 1.27e-05 ***
## HarvestRain -3.958e-03 8.751e-04 -4.523 0.000233 ***
## WinterRain 1.043e-03 5.310e-04 1.963 0.064416 .
## Age 5.847e-04 7.900e-02 0.007 0.994172
## FrancePop -4.953e-05 1.667e-04 -0.297 0.769578
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3019 on 19 degrees of freedom
## Multiple R-squared: 0.8294, Adjusted R-squared: 0.7845
## F-statistic: 18.47 on 5 and 19 DF, p-value: 1.044e-06
# Remove FrancePop
model4<-lm(Price ~ AGST + HarvestRain + WinterRain + Age, data=wine)
summary(model4)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45470 -0.24273 0.00752 0.19773 0.53637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4299802 1.7658975 -1.942 0.066311 .
## AGST 0.6072093 0.0987022 6.152 5.2e-06 ***
## HarvestRain -0.0039715 0.0008538 -4.652 0.000154 ***
## WinterRain 0.0010755 0.0005073 2.120 0.046694 *
## Age 0.0239308 0.0080969 2.956 0.007819 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.295 on 20 degrees of freedom
## Multiple R-squared: 0.8286, Adjusted R-squared: 0.7943
## F-statistic: 24.17 on 4 and 20 DF, p-value: 2.036e-07
#########################################
#Use the dataset wine.csv to create a linear regression model to predict Price using HarvestRain and WinterRain as independent variables, like you did in the previous quick question. Using the summary output of this model, answer the following questions:
fit2<-lm(Price ~ HarvestRain + WinterRain, data<-wine)
summary(fit2)
##
## Call:
## lm(formula = Price ~ HarvestRain + WinterRain, data = data <- wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0933 -0.3222 -0.1012 0.3871 1.1877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.865e+00 6.616e-01 11.888 4.76e-11 ***
## HarvestRain -4.971e-03 1.601e-03 -3.105 0.00516 **
## WinterRain -9.848e-05 9.007e-04 -0.109 0.91392
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5611 on 22 degrees of freedom
## Multiple R-squared: 0.3177, Adjusted R-squared: 0.2557
## F-statistic: 5.122 on 2 and 22 DF, p-value: 0.01492
#Is the coefficient for HarvestRain significant?
#Ans:Yes
#Is the coefficient for WinterRain significant?
#Ans:No
Video6:CORRELATION AND MULTICOLLINEARITY
A script file containing all of the R commands used in this lecture reproduced below
# VIDEO 6
wine<-read.csv("wine.csv")
# Correlations
cor(wine$WinterRain, wine$Price)
## [1] 0.1366505
cor(wine$Age, wine$FrancePop)
## [1] -0.9944851
cor(wine) #correlation matrix
## Year Price WinterRain AGST HarvestRain
## Year 1.00000000 -0.4477679 0.016970024 -0.24691585 0.02800907
## Price -0.44776786 1.0000000 0.136650547 0.65956286 -0.56332190
## WinterRain 0.01697002 0.1366505 1.000000000 -0.32109061 -0.27544085
## AGST -0.24691585 0.6595629 -0.321090611 1.00000000 -0.06449593
## HarvestRain 0.02800907 -0.5633219 -0.275440854 -0.06449593 1.00000000
## Age -1.00000000 0.4477679 -0.016970024 0.24691585 -0.02800907
## FrancePop 0.99448510 -0.4668616 -0.001621627 -0.25916227 0.04126439
## Age FrancePop
## Year -1.00000000 0.994485097
## Price 0.44776786 -0.466861641
## WinterRain -0.01697002 -0.001621627
## AGST 0.24691585 -0.259162274
## HarvestRain -0.02800907 0.041264394
## Age 1.00000000 -0.994485097
## FrancePop -0.99448510 1.000000000
# Remove Age and FrancePop as they were highly correlated
model5<-lm(Price ~ AGST + HarvestRain + WinterRain, data=wine)
summary(model5)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67472 -0.12958 0.01973 0.20751 0.63846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.3016263 2.0366743 -2.112 0.046831 *
## AGST 0.6810242 0.1117011 6.097 4.75e-06 ***
## HarvestRain -0.0039481 0.0009987 -3.953 0.000726 ***
## WinterRain 0.0011765 0.0005920 1.987 0.060097 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.345 on 21 degrees of freedom
## Multiple R-squared: 0.7537, Adjusted R-squared: 0.7185
## F-statistic: 21.42 on 3 and 21 DF, p-value: 1.359e-06
#removing both correlated IVs Age & FrancePop results in lowering of Rsqrd & Adjusted R-squared.Hence finalyly we should remove FrancePop as intuitively Age of Wine is a better/imp predictor of Price of Wine
##########################################
#Using the data set wine.csv, what is the correlation between HarvestRain and WinterRain?
cor(wine$HarvestRain,wine$WinterRain)
## [1] -0.2754409
#Ans:-0.2754409
Video7:MAKING PREDICTIONS
A script file containing all of the R commands used in this lecture reproduced below
# VIDEO 7
# Read in test set
wineTest<-read.csv("wine_test.csv")
str(wineTest)
## 'data.frame': 2 obs. of 7 variables:
## $ Year : int 1979 1980
## $ Price : num 6.95 6.5
## $ WinterRain : int 717 578
## $ AGST : num 16.2 16
## $ HarvestRain: int 122 74
## $ Age : int 4 3
## $ FrancePop : num 54836 55110
# Make test set predictions using predict()
predictTest<-predict(model4, newdata=wineTest)
predictTest
## 1 2
## 6.768925 6.684910
# Compute R-squared
SSE<-sum((wineTest$Price - predictTest)^2)
SST<-sum((wineTest$Price - mean(wine$Price))^2)
1 - SSE/SST
## [1] 0.7944278
Q:Which of the following are NOT valid values for an out-of-sample (test set) R² ? Select all that apply.
Ans:2.4
EXPLANATION:The formula for R² is
R² = 1 - SSE/SST,where SST is calculated using the average value of the dependent variable on the training set.Since SSE and SST are the sums of squared terms, we know that both will be positive. Thus SSE/SST must be greater than or equal to zero. This means it is not possible to have an out-of-sample R² value of 2.4.However, all other values are valid (even the negative ones!), since SSE can be more or less than SST, due to the fact that this is an out-of-sample R², not a model R².
VIDEO 2: MAKING IT TO THE PLAYOFFS
A script file containing all of the R commands used in this lecture reproduced below
# VIDEO 2
# Read in data
baseball<-read.csv("baseball.csv")
str(baseball)
## 'data.frame': 1232 obs. of 15 variables:
## $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 2 3 4 5 7 8 9 10 11 12 ...
## $ League : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ RS : int 734 700 712 734 613 748 669 667 758 726 ...
## $ RA : int 688 600 705 806 759 676 588 845 890 670 ...
## $ W : int 81 94 93 69 61 85 97 68 64 88 ...
## $ OBP : num 0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
## $ SLG : num 0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
## $ BA : num 0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
## $ Playoffs : int 0 1 1 0 0 0 1 0 0 1 ...
## $ RankSeason : int NA 4 5 NA NA NA 2 NA NA 6 ...
## $ RankPlayoffs: int NA 5 4 NA NA NA 4 NA NA 2 ...
## $ G : int 162 162 162 162 162 162 162 162 162 162 ...
## $ OOBP : num 0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
## $ OSLG : num 0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
# Subset to only include moneyball years before 2002
moneyball<-subset(baseball, Year < 2002)
str(moneyball)
## 'data.frame': 902 obs. of 15 variables:
## $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
## $ League : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
## $ Year : int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
## $ RS : int 691 818 729 687 772 777 798 735 897 923 ...
## $ RA : int 730 677 643 829 745 701 795 850 821 906 ...
## $ W : int 75 92 88 63 82 88 83 66 91 73 ...
## $ OBP : num 0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
## $ SLG : num 0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
## $ BA : num 0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
## $ Playoffs : int 0 1 1 0 0 0 0 0 1 0 ...
## $ RankSeason : int NA 5 7 NA NA NA NA NA 6 NA ...
## $ RankPlayoffs: int NA 1 3 NA NA NA NA NA 4 NA ...
## $ G : int 162 162 162 162 161 162 162 162 162 162 ...
## $ OOBP : num 0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
## $ OSLG : num 0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
# Compute Run Difference and add this new var to the dataframe
moneyball$RD<-moneyball$RS - moneyball$RA
str(moneyball)
## 'data.frame': 902 obs. of 16 variables:
## $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
## $ League : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
## $ Year : int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
## $ RS : int 691 818 729 687 772 777 798 735 897 923 ...
## $ RA : int 730 677 643 829 745 701 795 850 821 906 ...
## $ W : int 75 92 88 63 82 88 83 66 91 73 ...
## $ OBP : num 0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
## $ SLG : num 0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
## $ BA : num 0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
## $ Playoffs : int 0 1 1 0 0 0 0 0 1 0 ...
## $ RankSeason : int NA 5 7 NA NA NA NA NA 6 NA ...
## $ RankPlayoffs: int NA 1 3 NA NA NA NA NA 4 NA ...
## $ G : int 162 162 162 162 161 162 162 162 162 162 ...
## $ OOBP : num 0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
## $ OSLG : num 0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
## $ RD : int -39 141 86 -142 27 76 3 -115 76 17 ...
# Scatterplot to check for linear relationship
plot(moneyball$RD, moneyball$W)
# Regression model to predict wins
WinsReg<-lm(W ~ RD, data=moneyball)
summary(WinsReg)
##
## Call:
## lm(formula = W ~ RD, data = moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2662 -2.6509 0.1234 2.9364 11.6570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.881375 0.131157 616.67 <2e-16 ***
## RD 0.105766 0.001297 81.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.939 on 900 degrees of freedom
## Multiple R-squared: 0.8808, Adjusted R-squared: 0.8807
## F-statistic: 6651 on 1 and 900 DF, p-value: < 2.2e-16
##########################################
#If a baseball team scores 713 runs and allows 614 runs, how many games do we expect the team to win?
#Using the linear regression model constructed during the lecture, enter the number of games we expect the team to win:
library(SDSFoundations)
rdiff<-713-614
linFitPred(moneyball$RD,moneyball$W,rdiff)
#Ans:91.352
#EXPLANATION:Our linear regression model was
#Wins = 80.88 + 0.1058*(Run Difference)
#Here, the run difference is 99, so our prediction is
#Wins = 80.88 + 0.1058*99 = 91 games.
VIDEO 3: PREDICTING RUNS
A script file containing all of the R commands used in this lecture reproduced below
# VIDEO 3
# Read in data
baseball<-read.csv("baseball.csv")
# Subset to only include moneyball years before 2002
moneyball<-subset(baseball, Year < 2002)
str(moneyball)
## 'data.frame': 902 obs. of 15 variables:
## $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
## $ League : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
## $ Year : int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
## $ RS : int 691 818 729 687 772 777 798 735 897 923 ...
## $ RA : int 730 677 643 829 745 701 795 850 821 906 ...
## $ W : int 75 92 88 63 82 88 83 66 91 73 ...
## $ OBP : num 0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
## $ SLG : num 0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
## $ BA : num 0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
## $ Playoffs : int 0 1 1 0 0 0 0 0 1 0 ...
## $ RankSeason : int NA 5 7 NA NA NA NA NA 6 NA ...
## $ RankPlayoffs: int NA 1 3 NA NA NA NA NA 4 NA ...
## $ G : int 162 162 162 162 161 162 162 162 162 162 ...
## $ OOBP : num 0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
## $ OSLG : num 0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
# Compute Run Difference and add this new var to the dataframe
moneyball$RD<-moneyball$RS - moneyball$RA
str(moneyball)
## 'data.frame': 902 obs. of 16 variables:
## $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
## $ League : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
## $ Year : int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
## $ RS : int 691 818 729 687 772 777 798 735 897 923 ...
## $ RA : int 730 677 643 829 745 701 795 850 821 906 ...
## $ W : int 75 92 88 63 82 88 83 66 91 73 ...
## $ OBP : num 0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
## $ SLG : num 0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
## $ BA : num 0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
## $ Playoffs : int 0 1 1 0 0 0 0 0 1 0 ...
## $ RankSeason : int NA 5 7 NA NA NA NA NA 6 NA ...
## $ RankPlayoffs: int NA 1 3 NA NA NA NA NA 4 NA ...
## $ G : int 162 162 162 162 161 162 162 162 162 162 ...
## $ OOBP : num 0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
## $ OSLG : num 0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
## $ RD : int -39 141 86 -142 27 76 3 -115 76 17 ...
# Regression model to predict runs scored
RunsReg<-lm(RS ~ OBP + SLG + BA, data=moneyball)
summary(RunsReg) #BA coeff is negative impying that a team will score more runs if BA is lower which is counterintuitive....this must be due to existence of multicollinearity....now lets remove BA from the model
##
## Call:
## lm(formula = RS ~ OBP + SLG + BA, data = moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.941 -17.247 -0.621 16.754 90.998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -788.46 19.70 -40.029 < 2e-16 ***
## OBP 2917.42 110.47 26.410 < 2e-16 ***
## SLG 1637.93 45.99 35.612 < 2e-16 ***
## BA -368.97 130.58 -2.826 0.00482 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.69 on 898 degrees of freedom
## Multiple R-squared: 0.9302, Adjusted R-squared: 0.93
## F-statistic: 3989 on 3 and 898 DF, p-value: < 2.2e-16
RunsReg<-lm(RS ~ OBP + SLG, data=moneyball)
summary(RunsReg)
##
## Call:
## lm(formula = RS ~ OBP + SLG, data = moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.838 -17.174 -1.108 16.770 90.036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -804.63 18.92 -42.53 <2e-16 ***
## OBP 2737.77 90.68 30.19 <2e-16 ***
## SLG 1584.91 42.16 37.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.79 on 899 degrees of freedom
## Multiple R-squared: 0.9296, Adjusted R-squared: 0.9294
## F-statistic: 5934 on 2 and 899 DF, p-value: < 2.2e-16
#this is a simple (parsimonious) model giving significant coeff with high Rsqred.Also seeing the coeff of OBP & SLG which are of same scale,OBP has a higher coeff impying that OBP is probably worth more than SLG.
#########################################
#If a baseball team's OBP is 0.311 and SLG is 0.405, how many runs do we expect the team to score?
#Using the linear regression model constructed during the lecture (the one that uses OBP and SLG as independent variables), enter the number of runs we expect the team to score:
summary(RunsReg)
##
## Call:
## lm(formula = RS ~ OBP + SLG, data = moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.838 -17.174 -1.108 16.770 90.036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -804.63 18.92 -42.53 <2e-16 ***
## OBP 2737.77 90.68 30.19 <2e-16 ***
## SLG 1584.91 42.16 37.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.79 on 899 degrees of freedom
## Multiple R-squared: 0.9296, Adjusted R-squared: 0.9294
## F-statistic: 5934 on 2 and 899 DF, p-value: < 2.2e-16
-804.63+(0.311* 2737.77)+(0.405*1584.91)#from summary o/p
## [1] 688.705
#Ans:689
#EXPLANATION:Our linear regression model was:
#Runs Scored = -804.63 + 2737.77*(OBP) + 1584.91*(SLG)
#Here, OBP is 0.311 and SLG is 0.405, so our prediction is:
#Runs Scored = -804.63 + 2737.77*0.311 + 1584.91*0.405 = 689 runs
#If a baseball team's opponents OBP (OOBP) is 0.297 and oppenents SLG (OSLG) is 0.370, how many runs do we expect the team to allow?
#Using the linear regression model discussed during the lecture (the one on the last slide of the previous video), enter the number of runs we expect the team to allow:
-837.38+( 0.297*2913.60)+(0.370*1514.29)
## [1] 588.2465
#Ans:588
#EXPLANATION:Our linear regression model was:
#Runs Allowed = -837.38 + 2913.60*(OOBP) + 1514.29*(OSLG)
#Here, OOBP is 0.297 and OSLG is 0.370, so our prediction is:
#Runs Scored = -837.38 + 2913.60*(.297) + 1514.29*(.370) = 588 runs
VIDEO 4: USING THE MODELS TO MAKE PREDICTIONS
Q:Suppose you are the General Manager of a baseball team, and you are selecting TWO players for your team. You have a budget of $1,500,000, and you have the choice between the following players:
Player Name OBP SLG Salary Eric Chavez 0.338 0.540 $1,400,000 Jeremy Giambi 0.391 0.450 $1,065,000 Frank Menechino 0.369 0.374 $295,000 Greg Myers 0.313 0.447 $800,000 Carlos Pena 0.361 0.500 $300,000 Given your budget and the player statistics, which TWO players would you select?
Ans:Jeremy Giambi & Carlos Pena
EXPLANATION:We would select Jeremy Giambi and Carlos Pena, since they give the highest contribution to Runs Scored.We would not select Eric Chavez, since his salary consumes our entire budget, and although he has the highest SLG, there are players with better OBP.We would not select Frank Menechino since even though he has a high OBP, his SLG is low.We would not select Greg Myers since he is dominated by Carlos Pena in OBP and SLG, but has a much higher salary.
VIDEO 5: WINNING THE WORLD SERIES
In 2012 and 2013, there were 10 teams in the MLB playoffs: the six teams that had the most wins in each baseball division, and four “wild card” teams. The playoffs start between the four wild card teams - the two teams that win proceed in the playoffs (8 teams remaining). Then, these teams are paired off and play a series of games. The four teams that win are then paired and play to determine who will play in the World Series.
We can assign rankings to the teams as follows:
Rank 1: the team that won the World Series Rank 2: the team that lost the World Series Rank 3: the two teams that lost to the teams in the World Series Rank 4: the four teams that made it past the wild card round, but lost to the above four teams Rank 5: the two teams that lost the wild card round In your R console, create a corresponding rank vector by typing teamRank<-c(1,2,3,3,4,4,4,4,5,5)
In this quick question, we’ll see how well these rankings correlate with the regular season wins of the teams. In 2012, the ranking of the teams and their regular season wins were as follows:
Rank 1: San Francisco Giants (Wins = 94) Rank 2: Detroit Tigers (Wins = 88) Rank 3: New York Yankees (Wins = 95), and St. Louis Cardinals (Wins = 88) Rank 4: Baltimore Orioles (Wins = 93), Oakland A’s (Wins = 94), Washington Nationals (Wins = 98), Cincinnati Reds (Wins = 97) Rank 5: Texas Rangers (Wins = 93), and Atlanta Braves (Wins = 94) Create a vector in R called wins2012, that has the wins of each team in 2012, in order of rank (the vector should have 10 numbers).
In 2013, the ranking of the teams and their regular season wins were as follows:
Rank 1: Boston Red Sox (Wins = 97) Rank 2: St. Louis Cardinals (Wins = 97) Rank 3: Los Angeles Dodgers (Wins = 92), and Detroit Tigers (Wins = 93) Rank 4: Tampa Bay Rays (Wins = 92), Oakland A’s (Wins = 96), Pittsburgh Pirates (Wins = 94), and Atlanta Braves (Wins = 96) Rank 5: Cleveland Indians (Wins = 92), and Cincinnati Reds (Wins = 90) Create another vector in R called wins2013, that has the wins of each team in 2013, in order of rank (the vector should have 10 numbers).
teamRank<-c(1,2,3,3,4,4,4,4,5,5)
wins2012<-c(94,88,95,88,93,94,98,97,93,94)
wins2013<-c(97,97,92,93,92,96,94,96,92,90)
#What is the correlation between teamRank and wins2012?
cor(teamRank,wins2012)
## [1] 0.3477129
#Ans:0.3477129
#What is the correlation between teamRank and wins2013?
cor(teamRank,wins2013)
## [1] -0.6556945
#Ans:-0.6556945
#Since one of the correlations is positive and the other is negative, this means that there does not seem to be a pattern between regular season wins and winning the playoffs. We wouldn't feel comfortable making a bet for this year given this data!
VIDEO 6: THE ANALYTICS EDGE IN SPORTS
Q:Which of the following is MOST LIKELY to be a topic of Sabermetric research?
Ans:Predicting how many home runs the Oakland A’s will hit next year
EXPLANATION:Sabermetric research tries to take a quantitative approach to baseball. Predicting how many home runs the Oakland A’s will hit next year is a very quantitative problem. While the other two topics could be an area of Sabermetric research, they are more qualitative.
In this recitation, we’ll apply some of the ideas from Moneyball to data from the National Basketball Association (NBA). Please download the datasets NBA_train.csv and NBA_test.csv, and save them to a location on your computer that you will remember. This data comes from Basketball-Reference.com.
VIDEO 1: THE DATA
# VIDEO 1
# Read in the data
NBA<-read.csv("NBA_train.csv")
str(NBA)
## 'data.frame': 835 obs. of 20 variables:
## $ SeasonEnd: int 1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
## $ Team : Factor w/ 37 levels "Atlanta Hawks",..: 1 2 5 6 8 9 10 11 12 13 ...
## $ Playoffs : int 1 1 0 0 0 0 0 1 0 1 ...
## $ W : int 50 61 30 37 30 16 24 41 37 47 ...
## $ PTS : int 8573 9303 8813 9360 8878 8933 8493 9084 9119 8860 ...
## $ oppPTS : int 8334 8664 9035 9332 9240 9609 8853 9070 9176 8603 ...
## $ FG : int 3261 3617 3362 3811 3462 3643 3527 3599 3639 3582 ...
## $ FGA : int 7027 7387 6943 8041 7470 7596 7318 7496 7689 7489 ...
## $ X2P : int 3248 3455 3292 3775 3379 3586 3500 3495 3551 3557 ...
## $ X2PA : int 6952 6965 6668 7854 7215 7377 7197 7117 7375 7375 ...
## $ X3P : int 13 162 70 36 83 57 27 104 88 25 ...
## $ X3PA : int 75 422 275 187 255 219 121 379 314 114 ...
## $ FT : int 2038 1907 2019 1702 1871 1590 1412 1782 1753 1671 ...
## $ FTA : int 2645 2449 2592 2205 2539 2149 1914 2326 2333 2250 ...
## $ ORB : int 1369 1227 1115 1307 1311 1226 1155 1394 1398 1187 ...
## $ DRB : int 2406 2457 2465 2381 2524 2415 2437 2217 2326 2429 ...
## $ AST : int 1913 2198 2152 2108 2079 1950 2028 2149 2148 2123 ...
## $ STL : int 782 809 704 764 746 783 779 782 900 863 ...
## $ BLK : int 539 308 392 342 404 562 339 373 530 356 ...
## $ TOV : int 1495 1539 1684 1370 1533 1742 1492 1565 1517 1439 ...
VIDEO 2: PLAYOFFS AND WINS
# VIDEO 2
# How many wins to make the playoffs?
table(NBA$W, NBA$Playoffs)
##
## 0 1
## 11 2 0
## 12 2 0
## 13 2 0
## 14 2 0
## 15 10 0
## 16 2 0
## 17 11 0
## 18 5 0
## 19 10 0
## 20 10 0
## 21 12 0
## 22 11 0
## 23 11 0
## 24 18 0
## 25 11 0
## 26 17 0
## 27 10 0
## 28 18 0
## 29 12 0
## 30 19 1
## 31 15 1
## 32 12 0
## 33 17 0
## 34 16 0
## 35 13 3
## 36 17 4
## 37 15 4
## 38 8 7
## 39 10 10
## 40 9 13
## 41 11 26
## 42 8 29
## 43 2 18
## 44 2 27
## 45 3 22
## 46 1 15
## 47 0 28
## 48 1 14
## 49 0 17
## 50 0 32
## 51 0 12
## 52 0 20
## 53 0 17
## 54 0 18
## 55 0 24
## 56 0 16
## 57 0 23
## 58 0 13
## 59 0 14
## 60 0 8
## 61 0 10
## 62 0 13
## 63 0 7
## 64 0 3
## 65 0 3
## 66 0 2
## 67 0 4
## 69 0 1
## 72 0 1
# Compute Points Difference
NBA$PTSdiff<-NBA$PTS - NBA$oppPTS
# Check for linear relationship
plot(NBA$PTSdiff, NBA$W)
# Linear regression model for wins
WinsReg = lm(W ~ PTSdiff, data=NBA)
summary(WinsReg)
##
## Call:
## lm(formula = W ~ PTSdiff, data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7393 -2.1018 -0.0672 2.0265 10.6026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.100e+01 1.059e-01 387.0 <2e-16 ***
## PTSdiff 3.259e-02 2.793e-04 116.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.061 on 833 degrees of freedom
## Multiple R-squared: 0.9423, Adjusted R-squared: 0.9423
## F-statistic: 1.361e+04 on 1 and 833 DF, p-value: < 2.2e-16
#lets write the linear model
#W= 41 + 0.0326 * PTSdiff >= 42
#from the table the team would like to win abt 42 games to make it to the playoffs
# therefore PTSdiff= 42-41/0.0326= 30.67 ==31
VIDEO 3: POINTS SCORED
# VIDEO 3
# Linear regression model for points scored
PointsReg = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + STL + BLK, data=NBA)
summary(PointsReg)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV +
## STL + BLK, data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -527.40 -119.83 7.83 120.67 564.71
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.051e+03 2.035e+02 -10.078 <2e-16 ***
## X2PA 1.043e+00 2.957e-02 35.274 <2e-16 ***
## X3PA 1.259e+00 3.843e-02 32.747 <2e-16 ***
## FTA 1.128e+00 3.373e-02 33.440 <2e-16 ***
## AST 8.858e-01 4.396e-02 20.150 <2e-16 ***
## ORB -9.554e-01 7.792e-02 -12.261 <2e-16 ***
## DRB 3.883e-02 6.157e-02 0.631 0.5285
## TOV -2.475e-02 6.118e-02 -0.405 0.6859
## STL -1.992e-01 9.181e-02 -2.169 0.0303 *
## BLK -5.576e-02 8.782e-02 -0.635 0.5256
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.5 on 825 degrees of freedom
## Multiple R-squared: 0.8992, Adjusted R-squared: 0.8981
## F-statistic: 817.3 on 9 and 825 DF, p-value: < 2.2e-16
# Sum of Squared Errors
head(PointsReg$residuals)
## 1 2 3 4 5 6
## 38.572271 142.872004 -92.895718 -8.391347 -258.470561 171.460833
SSE = sum(PointsReg$residuals^2)
SSE #difficult to interpret, hence we calc RMSE
## [1] 28394314
# Root mean squared error
RMSE = sqrt(SSE/nrow(NBA))
RMSE
## [1] 184.4049
# Average number of points in a season
mean(NBA$PTS)
## [1] 8370.24
# Remove insignifcant variables
summary(PointsReg)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV +
## STL + BLK, data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -527.40 -119.83 7.83 120.67 564.71
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.051e+03 2.035e+02 -10.078 <2e-16 ***
## X2PA 1.043e+00 2.957e-02 35.274 <2e-16 ***
## X3PA 1.259e+00 3.843e-02 32.747 <2e-16 ***
## FTA 1.128e+00 3.373e-02 33.440 <2e-16 ***
## AST 8.858e-01 4.396e-02 20.150 <2e-16 ***
## ORB -9.554e-01 7.792e-02 -12.261 <2e-16 ***
## DRB 3.883e-02 6.157e-02 0.631 0.5285
## TOV -2.475e-02 6.118e-02 -0.405 0.6859
## STL -1.992e-01 9.181e-02 -2.169 0.0303 *
## BLK -5.576e-02 8.782e-02 -0.635 0.5256
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.5 on 825 degrees of freedom
## Multiple R-squared: 0.8992, Adjusted R-squared: 0.8981
## F-statistic: 817.3 on 9 and 825 DF, p-value: < 2.2e-16
PointsReg2 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL + BLK, data=NBA) # remove var TOV as it has highest non significant p value
summary(PointsReg2)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL +
## BLK, data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -526.79 -121.09 6.37 120.74 565.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.077e+03 1.931e+02 -10.755 <2e-16 ***
## X2PA 1.044e+00 2.951e-02 35.366 <2e-16 ***
## X3PA 1.263e+00 3.703e-02 34.099 <2e-16 ***
## FTA 1.125e+00 3.308e-02 34.023 <2e-16 ***
## AST 8.861e-01 4.393e-02 20.173 <2e-16 ***
## ORB -9.581e-01 7.758e-02 -12.350 <2e-16 ***
## DRB 3.892e-02 6.154e-02 0.632 0.5273
## STL -2.068e-01 8.984e-02 -2.301 0.0216 *
## BLK -5.863e-02 8.749e-02 -0.670 0.5029
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.4 on 826 degrees of freedom
## Multiple R-squared: 0.8991, Adjusted R-squared: 0.8982
## F-statistic: 920.4 on 8 and 826 DF, p-value: < 2.2e-16
PointsReg3 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK, data=NBA)
summary(PointsReg3)# remove var DRB as it has highest non significant p value
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK,
## data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -523.79 -121.64 6.07 120.81 573.64
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.015e+03 1.670e+02 -12.068 < 2e-16 ***
## X2PA 1.048e+00 2.852e-02 36.753 < 2e-16 ***
## X3PA 1.271e+00 3.475e-02 36.568 < 2e-16 ***
## FTA 1.128e+00 3.270e-02 34.506 < 2e-16 ***
## AST 8.909e-01 4.326e-02 20.597 < 2e-16 ***
## ORB -9.702e-01 7.519e-02 -12.903 < 2e-16 ***
## STL -2.276e-01 8.356e-02 -2.724 0.00659 **
## BLK -3.882e-02 8.165e-02 -0.475 0.63462
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.4 on 827 degrees of freedom
## Multiple R-squared: 0.8991, Adjusted R-squared: 0.8982
## F-statistic: 1053 on 7 and 827 DF, p-value: < 2.2e-16
PointsReg4 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data=NBA)
summary(PointsReg4)# remove var BLK as it has highest non significant p
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -523.33 -122.02 6.93 120.68 568.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.033e+03 1.629e+02 -12.475 < 2e-16 ***
## X2PA 1.050e+00 2.829e-02 37.117 < 2e-16 ***
## X3PA 1.273e+00 3.441e-02 37.001 < 2e-16 ***
## FTA 1.127e+00 3.260e-02 34.581 < 2e-16 ***
## AST 8.884e-01 4.292e-02 20.701 < 2e-16 ***
## ORB -9.743e-01 7.465e-02 -13.051 < 2e-16 ***
## STL -2.268e-01 8.350e-02 -2.717 0.00673 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.3 on 828 degrees of freedom
## Multiple R-squared: 0.8991, Adjusted R-squared: 0.8983
## F-statistic: 1229 on 6 and 828 DF, p-value: < 2.2e-16
# now we have a simple model with all significant var
# Compute SSE and RMSE for new model
SSE_4 = sum(PointsReg4$residuals^2)
RMSE_4 = sqrt(SSE_4/nrow(NBA))
SSE_4 #difficult to interpret, hence we calc RMSE
## [1] 28421465
RMSE_4
## [1] 184.493
# Now we hav a simpler model with almost same RMSE
VIDEO 4: MAKING PREDICTIONS
# VIDEO 4
# Read in test set
NBA_test = read.csv("NBA_test.csv")
# Make predictions on test set using the linear model from the training data
PointsPredictions = predict(PointsReg4, newdata=NBA_test)
# Compute out-of-sample R^2 #how well the model predicts on test data
SSE = sum((PointsPredictions - NBA_test$PTS)^2)
SST = sum((mean(NBA$PTS) - NBA_test$PTS)^2)
R2 = 1 - SSE/SST
R2
## [1] 0.8127142
# Compute the RMSE
RMSE = sqrt(SSE/nrow(NBA_test))
RMSE
## [1] 196.3723
There have been many studies documenting that the average global temperature has been increasing over the last century. The consequences of a continued rise in global temperature will be dire. Rising sea levels and an increased frequency of extreme weather events will affect billions of people.
In this problem, we will attempt to study the relationship between average global temperature and several other factors.
The file climate_change.csv contains climate data from May 1983 to December 2008. The available variables include:
Year: the observation year. Month: the observation month. Temp: the difference in degrees Celsius between the average global temperature in that period and a reference value. This data comes from the Climatic Research Unit at the University of East Anglia. CO2, N2O, CH4, CFC.11, CFC.12: atmospheric concentrations of carbon dioxide (CO2), nitrous oxide (N2O), methane (CH4), trichlorofluoromethane (CCl3F; commonly referred to as CFC-11) and dichlorodifluoromethane (CCl2F2; commonly referred to as CFC-12), respectively. This data comes from the ESRL/NOAA Global Monitoring Division. CO2, N2O and CH4 are expressed in ppmv (parts per million by volume – i.e., 397 ppmv of CO2 means that CO2 constitutes 397 millionths of the total volume of the atmosphere) CFC.11 and CFC.12 are expressed in ppbv (parts per billion by volume). Aerosols: the mean stratospheric aerosol optical depth at 550 nm. This variable is linked to volcanoes, as volcanic eruptions result in new particles being added to the atmosphere, which affect how much of the sun’s energy is reflected back into space. This data is from the Godard Institute for Space Studies at NASA. TSI: the total solar irradiance (TSI) in W/m2 (the rate at which the sun’s energy is deposited per unit area). Due to sunspots and other solar phenomena, the amount of energy that is given off by the sun varies substantially with time. This data is from the SOLARIS-HEPPA project website. MEI: multivariate El Nino Southern Oscillation index (MEI), a measure of the strength of the El Nino/La Nina-Southern Oscillation (a weather effect in the Pacific Ocean that affects global temperatures). This data comes from the ESRL/NOAA Physical Sciences Division.
PROBLEM 1.1 - CREATING OUR FIRST MODEL
We are interested in how changes in these variables affect future temperatures, as well as how well these variables explain temperature changes so far. To do this, first read the dataset climate_change.csv into R.
Then, split the data into a training set, consisting of all the observations up to and including 2006, and a testing set consisting of the remaining years (hint: use subset). A training set refers to the data that will be used to build the model (this is the data we give to the lm() function), and a testing set refers to the data we will use to test our predictive ability.
Next, build a linear regression model to predict the dependent variable Temp, using MEI, CO2, CH4, N2O, CFC.11, CFC.12, TSI, and Aerosols as independent variables (Year and Month should NOT be used in the model). Use the training set to build the model.
climate<-read.csv("climate_change.csv")
str(climate)
## 'data.frame': 308 obs. of 11 variables:
## $ Year : int 1983 1983 1983 1983 1983 1983 1983 1983 1984 1984 ...
## $ Month : int 5 6 7 8 9 10 11 12 1 2 ...
## $ MEI : num 2.556 2.167 1.741 1.13 0.428 ...
## $ CO2 : num 346 346 344 342 340 ...
## $ CH4 : num 1639 1634 1633 1631 1648 ...
## $ N2O : num 304 304 304 304 304 ...
## $ CFC.11 : num 191 192 193 194 194 ...
## $ CFC.12 : num 350 352 354 356 357 ...
## $ TSI : num 1366 1366 1366 1366 1366 ...
## $ Aerosols: num 0.0863 0.0794 0.0731 0.0673 0.0619 0.0569 0.0524 0.0486 0.0451 0.0416 ...
## $ Temp : num 0.109 0.118 0.137 0.176 0.149 0.093 0.232 0.078 0.089 0.013 ...
#splitting the dataset into train & test dataset
#method1
train<-subset(climate,Year<=2006)
str(train)
## 'data.frame': 284 obs. of 11 variables:
## $ Year : int 1983 1983 1983 1983 1983 1983 1983 1983 1984 1984 ...
## $ Month : int 5 6 7 8 9 10 11 12 1 2 ...
## $ MEI : num 2.556 2.167 1.741 1.13 0.428 ...
## $ CO2 : num 346 346 344 342 340 ...
## $ CH4 : num 1639 1634 1633 1631 1648 ...
## $ N2O : num 304 304 304 304 304 ...
## $ CFC.11 : num 191 192 193 194 194 ...
## $ CFC.12 : num 350 352 354 356 357 ...
## $ TSI : num 1366 1366 1366 1366 1366 ...
## $ Aerosols: num 0.0863 0.0794 0.0731 0.0673 0.0619 0.0569 0.0524 0.0486 0.0451 0.0416 ...
## $ Temp : num 0.109 0.118 0.137 0.176 0.149 0.093 0.232 0.078 0.089 0.013 ...
dim(train)
## [1] 284 11
plot(train) #scatter plot matrix
test<-subset(climate,Year>2006)
str(test)
## 'data.frame': 24 obs. of 11 variables:
## $ Year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## $ Month : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MEI : num 0.974 0.51 0.074 -0.049 0.183 ...
## $ CO2 : num 383 384 385 386 387 ...
## $ CH4 : num 1800 1803 1803 1802 1796 ...
## $ N2O : num 321 321 321 321 320 ...
## $ CFC.11 : num 248 248 248 248 247 ...
## $ CFC.12 : num 539 539 539 539 538 ...
## $ TSI : num 1366 1366 1366 1366 1366 ...
## $ Aerosols: num 0.0054 0.0051 0.0045 0.0045 0.0041 0.004 0.004 0.0041 0.0042 0.0041 ...
## $ Temp : num 0.601 0.498 0.435 0.466 0.372 0.382 0.394 0.358 0.402 0.362 ...
dim(test)
## [1] 24 11
#method2
train_index<-(climate$Year<=2006)
train1<-subset(climate,train_index)
str(train1)
## 'data.frame': 284 obs. of 11 variables:
## $ Year : int 1983 1983 1983 1983 1983 1983 1983 1983 1984 1984 ...
## $ Month : int 5 6 7 8 9 10 11 12 1 2 ...
## $ MEI : num 2.556 2.167 1.741 1.13 0.428 ...
## $ CO2 : num 346 346 344 342 340 ...
## $ CH4 : num 1639 1634 1633 1631 1648 ...
## $ N2O : num 304 304 304 304 304 ...
## $ CFC.11 : num 191 192 193 194 194 ...
## $ CFC.12 : num 350 352 354 356 357 ...
## $ TSI : num 1366 1366 1366 1366 1366 ...
## $ Aerosols: num 0.0863 0.0794 0.0731 0.0673 0.0619 0.0569 0.0524 0.0486 0.0451 0.0416 ...
## $ Temp : num 0.109 0.118 0.137 0.176 0.149 0.093 0.232 0.078 0.089 0.013 ...
dim(train1)
## [1] 284 11
test1<-climate[!train_index,]
str(test1)
## 'data.frame': 24 obs. of 11 variables:
## $ Year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## $ Month : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MEI : num 0.974 0.51 0.074 -0.049 0.183 ...
## $ CO2 : num 383 384 385 386 387 ...
## $ CH4 : num 1800 1803 1803 1802 1796 ...
## $ N2O : num 321 321 321 321 320 ...
## $ CFC.11 : num 248 248 248 248 247 ...
## $ CFC.12 : num 539 539 539 539 538 ...
## $ TSI : num 1366 1366 1366 1366 1366 ...
## $ Aerosols: num 0.0054 0.0051 0.0045 0.0045 0.0041 0.004 0.004 0.0041 0.0042 0.0041 ...
## $ Temp : num 0.601 0.498 0.435 0.466 0.372 0.382 0.394 0.358 0.402 0.362 ...
dim(test1)
## [1] 24 11
#lets now build the linear regression model on the training dataset
climatelm<-lm(Temp ~ MEI+ CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols, data=train)
summary(climatelm)
##
## Call:
## lm(formula = Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 +
## TSI + Aerosols, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25888 -0.05913 -0.00082 0.05649 0.32433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.246e+02 1.989e+01 -6.265 1.43e-09 ***
## MEI 6.421e-02 6.470e-03 9.923 < 2e-16 ***
## CO2 6.457e-03 2.285e-03 2.826 0.00505 **
## CH4 1.240e-04 5.158e-04 0.240 0.81015
## N2O -1.653e-02 8.565e-03 -1.930 0.05467 .
## CFC.11 -6.631e-03 1.626e-03 -4.078 5.96e-05 ***
## CFC.12 3.808e-03 1.014e-03 3.757 0.00021 ***
## TSI 9.314e-02 1.475e-02 6.313 1.10e-09 ***
## Aerosols -1.538e+00 2.133e-01 -7.210 5.41e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09171 on 275 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7436
## F-statistic: 103.6 on 8 and 275 DF, p-value: < 2.2e-16
#using visreg package to plot regression models.Default plots contain a confidence band, prediction line, and partial residuals.
library(visreg)
par(mfrow=c(2,4))
visreg(climatelm)
#Enter the model R2 (the "Multiple R-squared" value):
summary(climatelm)$r.squared #?summary.lm #to see the values that can be passed to summary()
## [1] 0.7508933
#Ans:0.7508933
##################################
#PROBLEM 1.2 - CREATING OUR FIRST MODEL
#Q:Which variables are significant in the model? We will consider a variable signficant only if the p-value is below 0.05. (Select all that apply.)
#Ans:MEI,CO2,CFC.11,CFC.12,TSI,Aerosols #from summary(climatelm)
####################################
#PROBLEM 2.1 - UNDERSTANDING THE MODEL
#Current scientific opinion is that nitrous oxide and CFC-11 are greenhouse gases: gases that are able to trap heat from the sun and contribute to the heating of the Earth. However, the regression coefficients of both the N2O and CFC-11 variables are negative, indicating that increasing atmospheric concentrations of either of these two compounds is associated with lower global temperatures.
#Which of the following is the simplest correct explanation for this contradiction?
#Ans:All of the gas concentration variables reflect human development - N2O and CFC.11 are correlated with other variables in the data set.
#EXPLANATION:The linear correlation of N2O and CFC.11 with other variables in the data set is quite large. The first explanation does not seem correct, as the warming effect of nitrous oxide and CFC-11 are well documented, and our regression analysis is not enough to disprove it. The second explanation is unlikely, as we have estimated eight coefficients and the intercept from 284 observations.
#############################################
#PROBLEM 2.2 - UNDERSTANDING THE MODEL
#Compute the correlations between all the variables in the training set. Which of the following independent variables is N2O highly correlated with (absolute correlation greater than 0.7)? Select all that apply.
#cor(train) # correlation matrix not required
cor(train$N2O,train)
## Year Month MEI CO2 CH4 N2O CFC.11
## [1,] 0.9938452 0.01363153 -0.05081978 0.9767198 0.8998386 1 0.5224773
## CFC.12 TSI Aerosols Temp
## [1,] 0.8679308 0.1997567 -0.3370546 0.7786389
#Ans:CO2,CH4,CFC.12
#Which of the following independent variables is CFC.11 highly correlated with? Select all that apply.
cor(train$CFC.11,train)
## Year Month MEI CO2 CH4 N2O CFC.11
## [1,] 0.5691064 -0.01311122 0.06900044 0.5140597 0.779904 0.5224773 1
## CFC.12 TSI Aerosols Temp
## [1,] 0.8689852 0.272046 -0.0439212 0.4077103
#Ans: CH4,CFC.12
library(sjPlot)
#?sjp.corr
sjp.corr(train,type = "tile",show.legend =FALSE) #plotting the correlation matrix
## Computing correlation using spearman-method with listwise-deletion...
#########################################
#PROBLEM 3 - SIMPLIFYING THE MODEL
#Given that the correlations are so high, let us focus on the N2O variable and build a model with only MEI, TSI, Aerosols and N2O as independent variables. Remember to use the training set to build the model.
climatelm2<-lm(Temp ~ MEI+ N2O + TSI + Aerosols, data=train)
summary(climatelm2)
##
## Call:
## lm(formula = Temp ~ MEI + N2O + TSI + Aerosols, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27916 -0.05975 -0.00595 0.05672 0.34195
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.162e+02 2.022e+01 -5.747 2.37e-08 ***
## MEI 6.419e-02 6.652e-03 9.649 < 2e-16 ***
## N2O 2.532e-02 1.311e-03 19.307 < 2e-16 ***
## TSI 7.949e-02 1.487e-02 5.344 1.89e-07 ***
## Aerosols -1.702e+00 2.180e-01 -7.806 1.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09547 on 279 degrees of freedom
## Multiple R-squared: 0.7261, Adjusted R-squared: 0.7222
## F-statistic: 184.9 on 4 and 279 DF, p-value: < 2.2e-16
par(mfrow =c(2, 2))
plot(climatelm2) #Plot Diagnostics for an lm Object
#Another excellent way to plot the LM diagnostics
#Plotting Diagnostics for LM with ggplot2 and ggfortify
library(ggfortify)
par(mfrow = c(1, 2))
autoplot(climatelm2, which = 1:6, ncol = 3,colour="black",size=2)
#help(autoplot.lm)
#Testing the model assumptions
library(sjPlot)
#?sjp.lm
sjp.lm(climatelm2,type="ma",completeDiagnostic=T)
## Removed 5 cases during 1 step(s).
## R^2 / adj. R^2 of original model: 0.726132 / 0.722206
## R^2 / adj. R^2 of updated model: 0.751744 / 0.748120
## AIC of original model: -521.289225
## AIC of updated model: -563.091626
## lag Autocorrelation D-W Statistic p-value
## 1 0.5784039 0.8424936 0
## Alternative hypothesis: rho != 0
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.8193595 Df = 1 p = 0.3653675
##
## studentized Breusch-Pagan test
##
## data: linreg
## BP = 1.4215, df = 4, p-value = 0.8405
##
## Suggested power transformation: 1.000072
#Enter the coefficient of N2O in this reduced model:
#Ans:0.02532
#Enter the model R2:
summary(climatelm2)$r.squared
## [1] 0.7261321
#Ans:0.7261321
#EXPLANATION:We have observed that, for this problem, when we remove many variables the sign of N2O flips. The model has not lost a lot of explanatory power (the model R2 is 0.7261 compared to 0.7509 previously) despite removing many variables. As discussed in lecture, this type of behavior is typical when building a model where many of the independent variables are highly correlated with each other. In this particular problem many of the variables (CO2, CH4, N2O, CFC.11 and CFC.12) are highly correlated, since they are all driven by human industrial development.
#################################################
#PROBLEM 4 - AUTOMATICALLY BUILDING THE MODEL
#We have many variables in this problem, and as we have seen above, dropping some from the model does not decrease model quality. R provides a function, step, that will automate the procedure of trying different combinations of variables to find a good compromise of model simplicity and R2. This trade-off is formalized by the Akaike information criterion (AIC) - it can be informally thought of as the quality of the model with a penalty for the number of variables in the model.
#The step function has one argument - the name of the initial model. It returns a simplified model. Use the step function in R to derive a new model, with the full model as the initial model (HINT: If your initial full model was called "climateLM", you could create a new model with the step function by typing step(climateLM). Be sure to save your new model to a variable name so that you can look at the summary. For more information about the step function, type ?step in your R console.)
StepModel<-step(climatelm)
## Start: AIC=-1348.16
## Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
##
## Df Sum of Sq RSS AIC
## - CH4 1 0.00049 2.3135 -1350.1
## <none> 2.3130 -1348.2
## - N2O 1 0.03132 2.3443 -1346.3
## - CO2 1 0.06719 2.3802 -1342.0
## - CFC.12 1 0.11874 2.4318 -1335.9
## - CFC.11 1 0.13986 2.4529 -1333.5
## - TSI 1 0.33516 2.6482 -1311.7
## - Aerosols 1 0.43727 2.7503 -1301.0
## - MEI 1 0.82823 3.1412 -1263.2
##
## Step: AIC=-1350.1
## Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
##
## Df Sum of Sq RSS AIC
## <none> 2.3135 -1350.1
## - N2O 1 0.03133 2.3448 -1348.3
## - CO2 1 0.06672 2.3802 -1344.0
## - CFC.12 1 0.13023 2.4437 -1336.5
## - CFC.11 1 0.13938 2.4529 -1335.5
## - TSI 1 0.33500 2.6485 -1313.7
## - Aerosols 1 0.43987 2.7534 -1302.7
## - MEI 1 0.83118 3.1447 -1264.9
summary(StepModel)
##
## Call:
## lm(formula = Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI +
## Aerosols, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25770 -0.05994 -0.00104 0.05588 0.32203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.245e+02 1.985e+01 -6.273 1.37e-09 ***
## MEI 6.407e-02 6.434e-03 9.958 < 2e-16 ***
## CO2 6.402e-03 2.269e-03 2.821 0.005129 **
## N2O -1.602e-02 8.287e-03 -1.933 0.054234 .
## CFC.11 -6.609e-03 1.621e-03 -4.078 5.95e-05 ***
## CFC.12 3.868e-03 9.812e-04 3.942 0.000103 ***
## TSI 9.312e-02 1.473e-02 6.322 1.04e-09 ***
## Aerosols -1.540e+00 2.126e-01 -7.244 4.36e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09155 on 276 degrees of freedom
## Multiple R-squared: 0.7508, Adjusted R-squared: 0.7445
## F-statistic: 118.8 on 7 and 276 DF, p-value: < 2.2e-16
coef(StepModel)
## (Intercept) MEI CO2 N2O CFC.11
## -1.245152e+02 6.406779e-02 6.401495e-03 -1.602113e-02 -6.609351e-03
## CFC.12 TSI Aerosols
## 3.867565e-03 9.311551e-02 -1.540206e+00
#fitted(StepModel)
#resid(StepModel)
AIC(StepModel)
## [1] -542.1426
#If you look at the summary of the model with summary(StepModel), you can see that the R-squared value is 0.75, and only CH4 was removed.
#It is interesting to note that the step function does not address the collinearity of the variables, except that adding highly correlated variables will not improve the R2 significantly. The consequence of this is that the step function will not necessarily produce a very interpretable model - just a model that has balanced quality and simplicity for a particular weighting of quality and simplicity (AIC).
###################################
#PROBLEM 5 - TESTING ON UNSEEN DATA
#We have developed an understanding of how well we can fit a linear regression to the training data, but does the model quality hold when applied to unseen data?
#Using the model produced from the step function, calculate temperature predictions for the testing data set, using the predict function.
climateStep<-lm(Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + Aerosols, data = train)
tempPredict<- predict(climateStep, newdata=test)
summary(tempPredict)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3142 0.3418 0.3771 0.3832 0.4245 0.4678
#Enter the testing set R2:
# Compute out-of-sample R^2 #how well the model predicts on test data
SSE = sum((tempPredict - test$Temp)^2)
SST = sum((mean(train$Temp) - test$Temp)^2)
R2 = 1 - SSE/SST
R2
## [1] 0.6286051
# Compute the RMSE
RMSE = sqrt(SSE/nrow(test))
RMSE
## [1] 0.09522876
#Lets do variable selection using leaps package with regsubsets()
library(leaps)
regfit.bwd<-regsubsets(Temp~.-Year-Month,data = train,nvmax=8,nbest=1,method = "backward")
reg.summary<-summary(regfit.bwd)
reg.summary
## Subset selection object
## Call: regsubsets.formula(Temp ~ . - Year - Month, data = train, nvmax = 8,
## nbest = 1, method = "backward")
## 8 Variables (and intercept)
## Forced in Forced out
## MEI FALSE FALSE
## CO2 FALSE FALSE
## CH4 FALSE FALSE
## N2O FALSE FALSE
## CFC.11 FALSE FALSE
## CFC.12 FALSE FALSE
## TSI FALSE FALSE
## Aerosols FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: backward
## MEI CO2 CH4 N2O CFC.11 CFC.12 TSI Aerosols
## 1 ( 1 ) " " " " " " " " " " "*" " " " "
## 2 ( 1 ) "*" "*" " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " "*" "*" " " " "
## 4 ( 1 ) "*" " " " " " " "*" "*" " " "*"
## 5 ( 1 ) "*" " " " " " " "*" "*" "*" "*"
## 6 ( 1 ) "*" "*" " " " " "*" "*" "*" "*"
## 7 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
#An asterix indicates that a given var is included in the corresponding model
#?regsubsets
reg.summary$rsq
## [1] 0.4727354 0.6638403 0.6666184 0.7085996 0.7434239 0.7474671 0.7508409
## [8] 0.7508933
reg.summary$cp #Mallows' Cp,lowest value for the model selected which corresponds to 7 var model
## [1] 302.070882 93.101659 92.034807 47.689848 11.245733 8.782349
## [7] 7.057826 9.000000
reg.summary$adjr2 #Adjusted r-squared,we selcted model with the highest which corresponds to 7 var model
## [1] 0.4708656 0.6614477 0.6630464 0.7044218 0.7388093 0.7419970 0.7445216
## [8] 0.7436465
reg.summary$bic # lowest value for the model selected
## [1] -170.4770 -292.6611 -289.3689 -321.9433 -352.4399 -351.3018 -349.4727
## [8] -343.8834
reg.summary$which #A logical matrix indicating which elements are in each model
## (Intercept) MEI CO2 CH4 N2O CFC.11 CFC.12 TSI Aerosols
## 1 TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## 2 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
## 4 TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
## 5 TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
## 6 TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
## 7 TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## 8 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#lets use in built plot() in regsubsets()
#Darker the color, higher is the fitting. A gap in the continuation means that the variable has not been included.
par(mfrow=c(2,2))
plot(regfit.bwd,scale="adjr2")
#By lloking at the plot we see that at the highest value of AdjR2,va CH4 shows discontinuity implying that the var shud be dropped, hence 7 var model as per highest adjR2 criterion is selected
# we can do the same plot based on other accuracy criterias
#?plot.regsubsets
plot(regfit.bwd,scale="r2") #here look for highest values
plot(regfit.bwd,scale="Cp") #here look for lowest values
plot(regfit.bwd,scale="bic") #here look for lowest values
#from the reg.summary & plots, we can see that 7 var model fits the best which corresponds to the step() we used before
#lets use conventional plots of RSS,adjusted Rsqrd,Cp,& BIC for all models at once will help us decide which model to select
par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of variables",ylab="RSS",type="l")
which.min(reg.summary$rss)
## [1] 8
points(8,reg.summary$rss[8],col="red",cex=2,pch=20) # adding the point to the plot which shows the min Cp
plot(reg.summary$adjr2,xlab="Number of variables",ylab="Adjusted R2",type="l")
which.max(reg.summary$adjr2)
## [1] 7
points(7,reg.summary$adjr2[7],col="red",cex=2,pch=20) # adding the point to the plot which shows the max adjusted R2
#similarily for other measures
plot(reg.summary$cp,xlab="Number of variables",ylab="Cp",type="l")
which.min(reg.summary$cp)
## [1] 7
points(7,reg.summary$cp[7],col="red",cex=2,pch=20) # adding the point to the plot which shows the min Cp
plot(reg.summary$bic,xlab="Number of variables",ylab="BIC",type="l")
which.min(reg.summary$bic)
## [1] 5
points(5,reg.summary$bic[5],col="red",cex=2,pch=20) # adding the point to the plot which shows the min Cp
Lets see the sjPlot package for journal type summary output
#lets just compare the two fitted model
#lets gets the summary o/p in better presented table
library(sjPlot)
sjt.lm(climatelm,climatelm2)
Temp | Temp | |||||||
B | CI | p | B | CI | p | |||
(Intercept) | -124.59 | -163.74 – -85.44 | <.001 | -116.23 | -156.04 – -76.42 | <.001 | ||
MEI | 0.06 | 0.05 – 0.08 | <.001 | 0.06 | 0.05 – 0.08 | <.001 | ||
CO2 | 0.01 | 0.00 – 0.01 | .005 | |||||
CH4 | 0.00 | -0.00 – 0.00 | .810 | |||||
N2O | -0.02 | -0.03 – 0.00 | .055 | 0.03 | 0.02 – 0.03 | <.001 | ||
CFC.11 | -0.01 | -0.01 – -0.00 | <.001 | |||||
CFC.12 | 0.00 | 0.00 – 0.01 | <.001 | |||||
TSI | 0.09 | 0.06 – 0.12 | <.001 | 0.08 | 0.05 – 0.11 | <.001 | ||
Aerosols | -1.54 | -1.96 – -1.12 | <.001 | -1.70 | -2.13 – -1.27 | <.001 | ||
Observations | 284 | 284 | ||||||
R2 / adj. R2 | .751 / .744 | .726 / .722 |
READING TEST SCORES
The Programme for International Student Assessment (PISA) is a test given every three years to 15-year-old students from around the world to evaluate their performance in mathematics, reading, and science. This test provides a quantitative way to compare the performance of students from different parts of the world. In this homework assignment, we will predict the reading scores of students from the United States of America on the 2009 PISA exam.
The datasets pisa2009train.csv and pisa2009test.csv contain information about the demographics and schools for American students taking the exam, derived from 2009 PISA Public-Use Data Files distributed by the United States National Center for Education Statistics (NCES). While the datasets are not supposed to contain identifying information about students taking the test, by using the data you are bound by the NCES data use agreement, which prohibits any attempt to determine the identity of any student in the datasets.
Each row in the datasets pisa2009train.csv and pisa2009test.csv represents one student taking the exam
#lets import the data set
pisaTrain<-read.csv("pisa2009train.csv")
pisaTest<-read.csv("pisa2009test.csv")
str(pisaTrain)
## 'data.frame': 3663 obs. of 24 variables:
## $ grade : int 11 11 9 10 10 10 10 10 9 10 ...
## $ male : int 1 1 1 0 1 1 0 0 0 1 ...
## $ raceeth : Factor w/ 7 levels "American Indian/Alaska Native",..: NA 7 7 3 4 3 2 7 7 5 ...
## $ preschool : int NA 0 1 1 1 1 0 1 1 1 ...
## $ expectBachelors : int 0 0 1 1 0 1 1 1 0 1 ...
## $ motherHS : int NA 1 1 0 1 NA 1 1 1 1 ...
## $ motherBachelors : int NA 1 1 0 0 NA 0 0 NA 1 ...
## $ motherWork : int 1 1 1 1 1 1 1 0 1 1 ...
## $ fatherHS : int NA 1 1 1 1 1 NA 1 0 0 ...
## $ fatherBachelors : int NA 0 NA 0 0 0 NA 0 NA 0 ...
## $ fatherWork : int 1 1 1 1 0 1 NA 1 1 1 ...
## $ selfBornUS : int 1 1 1 1 1 1 0 1 1 1 ...
## $ motherBornUS : int 0 1 1 1 1 1 1 1 1 1 ...
## $ fatherBornUS : int 0 1 1 1 0 1 NA 1 1 1 ...
## $ englishAtHome : int 0 1 1 1 1 1 1 1 1 1 ...
## $ computerForSchoolwork: int 1 1 1 1 1 1 1 1 1 1 ...
## $ read30MinsADay : int 0 1 0 1 1 0 0 1 0 0 ...
## $ minutesPerWeekEnglish: int 225 450 250 200 250 300 250 300 378 294 ...
## $ studentsInEnglish : int NA 25 28 23 35 20 28 30 20 24 ...
## $ schoolHasLibrary : int 1 1 1 1 1 1 1 1 0 1 ...
## $ publicSchool : int 1 1 1 1 1 1 1 1 1 1 ...
## $ urban : int 1 0 0 1 1 0 1 0 1 0 ...
## $ schoolSize : int 673 1173 1233 2640 1095 227 2080 1913 502 899 ...
## $ readingScore : num 476 575 555 458 614 ...
#PROBLEM 1.1 - DATASET SIZE
#How many students are there in the training set?
nrow(pisaTrain)
## [1] 3663
#Ans:3663
#We can then access the number of rows in the training set with str(pisaTrain) or nrow(pisaTrain).
################################
#PROBLEM 1.2 - SUMMARIZING THE DATASET
#Using tapply() on pisaTrain, what is the average reading test score of males?
tapply(pisaTrain$readingScore,pisaTrain$male,mean,na.rm=T)
## 0 1
## 512.9406 483.5325
#Ans:483.5325
#Of females?
#Ans:512.9406
#################################
#PROBLEM 1.3 - LOCATING MISSING VALUES
#Which variables are missing data in at least one observation in the training set? Select all that apply.
colnames(pisaTrain)[colSums(is.na(pisaTrain)) > 0] #or colnames(pisaTrain)[apply(is.na(pisaTrain), 2, any)]
## [1] "raceeth" "preschool"
## [3] "expectBachelors" "motherHS"
## [5] "motherBachelors" "motherWork"
## [7] "fatherHS" "fatherBachelors"
## [9] "fatherWork" "selfBornUS"
## [11] "motherBornUS" "fatherBornUS"
## [13] "englishAtHome" "computerForSchoolwork"
## [15] "read30MinsADay" "minutesPerWeekEnglish"
## [17] "studentsInEnglish" "schoolHasLibrary"
## [19] "schoolSize"
#or apply(is.na(pisaTrain), 2, any)
#or
which(apply(is.na(pisaTrain), 2, sum)>0) # gives col names as well as col no
## raceeth preschool expectBachelors
## 3 4 5
## motherHS motherBachelors motherWork
## 6 7 8
## fatherHS fatherBachelors fatherWork
## 9 10 11
## selfBornUS motherBornUS fatherBornUS
## 12 13 14
## englishAtHome computerForSchoolwork read30MinsADay
## 15 16 17
## minutesPerWeekEnglish studentsInEnglish schoolHasLibrary
## 18 19 20
## schoolSize
## 23
#Ans:[1]"raceeth" "preschool" "expectBachelors"
# [4] "motherHS" "motherBachelors" "motherWork"
# [7] "fatherHS" "fatherBachelors" "fatherWork"
#[10] "selfBornUS" "motherBornUS" "fatherBornUS"
#[13] "englishAtHome" "computerForSchoolwork" "read30MinsADay"
#[16] "minutesPerWeekEnglish" "studentsInEnglish" "schoolHasLibrary"
#[19] "schoolSize"
#EXPLANATION:We can read which variables have missing values from summary(pisaTrain). Because most variables are collected from study participants via survey, it is expected that most questions will have at least one missing value.
##########################
# Problem 1.4 - Removing missing values
#Linear regression discards observations with missing data, so we will remove all such observations from the training and testing sets. Later in the course, we will learn about imputation, which deals with missing data by filling in missing values with plausible information.
pisaTrain<-na.omit(pisaTrain)
pisaTest<-na.omit(pisaTest)
#How many observations are now in the training set?
nrow(pisaTrain)
## [1] 2414
#Ans:2414
#How many observations are now in the testing set?
nrow(pisaTest)
## [1] 990
#Ans:990
##############################
# Problem 2.1 - Factor variables
#Factor variables are variables that take on a discrete set of values, like the "Region" variable in the WHO dataset from the second lecture of Unit 1. This is an unordered factor because there isn't any natural ordering between the levels. An ordered factor has a natural ordering between the levels (an example would be the classifications "large," "medium," and "small").
sapply(pisaTrain,class) #to get classes of all var
## grade male raceeth
## "integer" "integer" "factor"
## preschool expectBachelors motherHS
## "integer" "integer" "integer"
## motherBachelors motherWork fatherHS
## "integer" "integer" "integer"
## fatherBachelors fatherWork selfBornUS
## "integer" "integer" "integer"
## motherBornUS fatherBornUS englishAtHome
## "integer" "integer" "integer"
## computerForSchoolwork read30MinsADay minutesPerWeekEnglish
## "integer" "integer" "integer"
## studentsInEnglish schoolHasLibrary publicSchool
## "integer" "integer" "integer"
## urban schoolSize readingScore
## "integer" "integer" "numeric"
#Which of the following variables is an unordered factor with at least 3 levels? (Select all that apply.)
levels(pisaTrain$raceeth)
## [1] "American Indian/Alaska Native"
## [2] "Asian"
## [3] "Black"
## [4] "Hispanic"
## [5] "More than one race"
## [6] "Native Hawaiian/Other Pacific Islander"
## [7] "White"
#Ans:raceeth
#Which of the following variables is an ordered factor with at least 3 levels? (Select all that apply.)
#Ans:grade
#Explanation:Male only has 2 levels (1 and 0). There is no natural ordering between the different values of raceeth, so it is an unordered factor. Meanwhile, we can order grades (8, 9, 10, 11, 12), so it is an ordered factor.
########################################
# Problem 2.2 - Unordered factors in regression models
#To include unordered factors in a linear regression model, we define one level as the "reference level" and add a binary variable for each of the remaining levels. In this way, a factor with n levels is replaced by n-1 binary variables. The reference level is typically selected to be the most frequently occurring level in the dataset.
#As an example, consider the unordered factor variable "color", with levels "red", "green", and "blue". If "green" were the reference level, then we would add binary variables "colorred" and "colorblue" to a linear regression problem. All red examples would have colorred=1 and colorblue=0. All blue examples would have colorred=0 and colorblue=1. All green examples would have colorred=0 and colorblue=0.
#Now, consider the variable "raceeth" in our problem, which has levels "American Indian/Alaska Native", "Asian", "Black", "Hispanic", "More than one race", "Native Hawaiian/Other Pacific Islander", and "White". Because it is the most common in our population, we will select White as the reference level.
#Which binary variables will be included in the regression model? (Select all that apply.)
#Ans:raceethAmerican Indian/Alaska Native,raceethAsian,raceethBlack ,raceethHispanic ,raceethMore than one race,raceethNative Hawaiian/Other Pacific Islander
#Explanation:We create a binary variable for each level except the reference level, so we would create all these variables except for raceethWhite.
#########################################
# Problem 2.3 - Example unordered factors
#Consider again adding our unordered factor race to the regression model with reference level "White".
#For a student who is Asian, which binary variables would be set to 0? All remaining variables will be set to 1. (Select all that apply.)
#Ans:raceethAmerican Indian/Alaska Native,raceethBlack ,raceethHispanic ,raceethMore than one race,raceethNative Hawaiian/Other Pacific Islander
#For a student who is white, which binary variables would be set to 0? All remaining variables will be set to 1. (Select all that apply.)
#Ans:raceethAmerican Indian/Alaska Native,raceethAsian,raceethBlack ,raceethHispanic ,raceethMore than one race,raceethNative Hawaiian/Other Pacific Islander
#Explanation:An Asian student will have raceethAsian set to 1 and all other raceeth binary variables set to 0. Because "White" is the reference level, a white student will have all raceeth binary variables set to 0.
######################################
# Problem 3.1 - Building a model
#Because the race variable takes on text values, it was loaded as a factor variable when we read in the dataset with read.csv() -- you can see this when you run str(pisaTrain) or str(pisaTest). However, by default R selects the first level alphabetically ("American Indian/Alaska Native") as the reference level of our factor instead of the most common level ("White"). Set the reference level of the factor by typing the following two lines in your R console:
pisaTrain$raceeth<-relevel(pisaTrain$raceeth, "White")
pisaTest$raceeth<-relevel(pisaTest$raceeth, "White")
#Now, build a linear regression model (call it lmScore) using the training set to predict readingScore using all the remaining variables.
#It would be time-consuming to type all the variables, but R provides the shorthand notation "readingScore ~ ." to mean "predict readingScore using all the other variables in the data frame." The period is used to replace listing out all of the independent variables. As an example, if your dependent variable is called "Y", your independent variables are called "X1", "X2", and "X3", and your training data set is called "Train", instead of the regular notation:
#LinReg = lm(Y ~ X1 + X2 + X3, data = Train)
#You would use the following command to build your model:
#LinReg = lm(Y ~ ., data = Train)
lmScore<-lm(readingScore ~ .,data=pisaTrain)
summary(lmScore)
##
## Call:
## lm(formula = readingScore ~ ., data = pisaTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -247.44 -48.86 1.86 49.77 217.18
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 143.766333 33.841226
## grade 29.542707 2.937399
## male -14.521653 3.155926
## raceethAmerican Indian/Alaska Native -67.277327 16.786935
## raceethAsian -4.110325 9.220071
## raceethBlack -67.012347 5.460883
## raceethHispanic -38.975486 5.177743
## raceethMore than one race -16.922522 8.496268
## raceethNative Hawaiian/Other Pacific Islander -5.101601 17.005696
## preschool -4.463670 3.486055
## expectBachelors 55.267080 4.293893
## motherHS 6.058774 6.091423
## motherBachelors 12.638068 3.861457
## motherWork -2.809101 3.521827
## fatherHS 4.018214 5.579269
## fatherBachelors 16.929755 3.995253
## fatherWork 5.842798 4.395978
## selfBornUS -3.806278 7.323718
## motherBornUS -8.798153 6.587621
## fatherBornUS 4.306994 6.263875
## englishAtHome 8.035685 6.859492
## computerForSchoolwork 22.500232 5.702562
## read30MinsADay 34.871924 3.408447
## minutesPerWeekEnglish 0.012788 0.010712
## studentsInEnglish -0.286631 0.227819
## schoolHasLibrary 12.215085 9.264884
## publicSchool -16.857475 6.725614
## urban -0.110132 3.962724
## schoolSize 0.006540 0.002197
## t value Pr(>|t|)
## (Intercept) 4.248 2.24e-05 ***
## grade 10.057 < 2e-16 ***
## male -4.601 4.42e-06 ***
## raceethAmerican Indian/Alaska Native -4.008 6.32e-05 ***
## raceethAsian -0.446 0.65578
## raceethBlack -12.271 < 2e-16 ***
## raceethHispanic -7.528 7.29e-14 ***
## raceethMore than one race -1.992 0.04651 *
## raceethNative Hawaiian/Other Pacific Islander -0.300 0.76421
## preschool -1.280 0.20052
## expectBachelors 12.871 < 2e-16 ***
## motherHS 0.995 0.32001
## motherBachelors 3.273 0.00108 **
## motherWork -0.798 0.42517
## fatherHS 0.720 0.47147
## fatherBachelors 4.237 2.35e-05 ***
## fatherWork 1.329 0.18393
## selfBornUS -0.520 0.60331
## motherBornUS -1.336 0.18182
## fatherBornUS 0.688 0.49178
## englishAtHome 1.171 0.24153
## computerForSchoolwork 3.946 8.19e-05 ***
## read30MinsADay 10.231 < 2e-16 ***
## minutesPerWeekEnglish 1.194 0.23264
## studentsInEnglish -1.258 0.20846
## schoolHasLibrary 1.318 0.18749
## publicSchool -2.506 0.01226 *
## urban -0.028 0.97783
## schoolSize 2.977 0.00294 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 73.81 on 2385 degrees of freedom
## Multiple R-squared: 0.3251, Adjusted R-squared: 0.3172
## F-statistic: 41.04 on 28 and 2385 DF, p-value: < 2.2e-16
#What is the Multiple R-squared value of lmScore on the training set?
#Ans:0.3251
#Note that this R-squared is lower than the ones for the models we saw in the lectures and recitation. This does not necessarily imply that the model is of poor quality. More often than not, it simply means that the prediction problem at hand (predicting a student's test score based on demographic and school-related variables) is more difficult than other prediction problems (like predicting a team's number of wins from their runs scored and allowed, or predicting the quality of wine from weather conditions).
###############################
#Problem 3.2 - Computing the root-mean squared error of the model
#What is the training-set root-mean squared error (RMSE) of lmScore?
# Compute SSE and RMSE for new model
SSE = sum(lmScore$residuals^2)
SSE
## [1] 12993365
RMSE = sqrt(SSE / nrow(pisaTrain))
RMSE
## [1] 73.36555
# or calc RMSE directltly
sqrt(mean(lmScore$residuals^2))
## [1] 73.36555
#Ans:73.36555
#using qpcR package to calc RMSE
library(qpcR)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: minpack.lm
## Loading required package: rgl
## Loading required package: robustbase
##
## Attaching package: 'robustbase'
## The following object is masked from 'package:survival':
##
## heart
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
RMSE(lmScore, which = NULL)
## [1] 73.36555
#############################
# Problem 3.3 - Comparing predictions for similar students
#Consider two students A and B. They have all variable values the same, except that student A is in grade 11 and student B is in grade 9. What is the predicted reading score of student A minus the predicted reading score of student B?
29.542707*(11-9) # grade var coeff times the diff in grade
## [1] 59.08541
#59.08541
#Explanation:The coefficient 29.54 on grade is the difference in reading score between two students who are identical other than having a difference in grade of 1. Because A and B have a difference in grade of 2, the model predicts that student A has a reading score that is 2*29.54 larger.
################################
# Problem 3.4 - Interpreting model coefficients
#What is the meaning of the coefficient associated with variable raceethAsian?
coef(summary(lmScore))["raceethAsian","Estimate"]
## [1] -4.110325
summary(lmScore)$coefficients["raceethAsian","Estimate"]
## [1] -4.110325
#Ans:Predicted difference in the reading score between an Asian student and a white student who is otherwise identical
#Explanation:The only difference between an Asian student and white student with otherwise identical variables is that the former has raceethAsian=1 and the latter has raceethAsian=0. The predicted reading score for these two students will differ by the coefficient on the variable raceethAsian.
##############################
# Problem 3.5 - Identifying variables lacking statistical significance
#Based on the significance codes, which variables are candidates for removal from the model? Select all that apply. (We'll assume that the factor variable raceeth should only be removed if none of its levels are significant.)
#Ans:preschool,motherHS,motherWork,fatherHS,fatherWork ,selfBornUS,motherBornUS,fatherBornUS,englishAtHome,minutesPerWeekEnglish,studentsInEnglish,schoolHasLibrary,urban
#Explanation:From summary(lmScore), we can see which variables were significant at the 0.05 level. Because several of the binary variables generated from the race factor variable are significant, we should not remove this variable.
########################
# Problem 4.1 - Predicting on unseen data
#Using the "predict" function and supplying the "newdata" argument, use the lmScore model to predict the reading scores of students in pisaTest. Call this vector of predictions "predTest". Do not change the variables in the model (for example, do not remove variables that we found were not significant in the previous part of this problem). Use the summary function to describe the test set predictions.
#What is the range between the maximum and minimum predicted reading score on the test set?
predTest<-predict(lmScore,newdata = pisaTest)
range(predTest) # we get the min & max values
## [1] 353.2231 637.6914
353.2231-637.6914
## [1] -284.4683
#Ans:284.4683
################################
# Problem 4.2 - Test set SSE and RMSE
#What is the sum of squared errors (SSE) of lmScore on the testing set?
#method1:lets use long hand method (a.k.a using formula)
SSE = sum((predTest - pisaTest$readingScore)^2)
#Ans:5762082
# Compute the RMSE
sqrt(mean((predTest-pisaTest$readingScore)^2)) # orRMSE = sqrt(SSE / nrow(pisaTest))
## [1] 76.29079
#Ans:76.29079
################################
# Problem 4.3 - Baseline prediction and test-set SSE
#What is the predicted test score used in the baseline model? Remember to compute this value using the training set and not the test set.
baseline = mean(pisaTrain$readingScore)
baseline
## [1] 517.9629
#Ans:517.9629
#What is the sum of squared errors of the baseline model on the testing set? HINT: We call the sum of squared errors for the baseline model the total sum of squares (SST)
SST<-sum((pisaTest$readingScore-mean(pisaTrain$readingScore))^2)
SST
## [1] 7802354
#or
sum((baseline-pisaTest$readingScore)^2)
## [1] 7802354
#Ans:5762082
###########################
# Problem 4.4 - Test-set R-squared
#What is the test-set R-squared value of lmScore?
1-5762082/7802354
## [1] 0.2614944
#Ans:0.2614944
#Explanation:The test-set R^2 is defined as 1-SSE/SST, where SSE is the sum of squared errors of the model on the test set and SST is the sum of squared errors of the baseline model. For this model, the R^2 is then computed to be 1-5762082/7802354.
Flu epidemics constitute a major public health concern causing respiratory illnesses, hospitalizations, and deaths. According to the National Vital Statistics Reports published in October 2012, influenza ranked as the eighth leading cause of death in 2011 in the United States. Each year, 250,000 to 500,000 deaths are attributed to influenza related diseases throughout the world.
The U.S. Centers for Disease Control and Prevention (CDC) and the European Influenza Surveillance Scheme (EISS) detect influenza activity through virologic and clinical data, including Influenza-like Illness (ILI) physician visits. Reporting national and regional data, however, are published with a 1-2 week lag.
The Google Flu Trends project was initiated to see if faster reporting can be made possible by considering flu-related online search queries – data that is available almost immediately.
PROBLEM 1.1 - UNDERSTANDING THE DATA
We would like to estimate influenza-like illness (ILI) activity using Google web search logs. Fortunately, one can easily access this data online:
ILI Data - The CDC publishes on its website the official regional and state-level percentage of patient visits to healthcare providers for ILI purposes on a weekly basis.
Google Search Queries - Google Trends allows public retrieval of weekly counts for every query searched by users around the world. For each location, the counts are normalized by dividing the count for each query in a particular week by the total number of online search queries submitted in that location during the week. Then, the values are adjusted to be between 0 and 1.
The csv file FluTrain.csv aggregates this data from January 1, 2004 until December 31, 2011 as follows:
“Week” - The range of dates represented by this observation, in year/month/day format.
“ILI” - This column lists the percentage of ILI-related physician visits for the corresponding week.
“Queries” - This column lists the fraction of queries that are ILI-related for the corresponding week, adjusted to be between 0 and 1 (higher values correspond to more ILI-related search queries).
#lets load the data
FluTrain<-read.csv("FluTrain.csv")
str(FluTrain)
## 'data.frame': 417 obs. of 3 variables:
## $ Week : Factor w/ 417 levels "2004-01-04 - 2004-01-10",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ ILI : num 2.42 1.81 1.71 1.54 1.44 ...
## $ Queries: num 0.238 0.22 0.226 0.238 0.224 ...
#Looking at the time period 2004-2011, which week corresponds to the highest percentage of ILI-related physician visits? Select the day of the month corresponding to the start of this week.
subset(FluTrain, ILI == max(ILI)) #or FluTrain$Week[which.max(FluTrain$ILI)]
## Week ILI Queries
## 303 2009-10-18 - 2009-10-24 7.618892 1
#Ans:2009-10-18
#Which week corresponds to the highest percentage of ILI-related query fraction?
subset(FluTrain,Queries==max(Queries)) #or FluTrain$Week[which.max(FluTrain$Queries)]
## Week ILI Queries
## 303 2009-10-18 - 2009-10-24 7.618892 1
#Ans:2009-10-18
#######################################
#PROBLEM 1.2 - UNDERSTANDING THE DATA
#Let us now understand the data at an aggregate level. Plot the histogram of the dependent variable, ILI. What best describes the distribution of values of ILI?
hist(FluTrain$ILI)
#Ans:Most of the ILI values are small, with a relatively small number of much larger values (in statistics, this sort of data is called "skew right").
# the data is skew right
########################################
#PROBLEM 1.3 - UNDERSTANDING THE DATA
#When handling a skewed dependent variable, it is often useful to predict the logarithm of the dependent variable instead of the dependent variable itself -- this prevents the small number of unusually large or small observations from having an undue influence on the sum of squared errors of predictive models. In this problem, we will predict the natural log of the ILI variable, which can be computed in R using the log() function.
#Plot the natural logarithm of ILI versus Queries. What does the plot suggest?.
plot(FluTrain$Queries,log(FluTrain$ILI))
#Ans:There is a positive, linear relationship between log(ILI) and Queries.
###########################################
#PROBLEM 2.1 - LINEAR REGRESSION MODEL
#Based on the plot we just made, it seems that a linear regression model could be a good modeling choice. Based on our understanding of the data from the previous subproblem, which model best describes our estimation problem?
FluTrend1<-lm(log(ILI)~Queries,data=FluTrain)
summary(FluTrend1)
##
## Call:
## lm(formula = log(ILI) ~ Queries, data = FluTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76003 -0.19696 -0.01657 0.18685 1.06450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.49934 0.03041 -16.42 <2e-16 ***
## Queries 2.96129 0.09312 31.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2995 on 415 degrees of freedom
## Multiple R-squared: 0.709, Adjusted R-squared: 0.7083
## F-statistic: 1011 on 1 and 415 DF, p-value: < 2.2e-16
#log(ILI) = intercept + coefficient x Queries, where the coefficient is positive
#XPLANATION:From the previous subproblem, we are predicting log(ILI) using the Queries variable. From the plot in the previous subproblem, we expect the coefficient on Queries to be positive.
#######################################
#PROBLEM 2.2 - LINEAR REGRESSION MODEL
#Let's call the regression model from the previous problem (Problem 2.1) FluTrend1 and run it in R. Hint: to take the logarithm of a variable Var in a regression equation, you simply use log(Var) when specifying the formula to the lm() function.
FluTrend1<-lm(log(ILI)~Queries,data=FluTrain)
summary(FluTrend1)
##
## Call:
## lm(formula = log(ILI) ~ Queries, data = FluTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76003 -0.19696 -0.01657 0.18685 1.06450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.49934 0.03041 -16.42 <2e-16 ***
## Queries 2.96129 0.09312 31.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2995 on 415 degrees of freedom
## Multiple R-squared: 0.709, Adjusted R-squared: 0.7083
## F-statistic: 1011 on 1 and 415 DF, p-value: < 2.2e-16
#What is the training set R-squared value for FluTrend1 model (the "Multiple R-squared")?
#Ans:0.709
##########################################
#PROBLEM 2.3 - LINEAR REGRESSION MODEL
#For a single variable linear regression model, there is a direct relationship between the R-squared and the correlation between the independent and the dependent variables. What is the relationship we infer from our problem? (Don't forget that you can use the cor function to compute the correlation between two variables.)
Correlation<-cor(FluTrain$Queries,log(FluTrain$ILI))
Correlation^2
## [1] 0.7090201
#Ans:R-squared = Correlation^2
#To test these hypotheses, we first need to compute the correlation between the independent variable used in the model (Queries) and the dependent variable (log(ILI)).
log(1/Correlation)
## [1] 0.1719357
exp(-0.5*Correlation)
## [1] 0.6563792
#It appears that Correlation^2 is equal to the R-squared value. It can be proved that this is always the case.
#########################################
#PROBLEM 3.1 - PERFORMANCE ON THE TEST SET
#The csv file FluTest.csv provides the 2012 weekly data of the ILI-related search queries and the observed weekly percentage of ILI-related physician visits.
FluTest<-read.csv("FluTest.csv")
#Normally, we would obtain test-set predictions from the model FluTrend1 using the code
#PredTest1 = predict(FluTrend1, newdata=FluTest)
#However, the dependent variable in our model is log(ILI), so PredTest1 would contain predictions of the log(ILI) value. We are instead interested in obtaining predictions of the ILI value. We can convert from predictions of log(ILI) to predictions of ILI via exponentiation, or the exp() function. The new code, which predicts the ILI value, is
PredTest1 = exp(predict(FluTrend1, newdata=FluTest))
#What is our estimate for the percentage of ILI-related physician visits for the week of March 11, 2012? (HINT: You can either just output FluTest$Week to find which element corresponds to March 11, 2012, or you can use the "which" function in R. To learn more about the which function, type ?which in your R console.)
PredTest1[which(FluTest$Week=="2012-03-11 - 2012-03-17")]
## 11
## 2.187378
#Ans:2.187383
###########################################
#PROBLEM 3.2 - PERFORMANCE ON THE TEST SET
#What is the relative error betweeen the estimate (our prediction) and the observed value for the week of March 11, 2012? Note that the relative error is calculated as
#(Observed ILI - Estimated ILI)/Observed ILI
FluTest$ILI[which(FluTest$Week=="2012-03-11 - 2012-03-17")]
## [1] 2.293422
(2.293422-2.187383)/2.293422
## [1] 0.04623615
#Ans:0.04623615
#EXPLANATION:From the previous problem, we know the predicted value is 2.187378. The actual value is the 11th testing set ILI value or FluTest$ILI[11], which has value 2.293422. Finally we compute the relative error to be (2.293422 - 2.187378)/2.293422.
######################################
#PROBLEM 3.3 - PERFORMANCE ON THE TEST SET
#What is the Root Mean Square Error (RMSE) between our estimates and the actual observations for the percentage of ILI-related physician visits, on the test set?
#lets use long hand method (a.k.a using formula)
SSE = sum((PredTest1 - FluTest$ILI)^2)
# Compute the RMSE
sqrt(mean((PredTest1-FluTest$ILI)^2)) # orRMSE = sqrt(SSE / nrow(FluTest))
## [1] 0.7490645
#Ans: 0.7490645
############################################
#PROBLEM 4.1 - TRAINING A TIME SERIES MODEL
#The observations in this dataset are consecutive weekly measurements of the dependent and independent variables. This sort of dataset is called a "time series." Often, statistical models can be improved by predicting the current value of the dependent variable using the value of the dependent variable from earlier weeks. In our models, this means we will predict the ILI variable in the current week using values of the ILI variable from previous weeks.
#First, we need to decide the amount of time to lag the observations. Because the ILI variable is reported with a 1- or 2-week lag, a decision maker cannot rely on the previous week's ILI value to predict the current week's value. Instead, the decision maker will only have data available from 2 or more weeks ago. We will build a variable called ILILag2 that contains the ILI value from 2 weeks before the current observation.
#To do so, we will use the "zoo" package, which provides a number of helpful methods for time series models. While many functions are built into R, you need to add new packages to use some functions. New packages can be installed and loaded easily in R, and we will do this many times in this class. Run the following two commands to install and load the zoo package. In the first command, you will be prompted to select a CRAN mirror to use for your download. Select a mirror near you geographically.
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
#After installing and loading the zoo package, run the following commands to create the ILILag2 variable in the training set:
ILILag2 = lag(zoo(FluTrain$ILI), k=-2, na.pad=TRUE)
FluTrain$ILILag2 = coredata(ILILag2)
#In these commands, the value of -2 passed to lag means to return 2 observations before the current one; a positive value would have returned future observations. The parameter na.pad=TRUE means to add missing values for the first two weeks of our dataset, where we can't compute the data from 2 weeks earlier.
#How many values are missing in the new ILILag2 variable?
sum(is.na(FluTrain$ILILag2)) #or summary(FluTrain$ILILag2)
## [1] 1
#Ans:2
#################################
#PROBLEM 4.2 - TRAINING A TIME SERIES MODEL
#Use the plot() function to plot the log of ILILag2 against the log of ILI. Which best describes the relationship between these two variables?
plot(log(FluTrain$ILILag2),log(FluTrain$ILI))
#Ans:There is a strong positive relationship between log(ILILag2) and log(ILI)
###################################
#PROBLEM 4.3 - TRAINING A TIME SERIES MODEL
#Train a linear regression model on the FluTrain dataset to predict the log of the ILI variable using the Queries variable as well as the log of the ILILag2 variable. Call this model FluTrend2.
FluTrend2<-lm(log(ILI)~Queries + log(ILILag2),data=FluTrain)
summary(FluTrend2)
##
## Call:
## lm(formula = log(ILI) ~ Queries + log(ILILag2), data = FluTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.38495 -0.07746 -0.01194 0.06849 0.66338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.11980 0.01546 -7.749 7.25e-14 ***
## Queries 0.63227 0.06605 9.573 < 2e-16 ***
## log(ILILag2) 0.82279 0.01879 43.787 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1256 on 413 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.949, Adjusted R-squared: 0.9487
## F-statistic: 3839 on 2 and 413 DF, p-value: < 2.2e-16
#Which coefficients are significant at the p=0.05 level in this regression model? (Select all that apply.)
#Ans:Intercept ,Queries,log(ILILag2)
#What is the R^2 value of the FluTrend2 model?
#Ans: 0.9063
#########################################
#PROBLEM 4.4 - TRAINING A TIME SERIES MODEL
#On the basis of R-squared value and significance of coefficients, which statement is the most accurate?
#Ans:FluTrend2 is a stronger model than FluTrend1 on the training set.
#EXPLANATION:Moving from FluTrend1 to FluTrend2, in-sample R^2 improved from 0.709 to 0.9063, and the new variable is highly significant. As a result, there is no sign of overfitting, and FluTrend2 is superior to FluTrend1 on the training set.
################
#PROBLEM 5.1 - EVALUATING THE TIME SERIES MODEL IN THE TEST SET
#So far, we have only added the ILILag2 variable to the FluTrain data frame. To make predictions with our FluTrend2 model, we will also need to add ILILag2 to the FluTest data frame (note that adding variables before splitting into a training and testing set can prevent this duplication of effort).
#Modify the code from the previous subproblem to add an ILILag2 variable to the FluTest data frame. How many missing values are there in this new variable?
ILILag2 = lag(zoo(FluTest$ILI), k=-2, na.pad=TRUE)
#??lag()
FluTest$ILILag2 = coredata(ILILag2)
summary(FluTest$ILILag2)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.9018 1.1450 1.3580 1.5780 1.8150 4.5470 1
#Ans:2
#############################
#PROBLEM 5.2 - EVALUATING THE TIME SERIES MODEL IN THE TEST SET
#In this problem, the training and testing sets are split sequentially -- the training set contains all observations from 2004-2011 and the testing set contains all observations from 2012. There is no time gap between the two datasets, meaning the first observation in FluTest was recorded one week after the last observation in FluTrain. From this, we can identify how to fill in the missing values for the ILILag2 variable in FluTest.
#Which value should be used to fill in the ILILag2 variable for the first observation in FluTest?
#Ans:The ILI value of the second-to-last observation in the FluTrain data frame
#EXPLANATION:The time two weeks before the first week of 2012 is the second-to-last week of 2011. This corresponds to the second-to-last observation in FluTrain.
#Which value should be used to fill in the ILILag2 variable for the second observation in FluTest?
#Ans:The ILI value of the last observation in the FluTrain data frame.
#EXPLANATION:The time two weeks before the second week of 2012 is the last week of 2011. This corresponds to the last observation in FluTrain.
################################
#PROBLEM 5.3 - EVALUATING THE TIME SERIES MODEL IN THE TEST SET
#Fill in the missing values for ILILag2 in FluTest. In terms of syntax, you could set the value of ILILag2 in row "x" of the FluTest data frame to the value of ILI in row "y" of the FluTrain data frame with "FluTest$ILILag2[x] = FluTrain$ILI[y]". Use the answer to the previous questions to determine the appropriate values of "x" and "y". It may be helpful to check the total number of rows in FluTrain using str(FluTrain) or nrow(FluTrain).
nrow(FluTrain)
## [1] 417
#What is the new value of the ILILag2 variable in the first row of FluTest?
FluTest$ILILag2[1] = FluTrain$ILI[416]
FluTest$ILILag2[1]
## [1] 1.852736
#Ans: 1.852736
#What is the new value of the ILILag2 variable in the second row of FluTest?
FluTest$ILILag2[2] = FluTrain$ILI[417]
FluTest$ILILag2[2]
## [1] 2.12413
#Ans:2.12413
####################################
#PROBLEM 5.4 - EVALUATING THE TIME SERIES MODEL IN THE TEST SET
#Obtain test set predictions of the ILI variable from the FluTrend2 model, again remembering to call the exp() function on the result of the predict() function to obtain predictions for ILI instead of log(ILI).
PredTest2 = exp(predict(FluTrend2, newdata=FluTest))
#What is the test-set RMSE of the FluTrend2 model?
#lets use long hand method (a.k.a using formula)
SSE = sum((PredTest2 - FluTest$ILI)^2)
# Compute the RMSE
sqrt(mean((PredTest2-FluTest$ILI)^2)) # orRMSE = sqrt(SSE / nrow(FluTest))
## [1] 0.2229195
#Ans:0.2942029
########################
#PROBLEM 5.5 - EVALUATING THE TIME SERIES MODEL IN THE TEST SET
#Which model obtained the best test-set RMSE?
#Ans:FluTrend2
#EXPLANATION:The test-set RMSE of FluTrend2 is 0.294, as opposed to the 0.749 value obtained by the FluTrend1 model.
#in this problem, we used a simple time series model with a single lag term. ARIMA models are a more general form of the model we built, which can include multiple lag terms as well as more complicated combinations of previous values of the dependent variable.
#lets compare models for PROBLEM 4.4 above using sjPot package
library(sjPlot)
sjt.lm(FluTrend1,FluTrend2)
log(ILI) | log(ILI) | |||||||
B | CI | p | B | CI | p | |||
(Intercept) | -0.50 | -0.56 – -0.44 | <.001 | -0.12 | -0.15 – -0.09 | <.001 | ||
Queries | 2.96 | 2.78 – 3.14 | <.001 | 0.63 | 0.50 – 0.76 | <.001 | ||
log(ILILag2) | 0.82 | 0.79 – 0.86 | <.001 | |||||
Observations | 417 | 416 | ||||||
R2 / adj. R2 | .709 / .708 | .949 / .949 |
We often take data for granted. However, one of the hardest parts about analyzing a problem you’re interested in can be to find good data to answer the questions you want to ask. As you’re learning R, though, there are many datasets that R has built in that you can take advantage of.
In this problem, we will be examining the “state” dataset, which has data from the 1970s on all fifty US states. For each state, the dataset includes the population, per capita income, illiteracy rate, murder rate, high school graduation rate, average number of frost days, area, latitude and longitude, division the state belongs to, region the state belongs to, and two-letter abbreviation.
#lets load the data set
data(state) #inbuilt data set containing 6 vectors and 1 data frame(state.x77).Lets combine them to a single data frame
statedata = cbind(data.frame(state.x77), state.abb, state.area, state.center, state.division, state.name, state.region) # or import statedata.csv
str(statedata)
## 'data.frame': 50 obs. of 15 variables:
## $ Population : num 3615 365 2212 2110 21198 ...
## $ Income : num 3624 6315 4530 3378 5114 ...
## $ Illiteracy : num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
## $ Life.Exp : num 69 69.3 70.5 70.7 71.7 ...
## $ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
## $ HS.Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
## $ Frost : num 20 152 15 65 20 166 139 103 11 60 ...
## $ Area : num 50708 566432 113417 51945 156361 ...
## $ state.abb : Factor w/ 50 levels "AK","AL","AR",..: 2 1 4 3 5 6 7 8 9 10 ...
## $ state.area : num 51609 589757 113909 53104 158693 ...
## $ x : num -86.8 -127.2 -111.6 -92.3 -119.8 ...
## $ y : num 32.6 49.2 34.2 34.7 36.5 ...
## $ state.division: Factor w/ 9 levels "New England",..: 4 9 8 5 9 8 1 3 3 3 ...
## $ state.name : Factor w/ 50 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ state.region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
#PROBLEM 1.1 - DATA EXPLORATION
#We begin by exploring the data. Plot all of the states' centers with latitude on the y axis (the "y" variable in our dataset) and longitude on the x axis (the "x" variable in our dataset). The shape of the plot should look like the outline of the United States! Note that Alaska and Hawaii have had their coordinates adjusted to appear just off of the west coast.
plot(statedata$x,statedata$y)
#In the R command you used to generate this plot, which variable name did you use as the first argument?
#Ans:
####################################
#PROBLEM 1.2 - DATA EXPLORATION
#Using the tapply command, determine which region of the US (West, North Central, South, or Northeast) has the highest average high school graduation rate of all the states in the region:
tapply(statedata$HS.Grad,statedata$state.region,mean)
## Northeast South North Central West
## 53.96667 44.34375 54.51667 62.00000
#Ans:West
########################
#PROBLEM 1.3 - DATA EXPLORATION
#Now, make a boxplot of the murder rate by region (for more information about creating boxplots in R, type ?boxplot in your console).
plot(statedata$Murder~state.region) #or boxplot(statedata$Murder ~ statedata$state.region)
#Which region has the highest median murder rate?
#Ans:South
#########################
#PROBLEM 1.4 - DATA EXPLORATION
#You should see that there is an outlier in the Northeast region of the boxplot you just generated. Which state does this correspond to? (Hint: There are many ways to find the answer to this question, but one way is to use the subset command to only look at the Northeast data.)
NE<-subset(statedata,statedata$state.region=="Northeast")
NE$state.name[NE$Murder==max(NE$Murder)]
## [1] New York
## 50 Levels: Alabama Alaska Arizona Arkansas California ... Wyoming
#Ans:New York
#################################
#PROBLEM 2.1 - PREDICTING LIFE EXPECTANCY - AN INITIAL MODEL
#We would like to build a model to predict life expectancy by state using the state statistics we have in our dataset.
#Build the model with all potential variables included (Population, Income, Illiteracy, Murder, HS.Grad, Frost, and Area). Note that you should use the variable "Area" in your model, NOT the variable "state.area".
LinReg<-lm(Life.Exp~Population+Income+Illiteracy+Murder+HS.Grad+Frost+Area,data=statedata)
summary(LinReg)
##
## Call:
## lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder +
## HS.Grad + Frost + Area, data = statedata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48895 -0.51232 -0.02747 0.57002 1.49447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.094e+01 1.748e+00 40.586 < 2e-16 ***
## Population 5.180e-05 2.919e-05 1.775 0.0832 .
## Income -2.180e-05 2.444e-04 -0.089 0.9293
## Illiteracy 3.382e-02 3.663e-01 0.092 0.9269
## Murder -3.011e-01 4.662e-02 -6.459 8.68e-08 ***
## HS.Grad 4.893e-02 2.332e-02 2.098 0.0420 *
## Frost -5.735e-03 3.143e-03 -1.825 0.0752 .
## Area -7.383e-08 1.668e-06 -0.044 0.9649
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7448 on 42 degrees of freedom
## Multiple R-squared: 0.7362, Adjusted R-squared: 0.6922
## F-statistic: 16.74 on 7 and 42 DF, p-value: 2.534e-10
#lets plot the coeff using sjPlot library
library(sjPlot)
#?sjp.lm
sjp.lm(LinReg,type="lm")
#What is the coefficient for "Income" in your linear regression model?
coef(summary(LinReg))["Income","Estimate"] #or look in summary(LinReg)
## [1] -2.180424e-05
#Ans:-2.180424e-05 (or -0.0000218)
#########################################
#PROBLEM 2.2 - PREDICTING LIFE EXPECTANCY - AN INITIAL MODEL
#Call the coefficient for income x (the answer to Problem 2.1). What is the interpretation of the coefficient x?
#Ans:For a one unit increase in income, predicted life expectancy decreases by |x|
#EXPLANATION:If we increase income by one unit, then our model's prediction will increase by the coefficient of income, x. Because x is negative, this is the same as predicted life expectancy decreasing by |x|.
#######################################
#PROBLEM 2.3 - PREDICTING LIFE EXPECTANCY - AN INITIAL MODEL
#Now plot a graph of life expectancy vs. income using the command:
plot(statedata$Income, statedata$Life.Exp)
#Visually observe the plot. What appears to be the relationship?
#Ans:Life expectancy is somewhat positively correlated with income
#EXPLANATION:Although the point in the lower right hand corner of the plot appears to be an outlier, we observe a positive linear relationship in the plot.
#####################################
#PROBLEM 2.4 - PREDICTING LIFE EXPECTANCY - AN INITIAL MODEL
#The model we built does not display the relationship we saw from the plot of life expectancy vs. income. Which of the following explanations seems the most reasonable?
#Ans:Multicollinearity
#EXPLANATION:Although income is an insignificant variable in the model, this does not mean that there is no association between income and life expectancy. However, in the presence of all of the other variables, income does not add statistically significant explanatory power to the model. This means that multicollinearity is probably the issue.
###########################################
#PROBLEM 3.1 - PREDICTING LIFE EXPECTANCY - REFINING THE MODEL AND ANALYZING PREDICTIONS
#Recall that we discussed the principle of simplicity: that is, a model with fewer variables is preferable to a model with many unnnecessary variables. Experiment with removing independent variables from the original model. Remember to use the significance of the coefficients to decide which variables to remove (remove the one with the largest "p-value" first, or the one with the "t value" closest to zero), and to remove them one at a time (this is called "backwards variable selection"). This is important due to multicollinearity issues - removing one insignificant variable may make another previously insignificant variable become significant.
#You should be able to find a good model with only 4 independent variables, instead of the original 7. Which variables does this model contain?
#Lets do variable selection using leaps package with regsubsets()
library(leaps)
regfit.bwd1<-regsubsets(Life.Exp~Population+Income+Illiteracy+Murder+HS.Grad+Frost+Area,data = statedata,nvmax=7,nbest=1,method = "backward")
reg.summary<-summary(regfit.bwd1)
reg.summary ##An asterix indicates that a given var is included in the corresponding model
## Subset selection object
## Call: regsubsets.formula(Life.Exp ~ Population + Income + Illiteracy +
## Murder + HS.Grad + Frost + Area, data = statedata, nvmax = 7,
## nbest = 1, method = "backward")
## 7 Variables (and intercept)
## Forced in Forced out
## Population FALSE FALSE
## Income FALSE FALSE
## Illiteracy FALSE FALSE
## Murder FALSE FALSE
## HS.Grad FALSE FALSE
## Frost FALSE FALSE
## Area FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: backward
## Population Income Illiteracy Murder HS.Grad Frost Area
## 1 ( 1 ) " " " " " " "*" " " " " " "
## 2 ( 1 ) " " " " " " "*" "*" " " " "
## 3 ( 1 ) " " " " " " "*" "*" "*" " "
## 4 ( 1 ) "*" " " " " "*" "*" "*" " "
## 5 ( 1 ) "*" "*" " " "*" "*" "*" " "
## 6 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
#lets calc the errors for the data
reg.summary$rsq
## [1] 0.6097201 0.6628461 0.7126624 0.7360328 0.7361014 0.7361440 0.7361563
reg.summary$rss # lowest value for the model selected
## [1] 34.46133 29.77036 25.37162 23.30804 23.30198 23.29822 23.29714
reg.summary$adjr2 #Adjusted r-squared,we selcted model with the highest which corresponds to 7 var model
## [1] 0.6015893 0.6484991 0.6939230 0.7125690 0.7061129 0.6993268 0.6921823
reg.summary$cp #Mallows' Cp,lowest value for the model selected which corresponds to 7 var model
## [1] 16.126760 9.669894 3.739878 2.019659 4.008737 6.001959 8.000000
reg.summary$bic # lowest value for the model selected
## [1] -39.22051 -42.62472 -46.70678 -47.03640 -43.13738 -39.23342 -35.32373
#lets use in built plot() in regsubsets()
#Darker the color, higher is the fitting. A gap in the continuation means that the variable has not been included.
par(mfrow=c(2,2))
plot(regfit.bwd1,scale="adjr2")
plot(regfit.bwd1,scale="r2")
plot(regfit.bwd1,scale="Cp")
plot(regfit.bwd1,scale="bic")
##lets use conventional plots of RSS,adjusted Rsqrd,Cp,& BIC for all models at once will help us decide which model to select
par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of variables",ylab="RSS",type="l")
which.min(reg.summary$rss)
## [1] 7
points(7,reg.summary$rss[7],col="red",cex=2,pch=20) # adding the point to the plot which shows the min RSS
plot(reg.summary$adjr2,xlab="Number of variables",ylab="Adjusted R2",type="l")
which.max(reg.summary$adjr2)
## [1] 4
points(4,reg.summary$adjr2[4],col="red",cex=2,pch=20) # adding the point to the plot which shows the max adjusted R2
plot(reg.summary$cp,xlab="Number of variables",ylab="Cp",type="l")
which.min(reg.summary$cp)
## [1] 4
points(4,reg.summary$cp[4],col="red",cex=2,pch=20) # adding the point to the plot which shows the min Cp
plot(reg.summary$bic,xlab="Number of variables",ylab="BIC",type="l")
which.min(reg.summary$bic)
## [1] 4
points(4,reg.summary$bic[4],col="red",cex=2,pch=20) # adding the point to the plot which shows the min Cp
#So we know now that 4 var model is the best model as per backward selection method based on Cp/Adj Rsqr/BIC
reg.summary$which #A logical matrix indicating which elements are in each model
## (Intercept) Population Income Illiteracy Murder HS.Grad Frost Area
## 1 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
## 3 TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
## 4 TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
## 5 TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
## 6 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## 7 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#Ans:Population,Murder,HS.Grad,Frost
#EXPLANATION;
#Lets see the backward selection method wherein we start with the full model and start eliminating var one by one based on highest P-value citerion (least significant var eliminated first but done one by one ).
LinReg<-lm(Life.Exp~Population+Income+Illiteracy+Murder+HS.Grad+Frost+Area,data=statedata) #full model
summary(LinReg)
##
## Call:
## lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder +
## HS.Grad + Frost + Area, data = statedata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48895 -0.51232 -0.02747 0.57002 1.49447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.094e+01 1.748e+00 40.586 < 2e-16 ***
## Population 5.180e-05 2.919e-05 1.775 0.0832 .
## Income -2.180e-05 2.444e-04 -0.089 0.9293
## Illiteracy 3.382e-02 3.663e-01 0.092 0.9269
## Murder -3.011e-01 4.662e-02 -6.459 8.68e-08 ***
## HS.Grad 4.893e-02 2.332e-02 2.098 0.0420 *
## Frost -5.735e-03 3.143e-03 -1.825 0.0752 .
## Area -7.383e-08 1.668e-06 -0.044 0.9649
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7448 on 42 degrees of freedom
## Multiple R-squared: 0.7362, Adjusted R-squared: 0.6922
## F-statistic: 16.74 on 7 and 42 DF, p-value: 2.534e-10
#We would eliminate the variable "Area" first (since it has the highest p-value, or probability, with a value of 0.9649), by adjusting our lm command to the following:
LinReg = lm(Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost, data=statedata)
summary(LinReg)
##
## Call:
## lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder +
## HS.Grad + Frost, data = statedata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.49047 -0.52533 -0.02546 0.57160 1.50374
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.099e+01 1.387e+00 51.165 < 2e-16 ***
## Population 5.188e-05 2.879e-05 1.802 0.0785 .
## Income -2.444e-05 2.343e-04 -0.104 0.9174
## Illiteracy 2.846e-02 3.416e-01 0.083 0.9340
## Murder -3.018e-01 4.334e-02 -6.963 1.45e-08 ***
## HS.Grad 4.847e-02 2.067e-02 2.345 0.0237 *
## Frost -5.776e-03 2.970e-03 -1.945 0.0584 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7361 on 43 degrees of freedom
## Multiple R-squared: 0.7361, Adjusted R-squared: 0.6993
## F-statistic: 19.99 on 6 and 43 DF, p-value: 5.362e-11
#Looking at summary(LinReg) now, we would choose to eliminate "Illiteracy" since it now has the highest p-value of 0.9340, using the following command:
LinReg = lm(Life.Exp ~ Population + Income + Murder + HS.Grad + Frost, data=statedata)
summary(LinReg)
##
## Call:
## lm(formula = Life.Exp ~ Population + Income + Murder + HS.Grad +
## Frost, data = statedata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4892 -0.5122 -0.0329 0.5645 1.5166
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.107e+01 1.029e+00 69.067 < 2e-16 ***
## Population 5.115e-05 2.709e-05 1.888 0.0657 .
## Income -2.477e-05 2.316e-04 -0.107 0.9153
## Murder -3.000e-01 3.704e-02 -8.099 2.91e-10 ***
## HS.Grad 4.776e-02 1.859e-02 2.569 0.0137 *
## Frost -5.910e-03 2.468e-03 -2.395 0.0210 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7277 on 44 degrees of freedom
## Multiple R-squared: 0.7361, Adjusted R-squared: 0.7061
## F-statistic: 24.55 on 5 and 44 DF, p-value: 1.019e-11
#Looking at summary(LinReg) again, we would next choose to eliminate "Income", since it has a p-value of 0.9153. This gives the following four variable model:
LinReg = lm(Life.Exp ~ Population + Murder + HS.Grad + Frost, data=statedata)
summary(LinReg)
##
## Call:
## lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost,
## data = statedata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.47095 -0.53464 -0.03701 0.57621 1.50683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.103e+01 9.529e-01 74.542 < 2e-16 ***
## Population 5.014e-05 2.512e-05 1.996 0.05201 .
## Murder -3.001e-01 3.661e-02 -8.199 1.77e-10 ***
## HS.Grad 4.658e-02 1.483e-02 3.142 0.00297 **
## Frost -5.943e-03 2.421e-03 -2.455 0.01802 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared: 0.736, Adjusted R-squared: 0.7126
## F-statistic: 31.37 on 4 and 45 DF, p-value: 1.696e-12
#This model with 4 variables is a good model. However, we can see that the variable "Population" is not quite significant. In practice, it would be up to you whether or not to keep the variable "Population" or eliminate it for a 3-variable model. Population does not add much statistical significance in the presence of murder, high school graduation rate, and frost days. However, for the remainder of this question, we will analyze the 4-variable model.
# This entire backward selection steps can be easily done using regsubsets() in leaps library
##################################
#PROBLEM 3.2 - PREDICTING LIFE EXPECTANCY - REFINING THE MODEL AND ANALYZING PREDICTIONS
#Removing insignificant variables changes the Multiple R-squared value of the model. By looking at the summary output for both the initial model (all independent variables) and the simplified model (only 4 independent variables) and using what you learned in class, which of the following correctly explains the change in the Multiple R-squared value?
#Ans:We expect the "Multiple R-squared" value of the simplified model to be slightly worse than that of the initial model. It can't be better than the "Multiple R-squared" value of the initial model.
#EXPLANATION:When we remove insignificant variables, the "Multiple R-squared" will always be worse, but only slightly worse. This is due to the nature of a linear regression model. It is always possible for the regression model to make a coefficient zero, which would be the same as removing the variable from the model. The fact that the coefficient is not zero in the intial model means it must be helping the R-squared value, even if it is only a very small improvement. So when we force the variable to be removed, it will decrease the R-squared a little bit. However, this small decrease is worth it to have a simpler model.
#On the contrary, when we remove insignificant variables, the "Adjusted R-squred" will frequently be better. This value accounts for the complexity of the model, and thus tends to increase as insignificant variables are removed, and decrease as insignificant variables are added.
##########################################
#PROBLEM 3.3 - PREDICTING LIFE EXPECTANCY - REFINING THE MODEL AND ANALYZING PREDICTIONS
#Using the simplified 4 variable model that we created, we'll now take a look at how our predictions compare to the actual values.
#Take a look at the vector of predictions by using the predict function (since we are just looking at predictions on the training set, you don't need to pass a "newdata" argument to the predict function).
#Which state do we predict to have the lowest life expectancy? (Hint: use the sort function)
predTest<-predict(LinReg)
sort(predTest)
## Alabama Georgia Mississippi South Carolina Louisiana
## 68.48112 68.63694 69.00535 69.06109 69.15045
## Kentucky North Carolina Tennessee Nevada Arkansas
## 69.24418 69.28624 69.46583 69.52482 69.57374
## Alaska Michigan Texas New Mexico Missouri
## 69.85740 69.86893 69.97886 70.03119 70.10610
## Virginia Illinois West Virginia Maryland Florida
## 70.14691 70.19244 70.44983 70.51852 70.61539
## New York Wyoming Indiana Vermont Ohio
## 70.62937 70.87679 70.90159 71.06135 71.08549
## Colorado Delaware Oklahoma Pennsylvania Montana
## 71.10354 71.12647 71.15860 71.38046 71.40025
## Arizona Idaho New Jersey New Hampshire Rhode Island
## 71.41416 71.49989 71.59612 71.72636 71.76007
## California Maine North Dakota Kansas Wisconsin
## 71.79565 71.86095 71.87649 71.90352 72.00996
## South Dakota Connecticut Utah Hawaii Nebraska
## 72.01161 72.03459 72.05753 72.09317 72.17032
## Minnesota Iowa Oregon Massachusetts Washington
## 72.26560 72.39653 72.41445 72.44105 72.68272
#Ans:Alabama
#Which state actually has the lowest life expectancy? (Hint: use the which.min function)
statedata$state.name[which.min(statedata$Life.Exp)]
## [1] South Carolina
## 50 Levels: Alabama Alaska Arizona Arkansas California ... Wyoming
#Ans: South Carolina
#################################
#PROBLEM 3.4 - PREDICTING LIFE EXPECTANCY - REFINING THE MODEL AND ANALYZING PREDICTIONS
#Which state do we predict to have the highest life expectancy?
sort(predTest,decreasing = T)
## Washington Massachusetts Oregon Iowa Minnesota
## 72.68272 72.44105 72.41445 72.39653 72.26560
## Nebraska Hawaii Utah Connecticut South Dakota
## 72.17032 72.09317 72.05753 72.03459 72.01161
## Wisconsin Kansas North Dakota Maine California
## 72.00996 71.90352 71.87649 71.86095 71.79565
## Rhode Island New Hampshire New Jersey Idaho Arizona
## 71.76007 71.72636 71.59612 71.49989 71.41416
## Montana Pennsylvania Oklahoma Delaware Colorado
## 71.40025 71.38046 71.15860 71.12647 71.10354
## Ohio Vermont Indiana Wyoming New York
## 71.08549 71.06135 70.90159 70.87679 70.62937
## Florida Maryland West Virginia Illinois Virginia
## 70.61539 70.51852 70.44983 70.19244 70.14691
## Missouri New Mexico Texas Michigan Alaska
## 70.10610 70.03119 69.97886 69.86893 69.85740
## Arkansas Nevada Tennessee North Carolina Kentucky
## 69.57374 69.52482 69.46583 69.28624 69.24418
## Louisiana South Carolina Mississippi Georgia Alabama
## 69.15045 69.06109 69.00535 68.63694 68.48112
#Ans: Washington
#Which state actually has the highest life expectancy?
statedata$state.name[which.max(statedata$Life.Exp)]
## [1] Hawaii
## 50 Levels: Alabama Alaska Arizona Arkansas California ... Wyoming
#Ans: Hawaii
##################################
#PROBLEM 3.5 - PREDICTING LIFE EXPECTANCY - REFINING THE MODEL AND ANALYZING PREDICTIONS
#Take a look at the vector of residuals (the difference between the predicted and actual values).
sort(abs(LinReg$residuals)) #or sort(abs(predTest-statedata$Life.Exp))
## Indiana Florida Illinois Virginia South Dakota
## 0.02158526 0.04460505 0.05244160 0.06691392 0.06839119
## North Carolina New York California Georgia Rhode Island
## 0.07624179 0.07937149 0.08564599 0.09694227 0.13992982
## Iowa Oklahoma Ohio Oregon New Mexico
## 0.16347124 0.26139958 0.26548767 0.28445333 0.28880945
## Maryland Idaho Louisiana Nebraska Connecticut
## 0.29851996 0.37010714 0.39044846 0.42967691 0.44541028
## Wisconsin Nevada New Hampshire Alaska Alabama
## 0.47004324 0.49482393 0.49635615 0.54740399 0.56888134
## Vermont Missouri Wyoming Massachusetts Tennessee
## 0.57865019 0.58389969 0.58678863 0.61105391 0.64416651
## New Jersey Kansas Minnesota Michigan Montana
## 0.66612086 0.67648037 0.69440380 0.76106640 0.84024805
## Utah Kentucky Arizona North Dakota Mississippi
## 0.84246817 0.85582067 0.86415671 0.90350550 0.91535384
## Texas Pennsylvania Colorado Washington West Virginia
## 0.92114057 0.95045527 0.95645816 0.96272426 0.96982588
## Delaware Arkansas South Carolina Maine Hawaii
## 1.06646884 1.08626119 1.10109172 1.47095411 1.50683146
#For which state do we make the smallest absolute error?
#Ans:Indiana
#For which state do we make the largest absolute error?
sort(abs(LinReg$residuals),decreasing = T ) #or sort(abs(predTest-statedata$Life.Exp),decreasing = T)
## Hawaii Maine South Carolina Arkansas Delaware
## 1.50683146 1.47095411 1.10109172 1.08626119 1.06646884
## West Virginia Washington Colorado Pennsylvania Texas
## 0.96982588 0.96272426 0.95645816 0.95045527 0.92114057
## Mississippi North Dakota Arizona Kentucky Utah
## 0.91535384 0.90350550 0.86415671 0.85582067 0.84246817
## Montana Michigan Minnesota Kansas New Jersey
## 0.84024805 0.76106640 0.69440380 0.67648037 0.66612086
## Tennessee Massachusetts Wyoming Missouri Vermont
## 0.64416651 0.61105391 0.58678863 0.58389969 0.57865019
## Alabama Alaska New Hampshire Nevada Wisconsin
## 0.56888134 0.54740399 0.49635615 0.49482393 0.47004324
## Connecticut Nebraska Louisiana Idaho Maryland
## 0.44541028 0.42967691 0.39044846 0.37010714 0.29851996
## New Mexico Oregon Ohio Oklahoma Iowa
## 0.28880945 0.28445333 0.26548767 0.26139958 0.16347124
## Rhode Island Georgia California New York North Carolina
## 0.13992982 0.09694227 0.08564599 0.07937149 0.07624179
## South Dakota Virginia Illinois Florida Indiana
## 0.06839119 0.06691392 0.05244160 0.04460505 0.02158526
#Ans: Hawaii
FORECASTING ELANTRA SALES (OPTIONAL)
An important application of linear regression is understanding sales. Consider a company that produces and sells a product. In a given period, if the company produces more units than how many consumers will buy, the company will not earn money on the unsold units and will incur additional costs due to having to store those units in inventory before they can be sold. If it produces fewer units than how many consumers will buy, the company will earn less than it potentially could have earned. Being able to predict consumer sales, therefore, is of first order importance to the company.
In this problem, we will try to predict monthly sales of the Hyundai Elantra in the United States. The Hyundai Motor Company is a major automobile manufacturer based in South Korea. The Elantra is a car model that has been produced by Hyundai since 1990 and is sold all over the world, including the United States. We will build a linear regression model to predict monthly sales using economic indicators of the United States as well as Google search queries.
The file elantra.csv contains data for the problem. Each observation is a month, from January 2010 to February 2014.
#lets import the data set
Elantra<-read.csv("elantra.csv")
str(Elantra)
## 'data.frame': 50 obs. of 7 variables:
## $ Month : int 1 1 1 1 1 2 2 2 2 2 ...
## $ Year : int 2010 2011 2012 2013 2014 2010 2011 2012 2013 2014 ...
## $ ElantraSales: int 7690 9659 10900 12174 15326 7966 12289 13820 16219 16393 ...
## $ Unemployment: num 9.7 9.1 8.2 7.9 6.6 9.8 9 8.3 7.7 6.7 ...
## $ Queries : int 153 259 354 230 232 130 266 296 239 240 ...
## $ CPI_energy : num 213 229 244 243 248 ...
## $ CPI_all : num 217 221 228 231 235 ...
#PROBLEM 1 - LOADING THE DATA
#lets split the data set into training & test data set
#Load the data set. Split the data set into training and testing sets as follows: place all observations for 2012 and earlier in the training set, and all observations for 2013 and 2014 into the testing set.
ElantraTrain<-subset(Elantra,Year<=2012)
dim(ElantraTrain)
## [1] 36 7
ElantraTest<-subset(Elantra,Year>2012) #or ElantraTest<-subset(Elantra,Year==2013 | Year==2014)
dim(ElantraTest)
## [1] 14 7
#How many observations are in the training set?
#Ans:36
##################################################
#PROBLEM 2.1 - A LINEAR REGRESSION MODEL
#Build a linear regression model to predict monthly Elantra sales using Unemployment, CPI_all, CPI_energy and Queries as the independent variables. Use all of the training set data to do this.
ElantraLM<-lm(ElantraSales~Unemployment+CPI_all+CPI_energy +Queries,data=ElantraTrain)
summary(ElantraLM)
##
## Call:
## lm(formula = ElantraSales ~ Unemployment + CPI_all + CPI_energy +
## Queries, data = ElantraTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6785.2 -2101.8 -562.5 2901.7 7021.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95385.36 170663.81 0.559 0.580
## Unemployment -3179.90 3610.26 -0.881 0.385
## CPI_all -297.65 704.84 -0.422 0.676
## CPI_energy 38.51 109.60 0.351 0.728
## Queries 19.03 11.26 1.690 0.101
##
## Residual standard error: 3295 on 31 degrees of freedom
## Multiple R-squared: 0.4282, Adjusted R-squared: 0.3544
## F-statistic: 5.803 on 4 and 31 DF, p-value: 0.00132
#lets visualise the coeff using sjPlot package
library(sjPlot)
sjp.lm(ElantraLM)
#What is the model R-squared? Note: In this problem, we will always be asking for the "Multiple R-Squared" of the model.
#Ans: 0.4282
###############################################
#PROBLEM 2.2 - SIGNIFICANT VARIABLES
#How many variables are significant, or have levels that are significant? Use 0.10 as your p-value cutoff.
#Ans:0
#EXPLANATION:After obtaining the output of the model summary, simply look at the p-values of all of the variables in the output (the right-most column, labeled "Pr(>|t|)"). It turns out that none of them are significant.
########################################
#PROBLEM 2.3 - COEFFICIENTS
#What is the coefficient of the Unemployment variable?
summary(ElantraLM)$coefficients["Unemployment","Estimate"]
## [1] -3179.9
#or coef(summary(ElantraLM))["Unemployment","Estimate"]
#Ans:-3179.9
####################################
#PROBLEM 2.4 - INTERPRETING THE COEFFICIENT
#What is the interpretation of this coefficient?
#Ans:For an increase of 1 in Unemployment, the prediction of Elantra sales decreases by approximately 3000.
#EXPLANATION:The second choice is the correct answer; the coefficient is defined as the change in the prediction of the dependent variable (ElantraSales) per unit change in the independent variable in question (Unemployment). The first choice is therefore not correct; it also does not make intuitive sense since Unemployment is the percentage unemployment rate, which is bounded to be between 0 and 100.
#The third choice is not correct because the coefficient indicates how the prediction changes, not how the actual sales change, and this option asserts that actual sales change, i.e., there is a causal effect.
#The fourth choice is not correct because the statistical significance indicates how likely it is that, by chance, the true coefficient is not different from zero. However, the estimated coefficient still has a (non-zero) value, and our prediction will change for different values of Unemployment; therefore, the sales prediction cannot stay the same.
#########################################
#PROBLEM 3.1 - MODELING SEASONALITY
#Our model R-Squared is relatively low, so we would now like to improve our model. In modeling demand and sales, it is often useful to model seasonality. Seasonality refers to the fact that demand is often cyclical/periodic in time. For example, in countries with different seasons, demand for warm outerwear (like jackets and coats) is higher in fall/autumn and winter (due to the colder weather) than in spring and summer. (In contrast, demand for swimsuits and sunscreen is higher in the summer than in the other seasons.) Another example is the "back to school" period in North America: demand for stationary (pencils, notebooks and so on) in late July and all of August is higher than the rest of the year due to the start of the school year in September.
#In our problem, since our data includes the month of the year in which the units were sold, it is feasible for us to incorporate monthly seasonality. From a modeling point of view, it may be reasonable that the month plays an effect in how many Elantra units are sold.
#To incorporate the seasonal effect due to the month, build a new linear regression model that predicts monthly Elantra sales using Month as well as Unemployment, CPI_all, CPI_energy and Queries. Do not modify the training and testing data frames before building the model.
ElantraLM1<-lm(ElantraSales~Month+Unemployment+CPI_all+CPI_energy +Queries,data=ElantraTrain)
summary(ElantraLM1)
##
## Call:
## lm(formula = ElantraSales ~ Month + Unemployment + CPI_all +
## CPI_energy + Queries, data = ElantraTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6416.6 -2068.7 -597.1 2616.3 7183.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 148330.49 195373.51 0.759 0.4536
## Month 110.69 191.66 0.578 0.5679
## Unemployment -4137.28 4008.56 -1.032 0.3103
## CPI_all -517.99 808.26 -0.641 0.5265
## CPI_energy 54.18 114.08 0.475 0.6382
## Queries 21.19 11.98 1.769 0.0871 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3331 on 30 degrees of freedom
## Multiple R-squared: 0.4344, Adjusted R-squared: 0.3402
## F-statistic: 4.609 on 5 and 30 DF, p-value: 0.003078
#What is the model R-Squared?
#Ans: 0.4344
##################################
#PROBLEM 3.2 - EFFECT OF ADDING A NEW VARIABLE
#Which of the following best describes the effect of adding Month?
#lets just compare the two fitted model
#lets gets the summary o/p in better presented table
library(sjPlot)
sjt.lm(ElantraLM,ElantraLM1)
ElantraSales | ElantraSales | |||||||
B | CI | p | B | CI | p | |||
(Intercept) | 95385.36 | -252685.78 – 443456.51 | .580 | 148330.49 | -250675.44 – 547336.42 | .454 | ||
Unemployment | -3179.90 | -10543.08 – 4183.28 | .385 | -4137.28 | -12323.85 – 4049.28 | .310 | ||
CPI_all | -297.65 | -1735.17 – 1139.88 | .676 | -517.99 | -2168.68 – 1132.69 | .526 | ||
CPI_energy | 38.51 | -185.03 – 262.04 | .728 | 54.18 | -178.79 – 287.16 | .638 | ||
Queries | 19.03 | -3.93 – 41.99 | .101 | 21.19 | -3.28 – 45.65 | .087 | ||
Month | 110.69 | -280.73 – 502.10 | .568 | |||||
Observations | 36 | 36 | ||||||
R2 / adj. R2 | .428 / .354 | .434 / .340 |
#Ans:The model is not better because the adjusted R-squared has gone down and none of the variables (including the new one) are very significant.
#EXPLANATION:The first option is incorrect because (ordinary) R-Squared always increases (or at least stays the same) when you add new variables. This does not make the model better, and in fact, may hurt the ability of the model to generalize to new, unseen data (overfitting).
#The second option is correct: the adjusted R-Squared is the R-Squared but adjusted to take into account the number of variables. If the adjusted R-Squared is lower, then this indicates that our model is not better and in fact may be worse. Furthermore, if none of the variables have become significant, then this also indicates that the model is not better.
#The third option is not correct because as stated above, the adjusted R-Squared has become worse. Although the variables have come closer to being significant, this doesn't make it a better model.
#The fourth option is not correct. Although it is desirable to have models that are parsimonious (fewer variables), we are ultimately interested in models that have high explanatory power (as measured in training R-Squared) and out of sample predictive power (as measured in testing R-Squared). Adding a key variable may significantly improve the predictive power of the model, and we should thus not dismiss the model simply because it has more variables.
############################
#PROBLEM 3.3 - UNDERSTANDING THE MODEL
#Let us try to understand our model.
#In the new model, given two monthly periods that are otherwise identical in Unemployment, CPI_all, CPI_energy and Queries, what is the absolute difference in predicted Elantra sales given that one period is in January and one is in March?
#Ans:221.38
#EXPLANATION:The coefficient for Month is 110.69 (look at the summary output of the model). For the first question, January is coded numerically as 1, while March is coded numerically as 3; the difference in the prediction is therefore
#110.69 * (3 - 1) = 110.69 * 2 = 221.38
#In the new model, given two monthly periods that are otherwise identical in Unemployment, CPI_all, CPI_energy and Queries, what is the absolute difference in predicted Elantra sales given that one period is in January and one is in May?
#Ans:442.76
#EXPLANATION: May is numerically coded as 5, while January is 1, so the difference in predicted sales is
#110.69 * (5 - 1) = 110.69 * 4 = 442.76
#########################
#PROBLEM 3.4 - NUMERIC VS. FACTORS
#You may be experiencing an uneasy feeling that there is something not quite right in how we have modeled the effect of the calendar month on the monthly sales of Elantras. If so, you are right. In particular, we added Month as a variable, but Month is an ordinary numeric variable. In fact, we must convert Month to a factor variable before adding it to the model.
#What is the best explanation for why we must do this?
#Ans:By modeling Month as a factor variable, the effect of each calendar month is not restricted to be linear in the numerical coding of the month.
#EXPLANATION:The second choice is the correct answer. The previous subproblem essentially showed that for every month that we move into the future (e.g, from January to February, from February to March, etc.), our predicted sales go up by 110.69. This isn't right, because the effect of the month should not be affected by the numerical coding, and by modeling Month as a numeric variable, we cannot capture more complex effects. For example, suppose that when the other variables are fixed, an additional 500 units are sold from June to December, relative to the other months. This type of relationship between the boost to the sales and the Month variable would look like a step function at Month = 6, which cannot be modeled as a linear function of Month.
#The first choice is not right. As we have discussed before, increasing the number of coefficients will never cause the model's R-Squared to decrease, but if the increase is small, then we have not really improved the predictive power of our model, and converting Month to a factor variable is not justified.
#The third choice is also not correct. Month is stored as an ordinary number, so there cannot be any issues due to the Date format.
################################
#PROBLEM 4.1 - A NEW MODEL
#Re-run the regression with the Month variable modeled as a factor variable. (Create a new variable that models the Month as a factor (using the as.factor function) instead of overwriting the current Month variable. We'll still use the numeric version of Month later in the problem.)
ElantraTrain$MonthFactor = as.factor(ElantraTrain$Month) #new var 'MonthFactor'added to the ElantraTrain dataset
ElantraTest$MonthFactor = as.factor(ElantraTest$Month) #new var 'MonthFactor'added to the ElantraTest dataset
#Then, you want to rebuild the model using the lm function:
ElantraLM = lm(ElantraSales ~ Unemployment + Queries + CPI_energy + CPI_all + MonthFactor, data=ElantraTrain)
summary(ElantraLM)
##
## Call:
## lm(formula = ElantraSales ~ Unemployment + Queries + CPI_energy +
## CPI_all + MonthFactor, data = ElantraTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3865.1 -1211.7 -77.1 1207.5 3562.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 312509.280 144061.867 2.169 0.042288 *
## Unemployment -7739.381 2968.747 -2.607 0.016871 *
## Queries -4.764 12.938 -0.368 0.716598
## CPI_energy 288.631 97.974 2.946 0.007988 **
## CPI_all -1343.307 592.919 -2.266 0.034732 *
## MonthFactor2 2254.998 1943.249 1.160 0.259540
## MonthFactor3 6696.557 1991.635 3.362 0.003099 **
## MonthFactor4 7556.607 2038.022 3.708 0.001392 **
## MonthFactor5 7420.249 1950.139 3.805 0.001110 **
## MonthFactor6 9215.833 1995.230 4.619 0.000166 ***
## MonthFactor7 9929.464 2238.800 4.435 0.000254 ***
## MonthFactor8 7939.447 2064.629 3.845 0.001010 **
## MonthFactor9 5013.287 2010.745 2.493 0.021542 *
## MonthFactor10 2500.184 2084.057 1.200 0.244286
## MonthFactor11 3238.932 2397.231 1.351 0.191747
## MonthFactor12 5293.911 2228.310 2.376 0.027621 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2306 on 20 degrees of freedom
## Multiple R-squared: 0.8193, Adjusted R-squared: 0.6837
## F-statistic: 6.044 on 15 and 20 DF, p-value: 0.0001469
#What is the model R-Squared?
#Ans:0.8193
##############################
#PROBLEM 4.2 - SIGNIFICANT VARIABLES
#Which variables are significant, or have levels that are significant? Use 0.10 as your p-value cutoff. (Select all that apply.)
#Ans:Month (the factor version) ,CPI_all,CPI_energy ,Unemployment
#lets plot the coeff to see which are significant using sjPlot library
library(sjPlot)
#?sjp.lm
sjp.lm(ElantraLM,type="lm")
#################################
#PROBLEM 5.1 - MULTICOLINEARITY
#Another peculiar observation about the regression is that the sign of the Queries variable has changed. In particular, when we naively modeled Month as a numeric variable, Queries had a positive coefficient. Now, Queries has a negative coefficient. Furthermore, CPI_energy has a positive coefficient -- as the overall price of energy increases, we expect Elantra sales to increase, which seems counter-intuitive (if the price of energy increases, we'd expect consumers to have less funds to purchase automobiles, leading to lower Elantra sales).
#As we have seen before, changes in coefficient signs and signs that are counter to our intuition may be due to a multicolinearity problem. To check, compute the correlations of the variables in the training set.
cor(ElantraTrain[c("Unemployment","Month","Queries","CPI_energy","CPI_all")]) #or
## Unemployment Month Queries CPI_energy CPI_all
## Unemployment 1.0000000 -0.2036029 -0.6411093 -0.8007188 -0.9562123
## Month -0.2036029 1.0000000 0.0158443 0.1760198 0.2667883
## Queries -0.6411093 0.0158443 1.0000000 0.8328381 0.7536732
## CPI_energy -0.8007188 0.1760198 0.8328381 1.0000000 0.9132259
## CPI_all -0.9562123 0.2667883 0.7536732 0.9132259 1.0000000
cor(ElantraTrain$CPI_energy,ElantraTrain[,1:7]) #last col i.e.MonthFactor is a factor var and hence not be included for getting correlation as only numeric var can be computed
## Month Year ElantraSales Unemployment Queries CPI_energy
## [1,] 0.1760198 0.8316052 0.5916491 -0.8007188 0.8328381 1
## CPI_all
## [1,] 0.9132259
#Which of the following variables is CPI_energy highly correlated with? Select all that apply. (Include only variables where the absolute value of the correlation exceeds 0.6. For the purpose of this question, treat Month as a numeric variable, not a factor variable.)
#Ans:Unemployment,Queries,CPI_all
#plotting the correlation matrix
library(sjPlot)
#?sjp.corr
sjp.corr(ElantraTrain[,c("Unemployment","Month","Queries","CPI_energy","CPI_all")],type = "tile",show.legend=FALSE)
## Computing correlation using spearman-method with listwise-deletion...
#Testing the model assumptions
library(sjPlot)
#?sjp.lm
sjp.lm(ElantraLM,type="ma",completeDiagnostic=T)
## Removed 0 cases during 0 step(s).
## R^2 / adj. R^2 of original model: 0.819264 / 0.683712
## R^2 / adj. R^2 of updated model: 0.819264 / 0.683712
## AIC of original model: 672.515747
## AIC of updated model: 672.515747
## lag Autocorrelation D-W Statistic p-value
## 1 -0.4653687 2.795112 0.476
## Alternative hypothesis: rho != 0
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 2.04248 Df = 1 p = 0.1529601
##
## studentized Breusch-Pagan test
##
## data: linreg
## BP = 18.094, df = 15, p-value = 0.2577
##
## Suggested power transformation: 0.6030641
#the plots are giving a real & strong indication of presence of multicollinearity
#########################################
#PROBLEM 5.2 - CORRELATIONS
#Which of the following variables is Queries highly correlated with? Again, compute the correlations on the training set. Select all that apply. (Include only variables where the absolute value of the correlation exceeds 0.6. For the purpose of this question, treat Month as a numeric variable, not a factor variable.)
cor(ElantraTrain[c("Unemployment","Month","Queries","CPI_energy","CPI_all")]) #or
## Unemployment Month Queries CPI_energy CPI_all
## Unemployment 1.0000000 -0.2036029 -0.6411093 -0.8007188 -0.9562123
## Month -0.2036029 1.0000000 0.0158443 0.1760198 0.2667883
## Queries -0.6411093 0.0158443 1.0000000 0.8328381 0.7536732
## CPI_energy -0.8007188 0.1760198 0.8328381 1.0000000 0.9132259
## CPI_all -0.9562123 0.2667883 0.7536732 0.9132259 1.0000000
cor(ElantraTrain$Queries,ElantraTrain[,1:7]) #last col
## Month Year ElantraSales Unemployment Queries CPI_energy
## [1,] 0.0158443 0.726531 0.6100645 -0.6411093 1 0.8328381
## CPI_all
## [1,] 0.7536732
#Ans:Unemployment,CPI_energy,CPI_all
#plotting the correlation matrix
library(sjPlot)
#?sjp.corr
sjp.corr(ElantraTrain[,c("Unemployment","Month","Queries","CPI_energy","CPI_all")],type = "tile",show.legend =FALSE)
## Computing correlation using spearman-method with listwise-deletion...
##############################
#PROBLEM 6.1 - A REDUCED MODEL
#Let us now simplify our model (the model using the factor version of the Month variable). We will do this by iteratively removing variables, one at a time. Remove the variable with the highest p-value (i.e., the least statistically significant variable) from the model. Repeat this until there are no variables that are insignificant or variables for which all of the factor levels are insignificant. Use a threshold of 0.10 to determine whether a variable is significant.
#Method1:AUTOMATICALLY BUILDING THE MODEL
StepModel<-step(ElantraLM)
## Start: AIC=568.35
## ElantraSales ~ Unemployment + Queries + CPI_energy + CPI_all +
## MonthFactor
##
## Df Sum of Sq RSS AIC
## - Queries 1 720829 107064906 566.60
## <none> 106344077 568.35
## - CPI_all 1 27292499 133636576 574.58
## - Unemployment 1 36136773 142480850 576.88
## - CPI_energy 1 46147903 152491980 579.33
## - MonthFactor 11 230125737 336469814 587.82
##
## Step: AIC=566.6
## ElantraSales ~ Unemployment + CPI_energy + CPI_all + MonthFactor
##
## Df Sum of Sq RSS AIC
## <none> 107064906 566.60
## - CPI_all 1 29428416 136493322 573.34
## - Unemployment 1 40143086 147207993 576.06
## - CPI_energy 1 59057377 166122283 580.41
## - MonthFactor 11 260411263 367476169 588.99
summary(StepModel)
##
## Call:
## lm(formula = ElantraSales ~ Unemployment + CPI_energy + CPI_all +
## MonthFactor, data = ElantraTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3866.0 -1283.3 -107.2 1098.3 3650.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 325709.15 136627.85 2.384 0.026644 *
## Unemployment -7971.34 2840.79 -2.806 0.010586 *
## CPI_energy 268.03 78.75 3.403 0.002676 **
## CPI_all -1377.58 573.39 -2.403 0.025610 *
## MonthFactor2 2410.91 1857.10 1.298 0.208292
## MonthFactor3 6880.09 1888.15 3.644 0.001517 **
## MonthFactor4 7697.36 1960.21 3.927 0.000774 ***
## MonthFactor5 7444.64 1908.48 3.901 0.000823 ***
## MonthFactor6 9223.13 1953.64 4.721 0.000116 ***
## MonthFactor7 9602.72 2012.66 4.771 0.000103 ***
## MonthFactor8 7919.50 2020.99 3.919 0.000789 ***
## MonthFactor9 5074.29 1962.23 2.586 0.017237 *
## MonthFactor10 2724.24 1951.78 1.396 0.177366
## MonthFactor11 3665.08 2055.66 1.783 0.089062 .
## MonthFactor12 5643.19 1974.36 2.858 0.009413 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2258 on 21 degrees of freedom
## Multiple R-squared: 0.818, Adjusted R-squared: 0.6967
## F-statistic: 6.744 on 14 and 21 DF, p-value: 5.73e-05
# the iteration shows that Queries var is the var removed
#Which variables, and in what order, are removed by this process?
#Then, you want to rebuild the model using the lm function:
#the main model
ElantraLM = lm(ElantraSales ~ Unemployment + Queries + CPI_energy + CPI_all + MonthFactor, data=ElantraTrain)
summary(ElantraLM)
##
## Call:
## lm(formula = ElantraSales ~ Unemployment + Queries + CPI_energy +
## CPI_all + MonthFactor, data = ElantraTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3865.1 -1211.7 -77.1 1207.5 3562.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 312509.280 144061.867 2.169 0.042288 *
## Unemployment -7739.381 2968.747 -2.607 0.016871 *
## Queries -4.764 12.938 -0.368 0.716598
## CPI_energy 288.631 97.974 2.946 0.007988 **
## CPI_all -1343.307 592.919 -2.266 0.034732 *
## MonthFactor2 2254.998 1943.249 1.160 0.259540
## MonthFactor3 6696.557 1991.635 3.362 0.003099 **
## MonthFactor4 7556.607 2038.022 3.708 0.001392 **
## MonthFactor5 7420.249 1950.139 3.805 0.001110 **
## MonthFactor6 9215.833 1995.230 4.619 0.000166 ***
## MonthFactor7 9929.464 2238.800 4.435 0.000254 ***
## MonthFactor8 7939.447 2064.629 3.845 0.001010 **
## MonthFactor9 5013.287 2010.745 2.493 0.021542 *
## MonthFactor10 2500.184 2084.057 1.200 0.244286
## MonthFactor11 3238.932 2397.231 1.351 0.191747
## MonthFactor12 5293.911 2228.310 2.376 0.027621 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2306 on 20 degrees of freedom
## Multiple R-squared: 0.8193, Adjusted R-squared: 0.6837
## F-statistic: 6.044 on 15 and 20 DF, p-value: 0.0001469
#Lets remove Queries var and rebuild the model
ElantraLMnew = lm(ElantraSales ~ Unemployment + CPI_energy + CPI_all + MonthFactor, data=ElantraTrain)
summary(ElantraLMnew)
##
## Call:
## lm(formula = ElantraSales ~ Unemployment + CPI_energy + CPI_all +
## MonthFactor, data = ElantraTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3866.0 -1283.3 -107.2 1098.3 3650.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 325709.15 136627.85 2.384 0.026644 *
## Unemployment -7971.34 2840.79 -2.806 0.010586 *
## CPI_energy 268.03 78.75 3.403 0.002676 **
## CPI_all -1377.58 573.39 -2.403 0.025610 *
## MonthFactor2 2410.91 1857.10 1.298 0.208292
## MonthFactor3 6880.09 1888.15 3.644 0.001517 **
## MonthFactor4 7697.36 1960.21 3.927 0.000774 ***
## MonthFactor5 7444.64 1908.48 3.901 0.000823 ***
## MonthFactor6 9223.13 1953.64 4.721 0.000116 ***
## MonthFactor7 9602.72 2012.66 4.771 0.000103 ***
## MonthFactor8 7919.50 2020.99 3.919 0.000789 ***
## MonthFactor9 5074.29 1962.23 2.586 0.017237 *
## MonthFactor10 2724.24 1951.78 1.396 0.177366
## MonthFactor11 3665.08 2055.66 1.783 0.089062 .
## MonthFactor12 5643.19 1974.36 2.858 0.009413 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2258 on 21 degrees of freedom
## Multiple R-squared: 0.818, Adjusted R-squared: 0.6967
## F-statistic: 6.744 on 14 and 21 DF, p-value: 5.73e-05
#Ans:Queries
#EXPLANATION:The variable with the highest p-value is "Queries". After removing it and looking at the model summary again, we can see that there are no variables that are insignificant, at the 0.10 p-level. Note that Month has a few values that are insignificant, but we don't want to remove it because many values are very significant.
#####################################
#PROBLEM 6.2 - TEST SET PREDICTIONS
predTest<-predict(ElantraLMnew ,newdata = ElantraTest)
#Using the model from Problem 6.1, make predictions on the test set. What is the sum of squared errors of the model on the test set?
SSE = sum((predTest - ElantraTest$ElantraSales)^2)
SSE
## [1] 190757747
#Ans:190757747
##################################
#PROBLEM 6.3 - COMPARING TO A BASELINE
#What would the baseline method predict for all observations in the test set? Remember that the baseline method we use predicts the average outcome of all observations in the training set.
baseline = mean(ElantraTrain$ElantraSales)
baseline
## [1] 14462.25
#Ans:14462.25
#EXPLANATION:The baseline method that is used in the R-Squared calculation (to compute SST, the total sum of squares) simply predicts the mean of ElantraSales in the training set for every observation (i.e., without regard to any of the independent variables).
#################################
#PROBLEM 6.4 - TEST SET R-SQUARED
#What is the test set R-Squared?
1-190757747/701375142
## [1] 0.7280232
#Ans:0.7280232
#EXPLANATION:You can compute the SST as the sum of the squared differences between ElantraSales in the testing set and the mean of ElantraSales in the training set:
SST<-sum((ElantraTest$ElantraSales-mean(ElantraTrain$ElantraSales))^2)
SST
## [1] 701375142
#then R-Squared=1-SSE/SST
##############################
#PROBLEM 6.5 - ABSOLUTE ERRORS
#What is the largest absolute error that we make in our test set predictions?
sort(abs(predTest - ElantraTest$ElantraSales),decreasing = TRUE)
## 14 22 18 34 4 38 30
## 7491.4877 6903.3078 5997.2221 4392.9279 3088.6588 2482.2349 2245.0544
## 9 26 50 42 46 5 10
## 2187.5530 2021.6654 934.0051 658.5309 631.7978 490.4278 394.2148
#or
max(abs(predTest- ElantraTest$ElantraSales))
## [1] 7491.488
#Ans:7491.488
########################
#PROBLEM 6.6 - MONTH OF LARGEST ERROR
#In which period (Month,Year pair) do we make the largest absolute error in our prediction?
ElantraTest$Month[which.max(abs(predTest - ElantraTest$ElantraSales))]
## [1] 3
#Ans:03/2013
#This returns 5, which is the row number in ElantraTest corresponding to the period for which we make the largest absolute error.