2)Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction.
Finally, provide n and p. p: The number of predictors in a dataset. n: The number of samples in a dataset.
2a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
ANSWER 2a)
This is a regression problem since it involves numerical values. The keyword is understanding which leads to interpretability of the model as a requirement so we are most interested in inference. p=3
n=500
2b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
ANSWER 2b)
This is a classification problem since it involves classifying success or failure. Interpretability of the model is not a requirement so we are most interested in prediction. p=13
n=20
2c) We are interest in predicting the ed % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.
ANSWER 2c)
This is a regression problem since it involves % numerical values. Interpretability of the model is not a requirement so we are most interested in prediction. p=3
n=52(hence 52 weeks in a year)
ANSWER 5)
Regression Advantages:
If we are mainly interested in inference, then restrictive models (less flexible) are much more interpretable.
Regression Disadvantages:
Linear regression is a relatively inflexible approach, because it can only generate linear functions and small range of shapes to estimate f
Classification Advantages:
Classification is more flexible
Classification Disadvantages:
Classification is less interpretable.
Under what circumstances might a more flexible approach be preferred to a less flexible approach?
When the interpretability of the predictive model is simply not of interest and we are only interested in prediction. For instance, if we seek to develop an algorithm to predict the price of a stock, our sole requirement for the algorithm is that it predict accurately—interpretability is not a concern.
When might a less flexible approach be preferred?
When inference or interpretability is a must then we chose a more restrictive model or less flexible.
ANSWER 6)
Parametric:
Methods involve a two-step model-based approach. One very simple assumption is that f is linear in X. After a model has been selected, we need a procedure that uses the training data to fit or train the model. After a model has been selected, we need a procedure that uses the training data to fit or train the model.
Non-parametric:
Non-parametric methods do not make explicit assumptions about the functional form of f. Instead they seek an estimate of f that gets as close to the data points as possible without being too rough or wiggly.
What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)?
Parametric Advantages:
Assuming a parametric form for f simplifies the problem of estimating f because it is generally much easier to estimate a set of parameters, such as β0,β1, . . . ,βp in the linear model (2.4), than it is to fit an entirely arbitrary function f.
Non-parametric Advantages:
Such approaches can have a major advantage over parametric approaches: by avoiding the assumption of a particular functional form for f, they have the potential to accurately fit a wider range of possible shapes for f.
What are its disadvantages?
Parametric:
The potential disadvantage of a parametric approach is that the model we choose will usually not match the true unknown form of f. If the chosen model is too far from the true f, then our estimate will be poor.
Non-Parametric:
But non-parametric approaches do suffer from a major disadvantage: since they do not reduce the problem of estimating f to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for f.
8a) Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.
library(ISLR)
#setwd("C:\Users\user\Desktop\Summer 2021\Data Analytics Algorithms II\Homework\Week 1")
college=read.csv("college.csv", header=TRUE) #Exercise 8a
(8b) Look at the data using the fix() function. You should notice that the first column is just the name of each university.We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:
rownames(college)=college[,1]
fix(college) #Exercise 8b
(8c) i. Use the summary() function to pr
(8c)ii. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
(8c)iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.
(8c)iv. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.
#> Elite=rep(“No”,nrow(college)) #> Elite[college$Top10perc >50]=“Yes” #> Elite=as.factor(Elite) #> college=data.frame(college ,Elite)
Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.
(8c)v. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables.
You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.
(8c)vi. Continue exploring the data, and provide a brief summary of what you discover.
ANSWER 8c(i-vi):
summary(college) #Exercise 8c-i
## X Private Apps Accept
## Length:777 Length:777 Min. : 81 Min. : 72
## Class :character Class :character 1st Qu.: 776 1st Qu.: 604
## Mode :character Mode :character Median : 1558 Median : 1110
## Mean : 3002 Mean : 2019
## 3rd Qu.: 3624 3rd Qu.: 2424
## Max. :48094 Max. :26330
## Enroll Top10perc Top25perc F.Undergrad
## Min. : 35 Min. : 1.00 Min. : 9.0 Min. : 139
## 1st Qu.: 242 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992
## Median : 434 Median :23.00 Median : 54.0 Median : 1707
## Mean : 780 Mean :27.56 Mean : 55.8 Mean : 3700
## 3rd Qu.: 902 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005
## Max. :6392 Max. :96.00 Max. :100.0 Max. :31643
## P.Undergrad Outstate Room.Board Books
## Min. : 1.0 Min. : 2340 Min. :1780 Min. : 96.0
## 1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0
## Median : 353.0 Median : 9990 Median :4200 Median : 500.0
## Mean : 855.3 Mean :10441 Mean :4358 Mean : 549.4
## 3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0
## Max. :21836.0 Max. :21700 Max. :8124 Max. :2340.0
## Personal PhD Terminal S.F.Ratio
## Min. : 250 Min. : 8.00 Min. : 24.0 Min. : 2.50
## 1st Qu.: 850 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50
## Median :1200 Median : 75.00 Median : 82.0 Median :13.60
## Mean :1341 Mean : 72.66 Mean : 79.7 Mean :14.09
## 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50
## Max. :6800 Max. :103.00 Max. :100.0 Max. :39.80
## perc.alumni Expend Grad.Rate
## Min. : 0.00 Min. : 3186 Min. : 10.00
## 1st Qu.:13.00 1st Qu.: 6751 1st Qu.: 53.00
## Median :21.00 Median : 8377 Median : 65.00
## Mean :22.74 Mean : 9660 Mean : 65.46
## 3rd Qu.:31.00 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :64.00 Max. :56233 Max. :118.00
#college[,1:10] ##I had to comment this out, dataset is too long
pairs(College[,1:10]) #Exercise 8c-ii
plot(College$Private, College$Outstate, xlab = "Private",xlim = c(0,2.5), ylab ="OutState", main = "Outstate vs Private") #Exercise 8c-iii
Elite=rep("No",nrow(college)) #Exercise 8c-iv
Elite[college$Top10perc >50]="Yes"
Elite=as.factor(Elite)
college=data.frame(college ,Elite)
summary(college$Elite)
## No Yes
## 699 78
plot(college$Elite, college$Outstate, xlab = "Elite",,xlim = c(0,2.5), ylab ="OutState", main = "Outstate vs Elite")
par(mfrow = c(2,2)) #Exercise 8c-v
hist(college$Enroll, col=10, xlab = "Enroll", ylab = "Count")
hist(college$Top10perc, col = 10, xlab = "Top10", ylab = "Count")
hist(college$Personal, col = 5, xlab = "Personal", ylab = "Count")
hist(college$Grad.Rate, col = 5, xlab = "Graduation Rate", ylab = "Count")
#Exercise 8c-vi
summary(college$Enroll)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35 242 434 780 902 6392
#Found that the mean enrollment for students is equal to 780
summary(college$Outstate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2340 7320 9990 10441 12925 21700
#Found that the mean number of out of state students is equal to 10441
summary(college$Grad.Rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 53.00 65.00 65.46 78.00 118.00
#Found that the mean number for graduation rate on students is equal to 65.46. If this is a percentage, its not a very good number.
(9a) Which of the predictors are quantitative, and which are qualitative?
Auto = read.csv("Auto.csv", header=TRUE, na.strings = "?")
Auto = na.omit(Auto)
str(Auto) #Exercise 9A
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
ANSWER 9a)
As shown above, all of the variables are quantitative except “name” and “horsepower” which has a couple of “?” record values.
(9b) What is the range of each quantitative predictor? You can answer this using the range() function.
ANSWER 9B)
df = subset(Auto, select = -c(horsepower,name),header=TRUE )
#df
summary(df) # This will also show the mix and max values needed for each predictor
## mpg cylinders displacement weight acceleration
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. :1613 Min. : 8.00
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.:2225 1st Qu.:13.78
## Median :22.75 Median :4.000 Median :151.0 Median :2804 Median :15.50
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :2978 Mean :15.54
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:3615 3rd Qu.:17.02
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :5140 Max. :24.80
## year origin
## Min. :70.00 Min. :1.000
## 1st Qu.:73.00 1st Qu.:1.000
## Median :76.00 Median :1.000
## Mean :75.98 Mean :1.577
## 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :82.00 Max. :3.000
(9c) What is the mean and standard deviation of each quantitative predictor?
ANSWER 9C)
sapply(df,mean)
## mpg cylinders displacement weight acceleration year
## 23.445918 5.471939 194.411990 2977.584184 15.541327 75.979592
## origin
## 1.576531
sapply(df, sd)
## mpg cylinders displacement weight acceleration year
## 7.8050075 1.7057832 104.6440039 849.4025600 2.7588641 3.6837365
## origin
## 0.8055182
(9d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
ANSWER 9D)
df2 <- df[-c(10:85)]
#df2 # I had to comment out this line since dataset is way to long to output
sapply(df2, range)
## mpg cylinders displacement weight acceleration year origin
## [1,] 9.0 3 68 1613 8.0 70 1
## [2,] 46.6 8 455 5140 24.8 82 3
sapply(df2, mean)
## mpg cylinders displacement weight acceleration year
## 23.445918 5.471939 194.411990 2977.584184 15.541327 75.979592
## origin
## 1.576531
sapply(df2, sd)
## mpg cylinders displacement weight acceleration year
## 7.8050075 1.7057832 104.6440039 849.4025600 2.7588641 3.6837365
## origin
## 0.8055182
(9e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
ANSWER 9E)
Auto$cylinders <- as.factor(Auto$cylinders)
Auto$year <- as.factor(Auto$year)
Auto$origin <- as.factor(Auto$origin)
Auto$name <- as.factor(Auto$name)
pairs(Auto[,1:9])
# There is a linear relationship between displacement and horsepower as well as between weight and horsepower.
There is a linear relationship between displacement and horsepower as well as between weight and horsepower.
(9f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.
ANSWER 9F)
Auto$horsepower = as.numeric(Auto$horsepower)
cor(Auto$weight, Auto$horsepower)
## [1] 0.8645377
#[1] 0.8645377
cor(Auto$displacement, Auto$horsepower)
## [1] 0.897257
#[1] 0.897257
Yes, using the data from the previous plots, we can use the following predictors: cylinders,year,origin and horsepower. Displacement and weight are highly correlated with horsepower ,therefore, these two are omitted.
(10a) To begin, load in the Boston data set. The Boston data set is part of the MASS library in R.
Read about the data set:
ANSWER 10A)
library("MASS")
#Now the data set is contained in the object Boston.
#Read about the data set:
# Boston # I had to comment it out, dataset is way too long to show it
#How many rows are in this data set?
nrow(Boston)
## [1] 506
#506
# How many columns?
ncol(Boston)
## [1] 14
# 14
# What do the rows and columns represent?
#The rows and columns represent a data set
(10b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.
par(mfrow = c(2, 2))
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)
ANSWER 10B) There seems to be a linear relationship between crim and age as well as a linear relationship between crim and dis.
(10c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
ANSWER 10C)
hist(Boston$crim,col=30, breaks = 25)
#Yes, most of the data is exponentially decreasing and located from crim equals 0 to 25 which gives us an indicator of low crime rates in the Suburbs.
pairs(Boston[Boston$crim < 25, ])
#head(Boston)
# There is a linear relationship between crim and age.
# There is an inverted linear relationship between crim and dis.
# There is an inverted linear relationship between crim and medv.
(10d) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment onthe range of each predictor.
ANSWER 10D)
hist(Boston$crim,col=30, breaks = 25)
nrow(Boston[Boston$crim > 20, ])
## [1] 18
#18
#Tax rates?
hist(Boston$tax,col=30, breaks = 25)
nrow(Boston[Boston$tax == 666, ])
## [1] 132
#132
#Pupil-teacher ratios?
hist(Boston$ptratio, col=30, breaks = 25)
nrow(Boston[Boston$ptratio > 20, ])
## [1] 201
#201
#Comment on the range of each predictor.
#For crim the range from 0 to 80 show frequency values below 400.
#For tax the range from 0 to 500 show frequency values below 80.
#For tax the range from 500 to 700 show frequency values higher than 120.
#For ptratio the range from 0 to 20 show frequency values below 50.
#For ptratio the range from 20 to 22 show frequency values below 150.
(10e) How many of the suburbs in this data set bound the Charles river?
ANSWER 10E)
nrow(Boston[Boston$chas == 1, ])
## [1] 35
#35
(10f) What is the median pupil-teacher ratio among the towns in this data set?
ANSWER 10F)
median(Boston$ptratio)
## [1] 19.05
# 19.05
(10g) Which suburb of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
ANSWER 10G)
row.names(Boston[min(Boston$medv), ])
## [1] "5"
range(Boston$tax)
## [1] 187 711
#[1] 187 711
Boston[min(Boston$medv), ]$tax
## [1] 222
#222
Boston[min(Boston$medv), ]$age
## [1] 54.2
#54.2
range(Boston$age)
## [1] 2.9 100.0
# 2.9 100.0
Boston[min(Boston$medv), ]$rm
## [1] 7.147
#7.147
range(Boston$rm)
## [1] 3.561 8.780
#3.561 8.780
#The minimum Boston medv(Median value) value that corresponds to the tax predictor is equal to 222.
#This Boston tax predictor ranges between 187 and 711,therefore, when comparing the tax value of 222 to this range scale,
#we can conclude that this amount is on the low side of the scale.
#The minimum Boston medv(Median value) value that corresponds to the age predictor is equal to 54.2.
#This Boston age predictor ranges between 2.9 and 100,therefore, when comparing the age value of 54.2 to this range scale,
#we can conclude that this amount is a little past the midpoint side of the scale.
#The minimum Boston medv(Median value) value that corresponds to the rm predictor is equal to 7.147.
#This Boston age predictor ranges between 3.561 and 8.780,therefore, when comparing the rm value of 7.147 to this range scale,
#we can conclude that this amount on the high side of the scale.
(10h) In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling?Comment on the suburbs that average more than eight rooms per dwelling.
ANSWER 10H)
nrow(Boston[Boston$rm > 7, ])
## [1] 64
#[1] 64
#More than eight rooms per dwelling?
nrow(Boston[Boston$rm > 8, ])
## [1] 13
#[1] 13
#Comment on the suburbs that average more than eight rooms
#per dwelling.
hist(Boston$rm, col=30, breaks = 25)
#The suburbs that average more than 8 rooms per dwelling have a very low frequency (less than 20) and the maximum range is 8.780