For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
(a) The sample size n is extremely large, and the number of predictors p is small.
Ans: The flexible statistical learning method would be better in this case as the training data is extremely large
and model can learn the patter well from the data that avoids overfitting.
(b) The number of predictors p is extremely large, and the number of observations n is small.
Ans: The flexible statistical learning method would be perform worse here as the training data is small and runs the risk of overfitting. there would be lot of noise in learning the pattern from data.
(c) The relationship between the predictors and response is highly non-linear.
Ans: The flexible statistical learning method would perform better in this case as the flexible methods can easily learn from non linear relationships.
(d) The variance of the error terms, i.e. σ2 = Var(ϵ), is extremely high.
Ans: The flexible statistical learning method would perform worse in this case as there would be lot of noise due to variance of the error terms.
Explain whether each scenario is a classification or regression problem, and indicate whether we are most
interested in inference or pre diction. Finally, provide n and p.
a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
Ans: This is regression problem as we are interested in CEO’s salary which is a continuous random variable. Here, the effort is to explain out of the three explanatory variables “profit”, “number of employees”, and
“industry”, which which explanatory variable affects the CEO’s salary. Statistcially, speaking we are trying to infer which of the 3explanatory variables explain the variance in CEO’s salary well. Hence, we are most
interested in inference.
*n = 500*
*p = 4*
b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
Ans: As the output varaible of interest is discrete in two categories “success” and “failure”, this is a
classification problem. We are more interested in prediction as we want to know whether the product would be successful.
*n = 20*
*p = 14*
c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the% change in the US market, the % change in the British market, and the % change in the German market.
Ans: The problem statement is direct and makes it clear that it is prediction problem. Hence, we are more
interested in prediction.
*n = 52*
*p = 4*
4. You will now think of some real-life applications for statistical learning.
(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
Ans: The three real life applications on classification are as below: (i) Financial institutions that do lending of consumer segment products such as home loan, personal loan,auto loan, etc. use probability of default models to classify whether a customer who has applied for the loan can default in paying the equated monthly instalments. This is a risk management mechanism for the
financial institutions. In doing so, the institution is trying to predict whether the customer will
default.So, the response variable is discrete in two categories/classes “default” and “not default”.
There are many predictor variables used such as demography, financial details of the customer, credit rating, type of vehicle owned, type of immovable asset owned, any other pre existing loans, etc.
(ii) Optical Character Recognition (OCR) is an exmaple of classification problem.The objective is to recognize character codes from images. Hand written alphabets and numbers are difficult to recognize due to varaince in handwriting of population. This creates myriad challenges and to identify and classify the categories is critical. So in case we are trying to identify numbers then it is a 10 class problem or if we are identify alphabets then it a 26 class problem. So in this case the response variable would be discrete in more than 2 categories. There are many predictors recorded to classify such response variables such as the angle at which the alphate or the number tilts, size of the hand written character, etc. MNIST data is a great exmaple for such problems.
(iii) Face Recognition is a classification problem. Facial recognition is generally used for security and identification purposes. Here the response variable is the classes of people to recognized/identified from image. This is an extremely challenging problem as face is three dimensional and the image has been captured, exposure to light, etc. adds complexities. Image processing uses application of linear algebra and calculus. Every image is composed of pixels that has a numerical value. If it is a black and white image then the values in the pixels will range from 0 to 255. These pixels are then orchestrated in two dimensional tabular format that are used as predictors. Here, the goal is prediction.
(b) Describe three real-life applications in which regression might be useful. Describe the response, as well as
the predictors. Is the goal of each application inference or prediction? Explain your answer.
Ans: The three real life applications on regression are as below:
(i) In agriculture, regression is used to study the effect of water and fertilizer on crop yield. Here the agenda of regression is making inferences from the outcome. It is studied by applying different amounts of fertilizer and water on different fields to record the crop yield. The response variable here is the crop yield and the predictors are amount of water and fertilizer used
(ii) Passenger airlines always use regression for planning of maximum capacity utilization of aircraft seats. They use regression as prediction models to understand in any given flight how many passenger seats would be
booked so that they may accordingly plan and advertise agressively or dole out offers to sell the vacant seats. Here the response variable is bookings or number of seats sold and predictor variable can either only be time (month/week/day) or along with time other variables could be used such as price of the ticket, duration of the flight, route, etc.
(iii) Regression finds applications in sports management as well for inference recommendation. Sports managers try to understand the effect of different physical training plans on player performance. Like they would analyze how many cardio sessions and weight training sessions affect the number of boundaries that the player will score. So a regression model can be fit here with number of cardio sessions and weight training sessions as predictors and the scores aggregated through hitting boundaries is the response variable. Depending in the scores the number of training sessions of cardio and weight training are then recommended to the players.
(c) Describe three real-life applications in which cluster analysis might be useful.
Ans: The three real life applications of clsuter analysis are as below:
(i) Cluster analysis can be used to identify fake news. The content of the various news item is tokenized into
words which is then clustered basis similarity. News items with words that sensationalize and are present in higher percentage get clustered together and can be identified as fake news with a higher probability.
(ii) Banks use cluster analysis to identify right set of customers from their existing customer base to offer suitable financial products. When a bank tries to do a new product launch such as a new credit card, it cannot offer the card to all the customers in its customer base. The card addresses certain needs and provides a set of benefits that would be valuable to only a set of customers and not all. In order to identify who those customers could be, the bank uses/creates certain features such as spend categories, amount spent, income, expenes, savings, investments, etc. Basis these features or more the bank identifes
customers basis similarity to whim it could offer the new credit card.
(iii) Cluster analysis can also be used to identify clusters of customers that use health insurance in a specific way. Basis average age of households, number doctor visits per household, types of illness per household and size of household, clusters of households are developed to identify similar households. This helps the health insurance provider to set premiums depending on the usage of the health insurance by the cluster.
(a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.
# creating a data frame for the given dataset
data <- data.frame("Obs" = c(1,2,3,4,5,6),
X1 = c(0,2,0,0,-1,1),
X2 = c(3,0,1,1,0,1),
X3 = c(0,0,3,2,1,1),
Y = c('Red','Red','Red','Green','Green','Red'))
# creating test dataframe
test <- data.frame(X1 = 0,
X2 = 0,
X3 = 0)
# user defined function for Euclidean distance
dist <- function(m,n) sqrt(sum((m-n)^2))
# calculating Euclidean distance for train and test samples
a <- dist(data[1,2:4], test[,])
b <- dist(data[2,2:4], test[,])
d <- dist(data[3,2:4], test[,])
e <- dist(data[4,2:4], test[,])
f <- dist(data[5,2:4], test[,])
g <- dist(data[6,2:4], test[,])
# creating final dataframe with Euclidean distance for each train sample
edist <- data.frame(data,
"Euclidean_Distance" = c(a,b,d,e,f,g))
edist
## Obs X1 X2 X3 Y Euclidean_Distance
## 1 1 0 3 0 Red 3.000000
## 2 2 2 0 0 Red 2.000000
## 3 3 0 1 3 Red 3.162278
## 4 4 0 1 2 Green 2.236068
## 5 5 -1 0 1 Green 1.414214
## 6 6 1 1 1 Red 1.732051
(b) What is our prediction with K = 1? Why?
Ans: With K = 1, the response can be classified as “Green” as the test sample is closest to observation 5 in the train data set with the minimum distance as 1.141214.
(c) What is our prediction with K = 3? Why?
Ans: With K = 3, the nearest neighbours of the test are observations 5, 6, and 2 in order of increasing distance with classes as “Green,”Red”, and “Red” respectively. As the majority class is “Red” here, the test sample will get classified as “Red”.
(d) If the Bayes decision boundary in this problem is highly non linear, then would we expect the best value for K to be large or small? Why?
Ans: In this case the K value will be small because if the bayes decision boundary is highly non linear then it
becomes easy to fit in data with small K values. Small K values tends to overfit.
This exercise involves the Boston housing data set.
(a) To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library. How many rows are in this data set? How many columns? What do the rows and columns represent?
library(ISLR2)
dat = Boston
p = dim(dat)
print(p)
## [1] 506 13
Ans:
Total number of rows in Boston data set are 506 Total number of columns in Boston data set are 14
Boston housing data has housing values of 506 suburbs of Boston. It has 506 observations and 13 variables. The variables have the details of crime rate, residential and commercial area coverage, Social economic
population details, pupil-teacher ratio by town, property tax details, average number of rooms per dwelling,
and nitrogen oxide concentration. There one discrete variable with two classes and represents the river info.
(b) Make some pairwise scatter plots of the predictors (columns) in this data set. Describe your findings.
plot(dat)
Ans:
(i) Property tax rate and index of accessibility to radial highways is strongly correlated. It indicates that town areas are better connected to radial highways where property tax rates are higher. However, from the scatter plot we can observe that the correlation could be impacted due to outliers. It would be ideal to validate the correlation basis partial correlation coefficient.
(ii) Lower status of the population and median value of the owner occupied homes is strongly and negatively
correlated. The relationship seems to be non linear. Probably home ownership is less for lower population.
(iii) Average number of rooms per dwelling is positively correlated to Median value of owner occupied homes. It
means that higher value areas will have larger dwellings.
*(iv) Median value of owner occupied homes has poor correlation to Charles River Dummy Variable and distance to five **Boston employment centres.*
(v) Nitrogen oxides concentration and age (proportion of owner-occupied units built prior to 1940) are positively correlated and non linear in relationship.
(c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
Ans: Crime rate by town seems to have weak to moderate correlations with other predictors. The scatter plots give the impression that the index of accessibility to radial highways or property tax rate do not affect the
crime rate. Distance from Boston employment centres has a negative effect over crime rate, however seems to be weak.
The number of rooms per dwelling is also negatively correlated with crime rate and is weak. Lower status of
the population is positively correlated to crime rate. Probably the lower status of the population pushes the crime rate up.Crime rate and age of owner occupied units built prior 1940 is also positively correlated,
however it seems that crime rate is higher for units with higher age. Crime rate seems to be higher for areas where proportion of non-retail business acres is low.
(d) Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
par(mfrow=c(1,3))
boxplot(dat$crim, xlab = "crim")
boxplot(dat$tax, xlab = "tax")
boxplot(dat$ptratio, xlab = "ptratio")
Ans: In the above plot, the first box plot represents “per capita crime rate by town”, second boxplot represents “full-value property-tax rate per $10,000/-”, and the third boxplot represents “pupil-teacher ration by
town”.
(i) “per capita crime rate by town” - This variable has many outliers on the upper extreme. Majority of the towns have very low crime rate probably between zero to five. Some of the areas have very high crime rate above 70. The date for crime rate is positively skewed.Although the crime rate outliers probably range from 10 to above 80, many outlier towns seem to have not very high crime rates. The data ranges from 0 to above 80.
(ii) “full-value property-tax rate per $10,000/-” - There are no outliers in property tax rates, however, a median value near 300 indicates that the data for tax is also skewed as the data ranges from 200 to 700.
(iii) “pupil-teacher ration by town” - This variable has outliers in the lower extreme of the box plot. The data ranges from 12.6 to 22. The median value of pupil teacher ratio is around 19.
(e) How many of the census tracts in this data set bound the Charles river?
table(dat$chas)
##
## 0 1
## 471 35
Ans: 35 tracts bound the Charles river
(f) What is the median pupil-teacher ratio among the towns in this data set?
median(dat$ptratio)
## [1] 19.05
Ans: Median pupil-teacher ration among towns is 19.05.
(g) Which census tract of Boston has lowest median value of owner occupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
dat <- data.frame("obs" = c(1:length(dat$crim)), dat)
dat %>% filter(medv == min(medv))
## obs crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 30.59 5
## 2 406 67.9208 0 18.1 0 0.693 5.683 100 1.4254 24 666 20.2 22.98 5
Ans: The above two census tracts have the lowest median value of 5 from the owner occupied homes.
summary(dat[,2:14])
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
Ans: (i) “crim” - The values for crime rate are the outlier values for both the tracts. The crime rate is relatively
high as compared to the median value of crime rate. (ii) “zn” - Proportion of land zoned for lots over 25,000 sq.ft. for both tracts is zero, which is same as the
median value of the variable. Hence, there seems to be very little or no investment in these tracts. (iii) “indus” - For both tracts proportion of non-retail business per acres is 18.1 which is more than the median value and lies in the third quartile indicating good business opportunities. (iv) “chas” - The two tracts do not bound the Charles river (v) “nox” - Nitrogen oxides concentration lie beyond the third quartile. Hence, the concentration levels are very high. That could also be possible if the tracts are near the highways. (vi) “rm” - Average number of rooms per dwelling lie in the first quartile indicating smaller dwellings
relatively. (vii) “age” - Proportion of owner-occupied units built prior to 1940 for the two tracts lie on the upper quartile indicating very old dwellings/housing. (viii) “dis” - Weighted mean of distances to five Boston employment centres lie in the lower quartile indicating there could be high unemployment. (ix) “rad” - Index of accessibility to radial highways for both the tracts are the maximum value indicating that the tracts are very near to highways. (x) “tax” - Full-value property-tax rate per $10,000 for both the tracts are at the edge of third quartile indicating people paying high taxes for smaller houses. Probably the tax rate and housing area correlation is not linear. (xii) “ptratio” - Pupil Teacher ration for both the tracts are at the edge of third quartile indicating relative good education system. (xiii) “lstat” - Lower status of the population for both the tracts are in the upper quartile indicating high
number of low income group dwellings.
(h) In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.
k = dat %>% filter(rm > 7)
j = dat %>% filter(rm > 8)
data.frame("more_than_7_rooms" = c(length(k$rm)),
"more_than_8_rooms" = c(length(j$rm)))
## more_than_7_rooms more_than_8_rooms
## 1 64 13
boxplot(j[,2:14])
Ans: For the 13 tracts having average of more than 8 rooms per dwelling has very low crime rate, in range pupil teacher ratio as compared to over range of the variable, and very high average umber of rooms per dwelling. Property tax rate is low with exception of one tract. Proportion of non-retail business acres per town is very low except two tracts indicating more residential areas. The tracts seem to be away from highways. Majority of the houses are built before 1940 except a few.