Data Analysis

Q(2)

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p where n is sample size and p is the number of predictors.

(a)

We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

Solution: This is a regression problem because the response variable,CEO salary, is a continuous variable. We are interested in inference here because we are trying to understand the relationship of the CEO salary with other variables.

n = 500 and p = 3 (because there are 3 predictors).

(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget,competition price,and ten other variables.

Solution:

This is a classification problem because the response variable is a categorical variable with two categories(success or failure). We are interested in prediction in this problem because we want to predict whether the outcome is a success or a failure.

here, n = 20 and p = 13

(c)

We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

Solution:

This is a regression problem because the response variable is numeric(% change in the USD/Euro).Here our interest is on prediction.

n = 52(number of weeks in the year 2012) and p = 3(% change in the US,British,and German market)

(3)

We now revisit the bias-variance decomposition.

(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves.Make sure to label each one.

Solution:

(b) Explain why each of the five curves has the shape displayed in part (a).

Solution:

Training error:

As flexibility increases, the training error decreases. A more flexible model fits the training data very well.Therefore it minimizes the training error.

Test error:

The test error decreases for some time.After certain point, it increases again. The reason is that the test mean square error is the sum of Bias square, variance, and Baye’s error(irreducible error which remains constant).At the beginning, the decrease in bias square is higher than the increase in variance.This happens upto a certain point.So upto that point, the test mean square error decrease.But, beyond that the rate at which variance increases is more than the rate at which bias decrease thereby increasing the test mean square error.Therefore,the test mean square error curve is somehow “U shaped”.

Bayes error:

Bayes error is irreducible error. This error is due to the noise in the data and not because of the model.This error remains unaffected regardless of how good or bad the model is.That’s why Baye’s error is a horizontal straight line curve.

Bias square:

Bias square decreases as flexibility of the model increases. A more flexible model captures the complexity of data whereas a less flexible model cannot capture it.Therefore bias square is a decreasing curve.

Variance:

A highly flexible model captures all data point which increases the variance.Therefore variance has an increasing curve.

(7)

The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

Suppose we wish to use this data set to make a prediction for Y when $X_1 =X_2 =X_3 =0$ using K-nearest neighbors.

(a)

Compute the Euclidean distance between each observation and the test point, $X_1 =X_2 =X_3 =0$.

Solution:

euclidean_distance<- function(x1,x2,x3){
  d <- sqrt(((x1-0)^2)+((x2-0)^2)+((x3-0)^2))
  return(d)
}

# Euclidean distance between (0,3,0) and (0,0,0)
euclidean_distance(0,3,0)

## [1] 3

# Euclidean distance between (2,0,0) and (0,0,0)
euclidean_distance(2,0,0)

## [1] 2

# Euclidean distance between (0,1,3) and (0,0,0)
euclidean_distance(0,1,3)

## [1] 3.162278

# Euclidean distance between (0,1,2) and (0,0,0)
euclidean_distance(0,1,2)

## [1] 2.236068

# Euclidean distance between (-1,0,1) and (0,0,0)
euclidean_distance(-1,0,1)

## [1] 1.414214

# Euclidean distance between (1,1,1) and (0,0,0)
euclidean_distance(1,1,1)

## [1] 1.732051

(b)

What is our prediction with K = 1? Why?

Solution: Based on the euclidean distance calculated in part(a),the closest neighbor to the test point(0,0,0) is the point (-1,0,1) which is at a distance 1.414214. This point falls in the Green class.So, with K=1, we predict that the test point (0,0,0) is classified as Green.

(c)

What is our prediction with K =3? Why?

Solution: For K=3 and based on the distance calculated in part(a), the three nearest neighbors of the test point (0,0,0) are (2,0,0) Red,(-1,0,1)Green,(1,1,1)Red. Out of these three neighbors, the majorities belong to the class “Red”.Therefore, the test point (0,0,0) is predicted to be in Red class.

(d)

If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?

Solution:

If the Bayes decision boundary is highly nonlinear, that is the indication that the model is complex.That’s why we expect small value of k because small value of k prevents the model to be underfitting by allowing twists and turns to capture the complexity of the data.

(9)

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

(a)

Which of the predictors are quantitative, and which are qualitative?

library(ISLR)
data <- Auto
head(data)

Solution: cylinders,displacement,horsepower,weight,acceleration,and year are numeric(quantitative) predictors. Origin is categorical(qualitative) because here 1 refers to American, 2 European, and 3 Japanese. Likewise, name is qualitative.

mpg is response variable.

Data Cleaning:

We need to clean the data before doing the analysis.

# Check and remove if there are missing values
sum(is.na(data))

## [1] 0

There are no missing values.

#Check and remove if there are duplicate rows
sum(duplicated(data))

## [1] 0

There are no duplicate rows.So, the data is clean and ready for analysis.

(b)

What is the range of each quantitative predictor?

#We attach data so that we don't have to reference data and $ sign to call the columns
head(data)

Finding the ranges of the numeric predictors: for it,we first remove the response variable mpg i.e. column 1,and categorical predictors i.e. columns 8 and 9

#Remove response variable and categorical predictors
data1 <-data[,-c(1,8,9)]

#Find the ranges of the numeric predictor variables
apply(data1,2,function(x){
  c(Range = max(x) - min(x))
})

##    cylinders displacement   horsepower       weight acceleration         year 
##          5.0        387.0        184.0       3527.0         16.8         12.0

Alternative method of getting maximum and minimum value for the range:

attach(data1)
#After attaching data we don't need to do data$mpg,instead we can directly use mpg
#The following outputs min and max values whose difference is range
range(cylinders)

## [1] 3 8

range(displacement)

## [1]  68 455

range(horsepower)

## [1]  46 230

range(weight)

## [1] 1613 5140

range(acceleration)

## [1]  8.0 24.8

range(year)

## [1] 70 82

(c)

What is the mean and standard deviation of each quantitative predictor?

Solution:

Mean of numeric predictors(exclude response variable,and categorical predictors:

#Mean of numeric predictors
apply(data1,2,mean)

##    cylinders displacement   horsepower       weight acceleration         year 
##     5.471939   194.411990   104.469388  2977.584184    15.541327    75.979592

Standard deviation of numeric predictors:

apply(data1,2,sd)

##    cylinders displacement   horsepower       weight acceleration         year 
##     1.705783   104.644004    38.491160   849.402560     2.758864     3.683737

Alternative Method:

apply(data1,2,function(x){
  c(Mean=mean(x),Standard_deviation=sd(x))
})

##                    cylinders displacement horsepower    weight acceleration
## Mean                5.471939      194.412  104.46939 2977.5842    15.541327
## Standard_deviation  1.705783      104.644   38.49116  849.4026     2.758864
##                         year
## Mean               75.979592
## Standard_deviation  3.683737

(d)

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

#Removing 10th to 85th observation and the categorical columns 8th and 9th
reduced_data <- data1[-c(10:85),]

apply(reduced_data,2,function(x){
  c(Range=max(x)-min(x),Mean=mean(x),Standard_deviation=sd(x))
})

##                    cylinders displacement horsepower    weight acceleration
## Range               5.000000    387.00000  184.00000 3348.0000    16.300000
## Mean                5.373418    187.24051  100.72152 2935.9715    15.726899
## Standard_deviation  1.654179     99.67837   35.70885  811.3002     2.693721
##                         year
## Range              12.000000
## Mean               77.145570
## Standard_deviation  3.106217

(e)

Using the full data set, investigate the predictors graphically,using scatter plots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

pairs(~ mpg+cylinders + displacement + horsepower + weight + acceleration + year,data = data, col="red",main = "Scatter Plot Matrix")

Interactive heatmap using plotly:

#install.packages("plotly")
library(plotly)

#Create correlation matrix excluding the categorical variable
correlation_matrix <-cor(data[,c(-8,-9)])

#Make interactive heatmap of the correlation matrix
plot_ly(x=colnames(correlation_matrix),y=rownames(correlation_matrix),z=correlation_matrix,type="heatmap")

Findings: The scatter plot matrix and the heatmap show that displacement,horsepower and weight have a strong negative correlation with the response variable mpg suggesting that higher displacement engine,higher horse power(more powerful engine) and heavier vehicle(higher weight) require more fuel to move resulting in less mileage(less fuel efficiency).Likewise,mpg and cylinder have a negative correlation i.e. vehicles having higher number of cylinders consume more fuel(less fuel efficiency mpg).On the other hand, mpg and year have a positive correlation suggesting that newer vehicles have higher fuel efficiency.

We also observed that the predictors themselves are correlated with each other.In particular,displacement and horsepower; and displacement and weight have a strong positive correlation.So, this multicollinearity needs to be taken into consideration during statistical analysis.

(f)

Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

Solution: As can be seen from the plot and the heatmap,acceleration does not have a strong correlation with mpg. So, I don’t think that it will be helpful in predicting mpg. Other than this all the other variables viz. cylinder,displacement,horsepower,weight and year are correlated with mpg.So, these 5 variables could be useful in predicting mpg. However, there exists multicollinearity between displacement,horsepower and weight. So, careful study needs to be done to figure out which of them are to be included in the model.

(10)

This exercise involves the Boston housing data set.

(a)

To begin, load in the Boston data set. The Boston data set is part of the MASS library in R. How many rows are in this data set? How many columns? What do the rows and columns represent?

library(MASS)
my_data <- Boston
head(my_data)

dim(my_data)

## [1] 506  14

The data contains 506 rows and 14 columns. The rows represent different census tract in Boston Metropolitan area.The columns in the data set represent the variables associated with Boston Housing. The column names represent the following:

crim = per capita crime rate by town zn = proportion of residental land zoned for lots over 25,000 sq.ft. indus = proportion of non-retail business acres per town chas = Charles River dummy variable(=1 if tract bounds river,0 otherwise) nox = nitrozen oxides concentration(parts per 10 million) rm = average number of rooms per dwelling age = proportion of owner-occupied units built prior to 1940 dis = weighted mean distance to five Boston employment centres rad = index accessibility to radial highways tax = full value property tax rate per $10,000 ptratio = pupil teacher ratio by town black = 1000(Bk-0.63)^2 where Bk is the proportion of blacks by town lstat = lower status of the population(percent) medv = median value of owner occupied home in $1000s

here medv is the response variable.

Data Cleaning:

#Check if there are any missing values
sum(is.na(my_data))

## [1] 0

There are no missing values.

#Check if there are duplicate rows
sum(duplicated(my_data))

## [1] 0

The result shows that there are no duplicate rows. So, overall the data is clean and ready for analysis.

(b)

Make some pairwise scatter plots of the predictors (columns) in this data set. Describe your findings.

Solution: The column “chas”,4th column is a categorical variable.So, we remove that column before plotting.

pairs(my_data[,-c(4)],col="red")

Heatmap:

library(plotly)

#Exclude the categorical column "chas" which is 4th column
cor_matrix <- cor(my_data[,-c(4)])
plot_ly(x = colnames(cor_matrix), y = rownames(cor_matrix), z = cor_matrix, type = "heatmap")

Findings: From the scatterplot matrix and the heatmap, we can see that the median house price(medv) has strong positive correlation with number of rooms(rm). Higher the number of rooms, higher is the price of the house. On the other hand,the places where there is lower status population(lstat) with lower socio-economic background, the house prices are lower.

The scatter plot shows that there is multicollinearity between nox(nitrogen oxide concentration) and dis(weighted mean distance to five Boston employment centres)Likewise, there is multicollinearity between dis and age;and rm(number of rooms) and lstat(low status).This multicollinearity needs to be addressed while building our model.

(c)

Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

Solution:

From the scatter plot and heatmap we can see that the predictor “crim” has a positive correlation with “rad”. Here “crim” is the crime rate and “rad” is index accessibility to highways. This means if the place is easily accessible from highways, the crime rate is higher. Likewise the crime rate “crim” has a positive correlation with “tax” (tax rate) suggesting areas with higher tax rate have higher crime rate. Crime rate “crim” and “dis” (mean distance to 5 main employment center) have a negative correlation. But it is weak relation. Likewise, the plot shows some positive correlation between crime rate and age. But, it is not that strong.

(d)

Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

#smallest and largest crime rate
range(my_data$crim)

## [1]  0.00632 88.97620

From the range of the crime rate, we observe that some suburbs of Boston have high crime rate i.e. 88.9.

#Smallest and largest tax rates
range(my_data$tax)

## [1] 187 711

Regarding tax rates,the range goes from 187 to 711. Here 711 suggests that some suburbs have high property tax rates.

#Smallest and largest pupil_teacher ratio
range(my_data$ptratio)

## [1] 12.6 22.0

The pupil teacher ratio has less variation compared to crime rate and tax rate.No, suburb has extremely high pupil teacher ratio with maximum being 22.

(e)

How many of the suburbs in this data set bound the Charles river?

Solution:

#chas value 1 means the suburb is bounded by Charles river and 0 means not bounded.
boolean_mask <-my_data$chas==1
sum(boolean_mask)

## [1] 35

Result: 35 suburbs are bounded by Charles river.

(f)

What is the median pupil-teacher ratio among the towns in this data set?

Solution:

median(my_data$ptratio)

## [1] 19.05

Result: median pupil-teacher ratio is 19.05.

(g)

Which suburb of Boston has lowest median value of owner occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

#Remove the categorical column 4
data_subset <- my_data[,-c(4)]

minimum_median_value <-min(data_subset$medv)
print(c("minimum median value of the house is:",minimum_median_value,"in $1000s"))

## [1] "minimum median value of the house is:"
## [2] "5"                                    
## [3] "in $1000s"

mask <-data_subset$medv == 5

#Extracting only the rows which have "medv" value i.e. median value equal to 5(in $1000s)
data_subset[mask,]

Result: The result shows that suburb with row number 399 and 406 have the lowest median value of owner-occupied homes($5000). The values of other predictors have been displayed above. Comparing these vales with their ranges, we can see that both suburbs have high crime rate,high nitrogen oxides concentration pollution(nox=0.693),both suburbs have older housing(the proportion is 100% i.e. age=100) that is all houses in these suburbs were built prior to 1940.Likewise,both suburb have high property tax rates(666). These could be the contributing factor for the low housing price for the two suburbs.

(h)

In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

Number of suburbs with more than 7 rooms per dwelling:

mask2 <- my_data$rm > 7
subset2 <- my_data[mask2,]
dim(subset2)

## [1] 64 14

This shows that 64 suburbs have more than seven rooms per dwelling.

Number of suburbs with more than 8 rooms per dwelling:

mask3 <- my_data$rm > 8
subset3 <- my_data[mask3,]
dim(subset3)

## [1] 13 14

The result shows that 13 suburbs have more than 8 rooms per dwelling.

subset3

apply(my_data,2,range)

##          crim  zn indus chas   nox    rm   age     dis rad tax ptratio  black
## [1,]  0.00632   0  0.46    0 0.385 3.561   2.9  1.1296   1 187    12.6   0.32
## [2,] 88.97620 100 27.74    1 0.871 8.780 100.0 12.1265  24 711    22.0 396.90
##      lstat medv
## [1,]  1.73    5
## [2,] 37.97   50

As can be seen from the data that the suburbs with more than 8 rooms per dwelling have low crime rate.Most of these suburbs are not bounded by the Charles River.These suburbs have comparatively lower level of nitrogen oxide pollution.The property tax rate in these suburbs is relatively lower.They have high level of socio-economic status or equivalently the percentage of people having low socio-economic status is lower.These suburbs are a bit faraway from radial highway.The median value of the houses in these suburbs is high.This means wealthy people live in these suburbs..

Data Analysis

NAME: RAM Chandra Dhungana

2025-02-04