Assignment 1

Exercise 2:

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

Exercise 2a:

We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

It is a regression problem, and Inference because we are interpreting the relationship of CEO salary based on variables related to a given firms characteristics.
Here, n = 500 (number of firms that data was collected)
p = 3 (profit, number of employees, industry)

Exercise 2b:

We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

It is a classification problem and the interest is Prediction as we want to predict if the product launch will succeed or fail.
Here, n = 20 (the number of previous products on which data was collected)
p = 13 (price charged for the product, marketing budget, competition price, and ten other variables)

Exercise 2c:

We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

It is a regression problem, and Prediction because we are predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets.
Here, n = 52 (weekly data points collected for all of 2012 and there are 52 weeks in the year.)
p = 3 (% change in the US market, the % change in the British market, and the % change in the German market)

Exercise 5:

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

The advantages of a very flexible approach is that it may be a better fit for non-linear models, has less bias and works for more complex systems
The disadvantages of a very flexible approach is that it requires a large number of parameters, it can overfit the training dataset as it follows the noise too closely, it can also lead to high variance.
A more flexible approach is preferable when the system is underfitted, or when the data has non-linear characteristcs. A less flexible approach is preferable when the dataset has few observations, or when more interpretability is desirable, or when the data tends to a linear behavior.

Exercise 6:

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?

Differences: In parametric approach we make an assumption about the functional form of f. It reduces the problem of estimating f down to one of estimating a set of parameters. The non-parametric does not make assumptions about f, so requires a very large sample to accurately estimate f.
Advantages: Reduces the problem of estimating f. It does not require as many observations as compared to a non-parametric approach.
Disadvantages: The disadvantages of parametric approach is that model chosen will usually not estimate the true unknown form of f. If the model is too far from the true function, the estimate will be very poor.

Exercise 8:

This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US.

Exercise 8a:

Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data

data(College)
college <- read.csv("College.csv")

Exercise 8b:

Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later.

head(college[, 1:5])

rownames <- college[, 1]
college <- college[, -1]
head(college[, 1:5])

Exercise 8c:

1. Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)

##    Private               Apps           Accept          Enroll    
##  Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
##  Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
##  Mode  :character   Median : 1558   Median : 1110   Median : 434  
##                     Mean   : 3002   Mean   : 2019   Mean   : 780  
##                     3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
##                     Max.   :48094   Max.   :26330   Max.   :6392  
##    Top10perc       Top25perc      F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
##  1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
##  Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
##  Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
##  3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
##  Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
##  1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
##  Median : 9990   Median :4200   Median : 500.0   Median :1200  
##  Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
##  3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
##  Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
##       PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
##  1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
##  Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
##  3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate     
##  Min.   : 3186   Min.   : 10.00  
##  1st Qu.: 6751   1st Qu.: 53.00  
##  Median : 8377   Median : 65.00  
##  Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :56233   Max.   :118.00

1. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].pairs(college[, 1:10])

college$Private <- as.factor(college$Private)
pairs(college[, 1:10])

1. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

plot(college$Private, college$Outstate, 
     xlab = "Private University", 
     ylab ="Out of State tuition in USD", 
     main = "Outstate Tuition Plot")

1. Create a new qualitative variable, called Elite, by binning the Top10perc variable. Use the summary() function to see how many elite universities there are.

Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college$Elite <- Elite
summary(college$Elite)

##  No Yes 
## 699  78

Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

plot(college$Elite, college$Outstate, 
     xlab = "Elite University", 
     ylab ="Out of State tuition in USD", 
     main = "Outstate Tuition Plot")

1. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables.

par(mfrow = c(2,2))
hist(college$Books, xlab = "Books", ylab = "Count")
hist(college$PhD, xlab = "PhD", ylab = "Count")
hist(college$Grad.Rate, xlab = "Grad Rate", ylab = "Count")
hist(college$perc.alumni, xlab = "% alumni", ylab = "Count")

1. Continue exploring the data, and provide a brief summary of what you discover.

summary(college$PhD)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   62.00   75.00   72.66   85.00  103.00

Some universities have 103% of faculty with Phd degree, let us see how many universities have this percentage and their names.

faculty.phd <- college[college$PhD == 103, ]
nrow(faculty.phd)

## [1] 1

rownames[as.numeric(rownames(faculty.phd))]

## [1] "Texas A&M University at Galveston"

Exercise 9:

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

Exercise 9a:

Which of the predictors are quantitative, and which are qualitative?

auto <- read.csv("Auto.csv", na.strings = "?")
auto <- na.omit(auto)
str(auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Quantitative: Displacement, Horsepower, Weight, acceleration and Year

Qualitative: Cylinders, Origin and Name (Cylinders and Origin are numbers but they should be transformed into Factors, since they are categorical values and not continuous)

Exercise 9b:

What is the range of each quantitative predictor? You can answer this using the range() function.

range(Auto$mpg)

## [1]  9.0 46.6

range(Auto$cylinders)

## [1] 3 8

range(Auto$displacement)

## [1]  68 455

range(Auto$weight)

## [1] 1613 5140

range(Auto$acceleration)

## [1]  8.0 24.8

range(Auto$year)

## [1] 70 82

range(Auto$origin)

## [1] 1 3

Exercise 9c:

What is the mean and standard deviation of each quantitative predictor?

sapply(auto[, -c(4, 9)], mean)

##          mpg    cylinders displacement       weight acceleration         year 
##    23.445918     5.471939   194.411990  2977.584184    15.541327    75.979592 
##       origin 
##     1.576531

sapply(auto[, -c(4, 9)], sd)

##          mpg    cylinders displacement       weight acceleration         year 
##    7.8050075    1.7057832  104.6440039  849.4025600    2.7588641    3.6837365 
##       origin 
##    0.8055182

Exercise 9d:

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

subset <- auto[-c(10:85), -c(4,9)]
sapply(subset, range)

##       mpg cylinders displacement weight acceleration year origin
## [1,] 11.0         3           68   1649          8.5   70      1
## [2,] 46.6         8          455   4997         24.8   82      3

sapply(subset, mean)

##          mpg    cylinders displacement       weight acceleration         year 
##    24.404430     5.373418   187.240506  2935.971519    15.726899    77.145570 
##       origin 
##     1.601266

sapply(subset, sd)

##          mpg    cylinders displacement       weight acceleration         year 
##     7.867283     1.654179    99.678367   811.300208     2.693721     3.106217 
##       origin 
##     0.819910

Exercise 9e:

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

auto$cylinders <- as.factor(auto$cylinders)
auto$year <- as.factor(auto$year)
auto$origin <- as.factor(auto$origin)
auto$horsepower <- as.factor(auto$horsepower)
auto$name <- as.factor(auto$name)
pairs(auto)

There seems more mileage per gallon on a 4 cyl vehicle than other vehicles. Weight, displacement and horsepower seem to have an inverse effect with mpg. We see an overall increase in mpg over the years. Almost doubled in one decade.

Exercise 9f:

Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

From the plots above,mpg has negative correlation with weight, displacement and horsepower.

cor(auto$mpg, auto$weight)

## [1] -0.8322442

cor(auto$mpg, auto$displacement)

## [1] -0.8051269

auto$horsepower <- as.numeric(auto$horsepower)
cor(auto$mpg, auto$horsepower)

## [1] -0.8291518

Exercise 10:

This exercise involves the Boston housing data set

Exercise 10a:

To begin, load in the Boston data set. The Boston data set is part of the MASS library in R.

library(MASS)
Boston$chas <- as.factor(Boston$chas)
nrow(Boston)

## [1] 506

ncol(Boston)

## [1] 14

Exercise 10b:

Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

par(mfrow = c(2, 2))
plot(Boston$nox, Boston$crim)
plot(Boston$rm, Boston$crim)
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)

The crime and tax rate have an inverse relationship as in less crime in high tax rate areas.

Exercise 10c:

Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

hist(Boston$crim, breaks = 50)

Most suburbs do not have any crime (80% of data falls in crim < 20).

pairs(Boston[Boston$crim < 20, ])

There may be a relationship between crim and nox, rm, age, dis, lstat and medv.

Exercise 10d:

Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

hist(Boston$crim, breaks = 50)

nrow(Boston[Boston$crim > 20, ])

## [1] 18

hist(Boston$tax, breaks = 50)

nrow(Boston[Boston$tax == 666, ])

## [1] 132

hist(Boston$ptratio, breaks = 50)

nrow(Boston[Boston$ptratio > 20, ])

## [1] 201

Exercise 10e:

How many of the suburbs in this data set bound the Charles river?

nrow(Boston[Boston$chas == 1, ])

## [1] 35

Exercise 10f:

What is the median pupil-teacher ratio among the towns in this data set?

median(Boston$ptratio)

## [1] 19.05

Exercise 10g:

Which suburb of Boston has lowest median value of owner occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

row.names(Boston[min(Boston$medv), ])

## [1] "5"

low.med = Boston[order(Boston$medv),] #order in the ascending order
low.med[1,]

399 has the lowest median value(5) of owner occupied homes when compared to other suburbs of Boston.

range(Boston$tax)

## [1] 187 711

Boston[min(Boston$medv), ]$tax

## [1] 222

Exercise 10h:

In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

nrow(Boston[Boston$rm > 7, ])

## [1] 64

64 of the suburbs average more than seven rooms per dwelling.

nrow(Boston[Boston$rm > 8, ])

## [1] 13

13 of the suburbs average more than seven rooms per dwelling.