Data Analytics - Homework I

2)Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction.

Finally, provide n and p. p: The number of predictors in a dataset. n: The number of samples in a dataset.

2a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

ANSWER 2a)

This is a regression problem since it involves numerical values. The keyword is understanding which leads to interpretability of the model as a requirement so we are most interested in inference. p=3

n=500

2b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

ANSWER 2b)

This is a classification problem since it involves classifying success or failure. Interpretability of the model is not a requirement so we are most interested in prediction. p=13

n=20

2c) We are interest in predicting the ed % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

ANSWER 2c)

This is a regression problem since it involves % numerical values. Interpretability of the model is not a requirement so we are most interested in prediction. p=3

n=52(hence 52 weeks in a year)

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification?

ANSWER 5)

Regression Advantages:

If we are mainly interested in inference, then restrictive models (less flexible) are much more interpretable.

Regression Disadvantages:

Linear regression is a relatively inflexible approach, because it can only generate linear functions and small range of shapes to estimate f

Classification Advantages:

Classification is more flexible

Classification Disadvantages:

Classification is less interpretable.

Under what circumstances might a more flexible approach be preferred to a less flexible approach?

When the interpretability of the predictive model is simply not of interest and we are only interested in prediction. For instance, if we seek to develop an algorithm to predict the price of a stock, our sole requirement for the algorithm is that it predict accurately—interpretability is not a concern.

When might a less flexible approach be preferred?

When inference or interpretability is a must then we chose a more restrictive model or less flexible.

Describe the differences between a parametric and a non-parametric statistical learning approach.

ANSWER 6)

Parametric:

Methods involve a two-step model-based approach. One very simple assumption is that f is linear in X. After a model has been selected, we need a procedure that uses the training data to fit or train the model. After a model has been selected, we need a procedure that uses the training data to fit or train the model.

Non-parametric:

Non-parametric methods do not make explicit assumptions about the functional form of f. Instead they seek an estimate of f that gets as close to the data points as possible without being too rough or wiggly.

What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)?

Parametric Advantages:

Assuming a parametric form for f simplifies the problem of estimating f because it is generally much easier to estimate a set of parameters, such as β0,β1, . . . ,βp in the linear model (2.4), than it is to fit an entirely arbitrary function f.

Non-parametric Advantages:

Such approaches can have a major advantage over parametric approaches: by avoiding the assumption of a particular functional form for f, they have the potential to accurately fit a wider range of possible shapes for f.

What are its disadvantages?

Parametric:

The potential disadvantage of a parametric approach is that the model we choose will usually not match the true unknown form of f. If the chosen model is too far from the true f, then our estimate will be poor.

Non-Parametric:

But non-parametric approaches do suffer from a major disadvantage: since they do not reduce the problem of estimating f to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for f.

8a) Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

library(ISLR)
#setwd("C:\Users\user\Desktop\Summer 2021\Data Analytics Algorithms II\Homework\Week 1")
college=read.csv("college.csv", header=TRUE)   #Exercise 8a

(8b) Look at the data using the fix() function. You should notice that the first column is just the name of each university.We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:

rownames(college)=college[,1]             
fix(college)  #Exercise 8b

(8c) i. Use the summary() function to pr

(8c)ii. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].

(8c)iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

(8c)iv. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

#> Elite=rep(“No”,nrow(college)) #> Elite[college$Top10perc >50]=“Yes” #> Elite=as.factor(Elite) #> college=data.frame(college ,Elite)

Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

(8c)v. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables.

You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

(8c)vi. Continue exploring the data, and provide a brief summary of what you discover.

ANSWER 8c(i-vi):

summary(college)  #Exercise 8c-i

##       X               Private               Apps           Accept     
##  Length:777         Length:777         Min.   :   81   Min.   :   72  
##  Class :character   Class :character   1st Qu.:  776   1st Qu.:  604  
##  Mode  :character   Mode  :character   Median : 1558   Median : 1110  
##                                        Mean   : 3002   Mean   : 2019  
##                                        3rd Qu.: 3624   3rd Qu.: 2424  
##                                        Max.   :48094   Max.   :26330  
##      Enroll       Top10perc       Top25perc      F.Undergrad   
##  Min.   :  35   Min.   : 1.00   Min.   :  9.0   Min.   :  139  
##  1st Qu.: 242   1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992  
##  Median : 434   Median :23.00   Median : 54.0   Median : 1707  
##  Mean   : 780   Mean   :27.56   Mean   : 55.8   Mean   : 3700  
##  3rd Qu.: 902   3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005  
##  Max.   :6392   Max.   :96.00   Max.   :100.0   Max.   :31643  
##   P.Undergrad         Outstate       Room.Board       Books       
##  Min.   :    1.0   Min.   : 2340   Min.   :1780   Min.   :  96.0  
##  1st Qu.:   95.0   1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0  
##  Median :  353.0   Median : 9990   Median :4200   Median : 500.0  
##  Mean   :  855.3   Mean   :10441   Mean   :4358   Mean   : 549.4  
##  3rd Qu.:  967.0   3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0  
##  Max.   :21836.0   Max.   :21700   Max.   :8124   Max.   :2340.0  
##     Personal         PhD            Terminal       S.F.Ratio    
##  Min.   : 250   Min.   :  8.00   Min.   : 24.0   Min.   : 2.50  
##  1st Qu.: 850   1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50  
##  Median :1200   Median : 75.00   Median : 82.0   Median :13.60  
##  Mean   :1341   Mean   : 72.66   Mean   : 79.7   Mean   :14.09  
##  3rd Qu.:1700   3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50  
##  Max.   :6800   Max.   :103.00   Max.   :100.0   Max.   :39.80  
##   perc.alumni        Expend        Grad.Rate     
##  Min.   : 0.00   Min.   : 3186   Min.   : 10.00  
##  1st Qu.:13.00   1st Qu.: 6751   1st Qu.: 53.00  
##  Median :21.00   Median : 8377   Median : 65.00  
##  Mean   :22.74   Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:31.00   3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :64.00   Max.   :56233   Max.   :118.00

#college[,1:10]  ##I had to comment this out, dataset is too long
pairs(College[,1:10]) #Exercise 8c-ii

plot(College$Private, College$Outstate, xlab = "Private",xlim = c(0,2.5), ylab ="OutState", main = "Outstate vs Private")                #Exercise 8c-iii

Elite=rep("No",nrow(college)) #Exercise 8c-iv
Elite[college$Top10perc >50]="Yes"
Elite=as.factor(Elite)
college=data.frame(college ,Elite)
summary(college$Elite)

##  No Yes 
## 699  78

plot(college$Elite, college$Outstate, xlab = "Elite",,xlim = c(0,2.5), ylab ="OutState", main = "Outstate vs Elite")

par(mfrow = c(2,2))  #Exercise 8c-v
hist(college$Enroll, col=10, xlab = "Enroll", ylab = "Count")
hist(college$Top10perc, col = 10, xlab = "Top10", ylab = "Count")
hist(college$Personal, col = 5, xlab = "Personal", ylab = "Count")
hist(college$Grad.Rate, col = 5, xlab = "Graduation Rate", ylab = "Count")

 #Exercise 8c-vi
summary(college$Enroll)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      35     242     434     780     902    6392

#Found that the mean enrollment for students is equal to 780

summary(college$Outstate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2340    7320    9990   10441   12925   21700

#Found that the mean number of out of state students is equal to 10441

summary(college$Grad.Rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   53.00   65.00   65.46   78.00  118.00

#Found that the mean number for graduation rate on students is equal to 65.46. If this is a percentage, its not a very good number.

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

(9a) Which of the predictors are quantitative, and which are qualitative?

Auto = read.csv("Auto.csv", header=TRUE, na.strings = "?")
Auto = na.omit(Auto)
str(Auto)  #Exercise 9A

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

ANSWER 9a)

As shown above, all of the variables are quantitative except “name” and “horsepower” which has a couple of “?” record values.

(9b) What is the range of each quantitative predictor? You can answer this using the range() function.

ANSWER 9B)

df = subset(Auto, select = -c(horsepower,name),header=TRUE )
#df
summary(df) # This will also show the mix and max values needed for each predictor

##       mpg          cylinders      displacement       weight      acceleration  
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   :1613   Min.   : 8.00  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.:2225   1st Qu.:13.78  
##  Median :22.75   Median :4.000   Median :151.0   Median :2804   Median :15.50  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :2978   Mean   :15.54  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:3615   3rd Qu.:17.02  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :5140   Max.   :24.80  
##       year           origin     
##  Min.   :70.00   Min.   :1.000  
##  1st Qu.:73.00   1st Qu.:1.000  
##  Median :76.00   Median :1.000  
##  Mean   :75.98   Mean   :1.577  
##  3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :82.00   Max.   :3.000

(9c) What is the mean and standard deviation of each quantitative predictor?

ANSWER 9C)

sapply(df,mean)

##          mpg    cylinders displacement       weight acceleration         year 
##    23.445918     5.471939   194.411990  2977.584184    15.541327    75.979592 
##       origin 
##     1.576531

sapply(df, sd)

##          mpg    cylinders displacement       weight acceleration         year 
##    7.8050075    1.7057832  104.6440039  849.4025600    2.7588641    3.6837365 
##       origin 
##    0.8055182

(9d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

ANSWER 9D)

df2 <- df[-c(10:85)]
#df2  # I had to comment out this line since dataset is way to long to output
sapply(df2, range)

##       mpg cylinders displacement weight acceleration year origin
## [1,]  9.0         3           68   1613          8.0   70      1
## [2,] 46.6         8          455   5140         24.8   82      3

sapply(df2, mean)

##          mpg    cylinders displacement       weight acceleration         year 
##    23.445918     5.471939   194.411990  2977.584184    15.541327    75.979592 
##       origin 
##     1.576531

sapply(df2, sd)

##          mpg    cylinders displacement       weight acceleration         year 
##    7.8050075    1.7057832  104.6440039  849.4025600    2.7588641    3.6837365 
##       origin 
##    0.8055182

(9e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

ANSWER 9E)

Auto$cylinders <- as.factor(Auto$cylinders)
Auto$year <- as.factor(Auto$year)
Auto$origin <- as.factor(Auto$origin)
Auto$name <- as.factor(Auto$name)
pairs(Auto[,1:9])

# There is a linear relationship between displacement and horsepower as well as between weight and horsepower.

There is a linear relationship between displacement and horsepower as well as between weight and horsepower.

(9f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

ANSWER 9F)

Auto$horsepower = as.numeric(Auto$horsepower)
cor(Auto$weight, Auto$horsepower)

## [1] 0.8645377

#[1] 0.8645377

cor(Auto$displacement, Auto$horsepower)

## [1] 0.897257

#[1] 0.897257

Yes, using the data from the previous plots, we can use the following predictors: cylinders,year,origin and horsepower. Displacement and weight are highly correlated with horsepower ,therefore, these two are omitted.

This exercise involves the Boston housing data set.

(10a) To begin, load in the Boston data set. The Boston data set is part of the MASS library in R.

Read about the data set:

ANSWER 10A)

library("MASS")
#Now the data set is contained in the object Boston.

#Read about the data set:
# Boston # I had to comment it out, dataset is way too long to show it
#How many rows are in this data set?
nrow(Boston)

## [1] 506

  #506
#  How many columns?
ncol(Boston)

## [1] 14

 # 14
#  What do the rows and columns represent?
#The rows and columns represent a data set

(10b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

par(mfrow = c(2, 2))
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)

ANSWER 10B) There seems to be a linear relationship between crim and age as well as a linear relationship between crim and dis.

(10c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

ANSWER 10C)

hist(Boston$crim,col=30, breaks = 25)

#Yes, most of the data is exponentially decreasing and located from crim equals 0 to 25 which gives us an indicator of low crime rates in the Suburbs. 

pairs(Boston[Boston$crim < 25, ])

#head(Boston)
# There is a linear relationship between crim and age.
# There is an inverted linear relationship between crim and dis.
# There is an inverted linear relationship between crim and medv.

(10d) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment onthe range of each predictor.

ANSWER 10D)

hist(Boston$crim,col=30, breaks = 25)

nrow(Boston[Boston$crim > 20, ])

## [1] 18

#18

#Tax rates? 
hist(Boston$tax,col=30, breaks = 25)

nrow(Boston[Boston$tax == 666, ])

## [1] 132

#132

#Pupil-teacher ratios?
hist(Boston$ptratio, col=30, breaks = 25)

nrow(Boston[Boston$ptratio > 20, ])

## [1] 201

#201
#Comment on the range of each predictor.

#For crim the range from 0 to 80 show frequency values below 400.

#For tax the range from 0 to 500 show frequency values below 80.
#For tax the range from 500 to 700 show frequency values higher than 120.

#For ptratio the range from 0 to 20 show frequency values below 50.
#For ptratio the range from 20 to 22 show frequency values below 150.

(10e) How many of the suburbs in this data set bound the Charles river?

ANSWER 10E)

nrow(Boston[Boston$chas == 1, ])

## [1] 35

#35

(10f) What is the median pupil-teacher ratio among the towns in this data set?

ANSWER 10F)

median(Boston$ptratio)

## [1] 19.05

# 19.05

(10g) Which suburb of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

ANSWER 10G)

row.names(Boston[min(Boston$medv), ])

## [1] "5"

range(Boston$tax)

## [1] 187 711

#[1] 187 711
Boston[min(Boston$medv), ]$tax

## [1] 222

#222
Boston[min(Boston$medv), ]$age

## [1] 54.2

#54.2

range(Boston$age)

## [1]   2.9 100.0

# 2.9 100.0

Boston[min(Boston$medv), ]$rm

## [1] 7.147

#7.147
range(Boston$rm)

## [1] 3.561 8.780

#3.561 8.780

#The minimum Boston medv(Median value) value that corresponds to the tax predictor is equal to 222.
#This Boston tax predictor ranges between 187 and 711,therefore, when comparing the tax value of 222 to this range scale, 
#we can conclude that this amount is on the low side of the scale.

#The minimum Boston medv(Median value) value that corresponds to the age predictor is equal to 54.2.
#This Boston age predictor ranges between 2.9 and 100,therefore, when comparing the age value of 54.2 to this range scale, 
#we can conclude that this amount is a little past the midpoint side of the scale.

#The minimum Boston medv(Median value) value that corresponds to the rm predictor is equal to 7.147.
#This Boston age predictor ranges between 3.561 and 8.780,therefore, when comparing the rm value of 7.147 to this range scale, 
#we can conclude that this amount on the high side of the scale.

(10h) In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling?Comment on the suburbs that average more than eight rooms per dwelling.

ANSWER 10H)

nrow(Boston[Boston$rm > 7, ])

## [1] 64

#[1] 64
#More than eight rooms per dwelling?
nrow(Boston[Boston$rm > 8, ])

## [1] 13

#[1] 13
#Comment on the suburbs that average more than eight rooms
#per dwelling.
hist(Boston$rm, col=30, breaks = 25)

#The suburbs that average more than 8 rooms per dwelling have a very low frequency (less than 20) and the maximum range is 8.780

Data Analytics - Homework I

Jose Fernandez

6/9/2021