Assignment #1

Chapter 02 (page 54): 2, 5, 6, 8-10

Question 2

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

*** A : Regression and inference n=500 and p=3 ***

We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

*** A: Classification and prediction n=20 and p=13 ***

We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

*** A: Regression and prediction n=52 and p=3 ***

Question 5

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

*** Flexible Approach Advantages: decrease in bias and given a better fir for non-linear models Flexible Approach Disadvantages: a greater number of parameters needs to be estimated. It also follows the noise too closely (overfit) and it increases the variance.***

*** A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results ***

*** A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results ***

Question 6

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?

A parametric approach reduces the problem of estimating f down to one of estimating a set of parameters because it assumes a form for f.

A non-parametric approach does not assume a particular form of f and so requires a very large sample to accurately estimate f.

Advantages: - Simplifying of modeling f to a few parameters and not as many observations

Disadvantages: - Inaccurately estimating f if the form of f assumed is wrong - Overfitting the observations

Question 8

Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.0.5

setwd("C:/Users/brend/OneDrive/Desktop/UTSA/Summer 2021/Algo 2/Assignment 1")

college <- read.csv("College.csv")

head(college[, 1:5])

##                              X Private Apps Accept Enroll
## 1 Abilene Christian University     Yes 1660   1232    721
## 2           Adelphi University     Yes 2186   1924    512
## 3               Adrian College     Yes 1428   1097    336
## 4          Agnes Scott College     Yes  417    349    137
## 5    Alaska Pacific University     Yes  193    146     55
## 6            Albertson College     Yes  587    479    158

Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:

fix(college)

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored.

rownames = college[, 1]
fix(college)
college = college[, -1]
head(college[, 1:5])

##   Private Apps Accept Enroll Top10perc
## 1     Yes 1660   1232    721        23
## 2     Yes 2186   1924    512        16
## 3     Yes 1428   1097    336        22
## 4     Yes  417    349    137        60
## 5     Yes  193    146     55        16
## 6     Yes  587    479    158        38

1. Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)

##    Private               Apps           Accept          Enroll    
##  Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
##  Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
##  Mode  :character   Median : 1558   Median : 1110   Median : 434  
##                     Mean   : 3002   Mean   : 2019   Mean   : 780  
##                     3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
##                     Max.   :48094   Max.   :26330   Max.   :6392  
##    Top10perc       Top25perc      F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
##  1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
##  Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
##  Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
##  3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
##  Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
##  1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
##  Median : 9990   Median :4200   Median : 500.0   Median :1200  
##  Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
##  3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
##  Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
##       PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
##  1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
##  Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
##  3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate     
##  Min.   : 3186   Min.   : 10.00  
##  1st Qu.: 6751   1st Qu.: 53.00  
##  Median : 8377   Median : 65.00  
##  Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :56233   Max.   :118.00

1. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].

#college[,1:10]
pairs(College[,1:10])

1. Use the plot() function to produce side-by-side boxplots of Outstate versus Private

plot(College$Private, College$Outstate, xlab = "Private",xlim = c(0,2.5), ylab ="OutState", main = "Outstate vs Private")

1. Create a new qualitative variable, called Elite, by binning the Top10 perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50 %.

Elite=rep("No",nrow(college))
Elite[college$Top10perc >50]="Yes"
Elite=as.factor(Elite)
college=data.frame(college ,Elite)

summary(college$Elite)

##  No Yes 
## 699  78

plot(college$Elite, college$Outstate, xlab = "Elite",,xlim = c(0,2.5), ylab ="OutState", main = "Outstate vs Elite")

1. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

par(mfrow = c(2,2)) 

hist(college$Enroll, col=10, xlab = "Accept", ylab = "Count")
hist(college$Top10perc, col = 10, xlab = "Enroll", ylab = "Count")
hist(college$Personal, col = 5, xlab = "Top10perc", ylab = "Count")
hist(college$Grad.Rate, col = 5, xlab = "Top25perc", ylab = "Count")

1. Continue exploring the data, and provide a brief summary of what you discover.

# summary(college$Room.Board)
summary(college$Grad.Rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   53.00   65.00   65.46   78.00  118.00

gradrate <- college[college$Grad.Rate > 99,]
nrow(gradrate)

## [1] 11

rownames[as.numeric(rownames(gradrate))]

##  [1] "Amherst College"                 "Cazenovia College"              
##  [3] "College of Mount St. Joseph"     "Grove City College"             
##  [5] "Harvard University"              "Harvey Mudd College"            
##  [7] "Lindenwood College"              "Missouri Southern State College"
##  [9] "Santa Clara University"          "Siena College"                  
## [11] "University of Richmond"

I though it was interesting to find that these 11 schools have graduation rates of over 99%. I am curious to see how their curriculum is structured.

Question 9

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

Which of the predictors are quantitative, and which are qualitative?

data("Auto")


summary(complete.cases(Auto))

##    Mode    TRUE 
## logical     392

sapply(Auto, class)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    "numeric"    "numeric"    "numeric"    "numeric"    "numeric"    "numeric" 
##         year       origin         name 
##    "numeric"    "numeric"     "factor"

What is the range of each quantitative predictor? You can answer this using the range() function. All variables are quantitative.“Name” is the only one that qualitative.

quant <- sapply(Auto, is.numeric)
quant

##          mpg    cylinders displacement   horsepower       weight acceleration 
##         TRUE         TRUE         TRUE         TRUE         TRUE         TRUE 
##         year       origin         name 
##         TRUE         TRUE        FALSE

sapply(Auto[, quant], range)

##       mpg cylinders displacement horsepower weight acceleration year origin
## [1,]  9.0         3           68         46   1613          8.0   70      1
## [2,] 46.6         8          455        230   5140         24.8   82      3

Range: MPG: 9 - 24 Cylinders:3 - 8 Displacement: 68 - 455 Horsepower: 46 - 230 WEight: 1613 - 5140 Acceleration: 8.0 - 24.8 Year: 70 - 82 Origin: 1-3

What is the mean and standard deviation of each quantitative predictor?

sapply(Auto[, quant], function(x) signif(c(mean(x), sd(x)), 2))

##       mpg cylinders displacement horsepower weight acceleration year origin
## [1,] 23.0       5.5          190        100   3000         16.0 76.0   1.60
## [2,]  7.8       1.7          100         38    850          2.8  3.7   0.81

1 is mean 2 is standard deviation

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

delete_obs <- sapply(Auto[-10:-85, quant], function(x) round(c(range(x), mean(x), sd(x)), 2))
rownames(delete_obs) <- c("min", "max", "mean", "sd")
delete_obs

##        mpg cylinders displacement horsepower  weight acceleration  year origin
## min  11.00      3.00        68.00      46.00 1649.00         8.50 70.00   1.00
## max  46.60      8.00       455.00     230.00 4997.00        24.80 82.00   3.00
## mean 24.40      5.37       187.24     100.72 2935.97        15.73 77.15   1.60
## sd    7.87      1.65        99.68      35.71  811.30         2.69  3.11   0.82

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

pairs(Auto)

In the pairs plots if you specifically look at the relationship between horsepower and weight they are very linear. Also horsepower and displacement has a linear relationship as well. The relationship between displacement and weight also looks to be linear with a little bit of inconsistency.

Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

MPG suggest that displacement, horsepower and weight show the best linear relationship and would have the highest impact for predicting mpg.

Question 10

To begin, load in the Boston data set. The Boston data set is part of the MASS library in R

library(MASS)

#Boston

#?Boston

How many rows are in this data set? How many columns? What do the rows and columns represent?

The Boston data frame has 506 rows and 14 columns. The rows and columns represent: Housing Values in Suburbs of Boston

Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

pairs(Boston)

THere looks like there could be some sort of relationship between crim and rm and age.

Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

summary(Boston$crim)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620

cor(Boston)

##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##               black      lstat       medv
## crim    -0.38506394  0.4556215 -0.3883046
## zn       0.17552032 -0.4129946  0.3604453
## indus   -0.35697654  0.6037997 -0.4837252
## chas     0.04878848 -0.0539293  0.1752602
## nox     -0.38005064  0.5908789 -0.4273208
## rm       0.12806864 -0.6138083  0.6953599
## age     -0.27353398  0.6023385 -0.3769546
## dis      0.29151167 -0.4969958  0.2499287
## rad     -0.44441282  0.4886763 -0.3816262
## tax     -0.44180801  0.5439934 -0.4685359
## ptratio -0.17738330  0.3740443 -0.5077867
## black    1.00000000 -0.3660869  0.3334608
## lstat   -0.36608690  1.0000000 -0.7376627
## medv     0.33346082 -0.7376627  1.0000000

Looking at the coorelations bewtween Crim and the other variables: zn, chas, rm, dis, black, and medv all have a relationship to crim since they are all below the signifigance level of .05.

Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

summary(Boston$crim)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620

#?Boston

library(ggplot2)
qplot(Boston$crim, binwidth=5 , xlab = "Crime rate", ylab="Number of Suburbs" )

The suburbs don’t seem to have particularly high crime rate.

summary(Boston$tax)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0

qplot(Boston$tax, binwidth=50 , xlab = "Full-value property-tax rate per $10,000", ylab="Number of Suburbs")

A little over 125 suburbs have high taxes over 600.

summary(Boston$ptratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

qplot(Boston$ptratio, binwidth=5, xlab ="Pupil-teacher ratio by town", ylab="Number of Suburbs")

Ideally suburbs over 150 have large puptil to teach ratio. with a range of 17.5 - 22.5

How many of the suburbs in this data set bound the Charles river?

nrow(subset(Boston, chas ==1))

## [1] 35

35 suburbs set bound the Charles river

What is the median pupil-teacher ratio among the towns in this data set?

summary(Boston$ptratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

The median pupil-teacher ratio is 19

Which suburb of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

least_occupied <- Boston[order(Boston$medv),]
least_occupied[1,]

##        crim zn indus chas   nox    rm age    dis rad tax ptratio black lstat
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 396.9 30.59
##     medv
## 399    5

Suburb 399 has the lowest median value of owneroccupied homes.

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Looking at the Summary of Boston as a whole compared to suburb 399: Age is at the max with 100.It has the minimum medv compared to Boston. 399 has a very high tax value.The pupil per teacher ratio is very high with 20 when the max average is 22. It also has very high level of nox compared to other Boston suberbs. This list shows the suburb 399 is is not a suburb people want to live in.

In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling

rm_7 <- subset(Boston, rm>7)
nrow(rm_7)

## [1] 64

4 suburbs have more than 7 rooms per dwelling.

rm_8 <- subset(Boston, rm>8)
nrow(rm_8)

## [1] 13

13 suburbs have more than 7 rooms per dwelling