MS4373.1

Amanda Wallen

2

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

(a)We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary. The scenario is a regression problem, because we are predicting which factors affect CEO salary. The factors we are given are quantitative such as the sample size, form of profit, and the number of employees. n=500 and p=3. (b)We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables. The scenario is a classification problem, because it is focusing on two options: whether it is a success or a fail. Rather than being a quantitative response it is a qualitative one. n=20 and p=13. (c)We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market. The scenario is a regression problem, because we are looking at the percent change in the exchange rate, which is a quantitative observation. n=52 and p=3.

5

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

Advantages to a very flexible approach for regression or classification over a less flexible approach are that it is more useful in non-linear models because it reduces the error of the model predictions as well as the bias. Disadvantages are that it could possibly lead to over estimating the model by increasing the error. A less flexible approach would be preferred when the results need to be more accurately interpreted, because the slightest increase of error could throw off the results of the study.

6

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?

A parametric statistical learning approach assumes the distribution of random variables, has more statistical power, and has a smaller p value. Meanwhile a non-parametric approach can be applied more broadly as it is not fixed and can be adjusted with the sample size.

8

College <- read.csv(file="C:/Users/amand/Downloads/College.csv")
head(College[ , 1:4])

##                              X Private Apps Accept
## 1 Abilene Christian University     Yes 1660   1232
## 2           Adelphi University     Yes 2186   1924
## 3               Adrian College     Yes 1428   1097
## 4          Agnes Scott College     Yes  417    349
## 5    Alaska Pacific University     Yes  193    146
## 6            Albertson College     Yes  587    479

rownames=College[,1]
fix(College)
College=College[,-1]
fix(College)
summary(College)

##    Private               Apps           Accept          Enroll    
##  Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
##  Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
##  Mode  :character   Median : 1558   Median : 1110   Median : 434  
##                     Mean   : 3002   Mean   : 2019   Mean   : 780  
##                     3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
##                     Max.   :48094   Max.   :26330   Max.   :6392  
##    Top10perc       Top25perc      F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
##  1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
##  Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
##  Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
##  3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
##  Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
##  1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
##  Median : 9990   Median :4200   Median : 500.0   Median :1200  
##  Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
##  3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
##  Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
##       PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
##  1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
##  Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
##  3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate     
##  Min.   : 3186   Min.   : 10.00  
##  1st Qu.: 6751   1st Qu.: 53.00  
##  Median : 8377   Median : 65.00  
##  Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :56233   Max.   :118.00

A=read.csv(file="C:/Users/amand/Downloads/College.csv",header=T)
College[,1]=as.numeric(factor(College[,1]))
pairs(College[, 1:10])

College=read.csv(file="C:/Users/amand/Downloads/College.csv",header=T)
Elite=rep("No",nrow(College))
Elite[College$Top10perc>50]="Yes"
Elite=as.factor(Elite)
College=data.frame(College,Elite)
summary(College$Elite)

##  No Yes 
## 699  78

par(mfrow = c(2,2))
hist(College$Top25perc, col= 2,xlab = "Top25perc",ylab = "Count")
hist(College$PhD, col = 3,xlab = "PhD",ylab = "Count")
hist(College$Grad.Rate, col = 4,xlab = "Grad Rate",ylab = "Count")
hist(College$Expend,col = 5,xlab = "%Expend",ylab = "Count")

summary(College$Top25perc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0    41.0    54.0    55.8    69.0   100.0

summary(College$PhD)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   62.00   75.00   72.66   85.00  103.00

summary(College$Grad.Rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   53.00   65.00   65.46   78.00  118.00

summary(College$Expend)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3186    6751    8377    9660   10830   56233

weird.phd=College[College$PhD==103, ]
nrow(weird.phd)

## [1] 1

rownames[as.numeric(rownames(weird.phd))]

## [1] "Texas A&M University at Galveston"

weird.grad.rate=College[College$Grad.Rate==103, ]
nrow(weird.phd)

## [1] 1

rownames[as.numeric(rownames(weird.grad.rate))]

## character(0)

9

Auto=read.csv(file="C:/Users/amand/Downloads/Auto.csv")
str(Auto)

## 'data.frame':    397 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : chr  "130" "165" "150" "150" ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

summary(Auto[,-c(4,9)])

##       mpg          cylinders      displacement       weight      acceleration  
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   :1613   Min.   : 8.00  
##  1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.0   1st Qu.:2223   1st Qu.:13.80  
##  Median :23.00   Median :4.000   Median :146.0   Median :2800   Median :15.50  
##  Mean   :23.52   Mean   :5.458   Mean   :193.5   Mean   :2970   Mean   :15.56  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0   3rd Qu.:3609   3rd Qu.:17.10  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :5140   Max.   :24.80  
##       year           origin     
##  Min.   :70.00   Min.   :1.000  
##  1st Qu.:73.00   1st Qu.:1.000  
##  Median :76.00   Median :1.000  
##  Mean   :75.99   Mean   :1.574  
##  3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :82.00   Max.   :3.000

sapply(Auto[,-c(4,9)],mean)

##          mpg    cylinders displacement       weight acceleration         year 
##    23.515869     5.458438   193.532746  2970.261965    15.555668    75.994962 
##       origin 
##     1.574307

sapply(Auto[,-c(4,9)],sd)

##          mpg    cylinders displacement       weight acceleration         year 
##    7.8258039    1.7015770  104.3795833  847.9041195    2.7499953    3.6900049 
##       origin 
##    0.8025495

newauto=Auto[-c(10:85),-c(4,9)]
sapply(newauto,mean)

##          mpg    cylinders displacement       weight acceleration         year 
##    24.438629     5.370717   187.049844  2933.962617    15.723053    77.152648 
##       origin 
##     1.598131

sapply(newauto,sd)

##          mpg    cylinders displacement       weight acceleration         year 
##    7.9081842    1.6534857   99.6353853  810.6429384    2.6805138    3.1112298 
##       origin 
##    0.8161627

10

library(base)
library(ISLR2)

## 
## Attaching package: 'ISLR2'

## The following objects are masked _by_ '.GlobalEnv':
## 
##     Auto, College

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stats)
library(graphics)
nrow(Boston)

## [1] 506

ncol(Boston)

## [1] 13

pairs(Boston)

par(mfrow=c(1,3))
boxplot(Boston$crim, xlab = "per capita crime rate by town", main = "Box-plot of crime rate")
boxplot(Boston$tax, xlab = "full-value property-tax rate per $10,000", main = "Box-plot of tax-rate")
boxplot(Boston$ptratio, xlab = "pupil-teacher ratio by town", main = "Box-plot of Pupil-teacher ratio")

table(Boston$chas)

## 
##   0   1 
## 471  35

median(Boston$ptratio)

## [1] 19.05

Boston%>%mutate(Census_tracts=c(1:length(Boston$crim)))%>%filter(medv == min(medv))

##      crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 1 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5
## 2 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 22.98    5
##   Census_tracts
## 1           399
## 2           406

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

more7 = Boston %>% filter(rm > 7)
more8 = Boston %>% filter(rm > 8)
data.frame("more_than_7_rooms" = c(length(more7$rm)),
           "more_than_8_rooms" = c(length(more8$rm)))

##   more_than_7_rooms more_than_8_rooms
## 1                64                13

boxplot(more8, main="Boxplot of data which hae more than 8 rooms per dwelling")

MS4373.1

2022-08-29

Amanda Wallen

2

5

6

8

9

10