#Victor Ramos Hw 2
2a)
Inference, N = CEO salary, P = profit, number of employees, industry
2b)
Predictive, N = Price charged, P = marketing budget, competition price, ten other variables
2c)
Predictive, N = % change in USD/Euro, P = % change in US, % change in British, % change in German
Inflexible regression can generate a smaller deviation of F and is easier to interpret. Inflexible methods have higher bias, and greater error. Flexible approach gives a range of variation to estimate f, the model would be more complicated, but it will have less error and bias.
Parametric reduces the problem of estimating f to a set of parameters, it is easier than a single function of F. Parametric is less flexible, so it is easier to interpret. The disadvantage of parametric approaches is that they have a chance of a model not fitting the data well to the true value of F.
Non-Parametric accurately fit a wider range of F and is flexible, the advantage being that they try to get closer to the points on the curve of F. Disadvantage is that a very large number of observations is required to obtain for an accurate estimate of F. Without many observations then the model will not be accurate.
8a)
college <- read.csv("https://www.statlearning.com/s/College.csv")
fix(college)
rownames(college)=college[,1]
fix(college)
8b)
college=college[,-1]
fix(college)
summary(college)
## Private Apps Accept Enroll
## Length:777 Min. : 81 Min. : 72 Min. : 35
## Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
## Mode :character Median : 1558 Median : 1110 Median : 434
## Mean : 3002 Mean : 2019 Mean : 780
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
## Max. :48094 Max. :26330 Max. :6392
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
## 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
## Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
## Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
## 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
## Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
## 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
## Median : 9990 Median :4200 Median : 500.0 Median :1200
## Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
## 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
## Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
## 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
## Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
## 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10.00
## 1st Qu.: 6751 1st Qu.: 53.00
## Median : 8377 Median : 65.00
## Mean : 9660 Mean : 65.46
## 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :56233 Max. :118.00
8c)
pairs(college[,2:11])
attach(college)
I was getting this error for the plot() part Error in plot.window(…) : need finite ‘ylim’ values
In addition: Warning messages:
1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion 2: In min(x) : no non-missing arguments to min; returning Inf
3: In max(x) : no non-missing arguments to max; returning -Inf
Im a noob so only thing I can think of was this as.factor
Private=as.factor(Private)
plot(Private, Outstate)
Elite =rep("No",nrow(college ))
Elite[college$Top10perc>50]=" Yes"
Elite=as.factor(Elite)
college=data.frame(college, Elite)
summary(Elite)
## Yes No
## 78 699
plot(Elite, Outstate)
par(mfrow=c(2,2))
hist(Top10perc)
hist(Top25perc)
hist(Room.Board)
hist(Grad.Rate)
detach(college)
Most Colleges hover around 4,000 for room and board. Grad rates for many colleges hover around 65%
##Question 9
auto <- read.csv("https://www.statlearning.com/s/Auto.csv")
auto=na.omit(auto)
attach(auto)
horsepower = as.numeric(horsepower)
## Warning: NAs introduced by coercion
The following uses the auto data
9a)Which are Quantitative?
mpg, displacement, weight, acceleration, horsepower, year
9b)Range of each Quantitative?
Mpg range 9, 46.6
Weight range 1613, 5140
Acceleration range 8, 24.8
Displacement range 68, 455
Horsepower range is NA, NA
Year range is 70, 82
9c)Mean and Std dev of each quantitative
mean of mpg is 23.515869 and deviation 7.8258039
mean of weight is 2970.2619647 and deviation 847.9041195
mean of accel is 15.5556675 and deviation 2.7499953
mean of displ is 193.5327456 and deviation 104.3795833
mean of horsepower is NA and deviation NA mean of year is 75.9949622 and deviation 3.6900049
9d)remove the tenth through 85th row
auto.r<-auto[-c(10:85),]
range(auto.r$mpg); mean(auto.r$mpg); sd(auto.r$mpg)
## [1] 11.0 46.6
## [1] 24.43863
## [1] 7.908184
range(auto.r$weight); mean(auto.r$weight); sd(auto.r$weight)
## [1] 1649 4997
## [1] 2933.963
## [1] 810.6429
range(auto.r$acceleration); mean(auto.r$acceleration); sd(auto.r$acceleration)
## [1] 8.5 24.8
## [1] 15.72305
## [1] 2.680514
range(auto.r$displacement); mean(auto.r$displacement); sd(auto.r$displacement)
## [1] 68 455
## [1] 187.0498
## [1] 99.63539
range(auto.r$horsepower); mean(auto.r$horsepower); sd(auto.r$horsepower)
## [1] "?" "98"
## Warning in mean.default(auto.r$horsepower): argument is not numeric or logical:
## returning NA
## [1] NA
## Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm =
## na.rm): NAs introduced by coercion
## [1] NA
range(auto.r$year); mean(auto.r$year); sd(auto.r$year)
## [1] 70 82
## [1] 77.15265
## [1] 3.11123
9e)using the full data set, create plots and comment findings
# Heavier the car the least efficient it becomes with mpg as shown below
plot(mpg, weight)
# In general the more cylinders the least mpg as shown below
plot(mpg, cylinders)
# In general the newer the car the more fuel efficient it becomes, as shown below
plot(mpg, year)
9f) I think there is anough data to predict mpg
10a)
library(MASS)
boston <- Boston
summary(boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
detach(auto)
attach(boston)
506 rows, 14 columns. The columns represent qualitative and quantitative data from crime rate to proportion of residential land zoned, to age of owner-occupied unites build prior to 1940.
10b) make pairwise scatterplot and describe findings
pairs(boston)
pairs(~ crim + zn + indus + dis + tax + rad + medv, boston)
10c)Any related to crime as a good predictor?
Age of the house seems to have a connection to crime.
plot(age, crim)
There appears to be a strong conneciton with the distance of employment centers and the crime rate.
plot(dis, crim)
There appears to be greater chances of crime if property tax rates are high
plot(tax, crim)
The higher concentration of lower status population increases the chances of crime.
plot(lstat, crim)
The lower the median value of occupied homes the higher the chances for crime.
plot(medv, crim)
10d)Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
A few of the suburbs or neighborhoods have high crime ratse, while most other are lower than 10 per capita
hist(crim, breaks=15)
Many suburbs have low property taxes, there are some who have high taxes passing 700
hist(tax, breaks=35)
There appears to be some suburbs with a higher pupil to teacher ratio. Probably linked to property taxes.
hist(ptratio, breaks=10)
10d) How many suburbs bound the charles river?
dim(subset(boston, chas == 1))
## [1] 35 14
35 suburbs border charles river
10f) median student teacher ratio among towns in this data set
summary(ptratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
Median is 19.05
10g)
The lowest median value suburb isn’t by the river, tax is high, there is a higher crime rate there.
t(subset(boston, medv == min(boston$medv)))
## 399 406
## crim 38.3518 67.9208
## zn 0.0000 0.0000
## indus 18.1000 18.1000
## chas 0.0000 0.0000
## nox 0.6930 0.6930
## rm 5.4530 5.6830
## age 100.0000 100.0000
## dis 1.4896 1.4254
## rad 24.0000 24.0000
## tax 666.0000 666.0000
## ptratio 20.2000 20.2000
## black 396.9000 384.9700
## lstat 30.5900 22.9800
## medv 5.0000 5.0000
10h)In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.
dim(subset(boston, rm > 7))
## [1] 64 14
64 suburbs have an average of 7 rooms
dim(subset(boston, rm > 8))
## [1] 13 14
13 suburbs have an average of 8
Below is comparing the 8 avg room to the rest
Crime for the avg 8 room suburbs are higher than the median of the rest of Boston. It is also closer or have a higher proportion of non-retail businesses
summary(subset(boston, rm > 8))
## crim zn indus chas
## Min. :0.02009 Min. : 0.00 Min. : 2.680 Min. :0.0000
## 1st Qu.:0.33147 1st Qu.: 0.00 1st Qu.: 3.970 1st Qu.:0.0000
## Median :0.52014 Median : 0.00 Median : 6.200 Median :0.0000
## Mean :0.71879 Mean :13.62 Mean : 7.078 Mean :0.1538
## 3rd Qu.:0.57834 3rd Qu.:20.00 3rd Qu.: 6.200 3rd Qu.:0.0000
## Max. :3.47428 Max. :95.00 Max. :19.580 Max. :1.0000
## nox rm age dis
## Min. :0.4161 Min. :8.034 Min. : 8.40 Min. :1.801
## 1st Qu.:0.5040 1st Qu.:8.247 1st Qu.:70.40 1st Qu.:2.288
## Median :0.5070 Median :8.297 Median :78.30 Median :2.894
## Mean :0.5392 Mean :8.349 Mean :71.54 Mean :3.430
## 3rd Qu.:0.6050 3rd Qu.:8.398 3rd Qu.:86.50 3rd Qu.:3.652
## Max. :0.7180 Max. :8.780 Max. :93.90 Max. :8.907
## rad tax ptratio black
## Min. : 2.000 Min. :224.0 Min. :13.00 Min. :354.6
## 1st Qu.: 5.000 1st Qu.:264.0 1st Qu.:14.70 1st Qu.:384.5
## Median : 7.000 Median :307.0 Median :17.40 Median :386.9
## Mean : 7.462 Mean :325.1 Mean :16.36 Mean :385.2
## 3rd Qu.: 8.000 3rd Qu.:307.0 3rd Qu.:17.40 3rd Qu.:389.7
## Max. :24.000 Max. :666.0 Max. :20.20 Max. :396.9
## lstat medv
## Min. :2.47 Min. :21.9
## 1st Qu.:3.32 1st Qu.:41.7
## Median :4.14 Median :48.3
## Mean :4.31 Mean :44.2
## 3rd Qu.:5.12 3rd Qu.:50.0
## Max. :7.44 Max. :50.0
summary(boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
#The End