##Problem 1: Auto Data #Part A
Auto <- read.table("http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.data",
header=TRUE,
na.strings = "?")
str(Auto)
## 'data.frame': 397 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
The predictors mpg, cylinders, displacement, horsepower, weight, acceleration, year, and origin are quantitative. The predictor name is qualitative.
#Part B
#Range of each quantitative predictor:
range(Auto$mpg)
## [1] 9.0 46.6
range(Auto$cylinders)
## [1] 3 8
range(Auto$displacement)
## [1] 68 455
range(Auto$horsepower, na.rm=TRUE)
## [1] 46 230
range(Auto$weight)
## [1] 1613 5140
range(Auto$acceleration)
## [1] 8.0 24.8
range(Auto$year)
## [1] 70 82
range(Auto$origin)
## [1] 1 3
#Part C
#Mean of each quantitative predictor:
mean(Auto$mpg)
## [1] 23.51587
mean(Auto$cylinders)
## [1] 5.458438
mean(Auto$displacement)
## [1] 193.5327
mean(Auto$horsepower, na.rm=TRUE)
## [1] 104.4694
mean(Auto$weight)
## [1] 2970.262
mean(Auto$acceleration)
## [1] 15.55567
mean(Auto$year)
## [1] 75.99496
mean(Auto$origin)
## [1] 1.574307
#Standard deviation of each quantitative predictor:
sd(Auto$mpg)
## [1] 7.825804
sd(Auto$cylinders)
## [1] 1.701577
sd(Auto$displacement)
## [1] 104.3796
sd(Auto$horsepower, na.rm=TRUE)
## [1] 38.49116
sd(Auto$weight)
## [1] 847.9041
sd(Auto$acceleration)
## [1] 2.749995
sd(Auto$year)
## [1] 3.690005
sd(Auto$origin)
## [1] 0.8025495
#Part D
#Subset of data with the 10th through 85th observations removed.
Auto2<-Auto[-c(10:85),]
#Range of each quantitative predictor in the data subset:
range(Auto2$mpg)
## [1] 11.0 46.6
range(Auto2$cylinders)
## [1] 3 8
range(Auto2$displacement)
## [1] 68 455
range(Auto2$horsepower, na.rm=TRUE)
## [1] 46 230
range(Auto2$weight)
## [1] 1649 4997
range(Auto2$acceleration)
## [1] 8.5 24.8
range(Auto2$year)
## [1] 70 82
range(Auto2$origin)
## [1] 1 3
#Mean of each quantitative predictor in the data subset:
mean(Auto2$mpg)
## [1] 24.43863
mean(Auto2$cylinders)
## [1] 5.370717
mean(Auto2$displacement)
## [1] 187.0498
mean(Auto2$horsepower, na.rm=TRUE)
## [1] 100.9558
mean(Auto2$weight)
## [1] 2933.963
mean(Auto2$acceleration)
## [1] 15.72305
mean(Auto2$year)
## [1] 77.15265
mean(Auto2$origin)
## [1] 1.598131
#Standard deviation of each quantitative predictor in the data subset:
sd(Auto2$mpg)
## [1] 7.908184
sd(Auto2$cylinders)
## [1] 1.653486
sd(Auto2$displacement)
## [1] 99.63539
sd(Auto2$horsepower, na.rm=TRUE)
## [1] 35.89557
sd(Auto2$weight)
## [1] 810.6429
sd(Auto2$acceleration)
## [1] 2.680514
sd(Auto2$year)
## [1] 3.11123
sd(Auto2$origin)
## [1] 0.8161627
#Part E
plot(Auto$mpg, Auto$weight)
The scatterplot comparing mpg and weight suggests that as the weight of a car decreases, the miles per gallon increases. This makes sense because a heavier car requires more gas to acclerate the weightm of the car compared to a lighter car.
plot(Auto$year, Auto$mpg)
The scatterplot comparing the model year of the car and mpg suggests that there has been an increase in mpg and effeciency over the years as newer methods to create cars are created and implemented.
plot(Auto$cylinders, Auto$horsepower)
The scatter plot comparing the number of cylinders that a car has and the car’s horsepower shows a trend suggesting that the greater the number of cylinders of a car, the greater the horsepower of that car. ``` #Part F
pairs(Auto)
When observing the pairwise scatterplots between mpg and the other variables, the plots comparing mpg and displacement, mpg and horesepower, and mpg and weight show are trend in which as the x varible increase, mpg decreases. These findings suggest that if you knew the displacement, horespower, and weight of a car, it could be possible to predice gas mileage. The plots comparing mpg and cylinders and mpg and origin have similar structures, suggesting that it may be possible to predict the mpg with this information. In general, it would be difficult to predict the mpg of a car given only these variables because there are several other cofounding variables that can affect the prediction.
##Problem 2: Working with vectors and matrices #Data
#Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)
#Vectors region and titles, used for naming
region <-c("US", "non-US")
titles <- c("A New Hope", "The Empire Stikes Back", "Return of the Jedi")
#Part A
#Construct a matrix
starWars <- matrix(c(new_hope, empire_strikes, return_jedi), nrow=3, byrow = TRUE)
starWars
## [,1] [,2]
## [1,] 460.998 314.4
## [2,] 290.475 247.9
## [3,] 309.306 165.8
#Part B
#Rename the row and columns of the matrix
colnames(starWars) <- region
rownames(starWars) <- titles
starWars
## US non-US
## A New Hope 460.998 314.4
## The Empire Stikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
#Part C
#Worldwide box office figures for each movie
worldwide_figures<- rowSums(starWars)
worldwide_figures
## A New Hope The Empire Stikes Back Return of the Jedi
## 775.398 538.375 475.106
#Part D
#Add column to matrix for worldwide sales
worldwide_wars <- cbind(starWars, worldwide_figures)
#Prequel Data
phantom_menace <- c(474.5, 552.5)
attack_clones <- c(310.7, 338.7)
revenge_sith <- c(380.3, 468.5)
titles2<-c("The Phantom Menace", "Attack of the Clones", "Revent of the Sith")
#Part E
#Create a matrix for the prequels with row and column names
starWars2 <- matrix(c(phantom_menace, attack_clones, revenge_sith),
nrow=3, byrow = TRUE)
colnames(starWars2) <- region
rownames(starWars2) <- titles2
starWars2
## US non-US
## The Phantom Menace 474.5 552.5
## Attack of the Clones 310.7 338.7
## Revent of the Sith 380.3 468.5
#Part F
#Combine starWars and starWars2 into one matrix
allStarWars <- rbind(starWars,starWars2)
allStarWars
## US non-US
## A New Hope 460.998 314.4
## The Empire Stikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
## The Phantom Menace 474.500 552.5
## Attack of the Clones 310.700 338.7
## Revent of the Sith 380.300 468.5
#Part G
#Total non-US revenue for all movies
colSums(allStarWars)
## US non-US
## 2226.279 2087.800
The total non-US revenue for all the movies is $2087.80
##Problem 3: College data #Part A
#Reading data into R:
college<-read.csv("http://faculty.marshall.usc.edu/gareth-james/ISL/College.csv",header=TRUE)
#Part B
#Create row names column with the name of each university recorded
rownames(college)<- college[,1]
#First data column becomes 'Private'
college <- college[,-1]
#Part C
#Numerical summary of the varaibles in the data set
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
#Scatterplot matrix of the first ten varibles in the data
pairs(college[,1:10])
#Boxplot of Outstate vs Private
plot(college$Private, college$Outstate, main = "Outstate vs Private", xlab = "Private", ylab = "Outstate")
#Create Elite varible based on whether or not the proportion of students coming from the top 10% of their high shcool class exceed 50%
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] = "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate Elite
## Min. : 10.00 No :699
## 1st Qu.: 53.00 Yes: 78
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
There are 78 elite universities.
#Side-by-side boxplots of Outstate vs Elite
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ggplot(college, aes(x=Elite, y=Outstate, fill=Elite))+
geom_boxplot()