This exercise involves the “AUTO” data set we studied during lab. Make sure that the missing values have been removed from the data.
auto<-read.table("http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.data",
header=TRUE,
na.strings = "?")
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
str(auto)
## 'data.frame': 397 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
The quantitative predictors are mpg, cylinders, displacement, horsepower, weight, and acceleration. You could probably use year as a quantitative predictor as well.
The qualitative predictor is name, although you could say origin, year, and cylinders could be considered qualitative as the number values would be considered categories.
range(auto$mpg)
## [1] 9.0 46.6
range(auto$cylinders)
## [1] 3 8
range(auto$displacement)
## [1] 68 455
range(auto$horsepower,
na.rm = TRUE)
## [1] 46 230
range(auto$weight)
## [1] 1613 5140
range(auto$acceleration)
## [1] 8.0 24.8
range(auto$year)
## [1] 70 82
mean(auto$mpg)
## [1] 23.51587
mean(auto$cylinders)
## [1] 5.458438
mean(auto$displacement)
## [1] 193.5327
mean(auto$horsepower,
na.rm = TRUE)
## [1] 104.4694
mean(auto$weight)
## [1] 2970.262
mean(auto$acceleration)
## [1] 15.55567
The mean for mpg: 23.52
cylinders: 5.458
displacement: 193.5
horsepower: 104.5
weight: 2970
acceleration: 15.56
sd(auto$mpg)
## [1] 7.825804
sd(auto$cylinders)
## [1] 1.701577
sd(auto$displacement)
## [1] 104.3796
sd(auto$horsepower,
na.rm = TRUE)
## [1] 38.49116
sd(auto$weight)
## [1] 847.9041
sd(auto$acceleration)
## [1] 2.749995
The Standard Deviation for: mpg - 7.825804 cylinders - 1.1701577 displacement - 104.3796 horsepower - 38.49116 weight - 847.9041 acceleration - 2.749995
auto2<-auto[c(-10:-85),]
str(auto2)
## 'data.frame': 321 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 13 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 350 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 175 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 13 ...
## $ year : int 70 70 70 70 70 70 70 70 70 73 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 25 ...
# Finding range of values
range(auto2$mpg)
## [1] 11.0 46.6
range(auto2$cylinders)
## [1] 3 8
range(auto2$displacement)
## [1] 68 455
range(auto2$horsepower,
na.rm = TRUE)
## [1] 46 230
range(auto2$weight)
## [1] 1649 4997
range(auto2$acceleration)
## [1] 8.5 24.8
range(auto2$year)
## [1] 70 82
range(auto2$origin)
## [1] 1 3
# Finding Mean of Values
mean(auto2$mpg)
## [1] 24.43863
mean(auto2$cylinders)
## [1] 5.370717
mean(auto2$displacement)
## [1] 187.0498
mean(auto2$horsepower,
na.rm = TRUE)
## [1] 100.9558
mean(auto2$weight)
## [1] 2933.963
mean(auto2$acceleration)
## [1] 15.72305
# Finding standard deviation of values
sd(auto2$mpg)
## [1] 7.908184
sd(auto2$cylinders)
## [1] 1.653486
sd(auto2$displacement)
## [1] 99.63539
sd(auto2$horsepower,
na.rm = TRUE)
## [1] 35.89557
sd(auto2$weight)
## [1] 810.6429
sd(auto2$acceleration)
## [1] 2.680514
auto$cylinders<-as.factor(auto$cylinders)
plot(auto$cylinders, auto$mpg, col="red")
Here, “x” represents the Cylinders column, and “y” represents the MPG column. Comparing Cylinders to MPG shows us that there tends to be a higher gas milage among cars with 4 or 5 cylinders.
auto$horsepower<-as.factor(auto$horsepower)
plot(auto$horsepower, auto$mpg, col="red")
Here, “x” is Horsepower and “y” is MPG. This shows us that more horsepower decreases your gas milage.
plot(auto$cylinders, auto$acceleration, col="red")
Here, “x” is Cylinders and “y” is Acceleration. It seems that having just the right amount of Cylinders (like around 6?) gives you the most acceleration.
I believe that based on any of the other variables, we could predict MPG. The three that I just made already point to some kind of trend, where optimal gas milage is based on having an average amount of acceleration and horsepower and cylinders.
The following data is from the box office sales of Star Wars Movies
# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)
# Vectors region and titles, used for naming
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of
the Jedi")
Provide the code and output for each of the follwing tasks:
starWars<-matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)
starWars
## [,1] [,2]
## [1,] 460.998 314.4
## [2,] 290.475 247.9
## [3,] 309.306 165.8
colnames(starWars)<-region
rownames(starWars)<-titles
starWars
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of\nthe Jedi 309.306 165.8
world_wide_box_office<-rowSums(starWars)
world_wide_box_office
## A New Hope The Empire Strikes Back Return of\nthe Jedi
## 775.398 538.375 475.106
# To do this, I bind the original matrix with the Worldwide vector.
cbind(starWars, world_wide_box_office)
## US non-US world_wide_box_office
## A New Hope 460.998 314.4 775.398
## The Empire Strikes Back 290.475 247.9 538.375
## Return of\nthe Jedi 309.306 165.8 475.106
# Prequels
phantom_menace <- c(474.5, 552.5)
attack_clones <- c(310.7, 338.7)
revenge_sith <- c(380.3, 468.5)
titles2<- c("The Phantom Menace", "Attack of the Clones",
"Revenge of the Sith")
starWars2<-matrix(c(phantom_menace, attack_clones, revenge_sith), nrow = 3, byrow = TRUE)
colnames(starWars2)<-region
rownames(starWars2)<-titles2
starWars2
## US non-US
## The Phantom Menace 474.5 552.5
## Attack of the Clones 310.7 338.7
## Revenge of the Sith 380.3 468.5
allStarWars<-rbind(starWars, starWars2)
allStarWars
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of\nthe Jedi 309.306 165.8
## The Phantom Menace 474.500 552.5
## Attack of the Clones 310.700 338.7
## Revenge of the Sith 380.300 468.5
colSums(allStarWars)
## US non-US
## 2226.279 2087.800
sum(allStarWars[, 2])
## [1] 2087.8
# How do i use colSums for only non-US? I tried using it, but could only find answer using sum().
college<-read.csv("http://faculty.marshall.usc.edu/gareth-james/ISL/
College.csv",header=TRUE)
We don’t really want R to treat this as a variable. However, it may be handy to have these names for later. Try the following commands:
view(college)
rownames(college)<-college[, 1]
view(college)
You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on row names. However, we still need to eliminate the first column in the data where the names are stored. Try:
college <- college[,-1]
view(college)
Now you should see that the first data column is “Private”
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
pairs(college[, 1:10])
college$Private<-as.factor(college$Private)
plot(college$Private, college$Outstate, col="yellow")
Elite<-rep("No", nrow(college))
Elite[college$Top10perc>50]="Yes"
Elite<-as.factor(Elite)
college<-data.frame(college, Elite)
summary(Elite)
## No Yes
## 699 78
plot(college$Elite, college$Outstate, col="forest green")