Using the str function, I identified the quantitative preditctors. These include mpg,cylinders, displacement, horsepower, weight, and acceleration.
Auto <- read.table("http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.data",
header=TRUE,
na.strings = "?")
str(Auto)
## 'data.frame': 397 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
Using matrices I found the range of the quantitative predictors
Auto=na.omit(Auto)
ranges <- matrix(c(range(Auto$mpg),
range(Auto$cylinders),
range(Auto$displacement),
range(Auto$horsepower),
range(Auto$weight),
range(Auto$acceleration)),
nrow=2, ncol=6, byrow=FALSE)
colnames(ranges)<- c("mpg","cylinders","displacement","horsepower","weight","acceleration")
rownames(ranges)<- c("low", "high")
ranges
## mpg cylinders displacement horsepower weight acceleration
## low 9.0 3 68 46 1613 8.0
## high 46.6 8 455 230 5140 24.8
using matrices I found the mean and standard deviations of the quantitaive predictors
means<- matrix(c(mean(Auto$mpg),
mean(Auto$cylinders),
mean(Auto$displacement),
mean(Auto$horsepower),
mean(Auto$weight),
mean(Auto$acceleration)),
nrow=6, ncol=1, byrow=TRUE)
SDs<- matrix(c(sd(Auto$mpg),
sd(Auto$cylinders),
sd(Auto$displacement),
sd(Auto$horsepower),
sd(Auto$weight),
sd(Auto$acceleration)),
nrow=6, ncol=1, byrow =TRUE)
Mean_SD<-cbind(means,SDs)
rownames(Mean_SD)<- c("mpg","cylinders","displacement","horsepower","weight","acceleration")
colnames(Mean_SD)<- c("Mean","SD")
Mean_SD
## Mean SD
## mpg 23.445918 7.805007
## cylinders 5.471939 1.705783
## displacement 194.411990 104.644004
## horsepower 104.469388 38.491160
## weight 2977.584184 849.402560
## acceleration 15.541327 2.758864
Ignoring the 10th trhough 85th observations I found the Mean, Standard devation, and range of the observations
Range
apply(Auto[-c(10:85), 1:6], 2 , range)
## mpg cylinders displacement horsepower weight acceleration
## [1,] 11.0 3 68 46 1649 8.5
## [2,] 46.6 8 455 230 4997 24.8
Mean
apply(Auto[-c(10:85), 1:6], 2 , mean)
## mpg cylinders displacement horsepower weight acceleration
## 24.404430 5.373418 187.240506 100.721519 2935.971519 15.726899
SD
apply(Auto[-c(10:85), 1:6], 2 , sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.867283 1.654179 99.678367 35.708853 811.300208 2.693721
Using the complete data set I created pairwise scatterplots
pairs(Auto [,])
Looking at the scatter plots above shows that there is a negative correlation between horsepower and acceleration. As accelration increases horsepower decreases, which is unexpecetd.
plot(y=Auto$horsepower, x=Auto$acceleration, xlab = "Acceleration", ylab = "Horsepower")
There is a positive correlation between weight and displacement, as weight increases displacment increases.
plot(y=Auto$displacement, x=Auto$weight, xlab = "Weight", ylab = "displacement")
There also appears to be a positice correlation between and horsepower as horsepower increase so does the weight.
plot(y=Auto$weight, x=Auto$horsepower, xlab = "Horsepower", ylab = "Weight")
Suppossing we want to predict mpg in relation to other variables we might want to look at the relationship between mpg and horsepower
plot(y=Auto$mpg, x=Auto$horsepower, xlab = "Horsepower", ylab = "mpg")
There is a clear positive correlation between the two variables, as horsepower increase the mpg increases. Weight has a similair relation ship to mpg
plot(y=Auto$mpg, x=Auto$weight, xlab = "weight", ylab = "mpg")
When weight increases, the mpg increases. By understanding these relationship we might be ablt to predict the mpg a car has when looking at weight and horsepower.
I created a matrix of the Star War films where rows represent each movie.
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of
the Jedi")
StarWars <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)
I renamed the columns and rows
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
rownames(StarWars) <- titles
colnames(StarWars) <- region
StarWars
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
Using the rowSum function I found the worldwide box office figures
worldwide_vector <- rowSums(StarWars)
I added a column to the matrix to include worldwide sales using cbind
all_wars_matrix <- cbind(StarWars, worldwide_vector)
all_wars_matrix
## US non-US worldwide_vector
## A New Hope 460.998 314.4 775.398
## The Empire Strikes Back 290.475 247.9 538.375
## Return of the Jedi 309.306 165.8 475.106
I created a second matrix for the prequel movies.
phantom_menace <- c(474.5, 552.5)
attack_clones <- c(310.7, 338.7)
revenge_sith <- c(380.3, 468.5)
titles2 <- c("The Phantom Menace", "Attack of the Clones", "Revenge of the Sith")
StarWars2 <- matrix(c(phantom_menace, attack_clones, revenge_sith), nrow=3, byrow = TRUE)
StarWars2
## [,1] [,2]
## [1,] 474.5 552.5
## [2,] 310.7 338.7
## [3,] 380.3 468.5
rownames(StarWars2) <- titles2
colnames (StarWars2) <- region
I constructed a new matrix to include all the Star War films.
AllWars <- rbind(StarWars, StarWars2)
I found the total revenue of all star wars films
Total_revenue <- colSums(AllWars)
Total_revenue
## US non-US
## 2226.279 2087.800
I used the sum() function to produce a summary of the data
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
I used the pairs function to create scatterplots of the data
pairs(college[,1:10])
I used ggplot to to produce side-by-side boxplots of Outstate vs Private.
library(tidyverse)
ggplot(college, aes(y=Outstate, fill= Private, ))+ geom_boxplot()+
theme_minimal()
## Part D I binned the top ten percent of the data to create a new variable.
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] = "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)
summary(Elite)
## No Yes
## 699 78
ggplot(college, aes(y=Outstate, fill= Elite))+ geom_boxplot()+
theme_minimal()