Example:
# let's fetch the make variable from the mvehicles data set and display first 10 values
library(DataAnalytics)
data(mvehicles)
cars=mvehicles[mvehicles$bodytype != "Truck",]
cars$make == "Ford", to verify that it is a logical vector.fords_logic=(cars$make=="Ford")
fords_f50=head(fords_logic,n=50)
fords_f50
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## [34] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE
cars data frame by a two step process to only the “Ford” make. That is, create the row selection logical vector in one statement and select observations from the cars data frame in the second.cars=mvehicles[mvehicles$bodytype != "Truck",]
fords_car=cars[cars$make=="Ford",]
head(fords_car)
## make year model style origin bodytype
## 6 Ford 2011 Edge Limited 4dr SUV (3.5L 6cyl 6A) America SUV
## 7 Ford 2011 Edge Limited 4dr SUV AWD (3.5L 6cyl 6A) America SUV
## 8 Ford 2011 Edge SE 4dr SUV (3.5L 6cyl 6A) America SUV
## 9 Ford 2011 Edge SEL 4dr SUV (3.5L 6cyl 6A) America SUV
## 10 Ford 2011 Edge SEL 4dr SUV AWD (3.5L 6cyl 6A) America SUV
## 29 Ford 2011 Explorer 4dr SUV (3.5L 6cyl 6A) America SUV
## emv seats roominess mpg warranty appearance luxury sporty
## 6 35441.33 5 0.3917259 22 0.3333333 0.6205152 0.4171626 0.3805899
## 7 38441.82 5 0.3917259 20 0.3333333 0.6205152 0.4171626 0.3805899
## 8 26999.12 5 0.3917259 21 0.3333333 0.6205152 0.4171626 0.3805899
## 9 30960.75 5 0.3917259 22 0.3333333 0.6205152 0.4171626 0.3805899
## 10 32619.84 5 0.3917259 20 0.3333333 0.6205152 0.4171626 0.3805899
## 29 27579.96 7 0.4912934 20 0.3333333 0.6020550 0.4169447 0.4211533
## speed technology sales
## 6 0.6533921 0.8125000 21424
## 7 0.6327483 0.8125000 16496
## 8 0.6533921 0.1875000 15617
## 9 0.6533921 0.6562500 23186
## 10 0.6327483 0.6562500 13485
## 29 0.6544853 0.1354167 9784
cars data frame? hint: nrow() tells you how many rows are in a data frame.cars=mvehicles[mvehicles$bodytype != "Truck",]
Kias_car=cars[cars$make=="Kia",]
num1=nrow(Kias_car)
num1
## [1] 43
cars=mvehicles[mvehicles$bodytype != "Truck",]
emv10=cars[cars$emv>100000,]
num2=nrow(emv10)
num2
## [1] 37
var1b=mvehicles[mvehicles$origin == "Europe" & mvehicles$emv > 75000,]
var1bnum=mean(var1b$sales)
var1bnum
## [1] 626.6957
In many data sets, there are long text fields whcih describe an observation. These fields are not formatted in any way and so it is difficult to use simple comparison methods to fetch observations. However, we can use the power of something called regular expressions to find any observations for which a given variable contains some character pattern. Regular expressions are very complicated to use in generality but we can get a lot of use out of a very simple expression.
The style variable in cars is a general text description variable, We can find the rows for each style contains any string by using the command grepl("string",column,ignore.case=TRUE). For example, grepl("hybrid",cars$style,ignore.case=TRUE) creates a logical vector (TRUE or FALSE) to help select rows corresponding to hybrids. cars[grepl("hybrid",cars$style,ignore.case=TRUE),] will fetch only hybrids.
var1c1=nrow(cars[grepl("4dr",cars$style,ignore.case=TRUE),])
var1c1
## [1] 1105
test=cars[grepl("4dr",cars$style,ignore.case=TRUE),]
var1c2=nrow(test[test$bodytype=='Sedan',])
var1c2
## [1] 432
R is very careful about distinguishing various types of numbers in including missing values (NA), Infinity (Inf), and Not A Number (NaN)
Inf or -Inf by mathematical operations.2/0
## [1] Inf
log(0)
## [1] -Inf
0/0
## [1] NaN
log(-1)
## Warning in log(-1): NaNs produced
## [1] NaN
vec=c(rnorm(10),NA). Why is mean(vec) missing? How do you avoid this? Ans:The mean(vec) is missing since the average of the NA is still NA. We should omit the missing value when we caculate its mean.vec=c(rnorm(10),NA)
mean(vec,na.rm=TRUE)
## [1] -0.08487253
Read Chapter 6 of Lander.
Download slots.txt and slots.csv from the course web site. Open both files in Rstudio. This is data on plays of slot machines. B refers to a “bar”, BB two bars, BBB three bars, DD double diamonds, C three cherries and 7 means three sevens. Write an r script to read in slots.txt and convert it to a data.frame that is identical to slots.csv. Use read.delim() or read.table() to read in the files and convert the meaningless numbers to symbols. Don’t forget to name the variables.
slots=read.table("~/Documents/2015 Fall/Data Analytics/HW1/slots.txt", quote="\"")
colnames(slots)<-c("w1","w2","w3","prize","night")
for(i in 1:3)
{
slots[slots[,i]==1,i]='B'
slots[slots[,i]==2,i]='BB'
slots[slots[,i]==3,i]='BBB'
slots[slots[,i]==5,i]='DD'
slots[slots[,i]==6,i]='C'
}
head(slots)
## w1 w2 w3 prize night
## 1 BB 0 0 0 1
## 2 0 DD B 0 1
## 3 0 0 0 0 1
## 4 BB 0 0 0 1
## 5 0 0 0 0 1
## 6 0 0 B 0 1