237M,1 Data Analytics | P. Rossi

Problem Set 1

Example:

# let's fetch the make variable from the mvehicles data set and display first 10 values
library(DataAnalytics)
data(mvehicles)
cars=mvehicles[mvehicles$bodytype != "Truck",] 

Question 1 : More on Subsetting Observations

Q1, Part A

  1. Display the contents of the first 50 elements of the vector, cars$make == "Ford", to verify that it is a logical vector.
fords_logic=(cars$make=="Ford")
fords_f50=head(fords_logic,n=50)
fords_f50
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [34]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE
  1. Subset the cars data frame by a two step process to only the “Ford” make. That is, create the row selection logical vector in one statement and select observations from the cars data frame in the second.
cars=mvehicles[mvehicles$bodytype != "Truck",] 
fords_car=cars[cars$make=="Ford",]
head(fords_car)
##    make year    model                              style  origin bodytype
## 6  Ford 2011     Edge     Limited 4dr SUV (3.5L 6cyl 6A) America      SUV
## 7  Ford 2011     Edge Limited 4dr SUV AWD (3.5L 6cyl 6A) America      SUV
## 8  Ford 2011     Edge          SE 4dr SUV (3.5L 6cyl 6A) America      SUV
## 9  Ford 2011     Edge         SEL 4dr SUV (3.5L 6cyl 6A) America      SUV
## 10 Ford 2011     Edge     SEL 4dr SUV AWD (3.5L 6cyl 6A) America      SUV
## 29 Ford 2011 Explorer             4dr SUV (3.5L 6cyl 6A) America      SUV
##         emv seats roominess mpg  warranty appearance    luxury    sporty
## 6  35441.33     5 0.3917259  22 0.3333333  0.6205152 0.4171626 0.3805899
## 7  38441.82     5 0.3917259  20 0.3333333  0.6205152 0.4171626 0.3805899
## 8  26999.12     5 0.3917259  21 0.3333333  0.6205152 0.4171626 0.3805899
## 9  30960.75     5 0.3917259  22 0.3333333  0.6205152 0.4171626 0.3805899
## 10 32619.84     5 0.3917259  20 0.3333333  0.6205152 0.4171626 0.3805899
## 29 27579.96     7 0.4912934  20 0.3333333  0.6020550 0.4169447 0.4211533
##        speed technology sales
## 6  0.6533921  0.8125000 21424
## 7  0.6327483  0.8125000 16496
## 8  0.6533921  0.1875000 15617
## 9  0.6533921  0.6562500 23186
## 10 0.6327483  0.6562500 13485
## 29 0.6544853  0.1354167  9784
  1. How many Kia observations are there in the cars data frame? hint: nrow() tells you how many rows are in a data frame.
cars=mvehicles[mvehicles$bodytype != "Truck",] 
Kias_car=cars[cars$make=="Kia",]
num1=nrow(Kias_car)
num1
## [1] 43
  1. How many cars are have a price (emv) that is greater than $100,000?
cars=mvehicles[mvehicles$bodytype != "Truck",] 
emv10=cars[cars$emv>100000,]
num2=nrow(emv10)
num2
## [1] 37

Q1, part B

  1. What is the average sales for all cars made in Europe with price above $75,000?
var1b=mvehicles[mvehicles$origin == "Europe" & mvehicles$emv > 75000,] 
var1bnum=mean(var1b$sales)
var1bnum
## [1] 626.6957

In many data sets, there are long text fields whcih describe an observation. These fields are not formatted in any way and so it is difficult to use simple comparison methods to fetch observations. However, we can use the power of something called regular expressions to find any observations for which a given variable contains some character pattern. Regular expressions are very complicated to use in generality but we can get a lot of use out of a very simple expression.

The style variable in cars is a general text description variable, We can find the rows for each style contains any string by using the command grepl("string",column,ignore.case=TRUE). For example, grepl("hybrid",cars$style,ignore.case=TRUE) creates a logical vector (TRUE or FALSE) to help select rows corresponding to hybrids. cars[grepl("hybrid",cars$style,ignore.case=TRUE),] will fetch only hybrids.

Q 1, part C

  1. How many four door vehicles are in cars?
var1c1=nrow(cars[grepl("4dr",cars$style,ignore.case=TRUE),]) 
var1c1
## [1] 1105
  1. How many four door sedans are in cars?
test=cars[grepl("4dr",cars$style,ignore.case=TRUE),]
var1c2=nrow(test[test$bodytype=='Sedan',])
var1c2
## [1] 432

Question 2 : More on R data types

R is very careful about distinguishing various types of numbers in including missing values (NA), Infinity (Inf), and Not A Number (NaN)

  1. Give two examples in which you create Inf or -Inf by mathematical operations.
2/0
## [1] Inf
log(0)
## [1] -Inf
  1. Give two examples in which you create NaNs.
0/0
## [1] NaN
log(-1)
## Warning in log(-1): NaNs produced
## [1] NaN
  1. Create a vector with some missing values in it, e.g. vec=c(rnorm(10),NA). Why is mean(vec) missing? How do you avoid this? Ans:The mean(vec) is missing since the average of the NA is still NA. We should omit the missing value when we caculate its mean.
vec=c(rnorm(10),NA)
mean(vec,na.rm=TRUE)
## [1] -0.08487253

Question 3 : Cleaning data

Read Chapter 6 of Lander.

Download slots.txt and slots.csv from the course web site. Open both files in Rstudio. This is data on plays of slot machines. B refers to a “bar”, BB two bars, BBB three bars, DD double diamonds, C three cherries and 7 means three sevens. Write an r script to read in slots.txt and convert it to a data.frame that is identical to slots.csv. Use read.delim() or read.table() to read in the files and convert the meaningless numbers to symbols. Don’t forget to name the variables.

slots=read.table("~/Documents/2015 Fall/Data Analytics/HW1/slots.txt", quote="\"")
colnames(slots)<-c("w1","w2","w3","prize","night")
for(i in 1:3)
{
  slots[slots[,i]==1,i]='B'
  slots[slots[,i]==2,i]='BB'
  slots[slots[,i]==3,i]='BBB'
  slots[slots[,i]==5,i]='DD'
  slots[slots[,i]==6,i]='C'
}
head(slots)
##   w1 w2 w3 prize night
## 1 BB  0  0     0     1
## 2  0 DD  B     0     1
## 3  0  0  0     0     1
## 4 BB  0  0     0     1
## 5  0  0  0     0     1
## 6  0  0  B     0     1