QMBE 3740– Data Mining Problem Set 1

Question 1 - ToyotaCorolla.csv
- Part A:
- Part B:
Question 2 - RidingMowers.csv
- Question:
Question 3 - LaptopSalesJanuary2008.csv:
- Barplot
- A note to Brett Devine:

Question 1 - ToyotaCorolla.csv

Read the data into R Studio:

df = read.csv(file="ToyotaCorolla.csv", header=TRUE, sep=",")

This allows R Markdown to access the file, ToyotaCorolla.csv

Part A:

Which pairs among the variables seem to be correlated?

plot the pairwise relationship between two variables:

#little.df = data.table(df.nums)
#little.df = little.df[sample(.N, 20)]
#plot(little.df)

plot(df$KM, df$Price, xlab="Kilometers Driven", ylab="Price", main="Plot 1")

#PLOT 1
cor(df$KM, df$Price)

## [1] -0.5699602

plot(df$Quarterly_Tax, df$Age_08_04, xlab="Quarterly Tax", ylab="Age of Car", main="Plot 2")

#PLOT 2
cor(df$Quarterly_Tax, df$Age_08_04)

## [1] -0.1984305

plot(df$HP, df$Price, xlab="Horsepower", ylab="Price", main = "Plot 3")

#PLOT 3
cor(df$HP, df$Price)

## [1] 0.3149898

Note that cor(df$var1, df$var2) returns the correlation value of two variables, var1 and var2

Also, I could this for every combination of variable, but this would be tedious.

Part B:

Describe how you would convert these to binary variables (dummy variables)

Use R’s ifelse statement to create binary dummy variables from categorical variables

Example:

df$Fuel_Type.Dummy = ifelse(df$Fuel_Type == "Petrol", c(1), c(0)) 
summary(df$Fuel_Type.Dummy)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  1.0000  1.0000  0.8802  1.0000  1.0000

Question 2 - RidingMowers.csv

Read the data into R Studio:

df = read.csv(file="RidingMowers.csv", header=TRUE, sep=",")
summary(df)

##      Income          Lot_Size        Ownership 
##  Min.   : 33.00   Min.   :14.00   Nonowner:12  
##  1st Qu.: 52.35   1st Qu.:17.50   Owner   :12  
##  Median : 64.80   Median :19.00                
##  Mean   : 68.44   Mean   :18.95                
##  3rd Qu.: 83.10   3rd Qu.:20.80                
##  Max.   :110.10   Max.   :23.60

This allows R Markdown to access the file, RidingMowers.csv

Question:

[classify] households as prospective `owners` or `non_owners` on the basis of `Income` (in $1000s) and `Lot Size` (in 1000s of square feet)

Scatter Plot of Lot_Size on Income where color represents binary categorical variable, Ownership:

Source the package plotly first with library(plotly)

plot(df$Lot_Size, df$Income, col=df$Ownership)
legend('topright', legend = levels(df$Ownership), col = 1:3, cex = 0.8, pch = 1)

We can classify this through a K-Means algorithm for a cluster analysis using command kmeans:

il = c(df$Income, df$Lot_Size)
df.il = data.frame(il)

variation = c()
for(i in 1:5) {
  km = kmeans(df.il, i)
  avg_var = mean(km$withinss)
  variation = append(variation, avg_var)
}

plot(variation, type = "line")

## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to
## first character

variation.df = data.frame(variation) 

variation.lag = slide(variation.df, Var = "variation", slideBy = -1)

## 
## Remember to put variation.df in time order before running.

## 
## Lagging variation by 1 time units.

variation.lag$ROC = abs(variation.lag$variation - variation.lag$`variation-1`)
print(variation.lag)

##    variation variation-1        ROC
## 1 38534.4481          NA         NA
## 2  3907.6278  38534.4481 34626.8203
## 3   777.5662   3907.6278  3130.0617
## 4   373.0103    777.5662   404.5558
## 5   267.6722    373.0103   105.3381

max(variation.lag$ROC, na.rm = TRUE)

## [1] 34626.82

The “elbow” or maximum rate of change occurs at k=2, thus 2 is the optimal number of clusters

Question 3 - LaptopSalesJanuary2008.csv:

df = read.csv(file="LaptopSalesJanuary2008.csv", header=TRUE, sep = ",")
summary(df)

##               Date      Configuration   Customer.Postcode  Store.Postcode
##  1/28/2008 16:01:   4   Min.   :  1.0   W1T 1DG :  21     SW1P 3AU:1604  
##  1/28/2008 23:10:   4   1st Qu.: 77.0   EC4V 2BA:  19     SE1 2BN :1232  
##  1/1/2008 10:06 :   3   Median :209.5   SE16 2HB:  19     SW1V 4QQ:1145  
##  1/1/2008 12:24 :   3   Mean   :207.2   SW7 4TE :  19     NW5 2QH : 870  
##  1/11/2008 2:02 :   3   3rd Qu.:315.0   E2 8QY  :  18     E2 0RY  : 856  
##  1/12/2008 15:38:   3   Max.   :368.0   SE1 0LH :  18     SE8 3JD : 450  
##  (Other)        :7936                   (Other) :7842     (Other) :1799  
##   Retail.Price   Screen.Size..Inches. Battery.Life..Hours.    RAM..GB.    
##  Min.   :300.0   Min.   :15           Min.   :4.000        Min.   :1.000  
##  1st Qu.:455.0   1st Qu.:15           1st Qu.:4.000        1st Qu.:1.000  
##  Median :490.0   Median :15           Median :5.000        Median :2.000  
##  Mean   :487.9   Mean   :15           Mean   :5.139        Mean   :1.548  
##  3rd Qu.:525.0   3rd Qu.:15           3rd Qu.:6.000        3rd Qu.:2.000  
##  Max.   :665.0   Max.   :15           Max.   :6.000        Max.   :2.000  
##                                                                           
##  Processor.Speeds..GHz. Integrated.Wireless.  HD.Size..GB.  
##  Min.   :1.500          No :3866             Min.   : 40.0  
##  1st Qu.:1.500          Yes:4090             1st Qu.: 80.0  
##  Median :2.000                               Median :120.0  
##  Mean   :1.758                               Mean   :150.4  
##  3rd Qu.:2.000                               3rd Qu.:300.0  
##  Max.   :2.000                               Max.   :300.0  
##                                                             
##  Bundled.Applications. OS.X.Customer    OS.Y.Customer      OS.X.Store    
##  No :3565              Min.   :512253   Min.   :164886   Min.   :517917  
##  Yes:4391              1st Qu.:529208   1st Qu.:178716   1st Qu.:528924  
##                        Median :531151   Median :181106   Median :529902  
##                        Mean   :530868   Mean   :179886   Mean   :530748  
##                        3rd Qu.:533130   3rd Qu.:182060   3rd Qu.:534057  
##                        Max.   :549065   Max.   :199846   Max.   :541428  
##                                                          NA's   :4       
##    OS.Y.Store     CustomerStoreDistance
##  Min.   :168302   Min.   :    0        
##  1st Qu.:178440   1st Qu.: 2422        
##  Median :179641   Median : 3382        
##  Mean   :179808   Mean   : 3680        
##  3rd Qu.:182961   3rd Qu.: 4346        
##  Max.   :190628   Max.   :19892        
##  NA's   :4        NA's   :4

create sub dataframe of mean Retail Price at each Store Code:

agg = aggregate(df[, 5], list(df$Store.Postcode), mean)
agg

##     Group.1        x
## 1   CR7 8LE 488.6190
## 2    E2 0RY 483.1717
## 3    E7 8NW 494.3814
## 4   KT2 5AU 493.9048
## 5   N17 6QA 494.6341
## 6    N3 1DH 487.3684
## 7   NW5 2QH 486.5805
## 8   S1P 3AU 486.2500
## 9   SE1 2BN 486.6802
## 10  SE8 3JD 492.1778
## 11 SW12 9HD 485.2957
## 12 SW18 1NN 493.0389
## 13 SW1P 3AU 488.5069
## 14 SW1V 4QQ 489.3450
## 15  W10 6HQ 489.8667
## 16   W4 3PH 481.0063

Barplot

make a bar plot showing mean price at each store code:

barplot(agg$x, height = agg$x, names.arg = agg$Group.1, cex.names = 0.25, xlab = "Store Code", main = "Average Price by Store")

Yeah, these are tiny labels… ### Boxplot

boxplot(Retail.Price~Store.Postcode,
        data=df,
        xlab="Store",
        cex.xlab=0.025)

bigger, but fewer labels…