Read the data into R Studio:
df = read.csv(file="ToyotaCorolla.csv", header=TRUE, sep=",")
This allows R Markdown to access the file, ToyotaCorolla.csv
Which pairs among the variables seem to be correlated?
plot the pairwise relationship between two variables:
#little.df = data.table(df.nums)
#little.df = little.df[sample(.N, 20)]
#plot(little.df)
plot(df$KM, df$Price, xlab="Kilometers Driven", ylab="Price", main="Plot 1")
#PLOT 1
cor(df$KM, df$Price)
## [1] -0.5699602
plot(df$Quarterly_Tax, df$Age_08_04, xlab="Quarterly Tax", ylab="Age of Car", main="Plot 2")
#PLOT 2
cor(df$Quarterly_Tax, df$Age_08_04)
## [1] -0.1984305
plot(df$HP, df$Price, xlab="Horsepower", ylab="Price", main = "Plot 3")
#PLOT 3
cor(df$HP, df$Price)
## [1] 0.3149898
Note that cor(df$var1, df$var2) returns the correlation value of two variables, var1 and var2
Also, I could this for every combination of variable, but this would be tedious.
Describe how you would convert these to binary variables (dummy variables)
Use R’s ifelse statement to create binary dummy variables from categorical variables
Example:
df$Fuel_Type.Dummy = ifelse(df$Fuel_Type == "Petrol", c(1), c(0))
summary(df$Fuel_Type.Dummy)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 1.0000 1.0000 0.8802 1.0000 1.0000
Read the data into R Studio:
df = read.csv(file="RidingMowers.csv", header=TRUE, sep=",")
summary(df)
## Income Lot_Size Ownership
## Min. : 33.00 Min. :14.00 Nonowner:12
## 1st Qu.: 52.35 1st Qu.:17.50 Owner :12
## Median : 64.80 Median :19.00
## Mean : 68.44 Mean :18.95
## 3rd Qu.: 83.10 3rd Qu.:20.80
## Max. :110.10 Max. :23.60
This allows R Markdown to access the file, RidingMowers.csv
[classify] households as prospective `owners` or `non_owners` on the basis of `Income` (in $1000s) and `Lot Size` (in 1000s of square feet)
Scatter Plot of Lot_Size on Income where color represents binary categorical variable, Ownership:
Source the package plotly first with library(plotly)
plot(df$Lot_Size, df$Income, col=df$Ownership)
legend('topright', legend = levels(df$Ownership), col = 1:3, cex = 0.8, pch = 1)
We can classify this through a K-Means algorithm for a cluster analysis using command kmeans:
il = c(df$Income, df$Lot_Size)
df.il = data.frame(il)
variation = c()
for(i in 1:5) {
km = kmeans(df.il, i)
avg_var = mean(km$withinss)
variation = append(variation, avg_var)
}
plot(variation, type = "line")
## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to
## first character
variation.df = data.frame(variation)
variation.lag = slide(variation.df, Var = "variation", slideBy = -1)
##
## Remember to put variation.df in time order before running.
##
## Lagging variation by 1 time units.
variation.lag$ROC = abs(variation.lag$variation - variation.lag$`variation-1`)
print(variation.lag)
## variation variation-1 ROC
## 1 38534.4481 NA NA
## 2 3907.6278 38534.4481 34626.8203
## 3 777.5662 3907.6278 3130.0617
## 4 373.0103 777.5662 404.5558
## 5 267.6722 373.0103 105.3381
max(variation.lag$ROC, na.rm = TRUE)
## [1] 34626.82
The “elbow” or maximum rate of change occurs at k=2, thus 2 is the optimal number of clusters
df = read.csv(file="LaptopSalesJanuary2008.csv", header=TRUE, sep = ",")
summary(df)
## Date Configuration Customer.Postcode Store.Postcode
## 1/28/2008 16:01: 4 Min. : 1.0 W1T 1DG : 21 SW1P 3AU:1604
## 1/28/2008 23:10: 4 1st Qu.: 77.0 EC4V 2BA: 19 SE1 2BN :1232
## 1/1/2008 10:06 : 3 Median :209.5 SE16 2HB: 19 SW1V 4QQ:1145
## 1/1/2008 12:24 : 3 Mean :207.2 SW7 4TE : 19 NW5 2QH : 870
## 1/11/2008 2:02 : 3 3rd Qu.:315.0 E2 8QY : 18 E2 0RY : 856
## 1/12/2008 15:38: 3 Max. :368.0 SE1 0LH : 18 SE8 3JD : 450
## (Other) :7936 (Other) :7842 (Other) :1799
## Retail.Price Screen.Size..Inches. Battery.Life..Hours. RAM..GB.
## Min. :300.0 Min. :15 Min. :4.000 Min. :1.000
## 1st Qu.:455.0 1st Qu.:15 1st Qu.:4.000 1st Qu.:1.000
## Median :490.0 Median :15 Median :5.000 Median :2.000
## Mean :487.9 Mean :15 Mean :5.139 Mean :1.548
## 3rd Qu.:525.0 3rd Qu.:15 3rd Qu.:6.000 3rd Qu.:2.000
## Max. :665.0 Max. :15 Max. :6.000 Max. :2.000
##
## Processor.Speeds..GHz. Integrated.Wireless. HD.Size..GB.
## Min. :1.500 No :3866 Min. : 40.0
## 1st Qu.:1.500 Yes:4090 1st Qu.: 80.0
## Median :2.000 Median :120.0
## Mean :1.758 Mean :150.4
## 3rd Qu.:2.000 3rd Qu.:300.0
## Max. :2.000 Max. :300.0
##
## Bundled.Applications. OS.X.Customer OS.Y.Customer OS.X.Store
## No :3565 Min. :512253 Min. :164886 Min. :517917
## Yes:4391 1st Qu.:529208 1st Qu.:178716 1st Qu.:528924
## Median :531151 Median :181106 Median :529902
## Mean :530868 Mean :179886 Mean :530748
## 3rd Qu.:533130 3rd Qu.:182060 3rd Qu.:534057
## Max. :549065 Max. :199846 Max. :541428
## NA's :4
## OS.Y.Store CustomerStoreDistance
## Min. :168302 Min. : 0
## 1st Qu.:178440 1st Qu.: 2422
## Median :179641 Median : 3382
## Mean :179808 Mean : 3680
## 3rd Qu.:182961 3rd Qu.: 4346
## Max. :190628 Max. :19892
## NA's :4 NA's :4
create sub dataframe of mean Retail Price at each Store Code:
agg = aggregate(df[, 5], list(df$Store.Postcode), mean)
agg
## Group.1 x
## 1 CR7 8LE 488.6190
## 2 E2 0RY 483.1717
## 3 E7 8NW 494.3814
## 4 KT2 5AU 493.9048
## 5 N17 6QA 494.6341
## 6 N3 1DH 487.3684
## 7 NW5 2QH 486.5805
## 8 S1P 3AU 486.2500
## 9 SE1 2BN 486.6802
## 10 SE8 3JD 492.1778
## 11 SW12 9HD 485.2957
## 12 SW18 1NN 493.0389
## 13 SW1P 3AU 488.5069
## 14 SW1V 4QQ 489.3450
## 15 W10 6HQ 489.8667
## 16 W4 3PH 481.0063
make a bar plot showing mean price at each store code:
barplot(agg$x, height = agg$x, names.arg = agg$Group.1, cex.names = 0.25, xlab = "Store Code", main = "Average Price by Store")
Yeah, these are tiny labels… ### Boxplot
boxplot(Retail.Price~Store.Postcode,
data=df,
xlab="Store",
cex.xlab=0.025)
bigger, but fewer labels…