For iris, heart and groceries data sets:
iris Data Setiris <- read.csv(file = "Datasets/iris.csv", header = TRUE, sep = ",")
In order to explore the variables use str(), summary() and head()
head(iris,4)
## Type PW PL SW SL
## 1 0 2 14 33 50
## 2 1 24 56 31 67
## 3 1 23 51 31 69
## 4 0 2 10 36 46
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Type: int 0 1 1 0 1 1 2 2 1 2 ...
## $ PW : int 2 24 23 2 20 19 13 16 17 14 ...
## $ PL : int 14 56 51 10 52 51 45 47 45 47 ...
## $ SW : int 33 31 31 36 30 27 28 33 25 32 ...
## $ SL : int 50 67 69 46 65 58 57 63 49 70 ...
summary(iris)
## Type PW PL SW
## Min. :0 Min. : 1.00 Min. :10.00 Min. :20.00
## 1st Qu.:0 1st Qu.: 3.00 1st Qu.:16.00 1st Qu.:28.00
## Median :1 Median :13.00 Median :44.00 Median :30.00
## Mean :1 Mean :11.93 Mean :37.79 Mean :30.55
## 3rd Qu.:2 3rd Qu.:18.00 3rd Qu.:51.00 3rd Qu.:33.00
## Max. :2 Max. :25.00 Max. :69.00 Max. :44.00
## SL
## Min. :43.00
## 1st Qu.:51.00
## Median :58.00
## Mean :58.45
## 3rd Qu.:64.00
## Max. :79.00
In this case the variables are:
| Variable Name | Type of Data | Measurement |
|---|---|---|
| Type, (presumably species) | Categorical | The type of flower (i.e. species) |
| PW, (petal width) | Quantitative | Linear Distance |
| SW, (sepal width) | Quantitative | Linear Distance |
| PL, (petal length) | Quantitative | Linear Distance |
| SL, (sepal length) | Quantitative | Linear Distance |
A possible research question could be:
This research question would be an example of supervised learning because the plant species are known and the model can be trained using already-known output values.
heart datasetheart <- read.csv(file = "Datasets/heart.csv", header = TRUE, sep = ",")
In order to explore the variables use str(), summary() and head()
head(heart,4)
## X Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope
## 1 1 63 1 typical 145 233 1 2 150 0 2.3 3
## 2 2 67 1 asymptomatic 160 286 0 2 108 1 1.5 2
## 3 3 67 1 asymptomatic 120 229 0 2 129 1 2.6 2
## 4 4 37 1 nonanginal 130 250 0 0 187 0 3.5 3
## Ca Thal AHD
## 1 0 fixed 0
## 2 3 normal 1
## 3 2 reversable 1
## 4 0 normal 0
str(heart)
## 'data.frame': 303 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 63 67 67 37 41 56 62 57 63 53 ...
## $ Sex : int 1 1 1 1 0 1 0 0 1 1 ...
## $ ChestPain: Factor w/ 4 levels "asymptomatic",..: 4 1 1 2 3 3 1 1 1 1 ...
## $ RestBP : int 145 160 120 130 130 120 140 120 130 140 ...
## $ Chol : int 233 286 229 250 204 236 268 354 254 203 ...
## $ Fbs : int 1 0 0 0 0 0 0 0 0 1 ...
## $ RestECG : int 2 2 2 0 2 0 2 0 2 2 ...
## $ MaxHR : int 150 108 129 187 172 178 160 163 147 155 ...
## $ ExAng : int 0 1 1 0 0 0 0 1 0 1 ...
## $ Oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ Slope : int 3 2 2 3 1 1 3 1 2 3 ...
## $ Ca : int 0 3 2 0 0 0 2 0 1 0 ...
## $ Thal : Factor w/ 3 levels "fixed","normal",..: 1 2 3 2 2 2 2 2 3 3 ...
## $ AHD : int 0 1 1 0 0 0 1 0 1 1 ...
summary(heart)
## X Age Sex ChestPain
## Min. : 1.0 Min. :29.00 Min. :0.0000 asymptomatic:144
## 1st Qu.: 76.5 1st Qu.:48.00 1st Qu.:0.0000 nonanginal : 86
## Median :152.0 Median :56.00 Median :1.0000 nontypical : 50
## Mean :152.0 Mean :54.44 Mean :0.6799 typical : 23
## 3rd Qu.:227.5 3rd Qu.:61.00 3rd Qu.:1.0000
## Max. :303.0 Max. :77.00 Max. :1.0000
##
## RestBP Chol Fbs RestECG
## Min. : 94.0 Min. :126.0 Min. :0.0000 Min. :0.0000
## 1st Qu.:120.0 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000
## Median :130.0 Median :241.0 Median :0.0000 Median :1.0000
## Mean :131.7 Mean :246.7 Mean :0.1485 Mean :0.9901
## 3rd Qu.:140.0 3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:2.0000
## Max. :200.0 Max. :564.0 Max. :1.0000 Max. :2.0000
##
## MaxHR ExAng Oldpeak Slope
## Min. : 71.0 Min. :0.0000 Min. :0.00 Min. :1.000
## 1st Qu.:133.5 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000
## Median :153.0 Median :0.0000 Median :0.80 Median :2.000
## Mean :149.6 Mean :0.3267 Mean :1.04 Mean :1.601
## 3rd Qu.:166.0 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000
## Max. :202.0 Max. :1.0000 Max. :6.20 Max. :3.000
##
## Ca Thal AHD
## Min. :0.0000 fixed : 18 Min. :0.0000
## 1st Qu.:0.0000 normal :166 1st Qu.:0.0000
## Median :0.0000 reversable:117 Median :0.0000
## Mean :0.6722 NA's : 2 Mean :0.4587
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :3.0000 Max. :1.0000
## NA's :4
In this case the variables are:
| Variable Name | Type of Data | Measurement |
|---|---|---|
| X | Categorical (\(\mathbb{N}\)) | presumably observation number |
| Age | Quantitative | The age of the individual |
| Sex | Categorical | The individuals gender |
| Chestpain | Categorical | A classification of the type of chest pain |
| RestBP | Quantitative | A measurement of Sys. Blood Pressure at rest |
| Chol | Quantitative | Cholestrol levels |
| Fbs | Categorical | An indicator of whether or not Fasting Blood Sugar is above a threshold |
| RestECG | Categorical | An indicator of the type ECG result |
| MaxHR | Quantitative | A measurement of the maximum Heart Rate |
| ExAng | Categorical | An indicator of whether or not this individual suffered exercise induced angina |
| oldpeak | quantitative | A measurement of ECG change induced by exercise |
| slope | categorical | An indicator of the slope of the ST segment of an ECG graph |
| Ca | categorical (because it exists in \((\mathbb{N})\) | An indicator of how many of the three major blood vessels are revealed by fluroscopy |
| AHD | categorical | An indicator of whether or not the individual suffered Atherosclerotic Heart Disease |
A possible research question could be:
This research question would be an example of supervised learning because the incidence of AHD are known and the model can be trained using already-known output values.
groceries datasetgroceries <- read.csv(file = "Datasets/groceries(1).csv", header = TRUE, sep = ",")
In order to explore the variables use str(), summary() and head()
head(groceries[,1:5], 4)
## frankfurter sausage liver.loaf ham meat
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
#Restrict the columns of groceries to fit on the page
str(groceries[,1:5])
## 'data.frame': 9835 obs. of 5 variables:
## $ frankfurter: int 0 0 0 0 0 0 0 0 0 0 ...
## $ sausage : int 0 0 0 0 0 0 0 0 0 0 ...
## $ liver.loaf : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ham : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meat : int 0 0 0 0 0 0 0 0 0 0 ...
summary(groceries[,1:5])
## frankfurter sausage liver.loaf ham
## Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.000000 Median :0.00000
## Mean :0.05897 Mean :0.09395 Mean :0.005084 Mean :0.02603
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.000000 Max. :1.00000
## meat
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.02583
## 3rd Qu.:0.00000
## Max. :1.00000
It is necessary to see whether or not the input value is 1/0 or a number, we can check this by doing:
if(sum(groceries>1)==0){
print("The values are Boolean")
} else{
print("The values could be categorical or continuous")
}
## [1] "The values are Boolean"
In this case the variables are:
| Variable Name | Type of Data | Measurement |
|---|---|---|
| food item | categorical | whether or not the item needs to be purchased |
The subsequent observations (rows) could represent the need for groceries at each week.
A possible research question could be:
This research question would be an example of unsupervised learning because the pattern of food consumption is not known and the algorithm must ‘learn’ what the patterns are.
Import the Advertising Data:
adv <- read.csv(file = "Datasets/Advertising(2).csv", header = TRUE, sep = ",")
Use head() and str() to get an understanding of the data:
head(adv)
## TV Radio Newspaper Sales
## 1 230.1 37.8 69.2 22.1
## 2 44.5 39.3 45.1 10.4
## 3 17.2 45.9 69.3 9.3
## 4 151.5 41.3 58.5 18.5
## 5 180.8 10.8 58.4 12.9
## 6 8.7 48.9 75.0 7.2
str(adv)
## 'data.frame': 200 obs. of 4 variables:
## $ TV : num 230.1 44.5 17.2 151.5 180.8 ...
## $ Radio : num 37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
## $ Newspaper: num 69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
## $ Sales : num 22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...
So it looks like some quantitative measurement related to advertising accross various genres, so let’s get a summary of it:
summary(adv)
## TV Radio Newspaper Sales
## Min. : 0.70 Min. : 0.000 Min. : 0.30 Min. : 1.60
## 1st Qu.: 74.38 1st Qu.: 9.975 1st Qu.: 12.75 1st Qu.:10.38
## Median :149.75 Median :22.900 Median : 25.75 Median :12.90
## Mean :147.04 Mean :23.264 Mean : 30.55 Mean :14.02
## 3rd Qu.:218.82 3rd Qu.:36.525 3rd Qu.: 45.10 3rd Qu.:17.40
## Max. :296.40 Max. :49.600 Max. :114.00 Max. :27.00
corrplotcormatadv <- cor(adv)
corrplot(method = 'ellipse', type = 'lower', corr = cormatadv)
looking at this plot it appears that TV is most correlated with sales:
adv$MeanAdvertising <- rowMeans(adv[,c(!(names(adv) == "Sales"))])
ggplot(data = adv, aes(x = TV, y = Sales, col = MeanAdvertising)) +
geom_point() +
theme_bw() +
stat_smooth(method = 'lm', formula = y ~ poly(x, 2, raw = TRUE), se = FALSE) +
# stat_smooth(method = 'lm', formula = y ~ log(x), se = FALSE) +
labs(col = "Mean Advertising", x= "TV Advertising")
ggplot(data = adv, aes(x = Radio, y = Sales, col = MeanAdvertising)) +
geom_point() +
theme_bw() +
labs(col = "Mean Advertising", x= "Radio Advertising") +
geom_smooth(method = 'lm')
padv <- ggplot(data = adv, aes(x = Newspaper, y = Sales, col = MeanAdvertising)) +
geom_point() +
theme_bw() +
labs(col = "Mean Advertising", x= "Newspaper Advertising")
padv
#Thise could be combined into an interactive graph by wrapping in ggplotly(padv)
It appears that tv advertising is positively correlated with sales, however more advertising leads to less certain increases and diminishing returns, radio advertising is much the same but with less certainty than TV advertising. Their appears to be no correlation between Newspaper advertising and sales.
The iris data has alread been imported and assigned the variable iris.
The iris data set can be explored using head() and str()
head(iris)
## Type PW PL SW SL
## 1 0 2 14 33 50
## 2 1 24 56 31 67
## 3 1 23 51 31 69
## 4 0 2 10 36 46
## 5 1 20 52 30 65
## 6 1 19 51 27 58
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Type: int 0 1 1 0 1 1 2 2 1 2 ...
## $ PW : int 2 24 23 2 20 19 13 16 17 14 ...
## $ PL : int 14 56 51 10 52 51 45 47 45 47 ...
## $ SW : int 33 31 31 36 30 27 28 33 25 32 ...
## $ SL : int 50 67 69 46 65 58 57 63 49 70 ...
The data set can be described using the summary() function
summary(iris)
## Type PW PL SW
## Min. :0 Min. : 1.00 Min. :10.00 Min. :20.00
## 1st Qu.:0 1st Qu.: 3.00 1st Qu.:16.00 1st Qu.:28.00
## Median :1 Median :13.00 Median :44.00 Median :30.00
## Mean :1 Mean :11.93 Mean :37.79 Mean :30.55
## 3rd Qu.:2 3rd Qu.:18.00 3rd Qu.:51.00 3rd Qu.:33.00
## Max. :2 Max. :25.00 Max. :69.00 Max. :44.00
## SL
## Min. :43.00
## 1st Qu.:51.00
## Median :58.00
## Mean :58.45
## 3rd Qu.:64.00
## Max. :79.00
coriris <- cor(iris[,!(names(iris) == "Species")])
corrplot(method = 'ellipse', type = 'lower', corr = coriris)
because sepal width and sepal length appear to be the most independent we will create a linear regression of those variables:
#Make the datatype a factor
iris$Type <- as.factor(iris$Type)
ggplot(data = iris, aes(x = SW, y = SL, col = Type, shape = Type)) +
geom_point(size = 2.5, alpha = 0.8) +
theme_bw() +
labs(col = "Plant Species", shape= "Plant\nSpecies", x = "Sepal Width", y = "Sepal Length") +
geom_smooth(method = 'lm', se = FALSE, lwd = 0.5) +
# It is necessary to use scale_shape_discrete in order to change the labels:
#make sure the names match, if the legend names match, they'll be merged.
scale_shape_discrete(name ="Plant\nSpecies",
breaks=c("0", "1", "2"),
labels=c("Setosa", "Versicolor", "Virginica")) +
scale_color_discrete(name ="Plant\nSpecies",
breaks=c("0", "1", "2"),
labels=c("Setosa", "Versicolor", "Virginica"))
# This could have also been done with the built in iris data set, that was where I got the legend labels from