Exercises (Part 1)

For iris, heart and groceries data sets:

  1. Explore the variables
  2. List the variables and classify as quantitative or qualitative
  3. Provide a Research Question and identify a target variable
  4. Identify whether or not this would exemplify supervised or unsupervised learning.

iris Data Set

iris <- read.csv(file = "Datasets/iris.csv", header = TRUE, sep = ",")

(1) Explore the varibles

In order to explore the variables use str(), summary() and head()

head(iris,4)
##   Type PW PL SW SL
## 1    0  2 14 33 50
## 2    1 24 56 31 67
## 3    1 23 51 31 69
## 4    0  2 10 36 46
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Type: int  0 1 1 0 1 1 2 2 1 2 ...
##  $ PW  : int  2 24 23 2 20 19 13 16 17 14 ...
##  $ PL  : int  14 56 51 10 52 51 45 47 45 47 ...
##  $ SW  : int  33 31 31 36 30 27 28 33 25 32 ...
##  $ SL  : int  50 67 69 46 65 58 57 63 49 70 ...
summary(iris)
##       Type         PW              PL              SW       
##  Min.   :0   Min.   : 1.00   Min.   :10.00   Min.   :20.00  
##  1st Qu.:0   1st Qu.: 3.00   1st Qu.:16.00   1st Qu.:28.00  
##  Median :1   Median :13.00   Median :44.00   Median :30.00  
##  Mean   :1   Mean   :11.93   Mean   :37.79   Mean   :30.55  
##  3rd Qu.:2   3rd Qu.:18.00   3rd Qu.:51.00   3rd Qu.:33.00  
##  Max.   :2   Max.   :25.00   Max.   :69.00   Max.   :44.00  
##        SL       
##  Min.   :43.00  
##  1st Qu.:51.00  
##  Median :58.00  
##  Mean   :58.45  
##  3rd Qu.:64.00  
##  Max.   :79.00

(2) List the Variables

In this case the variables are:

Variable Name Type of Data Measurement
Type, (presumably species) Categorical The type of flower (i.e. species)
PW, (petal width) Quantitative Linear Distance
SW, (sepal width) Quantitative Linear Distance
PL, (petal length) Quantitative Linear Distance
SL, (sepal length) Quantitative Linear Distance

(3) provide a research Question

A possible research question could be:

  • Can plant species be the Sepal Length and Petal Width.

(4) Is this Supervised or Unsupervised Learning

This research question would be an example of supervised learning because the plant species are known and the model can be trained using already-known output values.

heart dataset

heart <- read.csv(file = "Datasets/heart.csv", header = TRUE, sep = ",")

(1) Explore the varibles

In order to explore the variables use str(), summary() and head()

head(heart,4)
##   X Age Sex    ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope
## 1 1  63   1      typical    145  233   1       2   150     0     2.3     3
## 2 2  67   1 asymptomatic    160  286   0       2   108     1     1.5     2
## 3 3  67   1 asymptomatic    120  229   0       2   129     1     2.6     2
## 4 4  37   1   nonanginal    130  250   0       0   187     0     3.5     3
##   Ca       Thal AHD
## 1  0      fixed   0
## 2  3     normal   1
## 3  2 reversable   1
## 4  0     normal   0
str(heart)
## 'data.frame':    303 obs. of  15 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age      : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ Sex      : int  1 1 1 1 0 1 0 0 1 1 ...
##  $ ChestPain: Factor w/ 4 levels "asymptomatic",..: 4 1 1 2 3 3 1 1 1 1 ...
##  $ RestBP   : int  145 160 120 130 130 120 140 120 130 140 ...
##  $ Chol     : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ Fbs      : int  1 0 0 0 0 0 0 0 0 1 ...
##  $ RestECG  : int  2 2 2 0 2 0 2 0 2 2 ...
##  $ MaxHR    : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ ExAng    : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ Oldpeak  : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ Slope    : int  3 2 2 3 1 1 3 1 2 3 ...
##  $ Ca       : int  0 3 2 0 0 0 2 0 1 0 ...
##  $ Thal     : Factor w/ 3 levels "fixed","normal",..: 1 2 3 2 2 2 2 2 3 3 ...
##  $ AHD      : int  0 1 1 0 0 0 1 0 1 1 ...
summary(heart)
##        X              Age             Sex                ChestPain  
##  Min.   :  1.0   Min.   :29.00   Min.   :0.0000   asymptomatic:144  
##  1st Qu.: 76.5   1st Qu.:48.00   1st Qu.:0.0000   nonanginal  : 86  
##  Median :152.0   Median :56.00   Median :1.0000   nontypical  : 50  
##  Mean   :152.0   Mean   :54.44   Mean   :0.6799   typical     : 23  
##  3rd Qu.:227.5   3rd Qu.:61.00   3rd Qu.:1.0000                     
##  Max.   :303.0   Max.   :77.00   Max.   :1.0000                     
##                                                                     
##      RestBP           Chol            Fbs            RestECG      
##  Min.   : 94.0   Min.   :126.0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:120.0   1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :130.0   Median :241.0   Median :0.0000   Median :1.0000  
##  Mean   :131.7   Mean   :246.7   Mean   :0.1485   Mean   :0.9901  
##  3rd Qu.:140.0   3rd Qu.:275.0   3rd Qu.:0.0000   3rd Qu.:2.0000  
##  Max.   :200.0   Max.   :564.0   Max.   :1.0000   Max.   :2.0000  
##                                                                   
##      MaxHR           ExAng           Oldpeak         Slope      
##  Min.   : 71.0   Min.   :0.0000   Min.   :0.00   Min.   :1.000  
##  1st Qu.:133.5   1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000  
##  Median :153.0   Median :0.0000   Median :0.80   Median :2.000  
##  Mean   :149.6   Mean   :0.3267   Mean   :1.04   Mean   :1.601  
##  3rd Qu.:166.0   3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000  
##  Max.   :202.0   Max.   :1.0000   Max.   :6.20   Max.   :3.000  
##                                                                 
##        Ca                 Thal          AHD        
##  Min.   :0.0000   fixed     : 18   Min.   :0.0000  
##  1st Qu.:0.0000   normal    :166   1st Qu.:0.0000  
##  Median :0.0000   reversable:117   Median :0.0000  
##  Mean   :0.6722   NA's      :  2   Mean   :0.4587  
##  3rd Qu.:1.0000                    3rd Qu.:1.0000  
##  Max.   :3.0000                    Max.   :1.0000  
##  NA's   :4

(2) List the Variables

In this case the variables are:

Variable Name Type of Data Measurement
X Categorical (\(\mathbb{N}\)) presumably observation number
Age Quantitative The age of the individual
Sex Categorical The individuals gender
Chestpain Categorical A classification of the type of chest pain
RestBP Quantitative A measurement of Sys. Blood Pressure at rest
Chol Quantitative Cholestrol levels
Fbs Categorical An indicator of whether or not Fasting Blood Sugar is above a threshold
RestECG Categorical An indicator of the type ECG result
MaxHR Quantitative A measurement of the maximum Heart Rate
ExAng Categorical An indicator of whether or not this individual suffered exercise induced angina
oldpeak quantitative A measurement of ECG change induced by exercise
slope categorical An indicator of the slope of the ST segment of an ECG graph
Ca categorical (because it exists in \((\mathbb{N})\) An indicator of how many of the three major blood vessels are revealed by fluroscopy
AHD categorical An indicator of whether or not the individual suffered Atherosclerotic Heart Disease

(3) provide a research Question

A possible research question could be:

  • does MaxHR predict Atherosclerotic Heart Disease independent of age?

(4) Is this Supervised or Unsupervised Learning

This research question would be an example of supervised learning because the incidence of AHD are known and the model can be trained using already-known output values.

groceries dataset

groceries <- read.csv(file = "Datasets/groceries(1).csv", header = TRUE, sep = ",")

(1) Explore the varibles

In order to explore the variables use str(), summary() and head()

head(groceries[,1:5], 4)
##   frankfurter sausage liver.loaf ham meat
## 1           0       0          0   0    0
## 2           0       0          0   0    0
## 3           0       0          0   0    0
## 4           0       0          0   0    0
  #Restrict the columns of groceries to fit on the page
str(groceries[,1:5])
## 'data.frame':    9835 obs. of  5 variables:
##  $ frankfurter: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sausage    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ liver.loaf : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ham        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meat       : int  0 0 0 0 0 0 0 0 0 0 ...
summary(groceries[,1:5])
##   frankfurter         sausage          liver.loaf            ham         
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.000000   Median :0.00000  
##  Mean   :0.05897   Mean   :0.09395   Mean   :0.005084   Mean   :0.02603  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00000  
##       meat        
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.02583  
##  3rd Qu.:0.00000  
##  Max.   :1.00000

(2) List the Variables

It is necessary to see whether or not the input value is 1/0 or a number, we can check this by doing:

if(sum(groceries>1)==0){
  print("The values are Boolean")
} else{
  print("The values could be categorical or continuous")
}
## [1] "The values are Boolean"

In this case the variables are:

Variable Name Type of Data Measurement
food item categorical whether or not the item needs to be purchased

The subsequent observations (rows) could represent the need for groceries at each week.

(3) provide a research Question

A possible research question could be:

  • Are certain food items more common at during holiday periods,
    • so for example is consumption of processed meats more common, this could be a public health enquiry.

(4) Is this Supervised or Unsupervised Learning

This research question would be an example of unsupervised learning because the pattern of food consumption is not known and the algorithm must ‘learn’ what the patterns are.

Exercises (Part 2)

Advertising Data

Import the Advertising Data

Import the Advertising Data:

adv <- read.csv(file = "Datasets/Advertising(2).csv", header = TRUE, sep = ",")

Explore and describe the data set

Use head() and str() to get an understanding of the data:

head(adv)
##      TV Radio Newspaper Sales
## 1 230.1  37.8      69.2  22.1
## 2  44.5  39.3      45.1  10.4
## 3  17.2  45.9      69.3   9.3
## 4 151.5  41.3      58.5  18.5
## 5 180.8  10.8      58.4  12.9
## 6   8.7  48.9      75.0   7.2
str(adv)
## 'data.frame':    200 obs. of  4 variables:
##  $ TV       : num  230.1 44.5 17.2 151.5 180.8 ...
##  $ Radio    : num  37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
##  $ Newspaper: num  69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
##  $ Sales    : num  22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...

So it looks like some quantitative measurement related to advertising accross various genres, so let’s get a summary of it:

summary(adv)
##        TV             Radio          Newspaper          Sales      
##  Min.   :  0.70   Min.   : 0.000   Min.   :  0.30   Min.   : 1.60  
##  1st Qu.: 74.38   1st Qu.: 9.975   1st Qu.: 12.75   1st Qu.:10.38  
##  Median :149.75   Median :22.900   Median : 25.75   Median :12.90  
##  Mean   :147.04   Mean   :23.264   Mean   : 30.55   Mean   :14.02  
##  3rd Qu.:218.82   3rd Qu.:36.525   3rd Qu.: 45.10   3rd Qu.:17.40  
##  Max.   :296.40   Max.   :49.600   Max.   :114.00   Max.   :27.00

Create Scatter Plots of the Data

First consider all possible correlations using corrplot

cormatadv <- cor(adv)
corrplot(method = 'ellipse', type = 'lower', corr = cormatadv)

looking at this plot it appears that TV is most correlated with sales:

Sales vs TV Advertising

adv$MeanAdvertising <- rowMeans(adv[,c(!(names(adv) == "Sales"))])

ggplot(data = adv, aes(x = TV, y = Sales, col = MeanAdvertising)) +
  geom_point() + 
  theme_bw() +
  stat_smooth(method = 'lm', formula = y ~ poly(x, 2, raw = TRUE), se = FALSE) +
 # stat_smooth(method = 'lm', formula = y ~ log(x), se = FALSE) +
  labs(col = "Mean Advertising", x= "TV Advertising") 

 ggplot(data = adv, aes(x = Radio, y = Sales, col = MeanAdvertising)) +
   geom_point() + 
   theme_bw() +
   labs(col = "Mean Advertising", x= "Radio Advertising") + 
   geom_smooth(method = 'lm')

padv <- ggplot(data = adv, aes(x = Newspaper, y = Sales, col = MeanAdvertising)) +
  geom_point() + 
  theme_bw() +
  labs(col = "Mean Advertising", x= "Newspaper Advertising")

padv

#Thise could be combined into an interactive graph by wrapping in ggplotly(padv)

It appears that tv advertising is positively correlated with sales, however more advertising leads to less certain increases and diminishing returns, radio advertising is much the same but with less certainty than TV advertising. Their appears to be no correlation between Newspaper advertising and sales.

iris Data

Import the iris Data

The iris data has alread been imported and assigned the variable iris.

Explore and describe the data set

The iris data set can be explored using head() and str()

head(iris)
##   Type PW PL SW SL
## 1    0  2 14 33 50
## 2    1 24 56 31 67
## 3    1 23 51 31 69
## 4    0  2 10 36 46
## 5    1 20 52 30 65
## 6    1 19 51 27 58
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Type: int  0 1 1 0 1 1 2 2 1 2 ...
##  $ PW  : int  2 24 23 2 20 19 13 16 17 14 ...
##  $ PL  : int  14 56 51 10 52 51 45 47 45 47 ...
##  $ SW  : int  33 31 31 36 30 27 28 33 25 32 ...
##  $ SL  : int  50 67 69 46 65 58 57 63 49 70 ...

The data set can be described using the summary() function

summary(iris)
##       Type         PW              PL              SW       
##  Min.   :0   Min.   : 1.00   Min.   :10.00   Min.   :20.00  
##  1st Qu.:0   1st Qu.: 3.00   1st Qu.:16.00   1st Qu.:28.00  
##  Median :1   Median :13.00   Median :44.00   Median :30.00  
##  Mean   :1   Mean   :11.93   Mean   :37.79   Mean   :30.55  
##  3rd Qu.:2   3rd Qu.:18.00   3rd Qu.:51.00   3rd Qu.:33.00  
##  Max.   :2   Max.   :25.00   Max.   :69.00   Max.   :44.00  
##        SL       
##  Min.   :43.00  
##  1st Qu.:51.00  
##  Median :58.00  
##  Mean   :58.45  
##  3rd Qu.:64.00  
##  Max.   :79.00

Create Scatter Plots of the Data

coriris <- cor(iris[,!(names(iris) == "Species")])
corrplot(method = 'ellipse', type = 'lower', corr = coriris)

because sepal width and sepal length appear to be the most independent we will create a linear regression of those variables:

#Make the datatype a factor
iris$Type <- as.factor(iris$Type)

 ggplot(data = iris, aes(x = SW, y = SL, col = Type, shape = Type)) +
   geom_point(size = 2.5, alpha = 0.8) + 
   theme_bw() +
   labs(col = "Plant Species", shape= "Plant\nSpecies", x = "Sepal Width", y = "Sepal Length") + 
   geom_smooth(method = 'lm', se = FALSE, lwd = 0.5) +
   # It is necessary to use scale_shape_discrete in order to change the labels:
      #make sure the names match, if the legend names match, they'll be merged.
   scale_shape_discrete(name  ="Plant\nSpecies",
                           breaks=c("0", "1", "2"),
                           labels=c("Setosa", "Versicolor", "Virginica")) + 
   scale_color_discrete(name  ="Plant\nSpecies",
                           breaks=c("0", "1", "2"),
                           labels=c("Setosa", "Versicolor", "Virginica"))

  # This could have also been done with the built in iris data set, that was where I got the legend labels from