DATA ANALYSIS ALONGSIDE WITH VISUALISATION : PROJECT-BREAST CANCER ANALYSIS

DONE BY: SHOBIKA.S(2019)

DESCRIPTION:

Breast cancer (BC) is one of the most common cancers among women worldwide, representing the majority of new cancer cases and cancer-related deaths according to global statistics, making it a significant public health problem in today’s society. Classification and data mining methods are an effective way to classify data. Especially in medical field, where those methods are widely used in diagnosis and analysis to make decisions.

DATASET EXPLANATION:

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at [Web Link]

Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, “Decision Tree Construction Via Linear Programming.” Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.

The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

ATTRIBUTE EXPLANATION:

  1. ID number 2) Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus: 1.radius_mean:radius (mean of distances from center to points on the perimeter)

  1. texture (standard deviation of gray-scale values)
  2. perimeter
  3. area
  4. smoothness (local variation in radius lengths)
  5. compactness (perimeter² / area - 1.0)
  6. concavity (severity of concave portions of the contour)
  7. concave points (number of concave portions of the contour)
  8. symmetry
  9. fractal dimension (“coastline approximation” - 1)

OBJECTIVES:

This analysis aims to observe which features are most helpful in predicting malignant or benign cancer and to see general trends that may aid us in model selection and hyper parameter selection. The goal is to classify whether the breast cancer is benign or malignant. To achieve this i have used machine learning classification methods to fit a function that can predict the discrete class of new input.

Load libraries

Descriptive statistics

The first step is to visually inspect the data set. #DATA EXPLORATION #Load dataset

data <- read.csv("data.csv")
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
View(head(data))
glimpse(data)
## Observations: 569
## Variables: 33
## $ id                      <int> 842302, 842517, 84300903, 84348301, 84...
## $ diagnosis               <fct> M, M, M, M, M, M, M, M, M, M, M, M, M,...
## $ radius_mean             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290...
## $ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15....
## $ perimeter_mean          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10,...
## $ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0,...
## $ smoothness_mean         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0....
## $ compactness_mean        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0....
## $ concavity_mean          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0....
## $ concave.points_mean     <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0....
## $ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809...
## $ fractal_dimension_mean  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0....
## $ radius_se               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572...
## $ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813...
## $ perimeter_se            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.2...
## $ area_se                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27...
## $ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110...
## $ compactness_se          <dbl> 0.049040, 0.013080, 0.040060, 0.074580...
## $ concavity_se            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0....
## $ concave.points_se       <dbl> 0.015870, 0.013400, 0.020580, 0.018670...
## $ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0....
## $ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208...
## $ radius_worst            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15....
## $ texture_worst           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23....
## $ perimeter_worst         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20,...
## $ area_worst              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0,...
## $ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374...
## $ compactness_worst       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050...
## $ concavity_worst         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0....
## $ concave.points_worst    <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0....
## $ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364...
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0....
## $ X                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
#structure of the dataset
str(data)
## 'data.frame':    569 obs. of  33 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
##  $ X                      : logi  NA NA NA NA NA NA ...
#dimension of data set
dim(data)
## [1] 569  33
#summary of data set
summary(data)
##        id            diagnosis  radius_mean      texture_mean  
##  Min.   :     8670   B:357     Min.   : 6.981   Min.   : 9.71  
##  1st Qu.:   869218   M:212     1st Qu.:11.700   1st Qu.:16.17  
##  Median :   906024             Median :13.370   Median :18.84  
##  Mean   : 30371831             Mean   :14.127   Mean   :19.29  
##  3rd Qu.:  8813129             3rd Qu.:15.780   3rd Qu.:21.80  
##  Max.   :911320502             Max.   :28.110   Max.   :39.28  
##  perimeter_mean     area_mean      smoothness_mean   compactness_mean 
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##  concavity_mean    concave.points_mean symmetry_mean   
##  Min.   :0.00000   Min.   :0.00000     Min.   :0.1060  
##  1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619  
##  Median :0.06154   Median :0.03350     Median :0.1792  
##  Mean   :0.08880   Mean   :0.04892     Mean   :0.1812  
##  3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957  
##  Max.   :0.42680   Max.   :0.20120     Max.   :0.3040  
##  fractal_dimension_mean   radius_se        texture_se      perimeter_se   
##  Min.   :0.04996        Min.   :0.1115   Min.   :0.3602   Min.   : 0.757  
##  1st Qu.:0.05770        1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606  
##  Median :0.06154        Median :0.3242   Median :1.1080   Median : 2.287  
##  Mean   :0.06280        Mean   :0.4052   Mean   :1.2169   Mean   : 2.866  
##  3rd Qu.:0.06612        3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357  
##  Max.   :0.09744        Max.   :2.8730   Max.   :4.8850   Max.   :21.980  
##     area_se        smoothness_se      compactness_se      concavity_se    
##  Min.   :  6.802   Min.   :0.001713   Min.   :0.002252   Min.   :0.00000  
##  1st Qu.: 17.850   1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509  
##  Median : 24.530   Median :0.006380   Median :0.020450   Median :0.02589  
##  Mean   : 40.337   Mean   :0.007041   Mean   :0.025478   Mean   :0.03189  
##  3rd Qu.: 45.190   3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205  
##  Max.   :542.200   Max.   :0.031130   Max.   :0.135400   Max.   :0.39600  
##  concave.points_se   symmetry_se       fractal_dimension_se
##  Min.   :0.000000   Min.   :0.007882   Min.   :0.0008948   
##  1st Qu.:0.007638   1st Qu.:0.015160   1st Qu.:0.0022480   
##  Median :0.010930   Median :0.018730   Median :0.0031870   
##  Mean   :0.011796   Mean   :0.020542   Mean   :0.0037949   
##  3rd Qu.:0.014710   3rd Qu.:0.023480   3rd Qu.:0.0045580   
##  Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400   
##   radius_worst   texture_worst   perimeter_worst    area_worst    
##  Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
##  1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11   1st Qu.: 515.3  
##  Median :14.97   Median :25.41   Median : 97.66   Median : 686.5  
##  Mean   :16.27   Mean   :25.68   Mean   :107.26   Mean   : 880.6  
##  3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40   3rd Qu.:1084.0  
##  Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
##  smoothness_worst  compactness_worst concavity_worst  concave.points_worst
##  Min.   :0.07117   Min.   :0.02729   Min.   :0.0000   Min.   :0.00000     
##  1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145   1st Qu.:0.06493     
##  Median :0.13130   Median :0.21190   Median :0.2267   Median :0.09993     
##  Mean   :0.13237   Mean   :0.25427   Mean   :0.2722   Mean   :0.11461     
##  3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829   3rd Qu.:0.16140     
##  Max.   :0.22260   Max.   :1.05800   Max.   :1.2520   Max.   :0.29100     
##  symmetry_worst   fractal_dimension_worst    X          
##  Min.   :0.1565   Min.   :0.05504         Mode:logical  
##  1st Qu.:0.2504   1st Qu.:0.07146         NA's:569      
##  Median :0.2822   Median :0.08004                       
##  Mean   :0.2901   Mean   :0.08395                       
##  3rd Qu.:0.3179   3rd Qu.:0.09208                       
##  Max.   :0.6638   Max.   :0.20750
##remove na's
data<-data[-33]
summary(data)
##        id            diagnosis  radius_mean      texture_mean  
##  Min.   :     8670   B:357     Min.   : 6.981   Min.   : 9.71  
##  1st Qu.:   869218   M:212     1st Qu.:11.700   1st Qu.:16.17  
##  Median :   906024             Median :13.370   Median :18.84  
##  Mean   : 30371831             Mean   :14.127   Mean   :19.29  
##  3rd Qu.:  8813129             3rd Qu.:15.780   3rd Qu.:21.80  
##  Max.   :911320502             Max.   :28.110   Max.   :39.28  
##  perimeter_mean     area_mean      smoothness_mean   compactness_mean 
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##  concavity_mean    concave.points_mean symmetry_mean   
##  Min.   :0.00000   Min.   :0.00000     Min.   :0.1060  
##  1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619  
##  Median :0.06154   Median :0.03350     Median :0.1792  
##  Mean   :0.08880   Mean   :0.04892     Mean   :0.1812  
##  3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957  
##  Max.   :0.42680   Max.   :0.20120     Max.   :0.3040  
##  fractal_dimension_mean   radius_se        texture_se      perimeter_se   
##  Min.   :0.04996        Min.   :0.1115   Min.   :0.3602   Min.   : 0.757  
##  1st Qu.:0.05770        1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606  
##  Median :0.06154        Median :0.3242   Median :1.1080   Median : 2.287  
##  Mean   :0.06280        Mean   :0.4052   Mean   :1.2169   Mean   : 2.866  
##  3rd Qu.:0.06612        3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357  
##  Max.   :0.09744        Max.   :2.8730   Max.   :4.8850   Max.   :21.980  
##     area_se        smoothness_se      compactness_se      concavity_se    
##  Min.   :  6.802   Min.   :0.001713   Min.   :0.002252   Min.   :0.00000  
##  1st Qu.: 17.850   1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509  
##  Median : 24.530   Median :0.006380   Median :0.020450   Median :0.02589  
##  Mean   : 40.337   Mean   :0.007041   Mean   :0.025478   Mean   :0.03189  
##  3rd Qu.: 45.190   3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205  
##  Max.   :542.200   Max.   :0.031130   Max.   :0.135400   Max.   :0.39600  
##  concave.points_se   symmetry_se       fractal_dimension_se
##  Min.   :0.000000   Min.   :0.007882   Min.   :0.0008948   
##  1st Qu.:0.007638   1st Qu.:0.015160   1st Qu.:0.0022480   
##  Median :0.010930   Median :0.018730   Median :0.0031870   
##  Mean   :0.011796   Mean   :0.020542   Mean   :0.0037949   
##  3rd Qu.:0.014710   3rd Qu.:0.023480   3rd Qu.:0.0045580   
##  Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400   
##   radius_worst   texture_worst   perimeter_worst    area_worst    
##  Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
##  1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11   1st Qu.: 515.3  
##  Median :14.97   Median :25.41   Median : 97.66   Median : 686.5  
##  Mean   :16.27   Mean   :25.68   Mean   :107.26   Mean   : 880.6  
##  3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40   3rd Qu.:1084.0  
##  Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
##  smoothness_worst  compactness_worst concavity_worst  concave.points_worst
##  Min.   :0.07117   Min.   :0.02729   Min.   :0.0000   Min.   :0.00000     
##  1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145   1st Qu.:0.06493     
##  Median :0.13130   Median :0.21190   Median :0.2267   Median :0.09993     
##  Mean   :0.13237   Mean   :0.25427   Mean   :0.2722   Mean   :0.11461     
##  3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829   3rd Qu.:0.16140     
##  Max.   :0.22260   Max.   :1.05800   Max.   :1.2520   Max.   :0.29100     
##  symmetry_worst   fractal_dimension_worst
##  Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.2822   Median :0.08004        
##  Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :0.6638   Max.   :0.20750

DATA ANALYSIS

No of women affected in benign and malignant stage

data %>% count(diagnosis)
## # A tibble: 2 x 2
##   diagnosis     n
##   <fct>     <int>
## 1 B           357
## 2 M           212

Percentage of women affected in benign and malignant stage

data %>% count(diagnosis)%>%group_by(diagnosis) %>%
  summarize(perc_dx = round((n / 569)* 100, 2))
## # A tibble: 2 x 2
##   diagnosis perc_dx
##   <fct>       <dbl>
## 1 B            62.7
## 2 M            37.3

DATA VISUALIZATION

Frequency of cancer diagnosis

diagnosis.table <- table(data$diagnosis)
colors <- terrain.colors(2) 
# Create a pie chart 
diagnosis.prop.table <- prop.table(diagnosis.table)*100
diagnosis.prop.df <- as.data.frame(diagnosis.prop.table)
pielabels <- sprintf("%s - %3.1f%s", diagnosis.prop.df[,1], diagnosis.prop.table, "%")
pie(diagnosis.prop.table,
  labels=pielabels,  
  clockwise=TRUE,
  col=colors,
  border="gainsboro",
  radius=0.8,
  cex=0.8, 
  main="frequency of cancer diagnosis")
legend(1, .4, legend=diagnosis.prop.df[,1], cex = 0.7, fill = colors)

Comparing the radius,area and concavity of benign and malingnant stage

library(ggplot2)
ggplot(data=data,aes(x=diagnosis,y=radius_mean,fill="pink"))+geom_boxplot()+ggtitle("radius of Benign Vs Malignant")

ggplot(data=data,aes(x=diagnosis,y=area_mean))+geom_boxplot()+ggtitle("area of Benign Vs Malignant")

ggplot(data=data,aes(x=diagnosis,y=concavity_mean))+geom_boxplot()+ggtitle("concavity of  Benign Vs Malignant")

we came to know that malignant cells have higher radius,area and concavity mean than benign cell

Bar plot for analysing the stages of the affected women

ggplot(data,aes(x=diagnosis,fill=texture_mean))+geom_bar()+ggtitle("women affected in benign and malingnant stage")

Women affected at higher levels based on mean from the analysis of boxplot

sel_data=data[data$radius_mean>10&
                data$radius_mean<15&
                data$compactness_mean>0.1,]
ggplot(sel_data,aes(x=diagnosis,y=radius_mean,fill=diagnosis))+geom_col()+ggtitle("womens affected in higher levels based on mean")

Density plot based on texture mean

ggplot(data,aes(x=texture_mean,fill=as.factor(diagnosis)))+geom_density(alpha=0.4)+ggtitle(" texture mean  for benign vs malignant")

Analysing perimeter mean for women affected in benign and malignant stage

ggplot(data,aes(x=as.factor(diagnosis),y=perimeter_mean))+geom_violin()+ggtitle(" perimeter mean  for benign vs malignant")

Analysing concavity mean for women affected in benign and malignant stage

data1=data%>%filter(concavity_mean>0.2)
ggplot(data1,aes(x=concavity_mean,y=diagnosis,size=perimeter_se))+geom_point()+ggtitle("concavity mean  for benign vs malignant")

Bar plot for area_se >15

ggplot(data, aes(x = area_se>15, fill = diagnosis)) +geom_bar(position = "fill")+ggtitle("area se for benign vs malignant")

DISTRIBUTION OF DATA VIA HISTOGRAMS

ggplot(data,aes(x=concavity_mean,fill=diagnosis))+geom_histogram(binwidth=10)+ggtitle(" concavity mean  for benign vs malignant")

  ggplot(data, aes(x = texture_se)) +
  geom_histogram(binwidth=10) +
  facet_wrap(~ diagnosis)+ggtitle(" texture se  for benign vs malignant")

  ggplot(data, aes(x = perimeter_mean)) +
  geom_histogram(binwidth=10) +
  facet_wrap(~ diagnosis)+ggtitle(" perimeter mean  for benign vs malignant")

Applying machine learning models

In this section I will:

1.Train the algorithm on the first part,

2.make predictions on the second part and

3.evaluate the predictions against the expected results.

LOGISTIC REGRESSION

split the data into traning and testing sets

library(caTools)
data$diagnosis<-factor(data$diagnosis,levels=c("B","M"),labels=c(0,1))
set.seed(123)
split=sample.split(data$diagnosis,SplitRatio=0.65)
data<-data[-33]
training_set<-subset(data,split==T)
View(training_set)
test_set<-subset(data,split==F)
View(test_set)

Normalisation process

training_set[,3:32]<-scale(training_set[,3:32])
View(training_set)
test_set[,3:32]<-scale(test_set[,3:32])
View(test_set)

create a model

reg<-glm(formula=diagnosis~ .,family=quasibinomial(),data=training_set)
## Warning: glm.fit: algorithm did not converge
summary(reg)
## 
## Call:
## glm(formula = diagnosis ~ ., family = quasibinomial(), data = training_set)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -9.429e-05  -2.100e-08  -2.100e-08   2.100e-08   1.208e-04  
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)             -3.449e+00  8.343e-01   -4.134 4.50e-05 ***
## id                       1.082e-07  1.980e-09   54.641  < 2e-16 ***
## radius_mean             -3.229e+02  2.003e+01  -16.120  < 2e-16 ***
## texture_mean             4.486e+01  5.274e-01   85.049  < 2e-16 ***
## perimeter_mean           5.784e+02  1.267e+01   45.658  < 2e-16 ***
## area_mean               -3.233e+02  1.237e+01  -26.145  < 2e-16 ***
## smoothness_mean          3.435e+01  4.869e-01   70.548  < 2e-16 ***
## compactness_mean        -1.725e+02  1.612e+00 -107.061  < 2e-16 ***
## concavity_mean          -2.358e+01  2.002e+00  -11.777  < 2e-16 ***
## concave.points_mean      1.083e+02  3.078e+00   35.178  < 2e-16 ***
## symmetry_mean           -4.042e+01  7.221e-01  -55.976  < 2e-16 ***
## fractal_dimension_mean   2.238e+00  6.437e-01    3.477 0.000574 ***
## radius_se                1.242e+02  3.505e+00   35.431  < 2e-16 ***
## texture_se              -1.402e+00  4.364e-01   -3.213 0.001439 ** 
## perimeter_se            -3.235e+01  2.090e+00  -15.477  < 2e-16 ***
## area_se                 -3.336e+01  4.455e+00   -7.489 6.09e-13 ***
## smoothness_se           -2.412e+01  4.684e-01  -51.503  < 2e-16 ***
## compactness_se           5.370e+01  8.409e-01   63.855  < 2e-16 ***
## concavity_se            -9.459e+01  9.038e-01 -104.654  < 2e-16 ***
## concave.points_se        1.017e+02  9.083e-01  111.962  < 2e-16 ***
## symmetry_se             -7.320e+00  5.140e-01  -14.241  < 2e-16 ***
## fractal_dimension_se    -6.932e+01  9.562e-01  -72.497  < 2e-16 ***
## radius_worst            -2.509e+02  1.202e+01  -20.874  < 2e-16 ***
## texture_worst            1.240e+01  6.738e-01   18.404  < 2e-16 ***
## perimeter_worst          6.154e+01  7.704e+00    7.988 2.17e-14 ***
## area_worst               4.062e+02  9.885e+00   41.093  < 2e-16 ***
## smoothness_worst         1.854e+01  5.165e-01   35.893  < 2e-16 ***
## compactness_worst       -4.216e+01  1.384e+00  -30.453  < 2e-16 ***
## concavity_worst          1.342e+02  1.768e+00   75.941  < 2e-16 ***
## concave.points_worst    -4.460e+01  2.720e+00  -16.400  < 2e-16 ***
## symmetry_worst           5.770e+01  1.080e+00   53.450  < 2e-16 ***
## fractal_dimension_worst  4.384e+01  9.737e-01   45.031  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 4.936336e-10)
## 
##     Null deviance: 4.8878e+02  on 369  degrees of freedom
## Residual deviance: 1.2279e-07  on 338  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 25

Predict the model

prob_pred<-predict(object=reg,type="response",newdata=test_set[-2])
View(prob_pred)

separate the predicted value

y_pred<-ifelse(prob_pred>0.5,1,0)
View(y_pred)

confusion matrix for verifying prediction

tab<-table(test_set[,2],y_pred)
tab
##    y_pred
##       0   1
##   0 121   4
##   1   5  69

Accuracy

acc<-sum(diag(tab))/sum(tab) 
acc
## [1] 0.9547739

Error

err<-1-acc
err
## [1] 0.04522613

CONCLUSION:

The feature analysis show that there are few features with more predictive value for the diagnosis. We have found a model based on neural network and preprocessed data with good results over the test set. This model has a sensitivity of 0.954.