Data Preparation

library(tidyverse)
library(psych)

bc_data <- read.csv("wisc_bc_data.csv")

#create bucket variable for radius mean which is either top half or bottom half
bc_data <- bc_data %>% 
    mutate(radius_mean_size = ifelse(radius_mean < 19.7,"bottom half", "top half") )

str(bc_data)
## 'data.frame':    569 obs. of  33 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
##  $ radius_mean_size       : chr  "bottom half" "top half" "bottom half" "bottom half" ...

Research question

Are any of the variables or any combination of the variables predictors for a malignant or benign diagnosis?

Cases

What are the cases, and how many are there?

Each case represents a sample from a biopsy of a breast mass. There 569 observations in the given data set.

Data collection

Describe the method of data collection.

“Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.”

Type of study

What type of study is this (observational/experiment)?

This study is an observational study of the biopsied breast tissue mass.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Creators:

  1. Dr. William H. Wolberg, General Surgery Dept. University of Wisconsin, Clinical Sciences Center Madison, WI 53792 wolberg ‘@’ eagle.surgery.wisc.edu

  2. W. Nick Street, Computer Sciences Dept. University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 street ‘@’ cs.wisc.edu 608-262-6619

  3. Olvi L. Mangasarian, Computer Sciences Dept. University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 olvi ‘@’ cs.wisc.edu

Dependent Variable

What is the response variable, and what type is it (numerical/categorical)?

The response variable is the diagnosis which is a qualitative binary categorical variable.

Independent Variable

What is the explanatory variable, and what type is it (numerical/categorical)? You should have two independent variables, one quantitative and one qualitative.

There are 30 independent variables which are quantitative and an additional variable conputed from radius_mean called radius_mean_size is a qualitative variable that describes the size of the mass as being in the bottom half or top half of the size range. The variables are all aspects of the tissue samples and include the mean, standard error and worst case for each variable.

Ten real-valued features are computed for each cell nucleus:

  1. radius (mean of distances from center to points on the perimeter)
  2. texture (standard deviation of gray-scale values)
  3. perimeter
  4. area
  5. smoothness (local variation in radius lengths)
  6. compactness (perimeter^2 / area - 1.0)
  7. concavity (severity of concave portions of the contour)
  8. concave points (number of concave portions of the contour)
  9. symmetry
  10. fractal dimension (“coastline approximation” - 1)

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

summary(bc_data)
##        id            diagnosis  radius_mean      texture_mean  
##  Min.   :     8670   B:357     Min.   : 6.981   Min.   : 9.71  
##  1st Qu.:   869218   M:212     1st Qu.:11.700   1st Qu.:16.17  
##  Median :   906024             Median :13.370   Median :18.84  
##  Mean   : 30371831             Mean   :14.127   Mean   :19.29  
##  3rd Qu.:  8813129             3rd Qu.:15.780   3rd Qu.:21.80  
##  Max.   :911320502             Max.   :28.110   Max.   :39.28  
##  perimeter_mean     area_mean      smoothness_mean   compactness_mean 
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##  concavity_mean    concave.points_mean symmetry_mean   
##  Min.   :0.00000   Min.   :0.00000     Min.   :0.1060  
##  1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619  
##  Median :0.06154   Median :0.03350     Median :0.1792  
##  Mean   :0.08880   Mean   :0.04892     Mean   :0.1812  
##  3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957  
##  Max.   :0.42680   Max.   :0.20120     Max.   :0.3040  
##  fractal_dimension_mean   radius_se        texture_se      perimeter_se   
##  Min.   :0.04996        Min.   :0.1115   Min.   :0.3602   Min.   : 0.757  
##  1st Qu.:0.05770        1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606  
##  Median :0.06154        Median :0.3242   Median :1.1080   Median : 2.287  
##  Mean   :0.06280        Mean   :0.4052   Mean   :1.2169   Mean   : 2.866  
##  3rd Qu.:0.06612        3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357  
##  Max.   :0.09744        Max.   :2.8730   Max.   :4.8850   Max.   :21.980  
##     area_se        smoothness_se      compactness_se      concavity_se    
##  Min.   :  6.802   Min.   :0.001713   Min.   :0.002252   Min.   :0.00000  
##  1st Qu.: 17.850   1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509  
##  Median : 24.530   Median :0.006380   Median :0.020450   Median :0.02589  
##  Mean   : 40.337   Mean   :0.007041   Mean   :0.025478   Mean   :0.03189  
##  3rd Qu.: 45.190   3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205  
##  Max.   :542.200   Max.   :0.031130   Max.   :0.135400   Max.   :0.39600  
##  concave.points_se   symmetry_se       fractal_dimension_se
##  Min.   :0.000000   Min.   :0.007882   Min.   :0.0008948   
##  1st Qu.:0.007638   1st Qu.:0.015160   1st Qu.:0.0022480   
##  Median :0.010930   Median :0.018730   Median :0.0031870   
##  Mean   :0.011796   Mean   :0.020542   Mean   :0.0037949   
##  3rd Qu.:0.014710   3rd Qu.:0.023480   3rd Qu.:0.0045580   
##  Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400   
##   radius_worst   texture_worst   perimeter_worst    area_worst    
##  Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
##  1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11   1st Qu.: 515.3  
##  Median :14.97   Median :25.41   Median : 97.66   Median : 686.5  
##  Mean   :16.27   Mean   :25.68   Mean   :107.26   Mean   : 880.6  
##  3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40   3rd Qu.:1084.0  
##  Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
##  smoothness_worst  compactness_worst concavity_worst  concave.points_worst
##  Min.   :0.07117   Min.   :0.02729   Min.   :0.0000   Min.   :0.00000     
##  1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145   1st Qu.:0.06493     
##  Median :0.13130   Median :0.21190   Median :0.2267   Median :0.09993     
##  Mean   :0.13237   Mean   :0.25427   Mean   :0.2722   Mean   :0.11461     
##  3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829   3rd Qu.:0.16140     
##  Max.   :0.22260   Max.   :1.05800   Max.   :1.2520   Max.   :0.29100     
##  symmetry_worst   fractal_dimension_worst radius_mean_size  
##  Min.   :0.1565   Min.   :0.05504         Length:569        
##  1st Qu.:0.2504   1st Qu.:0.07146         Class :character  
##  Median :0.2822   Median :0.08004         Mode  :character  
##  Mean   :0.2901   Mean   :0.08395                           
##  3rd Qu.:0.3179   3rd Qu.:0.09208                           
##  Max.   :0.6638   Max.   :0.20750
describe(bc_data %>% select(-id, -diagnosis))
##                         vars   n   mean     sd median trimmed    mad
## radius_mean                1 569  14.13   3.52  13.37   13.82   2.82
## texture_mean               2 569  19.29   4.30  18.84   19.04   4.17
## perimeter_mean             3 569  91.97  24.30  86.24   89.74  18.84
## area_mean                  4 569 654.89 351.91 551.10  606.13 227.28
## smoothness_mean            5 569   0.10   0.01   0.10    0.10   0.01
## compactness_mean           6 569   0.10   0.05   0.09    0.10   0.05
## concavity_mean             7 569   0.09   0.08   0.06    0.08   0.06
## concave.points_mean        8 569   0.05   0.04   0.03    0.04   0.03
## symmetry_mean              9 569   0.18   0.03   0.18    0.18   0.03
## fractal_dimension_mean    10 569   0.06   0.01   0.06    0.06   0.01
## radius_se                 11 569   0.41   0.28   0.32    0.36   0.16
## texture_se                12 569   1.22   0.55   1.11    1.16   0.47
## perimeter_se              13 569   2.87   2.02   2.29    2.51   1.14
## area_se                   14 569  40.34  45.49  24.53   31.69  13.63
## smoothness_se             15 569   0.01   0.00   0.01    0.01   0.00
## compactness_se            16 569   0.03   0.02   0.02    0.02   0.01
## concavity_se              17 569   0.03   0.03   0.03    0.03   0.02
## concave.points_se         18 569   0.01   0.01   0.01    0.01   0.01
## symmetry_se               19 569   0.02   0.01   0.02    0.02   0.01
## fractal_dimension_se      20 569   0.00   0.00   0.00    0.00   0.00
## radius_worst              21 569  16.27   4.83  14.97   15.73   3.65
## texture_worst             22 569  25.68   6.15  25.41   25.39   6.42
## perimeter_worst           23 569 107.26  33.60  97.66  103.42  25.01
## area_worst                24 569 880.58 569.36 686.50  788.02 319.65
## smoothness_worst          25 569   0.13   0.02   0.13    0.13   0.02
## compactness_worst         26 569   0.25   0.16   0.21    0.23   0.13
## concavity_worst           27 569   0.27   0.21   0.23    0.25   0.20
## concave.points_worst      28 569   0.11   0.07   0.10    0.11   0.07
## symmetry_worst            29 569   0.29   0.06   0.28    0.28   0.05
## fractal_dimension_worst   30 569   0.08   0.02   0.08    0.08   0.01
## radius_mean_size*         31 569    NaN     NA     NA     NaN     NA
##                            min     max   range skew kurtosis    se
## radius_mean               6.98   28.11   21.13 0.94     0.81  0.15
## texture_mean              9.71   39.28   29.57 0.65     0.73  0.18
## perimeter_mean           43.79  188.50  144.71 0.99     0.94  1.02
## area_mean               143.50 2501.00 2357.50 1.64     3.59 14.75
## smoothness_mean           0.05    0.16    0.11 0.45     0.82  0.00
## compactness_mean          0.02    0.35    0.33 1.18     1.61  0.00
## concavity_mean            0.00    0.43    0.43 1.39     1.95  0.00
## concave.points_mean       0.00    0.20    0.20 1.17     1.03  0.00
## symmetry_mean             0.11    0.30    0.20 0.72     1.25  0.00
## fractal_dimension_mean    0.05    0.10    0.05 1.30     2.95  0.00
## radius_se                 0.11    2.87    2.76 3.07    17.45  0.01
## texture_se                0.36    4.88    4.52 1.64     5.26  0.02
## perimeter_se              0.76   21.98   21.22 3.43    21.12  0.08
## area_se                   6.80  542.20  535.40 5.42    48.59  1.91
## smoothness_se             0.00    0.03    0.03 2.30    10.32  0.00
## compactness_se            0.00    0.14    0.13 1.89     5.02  0.00
## concavity_se              0.00    0.40    0.40 5.08    48.24  0.00
## concave.points_se         0.00    0.05    0.05 1.44     5.04  0.00
## symmetry_se               0.01    0.08    0.07 2.18     7.78  0.00
## fractal_dimension_se      0.00    0.03    0.03 3.90    25.94  0.00
## radius_worst              7.93   36.04   28.11 1.10     0.91  0.20
## texture_worst            12.02   49.54   37.52 0.50     0.20  0.26
## perimeter_worst          50.41  251.20  200.79 1.12     1.04  1.41
## area_worst              185.20 4254.00 4068.80 1.85     4.32 23.87
## smoothness_worst          0.07    0.22    0.15 0.41     0.49  0.00
## compactness_worst         0.03    1.06    1.03 1.47     2.98  0.01
## concavity_worst           0.00    1.25    1.25 1.14     1.57  0.01
## concave.points_worst      0.00    0.29    0.29 0.49    -0.55  0.00
## symmetry_worst            0.16    0.66    0.51 1.43     4.37  0.00
## fractal_dimension_worst   0.06    0.21    0.15 1.65     5.16  0.00
## radius_mean_size*          Inf    -Inf    -Inf   NA       NA    NA
table(bc_data$diagnosis)
## 
##   B   M 
## 357 212

Boxplots of the 10 mean variables vs. diagnosis:

boxplot(radius_mean ~ diagnosis, data = bc_data)

boxplot(texture_mean ~ diagnosis, data = bc_data)

boxplot(perimeter_mean ~ diagnosis, data = bc_data)

boxplot(area_mean ~ diagnosis, data = bc_data)

boxplot(smoothness_mean ~ diagnosis, data = bc_data)

boxplot(compactness_mean ~ diagnosis, data = bc_data)

boxplot(concavity_mean ~ diagnosis, data = bc_data)

boxplot(concave.points_mean ~ diagnosis, data = bc_data)

boxplot(symmetry_mean ~ diagnosis, data = bc_data)

boxplot(fractal_dimension_mean ~ diagnosis, data = bc_data)