library(tidyverse)
library(psych)
bc_data <- read.csv("wisc_bc_data.csv")
#create bucket variable for radius mean which is either top half or bottom half
bc_data <- bc_data %>%
mutate(radius_mean_size = ifelse(radius_mean < 19.7,"bottom half", "top half") )
str(bc_data)
## 'data.frame': 569 obs. of 33 variables:
## $ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
## $ radius_mean_size : chr "bottom half" "top half" "bottom half" "bottom half" ...
Are any of the variables or any combination of the variables predictors for a malignant or benign diagnosis?
What are the cases, and how many are there?
Each case represents a sample from a biopsy of a breast mass. There 569 observations in the given data set.
Describe the method of data collection.
“Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.”
What type of study is this (observational/experiment)?
This study is an observational study of the biopsied breast tissue mass.
If you collected the data, state self-collected. If not, provide a citation/link.
This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/
Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Creators:
Dr. William H. Wolberg, General Surgery Dept. University of Wisconsin, Clinical Sciences Center Madison, WI 53792 wolberg ‘@’ eagle.surgery.wisc.edu
W. Nick Street, Computer Sciences Dept. University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 street ‘@’ cs.wisc.edu 608-262-6619
Olvi L. Mangasarian, Computer Sciences Dept. University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 olvi ‘@’ cs.wisc.edu
What is the response variable, and what type is it (numerical/categorical)?
The response variable is the diagnosis which is a qualitative binary categorical variable.
What is the explanatory variable, and what type is it (numerical/categorical)? You should have two independent variables, one quantitative and one qualitative.
There are 30 independent variables which are quantitative and an additional variable conputed from radius_mean called radius_mean_size is a qualitative variable that describes the size of the mass as being in the bottom half or top half of the size range. The variables are all aspects of the tissue samples and include the mean, standard error and worst case for each variable.
Ten real-valued features are computed for each cell nucleus:
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
summary(bc_data)
## id diagnosis radius_mean texture_mean
## Min. : 8670 B:357 Min. : 6.981 Min. : 9.71
## 1st Qu.: 869218 M:212 1st Qu.:11.700 1st Qu.:16.17
## Median : 906024 Median :13.370 Median :18.84
## Mean : 30371831 Mean :14.127 Mean :19.29
## 3rd Qu.: 8813129 3rd Qu.:15.780 3rd Qu.:21.80
## Max. :911320502 Max. :28.110 Max. :39.28
## perimeter_mean area_mean smoothness_mean compactness_mean
## Min. : 43.79 Min. : 143.5 Min. :0.05263 Min. :0.01938
## 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492
## Median : 86.24 Median : 551.1 Median :0.09587 Median :0.09263
## Mean : 91.97 Mean : 654.9 Mean :0.09636 Mean :0.10434
## 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040
## Max. :188.50 Max. :2501.0 Max. :0.16340 Max. :0.34540
## concavity_mean concave.points_mean symmetry_mean
## Min. :0.00000 Min. :0.00000 Min. :0.1060
## 1st Qu.:0.02956 1st Qu.:0.02031 1st Qu.:0.1619
## Median :0.06154 Median :0.03350 Median :0.1792
## Mean :0.08880 Mean :0.04892 Mean :0.1812
## 3rd Qu.:0.13070 3rd Qu.:0.07400 3rd Qu.:0.1957
## Max. :0.42680 Max. :0.20120 Max. :0.3040
## fractal_dimension_mean radius_se texture_se perimeter_se
## Min. :0.04996 Min. :0.1115 Min. :0.3602 Min. : 0.757
## 1st Qu.:0.05770 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606
## Median :0.06154 Median :0.3242 Median :1.1080 Median : 2.287
## Mean :0.06280 Mean :0.4052 Mean :1.2169 Mean : 2.866
## 3rd Qu.:0.06612 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357
## Max. :0.09744 Max. :2.8730 Max. :4.8850 Max. :21.980
## area_se smoothness_se compactness_se concavity_se
## Min. : 6.802 Min. :0.001713 Min. :0.002252 Min. :0.00000
## 1st Qu.: 17.850 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509
## Median : 24.530 Median :0.006380 Median :0.020450 Median :0.02589
## Mean : 40.337 Mean :0.007041 Mean :0.025478 Mean :0.03189
## 3rd Qu.: 45.190 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205
## Max. :542.200 Max. :0.031130 Max. :0.135400 Max. :0.39600
## concave.points_se symmetry_se fractal_dimension_se
## Min. :0.000000 Min. :0.007882 Min. :0.0008948
## 1st Qu.:0.007638 1st Qu.:0.015160 1st Qu.:0.0022480
## Median :0.010930 Median :0.018730 Median :0.0031870
## Mean :0.011796 Mean :0.020542 Mean :0.0037949
## 3rd Qu.:0.014710 3rd Qu.:0.023480 3rd Qu.:0.0045580
## Max. :0.052790 Max. :0.078950 Max. :0.0298400
## radius_worst texture_worst perimeter_worst area_worst
## Min. : 7.93 Min. :12.02 Min. : 50.41 Min. : 185.2
## 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11 1st Qu.: 515.3
## Median :14.97 Median :25.41 Median : 97.66 Median : 686.5
## Mean :16.27 Mean :25.68 Mean :107.26 Mean : 880.6
## 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40 3rd Qu.:1084.0
## Max. :36.04 Max. :49.54 Max. :251.20 Max. :4254.0
## smoothness_worst compactness_worst concavity_worst concave.points_worst
## Min. :0.07117 Min. :0.02729 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145 1st Qu.:0.06493
## Median :0.13130 Median :0.21190 Median :0.2267 Median :0.09993
## Mean :0.13237 Mean :0.25427 Mean :0.2722 Mean :0.11461
## 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829 3rd Qu.:0.16140
## Max. :0.22260 Max. :1.05800 Max. :1.2520 Max. :0.29100
## symmetry_worst fractal_dimension_worst radius_mean_size
## Min. :0.1565 Min. :0.05504 Length:569
## 1st Qu.:0.2504 1st Qu.:0.07146 Class :character
## Median :0.2822 Median :0.08004 Mode :character
## Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.6638 Max. :0.20750
describe(bc_data %>% select(-id, -diagnosis))
## vars n mean sd median trimmed mad
## radius_mean 1 569 14.13 3.52 13.37 13.82 2.82
## texture_mean 2 569 19.29 4.30 18.84 19.04 4.17
## perimeter_mean 3 569 91.97 24.30 86.24 89.74 18.84
## area_mean 4 569 654.89 351.91 551.10 606.13 227.28
## smoothness_mean 5 569 0.10 0.01 0.10 0.10 0.01
## compactness_mean 6 569 0.10 0.05 0.09 0.10 0.05
## concavity_mean 7 569 0.09 0.08 0.06 0.08 0.06
## concave.points_mean 8 569 0.05 0.04 0.03 0.04 0.03
## symmetry_mean 9 569 0.18 0.03 0.18 0.18 0.03
## fractal_dimension_mean 10 569 0.06 0.01 0.06 0.06 0.01
## radius_se 11 569 0.41 0.28 0.32 0.36 0.16
## texture_se 12 569 1.22 0.55 1.11 1.16 0.47
## perimeter_se 13 569 2.87 2.02 2.29 2.51 1.14
## area_se 14 569 40.34 45.49 24.53 31.69 13.63
## smoothness_se 15 569 0.01 0.00 0.01 0.01 0.00
## compactness_se 16 569 0.03 0.02 0.02 0.02 0.01
## concavity_se 17 569 0.03 0.03 0.03 0.03 0.02
## concave.points_se 18 569 0.01 0.01 0.01 0.01 0.01
## symmetry_se 19 569 0.02 0.01 0.02 0.02 0.01
## fractal_dimension_se 20 569 0.00 0.00 0.00 0.00 0.00
## radius_worst 21 569 16.27 4.83 14.97 15.73 3.65
## texture_worst 22 569 25.68 6.15 25.41 25.39 6.42
## perimeter_worst 23 569 107.26 33.60 97.66 103.42 25.01
## area_worst 24 569 880.58 569.36 686.50 788.02 319.65
## smoothness_worst 25 569 0.13 0.02 0.13 0.13 0.02
## compactness_worst 26 569 0.25 0.16 0.21 0.23 0.13
## concavity_worst 27 569 0.27 0.21 0.23 0.25 0.20
## concave.points_worst 28 569 0.11 0.07 0.10 0.11 0.07
## symmetry_worst 29 569 0.29 0.06 0.28 0.28 0.05
## fractal_dimension_worst 30 569 0.08 0.02 0.08 0.08 0.01
## radius_mean_size* 31 569 NaN NA NA NaN NA
## min max range skew kurtosis se
## radius_mean 6.98 28.11 21.13 0.94 0.81 0.15
## texture_mean 9.71 39.28 29.57 0.65 0.73 0.18
## perimeter_mean 43.79 188.50 144.71 0.99 0.94 1.02
## area_mean 143.50 2501.00 2357.50 1.64 3.59 14.75
## smoothness_mean 0.05 0.16 0.11 0.45 0.82 0.00
## compactness_mean 0.02 0.35 0.33 1.18 1.61 0.00
## concavity_mean 0.00 0.43 0.43 1.39 1.95 0.00
## concave.points_mean 0.00 0.20 0.20 1.17 1.03 0.00
## symmetry_mean 0.11 0.30 0.20 0.72 1.25 0.00
## fractal_dimension_mean 0.05 0.10 0.05 1.30 2.95 0.00
## radius_se 0.11 2.87 2.76 3.07 17.45 0.01
## texture_se 0.36 4.88 4.52 1.64 5.26 0.02
## perimeter_se 0.76 21.98 21.22 3.43 21.12 0.08
## area_se 6.80 542.20 535.40 5.42 48.59 1.91
## smoothness_se 0.00 0.03 0.03 2.30 10.32 0.00
## compactness_se 0.00 0.14 0.13 1.89 5.02 0.00
## concavity_se 0.00 0.40 0.40 5.08 48.24 0.00
## concave.points_se 0.00 0.05 0.05 1.44 5.04 0.00
## symmetry_se 0.01 0.08 0.07 2.18 7.78 0.00
## fractal_dimension_se 0.00 0.03 0.03 3.90 25.94 0.00
## radius_worst 7.93 36.04 28.11 1.10 0.91 0.20
## texture_worst 12.02 49.54 37.52 0.50 0.20 0.26
## perimeter_worst 50.41 251.20 200.79 1.12 1.04 1.41
## area_worst 185.20 4254.00 4068.80 1.85 4.32 23.87
## smoothness_worst 0.07 0.22 0.15 0.41 0.49 0.00
## compactness_worst 0.03 1.06 1.03 1.47 2.98 0.01
## concavity_worst 0.00 1.25 1.25 1.14 1.57 0.01
## concave.points_worst 0.00 0.29 0.29 0.49 -0.55 0.00
## symmetry_worst 0.16 0.66 0.51 1.43 4.37 0.00
## fractal_dimension_worst 0.06 0.21 0.15 1.65 5.16 0.00
## radius_mean_size* Inf -Inf -Inf NA NA NA
table(bc_data$diagnosis)
##
## B M
## 357 212
Boxplots of the 10 mean variables vs. diagnosis:
boxplot(radius_mean ~ diagnosis, data = bc_data)
boxplot(texture_mean ~ diagnosis, data = bc_data)
boxplot(perimeter_mean ~ diagnosis, data = bc_data)
boxplot(area_mean ~ diagnosis, data = bc_data)
boxplot(smoothness_mean ~ diagnosis, data = bc_data)
boxplot(compactness_mean ~ diagnosis, data = bc_data)
boxplot(concavity_mean ~ diagnosis, data = bc_data)
boxplot(concave.points_mean ~ diagnosis, data = bc_data)
boxplot(symmetry_mean ~ diagnosis, data = bc_data)
boxplot(fractal_dimension_mean ~ diagnosis, data = bc_data)