One of the most common types of cancer is breast cancer, it accounts for 25% of cancer cases among women.
The way it starts is when breast cells start to grow uncontrollably, they then form a tumer
that can either be seen through x-ray tests or felt as lumps in the chest area.
Our main goal is to analyse the data set to extract some useful insights that can lead to data driven decisions.
The scope is to use the correlation between the breast cancer data set features
to predict whether a patient has a benign (not harmful) tumor or malignant (cancerous) tumor.
Used Libraries.
library(tidyverse) #For tidying the data
library(skimr) #It is designed to provide summary statistics about variables in data frames
library(ggplot2) #For plotting
library(RColorBrewer) #For aesthetics
library(tidymodels) #For splitting the data and training
library(caret) #For machine learning
library(party) #The core of the package is ctree()
library(yardstick) #For plotting the confusion matrix
df <- read.csv("/Users/salahkaf/Desktop/breast_cancer_1.csv")
Title: Breast Cancer Wisconsin (Diagnostic) Data Set
Last Update: 2021-12-29
Purpose: Analysing this data set aims to protect women from breast cancer.
The analysis is accomplished by using classification and regression models
to predict the cancer before its occurring to take the necessary measurements to prevent it.
Link: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
dim(df) # dimension of the dataset
[1] 569 32
names(df) # column names
[1] "id" "diagnosis"
[3] "radius_mean" "texture_mean"
[5] "perimeter_mean" "area_mean"
[7] "smoothness_mean" "compactness_mean"
[9] "concavity_mean" "concave.points_mean"
[11] "symmetry_mean" "fractal_dimension_mean"
[13] "radius_se" "texture_se"
[15] "perimeter_se" "area_se"
[17] "smoothness_se" "compactness_se"
[19] "concavity_se" "concave.points_se"
[21] "symmetry_se" "fractal_dimension_se"
[23] "radius_worst" "texture_worst"
[25] "perimeter_worst" "area_worst"
[27] "smoothness_worst" "compactness_worst"
[29] "concavity_worst" "concave.points_worst"
[31] "symmetry_worst" "fractal_dimension_worst"
str(df) # structure of the data
'data.frame': 569 obs. of 32 variables:
$ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
$ diagnosis : chr NA "M" "M" "M" ...
$ radius_mean : num NA 20.6 19.7 11.4 20.3 ...
$ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
$ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
$ area_mean : num 1001 1326 1203 386 1297 ...
$ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
$ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
$ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
$ concave.points_mean : num NA 0.0702 0.1279 0.1052 0.1043 ...
$ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
$ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
$ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
$ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
$ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
$ area_se : num 153.4 74.1 94 27.2 94.4 ...
$ smoothness_se : num NA 0.00522 0.00615 0.00911 0.01149 ...
$ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
$ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
$ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
$ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
$ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
$ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
$ texture_worst : num NA 23.4 25.5 26.5 16.7 ...
$ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
$ area_worst : num 2019 1956 1709 568 1575 ...
$ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
$ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
$ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
$ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
$ symmetry_worst : num NA 0.275 0.361 0.664 0.236 ...
$ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
summary(df) # Summary of each column in the data
id diagnosis radius_mean
Min. : 8670 Length:569 Min. : 6.981
1st Qu.: 869218 Class :character 1st Qu.:11.685
Median : 906024 Mode :character Median :13.300
Mean : 30371831 Mean :14.107
3rd Qu.: 8813129 3rd Qu.:15.765
Max. :911320502 Max. :28.110
NA's :6
texture_mean perimeter_mean area_mean smoothness_mean
Min. : 9.71 Min. : 43.79 Min. : 143.5 Min. :0.05263
1st Qu.:16.17 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637
Median :18.84 Median : 86.24 Median : 551.1 Median :0.09587
Mean :19.29 Mean : 91.97 Mean : 654.9 Mean :0.09636
3rd Qu.:21.80 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530
Max. :39.28 Max. :188.50 Max. :2501.0 Max. :0.16340
compactness_mean concavity_mean concave.points_mean
Min. :0.01938 Min. :0.00000 Min. :0.00000
1st Qu.:0.06492 1st Qu.:0.02956 1st Qu.:0.02029
Median :0.09263 Median :0.06154 Median :0.03326
Mean :0.10434 Mean :0.08880 Mean :0.04852
3rd Qu.:0.13040 3rd Qu.:0.13070 3rd Qu.:0.07352
Max. :0.34540 Max. :0.42680 Max. :0.20120
NA's :6
symmetry_mean fractal_dimension_mean radius_se
Min. :0.1060 Min. :0.04996 Min. :0.1115
1st Qu.:0.1619 1st Qu.:0.05770 1st Qu.:0.2324
Median :0.1792 Median :0.06154 Median :0.3242
Mean :0.1812 Mean :0.06280 Mean :0.4052
3rd Qu.:0.1957 3rd Qu.:0.06612 3rd Qu.:0.4789
Max. :0.3040 Max. :0.09744 Max. :2.8730
texture_se perimeter_se area_se
Min. :0.3602 Min. : 0.757 Min. : 6.802
1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850
Median :1.1080 Median : 2.287 Median : 24.530
Mean :1.2169 Mean : 2.866 Mean : 40.337
3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190
Max. :4.8850 Max. :21.980 Max. :542.200
smoothness_se compactness_se concavity_se
Min. :0.001713 Min. :0.002252 Min. :0.00000
1st Qu.:0.005163 1st Qu.:0.013080 1st Qu.:0.01509
Median :0.006380 Median :0.020450 Median :0.02589
Mean :0.007050 Mean :0.025478 Mean :0.03189
3rd Qu.:0.008182 3rd Qu.:0.032450 3rd Qu.:0.04205
Max. :0.031130 Max. :0.135400 Max. :0.39600
NA's :6
concave.points_se symmetry_se fractal_dimension_se
Min. :0.000000 Min. :0.007882 Min. :0.0008948
1st Qu.:0.007638 1st Qu.:0.015160 1st Qu.:0.0022480
Median :0.010930 Median :0.018730 Median :0.0031870
Mean :0.011796 Mean :0.020542 Mean :0.0037949
3rd Qu.:0.014710 3rd Qu.:0.023480 3rd Qu.:0.0045580
Max. :0.052790 Max. :0.078950 Max. :0.0298400
radius_worst texture_worst perimeter_worst area_worst
Min. : 7.93 Min. :12.02 Min. : 50.41 Min. : 185.2
1st Qu.:13.01 1st Qu.:21.09 1st Qu.: 84.11 1st Qu.: 515.3
Median :14.97 Median :25.40 Median : 97.66 Median : 686.5
Mean :16.27 Mean :25.67 Mean :107.26 Mean : 880.6
3rd Qu.:18.79 3rd Qu.:29.80 3rd Qu.:125.40 3rd Qu.:1084.0
Max. :36.04 Max. :49.54 Max. :251.20 Max. :4254.0
NA's :6
smoothness_worst compactness_worst concavity_worst
Min. :0.07117 Min. :0.02729 Min. :0.0000
1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145
Median :0.13130 Median :0.21190 Median :0.2267
Mean :0.13237 Mean :0.25427 Mean :0.2722
3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829
Max. :0.22260 Max. :1.05800 Max. :1.2520
concave.points_worst symmetry_worst fractal_dimension_worst
Min. :0.00000 Min. :0.1565 Min. :0.05504
1st Qu.:0.06493 1st Qu.:0.2503 1st Qu.:0.07146
Median :0.09993 Median :0.2822 Median :0.08004
Mean :0.11461 Mean :0.2899 Mean :0.08395
3rd Qu.:0.16140 3rd Qu.:0.3177 3rd Qu.:0.09208
Max. :0.29100 Max. :0.6638 Max. :0.20750
NA's :6
We observe from summary that we have missing values in our data set, we need to tidy this part.
[1] TRUE
colnames(df) <- str_to_lower(colnames(df)) # Changing all columns names to lowercase
# Finding the columns that has NA values
colSums(is.na(df))
id diagnosis
0 6
radius_mean texture_mean
6 0
perimeter_mean area_mean
0 0
smoothness_mean compactness_mean
0 0
concavity_mean concave.points_mean
0 6
symmetry_mean fractal_dimension_mean
0 0
radius_se texture_se
0 0
perimeter_se area_se
0 0
smoothness_se compactness_se
6 0
concavity_se concave.points_se
0 0
symmetry_se fractal_dimension_se
0 0
radius_worst texture_worst
0 6
perimeter_worst area_worst
0 0
smoothness_worst compactness_worst
0 0
concavity_worst concave.points_worst
0 0
symmetry_worst fractal_dimension_worst
6 0
# Replacing the values of these columns by the mean
df$radius_mean[is.na(df$radius_mean)] <- mean(df$radius_mean, na.rm = TRUE)
df$concave.points_mean[is.na(df$concave.points_mean)] <- mean(df$concave.points_mean,na.rm = TRUE)
df$smoothness_se[is.na(df$smoothness_se)] <- mean(df$smoothness_se, na.rm = TRUE)
df$texture_worst[is.na(df$texture_worst)] <- mean(df$texture_worst, na.rm = TRUE)
df$symmetry_worst[is.na(df$symmetry_worst)] <- mean(df$symmetry_worst, na.rm = TRUE)
# checking again if there is any NA values in the numeric columns
colSums(is.na(df))
id diagnosis
0 6
radius_mean texture_mean
0 0
perimeter_mean area_mean
0 0
smoothness_mean compactness_mean
0 0
concavity_mean concave.points_mean
0 0
symmetry_mean fractal_dimension_mean
0 0
radius_se texture_se
0 0
perimeter_se area_se
0 0
smoothness_se compactness_se
0 0
concavity_se concave.points_se
0 0
symmetry_se fractal_dimension_se
0 0
radius_worst texture_worst
0 0
perimeter_worst area_worst
0 0
smoothness_worst compactness_worst
0 0
concavity_worst concave.points_worst
0 0
symmetry_worst fractal_dimension_worst
0 0
# the only columns that is character is "diagnosis"
# this is a function to find the mode in a column
calc_mode <- function(x)
{
# List the distinct / unique values
distinct_values <- unique(x)
# Count the occurrence of each distinct value
distinct_tabulate <- tabulate(match(x, distinct_values))
# Return the value with the highest occurrence
distinct_values[which.max(distinct_tabulate)]
}
# replacing the NA values by the mode
df <- df %>%
mutate(diagnosis = if_else(is.na(diagnosis),
calc_mode(diagnosis),
diagnosis))
# checking again for NA values in the data.
any(is.na(df)) #Should Be FALSE
[1] FALSE
Q1- How to predict a breast tumor is malignant or benign? (Classification problem)
Q2- How to predict the mean of the lobes radius of breast cancer tumor? (Regression Problem)
Making a copy of the Data set for classification manipulation
df_c <- df
skim_without_charts(df_c) # extracting useful summary statistics
| Name | df_c |
| Number of rows | 569 |
| Number of columns | 32 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 31 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| diagnosis | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 30371831.43 | 125020585.61 | 8670.00 | 869218.00 | 906024.00 | 8813129.00 | 911320502.00 |
| radius_mean | 0 | 1 | 14.11 | 3.51 | 6.98 | 11.70 | 13.38 | 15.75 | 28.11 |
| texture_mean | 0 | 1 | 19.29 | 4.30 | 9.71 | 16.17 | 18.84 | 21.80 | 39.28 |
| perimeter_mean | 0 | 1 | 91.97 | 24.30 | 43.79 | 75.17 | 86.24 | 104.10 | 188.50 |
| area_mean | 0 | 1 | 654.89 | 351.91 | 143.50 | 420.30 | 551.10 | 782.70 | 2501.00 |
| smoothness_mean | 0 | 1 | 0.10 | 0.01 | 0.05 | 0.09 | 0.10 | 0.11 | 0.16 |
| compactness_mean | 0 | 1 | 0.10 | 0.05 | 0.02 | 0.06 | 0.09 | 0.13 | 0.35 |
| concavity_mean | 0 | 1 | 0.09 | 0.08 | 0.00 | 0.03 | 0.06 | 0.13 | 0.43 |
| concave.points_mean | 0 | 1 | 0.05 | 0.04 | 0.00 | 0.02 | 0.03 | 0.07 | 0.20 |
| symmetry_mean | 0 | 1 | 0.18 | 0.03 | 0.11 | 0.16 | 0.18 | 0.20 | 0.30 |
| fractal_dimension_mean | 0 | 1 | 0.06 | 0.01 | 0.05 | 0.06 | 0.06 | 0.07 | 0.10 |
| radius_se | 0 | 1 | 0.41 | 0.28 | 0.11 | 0.23 | 0.32 | 0.48 | 2.87 |
| texture_se | 0 | 1 | 1.22 | 0.55 | 0.36 | 0.83 | 1.11 | 1.47 | 4.88 |
| perimeter_se | 0 | 1 | 2.87 | 2.02 | 0.76 | 1.61 | 2.29 | 3.36 | 21.98 |
| area_se | 0 | 1 | 40.34 | 45.49 | 6.80 | 17.85 | 24.53 | 45.19 | 542.20 |
| smoothness_se | 0 | 1 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.03 |
| compactness_se | 0 | 1 | 0.03 | 0.02 | 0.00 | 0.01 | 0.02 | 0.03 | 0.14 |
| concavity_se | 0 | 1 | 0.03 | 0.03 | 0.00 | 0.02 | 0.03 | 0.04 | 0.40 |
| concave.points_se | 0 | 1 | 0.01 | 0.01 | 0.00 | 0.01 | 0.01 | 0.01 | 0.05 |
| symmetry_se | 0 | 1 | 0.02 | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.08 |
| fractal_dimension_se | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 |
| radius_worst | 0 | 1 | 16.27 | 4.83 | 7.93 | 13.01 | 14.97 | 18.79 | 36.04 |
| texture_worst | 0 | 1 | 25.67 | 6.12 | 12.02 | 21.18 | 25.45 | 29.66 | 49.54 |
| perimeter_worst | 0 | 1 | 107.26 | 33.60 | 50.41 | 84.11 | 97.66 | 125.40 | 251.20 |
| area_worst | 0 | 1 | 880.58 | 569.36 | 185.20 | 515.30 | 686.50 | 1084.00 | 4254.00 |
| smoothness_worst | 0 | 1 | 0.13 | 0.02 | 0.07 | 0.12 | 0.13 | 0.15 | 0.22 |
| compactness_worst | 0 | 1 | 0.25 | 0.16 | 0.03 | 0.15 | 0.21 | 0.34 | 1.06 |
| concavity_worst | 0 | 1 | 0.27 | 0.21 | 0.00 | 0.11 | 0.23 | 0.38 | 1.25 |
| concave.points_worst | 0 | 1 | 0.11 | 0.07 | 0.00 | 0.06 | 0.10 | 0.16 | 0.29 |
| symmetry_worst | 0 | 1 | 0.29 | 0.06 | 0.16 | 0.25 | 0.28 | 0.32 | 0.66 |
| fractal_dimension_worst | 0 | 1 | 0.08 | 0.02 | 0.06 | 0.07 | 0.08 | 0.09 | 0.21 |
diagnosis n
1 B 361
2 M 208
# setting diagnosis column to factor for plotting and modeling
df_c$diagnosis <- as.factor(df_c$diagnosis)
# plotting the bar plot
ggplot(df_c, aes(x=diagnosis, fill= diagnosis)) +
geom_bar(stat="count") +
theme_classic() +
scale_y_continuous(breaks = seq(0, 400, by = 25)) +
labs(title="Distribution of Diagnosis")
fig2 <- df_c[c("radius_mean", "diagnosis")]
# plotting the box plot
ggplot(fig2, aes(diagnosis, radius_mean, fill = diagnosis)) +
geom_boxplot()+
labs(col="Type of The Tumor") + ylab("lobes radius mean") +
labs(title="Distribution of diagnosis")+
# changing the color of the boxplots
scale_fill_manual(values = c( "dodgerblue1","red3"))
ggplot(df_c, aes(radius_mean, concavity_mean)) +
geom_point(aes(color = diagnosis)) +
labs(title = "Radious mean Vs Concavity mean",
y = "Mean of Concavity", x = "Radius of Lobes",
col="Type of The Tumor")+
scale_colour_manual(labels = c("Benign", "Malignant"),
values = c("dodgerblue1","red2")) +
theme_bw()
# setting the independent variable to factor so that we can be able to fit it to the classification model
df_c$diagnosis <- as.factor(df_c$diagnosis)
df_c <- df_c[,-1] #Removing ID column as it does not have correlation
split=0.80 # define an 80%/20% train/test split of the dataset
trainIndex <- createDataPartition(df_c$diagnosis, p=split, list=FALSE)
data_train <- df_c[ trainIndex,]
data_test <- df_c[-trainIndex,]
#use train() and trainControl()
train_control <- trainControl(method="cv", number=10)
plot(model1)
cm1<-confusionMatrix(predictions1,data_test$diagnosis) # using confusion matrix for accuracy
cm1
Confusion Matrix and Statistics
Reference
Prediction B M
B 69 5
M 3 36
Accuracy : 0.9292
95% CI : (0.8653, 0.9689)
No Information Rate : 0.6372
P-Value [Acc > NIR] : 4.941e-13
Kappa : 0.8453
Mcnemar's Test P-Value : 0.7237
Sensitivity : 0.9583
Specificity : 0.8780
Pos Pred Value : 0.9324
Neg Pred Value : 0.9231
Prevalence : 0.6372
Detection Rate : 0.6106
Detection Prevalence : 0.6549
Balanced Accuracy : 0.9182
'Positive' Class : B
# making a data frame for the predicted and actual values
cm1_plot <- data.frame(predictions1,data_test$diagnosis )
# renaming the columns
names(cm1_plot) <- c("Predicted", "Actual")
cm1_plot <- conf_mat(cm1_plot, Actual,Predicted)
autoplot(cm1_plot, type = "heatmap") +
scale_fill_gradient(low="#D6EAF8",high = "#2E86C1") +
theme(legend.position = "right") + labs(title = "Confusion matrix for KNN model")
cm2<-confusionMatrix(predictions2,data_test$diagnosis)
cm2
Confusion Matrix and Statistics
Reference
Prediction B M
B 67 0
M 5 41
Accuracy : 0.9558
95% CI : (0.8998, 0.9855)
No Information Rate : 0.6372
P-Value [Acc > NIR] : 6.935e-16
Kappa : 0.9068
Mcnemar's Test P-Value : 0.07364
Sensitivity : 0.9306
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.8913
Prevalence : 0.6372
Detection Rate : 0.5929
Detection Prevalence : 0.5929
Balanced Accuracy : 0.9653
'Positive' Class : B
# making a data frame for the predicted and actual values
cm2_plot <- data.frame(predictions2,data_test$diagnosis )
# renaming the columns
names(cm2_plot) <- c("Predicted", "Actual")
cm2_plot <- conf_mat(cm2_plot, Actual,Predicted)
autoplot(cm2_plot, type = "heatmap") +
scale_fill_gradient(low="#D6EAF8",high = "#2E86C1") +
theme(legend.position = "right") + labs(title = "Confusion matrix for DT model")
# Picking the columns that are highly correlated with the mean of lobes radius
df_R <- df[c("perimeter_mean", "area_mean" , "concave.points_mean", "radius_worst",
"radius_worst", "perimeter_worst", "area_worst", "radius_mean" )]
cor(df_R)[,8]
perimeter_mean area_mean concave.points_mean
0.9924522 0.9831023 0.8222758
radius_worst radius_worst.1 perimeter_worst
0.9626530 0.9626530 0.9569919
area_worst radius_mean
0.9340473 1.0000000
ggplot(df_R, aes(radius_mean, area_mean))+
geom_point(aes(alpha = 0.1))+ geom_smooth(method = "lm") +
labs(title = "The mean of lobes radius Vs the mean of the lobes area",
y = "The mean of lobes areas", x = "The mean of lobes radius") +theme_bw()
ggplot(df_R, aes(radius_mean, perimeter_mean))+
geom_point(aes(alpha = 0.1))+ geom_smooth(method = "lm") +
labs(title = "The mean of lobes radius Vs the mean of the lobes perimeter",
y = "The mean of lobes perimeter", x = "The mean of lobes radius") +theme_bw()
split=0.80 # define an 80%/20% train/test split of the dataset
trainIndex <- createDataPartition(df_R$radius_mean, p=split, list=FALSE)
data_train_R <- df_R[ trainIndex,]
data_test_R <- df_R[-trainIndex,]
model3 <- lm(radius_mean ~ perimeter_mean+area_mean+concave.points_mean+radius_worst+radius_worst+
perimeter_worst+area_worst,data = data_train_R)
summary(model3)$coefficient
Estimate Std. Error t value
(Intercept) 1.452766700 0.2556453324 5.68274291
perimeter_mean 0.087434825 0.0113911580 7.67567485
area_mean 0.004246361 0.0007016029 6.05237018
concave.points_mean 0.044078471 0.9869770440 0.04466008
radius_worst 0.498508836 0.0494352482 10.08407673
perimeter_worst -0.042312350 0.0056869708 -7.44022626
area_worst -0.001986717 0.0003754594 -5.29143073
Pr(>|t|)
(Intercept) 2.384706e-08
perimeter_mean 1.032026e-13
area_mean 3.009551e-09
concave.points_mean 9.643981e-01
radius_worst 1.071058e-21
perimeter_worst 5.140728e-13
area_worst 1.899586e-07
prediction_R = predict(model3, newdata =data_test_R, interval = "prediction")
head(prediction_R) #Exploring the head of our prediction table
fit lwr upr
6 12.56525 11.79421 13.33629
15 13.59875 12.82401 14.37348
23 15.17628 14.40568 15.94688
42 11.00685 10.23759 11.77612
53 11.87405 11.10747 12.64064
55 14.98148 14.21370 15.74927
# binding the predicted values with the actual ones
act_pred <- as.data.frame(cbind(data_test_R$radius_mean,prediction_R ))
# taking only the predicted and actual values without the upper and lower boundaries
act_pred <- act_pred[,1:2]
#renaming the columns
act_pred <- act_pred %>% rename( Actual= V1, predicted= fit )
#assigning row names to Null to get rid of the unorganized index
row.names(act_pred) <- NULL
head(act_pred)
Actual predicted
1 12.45 12.56525
2 13.73 13.59875
3 15.34 15.17628
4 10.95 11.00685
5 11.94 11.87405
6 15.10 14.98148
ggplot(act_pred, aes(Actual, predicted)) +
geom_point(alpha = 0.4) + labs(title = "Predicted Values Vs Actual Values For Radius Mean"
, x = "Actual Values of The Test Data", "Predicted Values of The Test Data")+ theme_gray()
There is a perfect linear relationship between the actual and the predicted values
# The error rate can be estimated by dividing the RSE by the mean outcome variable:
# the smaller the error the more accurate the model is.
sigma(model3)/mean(data_test_R$radius_mean)
[1] 0.02775463
Data Science and Statistics are powerful tools that can be used to predict disasters
before happening. Breast cancer is a hideous disease, which has a strong negative impact on
society. In our project, we contributed to developing two supervised learning models, i.e.,
Classification model using two techniques, namely, KNN method with a 91% prediction accuracy,
and DT method with a ~90% accuracy. Moreover, we made a multiple linear regression statistical model
that predict the mean of the lobes radius based on seven parameters with an RSE value of 0.02~0.03.