1.Importing the Dataset

a. Import the dataset

library(readr)
breast_cancer <- read_csv("breast cancer.csv", 
    col_types = cols(id = col_character(), 
        diagnosis = col_factor(levels = c("B", 
            "M"))))

b. Recode the “diagnosis” column

library(magrittr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
breast_cancer_clean <-
  breast_cancer %>% mutate(diagnosis = case_match(
    diagnosis,
    "B" ~ "benign",
    "M" ~ "malignant",
    .ptype = factor(diagnosis, levels = c("benign", "malignant"))
  ))

b. Reference level “benign”

install.packages("forcats")
library(forcats)
breast_cancer_clean<-breast_cancer_clean%>%mutate(diagnosis=fct_relevel(diagnosis, "benign","malignant"))
levels(breast_cancer_clean$diagnosis)
## [1] "benign"    "malignant"

2. Exploratory Data Analysis (EDA)

a. Check the dimensions

Response: there are 569 rows and 32 columns in this dataset. The 33rd column was removed previous to importing. Nothing seems unusual here.

dim(breast_cancer_clean)
## [1] 569  32

b. Preview the first few rows

Response: Nothing seems unusual here. id is character and diagnosis is factor as was intended. The diagnosis column has been successfully recoded. The first few columns don’t have enough information to identify any interesting patterns.

head(breast_cancer_clean)
## # A tibble: 6 × 32
##   id       diagnosis radius_mean texture_mean perimeter_mean area_mean
##   <chr>    <fct>           <dbl>        <dbl>          <dbl>     <dbl>
## 1 842302   malignant        18.0         10.4          123.      1001 
## 2 842517   malignant        20.6         17.8          133.      1326 
## 3 84300903 malignant        19.7         21.2          130       1203 
## 4 84348301 malignant        11.4         20.4           77.6      386.
## 5 84358402 malignant        20.3         14.3          135.      1297 
## 6 843786   malignant        12.4         15.7           82.6      477.
## # ℹ 26 more variables: smoothness_mean <dbl>, compactness_mean <dbl>,
## #   concavity_mean <dbl>, `concave points_mean` <dbl>, symmetry_mean <dbl>,
## #   fractal_dimension_mean <dbl>, radius_se <dbl>, texture_se <dbl>,
## #   perimeter_se <dbl>, area_se <dbl>, smoothness_se <dbl>,
## #   compactness_se <dbl>, concavity_se <dbl>, `concave points_se` <dbl>,
## #   symmetry_se <dbl>, fractal_dimension_se <dbl>, radius_worst <dbl>,
## #   texture_worst <dbl>, perimeter_worst <dbl>, area_worst <dbl>, …

c. Summarize the dataset

Response: I’m happy to see that there are no N/A’s like in the original dataset, I was able to successfully remove them! The quartiles exhibit interesting patterns. The Max value, for the most part, is quite higher than the 3rd Q value. This may indicate possible skews.The Min values are also lower than 1st Q values but not as much.

summary(breast_cancer_clean)
##       id                diagnosis    radius_mean      texture_mean  
##  Length:569         benign   :357   Min.   : 6.981   Min.   : 9.71  
##  Class :character   malignant:212   1st Qu.:11.700   1st Qu.:16.17  
##  Mode  :character                   Median :13.370   Median :18.84  
##                                     Mean   :14.127   Mean   :19.29  
##                                     3rd Qu.:15.780   3rd Qu.:21.80  
##                                     Max.   :28.110   Max.   :39.28  
##  perimeter_mean     area_mean      smoothness_mean   compactness_mean 
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##  concavity_mean    concave points_mean symmetry_mean    fractal_dimension_mean
##  Min.   :0.00000   Min.   :0.00000     Min.   :0.1060   Min.   :0.04996       
##  1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619   1st Qu.:0.05770       
##  Median :0.06154   Median :0.03350     Median :0.1792   Median :0.06154       
##  Mean   :0.08880   Mean   :0.04892     Mean   :0.1812   Mean   :0.06280       
##  3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957   3rd Qu.:0.06612       
##  Max.   :0.42680   Max.   :0.20120     Max.   :0.3040   Max.   :0.09744       
##    radius_se        texture_se      perimeter_se       area_se       
##  Min.   :0.1115   Min.   :0.3602   Min.   : 0.757   Min.   :  6.802  
##  1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606   1st Qu.: 17.850  
##  Median :0.3242   Median :1.1080   Median : 2.287   Median : 24.530  
##  Mean   :0.4052   Mean   :1.2169   Mean   : 2.866   Mean   : 40.337  
##  3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357   3rd Qu.: 45.190  
##  Max.   :2.8730   Max.   :4.8850   Max.   :21.980   Max.   :542.200  
##  smoothness_se      compactness_se      concavity_se     concave points_se 
##  Min.   :0.001713   Min.   :0.002252   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509   1st Qu.:0.007638  
##  Median :0.006380   Median :0.020450   Median :0.02589   Median :0.010930  
##  Mean   :0.007041   Mean   :0.025478   Mean   :0.03189   Mean   :0.011796  
##  3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205   3rd Qu.:0.014710  
##  Max.   :0.031130   Max.   :0.135400   Max.   :0.39600   Max.   :0.052790  
##   symmetry_se       fractal_dimension_se  radius_worst   texture_worst  
##  Min.   :0.007882   Min.   :0.0008948    Min.   : 7.93   Min.   :12.02  
##  1st Qu.:0.015160   1st Qu.:0.0022480    1st Qu.:13.01   1st Qu.:21.08  
##  Median :0.018730   Median :0.0031870    Median :14.97   Median :25.41  
##  Mean   :0.020542   Mean   :0.0037949    Mean   :16.27   Mean   :25.68  
##  3rd Qu.:0.023480   3rd Qu.:0.0045580    3rd Qu.:18.79   3rd Qu.:29.72  
##  Max.   :0.078950   Max.   :0.0298400    Max.   :36.04   Max.   :49.54  
##  perimeter_worst    area_worst     smoothness_worst  compactness_worst
##  Min.   : 50.41   Min.   : 185.2   Min.   :0.07117   Min.   :0.02729  
##  1st Qu.: 84.11   1st Qu.: 515.3   1st Qu.:0.11660   1st Qu.:0.14720  
##  Median : 97.66   Median : 686.5   Median :0.13130   Median :0.21190  
##  Mean   :107.26   Mean   : 880.6   Mean   :0.13237   Mean   :0.25427  
##  3rd Qu.:125.40   3rd Qu.:1084.0   3rd Qu.:0.14600   3rd Qu.:0.33910  
##  Max.   :251.20   Max.   :4254.0   Max.   :0.22260   Max.   :1.05800  
##  concavity_worst  concave points_worst symmetry_worst   fractal_dimension_worst
##  Min.   :0.0000   Min.   :0.00000      Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.1145   1st Qu.:0.06493      1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.2267   Median :0.09993      Median :0.2822   Median :0.08004        
##  Mean   :0.2722   Mean   :0.11461      Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.3829   3rd Qu.:0.16140      3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :1.2520   Max.   :0.29100      Max.   :0.6638   Max.   :0.20750

c. Plotting to check the shape

Response: As expected most data falls into the bins around the median. Most data in general falls towards the left. Few values exist towards the right. There looks like a missing bin.

library(ggplot2)
ggplot(data=breast_cancer_clean,aes(x=radius_mean,y=))+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### c. Check for missing values

Response: so far no missing values

is.na(breast_cancer_clean)
sum(is.na(breast_cancer_clean))

d. Identify collinear variables

d. Distance matrix

distancematrix<-1-cor(select(breast_cancer_clean,-c(id,diagnosis)))

Response:Cluster Dendrogram: Items joined at the same height are most the similar. I’m guessing texture_mean & texture_worst are very similar/ corelated. They are less similar to texture_se, smoothness_se and symmetry_se.

plot(hclust(as.dist(distancematrix)))

note: the closer to 0 the more closely correlated

Response: Heatmap: A variable will be highly correlated to itself. The concavity variables seem to be highly correlated with each other. Least correlated seems to be the perimeters, areas, and radi’s and the textures, smoothness, and symmetries.

ComplexHeatmap::Heatmap(distancematrix)

ComplexHeatmap::pheatmap(distancematrix)

e. Create a boxplot for each column expect for id

e. Create a new dataset without the id variable, create a “long” dataframe

breast_cancer_new<-breast_cancer_clean %>% select(-id)
breast_cancer_long<-breast_cancer_new %>% tidyr::pivot_longer(!diagnosis, names_to = "attributes",values_to = "values")

Response: Facet_wrap(): hmm, I assume that the factors with the higher medians are more highly associated with breast cancer diagnosis. No matter how much I configure the width & height, I can’t see the boxplots properly on my end, so I will use Lisa’s boxplot for reference. Overall, the malignant factors are more highly associated with breast cancer diagnosis than benign but in some variables, the differences between the factors are not too different. I see large differences in medians in area_worst free, area_mean free, area_worst, compactness_mean and many more.

library(ggplot2)
ggplot(breast_cancer_long,aes(x=diagnosis, y=values)) + geom_boxplot()+ facet_wrap(vars(attributes, scales="free"))

## 3. Building a Logistic Regression Model

a. Fit a univariate logisitic regression model using area_mean as the predictor and diagnosis as the outcome

fit<-glm(formula=diagnosis~area_mean, data=breast_cancer_clean,family="binomial")

b & c. Print the summary of the model

Response: Not simply interpretable in a logisitic regression.

Response: Before exponentiating: When the area_mean is hypothetically zero there’s a negative 8 chance of breast cancer diagnosis. For every increase in one unit of area_mean, the log odds of breast cancer diagnosis increases by 0.01177.

Response: After exponentiating: for a one unit increase of the area the odds of breast cancer diagnosis increases by 1.0118 times higher or 1.18%

summary(fit)
## 
## Call:
## glm(formula = diagnosis ~ area_mean, family = "binomial", data = breast_cancer_clean)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.97409    0.68286  -11.68   <2e-16 ***
## area_mean    0.01177    0.00109   10.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 751.44  on 568  degrees of freedom
## Residual deviance: 325.66  on 567  degrees of freedom
## AIC: 329.66
## 
## Number of Fisher Scoring iterations: 7

Odds intercept after exponentiating –> exp(intercept)=Odds

Odds=0.00034

Odds area mean after exponentiating –> exp(area_mean)=Odds

Odds=1.0118

percentage change –> (1.0118-1)*100

1.18 %

4. Making Predictions

a.Create a xy plot with area_mean of the x axis and diagnosis on the y axis. Use geom_jitter(width = 0)

Response: Area means for the malignant cancer diagnosis is shifted to the right but, there is a large range associated with the malignants.

ggplot(breast_cancer_clean, aes(x=area_mean, y=diagnosis))+geom_point()+geom_jitter(width=0)

## b. Add to this plot the predicted probabilities for each observed temperature, using the predict function with argument type=“response”

breast_cancer_clean<-breast_cancer_clean%>% mutate(breast_cancer_clean,prediction=predict(fit,type="response"))

b. Plot the predictions

ggplot(breast_cancer_clean, aes(x=area_mean, y=diagnosis))+geom_point()+geom_jitter(width=0)+ geom_line(data=breast_cancer_clean,aes(x=area_mean,y=prediction+1),color="pink",linewidth=2)

### c. What do you notice about the data points where the predicted probabilities are 0, 0.5, and 1?

Response: 0 = beign, 1= malignant, 0.5 = difficult to tell

d. What are the predicted probabilities of a malignant diagnosis when area_mean is 300, 500, 700, 900, and 1100?

(300) = 0.01161568, (500) = 0.11005978, (700) = 0.56548526, (900) = 0.93195019, (1100) = 0.99310900

predict(fit,newdata=data.frame(area_mean=seq(300,1100,by=200)),type="response")
##          1          2          3          4          5 
## 0.01161568 0.11005978 0.56548526 0.93195019 0.99310900