Introduction

Purpose of this report is to predict acute aquatic toxicity towards the fish Pimephales promelas (fathead minnow) on a set of 908 chemicals. The dataset that I’ll be working with is called QSAR fish toxicity.

Link to original data source: https://archive.ics.uci.edu/ml/datasets/QSAR+fish+toxicity

Introduction to the variables: Responce variable: LC50 data, which is the concentration that causes death in 50% of test fish over a test duration of 96 hours, was used as model response. The treatment variables: The model comprised 6 molecular descriptors: MLOGP (molecular properties), CIC0 (information indices), GATS1i (2D autocorrelations), NdssC (atom-type counts), NdsCH ((atom-type counts), SM1_Dz (2D matrix-based descriptors).

library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 4.0.5
## -- Conflicts ------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
toxicity <- read_csv2("D:\\NCSU\\Spring2022_sophomore\\ST308\\FinalProject\\qsar_fish_toxicity.csv",
                     col_names = FALSE,
                     col_types = NULL)
## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_double(),
##   X5 = col_double(),
##   X6 = col_character(),
##   X7 = col_character()
## )
toxicity %>%
  rename("CIC0"="X1", "SM1_Dz(Z)"="X2","GATS1i"="X3",  "NdsCH"="X4","NdssC"="X5", "MLOGP"="X6", "LC50"="X7")
## # A tibble: 908 x 7
##    CIC0  `SM1_Dz(Z)` GATS1i NdsCH NdssC MLOGP LC50 
##    <chr> <chr>       <chr>  <dbl> <dbl> <chr> <chr>
##  1 3.26  0.829       1.676      0     1 1.453 3.770
##  2 2.189 0.58        0.863      0     0 1.348 3.115
##  3 2.125 0.638       0.831      0     0 1.348 3.531
##  4 3.027 0.331       1.472      1     0 1.807 3.510
##  5 2.094 0.827       0.86       0     0 1.886 5.390
##  6 3.222 0.331       2.177      0     0 0.706 1.819
##  7 3.179 0           1.063      0     0 2.942 3.947
##  8 3     0           0.938      1     0 2.851 3.513
##  9 2.62  0.499       0.99       0     0 2.942 4.402
## 10 2.834 0.134       0.95       0     0 1.591 3.021
## # ... with 898 more rows

EDA

library(GGally)
## Warning: package 'GGally' was built under R version 4.0.5
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
#PairPlot <- ggpairs(toxicity, )

Multiple Linear Regression Models