Purpose of this report is to predict acute aquatic toxicity towards the fish Pimephales promelas (fathead minnow) on a set of 908 chemicals. The dataset that I’ll be working with is called QSAR fish toxicity.
Link to original data source: https://archive.ics.uci.edu/ml/datasets/QSAR+fish+toxicity
Introduction to the variables: Responce variable: LC50 data, which is the concentration that causes death in 50% of test fish over a test duration of 96 hours, was used as model response. The treatment variables: The model comprised 6 molecular descriptors: MLOGP (molecular properties), CIC0 (information indices), GATS1i (2D autocorrelations), NdssC (atom-type counts), NdsCH ((atom-type counts), SM1_Dz (2D matrix-based descriptors).
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 4.0.5
## -- Conflicts ------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
toxicity <- read_csv2("D:\\NCSU\\Spring2022_sophomore\\ST308\\FinalProject\\qsar_fish_toxicity.csv",
col_names = FALSE,
col_types = NULL)
## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_character(),
## X3 = col_character(),
## X4 = col_double(),
## X5 = col_double(),
## X6 = col_character(),
## X7 = col_character()
## )
toxicity %>%
rename("CIC0"="X1", "SM1_Dz(Z)"="X2","GATS1i"="X3", "NdsCH"="X4","NdssC"="X5", "MLOGP"="X6", "LC50"="X7")
## # A tibble: 908 x 7
## CIC0 `SM1_Dz(Z)` GATS1i NdsCH NdssC MLOGP LC50
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 3.26 0.829 1.676 0 1 1.453 3.770
## 2 2.189 0.58 0.863 0 0 1.348 3.115
## 3 2.125 0.638 0.831 0 0 1.348 3.531
## 4 3.027 0.331 1.472 1 0 1.807 3.510
## 5 2.094 0.827 0.86 0 0 1.886 5.390
## 6 3.222 0.331 2.177 0 0 0.706 1.819
## 7 3.179 0 1.063 0 0 2.942 3.947
## 8 3 0 0.938 1 0 2.851 3.513
## 9 2.62 0.499 0.99 0 0 2.942 4.402
## 10 2.834 0.134 0.95 0 0 1.591 3.021
## # ... with 898 more rows
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.5
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
#PairPlot <- ggpairs(toxicity, )