Selected molecular descriptors from the Dragon chemoinformatics application were used to predict bioconcentration factors for 779 chemicals in order to evaluate QSAR (Quantitative Structure Activity Relationship). This dataset was obtained from the UCI machine learning repository.
The dataset consists of 779 observations of 10 attributes. Below is a brief description of each feature and the response variable (logBCF) in our dataset:
nHM - number of heavy atoms (integer)
piPC09 - molecular multiple path count (numeric)
PCD - difference between multiple path count and path count (numeric)
X2Av - average valence connectivity (numeric)
MLOGP - Moriguchi octanol-water partition coefficient (numeric)
ON1V - overall modified Zagreb index by valence vertex degrees (numeric)
N.072 - Frequency of RCO-N< / >N-X=X fragments (integer)
B02[C-N] - Presence/Absence of C-N atom pairs (binary)
F04[C-O] - Frequency of C-O atom pairs (integer)
logBCF - Bioconcentration Factor in log units (numeric)
Note that all predictors with the exception of B02[C-N] are quantitative. For the purpose of this assignment, DO NOT CONVERT B02[C-N] to factor. Leave the data in its original format - numeric in R.
Please load the dataset "Bio_pred" and then split the dataset into a train and test set in a 80:20 ratio. Use the training set to build the models in Questions 1-6. Use the test set to help evaluate model performance in Question 7. Please make sure that you are using R version 3.6.X or above (i.e. version 4.X is also acceptable).