load_biodegradability

getml.datasets.load_biodegradability(roles=True, as_pandas=False, as_dict=False)[source]

Regression dataset on molecule weight prediction

The QSAR biodegradation dataset was built in the Milano Chemometrics and QSAR Research Group (Universita degli Studi Milano-Bicocca, Milano, Italy). The data have been used to develop QSAR (Quantitative Structure Activity Relationships) models for the study of the relationships between chemical structure and biodegradation of molecules. Biodegradation experimental values of 1055 chemicals were collected from the webpage of the National Institute of Technology and Evaluation of Japan (NITE).

The orgininal publication is: Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R., Consonni, V. (2013). Quantitative Structure - Activity Relationship models for ready biodegradability of chemicals. Journal of Chemical Information and Modeling, 53, 867-878

The dataset was collected through the UCI Machine Learning Repository <https://archive.ics.uci.edu/ml/datasets/QSAR+biodegradation>

It contains information on 1309 molecules with 6166 bonds. It consists of 5 tables.

The population table is split into a training (50 %) and a testing (25%) and validition (25%) sets.

Args:
as_pandas (bool):

Return data as pandas.DataFrame s

roles (bool):

Return data with roles set

as_dict (bool):

Return data as dict with df.name s as keys and df s as values.

Returns:
tuple:

Tuple containing (sorted alphabetically by df.name`s) the data as :class:`~getml.DataFrame s or pandas.DataFrame s (if as_pandas is True) or

dict:

if as_dict is True: Dictionary containing the data as DataFrame s or pandas.DataFrame s (if as_pandas is True). The keys correspond to the name of the DataFrame on the engine.

The following DataFrames are returned:

  • molecule

  • atom

  • bond

  • gmember

  • group

Examples:
>>> biodegradability = getml.datasets.load_biodegradability(as_dict=True)
>>> type(biodegradability["molecule_train"])
... getml.data.data_frame.DataFrame

For an full analysis of the biodegradability dataset including all necessary preprocessing steps please refer to getml-examples (forthcoming).

Note:

Roles can be set ad-hoc by supplying the respective flag. If roles is False, all columns in the returned DataFrames s have roles unused_string or unused_float. This dataset contains no units. Before using them in an analysis, a data model needs to be constructed using Placeholder s.