load_atherosclerosis

getml.datasets.load_atherosclerosis(roles=False, as_pandas=False)[source]

Binary classification dataset on the lethality of atherosclerosis

The atherosclerosis dataset is a medical dataset from the the CTU Prague Relational Learning Repository. It contains information from an longitudal study on 1417 middle-aged men obeserved over the course of 20 years. After preprocessing, it consists of 2 tables with 76 and 66 columns:

  • population: Data on the study’s participants

  • contr: Data on control dates

The population table is split into a training (70%), a testing (15%) set and a validation (15%) set.

Args:

as_pandas (bool):

Return data as pandas.DataFrame s

roles (bool):

Return data with roles set

Returns:

dict:

Dictionary containing the data as DataFrame s or pandas.DataFrame s (if as_pandas is True). The keys correspond to the name of the DataFrame on the engine. The following DataFrames are contained in the dictionary

  • population_train

  • population_test

  • population_validation

  • contr

Examples:

>>> df_getml = getml.datasets.load_atherosclerosis()
>>> type(df_getml["population_train"])
... getml.data.data_frame.DataFrame

For an full analysis of the atherosclerosis dataset including all necessary preprocessing steps please refer to getml-examples.

Note:

Roles can be set ad-hoc by supplying the respective flag. If roles is False, all columns in the returned DataFrames s have roles unused_string or unused_float. This dataset contains no units. Before using them in an analysis, a data model needs to be constructed using Placeholder s.