make_same_units_categorical¶
- getml.datasets.make_same_units_categorical(n_rows_population: int = 500, n_rows_peripheral: int = 125000, random_state: Optional[int] = None, population_name: str = '', peripheral_name: str = '', aggregation: str = 'COUNT') Tuple[DataFrame, DataFrame] [source]¶
Generate a random dataset with categorical variables
The dataset consists of a population table and one peripheral table.
The peripheral table has 3 columns:
column_01: random categorical variable between ‘0’ and ‘9’
join_key: random integer in the range from 0 to
n_rows_population
time_stamp: random number between 0 and 1
The population table has 4 columns:
column_01: random categorical variable between ‘0’ and ‘9’
join_key: unique integer in the range from 0 to
n_rows_population
time_stamp: random number between 0 and 1
targets: target variable. Defined as the number of matching entries in the peripheral table for which
time_stamp_peripheral < time_stamp_population
and the category in the peripheral table is not 1, 2 or 9
SELECT aggregation( column_02 ) FROM POPULATION_TABLE t1 LEFT JOIN PERIPHERAL_TABLE t2 ON t1.join_key = t2.join_key WHERE ( ( t1.column_01 == t2.column_01 ) ) AND t2.time_stamps <= t1.time_stamps GROUP BY t1.join_key, t1.time_stamp;
- Args:
- n_rows_population (int, optional):
Number of rows in the population table.
- n_row_peripheral (int, optional):
Number of rows in the peripheral table.
- random_state (Optional[int], optional):
Seed to initialize the random number generator used for the dataset creation. If set to None, the seed will be the ‘microsecond’ component of
datetime.datetime.now()
.- population_name (string, optional):
Name assigned to the create
DataFrame
holding the population table. If set to a name already existing on the getML engine, the correspondingDataFrame
will be overwritten. If set to an empty string, a unique name will be generated by concatenating make_same_units_categorical_population_ and the seed of the random number generator.- peripheral_name (string, optional):
Name assigned to the create
DataFrame
holding the peripheral table. If set to a name already existing on the getML engine, the correspondingDataFrame
will be overwritten. If set to an empty string, a unique name will be generated by concatenating make_same_units_categorical_peripheral_ and the seed of the random number generator.- aggregation(string, optional):
aggregations
used to generate the ‘target’ column.
- Returns:
- tuple:
tuple containing:
population (
getml.DataFrame
): Population tableperipheral (
getml.DataFrame
): Peripheral table