make_same_units_categorical

getml.datasets.make_same_units_categorical(n_rows_population=500, n_rows_peripheral=125000, random_state=None, population_name='', peripheral_name='', aggregation='COUNT')[source]

Generate a random dataset with categorical variables

The dataset consists of a population table and one peripheral table.

The peripheral table has 3 columns:

  • column_01: random categorical variable between ‘0’ and ‘9’

  • join_key: random integer in the range from 0 to n_rows_population

  • time_stamp: random number between 0 and 1

The population table has 4 columns:

  • column_01: random categorical variable between ‘0’ and ‘9’

  • join_key: unique integer in the range from 0 to n_rows_population

  • time_stamp: random number between 0 and 1

  • targets: target variable. Defined as the number of matching entries in the peripheral table for which time_stamp_peripheral < time_stamp_population and the category in the peripheral table is not 1, 2 or 9

SELECT aggregation( column_02 )
FROM POPULATION_TABLE t1
LEFT JOIN PERIPHERAL_TABLE t2
ON t1.join_key = t2.join_key
WHERE (
   ( t1.column_01 == t2.column_01 )
) AND t2.time_stamps <= t1.time_stamps
GROUP BY t1.join_key,
     t1.time_stamp;
Args:
n_rows_population (int, optional):

Number of rows in the population table.

n_row_peripheral (int, optional):

Number of rows in the peripheral table.

random_state (Union[int, None], optional):

Seed to initialize the random number generator used for the dataset creation. If set to None, the seed will be the ‘microsecond’ component of datetime.datetime.now().

population_name (string, optional):

Name assigned to the create DataFrame holding the population table. If set to a name already existing on the getML engine, the corresponding DataFrame will be overwritten. If set to an empty string, a unique name will be generated by concatenating make_same_units_categorical_population_ and the seed of the random number generator.

peripheral_name (string, optional):

Name assigned to the create DataFrame holding the peripheral table. If set to a name already existing on the getML engine, the corresponding DataFrame will be overwritten. If set to an empty string, a unique name will be generated by concatenating make_same_units_categorical_peripheral_ and the seed of the random number generator.

aggregation(string, optional):

aggregations used to generate the ‘target’ column.

Returns:
tuple:

tuple containing: