make_snowflake¶
-
getml.datasets.
make_snowflake
(n_rows_population=500, n_rows_peripheral1=5000, n_rows_peripheral2=125000, random_state=None, population_name='', peripheral_name1='', peripheral_name2='', aggregation1='SUM', aggregation2='COUNT')¶ Generate a random dataset with continous numerical variables
The dataset consists of a population table and two peripheral tables.
The first peripheral table has 4 columns:
column_01: random number between -1 and 1
join_key: random integer in the range from 0 to
n_rows_population
join_key2: unique integer in the range from 0 to
n_rows_peripheral1
time_stamp: random number between 0 and 1
The second peripheral table has 3 columns:
column_01: random number between -1 and 1
join_key2: random integer in the range from 0 to
n_rows_peripheral1
time_stamp: random number between 0 and 1
The population table has 4 columns:
column_01: random number between -1 and 1
join_key: unique integer in the range from 0 to
n_rows_population
time_stamp: random number between 0 and 1
targets: target variable as defined by the SQL block below:
SELECT aggregation1( feature_1_1 ) FROM POPULATION t1 LEFT JOIN ( SELECT aggregation2( t4.column_01 ) AS feature_1_1 FROM PERIPHERAL t3 LEFT JOIN PERIPHERAL2 t4 ON t3.join_key2 = t4.join_key2 WHERE ( ( t3.time_stamp - t4.time_stamp <= 0.5 ) ) AND t4.time_stamp <= t3.time_stamp GROUP BY t3.join_key, t3.time_stamp ) t2 ON t1.join_key = t2.join_key WHERE t2.time_stamp <= t1.time_stamp GROUP BY t1.join_key, t1.time_stamp;
- Parameters
n_rows_population (int, optional) – Number of rows in the population table.
n_row_peripheral1 (int, optional) – Number of rows in the first peripheral table.
n_row_peripheral2 (int, optional) – Number of rows in the second peripheral table.
random_state (Union[int, None], optional) – Seed to initialize the random number generator used for the dataset creation. If set to None, the seed will be the ‘microsecond’ component of
datetime.datetime.now()
.population_name (string, optional) – Name assigned to the create
DataFrame
holding the population table. If set to a name already existing on the getML engine, the correspondingDataFrame
will be overwritten. If set to an empty string, a unique name will be generated by concatenating snowflake_population_ and the seed of the random number generator.peripheral_name1 (string, optional) – Name assigned to the create
DataFrame
holding the first peripheral table. If set to a name already existing on the getML engine, the correspondingDataFrame
will be overwritten. If set to an empty string, a unique name will be generated by concatenating snowflake_peripheral_1_ and the seed of the random number generator.peripheral_name2 (string, optional) – Name assigned to the create
DataFrame
holding the second peripheral table. If set to a name already existing on the getML engine, the correspondingDataFrame
will be overwritten. If set to an empty string, a unique name will be generated by concatenating snowflake_peripheral_2_ and the seed of the random number generator.aggregation1 (string, optional) –
aggregations
used to generate the ‘target’ column in the first peripheral table.aggregation2 (string, optional) –
aggregations
used to generate the ‘target’ column in the second peripheral table.
- Returns
tuple containing:
population (
getml.data.DataFrame
): Population tableperipheral (
getml.data.DataFrame
): Peripheral tableperipheral_2 (
getml.data.DataFrame
): Peripheral table
- Return type
tuple