make_snowflake

getml.datasets.make_snowflake(n_rows_population: int = 500, n_rows_peripheral1: int = 5000, n_rows_peripheral2: int = 125000, random_state: Optional[int] = None, population_name: str = '', peripheral_name1: str = '', peripheral_name2: str = '', aggregation1: str = 'SUM', aggregation2: str = 'COUNT') Tuple[DataFrame, DataFrame, DataFrame][source]

Generate a random dataset with continous numerical variables

The dataset consists of a population table and two peripheral tables.

The first peripheral table has 4 columns:

  • column_01: random number between -1 and 1

  • join_key: random integer in the range from 0 to n_rows_population

  • join_key2: unique integer in the range from 0 to n_rows_peripheral1

  • time_stamp: random number between 0 and 1

The second peripheral table has 3 columns:

  • column_01: random number between -1 and 1

  • join_key2: random integer in the range from 0 to n_rows_peripheral1

  • time_stamp: random number between 0 and 1

The population table has 4 columns:

  • column_01: random number between -1 and 1

  • join_key: unique integer in the range from 0 to n_rows_population

  • time_stamp: random number between 0 and 1

  • targets: target variable as defined by the SQL block below:

SELECT aggregation1( feature_1_1 )
FROM POPULATION t1
LEFT JOIN (
    SELECT aggregation2( t4.column_01 ) AS feature_1_1
    FROM PERIPHERAL t3
    LEFT JOIN PERIPHERAL2 t4
    ON t3.join_key2 = t4.join_key2
    WHERE (
       ( t3.time_stamp - t4.time_stamp <= 0.5 )
    ) AND t4.time_stamp <= t3.time_stamp
    GROUP BY t3.join_key,
         t3.time_stamp
) t2
ON t1.join_key = t2.join_key
WHERE t2.time_stamp <= t1.time_stamp
GROUP BY t1.join_key,
     t1.time_stamp;
Args:
n_rows_population (int, optional):

Number of rows in the population table.

n_row_peripheral1 (int, optional):

Number of rows in the first peripheral table.

n_row_peripheral2 (int, optional):

Number of rows in the second peripheral table.

random_state (Union[int, None], optional):

Seed to initialize the random number generator used for the dataset creation. If set to None, the seed will be the ‘microsecond’ component of datetime.datetime.now().

population_name (string, optional):

Name assigned to the create DataFrame holding the population table. If set to a name already existing on the getML engine, the corresponding DataFrame will be overwritten. If set to an empty string, a unique name will be generated by concatenating snowflake_population_ and the seed of the random number generator.

peripheral_name1 (string, optional):

Name assigned to the create DataFrame holding the first peripheral table. If set to a name already existing on the getML engine, the corresponding DataFrame will be overwritten. If set to an empty string, a unique name will be generated by concatenating snowflake_peripheral_1_ and the seed of the random number generator.

peripheral_name2 (string, optional):

Name assigned to the create DataFrame holding the second peripheral table. If set to a name already existing on the getML engine, the corresponding DataFrame will be overwritten. If set to an empty string, a unique name will be generated by concatenating snowflake_peripheral_2_ and the seed of the random number generator.

aggregation1(string, optional):

aggregations used to generate the ‘target’ column in the first peripheral table.

aggregation2(string, optional):

aggregations used to generate the ‘target’ column in the second peripheral table.

Returns:
tuple:

tuple containing: