make_snowflake¶
-
getml.datasets.
make_snowflake
(n_rows_population: int = 500, n_rows_peripheral1: int = 5000, n_rows_peripheral2: int = 125000, random_state: Optional[int] = None, population_name: str = '', peripheral_name1: str = '', peripheral_name2: str = '', aggregation1: Literal[VARIATION COEFFICIENT, VAR, TREND, TIME SINCE LAST MINIMUM, TIME SINCE LAST MAXIMUM, TIME SINCE FIRST MINIMUM, TIME SINCE FIRST MAXIMUM, SUM, STDDEV, SKEW, Q99, Q95, Q90, Q75, Q25, Q10, Q5, Q1, NUM MIN, NUM MAX, MODE, MIN, MEDIAN, MAX, LAST, KURTOSIS, FIRST, EWMA_365D, EWMA_90D, EWMA_30D, EWMA_7D, EWMA_1D, EWMA_1H, EWMA_1M, EWMA_1S, COUNT MINUS COUNT DISTINCT, COUNT DISTINCT OVER COUNT, COUNT DISTINCT, COUNT BELOW MEAN, COUNT ABOVE MEAN, COUNT, AVG] = 'SUM', aggregation2: Literal[VARIATION COEFFICIENT, VAR, TREND, TIME SINCE LAST MINIMUM, TIME SINCE LAST MAXIMUM, TIME SINCE FIRST MINIMUM, TIME SINCE FIRST MAXIMUM, SUM, STDDEV, SKEW, Q99, Q95, Q90, Q75, Q25, Q10, Q5, Q1, NUM MIN, NUM MAX, MODE, MIN, MEDIAN, MAX, LAST, KURTOSIS, FIRST, EWMA_365D, EWMA_90D, EWMA_30D, EWMA_7D, EWMA_1D, EWMA_1H, EWMA_1M, EWMA_1S, COUNT MINUS COUNT DISTINCT, COUNT DISTINCT OVER COUNT, COUNT DISTINCT, COUNT BELOW MEAN, COUNT ABOVE MEAN, COUNT, AVG] = 'COUNT') → Tuple[getml.data.data_frame.DataFrame, getml.data.data_frame.DataFrame, getml.data.data_frame.DataFrame][source]¶ Generate a random dataset with continous numerical variables
The dataset consists of a population table and two peripheral tables.
The first peripheral table has 4 columns:
column_01: random number between -1 and 1
join_key: random integer in the range from 0 to
n_rows_population
join_key2: unique integer in the range from 0 to
n_rows_peripheral1
time_stamp: random number between 0 and 1
The second peripheral table has 3 columns:
column_01: random number between -1 and 1
join_key2: random integer in the range from 0 to
n_rows_peripheral1
time_stamp: random number between 0 and 1
The population table has 4 columns:
column_01: random number between -1 and 1
join_key: unique integer in the range from 0 to
n_rows_population
time_stamp: random number between 0 and 1
targets: target variable as defined by the SQL block below:
SELECT aggregation1( feature_1_1 ) FROM POPULATION t1 LEFT JOIN ( SELECT aggregation2( t4.column_01 ) AS feature_1_1 FROM PERIPHERAL t3 LEFT JOIN PERIPHERAL2 t4 ON t3.join_key2 = t4.join_key2 WHERE ( ( t3.time_stamp - t4.time_stamp <= 0.5 ) ) AND t4.time_stamp <= t3.time_stamp GROUP BY t3.join_key, t3.time_stamp ) t2 ON t1.join_key = t2.join_key WHERE t2.time_stamp <= t1.time_stamp GROUP BY t1.join_key, t1.time_stamp;
- Args:
- n_rows_population (int, optional):
Number of rows in the population table.
- n_row_peripheral1 (int, optional):
Number of rows in the first peripheral table.
- n_row_peripheral2 (int, optional):
Number of rows in the second peripheral table.
- random_state (Union[int, None], optional):
Seed to initialize the random number generator used for the dataset creation. If set to None, the seed will be the ‘microsecond’ component of
datetime.datetime.now()
.- population_name (string, optional):
Name assigned to the create
DataFrame
holding the population table. If set to a name already existing on the getML engine, the correspondingDataFrame
will be overwritten. If set to an empty string, a unique name will be generated by concatenating snowflake_population_ and the seed of the random number generator.- peripheral_name1 (string, optional):
Name assigned to the create
DataFrame
holding the first peripheral table. If set to a name already existing on the getML engine, the correspondingDataFrame
will be overwritten. If set to an empty string, a unique name will be generated by concatenating snowflake_peripheral_1_ and the seed of the random number generator.- peripheral_name2 (string, optional):
Name assigned to the create
DataFrame
holding the second peripheral table. If set to a name already existing on the getML engine, the correspondingDataFrame
will be overwritten. If set to an empty string, a unique name will be generated by concatenating snowflake_peripheral_2_ and the seed of the random number generator.- aggregation1(string, optional):
aggregations
used to generate the ‘target’ column in the first peripheral table.- aggregation2(string, optional):
aggregations
used to generate the ‘target’ column in the second peripheral table.
- Returns:
- tuple:
tuple containing:
population (
getml.DataFrame
): Population tableperipheral (
getml.DataFrame
): Peripheral tableperipheral_2 (
getml.DataFrame
): Peripheral table