make_same_units_categorical¶
-
getml.datasets.
make_same_units_categorical
(n_rows_population: int = 500, n_rows_peripheral: int = 125000, random_state: Optional[int] = None, population_name: str = '', peripheral_name: str = '', aggregation: Literal[VARIATION COEFFICIENT, VAR, TREND, TIME SINCE LAST MINIMUM, TIME SINCE LAST MAXIMUM, TIME SINCE FIRST MINIMUM, TIME SINCE FIRST MAXIMUM, SUM, STDDEV, SKEW, Q99, Q95, Q90, Q75, Q25, Q10, Q5, Q1, NUM MIN, NUM MAX, MODE, MIN, MEDIAN, MAX, LAST, KURTOSIS, FIRST, EWMA_365D, EWMA_90D, EWMA_30D, EWMA_7D, EWMA_1D, EWMA_1H, EWMA_1M, EWMA_1S, COUNT MINUS COUNT DISTINCT, COUNT DISTINCT OVER COUNT, COUNT DISTINCT, COUNT BELOW MEAN, COUNT ABOVE MEAN, COUNT, AVG] = 'COUNT') → Tuple[getml.data.data_frame.DataFrame, getml.data.data_frame.DataFrame][source]¶ Generate a random dataset with categorical variables
The dataset consists of a population table and one peripheral table.
The peripheral table has 3 columns:
column_01: random categorical variable between ‘0’ and ‘9’
join_key: random integer in the range from 0 to
n_rows_population
time_stamp: random number between 0 and 1
The population table has 4 columns:
column_01: random categorical variable between ‘0’ and ‘9’
join_key: unique integer in the range from 0 to
n_rows_population
time_stamp: random number between 0 and 1
targets: target variable. Defined as the number of matching entries in the peripheral table for which
time_stamp_peripheral < time_stamp_population
and the category in the peripheral table is not 1, 2 or 9
SELECT aggregation( column_02 ) FROM POPULATION_TABLE t1 LEFT JOIN PERIPHERAL_TABLE t2 ON t1.join_key = t2.join_key WHERE ( ( t1.column_01 == t2.column_01 ) ) AND t2.time_stamps <= t1.time_stamps GROUP BY t1.join_key, t1.time_stamp;
- Args:
- n_rows_population (int, optional):
Number of rows in the population table.
- n_row_peripheral (int, optional):
Number of rows in the peripheral table.
- random_state (Optional[int], optional):
Seed to initialize the random number generator used for the dataset creation. If set to None, the seed will be the ‘microsecond’ component of
datetime.datetime.now()
.- population_name (string, optional):
Name assigned to the create
DataFrame
holding the population table. If set to a name already existing on the getML engine, the correspondingDataFrame
will be overwritten. If set to an empty string, a unique name will be generated by concatenating make_same_units_categorical_population_ and the seed of the random number generator.- peripheral_name (string, optional):
Name assigned to the create
DataFrame
holding the peripheral table. If set to a name already existing on the getML engine, the correspondingDataFrame
will be overwritten. If set to an empty string, a unique name will be generated by concatenating make_same_units_categorical_peripheral_ and the seed of the random number generator.- aggregation(string, optional):
aggregations
used to generate the ‘target’ column.
- Returns:
- tuple:
tuple containing:
population (
getml.DataFrame
): Population tableperipheral (
getml.DataFrame
): Peripheral table