group_by

DataFrame.group_by(key, name, aggregations)[source]

Creates new DataFrame by grouping over a join key.

This function split the DataFrame into groups with the same value for join_key, applies an aggregation function to one or more columns in each group, and combines the results into a new DataFrame. The aggregation funcion is defined for each column individually. This allows applying different aggregations to each column. In pandas this is known as named aggregation.

Args:
key (str): Name of the key to group by. If the key is a join key, the group_by will

be faster, because join keys already have an index, whereas all other columns need to have an index built for the group_by.

name (str): Name of the new DataFrame.

aggregations (List[Aggregation]):

Methods to apply on the groupings.

Raises:

TypeError: If any of the input arguments is of wrong type.

Returns:

DataFrame:

Handler of the newly generated data frame object.

Examples:

Generate example data

data = dict(
    fruit=["banana", "apple", "cherry", "cherry", "melon", "pineapple"],
    price=[2.4, 3.0, 1.2, 1.4, 3.4, 3.4],
    join_key=["0", "1", "2", "2", "3", "3"]
)
df = getml.data.DataFrame.from_dict(
    data,
    name="fruits",
    roles={"categorical": ["fruit"], "join_key": ["join_key"], "numerical": ["price"]}
)

df
| join_key | fruit       | price     |
| join key | categorical | numerical |
--------------------------------------
| 0        | banana      | 2.4       |
| 1        | apple       | 3         |
| 2        | cherry      | 1.2       |
| 2        | cherry      | 1.4       |
| 3        | melon       | 3.4       |
| 3        | pineapple   | 3.4       |

Group DataFrame using join_key. Aggregate the resulting groups by averaging and summing over the price column and counting the distinct entires in the fruit column

df_grouped = df.group_by("join_key", "fruits_grouped",
    [df["price"].avg(alias="avg price"),
    df["price"].sum(alias="total price"),
    df["fruit"].count_distinct(alias="unique items")])

df_grouped
| join_key | avg price | total price | unique items |
| join key | unused    | unused      | unused       |
-----------------------------------------------------
| 3        | 3.4       | 6.8         | 2            |
| 2        | 1.3       | 2.6         | 1            |
| 0        | 2.4       | 2.4         | 1            |
| 1        | 3         | 3           | 1            |