join

DataFrame.join(name, other, join_key, other_join_key=None, cols=None, other_cols=None, how='inner', where=None)[source]

Create a new DataFrame by joining the current instance with another DataFrame.

Args:

name (str): The name of the new DataFrame.

other (DataFrame): The other DataFrame.

join_key (str):

Name of the column containing the join key in the current instance.

other_join_key (str, optional):

Name of the join key in the other DataFrame. If set to None, join_key will be used for both the current instance and other.

cols (List[Union[

FloatColumn, StringFloatColumn], optional):

columns in the current instances to be included in the resulting DataFrame. If set to None, all columns will be used.

other_cols (List[Union[

FloatColumn, StringColumn], optional):

columns in other to be included in the resulting DataFrame. If set to None, all columns will be used.

how (str, optional):

Type of the join.

Supported options:

  • ‘left’

  • ‘inner’

  • ‘right’

where (VirtualBooleanColumn, optional):

Boolean column indicating which rows to be included in the resulting DataFrame. If set to None, all rows will be used.

If imposes a SQL-like WHERE condition on the join.

Raises:

TypeError: If any of the input arguments is of wrong type.

Returns:

DataFrame:

Handler of the newly create data frame object.

Examples:

Create DataFrame

data_df = dict(
    colors=["blue", "green", "yellow", "orange"],
    numbers=[2.4, 3.0, 1.2, 1.4],
    join_key=["0", "1", "2", "3"]
)

df = getml.data.DataFrame.from_dict(
    data_df, name="df_1",
    roles=dict(join_key=["join_key"], numerical=["numbers"], categorical=["colors"]))

df
| join_key | colors      | numbers   |
| join key | categorical | numerical |
--------------------------------------
| 0        | blue        | 2.4       |
| 1        | green       | 3         |
| 2        | yellow      | 1.2       |
| 3        | orange      | 1.4       |

Create other Data Frame

data_other = dict(
    colors=["blue", "green", "yellow", "black", "orange", "white"],
    numbers=[2.4, 3.0, 1.2, 1.4, 3.4, 2.2],
    join_key=["0", "1", "2", "2", "3", "4"])

other = getml.data.DataFrame.from_dict(
    data_other, name="df_2",
    roles=dict(join_key=["join_key"], numerical=["numbers"], categorical=["colors"]))

other
| join_key | colors      | numbers   |
| join key | categorical | numerical |
--------------------------------------
| 0        | blue        | 2.4       |
| 1        | green       | 3         |
| 2        | yellow      | 1.2       |
| 2        | black       | 1.4       |
| 3        | orange      | 3.4       |
| 4        | white       | 2.2       |

Left join the two DataFrames on their join key, while keeping the columns ‘colors’ and ‘numbers’ from the first one and the column ‘colors’ as ‘other_color’ from the second one. As subcondition only rows are selected where the ‘number’ columns are equal.

joined_df = df.join(
    name="joined_df",
    other=other,
    how="left",
    join_key="join_key",
    cols=[df["colors"], df["numbers"]],
    other_cols=[other["colors"].alias("other_color")],
    where=(df["numbers"] == other["numbers"]))

joined_df
| colors      | other_color | numbers   |
| categorical | categorical | numerical |
-----------------------------------------
| blue        | blue        | 2.4       |
| green       | green       | 3         |
| yellow      | yellow      | 1.2       |