View

class getml.data.View(base, name: Optional[str] = None, subselection: Optional[Union[BooleanColumnView, FloatColumn, FloatColumnView]] = None, added=None, dropped: Optional[List[str]] = None)[source]

A view is a lazily evaluated, immutable representation of a DataFrame.

There are important differences between a DataFrame and a view:

  • Views are lazily evaluated. That means that views do not contain any data themselves. Instead, they just refer to an underlying data frame. If the underlying data frame changes, so will the view (but such behavior will result in a warning).

  • Views are immutable. In-place operations on a view are not possible. Any operation on a view will result in a new view.

  • Views have no direct representation on the getML engine and therefore they do not need to have an identifying name.

Args:
base (DataFrame or View):

A data frame or view used as the basis for this view.

name (str):

The name assigned to this view.

subselection (BooleanColumnView, FloatColumnView or FloatColumn):

Indicates which rows we would like to keep.

added (dict):

A dictionary that describes a new column that has been added to the view.

dropped (List[str]):

A list of columns that have been dropped.

Examples:

You hardly ever directly create views. Instead, it is more likely that you will encounter them as a result of some operation on a DataFrame:

# Creates a view on the first 100 lines
view1 = data_frame[:100]

# Creates a view without some columns.
view2 = data_frame.drop(["col1", "col2"])

# Creates a view in which some roles are reassigned.
view3 = data_frame.with_role(["col1", "col2"], getml.data.roles.categorical)

A recommended pattern is to assign ‘baseline roles’ to your data frames and then using views to tweak them:

# Assign baseline roles
data_frame.set_role(["jk"], getml.data.roles.join_key)
data_frame.set_role(["col1", "col2"], getml.data.roles.categorical)
data_frame.set_role(["col3", "col4"], getml.data.roles.numerical)
data_frame.set_role(["col5"], getml.data.roles.target)

# Make the data frame immutable, so in-place operations are
# no longer possible.
data_frame.freeze()

# Save the data frame.
data_frame.save()

# I suspect that col1 leads to overfitting, so I will drop it.
view = data_frame.drop(["col1"])

# Insert the view into a container.
container = getml.data.Container(...)
container.add(some_alias=view)
container.save()

The advantage of using such a pattern is that it enables you to always completely retrace your entire pipeline without creating deep copies of the data frames whenever you have made a small change like the one in our example. Note that the pipeline will record which Container you have used.

Methods

check()

Checks whether the underlying data frame has been changed after the creation of the view.

drop(cols)

Returns a new View that has one or several columns removed.

ncols()

Number of columns in the current instance.

nrows([force])

Returns the number of rows in the current instance.

refresh()

Aligns meta-information of the current instance with the corresponding data frame in the getML engine.

to_arrow()

Creates a pyarrow.Table from the view.

to_csv(fname[, quotechar, sep, batch_size])

Writes the underlying data into a newly created CSV file.

to_db(table_name[, conn])

Writes the underlying data into a newly created table in the database.

to_df(name)

Creates a DataFrame from the view.

to_json()

Creates a JSON string from the current instance.

to_pandas()

Creates a pandas.DataFrame from the view.

to_parquet(fname[, compression])

Writes the underlying data into a newly created parquet file.

to_placeholder([name])

Generates a Placeholder from the current View.

to_pyspark(spark[, name])

Creates a pyspark.sql.DataFrame from the current instance.

to_s3(bucket, key, region[, sep, batch_size])

Writes the underlying data into a newly created CSV file located in an S3 bucket.

where(index)

Extract a subset of rows.

with_column(col, name[, role, unit, ...])

Returns a new View that contains an additional column.

with_name(name)

Returns a new View with a new name.

with_role(names, role[, time_formats])

Returns a new View with modified roles.

with_subroles(names, subroles[, append])

Returns a new view with one or several new subroles on one or more columns.

with_unit(names, unit[, comparison_only])

Returns a view that contains a new unit on one or more columns.

Attributes

added

The column that has been added to the view.

base

The basis on which the view is created.

colnames

List of the names of all columns.

columns

Alias for colnames().

dropped

The names of the columns that has been dropped.

last_change

A string describing the last time this data frame has been changed.

name

The name of the view.

roles

The roles of the columns included in this View.

rowid

The rowids for this view.

shape

A tuple containing the number of rows and columns of the View.

subselection

The subselection that is applied to this view.