View¶

class getml.data.View(base, name: Optional[str] = None, subselection: Optional[Union[BooleanColumnView, FloatColumn, FloatColumnView]] = None, added=None, dropped: Optional[List[str]] = None)[source]¶

A view is a lazily evaluated, immutable representation of a DataFrame.

There are important differences between a DataFrame and a view:

Views are lazily evaluated. That means that views do not contain any data themselves. Instead, they just refer to an underlying data frame. If the underlying data frame changes, so will the view (but such behavior will result in a warning).

Views are immutable. In-place operations on a view are not possible. Any operation on a view will result in a new view.

Views have no direct representation on the getML engine and therefore they do not need to have an identifying name.

Args:

base (DataFrame or View):: A data frame or view used as the basis for this view.
name (str):: The name assigned to this view.
subselection (BooleanColumnView, FloatColumnView or FloatColumn):: Indicates which rows we would like to keep.
added (dict):: A dictionary that describes a new column that has been added to the view.
dropped (List[str]):: A list of columns that have been dropped.

Examples:

You hardly ever directly create views. Instead, it is more likely that you will encounter them as a result of some operation on a DataFrame:

# Creates a view on the first 100 lines
view1 = data_frame[:100]

# Creates a view without some columns.
view2 = data_frame.drop(["col1", "col2"])

# Creates a view in which some roles are reassigned.
view3 = data_frame.with_role(["col1", "col2"], getml.data.roles.categorical)

A recommended pattern is to assign ‘baseline roles’ to your data frames and then using views to tweak them:

# Assign baseline roles
data_frame.set_role(["jk"], getml.data.roles.join_key)
data_frame.set_role(["col1", "col2"], getml.data.roles.categorical)
data_frame.set_role(["col3", "col4"], getml.data.roles.numerical)
data_frame.set_role(["col5"], getml.data.roles.target)

# Make the data frame immutable, so in-place operations are
# no longer possible.
data_frame.freeze()

# Save the data frame.
data_frame.save()

# I suspect that col1 leads to overfitting, so I will drop it.
view = data_frame.drop(["col1"])

# Insert the view into a container.
container = getml.data.Container(...)
container.add(some_alias=view)
container.save()

The advantage of using such a pattern is that it enables you to always completely retrace your entire pipeline without creating deep copies of the data frames whenever you have made a small change like the one in our example. Note that the pipeline will record which Container you have used.

Methods

`check`()	Checks whether the underlying data frame has been changed after the creation of the view.
`drop`(cols)	Returns a new `View` that has one or several columns removed.
`ncols`()	Number of columns in the current instance.
`nrows`([force])	Returns the number of rows in the current instance.
`refresh`()	Aligns meta-information of the current instance with the corresponding data frame in the getML engine.
`to_arrow`()	Creates a `pyarrow.Table` from the view.
`to_csv`(fname[, quotechar, sep, batch_size])	Writes the underlying data into a newly created CSV file.
`to_db`(table_name[, conn])	Writes the underlying data into a newly created table in the database.
`to_df`(name)	Creates a `DataFrame` from the view.
`to_json`()	Creates a JSON string from the current instance.
`to_pandas`()	Creates a `pandas.DataFrame` from the view.
`to_parquet`(fname[, compression])	Writes the underlying data into a newly created parquet file.
`to_placeholder`([name])	Generates a `Placeholder` from the current `View`.
`to_pyspark`(spark[, name])	Creates a `pyspark.sql.DataFrame` from the current instance.
`to_s3`(bucket, key, region[, sep, batch_size])	Writes the underlying data into a newly created CSV file located in an S3 bucket.
`where`(index)	Extract a subset of rows.
`with_column`(col, name[, role, unit, ...])	Returns a new `View` that contains an additional column.
`with_name`(name)	Returns a new `View` with a new name.
`with_role`(names, role[, time_formats])	Returns a new `View` with modified roles.
`with_subroles`(names, subroles[, append])	Returns a new view with one or several new subroles on one or more columns.
`with_unit`(names, unit[, comparison_only])	Returns a view that contains a new unit on one or more columns.

Attributes

`added`	The column that has been added to the view.
`base`	The basis on which the view is created.
`colnames`	List of the names of all columns.
`columns`	Alias for `colnames()`.
`dropped`	The names of the columns that has been dropped.
`last_change`	A string describing the last time this data frame has been changed.
`name`	The name of the view.
`roles`	The roles of the columns included in this View.
`rowid`	The rowids for this view.
`shape`	A tuple containing the number of rows and columns of the View.
`subselection`	The subselection that is applied to this view.