View¶
-
class
getml.data.
View
(base, name=None, subselection=None, added=None, dropped=None)[source]¶ A view is a lazily evaluated, immutable representation of a
DataFrame
.There are important differences between a
DataFrame
and a view:Views are lazily evaluated. That means that views do not contain any data themselves. Instead, they just refer to an underlying data frame. If the underlying data frame changes, so will the view (but such behavior will result in a warning).
Views are immutable. In-place operations on a view are not possible. Any operation on a view will result in a new view.
Views have no direct representation on the getML engine and therefore they do not need to have an identifying name.
- Args:
- base (
DataFrame
orView
): A data frame or view used as the basis for this view.
- name (str):
The name assigned to this view.
- subselection (
BooleanColumnView
,FloatColumnView
orFloatColumn
): Indicates which rows we would like to keep.
- added (dict):
A dictionary that describes a new column that has been added to the view.
- dropped (List[str]):
A list of columns that have been dropped.
- base (
- Examples:
You hardly ever directly create views. Instead, it is more likely that you will encounter them as a result of some operation on a
DataFrame
:# Creates a view on the first 100 lines view1 = data_frame[:100] # Creates a view without some columns. view2 = data_frame.drop(["col1", "col2"]) # Creates a view in which some roles are reassigned. view3 = data_frame.with_role(["col1", "col2"], getml.data.roles.categorical)
A recommended pattern is to assign ‘baseline roles’ to your data frames and then using views to tweak them:
# Assign baseline roles data_frame.set_role(["jk"], getml.data.roles.join_key) data_frame.set_role(["col1", "col2"], getml.data.roles.categorical) data_frame.set_role(["col3", "col4"], getml.data.roles.numerical) data_frame.set_role(["col5"], getml.data.roles.target) # Make the data frame immutable, so in-place operations are # no longer possible. data_frame.freeze() # Save the data frame. data_frame.save() # I suspect that col1 leads to overfitting, so I will drop it. view = data_frame.drop(["col1"]) # Insert the view into a container. container = getml.data.Container(...) container.add(some_alias=view) container.save()
The advantage of using such a pattern is that it enables you to always completely retrace your entire pipeline without creating deep copies of the data frames whenever you have made a small change like the one in our example. Note that the pipeline will record which
Container
you have used.
Methods
check
()Checks whether the underlying data frame has been changed after the creation of the view.
drop
(cols)Returns a new
View
that has one or several columns removed.ncols
()Number of columns in the current instance.
nrows
([force])Returns the number of rows in the current instance.
refresh
()Aligns meta-information of the current instance with the corresponding data frame in the getML engine.
to_arrow
()Creates a
pyarrow.Table
from the view.to_csv
(fname[, quotechar, sep, batch_size])Writes the underlying data into a newly created CSV file.
to_db
(table_name[, conn])Writes the underlying data into a newly created table in the database.
to_df
(name)Creates a
DataFrame
from the view.to_json
()Creates a JSON string from the current instance.
Creates a
pandas.DataFrame
from the view.to_parquet
(fname[, compression])Writes the underlying data into a newly created parquet file.
to_placeholder
([name])Generates a
Placeholder
from the currentView
.to_pyspark
(spark[, name])Creates a
pyspark.sql.DataFrame
from the current instance.to_s3
(bucket, key, region[, sep, batch_size])Writes the underlying data into a newly created CSV file located in an S3 bucket.
where
(index)Extract a subset of rows.
with_column
(col, name[, role, unit, …])Returns a new
View
that contains an additional column.with_name
(name)Returns a new
View
with a new name.with_role
(names, role[, time_formats])Returns a new
View
with modified roles.with_subroles
(names, subroles[, append])Returns a new view with one or several new subroles on one or more columns.
with_unit
(names, unit[, comparison_only])Returns a view that contains a new unit on one or more columns.
Attributes
The column that has been added to the view.
The basis on which the view is created.
List of the names of all columns.
Alias for
colnames()
.The names of the columns that has been dropped.
A string describing the last time this data frame has been changed.
The name of the view.
The roles of the columns included in this View.
The rowids for this view.
A tuple containing the number of rows and columns of the View.
The subselection that is applied to this view.