icon_data_frames Data frames

In this view, you can access information related to your imported DataFrames.

Top-level view

../../../_images/screenshot_data_frames.png

The top-level view of the ‘icon_data_frames Data Frames’ tab consists of two plots. The first plot compares the numbers of rows. The second plot compares the memory used by the data frames. The top-level view also contains the ‘Data Frames’ table which contains a general overview of all imported DataFrames.

In addition, you have four icons at the end of each column enabling you to do the following:

  • icon_zoom_in Get more detailed information about a data frame via the data frame view.

  • icon_save Write the content of the corresponding data frame to disk.

  • icon_delete Removes the data frame from RAM. Note that, when saving its content to disk first, the data frame can be restored by loading it into memory again (see Lifecycle of a DataFrame).

  • icon_delete_forever Delete a data frame object from both RAM and disk (Warning: this step can not be undone).

Note that the statistics for the memory consumption only include the raw data, but do not include the indices which are automatically created on all join keys.

Data frame view

../../../_images/screenshot_data_frame_view.png

The data frame view can be accessed from the top-level view of the ‘icon_data_frames Data Frames’ tab by either clicking the name or the icon_zoom_in symbol.

The body of the table contains the raw content of the data frame. The table head contains additional information about the annotation. The icon_zoom_in icon redirects to the column view. Below the icon you can see the role and unit of the column.

Column view

../../../_images/screenshot_column_view.png

The column view can be accessed by clicking either the name of a column or the icon_zoom_in icon in the heading of a column in the data frame view. It contains information on the column:

  • The ‘Summary’ table displaying various summary statistics of a variable.

  • The density and relation plot.

Density plot

The binning in the frequency plot behaves differently for numerical and categorical data. If a column’s role is numerical, time_stamp, target, or unused_float, it contains numerical data. If a column’s role is categorical, join_key, or unused_string, it contains categorical data.

For numerical data, the range between the maximum and minimum value is split into bins. The number of said bins is determined by the “number of bins” input in the topbar. Bins containing no values will not be displayed in the plot.

The x-axis of the frequency plot represents average of all values within the bin (as opposed to the minimum, mean, or maximum of the bin itself).

The y-axis of the frequency plot represents the share of values within the bin.

Thus, all resulting points are located on the normalized version of the empirical PDF (probability density function).

For categorical data, each category is used as a bin (containing only a single value) if the number in “number of bins” is greater or equal than the total number of unique categories. The resulting bins are then sorted according to the frequency of the corresponding category.

If the number of bins is smaller than the total number of categories, the categories are sorted by frequency and then distributed as evenly as possible into the bins.

The x-axis represents the categories or bins.

The y-axis represents the frequency.

Relation plot

You can use the relation plot to plot the current column against another column containing numerical data (numerical, time_stamp, target, or unused_float). The other column is selected in the drop-down menu in the “Settings” box.

For numerical columns, the curve is calculated as follows:

The binning is the same as in the previous section. The y-axis represents the average value of the other column for all values within the bins.

For categorical columns, the curve is calculated as follows:

For each unique value, the algorithm calculates the average value of the other column. The values are then sorted by these averages. If the number of unique values is greater than the number of bins, the values are distributed as evenly as possible into the bins.

The plot displays the accumulated frequencies of the sorted values on its x-axis and the average values on the y-axis.

This gives you an indication of how well the string column separates the other column.