from_csv¶

classmethod DataFrame.from_csv(fnames, name, num_lines_sniffed=1000, num_lines_read=0, quotechar='"', sep=',', skip=0, colnames=None, roles=None, ignore=False, dry=False, verbose=True) → DataFrame[source]¶

Create a DataFrame from CSV files.

The getML engine will construct a data frame object in the engine, fill it with the data read from the CSV file(s), and return a corresponding DataFrame handle.

Args:

fnames (List[str]):

CSV file paths to be read.

name (str):

Name of the data frame to be created.

num_lines_sniffed (int, optional):

Number of lines analyzed by the sniffer.

num_lines_read (int, optional):

Number of lines read from each file. Set to 0 to read in the entire file.

quotechar (str, optional):

The character used to wrap strings.

sep (str, optional):

The separator used for separating fields.

skip (int, optional):

Number of lines to skip at the beginning of each file.

colnames(List[str] or None, optional): The first line of a CSV file

usually contains the column names. When this is not the case, you need to explicitly pass them.

roles(dict[str, List[str]] or Roles, optional):

Maps the roles to the column names (see colnames()).

The roles dictionary is expected to have the following format

roles = {getml.data.role.numeric: ["colname1", "colname2"],
         getml.data.role.target: ["colname3"]}

Otherwise, you can use the Roles class.

ignore (bool, optional):

Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?

dry (bool, optional):

If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.

verbose (bool, optional):

If True, when fnames are urls, the filenames are printed to stdout during the download.

Returns:

DataFrame:

Handler of the underlying data.

Note:

It is assumed that the first line of each CSV file contains a header with the column names.

Examples:

Let’s assume you have two CSV files - file1.csv and file2.csv - in the current working directory. You can import their data into the getML engine using.

>>> df_expd = data.DataFrame.from_csv(
        ...     fnames=["file1.csv", "file2.csv"],
        ...     name="MY DATA FRAME",
        ...     sep=';',
        ...     quotechar='"'
        ... )

However, the CSV format lacks type safety. If you want to build a reliable pipeline, it is a good idea to hard-code the roles:

>>> roles = {"categorical": ["col1", "col2"], "target": ["col3"]}

>>>
>>> df_expd = data.DataFrame.from_csv(
        ...         fnames=["file1.csv", "file2.csv"],
        ...         name="MY DATA FRAME",
        ...         sep=';',
        ...         quotechar='"',
        ...         roles=roles
        ... )

If you think that typing out all of the roles by hand is too cumbersome, you can use a dry run:

>>> roles = data.DataFrame.from_csv(
        ...         fnames=["file1.csv", "file2.csv"],
        ...         name="MY DATA FRAME",
        ...         sep=';',
        ...         quotechar='"',
        ...         dry=True
        ... )

This will return the roles dictionary it would have used. You can now hard-code this.