from_csv

classmethod DataFrame.from_csv(fnames, name, num_lines_sniffed=1000, num_lines_read=0, quotechar='"', sep=',', skip=0, colnames=None, roles=None, ignore=False, dry=False)[source]

Create a DataFrame from CSV files.

The fastest way to import data into the getML engine is to read it directly from CSV files. It will construct a data frame object in the engine, fill it with the data read from the CSV file(s), and return a corresponding DataFrame handle.

Args:

fnames (List[str]): CSV file paths to be read.

name (str): Name of the data frame to be created.

num_lines_sniffed (int, optional):

Number of lines analyzed by the sniffer.

num_lines_read (int, optional): Number of lines read from each file.

Set to 0 to read in the entire file.

quotechar (str, optional): The character used to wrap strings.

sep (str, optional): The separator used for separating fields.

skip (int, optional):

Number of lines to skip at the beginning of each file.

colnames(List[str] or None, optional): The first line of a CSV file

usually contains the column names. When this is not the case, you need to explicitly pass them.

roles(dict[str, List[str]], optional): A dictionary mapping

the roles to the column names. If this is not passed, then the roles will be sniffed from the CSV files. The roles dictionary should be in the following format:

>>> roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
ignore (bool, optional): Only relevant when roles is not None.

Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?

dry (bool, optional): If set to True, then the data

will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.

Raises:

TypeError: If any of the input arguments is of a wrong type.

ValueError: If one of the provided keys in roles does not match a definition in roles.

Returns:

DataFrame:

Handler of the underlying data.

Note:

It is assumed that the first line of each CSV file contains a header with the column names.

Examples:

Let’s assume you have two CSV files - file1.csv and file2.csv - in the current working directory. You can import their data into the getML engine using.

>>> df_expd = data.DataFrame.from_csv(
        ...     fnames=["file1.csv", "file2.csv"],
        ...     name="MY DATA FRAME",
        ...     sep=';',
        ...     quotechar='"'
        ... )

However, the CSV format lacks type safety. If you want to build a reliable pipeline, it is a good idea to hard-code the roles:

>>> roles = {"categorical": ["colname1", "colname2"], "target": ["colname3"]}
>>>
>>> df_expd = data.DataFrame.from_csv(
        ...         fnames=["file1.csv", "file2.csv"],
        ...         name="MY DATA FRAME",
        ...         sep=';',
        ...         quotechar='"',
        ...         roles=roles
        ... )

If you think that typing out all of the roles by hand is too cumbersome, you can use a dry run:

>>> roles = data.DataFrame.from_csv(
        ...         fnames=["file1.csv", "file2.csv"],
        ...         name="MY DATA FRAME",
        ...         sep=';',
        ...         quotechar='"',
        ...         dry=True
        ... )

This will return the roles dictionary it would have used. You can now hard-code this.