from_s3

classmethod DataFrame.from_s3(bucket, keys, region, name, num_lines_sniffed=1000, num_lines_read=0, sep=',', skip=0, colnames=None, roles=None, ignore=False, dry=False)[source]

Create a DataFrame from CSV files located in an S3 bucket.

NOTE THAT S3 IS NOT SUPPORTED ON WINDOWS.

This classmethod will construct a data frame object in the engine, fill it with the data read from the CSV file(s), and return a corresponding DataFrame handle.

Args:
bucket (str):

The bucket from which to read the files.

keys (List[str]):

The list of keys (files in the bucket) to be read.

region (str):

The region in which the bucket is located.

name (str):

Name of the data frame to be created.

num_lines_sniffed (int, optional):

Number of lines analyzed by the sniffer.

num_lines_read (int, optional):

Number of lines read from each file. Set to 0 to read in the entire file.

sep (str, optional):

The separator used for separating fields.

skip (int, optional):

Number of lines to skip at the beginning of each file.

colnames(List[str] or None, optional):

The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them.

roles(dict[str, List[str]] or Roles, optional):

Maps the roles to the column names (see colnames()).

The roles dictionary is expected to have the following format:

roles = {getml.data.role.numeric: ["colname1", "colname2"],
         getml.data.role.target: ["colname3"]}

Otherwise, you can use the Roles class.

ignore (bool, optional):

Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?

dry (bool, optional):

If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.

Returns:

DataFrame:

Handler of the underlying data.

Example:

Let’s assume you have two CSV files - file1.csv and file2.csv - in the bucket. You can import their data into the getML engine using the following commands:

>>> getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID")
>>>
>>> getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY")
>>>
>>> data_frame_expd = data.DataFrame.from_s3(
        ...         bucket="your-bucket-name",
        ...         keys=["file1.csv", "file2.csv"],
        ...         region="us-east-2",
        ...         name="MY DATA FRAME",
        ...         sep=';'
        ... )

You can also set the access credential as environment variables before you launch the getML engine.

Also refer to the documention on from_csv() for further information on overriding the CSV sniffer for greater type safety.