getml.database.sniff_s3(name, bucket, keys, region, num_lines_sniffed=1000, sep=',', skip=0, colnames=None, conn=None)[source]

Sniffs a list of CSV files located in an S3 bucket.


name (str): Name of the table in which the data is to be inserted.

bucket (str):

The bucket from which to read the files.

keys (List[str]): The list of keys (files in the bucket) to be read.

region (str):

The region in which the bucket is located.

num_lines_sniffed (int, optional):

Number of lines analysed by the sniffer.

sep (str, optional):

The character used for separating fields.

skip (int, optional):

Number of lines to skip at the beginning of each file.

colnames(List[str] or None, optional): The first line of a CSV file

usually contains the column names. When this is not the case, you need to explicitly pass them.

conn (Connection, optional): The database connection to be used.

If you don’t explicitly pass a connection, the engine will use the default connection.


str: Appropriate CREATE TABLE statement.


Let’s assume you have two CSV files - file1.csv and file2.csv - in the bucket. You can import their data into the getML engine using the following commands:

>>> getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID")
>>> getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY")
>>> stmt = data.database.sniff_s3(
...         bucket="your-bucket-name",
...         keys=["file1.csv", "file2.csv"],
...         region="us-east-2",
...         name="MY_TABLE",
...         sep=';'
... )

You can also set the access credential as environment variables before you launch the getML engine.