read_s3

DataFrame.read_s3(bucket, keys, region, append=False, sep=',', num_lines_read=0, skip=0, colnames=None, time_formats=None)[source]

Read CSV files from an S3 bucket.

NOTE THAT S3 IS NOT SUPPORTED ON WINDOWS.

It is assumed that the first line of each CSV file contains a header with the column names.

Args:
bucket (str):

The bucket from which to read the files.

keys (List[str]):

The list of keys (files in the bucket) to be read.

region (str):

The region in which the bucket is located.

append (bool, optional):

If a data frame object holding the same name is already present in the getML, should the content of of the CSV files in fnames be appended or replace the existing data?

sep (str, optional):

The separator used for separating fields.

num_lines_read (int, optional):

Number of lines read from each file. Set to 0 to read in the entire file.

skip (int, optional):

Number of lines to skip at the beginning of each file.

colnames(List[str] or None, optional):

The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them.

time_formats (List[str], optional):

The list of formats tried when parsing time stamps.

The formats are allowed to contain the following special characters:

  • %w - abbreviated weekday (Mon, Tue, …)

  • %W - full weekday (Monday, Tuesday, …)

  • %b - abbreviated month (Jan, Feb, …)

  • %B - full month (January, February, …)

  • %d - zero-padded day of month (01 .. 31)

  • %e - day of month (1 .. 31)

  • %f - space-padded day of month ( 1 .. 31)

  • %m - zero-padded month (01 .. 12)

  • %n - month (1 .. 12)

  • %o - space-padded month ( 1 .. 12)

  • %y - year without century (70)

  • %Y - year with century (1970)

  • %H - hour (00 .. 23)

  • %h - hour (00 .. 12)

  • %a - am/pm

  • %A - AM/PM

  • %M - minute (00 .. 59)

  • %S - second (00 .. 59)

  • %s - seconds and microseconds (equivalent to %S.%F)

  • %i - millisecond (000 .. 999)

  • %c - centisecond (0 .. 9)

  • %F - fractional seconds/microseconds (000000 - 999999)

  • %z - time zone differential in ISO 8601 format (Z or +NN.NN)

  • %Z - time zone differential in RFC format (GMT or +NNNN)

  • %% - percent sign

Returns:
DataFrame:

Handler of the underlying data.