read_s3¶
- DataFrame.read_s3(bucket: str, keys: List[str], region: str, append: bool = False, sep: str = ',', num_lines_read: int = 0, skip: int = 0, colnames: Optional[List[str]] = None, time_formats: Optional[List[str]] = None)[source]¶
Read CSV files from an S3 bucket.
It is assumed that the first line of each CSV file contains a header with the column names.
- Args:
- bucket (str):
The bucket from which to read the files.
- keys (List[str]):
The list of keys (files in the bucket) to be read.
- region (str):
The region in which the bucket is located.
- append (bool, optional):
If a data frame object holding the same
name
is already present in the getML, should the content of of the CSV files in fnames be appended or replace the existing data?- sep (str, optional):
The separator used for separating fields.
- num_lines_read (int, optional):
Number of lines read from each file. Set to 0 to read in the entire file.
- skip (int, optional):
Number of lines to skip at the beginning of each file.
- colnames(List[str] or None, optional):
The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them.
- time_formats (List[str], optional):
The list of formats tried when parsing time stamps.
The formats are allowed to contain the following special characters:
%w - abbreviated weekday (Mon, Tue, …)
%W - full weekday (Monday, Tuesday, …)
%b - abbreviated month (Jan, Feb, …)
%B - full month (January, February, …)
%d - zero-padded day of month (01 .. 31)
%e - day of month (1 .. 31)
%f - space-padded day of month ( 1 .. 31)
%m - zero-padded month (01 .. 12)
%n - month (1 .. 12)
%o - space-padded month ( 1 .. 12)
%y - year without century (70)
%Y - year with century (1970)
%H - hour (00 .. 23)
%h - hour (00 .. 12)
%a - am/pm
%A - AM/PM
%M - minute (00 .. 59)
%S - second (00 .. 59)
%s - seconds and microseconds (equivalent to %S.%F)
%i - millisecond (000 .. 999)
%c - centisecond (0 .. 9)
%F - fractional seconds/microseconds (000000 - 999999)
%z - time zone differential in ISO 8601 format (Z or +NN.NN)
%Z - time zone differential in RFC format (GMT or +NNNN)
%% - percent sign
- Returns:
DataFrame
:Handler of the underlying data.
- Note:
Not supported in the getML community edition.