Staging Data Directly

First steps

For the complete script, please refer to example_01a_stage_data_directly.py.

getML features various ways to stage data. This section will illustrate how to stage data by using the getml.engine.DataFrames. This is the most efficient way to stage data, sqlite3 being a close second.

First of all, we need to tell Python where to find the raw data. This obviously depends on where you have decided to put them.

# The folder that contains expd151.csv
RAW_DATA_FOLDER = os.getenv("HOME") + "/Downloads/diary15/"

Then, we set the project:

engine.set_project("CE")

Loading the data

Now, we want to load the data.

getML provides a sophisticated CSV sniffer. However, no CSV sniffer is perfect. For instance, the UCC and NEWID look like integers, but they are actually strings. Reading them in as integers might lead to data loss.

You would get similar problems with just about any CSV sniffer we know - the underlying problem is that CSV sniffers don’t really have a real understanding about the columns are and can only use simple heuristics.

Luckily, we can easily override the suggestions of the sniffer:

expd_fnames = [
    RAW_DATA_FOLDER + "expd151.csv",
    RAW_DATA_FOLDER + "expd152.csv",
    RAW_DATA_FOLDER + "expd153.csv",
    RAW_DATA_FOLDER + "expd154.csv"
]

expd_roles = {"unused_string": ["UCC", "NEWID"]}

df_expd = data.DataFrame.from_csv(
    fnames=expd_fnames,
    name="EXPD",
    roles=expd_roles
)

We need to do the same for FMLD and MEMD.

Staging EXPD

We want to turn EXPNYR, EXPNMO and COST into numerical columns:

df_expd.set_role(["EXPNYR", "EXPNMO", "COST"], roles.numerical)

df_expd.set_unit(["EXPNMO"], "month")
df_expd.set_unit(["COST"], "cost")

We want to turn NEWID into a join key:

df_expd.set_role("NEWID", roles.join_key)

Sometimes the year entry (EXPNYR) or the month entry (EXPNMO) contain missing data. These do us no good, so we get rid of them:

expnyr = df_expd["EXPNYR"]
expnmo = df_expd["EXPNMO"]

not_nan = (expnyr.is_nan() | expnmo.is_nan()).is_false()

df_expd = df_expd.where("EXPD", not_nan)

We want to transform the EXPNYR and EXPNMO entries into proper time stamps.

expnyr = df_expd["EXPNYR"]
expnmo = df_expd["EXPNMO"]

ts = (expnyr.as_str() + "/" + expnmo.as_str()).as_ts(["%Y/%n"])

df_expd.add(ts, "TIME_STAMP", role.time_stamp)

As we explained, the UCC is based on a hierarchical system of classification. We make use of that fact and create a bunch of substrings. UCC1 contains only the first digit, UCC2 the first two and so on. This will also create units with the same name as the column name:

ucc = df_expd["UCC"]

for i in range(5):
    substr = ucc.substr(0, i+1)
    df_expd.add(
            substr,
            name="UCC" + str(i+1),
            role=roles.categorical,
            unit="UCC" + str(i+1))

df_expd.set_role("UCC", roles.categorical)
df_expd.set_unit("UCC", "UCC")

We want to predict whether the item was purchased as a gift. Strangely enough, the GIFT flag is set to 2 when the item is a gift. We want to change that to 0.0 and 1.0:

target = (df_expd["GIFT"] == 1)

df_expd.add(target, "TARGET", roles.target)

Staging MEMD

The preparation for MEMD is simpler: We just have to assign the appropriate role.

df_memd.set_role([
    "MARITAL",
    "SEX",
    "EMPLTYPE",
    "OCCULIST",
    "WHYNOWRK",
    "EDUCA",
    "MEDICARE",
    "PAYPERD",
    "RC_WHITE",
    "RC_BLACK",
    "RC_ASIAN",
    "RC_OTHER",
    "WKSTATUS"], roles.categorical)

df_memd.set_role(["AGE", "WAGEX"], roles.numerical)

df_memd.set_role("NEWID", roles.join_key)

time_stamp = df_memd.string_column("2015/01/01").as_ts(["%Y/%m/%d"])

df_memd.add(time_stamp, "TIME_STAMP", roles.time_stamp)

df_memd.save()

Staging POPULATION

We create the POPULATION table by joining EXPD and FMLD.

# -----------------------------------------------------------------------------
# Separate EXPD in training, testing, validation set

random = df_expd.random()

df_population_training = df_expd.where("POPULATION_TRAINING", random <= 0.7)

df_population_validation = df_expd.where("POPULATION_VALIDATION", (random <= 0.85) & (random > 0.7))

df_population_testing = df_expd.where("POPULATION_TESTING", random > 0.85)

# -----------------------------------------------------------------------------------------------
# NEWID in FMLD is unique - therefore, we can just LEFT JOIN it onto the POPULATION tables.

income_ranks = [
    "INC_RANK",
    "INC_RNK1",
    "INC_RNK2",
    "INC_RNK3",
    "INC_RNK4",
    "INC_RNK5",
    "INC_RNKM"
]

df_fmld.set_role(income_ranks, roles.numerical)

for inc in income_ranks:
    df_fmld.set_unit(inc, inc)

df_fmld.set_role("NEWID", roles.join_key)

df_population_training = df_population_training.join(
        name="POPULATION_TRAINING",
        other=df_fmld,
        join_key="NEWID",
        other_cols=[
            df_fmld["INC_RANK"],
            df_fmld["INC_RNK1"],
            df_fmld["INC_RNK2"],
            df_fmld["INC_RNK3"],
            df_fmld["INC_RNK4"],
            df_fmld["INC_RNK5"],
            df_fmld["INC_RNKM"]
        ]
)

df_population_validation = df_population_validation.join(
        name="POPULATION_VALIDATION",
        other=df_fmld,
        join_key="NEWID",
        other_cols=[
            df_fmld["INC_RANK"],
            df_fmld["INC_RNK1"],
            df_fmld["INC_RNK2"],
            df_fmld["INC_RNK3"],
            df_fmld["INC_RNK4"],
            df_fmld["INC_RNK5"],
            df_fmld["INC_RNKM"]
        ]
)

df_population_testing = df_population_testing.join(
        name="POPULATION_TESTING",
        other=df_fmld,
        join_key="NEWID",
        other_cols=[
            df_fmld["INC_RANK"],
            df_fmld["INC_RNK1"],
            df_fmld["INC_RNK2"],
            df_fmld["INC_RNK3"],
            df_fmld["INC_RNK4"],
            df_fmld["INC_RNK5"],
            df_fmld["INC_RNKM"]
        ]
)

# -----------------------------------------------------------------------------------------------

df_population_training.save()

df_population_validation.save()

df_population_testing.save()