Staging Data Using Pandas¶

Loading the data¶

For the complete script, please refer to example_01b_stage_data_using_pandas.py.

getML features various ways to stage data. This section will illustrate how to stage data using pandas. Most data scientists should be familiar with pandas - however it is not very efficient.

First of all, we need to tell Python where to find the raw data. This obviously depends on where you have decided to put them.

# The folder that contains expd151.csv
RAW_DATA_FOLDER = os.getenv("HOME") + "/Downloads/diary15"

Then, we set the project:

engine.set_project("CE")

Now we read the data into a pandas.DataFrame:

os.chdir(RAW_DATA_FOLDER)

# -----------------------------------------------------------------------------
# Load EXPD

expd = pd.read_csv("expd151.csv")
expd = expd.append(pd.read_csv("expd152.csv"))
expd = expd.append(pd.read_csv("expd153.csv"))
expd = expd.append(pd.read_csv("expd154.csv"))

# -----------------------------------------------------------------------------
# Load FMLD

fmld = pd.read_csv("fmld151.csv")
fmld = fmld.append(pd.read_csv("fmld152.csv"))
fmld = fmld.append(pd.read_csv("fmld153.csv"))
fmld = fmld.append(pd.read_csv("fmld154.csv"))

# -----------------------------------------------------------------------------
# Load MEMD

memd = pd.read_csv("memd151.csv")
memd = fmld.append(pd.read_csv("memd152.csv"))
memd = fmld.append(pd.read_csv("memd153.csv"))
memd = fmld.append(pd.read_csv("memd154.csv"))

Staging EXPD¶

We want to predict whether the item was purchased as a gift. Strangely enough, the GIFT flag is set to 2 when the item is a gift. We want to change that to 0.0 and 1.0:

expd["TARGET"] = [0.0 if elem == 2 else 1.0 for elem in expd["GIFT"]]

Sometimes the year entry (EXPNYR) or the month entry (EXPNMO) contain missing data. These do us no good, so we get rid of them:

expd = expd[
  (expd["EXPNYR"] == expd["EXPNYR"]) & (expd["EXPNMO"] == expd["EXPNMO"])
]

We want to transform the EXPNYR and EXPNMO entries into proper time stamps.

expd["TIME_STAMP"] = [
    datetime.datetime(int(year), int(month), 1) for year, month in zip(expd["EXPNYR"], expd["EXPNMO"])
]

As we explained, the UCC is based on a hierarchical system of classification. We make use of that fact and create a bunch of substrings. UCC1 contains only the first digit, UCC2 the first two and so on:

ucc = np.asarray(expd["UCC"]).astype(str)

expd["UCC1"] = [elem[:1] for elem in ucc]
expd["UCC2"] = [elem[:2] for elem in ucc]
expd["UCC3"] = [elem[:3] for elem in ucc]
expd["UCC4"] = [elem[:4] for elem in ucc]
expd["UCC5"] = [elem[:5] for elem in ucc]

Staging POPULATION¶

We create the POPULATION table by joining EXPD and FMLD.

FMLD_COLS = [
    "INC_RANK",
    "INC_RNK1",
    "INC_RNK2",
    "INC_RNK3",
    "INC_RNK4",
    "INC_RNK5",
    "INC_RNKM",
    "NEWID"
]

population_all = expd.merge(
    fmld[FMLD_COLS],
    on="NEWID"
)

Loading the data into the getML engine¶

Currently, the data is held in pandas. But we want it in the getML engine.

EXPD_CATEGORICAL = [
    "UCC",
    "UCC1",
    "UCC2",
    "UCC3",
    "UCC4",
    "UCC5"
]

EXPD_JOIN_KEYS = [
    "NEWID"
]

EXPD_NUMERICAL = [
    "COST",
    "EXPNYR",
    "EXPNMO"
]

EXPD_TARGETS = [
    "TARGET"
]

EXPD_TIME_STAMPS = [
    "TIME_STAMP"
]

expd_roles = {
    "join_key": EXPD_JOIN_KEYS,
    "time_stamp": EXPD_TIME_STAMPS,
    "categorical": EXPD_CATEGORICAL,
    "numerical": EXPD_NUMERICAL,
    "target": EXPD_TARGETS
}

df_expd = data.DataFrame.from_pandas(
    pandas_df=expd,
    name="EXPD",
    roles=expd_roles,
    ignore=True)

df_expd.set_unit("UCC1", "UCC1")
df_expd.set_unit("UCC2", "UCC2")
df_expd.set_unit("UCC3", "UCC3")
df_expd.set_unit("UCC4", "UCC4")
df_expd.set_unit("UCC5", "UCC5")
df_expd.set_unit("UCC", "UCC")
df_expd.set_unit("EXPNMO", "month")

df_expd.save()

The code for POPULATION_ALL and MEMD looks similarly.

Separating the data¶

Finally, we want to separate the data into a training, validation and testing set:

random = df_expd.random()

df_population_training = df_expd.where("POPULATION_TRAINING", random <= 0.7)

df_population_validation = df_expd.where("POPULATION_VALIDATION", (random <= 0.85) & (random > 0.7))

df_population_testing = df_expd.where("POPULATION_TESTING", random > 0.85)