Welcome to the getML technical documentation. This document is written for data scientists who want to use the getML software suite for their projects. For general information about getML visit https://getml.com. You can also contact us for any questions or inquiries.
getML in one minute¶
getML is an innovative tool for the end-to-end automation of data science projects. It covers everything from convenient data loading procedures to the deployment of trained models.
Most notably, getML includes advanced algorithms for automated feature engineering (feature learning) on relational data and time series. Feature engineering on relational data is defined as the creation of a flat table by merging and aggregating data. It is sometimes also referred to as data wrangling. Feature engineering is necessary if your data is distributed over more than one data table.
Automated feature engineering
Saves up to 90% of the time spent on a data science project
Increases the prediction accuracy over manual feature engineering
Andrew Ng, Professor at Stanford University and Co-founder of Google Brain described manual feature engineering as follows:
Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering.
The main purpose of getML is to automate this “difficult, time-consuming” process as much as possible.
getML comes with a high-performance engine written in C++ and an intuitive Python API. Completing a data science project with getML consists of seven simple steps.
import getml getml.engine.set_project('one_minute_to_getml')
Load the data into the engine
population = getml.data.DataFrame.from_csv('data_population.csv', name='population_table') peripheral = getml.data.DataFrame.from_csv('data_peripheral.csv', name='peripheral_table')
Annotate the data
population.set_role('target', getml.data.role.target) population.set_role('join_key', getml.data.role.join_key) ...
Define the data model
population_placeholder = getml.data.Placeholder("POPULATION") peripheral_placeholder = getml.data.Placeholder("PERIPHERAL") population_placeholder.join(peripheral_placeholder, join_key="join_key")
Train the feature learning algorithm and the predictor
pipe = getml.pipeline.Pipeline( population=population_placeholder, peripheral=[peripheral_placeholder], feature_learners=getml.feature_learning.MultirelModel(num_features=100) predictors=getml.predictors.LinearRegression() ) pipe.fit( population=population, peripheral=[peripheral] )
pipe.score( population=population_unseen, peripheral=[peripheral_unseen] )
pipe.predict( population=population_unseen, peripheral=[peripheral_unseen] )
# Allow the pipeline to respond to HTTP requests pipe.deploy(True)
Check out the rest of this documentation to find out how getML achieves top performance on real-world data science projects with many tables and complex data schemes.
How to use this guide¶
If you are looking for more detailed information, other sections of this documentation are more suitable. There are three major parts:
The tutorials section contains examples of how to use getML in real-world projects. All tutorials are based on public data sets so that you can follow along. If you are looking for an intuitive access to getML, the tutorials section is the right place to go. Also, the code examples are explicitly intended to be used as a template for your own projects.
The user guide explains all conceptional details behind getML in depth. It can serve as a reference guide for experienced users but it’s also suitable for first day users who want to get a deeper understanding of how getML works. Each chapter in the user guide represents one step of a typical data science project.
The API documentation covers everything related to the Python interface to the getML engine. Each module comes with a dedicated section that contains concrete code examples.
You can also check out our other resources