Getting Started
presto-query-predictor is a Python module introducing machine learning techniques to the Presto ecosystem. It contains a machine learning pipeline for the model training/evaluation and a query predictor web service to predict CPU and memory usages of Presto queries.
Installation
After cloning the GitHub repository,
pip3 install -e . # Installs the presto-query-predictor package locally
pip3 install -r requirements.txt # Installs dependencies
An alternative way is to install the package from PyPi,
pip3 install presto-query-predictor
Note
We recommend installing the package in a Python virtual environment instead of installing it globally.
Examples
The query_predictor/
folder contains the core of the package. We have prepared
some examples in the example/
folder, including
load_data.py
- An example to load the embedded fake TPCH-based dataset.transform.py
- An example to transform datasets for further training.train.py
- An example to train CPU and memory models.tune.py
- An example to tune classification algorithms.app.py
- An example to create a query predictor web service.
Training
A simple way to get a sense of the CPU and memory model training is running the
examples in the example/
folder.
cd examples
python3 transform.py
python3 train.py
Warning
The presto-query-predictor package can only be executed in a Python 3 environment. It does not support Python 2.
Afterward, the trained models should be generated in the models
folder, including
models/
vec-cpu.bin
vec-memory.bin
model-cpu.bin
model-memory.bin
By default, the vectorizers are trained from the TF-IDF algorithm, and the models are trained from XGBoost classifiers. The dataset used for training is a faked dataset based on the TPC-H benchmark with only 22 samples.
Serving
After running
python3 app.py
A Flask web application should be created at http://0.0.0.0:8000/. There is a web UI for the application where you can fill in the form with a query for resources prediction.