ML Pipeline
Overview
A typical machine learning pipeline may contain data loading, data transformation, training(train-dev)/testing datasets splitting, vectorization, classification/ regression, etc.
In the presto-query-predictor
package, the Pipeline
class provides a
high-level interface to create a model training pipeline without the necessity to
program in each detailed step. The dataset path or dataset, the transformer
config path, and the trainer config path are the input parameters for a Pipeline
instance. Below is an example to use the class to create a CPU model training
pipeline.
from pathlib import Path
from query_predictor.predictor.pipeline import Pipeline
# Path of the embedded fake dataset.
# The path can be replaced by other datasets in practical usages.
data_path = parent_dir / "query_predictor/datasets/data/tpch.csv"
# Paths of the default transformer and trainer configs.
# The paths should be replaced by other configs in practical usages.
transformer_config_path = (
parent_dir / "query_predictor/conf/transformer.yaml"
)
cpu_trainer_config_path = (
parent_dir / "query_predictor/conf/trainer-cpu.yaml"
)
# Runs the pipeline to train a model to predict cpu time.
cpu_pipeline = Pipeline(
data_path=data_path,
transformation_required=True,
transformer_config_path=transformer_config_path,
trainer_config_path=cpu_trainer_config_path,
)
cpu_pipeline.exec()
pp = pprint.PrettyPrinter()
pp.pprint(cpu_pipeline.report)
Datasets
The package contains a faked dataset created with some TPC-H SQL queries. The
faked dataset has 22 samples with columns: query_id, user_, source, environment,
catalog, query_state, query, peak_memory_bytes, and cpu_time_ms. The dataset
can be loaded through the load_tpch
method.
from query_predictor.datasets import load_tpch
data = load_tpch()
print(data)
Warning
The faked dataset is for demo purposes only. You need to train models from some specific Presto request logs for production purposes.
Data Transformation
After loading a raw Presto request log dataset, we need to transform the dataset,
e.g. converting SQL queries to lowercase, creating prediction labels, etc. The
package provides a DataTransformer
class for data transformation. The required
transformations are provided through a transformer configuration file.
transformers: # The transformations executed on the dataset
- drop_failed_queries # Drops failed queries whose query state is FAILURE
- create_labels # Creates prediction labels for CPU time and peak memory bytes
- to_lower_queries # Converts SQL queries to lowercase
- select_training_columns # Removes unnecessary columns
persist: true # Whether the dataset after transformations should be persisted or not
persist_path: clean.csv # Persistence path
Model Training
We apply data vectorization to the query strings in the transformed dataset.
For now, based on the scikit-learn
vectorizers, the package supports
DataCountVectorizer
- token count approachDataTfidfVectorizer
- TF-IDF (term frequency-inverse document frequency) approach
After vectorization, we'll split the dataset to training and testing datasets and apply specific classification algorithms.
RandomForestClassifier
- A random forest classifier based on thescikit-learn
package.LogisticRegressionClassifier
- A logistic regression classifier based on thescikit-learn
package.XGBoostClassifier
- An XGBoost classifier based on thexgboost
package.
Any contributions to more classifiers are welcome!
Both the vectorizer's and the classifier's parameters can be provided through a trainer configuration file. An example of training a CPU model is shown below.
label: cpu_time_label # Predictiona label: cpu_time_label or peak_memory_label
feature: query # Feature column
vectorizer:
type: tfidf # Vectorizer type: tfidf or count
params: # Params for the vectorizer, following scikit-learn parameters.
max_features: 100
min_df: 1
max_df: 0.8
persist: true # Whether the vectorizer trained should be persisted or not
persist_path: models/vec-cpu.bin # Persistence path
test_size: 0.2 # Testing dataset proportion during splitting
classifier:
type: XGBoost # Classifier type
params: # Params for the classifier
max_depth: 2
objective: 'binary:logistic'
persist: true # Whether the model trained should be persisted or not
persist_path: models/model-cpu.bin # Persistence path
After the training, a CPU model should be generated in the models/
folder.
This model can be used to predict CPU usages of future Presto requests.
Info
The vectorizer's and classifier's parameters are for demo purposes. They are not optimized. The parameters usually require tuning when changed to another dataset.