Using Config Files to run Machine Learning Experiments

Three key advantages of using config files

It’s one thing to do machine learning research as a solo practitioner; it’s another thing to do research at scale across a large team, building machine learning products. A team effort can lead to faster experiments, but can also bring bottlenecks and frustrations if things aren’t coordinated well. You might find yourself waiting for a compute job to start before you can change your code to run the next experiment. Or spending hours trying to reproduce someone’s experiment before you get comparable results.

This is where configuration, or config, files come in handy. Configuration files are files used to configure the parameters and settings of a program or application. Config files allow you to separate the code from the parameters of the machine learning pipeline to help produce repeatable outcomes. They explain how to read the data, what model to use, how to learn the model, and how to evaluate model performance – without interaction with the operator. It’s not easy to create great config files, however, and it takes time to understand how to build them properly. But they provide a set of advantages that make them worth considering in your machine learning workflows.

Many different formats of config files exist YAML, JSON, INI, TOML, XML, etc. Tools like Hydra make it easy to manage large config files. Hydra provides the ability to compose a hierarchical configuration and override it through config files and the command line. Each format has its strengths and weaknesses, so you should choose the format that best fits your needs.

How to create a config file

Creating a config file is relatively straightforward, but requires a good understanding of the configuration framework. Here are some guidelines to do so successfully:

Identify all mutable parameters in your machine learning project and write them in the config file. Consider writing the model parameters, the loss function, the dataset, the data loader, the metric, the optimizer, and the learning rate scheduler. The config file will help you quickly explore several values and change parameters without changing the code. Identifying mutable parameters can be challenging because project direction can change quickly.
Design the config file based on your task or problem. Each project has its own needs, and config files provide a lot of flexibility.
Make the config file easy to read. Try to mirror the code structure so it’s easy to understand the matching between the code and the config (i.e., where each parameter is used in the code). Keep the parameters of the same module in the same place so the user can easily find them.
In the config, write the parameters that are specific to one machine so you can use the same code on another machine. For example, writing into the config file the path of the dataset, or the path to where to write the model checkpoint, allows you to easily run an experiment on another machine without a code change.

If you want to train a model on a given dataset, you can use a config file with the following structure. This is only one example, and you can add more parameters. A user can easily see there are four main modules: dataset, model, metric, optimizer. Each module provides the parameters you want to be able to change. For example, you can easily increase the number of layers in your model from three to five by writing num_layers: 5.

 dataset:
    path: /path/to/my/dataset

 model:
    input_size: 112
    hidden_size: 128
    output_size: 42
    dropout: 0.1
    num_layers: 3

 metric: accuracy

 optimizer:
    name: SGD
    learning_rate: 0.001
    weight_decay: 1e-5

Advantage 1: Reproducibility

Reproducibility is critical for deploying machine learning products. Teams need to be able to run an algorithm on different datasets and obtain the same (or very similar) results before putting an algorithm into production. Reproducibility quickens production pipelines because it reduces errors and ambiguity when the projects move from development to production. It also helps to create trust and credibility.

To reproduce something, you need to take a snapshot of it. Config files help your colleagues reproduce your experiment more easily in the future, which is good practice for both computer science and software engineering. Teams can spend more time making enhancements and improvements to your work instead of decoding the work you’ve already done. This helps you make progress faster.

Config files are also friendly with versioning tools like git. It’s easy to keep track of config files in the repository with the code and to version them. Version control allows you to keep a record of what changes were made and the commit message provides the reason why the change was made. If a change has unintended side effects, it is easy to review the history to see what change caused the effects.

Advantage 2: Simultaneous Collaboration and Parallel Experiments

Ideally, two or more people should be able to work on the same model at the same time to maximize efficiency. Config files help do this. It’s time-consuming to write out all the parameters you want to use, and config files provide a configuration of experiments to help team members just run them, instead of recreating them.

If you develop a machine learning pipeline entirely of code, it’s not possible to change one part of the pipeline without breaking everything. But if you use config files to house different parts of your model, you can work with others on the same model in parallel without breaking the code for everyone else. One developer can run their experiment without impacting the entire model or those of their colleagues. For example, two people can train the same model on the same data but with different optimizers:

dataset: path: /path/to/my/dataset model: input_size: 112 hidden_size: 128 output_size: 42 dropout: 0.1 num_layers: 3 metric: accuracy optimizer: name: SGD learning_rate: 0.001 weight_decay: 1e-5 dataset:
path: /path/to/my/dataset

model:
input_size: 112
hidden_size: 128
output_size: 42
dropout: 0.1
num_layers: 3

metric: accuracy

optimizer:
name: Adam

learning_rate: 0.0001
betas: (0.9, 0.999)

Running a new experiment should not break previous experiments. Config files help you follow best practices for continuous integration or a continuous delivery cycle. Because the parameters of each experiment are in a config file, it’s easy to run several different experiments in parallel. This is particularly useful when running experiments on High-Performance Computing (HPC) clusters, where machines are shared with other teams and you have to wait for resources to free up. In this case, you often don’t know when the job with your experiment will run. If your experiment parameters are hardcoded, you need to wait for the experiment to start before moving to another experiment. Config files help here because the parameters are decoupled from the code: you can make progress on a new experiment while waiting for the first job to start without breaking anything. It’s also efficient if your team members can add new features to the code while your job is waiting somewhere.

Config files also help make your work more easily available to others. Perhaps a colleague wants to incorporate your work into their work or to compare your version of the pipeline with theirs to see any differences. Config files make this much easier. And it never hurts to be able to share work that benefits others on your team.

Advantage 3: Simplicity, Efficiency & Adoption

Config files can help structure projects, which can simplify not only the project but also getting people who just joined up to speed quickly.

If your domain is complicated, it’s difficult to work with files on their own. Config files provide a compartmentalized view of a project. They provide a way to reuse pieces of your configuration and make it easier to understand, which shortens the time it takes to understand the project, as colleagues need not scroll through many pages to understand the big picture.

Config files help researchers with project alignment and standardization, giving flexibility and easier code reuse across projects. Moving from one project or product to another can be a lot of work, and config files provide a neat way to package work already done in one space for reuse in another.

The benefits here aren’t only functional: config files help make projects look simpler and boost efficiency. Simplicity increases adoption and motivation, increasing return on the time and energy invested in creating products. If you make something easy to use, more people will use it.

On Database Migration from MS SQL Server to PostgreSQL

Blog

Engineering