Solving an ML Research Problem

machine-learning, research, problem-solving

Study the Problem #

Define the Metric #

Study the Data #

“Data is the Kind of Kings”

List Solutions #

Note: Read and think a lot, be a researcher. Build the expert intuition. I’m sure you get stuck at some point without it.

Create Evaluation Pipeline #

ATM we know the literature, possible solutions, the data, and the metric. Now we need to create a pipeline to evaluate the solutions.

We do it before we start the experiments with a clear outlook on the problem. We also don’t want to be biased with the effort we put later on to the experiments.

Establish Baseline #

Do not trust papers (sorry). Values in papers are mostly misleading in practical cases.

Pick the Best ROI Solution #

Experiment #

Using a comparison split. It is hard to compare successive training runs when data collection happens in parallel. Create a specific data split that stays constant (comparison split). Run validation on this split alongside your validation set. It’d help you compare models fairly and justify what works as you progress. Change the comparison set when the validation curve and comparison curve diverge which signals a change in data distribution. And you need to run all your previous compared models on this new set.

Don’t be obsessed with a solution. Our goal is solving the problem, not making your favorite solution work.

YOLO runs with big models with big changes. Sometimes you can’t go incremental because of budget, date, time, etc. In my experience, success of YOLO runs correlated with the experience and programming skills of the person.

Codebase. Make sure that you have a codebase that is modular, easy, and robust enough to perform the tasks above.

Automate as much as possible. We are lazy and we get lazier as we spend more time on a project. Therefore, automate any step you take as soon as possible to reduce the cognitive load of repetitive TODOs.

Iterate #

Always think about data first before changing your model. Getting more high-quality data is the most effective way to improve performance unless the limitation is inherently about model capacity.