7 Painful Pitfalls Of Machine Learning And How To Avoid Them

Machine learning is a powerful tool, but like every other tool, it's easy to misuse and can lead to poor results. Here are some common mistakes I'd like you to avoid.

Pitfall 1: Using ML When You Don't Need It

If you have a hammer, everything looks like a nail. Don't run around looking for ways to apply machine learning. Instead, do the opposite:

Identify relevant business problems first and see if ML could be a good fit later.

Pitfall 2: Not Having A Baseline

Double-check that you have an existing baseline to beat, such as simple heuristics or hard-coded rules.

Use ML only when you need to and can beat the baseline.

Pitfall 3: Being Too Greedy

Being too greedy for accuracy can lead to poor performance of your model on new data.

Be more conservative in your training and choose simpler algorithms over complex ones, even if their accuracy is slightly lower.

Pitfall 4: Building Too Complex Models

Don't run neural networks everywhere to gain high accuracy scores. Complex models have two disadvantages: First, they are harder to maintain and debug. Secondly, they are more likely to overfit (perform well during training but poorly on new data).

Constantly monitor your model's performance on new data as well as the model's complexity.

Pitfall 5: Data Labeling Galore

Perfectly labeled data doesn't lie around but comes at a cost. Experience shows that most ML algorithms at some point hit a plateau where additional training examples will not result in significant accuracy gains.

To avoid wasting money on collecting more and more data, set your targets first: Which accuracy do you need at least?

Pitfall 6: Ignoring Outliers

Outliers are data points that are well beyond the average value of your dataset. However, the impact of outliers can be huge for many algorithms, especially those working on regression tasks.

Pay close attention to outliers in your dataset!

Pitfall 7: The Curse Of Dimensionality

More features aren't always better but can be counterproductive: The attribute "US ZIP Code" will not increase the dimensionality by 1 but by order 41,692, because many ML algorithms require encoding categorical variables into single-column features. ML algorithms struggle to extract patterns from very sparse data. To avoid:

Be careful adding new features, don't hesitate to drop redundant or irrelevant ones.
Reduce the dependencies between attributes by encoding these as a single attribute. For example, do "price per square feet" instead of "price" and "size".