Machine learning is a powerful tool, but like every other tool, it's easy to misuse and can lead to poor results. Here are some common mistakes I'd like you to avoid.
Pitfall 1: Using ML When You Don't Need It
If you have a hammer, everything looks like a nail. Don't run around looking for ways to apply machine learning. Instead, do the opposite:
Identify relevant business problems first and see if ML could be a good fit later.
Pitfall 2: Not Having A Baseline
Double-check that you have an existing baseline to beat, such as simple heuristics or hard-coded rules.
Use ML only when you need to and can beat the baseline.
Pitfall 3: Being Too Greedy
Being too greedy for accuracy can lead to poor performance of your model on new data.
Be more conservative in your training and choose simpler algorithms over complex ones, even if their accuracy is slightly lower.
Pitfall 4: Building Too Complex Models
Don't run neural networks everywhere to gain high accuracy scores. Complex models have two disadvantages: First, they are harder to maintain and debug. Secondly, they are more likely to overfit (perform well during training but poorly on new data).
Constantly monitor your model's performance on new data as well as the model's complexity.
Pitfall 5: Data Labeling Galore
Perfectly labeled data doesn't lie around but comes at a cost. Experience shows that most ML algorithms at some point hit a plateau where additional training examples will not result in significant accuracy gains.
To avoid wasting money on collecting more and more data, set your targets first: Which accuracy do you need at least?
Pitfall 6: Ignoring Outliers
Outliers are data points that are well beyond the average value of your dataset. However, the impact of outliers can be huge for many algorithms, especially those working on regression tasks.
Pay close attention to outliers in your dataset!
Pitfall 7: The Curse Of Dimensionality
More features aren't always better but can be counterproductive: The attribute "US ZIP Code" will not increase the dimensionality by 1 but by order 41,692, because many ML algorithms require encoding categorical variables into single-column features. ML algorithms struggle to extract patterns from very sparse data. To avoid:
Be careful adding new features, don't hesitate to drop redundant or irrelevant ones.
Reduce the dependencies between attributes by encoding these as a single attribute. For example, do "price per square feet" instead of "price" and "size".