A few months ago, I set out to build a model to predict the number of COVID cases in the NYC area. My dataset consisted of a "mixed bag" of variables: the number of hospitalizations, deaths, confirmed cases, and the weekly popularity percentage of the search term "COVID" from 2020 to 2025. Before I could train my models, I had to scale my data since the variables were not all on the same scale. Initially, I used the Min-Max normalization method to scale my data. Then later, I used Standardization (z-score) and saw a 33% improvement in my RMSE (Root Mean Square Error). It begged me to ask the question, why? Why did changing my normalization method change the outcome, and how can I tell when to use either method? The answer was due to the difference between bounded and unbounded data, and understanding what in the data signals when either method should be used.
What is bounded data?
So, let's define bounded data. Bounded data is data where there is a limit to the upper and lower values of the data. In short, it has a maximum and a minimum limit. An example of this would be a situation where you are tasked with the goal of predicting the height of a college basketball player given his age. Let us assume your dataset consists of the ages and heights of the roster of college basketball players across the U.S. for a given year. The predictor, age, has a general fixed upper and lower limit, typically between 20 and 22. Therefore, your data is said to be bounded.
What is unbounded data?
On the other hand, unbounded data is data where the limits are not finite. The maximum and or minimum limits are more open-ended. In the case I explained earlier, where I dealt with data that included the number of daily hospitalizations, confirmed cases, and deaths due to COVID within a geographic area, NYC, the limits of any predictor have the potential to be as small as 0 but as large as whatever the current population of that geographic area is. Certainly, the upper limit is bounded by the population. However, for a model, the number has the potential of swinging from as little as zero to as high as tens-of-thousands on the discovery of a new variant of COVID or a change in public policy. As a result, the upper limit in this case is unbounded.
How does this impact the scaling of your data?
Both Min-Max Normalization and Standardization (z-score) are well-known methods of putting your data's features onto the same scale. However, there are caveats to either method. The formulas for Min-Max Normalization and Standardization (z-score) are as follows:
Min-Max Normalization:
Standardization (z-score):
Min-Max Normalization, which is one of the more commonly used methods, is affected by outliers in your data. Consequently, if your data has a clearly defined minimum and maximum value, as depicted in the formula above, and all your data falls within that range, then it is perfect for Min-Max Normalization. Consequently, as a rule of thumb, when scaling data that is bounded, it is best to use normalization.
In contrast, Standardization (z-score) is more resilient and resistant to the effects of outliers than Min-Max Normalizations. As a result, its ability to scale data is affected less by outliers in either the upper or lower limits of your data. Therefore, if your data has outliers and or the data range is unbounded, it is best to use Standardization (z-score) to scale your data.
What if my predictors are a mix of both?
With large datasets, it is rare for all the predictors to be all bounded or all unbounded. If your dataset has a mix of both bounded and unbounded predictors, it is best to make use of Standardization (z-score).
Summary
To recap, bounded data is data where the features have known maximum and minimum limits. On the other hand, unbounded data has no limits on its maximum and or minimum range. Min-Max Normalization is best used on bounded data, whereas Standardization (z-score) is best used on unbounded data or data that has a mix of both. Understanding the nature of your dataset is the key to scaling your data properly and getting the most out of your model