Most problems in Data Science fall into one of two categories of problems: Classification or Regression. It is important to understand the distinction, as this ultimately helps to identify what models are chosen to complete the task and how to evaluate them. But what is a classification problem, and what is a regression problem? How do they differ, and how does it affect our evaluation of models? I will attempt to explain this below.
Classification
A classification problem is one where the outcome is made up of discrete results. These are items that are distinguishable and do not overlap. An example would be being able to distinguish between items that are different colors - red, green, blue black. The colors have specific boundaries, and they do not overlap. Another simple example is determining if something is True or False, or Off or On.
Regression
On the other hand, a regression problem is one where the result is continuous. The outcome of different data points is part of one continuous result. For example, let's assume that all the colors that we can represent are in a range of numbers from 0 to 255, where 0 represents black and 255 represents white. We could have colors 1.2 for an off-white, 2,4 for pinkish, 2.53 for reddish etc. The outcome is part of one continuous result.
Evaluation
Classification
Classification problems tend to be the easiest to evaluate, and we have developed different metrics for determining how well a model is able to predict the outcome.
Methods
Some methods for assessing the predictability of classification problems include accuracy, precision, recall, F1-score, and AUC (Area under the curve). These metrics are normally presented in percentages and or numbers between 0 and 1. The higher the percentage or the closer the value is to 1, the closer the prediction is to the actual value.
Regression
Unlike classification problems, where the goal is often to predict a label, regression problems focus on predicting a numeric value. Examples include temperature, height, and velocity. This series of numeric values can be plotted on a graph and compared to the actual values. The residual or error can be calculated by determining the difference between the predicted and actual curves.
Methods
The two more common methods for evaluating a regression problem are the Mean Squared Error and Root Mean Squared Error. The RMSE is often preferred because it represents the average deviation from the actual values. The smaller the RMSE, the closer the predicted curve is to the actual curve.
As a side note, a Regression problem can be presented as a Classification problem if the results are binned. For example, values can be grouped into ranges, and if a value falls within a range, it can be classified as belonging to one of the bins. However, it is always important to determine what the expected result from training a model should be - will it be a value or a label? Having determined this, you can choose between models used for classification and or regression. Finally, once you determine what sort of model you are going to use, you can determine how best to evaluate it, i.e. F1-score, accuracy, precision and recall or MSE or RMSE.