Outliers in data and how to deal with them

Term explanation

In image processing, as an area in signal processing, modelling the data and expected values is very important in all kinds of applications. So, the data represent the problem that needs to be addressed. That is why it is necessary for us to know what kind of data to expect, and what are some values that are the result of some measurement errors, faulty data, erroneous procedures, or simply what are the areas where a certain theory might not be valid. So, to improve the model and gain better results of our applications, we must recognize and deal with outliers in the data.

What is an outlier?

In statistics, an outlier is a data point that differs significantly from other observations. Outliers in the data can be very dangerous, since they change the classical data statistics, such as mean value and variance of the data. This affects the results of an algorithm of any kind (image processing, machine learning, deep learning algorithm…). So, when modeling, it is extremely important to clean the data sample to ensure that the observations best represent the problem.

How to deal with outliers in the data

            The thing we know about outliers is that they do not fit the model we assumed, but we don’t know anything else about them, when they will appear or what value will have. We just know that we must stop them messing with our results. But how?

  • First step in determining the outliers is getting to know the data for the specific application. So, we must have some test dataset and start from there.
  • The next step is to find the data distribution (according to the available dataset), which can be tricky task sometimes. Let us assume that the data have normal (Gaussian) distribution.
Normal (Gaussian) distribution
  • When we are familiar with the distribution of the data, now we can identify outliers more easily. So, there is no precise way to define and identify outliers in general, but we must know how to define them for our specific application.
  • We can now use statistical methods to identify observations that appear to be rare or unlikely given the available data. Outliers can occur by chance in any distribution, but they often indicate either measurement error or that the population has a heavy-tailed distribution.
  • In the former case one wishes to discard them or use statistics that are robust to outliers, while in the latter case they indicate that the distribution has high skewness and that one should be very cautious in using tools or intuitions that assume a normal distribution.
  • In most larger samplings of data, some data points will be further away from the sample mean than what is deemed reasonable. This can be due to incidental systematic error or flaws in the theory that generated an assumed family of probability distributions, or it may be that some observations are far from the center of the data. In large samples, a small number of outliers is to be expected (and not due to any anomalous condition).
  • Now, we can deal with outliers. We can remove them from our dataset if we are dealing with the offline applications. But, on the other hand, if we are dealing with the real time online processing   than we must use some procedures, in order to make our application more robust.

Remark:

Maybe one thinks that a simple way to handle outliers is to detect them and remove them from the data set. Deleting an outlier, although better than doing nothing, still poses a number of problems:

  • When is deletion justified? Deletion requires a subjective decision.
  • When is an observation “outlying enough” to be deleted?
  • The user or the author of the data may think that “an observation is an observation” (i.e., observations should speak for themselves) and hence feel uneasy about deleting them.
  • Since there is generally some uncertainty as to whether an observation is really atypical, there is a risk of deleting “good” observations, which results in underestimating data variability.

Since the results depend on the user’s subjective decisions, it is difficult to determine the statistical behavior of the complete procedure.

Robust Statistics

Let’s say something about normal distribution assumption. It is very common to assume the Gaussian distribution in different kinds of an engineering problems. The most widely used model formalization is the assumption that the observed data have a normal (Gaussian) distribution. This assumption has been present in statistics as well as engineering for two centuries and has been the framework for all the classical methods in regression, analysis of variance and multivariate analysis. The main justification for assuming a normal distribution is that it gives an approximate representation to many real data sets, and at the same time is theoretically quite convenient because it allows one to derive explicit formulas for optimal statistical methods such as maximum likelihood, likelihood ratio tests, etc. We refer to such methods as classical statistical methods and note that they rely on the assumption that normality holds exactly. The classical statistics are by modern computing standards quite easy to compute. Unfortunately, theoretical and computational convenience does not always deliver an adequate tool for the practice of statistics and data analysis. It often happens in practice that an assumed normal distribution model (e.g., Standard Kalman filter) holds approximately in that it describes the majority of observations, but some observations follow a different pattern or no pattern at all.

Now, we know that such atypical data are called outliers, and even a single outlier can have a large distorting influence on a classical statistical method that is optimal under the assumption of normality or linearity. The kind of “approximately” normal distribution that gives rise to outliers is one that has a normal shape in the central region but has tails that are heavier or “fatter” than those of a normal distribution. One might naively expect that if such approximate normality holds, then the results of using a normal distribution theory would also hold approximately. This is unfortunately not the case.

The robust approach to statistical modeling and data analysis aims at deriving methods that produce reliable parameter estimates and associated tests and confidence intervals, not only when the data follow a given distribution exactly, but also when this happens only approximately in the sense just described.

Robust methods fit the bulk of the data well: if the data contain no outliers the robust method gives approximately the same results as the classical method, while if a small proportion of outliers are present the robust method gives approximately the same results as the classical method applied to the “typical” data. As a consequence of fitting the bulk of the data well, robust methods provide a very reliable method of detecting outliers, even in high-dimensional multivariate situations.

We note that one approach to dealing with outliers is the diagnostic approach. Diagnostics are statistics generally based on classical estimates that aim at giving numerical or graphical clues for the detection of data departures from the assumed model. There is a considerable literature on outlier diagnostics, and a good outlier diagnostic is clearly better than doing nothing. However, these methods present two drawbacks. One is that they are in general not as reliable for detecting outliers as examining departures from a robust fit to the data. The other is that, once suspicious observations have been flagged, the actions to be taken with them remain the analyst’s personal decision, and thus there is no objective way to establish the properties of the result of the overall procedure.

Robust methods have a long history that can be traced back at least to the end of the nineteenth century. But the first great steps forward occurred in the 1960s, and the early 1970s with the fundamental work of John Tukey (1960, 1962), Peter Huber (1964, 1967) and Frank Hampel (1971, 1974). The applicability of the new robust methods proposed by these researchers was made possible by the increased speed and accessibility of computers.

In this post we will not talk about Robust Statistics any more. If you want to find out more, a new post will be published soon, or you can get some information from the references given at the end. This was just a beginning and a warm up for those who need more information about getting started in designing more robust applications.