**Term explanation**

In image processing, as an area in signal processing, **modelling** **the data** and expected values is very important in all kinds of applications. So, **the data represent the problem** that needs to be addressed. That is why it is necessary for us to know **what kind of data to expect**, and what are some values that are the result of some **measurement errors, faulty data, erroneous procedures**, or simply what are the **areas where a certain theory might not be valid**. So, **to improve the model** and gain better results of our applications, we must **recognize and deal with outliers** in the data.

In statistics, an **outlier is a
data point that differs significantly from other observations**. **Outliers** in the data can be very **dangerous**, since they **change** the classical data statistics,
such as **mean value and variance of the
data**. This **affects the results**
of an algorithm of any kind (image processing, machine learning, deep learning algorithm…).
So, when modeling, it is extremely **important**
**to clean the data sample** to ensure
that the observations best represent the problem.

**How to deal with outliers in the data**

The thing we know about **outliers** is that they **do not
fit the model we assumed**, but we don’t know anything else about them, when
they will appear or what value will have. We just know that we must **stop them messing with our results**. **But how?**

- First step in determining
the outliers is
**getting to know the data for the specific application**. So, we must have some test**dataset**and start from there. - The next step is
**to find the data distribution**(according to the available dataset), which can be**tricky task**sometimes. Let us**assume**that**the data have normal (Gaussian) distribution**.

- When we are familiar with the distribution of the data, now we can
**identify outliers more easily**. So, there is no precise way to define and identify outliers in general, but we must know how to**define them for our specific application**. - We can now use statistical methods to identify observations that appear to be
**rare or unlikely given the available data**.**Outliers**can occur by chance in any distribution, but they often**indicate****either****measurement****error**or that**the population has a heavy-tailed distribution**. - In the former case one wishes to
**discard them or use statistics that are robust to outliers**, while in the latter case they indicate that the distribution has high skewness and that one should be**very cautious in using tools or intuitions that assume a normal distribution**. - In most larger samplings of data, some data points will be further away from the sample mean than what is deemed reasonable. This can be due to
**incidental systematic error**or**flaws in the theory that generated an assumed family of probability distributions**, or it may be that some observations are far from the center of the data. In large samples, a small number of outliers is to be expected (and not due to any anomalous condition). - Now, we can deal with outliers. We can
**remove**them from our dataset if we are dealing with the**offline**applications. But, on the other hand, if we are dealing with the**real time online processing**than we must use some procedures, in order to make our application more robust.

*Remark:*

Maybe one thinks that a **simple
way to handle outliers is to detect them and remove them from the data set**.
Deleting an outlier, although better than doing nothing, still poses **a number of problems**:

- When is deletion justified? Deletion requires a subjective decision.
- When is an observation “outlying enough” to be deleted?
- The user or the author of the data may think that “an observation is an observation” (i.e., observations should speak for themselves) and hence feel uneasy about deleting them.
- Since there is generally some uncertainty as to whether an observation is really atypical, there is a risk of deleting “good” observations, which results in underestimating data variability.

Since the results depend on the user’s subjective decisions, it is difficult to determine the statistical behavior of the complete procedure.

**Robust Statistics**

Let’s say something about **normal
distribution assumption**. It is very common to assume the Gaussian
distribution in different kinds of an **engineering
problems**. The **most widely used model
formalization** is the assumption that the observed data have a **normal (Gaussian) distribution**. This
assumption has been present in statistics as well as engineering for two
centuries and has been the framework for all the classical methods in
regression, analysis of variance and multivariate analysis. **The main justification** for assuming a
normal distribution is that it gives an **approximate
representation** to many real data sets, and at the same time is **theoretically quite convenient** because
it allows one to derive explicit formulas for optimal statistical methods such
as maximum likelihood, likelihood ratio tests, etc. We refer to such methods as
**classical statistical methods** and
note that they rely on the assumption that normality holds exactly. The **classical statistics** are by modern
computing standards **quite easy to
compute**. Unfortunately, theoretical and computational convenience does not always
deliver an adequate tool for the practice of statistics and data analysis. It **often happens in practice** that an
assumed normal distribution model (e.g., **Standard
Kalman filter**) **holds approximately in
that it describes the majority of observations**, but some observations
follow a different pattern or no pattern at all.

Now, we know that such **atypical
data** are called **outliers**, and **even a single outlier can have a large
distorting influence** on a classical statistical method that is optimal
under the assumption of normality or linearity. The kind of “approximately”
normal distribution that gives rise to outliers is one that has a normal shape
in the central region but has tails that are heavier or “fatter” than those of
a normal distribution. **One might naively
expect that if such approximate normality holds, then the results of using a
normal distribution theory would also hold approximately. This is unfortunately
not the case.**

**The robust approach** to statistical modeling and data analysis aims
at deriving methods that produce reliable parameter estimates and associated
tests and confidence intervals, not only when the data follow a given
distribution exactly, but also when this happens only approximately in the
sense just described.

**Robust methods fit the bulk
of the data well: if the data contain no outliers the robust method gives
approximately the same results as the classical method, while if a small
proportion of outliers are present the robust method gives approximately the
same results as the classical method applied to the “typical” data**. As a consequence of fitting the bulk of the
data well, robust methods provide a very reliable method of detecting outliers,
even in high-dimensional multivariate situations.

We note that one approach to dealing with outliers is the **diagnostic approach**. Diagnostics are
statistics generally based on classical estimates that aim at giving numerical
or graphical clues for the detection of data departures from the assumed model.
There is a considerable literature on outlier diagnostics, and a good outlier diagnostic
is clearly better than doing nothing. However, these methods present **two drawbacks**. **One** is that they are **in
general not as reliable for detecting outliers as examining departures from a
robust fit to the data**. The other is that, once suspicious observations
have been flagged, the actions to be taken with them remain the analyst’s personal
decision, and thus there is **no objective
way to establish the properties of the result of the overall procedure**.

**Robust methods have a long
history** that can be traced back at
least to the end of the nineteenth century. But the first great steps forward
occurred in the 1960s, and the early 1970s with the fundamental work of **John Tukey** (1960, 1962), **Peter Huber** (1964, 1967) and **Frank Hampel** (1971, 1974). The
applicability of the new robust methods proposed by these researchers was made possible
by the increased speed and accessibility of computers.

In this post we will not talk about Robust Statistics any more. If you want to find out more, a new post will be published soon, or you can get some information from the references given at the end. This was just a beginning and a warm up for those who need more information about getting started in designing more robust applications.