**Yann Le Cunn**(Director of AI Research in Facebook and founding father of convolution networks) by checking section 4.3 of this paper.

Convergence [of backdrop] is usually faster if the average of each input variable over the training set is close to zero. Among others, one reason is that when **the neural network tries to correct the error performed in a prediction,** it updates the network by an amount proportional to the input vector, which is bad if input is large

.

**data in each cluster shares some common characteristics**. This algorithm performs two steps:

*Assign centers of clusters in some point in space (random at first try, calculating the centroid of each cluster the rest of the time)**Associate each point to the closest center.*

**Minkowski distance**(commonly the famous Euclidean distance). Each feature weights the same in the calculation, so features measured in high ranges will influence more than those measured in low ranges e.g. the same feature would have more influence in the calculation if measured in millimeters than in kilometers (because the numbers would be bigger). So the scale of the features must be in a comparable range.

**A couple of ways to normalize data:**

**Feature scaling**

Figure 1, normalization formula |

**One example:**imagine we have Internet data from a particular house and we want to make a model to predict something (maybe the price to charge). One of our hypothetical features could be the bandwidth of the fiber optic connection. Suppose the house purchased a 30Mbit Internet connection, so the bit rate is approximately the same every time we measure it (lucky guy).

Figure 2, Connection speed over 50 days |

**feature scaling method**(sklearn.preprocessing.MinMaxScaler).

Figure 3, Connection speed / day in scale 0-1. |

**data is distorted**. What was an almost flat signal, now looks like a connection with a lot of variation. This tells us that feature scaling is not adequate to nearly constant signals.

**Standard scaler**

Figure 4, Standard scaling formula |

**outliers may attenuate the non-outlier part of the data.**

**sine wave**, with lows during the weekdays and highs on weekends. It also has big outliers after “Halloween” and similar dates. We have idealized this situation with the next data set (3 parties in 50 days. Not bad).

**We check the basic parameters of standardization.**

Figure 6, Standard standardization for the above data is not a good choice. |

What happened? First, we were not able to scale the data between 0 and 1. Second, we now have negative numbers, which is not a dead end, but complicates the analysis. And third, now we are unable to clearly distinguish the differences between weekdays and weekends (all close to 0), because **outliers have interfered with the data.**

From a very promising data, we now have an almost irrelevant one. One solution to this situation could be to pre-process the data and eliminate the outliers (things change with outliers).

**Scaling over the maximum value**

The next idea that comes to mind is to scale the data by dividing it by its maximum value. Let´s see how it behaves with our data sets (sklearn.preprocessing.MaxAbsScaler).

Figure 7, data divided by maximum value |

Figure 8, data scaled over the maximum |

Good! Our data is in range 0,1… But, wait. What happened with the differences between weekdays and weekends? They are all close to zero! As in the case of standardization, **outliers flatten the differences among the data when scaling over the maximum.**

**Normalizer**

The next tool in the box of the data scientist is to normalize samples individually to unit norm (check this if you don’t remember what a norm is).

Figure 9, samples individually sampled to unit norm |

This data rings a bell in your head right? Let’s normalize it (here by hand, but also available as sklearn.preprocessing.Normalizer).

Figure 10, the data was then normalized |

At this point in the post, you know the story, but this case is worse than the previous one. In this case we don’t even get the highest outlier as 1, it is scaled to 0.74, which flattens the rest of the data even more.

**Robust scaler**

The last option we are going to evaluate is Robust scaler. This method **removes the median and scales the data according to the Interquartile Range (IQR)**. It is supposed to be robust to outliers.

Figure 11, the median data removed and scaled |

Figure 12, use of Robust scaler |

You may not see it in the plot (but you can see it in the output), but this scaler introduced negative numbers and did not limit the data to the range [0, 1]. (OK, I quit).

There are others methods to normalize your data (based on PCA, taking into account possible physical boundaries, etc), but now you know how to evaluate whether your algorithm is going to influence your data negatively.

**Things to remember (basically, know your data):**

Normalization may (possibly [dangerously]) distort your data. **There is no ideal method to normalize or scale all the data sets.** Thus it is the job of the data scientist to know how the data is distributed, **know the existence of outliers, check ranges, know the physical limits** (if any) and so on. With this knowledge, one can select the best technique to normalize the feature, probably using a **different method for each feature.**

If you know nothing about your data, I would recommend you to first **check the existence of outliers** (remove them if necessary) and then **scale over the maximum of each feature** (while crossing your fingers).

*Written by Santiago Morante, PhD, Data Scientist at LUCA Consulting Analytics*