“Mean Square Error”, abbreviated as MSE, is an ubiquitous term found in texts on estimation theory. Have you ever wondered what this term actually means and why is this getting used in estimation theory very often ?
Any communication system has a transmitter, a channel or medium to communicate and a receiver. Given the channel impulse response and the channel noise, the goal of a receiver is to decipher what was sent from the transmitter. A simple channel is usually characterized by a channel response – \( h \) and an additive noise term – \( n \). In time domain, this can be written as
$$ y = h \circledast x + n $$
Here, is the convolution operation. Equivalently, in frequency domain, the convolution operation is equivalent to multiplication in frequency domain and vice-versa.
$$ Y = HX + N $$
Remember!!! Capitalized letters indicate frequency domain representation and small caps indicate time domain representation. The frequency domain equation looks simple and the only spoiler is the noise term. The receiver receives the information over the channel corrupted by noise and tries to decipher what was sent from the transmitter – \(X\). If the noise is cancelled out in the receiver, \(N =0\), the observed spectrum at the receiver will look like,
$$ Y = HX $$
Now, to know \( X \), the receiver has to know \(H\). Then it can simple divide the observed spectrum \(Y\) with the channel frequency response \(H\) to get \(X\). Unfortunately, things are not that easy. Cancellation of noise from the received samples/spectrum is the hardest part. Complete nullification of noise at the receiver is hardest to achieve and the entire communication system design engineering revolves around reducing this noise to minimum acceptable level to achieve acceptable performance.
Given the noise term, how do we know \(H\) from the observed/received spectrum \(Y\). This is a classical estimation problem.
Usually a known sequence ( pilot sequence in OFDM and training sequence in GSM etc.., ) is transmitted and sent across the channel and from that the channel response \(H\) or the impulse response \(h\) is estimated. This estimated channel response is used to decipher the transmitted spectrum/sequence when receiving the actual data. This type of estimation is useful if and only if the channel response remains constant across the frequency band of interest (Channel is flat across the band of interest – “flat fading”).
Okay !!! To estimate \(H\) in the presence of noise, we need some metric to quantify the accuracy of the estimation.
Line equations:
Consider a generic line equation \(y = mx+c\), where \(m\) is the slope of the line and \(c\) is the intercept. Both \(m\) and \(c\) are constants. Since \(x\) is a first order term (highest degree of \(x\)’s degree is one), this is called a linear equation. If the equation looks like \(y = mx^2+kx+c\), the highest degree of \(x\) is 2 and it becomes a quadratic equation. Similarly, the term \(x^3\) in the polynomial equation gives rise to the name cubic equation, \(x^4\) – quartic equation so on and so forth.
Naming a univariate Polynomial equation
Univariate Polynomial Equation | Highest degree of 'x' | Name |
---|---|---|
1 | Linear | |
2 | Quadratic | |
3 | Cubic | |
4 | Quartic | |
5 | Quintic |
Coming back to the linear-line equation, \(y = mx +c\), to simplify things, let’s assign \(m=2\) and \(c=0\) and generate values for \(y\) by varying \(x\) from x=0 to 9 in steps of 1.
In the above equation, the input \(x\) gets transformed into output \(y\) by the transformation \(y = mx\). This can be considered analogous to a communication system in frequency domain, where the input \(X\) is transmitted, it gets transformed by a channel \(H\) and gives the output \(Y\).
$$ Y = HX $$
Frequency domain is considered because it has the same structure as the linear equation, whereas, in time domain the output of the channel is the convolution of the channel impulse response \(h\) and the input \(x\).
We can now consider that the channel impulse response in frequency domain \(H\) is equal to the constant \(m\) (flat fading assumption).
To make the channel look closer to a real one, we will add Additive White Noise Gaussian (AWGN) noise to the channel.
$$Y = HX + N $$
To represent this scenario in our line fitting problem, the noise is represented as being generated from a set of uniformly generated random numbers – ‘\(n\)’. We call this – “observed data”.
$$ y_1 = mx + n $$
Note: The term \(n\) in the above is not a constant but a random variable, whereas,the term \( c \) is a constant (This can be considered as a DC bias in the observed data , if present). I have generated the following table for illustration. For convenience and to illustrate the importance of the term MSE, the noise terms in the following table are not drawn from an uniform set of random numbers, instead, they are manually created in a way to make the total error term zero.
The first column is the input \( x \), the second column is the ideal (actual) output \( y \) that follows the equation \( y = mx + c \), with \( c \) set to \( 0 \). The third column is the noise term. The fourth column is the observed samples at the receiver after the ideal samples are corrupted by the noise term. The fourth column represents the equation \( y_1 = mx + n \).
Now, our job is to estimated the constant \( m \) in the presence of noise. Can you think of a possible metric which could aid in this estimation problem ? Given that a known data \( x \) is transmitted, the obvious choice is to measure the average error between the observed sequence of data and the actual data and to use a brute force search for \( m \). Plug-in various values in the place of \(m\) and choose the one that gives the minimum error.
Selecting the “error” as a metric seems to be a great and simple approach. But there exists a basic flaw in this approach. Now, consider the fifth column in the table which measures the error between the observed and actual data. The noise terms in the third column are chosen such that the average-error-measured becomes zero. Even though the average error is zero, it is obvious that the observed data is far from the ideal one. This is a big drawback in the error metric. This is because the positive and the negative errors cancel out. This can happen in the real scenario too, where the errors across all samples of observed data can cancel out each other.
To circumvent this problem, lets square the error terms (sixth column) and average them out. This metric is called – Mean Squared Error. Now, no matter what the sign of error is, the squaring operation always amplifies the errors in the positive direction. The issue of errors cancelling each other is solved by this approach. An estimation approach that attempts to Minimize the Mean Square Error is called a Minimum Mean Square Error (MMSE) estimator.1
I hope that this text might have helped in understanding the logic behind using Mean Square Error as a metric for estimation problems. Comments/suggestions for improvements are welcome. The next post will focus on Ordinary Least Squares (OLS) algorithm (using the mean square error metric) applied to a linear-line fitting problem.
Why cant we use modulus to calculate the error instead of squaring it?
Please refer this analysis
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d