Maximum Likelihood Estimation (MLE) : Understand with example

Key focus: Understand maximum likelihood estimation (MLE) using hands-on example. Know the importance of log likelihood function and its use in estimation problems.

Maximum Likelihood Estimation (MLE) is a statistical method used to estimate the parameters of a statistical model. The core idea behind MLE is to find the parameter values that maximize the likelihood of observing the given data under the assumed statistical model.

Likelihood Function:

The likelihood function measures how likely it is to observe the given data for different parameter values. The MLE seeks the parameter values that make the observed data most likely.

Suppose \(X = \left(x_1, x_2, \cdots, x_n \right)\) is the observed data consisting of (\(n \)) independent and identically distributed (i.i.d.) samples, parameterized by \(\theta\). \(\theta\) represents the parameter (or vector of parameters) we want to estimate from the observed data. The parameter \(\theta\) has an underlying probability density function (PDF) or probability mass function (PMF) given by \( f ( x | \theta) \). The likelihood function is given by

$$ L(\theta|X) = \prod_{i=1}^{n} f(x_i|\theta) $$

The above equation differs significantly from the joint probability calculation that in joint probability calculation, \(\theta\), is considered a random variable. In the above equation, the parameter \(\theta\) is the parameter to be estimated.

Maximum Likelihood Estimation

The maximum likelihood estimate is obtained by solving:

$$ \hat{\theta} = \underset{\theta}{\mathrm{argmin}} \; L (\theta | X) $$

In practice, it is often easier to work with the log-likelihood function, which simplifies computations by converting products into sums:

$$ l (\theta|X) = log \; L(\theta|X) = \sum_{i=1}^{n} log \; f(x_i|\theta) $$

This is particularly useful when implementing the likelihood metric in digital signal processors.

Example: Estimating Parameters of a Normal Distribution

Suppose we have a dataset that we believe follows a normal distribution with unknown mean (\(\mu\)) and known variance (\(\sigma^2\)).

First step is to choose a statistical model that describes how the data is generated. The PDF (statistical model) of the normal distribution is:

$$ f(x | \mu) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{ \left( x – \mu \right)^2}{2 \sigma^2}} $$

The likelihood function is given by

$$ L(\mu | X) = \prod_{i=1}^{n} f(x_i|\mu) $$

and the log-likelihood function is

$$ \begin{align} l (\mu|X) &= log \; L(\mu|X) \\ &= \sum_{i=1}^{n} log \; f(\mu|\theta)\\ &= – \frac{n}{2} log(2 \pi \sigma^2) – \frac{1}{2 \sigma^2} \sum_{i=1}^{n} (x_i – \mu) \end{align} $$

After constructing the log likelihood function, we use calculus or optimization techniques to find parameter values that maximize the likelihood or log-likelihood function.

To maximize this log-likelihood with respect to \( \mu \), take its derivative and set it to zero:

$$ 0 = – \frac{1}{\sigma^2} \sum_{i=1}^{n} \left( x_i – \mu\right) $$

Solving gives:

$$ \hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} x_i $$

Thus, the maximum likelihood estimate for \(\mu\) is simply the sample mean.

Example:

Consider the DC estimation problem presented in the previous article where a transmitter transmits continuous stream of data samples representing a constant value – \(A\). The data samples sent via a communication channel gets added with White Gaussian Noise- \(w[n] \sim \mathbb{C} (\mu =0, \sigma^2 =1) \) . The receiver receives the samples and its goal is to estimate the actual DC component – \(A\) in the presence of noise.

The problem of DC estimation  (Estimation theory - maximum likelihood estimation)
Figure 1: The problem of DC estimation

Likelihood as an Estimation Metric:

Let’s use the likelihood function as estimation metric. The estimation of A depends on the PDF of the underlying noise – \(w[n]\) . The estimation accuracy depends on the variance of the noise. More the variance less is the accuracy of estimation and vice versa.

Let’s fix \(A = 1.3\) and generate 10 samples from the above model (Use the Matlab script given below to test this. You may get different set of numbers). Now we pretend that we do not know anything about the model and all we want to do is to estimate the DC component (Parameter to be estimated \(\theta = A\) from the observed samples:

$$X=(3.8754,2.1966,0.4770,2.8353,3.1025,0.8082,0.5228,-1.2273,1.0023,0.7236)$$

Assuming a variance of 1 for the underlying PDF, we will try a range of values for \(A\) from \(-2.0\) to \(+ 1.5\) in steps of \(0.1\) and calculate the likelihood function for each value of \(A\).

Matlab script:

% Demonstration of Maximum Likelihood Estimation in Matlab
%   Author: Mathuranathan (https://www.gaussianwaves.com)
%   License : creative commons : Attribution-NonCommercial-ShareAlike 3.0 Unported

A=1.3;
N=10; %Number of Samples to collect
x=A+randn(1,N);

s=1; %Assume standard deviation s=1

rangeA=-2:0.1:5; %Range of values of estimation parameter to test
L=zeros(1,length(rangeA)); %Place holder for likelihoods

for i=1:length(rangeA)
    %Calculate Likelihoods for each parameter value in the range
    L(i) = exp(-sum((x-rangeA(i)).^2)/(2*s^2));  %Neglect the constant term (1/(sqrt(2*pi)*sigma))^N as it will pull %down the likelihood value to zero for increasing value of N
end

[maxL,index]=max(L); %Select the parameter value with Maximum Likelihood
display('Maximum Likelihood of A');
display(rangeA(index));

%Plotting Commands
plot(rangeA,L);hold on;
stem(rangeA(index),L(index),'r'); %Point the Maximum Likelihood Estimate
displayText=['leftarrow Likelihood of A=' num2str(rangeA(index))];
title('Maximum Likelihood Estimation of unknown Parameter A');
xlabel('leftarrow A');
ylabel('Likelihood');
text(rangeA(index),L(index)/3,displayText,'HorizontalAlignment','left');

figure(2);
plot(rangeA,log(L));hold on;
YL = ylim;YMIN = YL(1);
plot([rangeA(index) rangeA(index)],[YMIN log(L(index))] ,'r'); %Point the Maximum Likelihood Estimate
title('Log Likelihood Function');
xlabel('leftarrow A');
ylabel('Log Likelihood');
text([rangeA(index)],YMIN/2,displayText,'HorizontalAlignment','left');

Simulation Result:

For the above mentioned 10 samples of observation, the likelihood function over the range \(-2 \; to -1.5\) of DC component values is plotted below. The maximum likelihood value happens at \(A=1.4\) as shown in the figure. The estimated value of \(A = 1.4\) since the maximum value of likelihood occurs there.

The estimation accuracy will increase if the number of samples for observation is increased. Try the simulation with the number of samples \(N\) set to \(5000\) or \(10000\) and observe the estimated value of A for each run.

Maximum likelihood estimation of unknown parameter A
Figure 2: Maximum likelihood estimation of unknown parameter A

It is often useful to calculate the log likelihood function as it reduces the above mentioned equation to series of additions instead of multiplication of several terms.

The corresponding plot is given below

Maximum likelihood estimation using log likelihood function
Figure 3: Maximum likelihood estimation using log likelihood function

Advantages of Maximum Likelihood Estimation:

  • Asymptotically Efficient – achieving minimum variance among unbiased estimators under regularity conditions 2
  • Asymptotically normal – their distribution approaches a normal distribution as sample size grows large.
  • Asymptotically consistent – MLEs are consistent estimators; they converge to true parameter values as sample size increases.
  • MLE can handle a wide variety of statistical models and distributions 3
  • Estimation without any prior information
  • The estimates closely agree with the data

Challenges in Maximum Likelihood Estimation:

  • For complex models or large datasets, maximizing the likelihood can be computationally expensive 4
  • Non-linear models may lead to multiple local maxima in the likelihood function, making global optimization challenging.
  • MLE assumes that the chosen model correctly represents how data is generated. Incorrect assumptions can lead to biased estimates.
  • It does not utilize any prior information for the estimation. But in real world scenario, we always have some prior information about the parameter to be estimated. We should always use it to our advantage despite it introducing bias in the estimates.

Rate this article: PoorBelow averageAverageGoodExcellent (10 votes, average: 4.40 out of 5)

For further reading

1 Steven M. Kay, “Fundamentals of Statistical Signal Processing, Volume I: Estimation Theory”, ISBN: 978-0133457117, Prentice Hall, Edition 1, 1993.↗

2. Casella, G., & Berger, R. L. (2002). Statistical Inference. Springer Science & Business Media.

3. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

4. Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.

Related topics

[1]An Introduction to Estimation Theory
[2]Bias of an Estimator
[3]Minimum Variance Unbiased Estimators (MVUE)
[4]Maximum Likelihood Estimation
[5]Maximum Likelihood Decoding
[6]Probability and Random Process
[7]Likelihood Function and Maximum Likelihood Estimation (MLE)
[8]Score, Fisher Information and Estimator Sensitivity
[9]Introduction to Cramer Rao Lower Bound (CRLB)
[10]Cramer Rao Lower Bound for Scalar Parameter Estimation
[11]Applying Cramer Rao Lower Bound (CRLB) to find a Minimum Variance Unbiased Estimator (MVUE)
[12]Efficient Estimators and CRLB
[13]Cramer Rao Lower Bound for Phase Estimation
[14]Normalized CRLB - an alternate form of CRLB and its relation to estimator sensitivity
[15]Cramer Rao Lower Bound (CRLB) for Vector Parameter Estimation
[16]The Mean Square Error – Why do we use it for estimation problems
[17]How to estimate unknown parameters using Ordinary Least Squares (OLS)
[18]Essential Preliminary Matrix Algebra for Signal Processing
[19]Why Cholesky Decomposition ? A sample case:
[20]Tests for Positive Definiteness of a Matrix
[21]Solving a Triangular Matrix using Forward & Backward Substitution
[22]Cholesky Factorization - Matlab and Python
[23]LTI system models for random signals – AR, MA and ARMA models
[24]Comparing AR and ARMA model - minimization of squared error
[25]Yule Walker Estimation
[26]AutoCorrelation (Correlogram) and persistence – Time series analysis
[27]Linear Models - Least Squares Estimator (LSE)
[28]Best Linear Unbiased Estimator (BLUE)

Books by the author

Wireless Communication Systems in Matlab
Wireless Communication Systems in Matlab
Second Edition(PDF)

Note: There is a rating embedded within this post, please visit this post to rate it.
Digital modulations using Python
Digital Modulations using Python
(PDF ebook)

Note: There is a rating embedded within this post, please visit this post to rate it.
digital_modulations_using_matlab_book_cover
Digital Modulations using Matlab
(PDF ebook)

Note: There is a rating embedded within this post, please visit this post to rate it.
Hand-picked Best books on Communication Engineering
Best books on Signal Processing

8 thoughts on “Maximum Likelihood Estimation (MLE) : Understand with example”

  1. Could you please tell me how to do this for multivariate case.?
    I have 1000 samples of 5 variables(X = Xtrue + error) and i want to estimate sigma_e(covariance matrix of error) using mle where error is not changing w.r.t samples.

    Reply

Post your valuable comments !!!