Description
Statistical and Mathematical Methods for Artificial Intelligence
2022-2023
Homework 4: MLE and MAP
Maximum Likelihood Estimation (MLE) and Maximum a Posteriori (MAP) Suppose you are given a dataset D = {X,Y }, where
X = [x1x2 …xN] ∈ RN Y = [y1y2 …yN] ∈ RN
and xi ∈ R∀i = 1,…,N, yi ∈ R, ∀i = 1,…,N. Moreover, suppose that:
yi = θ1 + θ2xi + ··· + θK(xi)K−1 + ei ∀i = 1,…,N (1)
Where ei ∼ N(0;σ2) is random Gaussian Noise and θ = (θ1,…,θn)T. It is known that the Maximum Likelihood Estimation (MLE) approach works by defining the conditional probability of y given x, pθ(y|x) and then optimizes the parameters θ to maximize this probability distribution over D. Moreover, it is also known that this approach can be made equivalent to the deterministic approach to solve such problems (the Least Square method) by taking the negative-log of pθ(y|x). Indeed, by assuming that the noise ei is
Gaussian for each i = 1,…,N, we have
Thus
N pθ(Y |X) = Yp(yi|xi) =⇒
i=1
N N
θMLE = argmaxpθ(Y |X) = arg min −logpθ(Y |X) = arg min −logYpθ(yi|xi) = arg min X−logpθ(yi|xi) = θ∈Rs θ∈Rs θ∈Rs i=1 θ∈Rs i=1
N arg min
i=1
If , with φj(x) = xj−1, we already shown that the
problem above becomes
where Ψ(X) is the N ×K matrix that contains the powers from 0 to K −1 of the input datapoints xi as rows (the Vandermonde matrix associated with (x1,…,xN)T), θ = (θ1,…,θn)T and Y = (y1,…,yN)T.
Note that the above equation is equivalent to the function you optimized in the Exercise 3 of Lab3 with GD, with A := Φ(X), x := θ and b := y.
When it is unclear how to set the parameter K and it is impossible to use the error plot, it is required to use the Maximum A Posterior (MAP) approach.
To show how it works, suppose that we know that the parameters are normally distributed θ ∼ N(0;σθ2I). Then we can use the Bayes Theorem to express the A Posteriori probability on y given x and θ as
The MAP solution searches for a set of parameters θ that maximizes p(θ|X,Y ). Following the same reasoning as before,
N N
θMAP = argmaxp(θ|X,Y ) = argminX−logp(θ|xi,yi) = argminX−logp(yi|xi,θ) − logp(θ) θ θ θ
i=1 i=1
Given the two optimization problem above, you are required to implement a program that compare the two solutions, that we will refer to as θMLE and θMAP.
1. Define a test problem in the following way:
• Let the user fix a positive integer K > 0, and define θtrue = (1,1,…,1)T (you can also consider different θtrue);
• Define an input dataset
X = [x1x2 …xN] ∈ RN
where the xi are N uniformly distributed datapoints in the interval [a,b], where a < b are value that the user can select;
• Given a set of functions {φ1,φ2,…,φK}, define the Generalized Vandermonde matrix Φ(X) ∈ RN×K, whose element in position i,j is φj(xi). In particular, write a function defining the classical Vandermonde matrix where φj(x) = xj−1;
• Given a variance σ2 > 0 defined by the user, compute
Y = Φ(X)θtrue + e
where e ∼ N(0,σ2I) is Gaussian distributed noise with variance σ2. Note that the test problem defined in this way is very similar to what we did to define a test problem in the first Lab.
2. We now built a dataset D = {(X,Y )} such that θtrue = (1,1,…,1)T ∈ RK is the best solution to the least squares problem
Φ(X)θ ≈ Y
3. Pretend not to know the correct value of K. The first task is to try to guess it and use it to approximate the true solution θtrue by MLE and MAP. To do that:
• Write a function that takes as input the training data D = (X,Y ) and K and returns the MLE solution (with Gaussian assumption) θMLE ∈ RK for that problem. Note that the loss function can be optimized by GD, SGD or Normal Equations.
• Write a function that takes as input a set of K-dimensional parameter vector θ and a test set T E = {(Xtest,Ytest)} and returns the average absolute error of the polynomial regressor fθ(x) over Xtest, computed as:
• For different values of K, plot the training datapoints and the test datapoints with different colors, and visualize (as a continuous line) the learnt regression model fθMLE(x). Comment the results.
• For increasing values of K, use the functions defined above to compute the training and test error, where the test set is generated by sampling Ntest new points on the same interval [a,b] of the training set and generating the corresponding Ytest with the same procedure of the training set. Plot the two errors with respect to K. Comment the results.
• Write a function that takes as input the training data D = (X,Y ), K and λ > 0 and returns the MAP solution (with Gaussian assumption) θMAP ∈ RK for that problem. Note that the loss function can be optimized by GD, SGD or Normal Equations.
• For K lower, equal and greater than the correct degree of the test polynomial, plot the training datapoints and the test datapoints with different colors, and visualize (as a continuous line) the learnt regression model fθMAP(x) with different values of λ. Comment the results.
• For K being way greater than the correct degree of the polynomial, compute the MLE and MAP solution. Compare the test error of the two, for different values of λ (in the case of MAP).
• For K greater than the true degree of the polynomial, define where θtrue has been padded with zeros to match the shape of θ. Compute Err(θMLE) and Err(θMAP) for increasing values of K and different values of λ.
• Compare the results obtained by increasing the number N of datapoints.
• Compare the results obtained by the three algorithms GD, SGD and Normal Equations.
4. Repeat the first point of the homework by setting
yi ∼ Poi(y|θtrue,1 + θtrue,2xi + ··· + θtrue,K(xi)K−1) ∀i = 1,…,N (2)
Where Poi(y|λ) is the Poisson distribution with parameter λ (you can find the Python function to do that), such that
And consequently,
pθ(y|x) = Poi(y|θ1 + θ2xi + ··· + θK(xi)K−1)
5. Compute the MLE and MAP losses for this choice of pθ(y|x) and repeat the experiments in the previous points.
Reviews
There are no reviews yet.