## Description

Type Homework

Materials

Reviewed

Topic General

1: Independent Events and Bayes Theorem

1.

We aim to prove that

P(B∣A)P(A)

P(A∣B) =

P(B∣A)P(A)+ P(B∣¬A)P(¬A)

First, from the definition of conditional probability, P B) and P .

Note that the numerator of both expressions is the same, as the intersection of sets is commutative. Therefore,

P(A ∩ B) = P(A∣B)P(B) = P(B∣A)P(A)

Using the right two terms above:

P(A∣B)P(B) = P(B∣A)P(A) P(B∣A)P(A)

P(A∣B) =

P(B)

The denominator of the above statement can be transformed. A ∪ ¬Acontains all events, as the union of an event with its compliment contains all events.

Therefore, P(B) = P(A ∩ B)+ P(¬A ∩ B). From the definition of conditional probability, this can be written as P(B∣A)P(A)+ P(B∣¬A)P(¬A). If we replace P(B) in the denominator of the above expression, we are left with

P(B∣A)P(A)

P(A∣B) =

P(B∣A)P(A)+ P(B∣¬A)P(¬A)

This is what we aimed to prove. ■

2a

If X is independent of Y, then P(B ∩ A) = P(B)P(A). In this example, we find the chance of each event by summing all of the probabilities of outcomes included that event.

P(X ∩ Y ) = 0.1 + 0.175 = 0.275

P(X) = 0.05 + 0.1 + 0.1 + 0.175 = 0.425

P(Y ) = 0.2 + 0.1 + 0.175 + 0.175 = 0.65 P(X)P(Y ) = 0.27625 = 0.275 = P(X ∩ Y )

Therefore, X and Y are not independent.

2b

To find conditional independence of X and Y, given Z, we use the same process as above, while summing only the outcomes where Z=1. We also divide each by the probability that Z occurs,

which is P(Z) = 0.1 + 0.1 + 0.175 + 0.175 = 0.55

P

P

P

P(X∣Z)P(Y ∣Z) = 0.31818 = P(X ∩ Y ∣Z)

Therefore, X is conditionally independent of Y given Z.

2c

First, we calculate that P(Z = 0) = 0.45, which we get by summing the probability of all outcomes where Z = 0.

P(X = Y ∣Z = 0) =

P(X = 0,Y = 1∣Z = 0)+ P(X = 1,Y = 0∣Z = 0)

Now, we compute the two probabilities.

P

P

Therefore, P

2: Maximum Likelihood Estimation

1

The log likelihood function is

l(θ^) = log L(θ^) = log(P(X1,…,Xn∣θ^)

Because we are told that a value is drawn from a single Bernoulli distribution with parameter θ, and we know that Xi can only have the values of 0 or 1, then we can define

P(Xi∣θ^) = θ^Xi (1 − θ^)1−Xi

The likelihood of all of the values being drawn is the product of all of the probabilities of each event occurring, as we consider the events independent.

n

L(θ^) = P(X1,…,Xn∣θ^) = ∏θ^Xi (1 − θ^)1−Xi

i=1

Therefore, the log likelihood function is

n

l(θ^) = log

While the above equation is correct, we can simplify further using log properties.

n

l(θ^) = ∑log(θ^Xi (1 − θ^)1−Xi )

i=1

n

= ∑[Xi log(θ^)+ (1 − Xi)log(1 − θ^)]

i=1

Because the log likelihood is a sum, the order of the random variables will not matter, as addition is commutative.

2

In order to find the maximum likelihood estimate, we first find the derivative of the above function.

n

l′Xi log(θ^)+ (1 − Xi)log(1 − θ^))

Now, we calculate the partial derivative on the right-hand side.

n

l′

θ^ (1 − θ^)

i=1

Xi

i

=1 θ^(1 − θ^) θ^(1 − θ^)

=1 θ^(1 − θ^)

Using this notation, we can make an important observation. Xi is always either the value 0 or 1. When Xi is 1, the term after the sum will be θ1^, and when Xi is 0, the term will be −1−1θ^. This observation makes it much easier to compute the best value of θ^. The sample data has 6 1s, and 4 0s. Therefore, the sum is

We want to find the value of θ^ when this expression = 0

4

6 4

− = 0

= 0

= 0

θ^(1 − θ^)

= 0

This function is zero when the numerator is 0 and the denominator is not zero. Therefore, the function is zero at θ^ = 0.6.

Because l′(0.6) = 0, it must be a local maximum or minimum for l(θ^). Recall that θ^ must be between 0 and 1. Therefore, the maximum and minimum for l(θ^) must fall at either 0, 1, or 0.6.

We will now calculate these values. (Done programmatically to save time)

From these results, we can see that a maximum occurs at 0.6. Therefore, the maximum likelihood estimate θ^MLE = 0.6.

3

As before, the log likelihood function is

l(θ^) = log L(θ^) = log(P(X1,…,Xn∣θ^)

where P(Y1,…,Ym∣θ^) is the probability of seeing the m i.i.d. random variables Y1,…,Ym when drawn from a Binomial distribution with B(n,θ).

First, we look to define P(Y1,…,Ym∣θ^). Because it is drawn from a Binomial distribution,

n! ^k(1 − θ^)n−k

P(Yi = k) =⋅ θ

k!(n− k)!

Because we are told that Yi = k, we can make that substitution in the right-hand side.

n! ^Yi(1 − θ^)n−Yi

P(Yi = k) =⋅ θ

Yi!(n− Yi)!

Now, since we assume the random variables to be independent, P(Y1,…Ym∣θ^) is the product of P(Yi∣θ^) for all i from 1 to m.

m

P(Y1,…Ym∣θ^) = ∏ n! ⋅ θ^Yi(1 − θ^)n−Yi i=1 Yi!(n− Yi)!

Now, we can create an expression for the log likelihood.

l(θ^) = log L(θ^) = log(P(Y1,…,Ym∣θ^))

m n!

= log ∏

Now, we simplify using log properties.

The product becomes a sum, and the terms multiplied together becomes sums.

m

l(θ^) = ∑log n! n−Yi

i=1

Now, exponents are moved before the logs, and the fraction on the left is changed to subtraction

m

l(θ^) = ∑log(n!)− log(Yi!(n− Yi)!)+ Yi log(θ^)+ (n− Yi)log(1 − θ^)

i=1 m

l(θ^) = ∑log(n!)− log(Yi!)− log((n− Yi)!)+ Yi log(θ^)+ (n− Yi)log(1 − θ^)

i=1

4

To find the maximum likelihood estimate, we need to find a value of θ^ that maximizes l(θ^). To do that, we find the first derivative l′(θ^), then find the values of θ^ where l′(θ^) = 0. This give local extrema, one of which is likely the maximum.

m

l′ log(n!)− log(Yi!)− log((n− Yi)!)+ Yi log(θ^)+ (n− Yi)log(1 − θ^)]

m

l′ Yi log(θ^)+ (n− Yi)log(1 − θ^)]

m

l′ Y (n− Yi) i=1 θ^ (1 − θ^)

i=1 θ^(1 − θ^) θ^(1 − θ^)

i=1 θ^(1 − θ^)

For this problem, we know n = 5, so we can make that substitution. We now look to determine what θ^ makes this expression 0.

m ^

Yi − 5θ

∑ = 0 i=1 θ^(1 − θ^)

We only have 2 Y values, so it is easy to write the whole left-hand side without the sigma.

We can easily find the solution to this equation by finding values of θ^ where the numerator is equal to 0 and the denominator is not equal to 0. This occurs at θ^ = 0.6.

Because l′(0.6) = 0, it must be a local maximum or minimum for l(θ^). Recall that θ^ must be between 0 and 1. Therefore, the maximum and minimum for l(θ^) must fall at either 0, 1, or 0.6.

We will now calculate these values. (Done programmatically to save time)

From these results, we can see that a maximum occurs at 0.6. Therefore, the maximum likelihood estimate θ^MLE = 0.6.

5

My log likelihood function for part 1 and part 3 were very similar. This is unsurprising, as the definition for binomial distribution contains the definition for the Bernoulli distribution with the addition of the “choose” term. After the logarithm was applied to both, they both ended up as a sum, and the last two terms were almost identical, but part 3 had (n− Yi) and part 1 had (1 − Xi). This is purely a feature of Yi being able to be any number between 0 and n, whereas Xi was either 1 or 0.

This also provides some extra evidence to the claim that order does not matter for Part 1. In the calculation for part 4, the order of values pulled from the Bernoulli distribution was never used, only the fact that 3 of the 5 were 1. Y1 and Y2 could have come from (1, 1, 1, 0, 0) and (1, 1, 1, 0, 0) respectively, and the same answer would have been reached.

3: Implementing Naive Bayes

In order to implement Naive Bayes, I used scikit learn’s MultiNomialNB and ComplementNB . The full documentation for the classifiers can be found at https://scikitlearn.org/stable/modules/naive_bayes.html .

All classifiers make the naive assumption of conditional independence between each pair of features given the label. Both use MAP under-the-hood. Additionally, both models take a parameter α that smooths the data to account for features not present in the data. I chose to set α = 1

I selected MultinomialNB specifically because the documentation indicated that it performs well on data represented as word vector counts.

I then tried ComplementNB as it is a variation of MultinomialNB that uses the complement and argmin to calculate the weights. The documentation also said that this method regularly outperforms the above method, so it was worth a try.

See below for screenshots of how each model performed. The source itself will also be attached as a jupyter notebook.

Here are the results for MultinomialNB

Here are the results for ComplementNB:

As we can see, both models took similar (very small) amounts of time to train. They ended up with training accuracies within 0.01% of each other, and identical test accuracy. With an accuracy of above 98%, both classifiers are very good at labeling data.

## Reviews

There are no reviews yet.