Tuesday, November 22, 2011

A* Search

Algorithms form an important part of the problem solving process in Artificial Intelligence. One useful algorithm is A* Search, which is an extension of another useful algorithm called Dijkstra's algorithm. This video is a short introduction to the subject.

So, the key part of this algorithm is the evaluation function $$f(\text{node})=g(\text{node})+h(\text{node})$$ (note that in the video "state" is treated as a "node"), where
- $g(\text{node})$ is the cost from the "Start" node to the current node and
- $h(\text{node})$ is the heuristic (estimated) cost from the current node to the "Goal".
The algorithm works by always expanding the path which has the minimum value of $f(\text{node})$ first (cheapest first). The search terminates when the "Goal" node is chosen for expansion or there are no more nodes to expand.

For the heuristic function to be admissible (or optimistic, which is the same) it is important that for any node $h(\text{node}) \le$ actual cost from the node to the "Goal". If this is true, then "A* Search" will always find the lowest cost path from the "Start" to the "Goal", this video explains this principle with more details.

As you could mention, the last video operates with another term called "search frontier". It is a "characteristic" of the algorithm or the way it progresses to/with expanding the nodes. For example:
- this article shows how the search frontier looks for the Breadth-First Search algorithm and
- this article shows how the search frontier looks for the Depth-First Search algorithm.

An implementation of this algorithm would use a priority queue, where priority is dictated by the function f.

Now, let's consolidate this material with an exercise:

For heuristic function h and cost of action 10 (per step), find the order (1, 2, 3, ...) of the nodes expansion (the node is expanded when it is removed from queue). Start with "1" at the "Start" state at the top. Indicate the nodes that will never be expanded. Is the heuristic function admissible?

First of all the heuristic function is not admissible because h(B5)=20 > 10, where 10 is the actual cost from B5 to the "Goal".

Now, let's start with the expansion. The frontier is at the "Start".

1. f(A1)=10+11=21, f(A2)=18, f(A3)=17, f(A4)=16, f(A5)=20. The frontier moves to A1, A2, A3, A4, A5 (this is the content of the queue).

2. A4 is the node (in the queue) with the minimum f (=16), so it's the second node to be expanded (removed from the queue). f(B4)=10+10+5=25, f(B5)=40. The frontier moves to A1, A2, A3, B4, B5, A5.

3. A3 is the next node with the minimum f (=17), so it's the third node to be expanded. But there is nothing to add to the queue. The frontier now is A1, A2, B4, B5, A5.

4. A2 is the next node with the minimum f (=18), so it's the forth node to be expanded. f(B2)=10+10+3=23, f(B3)=29. The frontier moves to A1, B2, B3, B4, B5, A5.

5. A5 is the next node (again, in the queue) with the minimum f (=20), so it's the fifth node to be expanded. f("Goal")=10+10+0=20. The frontier moves to A1, B2, B3, B4, B5, "Goal".

6. The "Goal" is the next node (from the queue) with the minimum f (=20), so it's the sixth node to be expanded and, because it is the "Goal", it is also the last one to be expanded (i.e. end of the search).

Here is the expansion order: "Start"-1, A4-2, A3-3, A2-4, A5-5, "Goal"-6.

The nodes which were never expanded: A1, B1, B2, B3, B4 and B5.

Thursday, November 17, 2011

Dimensionality Reduction

Another interesting Machine Learning algorithm (unsupervised learning this time) is Dimensionality Reduction.

Here is a short video explaining the theory. In the example presented in this video the mean (μ vector) is the simple arithmetic average per column of the matrix X. Σ (or covariance) matrix is simply (1 ⁄ M)⋅(X-μ)T⋅(X-μ) (where X-μ is per column operation, i.e. mean is extracted from each element of the relevant column, T – is transpose operator and M is the number of rows in the matrix X). Eigen values and vectors are of the Σ matrix, i.e. satisfying Σ⋅v=λ⋅v, where λ is a scalar value.

And here is another video explaining the algorithm's applicability.

Monday, November 14, 2011

Linear Regression

This post will be another short Machine Learning lesson (or a set of materials to be more precise). Particularly, it will be about Linear Regression, which is a method of supervised machine learning (when there is a training set).

First of all, please watch these videos explaining the theoretical material: video 1, video 2 and video 3

Basically, it is a way to "predict" (or diagnose) a result from an input, after training the model with the training set.

The mathematics behind this is explained here and it is also known as the best fitting curve.

In order to consolidate this material (at least for myself :)), here is an easy exercise. Given the training set:

x:01234
y:367811
Find the hypothesis function $f(x)=w_{1}\cdot x + w_{0}$.

Using the formulas from the video material:
$$w_{0}=\frac{1}{M}\cdot\sum y_{i} - \frac{w_{1}}{M} \cdot\sum x_{i}$$
$$w_{1}=\frac{M\cdot \sum x_{i}\cdot y_{i} - \left(\sum x_{i}\right)\cdot\left(\sum y_{i}\right)}{M\cdot \sum x^2_{i} - \left(\sum x_{i}\right)^2}$$
and the fact that M=5, we have:

$$w_{1}=\\ \frac{5\cdot(0\cdot3+1\cdot6+2\cdot7+3\cdot8+4\cdot11)–(0+1+2+3+4)\cdot(3+6+7+8+11)}{5\cdot(0+1+4+9+16)-(0+1+2+3+4)^2}\\ =\frac{5\cdot 88-350}{5\cdot 30-100}=\frac{9}{5}$$

$$w_{0}=\frac{1}{5} \cdot (3+6+7+8+11) - \frac{9}{5} \cdot \frac{1}{5} \cdot (0+1+2+3+4) =\\ 7 – \frac{18}{5} = \frac{17}{5}$$

So $f(x)=1.8\cdot x + 3.4$

Thursday, November 3, 2011

Naive Bayes and Laplace Smoothing

Don't panic if you don't understand my writings below. I encourage you to watch this video for more details. Alternatively subscribe to the following online course. So, we are going to do a bit of Machine Learning.

Let's have the following inputs (2 categories or training set)

MOVIESONG
A PERFECT WORLDA PERFECT DAY
MY PERFECT WORLD   ELECTRIC STORM
PRETTY WOMANANOTHER RAINY DAY

Vocabulary of this training set is:

A, PERFECT, WORLD, MY, WOMAN, PRETTY, DAY, ELECTRIC, STORM, ANOTHER, RAINY

Vocabulary size is 11 words (also counted as categories, more details below).

Laplace smoothing is K=1.

P(SONG) = P(MOVIE) = (3+1) ⁄ (6+1⋅2) = 1 ⁄ 2
3 sentences in category, 6 - sentences altogether, 1 - is K, 2 - number of categories (SONG and MOVIES).

P("PERFECT"|MOVIE) = (2+1) ⁄ (8+1⋅11) = 3 ⁄ 19
2 occurrences in the MOVIE category, 1 - is K, 8 words in the MOVIE category, 11 - vocabulary size.

P("PERFECT"|SONG) = (1+1) ⁄ (8+1⋅11) = 2 ⁄ 19
1 occurrence in the SONG category, 1 - is K, 8 words in the SONG category, 11 - vocabulary size.

P("STORM"|MOVIE) = (0+1) ⁄ (8+1⋅11) = 1 ⁄ 19
0 occurrences in the MOVIE category, 1 - is K, 8 words in the MOVIE category, 11 - vocabulary size.

P("STORM"|SONG) = (1+1) ⁄ (8+1⋅11) = 2 ⋅ 19
1 occurrence in the SONG category, 1 - is K, 8 words in the SONG category, 11 - vocabulary size.

Applying Bayes' Rules we can calculate P(MOVIE|M), where M = {"PERFECT STORM"} (or probability that "PERFECT STORM" is MOVIE).

P(MOVIE|M) = P(M|MOVIE)⋅P(MOVIE) ⁄ [P(M|MOVIE)⋅P(MOVIE)+P(M|SONG)⋅P(SONG)] = (3⁄19 ⋅ 1⁄19 ⋅ 1⁄2)/[3⁄19 ⋅ 1⁄19 ⋅ 1⁄2 + 2⁄19 ⋅ 2⁄19 ⋅ 1⁄2] = 3⁄(3+4) = 3⁄7