Essay preview
www.jntuworld.com
www.jwjobs.net
NEURAL NETWORKS
Ivan F Wilde
Mathematics Department
King’s College London
London, WC2R 2LS, UK
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
Contents
1 Matrix Memory . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2 Adaptive Linear Combiner
....................
21
3 Artificial Neural Networks
....................
35
..........................
45
5 Multilayer Feedforward Networks . . . . . . . . . . . . . . . . .
75
6 Radial Basis Functions
95
4 The Perceptron
......................
7 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . 103 8 Singular Value Decomposition
Bibliography
. . . . . . . . . . . . . . . . . . 115
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
Chapter 1
Matrix Memory
We wish to construct a system which possesses so-called associative memory. This is definable generally as a process by which an input, considered as a “key”, to a memory system is able to evoke, in a highly selective fashion, a specific response associated with that key, at the system output. The signalresponse association should be “robust”, that is, a “noisy” or “incomplete” input signal should none the less invoke the correct response—or at least an acceptable response. Such a system is also called a content addressable memory.
3
mapping
stimulus
response
Figure 1.1: A content addressable memory.
The idea is that the association should not be defined so much between the individual stimulus-response pairs, but rather embodied as a whole collection of such input-output patterns—the system is a distributive associative memory (the input-output pairs are “distributed” throughout the system memory rather than the particular input-output pairs being somehow represented individually in various different parts of the system). To attempt to realize such a system, we shall suppose that the input key (or prototype) patterns are coded as vectors in Rn , say, and that the responses are coded as vectors in Rm . For example, the input might be a digitized photograph comprising a picture with 100 × 100 pixels, each of which may assume one of eight levels of greyness (from white (= 0) to black 1
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
2
Chapter 1
(= 7)). In this case, by mapping the screen to a vector, via raster order, say, the input is a vector in R10000 and whose components actually take values in the set {0, . . . , 7}. The desired output might correspond to the name of the person in the photograph. If we wish to recognize up to 50 people, say, then we could give each a binary code name of 6 digits—which allows up to 26 = 64 different names. Then the output can be considered as an element of R6 .
Now, for any pair of vectors x ∈ Rn , y ∈ Rm , we can effect the map x → y via the action of the m × n matrix
M (x,y) = y xT
where x is considered as an n × 1 (column) matrix and y as an m × 1 matrix. Indeed,
M (x,y) x = y xT x
= α y,
where α = xT x = x 2 , the squared Euclidean norm of x. The matrix yxT is called the outer product of x and y . This suggests a model for our “associative system”.
Suppose that we wish to consider p input-output pattern pairs, (x(1) , y (1) ), (x(2) , y (2) ), . . ., (x(p) , y (p) ). Form the m × n matrix p
y (i) x(i) T .
M=
i=1
M is called the correlation memory matrix (corresponding to the given pattern pairs). Note that if we let X = (x(1) · · · x(p) ) and Y = (y (1) · · · y (p) ) be the n × p and m × p matrices with columns given by the vectors x(1) , . . . , x(p) and y (1) , . . . , y (p) , respectively, then the matrix p=1 y (i) x(i) T is just Y X T . i
Indeed, the jk -element of Y X T is
p
p
T
T
Yji Xki
Yji (X )ik =
(Y X )jk =
i=1
i=1
p
(i) (i)
yj xk
=
i=1
which is precisely the jk -element of M .
When presented with the input signal x(j ) , the output is
p
(j )
Mx
y (i) x(i) T x(j )
=
i=1
p
= y (j ) x(j ) T x(j ) +
(x(i) T x(j ) )y (i) .
i=1
i=j
Department of Mathematics
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
3
Matrix Memory
In particular, if we agree to “normalize” the key input signals so that x(i) T x(i) = x(i) 2 = 1, for all 1 ≤ i ≤ p, then the first term on the right hand side above is just y (j ) , the desired response signal. The second term on the right hand side is called the “cross-talk” since it involves overlaps (i.e., inner products) of the various input signals.
If the input signals are pairwise orthogonal vectors, as well as being normalized, then x(i) T x(j ) = 0 for all i = j . In this case, we get M x(j ) = y (j )
that is, perfect recall. Note that Rn contains at most n mutually orthogonal vectors.
Operationally, one can imagine the system organized as indicated in the figure.
key patterns are loaded at beginning
↓↓
input
signal
↓
output
.
.
.
.
.
.
∈ Rn
∈ Rm
Figure 1.2: An operational view of the correlation memory matrix. • start with M = 0,
• load the key input patterns one by one
T
M ← M + y (i) x(i) ,
i = 1, . . . , p,
• finally, present any input signal and observe the response. Note that additional signal-response patterns can simply be “added in” at any time, or even removed—by adding in −y (j ) x(j ) T . After the second stage above, the system has “learned” the signal-response pattern pairs. The collection of pattern pairs (x(1) , y (1) ), . . . , (x(p) , y (p) ) is called the training set.
King’s College London
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
4
Chapter 1
Remark 1.1. In general, the system is a heteroassociative memory x(i) y (i) , 1 ≤ i ≤ p. If the output is the prototype input itself, then the system is said to be an autoassociative memory.
We wish, now, to consider a quantitative account of the robustness of the autoassociative memory matrix. For this purpose, we shall suppose that the prototype patterns are bipolar vectors in Rn , i.e., the components of (i)
the x(i) each belong to {−1, 1}. Then x(i) 2 = n=1 xj 2 = n, for each j
√
1 ≤ i ≤ p, so that (1/ n)x(i) is normalized. Suppose, further, that the prototype vectors are pairwise orthogonal (—this requires that n be even). The correlation memory matrix is
1
M=
n
p
x(i) x(i) T
i=1
and we have seen that M has perfect recall, M x(j ) = x(j ) for all 1 ≤ j ≤ p. We would like to know what happens if M is presented with x, a corrupted version of one of the x(j ) . In order to obtain a bipolar vector as output, we process the output vector M x as follows:
Mx
Φ(M x)
where Φ : Rn → {−1, 1}n is defined by
1, if zk ≥ 0
−1, if zk < 0
Φ(z )k =
for 1 ≤ k ≤ n and z ∈ Rn . Thus, the matrix output is passed through a (bipolar) signal quantizer, Φ. To proceed, we introduce the notion of Hamming distance between pairs of bipolar vectors.
Let a = (a1 , . . . , an ) and b = (b1 , . . . , bn ) be elements of {−1, 1}n , i.e., bipolar vectors. The set {−1, 1}n consists of the 2n vertices of a hypercube in Rn . Then
n
aT b =
i=1
ai bi = α − β
where α is the number of components of a and b which are the same, and β is the number of differing components (ai bi = 1 if and only if ai = bi , and ai bi = −1 if and only if ai = bi ). Clearly, α + β = n and so aT b = n − 2β . Definition 1.2. The Hamming distance between the bipolar vectors a, b, denoted ρ(a, b), is defined to be
n
ρ(a, b) =
1
2
i=1
|ai − bi |.
Department of Mathematics
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
5
Matrix Memory
1
Evidently (thanks to the factor 2 ), ρ(a, b) is just the total number of mismatches between the components of a and b, i.e., it is equal to β , above. Hence
aT b = n − 2ρ(a, b).
Note that the Hamming distance defines a metric on the set of bipolar vectors. Indeed, ρ(a, b) = 1 a − b 1 , where · 1 is the ℓ1 -norm defined 2
on Rn by z 1 = n |zi |, for z = (z1 , . . . , zn ) ∈ Rn . The ℓ1 -norm is also i=1
known as the Manhatten norm—the distance between two locations is the sum of lengths of the east-west and north-south contributions to the journey, inasmuch as diagonal travel is not possible in Manhatten.
Hence, using x(i) T x = n − 2ρ(x(i) , x), we have
1
Mx =
n
=
1
n
p
x(i) x(i) T x
i=1
p
i=1
(n − 2ρi (x)) x(i) ,
where ρi (x) = ρ(x(i) , x), the Hamming distance between the input vector x and the prototype pattern vector x(i) .
Given x, we wish to know when x
x(m) , that is, when x
Mx
(m) . According to our bipolar quantization rule, it will certainly Φ(M x) = x
be true that Φ(M x) = x(m) whenever the corresponding components of M x (m)
and x(m) have the same sign. This will be the case when (M x)j xj > 0, that is, whenever
1
1
(m) (m)
(n − 2ρm (x)) xj xj +
n
n
=1
p
(i) (m)
(n − 2ρi (x)) xj xj
>0
i=1
i=m
for all 1 ≤ j ≤ n. This holds if
p
(i) (m)
(n − 2ρi (x)) xj xj
i=1
i=m
< n − 2ρm (x)
((∗))
for all 1 ≤ j ≤ n (—we have used the fact that if s > |t| then certainly s + t > 0).
We wish to find conditions which ensure that the inequality (∗) holds. By the triangle inequality, we get
p
p
(n −
i=1
i=m
(i) (m)
2ρi (x)) xj xj
≤
i=1
i=m
|n − 2ρi (x)|
(∗∗)
King’s College London
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
6
Chapter 1
(i) (m)
since |xj xj | = 1 for all 1 ≤ j ≤ n. Furthermore, using the orthogonality of x(m) and x(i) , for i = m, we have
0 = x(i) T x(m) = n − 2ρ(x(i) , x(m) )
so that
|n − 2ρi (x)| = |2ρ(x(i) , x(m) ) − 2ρi (x)|
= 2|ρ(x(i) , x(m) ) − ρ(x(i) , x)|
≤ 2ρ(x(m) , x),
where the inequality above follows from the pair of inequalities ρ(x(i) , x(m) ) ≤ ρ(x(i) , x) + ρ(x, x(m) )
and
ρ(x(i) , x) ≤ ρ(x(i) , x(m) ) + ρ(x(m) , x).
Hence, we have
|n − 2ρi (x)| ≤ 2ρm (x)
(∗∗∗)
for all i = m. This, together with (∗∗) gives
p
p
(i) (m)
i=1
i=m
(n − 2ρi (x)) xj xj
≤
i=1
i=m
|n − 2ρi (x)|
≤ 2(p − 1)ρm (x).
It follows that whenever 2(p − 1)ρm (x) < n − 2ρm (x) then (∗) holds which means that M x = x(m) . The condition 2(p − 1)ρm (x) < n − 2ρm (x) is just that 2pρm (x) < n, i.e., the condition that ρm (x) < n/2p. Now, we observe that if ρm (x) < n/2p, then, for any i = m, n − 2ρi (x) ≤ 2ρm (x)
n
<
p
by (∗∗∗), above,
and so n − 2ρi (x) < n/p. Thus
ρi (x) >
1
2
n−
n
p
n p−1
2
p
n
≥,
2p
=
assuming that p ≥ 2, so that p − 1 ≥ 1. In other words, if x is within Hamming distance of (n/2p) from x(m) , then its Hamming distance to every other prototype input vector is greater (or equal to) (n/2p). We have thus proved the following theorem (L. Personnaz, I. Guyon and G. Dreyfus, Phys. Rev. A 34, 4217–4228 (1986)).
Department of Mathematics
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
7
Matrix Memory
Theorem 1.3. Suppose that {x(1) , x(2) , . . . , x(p) } is a given set of mutually orthogonal bipolar patterns in {−1, 1}n . If x ∈ {−1, 1}n lies within Hamming distance (n/2p) of a particular prototype vector x(m) , say, then x(m) is the nearest prototype vector to x.
Furthermore, if the autoassociative matrix memory based on the patterns {x(1) , x(2) , . . . , x(p) } is augmented by subsequent bipolar quantization, then the input vector x invokes x(m) as the corresponding output. This means that the combined memory matrix and quantization system can correctly recognize (slightly) corrupted input patterns. The nonlinearity (induced by the bipolar quantizer) has enhanced the system performance— small background “noise” has been removed. Note that it could happen that the output response to x is still x(m) even if x is further than (n/2p) from x(m) . In other words, the theorem only gives sufficient conditions for x to recall x(m) .
As an example, suppose that we store 4 patterns built from a grid of 8 × 8 pixels, so that p = 4, n = 82 = 64 and (n/2p) = 64/8 = 8. Each of the 4 patterns can then be correctly recalled even when presented with up to 7 incorrect pixels.
Remark 1.4. If x is close to −x(m) , then the output from the combined autocorrelation matrix memory and bipolar quantizer is −x(m) . 1
Proof. Let M = n p=1 (−x(i) )(−x(i) )T . Then clearly M = M , i.e., the i
autoassociative correlation memory matrix for the patterns −x(1) , . . . , −x(p) is exactly the same as that for the patterns x(1) , . . . , x(p) . Applying the above theorem to the system with the negative patterns, we get that Φ(M x) = −x(m) , whenever x is within Hamming distance (n/2p) of −x(m) . But then Φ(M x) = x(m) , as claimed.
A memory matrix, also known as a linear associator, can be pictured as a network as in the figure.
x1
x2
x3
input vector
(n components)
M11
y1
y2
M21
Mm1
.
.
.
.
.
.
output vector
(m components)
ym
xn
Mmn
Figure 1.3: The memory matrix (linear associator) as a network. King’s College London
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
8
Chapter 1
“Weights” are assigned to the connections. Since yi = j Mij xj , this suggests that we assign the weight Mij to the connection joining input node j to output node i; Mij = weight(j → i).
The correlation memory matrix trained on the pattern pairs (x(1) , y (1) ),. . . , (x(p) , y (p) ) is given by M = p =1 y (m) x(m) T , which has typical term m
p
(y (m) x(m) T )ij
Mij =
m=1
p
(m) (m)
xj .
=
yi
m=1
Now, Hebb’s law (1949) for “real” i.e., biological brains says that if the excitation of cell j is involved in the excitation of cell i, then continued excitation of cell j causes an increase in its efficiency to excite cell i. To encapsulate a crude version of this idea mathematically, we might hypothesise that the weight between the two nodes be proportional to the excitation values of the nodes. Thus, for pattern label m, we would postulate that the weight, (m) (m)
weight(input j → output i), be proportional to xj yi .
We see that Mij is a sum, over all patterns, of such terms. For this reason, the assignment of the correlation memory matrix to a content addressable memory system is sometimes referred to as generalized Hebbian learning, or one says that the memory matrix is given by the generalized Hebbian rule.
Capacity of autoassociative Hebbian learning
We have seen that the correlation memory matrix has perfect recall provided that the input patterns are pairwise orthogonal vectors. Clearly, there can be at most n of these. In practice, this orthogonality requirement may not be satisfied, so it is natural ask for some kind of guide as to the number of patterns that can be stored and effectively recovered. In other words, how many patterns can there be before the cross-talk term becomes so large that it destroys the recovery of the key patterns? Experiment confirms that, indeed, there is a problem here. To give some indication of what might be reasonable, consider the autoassociative correlation memory matrix based on p bipolar pattern vectors x(1) , . . . , x(p) ∈ {−1, 1}n , followed by bipolar quantization, Φ. On presentation of pattern x(m) , the system output is
(m)
Φ(M x
1
)=Φ
n
p
x(i) x(i) T x(m) .
i=1
Department of Mathematics
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
9
Matrix Memory
(m)
Consider the k th bit. Then Φ(M x(m) )k = xk
that is whenever
1 (m) (m) (m) T (m) 1
+
x
x xk x
nk
n
1
n
(m)
whenever xk (M x(m) )k > 0,
p
(m) (i)
xk xk x(i) T x(m) > 0
i=1
i=j
Ck
or
1 + Ck > 0.
In order to consider a “typical” situation, we suppose that the patterns are random. Thus, x(1) , . . . , x(p) are selected at random from {−1, 1}n , with all pn bits being chosen independently and equally likely. The output bit Φ(M x(m) )k is therefore incorrect if 1 + Ck < 0, i.e., if Ck < −1. We shall estimate the probability of this happening.
We see that Ck is a sum of many terms
1
Ck =
n
(m) (i) (i) (m)
where Xm,k,i,j = xk xk xj xj
p
n
Xm,k,i,j
i=1 j =1
i=m
. We note firstly that, with j = k ,
(m) (i) (i) (m)
Xm,k,i,k = xk xk xk xk
= 1.
Next, we see that, for j = k , each Xm,k,i,j takes the values ±1 with equal probability, namely, 1 , and that these different X s form an independent 2
family. Therefore, we may write Ck as
Ck =
p−1 1
+S
n
n
where S is a sum of (n − 1)(p − 1) independent random variables, each 1
taking the values ±1 with probability 2 . Each of the X s has mean 0 and 2 = 1. By the Central Limit Theorem, it follows that the random variance σ
variable (S/(n − 1)(p − 1))/(σ/ (n − 1)(p − 1)) = S/ (n − 1)(p − 1) has an approximate standard normal distribution (for large n).
King’s College London
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
10
Chapter 1
Hence, if we denote by Z a standard normal random variable,
p−1 S
+ < −1 = Prob S < −n − (p − 1)
n
n
S
−n − (p − 1)
= Prob
<
(n − 1)(p − 1)
(n − 1)(p − 1)
Prob(Ck < −1) = Prob
S
= Prob
(n − 1)(p − 1)
∼ Prob Z < −
∼ Prob Z < −
n
.
p
Suppose that we require that the probability of an incorrect bit be no greater than 0.01 (or 1%). Then, from statistical tables, we find that Prob(Z > n/p) ≤ 0.01 requires that n/p ≥ 2.326. That is, we require n/p ≥ (2.326)2 or p/n ≤ 0.185. Now, to say that any particular bit is incorrectly recalled with probability 0.01 is to say that the average number of incorrect bits (from a large sample) is 1% of the total. We have therefore shown that if we are prepared to accept up to 1% bad bits in our recalled patterns (on average) then we can expect to be able to store no more than p = 0.185n patterns in our autoassociative system. That is, the storage capacity (with a 1% error tolerance) is 0.185n.
Generalized inverse matrix memory
We have seen that the success of the correlation memory matrix, or Hebbian learning, is limited by the appearance of the cross-talk term. We shall derive an alternative system based on the idea of minimization of the output distortion or error.
Let us start again (and with a change of notation). We wish to construct an associative memory system which matches input patterns a(1) ,. . . , a(p) (from Rn ) with output pattern vectors b(1) ,. . . , b(p) (in Rm ), respectively. The question is whether or not we can find a matrix M ∈ Rm×n , the set of m × n real matrices, such that
M a(i) = b(i)
Department of Mathematics
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
11
Matrix Memory
for all 1 ≤ i ≤ p. Let A ∈ Rn×p be the matrix whose columns are the vectors a(1) , . . . , a(p) , i.e., A = (a(1) · · · a(p) ), and let B ∈ Rm×p be the matrix with (j )
(j )
columns given by the b(i) s, B = (b(1) · · · b(p) ), thus Aij = ai and Bij = bi . Then it is easy to see that M a(i) = b(i) , for all i, is equivalent to M A = B . The problem, then, is to solve the matrix equation
M A = B,
for M ∈ Rm×n , for given matrices A ∈ Rn×p and B ∈ Rm×p . First, we observe that for a solution to exist, the matrices A and B cannot be arbitrary. Indeed, if A = 0, then so is M A no matter what M is—so the equation will not hold unless B also has all zero entries. Suppose next, slightly more subtly, that there is some non-zero vector v ∈ Rp such that Av = 0. Then, for any M , M Av = 0. In general, it need not be true that Bv = 0.
Suppose then that there is no such non-zero v ∈ Rp such that Av = 0, i.e., we are supposing that Av = 0 implies that v = 0. What does this mean? We have
p
(Av )i =
Aij vj
j =1
= v1 Ai1 + v2 Ai2 + · · · + vp Aip
(1)
(2)
( p)
= v1 ai + v2 ai + · · · + vp ai
= ith component of (v1 a(1) + · · · + vp a(p) ).
In other words,
Av = v1 a(1) + · · · + vp a(p) .
The vector Av is a linear combination of the columns of A, considered as elements of Rn .
Now, the statement that Av = 0 if and only if v = 0 is equivalent to the statement that v1 a(1) + · · · + vp a(p) = 0 if and only if v1 = v2 = · · · = vp = 0 which, in turn, is equivalent to the statement that a(1) ,. . . ,a(p) are linearly independent vectors in Rn .
Thus, the statement, Av = 0 if and only if v = 0, is true if and only if the columns of A are linearly independent vectors in Rn .
Proposition 1.5. For any A ∈ Rn×p , the p × p matrix ATA is invertible if and only if the columns of A are linearly independent in Rn . Proof. The square matrix ATA is invertible if and only if the equation ATAv = 0 has the unique solution v = 0, v ∈ Rp . (Certainly the invertibility of ATA implies the uniqueness of the zero solution to ATAv = 0. For the converse, first note that the uniqueness of this zero solution implies that King’s College London
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
12
Chapter 1
ATA is a one-one linear mapping from Rp to Rp . Moreover, using linearity, one readily checks that the collection ATAu1 , . . . , ATAup is a linearly independent set for any basis u1 , . . . , up of Rp . This means that it is a basis and so ATA maps Rp onto itself. Hence ATA has an inverse. Alternatively, one can argue that since ATA is symmetric it can be diagonalized via some orthogonal transformation. But a diagonal matrix is invertible if and only if every diagonal entry is non-zero. In this case, these entries are precisely the eigenvalues of ATA. So ATA is invertible if and only if none of its eigenvalues are zero.)
Suppose that the columns of A are linearly independent and that ATAv = 0. Then it follows that v T ATAv = 0 and so Av = 0, since v T ATAv = n
2
2
i=1 (Av )i = Av 2 , the square of the Euclidean length of the n-dimensional vector Av . By the argument above, Av is a linear combination of the columns of A, and we deduce that v = 0. Hence ATA is invertible.
On the other hand, if ATA is invertible, then Av = 0 implies that ATAv = 0 and so v = 0. Hence the columns of A are linearly independent. We can now derive the result of interest here.
Theorem 1.6. Let A be any n × p matrix whose columns are linearly independent. Then for any m × p matrix B , there is an m × n matrix M such that M A = B .
Proof. Let
M = B (ATA)−1 AT .
m×p
p× n
p× p
Then M is well-defined since ATA ∈ Rp×p is invertible, by the proposition. Direct substitution shows that M A = B .
So, with this choice of M we get perfect recall, provided that the input pattern vectors are linearly independent.
Note that, in general, the solution above is not unique. Indeed, for any matrix C ∈ Rm×n , the m × n matrix M ′ = C 1 n − A(ATA)−1 AT satisfies l
M ′ A = C A − A(ATA)−1 ATA = C (A − A) = 0 ∈ Rm×p . Hence M + M ′ satisfies (M + M ′ )A = B .
Can we see what M looks like in terms of the patterns a(i) , b(i) ? The answer is “yes and no”. We have A = (a(1) · · · a(p) ) and B = (b(1) · · · b(p) ). Then
n
(ATA)ij =
n
AT Akj =
ik
k=1
n
=
Aki Akj
k=1
(i) (j )
ak ak
k=1
Department of Mathematics
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
13
Matrix Memory
which gives ATA directly in terms of the a(i) s. Let Q = ATA ∈ Rp×p . Then M = BQ−1 AT , so that
p
Bik (Q−1 )kℓ Aℓj
Mij =
k,ℓ=1
p
(k )
(ℓ)
bi (Q−1 )kℓ aj
=
since AT = Ajℓ .
ℓj
k,ℓ=1
This formula for M , valid for linearly independent input patterns, expresses M more or less in terms of the patterns. The appearance of the inverse, Q−1 , somewhat lessens its appeal, however.
To discuss the case where the columns of A are not necessarily linearly independent, we need to consider the notion of generalized inverse. Definition 1.7. For any given matrix A ∈ Rm×n , the matrix X ∈ Rn×m is said to be a generalized inverse of A if
(i) AXA = A,
(ii) XAX = X ,
(iii) (AX )T = AX ,
(iv) (XA)T = XA.
The terms pseudoinverse or Moore-Penrose inverse are also commonly used for such an X .
Examples 1.8.
1. If A ∈ Rn×n is invertible, then A−1 is the generalized inverse of A. 2. If A = α ∈ R1×1 , then X = 1/α is the generalized inverse provided α = 0. If α = 0, then X = 0 is the generalized inverse.
3. The generalized inverse of A = 0 ∈ Rm×n is X = 0 ∈ Rn×m . 4. If A = u ∈ Rm×1 , u = 0, then one checks that X = uT /(uT u) is a generalized inverse of u.
The following result is pertinent to the theory.
Theorem 1.9. Every matrix possesses a unique generalized inverse. Proof. We postpone discussion of existence (which can be established via the Singular Value Decomposition) and just show uniqueness. This follows King’s College London
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
14
Chapter 1
by repeated use of the defining properties (i),. . . ,(iv). Let A ∈ Rm×n be given and suppose that X, Y ∈ Rn×m are generalized inverses of A. Then X = XAX,
by (i),
T
= X (AX ) ,
T
by (iii),
T
= XX A
= XX T AT Y T AT ,
= XX T AT AY,
= XAX AY,
= XAY,
by (i)T ,
by (iii),
by (iii),
by (ii),
= XAY AY,
T
by (i),
T
= XAA Y Y,
by (iv),
= AT X T AT Y T Y,
T
T
= A Y Y,
= Y AY,
= Y,
by (iv),
T
by (i) ,
by (iv),
by (ii),
as required.
Notation For given A ∈ Rm×n , we denote its generalized inverse by A# . It is also often written as Ag , A+ or A† .
Proposition 1.10. For any A ∈ Rm×n , AA# is the orthogonal projection onto ran A, the linear span in Rm of the columns of A, i.e., if P = AA# ∈ Rm×m , then P = P T = P 2 and P maps Rm onto ran A.
Proof. The defining property (iii) of the generalized inverse A# is precisely the statement that P = AA# is symmetric. Furthermore,
P 2 = AA# AA# = AA# ,
by condition (i),
=P
so P is idempotent. Thus P is an orthogonal projection.
For any x ∈ Rm , we have that P x = AA# x ∈ ran A, so that P : Rm → ran A. On the other hand, if x ∈ ran A, there is some z ∈ Rn such that x = Az . Hence P x = P Az = AA# Az = Az = x, where we have used condition (i) in the penultimate step. Hence P maps R onto ran A. Proposition 1.11. Let A ∈ Rm×n .
(i) If rank A = n, then A# = (ATA)−1 AT .
(ii) If rank A = m, then A# = AT (AAT )−1 .
Department of Mathematics
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
15
Matrix Memory
Proof. If rank A = n, then A has linearly independent columns and we know that this implies that ATA is invertible (in Rn×n ). It is now a straightforward matter to verify that (ATA)−1 AT satisfies the four defining properties of the generalized inverse, which completes the proof of (i).
If rank A = m, we simply consider the transpose instead. Let B = AT . Then rank B = m, since A and AT have the same rank, and so, by the argument above, B # = (B T B )−1 B T . However, AT # = A# T , as is easily checked (again from the defining conditions). Hence
A# = A# T T = (AT )# T
= B # T = B (B T B )−1
= AT (AAT )−1
which establishes (ii).
Definition 1.12. The
·
A
F -norm
2
F
on Rm×n is defined by
= Tr(ATA)
for A ∈ Rm×n ,
where Tr(B ) is the trace of the square matrix B ; Tr(B ) =
norm is called the Frobenius norm and sometimes denoted ·
i Bii .
This
2.
We see that
n
A
2
F
(ATA)ii
T
= Tr(A A) =
i=1
n
m
AT Aji
ij
=
i=1 j =1
m
n
A2
ij
=
i=1 j =1
since AT = Aji . Hence
ij
A
F
=
(sum of squares of all entries of A) .
We also note, here, that clearly A F = AT F .
Suppose that A = u ∈ Rm×1 , an m-component vector. Then A 2 = F
m
2
i=1 ui , that is, A F is the usual Euclidean norm in this case. Thus the notation · 2 for this norm is consistent. Note that, generally, A F is just the Euclidean norm of A ∈ Rm×n when A is “taken apart” row by row and considered as a vector in Rmn via the correspondence A ↔ (A11 , A12 , . . . , A1n , A21 , . . . , Amn ).
The notation A 2 is sometimes used in numerical analysis texts (and in the computer algebra software package Maple) to mean the norm of King’s College London
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
16
Chapter 1
A as a linear map from Rn into Rm , that is, the value sup{ Ax : x ∈ Rn with x = 1}. One can show that this value is equal to the square root of the largest eigenvalue of ATA whereas A F is equal to the square root of the sum of the eigenvalues of ATA.
Remark 1.13. Let A ∈ Rm×n , B ∈ Rn×m , C ∈ Rm×n , and X ∈ Rp×p . Then it is easy to see that
(i) Tr(AB ) = Tr(BA),
(ii) Tr(X ) = Tr(X T ),
(iii) Tr(AC T ) = Tr(C T A) = Tr(AT C ) = Tr(CAT ).
The equalities in (iii) can each be verified directly, or alternatively, one notices that (iii) is a consequence of (i) and (ii) (replacing B by C T ). Lemma 1.14. For A ∈ Rm×n , A# AAT = AT .
Proof. We have
(A# A)AT = (A# A)T AT
by condition (iv)
= (A(A# A))T
= AT
by condition (i)
as required.
Theorem 1.15. Let A ∈ Rn×p and B ∈ Rm×p be given. Then X = BA# is an element of Rm×n which minimizes the quantity X A − B F . Proof. We have
XA − B
2
F
= (X − BA# )A + B (A# A − 1 p )
l
= (X − BA# )A
2
F
= (X − BA# )A
2
F
2
F
+ B (A# A − 1 p )
l
2
F
+ B (A# A − 1 p )
l
2
F
#
+ 2 Tr AT (X − BA# )T B (A# A − 1 p )
l
+ 2 Tr (X − BA# )T B (A A − 1 p )AT .
l
= 0 by the lemma
Hence
XA − B
2
F
= (X − BA# )A
2
F
+ B (A# A − 1 p )
l
which achieves its minimum, B (A# A − 1 p )
l
Department of Mathematics
www.jntuworld.com
2,
F
2
F
when X = BA# .
www.jntuworld.com
www.jwjobs.net
17
Matrix Memory
Note that any X satisfying XA = BA# A gives a minimum solution. If AT has full column rank (or, equivalently, AT has no kernel) then AAT is invertible. Multiplying on the right by AT (AAT )−1 gives X = BA# . So under this condition on AT , we see that there is a unique solution X = BA# minimizing X A − B F .
In general, one can show that BA# is that element with minimal · F norm which minimizes X A − B F , i.e., if Y = BA# and Y A − B F = B A# A − B F , then B A# F < Y F .
Now let us return to our problem of finding a memory matrix which stores the input-output pattern pairs (a(i) , b(i) ), 1 ≤ i ≤ p, with each a(i) ∈ Rn and each b(i) ∈ Rm . In general, it may not be possible to find a matrix M ∈ Rm×n such that M a(i) = b(i) , for each i. Whatever our choice of M , the system output corresponding to the input a(i) is just M a(i) . So, failing equality M a(i) = b(i) , we would at least like to minimize the error b(i) − M a(i) . A measure of such an error is b(i) − M a(i) 2 the squared 2
Euclidean norm of the difference. Taking all p patterns into account, the total system recall error is taken to be
p
i=1
b(i) − M a(i) 2 .
2
Let A = (a(1) · · · a(p) ) ∈ Rn×p and B = (b(1) · · · b(p) ) ∈ Rm×p be the matrices whose columns are given by the pattern vectors a(i) and b(i) , respectively. Then the total system recall error, above, is just
B − MA
2
F.
We have seen that this is minimized by the choice M = BA# , where A# is the generalized inverse of A. The memory matrix M = BA# is called the optimal linear associative memory (OLAM) matrix.
Remark 1.16. If the patterns {a(1) , . . . , a(p) } constitute an orthonormal family, then A has independent columns and so A# = (ATA)−1 AT = 1 p AT , so l
that the OLAM matrix is BA# = BAT which is exactly the correlation memory matrix.
In the autoassociative case, b(i) = a(i) , so that B = A and the OLAM matrix is given as
M = AA# .
We have seen that AA# is precisely the projection onto the range of A, i.e., onto the subspace of Rn spanned by the prototype patterns. In this case, we say that M is given by the projection rule.
King’s College London
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
18
Chapter 1
Any input x ∈ Rn can be written as
AA# x
x=
+
(1 − AA# )x .
l
OLAM system output
“novelty”
Kohonen calls 1 − AA# the novelty filter and has applied these ideas to l
image-subtraction problems such as tumor detection in brain scans. Nonnull novelty vectors may indicate disorders or anomalies.
Pattern classification
We have discussed the distributed associative memory (DAM) matrix as an autoassociative or as a heteroassociative memory model. The first is mathematically just a special case of the second. Another special case is that of so-called classification. The idea is that one simply wants an input signal to elicit a response “tag”, typically coded as one of a collection of orthogonal unit vectors, such as given by the standard basis vectors of Rm . • In operation, the input x induces output M x, which is then associated with that tag vector corresponding to its maximum component. In other words, if (M x)j is the maximum component of M x, then the output M x is associated with the j th tag.
Examples of various pattern classification tasks have been given by T. Kohonen, P. Lehti¨, E. Oja, A. Kortekangas and K. M¨kisara, Demonstration of o
a
pattern processing properties of the optimal associative mappings, Proceedings of the International Conference on Cybernetics and Society, Washington, D. C., 581–585 (1977). (See also the article “Storage and Processing of Information in Distributed Associative Memory Systems” by T. Kohonen, P. Lehti¨ and E. Oja in “Parallel Models of Associative Memory” edited o
by G. Hinton and J. Anderson, published by Lawrence Erlbaum Associates, (updated edition) 1989.)
In one such experiment, ten people were each photographed from five different angles, ranging from 45◦ to −45◦ , with 0◦ corresponding to a fully frontal face. These were then digitized to produce pattern vectors with eight possible intensity levels for each pixel. A distinct unit vector, a tag, was associated with each person, giving a total of ten tags, and fifty patterns. The OLAM matrix was constructed from this data.
The memory matrix was then presented with a digitized photograph of one of the ten people, but taken from a different angle to any of the original five prototypes. The output was then classified according to the tag associated with its largest component. This was found to give correct identification.
The OLAM matrix was also found to perform well with autoassociation. Pattern vectors corresponding to one hundred digitized photographs were Department of Mathematics
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
19
Matrix Memory
used to construct the autoassociative memory via the projection rule. When presented with incomplete or fuzzy versions of the original patterns, the OLAM matrix satisfactorily reconstructed the correct image.
In another autoassociative recall experiment, twenty one different prototype images were used to construct the OLAM matrix. These were each composed of three similarly placed copies of a subimage. New pattern images, consisting of just one part of the usual triple features, were presented to the OLAM matrix. The output images consisted of slightly fuzzy versions of the single part but triplicated so as to mimic the subimage positioning learned from the original twenty one prototypes.
An analysis comparing the performance of the correlation memory matrix with that of the generalized inverse matrix memory has been offered by Cherkassky, Fassett and Vassilas (IEEE Trans. on Computers, 40, 1429 (1991)). Their conclusion is that the generalized inverse memory matrix performs better than the correlation memory matrix for autoassociation, but that the correlation memory matrix is better for classification. This is contrary to the widespread belief that the generalized inverse memory matrix is the superior model.
King’s College London
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
20
Chapter 1
Department of Mathematics
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
Chapter 2
Adaptive Linear Combiner
We wish to consider a memory matrix for the special case of one-dimensional output vectors. Thus, we consider input pattern vectors x(1) , . . . , x(p) ∈ Rℓ , say, with corresponding desired outputs y (1) , . . . , y (p) ∈ R and we seek a memory matrix M ∈ R1×ℓ such that
M x(i) = y (i) ,
for 1 ≤ i ≤ p. Since M ∈ R1×ℓ , we can think of M as a row vector M = (m1 , . . . , mℓ ). The output corresponding to the input x = (x1 , . . . , xℓ ) ∈ Rℓ is just
ℓ
mi xi .
y = Mx =
i=1
Such a system is known as the adaptive linear combiner (ALC). weights on connections
x1
x2
input signal
components
m1
.
.
.
xn
m2
y
output
mn
“adder”
forms n mi xi
i=1
Figure 2.1: The Adaptive Linear Combiner.
We have seen that we may not be able to find M which satisfies the exact input-output relationship M x(i) = y (i) , for each i. The idea is to look for an M which is in a certain sense optimal. To do this, we seek m1 , . . . , mℓ 21
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
22
Chapter 2
such that (one half) the average mean-squared error
p
E≡
11
2p
i=1
|y (i) − z (i) |2
is minimized—where y (i) is the desired output corresponding to the input vector x(i) and z (i) = M x(i) is the actual system output. We already know, from the results in the last chapter, that this is achieved by the OLAM matrix based on the input-output pattern pairs, but we wish here to develop an algorithmic approach to the construction of the appropriate memory matrix. We can write out E in terms of the mi as follows
p
E=
1
2p
i=1
ℓ
|y (i) −
j =1
p
=
(i)
mj xj |2
ℓ
i=1
j,k=1
ℓ
=
1
2
j,k=1
ℓ
(i) (i)
y (i)2 +
1
2p
mj mk xj xk − 2
(i)
y (i) mj xj
j =1
ℓ
mj Ajk mk −
1
bj mj + 2 c
j =1
(i) (i)
(i)
1
1
1
where Ajk = p p=1 xj xk , bj = p p=1 y (i) xj and c = p p=1 y (i)2 . i
i
i
Note that A = (Ajk ) ∈ Rℓ×ℓ is symmetric. The error E is a non-negative quadratic function of the mi . For a minimum, we investigate the equalities ∂ E /∂mi = 0, that is,
∂E
=
0=
∂mi
ℓ
k=1
Aik mk − bi ,
for 1 ≤ i ≤ ℓ, where we have used the symmetry of (Aik ) here. We thus obtain the so-called Gauss-normal or Wiener-Hopf equations
ℓ
Aik mk = bi ,
k=1
for 1 ≤ i ≤ ℓ,
or, in matrix form,
Am = b ,
with A = (Aik ) ∈ Rℓ×ℓ , m = (m1 , . . . , mℓ ) ∈ Rℓ and b = (b1 , . . . , bℓ ) ∈ Rℓ . If A is invertible, then m = A−1 b is the unique solution. In general, there may be many solutions. For example, if A is diagonal with A11 = 0, then necessarily b1 = 0 (otherwise E could not be non-negative as a function of the mi ) and so we see that m1 is arbitrary. To relate this to the OLAM matrix, write E as
E = 21p M X − Y 2 ,
F
Department of Mathematics
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
23
Adaptive Linear Combiner
where X = (x(1) · · · x(p) ) ∈ Rℓ×p and Y = (y (1) · · · y (p) ) ∈ R1×p . This, we know, is minimized by M = Y X # ∈ R1×ℓ . Therefore m = M T must be a solution to the Wiener-Hopf equations above. We can write A, b and c in 1
1
terms of the matrices X and Y . One finds that A = p XX T , bT = p Y X T 1
and c = p Y TY . The equation Am = b then becomes XX T m = XY T giving m = (XX T )−1 XY T , provided that A is invertible. This gives M = mT = Y X T (XX T )−1 = Y X # , as above.
One method of attack for finding a vector m∗ minimizing E is that of gradient-descent. The idea is to think of E (m1 , . . . , mℓ ) as a bowl-shaped surface above the ℓ-dimensional m1 , . . . , mℓ -space. Pick any value for m. The vector grad E , when evaluated at m, points in the direction of maximum increase of E in the neighbourhood of m. That is to say, for small α (and a vector v of given length), E (m+αv )−E (m) is maximized when v points in the same direction as grad E (as is seen by Taylor’s theorem). Now, rather than increasing E , we wish to minimize it. So the idea is to move a small distance from m to m − α grad E , thus inducing maximal “downhill” movement on the error surface. By repeating this process, we hope to eventually reach a value of m which minimizes E .
The strategy, then, is to consider a sequence of vectors m(n) given algorithmically by m(n + 1) = m(n) − α grad E ,
for n = 1, 2, . . . ,
with m(1) arbitrary and where the parameter α is called the learning rate. If we substitute for grad E , we find
m(n + 1) = m(n) + α b − Am(n) .
Now, A is symmetric and so can be diagonalized. There is an orthogonal matrix U ∈ Rℓ×ℓ such that
U AU T = D = diag(λ1 , . . . , λℓ )
and we may assume t...