Essays /

Java And Web Technologies Essay

Essay preview

www.jntuworld.com

www.jwjobs.net

NEURAL NETWORKS

Ivan F Wilde
Mathematics Department
King’s College London
London, WC2R 2LS, UK

[email protected]

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Contents

1 Matrix Memory . . . . . . . . . . . . . . . . . . . . . . . . . .

1

2 Adaptive Linear Combiner

....................

21

3 Artificial Neural Networks

....................

35

..........................

45

5 Multilayer Feedforward Networks . . . . . . . . . . . . . . . . .

75

6 Radial Basis Functions

95

4 The Perceptron

......................

7 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . 103 8 Singular Value Decomposition
Bibliography

. . . . . . . . . . . . . . . . . . 115

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Chapter 1
Matrix Memory

We wish to construct a system which possesses so-called associative memory. This is definable generally as a process by which an input, considered as a “key”, to a memory system is able to evoke, in a highly selective fashion, a specific response associated with that key, at the system output. The signalresponse association should be “robust”, that is, a “noisy” or “incomplete” input signal should none the less invoke the correct response—or at least an acceptable response. Such a system is also called a content addressable memory.

3

mapping

stimulus

response

Figure 1.1: A content addressable memory.
The idea is that the association should not be defined so much between the individual stimulus-response pairs, but rather embodied as a whole collection of such input-output patterns—the system is a distributive associative memory (the input-output pairs are “distributed” throughout the system memory rather than the particular input-output pairs being somehow represented individually in various different parts of the system). To attempt to realize such a system, we shall suppose that the input key (or prototype) patterns are coded as vectors in Rn , say, and that the responses are coded as vectors in Rm . For example, the input might be a digitized photograph comprising a picture with 100 × 100 pixels, each of which may assume one of eight levels of greyness (from white (= 0) to black 1

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

2

Chapter 1

(= 7)). In this case, by mapping the screen to a vector, via raster order, say, the input is a vector in R10000 and whose components actually take values in the set {0, . . . , 7}. The desired output might correspond to the name of the person in the photograph. If we wish to recognize up to 50 people, say, then we could give each a binary code name of 6 digits—which allows up to 26 = 64 different names. Then the output can be considered as an element of R6 .

Now, for any pair of vectors x ∈ Rn , y ∈ Rm , we can effect the map x → y via the action of the m × n matrix
M (x,y) = y xT

where x is considered as an n × 1 (column) matrix and y as an m × 1 matrix. Indeed,
M (x,y) x = y xT x
= α y,
where α = xT x = x 2 , the squared Euclidean norm of x. The matrix yxT is called the outer product of x and y . This suggests a model for our “associative system”.
Suppose that we wish to consider p input-output pattern pairs, (x(1) , y (1) ), (x(2) , y (2) ), . . ., (x(p) , y (p) ). Form the m × n matrix p

y (i) x(i) T .

M=
i=1

M is called the correlation memory matrix (corresponding to the given pattern pairs). Note that if we let X = (x(1) · · · x(p) ) and Y = (y (1) · · · y (p) ) be the n × p and m × p matrices with columns given by the vectors x(1) , . . . , x(p) and y (1) , . . . , y (p) , respectively, then the matrix p=1 y (i) x(i) T is just Y X T . i

Indeed, the jk -element of Y X T is
p

p
T

T

Yji Xki

Yji (X )ik =

(Y X )jk =

i=1

i=1
p
(i) (i)

yj xk

=
i=1

which is precisely the jk -element of M .
When presented with the input signal x(j ) , the output is
p
(j )

Mx

y (i) x(i) T x(j )

=
i=1

p

= y (j ) x(j ) T x(j ) +

(x(i) T x(j ) )y (i) .
i=1
i=j

Department of Mathematics

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

3

Matrix Memory

In particular, if we agree to “normalize” the key input signals so that x(i) T x(i) = x(i) 2 = 1, for all 1 ≤ i ≤ p, then the first term on the right hand side above is just y (j ) , the desired response signal. The second term on the right hand side is called the “cross-talk” since it involves overlaps (i.e., inner products) of the various input signals.

If the input signals are pairwise orthogonal vectors, as well as being normalized, then x(i) T x(j ) = 0 for all i = j . In this case, we get M x(j ) = y (j )
that is, perfect recall. Note that Rn contains at most n mutually orthogonal vectors.
Operationally, one can imagine the system organized as indicated in the figure.
key patterns are loaded at beginning
↓↓

input
signal

output
.
.
.

.
.
.

∈ Rn

∈ Rm

Figure 1.2: An operational view of the correlation memory matrix. • start with M = 0,
• load the key input patterns one by one
T

M ← M + y (i) x(i) ,

i = 1, . . . , p,

• finally, present any input signal and observe the response. Note that additional signal-response patterns can simply be “added in” at any time, or even removed—by adding in −y (j ) x(j ) T . After the second stage above, the system has “learned” the signal-response pattern pairs. The collection of pattern pairs (x(1) , y (1) ), . . . , (x(p) , y (p) ) is called the training set.

King’s College London

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

4

Chapter 1

Remark 1.1. In general, the system is a heteroassociative memory x(i) y (i) , 1 ≤ i ≤ p. If the output is the prototype input itself, then the system is said to be an autoassociative memory.
We wish, now, to consider a quantitative account of the robustness of the autoassociative memory matrix. For this purpose, we shall suppose that the prototype patterns are bipolar vectors in Rn , i.e., the components of (i)

the x(i) each belong to {−1, 1}. Then x(i) 2 = n=1 xj 2 = n, for each j

1 ≤ i ≤ p, so that (1/ n)x(i) is normalized. Suppose, further, that the prototype vectors are pairwise orthogonal (—this requires that n be even). The correlation memory matrix is
1
M=
n

p

x(i) x(i) T
i=1

and we have seen that M has perfect recall, M x(j ) = x(j ) for all 1 ≤ j ≤ p. We would like to know what happens if M is presented with x, a corrupted version of one of the x(j ) . In order to obtain a bipolar vector as output, we process the output vector M x as follows:

Mx

Φ(M x)

where Φ : Rn → {−1, 1}n is defined by
1, if zk ≥ 0
−1, if zk < 0

Φ(z )k =

for 1 ≤ k ≤ n and z ∈ Rn . Thus, the matrix output is passed through a (bipolar) signal quantizer, Φ. To proceed, we introduce the notion of Hamming distance between pairs of bipolar vectors.
Let a = (a1 , . . . , an ) and b = (b1 , . . . , bn ) be elements of {−1, 1}n , i.e., bipolar vectors. The set {−1, 1}n consists of the 2n vertices of a hypercube in Rn . Then
n

aT b =
i=1

ai bi = α − β

where α is the number of components of a and b which are the same, and β is the number of differing components (ai bi = 1 if and only if ai = bi , and ai bi = −1 if and only if ai = bi ). Clearly, α + β = n and so aT b = n − 2β . Definition 1.2. The Hamming distance between the bipolar vectors a, b, denoted ρ(a, b), is defined to be

n

ρ(a, b) =

1
2
i=1

|ai − bi |.

Department of Mathematics

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

5

Matrix Memory

1
Evidently (thanks to the factor 2 ), ρ(a, b) is just the total number of mismatches between the components of a and b, i.e., it is equal to β , above. Hence
aT b = n − 2ρ(a, b).

Note that the Hamming distance defines a metric on the set of bipolar vectors. Indeed, ρ(a, b) = 1 a − b 1 , where · 1 is the ℓ1 -norm defined 2
on Rn by z 1 = n |zi |, for z = (z1 , . . . , zn ) ∈ Rn . The ℓ1 -norm is also i=1
known as the Manhatten norm—the distance between two locations is the sum of lengths of the east-west and north-south contributions to the journey, inasmuch as diagonal travel is not possible in Manhatten.

Hence, using x(i) T x = n − 2ρ(x(i) , x), we have
1
Mx =
n
=

1
n

p

x(i) x(i) T x
i=1
p
i=1

(n − 2ρi (x)) x(i) ,

where ρi (x) = ρ(x(i) , x), the Hamming distance between the input vector x and the prototype pattern vector x(i) .
Given x, we wish to know when x
x(m) , that is, when x
Mx
(m) . According to our bipolar quantization rule, it will certainly Φ(M x) = x
be true that Φ(M x) = x(m) whenever the corresponding components of M x (m)
and x(m) have the same sign. This will be the case when (M x)j xj > 0, that is, whenever
1
1
(m) (m)
(n − 2ρm (x)) xj xj +
n
n
=1

p
(i) (m)

(n − 2ρi (x)) xj xj

>0

i=1
i=m

for all 1 ≤ j ≤ n. This holds if
p
(i) (m)

(n − 2ρi (x)) xj xj

i=1
i=m

< n − 2ρm (x)

((∗))

for all 1 ≤ j ≤ n (—we have used the fact that if s > |t| then certainly s + t > 0).
We wish to find conditions which ensure that the inequality (∗) holds. By the triangle inequality, we get
p

p

(n −

i=1
i=m

(i) (m)
2ρi (x)) xj xj

i=1
i=m

|n − 2ρi (x)|

(∗∗)

King’s College London

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

6

Chapter 1
(i) (m)

since |xj xj | = 1 for all 1 ≤ j ≤ n. Furthermore, using the orthogonality of x(m) and x(i) , for i = m, we have
0 = x(i) T x(m) = n − 2ρ(x(i) , x(m) )
so that
|n − 2ρi (x)| = |2ρ(x(i) , x(m) ) − 2ρi (x)|

= 2|ρ(x(i) , x(m) ) − ρ(x(i) , x)|
≤ 2ρ(x(m) , x),

where the inequality above follows from the pair of inequalities ρ(x(i) , x(m) ) ≤ ρ(x(i) , x) + ρ(x, x(m) )

and

ρ(x(i) , x) ≤ ρ(x(i) , x(m) ) + ρ(x(m) , x).

Hence, we have
|n − 2ρi (x)| ≤ 2ρm (x)

(∗∗∗)

for all i = m. This, together with (∗∗) gives
p

p
(i) (m)

i=1
i=m

(n − 2ρi (x)) xj xj

i=1
i=m

|n − 2ρi (x)|

≤ 2(p − 1)ρm (x).
It follows that whenever 2(p − 1)ρm (x) < n − 2ρm (x) then (∗) holds which means that M x = x(m) . The condition 2(p − 1)ρm (x) < n − 2ρm (x) is just that 2pρm (x) < n, i.e., the condition that ρm (x) < n/2p. Now, we observe that if ρm (x) < n/2p, then, for any i = m, n − 2ρi (x) ≤ 2ρm (x)

n
<
p

by (∗∗∗), above,

and so n − 2ρi (x) < n/p. Thus
ρi (x) >

1
2

n−

n
p

n p−1
2
p
n
≥,
2p
=

assuming that p ≥ 2, so that p − 1 ≥ 1. In other words, if x is within Hamming distance of (n/2p) from x(m) , then its Hamming distance to every other prototype input vector is greater (or equal to) (n/2p). We have thus proved the following theorem (L. Personnaz, I. Guyon and G. Dreyfus, Phys. Rev. A 34, 4217–4228 (1986)).

Department of Mathematics

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

7

Matrix Memory

Theorem 1.3. Suppose that {x(1) , x(2) , . . . , x(p) } is a given set of mutually orthogonal bipolar patterns in {−1, 1}n . If x ∈ {−1, 1}n lies within Hamming distance (n/2p) of a particular prototype vector x(m) , say, then x(m) is the nearest prototype vector to x.

Furthermore, if the autoassociative matrix memory based on the patterns {x(1) , x(2) , . . . , x(p) } is augmented by subsequent bipolar quantization, then the input vector x invokes x(m) as the corresponding output. This means that the combined memory matrix and quantization system can correctly recognize (slightly) corrupted input patterns. The nonlinearity (induced by the bipolar quantizer) has enhanced the system performance— small background “noise” has been removed. Note that it could happen that the output response to x is still x(m) even if x is further than (n/2p) from x(m) . In other words, the theorem only gives sufficient conditions for x to recall x(m) .

As an example, suppose that we store 4 patterns built from a grid of 8 × 8 pixels, so that p = 4, n = 82 = 64 and (n/2p) = 64/8 = 8. Each of the 4 patterns can then be correctly recalled even when presented with up to 7 incorrect pixels.

Remark 1.4. If x is close to −x(m) , then the output from the combined autocorrelation matrix memory and bipolar quantizer is −x(m) . 1
Proof. Let M = n p=1 (−x(i) )(−x(i) )T . Then clearly M = M , i.e., the i
autoassociative correlation memory matrix for the patterns −x(1) , . . . , −x(p) is exactly the same as that for the patterns x(1) , . . . , x(p) . Applying the above theorem to the system with the negative patterns, we get that Φ(M x) = −x(m) , whenever x is within Hamming distance (n/2p) of −x(m) . But then Φ(M x) = x(m) , as claimed.

A memory matrix, also known as a linear associator, can be pictured as a network as in the figure.
x1
x2
x3
input vector
(n components)

M11
y1
y2

M21
Mm1

.
.
.

.
.
.

output vector
(m components)
ym

xn

Mmn

Figure 1.3: The memory matrix (linear associator) as a network. King’s College London

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

8

Chapter 1

“Weights” are assigned to the connections. Since yi = j Mij xj , this suggests that we assign the weight Mij to the connection joining input node j to output node i; Mij = weight(j → i).

The correlation memory matrix trained on the pattern pairs (x(1) , y (1) ),. . . , (x(p) , y (p) ) is given by M = p =1 y (m) x(m) T , which has typical term m
p

(y (m) x(m) T )ij

Mij =
m=1
p

(m) (m)
xj .

=

yi
m=1

Now, Hebb’s law (1949) for “real” i.e., biological brains says that if the excitation of cell j is involved in the excitation of cell i, then continued excitation of cell j causes an increase in its efficiency to excite cell i. To encapsulate a crude version of this idea mathematically, we might hypothesise that the weight between the two nodes be proportional to the excitation values of the nodes. Thus, for pattern label m, we would postulate that the weight, (m) (m)

weight(input j → output i), be proportional to xj yi .
We see that Mij is a sum, over all patterns, of such terms. For this reason, the assignment of the correlation memory matrix to a content addressable memory system is sometimes referred to as generalized Hebbian learning, or one says that the memory matrix is given by the generalized Hebbian rule.

Capacity of autoassociative Hebbian learning
We have seen that the correlation memory matrix has perfect recall provided that the input patterns are pairwise orthogonal vectors. Clearly, there can be at most n of these. In practice, this orthogonality requirement may not be satisfied, so it is natural ask for some kind of guide as to the number of patterns that can be stored and effectively recovered. In other words, how many patterns can there be before the cross-talk term becomes so large that it destroys the recovery of the key patterns? Experiment confirms that, indeed, there is a problem here. To give some indication of what might be reasonable, consider the autoassociative correlation memory matrix based on p bipolar pattern vectors x(1) , . . . , x(p) ∈ {−1, 1}n , followed by bipolar quantization, Φ. On presentation of pattern x(m) , the system output is

(m)

Φ(M x

1
)=Φ
n

p

x(i) x(i) T x(m) .
i=1

Department of Mathematics

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

9

Matrix Memory
(m)

Consider the k th bit. Then Φ(M x(m) )k = xk
that is whenever
1 (m) (m) (m) T (m) 1
+
x
x xk x
nk
n
1

n

(m)

whenever xk (M x(m) )k > 0,

p
(m) (i)

xk xk x(i) T x(m) > 0
i=1
i=j
Ck

or
1 + Ck > 0.
In order to consider a “typical” situation, we suppose that the patterns are random. Thus, x(1) , . . . , x(p) are selected at random from {−1, 1}n , with all pn bits being chosen independently and equally likely. The output bit Φ(M x(m) )k is therefore incorrect if 1 + Ck < 0, i.e., if Ck < −1. We shall estimate the probability of this happening.

We see that Ck is a sum of many terms
1
Ck =
n

(m) (i) (i) (m)

where Xm,k,i,j = xk xk xj xj

p

n

Xm,k,i,j
i=1 j =1
i=m

. We note firstly that, with j = k ,
(m) (i) (i) (m)

Xm,k,i,k = xk xk xk xk

= 1.

Next, we see that, for j = k , each Xm,k,i,j takes the values ±1 with equal probability, namely, 1 , and that these different X s form an independent 2
family. Therefore, we may write Ck as
Ck =

p−1 1
+S
n
n

where S is a sum of (n − 1)(p − 1) independent random variables, each 1
taking the values ±1 with probability 2 . Each of the X s has mean 0 and 2 = 1. By the Central Limit Theorem, it follows that the random variance σ
variable (S/(n − 1)(p − 1))/(σ/ (n − 1)(p − 1)) = S/ (n − 1)(p − 1) has an approximate standard normal distribution (for large n).
King’s College London

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

10

Chapter 1

Hence, if we denote by Z a standard normal random variable,
p−1 S
+ < −1 = Prob S < −n − (p − 1)
n
n
S
−n − (p − 1)
= Prob
<
(n − 1)(p − 1)
(n − 1)(p − 1)

Prob(Ck < −1) = Prob

S

= Prob

(n − 1)(p − 1)

∼ Prob Z < −
∼ Prob Z < −

n
.
p

Suppose that we require that the probability of an incorrect bit be no greater than 0.01 (or 1%). Then, from statistical tables, we find that Prob(Z > n/p) ≤ 0.01 requires that n/p ≥ 2.326. That is, we require n/p ≥ (2.326)2 or p/n ≤ 0.185. Now, to say that any particular bit is incorrectly recalled with probability 0.01 is to say that the average number of incorrect bits (from a large sample) is 1% of the total. We have therefore shown that if we are prepared to accept up to 1% bad bits in our recalled patterns (on average) then we can expect to be able to store no more than p = 0.185n patterns in our autoassociative system. That is, the storage capacity (with a 1% error tolerance) is 0.185n.

Generalized inverse matrix memory
We have seen that the success of the correlation memory matrix, or Hebbian learning, is limited by the appearance of the cross-talk term. We shall derive an alternative system based on the idea of minimization of the output distortion or error.

Let us start again (and with a change of notation). We wish to construct an associative memory system which matches input patterns a(1) ,. . . , a(p) (from Rn ) with output pattern vectors b(1) ,. . . , b(p) (in Rm ), respectively. The question is whether or not we can find a matrix M ∈ Rm×n , the set of m × n real matrices, such that

M a(i) = b(i)
Department of Mathematics

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

11

Matrix Memory

for all 1 ≤ i ≤ p. Let A ∈ Rn×p be the matrix whose columns are the vectors a(1) , . . . , a(p) , i.e., A = (a(1) · · · a(p) ), and let B ∈ Rm×p be the matrix with (j )
(j )
columns given by the b(i) s, B = (b(1) · · · b(p) ), thus Aij = ai and Bij = bi . Then it is easy to see that M a(i) = b(i) , for all i, is equivalent to M A = B . The problem, then, is to solve the matrix equation

M A = B,
for M ∈ Rm×n , for given matrices A ∈ Rn×p and B ∈ Rm×p . First, we observe that for a solution to exist, the matrices A and B cannot be arbitrary. Indeed, if A = 0, then so is M A no matter what M is—so the equation will not hold unless B also has all zero entries. Suppose next, slightly more subtly, that there is some non-zero vector v ∈ Rp such that Av = 0. Then, for any M , M Av = 0. In general, it need not be true that Bv = 0.

Suppose then that there is no such non-zero v ∈ Rp such that Av = 0, i.e., we are supposing that Av = 0 implies that v = 0. What does this mean? We have
p

(Av )i =

Aij vj
j =1

= v1 Ai1 + v2 Ai2 + · · · + vp Aip
(1)

(2)

( p)

= v1 ai + v2 ai + · · · + vp ai

= ith component of (v1 a(1) + · · · + vp a(p) ).
In other words,
Av = v1 a(1) + · · · + vp a(p) .
The vector Av is a linear combination of the columns of A, considered as elements of Rn .
Now, the statement that Av = 0 if and only if v = 0 is equivalent to the statement that v1 a(1) + · · · + vp a(p) = 0 if and only if v1 = v2 = · · · = vp = 0 which, in turn, is equivalent to the statement that a(1) ,. . . ,a(p) are linearly independent vectors in Rn .

Thus, the statement, Av = 0 if and only if v = 0, is true if and only if the columns of A are linearly independent vectors in Rn .
Proposition 1.5. For any A ∈ Rn×p , the p × p matrix ATA is invertible if and only if the columns of A are linearly independent in Rn . Proof. The square matrix ATA is invertible if and only if the equation ATAv = 0 has the unique solution v = 0, v ∈ Rp . (Certainly the invertibility of ATA implies the uniqueness of the zero solution to ATAv = 0. For the converse, first note that the uniqueness of this zero solution implies that King’s College London

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

12

Chapter 1

ATA is a one-one linear mapping from Rp to Rp . Moreover, using linearity, one readily checks that the collection ATAu1 , . . . , ATAup is a linearly independent set for any basis u1 , . . . , up of Rp . This means that it is a basis and so ATA maps Rp onto itself. Hence ATA has an inverse. Alternatively, one can argue that since ATA is symmetric it can be diagonalized via some orthogonal transformation. But a diagonal matrix is invertible if and only if every diagonal entry is non-zero. In this case, these entries are precisely the eigenvalues of ATA. So ATA is invertible if and only if none of its eigenvalues are zero.)

Suppose that the columns of A are linearly independent and that ATAv = 0. Then it follows that v T ATAv = 0 and so Av = 0, since v T ATAv = n
2
2
i=1 (Av )i = Av 2 , the square of the Euclidean length of the n-dimensional vector Av . By the argument above, Av is a linear combination of the columns of A, and we deduce that v = 0. Hence ATA is invertible.

On the other hand, if ATA is invertible, then Av = 0 implies that ATAv = 0 and so v = 0. Hence the columns of A are linearly independent. We can now derive the result of interest here.
Theorem 1.6. Let A be any n × p matrix whose columns are linearly independent. Then for any m × p matrix B , there is an m × n matrix M such that M A = B .
Proof. Let
M = B (ATA)−1 AT .
m×p

p× n

p× p

Then M is well-defined since ATA ∈ Rp×p is invertible, by the proposition. Direct substitution shows that M A = B .
So, with this choice of M we get perfect recall, provided that the input pattern vectors are linearly independent.
Note that, in general, the solution above is not unique. Indeed, for any matrix C ∈ Rm×n , the m × n matrix M ′ = C 1 n − A(ATA)−1 AT satisfies l
M ′ A = C A − A(ATA)−1 ATA = C (A − A) = 0 ∈ Rm×p . Hence M + M ′ satisfies (M + M ′ )A = B .
Can we see what M looks like in terms of the patterns a(i) , b(i) ? The answer is “yes and no”. We have A = (a(1) · · · a(p) ) and B = (b(1) · · · b(p) ). Then
n

(ATA)ij =

n

AT Akj =
ik
k=1
n

=

Aki Akj
k=1

(i) (j )

ak ak
k=1

Department of Mathematics

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

13

Matrix Memory

which gives ATA directly in terms of the a(i) s. Let Q = ATA ∈ Rp×p . Then M = BQ−1 AT , so that
p

Bik (Q−1 )kℓ Aℓj

Mij =
k,ℓ=1
p

(k )

(ℓ)

bi (Q−1 )kℓ aj

=

since AT = Ajℓ .
ℓj

k,ℓ=1

This formula for M , valid for linearly independent input patterns, expresses M more or less in terms of the patterns. The appearance of the inverse, Q−1 , somewhat lessens its appeal, however.
To discuss the case where the columns of A are not necessarily linearly independent, we need to consider the notion of generalized inverse. Definition 1.7. For any given matrix A ∈ Rm×n , the matrix X ∈ Rn×m is said to be a generalized inverse of A if

(i) AXA = A,
(ii) XAX = X ,
(iii) (AX )T = AX ,
(iv) (XA)T = XA.
The terms pseudoinverse or Moore-Penrose inverse are also commonly used for such an X .
Examples 1.8.
1. If A ∈ Rn×n is invertible, then A−1 is the generalized inverse of A. 2. If A = α ∈ R1×1 , then X = 1/α is the generalized inverse provided α = 0. If α = 0, then X = 0 is the generalized inverse.

3. The generalized inverse of A = 0 ∈ Rm×n is X = 0 ∈ Rn×m . 4. If A = u ∈ Rm×1 , u = 0, then one checks that X = uT /(uT u) is a generalized inverse of u.
The following result is pertinent to the theory.
Theorem 1.9. Every matrix possesses a unique generalized inverse. Proof. We postpone discussion of existence (which can be established via the Singular Value Decomposition) and just show uniqueness. This follows King’s College London

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

14

Chapter 1

by repeated use of the defining properties (i),. . . ,(iv). Let A ∈ Rm×n be given and suppose that X, Y ∈ Rn×m are generalized inverses of A. Then X = XAX,

by (i),
T

= X (AX ) ,
T

by (iii),

T

= XX A

= XX T AT Y T AT ,
= XX T AT AY,
= XAX AY,
= XAY,

by (i)T ,

by (iii),

by (iii),

by (ii),

= XAY AY,
T

by (i),

T

= XAA Y Y,

by (iv),

= AT X T AT Y T Y,
T

T

= A Y Y,
= Y AY,
= Y,

by (iv),
T

by (i) ,

by (iv),

by (ii),

as required.
Notation For given A ∈ Rm×n , we denote its generalized inverse by A# . It is also often written as Ag , A+ or A† .
Proposition 1.10. For any A ∈ Rm×n , AA# is the orthogonal projection onto ran A, the linear span in Rm of the columns of A, i.e., if P = AA# ∈ Rm×m , then P = P T = P 2 and P maps Rm onto ran A.
Proof. The defining property (iii) of the generalized inverse A# is precisely the statement that P = AA# is symmetric. Furthermore,
P 2 = AA# AA# = AA# ,

by condition (i),

=P
so P is idempotent. Thus P is an orthogonal projection.
For any x ∈ Rm , we have that P x = AA# x ∈ ran A, so that P : Rm → ran A. On the other hand, if x ∈ ran A, there is some z ∈ Rn such that x = Az . Hence P x = P Az = AA# Az = Az = x, where we have used condition (i) in the penultimate step. Hence P maps R onto ran A. Proposition 1.11. Let A ∈ Rm×n .

(i) If rank A = n, then A# = (ATA)−1 AT .
(ii) If rank A = m, then A# = AT (AAT )−1 .
Department of Mathematics

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

15

Matrix Memory

Proof. If rank A = n, then A has linearly independent columns and we know that this implies that ATA is invertible (in Rn×n ). It is now a straightforward matter to verify that (ATA)−1 AT satisfies the four defining properties of the generalized inverse, which completes the proof of (i).

If rank A = m, we simply consider the transpose instead. Let B = AT . Then rank B = m, since A and AT have the same rank, and so, by the argument above, B # = (B T B )−1 B T . However, AT # = A# T , as is easily checked (again from the defining conditions). Hence

A# = A# T T = (AT )# T
= B # T = B (B T B )−1
= AT (AAT )−1
which establishes (ii).
Definition 1.12. The

·
A

F -norm
2
F

on Rm×n is defined by

= Tr(ATA)

for A ∈ Rm×n ,

where Tr(B ) is the trace of the square matrix B ; Tr(B ) =
norm is called the Frobenius norm and sometimes denoted ·

i Bii .

This

2.

We see that
n

A

2
F

(ATA)ii

T

= Tr(A A) =
i=1
n

m

AT Aji
ij

=
i=1 j =1
m
n

A2
ij

=
i=1 j =1

since AT = Aji . Hence
ij
A

F

=

(sum of squares of all entries of A) .

We also note, here, that clearly A F = AT F .
Suppose that A = u ∈ Rm×1 , an m-component vector. Then A 2 = F
m
2
i=1 ui , that is, A F is the usual Euclidean norm in this case. Thus the notation · 2 for this norm is consistent. Note that, generally, A F is just the Euclidean norm of A ∈ Rm×n when A is “taken apart” row by row and considered as a vector in Rmn via the correspondence A ↔ (A11 , A12 , . . . , A1n , A21 , . . . , Amn ).

The notation A 2 is sometimes used in numerical analysis texts (and in the computer algebra software package Maple) to mean the norm of King’s College London

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

16

Chapter 1

A as a linear map from Rn into Rm , that is, the value sup{ Ax : x ∈ Rn with x = 1}. One can show that this value is equal to the square root of the largest eigenvalue of ATA whereas A F is equal to the square root of the sum of the eigenvalues of ATA.

Remark 1.13. Let A ∈ Rm×n , B ∈ Rn×m , C ∈ Rm×n , and X ∈ Rp×p . Then it is easy to see that
(i) Tr(AB ) = Tr(BA),
(ii) Tr(X ) = Tr(X T ),
(iii) Tr(AC T ) = Tr(C T A) = Tr(AT C ) = Tr(CAT ).
The equalities in (iii) can each be verified directly, or alternatively, one notices that (iii) is a consequence of (i) and (ii) (replacing B by C T ). Lemma 1.14. For A ∈ Rm×n , A# AAT = AT .
Proof. We have
(A# A)AT = (A# A)T AT

by condition (iv)

= (A(A# A))T
= AT

by condition (i)

as required.
Theorem 1.15. Let A ∈ Rn×p and B ∈ Rm×p be given. Then X = BA# is an element of Rm×n which minimizes the quantity X A − B F . Proof. We have
XA − B

2
F

= (X − BA# )A + B (A# A − 1 p )
l

= (X − BA# )A

2
F

= (X − BA# )A

2
F

2
F

+ B (A# A − 1 p )
l

2
F

+ B (A# A − 1 p )
l

2
F
#

+ 2 Tr AT (X − BA# )T B (A# A − 1 p )
l
+ 2 Tr (X − BA# )T B (A A − 1 p )AT .
l
= 0 by the lemma

Hence
XA − B

2
F

= (X − BA# )A

2
F

+ B (A# A − 1 p )
l

which achieves its minimum, B (A# A − 1 p )
l
Department of Mathematics

www.jntuworld.com

2,
F

2
F

when X = BA# .

www.jntuworld.com

www.jwjobs.net

17

Matrix Memory

Note that any X satisfying XA = BA# A gives a minimum solution. If AT has full column rank (or, equivalently, AT has no kernel) then AAT is invertible. Multiplying on the right by AT (AAT )−1 gives X = BA# . So under this condition on AT , we see that there is a unique solution X = BA# minimizing X A − B F .

In general, one can show that BA# is that element with minimal · F norm which minimizes X A − B F , i.e., if Y = BA# and Y A − B F = B A# A − B F , then B A# F < Y F .
Now let us return to our problem of finding a memory matrix which stores the input-output pattern pairs (a(i) , b(i) ), 1 ≤ i ≤ p, with each a(i) ∈ Rn and each b(i) ∈ Rm . In general, it may not be possible to find a matrix M ∈ Rm×n such that M a(i) = b(i) , for each i. Whatever our choice of M , the system output corresponding to the input a(i) is just M a(i) . So, failing equality M a(i) = b(i) , we would at least like to minimize the error b(i) − M a(i) . A measure of such an error is b(i) − M a(i) 2 the squared 2

Euclidean norm of the difference. Taking all p patterns into account, the total system recall error is taken to be
p
i=1

b(i) − M a(i) 2 .
2

Let A = (a(1) · · · a(p) ) ∈ Rn×p and B = (b(1) · · · b(p) ) ∈ Rm×p be the matrices whose columns are given by the pattern vectors a(i) and b(i) , respectively. Then the total system recall error, above, is just

B − MA

2
F.

We have seen that this is minimized by the choice M = BA# , where A# is the generalized inverse of A. The memory matrix M = BA# is called the optimal linear associative memory (OLAM) matrix.
Remark 1.16. If the patterns {a(1) , . . . , a(p) } constitute an orthonormal family, then A has independent columns and so A# = (ATA)−1 AT = 1 p AT , so l
that the OLAM matrix is BA# = BAT which is exactly the correlation memory matrix.
In the autoassociative case, b(i) = a(i) , so that B = A and the OLAM matrix is given as
M = AA# .
We have seen that AA# is precisely the projection onto the range of A, i.e., onto the subspace of Rn spanned by the prototype patterns. In this case, we say that M is given by the projection rule.

King’s College London

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

18

Chapter 1

Any input x ∈ Rn can be written as
AA# x

x=

+

(1 − AA# )x .
l

OLAM system output

“novelty”

Kohonen calls 1 − AA# the novelty filter and has applied these ideas to l
image-subtraction problems such as tumor detection in brain scans. Nonnull novelty vectors may indicate disorders or anomalies.

Pattern classification
We have discussed the distributed associative memory (DAM) matrix as an autoassociative or as a heteroassociative memory model. The first is mathematically just a special case of the second. Another special case is that of so-called classification. The idea is that one simply wants an input signal to elicit a response “tag”, typically coded as one of a collection of orthogonal unit vectors, such as given by the standard basis vectors of Rm . • In operation, the input x induces output M x, which is then associated with that tag vector corresponding to its maximum component. In other words, if (M x)j is the maximum component of M x, then the output M x is associated with the j th tag.

Examples of various pattern classification tasks have been given by T. Kohonen, P. Lehti¨, E. Oja, A. Kortekangas and K. M¨kisara, Demonstration of o
a
pattern processing properties of the optimal associative mappings, Proceedings of the International Conference on Cybernetics and Society, Washington, D. C., 581–585 (1977). (See also the article “Storage and Processing of Information in Distributed Associative Memory Systems” by T. Kohonen, P. Lehti¨ and E. Oja in “Parallel Models of Associative Memory” edited o

by G. Hinton and J. Anderson, published by Lawrence Erlbaum Associates, (updated edition) 1989.)
In one such experiment, ten people were each photographed from five different angles, ranging from 45◦ to −45◦ , with 0◦ corresponding to a fully frontal face. These were then digitized to produce pattern vectors with eight possible intensity levels for each pixel. A distinct unit vector, a tag, was associated with each person, giving a total of ten tags, and fifty patterns. The OLAM matrix was constructed from this data.

The memory matrix was then presented with a digitized photograph of one of the ten people, but taken from a different angle to any of the original five prototypes. The output was then classified according to the tag associated with its largest component. This was found to give correct identification.

The OLAM matrix was also found to perform well with autoassociation. Pattern vectors corresponding to one hundred digitized photographs were Department of Mathematics

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

19

Matrix Memory

used to construct the autoassociative memory via the projection rule. When presented with incomplete or fuzzy versions of the original patterns, the OLAM matrix satisfactorily reconstructed the correct image.

In another autoassociative recall experiment, twenty one different prototype images were used to construct the OLAM matrix. These were each composed of three similarly placed copies of a subimage. New pattern images, consisting of just one part of the usual triple features, were presented to the OLAM matrix. The output images consisted of slightly fuzzy versions of the single part but triplicated so as to mimic the subimage positioning learned from the original twenty one prototypes.

An analysis comparing the performance of the correlation memory matrix with that of the generalized inverse matrix memory has been offered by Cherkassky, Fassett and Vassilas (IEEE Trans. on Computers, 40, 1429 (1991)). Their conclusion is that the generalized inverse memory matrix performs better than the correlation memory matrix for autoassociation, but that the correlation memory matrix is better for classification. This is contrary to the widespread belief that the generalized inverse memory matrix is the superior model.

King’s College London

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

20

Chapter 1

Department of Mathematics

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Chapter 2
Adaptive Linear Combiner

We wish to consider a memory matrix for the special case of one-dimensional output vectors. Thus, we consider input pattern vectors x(1) , . . . , x(p) ∈ Rℓ , say, with corresponding desired outputs y (1) , . . . , y (p) ∈ R and we seek a memory matrix M ∈ R1×ℓ such that

M x(i) = y (i) ,
for 1 ≤ i ≤ p. Since M ∈ R1×ℓ , we can think of M as a row vector M = (m1 , . . . , mℓ ). The output corresponding to the input x = (x1 , . . . , xℓ ) ∈ Rℓ is just

mi xi .

y = Mx =
i=1

Such a system is known as the adaptive linear combiner (ALC). weights on connections

x1
x2
input signal
components

m1

.
.
.
xn

m2

y
output

mn

“adder”
forms n mi xi
i=1

Figure 2.1: The Adaptive Linear Combiner.
We have seen that we may not be able to find M which satisfies the exact input-output relationship M x(i) = y (i) , for each i. The idea is to look for an M which is in a certain sense optimal. To do this, we seek m1 , . . . , mℓ 21

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

22

Chapter 2

such that (one half) the average mean-squared error
p

E≡

11
2p
i=1

|y (i) − z (i) |2

is minimized—where y (i) is the desired output corresponding to the input vector x(i) and z (i) = M x(i) is the actual system output. We already know, from the results in the last chapter, that this is achieved by the OLAM matrix based on the input-output pattern pairs, but we wish here to develop an algorithmic approach to the construction of the appropriate memory matrix. We can write out E in terms of the mi as follows

p

E=

1
2p
i=1

|y (i) −

j =1

p

=

(i)

mj xj |2

i=1

j,k=1

=

1
2
j,k=1

(i) (i)

y (i)2 +

1
2p

mj mk xj xk − 2

(i)

y (i) mj xj
j =1

mj Ajk mk −

1
bj mj + 2 c
j =1

(i) (i)

(i)

1
1
1
where Ajk = p p=1 xj xk , bj = p p=1 y (i) xj and c = p p=1 y (i)2 . i
i
i
Note that A = (Ajk ) ∈ Rℓ×ℓ is symmetric. The error E is a non-negative quadratic function of the mi . For a minimum, we investigate the equalities ∂ E /∂mi = 0, that is,

∂E
=
0=
∂mi


k=1

Aik mk − bi ,

for 1 ≤ i ≤ ℓ, where we have used the symmetry of (Aik ) here. We thus obtain the so-called Gauss-normal or Wiener-Hopf equations

Aik mk = bi ,
k=1

for 1 ≤ i ≤ ℓ,

or, in matrix form,
Am = b ,
with A = (Aik ) ∈ Rℓ×ℓ , m = (m1 , . . . , mℓ ) ∈ Rℓ and b = (b1 , . . . , bℓ ) ∈ Rℓ . If A is invertible, then m = A−1 b is the unique solution. In general, there may be many solutions. For example, if A is diagonal with A11 = 0, then necessarily b1 = 0 (otherwise E could not be non-negative as a function of the mi ) and so we see that m1 is arbitrary. To relate this to the OLAM matrix, write E as

E = 21p M X − Y 2 ,
F
Department of Mathematics

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

23

Adaptive Linear Combiner

where X = (x(1) · · · x(p) ) ∈ Rℓ×p and Y = (y (1) · · · y (p) ) ∈ R1×p . This, we know, is minimized by M = Y X # ∈ R1×ℓ . Therefore m = M T must be a solution to the Wiener-Hopf equations above. We can write A, b and c in 1

1
terms of the matrices X and Y . One finds that A = p XX T , bT = p Y X T 1
and c = p Y TY . The equation Am = b then becomes XX T m = XY T giving m = (XX T )−1 XY T , provided that A is invertible. This gives M = mT = Y X T (XX T )−1 = Y X # , as above.
One method of attack for finding a vector m∗ minimizing E is that of gradient-descent. The idea is to think of E (m1 , . . . , mℓ ) as a bowl-shaped surface above the ℓ-dimensional m1 , . . . , mℓ -space. Pick any value for m. The vector grad E , when evaluated at m, points in the direction of maximum increase of E in the neighbourhood of m. That is to say, for small α (and a vector v of given length), E (m+αv )−E (m) is maximized when v points in the same direction as grad E (as is seen by Taylor’s theorem). Now, rather than increasing E , we wish to minimize it. So the idea is to move a small distance from m to m − α grad E , thus inducing maximal “downhill” movement on the error surface. By repeating this process, we hope to eventually reach a value of m which minimizes E .

The strategy, then, is to consider a sequence of vectors m(n) given algorithmically by m(n + 1) = m(n) − α grad E ,

for n = 1, 2, . . . ,

with m(1) arbitrary and where the parameter α is called the learning rate. If we substitute for grad E , we find
m(n + 1) = m(n) + α b − Am(n) .
Now, A is symmetric and so can be diagonalized. There is an orthogonal matrix U ∈ Rℓ×ℓ such that
U AU T = D = diag(λ1 , . . . , λℓ )
and we may assume t...

Read more

Keywords

+1 +2 +3 +4 /2 /dm /n 0 0.01 0.05 0.1 0.185 0.25 0.9 00 0h2 0h3 0v1 0v2 0v3 1 1.1 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1/2 1/n 10 100 1000 101 1011 102 103 104 105 106 107 108 109 10λ 11 110 111 112 113 114 115 116 117 118 119 12 120 121 122 123 124 13 14 1429 15 16 17 18 19 1949 1957 1969 1973 1977 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1h1 1h2 1h3 1m 1r 1v1 1v2 1v3 2 2.1 2.10 2.2 2.3 2.326 2.4 2.5 2.6 2.7 2.8 2.9 20 21 21p 22 226 23 24 245 25 26 27 28 29 2bi 2bj 2bt 2e 2i 2k 2ls 2n 2nd 2p 2pρm 2t 2x 2αe 2ρi 2ρm 3 3.1 3.2 3.3 3.4 3.5 30 31 32 33 34 35 359 36 366 37 38 39 4 4.1 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.2 4.20 4.21 4.22 4.23 4.24 4.25 4.26 4.27 4.3 4.4 4.5 4.6 4.7 4.8 4.9 40 41 42 4217 4228 43 44 45 46 47 48 49 4n 5 5.1 5.10 5.11 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 50 51 511 515 52 53 54 55 56 57 58 581 585 59 6 6.1 6.2 60 61 62 63 64 64/8 65 66 67 68 69 7 7.1 7.2 7.3 7/2 70 70mv 71 72 73 74 75 76 77 78 79 8 8.1 8.2 8.3 8.4 80 81 82 83 84 85 86 87 88 89 9 90 91 94 95 96 97 98 99 a-label a1 a11 a12 a1n a2 a21 aa aart aat ab abandon abil abl abstract abund ac academ acceler accept accommod accord account accumul accur accuraci achiev acid across act action activ actual ad adap45 adapt adder addison addison-wesley addisonwesley addit address adjust admit ag agre agreement ah1 ah2 ah3 ahi ai ai1 ai2 aij aik aim aip aj aji ajk ajℓ ak aki akj alc aleksand algebra algorithm all-or-non allow almost along alphabet alreadi also alter altern although altogeth alway amaz amit amn among amongst amount amplitud analog analysi anderson angl anneal anomali anoth answer anti anti-reinforc anyth anyway apart appar appeal appear appli applic approach appropri approxim arbib arbitrari arbitrarili architectur area argu argument around arrang arriv articl artifici ask aspect assign associ assum assumpt assur asynchron ata atau1 ataup atav atav1 atav2 atom attach attack attempt attent attract attractor au aubin auditori augment author autoassoci autocorrel av av1 av2 av3 avail averag avj avoid aw away ax axa axiomat axon ay az aℓ aℓj affect b b0 b1 b2 ba back back-propag background backward bad ball bam band base basi basic basin bat batch bath bayesian bear becom begin behaviour belief belong benefici bertseka better bi bias bibliographi bidirect big bii bij bik bilay binari binary-valu biochemistri biolog bipolar bishop bit bj black block block-sequenti blum bm bn bodi boltzmann book boolean borland bose bottom bound bowl bowl-shap bq bracket brain branch brows bt build built burst bv bℓ c c/c c1 c2 ca calcul call caller cam cambridg cancel cannot capabl capac capit captur care caricatur carri case cat catalyst categori caus cell centr central centroid certain chain chanc chang channel chapter charact characterist check chemic cherkasski choic choos chosen circuit circular circumst cj ck cl claim clamp class classifi classifi classific clean clean-up clear cleft clever clock close closer cluster code coefficient cold collaps collect colleg column combin come comment common communic communiti compact compar compil complet complex complic compon compos compound comprehens compress compris comput concentr concept concern concis conclud conclus concret condit cone confer connect connected connectionist conscious consecut consequ consid consist conson constant constitut constrain construct contain content context continu contradict contrari contribut control conveni convent converg convers convey configur confirm cook cook-book cooker cooper copi corollari correct correl correspond corrupt cos cosin cost could count cours credit cross cross-talk crucial crude cube culloch culloch-pitt curar current curv customari cybernet cycl d d1 d2 dam damp data dayhoff de deal debat decis declar declin decomposit decor decreas deduc degre delay deliber delta delta-rul demand demonstr dendrit dendron denot dens densiti depart depend depolar deposit deriv descent describ descript design desir destroy detail detect detector determin develop deviat devic devis devroy defin definabl definit diag diagnosi diagon diagram diamet digit dii dimens dimension dip direct disc discov discoveri discret discuss disord distanc distinct distinguish distort distribut diverg divid differ differenti diffus difficulti dm domin done doom dotsenko doubl doubli doubt downhil dr draw drawn dreyfus driven drug duda due dv dvt dx dynam dz dℓ e e1 ear earli earlier earpiec easi easili east east-west echo ed edg edit eigenvalu eight eighti either electr element elicit elucid emb embodi emiss emit emphas emphasi emphasis employ empti en encapsul end endeavour endless energi engin enhanc ensur enter entir entri enumer equal equat equip equival erlbaum error error-correct errorcorrect essenc essenti establish estim et etc euclidean evalu even eventu ever everi everyon everywher evid evok evolut evolv exact examin exampl exceed except excit excitatori excitatory/inhibitory exclus exclusive-or execut exemplar exhaust exist exp expand expect experi expert explain explicit exploit express extend extens extern extra extract eye effect effector effort efficaci efficienc f f0 f1 face facilit fact factor fail failur fals famili fan fan-in far fashion fassett fatti fausett favour featur fed feed feed-forward feedback feedforward feel fewer figur final first fj fn follow forc forev form formal former formul formula forward found foundat four fourier fourth fox fraction framework frank freeman frequenc frobenius frontal full fulli function fund fundament furthermor fuzzi g g1 g2 g2n g5 gain game gap gauss gauss-norm gaussian general generat gentl geometr get gi gij give given gj gland glass global glycerid gm gn go goe golub good got govern grad gradient gradient-desc gradientdesc gradual grandmoth graph great greater greek grey grid gross group guarante guid guyon gy h h0 h1 h2 h2n h3 half hall ham hand handset happen hard hard-limit hart hasler hassoun haykin head hear heavi heavy-go hebb hebbian hecht hecht-nielsen held help henc hertz heteroassoci hidden high higher hill hillock hint hinton histori hj hk hm ho ho-kashyap hold holder hop hope hopf hopkin hopfield horizont hornik hot howev hoff hu human hundr hype hypercub hyperplan hypothes hypothesi hypothesis i.e idea idempot ident identific ieee ignor ii iii ij ik illustr imag image-subtract imagin immedi impetus implement impli import impos imposs impuls inabl inaccur inasmuch inc incap incident inclin includ inclus incom incompat incomplet incorpor incorrect increas increment inde independ index indefinit indic individu induc inequ inform inhibit inhibitori initi inner input input-output insensit insid insolubl inspir instead integ integr intellig intens interconnect interest intermedi intern interv intric introduc introduct intuit invent invers invert investig invok involv infinit infinitely-mani infinitely-often influenc influx ion irregular issu iter ith iv ivan [email protected] j jargon java ji jk john join journalist journalistic-styl journey junction k k-mean k0 kamp kashyap keep kept kernel kesler key khanna kind king kisara kj kn kn0 knew know knowledg known kohonen kolmogorov kolmogorov/sprecher korst kortekanga kosko krogh kung kℓ l label larg larger largest largish last later latest latter lattic law lawrenc layer lead leak leakag learn least leav led left left-hand lehti lemma length less lessen let letter level li liang lie life light like likewis lim limit limn line linear link lipid list littl ller lms load loan local locat log logic london long look looney loop lose lot lower lugosi lull lump luo m m-compon m-output m-valu m.i.t m1 m11 m2 m21 ma machin macmillan macqueen made magnet magnitud main maintain major make manhatten mani map mapl market markov martinez marvin mass massiv master match mathemat mathematica matric matrix matter max maxim maximum may mc mcclelland mcculloch mcculloch-pitt mcgraw mcgraw-hil mean mean-squar measur mechan mediat medic meet member membership membran memori mere messag method metric mi micron microsecond microwav mid mid-eighti middl mideighti midpoint might mij millisecond millivolt mimic min mind minim minima minimum minor minski misclassifi misclassific mismatch mit mj mjk mjs mk mm1 mmn mn mode model modifi modific molecul molecular momentum monoton moor moore-penros moreov morton motor mouthpiec move movement mrs ms mt much multi multi-class multibranch multiclass multilay multipl multipli muscl muscular music must mutual mx mαi mℓ n n-depend n-dimension n-input n-pariti n/2 n/2p n/p n0 n1 na name natur near nearer nearest nearest-neighbour necessari necessarili need negat neighbour neighbourhood neither nerv nerve-puls nervous net network neural neurocomput neuron neurophysiolog neurosci neurotransmitt never nevertheless new next nice nielsen nj nk nm node nois noisi non non-convent non-empti non-increas non-linear non-neg non-posit non-trivi non-zero none nonetheless nonlinear nonneg nonnul nonparametr norm normal north north-south nostrand notat note noth notic notion novelti npariti nth nuisanc number numer ny o object observ obtain obvious occur odd often oil oja olam on-off one one-dimension one-on onto oper opposit optim order organ orient origin orthogon orthonorm oscil oscillatori other otherwis ouput outcom outer outgo output outset outsid outward overal overlap overshoot overtrain overview overfit oxford off offer p p/n pack packag pair pairwis palmer pao paper papert paradigm parallel paramet parent pariti part partial particular partit pass passag pattern pay pdp penros penultim peopl per perceptron peretto perfect perform pergamon perhap period permeabl person personnaz perspect pertin perturb phase phenomena phone phospholipid phosphor photograph phys physic pick pictori pictur piec piece-wis pitt pixel place place-hold placehold plane play pn point polar popular posit possess possibl postpon postsynapt postul potassium potenti practic preassign precis predict prentic prentice-hal prepar presenc present press presum presynapt prevent previous primarili priori privileg prob probabilist probabl probe problem procedur proceed process produc product program programm progress project promis proof propag properti proport proposit prototyp protoyp prove provid pseudoinvers publish pull puls pure purpos push put q quadrat qualit quantit quantiti quantiz quantum question quit r r0 r1 r10000 r2 r2n r6 radial ran random rang rank raster rate rather re re-excit reach readabl readili real real-lif real-valu realis realist realiti realiz realli realloc reason reassign reassur recal receiv recent recept receptor recip recipi recogn recognis recognit recomput reconsid reconstruct record recov recoveri rectangl recurr recurs reder reduc refer referenti reformul refractori regard region regular reinforc reinhardt reinhold relat relationship relax releas relev remain remark rememb remov repeat repel replac replic replica repres represent reproduc requir rescal research resolut respect respond respons rest restor restrict result resurg retina retrain return rev revers reviv rewrit rewritten rich right right-continu right-hand rise ritter rj rk rm rmn rn robust rod roja root rosenblatt rough row rp rr rule rumelhart run rℓ rfi s1 s2 said sake sampl satellit satisfactorili satisfi satisfi satur save say scale scan scene scheme school schulten schwarz scientific score screen second section see seek seem seen select self self-organ self-referenti selforgan sens sent separ sequenc sequenti seri serv set settl sever seymour shall shape share sharpen shed shortcom show shown si sic side sight sigmoid sigmoidal-typ sign signal signal-respons signalrespons similar simpl simpler simpli simplic simplifi simplifi simpson simul simultan sinc singl singular sink site situat six sixti skapura skin sleep slight slope slow sm small smaller smallest smooth so-cal socal societi sodium softwar sold solut solv soma somehow someth sometim somewhat soon sooner sound sourc south space span spatial speak special specifi specific speed spend spike spin spirit split spoil spot sprecher springer springer-verlag squar stabl stage standard start state statement statist step step-funct still stimul stimulus stimulus-respons stinchcomb stinchecomb stock stone stone-weierstrass stop storag store stori straightforward strategi strength strict string strive strong structur stuck studi style subdivid subimag subject subsequ subset subspac substitut subt subtract success suggest suitabl sum summat sup super super-pattern superb superconduct superior superpattern supervis support suppos sure surfac surmount surround suspect suffer suffic sufficient symbol symmetr symmetri synaps synapt synchron system systemat ta tabl tag tail take taken talk tanh target task tast taylor techniqu technolog telephon tell ten tend term termin test text th thank theme theorem theori therefor thermodynam thing think third thompson thought thousand thrash three three-lay three-stag threshold throughout throw thus tie time tive togeth toler tool top total touch toward tr trace track train tran transform transit transmiss transmit transpos transvers travel treat tree tree-lik tri trial triangl trick trigger trigonometr tripl triplic trivial true tsitsik tu tumor tune turbo turn twenti two two-class two-dimension two-lay ty type typic typifi u u1 u2 u3 ui uj uk um unchang under undergo understand undesir unfortun uni uniform uniqu unit univers unknown unless unlik unnecessari unsatisfactori unsupervis updat upper us use usual ut uℓ v v0 v1 v1t v2 v2t v3 valid valley valu van vanish vapnik vari variabl varianc variat various vassila vast vector vectorvalu verifi verifi verlag version vertex vertic vestig vi via vicin view vision visual vivid vj vm vn voic vol vowel vp vr vt vℓ w w0 w1 w10 w11 w12 w1i w1m w2 w21 w3 wall want washington wasserman water way wc2r web weierstrass weight well well-defin well-pres wellsepar welstead wesley west whatev whatsoev whenev wherea whether whilst white whole whose wi wide widespread widrow widrow-hoff width wiener wiener-hopf wii wij wik wild wiley win wine winner winner-takes-al wir wire wise wish withdraw within without wiℓ wj wji wjk wjn wk0 wki wkj wm wn word work world worri worth would wri write written wrj wrk ws wt www.jntuworld.com www.jwjobs.net wℓ wℓ0 wℓk x x0 x1 x2 x3 x4 xa xaa xax xay xi xj xk xki xm xn xnk xor xs xt xu xx xy xℓ y y-h y0 y1 y2 yd yes yet yi yield yj yji yk ym yn yr yw yxj yxt yℓ z z-q z0 z1 z2n zero zeroth zi zj zk zn zz µ µ1 µ2 µ3 µi µj µk µn µr α α1 α2 α3 αc1 αc2 αf αj αm αn αnzi αt αv αvj αz αzi αzz αδ αλ αλj αϕ β β1 β2 β3 βj βm βn βz βϕ γ γi δ ε θ θ1 θ2 θ2n θ3 θi θj θk θm θn θℓ λ λ0 λ1 λ2 λi λj λm λmax λn λr λu λv1 λv1t λβ λℓ ν π ρ ρi ρm σ σ1 σa σb σc σi σj τ φ χ χb χb1 χbi χbj χbm χi ψ ϕ ϕ0 ϕ1 ϕ2 ϕi ϕj ϕm ϕn ℓ-dimension ℓ1 ℓj ℓk fibres field fifti figure filter final finalli find finding finds finite finitely-mani fire fired fires firing first firstli five fixed flat flip fluctuat