《矩阵计算（英文版•第4版）》试读：1.1 Basic Algorithms and Notation
1.1 Basic Algorithms and Notation 1.2 Structure and Efficiency 1.3 Block Matrices and Algorithms 1.4 Fast Matrix-Vector Products 1.5 Vectorization and Locality 1.6 Parallel Matrix Multiplication The study of matrix computations properly begins with the study of various matrix multiplication problems. Although simple mathematically, these calculations are sufficiently rich to develop a wide range of essential algorithmic skills. In §1.1 we examine several formulations of the matrix multiplication update problem C = C + AB. Partitioned matrices are introduced and used to identify linear algebraic “levels” of computation. If a matrix has special properties, then various economies are generally possible. For example, a symmetric matrix can be stored in half the space of a general matrix. A matrix-vector product may require much less time to execute if the matrix has many zero entries. These matters are considered in §1.2.A block matrix is a matrix whose entries are themselves matrices. The “language” of block matrices is developed in §1.3. It supports the easy derivation of matrix factorizations by enabling us to spot patterns in a computation that are obscured at the scalar level. Algorithms phrased at the block level are typically rich in matrix-matrix multiplication, the operation of choice in many high-performance computing environments. Sometimes the block structure of a matrix is recursive, meaning that the block entries have an exploitable resemblance to the overall matrix. This type of connection is the foundation for “fast” matrix-vector product algorithms such as various fast Fourier transforms, trigonometric transforms, and wavelet transforms. These calculations are among the most important in all of scientific computing and are discussed in §1.4. They provide an excellent opportunity to develop a facility with block matrices and recursion. The last two sections set the stage for effective, “large-n” matrix computations. In this context, data locality affects efficiency more than the volume of actual arithmetic. Having an ability to reason about memory hierarchies and multiprocessor computation is essential. Our goal in §1.5 and §1.6 is to build an appreciation for the attendant issues without getting into system-dependent details. Reading Notes The sections within this chapter depend upon each other as follows: 代码无法显示 Before proceeding to later chapters, §1.1, §1.2, and §1.3 are essential. The fast transform ideas in §1.4 are utilized in §4.8 and parts of Chapters 11 and 12. The reading of§1.5 and §1.6 can be deferred until high-performance linear equation solving or eigenvalue computation becomes a topic of concern. 1.1 Basic Algorithms and Notation Matrix computations are built upon a hierarchy of linear algebraic operations. Dot products involve the scalar operations of addition and multiplication. Matrix-vector multiplication is made up of dot products. Matrix-matrix multiplication amounts to a collection of matrix-vector products. All of these operations can be described in algorithmic form or in the language of linear algebra. One of our goals is to show how these two styles of expression complement each other. Along the way we pick up notation and acquaint the reader with the kind of thinking that underpins the matrix computation area. The discussion revolves around the matrix multiplication problem, a computation that can be organized in several ways. 1.1.1 Matrix Notation Let IR designate the set of real numbers. We denote the vector space of all m-by-n real matrices by IRm×n: 代码无法显示 If a capital letter is used to denote a matrix (e.g., A, B, Δ), then the corresponding lower case letter with subscript ij refers to the (i, j) entry (e.g., aij , bij , δij ). Sometimes we designate the elements of a matrix with the notation [ A ]ij or A(i, j). 1.1.2 Matrix Operations Basic matrix operations include transposition (IRm×n → IRn×m), 代码无法显示, addition (IRm×n × IRm×n → IRm×n), 代码无法显示 , scalar-matrix multiplication (IR × IRm×n → IRm×n), 代码无法显示 , and matrix-matrix multiplication (IRm×p × IRp×n → IRm×n), C = AB =⇒ cij =p k=1aikbkj. Point wise matrix operations are occasionally useful, especially pointwise multiplication(IRm×n × IRm×n → IRm×n), 代码无法显示 and point wise division (IRm×n × IRm×n → IRm×n), 代码无法显示 Of course, for pointwise division to make sense, the “denominator matrix” must have nonzero entries. 1.1.3 Vector Notation Let IRn denote the vector space of real n-vectors:代码无法显示 We refer to xi as the ith component of x. Depending upon context, the alternative notations [x]i and x(i) are sometimes used. Notice that we are identifying IRn with IRn×1 and so the members of IRn are column vectors. On the other hand, the elements of IR1×n are row vectors: 代码无法显示 If x is a column vector, then y = xT is a row vector. 1.1.4 Vector Operations Assume that a ∈ IR, x ∈ IRn, and y ∈ IRn. Basic vector operations include scalar-vector multiplication, 代码无法显示, vector addition, 代码无法显示, and the inner product (or dot product), 代码无法显示 A particularly important operation, which we write in update form, is the saxpy: 代码无法显示 Here, the symbol “=” is used to denote assignment, not mathematical equality. The vector y is being updated. The name “saxpy” is used in LAPACK, a software package that implements many of the algorithms in this book. “Saxpy” is a mnemonic for“scalar a x plus y.” See LAPACK. Pointwise vector operations are also useful, including vector multiplication, 代码无法显示 and vector division, 代码无法显示 1.1.5 The Computation of Dot Products and Saxpys Algorithms in the text are expressed using a stylized version of the MATLAB language. Here is our first example: Algorithm 1.1.1 (Dot Product) If x, y ∈ IRn, then this algorithm computes their dot product c = xT y. c = 0 for i = 1:n c = c + x(i)y(i) end It is clear from the summation that the dot product of two n- vectors involves n multiplications and n additions. The dot product operation is an “O(n)” operation, meaning that the amount of work scales linearly with the dimension. The saxpy computation is also O(n):Algorithm 1.1.2 (Saxpy) If x, y ∈ IRn and a ∈ IR, then this algorithm overwrites y with y + ax. for i = 1:n y(i) = y(i) + ax(i) end We stress that the algorithms in this book are encapsulations of important computational ideas and are not to be regarded as “production codes.” 1.1.6 Matrix-Vector Multiplication and the Gaxpy Suppose A ∈ IRm×n and that we wish to compute the update y = y + Ax where x ∈ IRn and y ∈ IRm are given. This generalized saxpy operation is referred to as a gaxpy. A standard way that this computation proceeds is to update the components one-at-a-time: 代码无法显示． This gives the following algorithm: Algorithm 1.1.3 (Row-Oriented Gaxpy) If A ∈ IRm×n, x ∈ IRn, and y ∈ IRm, then this algorithm overwrites y with Ax + y. for i = 1:m for j = 1:n y(i) = y(i) + A(i, j)x(j) end end Note that this involves O(mn) work. If each dimension of A is doubled, then the amount of arithmetic increases by a factor of 4. An alternative algorithm results if we regard Ax as a linear combination of A’s columns, e.g., 代码无法显示． Algorithm 1.1.4 (Column-Oriented Gaxpy) If A ∈ IRm×n, x ∈ IRn, and y ∈ IRm, then this algorithm overwrites y with Ax + y. for j = 1:n for i = 1:m y(i) = y(i) + A(i, j)•x(j) end end Note that the inner loop in either gaxpy algorithm carries out a saxpy operation. The column version is derived by rethinking what matrix-vector multiplication “means” at the vector level, but it could also have been obtained simply by interchanging the order of the loops in the row version. 1.1.7 Partitioning a Matrix into Rows and Columns Algorithms 1.1.3 and 1.1.4 access the data in A by row and by column, respectively. To highlight these orientations more clearly, we introduce the idea of a partitioned matrix. From one point of view, a matrix is a stack of row vectors: 代码无法显示． This is called a row partition of A. Thus, if we row partition 代码无法显示 then we are choosing to think of A as a collection of rows with 代码无法显示 With the row partitioning (1.1.1), Algorithm 1.1.3 can be expressed as follows: 代码无法显示 Alternatively, a matrix is a collection of column vectors: 代码无法显示 We refer to this as a column partition of A. In the 3-by-2 example above, we thus would set c1 and c2 to be the first and second columns of A, respectively: 代码无法显示 With (1.1.2) we see that Algorithm 1.1.4 is a saxpy procedure that accesses A by columns: for j = 1:n y = y + xjcj end In this formulation, we appreciate y as a running vector sum that undergoes repeated saxpy updates. 1.1.8 The Colon Notation A handy way to specify a column or row of a matrix is with the “colon” notation. IfA ∈ IRm×n, then A(k, :) designates the kth row, i.e.,A(k, :) = [ak1, . . . , akn] . The kth column is specified by 代码无法显示 With these conventions we can rewrite Algorithms 1.1.3 and 1.1.4 as for i = 1:m y(i) = y(i) + A(i, :)•x end and for j = 1:n y = y + x(j)•A(:, j) end respectively. By using the colon notation, we are able to suppress inner loop details and encourage vector-level thinking. 1.1.9 The Outer Product Update As a preliminary application of the colon notation, we use it to understand the outer product update A = A + xyT, A∈ IRm×n, x ∈ IRm, y ∈ IRn. The outer product operation xyT “looks funny” but is perfectly legal, e.g., 代码无法显示 This is because xyT is the product of two “skinny” matrices and the number of columns in the left matrix x equals the number of rows in the right matrix yT. The entries in the outer product update are prescribed by for i = 1:m for j = 1:n aij = aij + xiyj end end This involves O(mn) arithmetic operations. The mission of the j loop is to add a multiple of yT to the ith row of A, i.e., for i = 1:m A(i, :) = A(i, :) + x(i)•yT end On the other hand, if we make the i-loop the inner loop, then its task is to add a multiple of x to the jth column of A: for j = 1:n A(:, j) = A(:, j) + y(j)•x end Note that both implementations amount to a set of saxpy computations. 1.1.10 Matrix-Matrix Multiplication Consider the 2-by-2 matrix-matrix multiplication problem. In the dot product formulation, each entry is computed as a dot product: 代码无法显示 In the saxpy version, each column in the product is regarded as a linear combination of left-matrix columns: 代码无法显示 Finally, in the outer product version, the result is regarded as the sum of outer products: 代码无法显示 Although equivalent mathematically, it turns out that these versions of matrix multiplication can have very different levels of performance because of their memory traffic properties. This matter is pursued in §1.5. For now, it is worth detailing the various approaches to matrix multiplication because it gives us a chance to review notation and to practice thinking at different linear algebraic levels. To fix the discussion, wefocus on the matrix-matrix update computation: 代码无法显示 The update C = C + AB is considered instead of just C = AB because it is the more typical situation in practice. 1.1.11 Scalar-Level Specifications The starting point is the familiar triply nested loop algorithm: Algorithm 1.1.5 (ijk Matrix Multiplication) If A ∈ IRm×r, B ∈ IRr×n, and C ∈ IRm×nare given, then this algorithm overwrites C with C + AB. for i = 1:m for j = 1:n for k = 1:r C(i, j) = C(i, j) + A(i, k)•B(k, j) end end end This computation involves O(mnr) arithmetic. If the dimensions are doubled, then work increases by a factor of 8. Each loop index in Algorithm 1.1.5 has a particular role. (The subscript i names the row, j names the column, and k handles the dot product.) Nevertheless, the ordering of the loops is arbitrary. Here is the (mathematically equivalent) jki variant: for j = 1:n for k = 1:r for i = 1:m C(i, j) = C(i, j) + A(i, k)B(k, j) end end end Altogether, there are six (= 3!) possibilities: ijk, jik, ikj, jki, kij, kji. Each features an inner loop operation (dot product or saxpy) and each has its own pattern of data flow. For example, in the ijk variant, the inner loop oversees a dot product that requires access to a row of A and a column of B. The jki variant involves a saxpy that requires access to a column of C and a column of A. These attributes are summarized in Table 1.1.1 together with an interpretation of what is going on when the middle and inner loops are considered together. Each variant involves the same amount of arithmetic, but accesses the A, B, and C data differently. The ramifications of this are discussed in §1.5. Table 1.1.1. Matrix multiplication: loop orderings and properties 1.1.12 A Dot Product Formulation The usual matrix multiplication procedure regards A•B as an array of dot products tobe computed one at a time in left-to-right, top-to-bottom order. This is the idea behind Algorithm 1.1.5 which we rewrite using the colon notation to highlight the mission of the innermost loop: Algorithm 1.1.6 (Dot Product Matrix Multiplication) If A ∈ IRm×r, B ∈ IRr×n, andC ∈ IRm×n are given, then this algorithm overwrites C with C + AB. for i = 1:m for j = 1:n C(i, j) = C(i, j) + A(i, :)•B(:, j) end end In the language of partitioned matrices, if 代码无法显示 then Algorithm 1.1.6 has this interpretation: for i = 1:m for j = 1:n cij = cij + aTi bj end end Note that the purpose of the j-loop is to compute the ith row of the update. To emphasize this we could write 代码无法显示 is a row partitioning of C. To say the same thing with the colon notation we write for i = 1:m C(i, :) = C(i, :) + A(i, :)•B end Either way we see that the inner two loops of the ijk variant define a transposed gaxpy operation. 1.1.13 A Saxpy Formulation Suppose A and C are column-partitioned as follows: A = [a1 | • • • | ar ], C= [c1 | • • • | cn ] . By comparing jth columns in C = C + AB we see that 代码无法显示 These vector sums can be put together with a sequence of saxpy updates. Algorithm 1.1.7 (Saxpy Matrix Multiplication) If the matrices A ∈ IRm×r, B ∈ IRr×n,and C ∈ IRm×n are given, then this algorithm overwrites C with C + AB. for j = 1:n for k = 1:r C(:, j) = C(:, j) + A(:, k)•B(k, j) end end Note that the k-loop oversees a gaxpy operation: for j = 1:n C(:, j) = C(:, j) + AB(:, j) end 1.1.14 An Outer Product Formulation Consider the kij variant of Algorithm 1.1.5: for k = 1:r for j = 1:n for i = 1:m C(i, j) = C(i, j) + A(i, k)•B(k, j) end end end The inner two loops oversee the outer product update C = C + akbTk where 代码无法显示 with ak ∈ IRm and bk ∈ IRn. This renders the following implementation: Algorithm 1.1.8 (Outer Product Matrix Multiplication) If the matrices A ∈ IRm×r,B ∈ IRr×n, and C ∈ IRm×n are given, then this algorithm overwrites C with C + AB. for k = 1:r C = C + A(:, k)•B(k, :) end Matrix-matrix multiplication is a sum of outer products. 1.1.15 Flops One way to quantify the volume of work associated with a computation is to count flops. A flop is a floating point add, subtract, multiply, or divide. The number of flops in a given matrix computation is usually obtained by summing the amount of arithmetic associated with the most deeply nested statements. For matrix-matrix multiplication, e.g., Algorithm 1.1.5, this is the 2-flop statement C(i, j) = C(i, j) + A(i, k)•B(k, j). If A ∈ IRm×r, B ∈ IRr×n, and C ∈ IRm×n, then this statement is executed mnr times. Table 1.1.2 summarizes the number of flops that are required for the common operations detailed above. Table 1.1.2. Important flop counts 1.1.16 Big-Oh Notation/Perspective In certain settings it is handy to use the “Big-Oh” notation when an order-of-magnitude assessment of work suffices. (We did this in §1.1.1.) Dot products are O(n), matrix vector products are O(n2), and matrix-matrix products are O(n3). Thus, to make efficient an algorithm that involves a mix of these operations, the focus should typically be on the highest order operations that are involved as they tend to dominate the overall computation. 1.1.17 The Notion of “Level” and the BLAS The dot product and saxpy operations are examples of level-1 operations. Level-1operations involve an amount of data and an amount of arithmetic that are linear in the dimension of the operation. An m-by-n outer product update or a gaxpy operation involves a quadratic amount of data (O(mn)) and a quadratic amount of work (O(mn)).These are level-2 operations. The matrix multiplication update C = C+AB is a level-3operation. Level-3 operations are quadratic in data and cubic in work. Important level-1, level-2, and level-3 operations are encapsulated in the “BLAS, ”an acronym that stands for Basic Linear Algebra Subprograms. See LAPACK. The design of matrix algorithms that are rich in level-3 BLAS operations is a major preoccupation of the field for reasons that have to do with data reuse (§1.5). 1.1.18 Verifying a Matrix Equation In striving to understand matrix multiplication via outer products, we essentially established the matrix equation 代码无法显示 where the ak and bk are defined by the partitionings in (1.1.3). Numerous matrix equations are developed in subsequent chapters. Sometimes they are established algorithmically as above and other times they are proved at the ij-component level, e.g., 代码无法显示 Scalar-level verifications such as this usually provide little insight. However, they are sometimes the only way to proceed. 1.1.19 Complex Matrices On occasion we shall be concerned with computations that involve complex matrices. The vector space of m-by-n complex matrices is designated by Cm×n. The scaling, addition, and multiplication of complex matrices correspond exactly to the real case. However, transposition becomes conjugate transposition: 代码无法显示 The vector space of complex n-vectors is designated by Cn. The dot product of complexn-vectors x and y is prescribed by 代码无法显示 If A = B + iC ∈ Cm×n, then we designate the real and imaginary parts of A by Re(A) =B and Im(A) = C, respectively. The conjugate of A is the matrix A¯ = (a¯ij). Problems P1.1.1 Suppose A ∈ IRn×n and x ∈ IRr are given. Give an algorithm for computing the first column of M = (A − x1I) • • • (A − xrI). P1.1.2 In a conventional 2-by-2 matrix multiplication C = AB, there are eight multiplications: a11b11,a11b12, a21b11, a21b12, a12b21, a12b22, a22b21, and a22b22. Make a table that indicates the order that these multiplications are performed for the ijk, jik, kij, ikj, jki, and kji matrix multiplication algorithms. P1.1.3 Give an O(n2) algorithm for computing C = (xyT )k where x and y are n-vectors. P1.1.4 Suppose D = ABC where A ∈ IRm×n, B ∈ IRn×p, and C ∈ IRp×q. Compare the flop count of an algorithm that computes D via the formula D = (AB)C versus the flop count for an algorithm that computes D using D = A(BC). Under what conditions is the former procedure more flop-efficient than the latter? P1.1.5 Suppose we have real n-by-n matrices C, D, E, and F. Show how to compute real n-by-n matrices A and B with just three real n-by-n matrix multiplications so that A + iB = (C + iD)(E + iF). Hint: Compute W = (C + D)(E − F). P1.1.6 Suppose W ∈ IRn×n is defined by 代码无法显示 where X, Y, Z ∈ IRn×n. If we use this formula for each wij then it would require O(n4) operations to set up W. On the other hand, 代码无法显示 where U = YZ. Thus, W = XU = XY Z and only O(n3) operations are required. Use this methodology to develop an O(n3) procedure for computing the n-by-n matrix A defined by 代码无法显示 where E, F,G,H ∈ IRn×n. Hint. Transposes and pointwise products are involved. Notes and References for §1.1 For an appreciation of the BLAS and their foundational role, see: C.L. Lawson, R.J. Hanson, D.R. Kincaid, and F.T. Krogh (1979). “Basic Linear Algebra Subprograms for FORTRAN Usage,” ACM Trans. Math. Softw. 5, 308–323. J.J. Dongarra, J. Du Croz, S. Hammarling, and R.J. Hanson (1988). “An Extended Set of Fortran Basic Linear Algebra Subprograms,” ACM Trans. Math. Softw. 14, 1– 17. J.J. Dongarra, J. Du Croz, I.S. Duff, and S.J. Hammarling (1990). “A Set of Level 3 Basic Linear Algebra Subprograms,” ACM Trans. Math. Softw. 16, 1–17. B. K˚agstrom, P. Ling, and C. Van Loan (1991). “High-Performance Level-3 BLAS: Sample Routines for Double Precision Real Data,” in High Performance Computing II, M. Durand and F. El Dabaghi(eds.), North-Holland, Amsterdam, 269–281. L.S. Blackford, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux, L. Kaufman, A. Lumsdaine, A. Petitet, R. Pozo, K. Remington, and R.C. Whaley (2002). “An Updated Set of Basic Linear Algebra Subprograms (BLAS)”, ACM Trans. Math. Softw. 28, 135–151. The order in which the operations in the matrix product A1 • • •Ar are carried out affects the flop count if the matrices vary in dimension. (See P1.1.4.) Optimization in this regard requires dynamic programming, see: T.H. Corman, C.E. Leiserson, R.L. Rivest, and C. Stein (2001). Introduction to Algorithms, MIT Press and McGraw-Hill, 331–339.