《深入理解计算机系统》的原文摘录

  • Loop unrolling can improve performance in two ways. First, it reduces the number of operations that do not contribute directly to the program result, such as loop indexing and conditional branching. Second, it exposes ways in which we can further transform the code to reduce the number of operations in the critical paths of the overall computation. (查看原文)
    晨星 2012-10-31 14:57:40
    —— 引自第509页
  • We have seen in Chapter 2 that two’s-complement arithmetic is commutative and associative, even when overflow occurs. (查看原文)
    晨星 2012-11-01 10:30:19
    —— 引自第517页
  • Many compilers do loop unrolling automatically, but relatively few then introduce this form of parallelism (use different variables or registers for different unrolled operations to break dependency). (查看原文)
    晨星 2012-11-01 10:30:39
    —— 引自第518页
  • On the other hand, floating-point multiplication and addition are not associative. … In most real-life applications, however, such patterns are unlikely. Since most physical phenomena are continuous, numerical data tend to be reasonably smooth and well-behaved. (查看原文)
    晨星 2012-11-01 10:31:06
    —— 引自第518页
  • In summary, a reassociation transformation can reduce the number of operations along the critical path in a computation, resulting in better performance by better utilizing the pipelining capabilities of the functional units. Most compilers will not attempt any reassociations of floating-point operations, since these operations are not guaranteed to be associative. Current versions of GCC do perform reassociations of integer operations, but not always with good effects. In general, we have found that unrolling a loop and accumulating multiple values in parallel is a more reliable way to achieve improved program performance. (查看原文)
    晨星 2012-11-02 09:45:01
    —— 引自第522页
  • GCC supports extensions to the C language that let programmers express a program in terms of vector operations that can be compiled into the SIMD instructions of SSE. (查看原文)
    晨星 2012-11-02 09:45:18
    —— 引自第524页
  • We have seen that the critical path in a data-flow graph representation of a program indicates a fundamental lower bound on the time required to execute a program. … We have also seen that the throughput bounds of the functional units also impose a lower bound on the execution time for a program. (查看原文)
    晨星 2012-11-02 09:45:33
    —— 引自第525页
  • Register Spilling: If we have a degree of parallelism p that exceeds the number of available registers, then the compiler will resort to spilling, storing some of the temporary values on the stack. (查看原文)
    晨星 2012-11-02 09:45:51
    —— 引自第526页
  • The basic idea for translating into conditional moves is to compute the values along both branches of a conditional expression or statement, and then use conditional moves to select the desired value. … There is no need to guess whether or not the condition will hold, and hence no penalty for guessing incorrectly. (查看原文)
    晨星 2012-11-05 11:01:14
    —— 引自第527页
  • Intel Core i7 has a 44 clock-cycle misprediction penalty. (查看原文)
    晨星 2012-11-05 11:01:37
    —— 引自第527页
  • That the effect of a mispredicted branch can be very high does not mean that all program branches will slow a program down. In fact, the branch prediction logic found in modern processors is very good at discerning regular patterns and long-term trends for the different branch instructions. (查看原文)
    晨星 2012-11-05 11:02:37
    —— 引自第527页
  • We have found that gcc is able to generate conditional moves for code written in a more “functional” style, where we use conditional operations to compute values and then update the program state with these values, as opposed to a more “imperative” style, where we use conditionals to selectively update program state. (查看原文)
    晨星 2012-11-05 11:02:52
    —— 引自第529页
  • Not all conditional behavior can be implemented with conditional data transfers, and so there are inevitably cases where programmers cannot avoid writing code that will lead to conditional branches for which the processor will do poorly with its branch prediction. (查看原文)
    晨星 2012-11-05 11:03:05
    —— 引自第530页
  • Modern processors have dedicated functional units to perform load and store operations, and these units have internal buffers to hold sets of outstanding requests for memory operations. (查看原文)
    晨星 2012-11-05 11:03:22
    —— 引自第531页
  • A final word of advice to the reader is to be vigilant to avoid introducing errors as you rewrite programs in the interest of efficiency. (查看原文)
    晨星 2012-11-05 11:03:34
    —— 引自第539页
  • The profiler helps us focus our attention on the most time-consuming parts of the program and also provides useful information about the procedure call structure. (查看原文)
    晨星 2012-11-07 09:57:40
    —— 引自第545页
  • Profiling is a useful tool to have in the toolbox, but it should not be the only one. The timing measurements are imperfect. More significantly, the results apply only to the particular data tested. (查看原文)
    晨星 2012-11-07 09:58:05
    —— 引自第545页
  • In general, profiling can help us optimize for typical cases, assuming we run the program on representative data, but we should also make sure the program will have respectable performance for all possible cases. This mainly involves avoiding algorithms and bad programming practices that yield poor asymptotic performance. (查看原文)
    晨星 2012-11-07 09:58:18
    —— 引自第545页
  • The main idea of Amdahl’s Law is that when we speed up one part of a system, the effect on the overall system performance depends on both how significant this part was and how much it sped up. (查看原文)
    晨星 2012-11-07 09:58:30
    —— 引自第545页
  • The major insight of Amdahl’s law—to significantly speed up the entire system, we must improve the speed of a very large fraction of the overall system. (查看原文)
    晨星 2012-11-07 09:58:43
    —— 引自第546页
<前页 1 2 3 4 5 6 7 8 9 10 11 ... 17 18 后页>