《集体智慧编程》的原文摘录

  • Next, get a list of random people to make up the dataset. Fortunately, Hot or Not provides an API call that returns a list of people with specified criteria. In this exam- ple, the only criteria will be that the people have “meet me” profiles, since only from these profiles can you get other information like location and interests. Add this function to hotornot.py: (查看原文)
    candy 2011-04-06 14:23:24
    —— 引自第162页
  • What Does This Have to Do with the Articles Matrix? So far, what you have is a matrix of articles with word counts. The goal is to factorize this matrix, which means finding two smaller matrices that can be multiplied together to reconstruct this one. The two smaller matrices are: The features matrix This matrix has a row for each feature and a column for each word. The values indicate how important a word is to a feature. Each feature should represent a theme that emerged from a set of articles, so you might expect an article about a new TV show to have a high weight for the word “television.” The weights matrix This matrix maps the features to the articles matrix. Each row is an article and each column is a feature. The values state how much each feature applies to each articl... (查看原文)
    candy 2011-04-07 16:25:32
    —— 引自第234页
  • Another feature that applies more evenly to a couple of companies is this one: Feature 2 (46151801.813632453, 'GOOG') (24298994.720555616, 'YHOO') (10606419.91092159, 'PG') (7711296.6887903402, 'CVX') (4711899.0067871698, 'BIIB') (4423180.7694432881, 'XOM') (3430492.5096612777, 'DNA') (2882726.8877627672, 'EXPE') (2232928.7181202639, 'CL') (2043732.4392455407, 'AVP') (1934010.2697886101, 'BP') (1801256.8664912341, 'AMGN') [(2.9757765047938824, '20-Jan-06'), (2.8627791325829448, '28-Feb-06'), (2.356157903021133, '31-Mar-06'), This feature represents large spikes in Google’s trading volume, which in the top three cases were due to news events. The strongest day, January 20th, was the day that Google announced it would not give information about its search engine usage to the government. ... (查看原文)
    candy 2011-04-11 11:12:26
    —— 引自第271页
  • Because new connections are only created when necessary, this method has to return a default value if there are no connections. For links from words to the hidden layer, the default value will be –0.2 so that, by default, extra words will have a slightly negative effect on the activation level of a hidden node. For links from the hidden layer to URLs, the method will return a default value of 0. (查看原文)
    candy 2011-04-13 15:50:32
    —— 引自第80页
  • Pearson Correlation Score A slightly more sophisticated way to determine the similarity between people’s inter- ests is to use a Pearson correlation coefficient. The correlation coefficient is a mea- sure of how well two sets of data fit on a straight line. The formula for this is more complicated than the Euclidean distance score, but it tends to give better results in situations where the data isn’t well normalized—for example, if critics’ movie rank- ings are routinely more harsh than average. (查看原文)
    candy 2011-04-18 14:02:27
    —— 引自第11页
  • Simulated annealing is an optimization method inspired by physics. Annealing is the process of heating up an alloy and then cooling it down slowly. Because the atoms are first made to jump around a lot and then gradually settle into a low energy state, the atoms can find a low energy configuration. (查看原文)
    candy 2011-04-27 15:47:28
    —— 引自第95页
  • The flight scheduling example works because moving a person from the second to the third flight of the day would probably change the overall cost by a smaller amount than moving that person to the eighth flight of the day would. If the flights were in random order, the optimization methods would work no better than a random search—in fact, there’s no optimization method that will consistently work better than a random search in that case. (查看原文)
    candy 2011-04-27 17:22:34
    —— 引自第100页
  • Squaring the numbers is common practice because it makes large differences count for even more. This means an algorithm that is very close most of the time but far off occasionally will fare worse than an algorithm that is always somewhat close. This is often desired behavior, but there are situations in which making a big mistake is occasionally acceptable if accuracy is very high the rest of the time. When this is the case, you can modify the function to just add up the absolute values of the differences. (查看原文)
    candy 2011-05-03 11:46:51
    —— 引自第177页
  • When you were only making comparisons based on people’s ages, it was fine to keep the data as it was and to use averages and distances, since it makes sense to compare variables that mean the same thing. However, now you’ve introduced some new variables that aren’t really comparable to age, since their values are much smaller. Having differing opinions about children—a gap of 2, between 1 and –1—may be much more significant in reality than having an age gap of six years, but if you used the data as is, the age difference would count for three times as much. To resolve this issue, it’s a good idea to put all the data on a common scale so that differences are comparable in every variable. You can do this by determining the low- est and highest values for every variable, and scalin... (查看原文)
    candy 2011-05-03 14:25:20
    —— 引自第209页
  • The biggest downside to naïve Bayesian classifiers is their inability to deal with outcomes that change based on combinations of features. Imagine the following scenario in which you are trying to distinguish spam from nonspam email: let’s say your job is building web applications, so the word “online” frequently appears in your work-related email. Your best friend works at a pharmacy and likes sending you funny stories about things that happen to him at work. Also, like most people who haven’t closely guarded their email addresses, you occasionally receive spam containing the words “online pharmacy.” You can probably see the dilemma here already—the classifier is constantly being told that “online” and “pharmacy” exist in nonspam email messages, so their pro... (查看原文)
    candy 3赞 2011-05-10 15:26:11
    —— 引自第296页
  • As the temperature decreases,the difference between the high cost and the low cost becomes more important. (查看原文)
    36° 2011-10-01 12:20:20
    —— 引自第95页
  • the acceptance probability function P(e,e',T) was defined as 1 if e' < e, and exp((e − e') / T) otherwise. (查看原文)
    36° 2011-10-01 12:20:20
    —— 引自第95页
  • random.randint(a, b) Return a random integer N such that a <= N <= b. (查看原文)
    36° 2011-10-01 15:05:59
    —— 引自第98页
  • 随机搜索不是一种非常好的优化算法,但是它却使我们很容易领会所有算法的真正意图。并且它也是我们评估其他算法优劣的基线(baseline)。 (查看原文)
    红色有角F叔 2017-08-03 22:23:04
    —— 引自章节:优化
  • 爬山法 随机尝试各种题解是非常低效的,因为这种方法没有充分利用已经发现的优解。 (查看原文)
    红色有角F叔 2017-08-03 22:23:04
    —— 引自章节:优化
  • 一种优化方法能够管用很大程度上取决于问题本身。模拟退火算法、遗传算法,以及大多数其他优化方法都有依赖于这样一个事实:对大多数问题而言,最优解应该接近于其他的优解。 (查看原文)
    红色有角F叔 2017-08-03 22:23:04
    —— 引自章节:优化