Apache Hadoop is ideal for organizations with a growing need to store and process massive application datasets. Hadoop: The Definitive Guide is a comprehensive resource for using Hadoop to build reliable, scalable, distributed systems. Programmers will find details for analyzing large datasets with Hadoop, and administrators will learn how to set up and run Hadoop clusters. The...
Apache Hadoop is ideal for organizations with a growing need to store and process massive application datasets. Hadoop: The Definitive Guide is a comprehensive resource for using Hadoop to build reliable, scalable, distributed systems. Programmers will find details for analyzing large datasets with Hadoop, and administrators will learn how to set up and run Hadoop clusters. The book includes case studies that illustrate how Hadoop solves specific problems.
Organizations large and small are adopting Apache Hadoop to deal with huge application datasets. Hadoop: The Definitive Guide provides you with the key for unlocking the wealth this data holds. Hadoop is ideal for storing and processing massive amounts of data, but until now, information on this open-source project has been lacking -- especially with regard to best practices. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems. Programmers will find details for analyzing large datasets with Hadoop, and administrators will learn how to set up and run Hadoop clusters.
With case studies that illustrate how Hadoop solves specific problems, this book helps you:
* Learn the Hadoop Distributed File System (HDFS), including ways to use its many APIs to transfer data
* Write distributed computations with MapReduce, Hadoop's most vital component
* Become familiar with Hadoop's data and IO building blocks for compression, data integrity, serialization, and persistence
* Learn the common pitfalls and advanced features for writing real-world MapReduce programs
* Design, build, and administer a dedicated Hadoop cluster
* Use HBase, Hadoop's database for structured and semi-structured data
And more. Hadoop: The Definitive Guide is still in progress, but you can get started on this technology with the Rough Cuts edition, which lets you read the book online or download it in PDF format as the manuscript evolves.
MapReduce is a programming model for data processing. MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function. (查看原文)
中文版412页: 所以理论上,任何东西都可以表示成二进制形式,然后转化成为长整型的字符串或直接对数据结构进行序列化,来作为键值。 原文460页: ..., so theoretically anything can serve as row key, from strings to binary representations of long or even serialized ...
(展开)
HDFS is a filesystem designed for storing very large files with streaming data access patterns(write-once, read-many-times pattern), running on clusters of commodity hardware. HDFS blocks(>64M) are large compared to disk blocks, and the reason is to minimize the cost of seeks. Map tasks in MapReduce normally operate on one block at a time, so if you have too few tasks (fewer than nodes in th...
2013-02-27 18:24
HDFS is a filesystem designed for storing very large files with streaming data access patterns(write-once, read-many-times pattern), running on clusters of commodity hardware.
HDFS blocks(>64M) are large compared to disk blocks, and the reason is to minimize the cost of seeks. Map tasks in MapReduce normally operate on one block at a time, so if you have too few tasks (fewer than nodes in the cluster), your jobs will run slower than they could otherwise.
An HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers). The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
After a successful return from sync() , HDFS guarantees that the data written up to that point in the file is persisted and visible to all new readers.
Hadoop Archives , or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode memory usage while still allowing transparent access to files.引自 HDFS
Date Flow - Read
Read data flow of HDFS
One important aspect of this design is that the client contacts datanodes directly to retrieve data and is guided by the namenode to the best datanode for each block. This design allows HDFS to scale to a large number of concurrent clients, since the data traffic is spread across all the datanodes in the cluster.
Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor.引自 HDFS
Data Flow - Write
Write data flow of HDFS
As the client writes data (step 3), DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue . The data queue is consumed by the Data Streamer , whose responsibility it is to ask the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. The list of datanodes forms a pipeline—we’ll assume the replication level is three, so there are three nodes in the pipeline. The DataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline.
DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue . A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline (step 5). If a datanode fails while data is being written to it, then the following actions are taken, which are transparent to the client writing the data. First the pipeline is closed, and any packets in the ack queue are added to the front of the data queue so that datanodes that are downstream from the failed node will not miss any packets. The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on. The failed datanode is removed from the pipeline and the remainder of the block’s data is written to the two good datanodes in the pipeline. The namenode notices that the block is under-replicated, and it arranges for a further replica to be created on another node. Subsequent blocks are then treated as normal.
Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a
different rack from the first ( off-rack ), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.引自 HDFS
Additional to V3.
HDFS Federation, introduced in the 0.23 release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace. For example, one namenode might manage all the files rooted under /user , say, and a second namenode might handle files under /share .
The 0.23 release series of Hadoop remedies this situation by adding support for HDFS high-availability (HA). In this implementation there is a pair of namenodes in an activestandby configuration.引自 HDFS
Why can’t we use databases with lots of disks to do large-scale batch analysis? Why is MapReduce needed? The answer to these questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas ...
2011-08-22 14:26
Why can’t we use databases with lots of disks to do large-scale batch analysis? Why is
MapReduce needed?
The answer to these questions comes from another trend in disk drives: seek time is
improving more slowly than transfer rate. Seeking is the process of moving the disk’s
head to a particular place on the disk to read or write data. It characterizes the latency
of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth.
MapReduce is a programming model for data processing. MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function. Map的过程处理每一行数据,生成一个<key, value>的结...
2012-09-10 15:55
MapReduce is a programming model for data processing. MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function.引自 Map Reduce
Map的过程处理每一行数据,生成一个<key, value>的结果,然后进行shuffle过程,将数据按照key组织<key,Iterable<value>>, 最后的reduce过程按照key去处理一组key的数据
Map/Reduce程序运行的节点分为两种,jobtracker和tasktracker,jobtracker负责协调安排job,将他们divide to task run在不同机器上,keep a record of the overall progress of each job. tasktracker负责run task.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined map function for each record in the split. Hadoop does its best to run the map task on a node where the input data resides in HDFS. This is called the data locality optimization.
When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for every key are all in a single partition. The partitioning can be controlled by a user-defined partitioning function, but normally the default partitioner—which buckets keys using a hash function—works very well.引自 Map Reduce
Map Reduce里的shuffle过程
其中shuffle的过程是这样进行的:包括在mapper上对结果进行sort,在reducer上对数据进行merge的两个过程。 Mapper在写数据的时候,其实是有buffer的缓存的(circular memory),默认是100M,当数据大于80%, mapper就开始把数据像disk里写(通过spill的过程,in round of robin fashion),在写数据之前,如果有多个reduce就会将数据分为多个partition,每个partition对应一个reducer。在Mapper结束以前,所有spill到disk上的数据,每个partition内部会进行merge和sort。如果配置了combiner,当数据spill到一定数量combiner就开始进行。当所有的数据都spill,merge和sort后,这些partition的数据可以HTTP access。然后mapper回通知jobtracker,自己已经完成了task。对于一个job,jobtracker knows the mapping between map outputs and tasktracker. reducer会periodically ask jobtracker for map output location until it has retrieve them all.
然后reduce会从mapper上并行的copy所有的output,同时当output data写入本地disk时就对数据进行merge和sort
Sample implementation of min temperature processing.
package org.apache.hadoop.book.ch02.min;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class NewMinTemperature {
static class NewMinTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if(line.charAt(87) == '+'){
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if(airTemperature != MISSING && quality.matches("[01459]")){
context.write(new Text(year), new IntWritable(airTemperature));
}
}
}
static class NewMinTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException{
int minValue = Integer.MAX_VALUE;
for(IntWritable value : values){
minValue = Math.min(minValue, value.get());
}
context.write(key, new IntWritable(minValue));
}
}
public static void main(String args[]) throws Exception{
if(args.length != 2){
System.err.println("Usage: Min Temperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(NewMinTemperature.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(NewMinTemperatureMapper.class);
job.setCombinerClass(NewMinTemperatureReducer.class); //optional for combiner
job.setReducerClass(NewMinTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true)? 0: 1);
}
}
HDFS is a filesystem designed for storing very large files with streaming data access patterns(write-once, read-many-times pattern), running on clusters of commodity hardware. HDFS blocks(>64M) are large compared to disk blocks, and the reason is to minimize the cost of seeks. Map tasks in MapReduce normally operate on one block at a time, so if you have too few tasks (fewer than nodes in th...
2013-02-27 18:24
HDFS is a filesystem designed for storing very large files with streaming data access patterns(write-once, read-many-times pattern), running on clusters of commodity hardware.
HDFS blocks(>64M) are large compared to disk blocks, and the reason is to minimize the cost of seeks. Map tasks in MapReduce normally operate on one block at a time, so if you have too few tasks (fewer than nodes in the cluster), your jobs will run slower than they could otherwise.
An HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers). The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
After a successful return from sync() , HDFS guarantees that the data written up to that point in the file is persisted and visible to all new readers.
Hadoop Archives , or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode memory usage while still allowing transparent access to files.引自 HDFS
Date Flow - Read
Read data flow of HDFS
One important aspect of this design is that the client contacts datanodes directly to retrieve data and is guided by the namenode to the best datanode for each block. This design allows HDFS to scale to a large number of concurrent clients, since the data traffic is spread across all the datanodes in the cluster.
Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor.引自 HDFS
Data Flow - Write
Write data flow of HDFS
As the client writes data (step 3), DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue . The data queue is consumed by the Data Streamer , whose responsibility it is to ask the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. The list of datanodes forms a pipeline—we’ll assume the replication level is three, so there are three nodes in the pipeline. The DataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline.
DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue . A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline (step 5). If a datanode fails while data is being written to it, then the following actions are taken, which are transparent to the client writing the data. First the pipeline is closed, and any packets in the ack queue are added to the front of the data queue so that datanodes that are downstream from the failed node will not miss any packets. The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on. The failed datanode is removed from the pipeline and the remainder of the block’s data is written to the two good datanodes in the pipeline. The namenode notices that the block is under-replicated, and it arranges for a further replica to be created on another node. Subsequent blocks are then treated as normal.
Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a
different rack from the first ( off-rack ), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.引自 HDFS
Additional to V3.
HDFS Federation, introduced in the 0.23 release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace. For example, one namenode might manage all the files rooted under /user , say, and a second namenode might handle files under /share .
The 0.23 release series of Hadoop remedies this situation by adding support for HDFS high-availability (HA). In this implementation there is a pair of namenodes in an activestandby configuration.引自 HDFS
MapReduce is a programming model for data processing. MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function. Map的过程处理每一行数据,生成一个<key, value>的结...
2012-09-10 15:55
MapReduce is a programming model for data processing. MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function.引自 Map Reduce
Map的过程处理每一行数据,生成一个<key, value>的结果,然后进行shuffle过程,将数据按照key组织<key,Iterable<value>>, 最后的reduce过程按照key去处理一组key的数据
Map/Reduce程序运行的节点分为两种,jobtracker和tasktracker,jobtracker负责协调安排job,将他们divide to task run在不同机器上,keep a record of the overall progress of each job. tasktracker负责run task.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined map function for each record in the split. Hadoop does its best to run the map task on a node where the input data resides in HDFS. This is called the data locality optimization.
When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for every key are all in a single partition. The partitioning can be controlled by a user-defined partitioning function, but normally the default partitioner—which buckets keys using a hash function—works very well.引自 Map Reduce
Map Reduce里的shuffle过程
其中shuffle的过程是这样进行的:包括在mapper上对结果进行sort,在reducer上对数据进行merge的两个过程。 Mapper在写数据的时候,其实是有buffer的缓存的(circular memory),默认是100M,当数据大于80%, mapper就开始把数据像disk里写(通过spill的过程,in round of robin fashion),在写数据之前,如果有多个reduce就会将数据分为多个partition,每个partition对应一个reducer。在Mapper结束以前,所有spill到disk上的数据,每个partition内部会进行merge和sort。如果配置了combiner,当数据spill到一定数量combiner就开始进行。当所有的数据都spill,merge和sort后,这些partition的数据可以HTTP access。然后mapper回通知jobtracker,自己已经完成了task。对于一个job,jobtracker knows the mapping between map outputs and tasktracker. reducer会periodically ask jobtracker for map output location until it has retrieve them all.
然后reduce会从mapper上并行的copy所有的output,同时当output data写入本地disk时就对数据进行merge和sort
Sample implementation of min temperature processing.
package org.apache.hadoop.book.ch02.min;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class NewMinTemperature {
static class NewMinTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if(line.charAt(87) == '+'){
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if(airTemperature != MISSING && quality.matches("[01459]")){
context.write(new Text(year), new IntWritable(airTemperature));
}
}
}
static class NewMinTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException{
int minValue = Integer.MAX_VALUE;
for(IntWritable value : values){
minValue = Math.min(minValue, value.get());
}
context.write(key, new IntWritable(minValue));
}
}
public static void main(String args[]) throws Exception{
if(args.length != 2){
System.err.println("Usage: Min Temperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(NewMinTemperature.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(NewMinTemperatureMapper.class);
job.setCombinerClass(NewMinTemperatureReducer.class); //optional for combiner
job.setReducerClass(NewMinTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true)? 0: 1);
}
}
0 有用 [已注销] 2012-08-05
Hadoop挫逼一定是Java的错!
0 有用 Luffy Lee 2011-11-18
一年前读过了,忘了改状态了。
0 有用 yang_bigarm 2011-08-08
权威之作
0 有用 akito 2012-07-26
因为做报告的需要,看了关于HDFS的部分
0 有用 袜落 2012-03-20
我读过最淫荡的技术书籍,虽然第三版覆盖的配置都已经过时了
0 有用 Валия 2020-06-28
第三版
0 有用 herihe 2020-06-11
英文版和中文版的评价能分开吗 一个时代的结束
0 有用 memex 2020-03-29
大概13年左右看的 当初学的这个之后的工作中派上了用场
0 有用 ren 2019-11-29
大致翻过。谷歌三驾马车的开源实现,讲得比论文详细。
0 有用 对我就是那个谁 2019-08-02
大全式的工具书。配合hadoop源码阅读会学到比较多的东西。