出版社: O'Reilly Media
副标题: Lightning-Fast Data Analytics
出版年: 2020-7-28
页数: 400
定价: USD 35.99
装帧: Paperback
ISBN: 9781492050049
内容简介 · · · · · ·
Data is getting bigger, arriving faster, and coming in varied formats—and it all needs to be processed at scale for analytics or machine learning. How can you process such varied data workloads efficiently? Enter Apache Spark.
Updated to emphasize new features in Spark 2.x., this second edition shows data engineers and scientists why structure and unification in Spark matters. ...
Data is getting bigger, arriving faster, and coming in varied formats—and it all needs to be processed at scale for analytics or machine learning. How can you process such varied data workloads efficiently? Enter Apache Spark.
Updated to emphasize new features in Spark 2.x., this second edition shows data engineers and scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine-learning algorithms. Through discourse, code snippets, and notebooks, you’ll be able to:
Learn Python, SQL, Scala, or Java high-level APIs: DataFrames and Datasets
Peek under the hood of the Spark SQL engine to understand Spark transformations and performance
Inspect, tune, and debug your Spark operations with Spark configurations and Spark UI
Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
Perform analytics on batch and streaming data using Structured Streaming
Build reliable data pipelines with open source Delta Lake and Spark
Develop machine learning pipelines with MLlib and productionize models using MLflow
Use open source Pandas framework Koalas and Spark for data transformation and feature engineering
作者简介 · · · · · ·
Jules S. Damji is a senior developer advocate at Databricks and an MLflow contributor. He is a hands-on developer with over 20 years of experience and has worked as a software engineer at leading companies such as Sun Microsystems, Netscape, @Home, Loudcloud/Opsware, Verisign, ProQuest, and Hortonworks, building large scale distributed systems. He holds a B.Sc. and an M.Sc. in ...
Jules S. Damji is a senior developer advocate at Databricks and an MLflow contributor. He is a hands-on developer with over 20 years of experience and has worked as a software engineer at leading companies such as Sun Microsystems, Netscape, @Home, Loudcloud/Opsware, Verisign, ProQuest, and Hortonworks, building large scale distributed systems. He holds a B.Sc. and an M.Sc. in computer science and an MA in political advocacy and communication from Oregon State University, Cal State, and Johns Hopkins University, respectively.
Brooke Wenig is a machine learning practice lead at Databricks. She leads a team of data scientists who develop large-scale machine learning pipelines for customers, as well as teaching courses on distributed machine learning best practices. Previously, she was a principal data science consultant at Databricks. She holds an M.S. in computer science from UCLA with a focus on distributed machine learning.
Tathagata Das is a staff software engineer at Databricks, an Apache Spark committer, and a member of the Apache Spark Project Management Committee (PMC). He is one of the original developers of Apache Spark, the lead developer of Spark Streaming (DStreams), and is currently one of the core developers of Structured Streaming and Delta Lake. Tathagata holds an M.S. in computer science from UC Berkeley.
Denny Lee is a staff developer advocate at Databricks who has been working with Apache Spark since 0.6. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premises and cloud environments. He also has an M.S. in biomedical informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise healthcare customers.
目录 · · · · · ·
The Genesis of Big Data and Distributed Computing at Google
Hadoop at Yahoo!
Spark’s Early Years at AMPLab
What is Apache Spark?
Speed
· · · · · · (更多)
The Genesis of Big Data and Distributed Computing at Google
Hadoop at Yahoo!
Spark’s Early Years at AMPLab
What is Apache Spark?
Speed
Ease of Use
Modularity
Extensibility
Why Unified Analytics?
Apache Spark Components as a Unified Stack
Apache Spark’s Distributed Execution and Concepts
Developer’s Experience
Who Uses Spark, and for What?
Data Science Tasks
Data Engineering Tasks
Machine Learning or Deep Learning Tasks
Community Adoption and Expansion
2. Downloading Apache Spark and Getting Started
Step 1: Download Apache Spark
Spark’s Directories and Files
Step 2: Use Scala Shell or PySpark Shell
Using Local Machine
Step 3: Understand Spark Application Concepts
Spark Application and SparkSession
Spark Jobs
Spark Stages
Spark Tasks
Transformations, Actions, and Lazy Evaluation
Spark UI
Databricks Community Edition
First Standalone Application
Using Local Machine
Counting M&Ms for the Cookie Monster
Building Standalone Applications in Scala
Summary
3. Apache Spark’s Structured APIs
A Bit of History…
Unstructured Spark: What’s Underneath an RDD?
Structuring Spark
Key Merits and Benefits
Structured APIs: DataFrames and Datasets APIs
DataFrames API
Common DataFrame Operations
Datasets API
DataFrames vs Datasets
What about RDDs?
Spark SQL and the Underlying Engine
Catalyst Optimizer
Summary
4. Spark SQL and DataFrames — Introduction to Built-in Data Sources
Using Spark SQL in Spark Applications
Basic Query Example
SQL Tables and Views
Data Sources for DataFrames and SQL Tables
DataFrameReader
DataFrameWriter
Parquet
JSON
CSV
Avro
ORC
Image
Summary
5. Spark SQL and Datasets
Single API for Java and Scala
Scala Case Classes and JavaBeans for Datasets
Working with Datasets
Creating Sample Data
Transforming Sample Data
Memory Management for Datasets and DataFrames
Dataset Encoders
Spark’s Internal Format vs Java Object Format
Serialization and Deserialization (SerDe)
Costs of Using Datasets
Strategies to Mitigate Costs
Summary
6. Loading and Saving Your Data
Motivation for Data Sources
File Formats: Revisited
Text Files
Organizing Data for Efficient I/O
Partitioning
Bucketing
Compression Schemes
Saving as Parquet Files
Delta Lake Storage Format
Delta Lake Table
Summary
· · · · · · (收起)
喜欢读"Learning Spark, 2nd Edition"的人也喜欢 · · · · · ·
Learning Spark, 2nd Edition的书评 · · · · · · ( 全部 7 条 )
对于小白来说还是有点晦涩
相对书还是有些老旧了
入门spark的好书
Spark快速大数据分析
基于Python Spark的大数据分析(第一期)
> 更多书评 7篇
论坛 · · · · · ·
在这本书的论坛里发言这本书的其他版本 · · · · · · ( 全部5 )
-
人民邮电出版社 (2015)7.9分 474人读过
-
人民邮电出版社 (2021)8.1分 32人读过
-
O’Reilly Media, Inc (2015)暂无评分 5人读过
-
东南大学出版社 (2015)暂无评分 3人读过
以下书单推荐 · · · · · · ( 全部 )
- 学习BigData (视界)
- T (dhcn)
- 数据科学 (Silver)
- 未评分 (橄榄树萍)
- O Reilly系列的动物书 (damengxinfa02)
谁读这本书? · · · · · ·
二手市场
· · · · · ·
- 在豆瓣转让 有44人想读,手里有一本闲着?
订阅关于Learning Spark, 2nd Edition的评论:
feed: rss 2.0
1 有用 Lo 2022-02-04 09:39:16
Rereading Spark DS & optimization in 2nd edition. 1st edition was read in 2017 - https://book.douban.com/subject/22139960/
0 有用 有一个这样的人 2021-11-26 21:02:31
有深有浅,挺好的
0 有用 文杨 2023-04-03 08:36:12 美国
对工作中只会用的那三板斧的来龙去脉了解多了不少 觉得书中有些代码的内容还是太繁杂了些
0 有用 流光 2023-02-22 01:23:22 美国
去年准备面试时读的,算是一本合格的入门书吧。 读之前没想明白Spark和Ray的区别,虽然知道一个基于RDD一个则是general distributed computing framework。仔细一想后发现Spark虽然已经脱离了传统的mapreduce框架,但新增加的API只是提供了一些high level接口,本质上每一个api call的背后都还是完整的跑了一遍map-reduce,而r... 去年准备面试时读的,算是一本合格的入门书吧。 读之前没想明白Spark和Ray的区别,虽然知道一个基于RDD一个则是general distributed computing framework。仔细一想后发现Spark虽然已经脱离了传统的mapreduce框架,但新增加的API只是提供了一些high level接口,本质上每一个api call的背后都还是完整的跑了一遍map-reduce,而ray更像是generalize single-machine multiprocessing to a cluster,更底层,但也更加灵活。 (展开)
0 有用 3点一直线 2023-02-21 06:29:57 美国
比较新,可以当作新api的书来看
0 有用 o` 2024-06-06 03:22:26 加拿大
适合新手入门,内容基础,距离上生产有一定距离。
0 有用 文杨 2023-04-03 08:36:12 美国
对工作中只会用的那三板斧的来龙去脉了解多了不少 觉得书中有些代码的内容还是太繁杂了些
0 有用 流光 2023-02-22 01:23:22 美国
去年准备面试时读的,算是一本合格的入门书吧。 读之前没想明白Spark和Ray的区别,虽然知道一个基于RDD一个则是general distributed computing framework。仔细一想后发现Spark虽然已经脱离了传统的mapreduce框架,但新增加的API只是提供了一些high level接口,本质上每一个api call的背后都还是完整的跑了一遍map-reduce,而r... 去年准备面试时读的,算是一本合格的入门书吧。 读之前没想明白Spark和Ray的区别,虽然知道一个基于RDD一个则是general distributed computing framework。仔细一想后发现Spark虽然已经脱离了传统的mapreduce框架,但新增加的API只是提供了一些high level接口,本质上每一个api call的背后都还是完整的跑了一遍map-reduce,而ray更像是generalize single-machine multiprocessing to a cluster,更底层,但也更加灵活。 (展开)
0 有用 3点一直线 2023-02-21 06:29:57 美国
比较新,可以当作新api的书来看
1 有用 Lo 2022-02-04 09:39:16
Rereading Spark DS & optimization in 2nd edition. 1st edition was read in 2017 - https://book.douban.com/subject/22139960/