By providing a distributed data storage and parallel computing framework, Hadoop has evolved from an abstraction of cluster computing to an operating system for big data. This book aims to pave the way for data scientists to gain insight into specific subject areas by providing an overview of cluster computing and analytics in a readable and intuitive way, providing an introduction to Hadoop cluster computing and analytics from a data scientist's perspective. The book is divided into two parts, the first part introduces distributed computing at a very high level and discusses how to run computing on a cluster; The second part focuses on the tools and techniques that data scientists should know to power a variety of analytics and large-scale data management.
This book is suitable for practitioners in the field of data science, as well as researchers interested in data analysis.
Big data has become a buzzword. People use it to describe exciting new tools and technologies in data-driven applications. These applications are bringing us new ways of computing. To the chagrin of statisticians, the term seems to be used haphazardly, and its scope even includes the use of well-known statistical techniques on large data sets**. While big data has become a buzzword, the reality is that modern distributed computing techniques can analyze far larger data sets than the "typical" ones of the past, and the results are much more impressive.
However, distributed computing alone is not the same as data science. The Internet has brought about rapidly growing data sets that in turn drive models ("more data is better than better algorithms"1), and data products have become a new economic paradigm. The great success of modeling large, cross-domain heterogeneous datasets (e.g., Nate Silver's 2008 U.S. results by using big data techniques like magic) has led to the realization of the value of data science and the attraction of a large number of practitioners to the field.
By providing a distributed data store and parallel computing framework, Hadoop has evolved from an abstraction of cluster computing to a big data operating system. Spark is built with this in mind, making it easier for data scientists to work with clustered computing. However, data scientists and analysts who don't understand distributed computing may feel that these tools are for programmers rather than analysts. That's because we need to fundamentally shift the way we think about managing and computing data so that we can move from serial to parallel.
This book is designed to help data scientists make this mindset shift by providing a readable and intuitive introduction to cluster computing and analytics. We'll cover the many concepts, tools, and techniques involved in distributed computing for data analytics, paving the way for a deeper understanding of specific domains.
Directory. Preface.
Part 1: Introduction to Distributed Computing.
Chapter 1: The Era of Data Products.
1.1 What is a data product.
1.2 Build large-scale data products with Hadoop.
1.2.1 Leverage large datasets.
1.2.2 Hadoop in data products
1.3 Data science pipelines and the Hadoop ecosystem.
Big data workflows.
1.4 Summary.
Chapter 2 Big Data Operating Systems.
2.1 Basic Concepts.
2.2 Hadoop architecture.
2.2.1 Hadoop cluster.
2.2.2 hdfs
2.2.3 yarn
2.3 Use a distributed file system.
2.3.1 Basic file system operations.
2.3.2 HDFS file permissions.
2.3.3 Other HDFS interfaces.
2.4 Use distributed computing.
2.4.1 MapReduce: Functional programming model.
2.4.2 MapReduce: Implements on a cluster.
2.4.3 More than one mapreduce: job chain.
2.5 Submit the mapreduce job to yarn.
2.6 Summary.
Chapter 3: Python Frameworks and Hadoop Streaming
3.1 hadoop streaming
3.1.1 Use streaming to run calculations on CSV data.
3.1.2 Execute the streaming job.
3.2 Python's MapReduce framework.
3.2.1 phrase count.
3.2.2 Other frameworks.
3.3 MapReduce Advanced.
3.3.1 combiner
3.3.2 partitioner
3.3.3 Job chain.
3.4 Summary.
Chapter 4 Spark In-Memory Computing.
4.1 Spark Basics.
4.1.1 Spark Stack.
4.1.2 rdd
4.1.3 Programming with RDD.
4.2 Interactive Spark based on PySpark
4.3 Write a Spark application.
Visualize flight delays with Spark.
4.4 Summary.
Chapter 5 Distributed Analytics and Patterns.
5.1 key calculation.
5.1.1 Composite key.
5.1.2-key space mode.
5.1.3 pair with stripe
5.2 Design Patterns.
5.2.1 Summary.
5.2.2 Index.
5.2.3 Filtering.
5.3 Towards last-mile analytics.
5.3.1 Model fitting.
5.3.2 Model validation.
5.4 Summary.
Part 2: Workflows and Tools for Big Data Science.
Chapter 6 Data Mining and Data Warehousing.
6.1 hive structured data query.
6.1.1 Hive Command Line Interface (CLI).
6.1.2 Hive Query Language.
6.1.3 Hive Data Analysis.
6.2 hbase
6.2.1 NoSQL vs. Columnar Databases.
6.2.2 HBase real-time analytics.
6.3 Summary.
Chapter 7 Data Collection.
7.1 Import relational data using sqoop.
7.1.1 Import HDFS from MySQL
7.1.2 Import Hive from MySQL
7.1.3 Import HBase from MySQL
7.2 Use Flume to get streaming data.
7.2.1 Flume data stream.
7.2.2 Use Flume to get product impression data.
7.3 Summary.
Chapter 8 Using Advanced APIs for Analysis.
8.1 pig
8.1.1 pig latin
8.1.2 Data Types.
8.1.3 Relational operators.
8.1.4 User-defined functions.
8.1.5 pig summary.
8.2 Spark Premium API
8.2.1 spark sql
8.2.2 dataframe
8.3 Summary.
Chapter 9 Machine Learning.
9.1 Scalable machine learning with Spark.
9.1.1 Collaborative filtering.
9.1.2 Classification.
9.1.3 Clustering.
9.2 Summary.
Chapter 10 Summary: Distributed Data Science in Action.
10.1 Data Product Lifecycle.
10.1.1 Data Lake.
10.1.2 Data Collection.
10.1.3 Computational data storage.
10.2 Machine Learning Lifecycle.
10.3 Summary.
Appendix A Creating a Hadoop Pseudo-Distributed Development Environment.
a.1 Get started quickly.
a.2 Set up your Linux environment.
a.2.1 Create a Hadoop user.
a.2.2 Configure SSH
a.2.3 Install J**A
a.2.4 Disable IPv6
a.3 Install Hadoop
a.3.1 Unzip.
a.3.2 Environment.
a.3.3 Hadoop configuration.
a.3.4 Format the namenode
a.3.5 Start Hadoop
a.3.6 Restart Hadoop
Appendix B Installing Hadoop Ecosystem Products.
b.1 Packaged Hadoop distribution.
b.2 Install the Apache Hadoop ecosystem product yourself.
b.2.1 Basic installation and configuration steps.
b.2.2 sqoop specific configurations.
b.2.3 Hive Specific Configurations.
b.2.4 HBase-specific configurations.
b.2.5 Install Spark
Glossary. About the author.
About the cover. Connect with Turing.
I'm done. **Address: