Skip to content Skip to footer
-70%

Fundamentals of Data Engineering: Plan and Build Robust Data Systems by Joe Reis, ISBN-13: 978-1098108304

Original price was: $50.00.Current price is: $14.99.

 Safe & secure checkout

Description

Description

Fundamentals of Data Engineering: Plan and Build Robust Data Systems by Joe Reis, ISBN-13: 978-1098108304

[PDF eBook eTextbook] – Available Instantly

  • Publisher: ‎ O’Reilly Media; 1st edition (July 26, 2022)
  • Language: ‎ English
  • 447 pages
  • ISBN-10: ‎ 1098108302
  • ISBN-13: ‎ 978-1098108304

Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and analysts looking for a comprehensive view of this practice. With this practical book, you’ll learn how to plan and build systems to serve the needs of your organization and customers by evaluating the best technologies available through the framework of the data engineering lifecycle.

Authors Joe Reis and Matt Housley walk you through the data engineering lifecycle and show you how to stitch together a variety of cloud technologies to serve the needs of downstream data consumers. You’ll understand how to apply the concepts of data generation, ingestion, orchestration, transformation, storage, and governance that are critical in any data environment regardless of the underlying technology.

This book will help you:

  • Get a concise overview of the entire data engineering landscape
  • Assess data engineering problems using an end-to-end framework of best practices
  • Cut through marketing hype when choosing data technologies, architecture, and processes
  • Use the data engineering lifecycle to design and build a robust architecture
  • Incorporate data governance and security across the data engineering lifecycle

Table of Contents:

Preface

What This Book Isn’t

What This Book Is About

Who Should Read This Book

Prerequisites

What You’ll Learn and How It Will Improve Your Abilities

Navigating This Book

Conventions Used in This Book

How to Contact Us

Acknowledgments

I. Foundation and Building Blocks

1. Data Engineering Described

What Is Data Engineering?

Data Engineering Defined

The Data Engineering Lifecycle

Evolution of the Data Engineer

Data Engineering and Data Science

Data Engineering Skills and Activities

Data Maturity and the Data Engineer

The Background and Skills of a Data Engineer

Business Responsibilities

Technical Responsibilities

The Continuum of Data Engineering Roles, from A to B

Data Engineers Inside an Organization

Internal-Facing Versus External-Facing Data Engineers

Data Engineers and Other Technical Roles

Data Engineers and Business Leadership

Conclusion

Additional Resources

2. The Data Engineering Lifecycle

What Is the Data Engineering Lifecycle?

The Data Lifecycle Versus the Data Engineering Lifecycle

Generation: Source Systems

Storage

Ingestion

Transformation

Serving Data

Major Undercurrents Across the Data Engineering Lifecycle

Security

Data Management

DataOps

Data Architecture

Orchestration

Software Engineering

Conclusion

Additional Resources

3. Designing Good Data Architecture

What Is Data Architecture?

Enterprise Architecture Defined

Data Architecture Defined

“Good” Data Architecture

Principles of Good Data Architecture

Principle 1: Choose Common Components Wisely

Principle 2: Plan for Failure

Principle 3: Architect for Scalability

Principle 4: Architecture Is Leadership

Principle 5: Always Be Architecting

Principle 6: Build Loosely Coupled Systems

Principle 7: Make Reversible Decisions

Principle 8: Prioritize Security

Principle 9: Embrace FinOps

Major Architecture Concepts

Domains and Services

Distributed Systems, Scalability, and Designing for Failure

Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices

User Access: Single Versus Multitenant

Event-Driven Architecture

Brownfield Versus Greenfield Projects

Examples and Types of Data Architecture

Data Warehouse

Data Lake

Convergence, Next-Generation Data Lakes, and the Data Platform

Modern Data Stack

Lambda Architecture

Kappa Architecture

The Dataflow Model and Unified Batch and Streaming

Architecture for IoT

Data Mesh

Other Data Architecture Examples

Who’s Involved with Designing a Data Architecture?

Conclusion

Additional Resources

4. Choosing Technologies Across the Data Engineering Lifecycle

Team Size and Capabilities

Speed to Market

Interoperability

Cost Optimization and Business Value

Total Cost of Ownership

Total Opportunity Cost of Ownership

FinOps

Today Versus the Future: Immutable Versus Transitory Technologies

Our Advice

Location

On Premises

Cloud

Hybrid Cloud

Multicloud

Decentralized: Blockchain and the Edge

Our Advice

Cloud Repatriation Arguments

Build Versus Buy

Open Source Software

Proprietary Walled Gardens

Our Advice

Monolith Versus Modular

Monolith

Modularity

The Distributed Monolith Pattern

Our Advice

Serverless Versus Servers

Serverless

Containers

How to Evaluate Server Versus Serverless

Our Advice

Optimization, Performance, and the Benchmark Wars

Big Data…for the 1990s

Nonsensical Cost Comparisons

Asymmetric Optimization

Caveat Emptor

Undercurrents and Their Impacts on Choosing Technologies

Data Management

DataOps

Data Architecture

Orchestration Example: Airflow

Software Engineering

Conclusion

Additional Resources

II. The Data Engineering Lifecycle in Depth

5. Data Generation in Source Systems

Sources of Data: How Is Data Created?

Source Systems: Main Ideas

Files and Unstructured Data

APIs

Application Databases (OLTP Systems)

Online Analytical Processing System

Change Data Capture

Logs

Database Logs

CRUD

Insert-Only

Messages and Streams

Types of Time

Source System Practical Details

Databases

APIs

Data Sharing

Third-Party Data Sources

Message Queues and Event-Streaming Platforms

Whom You’ll Work With

Undercurrents and Their Impact on Source Systems

Security

Data Management

DataOps

Data Architecture

Orchestration

Software Engineering

Conclusion

Additional Resources

6. Storage

Raw Ingredients of Data Storage

Magnetic Disk Drive

Solid-State Drive

Random Access Memory

Networking and CPU

Serialization

Compression

Caching

Data Storage Systems

Single Machine Versus Distributed Storage

Eventual Versus Strong Consistency

File Storage

Block Storage

Object Storage

Cache and Memory-Based Storage Systems

The Hadoop Distributed File System

Streaming Storage

Indexes, Partitioning, and Clustering

Data Engineering Storage Abstractions

The Data Warehouse

The Data Lake

The Data Lakehouse

Data Platforms

Stream-to-Batch Storage Architecture

Big Ideas and Trends in Storage

Data Catalog

Data Sharing

Schema

Separation of Compute from Storage

Data Storage Lifecycle and Data Retention

Single-Tenant Versus Multitenant Storage

Whom You’ll Work With

Undercurrents

Security

Data Management

DataOps

Data Architecture

Orchestration

Software Engineering

Conclusion

Additional Resources

7. Ingestion

What Is Data Ingestion?

Key Engineering Considerations for the Ingestion Phase

Bounded Versus Unbounded Data

Frequency

Synchronous Versus Asynchronous Ingestion

Serialization and Deserialization

Throughput and Scalability

Reliability and Durability

Payload

Push Versus Pull Versus Poll Patterns

Batch Ingestion Considerations

Snapshot or Differential Extraction

File-Based Export and Ingestion

ETL Versus ELT

Inserts, Updates, and Batch Size

Data Migration

Message and Stream Ingestion Considerations

Schema Evolution

Late-Arriving Data

Ordering and Multiple Delivery

Replay

Time to Live

Message Size

Error Handling and Dead-Letter Queues

Consumer Pull and Push

Location

Ways to Ingest Data

Direct Database Connection

Change Data Capture

APIs

Message Queues and Event-Streaming Platforms

Managed Data Connectors

Moving Data with Object Storage

EDI

Databases and File Export

Practical Issues with Common File Formats

Shell

SSH

SFTP and SCP

Webhooks

Web Interface

Web Scraping

Transfer Appliances for Data Migration

Data Sharing

Whom You’ll Work With

Upstream Stakeholders

Downstream Stakeholders

Undercurrents

Security

Data Management

DataOps

Orchestration

Software Engineering

Conclusion

Additional Resources

8. Queries, Modeling, and Transformation

Queries

What Is a Query?

The Life of a Query

The Query Optimizer

Improving Query Performance

Queries on Streaming Data

Data Modeling

What Is a Data Model?

Conceptual, Logical, and Physical Data Models

Normalization

Techniques for Modeling Batch Analytical Data

Modeling Streaming Data

Transformations

Batch Transformations

Materialized Views, Federation, and Query Virtualization

Streaming Transformations and Processing

Whom You’ll Work With

Upstream Stakeholders

Downstream Stakeholders

Undercurrents

Security

Data Management

DataOps

Data Architecture

Orchestration

Software Engineering

Conclusion

Additional Resources

9. Serving Data for Analytics, Machine Learning, and Reverse ETL

General Considerations for Serving Data

Trust

What’s the Use Case, and Who’s the User?

Data Products

Self-Service or Not?

Data Definitions and Logic

Data Mesh

Analytics

Business Analytics

Operational Analytics

Embedded Analytics

Machine Learning

What a Data Engineer Should Know About ML

Ways to Serve Data for Analytics and ML

File Exchange

Databases

Streaming Systems

Query Federation

Data Sharing

Semantic and Metrics Layers

Serving Data in Notebooks

Reverse ETL

Whom You’ll Work With

Undercurrents

Security

Data Management

DataOps

Data Architecture

Orchestration

Software Engineering

Conclusion

Additional Resources

III. Security, Privacy, and the Future of Data Engineering

10. Security and Privacy

People

The Power of Negative Thinking

Always Be Paranoid

Processes

Security Theater Versus Security Habit

Active Security

The Principle of Least Privilege

Shared Responsibility in the Cloud

Always Back Up Your Data

An Example Security Policy

Technology

Patch and Update Systems

Encryption

Logging, Monitoring, and Alerting

Network Access

Security for Low-Level Data Engineering

Conclusion

Additional Resources

11. The Future of Data Engineering

The Data Engineering Lifecycle Isn’t Going Away

The Decline of Complexity and the Rise of Easy-to-Use Data Tools

The Cloud-Scale Data OS and Improved Interoperability

“Enterprisey” Data Engineering

Titles and Responsibilities Will Morph…

Moving Beyond the Modern Data Stack, Toward the Live Data Stack

The Live Data Stack

Streaming Pipelines and Real-Time Analytical Databases

The Fusion of Data with Applications

The Tight Feedback Between Applications and ML

Dark Matter Data and the Rise of…Spreadsheets?!

Conclusion

A. Serialization and Compression Technical Details

Serialization Formats

Row-Based Serialization

Columnar Serialization

Hybrid Serialization

Database Storage Engines

Compression: gzip, bzip2, Snappy, Etc.

B. Cloud Networking

Cloud Network Topology

Data Egress Charges

Availability Zones

Regions

GCP-Specific Networking and Multiregional Redundancy

Direct Network Connections to the Clouds

CDNs

The Future of Data Egress Fees

Index

About the Authors

Joe Reis is a business-minded data nerd who’s worked in the data industry for 20 years, with responsibilities ranging from statistical modeling, forecasting, machine learning, data engineering, data architecture, and almost everything else in between. Joe is the CEO and cofounder of Ternary Data, a data engineering and architecture consulting firm based in Salt Lake City, Utah. In addition, he volunteers with several technology groups and teaches at the University of Utah. In his spare time, Joe likes to rock climb, produce electronic music, and take his kids on crazy adventures.

Matt Housley is a data engineering consultant and cloud specialist. After some early programming experience with Logo, Basic, and 6502 assembly, he completed a PhD in mathematics at the University of Utah. Matt then began working in data science, eventually specializing in cloud-based data engineering. He cofounded Ternary Data with Joe Reis, where he leverages his teaching experience to train future data engineers and advise teams on robust data architecture. Matt and Joe also pontificate on all things data on The Monday Morning Data Chat.

What makes us different?

• Instant Download

• Always Competitive Pricing

• 100% Privacy

• FREE Sample Available

• 24-7 LIVE Customer Support

Delivery Info

Reviews (0)