Joe Reis and Matt Housley :: Fundamentals of Data Engineering

O’reilly book from 2022. I haven’t heard as much praise (or anything at all, compared to old standards) - but as most of my career has been data, I thought I’d give it a skim.

Fundamentals of Data Engineering

Part I. Foundation and Building Blocks

Chapter 1. Data Engineering Described

What Is Data Engineering?

Data Engineering Defined

The Data Engineering Lifecycle

Evolution of the Data Engineer

Data Engineering and Data Science

Data Engineering Skills and Activities

Data Maturity and the Data Engineer

The Background and Skills of a Data Engineer

Business Responsibilities

Technical Responsibilities

The Continuum of Data Engineering Roles, from A to B

Data Engineers Inside an Organization

Internal-Facing Versus External-Facing Data Engineers

Data Engineers and Other Technical Roles

Data Engineers and Business Leadership

Conclusion

Additional Resources

Chapter 2. The Data Engineering Lifecycle

What Is the Data Engineering Lifecycle?

The Data Lifecycle Versus the Data Engineering Lifecycle

Generation: Source Systems

Storage

Ingestion

Transformation

Serving Data

Major Undercurrents Across the Data Engineering Lifecycle

Security

Data Management

DataOps

Data Architecture

Orchestration

Software Engineering

Conclusion

Additional Resources

Chapter 3. Designing Good Data Architecture

What Is Data Architecture?

Enterprise Architecture Defined

Data Architecture Defined

“Good” Data Architecture

Principles of Good Data Architecture

Principle 1: Choose Common Components Wisely

Principle 2: Plan for Failure

Principle 3: Architect for Scalability

Principle 4: Architecture Is Leadership

Principle 5: Always Be Architecting

Principle 6: Build Loosely Coupled Systems

Principle 7: Make Reversible Decisions

Principle 8: Prioritize Security

Principle 9: Embrace FinOps

Major Architecture Concepts

Domains and Services

Distributed Systems, Scalability, and Designing for Failure

Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices

User Access: Single Versus Multitenant

Event-Driven Architecture

Brownfield Versus Greenfield Projects

Examples and Types of Data Architecture

Data Warehouse

Data Lake

Convergence, Next-Generation Data Lakes, and the Data Platform

Modern Data Stack

Lambda Architecture

Kappa Architecture

The Dataflow Model and Unified Batch and Streaming

Architecture for IoT

Data Mesh

Other Data Architecture Examples

Who’s Involved with Designing a Data Architecture?

Conclusion

Additional Resources

Chapter 4. Choosing Technologies Across the Data Engineering Lifecycle

Team Size and Capabilities

Speed to Market

Interoperability

Cost Optimization and Business Value

Total Cost of Ownership

Total Opportunity Cost of Ownership

FinOps

Today Versus the Future: Immutable Versus Transitory Technologies

Our Advice

Location

On Premises

Cloud

Hybrid Cloud

Multicloud

Decentralized: Blockchain and the Edge

Our Advice

Cloud Repatriation Arguments

Build Versus Buy

Open Source Software

Proprietary Walled Gardens

Our Advice

Monolith Versus Modular

Monolith

Modularity

The Distributed Monolith Pattern

Our Advice

Serverless Versus Servers

Serverless

Containers

How to Evaluate Server Versus Serverless

Our Advice

Optimization, Performance, and the Benchmark Wars

Big Data…for the 1990s

Nonsensical Cost Comparisons

Asymmetric Optimization

Caveat Emptor

Undercurrents and Their Impacts on Choosing Technologies

Data Management

DataOps

Data Architecture

Orchestration Example: Airflow

Software Engineering

Conclusion

Additional Resources

Part II. The Data Engineering Lifecycle in Depth

Chapter 5. Data Generation in Source Systems

Sources of Data: How Is Data Created?

Source Systems: Main Ideas

Files and Unstructured Data

APIs

Application Databases (OLTP Systems)

Online Analytical Processing System

Change Data Capture

Logs

Database Logs

CRUD

Insert-Only

Messages and Streams

Types of Time

Source System Practical Details

Databases

APIs

Third-Party Data Sources

Message Queues and Event-Streaming Platforms

Whom You’ll Work With

Undercurrents and Their Impact on Source Systems

Security

Data Management

DataOps

Data Architecture

Orchestration

Software Engineering

Conclusion

Additional Resources

Chapter 6. Storage

Raw Ingredients of Data Storage

Magnetic Disk Drive

Solid-State Drive

Random Access Memory

Networking and CPU

Serialization

Compression

Caching

Data Storage Systems

Single Machine Versus Distributed Storage

Eventual Versus Strong Consistency

File Storage

Block Storage

Object Storage

Cache and Memory-Based Storage Systems

The Hadoop Distributed File System

Streaming Storage

Indexes, Partitioning, and Clustering

Data Engineering Storage Abstractions

The Data Warehouse

The Data Lake

The Data Lakehouse

Data Platforms

Stream-to-Batch Storage Architecture

Big Ideas and Trends in Storage

Data Catalog

Schema

Separation of Compute from Storage

Data Storage Lifecycle and Data Retention

Single-Tenant Versus Multitenant Storage

Whom You’ll Work With

Undercurrents

Security

Data Management

DataOps

Data Architecture

Orchestration

Software Engineering

Conclusion

Additional Resources

Chapter 7. Ingestion

What Is Data Ingestion?

Key Engineering Considerations for the Ingestion Phase

Bounded Versus Unbounded Data

Frequency

Synchronous Versus Asynchronous Ingestion

Serialization and Deserialization

Throughput and Scalability

Reliability and Durability

Payload

Push Versus Pull Versus Poll Patterns

Batch Ingestion Considerations

Snapshot or Differential Extraction

File-Based Export and Ingestion

ETL Versus ELT

Inserts, Updates, and Batch Size

Data Migration

Message and Stream Ingestion Considerations

Schema Evolution

Late-Arriving Data

Ordering and Multiple Delivery

Replay

Time to Live

Message Size

Error Handling and Dead-Letter Queues

Consumer Pull and Push

Location

Ways to Ingest Data

Direct Database Connection

Change Data Capture

APIs

Message Queues and Event-Streaming Platforms

Managed Data Connectors

Moving Data with Object Storage

EDI

Databases and File Export

Practical Issues with Common File Formats

Shell

SSH

SFTP and SCP

Webhooks

Web Interface

Web Scraping

Transfer Appliances for Data Migration

Whom You’ll Work With

Upstream Stakeholders

Downstream Stakeholders

Undercurrents

Security

Data Management

DataOps

Orchestration

Software Engineering

Conclusion

Additional Resources

Chapter 8. Queries, Modeling, and Transformation

Queries

What Is a Query?

The Life of a Query

The Query Optimizer

Improving Query Performance

Queries on Streaming Data

Data Modeling

What Is a Data Model?

Conceptual, Logical, and Physical Data Models

Normalization

Techniques for Modeling Batch Analytical Data

Modeling Streaming Data

Transformations

Batch Transformations

Materialized Views, Federation, and Query Virtualization

Streaming Transformations and Processing

Whom You’ll Work With

Upstream Stakeholders

Downstream Stakeholders

Undercurrents

Security

Data Management

DataOps

Data Architecture

Orchestration

Software Engineering

Conclusion

Additional Resources

Chapter 9. Serving Data for Analytics, Machine Learning, and Reverse ETL

General Considerations for Serving Data

Trust

What’s the Use Case, and Who’s the User?

Data Products

Self-Service or Not?

Data Definitions and Logic

Data Mesh

Analytics

Business Analytics

Operational Analytics

Embedded Analytics

Machine Learning

What a Data Engineer Should Know About ML

Ways to Serve Data for Analytics and ML

File Exchange

Databases

Streaming Systems

Query Federation

Semantic and Metrics Layers

Serving Data in Notebooks

Reverse ETL

Whom You’ll Work With

Undercurrents

Security

Data Management

DataOps

Data Architecture

Orchestration

Software Engineering

Conclusion

Additional Resources

Part III. Security, Privacy, and the Future of Data Engineering

Chapter 10. Security and Privacy

People

The Power of Negative Thinking

Always Be Paranoid

Processes

Security Theater Versus Security Habit

Active Security

The Principle of Least Privilege

Shared Responsibility in the Cloud

Always Back Up Your Data

An Example Security Policy

Technology

Patch and Update Systems

Encryption

Logging, Monitoring, and Alerting

Network Access

Security for Low-Level Data Engineering

Conclusion

Additional Resources

Chapter 11. The Future of Data Engineering

The Data Engineering Lifecycle Isn’t Going Away

The Decline of Complexity and the Rise of Easy-to-Use Data Tools

The Cloud-Scale Data OS and Improved Interoperability

“Enterprisey” Data Engineering

Titles and Responsibilities Will Morph…

Moving Beyond the Modern Data Stack, Toward the Live Data Stack

The Live Data Stack

Streaming Pipelines and Real-Time Analytical Databases

The Fusion of Data with Applications

The Tight Feedback Between Applications and ML

Dark Matter Data and the Rise of…Spreadsheets?!

Conclusion

Appendix A. Serialization and Compression Technical Details

Serialization Formats

Row-Based Serialization

Columnar Serialization

Hybrid Serialization

Database Storage Engines

Compression: gzip, bzip2, Snappy, Etc.

Appendix B. Cloud Networking

Cloud Network Topology

Data Egress Charges

Availability Zones

Regions

GCP-Specific Networking and Multiregional Redundancy

Direct Network Connections to the Clouds

CDNs

The Future of Data Egress Fees

Notes

Links

Fundamentals of Data Engineering

Part I. Foundation and Building Blocks

Chapter 1. Data Engineering Described

What Is Data Engineering?

Data Engineering Defined

The Data Engineering Lifecycle

Evolution of the Data Engineer

Data Engineering and Data Science

Data Engineering Skills and Activities

Data Maturity and the Data Engineer

The Background and Skills of a Data Engineer

Business Responsibilities

Technical Responsibilities

The Continuum of Data Engineering Roles, from A to B

Data Engineers Inside an Organization

Internal-Facing Versus External-Facing Data Engineers

Data Engineers and Other Technical Roles

Data Engineers and Business Leadership

Conclusion

Additional Resources

Chapter 2. The Data Engineering Lifecycle

What Is the Data Engineering Lifecycle?

The Data Lifecycle Versus the Data Engineering Lifecycle

Generation: Source Systems

Storage

Ingestion

Transformation

Serving Data

Major Undercurrents Across the Data Engineering Lifecycle

Security

Data Management

DataOps

Data Architecture

Orchestration

Software Engineering

Conclusion

Additional Resources

Chapter 3. Designing Good Data Architecture

What Is Data Architecture?

Enterprise Architecture Defined

Data Architecture Defined

“Good” Data Architecture

Principles of Good Data Architecture

Principle 1: Choose Common Components Wisely

Principle 2: Plan for Failure

Principle 3: Architect for Scalability

Principle 4: Architecture Is Leadership

Principle 5: Always Be Architecting

Principle 6: Build Loosely Coupled Systems

Principle 7: Make Reversible Decisions

Principle 8: Prioritize Security

Principle 9: Embrace FinOps

Major Architecture Concepts

Domains and Services

Distributed Systems, Scalability, and Designing for Failure

Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices

User Access: Single Versus Multitenant

Event-Driven Architecture

Brownfield Versus Greenfield Projects

Examples and Types of Data Architecture

Data Warehouse

Data Lake

Convergence, Next-Generation Data Lakes, and the Data Platform

Modern Data Stack

Lambda Architecture

Kappa Architecture

The Dataflow Model and Unified Batch and Streaming

Architecture for IoT

Data Mesh

Other Data Architecture Examples

Who’s Involved with Designing a Data Architecture?

Conclusion

Additional Resources

Chapter 4. Choosing Technologies Across the Data Engineering Lifecycle

Team Size and Capabilities

Speed to Market

Interoperability

Cost Optimization and Business Value