O’reilly book from 2022. I haven’t heard as much praise (or anything at all, compared to old standards) - but as most of my career has been data, I thought I’d give it a skim.

Fundamentals of Data Engineering

Part I. Foundation and Building Blocks

Chapter 1. Data Engineering Described

What Is Data Engineering?
Data Engineering Defined
The Data Engineering Lifecycle
Evolution of the Data Engineer
Data Engineering and Data Science
Data Engineering Skills and Activities
Data Maturity and the Data Engineer
The Background and Skills of a Data Engineer
Business Responsibilities
Technical Responsibilities
The Continuum of Data Engineering Roles, from A to B
Data Engineers Inside an Organization
Internal-Facing Versus External-Facing Data Engineers
Data Engineers and Other Technical Roles
Data Engineers and Business Leadership
Conclusion
Additional Resources

Chapter 2. The Data Engineering Lifecycle

What Is the Data Engineering Lifecycle?
The Data Lifecycle Versus the Data Engineering Lifecycle
Generation: Source Systems
Storage
Ingestion
Transformation
Serving Data
Major Undercurrents Across the Data Engineering Lifecycle
Security
Data Management
DataOps
Data Architecture
Orchestration
Software Engineering
Conclusion
Additional Resources

Chapter 3. Designing Good Data Architecture

What Is Data Architecture?
Enterprise Architecture Defined
Data Architecture Defined
“Good” Data Architecture
Principles of Good Data Architecture
Principle 1: Choose Common Components Wisely
Principle 2: Plan for Failure
Principle 3: Architect for Scalability
Principle 4: Architecture Is Leadership
Principle 5: Always Be Architecting
Principle 6: Build Loosely Coupled Systems
Principle 7: Make Reversible Decisions
Principle 8: Prioritize Security
Principle 9: Embrace FinOps
Major Architecture Concepts
Domains and Services
Distributed Systems, Scalability, and Designing for Failure
Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices
User Access: Single Versus Multitenant
Event-Driven Architecture
Brownfield Versus Greenfield Projects
Examples and Types of Data Architecture
Data Warehouse
Data Lake
Convergence, Next-Generation Data Lakes, and the Data Platform
Modern Data Stack
Lambda Architecture
Kappa Architecture
The Dataflow Model and Unified Batch and Streaming
Architecture for IoT
Data Mesh
Other Data Architecture Examples
Who’s Involved with Designing a Data Architecture?
Conclusion
Additional Resources

Chapter 4. Choosing Technologies Across the Data Engineering Lifecycle

Team Size and Capabilities
Speed to Market
Interoperability
Cost Optimization and Business Value
Total Cost of Ownership
Total Opportunity Cost of Ownership
FinOps
Today Versus the Future: Immutable Versus Transitory Technologies
Our Advice
Location
On Premises
Cloud
Hybrid Cloud
Multicloud
Decentralized: Blockchain and the Edge
Our Advice
Cloud Repatriation Arguments
Build Versus Buy
Open Source Software
Proprietary Walled Gardens
Our Advice
Monolith Versus Modular
Monolith
Modularity
The Distributed Monolith Pattern
Our Advice
Serverless Versus Servers
Serverless
Containers
How to Evaluate Server Versus Serverless
Our Advice
Optimization, Performance, and the Benchmark Wars
Big Data…for the 1990s
Nonsensical Cost Comparisons
Asymmetric Optimization
Caveat Emptor
Undercurrents and Their Impacts on Choosing Technologies
Data Management
DataOps
Data Architecture
Orchestration Example: Airflow
Software Engineering
Conclusion
Additional Resources

Part II. The Data Engineering Lifecycle in Depth

Chapter 5. Data Generation in Source Systems

Sources of Data: How Is Data Created?
Source Systems: Main Ideas
Files and Unstructured Data
APIs
Application Databases (OLTP Systems)
Online Analytical Processing System
Change Data Capture
Logs
Database Logs
CRUD
Insert-Only
Messages and Streams
Types of Time
Source System Practical Details
Databases
APIs
Data Sharing
Third-Party Data Sources
Message Queues and Event-Streaming Platforms
Whom You’ll Work With
Undercurrents and Their Impact on Source Systems
Security
Data Management
DataOps
Data Architecture
Orchestration
Software Engineering
Conclusion
Additional Resources

Chapter 6. Storage

Raw Ingredients of Data Storage
Magnetic Disk Drive
Solid-State Drive
Random Access Memory
Networking and CPU
Serialization
Compression
Caching
Data Storage Systems
Single Machine Versus Distributed Storage
Eventual Versus Strong Consistency
File Storage
Block Storage
Object Storage
Cache and Memory-Based Storage Systems
The Hadoop Distributed File System
Streaming Storage
Indexes, Partitioning, and Clustering
Data Engineering Storage Abstractions
The Data Warehouse
The Data Lake
The Data Lakehouse
Data Platforms
Stream-to-Batch Storage Architecture
Data Catalog
Data Sharing
Schema
Separation of Compute from Storage
Data Storage Lifecycle and Data Retention
Single-Tenant Versus Multitenant Storage
Whom You’ll Work With
Undercurrents
Security
Data Management
DataOps
Data Architecture
Orchestration
Software Engineering
Conclusion
Additional Resources

Chapter 7. Ingestion

What Is Data Ingestion?
Key Engineering Considerations for the Ingestion Phase
Bounded Versus Unbounded Data
Frequency
Synchronous Versus Asynchronous Ingestion
Serialization and Deserialization
Throughput and Scalability
Reliability and Durability
Payload
Push Versus Pull Versus Poll Patterns
Batch Ingestion Considerations
Snapshot or Differential Extraction
File-Based Export and Ingestion
ETL Versus ELT
Inserts, Updates, and Batch Size
Data Migration
Message and Stream Ingestion Considerations
Schema Evolution
Late-Arriving Data
Ordering and Multiple Delivery
Replay
Time to Live
Message Size
Error Handling and Dead-Letter Queues
Consumer Pull and Push
Location
Ways to Ingest Data
Direct Database Connection
Change Data Capture
APIs
Message Queues and Event-Streaming Platforms
Managed Data Connectors
Moving Data with Object Storage
EDI
Databases and File Export
Practical Issues with Common File Formats
Shell
SSH
SFTP and SCP
Webhooks
Web Interface
Web Scraping
Transfer Appliances for Data Migration
Data Sharing
Whom You’ll Work With
Upstream Stakeholders
Downstream Stakeholders
Undercurrents
Security
Data Management
DataOps
Orchestration
Software Engineering
Conclusion
Additional Resources

Chapter 8. Queries, Modeling, and Transformation

Queries
What Is a Query?
The Life of a Query
The Query Optimizer
Improving Query Performance
Queries on Streaming Data
Data Modeling
What Is a Data Model?
Conceptual, Logical, and Physical Data Models
Normalization
Techniques for Modeling Batch Analytical Data
Modeling Streaming Data
Transformations
Batch Transformations
Materialized Views, Federation, and Query Virtualization
Streaming Transformations and Processing
Whom You’ll Work With
Upstream Stakeholders
Downstream Stakeholders
Undercurrents
Security
Data Management
DataOps
Data Architecture
Orchestration
Software Engineering
Conclusion
Additional Resources

Chapter 9. Serving Data for Analytics, Machine Learning, and Reverse ETL

General Considerations for Serving Data
Trust
What’s the Use Case, and Who’s the User?
Data Products
Self-Service or Not?
Data Definitions and Logic
Data Mesh
Analytics
Business Analytics
Operational Analytics
Embedded Analytics
Machine Learning
What a Data Engineer Should Know About ML
Ways to Serve Data for Analytics and ML
File Exchange
Databases
Streaming Systems
Query Federation
Data Sharing
Semantic and Metrics Layers
Serving Data in Notebooks
Reverse ETL
Whom You’ll Work With
Undercurrents
Security
Data Management
DataOps
Data Architecture
Orchestration
Software Engineering
Conclusion
Additional Resources

Part III. Security, Privacy, and the Future of Data Engineering

Chapter 10. Security and Privacy

People
The Power of Negative Thinking
Always Be Paranoid
Processes
Security Theater Versus Security Habit
Active Security
The Principle of Least Privilege
Shared Responsibility in the Cloud
Always Back Up Your Data
An Example Security Policy
Technology
Patch and Update Systems
Encryption
Logging, Monitoring, and Alerting
Network Access
Security for Low-Level Data Engineering
Conclusion
Additional Resources

Chapter 11. The Future of Data Engineering

The Data Engineering Lifecycle Isn’t Going Away
The Decline of Complexity and the Rise of Easy-to-Use Data Tools
The Cloud-Scale Data OS and Improved Interoperability
“Enterprisey” Data Engineering
Titles and Responsibilities Will Morph…
Moving Beyond the Modern Data Stack, Toward the Live Data Stack
The Live Data Stack
Streaming Pipelines and Real-Time Analytical Databases
The Fusion of Data with Applications
The Tight Feedback Between Applications and ML
Dark Matter Data and the Rise of…Spreadsheets?!
Conclusion

Appendix A. Serialization and Compression Technical Details

Serialization Formats

Row-Based Serialization
Columnar Serialization
Hybrid Serialization

Database Storage Engines

Compression: gzip, bzip2, Snappy, Etc.

Appendix B. Cloud Networking

Cloud Network Topology

Data Egress Charges
Availability Zones
Regions
GCP-Specific Networking and Multiregional Redundancy
Direct Network Connections to the Clouds

CDNs

The Future of Data Egress Fees

*