南平市网站建设_网站建设公司_UI设计_seo优化
2025/12/24 9:25:14 网站建设 项目流程

Massively Parallel Processing (MPP) is a computing architecture where a large number of independent processors (or "nodes") work together to solve a single problem.

Instead of one powerful computer trying to crunch all the data, an MPP system breaks a massive task into hundreds or thousands of smaller pieces and executes them simultaneously.

How MPP Works: The "Shared Nothing" Concept

The defining feature of MPP is its shared-nothing architecture. In a traditional computer, multiple processors might share the same memory (RAM) and disk, which creates a "traffic jam" (bottleneck) as they fight for resources.

In an MPP system:

  • Each Node is Independent: Every node has its own CPU, dedicated memory, and its own local storage.

  • Divide and Conquer: When you run a query (e.g., "Find the total sales for 2024"), a "Leader Node" splits the data and the task. Each "Worker Node" processes its own specific slice of data independently.

  • Aggregation: Once the worker nodes finish, they send their results back to be stitched together into the final answer.

 

How data is "sharded" across these nodes?

 

In a Massively Parallel Processing (MPP) system, sharding is the method used to split a massive dataset across many independent nodes. Without sharding, you’d just have a bunch of powerful computers with nothing to do.

Think of sharding as the "slicing" of the data, while MPP is the "simultaneous cooking" of those slices.

How Sharding Works in MPP

When you load a table into an MPP database, the system doesn't store the whole table in one place. Instead, it uses a Shard Key (a specific column, like Customer_ID or Transaction_Date) to decide which row goes to which node.

There are three primary strategies for doing this:

Strategy How it Works Best For...
Hash Sharding A mathematical formula (hash) is applied to the shard key. If the result is "1," the data goes to Node 1. This ensures a very even, random-looking distribution. Large tables where you want to avoid "hotspots" (one node doing all the work).
Round Robin The system simply rotates through nodes: Row 1 goes to Node A, Row 2 to Node B, and so on. It is the simplest and fastest way to load data. Staging data or tables where you don't have a clear "key" to filter by.
Replication The entire table is copied onto every single node. This uses more storage but makes queries lightning-fast. Small "lookup" tables (e.g., a list of Country Codes) that need to be joined with giant transaction tables.

The Goal: Avoiding the "Data Shuffle"

The "holy grail" of MPP performance is Data Locality.

If you want to join two massive tables (like Sales and Customers), and both are sharded using the same Customer_ID key, then the data for Customer #500 will live on Node A for both tables. Node A can join that data instantly without talking to any other node.

If the data is sharded poorly, the system has to perform a Data Shuffle—physically moving millions of rows across the network between nodes during the query. This is the #1 performance killer in MPP systems.

 

Would you like to see a specific example of how a SQL query is rewritten by the "Leader Node" to run across these shards?

 

Yes, I would.

 

When you submit a SQL query to an MPP system, the Leader Node (the "brain") doesn't just execute the code. It acts like a general manager, rewriting your query into a distributed execution plan.

Here is how a standard query is broken down and executed across shards.

The Scenario

Imagine you have a Sales table with 100 million rows, sharded across 4 nodes. You run this query:

SQL
 
SELECT Store_ID, SUM(Revenue) 
FROM Sales 
GROUP BY Store_ID;

Step 1: Query Decomposition (The Rewriting)

The Leader Node realizes it cannot calculate the total sum on its own. It breaks your query into two distinct phases:

  1. The Local Phase: Each worker node calculates the sum for the data it locally owns.

  2. The Global Phase: The Leader Node aggregates those subtotals into the final answer.


Step 2: Parallel Execution

The Leader Node sends a modified version of the query to the Worker Nodes. The workers execute this simultaneously:

  • Node 1: Scans its shard $\rightarrow$ "Store A: $500, Store B: $200"

  • Node 2: Scans its shard $\rightarrow$ "Store A: $300, Store C: $400"

  • Node 3: Scans its shard $\rightarrow$ "Store B: $100, Store C: $100"

  • Node 4: Scans its shard $\rightarrow$ "Store A: $200, Store B: $300"

Step 3: The Aggregation (Final Result)

The Leader Node receives these small "intermediate" result sets. It does the final math:

  • Store A: $500 + $300 + $200 = **$1,000**

  • Store B: $200 + $100 + $300 = **$600**

  • Store C: $400 + $100 = **$500**


Why this is faster

In a traditional database, one CPU would have to read 100 million rows and keep a massive counter in memory.

In the MPP version:

  1. Each node only reads 25 million rows (shorter distance).

  2. The heavy lifting (the math) happens on 4 CPUs at once.

  3. The Leader Node only has to process 4 tiny result sets instead of 100 million raw records.

The "Skew" Trap

The only way this fails is if your sharding is uneven. If 90% of your sales are in "Store A" and you sharded by Store_ID, Node 1 would be doing all the work while the others sit idle. This is called Data Skew, and it's the most common reason an MPP system slows down.

 

Is the 'Shard Key' automatically chosen by the MPP database or manually selected by the coder?

 

The answer depends entirely on which modern database you are using. In the early days of MPP (like older versions of Amazon Redshift), it was a manual task for the coder. Today, the industry is split into two "schools of thought": Manual Control and Automated Sharding.

1. The Manual School (e.g., Amazon Redshift)

In traditional MPP databases, the coder manually selects a "Distribution Key" (Shard Key) when they create a table.

  • The Pro: You have total control. If you know exactly how your data will be joined, you can force the database to be incredibly fast by co-locating data.

  • The Con: If you choose a bad key (like a column with too many nulls), you create Data Skew, where one node does all the work while others sit idle. Changing a key usually requires rebuilding the entire table.

-- Example of manual sharding in Redshift
CREATE TABLE sales (sale_id INT,customer_id INT
) 
DISTSTYLE KEY DISTKEY (customer_id); -- Manually choosing the shard key

2. The Automated School (e.g., Snowflake & BigQuery)

Modern "cloud-native" MPP systems have largely moved away from manual shard keys to make life easier for developers.

  • Snowflake: Uses Micro-partitioning. It automatically divides your data into small groups (50–500MB) as it arrives. You don't pick a "shard key"; the system manages the distribution behind the scenes.

  • Google BigQuery: Completely abstracts sharding. It uses a "dynamic" approach where it might re-shuffle data in the middle of a query if it notices one node is struggling.

Feature Manual (Redshift/Greenplum) Automated (Snowflake/BigQuery)
Who chooses? The Developer/DBA The Database Engine
Complexity High (Requires planning) Low (Zero-management)
Risk High (Potential for "Data Skew") Low (System self-optimizes)
Performance Can be faster if tuned perfectly Consistently fast for most cases

Which is better?

If you are a Performance Specialist working with a predictable, massive dataset on AWS, you might prefer Manual keys to squeeze every millisecond of speed out of the hardware.

However, most Modern Data Teams prefer Automated systems. They would rather pay a little extra for the database to "figure it out" than spend dozens of hours a week manually tuning shard keys and managing data skew.

 

I'd like to know which school are ClickHouse and Doris in?

 

In the world of MPP databases, ClickHouse and Apache Doris represent two very different approaches to data sharding.

ClickHouse is firmly in the "Manual School," prioritizing control and transparency, while Apache Doris has evolved into a "Hybrid School" that leans heavily toward automation.


1. ClickHouse: The Manual School

ClickHouse is a "low-level" power tool. It gives the coder total control over the architecture, but this means you have to build the sharding logic yourself.

  • Explicit Setup: In ClickHouse, you don't just create a table; you create a Local Table (on each individual node) and then a Distributed Table (a "view" that sits on top of them).

  • Manual Shard Key: When you define the Distributed Table, you must explicitly provide a sharding_key (e.g., cityHash64(user_id)). If you don't define this, the database won't know how to split your data.

  • No Automatic Rebalancing: If you add 5 new nodes to a ClickHouse cluster, the database will not automatically move your old data to the new nodes. You have to manually "reshard" the data, which is a complex operation involving moving partitions.


2. Apache Doris: The Hybrid School

Apache Doris is designed to be much "friendlier" and automated, similar to modern cloud warehouses like Snowflake, but it still allows manual overrides.

  • Two-Level Sharding: Doris uses a unique "Partition + Bucket" system.

    • Partitions (Range/List): Usually based on time (e.g., one partition per day).

    • Buckets (Hash): The actual shards distributed across the nodes.

  • Automation (Auto-Partitioning): As of version 2.1, Doris supports Auto-PartitioningYou don't have to pre-create tables for tomorrow's data; as data flows in, Doris automatically creates the necessary shards on the fly.7

  • Automatic Management: Unlike ClickHouse, Doris handles Tablet Scheduling and Load Balancing automatically. If a node fails or a new one is added, the system automatically moves "tablets" (shards) around to keep the cluster balanced without the coder's intervention.


Summary Table

Feature ClickHouse Apache Doris
Philosophy "Provide the tools, user builds the engine." "Provide the engine, system manages the data."
Setup Complexity High: Manual local/distributed tables. Low: Single CREATE TABLE statement.
Scaling Manual: Coder must move data to new nodes. Automatic: System rebalances shards itself.
Best For... Hardcore performance tuning and custom architectures. Rapid development and "worry-free" scaling.

Would you like to see the specific SQL syntax difference between creating a sharded table in ClickHouse versus Apache Doris?

 

Yes, I would.

 

To see the difference in their "schools of thought," look at how much code is required to simply get data spread across multiple nodes.

1. ClickHouse (Manual School)

In ClickHouse, you have to be very explicit. You first create the "skeleton" (the local table) on every machine, and then create a "controller" (the distributed table) to tell ClickHouse how to split the data using a shard key.

Step A: The Local Table (Created on every node)

CREATE TABLE sales_local ON CLUSTER 'my_cluster' (order_id UInt64,customer_id UInt64,amount Float64
) 
ENGINE = MergeTree()
ORDER BY order_id;

Step B: The Distributed Table (The Shard Logic)

CREATE TABLE sales_distributed ON CLUSTER 'my_cluster' AS sales_local
ENGINE = Distributed('my_cluster', 'default', 'sales_local', rand()); 
-- 'rand()' here is the manual sharding instruction (Round Robin)

Note: If you want to scale this, you have to manage those local tables manually.


2. Apache Doris (Hybrid/Automated School)

In Doris, the syntax looks much more like a traditional database. You define the sharding logic (Bucketing) inside a single command. Doris handles the creation of "local" storage behind the scenes.

The Single Table Approach

CREATE TABLE sales (order_id BIGINT,customer_id BIGINT,amount DOUBLE
)
DUPLICATE KEY(order_id)
PARTITION BY RANGE(order_id) () -- Optional: Auto-create partitions
DISTRIBUTED BY HASH(customer_id) BUCKETS AUTO; 
-- 'BUCKETS AUTO' tells Doris to decide how many shards are needed

Key Takeaways from the Syntax

  • ClickHouse gives you the "Gears": You see the Distributed engine and the Local engine. You are responsible for the "plumbing" between the nodes. This is why it’s popular for engineers who want to know exactly where every byte of data is sitting.

  • Doris gives you the "Dashboard": You use DISTRIBUTED BY HASH and BUCKETS AUTO. You tell the system your intent (shard by customer ID), and the Doris "Frontend" (FE) manages the "Backends" (BE) for you.

Summary Comparison

Task ClickHouse Apache Doris
Defining Shards Two-step process (Local + Distributed). Single-step process.
Shard Key Required for distribution. Required (as Buckets).
Adding Nodes Requires manual config & data movement. Data automatically migrates to new nodes.
Complexity High (Powerful but "Heavy"). Low (User-friendly "Cloud" feel).

 

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询