珠海市网站建设_网站建设公司_留言板_seo优化
2025/12/27 21:46:38 网站建设 项目流程

MATCH_RECOGNIZE is a powerful ANSI SQL standard feature (introduced in SQL:2016) used for Complex Event Processing (CEP). It allows you to search for patterns across multiple rows of data, similar to how Regular Expressions work for strings, but applied to rows in a table.

It is most commonly used in streaming engines like Apache Flink, Oracle Database, and Snowflake.


How it Works: The Structure

The clause follows a specific sequence to define how rows should be grouped, ordered, and matched.

SELECT *
FROM Ticker
MATCH_RECOGNIZE (PARTITION BY symbol          -- Group data (e.g., by stock ticker)ORDER BY tstamp              -- Sort data (usually by time)MEASURES                     -- Define the output columnsFIRST(START_ROW.tstamp) AS start_tstamp,LAST(PRICE_DOWN.tstamp) AS bottom_tstamp,LAST(PRICE_UP.tstamp) AS end_tstampONE ROW PER MATCH            -- Output summary or all rowsAFTER MATCH SKIP TO LAST PRICE_UP -- Where to start the next searchPATTERN (START_ROW PRICE_DOWN+ PRICE_UP+) -- The Regex patternDEFINE                       -- Define what the variables meanPRICE_DOWN AS price < PREV(price),PRICE_UP AS price > PREV(price)
)

Core Components Explained

  • PARTITION BY & ORDER BY: You cannot find a pattern if the data is jumbled. This organizes the stream into logical partitions and ensures they are chronologically sorted.

  • PATTERN: This is where you define the sequence using Regex-style quantifiers:

    • A B: A followed by B.

    • A+: One or more A.

    • A*: Zero or more A.

    • A?: Zero or one A.

  • DEFINE: This is the logic for your variables. In the example above, PRICE_UP is only met if the current row's price is higher than the PREV (previous) row's price.

  • MEASURES: Since a match spans multiple rows, MEASURES allows you to calculate aggregate values for the entire match (like the start price vs. end price).


Common Use Cases

Use Case Pattern Logic
V-Shape Recovery A decrease in price followed by an immediate increase.
Double Bottom Price drops, rises slightly, drops again to the same level, then rises.
Fraud Detection A "Login" event followed by "Change Password" and "Large Withdrawal" within a short time.
Sensor Jitter Detecting when a temperature sensor fluctuates above a threshold more than 3 times.

Why use this over Standard SQL?

Without MATCH_RECOGNIZE, you would have to use complex Self-Joins or Window Functions (like LAG and LEAD) which become unreadable and perform poorly as the pattern length increases. MATCH_RECOGNIZE is processed in a single pass over the data, making it highly efficient for real-time streaming.

Would you like me to create a specific SQL example for a use case like fraud detection or website clickstream analysis?

 

Yes, I would. Please create a specific SQL example for each of the use case you listed in Common Use Cases.

 

Here are specific ANSI SQL MATCH_RECOGNIZE examples for each of the use cases mentioned. These examples use syntax compatible with Apache Flink SQL and Oracle.


1. V-Shape Recovery (Market Analysis)

This pattern detects a "dip" in price followed by an immediate recovery. We define the "Bottom" of the V as the point where the price stops falling and starts rising.

SELECT *
FROM StockTicks
MATCH_RECOGNIZE (PARTITION BY symbolORDER BY rowtimeMEASURESSTART_ROW.price AS start_price,BOTTOM.price AS bottom_price,LAST(UP.price) AS recovery_priceONE ROW PER MATCHPATTERN (START_ROW DOWN+ BOTTOM UP+)DEFINEDOWN AS price < PREV(price),BOTTOM AS price < PREV(price),UP AS price > PREV(price)
);

2. Double Bottom (Technical Trading)

A "Double Bottom" looks like a "W". The price drops, hits a floor, rises, drops back to that same floor, and then breaks out upward.

 
SELECT *
FROM StockTicks
MATCH_RECOGNIZE (PARTITION BY symbolORDER BY rowtimeMEASURESFIRST(B1.price) AS first_low,FIRST(B2.price) AS second_lowONE ROW PER MATCHPATTERN (START_ROW DOWN+ B1 UP+ B2 UP+)DEFINEDOWN AS price < PREV(price),B1 AS price < PREV(price),UP AS price > PREV(price),-- The second bottom must be within 1% of the first bottomB2 AS price < PREV(price) AND ABS(price - FIRST(B1.price)) < (FIRST(B1.price) * 0.01)
);

3. Fraud Detection (Security)

Detecting a high-risk sequence: A user logs in, immediately changes their password, and then initiates a withdrawal larger than $10,000, all within a short timeframe.

SELECT *
FROM AccountEvents
MATCH_RECOGNIZE (PARTITION BY account_idORDER BY event_timeMEASURESA.event_time AS login_time,C.amount AS withdrawal_amountONE ROW PER MATCHPATTERN (A B C) DEFINEA AS event_type = 'LOGIN',B AS event_type = 'PASSWORD_CHANGE',C AS event_type = 'WITHDRAWAL' AND amount > 10000
) 
-- Note: In Flink, you'd typically wrap this to ensure it happens within 1 hour

4. Sensor Jitter (IoT/Manufacturing)

This pattern identifies a malfunctioning sensor that is "jittering"—fluctuating rapidly above a certain threshold (e.g., 100°C) more than 3 times in a row.

SELECT *
FROM SensorReadings
MATCH_RECOGNIZE (PARTITION BY sensor_idORDER BY tsMEASURESFIRST(JITTER.ts) AS jitter_start,LAST(JITTER.ts) AS jitter_end,COUNT(JITTER.temp) AS spike_countONE ROW PER MATCHPATTERN (JITTER{3,}) -- Matches 3 or more consecutive JITTER eventsDEFINE-- Jitter is defined as a reading > 100 that is also 10 degrees -- different from the previous readingJITTER AS temp > 100 AND ABS(temp - PREV(temp)) > 10
);

Key Takeaways for Writing Patterns

  • Quantifiers: Notice {3,} in the Sensor Jitter example? That means "at least 3." You can also use + (1 or more) or * (0 or more).

  • Strict Contiguity: By default, MATCH_RECOGNIZE looks for rows that are strictly next to each other in the sorted order.

  • Logical Offsets: PREV() is your best friend here. It allows you to compare the current row to the one right before it to detect trends (rising/falling).

Would you like to see how to handle "Late Data" in these SQL queries using Flink's Watermarks?

 

Yes, I would.

 

In Apache Flink, combining MATCH_RECOGNIZE with watermarks is essential because the SQL engine needs to know when it can "give up" waiting for a pattern to complete and move on.

When you use event-time (the timestamp inside your data), Flink uses the Watermark as a signal that no more old data will arrive. This allows the system to clean up memory and finalize matches.

How Watermarks Impact the Query

Without a watermark, a pattern like "Login followed by Logout" might stay in Flink's memory forever if the "Logout" event never arrives. To prevent this, we use a WITHIN clause or windowing logic.

Example: Fraud Detection with Watermarks

Here is how you would write a robust Flink SQL query that handles late data and out-of-order events using a 1-hour time constraint.

-- 1. Create the table with a Watermark strategy
CREATE TABLE AccountEvents (account_id INT,event_time TIMESTAMP(3),event_type STRING,amount DOUBLE,-- Extract the timestamp and allow 5 seconds of 'lateness'WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH ('connector' = 'kafka',...
);-- 2. Run the MATCH_RECOGNIZE query
SELECT *
FROM AccountEvents
MATCH_RECOGNIZE (PARTITION BY account_idORDER BY event_timeMEASURESA.event_time AS login_time,C.event_time AS withdrawal_timeONE ROW PER MATCH-- We add a time constraint to the pattern itselfPATTERN (A B C) WITHIN INTERVAL '1' HOUR DEFINEA AS event_type = 'LOGIN',B AS event_type = 'PASSWORD_CHANGE',C AS event_type = 'WITHDRAWAL' AND amount > 10000
);

What happens behind the scenes?

  1. Buffering: As rows arrive, Flink buffers them in state, sorted by event_time.

  2. Watermark Arrival: When a Watermark for 12:00:00 arrives, Flink knows that all data for 12:00:00 is now present.

  3. Pattern Matching: Flink attempts to match the sequence A -> B -> C.

  4. Pruning: Because of the WITHIN INTERVAL '1' HOUR clause, if an A (Login) happened at 10:00:00 and the Watermark passes 11:00:00 without a C occurring, Flink automatically drops that partial match from memory to save space.

Handling "Late" Data

If a record arrives after the watermark has already passed its timestamp (e.g., a record from 12:00:00 arrives when the watermark is at 12:00:10), Flink considers it "Late Data".

  • In SQL, late data is typically dropped by default to maintain correctness.

  • If you need to keep it, you would need to increase the "out-of-orderness" interval in your WATERMARK definition (changing '5' SECOND to something larger).

Would you like to see how to capture these "timed-out" partial matches (e.g., catching everyone who logged in but didn't logout) using the OMIT EMPTY MATCHES or similar logic?

 

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询