Description

Company: Perplexity

Problem Requirements

You are given a stream of text characters. This stream might be infinite, or just too large to hold in your computer's memory at once. You are also given a list of "stop words" (sensitive words).

Your task is to return all the text that appears before the first time any stop word shows up.

Rules:

Save Memory: You cannot load the whole text at once. You must read it in small parts (chunks).
Use Python Generators: You must use the yield keyword to process the text as it comes in.
Handle Split Words: A stop word might get cut in half between two chunks. Your code must still find it.

Example:

stop_words = ["<stop>", "<end>"]
stream_chunks = ["This is a te", "st<st", "op> message"]

# Expected output: "This is a test"
# Reason: The word "<stop>" is split between chunk 2 ("st<st") and chunk 3 ("op> message")

Part 1: How to Solve It

The Main Challenge: Split Words

The hardest part of this problem is finding a stop word when it is cut across two different chunks.

Look at this example:

Chunk 1: "...te"
Chunk 2: "st<st"
Chunk 3: "op>..." The word <stop> is hidden. It starts at the end of Chunk 2 and finishes at the start of Chunk 3. If we check each chunk one by one, we will miss it.

The Solution: The Buffer Method

The Plan:

Keep a small buffer. This holds characters from the end of the previous chunk.
When a new chunk arrives, add the buffer to the front of it.
After checking for stop words, save the last few characters to the buffer for the next time.
This connects the end of one chunk to the start of the next.

How big should the buffer be?

We use max_stop_word_length - 1.

Imagine the longest stop word is 6 letters long.
We only need to save the last 5 letters from the current chunk.
When we combine those 5 letters with the first letter of the new chunk, we have 6 letters. This is enough to find the word.

Part 2: Python Solution

Code Implementation

from typing import Iterator, List, Optional

def process_stream_with_stopwords(
    stream: Iterator[str],
    stop_words: List[str]
) -> Iterator[str]:
    """
    Read a stream of text and yield characters until a stop word is found.

    Args:
        stream: An iterator that gives us text chunks
        stop_words: A list of words to look for

    Yields:
        Characters found before the first stop word
    """
    if not stop_words:
        # If there are no stop words, just return the whole stream
        for chunk in stream:
            yield chunk
        return

    # Find the length of the longest stop word
    max_stop_len = max(len(word) for word in stop_words)

    # A buffer to save the end of the previous chunk
    buffer = ""

    for chunk in stream:
        # Join the buffer with the new chunk
        text = buffer + chunk

        # Look for stop words in this combined text
        earliest_pos = len(text)  # Start by assuming the word is at the very end
        found_stop_word = None

        for stop_word in stop_words:
            pos = text.find(stop_word)
            # If we found a word, and it appears earlier than any previous find
            if pos != -1 and pos < earliest_pos:
                earliest_pos = pos
                found_stop_word = stop_word

        if found_stop_word:
            # We found a stop word! Return text before it and stop.
            if earliest_pos > 0:
                yield text[:earliest_pos]
            return

        # No stop word found yet.
        # We can safely return everything except the last few characters.
        # We need to keep (max_stop_len - 1) characters for the next check.
        safe_to_yield_len = max(0, len(text) - (max_stop_len - 1))

        if safe_to_yield_len > 0:
            yield text[:safe_to_yield_len]
            # Save the rest of the text to the buffer
            buffer = text[safe_to_yield_len:]
        else:
            # If the text is too short, keep it all in the buffer
            buffer = text

    # If the stream ends and we still have text in the buffer, return it
    if buffer:
        yield buffer

def extract_text_before_stopword(
    stream: Iterator[str],
    stop_words: List[str]
) -> str:
    """
    A helper function to get the final string easily.

    Args:
        stream: An iterator giving string chunks
        stop_words: A list of stop words

    Returns:
        The full string before the first stop word
    """
    return ''.join(process_stream_with_stopwords(stream, stop_words))

Examples

# Example 1: Stop word is split between chunks
def create_stream_1():
    chunks = ["This is a te", "st<st", "op> message"]
    for chunk in chunks:
        yield chunk

stop_words = ["<stop>", "<end>"]
result = extract_text_before_stopword(create_stream_1(), stop_words)
print(result)  # Output: "This is a test"

# Example 2: Stop word is inside one chunk
def create_stream_2():
    chunks = ["Hello world", " <stop> more", " text"]
    for chunk in chunks:
        yield chunk

result = extract_text_before_stopword(create_stream_2(), stop_words)
print(result)  # Output: "Hello world "

# Example 3: No stop word at all
def create_stream_3():
    chunks = ["This is ", "a normal ", "text"]
    for chunk in chunks:
        yield chunk

result = extract_text_before_stopword(create_stream_3(), stop_words)
print(result)  # Output: "This is a normal text"

Part 3: Tricky Cases & Speed Improvements

Common Edge Cases

Empty Stream: The input has no data. The code should return an empty string.
Stop Word at Start: If the stream starts with <stop>, the result should be empty.
Overlapping Words: If you have test<st and op>end>more, the code must stop at <stop>, not <end>.
Tiny Chunks: If chunks are 1 letter each (like "<", "s", "t", "o", "p", ">"), the buffer logic must still catch the word.
Long Stop Words: If the stop word is longer than the chunk size, the buffer must grow enough to hold it.

Making It Faster (Optimization)

Use a Trie (Prefix Tree)
- If you have thousands of stop words, checking them one by one is slow. A Trie is a tree-like data structure that lets you search for all words at the same time.
- Why use it?
  - Simple Search: Complexity is O(n * m * k) (slow with many words).
  - Trie Search: Complexity is closer to O(n * k) (much faster).
Aho-Corasick Algorithm
- This is a famous algorithm used in production systems. It is the best way to find multiple patterns in text at once. It has a time complexity of O(n + m + z), which is very efficient.

Memory Usage

Buffer Space: O(L), where L is the length of the longest stop word.
Space for Stop Words: O(m * k), where m is the number of words.
Total Space: It stays small O(L + m * k). It does not grow with the size of the stream. This is why it works for terabytes of data.

Part 4: Interview Discussion Topics

1. Why use Generators?

To save memory.
Generators process data "lazily" (only when asked).
Using yield keeps only one chunk in memory at a time.
If you tried to load a 100GB file into a variable, your program would crash. Generators prevent this.

2. Why is the buffer size `max_stop_len - 1`?

The Logic:

Let's say the stop word is <stop> (6 chars).
To find this word, we need to see 6 letters in a row.
If the word is split, some letters are in the old chunk, and some are in the new one.
By keeping the last 5 letters (6 - 1) of the old chunk, and adding the first letter of the new chunk, we have 6 letters. We can now see the whole word.

3. Python Tips for Production

If an interviewer asks, "How would you run this in a real app?", mention these:

collections.deque: Good for buffers if you need to add/remove items frequently.
Regex (re module): You can use Python's regex to find patterns, which might be cleaner to write but sometimes slower to run.
Encoding: Always remember to handle UTF-8 properly, especially if the text contains emojis or foreign languages.

4. Another Way: Using Regex

You can use Regular Expressions to solve this.

import re

def process_with_regex(stream, stop_words):
    # Make the stop words safe for regex
    escaped = [re.escape(word) for word in stop_words]
    # Create a pattern like "word1|word2|word3"
    pattern = '|'.join(escaped)
    regex = re.compile(pattern)

    buffer = ""
    max_stop_len = max(len(word) for word in stop_words)

    for chunk in stream:
        text = buffer + chunk
        match = regex.search(text)

        if match:
            # Return text up to the match
            yield text[:match.start()]
            return

        # Same buffer logic as before
        safe_len = max(0, len(text) - (max_stop_len - 1))
        if safe_len > 0:
            yield text[:safe_len]
            buffer = text[safe_len:]
        else:
            buffer = text

    if buffer:
        yield buffer

Stream Processing with Stop Words

Description

Problem Requirements

Rules:

Example:

Part 1: How to Solve It

The Main Challenge: Split Words

The Solution: The Buffer Method

The Plan:

How big should the buffer be?

Part 2: Python Solution

Code Implementation

Examples

Part 3: Tricky Cases & Speed Improvements

Common Edge Cases

Making It Faster (Optimization)

Memory Usage

Part 4: Interview Discussion Topics

1. Why use Generators?

2. Why is the buffer size `max_stop_len - 1`?

The Logic:

3. Python Tips for Production

4. Another Way: Using Regex

Similar Practice Problems

Discussion (0)

Description

Problem Requirements

Rules:

Example:

Part 1: How to Solve It

The Main Challenge: Split Words

The Solution: The Buffer Method

The Plan:

How big should the buffer be?

Part 2: Python Solution

Code Implementation

Examples

Part 3: Tricky Cases & Speed Improvements

Common Edge Cases

Making It Faster (Optimization)

Memory Usage

Part 4: Interview Discussion Topics

1. Why use Generators?

2. Why is the buffer size max_stop_len - 1?

The Logic:

3. Python Tips for Production

4. Another Way: Using Regex

Similar Practice Problems

Discussion (0)

2. Why is the buffer size `max_stop_len - 1`?