Description
Company: Perplexity
Problem Requirements
You are given a stream of text characters. This stream might be infinite, or just too large to hold in your computer's memory at once. You are also given a list of "stop words" (sensitive words).
Your task is to return all the text that appears before the first time any stop word shows up.
Rules:
- Save Memory: You cannot load the whole text at once. You must read it in small parts (chunks).
- Use Python Generators: You must use the
yieldkeyword to process the text as it comes in. - Handle Split Words: A stop word might get cut in half between two chunks. Your code must still find it.
Example:
stop_words = ["<stop>", "<end>"]
stream_chunks = ["This is a te", "st<st", "op> message"]
# Expected output: "This is a test"
# Reason: The word "<stop>" is split between chunk 2 ("st<st") and chunk 3 ("op> message")
Part 1: How to Solve It
The Main Challenge: Split Words
The hardest part of this problem is finding a stop word when it is cut across two different chunks.
Look at this example:
- Chunk 1: "...te"
- Chunk 2: "st<st"
- Chunk 3: "op>..."
The word
<stop>is hidden. It starts at the end of Chunk 2 and finishes at the start of Chunk 3. If we check each chunk one by one, we will miss it.
The Solution: The Buffer Method
The Plan:
- Keep a small buffer. This holds characters from the end of the previous chunk.
- When a new chunk arrives, add the buffer to the front of it.
- After checking for stop words, save the last few characters to the buffer for the next time.
- This connects the end of one chunk to the start of the next.
How big should the buffer be?
We use max_stop_word_length - 1.
- Imagine the longest stop word is 6 letters long.
- We only need to save the last 5 letters from the current chunk.
- When we combine those 5 letters with the first letter of the new chunk, we have 6 letters. This is enough to find the word.
Part 2: Python Solution
Code Implementation
from typing import Iterator, List, Optional
def process_stream_with_stopwords(
stream: Iterator[str],
stop_words: List[str]
) -> Iterator[str]:
"""
Read a stream of text and yield characters until a stop word is found.
Args:
stream: An iterator that gives us text chunks
stop_words: A list of words to look for
Yields:
Characters found before the first stop word
"""
if not stop_words:
# If there are no stop words, just return the whole stream
for chunk in stream:
yield chunk
return
# Find the length of the longest stop word
max_stop_len = max(len(word) for word in stop_words)
# A buffer to save the end of the previous chunk
buffer = ""
for chunk in stream:
# Join the buffer with the new chunk
text = buffer + chunk
# Look for stop words in this combined text
earliest_pos = len(text) # Start by assuming the word is at the very end
found_stop_word = None
for stop_word in stop_words:
pos = text.find(stop_word)
# If we found a word, and it appears earlier than any previous find
if pos != -1 and pos < earliest_pos:
earliest_pos = pos
found_stop_word = stop_word
if found_stop_word:
# We found a stop word! Return text before it and stop.
if earliest_pos > 0:
yield text[:earliest_pos]
return
# No stop word found yet.
# We can safely return everything except the last few characters.
# We need to keep (max_stop_len - 1) characters for the next check.
safe_to_yield_len = max(0, len(text) - (max_stop_len - 1))
if safe_to_yield_len > 0:
yield text[:safe_to_yield_len]
# Save the rest of the text to the buffer
buffer = text[safe_to_yield_len:]
else:
# If the text is too short, keep it all in the buffer
buffer = text
# If the stream ends and we still have text in the buffer, return it
if buffer:
yield buffer
def extract_text_before_stopword(
stream: Iterator[str],
stop_words: List[str]
) -> str:
"""
A helper function to get the final string easily.
Args:
stream: An iterator giving string chunks
stop_words: A list of stop words
Returns:
The full string before the first stop word
"""
return ''.join(process_stream_with_stopwords(stream, stop_words))
Examples
# Example 1: Stop word is split between chunks
def create_stream_1():
chunks = ["This is a te", "st<st", "op> message"]
for chunk in chunks:
yield chunk
stop_words = ["<stop>", "<end>"]
result = extract_text_before_stopword(create_stream_1(), stop_words)
print(result) # Output: "This is a test"
# Example 2: Stop word is inside one chunk
def create_stream_2():
chunks = ["Hello world", " <stop> more", " text"]
for chunk in chunks:
yield chunk
result = extract_text_before_stopword(create_stream_2(), stop_words)
print(result) # Output: "Hello world "
# Example 3: No stop word at all
def create_stream_3():
chunks = ["This is ", "a normal ", "text"]
for chunk in chunks:
yield chunk
result = extract_text_before_stopword(create_stream_3(), stop_words)
print(result) # Output: "This is a normal text"
Part 3: Tricky Cases & Speed Improvements
Common Edge Cases
- Empty Stream: The input has no data. The code should return an empty string.
- Stop Word at Start: If the stream starts with
<stop>, the result should be empty. - Overlapping Words: If you have
test<standop>end>more, the code must stop at<stop>, not<end>. - Tiny Chunks: If chunks are 1 letter each (like "<", "s", "t", "o", "p", ">"), the buffer logic must still catch the word.
- Long Stop Words: If the stop word is longer than the chunk size, the buffer must grow enough to hold it.
Making It Faster (Optimization)
- Use a Trie (Prefix Tree)
- If you have thousands of stop words, checking them one by one is slow. A Trie is a tree-like data structure that lets you search for all words at the same time.
- Why use it?
- Simple Search: Complexity is
O(n * m * k)(slow with many words). - Trie Search: Complexity is closer to
O(n * k)(much faster).
- Simple Search: Complexity is
- Aho-Corasick Algorithm
- This is a famous algorithm used in production systems. It is the best way to find multiple patterns in text at once. It has a time complexity of
O(n + m + z), which is very efficient.
- This is a famous algorithm used in production systems. It is the best way to find multiple patterns in text at once. It has a time complexity of
Memory Usage
- Buffer Space:
O(L), whereLis the length of the longest stop word. - Space for Stop Words:
O(m * k), wheremis the number of words. - Total Space: It stays small
O(L + m * k). It does not grow with the size of the stream. This is why it works for terabytes of data.
Part 4: Interview Discussion Topics
1. Why use Generators?
- To save memory.
- Generators process data "lazily" (only when asked).
- Using
yieldkeeps only one chunk in memory at a time. - If you tried to load a 100GB file into a variable, your program would crash. Generators prevent this.
2. Why is the buffer size max_stop_len - 1?
The Logic:
- Let's say the stop word is
<stop>(6 chars). - To find this word, we need to see 6 letters in a row.
- If the word is split, some letters are in the old chunk, and some are in the new one.
- By keeping the last 5 letters (6 - 1) of the old chunk, and adding the first letter of the new chunk, we have 6 letters. We can now see the whole word.
3. Python Tips for Production
If an interviewer asks, "How would you run this in a real app?", mention these:
collections.deque: Good for buffers if you need to add/remove items frequently.Regex(remodule): You can use Python's regex to find patterns, which might be cleaner to write but sometimes slower to run.- Encoding: Always remember to handle UTF-8 properly, especially if the text contains emojis or foreign languages.
4. Another Way: Using Regex
You can use Regular Expressions to solve this.
import re
def process_with_regex(stream, stop_words):
# Make the stop words safe for regex
escaped = [re.escape(word) for word in stop_words]
# Create a pattern like "word1|word2|word3"
pattern = '|'.join(escaped)
regex = re.compile(pattern)
buffer = ""
max_stop_len = max(len(word) for word in stop_words)
for chunk in stream:
text = buffer + chunk
match = regex.search(text)
if match:
# Return text up to the match
yield text[:match.start()]
return
# Same buffer logic as before
safe_len = max(0, len(text) - (max_stop_len - 1))
if safe_len > 0:
yield text[:safe_len]
buffer = text[safe_len:]
else:
buffer = text
if buffer:
yield buffer
Similar Practice Problems
- LeetCode 76: Minimum Window Substring (Uses sliding windows).
- LeetCode 438: Find All Anagrams in a String.
- System Design: How to parse huge log files or analyze real-time chats.
Discussion (0)
All comments are anonymous. Your identity is not shared.
Loading comments...
Loading editor…
OUTPUTLast run results appear here.
No output yet. Click "Run Code" to see results.