Java Regular Expressions
Regular expressions (regex) are a powerful pattern-matching language used for searching, validating, and manipulating text. Java's java.util.regex package provides a robust regex engine based on Perl-style patterns. Regex is a frequent topic in interviews — interviewers test both your ability to write patterns and your understanding of the underlying engine.
Core Classes: Pattern and Matcher
Java's regex engine revolves around two classes:
| Class | Purpose |
|---|---|
Pattern | Compiled representation of a regex. Immutable and thread-safe. |
Matcher | Stateful engine that performs match operations against a CharSequence. |
// 1. Compile the pattern (expensive — do once, reuse)
Pattern pattern = Pattern.compile("\\d{3}-\\d{4}");
// 2. Create a matcher against input
Matcher matcher = pattern.matcher("Call 555-1234 now");
// 3. Use matcher methods
if (matcher.find()) {
System.out.println(matcher.group()); // "555-1234"
System.out.println(matcher.start()); // 5
System.out.println(matcher.end()); // 13
}
Key Matcher Methods
| Method | Description |
|---|---|
matches() | Entire input must match the pattern |
find() | Finds the next subsequence that matches |
lookingAt() | Input must match from the beginning (but need not consume all) |
group() | Returns the matched subsequence |
start() / end() | Start and end indices of the match |
replaceAll(String) | Replaces every match with the replacement |
reset(CharSequence) | Reuses the matcher with new input |
Compilation Flags
Pass flags to Pattern.compile() to alter matching behavior.
| Flag | Constant | Effect |
|---|---|---|
(?i) | Pattern.CASE_INSENSITIVE | Case-insensitive matching (ASCII only) |
(?m) | Pattern.MULTILINE | ^ and $ match line boundaries, not just input boundaries |
(?s) | Pattern.DOTALL | . matches everything including \n |
(?x) | Pattern.COMMENTS | Allows whitespace and comments in the pattern |
(?u) | Pattern.UNICODE_CASE | Unicode-aware case folding (use with CASE_INSENSITIVE) |
(?d) | Pattern.UNIX_LINES | Only \n is recognized as a line terminator |
// Combine flags with bitwise OR
Pattern p = Pattern.compile("hello world",
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
// Or embed flags inline
Pattern p2 = Pattern.compile("(?im)hello world");
MULTILINE vs DOTALL
MULTILINE changes what ^ and $ mean — they match at line boundaries instead of only at the start/end of the entire input. DOTALL changes what . matches — it includes newline characters. These are independent and frequently confused in interviews.
Character Classes
| Syntax | Matches | Example |
|---|---|---|
[abc] | Any of a, b, or c | [aeiou] matches vowels |
[^abc] | Anything except a, b, c | [^0-9] matches non-digits |
[a-z] | Range a through z | [A-Za-z] matches letters |
[a-z&&[^m-p]] | Intersection (a-z minus m-p) | Java-specific syntax |
. | Any character (except \n by default) | |
\d | Digit [0-9] | |
\D | Non-digit [^0-9] | |
\w | Word character [a-zA-Z0-9_] | |
\W | Non-word character | |
\s | Whitespace [ \t\n\r\f] | |
\S | Non-whitespace | |
\b | Word boundary | |
\B | Non-word boundary |
Double Backslash in Java
Java strings require escaping backslashes. Write \\d in code to represent the regex \d. This is the most common source of regex bugs in Java.
Quantifiers
Greedy, Reluctant, and Possessive
| Greedy | Reluctant | Possessive | Meaning |
|---|---|---|---|
X? | X?? | X?+ | Zero or one |
X* | X*? | X*+ | Zero or more |
X+ | X+? | X++ | One or more |
X{n} | X{n}? | X{n}+ | Exactly n |
X{n,} | X{n,}? | X{n,}+ | At least n |
X{n,m} | X{n,m}? | X{n,m}+ | Between n and m |
String input = "<b>bold</b> and <i>italic</i>";
// Greedy: matches as much as possible, then backtracks
Pattern greedy = Pattern.compile("<.+>");
// Matches: "<b>bold</b> and <i>italic</i>" (one huge match)
// Reluctant: matches as little as possible
Pattern reluctant = Pattern.compile("<.+?>");
// Matches: "<b>", "</b>", "<i>", "</i>" (four separate matches)
// Possessive: matches as much as possible, NO backtracking
Pattern possessive = Pattern.compile("<.++>");
// No match at all! Consumes everything, won't give back the ">"
When to Use Possessive Quantifiers
Possessive quantifiers prevent backtracking entirely. Use them when you know the consumed characters will never be part of a later match — this dramatically improves performance and prevents catastrophic backtracking (ReDoS).
Anchors
| Anchor | Meaning |
|---|---|
^ | Start of input (or line in MULTILINE mode) |
$ | End of input (or line in MULTILINE mode) |
\b | Word boundary |
\B | Non-word boundary |
\A | Absolute start of input (ignores MULTILINE) |
\Z | End of input, before final terminator |
\z | Absolute end of input |
// \b ensures we match whole words only
Pattern p = Pattern.compile("\\bcat\\b");
Matcher m = p.matcher("concatenate the cat in the catalog");
// Finds only "cat" at position 18, not inside "concatenate" or "catalog"
Capturing Groups and Named Groups
Parentheses () create capturing groups. Groups are numbered left-to-right starting at 1. Group 0 is always the entire match.
Pattern datePattern = Pattern.compile("(\\d{4})-(\\d{2})-(\\d{2})");
Matcher m = datePattern.matcher("Date: 2025-07-15");
if (m.find()) {
System.out.println(m.group(0)); // "2025-07-15" (full match)
System.out.println(m.group(1)); // "2025" (year)
System.out.println(m.group(2)); // "07" (month)
System.out.println(m.group(3)); // "15" (day)
}
Named Groups (Java 7+)
Use (?<name>...) to assign names to groups. Improves readability significantly.
Pattern p = Pattern.compile(
"(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})");
Matcher m = p.matcher("2025-07-15");
if (m.find()) {
System.out.println(m.group("year")); // "2025"
System.out.println(m.group("month")); // "07"
System.out.println(m.group("day")); // "15"
}
Non-Capturing Groups
Use (?:...) when you need grouping for alternation or quantifiers but do not need to capture.
// Non-capturing: groups for alternation without creating a capture
Pattern p = Pattern.compile("(?:http|https)://(.+)");
Matcher m = p.matcher("https://example.com");
if (m.find()) {
System.out.println(m.group(1)); // "example.com" — group 1, not 2
}
Backreferences
Backreferences refer to previously captured groups within the same pattern. \1 refers to group 1, \k<name> refers to a named group.
// Find repeated words
Pattern p = Pattern.compile("\\b(\\w+)\\s+\\1\\b", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("The the quick brown fox fox jumped");
while (m.find()) {
System.out.println("Duplicate: " + m.group());
// "The the"
// "fox fox"
}
// Named backreference
Pattern p = Pattern.compile("(?<tag>\\w+)=\\k<tag>");
Matcher m = p.matcher("test=test value=value bad=wrong");
while (m.find()) {
System.out.println(m.group()); // "test=test", "value=value"
}
Lookahead and Lookbehind Assertions
Lookarounds assert that a pattern exists (or does not exist) at a position without consuming characters.
| Syntax | Name | Meaning |
|---|---|---|
(?=X) | Positive lookahead | Followed by X |
(?!X) | Negative lookahead | NOT followed by X |
(?<=X) | Positive lookbehind | Preceded by X |
(?<!X) | Negative lookbehind | NOT preceded by X |
// Password validation: at least 8 chars, one uppercase, one digit, one special
Pattern strongPassword = Pattern.compile(
"^(?=.*[A-Z])(?=.*\\d)(?=.*[@#$%^&+=!]).{8,}$"
);
System.out.println(strongPassword.matcher("Passw0rd!").matches()); // true
System.out.println(strongPassword.matcher("weakpass").matches()); // false
// Lookbehind: extract amounts after a dollar sign
Pattern p = Pattern.compile("(?<=\\$)\\d+\\.\\d{2}");
Matcher m = p.matcher("Price: $49.99 and $12.50");
while (m.find()) {
System.out.println(m.group()); // "49.99", "12.50" (without the $)
}
// Negative lookahead: find "foo" NOT followed by "bar"
Pattern p = Pattern.compile("foo(?!bar)");
Matcher m = p.matcher("foobar foobaz foo");
while (m.find()) {
System.out.println(m.start()); // 7 ("foobaz"), 14 ("foo")
}
Lookbehind Limitations in Java
Java requires lookbehinds to have a finite, obvious length. Patterns like (?<=a+) are illegal. Use (?<=a{1,10}) or restructure with lookahead instead.
String Convenience Methods
Java's String class provides regex-powered methods for quick operations.
String input = "Hello, World!";
// matches() — entire string must match
input.matches("[A-Za-z, !]+"); // true
// split() — split on a pattern
"one:two::three".split(":"); // ["one", "two", "", "three"]
"one:two::three".split(":", 3); // ["one", "two", ":three"] (limit = 3 parts)
"one:two::three".split(":", -1); // ["one", "two", "", "three"] (keep trailing empties)
// replaceAll() — replace all matches
"2025-07-15".replaceAll("(\\d{4})-(\\d{2})-(\\d{2})", "$2/$3/$1");
// Result: "07/15/2025"
// replaceFirst() — replace first match only
"aaa bbb aaa".replaceFirst("aaa", "ccc"); // "ccc bbb aaa"
Pattern.compile() vs String.matches() Performance
This is a classic interview question. String.matches() recompiles the pattern every single call.
// BAD: recompiles pattern on every iteration — O(n * compile_cost)
for (String email : emails) {
if (email.matches("[\\w.]+@[\\w.]+\\.\\w+")) {
// ...
}
}
// GOOD: compile once, reuse — O(compile_cost + n * match_cost)
private static final Pattern EMAIL_PATTERN =
Pattern.compile("[\\w.]+@[\\w.]+\\.\\w+");
for (String email : emails) {
if (EMAIL_PATTERN.matcher(email).matches()) {
// ...
}
}
Performance Rule of Thumb
If you use a regex more than once, always pre-compile with Pattern.compile() and store it in a static final field. The Pattern object is immutable and thread-safe, making this pattern safe for concurrent access.
Common Patterns
These are frequently asked in interviews. Understand the trade-offs — production-grade validation often requires more than regex alone.
// Email (simplified — RFC 5322 is much more complex)
Pattern EMAIL = Pattern.compile(
"^[\\w.%+-]+@[\\w.-]+\\.[a-zA-Z]{2,}$"
);
// US Phone Number (flexible format)
Pattern PHONE = Pattern.compile(
"^\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}$"
);
// IPv4 Address
Pattern IPV4 = Pattern.compile(
"^((25[0-5]|2[0-4]\\d|[01]?\\d\\d?)\\.){3}(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)$"
);
// URL (http/https)
Pattern URL = Pattern.compile(
"^https?://[\\w.-]+(?:\\.[a-zA-Z]{2,})(?:/[\\w./?%&=+-]*)?$"
);
// Java identifier
Pattern IDENTIFIER = Pattern.compile(
"^[a-zA-Z_$][a-zA-Z0-9_$]*$"
);
// ISO date (YYYY-MM-DD)
Pattern ISO_DATE = Pattern.compile(
"^\\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01])$"
);
Regex is Not a Validator
These patterns check format, not validity. For example, the date pattern accepts 2025-02-31 which does not exist. For production validation, use proper libraries (InternetAddress for email, InetAddress for IP, LocalDate.parse() for dates).
ReDoS: Catastrophic Backtracking
ReDoS (Regular Expression Denial of Service) occurs when a regex engine takes exponential time on certain inputs due to excessive backtracking.
The Problem
// VULNERABLE: nested quantifiers create exponential backtracking
Pattern bad = Pattern.compile("(a+)+b");
// Benign input: matches quickly
bad.matcher("aaaaab").matches(); // true, fast
// Malicious input: no match, but engine tries every combination
bad.matcher("aaaaaaaaaaaaaaaaaaaaa!").matches();
// Takes seconds... minutes... hangs!
The engine tries a+ matching 1 char, then the outer + repeating... then backtracks and tries a+ matching 2 chars, and so on. For n characters, this is O(2^n).
Vulnerable Patterns to Avoid
| Pattern | Problem |
|---|---|
(a+)+ | Nested quantifiers |
(a\|b)* | Alternation inside repetition with overlap |
(.*a){n} | Wildcard with suffix inside repetition |
(\w+\s?)+ | Overlapping possibilities for each position |
How to Prevent ReDoS
- Use possessive quantifiers:
(a++)bprevents backtracking entirely - Use atomic groups:
(?>a+)b(Java supports this) - Avoid nested quantifiers: Flatten
(a+)+toa+ - Anchor patterns: Use
^and$to constrain matching - Set timeouts: Limit regex execution time
- Test with tools: Use regex analysis tools to detect vulnerable patterns
// SAFE: possessive quantifier prevents backtracking
Pattern safe = Pattern.compile("(a++)b");
safe.matcher("aaaaaaaaaaaaaaaaaaaaa!").matches(); // false, instant
// SAFE: atomic group (equivalent to possessive)
Pattern atomic = Pattern.compile("(?>a+)b");
atomic.matcher("aaaaaaaaaaaaaaaaaaaaa!").matches(); // false, instant
Interview Red Flag
If an interviewer asks you to write a regex for user-facing input validation, always mention ReDoS as a concern. It shows security awareness and production-mindset thinking.
Java 9+ Regex Enhancements
| Version | Feature |
|---|---|
| Java 9 | Matcher.results() returns a Stream<MatchResult> |
| Java 9 | Scanner.findAll() returns a Stream<MatchResult> |
| Java 9 | Pattern.asMatchPredicate() — returns Predicate<String> (full match) |
| Java 11 | Pattern.asMatchPredicate() stabilized |
// Java 9+: Stream-based matching
Pattern p = Pattern.compile("\\b\\w{5}\\b");
List<String> fiveLetterWords = p.matcher("Hello brave new world today")
.results()
.map(MatchResult::group)
.collect(Collectors.toList());
// ["Hello", "brave", "world", "today"]
// Java 9+: Use as a predicate for filtering
Pattern emailPattern = Pattern.compile("[\\w.]+@[\\w.]+\\.\\w+");
Predicate<String> isEmail = emailPattern.asMatchPredicate();
List<String> validEmails = candidates.stream()
.filter(isEmail)
.collect(Collectors.toList());
Interview Questions
What is the difference between matches(), find(), and lookingAt()?
matches()requires the entire input to match the pattern. Equivalent to wrapping the pattern with^...$.find()searches for the next subsequence anywhere in the input that matches. Can be called repeatedly to find all matches.lookingAt()checks if the input starts with the pattern, but does not require it to consume the entire input.
Why should you avoid String.matches() in a loop?
String.matches() internally calls Pattern.compile() every invocation. Pattern compilation involves parsing the regex, building an NFA, and optimizing it — this is expensive. In a loop processing thousands of strings, you pay this cost every iteration.
The correct approach is to pre-compile the Pattern into a static final field and create new Matcher instances per input. Pattern is immutable and thread-safe; Matcher is not.
Explain greedy vs reluctant vs possessive quantifiers with an example.
Given input "<b>bold</b>" and pattern <.+>:
- Greedy (
<.+>):.+consumes as much as possible (b>bold</b), then backtracks character by character until>matches. Result:<b>bold</b>(one match spanning the whole string). - Reluctant (
<.+?>):.+?consumes as little as possible (b), then checks if>matches. Result:<b>and</b>(two separate matches). - Possessive (
<.++>):.++consumes as much as possible and refuses to backtrack. Since it consumed the final>, the pattern cannot match. Result: no match.
Possessive quantifiers are useful for performance — they fail fast when no match is possible and prevent catastrophic backtracking.
What is the difference between a capturing group and a non-capturing group?
- Capturing group
(X): Matches X and remembers the match. Accessible viamatcher.group(n). Each capturing group is assigned a number (left-to-right by opening parenthesis). - Non-capturing group
(?:X): Matches X but does not remember it. Used for grouping alternation or applying quantifiers without the overhead of capturing.
Use non-capturing groups when you only need logical grouping. This reduces memory usage and keeps group numbering clean.
How do lookahead and lookbehind work? Give a practical example.
Lookarounds are zero-width assertions — they check for a condition at a position without consuming any characters.
(?=X)Positive lookahead: succeeds if X matches ahead(?!X)Negative lookahead: succeeds if X does NOT match ahead(?<=X)Positive lookbehind: succeeds if X matches behind(?<!X)Negative lookbehind: succeeds if X does NOT match behind
Practical example — extract numbers that are preceded by $:
Pattern p = Pattern.compile("(?<=\\$)\\d+\\.?\\d*");
Matcher m = p.matcher("Items: $50, 30 units, $19.99");
// Finds: "50", "19.99" (without the dollar sign)
Practical example — password must contain at least one digit and one uppercase letter:
Each lookahead checks a condition at position 0 without advancing the cursor.What is ReDoS and how do you prevent it?
ReDoS (Regular Expression Denial of Service) is a vulnerability where a malicious input causes the regex engine to enter catastrophic backtracking, consuming exponential time.
It occurs with patterns that have:
- Nested quantifiers:
(a+)+ - Overlapping alternation:
(a|a)+ - Ambiguous repetition:
(\w+\s?)+
Prevention strategies:
- Use possessive quantifiers (
a++) or atomic groups ((?>a+)) to prevent backtracking - Avoid nested quantifiers — flatten them
- Anchor patterns with
^and$ - Set a timeout or run regex in a separate thread with a deadline
- Use static analysis tools to detect vulnerable patterns before deployment
What does the MULTILINE flag actually do? How is it different from DOTALL?
MULTILINE ((?m)) changes the behavior of ^ and $:
- Without it:
^matches start of input,$matches end of input - With it:
^also matches after\n,$also matches before\n
DOTALL ((?s)) changes the behavior of .:
- Without it:
.matches any character except\n - With it:
.matches any character including\n
They are independent flags. You can use both, one, or neither.
String input = "line1\nline2\nline3";
// Without MULTILINE: ^ only matches start of entire input
Pattern.compile("^\\w+").matcher(input).find(); // "line1"
// With MULTILINE: ^ matches start of each line
Pattern.compile("(?m)^\\w+").matcher(input).results()
.map(MatchResult::group).toList(); // ["line1", "line2", "line3"]
How would you extract all key-value pairs from a config string like 'host=localhost;port=8080;debug=true'?
Use named capturing groups for clarity:
Pattern p = Pattern.compile("(?<key>\\w+)=(?<value>[^;]+)");
Matcher m = p.matcher("host=localhost;port=8080;debug=true");
Map<String, String> config = new LinkedHashMap<>();
while (m.find()) {
config.put(m.group("key"), m.group("value"));
}
// {host=localhost, port=8080, debug=true}
In Java 9+, you can use matcher.results() with streams for a more functional approach:
Is Pattern thread-safe? Is Matcher thread-safe?
Pattern is immutable and thread-safe. You should compile it once and store it in a static final field.
Matcher is stateful and NOT thread-safe. It maintains internal state (current position, group captures, etc.). Each thread must create its own Matcher instance via pattern.matcher(input).
Write a regex to validate an IPv4 address.
Each octet must be 0-255. This requires careful range handling:
Pattern IPV4 = Pattern.compile(
"^((25[0-5]|2[0-4]\\d|[01]?\\d\\d?)\\.){3}" +
"(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)$"
);
IPV4.matcher("192.168.1.1").matches(); // true
IPV4.matcher("255.255.255.255").matches(); // true
IPV4.matcher("256.1.1.1").matches(); // false
IPV4.matcher("1.2.3").matches(); // false
Breakdown of each octet: 25[0-5] matches 250-255, 2[0-4]\d matches 200-249, [01]?\d\d? matches 0-199. The order matters because the regex engine tries alternatives left-to-right.
In production, prefer InetAddress.getByName() with validation — regex cannot catch semantic issues like reserved ranges.