Remove Accents and Diacritics from String
While indexing, you need to pre-process your data for fast and reliable search queries. One of these processes is normalizing your data. This includes, but not limits to, removing accents and diacritics from text, which is language- and charset specific. During the search, this process is repeated for reliable search queries. This tutorial shows you how to remove accents and diacritics from a string and convert to normal letters.
Another use-case, when you want to convert accents and diacritics to regular letters, is when you want to create bookmarkable URLs, file names or want to display a plain ASCII representation.
Strip Accents from String
Since Java 6, you can use the java.text.Normalizer
class. This class contains the method normalize
which transforms Unicode text into an equivalent composed or decomposed form, allowing for easier sorting and searching of text. You can remove all accents and diacritics using one of the following regular expressions:
"\\p{InCombiningDiacriticalMarks}+"
matches all diacritic symbols."[\\p{M}]"
matches characters intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.)."[^\\p{ASCII}]"
matches all unicode characters.
import java.text.Normalizer;
public static String stripAccents(String input){
return input == null ? null :
Normalizer.normalize(input, Normalizer.Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
Walkthrough
Previously, we saw how to convert special unicode characters to normal letters. Here, we walk through the process step by step. We start by simply converting the unicode text, which contains accents and diacritics, to a normalized text without special characters.
Next, we loop over each character individually and print the entire process to the console. The first column contains the original unicode character. The second column contains the decomposed character. The last column contains the normalized character.
package com.memorynotfound;
import java.text.Normalizer;
public class RemoveAccents {
static final String original = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ";
public static void main(String... args){
System.out.println("Stripped:");
System.out.println(stripAccents(original));
System.out.println("\nDebugged:");
debugStrippingAccents();
}
public static String stripAccents(String input){
return input == null ? null :
Normalizer.normalize(input, Normalizer.Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
public static void debugStrippingAccents(){
for (char c : original.toCharArray()){
String text = String.valueOf(c);
// normalize each character individually
String decomposed = Normalizer.normalize(text, Normalizer.Form.NFD);
// remove accents and diacritics
String removed = decomposed.replaceAll("[^\\p{ASCII}]", "");
// print per character
System.out.println(text + " (" + asUnicode(text) + ") -> "
+ decomposed + " (" + asUnicode(decomposed) + ") -> "
+ removed + " (" + asUnicode(removed) + ")");
}
}
// converts string to unicode
public static String asUnicode(String input){
if (input.length() == 1){
char c = input.charAt(0);
return asUnicode(c);
} else {
String result = "";
for (char c : input.toCharArray()){
result += asUnicode(c) + " ";
}
result = result.substring(0, result.length()-1);
return result;
}
}
public static String asUnicode(char input){
if (input < 0x10) {
return "\\u000" + Integer.toHexString(input);
} else if (input < 0x100) {
return "\\u00" + Integer.toHexString(input);
} else if (input < 0x1000) {
return "\\u0" + Integer.toHexString(input);
}
return "\\u" + Integer.toHexString(input);
}
}
Output
Stripped:
This is a funky String
Debugged:
T (\u0054) -> T (\u0054) -> T (\u0054)
ĥ (\u0125) -> h ̂ (\u0068 \u0302) -> h (\u0068)
ï (\u00ef) -> i ̈ (\u0069 \u0308) -> i (\u0069)
ŝ (\u015d) -> s ̂ (\u0073 \u0302) -> s (\u0073)
(\u0020) -> (\u0020) -> (\u0020)
ĩ (\u0129) -> i ̃ (\u0069 \u0303) -> i (\u0069)
š (\u0161) -> s ̌ (\u0073 \u030c) -> s (\u0073)
(\u0020) -> (\u0020) -> (\u0020)
â (\u00e2) -> a ̂ (\u0061 \u0302) -> a (\u0061)
(\u0020) -> (\u0020) -> (\u0020)
f (\u0066) -> f (\u0066) -> f (\u0066)
ů (\u016f) -> u ̊ (\u0075 \u030a) -> u (\u0075)
ň (\u0148) -> n ̌ (\u006e \u030c) -> n (\u006e)
ķ (\u0137) -> ķ (\u006b \u0327) -> k (\u006b)
ŷ (\u0177) -> y ̂ (\u0079 \u0302) -> y (\u0079)
(\u0020) -> (\u0020) -> (\u0020)
Š (\u0160) -> S ̌ (\u0053 \u030c) -> S (\u0053)
ť (\u0165) -> t ̌ (\u0074 \u030c) -> t (\u0074)
ŕ (\u0155) -> r ́ (\u0072 \u0301) -> r (\u0072)
ĭ (\u012d) -> i ̆ (\u0069 \u0306) -> i (\u0069)
ń (\u0144) -> n ́ (\u006e \u0301) -> n (\u006e)
ġ (\u0121) -> g ̇ (\u0067 \u0307) -> g (\u0067)