Remove Accents and Diacritics from String

While indexing, you need to pre-process your data for fast and reliable search queries. One of these processes is normalizing your data. This includes, but not limits to, removing accents and diacritics from text, which is language- and charset specific. During the search, this process is repeated for reliable search queries. This tutorial shows you how to remove accents and diacritics from a string and convert to normal letters.

Another use-case, when you want to convert accents and diacritics to regular letters, is when you want to create bookmarkable URLs, file names or want to display a plain ASCII representation.

Strip Accents from String

Since Java 6, you can use the java.text.Normalizer class. This class contains the method normalize which transforms Unicode text into an equivalent composed or decomposed form, allowing for easier sorting and searching of text. You can remove all accents and diacritics using one of the following regular expressions:

  • "\\p{InCombiningDiacriticalMarks}+" matches all diacritic symbols.
  • "[\\p{M}]" matches characters intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
  • "[^\\p{ASCII}]" matches all unicode characters.
import java.text.Normalizer;

public static String stripAccents(String input){
    return input == null ? null :
            Normalizer.normalize(input, Normalizer.Form.NFD)
                    .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

Walkthrough

Previously, we saw how to convert special unicode characters to normal letters. Here, we walk through the process step by step. We start by simply converting the unicode text, which contains accents and diacritics, to a normalized text without special characters.

Next, we loop over each character individually and print the entire process to the console. The first column contains the original unicode character. The second column contains the decomposed character. The last column contains the normalized character.

package com.memorynotfound;

import java.text.Normalizer;

public class RemoveAccents {

    static final String original = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ";

    public static void main(String... args){
        System.out.println("Stripped:");
        System.out.println(stripAccents(original));

        System.out.println("\nDebugged:");
        debugStrippingAccents();
    }

    public static String stripAccents(String input){
        return input == null ? null :
            Normalizer.normalize(input, Normalizer.Form.NFD)
                    .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
    }

    public static void debugStrippingAccents(){
        for (char c : original.toCharArray()){
            String text = String.valueOf(c);

            // normalize each character individually
            String decomposed = Normalizer.normalize(text, Normalizer.Form.NFD);

            // remove accents and diacritics
            String removed = decomposed.replaceAll("[^\\p{ASCII}]", "");

            // print per character
            System.out.println(text + " (" + asUnicode(text) + ") -> "
                    + decomposed + " (" + asUnicode(decomposed) + ") -> "
                    + removed + " (" + asUnicode(removed) + ")");
        }
    }

    // converts string to unicode
    public static String asUnicode(String input){
        if (input.length() == 1){
            char c = input.charAt(0);
            return asUnicode(c);
        } else {
            String result = "";
            for (char c : input.toCharArray()){
                result += asUnicode(c) + " ";
            }
            result = result.substring(0, result.length()-1);
            return result;
        }
    }

    public static String asUnicode(char input){
        if (input < 0x10) {
            return "\\u000" + Integer.toHexString(input);
        } else if (input < 0x100) {
            return "\\u00" + Integer.toHexString(input);
        } else if (input < 0x1000) {
            return "\\u0" + Integer.toHexString(input);
        }
        return "\\u" + Integer.toHexString(input);
    }
}

Output

Stripped:
This is a funky String

Debugged:
T (\u0054) -> T  (\u0054) -> T (\u0054)
ĥ (\u0125) -> h ̂ (\u0068 \u0302) -> h (\u0068)
ï (\u00ef) -> i ̈ (\u0069 \u0308) -> i (\u0069)
ŝ (\u015d) -> s ̂ (\u0073 \u0302) -> s (\u0073)
  (\u0020) ->    (\u0020) ->   (\u0020)
ĩ (\u0129) -> i ̃ (\u0069 \u0303) -> i (\u0069)
š (\u0161) -> s ̌ (\u0073 \u030c) -> s (\u0073)
  (\u0020) ->    (\u0020) ->   (\u0020)
â (\u00e2) -> a ̂ (\u0061 \u0302) -> a (\u0061)
  (\u0020) ->    (\u0020) ->   (\u0020)
f (\u0066) -> f  (\u0066) -> f (\u0066)
ů (\u016f) -> u ̊ (\u0075 \u030a) -> u (\u0075)
ň (\u0148) -> n ̌ (\u006e \u030c) -> n (\u006e)
ķ (\u0137) -> ķ  (\u006b \u0327) -> k (\u006b)
ŷ (\u0177) -> y ̂ (\u0079 \u0302) -> y (\u0079)
  (\u0020) ->    (\u0020) ->   (\u0020)
Š (\u0160) -> S ̌ (\u0053 \u030c) -> S (\u0053)
ť (\u0165) -> t ̌ (\u0074 \u030c) -> t (\u0074)
ŕ (\u0155) -> r ́ (\u0072 \u0301) -> r (\u0072)
ĭ (\u012d) -> i ̆ (\u0069 \u0306) -> i (\u0069)
ń (\u0144) -> n ́ (\u006e \u0301) -> n (\u006e)
ġ (\u0121) -> g ̇ (\u0067 \u0307) -> g (\u0067)

References

You may also like...