What does the term “canonical form” or “canonical representation” in Java mean?

By | July 9, 2019


I have often heard this term being used, but I have never really understood it.

What does it mean, and can anyone give some examples/point me to some links?

EDIT: Thanks to everyone for the replies. Can you also tell me how the canonical representation is useful in equals() performance, as stated in Effective Java?


Wikipedia points to the term Canonicalization.

A process for converting data that has more than one possible representation into a “standard” canonical representation. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.

The Unicode example made the most sense to me:

Variable-length encodings in the Unicode standard, in particular UTF-8, have more than one possible encoding for most common characters. This makes string validation more complicated, since every possible encoding of each string character must be considered. A software implementation which does not consider all character encodings runs the risk of accepting strings considered invalid in the application design, which could cause bugs or allow attacks. The solution is to allow a single encoding for each character. Canonicalization is then the process of translating every string character to its single allowed encoding. An alternative is for software to determine whether a string is canonicalized, and then reject it if it is not. In this case, in a client/server context, the canonicalization would be the responsibility of the client.

In summary, a standard form of representation for data. From this form you can then convert to any representation you may need.

Rate this post

Leave a Reply

Your email address will not be published. Required fields are marked *