Transfer Special Characters (Unicode) From Mobile Browser to Server

Currently I took one project, it is one POI search service, the user will visit the website of city guide in the mobile phone and search the POIs which are in some distance of his stand position, or he can find the special POIs, this search function could tell the user the direction and the distance from his position.

mobil.FRANKFURT.de

The client is HTML + XML web pages viewed by the browser in mobile phone client, by server side I used J2EE Servlet to perform this task. During the development I found one big iusse. This project is designed for tourist guide Frankurt--one German city, some POIs and street names have special German characters(Umlaut) such as: ä, ö, ü, ß. The reqest form in html page was used POST method to send the parameters.

First, for the compatibility problem of mobile browsers, we can not use these special characters directly in the html page, that will cause some display problem, if your browser's encode was not set to be de-DE locate language. That will be one awful thing if only the german users could use this service.
For solving this problem, we use HTML code(iso8859-1 table) to replace those special characters, for example, in the search page, the value and the display string in select list "Ämter und Institutionen" and "Börse" will be transcode to "Ämter und Institutionen" and "Börse".

This is one table for those special german characters:

Char	Code	
Ä	Ä --> Ä	
Ö	Ö --> Ö	
Ü	Ü --> Ü
ß	ß --> ß	
ä	ä --> ä
ö	ö --> ö
ü	ü --> ü

You can find more special characters here: http://www.usf.uni-osnabrueck.de/infoservice/html/zeichen.de.html

I can not say that every internet browser in mobile phone support "ä, ö, ü, ß", but most of the internet browser which support HTML, will support these HTML codes. So the first display problem we have solved.

There is one other important reason to use iso8859-1 HTML code, these HTML code will be transfered as parameters to the server side servlet through the POST method in FORM without transcoding mistake.

This the html form in search page:
<?xml version="1.0" encoding="iso8859-1"?>
...............
<form action="servlet/SimpleQueryPOI" method="post">
................
<select name="RUBIK">
<option selected="selected" value="&#196;mter und Institutionen">&#196;mter und Institutionen</option>
<option value="Bibliotheken">Bibliotheken</option>
<option value="B&#252;hnen">B&#252;hnen</option>

...................

This the Java code in SimpleQueryPOI :

public void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {

response.setContentType("text/xhtml");
response.setCharacterEncoding("UTF-8");

/*
* If the page you have used the iso code to replaced the umlaut of german, set it to true
* For example: "&#246;"->"ö"
*
* if not, set it to false
* Dont change this value until you know what are you doing
*/
useisocodealready = true;

Enumeration paramNames = request.getParameterNames();
while(paramNames.hasMoreElements()) {
String paramName = (String)paramNames.nextElement();

String[] paramValues = request.getParameterValues(paramName);

if(paramName.equalsIgnoreCase("RUBIK")){
strRUBIK = paramValues[0].trim();
CommonModel.debug("RUBIK org: "+ strRUBIK);
}
..........

}
.......

}

If you print out the RUBIK parameter that get from the browser client, you will find the right value is printed. That means servlet has transcoded the parameters in format of HTML code from the Form without problem. For example, "&#196;mter und Institutionen" and "B&#246;rse" will be transcoded to "Ämter und Institutionen" and "Börse".


But we are not happy during the test. For most of the mobile phone browser, Nokia, SE, it seems ok, but with Benq-Siemens EF18 and some phone from Samsung, the search function can not get any result if in the parameters there are any german special characters. Why?

By the way, the server side database used MySql database with UTF-8 encode. I checked the parameters transfered from the client, for EF18, I found the parameters were transcoded wrongly:
"&#196;mter und Institutionen" and "B&#246;rse" has turned to "Ämter und Institutionen" and "Börse".

I do not know what are that strange codes, but they are certainly the unicode, which I don't like. It must be wrong during the transcoding.

Firstly I tried to set the incoming request encode as "UTF-8", maybe we must set the incoming request encode before parsing the parameters.

public void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
request.setCharacterEncoding("UTF-8");

response.setContentType("text/xhtml");
response.setCharacterEncoding("UTF-8");

.............................

Ok, with Benq-Siemens EF18's browser everything works now, but, but now most of the mobile browsers do not work!! Strange, I dont know the reason.

I have one good simple idea, could I replace the strange code when parsing the parameters? I wrote this function:

public String tranStringSpecial(String in){
String tem = "";
tem = in.replaceAll("ö", "\u00f6");// \u00f6 is the unicode inside java for ö
tem = in.replaceAll("Ö", "\u00d6");//Ö
tem = in.replaceAll("ä", "\u00e4"); //ä
tem = in.replaceAll("Ä", "\u00c4"); //Ä
tem = in.replaceAll("ü", "\u00fc"); //ü
tem = in.replaceAll("Ãœ", "\u00dc"); //Ü
tem = in.replaceAll("ß", "\u00df"); //ß

return tem;
}


...............................
if(paramName.equalsIgnoreCase("RUBIK")){
strRUBIK = paramValues[0].trim();
CommonModel.debug("RUBIK org: "+ strRUBIK);
strRUBIK =
tranStringSpecial(strRUBIK);
CommonModel.debug("RUBIK: "+ strRUBIK);
}

...............................

Ok, continue to test. Still strange problem, now most of the mobile browsers work fine, but for EF18, sometimes it works, sometimes not. Nevertheless it is better, at least it work sometimes :P. That is not enough, of couse.
Why?? I should know which encode is send with parameters from client, fortunately, we could check such infomation in the header information from the client.

Enumeration headerNames = request.getHeaderNames();
while (headerNames.hasMoreElements()) {
headerName = (String) headerNames.nextElement();
if (headerName != null) {
CommonModel.debug("headers "+ headerName+" :"+request.getHeader(headerName));
}
}

But the result lets me down: all the browsers in mobile phone, even the IE and Firefox dont send the encode information int the header:

-=Debug=- getCharacterEncoding is:null

I googled and found this useful information:
Currently, many browsers do not send character
encoding information in the Content-Type header of an HTTP request. If
an encoding has not been specified by the client request, the container
uses a default encoding to parse request parameters. If the client
hasn't set character encoding and the request parameters are encoded
with a different encoding than the default, the parameters will be
parsed incorrectly. You can use the method setCharacterEncoding in the ServletRequest
interface to set the encoding. Since this method must be called prior
to parsing any post data or reading any input from the request, this
function is a prime application for filters.
Such a filter is contained in the examples distributed with the Tomcat 4.0
web container. The filter sets the character encoding from a filter
initialization parameter. This filter could easily be extended to set
the encoding based on characteristics of the incoming request, such as
the values of the Accept-Language and User-Agent headers, or a value
saved in the current user's session.

public void doFilter(ServletRequest request, 

ServletResponse response, FilterChain chain) throws

IOException, ServletException {

String encoding = selectEncoding(request);

if (encoding != null)

request.setCharacterEncoding(encoding);

chain.doFilter(request, response);

}

public void init(FilterConfig filterConfig) throws

ServletException {

this.filterConfig = filterConfig;

this.encoding = filterConfig.getInitParameter("encoding");

}

protected String selectEncoding(ServletRequest request) {

return (this.encoding);

}

 

Ok, clear, we need use a filter to detect the incoming encode automatically, and set the html page as utf-8 format firstly:

<?xml version="1.0" encoding="utf-8"?>
...............
<form action="servlet/SimpleQueryPOI" method="post">

Go to Set Character Encoding Filter to download the source code (thanks for Craig McClanahan's work!), create one same class (SetCharacterEncodingFilter) in my project.

Code:


/*
* Copyright 2004 The Apache Software Foundation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package filters;

import java.io.IOException;
import javax.servlet.Filter;
import javax.servlet.FilterChain;
import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;
import javax.servlet.UnavailableException;

/**
* <p>Example filter that sets the character encoding to be used in parsing the
* incoming request, either unconditionally or only if the client did not
* specify a character encoding. Configuration of this filter is based on
* the following initialization parameters:</p>
* <ul>
* <li><strong>encoding</strong> - The character encoding to be configured
* for this request, either conditionally or unconditionally based on
* the <code>ignore</code> initialization parameter. This parameter
* is required, so there is no default.</li>
* <li><strong>ignore</strong> - If set to "true", any character encoding
* specified by the client is ignored, and the value returned by the
* <code>selectEncoding()</code> method is set. If set to "false,
* <code>selectEncoding()</code> is called <strong>only</strong> if the
* client has not already specified an encoding. By default, this
* parameter is set to "true".</li>
* </ul>
*
* <p>Although this filter can be used unchanged, it is also easy to
* subclass it and make the <code>selectEncoding()</code> method more
* intelligent about what encoding to choose, based on characteristics of
* the incoming request (such as the values of the <code>Accept-Language</code>
* and <code>User-Agent</code> headers, or a value stashed in the current
* user's session.</p>
*
* @author Craig McClanahan
* @version $Revision: 267129 $ $Date: 2004-03-18 11:40:35 -0500 (Thu, 18 Mar 2004) $
*/

public class SetCharacterEncodingFilter implements Filter {

// ----------------------------------------------------- Instance Variables

/**
* The default character encoding to set for requests that pass through
* this filter.
*/
protected String encoding = null;

/**
* The filter configuration object we are associated with. If this value
* is null, this filter instance is not currently configured.
*/
protected FilterConfig filterConfig = null;

/**
* Should a character encoding specified by the client be ignored?
*/
protected boolean ignore = true;

// --------------------------------------------------------- Public Methods

/**
* Take this filter out of service.
*/
public void destroy() {

this.encoding = null;
this.filterConfig = null;

}

/**
* Select and set (if specified) the character encoding to be used to
* interpret request parameters for this request.
*
* @param request The servlet request we are processing
* @param result The servlet response we are creating
* @param chain The filter chain we are processing
*
* @exception IOException if an input/output error occurs
* @exception ServletException if a servlet error occurs
*/
public void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain)
throws IOException, ServletException {

// Conditionally select and set the character encoding to be used
if (ignore || (request.getCharacterEncoding() == null)) {
String encoding = selectEncoding(request);
if (encoding != null)
request.setCharacterEncoding(encoding);
}

// Pass control on to the next filter
chain.doFilter(request, response);

}

/**
* Place this filter into service.
*
* @param filterConfig The filter configuration object
*/
public void init(FilterConfig filterConfig) throws ServletException {

this.filterConfig = filterConfig;
this.encoding = filterConfig.getInitParameter("encoding");
String value = filterConfig.getInitParameter("ignore");
if (value == null)
this.ignore = true;
else if (value.equalsIgnoreCase("true"))
this.ignore = true;
else if (value.equalsIgnoreCase("yes"))
this.ignore = true;
else
this.ignore = false;

}

// ------------------------------------------------------ Protected Methods

/**
* Select an appropriate character encoding to be used, based on the
* characteristics of the current request and/or filter initialization
* parameters. If no character encoding should be set, return
* <code>null</code>.
* <p>
* The default implementation unconditionally returns the value configured
* by the <strong>encoding</strong> initialization parameter for this
* filter.
*
* @param request The servlet request we are processing
*/
protected String selectEncoding(ServletRequest request) {

return (this.encoding);

}

}


Next step is very important, open the web.xml file in my project (Projectname/WebRoot/WEB-INF) and add the following code:(poiQuery is the class folder)

<filter>
<filter-name>SetCharacterEncoding</filter-name>
<filter-class>poiQuery.SetCharacterEncodingFilter</filter-class>
<init-param>
<param-name>encoding</param-name>
<param-value>UTF-8</param-value>
</init-param>
</filter>
<filter-mapping>
<filter-name>SetCharacterEncoding</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>

Compile the project and restart Tomcat, we go on test this service, hu~~~ everything works fine, including the Benq -Siemens EF18. Cool Filter!

But, but after 2 days, I need add one middle page to display the possible street names if the user has inputted the incomplete street name, then the user can select one to continue the search. problem is, when I use one Form with select option here to let user choose, it is OK; but my boss insists here should be the links for the street names, not the select option---- for the better user capability, the user can click only once to go on search (if select option, user need select firstly and then click to send request.)

For example, you input "kal", and the service will give you such middle page:
middle page for possible street name

The parameters are included in the link as URL GET request to send to the server, BUT the filter DO NOT work with GET Request!!!!! Because Tomcat use different way to parse the parameters in GET as in POST request. We need set the configuration of Tomcat to let it could parse the parameters in GET URL:
Find server.xml in Tomcat, open and find such code:

<!-- Define a non-SSL HTTP/1.1 Connector on port 80-->
<Connector port="8080" maxHttpHeaderSize="8192"
maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
enableLookups="false" redirectPort="8443" acceptCount="100"
connectionTimeout="20000" disableUploadTimeout="true" URIEncoding="UTF-8"/>

Notice, the post could be any value as you have set, maybe 80, maybe 8090, add URIEncoding="UTF-8" inside. Save it and restart Tomcat, Now Tomcat could process the GET Request correctly.

I tried to finish this article now, but I met a new problem, god~~  I found the link with the special letters still did not  work. ok, every bug should have solution. Check the source code of the middle page, I found HTML code was used in the link to send request.

For example: for Kallestraße, the link is: 

<a href="/gc/frankfurt_poi/servlet/SimpleQueryPOI?METHOD=1&amp;STRASSE=Kallestra&#223;e&amp;isAccurate=true">Kallestra&#223;e</a>

What I have said about Get encode? Now Tomcat can parse the Get URL with format of UTF-8,  but not HTML code.

 
&#223; replaces ß, the HTML code for ß
  %C3%9F is the utf-8 code for ß

So the links in the middle page should be transcoded in UTF-8 format, not html code. This is the table of UTF-8 code for the german characters:

ü    %C3%BC
ü    %C3%9C
ö    %C3%B6
Ö    %C3%96
ä    %C3%A4
Ä    %C3%84
ß    %C3%9F

The link should be like this:

<a href="/gc/frankfurt_poi/servlet/SimpleQueryPOI?METHOD=1&STRASSE=Kallestra%C3%9Fe&isAccurate=true"  >Kallestra&#223;e  </a  > The string for displaying is still use HTML code.

 Now I can have a rest. lol

 

 

Ref:
http://www.javafaq.nu/java-example-code-235.html
http://java.sun.com/products/servlet/Filters.html
http://dev.csdn.net/author/lin_bei/66de56da81df44ee9ac379099a0600cf.html
http://java.sun.com/developer/technicalArticles/Intl/HTTPCharset/??


Blog