Decoding URLs and strings can be a common task, especially when working with web data. This is easy to do in a language like Java or Python, but what about in Hive? Luckily, this is fairly easy as well.
Decoding URLs in Hive with Reflection
The first and easiest approach is to use the reflect()
UDF that comes with Hive. The reflect()
UDF uses Java reflection to instantiate and call methods of objects. It can also call static functions. The method used in reflect() must return a Java primitive or a type that Hive knows how to serialize (like String).
Here is an example using this reflection approach to decode encoded URLs.
SELECT
reflect("java.net.URLDecoder", "decode", encoded_url, "UTF-8")
FROM
encoded_urls;
You can see in the Hive statement above that we use the URLDecoder
class and the decode()
method. The decode()
method takes two arguments which are the input string and the character encoding. The input string is our Hive column to decode (in this case “encoded_url”) and the character encoding is “UTF-8” (recommended encoding for non-ASCII characters in URLs).
Documentation for the Java URLDecoder.decode
method can be found here, and documentation on the Hive reflect()
UDF can be found here.
Decoding URLS with Custom Hive UDF
Another way to decode URLs in Hive is to create a custom UDF (User Defined Function). This takes more time to set up, but is generally more convenient to use afterwards.
The simplest UDFs extend the UDF
class and implement an evaluate()
method. This evaluate()
method can take and return basic Java types and primitives (String, int, double, etc.) or the associated Hadoop types (Text, IntWritable, DoubleWritable, etc.).
Here is a simple Hive UDF that will decode URLs:
package net.bigdatums.hive.udf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import java.net.URLDecoder;
@Description(name="DecodeUrl", value="Decodes a URL String. If decoding is unsuccessful null is returned")
public class DecodeUrl extends UDF {
public String evaluate(String input) {
//return null if input is null
if(input == null) return null;
//return decoded string. if exception return null
try {return URLDecoder.decode(input, "UTF-8");}
catch (Exception e) {return null;}
}
}
This UDF checks to ensure that the input object is not null. If the input is not null, the UDF attempts to decode the input URL. If there are issues/thrown Exceptions decoding the input, null is returned.
Hive UDFs are generally written in Java, compiled, and packaged into a JAR file. This JAR file must be made available to Hive (added to Hive’s class path). After this is done, a new function must be created within Hive.
Here is an example of explicitly adding a JAR containing our custom UDF, creating a temporary function, and using it on a Hive table to decode web URLs.
ADD JAR /home/user1/bigdatums-hive-udf-1.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION decode_url as 'net.bigdatums.hive.udf.DecodeURL';
SELECT
decode_url(encoded_url)
FROM
encoded_urls;