What is SerDe in Hive

Serialization

SerDe is the abbreviation for Serialize / Deserilize, which is used for serialization and deserialization.

In serialization, an object is converted into a sequence of bytes.
Serialization recovers byte sequences in objects.
The serialization of objects has two main uses: Object persistence, ie the object is converted into a byte sequence and stored in a file, the object data is transmitted over the network.
In addition to the two points above, hive serialization also involves the following: Hive deserialization consists in deserializing the key / value into the value of each column in the Hive table. Hive can easily load data into the table without converting the data. This can save a lot of time when processing large amounts of data.

SerDe explains how Hive processes a data set, including the two functions Serialize / Deserilize. Serialize converts the Java object used by Hive into a sequence of bytes that can be written in HDFS or into a stream file that other systems can recognize. Deserilize converts a string or binary stream into a Java object that Hive can recognize. For example, the select statement uses the Serialize object to parse the hdfs data, the insert statement uses the deserilize, and the data is written to the hdfs system and the data needs to be serialized.

When Hive creates a table, specify the serialization and deserialization method for data through a custom SerDe, or use Hive's built-in SerDe.

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC | DESC], ...)]
INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]

Create the table statement as above and use the row format parameter to specify the SerDe type.

SerDe contains built-in types

Avro
ORC
RegEx
Thrift
Parquet
CSV
JsonSerDe

Custom type

Steps for custom types:

  • Define a class, inherit the abstract class AbstractSerDe, implement initialize, deserialize and
  • Add a JAR package to a custom SerDe class

hive> add jar MySerDe.jar

  • The fromat attribute line specifies the custom SerDe class when the table was created