最近在做 Rust RDMA RPC 框架,需要一套二进制序列化库。因为对 Rust 编程中的范型和宏不是太熟悉,所以刚好自己做一套用以练手。项目 repo 是 SF-Zhou/derse,本文简述一下该序列化库的设计方案。
序列化的二进制格式参考我原先做的 C++ 库,大概原则是这样的:
举例:
struct A {
x: String,
y: i32
}
struct B {
a: A,
b: u64,
}
// 以下是一个 B 类型对象的序列化结果
data = [0x1a, 0x11, 0x0c, 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x21, 0x20, 0x00, 0x00, 0x00, 0xe9, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00];
data = [
0x1a, // B 类型各个字段序列化的总长度为 26
// 第一个字段:结构体 a
0x11, // A 类型各个字段序列化的总长度为 17
// 第一个字段:String
0x0c, // 字符串的长度为 12
// 解析字符串,得到 "hello world!"
0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x21,
// 第二个字段:i32
0x20, 0x00, 0x00, 0x00, // 小端解析得到数字 32
// 第二个字段:u64
0xe9, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // 小端解析得到数字 233
]
实现这样的序列化似乎并不复杂:
&x.to_le_bytes()
获得 &[u8]
x.as_bytes()
获得 &[u8]
&[u8]
,再拼接上长度Vec<Item>
这样的容器,按顺序将每个 Item
转为 &[u8]
所以需要一个容器,用以存储序列化的结果。并且该容器最好是可以用户自定义的,故这里定义一个 Serializer
的 trait 用以表征承载序列化结果的容器。考虑上一节中提到的第 5 条,因为结构体序列化的长度是记录在最前方的,而该长度又是变长的,我们无法在序列化完成前轻松地获得该长度,也不希望先序列化字段到某个 buffer 后再复制,所以这里定义一个反向增长的 prepend
接口,会将当前传入的 &[u8]
复制到 serializer
的前方。序列化结构体时,先逆序地序列化所有字段,最后再序列化长度。最终 Serializer
的定义如下:
pub trait Serializer {
fn prepend(&mut self, data: impl AsRef<[u8]>) -> Result<()>;
fn len(&self) -> usize;
}
对于序列化操作本身,定义一个 Serialize
的 trait,需要用户自行实现的接口函数是 serialize_to
,即将某个对象序列化到现有的 serializer
上:
pub trait Serialize {
fn serialize<S: Serializer + Default>(&self) -> Result<S> {
let mut serializer = S::default();
self.serialize_to(&mut serializer)?;
Ok(serializer)
}
fn serialize_to<S: Serializer>(&self, serializer: &mut S) -> Result<()>;
}
基础类型的 Serialize
实现举例:
// 1. bool 类型序列化结果占用一个字节
impl Serialize for bool {
fn serialize_to<S: Serializer>(&self, serializer: &mut S) -> Result<()> {
serializer.prepend([*self as u8])
}
}
// 2. 各类数值类型使用宏避免重复代码
macro_rules! impl_serialize_trait {
($($t:ty),*) => {
$(impl Serialize for $t {
fn serialize_to<S: Serializer>(&self, serializer: &mut S) -> Result<()> {
serializer.prepend(&self.to_le_bytes())
}
}
};
}
impl_serialize_trait! {i8, i16, i32, i64, i128, u8, u16, u32, u64, u128, f32, f64}
// 3. 对于 usize 这种按平台有不同大小的类型,统一转为 u64 保证统一
impl Serialize for usize {
fn serialize_to<S: Serializer>(&self, serializer: &mut S) -> Result<()> {
(*self as u64).serialize_to(serializer)
}
}
// 4. 对于字符串,先序列化字符串的内容,再序列化它的长度
impl Serialize for str {
fn serialize_to<S: Serializer>(&self, serializer: &mut S) -> Result<()> {
serializer.prepend(self.as_bytes())?;
VarInt64(self.len() as u64).serialize_to(serializer)
}
}
impl Serialize for String {
fn serialize_to<S: Serializer>(&self, serializer: &mut S) -> Result<()> {
self.as_str().serialize_to(serializer)
}
}
// 5. 对 Vec 亦然,并且序列化 Item 时需要逆序
impl<Item: Serialize> Serialize for Vec<Item> {
fn serialize_to<S: Serializer>(&self, serializer: &mut S) -> Result<()> {
for item in self.iter().rev() {
item.serialize_to(serializer)?;
}
VarInt64(self.len() as u64).serialize_to(serializer)
}
}
// 6. 对于 Option,需要额外的一个字节记录是否存在值
impl<Item: Serialize> Serialize for Option<Item> {
fn serialize_to<S: Serializer>(&self, serializer: &mut S) -> Result<()> {
if let Some(item) = self {
item.serialize_to(serializer)?;
true.serialize_to(serializer)
} else {
false.serialize_to(serializer)
}
}
}
// 7. 对于不同长度的 tuple,同样借助宏生成代码
macro_rules! tuple_serialization {
(($($name:ident),+), ($($idx:tt),+)) => {
impl<$($name),+> Serialize for ($($name,)+)
where
$($name: Serialize),+
{
fn serialize_to<S: Serializer>(&self, serializer: &mut S) -> Result<()> {
$((self.$idx.serialize_to(serializer))?;)+
Ok(())
}
}
};
}
tuple_serialization!((H), (0));
tuple_serialization!((H, I), (1, 0));
tuple_serialization!((H, I, J), (2, 1, 0));
tuple_serialization!((H, I, J, K), (3, 2, 1, 0));
tuple_serialization!((H, I, J, K, L), (4, 3, 2, 1, 0));
tuple_serialization!((H, I, J, K, L, M), (5, 4, 3, 2, 1, 0));
tuple_serialization!((H, I, J, K, L, M, N), (6, 5, 4, 3, 2, 1, 0));
根据 Serialize
,可以设计出对称的 Deserialize
接口:
pub trait Deserialize<'a> {
fn deserialize<D: Deserializer<'a>>(mut der: D) -> Result<Self>
where
Self: Sized,
{
Self::deserialize_from(&mut der)
}
fn deserialize_from<D: Deserializer<'a>>(buf: &mut D) -> Result<Self>
where
Self: Sized;
}
带有生命周期是为了返回值可以直接引用 deserializer
中的一些内容以减少内存复制。依照序列化的格式,Deserializer
需要提供如下接口:
pub trait Deserializer<'a> {
// 检查 Deserilizer 是否为空,维持兼容性时使用
fn is_empty(&self) -> bool;
// 向前跳过指定长度,同时返回跳过的这段
fn advance(&mut self, len: usize) -> Result<Self>
where
Self: Sized;
// pop 出指定长度的 &[u8] 用于反序列化
fn pop(&mut self, len: usize) -> Result<Cow<'a, [u8]>>;
}
// 对 &[u8] 实现 Deserializer
impl<'a> Deserializer<'a> for &'a [u8] {
/// Checks if the byte slice is empty.
fn is_empty(&self) -> bool {
<[u8]>::is_empty(self)
}
/// Advances the byte slice by the specified length.
fn advance(&mut self, len: usize) -> Result<Self>
where
Self: Sized,
{
if len <= self.len() {
let (front, back) = self.split_at(len);
*self = back;
Ok(front)
} else {
Err(Error::DataIsShort {
expect: len,
actual: self.len(),
})
}
}
/// Pops the specified length of data from the byte slice.
fn pop(&mut self, len: usize) -> Result<Cow<'a, [u8]>> {
if len <= self.len() {
let (front, back) = self.split_at(len);
*self = back;
Ok(Cow::Borrowed(front))
} else {
Err(Error::DataIsShort {
expect: len,
actual: self.len(),
})
}
}
}
基础类型的 Deserialize
实现举例:
// 1. bool 类型,仅 0u8 和 1u8 是合法值
impl<'a> Deserialize<'a> for bool {
fn deserialize_from<D: Deserializer<'a>>(buf: &mut D) -> Result<Self>
where
Self: Sized,
{
let front = buf.pop(1)?;
match front[0] {
0 => Ok(false),
1 => Ok(true),
v => Err(Error::InvalidBool(v)),
}
}
}
// 2. 各类数值类型
macro_rules! impl_deserialize_trait {
($($t:ty),*) => {
impl<'a> Deserialize<'a> for $t {
fn deserialize_from<D: Deserializer<'a>>(buf: &mut D) -> Result<Self>
where
Self: Sized,
{
let front = buf.pop(std::mem::size_of::<Self>())?;
Ok(Self::from_le_bytes(front.as_ref().try_into().unwrap()))
}
})*
};
}
impl_deserialize_trait! {i8, i16, i32, i64, i128, u8, u16, u32, u64, u128, f32, f64}
// 3. 对于字符串,支持反序列化得到 Cow<'a, str> 以减少复制
impl<'a> Deserialize<'a> for Cow<'a, str> {
fn deserialize_from<D: Deserializer<'a>>(buf: &mut D) -> Result<Self>
where
Self: Sized,
{
let len = VarInt64::deserialize_from(buf)?.0 as usize;
let front = buf.pop(len)?;
match front {
Cow::Borrowed(borrowed) => match std::str::from_utf8(borrowed) {
Ok(str) => Ok(Cow::Borrowed(str)),
Err(_) => Err(Error::InvalidString(Vec::from(borrowed))),
},
Cow::Owned(owned) => match std::str::from_utf8(&owned) {
Ok(_) => Ok(Cow::Owned(unsafe { String::from_utf8_unchecked(owned) })),
Err(_) => Err(Error::InvalidString(owned)),
},
}
}
}
impl<'a> Deserialize<'a> for String {
fn deserialize_from<D: Deserializer<'a>>(buf: &mut D) -> Result<Self>
where
Self: Sized,
{
Ok(Cow::<str>::deserialize_from(buf)?.into_owned())
}
}
// 4. Vec 的反序列化也很简单、直接
impl<'a, Item: Deserialize<'a>> Deserialize<'a> for Vec<Item> {
fn deserialize_from<D: Deserializer<'a>>(buf: &mut D) -> Result<Self>
where
Self: Sized,
{
let len = VarInt64::deserialize_from(buf)?.0 as usize;
let mut out = Vec::with_capacity(len);
for _ in 0..len {
out.push(Item::deserialize_from(buf)?);
}
Ok(out)
}
}
对于结构体,序列化和反序列化的思路已经有了,但肯定不会对每个结构体都手写代码实现 Serialize
/ Deserialize
接口。这里使用 derive 过程宏,类似 #[derive(Default)]
,帮助生成这两个 trait 的 impl 代码。代码略长这里就不列了,生成的代码举例:
#[derive(derse::Deserialize, derse::Serialize)]
struct A {
x: u64,
y: String,
}
impl derse::Serialize for A {
fn serialize_to<Serializer: derse::Serializer>(
&self,
serializer: &mut Serializer,
) -> derse::Result<()> {
let start = serializer.len();
self.y.serialize_to(serializer)?;
self.x.serialize_to(serializer)?;
let len = serializer.len() - start;
derse::VarInt64(len as u64).serialize_to(serializer)
}
}
impl<'derse> derse::Deserialize<'derse> for A {
fn deserialize_from<Deserializer: derse::Deserializer<'derse>>(
buf: &mut Deserializer,
) -> derse::Result<Self>
where
Self: Sized,
{
use derse::DetailedDeserialize;
let len = derse::VarInt64::deserialize_from(buf)?.0 as usize;
let mut buf = buf.advance(len)?;
let result = Self {
x: if buf.is_empty() {
Default::default()
} else {
derse::Deserialize::deserialize_from(&mut buf)?
},
y: if buf.is_empty() {
Default::default()
} else {
derse::Deserialize::deserialize_from(&mut buf)?
},
};
Ok(result)
}
}
注意在反序列化时对兼容性的考量。序列化到某个字段时,如果 buf
恰好为空,说明序列化的代码中并没有该字段,所以直接使用 Default::default()
默认初始化;如果反序列化完 buf
仍然有内容,说明序列化的代码中还有更新的字段,这里会直接丢掉。支持增加字段对 RPC 类型的应用非常有用。