大模型机器人原理解析：从Google的RT/RT2到斯坦福机器人Mobile ALOHA

2024-01-07 19:36:04

前言

23年7月，我在朋友圈评估Google的RT2说道：

“大模型正在革新一切领域啊，超帅，通过大模型不仅能理解“人话”，还能对“人话”进行推理，并转变为机器人能理解的指令，从而分阶段完成任务。回头仔细看下论文”
当时便对大模型机器人印象深刻，一直想仔细研究下来着，但因为后来一直和团队忙于论文审稿GPT、企业知识库问答等项目，所以一直没抽出时间去深入研究

没成想，前几天，斯坦福的炒菜机器人火爆全网，再次让包括我在内的所有人目瞪口呆，再次在朋友圈评论道：

多模态 + 大模型 + AI agent可以全方位赋能机器人
一年前我决心彻底写清楚ChatGPT原理
一年前，因为对ChatGPT背后技术原理巨大的「好奇心」，加之极高的「分享热情」、以及想写一篇关于其原理最全面最深入最细致文章的「决心」，彻底改变了过去一年的轨迹
?最后，博客证明了技术研究能力，课程证明了教学教研能力，项目证明了带队开发能力

一年后的今天，我下定决心准备彻底研究下机器人
刚好今年q1本身要做一个AI agent小项目，希望q2起，有机会做这个机器人agent大项目，如能和某高校实验室或资本合作更好

说干就干

一方面，我组建了一个斯坦福机器人复现小组，准备复现斯坦福这个炒菜或家务机器人
二方面，我准备把大模型机器人的发展史以及其中涉及到的所有关键技术细节，全部都梳理一下(毕竟新闻稿只能看个大概，但想精准理解，必须结合一系列论文理解)

春节之后，组装样机

第一部分

// 待更

第二部分

// 待更

第三部分??斯坦福机器人Mobile ALOHA：炒菜、家务全活了

3.1?Mobile ALOHA整体训练流程

在机器人技术领域，通过对人类示范进行模仿学习已经取得了令人瞩目的成绩。然而，目前大多数研究结果都集中在桌面操作上，缺乏完成一般任务的移动性和灵活性

近日，斯坦福一研究团队开发了一个系统：Mobile ALOHA(论文地址、项目地址、技术文档)，由于其可以做各种家务，比如炒菜、刷碗等等，使得其一经发布便火爆全网

斯坦福家务机器人mobile-aloha

该系统用于模仿需要全身控制的双臂移动操作任务(In this work, we develop a systemfor imitating mobile manipulation tasks that are bi-manual and require whole-body control)

首先提出了Mobile ALOHA系统，作为低成本全身远程操作系统来收集数据(通过一个移动底座和一个全身远程操作界面增强了 ALOHA 系统)
We first present Mobile ALOHA, a low-cost and whole-bodyteleoperation system for data collection. It augmentsthe ALOHA system [104] with a mobile base, and awhole-body teleoperation interface.
之后利用Mobile ALOHA 收集的数据，然后进行有监督的行为克隆(behavioral cloning)，且和静态 ALOHA 数据集进行协同训练co-training
Using data col-lected with Mobile ALOHA, we then perform super-vised behavior cloning and find that co-training with existing static ALOHA datasets boosts performanceon mobile manipulation tasks.
每个任务包含50次演示(说白了，人类先做示范，然后机器人先向人类学习)，在经过协同训练后成功率可达到90%，使得Mobile ALOHA能够自主完成复杂的移动操作任务，如炒虾、打开双门壁柜存放沉重的烹饪锅、呼叫并进入电梯以及使用厨房水龙头轻轻冲洗用过的平底锅。
With 50 demonstra-tions for each task, co-training can increase successrates by up to 90%, allowing Mobile ALOHA to au-tonomously complete complex mobile manipulationtasks such as sauteing and serving a piece of shrimp,opening a two-door wall cabinet to store heavy cook-ing pots, calling and entering an elevator, and lightlyrinsing a used pan using a kitchen faucet.

3.2?Mobile ALOHA 硬件

3.2.1?Mobile ALOHA 硬件的总体情况

在此之前，能够即插即用的全身遥控硬件是比较昂贵的，比如像PR2、TIAGo这样的机器人价格一般超过20万美刀，且之前的机器人也没法完成复杂的需要双手互相配合的各种灵活操作，毕竟人类的十指多么灵活

而Mobile ALOHA 是一种低成本的移动机械手，可以执行各种家庭任务，其继承了原始 ALOHA 系统的优点，即低成本、灵巧、可维修的双臂远程操作装置，同时将其功能扩展到桌面操作之外，且重点做到了以下4点

移动性：移动速度与人类行走速度相当，约为 1.42 米 / 秒
稳定性：在操作重型家用物品时它能保持稳定，如锅和橱柜
全身遥控操作：所有自由度均可同时进行遥控操作，包括双臂和移动底座
无绳：具有机载电源和计算设备(数据收集和推断期间的所有计算都是在配备了Nvidia 3070 Ti GPU (8GB VRAM)和Intel i7-12800H的消费级笔记本电脑上进行)

如上图所示

上图左侧部分(Mobile ALOHA has two wrist cameras and one top camera, with onboard power and compute)
展示了研究者发现将操作员的腰部与移动底座系在一起的设计是最简单直接的解决方案
上图中间部分(Middle: The teleoperation setup can be removed and only two ViperX 300 [3] are used during autonomous execution. Both arms can reach a min/max height of 65cm/200cm, and extends 100cm from the base)中的数据表明
机械手相对于地面的垂直高度为 65 厘米至 200 厘米，可伸出底座 100 厘米，可举起 1.5 千克重的物体，并可在 1.5 米高处施加 100 牛的拉力
这样的设计让 Mobile ALOHA 可以完成很多任务，包括实物烹饪、家务管理、人机互动等
上图右侧部分(Right: Technical specifications of Mobile ALOHA)中列出了 Mobile ALOHA 的更多技术规格
除了现成的机器人外，研究者还开源了所有的软件和硬件部件，并提供了详细的教程，包括三维打印、组装和软件安装

3.2.2 硬件材料清单与硬件制作步骤

首先，准备一系列硬件材料，比如

它接受来自三个罗技C922x RGB网络摄像头的流媒体，分辨率为480 × 640，频率为50Hz
两个摄像头安装在跟随者机器人的手腕上，第三个摄像头面向前方
笔记本电脑还通过USB串行端口接收来自所有4个手臂的本体感觉流，通过CAN总线接收来自Tracer移动基地的本体感觉流

Part

Quantity

Link

Price?

(per unit)

Robots

ViperX 300 Robot Arm 6DOF

2

ViperX 300 Robot Arm 6DOF

$5,695.95

WidowX 250 Robot Arm 6DOF

2

WidowX 250 Robot Arm 6DOF - X-Series Robotic Arm

$3,295.95

Tracer AGV

1

AgileX Tracer AGV

$8,999.95

Onboard Compute

Lambda Labs Tensorbook?

1

Deep Learning Laptop - RTX 3080 Max-Q | Razer x Lambda Tensorbook

$2,399.00

Robot Frame

4040 800mm x 8

4

Amazon.com (2 pcs)

$42.29

4040 500mm x 6

2

Amazon.com (4 pcs)

$58.99

4040 400mm x 2

2

Amazon.com (1 pcs)

$22.99

4040 300mm x 7

2

Amazon.com (4 pcs)

$59.99

4040 L-shape connectors x 28

5

Amazon.com (6 pcs)

$32.99

4040 T-shape connectors x 4

1

Amazon.com (6 pcs)

$30.99

4040 45-degree corner connectors

1

Amazon.com?

$21.99

4040 Corner Bracket and T-Slot Sliding Nuts

2

Amazon.com?

$24.99

4040 caps

2

Amazon.com?

$9.81

M6 20mm

(for mounting robot)

1

Amazon.com?

$9.99

M6 T nuts for 4040

(for mounting robot)

2

Amazon.com?

$14.16

Camera setup

Logitech C922x Pro Stream Webcam

4

Amazon.com

$98.35

USB Hub

2

Amazon.com

$19.99

Power

Battery Pack

1

Amazon.com?

$699.00

600W DC Supply

1

Amazon.com?

$59.00

12V DC Cable

5

Amazon.com?

$15.99

Fork Spade Connectors

1

Amazon.com?

$13.69

USB-A to Micro USB Cable

4

Amazon.com?

$17.87

Wheel Odometry

DYNAMIXEL XL430-W250-T

2

DYNAMIXEL XL430-W250-T - ROBOTIS?

$49.90

U2D2

1

U2D2 - ROBOTIS?

$32.10

U2D2 Power Hub Board Set

1

U2D2 Power Hub Board Set - ROBOTIS?

$19.00

Jumper Wire

1

Amazon.com?

$9.99

Weights

1

Amazon.com: ACCRETION 1 Oz Grey Adhesive Backed Wheel Weights (24 Oz Pack) : Automotive?

$14.65

Misc

Rubber Band

1

Amazon.com

$9.99

Gripping Tape

1

Amazon.com?

$54.14

Common equipments

Allen keys

Hot glue gun

Total

$31,757.86

硬件材料准备齐全后，按以下步骤一步步执行

Install ALOHA end-effectors
通过6个步骤打造ALOHA：ALOHA 🏖? Tutorial，单纯打造这个还不具备移动功能的ALOHA便得花费3万刀中的1.9万刀
Build the robot frame

Mount the robots and the cameras
Cable connections

3.3 增加静态 ALOHA 数据进行Co-training

3.3.1?静态 ALOHA 数据的组成情况

对于机器人的训练，数据是一个很大的问题

使用模仿学习(imitation learning)来解决现实世界机器人任务的典型方法依赖于在特定机器人硬件平台上收集的目标任务数据集。然而，这种方法虽够但数据本身收集的过程过于冗长，因为在特定机器人硬件平台上，人类操作员需要从头开始为每个任务收集演示数据
The typical approach for using imitation learning to solve real-world robotics tasks relies on using thedatasets that are collected on a specific robot hard-ware platform for a targeted task. This straightfor-ward approach, however, suffers from lengthy datacollection processes where human operators collect demonstration data from scratch for every task onthe a specific robot hardware platform.

且由于这些专门数据集中视觉差异有限，在这些数据集上训练得到的策略通常对感知干扰(如干扰和照明变化)不够鲁棒
The policie strained on these specialized datasets are often not ro-bust to the perceptual perturbations (e.g. distractorsand lighting changes) due to the limited visual diver-sity in these datasets [95]
好在最近，在从不同但类似类型的机器人收集的各种真实数据集上进行co-training，在单臂操作和导航方面已经显示出了有希望的结果
Recently, co-training ondiverse real-world datasets collected from different but similar types of robots have shown promising results on single-arm manipulation [11, 20, 31, 61],and on navigation [79].

斯坦福的研究者在这项工作中便使用的Co-training，且利用现有的静态 ALOHA 数据集来提高移动操作的模仿学习性能，尤其是双臂动作

静态 ALOHA 数据集总共有 825 个示范动作，任务包括密封密封袋、拿起叉子、包装糖果、撕纸巾、打开带盖塑料瓶、玩乒乓球、分发胶带、使用咖啡机、交接铅笔和操作螺丝刀等
需要注意的是，静态 ALOHA 数据都是在黑色桌面上收集的，两只手臂固定朝向对方
这种设置与移动 ALOHA 不同，移动 ALOHA 的背景会随着移动底座的变化而变化，两臂平行朝前放置
在Co-training中，研究者没有对静态 ALOHA 数据中的 RGB 观察结果或双臂动作使用任何特殊的数据处理技术

3.3.2 基于两套数据(静态ALOHA数据和移动ALOHA数据)训练损失函数

任务 $m$ 的移动操作策略 $\pi^{m}$ 的训练目标是最小化模拟损失函数 $L$

$\begin{array}{l} \mathbb{E}_{\left(o^{i}, a_{\text {arms }}^{i}, a_{\text {base }}^{i}\right) \sim D_{\text {mobile }}^{m}}\left[L\left(a_{\text {arms }}^{i}, a_{\text {base }}^{i}, \pi^{m}\left(o^{i}\right)\right)\right]+ \mathbb{E}_{\left(o^{i}, a_{\text {arms }}^{i}\right) \sim D_{\text {static }}}\left[L\left(a_{\text {arms }}^{i},[0,0], \pi^{m}\left(o^{i}\right)\right)\right] \end{array}$

其中 $o^{i}$ 表示观察结果，包括两个手腕摄像头RGB观察(two wrist camera RGB observations)，和一个安装在手臂和手臂关节之间、以自我为中心的顶部摄像头RGB观察(top camera RGB observation mounted)，如下图左上角所示

我们以相同概率从静态ALOHA数据 $D_{\text {static }}$ 和移动ALOHA数据 $D_{\text {mobile }}^{m}$ 中进行抽样，并将批量大小设置为16

由于静态ALOHA数据点没有移动基本动作，我们对动作标签进行零填充处理，使得来自两个数据集的动作具有相同维度，我们还忽略了静态ALOHA数据中的前置摄像头，因此两个数据集都有3个摄像头
Since static ALOHA datapoints have no mobile base actions, we zero-pad the action labels so actions from both datasets have the same dimension.We also ignore the front camera in the static ALOHA data so that both datasets have 3 cameras.
同时，我们仅根据移动ALOHA数据集 $D_{\text {mobile }}^{m}$ 的统计信息对每个动作进行标准化处理
We normalize every action based on the statistics of the Mobile ALOHA dataset Dm mobile alone
在实验中，我们将这种协同训练方法与多种基本模仿学习方法(如ACT [Learning fine-grained bimanual manipulation with low-cost hardware]、扩散策略[Diffusion policy: Visuomotor policy learning via action diffusion]和VINN [The surprising effectiveness of representation learning for visual imitation])结合使用
In our experiments, we combine this co-training recipe with multiple base imitation learning approaches, including ACT [104], Diffusion Policy [18], and VINN [63]

最终该团队选择了 7 个任务，它们涵盖了现实应用中可能出现的各种功能、对象和交互，分别是擦拭葡萄酒、煮虾、冲洗锅、使用橱柜、呼叫电梯、推椅子和击掌

下图则是机器人在执行任务时的导航移动轨迹

3.4 实验

在实验中回答两个核心问题：

移动ALOHA是否能够通过协同训练co-training和少量移动操作数据来获得复杂的移动操作技能？
Can Mobile ALOHA acquire complex mobile manipulation skills with co-training and a small amount of mobile manipulation data?
移动ALOHA是否能够与不同类型的模仿学习方法一起工作，包括ACT[104]、扩散策略[18]和基于检索的VINN[63]？
Can Mobile ALOHA work with different types of imita-tion learning methods, including ACT [104], Diffu-sion Policy [18], and retrieval-based VINN [63]?

作为初步研究，我们将使用“动作分块”[action chunking]这种方法，在该方法中一个策略预测未来一系列动作，而不是每个时间步只预测一个动作。这种方法已经被应用于ACT和扩散策略，并且可以简单地添加到VINN中(As a preliminary, all methods we will examine em-ploy “action chunking” [104], where a policy predictsa sequence of future actions instead of one action ateach time step)

采用动作分块对于操作至关重要，它提高了生成轨迹的一致性，并减少了逐步策略推断所带来的延迟
We found action chunking to be crucial formanipulation, improving the coherence of generated trajectory and reducing the latency from per-steppolicy inference.
我们观察到移动基座目标速度与实际速度之间存在延迟，而位置控制手臂则具有更小的延迟。为了解释移动基座d步骤上出现的延迟情况，我们使机器人执行长度为k-d 的第一个k-d 手臂行为以及最后一个k-d 基座行为
We observe a delay between target and actual veloc-ities of our mobile base, while the delay for position-controlled arms is much smaller. To account for adelay of d steps of the mobile base, our robot exe-cutes the first k ?d arm actions and last k ?d baseactions of an action chunk of length k.

3.4.1?协同训练提高了性能

研究发现，Co-training可以提高ACT性能。在7项具有挑战性的移动操作任务中，与静态ALOHA数据集进行Co-training可持续提高ACT的成功率

这对于乘电梯时需要按键、清洗锅时需要打开水龙头，等子任务尤为重要，因为在这些任务中，精确操作是瓶颈所在

3.4.2?兼容ACT、扩散策略和VINN

除了ACT，我们还使用Mobile ALOHA训练了两种最新的模仿学习方法，即扩散策略[18]和VINN[63](We train two recent imitation learning methods,Diffusion Policy [18] and VINN [63], with Mobile ALOHA in addition to ACT.)

扩散策略通过逐步细化动作预测来训练神经网络。为提高推理速度，我们采用DDIM调度器[85]并对图像观测应用数据增强以防止过拟合。co-training数据管道与ACT相同，在附录A中有更多的训练细节可供参考
Diffusion policy trains aneural network to gradually refine the action predic-tion. We use the DDIM scheduler [85] to improve in-ference speed, and apply data augmentation to image observations to prevent overfitting. The co-training data pipeline is the same as ACT, and we includemore training details in the Appendix A.3.
VINN利用BYOL[Bootstrap your own latenta new approach to self-supervised learning]训练一个视觉表示模型(简单地用移动和静态数据的组合对BYOL编码器进行co-training)，并使用该模型从具有最近邻演示数据集中检索动作。我们采用本体感知特征增强VINN检索，并调整相对权重以平衡视觉和本体感知特征的重要性
VINN trains a visual representation model, BYOL [37] anduses it to retrieve actions from the demonstrationdataset with nearest neighbors. We augment VINNretrieval with proprioception features and tune therelative weight to balance visual and proprioceptionfeature importance

此外，我们进行了动作块的检索而非单个动作，并发现类似于Zhao等人的显著性能改进
We also retrieve an action chunkinstead of a single action and find significant per-formance improvement similar to Zhao et al.. For

总之，带分块的VINN、扩散策略和ACT在Mobile ALOHA上都取得了良好的性能，并且受益于与静态ALOHA的协同训练

此外，协同训练在擦拭酒的任务中的表现，成功率95%，大大优于预训练的成功率40%

最终，仅用32000美元的预算，通过静态ALOHA数据Co-training的模仿学习，Mobile ALOHA只需要20-50个演示就能学会各种复杂的任务。

斯坦福Mobile ALOHA向所有人展示了机器人在各种应用场景的潜力，甚至机器人开源实现了人人可复刻

第四部分 Google家务机器人

// 待更

参考文献与推荐阅读

斯坦福炒虾机器人爆火全网！华人团队成本22万元，能做满汉全席还会洗碗，新智元发的新闻稿
斯坦福开源的机器人厨子，今天又接手了所有家务，机器之心发的新闻稿
关于Google家务机器人的报道
谷歌DeepMind机器人成果三连发！两大能力全提升，数据收集系统可同时管理20个机器人，量子位
谷歌家务机器人单挑斯坦福炒虾机器人！端茶倒水逗猫，连甩三连弹开打，新智元
大模型正在重构机器人，谷歌Deepmind这样定义具身智能的未来，机器之心
..

文章来源:https://blog.csdn.net/v_JULY_v/article/details/135429156
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若内容造成侵权/违法违规/事实不符，请联系我的编程经验分享网邮箱：veading@qq.com进行投诉反馈，一经查实，立即删除！

Part	Quantity	Link	Price? (per unit)
Robots
ViperX 300 Robot Arm 6DOF	2	ViperX 300 Robot Arm 6DOF	$5,695.95
WidowX 250 Robot Arm 6DOF	2	WidowX 250 Robot Arm 6DOF - X-Series Robotic Arm	$3,295.95
Tracer AGV	1	AgileX Tracer AGV	$8,999.95
Onboard Compute
Lambda Labs Tensorbook?	1	Deep Learning Laptop - RTX 3080 Max-Q \| Razer x Lambda Tensorbook	$2,399.00
Robot Frame
4040 800mm x 8	4	Amazon.com (2 pcs)	$42.29
4040 500mm x 6	2	Amazon.com (4 pcs)	$58.99
4040 400mm x 2	2	Amazon.com (1 pcs)	$22.99
4040 300mm x 7	2	Amazon.com (4 pcs)	$59.99
4040 L-shape connectors x 28	5	Amazon.com (6 pcs)	$32.99
4040 T-shape connectors x 4	1	Amazon.com (6 pcs)	$30.99
4040 45-degree corner connectors	1	Amazon.com?	$21.99
4040 Corner Bracket and T-Slot Sliding Nuts	2	Amazon.com?	$24.99
4040 caps	2	Amazon.com?	$9.81
M6 20mm (for mounting robot)	1	Amazon.com?	$9.99
M6 T nuts for 4040 (for mounting robot)	2	Amazon.com?	$14.16
Camera setup
Logitech C922x Pro Stream Webcam	4	Amazon.com	$98.35
USB Hub	2	Amazon.com	$19.99
Power
Battery Pack	1	Amazon.com?	$699.00
600W DC Supply	1	Amazon.com?	$59.00
12V DC Cable	5	Amazon.com?	$15.99
Fork Spade Connectors	1	Amazon.com?	$13.69
USB-A to Micro USB Cable	4	Amazon.com?	$17.87
Wheel Odometry
DYNAMIXEL XL430-W250-T	2	DYNAMIXEL XL430-W250-T - ROBOTIS?	$49.90
U2D2	1	U2D2 - ROBOTIS?	$32.10
U2D2 Power Hub Board Set	1	U2D2 Power Hub Board Set - ROBOTIS?	$19.00
Jumper Wire	1	Amazon.com?	$9.99
Weights	1	Amazon.com: ACCRETION 1 Oz Grey Adhesive Backed Wheel Weights (24 Oz Pack) : Automotive?	$14.65
Misc
Rubber Band	1	Amazon.com	$9.99
Gripping Tape	1	Amazon.com?	$54.14
Common equipments
Allen keys
Hot glue gun

Total			$31,757.86