SkyPile-150B 数据下载地址
2023-12-14 03:09:04
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0000.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0001.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0002.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0003.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0004.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0005.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0006.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0007.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0008.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0009.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0010.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0011.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0012.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0013.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0014.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0015.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0016.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0017.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0000.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0001.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0002.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0003.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0004.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0005.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0006.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0007.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0008.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0009.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0010.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0011.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0012.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0013.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0014.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0015.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0016.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0017.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0018.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0019.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0020.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0021.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0000.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0001.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0002.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0003.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0004.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0005.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0006.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0007.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0008.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0009.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0010.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0000.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0001.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0002.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0003.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0004.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0005.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0006.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0007.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0008.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0009.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0010.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0011.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0012.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0013.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0000.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0001.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0002.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0003.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0004.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0005.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0006.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0007.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0008.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0009.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0000.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0001.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0002.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0003.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0004.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0005.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0006.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0007.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0008.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0009.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0010.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0011.jsonl
数据集摘要
SkyPile-150B是一个全面的、大规模的中国数据集,专门用于大型语言模型的预训练。它来自大量可公开访问的中国互联网网页。我们采用了严格的过滤、广泛的重复删除和彻底的敏感数据过滤来确保其质量。此外,我们还利用了fastText和BERT等先进工具来过滤低质量数据。
SkyPile-150B数据集的公开部分包含大约2.33亿个独特的网页,每个网页平均包含1000多个中文字符。该数据集总共包含大约1500亿个令牌和620 GB的纯文本数据。
语言
SkyPile-150B数据集完全由中国数据组成。
数据字段说明
文本:从每个页面中提取的经过处理和清洗的文本。
数据集安全
我们使用了200w多个规则和基于BERT的模型来确定数据集中存在的敏感数据,并随后删除了我们检测到的任何有害条目。
敏感信息和偏见
尽管我们做出了最大的努力,但根据公开网页上的信息,SkyPile-150B可能包含敏感信息,如电子邮件地址、电话号码或IP地址。我们已通过重复数据删除和低质量过滤尽量减少这些信息,但SkyPile-150B的用户仍应保持警惕。
文章来源:https://blog.csdn.net/weixin_32759777/article/details/134862170
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。 如若内容造成侵权/违法违规/事实不符,请联系我的编程经验分享网邮箱:veading@qq.com进行投诉反馈,一经查实,立即删除!
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。 如若内容造成侵权/违法违规/事实不符,请联系我的编程经验分享网邮箱:veading@qq.com进行投诉反馈,一经查实,立即删除!