5.Pandas分组_分箱常用操作

2024-01-02 18:45:13
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

sns.set(style='whitegrid')
warnings.filterwarnings('ignore')
pd.show_versions()
INSTALLED VERSIONS
------------------
commit              : a671b5a8bf5dd13fb19f0e88edc679bc9e15c673
python              : 3.11.5.final.0
python-bits         : 64
OS                  : Windows
OS-release          : 10
Version             : 10.0.22621
machine             : AMD64
processor           : Intel64 Family 6 Model 154 Stepping 3, GenuineIntel
byteorder           : little
LC_ALL              : None
LANG                : None
LOCALE              : Chinese (Simplified)_China.936

pandas              : 2.1.4
numpy               : 1.24.3
pytz                : 2023.3.post1
dateutil            : 2.8.2
setuptools          : 68.2.2
pip                 : 23.3.1
Cython              : None
pytest              : 7.4.0
hypothesis          : None
sphinx              : 5.0.2
blosc               : None
feather             : None
xlsxwriter          : None
lxml.etree          : 4.9.3
html5lib            : 1.1
pymysql             : 1.0.2
psycopg2            : None
jinja2              : 3.1.2
IPython             : 8.15.0
pandas_datareader   : None
bs4                 : 4.12.2
bottleneck          : 1.3.5
dataframe-api-compat: None
fastparquet         : None
fsspec              : 2023.10.0
gcsfs               : None
matplotlib          : 3.8.0
numba               : 0.58.1
numexpr             : 2.8.7
odfpy               : None
openpyxl            : 3.0.10
pandas_gbq          : None
pyarrow             : 11.0.0
pyreadstat          : None
pyxlsb              : None
s3fs                : 2023.10.0
scipy               : 1.11.4
sqlalchemy          : 2.0.21
tables              : 3.8.0
tabulate            : 0.9.0
xarray              : 2023.6.0
xlrd                : 2.0.1
zstandard           : 0.19.0
tzdata              : 2023.3
qtpy                : 2.4.1
pyqt5               : None

1. 数据分组

分组的核心操作分为三步:

  1. 分组:选择分组的方式,创建GroupBy对象;
  2. 应用:使用聚合、转换、过滤等类型的应用函数;
  3. 合并:将各应用函数结果合并。
df_english=pd.DataFrame(
     {
         "student": ["John", "James", "Jennifer"],
         "gender": ["male", "male", "female"],
         "score": [20, 30, 30],
         "subject": "english"
     }
 )
 
df_math=pd.DataFrame(
     {
         "student": ["John", "James", "Jennifer"],
         "gender": ["male", "male", "female"],
         "score": [90, 100, 95],
         "subject": "math"
     }
 )
df = pd.concat([df_english, df_math],ignore_index=True)
df = df.sort_values(['student','subject']).reset_index(drop=True)
df
studentgenderscoresubject
0Jamesmale30english
1Jamesmale100math
2Jenniferfemale30english
3Jenniferfemale95math
4Johnmale20english
5Johnmale90math

1.1 列筛选

# 不指定列
df.groupby(by='student').count()
genderscoresubject
student
James222
Jennifer222
John222
# 指定列
df.groupby(by='student')['score'].count()
student
James       2
Jennifer    2
John        2
Name: score, dtype: int64

1.2 排序

groupby()默认有排序,如果不想要排序可以设置sort=False关闭。

df.groupby(by=['student', 'score'])['subject'].agg(['count', 'size'])
countsize
studentscore
James3011
10011
Jennifer3011
9511
John2011
9011
df.groupby(by=['student', 'score'], sort=False)['subject'].agg(['count', 'size'])
countsize
studentscore
James3011
10011
Jennifer3011
9511
John2011
9011

1.3 索引重置

groupby()会把分组列作为繁华dataframe的索引,如果不想这样,可以设置as_index=False参数。

df.groupby(by='student')['score'].agg(['mean', 'max', 'min'])
meanmaxmin
student
James65.010030
Jennifer62.59530
John55.09020
df.groupby(by='student', as_index=False)['score'].agg(['mean', 'max', 'min'])
studentmeanmaxmin
0James65.010030
1Jennifer62.59530
2John55.09020

1.4 空值统计

groupby()分组时默认忽略空值,但实际中有时需要对变量空值进行统计,可设置参数dropna=False

dd = pd.DataFrame({
      'A':['a1','a1','a2','a3'],
      'B':['b1',None,'b2','b3'],
      'C':[1,2,3,4],
      'D':[5,None,9,10]})
dd
ABCD
0a1b115.0
1a1None2NaN
2a2b239.0
3a3b3410.0
# 默认忽略空值
dd.groupby('B').sum()
ACD
B
b1a115.0
b2a239.0
b3a3410.0
# 分组空值加入统计
dd.groupby('B', dropna=False).sum()
ACD
B
b1a115.0
b2a239.0
b3a3410.0
NaNa120.0

1.5 多级索引

除数据列外,我们也可以通过索引层级来实现分组。

df = df.set_index(['student', 'subject'])
df
genderscore
studentsubject
Jamesenglishmale30
mathmale100
Jenniferenglishfemale30
mathfemale95
Johnenglishmale20
mathmale90
# 指定level=0分组
df.groupby(level=0).count()
genderscore
student
James22
Jennifer22
John22
# 指定level=1分组
df.groupby(level=1).count()
genderscore
subject
english33
math33

1.6 分组聚合

即在分组对象后应用聚合函数,有内置方法和自定义方法两种。

方法概述
any()计算组中的任何数是否为真
all()计算组中的任何数是否全都为真
count()计算组中非NA的数量
cov()*计算组的协方差
first()计算每组第一个出现的值
idxmax()*最大值的索引
idxmin()*最小值的索引
last()每组最后一个出现的值
max()每组的最大值
min()每组的最小值
mean()每组的平均数
median()每组的中位数
nunique()每组中唯一值的数量
prod()每组的乘积
quantile()每组中值的给定分位数
sem()计算每组中值的平均值的标准差
size()值的数量
skew()*值的偏差
std()值的标准偏差
sum()值的总和
var()值的方差
# 集中趋势
# 按学生分组计算平均分数
df.groupby(['student'])['score'].mean()
# 按学生分组计算中位数分数
df.groupby(['student'])['score'].median()
# 按学生分组计算指定分位数
df.groupby(['student'])['score'].quantile()
student
James       65.0
Jennifer    62.5
John        55.0
Name: score, dtype: float64
# 离散趋势
# 按学生分组计算最大分数
df.groupby(['student'])['score'].max()
# 按学生分组计算最小分数
df.groupby(['student'])['score'].min()
# 按学生分组计算方差
df.groupby(['student'])['score'].var()
# 按学生分组计算协方差
# df.groupby(['student'])['score'].cov()
# 按学生分组计算偏差
df.groupby(['student'])['score'].skew()
# 按学生分组计算标准差
df.groupby(['student'])['score'].std()
student
James       49.497475
Jennifer    45.961941
John        49.497475
Name: score, dtype: float64
# 计数
# 按学生分组计算非空数量
df.groupby(['student'])['score'].count()
# 按学生分组计算所有值数量
df.groupby(['student'])['score'].size()
# 按学生分组计算唯一值数量
df.groupby(['student'])['score'].nunique()
# 按学生分组计算求和
df.groupby(['student'])['score'].sum()
# 按学生分组计算求乘积
df.groupby(['student'])['score'].prod()
student
James       3000
Jennifer    2850
John        1800
Name: score, dtype: int64
# 排序
# 按学生分组返回第一个score
df.groupby(['student'])['score'].first()
# 按学生分组返回最后一个score
df.groupby(['student'])['score'].last()
# 按学生分组返回第n个score
df.groupby(['student'])['score'].nth(1)
student   subject
James     math       100
Jennifer  math        95
John      math        90
Name: score, dtype: int64

2. 分箱

分箱时将连续数值变量变为离散分类变量的一种方法,常用于特征工程,这对于组的划分有更高的要求,而不是简单的将相同值划分为同一组。

分箱的分为很多种,简单的有等频和等距,如果涉及到算法的,常见的有决策树、卡方、最优分箱等,这里只给出最简单的能由pandas实现的等频和等距两种方法。

df = pd.read_csv('data/titanic.csv')
df.sample(5)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
41441513Sundman, Mr. Johan Julianmale44.000STON/O 2. 31012697.925NaNS
72872902Bryhl, Mr. Kurt Arnold Gottfridmale25.01023685326.000NaNS
18618713O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey)femaleNaN1037036515.500NaNQ
44945011Peuchen, Major. Arthur Godfreymale52.00011378630.500C104S
38238303Tikkanen, Mr. Juhomale32.000STON/O 2. 31012937.925NaNS

2.1 等距分箱

即按相同距离将连续的数值划分成不同的箱体,每个箱体的间距相等,但箱体内的数据并不一定相等,用pd.cut()实现。

pd.cut(df['Age'], bins=6)
0      (13.683, 26.947]
1       (26.947, 40.21]
2      (13.683, 26.947]
3       (26.947, 40.21]
4       (26.947, 40.21]
             ...       
886     (26.947, 40.21]
887    (13.683, 26.947]
888                 NaN
889    (13.683, 26.947]
890     (26.947, 40.21]
Name: Age, Length: 891, dtype: category
Categories (6, interval[float64, right]): [(0.34, 13.683] < (13.683, 26.947] < (26.947, 40.21] < (40.21, 53.473] < (53.473, 66.737] < (66.737, 80.0]]
pd.cut(df['Age'], bins=6).value_counts().sort_index()
Age
(0.34, 13.683]       71
(13.683, 26.947]    248
(26.947, 40.21]     245
(40.21, 53.473]     100
(53.473, 66.737]     43
(66.737, 80.0]        7
Name: count, dtype: int64

2.2 自定义分箱

bins为标量序列时,cut会按照我们自定义的边界点进行分箱,不再是等距分箱。

pd.cut(df['Age'], bins=[0, 10, 30, 50, 100])
0      (10.0, 30.0]
1      (30.0, 50.0]
2      (10.0, 30.0]
3      (30.0, 50.0]
4      (30.0, 50.0]
           ...     
886    (10.0, 30.0]
887    (10.0, 30.0]
888             NaN
889    (10.0, 30.0]
890    (30.0, 50.0]
Name: Age, Length: 891, dtype: category
Categories (4, interval[int64, right]): [(0, 10] < (10, 30] < (30, 50] < (50, 100]]
pd.cut(df['Age'], bins=[0, 10, 30, 50, 100]).value_counts().sort_index()
Age
(0, 10]       64
(10, 30]     345
(30, 50]     241
(50, 100]     64
Name: count, dtype: int64

2.3 等频分箱

等频要求分组内的数量相等,在pandas中用qcut实现。

pd.qcut(df['Age'], q=10).value_counts().sort_index()
Age
(0.419, 14.0]    77
(14.0, 19.0]     87
(19.0, 22.0]     67
(22.0, 25.0]     70
(25.0, 28.0]     61
(28.0, 31.8]     66
(31.8, 36.0]     91
(36.0, 41.0]     53
(41.0, 50.0]     78
(50.0, 80.0]     64
Name: count, dtype: int64

2.4 分箱合并

2.4.1 分类合并

当分类型变量有很多值但分布比较分散的时候,我们有时需要做合并的处理。

df['Embarked'].value_counts(dropna=False)
Embarked
S      644
C      168
Q       77
NaN      2
Name: count, dtype: int64
where条件筛选

where可接受的筛选条件必须是布尔型的,基本判断逻辑是:如果对象不满足匹配条件,就被赋值为默认空值或other参数指定的值,满足条件的保持原值不变。

cond = df['Embarked'].isin(['S','C'])
df['Embarked'].where(cond, other='other', inplace=True)
df['Embarked'].value_counts(dropna=False)
Embarked
S        644
C        168
other     79
Name: count, dtype: int64
loc筛选反
df.loc[~df['Embarked'].isin(['S','C']), 'Embarked'] = 'other'
df['Embarked'].value_counts(dropna=False)
Embarked
S        644
C        168
other     79
Name: count, dtype: int64

2.4.2 数值合并

有时候我们会对分箱结果不满意,例如其中连着的两个分箱内数量太少不满足统计最小要求,因此选择合并。

# 先分箱
a, b = pd.cut(df['Age'], bins=8, retbins=True)
print(b)
print('-'*8)
a.value_counts(dropna=False).sort_index()
[ 0.34042 10.3675  20.315   30.2625  40.21    50.1575  60.105   70.0525
 80.     ]
--------





Age
(0.34, 10.368]       64
(10.368, 20.315]    115
(20.315, 30.263]    230
(30.263, 40.21]     155
(40.21, 50.158]      86
(50.158, 60.105]     42
(60.105, 70.052]     17
(70.052, 80.0]        5
NaN                 177
Name: count, dtype: int64
# (60.105, 70.052]和(60.105, 70.052]箱体数量太少,选择合并
bins = [0.34042,10.3675,20.315,30.2625,40.21,50.1575,60.105,80]
a, b = pd.cut(df['Age'], bins=bins, retbins=True)
print(b)
print('-'*8)
a.value_counts(dropna=False).sort_index()
[ 0.34042 10.3675  20.315   30.2625  40.21    50.1575  60.105   80.     ]
--------





Age
(0.34, 10.368]       64
(10.368, 20.315]    115
(20.315, 30.262]    230
(30.262, 40.21]     155
(40.21, 50.158]      86
(50.158, 60.105]     42
(60.105, 80.0]       22
NaN                 177
Name: count, dtype: int64

文章来源:https://blog.csdn.net/PyDarren/article/details/135346872
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。