0%

Ensemble-happiness

1. Problem

1.1 Introduction of background

幸福感是一个古老而深刻的话题,是人类世代追求的方向。与幸福感相关的因素成千上万,这些错综复杂的因素中,我们能找到其中的共性,一窥幸福感的要义吗?

该案例为幸福感预测这一经典课题,希望在现有社会科学研究外有其他维度的算法尝试,结合多学科各自优势,挖掘潜在的影响因素,发现更多可解释、可理解的相关关系。

具体来说,该案例就是一个数据挖掘类型的比赛——幸福感预测的baseline。具体来说,我们需要使用包括个体变量(性别、年龄、地域、职业、健康、婚姻与政治面貌等等)、家庭变量(父母、配偶、子女、家庭资本等等)、社会态度(公平、信用、公共服务等等)等139维度的信息来预测其对幸福感的影响。

数据来源于国家官方的《中国综合社会调查(CGSS)》文件中的调查结果中的数据,数据来源可靠可依赖。

数据包下载地址: Data of case

1.2 Data information

要求使用以上 139 维的特征,使用 8000 余组数据进行对于个人幸福感的预测(预测值为1,2,3,4,5,其中1代表幸福感最低,5代 表幸福感最高)。 因为考虑到变量个数较多,部分变量间关系复杂,数据分为完整版和精简版两类。同时给出了 index 文件中包含每个变量对应的问卷题目,以及变量取值的含义;survey 文件中为原版问卷,作为补充以方便理解问题背景。

2. Data process

2.1 Load data

1
2
3
4
5
import pandas as pd

train = pd.read_csv("train.csv", parse_dates=['survey_time'],encoding='latin-1')
test = pd.read_csv("test.csv", parse_dates=['survey_time'],encoding='latin-1')
#latin-1向下兼容ASCII
  • Deleting the rows that happiness values equal to -8

    1
    2
    3
    # Deleting the rows that happiness values equal to -8
    target_name = 'happiness'
    train_copy = train.copy()
  • Train and target set

    1
    2
    train = train[train[target_name] != -8].reset_index(drop = True)
    target = train_copy[target_name] # Target
  • Concatenation

    1
    2
    3
    4
    del train_copy[target_name] # Drop the target column
    # Store the train set(without happiness values)

    data = pd.concat([train_copy, test], axis = 0, ignore_index = True)

2.2 Information

  • describe

    1
    train.happiness.describe()

    Results:

    count    7988.000000
    mean        3.867927
    std         0.818717
    min         1.000000
    25%         4.000000
    50%         4.000000
    75%         4.000000
    max         5.000000
    Name: happiness, dtype: float64
  • info

    1
    train.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 8000 entries, 0 to 7999
    Columns: 140 entries, id to public_service_9
    dtypes: datetime64[ns](1), float64(25), int64(111), object(3)
    memory usage: 8.5+ MB

2.3 Data preparation

2.3.1 Missing values

  • Number of columns having missing values

    通过如下代码了解数据中的缺省值信息:

    1
    2
    3
    4
    5
    6
    data_nomiss = data.dropna(axis=1, inplace = False) # 没有缺失值的数据信息

    name_data = data.columns
    name_miss = name_data.drop(data_nomiss.columns) # 含有缺失值的列名称

    data_miss = data.loc[:, name_miss] # 存储含有缺失值的数据,方便查看数据
    1
    2
    3
    >>> data_miss.shape

    (10968, 25)

    可以发现,数据集中 25 列中含有缺失值,具体的列名称如下:

    1
    2
    3
    >>> miss_name

    Index(['edu_other', 'edu_status', 'edu_yr', 'join_party', 'property_other', 'hukou_loc', 'social_neighbor', 'social_friend', 'work_status', 'work_yr', 'work_type', 'work_manage', 'family_income', 'invest_other', 'minor_child', 'marital_1st', 's_birth', 'marital_now', 's_edu', 's_political', 's_hukou', 's_income', 's_work_exper', 's_work_status', 's_work_type'], dtype='object')
  • Processing

    进一步查看缺失数据的信息:

    >>> data_miss.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10968 entries, 0 to 10967
    Data columns (total 25 columns):
     #   Column           Non-Null Count  Dtype  
    ---  ------           --------------  -----  
     0   edu_other        6 non-null      object
     1   edu_status       9399 non-null   float64
     2   edu_yr           8212 non-null   float64
     3   join_party       1126 non-null   float64
     4   property_other   89 non-null     object
     5   hukou_loc        10964 non-null  float64
     6   social_neighbor  9871 non-null   float64
     7   social_friend    9871 non-null   float64
     8   work_status      4029 non-null   float64
     9   work_yr          4029 non-null   float64
     10  work_type        4030 non-null   float64
     11  work_manage      4030 non-null   float64
     12  family_income    10967 non-null  float64
     13  invest_other     45 non-null     object
     14  minor_child      9520 non-null   float64
     15  marital_1st      9839 non-null   float64
     16  s_birth          8601 non-null   float64
     17  marital_now      8521 non-null   float64
     18  s_edu            8601 non-null   float64
     19  s_political      8601 non-null   float64
     20  s_hukou          8601 non-null   float64
     21  s_income         8601 non-null   float64
     22  s_work_exper     8601 non-null   float64
     23  s_work_status    3524 non-null   float64
     24  s_work_type      3524 non-null   float64
    dtypes: float64(22), object(3)
    memory usage: 2.1+ MB

    可以发现,edu_other, join_party, property_other, invest_other, 四列含有缺失值较多,占到了样本总体的90%以上,综合考虑后,选择将这四个变量从分析中剔除。

    1
    2
    3
    4
    5
    6
    name_del = ['edu_other', 'join_party', 'property_other', 'invest_other']
    name_data = name_data.drop(name_del)
    name_miss = name_miss.drop(name_del) # 剔除上述四个变量

    data_miss = data_miss.loc[:, name_miss]
    data = data.copy().loc[:, name_miss]
  • Information

    查看每列缺失值的数量:

    >>> data_miss.isnull().sum()
    
    edu_status         0
    edu_yr             0
    hukou_loc          0
    social_neighbor    0
    social_friend      0
    work_status        0
    work_yr            0
    work_type          0
    work_manage        0
    family_income      0
    minor_child        0
    marital_1st        0
    s_birth            0
    marital_now        0
    s_edu              0
    s_political        0
    s_hukou            0
    s_income           0
    s_work_exper       0
    s_work_status      0
    s_work_type        0
    dtype: int64
  • Fillna

    可以发现,部分列缺失值比较多,部分缺失值较少,为此,我们对缺失值做如下处理:

    • 缺失值数量少于总数的 20% (即 2193.6)的列,用该列的众数进行填充;
    • 缺失值数量大于2193的列,用0进行填充,防止过分干预数据而导致失真。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    for i in range(len(name_miss)):
    num_nulls = data_miss.isnull().sum() # 对缺失值进行计数
    num_null = num_nulls[i] # 当前列的缺失值数
    name = name_miss[i] # 当前列的名称

    if num_null < 0.2*data.shape[0]:
    data.loc[:, name].fillna(data_miss.mode().loc[0, name], inplace = True)
    else:
    data.loc[:, name].fillna(0, inplace = True)
    data_miss = data.loc[:, name_miss]

至此,缺失值处理完成。

2.3.2 Negative values

对数据中的连续出现的负的数值进行处理。由于数据中的负值只有 -1, -2, -3, -8,故进行如下处理:

  • Self definition function

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    # make feature +5
    #csv中有复数值:-1、-2、-3、-8,将他们视为有问题的特征,但是不删去
    def getres1(row):
    return len([x for x in row.values if type(x)==int and x<0])

    def getres2(row):
    return len([x for x in row.values if type(x)==int and x==-8])

    def getres3(row):
    return len([x for x in row.values if type(x)==int and x==-1])

    def getres4(row):
    return len([x for x in row.values if type(x)==int and x==-2])

    def getres5(row):
    return len([x for x in row.values if type(x)==int and x==-3])
    1
    2
    3
    4
    5
    6
    7
    #检查数据
    data['neg1'] = data[data.columns].apply(lambda row:getres1(row),axis=1)
    data.loc[data['neg1']>20,'neg1'] = 20 #平滑处理,最多出现20次
    data['neg2'] = data[data.columns].apply(lambda row:getres2(row),axis=1)
    data['neg3'] = data[data.columns].apply(lambda row:getres3(row),axis=1)
    data['neg4'] = data[data.columns].apply(lambda row:getres4(row),axis=1)
    data['neg5'] = data[data.columns].apply(lambda row:getres5(row),axis=1)
  • Process

    记录下含有负值的列名称,然后对比 index.csv 中的信息对这些负值进行填充性处理。

    1
    2
    3
    4
    # 首先将问卷时间变为年份
    data['survey_time'] = pd.to_datetime(data['survey_time'], format = '%Y-%m-%d',
    errors = 'coerce') # 强制转换
    data['survey_time'] = data['survey_time'].dt.year # 截取年份,用于计算年龄
    1
    2
    3
    4
    5
    6
    7
    8
    9
    # 查看所有含有负值的列
    neg_name = data.min()[data.min()<0].index # 存储 data 中最小的值小于 0 的列名称

    # 借助 index.csv 查看缺失值的信息
    data_index = pd.read_csv("index.csv", encoding='utf-8')
    data_index.set_index('变量名', inplace = True)
    data_index_column = data_index.columns

    print('There are {} columns have negative values'.format(len(neg_name)))
    There are 100 columns have negative values

    可以到整个数据表中有 100 列中含有负值。

  • 查看字段详细信息

    1
    data_index.loc[neg_name, data_index_column[1:3]].head()

    问题

    取值含义

    nationality

    您的民族

    1 = 汉; 2 = 蒙; 3 = 满; 4 = 回; 5 = 藏; 6 = 壮; 7 = ...

    religion

    您的宗教信仰-不信仰宗教

    0 = 否; 1 = 是;

    religion_freq

    您参加宗教活动的频繁程度

    1 = 从来没有参加过; 2 = 一年不到1次; 3 = 一年大概1到2次; 4 = 一年几...

    edu

    您目前的最高教育程度(包括目前在读的)

    1 = 没有受过任何教育; 2 = 私塾、扫盲班; 3 = 小学; 4 = 初中; 5 = ...

    edu_status

    您目前的最高教育程度的状态

    1 = 正在读; 2 = 辍学和中途退学; 3 = 肄业; 4 = 毕业;

  • 填充

    根据实际情况,对数据进行填充处理,填充分为如下几种方式:

    • 根据生活常识进行填充
    • 根据非负数中的众数进行填充
    • 根据非负数中的均值进行填充
    1
    2
    3
    for col in neg_name:
    if(data[col].loc[data[col]<0].count()>1000):
    print(col, data[col].loc[data[col]<0].count(), sep = ' | ')
    edu_yr | 1679
    marital_1st | 1901
    f_birth | 4664
    m_birth | 4448
    inc_ability | 1325
    inc_exp | 1501
    trust_3 | 1399
    trust_4 | 1413
    trust_6 | 1562
    trust_8 | 1381
    trust_10 | 2451
    trust_11 | 5456
    trust_12 | 3382

    通过上述预处理,可以发现,仅有 5列数据中的负值多余2000

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    # ===================== 生活常识 ========================
    # 民族
    data.loc[data['nationality']<0,'nationality'] = 1 # 1为汉族

    # 宗教
    data.loc[data['religion']<0,'religion'] = 1 # 1为不信仰宗教
    data.loc[data['religion_freq']<0,'religion_freq'] = 1 # 1为从来没参加过宗教活动

    # ======================= 众数 ============================
    # 教育
    data.loc[data['edu']<0,'edu'] = data['edu'].mode()[0] # 众数为 4,初中
    data.loc[data['edu_status']<0,'edu_status'] = data['edu_status'].mode()[0] # 众数为 4,毕业
    data.loc[data['edu_yr']<0,'edu_yr'] = 0 # 毕业年份

    # 政治面貌
    data.loc[data['political']<0,'political'] = data['political'].mode()[0] # 众数为 1,群众

    # 健康
    data.loc[data['health']<0,'health'] = data['health'].mode()[0] # 众数为 4,比较健康
    data.loc[data['health_problem']<0,'health_problem'] = data['health_problem'].mode()[0] # 众数为 4,比较健康

    # 沮丧
    data.loc[data['depression']<0,'depression'] = data['depression'].mode()[0] # 众数为 4,很少

    # 媒体:报纸、杂志、广播、电视、互联网、手机定值消息
    for i in range(6):
    col = 'media_' + str(i+1)
    data.loc[data[col]<0,col] = data[col].mode()[0] # 以众数填充
    '''
    data.loc[data['media_1']<0,'media_1'] = data['media_1'].mode().values # 众数为1,从不
    data.loc[data['media_2']<0,'media_2'] = data['media_2'].mode().values # 众数为1,从不
    data.loc[data['media_3']<0,'media_3'] = data['media_3'].mode().values # 众数为1,从不
    data.loc[data['media_4']<0,'media_4'] = data['media_4'].mode().values # 众数为4,经常
    data.loc[data['media_5']<0,'media_5'] = data['media_5'].mode().values # 众数为1,从不
    data.loc[data['media_6']<0,'media_6'] = data['media_6'].mode().values # 众数为1,经常
    '''

    # 空闲:看碟、看电影、逛街购物、读书、文化活动等共12项
    for i in range(12):
    col = 'leisure_' + str(i+1)
    data.loc[data[col]<0,col] = data[col].mode()[0] # 以众数填充

    #
    '''
    socialize、relax、learn、social_neighbor、learn、social_neighbor、
    social_friend、socia_outing、equity、class、class_10_before、
    class_10_after、class_14、work_status、work_yr、work_type、work_manage、
    insur_1、insur_2、insur_3、insur_4
    '''
    for i in range(29, 48):
    col = neg_name[i]
    data.loc[data[col]<0,col] = data[col].mode()[0] # 以众数填充
    data.loc[data['socialize']<0,'socialize'] = data['socialize'].mode()[0]

    # 收入状态
    for i in range(49, 56):
    col = neg_name[i]
    data.loc[data[col]<0,col] = data[col].mode()[0] # 以众数填充
    data.loc[data['socialize']<0,'socialize'] = data['socialize'].mode()[0]

    for i in range(58, 64):
    col = neg_name[i]
    data.loc[data[col]<0,col] = data[col].mode()[0]

    # 信任
    for i in range(77, 100):
    col = neg_name[i]
    # 考虑到 trust 中包含较多负数,故考虑在大于零的数中取众数
    data.loc[data[col]<0,col] = data[name].loc[data[name]>0].mode()[0]

    # ========================= 均值 ==========================
    # 收入
    data.loc[data['income']<0,'income'] = data['income'].mean() # 取均值填充收入

    # 家庭收入:49列
    data.loc[data['family_income']<0,'family_income'] = data['family_income'].mean()

    # 预期收入:76
    data.loc[data['inc_exp']<0,'inc_exp'] = data['inc_exp'].mean()

    # ======================== 0 填充 ===========================
    # 婚姻:56-57
    data.loc[data['marital_1st']<0,'marital_1st'] = 0 # 首次结婚
    data.loc[data['marital_now']<0,'marital_now'] = 0 # 现在

    # 父亲:64
    data.loc[data['f_birth']<0,'f_birth'] = 0 # 出生年份

    for i in range(65, 68):
    col = neg_name[i]
    data.loc[data[col]<0,col] = data[col].mode()[0]# 众数

    # 母亲:68
    data.loc[data['m_birth']<0,'m_birth'] = 0 # 出生年份

    for i in range(69, 76):
    col = neg_name[i]
    data.loc[data[col]<0,col] = data[col].mode()[0] # 众数
  • 确认负值处理完成

    对所有的列进行查看,确认是否所有的列中的负值都已处理完成。

    1
    2
    3
    4
    5
    6
    neg_count = len(data.min()[data.min()<0].index) # 记录负值列的个数

    if neg_count != 0:
    print('{} columns still have Negative values!'.format(neg_count))
    else:
    print('Congratulations! No columns have negative values!')
    Congratulations! No columns have negative values!

可以发现所有的负值都已处理完成。

Special infromation process

  • Age strafitication

    考虑到年龄对于收入水平、教育水平等都有重要的影响,因此,根据表格信息survey_timebirth 两列的信息计算年龄,并对年龄进行分段。

    1
    2
    3
    4
    5
    6
    # Calculate age
    data['age'] = data['survey_time'] - data['birth']

    # Age strafitication
    bins = [0, 17, 26, 34, 50, 63, 100]
    data['age_bin'] = pd.cut(data['age'], bins, labels = [0, 1, 2, 3, 4, 5]) # 分层
  • Data augmentation

    为了挖掘更多信息,我们需要进一步分析每个特征之间的关系,进行数据增广。经过仔细分析,加入了如下特征:

    • 第一次结婚年龄
    • 最近结婚年龄
    • 是否再婚
    • 配偶年龄
    • 配偶年龄差
    • 各种收入比(与配偶之间的收入比、十年后预期收入与现在的收入比等)
    • 收入与住房买诺记比(包括10年后期望收入等)
    • 社会阶级(10年后的社会阶级、14年后的社会阶级等)
    • 悠闲指数
    • 满意指数
    • 信任指数
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    # 第一次结婚年龄 143
    new_col = 'marital_1stbir'
    data[new_col] = data['marital_1st'] - data['birth']

    # 最近结婚年龄 144
    new_col = 'marital_nowtbir'
    data[new_col] = data['marital_now'] - data['birth']


    # 是否再婚 145
    new_col = 'mar'
    data[new_col] = data['marital_nowtbir'] - data['marital_1stbir']

    # 配偶年龄 146
    new_col = 'marital_sbir'
    data[new_col] = data['marital_now']-data['s_birth']

    # 配偶年龄差 147
    new_col = 'age__'
    data[new_col] = data['marital_nowtbir'] - data['marital_sbir']


    # 收入比 147+7 = 154
    data['income/s_income'] = data['income']/(data['s_income']+1)
    data['income+s_income'] = data['income']+(data['s_income']+1)
    data['income/family_income'] = data['income']/(data['family_income']+1)
    data['all_income/family_income'] = (data['income']+data['s_income'])/(data['family_income']+1)
    data['income/inc_exp'] = data['income']/(data['inc_exp']+1)
    data['family_income/m'] = data['family_income']/(data['family_m']+0.01)
    data['income/m'] = data['income']/(data['family_m']+0.01)

    # 收入/面积比 154+4=158
    data['income/floor_area'] = data['income']/(data['floor_area']+0.01)
    data['all_income/floor_area'] = (data['income']+data['s_income'])/(data['floor_area']+0.01)
    data['family_income/floor_area'] = data['family_income']/(data['floor_area']+0.01)
    data['floor_area/m'] = data['floor_area']/(data['family_m']+0.01)

    # class 158+3=161
    data['class_10_diff'] = (data['class_10_after'] - data['class'])
    data['class_diff'] = data['class'] - data['class_10_before']
    data['class_14_diff'] = data['class'] - data['class_14']
    # 悠闲指数 162
    leisure_fea_lis = ['leisure_'+str(i) for i in range(1,13)]
    data['leisure_sum'] = data[leisure_fea_lis].sum(axis=1) #skew
    # 满意指数 163
    public_service_fea_lis = ['public_service_'+str(i) for i in range(1,10)]
    data['public_service_sum'] = data[public_service_fea_lis].sum(axis=1) #skew
    # 信任指数 164
    trust_fea_lis = ['trust_'+str(i) for i in range(1,14)]
    data['trust_sum'] = data[trust_fea_lis].sum(axis=1) #skew

    # province mean 164+13=177
    data['province_income_mean'] = data.groupby(['province'])['income'].transform('mean').values
    data['province_family_income_mean'] = data.groupby(['province'])['family_income'].transform('mean').values
    data['province_equity_mean'] = data.groupby(['province'])['equity'].transform('mean').values
    data['province_depression_mean'] = data.groupby(['province'])['depression'].transform('mean').values
    data['province_floor_area_mean'] = data.groupby(['province'])['floor_area'].transform('mean').values
    data['province_health_mean'] = data.groupby(['province'])['health'].transform('mean').values
    data['province_class_10_diff_mean'] = data.groupby(['province'])['class_10_diff'].transform('mean').values
    data['province_class_mean'] = data.groupby(['province'])['class'].transform('mean').values
    data['province_health_problem_mean'] = data.groupby(['province'])['health_problem'].transform('mean').values
    data['province_family_status_mean'] = data.groupby(['province'])['family_status'].transform('mean').values
    data['province_leisure_sum_mean'] = data.groupby(['province'])['leisure_sum'].transform('mean').values
    data['province_public_service_sum_mean'] = data.groupby(['province'])
    ['public_service_sum'].transform('mean').values
    data['province_trust_sum_mean'] = data.groupby(['province'])['trust_sum'].transform('mean').values

    # city mean 177+13=190
    data['city_income_mean'] = data.groupby(['city'])['income'].transform('mean').values
    data['city_family_income_mean'] = data.groupby(['city'])['family_income'].transform('mean').values
    data['city_equity_mean'] = data.groupby(['city'])['equity'].transform('mean').values
    data['city_depression_mean'] = data.groupby(['city'])['depression'].transform('mean').values
    data['city_floor_area_mean'] = data.groupby(['city'])['floor_area'].transform('mean').values
    data['city_health_mean'] = data.groupby(['city'])['health'].transform('mean').values
    data['city_class_10_diff_mean'] = data.groupby(['city'])['class_10_diff'].transform('mean').values
    data['city_class_mean'] = data.groupby(['city'])['class'].transform('mean').values
    data['city_health_problem_mean'] = data.groupby(['city'])['health_problem'].transform('mean').values
    data['city_family_status_mean'] = data.groupby(['city'])['family_status'].transform('mean').values
    data['city_leisure_sum_mean'] = data.groupby(['city'])['leisure_sum'].transform('mean').values
    data['city_public_service_sum_mean'] = data.groupby(['city'])['public_service_sum'].transform('mean').values
    data['city_trust_sum_mean'] = data.groupby(['city'])['trust_sum'].transform('mean').values

    # county mean 190 + 13 = 203
    data['county_income_mean'] = data.groupby(['county'])['income'].transform('mean').values
    data['county_family_income_mean'] = data.groupby(['county'])['family_income'].transform('mean').values
    data['county_equity_mean'] = data.groupby(['county'])['equity'].transform('mean').values
    data['county_depression_mean'] = data.groupby(['county'])['depression'].transform('mean').values
    data['county_floor_area_mean'] = data.groupby(['county'])['floor_area'].transform('mean').values
    data['county_health_mean'] = data.groupby(['county'])['health'].transform('mean').values
    data['county_class_10_diff_mean'] = data.groupby(['county'])['class_10_diff'].transform('mean').values
    data['county_class_mean'] = data.groupby(['county'])['class'].transform('mean').values
    data['county_health_problem_mean'] = data.groupby(['county'])['health_problem'].transform('mean').values
    data['county_family_status_mean'] = data.groupby(['county'])['family_status'].transform('mean').values
    data['county_leisure_sum_mean'] = data.groupby(['county'])['leisure_sum'].transform('mean').values
    data['county_public_service_sum_mean'] = data.groupby(['county'])
    ['public_service_sum'].transform('mean').values
    data['county_trust_sum_mean'] = data.groupby(['county'])['trust_sum'].transform('mean').values

    # ratio 相比同省 203 + 13 =216
    data['income/province'] = data['income']/(data['province_income_mean'])
    data['family_income/province'] = data['family_income']/(data['province_family_income_mean'])
    data['equity/province'] = data['equity']/(data['province_equity_mean'])
-------------This blog is over! Thanks for your reading-------------