1 概述 这门课程得主要目的是通过真实的数据,以实战的方式了解数据分析的流程和熟悉数据分析python的基本操作。知道了课程的目的之后,我们接下来我们要正式的开始数据分析的实战教学,完成kaggle上泰坦尼克的任务 ,实战数据分析全流程。 这里有两份资料: 教材《Python for Data Analysis》和 baidu.com & google.com(善用搜索引擎)
2 数据载入及初步观察 2.1 载入数据 数据集下载 https://www.kaggle.com/c/titanic/overview
1 2 import numpy as nnpimport pandas as pd
2.2 加载数据
1 >>> data = pd.read_csv('./train.csv' , encoding = 'utf-8' )
1 >>> data = pd.read_csv(r"D:\Demo\University\XMU\Python\Artificial-intelligence\Data analysis\第一单元项目集合\train.csv" , encoding = 'utf-8' )
1 2 >>> data = pd.read_table('./train.csv' , sep = ',' )
1 2 3 4 >>> import os>>> print(os.getcwd())D:\Demo\University\XMU\Python\Artificial-intelligence\Data analysis\第一单元项目集合
1 2 3 4 5 6 7 >>> chunker = pd.read_csv('train.csv' , chunksize=500 )>>> for df in chunker: print(type (df), df.shape) <class 'pandas .core .frame .DataFrame '> (500 , 12 ) <class 'pandas .core .frame .DataFrame '> (391 , 12 )
2.3 数据预处理 2.3.1 更改表头 索引改为乘客ID(对于某些英文资料,我们可以通过翻译来更直观的熟悉我们的数据): PassengerId => 乘客ID Survived => 是否幸存 Pclass => 乘客等级(1/2/3等舱位) Name => 乘客姓名 Sex => 性别 Age => 年龄 SibSp => 堂兄弟/妹个数 Parch => 父母与小孩个数 Ticket => 船票信息 Fare => 票价 Cabin => 客舱 Embarked => 登船港口
1 2 3 4 5 >>> maps = {'PassengerId' : '乘客ID' , 'Survived' : '是否幸存' , 'Pclass' : '乘客等级(1/2/3等舱位)' , 'Name' : '乘客姓名' , 'Sex' : '性别' , 'Age' : '年龄' , 'SibSp' : '堂兄弟/妹个数' , 'Parch' : '父母与小孩个数' , 'Ticket' : '船票信息' , 'Fare' : '票价' , 'Cabin' : '客舱' , 'Embarked' : '登船港口' } >>> data.rename(columns=maps, inplace=True )
1 2 3 4 >>> data = data = pd.read_csv('./train.csv' , encoding = 'utf-8' )>>> columns = ['乘客ID' ,'是否幸存' ,'仓位等级' ,'姓名' ,'性别' ,'年龄' ,'兄弟姐妹个数' , '父母子女个数' ,'船票信息' ,'票价' ,'客舱' ,'登船港口' ] >>> data.columns = columns
1 2 3 4 >>> df = pd.read_csv('train.csv' , names=['乘客ID' ,'是否幸存' ,'仓位等级' ,'姓名' ,'性别' ,'年龄' , '兄弟姐妹个数' , '父母子女个数' ,'船票信息' ,'票价' ,'客舱' ,'登船港口' ], index_col='乘客ID' ,header=0 )
2.3.2 初步观察 导入数据后,你可能要对数据的整体结构和样例进行概览,比如说,数据大小、有多少列,各列都是什么格式的,是否包含null等。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 >>> data.info()<class 'pandas .core .frame .DataFrame '> RangeIndex : 891 entries, 0 to 890 Data columns (total 12 columns): --- ------ -------------- ----- 0 乘客ID 891 non-null int64 1 是否幸存 891 non-null int64 2 仓位等级 891 non-null int64 3 姓名 891 non-null object 4 性别 891 non-null object 5 年龄 714 non-null float64 6 兄弟姐妹个数 891 non-null int64 7 父母子女个数 891 non-null int64 8 船票信息 891 non-null object 9 票价 891 non-null float64 10 客舱 204 non-null object 11 登船港口 889 non-null object dtypes: float64(2 ), int64(5 ), object (5 ) memory usage: 83.7 + KB
乘客ID
是否幸存
仓位等级
姓名
性别
年龄
兄弟姐妹个数
父母子女个数
船票信息
票价
客舱
登船港口
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
5
6
0
3
Moran, Mr. James
male
NaN
0
0
330877
8.4583
NaN
Q
6
7
0
1
McCarthy, Mr. Timothy J
male
54.0
0
0
17463
51.8625
E46
S
7
8
0
3
Palsson, Master. Gosta Leonard
male
2.0
3
1
349909
21.0750
NaN
S
8
9
1
3
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
female
27.0
0
2
347742
11.1333
NaN
S
9
10
1
2
Nasser, Mrs. Nicholas (Adele Achem)
female
14.0
1
0
237736
30.0708
NaN
C
乘客ID
是否幸存
仓位等级
姓名
性别
年龄
兄弟姐妹个数
父母子女个数
船票信息
票价
客舱
登船港口
876
877
0
3
Gustafsson, Mr. Alfred Ossian
male
20.0
0
0
7534
9.8458
NaN
S
877
878
0
3
Petroff, Mr. Nedelio
male
19.0
0
0
349212
7.8958
NaN
S
878
879
0
3
Laleff, Mr. Kristo
male
NaN
0
0
349217
7.8958
NaN
S
879
880
1
1
Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
female
56.0
0
1
11767
83.1583
C50
C
880
881
1
2
Shelley, Mrs. William (Imanita Parrish Hall)
female
25.0
0
1
230433
26.0000
NaN
S
881
882
0
3
Markun, Mr. Johann
male
33.0
0
0
349257
7.8958
NaN
S
882
883
0
3
Dahlberg, Miss. Gerda Ulrika
female
22.0
0
0
7552
10.5167
NaN
S
883
884
0
2
Banfield, Mr. Frederick James
male
28.0
0
0
C.A./SOTON 34068
10.5000
NaN
S
884
885
0
3
Sutehall, Mr. Henry Jr
male
25.0
0
0
SOTON/OQ 392076
7.0500
NaN
S
885
886
0
3
Rice, Mrs. William (Margaret Norton)
female
39.0
0
5
382652
29.1250
NaN
Q
886
887
0
2
Montvila, Rev. Juozas
male
27.0
0
0
211536
13.0000
NaN
S
887
888
1
1
Graham, Miss. Margaret Edith
female
19.0
0
0
112053
30.0000
B42
S
888
889
0
3
Johnston, Miss. Catherine Helen "Carrie"
female
NaN
1
2
W./C. 6607
23.4500
NaN
S
889
890
1
1
Behr, Mr. Karl Howell
male
26.0
0
0
111369
30.0000
C148
C
890
891
0
3
Dooley, Mr. Patrick
male
32.0
0
0
370376
7.7500
NaN
Q
1 >>> data.isnull().head()
乘客ID
是否幸存
仓位等级
姓名
性别
年龄
兄弟姐妹个数
父母子女个数
船票信息
票价
客舱
登船港口
0
False
False
False
False
False
False
False
False
False
False
True
False
1
False
False
False
False
False
False
False
False
False
False
False
False
2
False
False
False
False
False
False
False
False
False
False
True
False
3
False
False
False
False
False
False
False
False
False
False
False
False
4
False
False
False
False
False
False
False
False
False
False
True
False
乘客ID
是否幸存
仓位等级
年龄
兄弟姐妹个数
父母子女个数
票价
count
891.000000
891.000000
891.000000
714.000000
891.000000
891.000000
891.000000
mean
446.000000
0.383838
2.308642
29.699118
0.523008
0.381594
32.204208
std
257.353842
0.486592
0.836071
14.526497
1.102743
0.806057
49.693429
min
1.000000
0.000000
1.000000
0.420000
0.000000
0.000000
0.000000
25%
223.500000
0.000000
2.000000
20.125000
0.000000
0.000000
7.910400
50%
446.000000
0.000000
3.000000
28.000000
0.000000
0.000000
14.454200
75%
668.500000
1.000000
3.000000
38.000000
1.000000
0.000000
31.000000
max
891.000000
1.000000
3.000000
80.000000
8.000000
6.000000
512.329200
重复
1 2 3 4 5 6 7 8 9 10 11 12 13 14 >>> data.duplicated()0 False 1 False 2 False 3 False 4 False ... 886 False 887 False 888 False 889 False 890 False Length: 891 , dtype: bool
2.4 保存数据 1 >>> df.to_csv('train_chinese.csv' , encoding = 'ansi' )
【总结】数据的加载以及入门,接下来就要接触数据本身的运算,我们将主要掌握numpy和pandas在工作和项目场景的运用。
3 探索性数据分析 复习: 前面已经学习了Pandas基础,知道利用Pandas读取csv数据的增删查改。这一章节学习的是 探索性数据分析 ,主要介绍如何利用Pandas进行排序、算术计算以及计算描述函数describe()的使用。
3.1 加载数据 1 2 3 4 >>> data = pd.read_csv('train_chinese.csv' ) data_en = pd.read_csv('train.csv' ) data.head(1 )
�˿�ID
�Ƿ��Ҵ�
��λ�ȼ�
����
�Ա�
����.1
�ֵܽ��ø���
��ĸ��Ů����
��Ʊ��Ϣ
Ʊ��
�Ͳ�
�Ǵ��ۿ�
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.25
NaN
S
可以发现直接读取会产生乱码,因此需要设置编码格式
1 2 >>> data = pd.read_csv('train_chinese.csv' , encoding = 'ansi' ) data.head()
乘客ID
是否幸存
仓位等级
姓名
性别
年龄
兄弟姐妹个数
父母子女个数
船票信息
票价
客舱
登船港口
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
3.2 了解数据 详细请参考教材《Python for Data Analysis》第五章
3.2.1 排序
1 2 3 4 5 >>> frame = pd.DataFrame(np.arange(8 ).reshape((2 , 4 )), index=['2' , '1' ], columns=['d' , 'a' , 'b' , 'c' ]) >>> frame
d
a
b
c
2
0
1
2
3
1
4
5
6
7
d
a
b
c
1
4
5
6
7
2
0
1
2
3
Note :
axis: 控制索引的维度,默认为 axis = 0 即行索引。将 axis 设置为 1 可以更改为按列排序.
ascending: 控制升序 / 降序。默认为 ascending = True,即升序排列。
此外,可以通过设置 by 参数,选择多列进行排序。如 by = ['a', 'c']。
1 >>> data.sort_values(by=['票价' , '年龄' ], ascending=False ).head(3 )
乘客ID
是否幸存
仓位等级
姓名
性别
年龄
兄弟姐妹个数
父母子女个数
船票信息
票价
客舱
登船港口
679
680
1
1
Cardeza, Mr. Thomas Drake Martinez
male
36.0
0
1
PC 17755
512.3292
B51 B53 B55
C
258
259
1
1
Ward, Miss. Anna
female
35.0
0
0
PC 17755
512.3292
NaN
C
737
738
1
1
Lesurer, Mr. Gustave J
male
35.0
0
0
PC 17755
512.3292
B101
C
排序后,如果我们仅仅关注年龄和票价两列。根据常识我知道发现票价越高的应该客舱越好,所以我们会明显看出,票价前20的乘客中存活的高达14人。
3.2.2 算数计算 具体参照《利用Python进行数据分析》第五章 算术运算与数据对齐部分
1 2 3 4 5 6 >>> frame1_a = pd.DataFrame(np.arange(9. ).reshape(3 , 3 ), columns=['a' , 'b' , 'c' ], index=['one' , 'two' , 'three' ]) >>> frame1_b = pd.DataFrame(np.arange(12. ).reshape(4 , 3 ), columns=['a' , 'e' , 'c' ], index=['first' , 'one' , 'two' , 'second' ])
a
b
c
one
0.0
1.0
2.0
two
3.0
4.0
5.0
three
6.0
7.0
8.0
a
e
c
first
0.0
1.0
2.0
one
3.0
4.0
5.0
two
6.0
7.0
8.0
second
9.0
10.0
11.0
1 2 >>> frame1_a + frame1_b
a
b
c
e
first
NaN
NaN
NaN
NaN
one
3.0
NaN
7.0
NaN
second
NaN
NaN
NaN
NaN
three
NaN
NaN
NaN
NaN
two
9.0
NaN
13.0
NaN
3.2.3 极值 查看最大的家族有多少人(‘兄弟姐妹个数’+‘父母子女个数’)
1 2 3 >>> max (data['兄弟姐妹个数' ] + data['父母子女个数' ])10
1 2 3 4 5 6 7 8 9 10 11 >>> data['父母子女个数' ].describe()count 891.000000 mean 0.381594 std 0.806057 min 0.000000 25 % 0.000000 50 % 0.000000 75 % 0.000000 max 6.000000 Name: 父母子女个数, dtype: float64
4 Pandas 基础
1 2 3 4 5 6 7 8 9 >>> sdata = {'Ohio' : 35000 , 'Texas' : 71000 , 'Oregon' : 16000 , 'Utah' : 5000 } example_1 = pd.Series(sdata) example_1 Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64
1 2 3 4 >>> temp = {'state' : ['Ohio' , 'Ohio' , 'Ohio' , 'Nevada' , 'Nevada' , 'Nevada' ], 'year' : [2000 , 2001 , 2002 , 2001 , 2002 , 2003 ],'pop' : [1.5 , 1.7 , 3.6 , 2.4 , 2.9 , 3.2 ]} example_2 = pd.DataFrame(temp) example_2
state
year
pop
0
Ohio
2000
1.5
1
Ohio
2001
1.7
2
Ohio
2002
3.6
3
Nevada
2001
2.4
4
Nevada
2002
2.9
5
Nevada
2003
3.2
4.1 数据特征
1 2 3 >>> data_en.columnsIndex(['PassengerId' , 'Survived' , 'Pclass' , 'Name' , 'Sex' , 'Age' , 'SibSp' , 'Parch' , 'Ticket' , 'Fare' , 'Cabin' , 'Embarked' ], dtype='object' )
1 2 3 >>> data_en['Cabin' ].head(3 )0 NaN1 C852 NaNName: Cabin, dtype: object
1 2 3 >>> data_en.Cabin.head(3 )0 NaN1 C852 NaNName: Cabin, dtype: object
4.2 删除多余的列 1 2 >>> test_1 = pd.read_csv('test_1.csv' )test_1.head(3 )
Unnamed: 0
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
a
0
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
100
1
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
100
2
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
100
经过我们的观察发现一个测试集 test_1.csv 有两列是多余的,需要将多余的列删去。
1 2 3 >>> del test_1['a' ]>>> test_1.head(3 )
Unnamed: 0
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
1 >>> test_1.drop('a' , axis = 1 , inplace = True )
Unnamed: 0
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
Note:
inplace :是否替换原始的数据,也可以使用第二行的代码进行替代或存储为新的变量。默认为 False,即不替换
axis:在行/列的维度进行操作。
1 >>> test_1 = test_1.iloc[:, 1 :-2 ]
可以通过 loc 和 iloc 方法进行取位置来替换原始的数据。
4.3 隐藏信息 1 >>> data_en.drop(['PassengerId' ,'Name' ,'Age' ,'Ticket' ],axis=1 ).head(3 )
Survived
Pclass
Sex
SibSp
Parch
Fare
Cabin
Embarked
0
0
3
male
1
0
7.2500
NaN
S
1
1
1
female
1
0
71.2833
C85
C
2
1
3
female
0
0
7.9250
NaN
S
上述也可用下述代码实现:
1 >>> columns = data_en.columns.drop(['PassengerId' ,'Name' ,'Age' ,'Ticket' ])data_en.drop(columns, axis=1 ).head(3 )
PassengerId
Name
Age
Ticket
0
1
Braund, Mr. Owen Harris
22.0
A/5 21171
1
2
Cumings, Mrs. John Bradley (Florence Briggs Th...
38.0
PC 17599
2
3
Heikkinen, Miss. Laina
26.0
STON/O2. 3101282
Note:
drop 方法将数据从原数据中去掉,如果不设置 inplace = True 参数的话,默认是用一个临时变量存储新得到的数据。
4.3.1 数据筛选 表格数据中,最重要的一个功能就是要具有可筛选的能力,选出我所需要的信息,丢弃无用的信息。
1 >>> data_en[data_en["Age" ]<10 ].head(3 )
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
7
8
0
3
Palsson, Master. Gosta Leonard
male
2.0
3
1
349909
21.075
NaN
S
10
11
1
3
Sandstrom, Miss. Marguerite Rut
female
4.0
1
1
PP 9549
16.700
G6
S
16
17
0
3
Rice, Master. Eugene
male
2.0
4
1
382652
29.125
NaN
Q
1 2 >>> midage = data_en[(data_en["Age" ]>10 ) & (data_en["Age" ]<50 )]>>> midage.head(3 )
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
4.3.2 数据提取 将midage的数据中第100行的”Pclass”和”Sex”的数据显示出来
1 2 3 >>> midage = midage.reset_index(drop=True ) >>> midage.head(3 )
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
Note :
reset_index() 的作用是重置索引值,drop 参数表示是否去掉原始的索引值。
1 >>> midage.loc[[100 ],['Pclass' ,'Sex' ]]
1 >>> midage.loc[[100 ,105 ,108 ],['Pclass' ,'Name' ,'Sex' ]]
Pclass
Name
Sex
100
2
Byles, Rev. Thomas Roussel Davids
male
105
3
Cribb, Mr. John Hatfield
male
108
3
Calic, Mr. Jovo
male
1 >>> midage.iloc[[100 ,105 ,108 ],[2 ,3 ,4 ]]
Pclass
Name
Sex
100
2
Byles, Rev. Thomas Roussel Davids
male
105
3
Cribb, Mr. John Hatfield
male
108
3
Calic, Mr. Jovo
male
【总结】 : loc 方法的操作对象是 label 标签,而 iloc 方法的操作对象是标签的索引值。
此外如果要对 DataFrame 对象中的数据进行重赋值操作的话,如果使用的是 loc 方法,当赋值的 label 在数据表中不存在时,会默认新建一列,而 iloc 方法只能在数据表中现有的索引内容中进行覆写,索引值不能超过现有的 shape 值。