0%

Analysis-xml

1. Python 解析和处理 xml 文件

1.1 SAX知识了解

SAX (simple API for XML) 有解析器和事件处理器:

  • 解析器:负责读取XML文档,并向事件处理器发送事件,如元素开始跟元素结束事件。
  • 事件处理器:则负责对事件作出响应,对传递的XML数据进行处理。

sax 主要借助 ContentHandler 来进行 xml 文件解析。其包含如下方法:

  • startDocument() : 文档启动的时候调用。
  • endDocument() : 解析器到达文档结尾时调用。
  • startElement(name, attrs): 遇到 XML 开始标签时调用,name 是标签的名字,attrs 是标签的属性值字典。
  • endElement(name) : 遇到 XML 结束标签时调用。
  • characters :内容处理。解析器将调用此方法来报告每个字符数据块。SAX 解析器可以在单个块中返回所有连续字符数据,也可以将其拆分为多个块;但是,任何单个事件中的所有字符都必须来自同一个外部实体,以便定位器提供有用的信息。
  • make_parser : 创建一个解释器对象并返回
  • parser : 解析 XML

ref:python 使用sax 解析xml 文件

ContentHandler 类的帮助文档如下:

1
help(xml.sax.ContentHandler)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
Help on class ContentHandler in module xml.sax.handler:

class ContentHandler(builtins.object)
| Interface for receiving logical document content events.
|
| This is the main callback interface in SAX, and the one most
| important to applications. The order of events in this interface
| mirrors the order of the information in the document.
|
| Methods defined here:
|
| __init__(self)
| Initialize self. See help(type(self)) for accurate signature.
|
| characters(self, content)
| Receive notification of character data.
|
| The Parser will call this method to report each chunk of
| character data. SAX parsers may return all contiguous
| character data in a single chunk, or they may split it into
| several chunks; however, all of the characters in any single
| event must come from the same external entity so that the
| Locator provides useful information.
|
| endDocument(self)
| Receive notification of the end of a document.
|
| The SAX parser will invoke this method only once, and it will
| be the last method invoked during the parse. The parser shall
| not invoke this method until it has either abandoned parsing
| (because of an unrecoverable error) or reached the end of
| input.
|
| endElement(self, name)
| Signals the end of an element in non-namespace mode.
|
| The name parameter contains the name of the element type, just
| as with the startElement event.
|
| endElementNS(self, name, qname)
| Signals the end of an element in namespace mode.
|
| The name parameter contains the name of the element type, just
| as with the startElementNS event.
|
| endPrefixMapping(self, prefix)
| End the scope of a prefix-URI mapping.
|
| See startPrefixMapping for details. This event will always
| occur after the corresponding endElement event, but the order
| of endPrefixMapping events is not otherwise guaranteed.
|
| ignorableWhitespace(self, whitespace)
| Receive notification of ignorable whitespace in element content.
|
| Validating Parsers must use this method to report each chunk
| of ignorable whitespace (see the W3C XML 1.0 recommendation,
| section 2.10): non-validating parsers may also use this method
| if they are capable of parsing and using content models.
|
| SAX parsers may return all contiguous whitespace in a single
| chunk, or they may split it into several chunks; however, all
| of the characters in any single event must come from the same
| external entity, so that the Locator provides useful
| information.
|
| processingInstruction(self, target, data)
| Receive notification of a processing instruction.
|
| The Parser will invoke this method once for each processing
| instruction found: note that processing instructions may occur
| before or after the main document element.
|
| A SAX parser should never report an XML declaration (XML 1.0,
| section 2.8) or a text declaration (XML 1.0, section 4.3.1)
| using this method.
|
| setDocumentLocator(self, locator)
| Called by the parser to give the application a locator for
| locating the origin of document events.
|
| SAX parsers are strongly encouraged (though not absolutely
| required) to supply a locator: if it does so, it must supply
| the locator to the application by invoking this method before
| invoking any of the other methods in the DocumentHandler
| interface.
|
| The locator allows the application to determine the end
| position of any document-related event, even if the parser is
| not reporting an error. Typically, the application will use
| this information for reporting its own errors (such as
| character content that does not match an application's
| business rules). The information returned by the locator is
| probably not sufficient for use with a search engine.
|
| Note that the locator will return correct information only
| during the invocation of the events in this interface. The
| application should not attempt to use it at any other time.
|
| skippedEntity(self, name)
| Receive notification of a skipped entity.
|
| The Parser will invoke this method once for each entity
| skipped. Non-validating processors may skip entities if they
| have not seen the declarations (because, for example, the
| entity was declared in an external DTD subset). All processors
| may skip external entities, depending on the values of the
| http://xml.org/sax/features/external-general-entities and the
| http://xml.org/sax/features/external-parameter-entities
| properties.
|
| startDocument(self)
| Receive notification of the beginning of a document.
|
| The SAX parser will invoke this method only once, before any
| other methods in this interface or in DTDHandler (except for
| setDocumentLocator).
|
| startElement(self, name, attrs)
| Signals the start of an element in non-namespace mode.
|
| The name parameter contains the raw XML 1.0 name of the
| element type as a string and the attrs parameter holds an
| instance of the Attributes class containing the attributes of
| the element.
|
| startElementNS(self, name, qname, attrs)
| Signals the start of an element in namespace mode.
|
| The name parameter contains the name of the element type as a
| (uri, localname) tuple, the qname parameter the raw XML 1.0
| name used in the source document, and the attrs parameter
| holds an instance of the Attributes class containing the
| attributes of the element.
|
| The uri part of the name tuple is None for elements which have
| no namespace.
|
| startPrefixMapping(self, prefix, uri)
| Begin the scope of a prefix-URI Namespace mapping.
|
| The information from this event is not necessary for normal
| Namespace processing: the SAX XML reader will automatically
| replace prefixes for element and attribute names when the
| http://xml.org/sax/features/namespaces feature is true (the
| default).
|
| There are cases, however, when applications need to use
| prefixes in character data or in attribute values, where they
| cannot safely be expanded automatically; the
| start/endPrefixMapping event supplies the information to the
| application to expand prefixes in those contexts itself, if
| necessary.
|
| Note that start/endPrefixMapping events are not guaranteed to
| be properly nested relative to each-other: all
| startPrefixMapping events will occur before the corresponding
| startElement event, and all endPrefixMapping events will occur
| after the corresponding endElement event, but their order is
| not guaranteed.
|
| ----------------------------------------------------------------------

1.2 Demo for analysis xml

1.2.1 读取只有标签的 xml

创建有一个 config1.xml 的文件内容如下:

1
2
3
4
5
6
<?xml version="1.0" encoding="UTF-8"?>
<config_content>
<lib name="a" path="a的路径"/>
<lib name="b" path="b的路径"/>
<lib name="c" path="c的路径"/>
</config_content>
  • set data path
1
2
data_dir = './data/test_data/'
data_test_path = data_dir + 'config1.xml'
  • 导入模块及预览文件
1
2
3
4
5
6
7
import codecs

# 打开XML文件并添加BOM
with codecs.open(data_test_path, 'r', encoding='utf-8') as f:
xml_data = f.read()

print(xml_data)

Results:

1
2
3
4
5
6
<?xml version="1.0" encoding="UTF-8"?>
<config_content>
<lib name="a" path="a的路径"/>
<lib name="b" path="b的路径"/>
<lib name="c" path="c的路径"/>
</config_content>
  • 借助 xml.sax 解析:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import xml.sax

class ConfigHandler(xml.sax.ContentHandler):

def __init__(self):
self.tag = ""
self.name = ""
self.path = ""

# 启动文档
def startDocument(self):
print("******解析配置文件开始******")

# 开始解析xml
def startElement(self, name, attributes):
self.tag = name
if name == "lib":
self.name = attributes["name"]
self.path = attributes["path"]
print(self.name)
print(self.path)

# xml内容事件处理
def characters(self, content):
pass

# 结束解析xml
def endElement(self, name):
pass

# xml结束标签调用
def endDocument(self):
print("******配置文件解析结束******")


if __name__ == "__main__":
# 创建一个 XMLReader
parser = xml.sax.make_parser()

# turn off namepsaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)

# 重写 ContextHandler
Handler = ConfigHandler()
parser.setContentHandler(Handler)

# 解析 xml 这里可以写xml 的具体路径,为了简单放在了同一个文件夹里面了
parser.parse(data_test_path)

Results:

1
2
3
4
5
6
7
8
******解析配置文件开始******
a
a的路径
b
b的路径
c
c的路径
******配置文件解析结束******

由于读取的 xml 只有标签,这里内容处理和结束的时候并没有做其他的操作。如果我们要使用读取的数据,可以把数据存放到 list 中或者存放到字典中,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class ConfigHandler(xml.sax.ContentHandler):
config_map = {}
config_name_list = []
config_path_list = []

def __init__(self):
self.tag = ""
self.name = ""
self.path = ""

# 启动文档
def startDocument(self):
print("******解析配置文件开始******")

# 开始解析xml
def startElement(self, name, attributes):
self.tag = name
if name == "lib":
self.name = attributes["name"]
self.path = attributes["path"]
# print(self.name)
# print(self.path)
self.config_name_list.append(self.name)
print(self.config_name_list)
self.config_path_list.append(self.path)
print(self.config_path_list)
self.config_map.update({self.name: self.path})
print(self.config_map)

# xml内容事件处理
def characters(self, content):
pass

# 结束解析xml
def endElement(self, name):
pass
# xml结束标签调用
def endDocument(self):
print("******配置文件解析结束******")


if __name__ == "__main__":
# 创建一个 XMLReader
parser = xml.sax.make_parser()

# turn off namepsaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)

# 重写 ContextHandler
Handler = ConfigHandler()
parser.setContentHandler(Handler)

# 解析 xml 这里可以写xml 的具体路径,为了简单放在了同一个文件夹里面了
parser.parse(data_test_path)

Results:

1
2
3
4
5
6
7
8
9
10
11
******解析配置文件开始******
['a']
['a的路径']
{'a': 'a的路径'}
['a', 'b']
['a的路径', 'b的路径']
{'a': 'a的路径', 'b': 'b的路径'}
['a', 'b', 'c']
['a的路径', 'b的路径', 'c的路径']
{'a': 'a的路径', 'b': 'b的路径', 'c': 'c的路径'}
******配置文件解析结束******

1.2.2 读取同标签不同内容

创建 config2.xml 文件,包含一下内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
<?xml version="1.0" encoding="UTF-8"?>
<config_content>
<type class="3年级">
<lib name="体育">优秀</lib>
<lib name="语文">一般</lib>
<lib name="数学">优秀</lib>
</type>
<type class="5年级">
<lib name="体育">一般</lib>
<lib name="语文">优秀</lib>
<lib name="数学">良好</lib>
</type>
</config_content>
  • set data path
1
data_test_path = data_dir + 'config2.xml'
  • Analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import xml.sax

class ConfigHandler(xml.sax.ContentHandler):

def __init__(self):
self.tag = ""
self.name = ""
self.label = ""
self.content = ""

# 启动文档
def startDocument(self):
print("******解析配置文件开始******")

# 开始解析xml
def startElement(self, name, attributes):
self.tag = name
if name == "type":
self.name = attributes["class"]
print(self.name)
if name == "lib":
self.label = attributes["name"]
print(self.label)

# xml内容事件处理
def characters(self, content):
self.content = content

# 结束解析xml
def endElement(self, name):
if name == "lib":
print(self.content)

# xml结束标签调用
def endDocument(self):
print("******配置文件解析结束******")


if __name__ == "__main__":
# 创建一个 XMLReader
parser = xml.sax.make_parser()

# turn off namepsaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)

# 重写 ContextHandler
Handler = ConfigHandler()
parser.setContentHandler(Handler)

# 解析 xml 这里可以写xml 的具体路径,为了简单放在了同一个文件夹里面了
parser.parse(data_test_path)

Results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
******解析配置文件开始******
3年级
体育
优秀
语文
一般
数学
优秀
5年级
体育
一般
语文
优秀
数学
良好
******配置文件解析结束******

1.2.3 读取想同标签多个标题

创建 config3.xml 文件,包含一下内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<?xml version="1.0" encoding="UTF-8"?>
<config_content>
<school name="第六中学">
<type class="2年级">
<Language>优秀</Language>
<Math>一般</Math>
<English>优秀</English>
</type>
<type class="5年级">
<Language>优秀</Language>
<Math>一般</Math>
<English>优秀</English>
</type>
</school>
<school name="第九中学">
<type class="1年级">
<Language>优秀</Language>
<Math>一般</Math>
<English>优秀</English>
</type>
<type class="3年级">
<Language>优秀</Language>
<Math>一般</Math>
<English>优秀</English>
</type>
</school>
</config_content>
  • set data path
1
data_test_path = data_dir + 'config3.xml'
  • analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import xml.sax


class ConfigHandler(xml.sax.ContentHandler):

def __init__(self):
self.tag = ""
self.name = ""
self.label = ""
self.content = ""

# 启动文档
def startDocument(self):
print("******解析配置文件开始******")

# 开始解析xml
def startElement(self, name, attributes):
self.tag = name
if name == "school":
self.name = attributes["name"]
print(self.name)
if name == "type":
self.label = attributes["class"]
print(self.label)

# xml内容事件处理
def characters(self, content):
self.content = content

# 结束解析xml
def endElement(self, name):
if name == "Language":
print(self.content)
elif name == "Math":
print(self.content)
elif name == "English":
print(self.content)

# xml结束标签调用
def endDocument(self):
print("******配置文件解析结束******")


if __name__ == "__main__":
# 创建一个 XMLReader
parser = xml.sax.make_parser()
# turn off namepsaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
# 重写 ContextHandler
Handler = ConfigHandler()
parser.setContentHandler(Handler)
# 解析 xml 这里可以写xml 的具体路径,为了简单放在了同一个文件夹里面了
parser.parse(data_test_path)

Results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
******解析配置文件开始******
第六中学
2年级
优秀
一般
优秀
5年级
优秀
一般
优秀
第九中学
1年级
优秀
一般
优秀
3年级
优秀
一般
优秀
******配置文件解析结束******
-------------This blog is over! Thanks for your reading-------------