Python实现XML文件解析
1. XML 简介
XML(eXtensible Markup Language)指可扩展标记语言,被设计用来传输和存储数据,已经日趋成为当前许多新生技术的核心,在不同的领域都有着不同的应用。它是 web 发展到一定阶段的必然产物,既具有 SGML 的核心特征,又有着 HTML 的简单特性,还具有明确和结构良好等许多新的特性。
test.XML 文件
<?xml version="1.0" encoding="utf-8"?> <catalog> <maxid>4</maxid> <login username="pytest" passwd='123456'> <caption>Python</caption> <item id="4"> <caption>测试</caption> </item> </login> <item id="2"> <caption>Zope</caption> </item> </catalog>
XML 详细介绍可以参考:http://www.w3school.com.cn/xmldom/dom_nodetype.asp
2. XML 文件解析
python 解析 XML 常见的有三种方法:一是 xml.dom.* 模块,它是 W3C DOM API 的实现,若需要处理 DOM API 则该模块很适合;二是 xml.sax.* 模块,它是 SAX API 的实现,这个模块牺牲了便捷性来换取速度和内存占用,SAX 是一个基于事件的 API,这就意味着它可以“在空中”处理庞大数量的的文档,不用完全加载进内存;三是 xml.etree.ElementTree 模块(简称 ET),它提供了轻量级的 Python 式的 API,相对于 DOM 来说 ET 快了很多,而且有很多令人愉悦的 API 可以使用,相对于 SAX 来说 ET 的 ET.iterparse 也提供了 “在空中” 的处理方式,没有必要加载整个文档到内存,ET 的性能的平均值和 SAX 差不多,但是 API 的效率更高一点而且使用起来很方便。
2.1 xml.dom.*
文件对象模型(Document Object Model,简称 DOM),是 W3C 组织推荐的处理可扩展置标语言的标准编程接口。一个 DOM 的解析器在解析一个 XML 文档时,一次性读取整个文档,把文档中所有元素保存在内存中的一个树结构里,之后你可以利用 DOM 提供的不同的函数来读取或修改文档的内容和结构,也可以把修改过的内容写入 xml 文件。python 中用 xml.dom.minidom 来解析 xml 文件。
a. 获得子标签
b. 区分相同标签名的标签
c. 获取标签属性值
d. 获取标签对之间的数据
#coding=utf-8#通过 minidom 解析 xml 文件
import xml.dom.minidom as xmldom
import os
'''
XML 文件读取
<?xml version="1.0" encoding="utf-8"?>
<catalog>
<maxid>4</maxid>
<login username="pytest" passwd='123456'>dasdas
<caption>Python</caption>
<item id="4">
<caption> 测试 </caption>
</item>
</login>
<item id="2">
<caption>Zope</caption>
</item>
</catalog>'''
xmlfilepath = os.path.abspath("test.xml")
print ("xml 文件路径:", xmlfilepath)# 得到文档对象
domobj = xmldom.parse(xmlfilepath)
print("xmldom.parse:", type(domobj))
# 得到元素对象
elementobj = domobj.documentElement
print ("domobj.documentElement:", type(elementobj))#获得子标签
subElementObj = elementobj.getElementsByTagName("login")
print ("getElementsByTagName:", type(subElementObj))print (len(subElementObj))
# 获得标签属性值
print (subElementObj[0].getAttribute("username"))
print (subElementObj[0].getAttribute("passwd"))#区分相同标签名的标签
subElementObj1 = elementobj.getElementsByTagName("caption")
for i in range(len(subElementObj1)):
print ("subElementObj1[i]:", type(subElementObj1[i]))
print (subElementObj1[i].firstChild.data) #显示标签对之间的数据
输出结果:
>>> D:\Pystu>python xml_instance.py >>> xml 文件路径: D:\Pystu\test.xml >>> xmldom.parse: <class 'xml.dom.minidom.Document'> >>> domobj.documentElement: <class 'xml.dom.minidom.Element'> >>> getElementsByTagName: <class 'xml.dom.minicompat.NodeList'> >>> username: pytest >>> passwd: 123456 >>> subElementObj1[i]: <class 'xml.dom.minidom.Element'> >>> Python >>> subElementObj1[i]: <class 'xml.dom.minidom.Element'> >>> 测试 >>> subElementObj1[i]: <class 'xml.dom.minidom.Element'> >>> Zope
2.2 xml.etree.ElementTree
ElementTree 生来就是为了处理 XML,它在 Python 标准库中有两种实现:一种是纯 Python 实现的,如 xml.etree.ElementTree,另一种是速度快一点的 xml.etree.cElementTree。注意:尽量使用 C 语言实现的那种,因为它速度更快,而且消耗的内存更少。
a. 遍历根节点的下一层
b. 下标访问各个标签、属性、文本
c. 查找 root 下的指定标签
d. 遍历 XML 文件
e. 修改 XML 文件
#coding=utf-8#通过解析 xml 文件
'''
try:
import xml.etree.CElementTree as ET
except:
import xml.etree.ElementTree as ET从 Python3.3 开始 ElementTree 模块会自动寻找可用的 C 库来加快速度
'''
import xml.etree.ElementTree as ET
import os
import sys
'''
XML 文件读取
<?xml version="1.0" encoding="utf-8"?>
<catalog>
<maxid>4</maxid>
<login username="pytest" passwd='123456'>dasdas
<caption>Python</caption>
<item id="4">
<caption> 测试 </caption>
</item>
</login>
<item id="2">
<caption>Zope</caption>
</item>
</catalog>
'''#遍历 xml 文件
def traverseXml(element):
#print (len(element))
if len(element)>0:
for child in element:
print (child.tag, "----", child.attrib)
traverseXml(child)
#else:
#print (element.tag, "----", element.attrib)if name == "main":
xmlFilePath = os.path.abspath("test.xml")
print(xmlFilePath)
try:
tree = ET.parse(xmlFilePath)
print ("tree type:", type(tree))</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 获得根节点</span> root =<span style="color: rgba(0, 0, 0, 1)"> tree.getroot() </span><span style="color: rgba(0, 0, 255, 1)">except</span> Exception as e: <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">捕获除与程序退出sys.exit()相关之外的所有异常</span> <span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">parse test.xml fail!</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) sys.exit() </span><span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">root type:</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, type(root)) </span><span style="color: rgba(0, 0, 255, 1)">print</span> (root.tag, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">----</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, root.attrib) </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">遍历root的下一层</span> <span style="color: rgba(0, 0, 255, 1)">for</span> child <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> root: </span><span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">遍历root的下一层</span><span style="color: rgba(128, 0, 0, 1)">"</span>, child.tag, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">----</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, child.attrib) </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">使用下标访问</span> <span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)"> (root[0].text) </span><span style="color: rgba(0, 0, 255, 1)">print</span> (root[1][1<span style="color: rgba(0, 0, 0, 1)">][0].text) </span><span style="color: rgba(0, 0, 255, 1)">print</span> (20 * <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">*</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">遍历xml文件</span>
traverseXml(root)
print (20 * "*")</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">根据标签名查找root下的所有标签</span> captionList = root.findall(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">item</span><span style="color: rgba(128, 0, 0, 1)">"</span>) <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">在当前指定目录下遍历</span> <span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)"> (len(captionList)) </span><span style="color: rgba(0, 0, 255, 1)">for</span> caption <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> captionList: </span><span style="color: rgba(0, 0, 255, 1)">print</span> (caption.tag, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">----</span><span style="color: rgba(128, 0, 0, 1)">"</span>, caption.attrib, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">----</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, caption.text) </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">修改xml文件,将passwd修改为999999</span> login = root.find(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">login</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) passwdValue </span>= login.get(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">passwd</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) </span><span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">not modify passwd:</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, passwdValue) login.set(</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">passwd</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">999999</span><span style="color: rgba(128, 0, 0, 1)">"</span>) <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">修改,若修改text则表示为login.text</span> <span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">modify passwd:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, login.get(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">passwd</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)) </span></pre>
输出结果:
>>> D:\Pystu\test.xml >>> tree type: <class 'xml.etree.ElementTree.ElementTree'> >>> root type: <class 'xml.etree.ElementTree.Element'> >>> catalog ---- {} >>> 遍历 root 的下一层 maxid ---- {} >>> 遍历 root 的下一层 login ---- {'username': 'pytest', 'passwd': '123456'} >>> 遍历 root 的下一层 item ---- {'id': '2'} >>> 4 >>> 测试 >>> ******************** >>> maxid ---- {} >>> login ---- {'username': 'pytest', 'passwd': '123456'} >>> caption ---- {} >>> item ---- {'id': '4'} >>> caption ---- {} >>> item ---- {'id': '2'} >>> caption ---- {} >>> ******************** >>> 1 >>> item ---- {'id': '2'} ---->>> not modify passwd: 123456
>>> modify passwd: 999999
附:
#coding=utf-8'''
XML 解析类
@功能 - 结点的增删改查
'''
import xml.etree.ElementTree as ET
import sys
import os.pathclass XmlParse:
def init(self, file_path):
self.tree = None
self.root = None
self.xml_file_path = file_path</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> ReadXml(self): </span><span style="color: rgba(0, 0, 255, 1)">try</span><span style="color: rgba(0, 0, 0, 1)">: </span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">xmlfile:</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, self.xml_file_path) self.tree </span>=<span style="color: rgba(0, 0, 0, 1)"> ET.parse(self.xml_file_path) self.root </span>=<span style="color: rgba(0, 0, 0, 1)"> self.tree.getroot() </span><span style="color: rgba(0, 0, 255, 1)">except</span><span style="color: rgba(0, 0, 0, 1)"> Exception as e: </span><span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">parse xml faild!</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) sys.exit() </span><span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">: </span><span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">parse xml success!</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) </span><span style="color: rgba(0, 0, 255, 1)">finally</span><span style="color: rgba(0, 0, 0, 1)">: </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> self.tree </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> CreateNode(self, tag, attrib, text): element </span>=<span style="color: rgba(0, 0, 0, 1)"> ET.Element(tag, attrib) element.text </span>=<span style="color: rgba(0, 0, 0, 1)"> text </span><span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">tag:%s;attrib:%s;text:%s</span><span style="color: rgba(128, 0, 0, 1)">"</span> %<span style="color: rgba(0, 0, 0, 1)">(tag, attrib, text)) </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> element </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> AddNode(self, Parent, tag, attrib, text): element </span>=<span style="color: rgba(0, 0, 0, 1)"> self.CreateNode(tag, attrib, text) </span><span style="color: rgba(0, 0, 255, 1)">if</span><span style="color: rgba(0, 0, 0, 1)"> Parent: Parent.append(element) el </span>= self.root.find(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">lizhi</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) </span><span style="color: rgba(0, 0, 255, 1)">print</span> (el.tag, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">----</span><span style="color: rgba(128, 0, 0, 1)">"</span>, el.attrib, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">----</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, el.text) </span><span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">: </span><span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">parent is none</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) </span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> WriteXml(self, destfile): dest_xml_file </span>=<span style="color: rgba(0, 0, 0, 1)"> os.path.abspath(destfile) self.tree.write(dest_xml_file, encoding</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">"</span>,xml_declaration=<span style="color: rgba(0, 0, 0, 1)">True)
if name == "main":
xml_file = os.path.abspath("test.xml")
parse = XmlParse(xml_file)
tree = parse.ReadXml()
root = tree.getroot()
print (root)
parse.AddNode(root, "Python", {"age":"22", "hello":"world"}, "YES")parse.WriteXml(</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">testtest.xml</span><span style="color: rgba(128, 0, 0, 1)">"</span>)</pre>
2.3 xml.sax.*
SAX 是一种基于事件驱动的 API,利用 SAX 解析 XML 牵涉到两个部分:解析器和事件处理器。
解析器负责读取 XML 文档,并向事件处理器发送事件,如元素开始跟元素结束事件
事件处理器则负责对事件作出相应,对传递的 XML 数据进行处理
常用场景:
(1)对大型文件进行处理
(2)只需文件的部分内容,或只需从文件中得到特定信息
(3)想建立自己的对象模型
基于事件驱动的 SAX 解析 XML 内容的知识后续补充!