"1. XML 简介　　XML（eXtensible Markup Language）指可扩展标记语言，被设计用来传输和存储数据，已经日趋成为当前许多新生技术的核心，在不同的领域都有着不同的应用。它是 web 发展到一定阶段的必然产物，既具有 SGML 的核心特征，又有着 HTML 的简单特性，还具 ...."

someone1
AI专业者开发社区 1 号成员
AI开发者社区 • 0 回帖 • 10 浏览 • 5 个月前

Python实现XML文件解析

1. XML 简介

　　XML（eXtensible Markup Language）指可扩展标记语言，被设计用来传输和存储数据，已经日趋成为当前许多新生技术的核心，在不同的领域都有着不同的应用。它是 web 发展到一定阶段的必然产物，既具有 SGML 的核心特征，又有着 HTML 的简单特性，还具有明确和结构良好等许多新的特性。

　　test.XML 文件　

<?xml version="1.0" encoding="utf-8"?>
<catalog>
    <maxid>4</maxid>
    <login username="pytest" passwd='123456'>
        <caption>Python</caption>
        <item id="4">
            <caption>测试</caption>
        </item>
    </login>
    <item id="2">
        <caption>Zope</caption>
    </item>
</catalog>

　　XML 详细介绍可以参考：http://www.w3school.com.cn/xmldom/dom_nodetype.asp

2. XML 文件解析

　　python 解析 XML 常见的有三种方法：一是 xml.dom.* 模块，它是 W3C DOM API 的实现，若需要处理 DOM API 则该模块很适合；二是 xml.sax.* 模块，它是 SAX API 的实现，这个模块牺牲了便捷性来换取速度和内存占用，SAX 是一个基于事件的 API，这就意味着它可以“在空中”处理庞大数量的的文档，不用完全加载进内存；三是 xml.etree.ElementTree 模块（简称 ET），它提供了轻量级的 Python 式的 API，相对于 DOM 来说 ET 快了很多，而且有很多令人愉悦的 API 可以使用，相对于 SAX 来说 ET 的 ET.iterparse 也提供了 “在空中” 的处理方式，没有必要加载整个文档到内存，ET 的性能的平均值和 SAX 差不多，但是 API 的效率更高一点而且使用起来很方便。

2.1 xml.dom.*

　　文件对象模型（Document Object Model，简称 DOM），是 W3C 组织推荐的处理可扩展置标语言的标准编程接口。一个 DOM 的解析器在解析一个 XML 文档时，一次性读取整个文档，把文档中所有元素保存在内存中的一个树结构里，之后你可以利用 DOM 提供的不同的函数来读取或修改文档的内容和结构，也可以把修改过的内容写入 xml 文件。python 中用 xml.dom.minidom 来解析 xml 文件。

　　a. 获得子标签　　　

　　b. 区分相同标签名的标签

　　c. 获取标签属性值

　　d. 获取标签对之间的数据

#coding=utf-8
#通过 minidom 解析 xml 文件
import xml.dom.minidom as xmldom
import os
'''
XML 文件读取
<?xml version="1.0" encoding="utf-8"?>
<catalog>
<maxid>4</maxid>
<login username="pytest" passwd='123456'>dasdas
<caption>Python</caption>
<item id="4">
<caption> 测试 </caption>
</item>
</login>
<item id="2">
<caption>Zope</caption>
</item>
</catalog>
'''
xmlfilepath = os.path.abspath("test.xml")
print ("xml 文件路径：", xmlfilepath)
# 得到文档对象
domobj = xmldom.parse(xmlfilepath)
print("xmldom.parse:", type(domobj))
# 得到元素对象
elementobj = domobj.documentElement
print ("domobj.documentElement:", type(elementobj))
#获得子标签
subElementObj = elementobj.getElementsByTagName("login")
print ("getElementsByTagName:", type(subElementObj))
print (len(subElementObj))
# 获得标签属性值
print (subElementObj[0].getAttribute("username"))
print (subElementObj[0].getAttribute("passwd"))
#区分相同标签名的标签
subElementObj1 = elementobj.getElementsByTagName("caption")
for i in range(len(subElementObj1)):
print ("subElementObj1[i]:", type(subElementObj1[i]))
print (subElementObj1[i].firstChild.data)  #显示标签对之间的数据

　　输出结果：

>>> D:\Pystu>python xml_instance.py
>>> xml 文件路径： D:\Pystu\test.xml
>>> xmldom.parse: <class 'xml.dom.minidom.Document'>
>>> domobj.documentElement: <class 'xml.dom.minidom.Element'>
>>> getElementsByTagName: <class 'xml.dom.minicompat.NodeList'>
>>> username: pytest
>>> passwd: 123456
>>> subElementObj1[i]: <class 'xml.dom.minidom.Element'>
>>> Python
>>> subElementObj1[i]: <class 'xml.dom.minidom.Element'>
>>> 测试
>>> subElementObj1[i]: <class 'xml.dom.minidom.Element'>
>>> Zope

2.2 xml.etree.ElementTree

　　ElementTree 生来就是为了处理 XML，它在 Python 标准库中有两种实现：一种是纯 Python 实现的，如 xml.etree.ElementTree，另一种是速度快一点的 xml.etree.cElementTree。注意：尽量使用 C 语言实现的那种，因为它速度更快，而且消耗的内存更少。

　　a. 遍历根节点的下一层　　　

　　b. 下标访问各个标签、属性、文本

　　c. 查找 root 下的指定标签

　　d. 遍历 XML 文件

　　e. 修改 XML 文件

#coding=utf-8
#通过解析 xml 文件
'''
try:
import xml.etree.CElementTree as ET
except:
import xml.etree.ElementTree as ET
从 Python3.3 开始 ElementTree 模块会自动寻找可用的 C 库来加快速度
'''
import xml.etree.ElementTree as ET
import os
import sys
'''
XML 文件读取
<?xml version="1.0" encoding="utf-8"?>
<catalog>
<maxid>4</maxid>
<login username="pytest" passwd='123456'>dasdas
<caption>Python</caption>
<item id="4">
<caption> 测试 </caption>
</item>
</login>
<item id="2">
<caption>Zope</caption>
</item>
</catalog>
'''
#遍历 xml 文件
def traverseXml(element):
#print (len(element))
if len(element)>0:
for child in element:
print (child.tag, "----", child.attrib)
traverseXml(child)
#else:
#print (element.tag, "----", element.attrib)
if name == "main":
xmlFilePath = os.path.abspath("test.xml")
print(xmlFilePath)
try:
tree = ET.parse(xmlFilePath)
print ("tree type:", type(tree))
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 获得根节点</span>
    root =<span style="color: rgba(0, 0, 0, 1)"> tree.getroot()
</span><span style="color: rgba(0, 0, 255, 1)">except</span> Exception as e:  <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">捕获除与程序退出sys.exit()相关之外的所有异常</span>
    <span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">parse test.xml fail!</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
    sys.exit()
</span><span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">root type:</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, type(root))    
</span><span style="color: rgba(0, 0, 255, 1)">print</span> (root.tag, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">----</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, root.attrib)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">遍历root的下一层</span>
<span style="color: rgba(0, 0, 255, 1)">for</span> child <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> root:
    </span><span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">遍历root的下一层</span><span style="color: rgba(128, 0, 0, 1)">"</span>, child.tag, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">----</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, child.attrib)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">使用下标访问</span>
<span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)"> (root[0].text)
</span><span style="color: rgba(0, 0, 255, 1)">print</span> (root[1][1<span style="color: rgba(0, 0, 0, 1)">][0].text)

</span><span style="color: rgba(0, 0, 255, 1)">print</span> (20 * <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">*</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">遍历xml文件</span>

    traverseXml(root)
print (20 * "*")
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">根据标签名查找root下的所有标签</span>
captionList = root.findall(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">item</span><span style="color: rgba(128, 0, 0, 1)">"</span>)  <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">在当前指定目录下遍历</span>
<span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)"> (len(captionList))
</span><span style="color: rgba(0, 0, 255, 1)">for</span> caption <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> captionList:
    </span><span style="color: rgba(0, 0, 255, 1)">print</span> (caption.tag, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">----</span><span style="color: rgba(128, 0, 0, 1)">"</span>, caption.attrib, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">----</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, caption.text)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">修改xml文件，将passwd修改为999999</span>
login = root.find(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">login</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
passwdValue </span>= login.get(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">passwd</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">not modify passwd:</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, passwdValue)
login.set(</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">passwd</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">999999</span><span style="color: rgba(128, 0, 0, 1)">"</span>)   <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">修改，若修改text则表示为login.text</span>
<span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">modify passwd:</span><span style="color: rgba(128, 0, 0, 1)">"</span>, login.get(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">passwd</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">))
</span></pre>

　　输出结果：

>>> D:\Pystu\test.xml
>>> tree type: <class 'xml.etree.ElementTree.ElementTree'>
>>> root type: <class 'xml.etree.ElementTree.Element'>
>>> catalog ---- {}
>>> 遍历 root 的下一层 maxid ---- {}
>>> 遍历 root 的下一层 login ---- {'username': 'pytest', 'passwd': '123456'}
>>> 遍历 root 的下一层 item ---- {'id': '2'}
>>> 4
>>> 测试
>>> ********************
>>> maxid ---- {}
>>> login ---- {'username': 'pytest', 'passwd': '123456'}
>>> caption ---- {}
>>> item ---- {'id': '4'}
>>> caption ---- {}
>>> item ---- {'id': '2'}
>>> caption ---- {}
>>> ********************
>>> 1
>>> item ---- {'id': '2'} ----
>>> not modify passwd: 123456
>>> modify passwd: 999999

　　附：

#coding=utf-8
'''
XML 解析类
@功能 - 结点的增删改查
'''
import xml.etree.ElementTree as ET
import sys
import os.path
class XmlParse:
def init(self, file_path):
self.tree = None
self.root = None
self.xml_file_path = file_path
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> ReadXml(self):
    </span><span style="color: rgba(0, 0, 255, 1)">try</span><span style="color: rgba(0, 0, 0, 1)">:
        </span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">xmlfile:</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, self.xml_file_path)
        self.tree </span>=<span style="color: rgba(0, 0, 0, 1)"> ET.parse(self.xml_file_path)
        self.root </span>=<span style="color: rgba(0, 0, 0, 1)"> self.tree.getroot()
    </span><span style="color: rgba(0, 0, 255, 1)">except</span><span style="color: rgba(0, 0, 0, 1)"> Exception as e:
        </span><span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">parse xml faild!</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
        sys.exit()
    </span><span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
        </span><span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">parse xml success!</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)            
    </span><span style="color: rgba(0, 0, 255, 1)">finally</span><span style="color: rgba(0, 0, 0, 1)">: 
        </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> self.tree
           
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> CreateNode(self, tag, attrib, text):
    element </span>=<span style="color: rgba(0, 0, 0, 1)"> ET.Element(tag, attrib)
    element.text </span>=<span style="color: rgba(0, 0, 0, 1)"> text
    </span><span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">tag:%s;attrib:%s;text:%s</span><span style="color: rgba(128, 0, 0, 1)">"</span> %<span style="color: rgba(0, 0, 0, 1)">(tag, attrib, text))
    </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> element
          
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> AddNode(self, Parent, tag, attrib, text):
    element </span>=<span style="color: rgba(0, 0, 0, 1)"> self.CreateNode(tag, attrib, text)
    </span><span style="color: rgba(0, 0, 255, 1)">if</span><span style="color: rgba(0, 0, 0, 1)"> Parent:
        Parent.append(element)
        el </span>= self.root.find(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">lizhi</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
        </span><span style="color: rgba(0, 0, 255, 1)">print</span> (el.tag, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">----</span><span style="color: rgba(128, 0, 0, 1)">"</span>, el.attrib, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">----</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, el.text)
    </span><span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
        </span><span style="color: rgba(0, 0, 255, 1)">print</span> (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">parent is none</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> WriteXml(self, destfile):
    dest_xml_file </span>=<span style="color: rgba(0, 0, 0, 1)"> os.path.abspath(destfile)
    self.tree.write(dest_xml_file, encoding</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">utf-8</span><span style="color: rgba(128, 0, 0, 1)">"</span>,xml_declaration=<span style="color: rgba(0, 0, 0, 1)">True)

if name == "main":
xml_file = os.path.abspath("test.xml")
parse = XmlParse(xml_file)
tree = parse.ReadXml()
root = tree.getroot()
print (root)
parse.AddNode(root, "Python", {"age":"22", "hello":"world"}, "YES")
parse.WriteXml(</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">testtest.xml</span><span style="color: rgba(128, 0, 0, 1)">"</span>)</pre>

View Code

2.3 xml.sax.*

　　SAX 是一种基于事件驱动的 API，利用 SAX 解析 XML 牵涉到两个部分：解析器和事件处理器。

　　解析器负责读取 XML 文档，并向事件处理器发送事件，如元素开始跟元素结束事件

　　事件处理器则负责对事件作出相应，对传递的 XML 数据进行处理

　　常用场景：

　　　　（1）对大型文件进行处理

　　　　（2）只需文件的部分内容，或只需从文件中得到特定信息

　　　　（3）想建立自己的对象模型

　　基于事件驱动的 SAX 解析 XML 内容的知识后续补充！

Python实现XML文件解析

1. XML 简介

2. XML 文件解析

2.1 xml.dom.*

a. 获得子标签

b. 区分相同标签名的标签

c. 获取标签属性值

d. 获取标签对之间的数据

输出结果：

2.2 xml.etree.ElementTree

a. 遍历根节点的下一层

b. 下标访问各个标签、属性、文本

c. 查找 root 下的指定标签

d. 遍历 XML 文件

e. 修改 XML 文件

输出结果：

附：

2.3 xml.sax.*

相关帖子

随便看看

Python实现XML文件解析