前提・実現したいこと

（6/21コメントをもとに質問を全面的に修正しました）下方のXMLファイル中にある特定の要素、ここではAttributes 内のmaterial のキーワードだけ（soil）を抽出・出力できない。また実際のファイルは、同様の形式で複数のidの情報が数万行で含まれており、idは必ずあるものの、Attributes 内にmaterialの項目がある場合とない場合があります。項目がある場合はそのまま、また、ない場合はスキップするか、あるいはmissingとしてidとともにタブ区切りで出力したいと考えております（以下が最終的に実現したいこと）。

[outputファイル]
SAMD00000001^ soil
SAMD00000002^ missing
SAMD00000003^ water
…と続いていきます。

発生している問題・エラーメッセージ

以下の通り、Attributesの全てのワードが出てくる状況です。また、その情報をidと一緒に出力できない。

該当のコマンド

xmllint --xpath "/BioSample/Ids/Id/text()" testbiosample.xml
# 結果、id SAMD00000001 が抽出される
xmllint --xpath "/BioSample/Attributes/Attribute/text()" testbiosample.xml
# 結果、以下の通りAttributesの全てのワードが出てくる
DOA9
grassland
N/A
grassland
Thailand
N/A
soil
Genetic diversity of Bradyrhizobium strains isolated from root nodules of Aeschynomene americana
22752179
2
22752179
symbiont
Aeschynomene americana
PRJDB1640
BDOA9

testbiosample.xml

<BioSample access="public" publication_date="2014-04-07T00:00:00+09:00" last_update="2014-09-25T09:58:01+09:00">
                <Ids>
                        <Id is_primary="1" namespace="BioSample">SAMD00000001</Id>
                </Ids>
                <Description>
                        <SampleName>Bradyrhizobium sp. DOA9</SampleName>
                        <Title>MIGS Cultured Bacterial/Archaeal sample from Bradyrhizobium sp. DOA9</Title>
                        <Organism taxonomy_id="1126627">
                                <OrganismName>Bradyrhizobium sp. DOA9</OrganismName>
                        </Organism>
                </Description>
                <Owner>
                        <Name url="-----">Tokyo University of Agriculture and Technology</Name>
                </Owner>
                <Models>
                        <Model>MIGS.ba</Model>
                </Models>
                <Attributes>
                        <Attribute attribute_name="strain">DOA9</Attribute>
                        <Attribute attribute_name="biome">grassland</Attribute>
                        <Attribute attribute_name="collection_date">N/A</Attribute>
                        <Attribute attribute_name="feature">grassland</Attribute>
                        <Attribute attribute_name="geo_loc_name">Thailand</Attribute>
                        <Attribute attribute_name="lat_lon">N/A</Attribute>
                        <Attribute attribute_name="material">soil</Attribute>
                        <Attribute attribute_name="project_name">Genetic diversity of Bradyrhizobium strains isolated from root nodules of Aeschynomene americana</Attribute>
                        <Attribute attribute_name="isol_growth_condt">22752179</Attribute>
                        <Attribute attribute_name="num_replicons">2</Attribute>
                        <Attribute attribute_name="ref_biomaterial">22752179</Attribute>
                        <Attribute attribute_name="biotic_relationship">symbiont</Attribute>
                        <Attribute attribute_name="specific_host">Aeschynomene americana</Attribute>
                        <Attribute attribute_name="bioproject_id">PRJDB1640</Attribute>
                        <Attribute attribute_name="locus_tag_prefix">BDOA9</Attribute>
                </Attributes>
                <Links>
                        <Link label="pubmed" type="db_xref">22752179</Link>
                </Links>
        </BioSample>

# 本来はこの下に次のidの情報が続きます。

行動規範の内容に同意します

回答1件

ベストアンサー

まず、そのXMLファイルは書式に不備があるために処理ができません。

ルートタグである<BioSampleSet>に対応する閉じタグが欠落しています。

xmllintは本来は「xmlデータのフォーマットをチェックする」ツールです。

xmllint sample.xml

こんな感じで実行して、問題なくxmlデータが出力されればいいのですが、掲出のデータではエラーメッセージが出るはずです。これをまず潰してください。

xpathを検討してみる

便宜上、中身を簡略化した次のようなテスト用データで考えてみます。

xml
1<?xml version="1.0" encoding="UTF-8"?>
2<!-- test200622a.xml -->
3<BioSampleSet>
4  <BioSample> <!-- 理想を体現したノード -->
5    <Ids>
6      <Id>id1</Id>
7    </Ids>
8    <Attributes>
9      <Attribute attribute_name="strain">DOA9</Attribute>
10      <Attribute attribute_name="material">soil</Attribute>
11    </Attributes>
12  </BioSample>
13  <BioSample> <!-- idは上と同じだがmaterialがないノード -->
14    <Ids>
15      <Id>id1</Id>
16    </Ids>
17    <Attributes>
18      <Attribute attribute_name="strain">XXXX</Attribute>
19    </Attributes>
20  </BioSample>
21  <BioSample> <!-- idは上と違いmaterialがあるノード -->
22    <Ids>
23      <Id>id2</Id>
24    </Ids>
25    <Attributes>
26      <Attribute attribute_name="strain">YYYY</Attribute>
27      <Attribute attribute_name="material">soil</Attribute>
28    </Attributes>
29  </BioSample>
30</BioSampleSet>

私の環境ではperlのXML::Pathモジュールに付随したxpathというツールが入っているのでこれを使いますが、xpathの書式はxmllintでも同じです。

ここではId="id1"なノードをまず探してみることにします。

% xpath -e '//Id[text()="id1"]' test200622.xml                                                                  
Found 2 nodes in test200622.xml:
-- NODE --
<Id>id1</Id>
-- NODE --
<Id>id1</Id>

次に、最終的にたどり着きたいノードAttributeと出発点のIdの位置関係を確認します。両ノードの共通祖先はBioSampleで、これはIdから見て2世代上になります。そこで、Idから二つ遡ったノードにアクセスするXpathを考えます。

xpath -e '//Id[text()="id1"]/../..' test200622.xml                                                            [ ~/work ] 
Found 2 nodes in test200622.xml:
-- NODE --
<BioSample> <!-- 理想を体現したノード -->
    <Ids>
      <Id>id1</Id>
    </Ids>
    <Attributes>
      <Attribute attribute_name="strain">DOA9</Attribute>
      <Attribute attribute_name="material">soil</Attribute>
    </Attributes>
  </BioSample>
-- NODE --
<BioSample> <!-- idは上と同じだがmaterialがないノード -->
    <Ids>
      <Id>id1</Id>
    </Ids>
    <Attributes>
      <Attribute attribute_name="strain">XXXX</Attribute>
    </Attributes>
  </BioSample>

さらにAttributeにおりていく記述を追加します。相手にするのは"material"なAttributeなので、制約条件を付けます。

xpath -e '//Id[text()="id1"]/../..//Attribute[@attribute_name="material"]' test200622.xml                     [ ~/work ] 
Found 1 nodes in test200622.xml:
-- NODE --
<Attribute attribute_name="material">soil</Attribute>

タグの中身だけあればいいので、

xpath -e '//Id[text()="id1"]/../..//Attribute[@attribute_name="material"]/text()' test200622.xml              [ ~/work ] 
Found 1 nodes in test200622.xml:
-- NODE --
soil

複数のIDについて処理し、なおかつIDと結果の対応がわかるように出力するためには、xpathの外で仕掛けを作る必要があるでしょう。

投稿2020/06/20 12:43

編集2020/06/21 19:38

KojiDoi

総合スコア13727

kakuko

2020/06/21 05:06 編集

度々ありがとうございます。もともとは以下に格納されているbiosampleデータを対象にしておりましたが、 ftp://ftp.ncbi.nlm.nih.gov/biosample/ 全部で計算したり、また上記に示したりする際にサイズが大きくなるため、抜粋しておりました。その過程で「閉じタグ」がないなど、エラーを生じたのだと思います。申し訳ございませんでした。データ形式としては、以下の「Example of original XML file」が閉じタグがついているものです。 https://github.com/dbcls/bh14/wiki/BioSample これでまずは練習しはじめていますが、それでもうまくいっていない状況です。IDとisolation-sourceの中身を抽出するだけとはいえ、まだ勉強不足のゆえ、難しさを感じています。

kakuko

2020/06/21 11:23

昨日はありがとうございました。その後も教えていただいた資料を調べ、いくつかの子要素を抽出することができるようになりました。現状をもとに、不正確だった質問内容を修正し、現状の問題点を整理しました。実現したいことに近づいてきましたが、またよろしければ、コメントいただけましたら幸いです。

kakuko

2020/06/22 12:40

丁寧な解説をありがとうございました。今テストのxmlで同様の作業を実行しているところです。コマンドと対応する要素の構造を理解できました。ベストアンサーにさせていただきます。引き続き、目的に向けて検討します。

行動規範の内容に同意します