python 数字が一か所入っている文字列から数字を取り出す一番処理が軽い方法ってなんですか？

python で数字が一か所入っている文字列sがあるとします。
なので、これはsで

'1yen'
'23dogs'
'there is 45 dogs'

これはsではないです

'1 2'
'1 and 2 and 3 and 4'
'1or5'

このsから数字だけを取り出したいです。この場合、一番処理が軽い方法は正規表現ですか？それとも他にもっと軽いやり方はありますか？

行動規範の内容に同意します

回答4件

ベストアンサー

最速かどうかはわかりませんが、正規表現が簡単で実用的な速度が得られるでしょう。

[0-9]+か\d+などがわかりやすいですが、細かい要件は詰めておく必要があります（全角、漢数字など気にすることは色々あります）。

投稿2020/04/11 13:46

hayataka2049

総合スコア30933

先程の回答は質問を読み違えてたので、コードのみ再投稿します。

take_s: itertools.groupby で数字と数字以外に分離する方法。
take_s_re: 正規表現を使った方法。

python
1import re
2import string
3import itertools
4
5# 全角文字も含む場合
6# key=str.isdigit
7#
8# 半角のみ
9# key=string.digits.__contains__
10
11
12def take_s(s, key=string.digits.__contains__,
13           groupby=itertools.groupby,
14           result=None, strjoin="".join):
15    """
16    >>> take_s('1yen')
17    1
18    >>> take_s('23dogs')
19    23
20    >>> take_s('there is 45 dogs')
21    45
22    >>> take_s('zero 0')
23    0
24
25    >>> take_s('1 2')
26    >>> take_s('1 and 2 and 3 and 4')
27    >>> take_s('1or5')
28    >>> take_s('zero 0 zero 0')
29    """
30
31    for isdigit, digits in groupby(s, key):
32        if isdigit: # 数値のグループ
33            if result is not None: # 2個目以降の数値が見つかった場合 None
34                return
35            else:
36                result = int(strjoin(digits))
37    return result
38
39
40def take_s_re(s, match=re.compile(r"^\D*(\d+)\D*$").match):
41    """
42    >>> take_s_re('1yen')
43    1
44    >>> take_s_re('23dogs')
45    23
46    >>> take_s_re('there is 45 dogs')
47    45
48    >>> take_s_re('zero 0')
49    0
50
51    >>> take_s_re('1 2')
52    >>> take_s_re('1 and 2 and 3 and 4')
53    >>> take_s_re('1or5')
54    >>> take_s_re('zero 0 zero 0')
55    """
56    # py3.8) if m := match(s):
57    m = match(s)
58    if m:
59        return int(m.group(1))
60
61
62if __name__ == "__main__":
63    import doctest
64    doctest.testmod()

追記: 試していない事

pypy や Numba 等での JITコンパイル。
全角数字を対象にした場合の、正規表現 vs str.isdigit

投稿2020/04/11 21:42

編集2020/04/11 22:21

teamikl

総合スコア8664

軽いかどうかは分からないけど、リスト内包表記で。

python
1>>> def digit(s):
2...     return ''.join([c for c in s if c.isdigit()])
3...
4>>> digit('1yen')
5'1'
6>>> digit('23dogs')
7'23'
8>>> digit('there is 45 dogs')
9'45'

ジェネレータ式でもできるけど、内包表記の方が早い気がします。

python
1>>> def digit(s):
2...     return ''.join(c for c in s if c.isdigit())
3...
4>>> digit('1yen')
5'1'
6>>> digit('23dogs')
7'23'
8>>> digit('there is 45 dogs')
9'45'

投稿2020/04/11 19:15

編集2020/04/11 19:19

shiracamus

総合スコア5406

正規表現以外のやり方もありますが、軽いかどうかわかりません。

Python
1def match(s):
2    a = -1
3    b = n = len(s)
4    for i in range(n):
5        if s[i] >= '0' and s[i] <= '9':
6            if a < 0:
7                a = i
8            elif b < n:
9                return None
10        elif a >= 0 and b == n:
11            b = i
12    return None if a < 0 else s[a:b]
13
14a = [ '1yen', '23dogs', 'there is 45 dogs', '1 2',
15    '1 and 2 and 3 and 4', '1or5' ]
16r = [ match(s) for s in a ]
17print(a)
18print(r)

正規表現を使うと

Python
1import re
2
3class Dig:
4    pat = re.compile(r"^\D*(\d+)\D*$")
5    def match(s):
6        mat = Dig.pat.match(s)
7        return mat.group(1) if mat else None
8
9a = [ '1yen', '23dogs', 'there is 45 dogs', '1 2',
10    '1 and 2 and 3 and 4', '1or5' ]
11r = [ Dig.match(s) for s in a ]
12print(a)
13print(r)

どちらが軽いか(速いか?)は自分で測定してみてください。

追記
質問を「与えられた文字列の中に、数字列が1個所だけあればその数字列を返し、
数字列が無いまたは2個所以上ある場合は何も返さない」だと誤解していました。

数字列が必ず1個所だけある文字列が与えられるのですね。

Python
1import re
2
3pat = re.compile(r"\D")
4
5a = [ '1yen', '23dogs', 'there is 45 dogs' ]
6r = [ pat.sub("", s) for s in a ]
7print(a)
8print(r)

正規表現を使わないで、数字列の開始点と終了点を探すと、

Python
1def match(s):
2    a = -1
3    b = n = len(s);
4    for i in range(n):
5        if s[i].isdigit():
6            if a < 0:
7                a = i
8        elif a >= 0:
9            b = i
10            break
11    return s[a:b]
12
13a = [ '1yen', '23dogs', 'there is 45 dogs' ]
14r = [ match(s) for s in a ]
15print(a)
16print(r)