やりたいこと

AA1111
BB222CC
DDD250E

このデータから数字とアルファベットの間に区切りを入れて、

<1列目>　<2列目>　<3列目>　<4列目> <5列目>　<6列目>　<7列目>　<8列目> <9列目>　<10列目>・・・
AA1111　AA●1111　AA■1111　AA 11・・・
BB222CC　BB●222●CC　BB●222CC　BB222●CC　BB■222■CC　BB■222CC　BB222■CC　BB 222 CC　BB 222CC　BB222 CC・・・
DDD250E　DDD●250●E　DDD●250E　DDD250●E　DDD■250■E　DDD■250E　DDD250■E　DDD 250 E　DDD 250E　DDD250 E・・・

のように、区切りを入れた上で列は不揃いの複数の組合せをdfで取得し出力したいと考えています。

記号の数を増やす際に、defの内容をDataFlameにdf.apply(def, axis=1)の形で適用させて上記を実現しようとしていますが、重複が出てきてしまうことでうまく適用できていない箇所があるように見受けられます。

作成したコード

# ★★
import pandas as pd
import re
import itertools

def func(row):

    default = row[0]
    # 空白
    space = re.sub(r'((?<=\d)\D|(?<=\D)\d)', r' \1', default)
    # スペース区切り
    lst = space.split(' ')
    # ●挿入
    ret = []
    for tpl in itertools.product(['','●'],repeat=len(lst)-1):
        p = lst[0]
        for t,l in zip(tpl,lst[1:]):
            p += t + l
        ret.append(p)
    # 空白（半角スペース）挿入
    for tpl in itertools.product(['',' '],repeat=len(lst)-1):
        p = lst[0]
        for t,l in zip(tpl,lst[1:]):
            p += t + l
        ret.append(p)
    # ■挿入
    for tpl in itertools.product(['','■'],repeat=len(lst)-1):
        p = lst[0]
        for t,l in zip(tpl,lst[1:]):
            p += t + l
        ret.append(p)
    
    # 重複がある場合削除
    ret = list(set(ret))

    # 列を追加
    for idx,val in enumerate(ret):
        if idx > 0: # 先頭列は不要
            row[idx] = val

    return row

# データ
data = pd.read_csv('test.csv', header=0)
print(data)　# ['AA1111','BB222C','DD25EE'])

# 関数を適用する
df = data.apply(func,axis=1)
print(df)　# IndexError: ('index 1 is out of bounds for axis 0 with size 1', 'occurred at index 0')

## dfへの適用がうまくいかなかったので、下記を試してみました。

# リスト型
dflist = data.values.tolist()

# 内包表記
res = [ flatten for inner in dflist for flatten in inner ]

# df2
df2 = pd.DataFrame(res)
df2.rename(columns={"model":"0"})

# 関数を適用する
df2 = df2.apply(func,axis=1)
print(df2)

結果

	0	1	2	3	4	5	6	7	8	9
0	AA1111	AA1111	AA 1111	AA■1111	NaN	NaN	NaN	NaN	NaN	NaN
1	BB222CC	BB 222 CC	BB 222CC	BB■222■CC	BB●222CC	BB222 CC	BB222CC	BB●222●CC	BB222■CC	BB222●CC
2	DDD250E	DDD■250■E	DDD250 E	DDD250E	DDD 250 E	DDD■250E	DDD●250E	DDD●250●E	DDD250■E	DDD 250E

考察

一見出来たように見えるのですが、df1行目については'AA●1111'がなく、2行目については'BB★222CC'とDDD250Eがなく、df2行目については'AA■1111'がなく、df3行目については'DDD●250E'がない状況となっています。また、その箇所が重複箇所となっていることがわかりました。

これはどの部分により組合せが崩れて同じものが出てくる状態になっているのでしょうか。
適切に取得するためにはどのようにコードを変更したらよさそうかお教え頂けましたら幸いです。
何卒よろしくお願い申し上げます。

行動規範の内容に同意します

回答1件

ベストアンサー

これはどの部分により組合せが崩れて同じものが出てくる状態になっているのでしょうか。

前回の回答に不具合がありました。失礼しました。
以下、修正後の処理にて書いてみました。なお、IndexErrorは当方環境では発生しませんでした。

Python
1import pandas as pd
2import re
3import itertools
4
5# 指定区切り文字を組み合わせ挿入したリストを返す
6# ただし区切文字を含まない要素（「AA1111」など）は含めない
7def insert(lst,char):
8    ret = []
9    for tpl in itertools.product(['',char],repeat=len(lst)-1):
10        p = lst[0]
11        for t,l in zip(tpl,lst[1:]):
12            p += t + l
13        ret.append(p)
14    return ret[1:] # 区切文字を含まない要素は含めない
15
16def func(row):
17    default = row[0]
18    # 空白
19    space = re.sub(r'((?<=\d)\D|(?<=\D)\d)', r' \1', default)
20    # スペース区切り
21    lst = space.split(' ')
22
23    ret = []
24    ret += insert(lst, '●') # ●挿入
25    ret += insert(lst, ' ')  # 空白（半角スペース）挿入
26    ret += insert(lst, '■') # ■挿入
27
28    # 列を追加
29    for idx,val in enumerate(ret):
30        row[idx+1] = val # 区切文字を含まない先頭列は残しておく
31
32    return row
33
34# データ
35#data = pd.read_csv('hondaBIKEtest.csv', header=0)
36data = pd.DataFrame(['AA1111','BB222C','DD25EE'])
37
38# 関数を適用する
39df = data.apply(func,axis=1)
40print(df)
41"""
42        0        1        2         3        4        5         6        7        8         9
430  AA1111  AA●1111  AA 1111   AA■1111      NaN      NaN       NaN      NaN      NaN       NaN
441  BB222C  BB222●C  BB●222C  BB●222●C  BB222 C  BB 222C  BB 222 C  BB222■C  BB■222C  BB■222■C
452  DD25EE  DD25●EE  DD●25EE  DD●25●EE  DD25 EE  DD 25EE  DD 25 EE  DD25■EE  DD■25EE  DD■25■EE
46"""