スプライン近似する際の節点の配置を探索するプログラムを作りたい。

実現したいこと

データをスプライン近似する際、MSEが小さくなるような節点の配置を探索したいです。

正規乱数のデータを50個用意してそれらのデータのいくつかを節点としてスプライン近似を行う。
(例)データの個数を50個、スプライン節点数を3と仮定して、②～④の操作をMSEができるだけ小さくなるまで繰り返す。
①データのx座標をx[0]~x[49]、節点knotsの座標をknots[0]=x[10], knots[1]=x[20], knots[2]=x[30]のように設定(最初の節点配置は適当に設定)
②スプライン近似
③②で求めたスプライン関数の値とデータとのMSEを計算
④MSEが小さくなるように各節点knots[i]を少しずつ移動させていく（knots[1]=x[20]→knots[1]=x[21]のように一つ隣のデータを節点とする）

節点更新の方法として、
①各節点間隔ごとにMSEを計算。
②MSEが最も大きい節点間隔の節点を移動させ節点更新
③再度スプライン近似
④MSEが減少しなくなるまで①~③を繰り返す
という方法のアドバイスを以前いただいたのですが、自分の力では実装できませんでした。
実装する際に困った点として、
・各節点間隔のMSEを求めた際に最大MSEの節点を動かす必要があるが、どちらの節点をどっち方向に動かせばよいかわからない。
(例えば画像のMSE[1]が最大として、knots[0]とknots[1]のどちらを動かせばよいのか。またどちらの方向に動かすのか。）
・最初と最後の節点更新の実装ができない。例えば、MSEが最大となる節点間隔が画像のMSE[3]だとする。あらかじめ、最大節点間隔の右側の節点を動かす、とルールを定めていても、MSE[3]の右側の節点は存在しないのでそもそも動かせない。

どのようにしてMSEが小さくなる節点を探索するプログラムを実装すればよいでしょうか。ご助言お願いいたします。
以下に実装したい箇所の補足説明を加えます。ソースコード中のcalc_mse関数とupdate_knots関数を実装したいです。

該当のソースコード

Python
1def generate_dataset():
2    rnd = norm.rvs(size=100000)
3    freq, cv = np.histogram(rnd, bins=50)
4    diff = (cv[1] - cv[0]) / 2
5    for i in range(len(cv)):
6        cv[i] += diff
7    cv = np.delete(cv, -1)
8    return cv, freq
9
10def update_knots(x, y, num, mse, knots):
11    # ここで最大MSEの節点配置を更新する処理(knotsを更新)
12    model, _ = get_natural_cubic_spline_model(x, y, minval=min(x), max_val=max(x), n_knots=num, knots=list(knots))  # 更新後の節点を使って再度スプライン近似
13    model.predict(x)
14    # ここで全データとスプライン関数(model.predict(x))のMSEを計算。減少し続けるまで節点の更新を繰り返す
15
16np.random.seed(1234)
17# 正規乱数を発生させ、ヒストグラムを作成。階級値をx, 度数をyに格納
18x, y = generate_dataset()
19
20
21num = 3  # 節点数
22knots[0], knots[1], knots[2] = x[10], x[20], x[30]  # 初期の節点を適当に配置(データのいずれかをスプラインの節点とする)
23model, _ = model = get_natural_cubic_spline_model(x, y, minval=min(x), max_val=max(x), n_knots=num, knots=list(knots))  # データを節点数num、節点の位置をknotsとして、スプライン近似。
24mse = calc_mse()  # calc_mse関数で各節点間隔のmseを計算。まだ実装できていない。
25update_knots(mse, knots) # 実装したい関数。節点の配置を更新し、MSEが小さくなるような節点の配置を探索する関数。

スプライン近似のコード

役に立つかはわかりませんがここに載せておきます。

Python
1def get_natural_cubic_spline_model(x, y, minval=None, maxval=None, n_knots=None, knots=None):
2    """
3    Get a natural cubic spline model for the data.
4
5    For the knots, give (a) `knots` (as an array) or (b) minval, maxval and n_knots.
6
7    If the knots are not directly specified, the resulting knots are equally
8    space within the *interior* of (max, min).  That is, the endpoints are
9    *not* included as knots.
10
11    Parameters
12    ----------
13    x: np.array of float
14        The input data
15    y: np.array of float
16        The outpur data
17    minval: float 
18        Minimum of interval containing the knots.
19    maxval: float 
20        Maximum of the interval containing the knots.
21    n_knots: positive integer 
22        The number of knots to create.
23    knots: array or list of floats 
24        The knots.
25
26    Returns
27    --------
28    model: a model object
29        The returned model will have following method:
30        - predict(x):
31            x is a numpy array. This will return the predicted y-values.
32    """
33
34    if knots:
35        spline = NaturalCubicSpline(knots=knots)
36    else:
37        spline = NaturalCubicSpline(max=maxval, min=minval, n_knots=n_knots)
38
39    p = Pipeline([
40        ('nat_cubic', spline),
41        ('regression', LinearRegression(fit_intercept=True))
42    ])
43
44    p.fit(x, y)
45
46    return p, spline.knots
47
48
49class AbstractSpline(BaseEstimator, TransformerMixin):
50    """Base class for all spline basis expansions."""
51
52    def __init__(self, max=None, min=None, n_knots=None, n_params=None, knots=None):
53        if knots is None:
54            if not n_knots:
55                n_knots = self._compute_n_knots(n_params)
56            knots = np.linspace(min, max, num=(n_knots + 2))[1:-1]
57            max, min = np.max(knots), np.min(knots)
58        self.knots = np.asarray(knots)
59
60    @property
61    def n_knots(self):
62        return len(self.knots)
63
64    def fit(self, *args, **kwargs):
65        return self
66
67
68class NaturalCubicSpline(AbstractSpline):
69    """Apply a natural cubic basis expansion to an array.
70    The features created with this basis expansion can be used to fit a
71    piecewise cubic function under the constraint that the fitted curve is
72    linear *outside* the range of the knots..  The fitted curve is continuously
73    differentiable to the second order at all of the knots.
74    This transformer can be created in two ways:
75      - By specifying the maximum, minimum, and number of knots.
76      - By specifying the cutpoints directly.  
77
78    If the knots are not directly specified, the resulting knots are equally
79    space within the *interior* of (max, min).  That is, the endpoints are
80    *not* included as knots.
81    Parameters
82    ----------
83    min: float 
84        Minimum of interval containing the knots.
85    max: float 
86        Maximum of the interval containing the knots.
87    n_knots: positive integer 
88        The number of knots to create.
89    knots: array or list of floats 
90        The knots.
91    """
92
93    def _compute_n_knots(self, n_params):
94        return n_params
95
96    @property
97    def n_params(self):
98        return self.n_knots - 1
99
100    def transform(self, X, **transform_params):
101        X_spl = self._transform_array(X)
102        if isinstance(X, pd.Series):
103            col_names = self._make_names(X)
104            X_spl = pd.DataFrame(X_spl, columns=col_names, index=X.index)
105        return X_spl
106
107    def _make_names(self, X):
108        first_name = "{}_spline_linear".format(X.name)
109        rest_names = ["{}_spline_{}".format(X.name, idx)
110                      for idx in range(self.n_knots - 2)]
111        return [first_name] + rest_names
112
113    def _transform_array(self, X, **transform_params):
114        X = X.squeeze()
115        try:
116            X_spl = np.zeros((X.shape[0], self.n_knots - 1))
117        except IndexError: # For arrays with only one element
118            X_spl = np.zeros((1, self.n_knots - 1))
119        X_spl[:, 0] = X.squeeze()
120
121        def d(knot_idx, x):
122            def ppart(t): return np.maximum(0, t)
123
124            def cube(t): return t*t*t
125            numerator = (cube(ppart(x - self.knots[knot_idx]))
126                         - cube(ppart(x - self.knots[self.n_knots - 1])))
127            denominator = self.knots[self.n_knots - 1] - self.knots[knot_idx]
128            return numerator / denominator
129
130        for i in range(0, self.n_knots - 2):
131            X_spl[:, i+1] = (d(i, X) - d(self.n_knots - 2, X)).squeeze()
132        return X_spl

補足情報

わからないことがありましたらすぐに補足いたします。よろしくお願いいたします。

fana

2022/11/24 10:27

素人なので頓珍漢なことを言うかもしれませんが……そのような操作の結果として得られる解が最良であるという理屈が成り立つ問題なのでしょうか？（＝枝刈が可能な探索問題なのか？　不可能なら結局全部やってみるしかないのでは…？）最良でなくてもある程度良い解が得られればOKという話だとしたら… 一般的に(?)近すぎる点群（例えば x[1], x[3], x[6] の３点だとか）だけを選んだ結果がとても良いという可能性は低そうなので，「ある程度互いに離れた選択の仕方」の範囲だけで全探索する（途中で満足いくのを見つけたら打ち切る）みたいな話が思いつくけども…