有沒有解決此問題的方法?


2

對於每個觀察,都有207個變量(二進制變量,是否出現"症狀"),類變量也是二進制變量。

對於每個變量或症狀,都有一個權重(當前手動設置為-5至50),並且對於每個觀察都有一條臨界線(有3條不同的臨界線)。將虛擬變量矩陣乘以權重,然後將每個觀察結果的矩陣累加到不同的列中,從而得到一定的分數。如果此分數較高,則與觀察相關的特定臨界線將為1,否則為0。

問題在於最佳地設置那些權重和關鍵線。我顯然有一個數據集,可以查看預測中哪些症狀通常與" 1"相對應。

對我來說,這似乎是一個優化問題,但顯然可以通過機器學習來進行預測,但是我正在尋找其他資源。

問題是:你們是否知道OR的任何領域,或者可以指出一些關鍵字以查看如何解決此類問題?我對Python很好,所以如果您想向我推荐一些軟件包,我會很高興。我唯一想做的是在(-5,50)間隔內隨機生成權重,對於試驗負荷,也許我會找到與最佳準確度相對應的權重(目的是將假陽性率降至最低)。

謝謝!

-EDIT 20.07

我目前的表述如下:

max(N的總和(t_i * s_i))聖

(M x')_ i> = L_i然後s_i = 1

(M x')_ i

N(s_i)上的總和= <0.06N

其中N是觀察數,M是變量數x是權重的向量,M是虛擬變量的NxM矩陣,其中每一行代表一個觀測值,因此Mx'得出每個觀測值的Nx1累積權重向量。

正如我在評論中提到的那樣,最佳截止線L = [L_1,...,L_n]也是問題的一部分。真實分配t的向量是已知的。關鍵是,一旦獲得截止線和權重,系統便會使用它們來處理新的觀測值。

我也不想讓s_i為正數,這是問題的另一個約束。

感謝您的所有評論,我是堆棧交換的新手,請多多包涵。

Blockquote

1

There are multiple ways to solve this problem, in my opinion it would be more of a ML problem but you can do with linear programming.

Let $a_i$ be the array of features for element $i$. Assuming you have a sample where given $a_i$ you are told the class it belongs to ($S_0$ or $S_1$), let $x$ be the matrix of weights and let $b\in[0,1]$ be an scalar. Establishing that \begin{equation} a_i'x \geq b \Longleftrightarrow a_i'\in S_0 \end{equation} \begin{equation} a_i'x \lt b \Longleftrightarrow a_i'\in S_1 \end{equation}

Then, we could say the given sample should be classified correctly: \begin{equation*} a_i'x \ge b, \hspace{10mm} i\in S_0 \\ a_i'x \lt b, \hspace{10mm} i\in S_1 \end{equation*}

There is no need for an objective function, although you may need one in case the problem is infeasible (there is no linear separation). In that case your objective function could be to maximize the accuracy of your predictions, recall, f1-score, depends on the problem.


1

Given weights I can easily calculate how good these weights are for predicting but how can I determine weights?

From the answer above, $x$ would represent the weights and $b$ the cut point to decide whether a sample belongs to $S_0$ or $S_1$, those are the two variables in the OR problem. $a$ represents the observations from the sample. Solving that problem in linear programming would give you the resulting weights as well as the cut point.


0

This sure sounds like you guys are taking the long road to Logisitic Regression....

You have a bunch of observations, presumably with outcomes to do the training or calculate the model, right?

Each observation has 207 data elements that are numerical. (Some/many of those will likely be dropped in the final model)

And you want to make a model from that to use on new data to predict 1/0 outcomes?

This is classic logistic regression, which should be your starting point (easiest) and then maybe some ML model, but this is not optimization unless you consider the calculation of weights for logistical regression an optimization problem.