# 我如何才能將機器學習算法應用於股票市場？

There seems to be a basic fallacy that someone can come along and learn some machine learning or AI algorithms, set them up as a black box, hit go, and sit back while they retire.

Learn statistics and machine learning first, then worry about how to apply them to a given problem. There is no free lunch here. Data analysis is hard work. Read "The Elements of Statistical Learning" (the pdf is available for free on the website), and don't start trying to build a model until you understand at least the first 8 chapters.

Once you understand the statistics and machine learning, then you need to learn how to backtest and build a trading model, accounting for transaction costs, etc. which is a whole other area.

After you have a handle on both the analysis and the finance, then it will be somewhat obvious how to apply it. The entire point of these algorithms is trying to find a way to fit a model to data and produce low bias and variance in prediction (i.e. that the training and test prediction error will be low and similar). Here is an example of a trading system using a support vector machine in R, but just keep in mind that you will be doing yourself a huge disservice if you don't spend the time to understand the basics before trying to apply something esoteric.

[Edit:]

Just to add an entertaining update: I recently came across this master's thesis: "A Novel Algorithmic Trading Framework Applying Evolution and Machine Learning for Portfolio Optimization" (2012). It's an extensive review of different machine learning approaches compared against buy-and-hold. After almost 200 pages, they reach the basic conclusion: "No trading system was able to outperform the benchmark when using transaction costs." Needless to say, this does not mean that it can't be done (I haven't spent any time reviewing their methods to see the validity of the approach), but it certainly provides some more evidence in favor of the no-free lunch theorem.

One basic application is predicting financial distress.

Get a bunch of data with some companies that have defaulted, and others that haven't, with a variety of financial information and ratios.

Use a machine learning method such as SVM to see if you can predict which companies will default and which will not.

Use that SVM in the future to short high-probability default companies and long low-probability default companies, with the proceeds of the short sales.

I echo much of what @Shane wrote. In addition to reading ESL, I would suggest an even more fundamental study of statistics first. Beyond that, the problems I outlined in in another question on this exchange are highly relevant. In particular, the problem of datamining bias is a serious roadblock to any machine-learning based strategy.

There are several Machine Learning/Artificial Intelligence (ML/AI) branches out there:
http://www-formal.stanford.edu/jmc/whatisai/node2.html

I have only tried genetic programming and some neural networks, and I personally think that the "learning from experience" branch seems to have the most potential. GP/GA and neural nets seem to be the most commonly explored methodologies for the purpose of stock market predictions, but if you do some data mining on Predict Wall Street, you might be able to do some sentiment analysis too.

Spend some time learning about the various ML/AI techniques, find some market data and try to implement some of those algorithms. Each one will have its strengths and weaknesses, but you may be able to combine the predictions of each algorithm into a composite prediction (similar to what the winners of the NetFlix Prize did).

Some Resources:
Here are some resources that you might want to look into:

The Chatter:
The general consensus amongst traders is that Artificial Intelligence is a voodoo science, you can't make a computer predict stock prices and you're sure to loose your money if you try doing it. Nonetheless, the same people will tell you that just about the only way to make money on the stock market is to build and improve on your own trading strategy and follow it closely (which is not actually a bad idea).

The idea of AI algorithms is not to build Chip and let him trade for you, but to automate the process of creating strategies. It's a very tedious process and by no means is it easy :).

Minimizing Overfitting:
As we've heard before, a fundamental issue with AI algorithms is overfitting (aka datamining bias): given a set of data, your AI algorithm may find a pattern that is particularly relevant to the training set, but it may not be relevant in the test set.

There are several ways to minimize overfitting:

1. Use a validation set: it doesn't give feedback to the algorithm, but it allows you to detect when your algorithm is potentially beginning to overfit (i.e. you can stop training if you're overfitting too much).
2. Use online machine learning: it largely eliminates the need for back-testing and it is very applicable for algorithms that attempt to make market predictions.
3. Ensemble Learning: provides you with a way to take multiple machine learning algorithms and combine their predictions. The assumption is that various algorithms may have overfit the data in some area, but the "correct" combination of their predictions will have better predictive power.

Fun Facts:

The short and brutal answer is: you don't. First, because ML and Statistics is not something you can command well in one or two years. My recommended time horizon to learn anything non-trivial is 10 years. ML not a recipe to make money, but just another means to observe reality. Second, because any good statistician knows that understanding the data and the problem domain is 80% of the work. That's why you have statisticians focusing on Physics data analysis, on genomics, on sabermetrics etc. For the record, Jerome Friedman, co-author of ESL quoted above, is a physicist and still holds a courtesy position at SLAC.

So, study Statistics and Finance for a few years. Be patient. Go your own way.

Mileage may vary.

I'm currently working on this task, to apply machine learning to stock trading. However, the concerns raised in other answers are major obstacles. So, I'm taking a different tact.

My strategy is more akin to teaching a car to drive - the machine learning is not based on the underlying data, but rather on the driver's reaction to the data. So based on what the road looks like, the steering position of the wheel. The machine observes "correct" driving, and can very quickly mimic the driving actions. I think this is referred to as "Supervised Learning" (I'm very new to formal machine learning - taking the Stanford class on iTunes U).

To apply this tact to stock trading, you take the factors that you personally consider when trading stocks (price, moving average, volume, whatever) and make those measures available as inputs to your machine learning algorithm. Then, for a series of data points, you enter the "right" answer, which I prefer to organize as LONG/SHORT/FLAT. Of course this doesn't help if you are bad at trading stocks, but it does help create an agent who can do whatever you would do.

The overall idea isn't to create a millionaire black box, but rather to free up your time from watching the market closely, or to allow you to apply your strategies to more stocks that you otherwise would be able.

If anyone would like to collaborate with me, please feel free to contact. I'm currently implementing the above as an iOS & Mac app.

Two aspects of statistical learning are useful for trading

1. First the ones mentioned earlier: some statistical methods focused on working on live datasets. It means that you know you are observing only a sample of data and you want to extrapolate. You thus have to deal with in sample and out of sample issues, overfitting and so on... From this viewpoint, data-mining is more focused on dead datasets (ie you can see almost all the data, you have an in sample only problem) than statistical learning.

Because statistical learning is about working on live dataset, the applied maths that deal with them had to focus on a two scales problem:

$$\left\{\begin{array}{lcl} X_{n+1} &=& F_\theta(X_n,\xi_{n+1})\\ {\hat\theta}_{n+1} &=& L(\pi(X_n),{\hat\theta}_n) \end{array}\right.$$ where $X$ is the (multidimentional) state space to study (you have in it your explanatory variables and the ones to predict), $F$ contains the dynamics of $X$ which need some parameters $\theta$. The randomness of $X$ comes from the innovation $\xi$, which is i.i.d.

The goal of statistical learning is to build a methodology $L$ ith as inputs a partial observation $\pi$ of $X$ and progressively adjust an estimate $\hat\theta$ of $\theta$, so that we will know all that is needed on $X$.

If you think about using statistical learning to find the parameters of a linear regression, we can model the state space like this: $$\underbrace{\left( \begin{array}{c} y_{n+1}\\ x_{n+1} \end{array}\right)}_{X_{n+1}} = \left[ \begin{array}{ccc} a & b & 1\\ 1 & 0 & 0\\ \end{array}\right] \cdot \underbrace{\left( \begin{array}{c} x_{n}\\1\\ \epsilon_{n+1} \end{array}\right)}_{\xi_{n+1}}$$ which thus allows to observe $(y,x)_n$ at any $n$; here $\theta=(a,b)$.

Then you need to find a way to progressively build an estimator of $\theta$ using our observations. Why not a gradient descent on the L2 distance between $y$ and the regression: $$C(\hat a, \hat b)_n = \sum_{k\leq n} (y_k - (\hat a \, x_k + \hat b))^2$$

So we can build these dynamics: $${\hat a}_{n+1} = {\hat a}_n - \gamma_{n+1} \,\frac{\partial\, C({\hat a}_n, {\hat b}_n)_{n+1}}{\partial\, {\hat a}_n}$$ and similarly for $\hat b$.

Here $\gamma$ is a weighting scheme.

Usually a nice way to build an estimator is to write properly the criteria to minimize and implement a gradient descent that will produce the learning scheme $L$.

Going back to our original generic problem: we need some applied maths to know when couple dynamical systems in $(X,\hat\theta)$ converge, and we need to know how to build estimating schemes $L$ that converge towards the original $\theta$.

To give you pointers on such mathematical results:

Now we can go back to the second aspect of statistical learning that is very interesting for quant traders/strategists:

2. The results used to prove the efficiency of statistical learning methods can be used to prove the efficiency of trading algorithms. To see that it is enough to read again the coupled dynamical system that allows to write statistical learning: $$\left\{\begin{array}{lcl} M_{n+1} &=& F_\rho(M_n,\xi_{n+1})\\ {\hat\rho}_{n+1} &=& L(\pi(M_n),{\hat\rho}_n) \end{array}\right.$$

Now $M$ are market variables, $\rho$ is underlying PnL, $L$ is a trading strategy. Just replace minimizing a criteria by maximizing the PnL.

See for instance Optimal split of orders across liquidity pools: a stochatic algorithm approach by: Gilles Pagès, Sophie Laruelle, Charles-Albert Lehalle, in this paper, authors show who to use this approach to optimally split an order across different dark pools simultaneously learning the capability of the pools to provide liquidity and using the results to trade.

The statistical learning tools can be used to build iterative trading strategies (most of them are iterative) and prove their efficiency.

People seem to think that using ML is going to circumvent the process of actually learning to trade, it doesn't. ML can be used to refine trading ideas, but it doesn't generate them, you need to use your brain for that.

One possibility worth exploring is to use the support vector machine learning tool on the Metatrader 5 platform. Firstly, if you're not familiar with it, Metatrader 5 is a platform developed for users to implement algorithmic trading in forex and CFD markets (I'm not sure if the platform can be extended to stocks and other markets). It is typically used for technical analysis based strategies (i.e. using indicators based on historical data) and is used by people looking to automate their trading.

The "Support Vector Machine Learning Tool" has been developed by one of the community of users to allow support vector machines to be applied to technical indicators and advise on trades. A free demo version of the tool can be downloaded here if you want to investigate further.

As I understand it, the tool uses historical price data to assess whether hypothetical trades in the past would have been successful. It then takes this data along with the historical values from a number of customisable indicators (MACD, oscillators etc), and uses this to train a support vector machine. Then it uses the trained support vector machine to signal future buy/sell trades. A better desciption can be found at the link.

I have played around with it a little with some very interesting results, but as with all algorithmic trading strategies I recommend solid back/forward testing before taking it to the live market.

Sorry, but despite being used as a popular example in machine learning, no one has ever achieved a stock market prediction.

It does not work for several reasons (check random walk by Fama and quite a bit of others, rational decision making fallacy, wrong assumptions ...), but the most compelling one is that if it would work, someone would be able to become insanely rich within months, basically owning all the world. As this is not happening (and you can be sure all the bank have tried it), we have good evidence, that it just does not work.

Besides: How do you think you will achieve what tens of thousands of professionals have failed to, by using the same methods they have, plus limited resources and only basic versions of their methods?

Blair Hull as an idea: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2609814

He says he sold his automated trading firm to Goldman for 300 million \$.

You can try this course on Udactiy

Look at the seasonal chart for crude oil over 30 years from seasonalcharts.com:

(source: seasonalcharts.com)

and the seasonal chart for heating oil over 29 years:

(source: seasonalcharts.com)

Traders know what to do even without using machine learning. For $$k$$-fold cross validation, it will help to split the input data by years.

Perhaps, it's better to create a stock-screener (classification), rather than trying to predict stock price(non-deterministic regression) problem. Making money that way seems to be much easier, and we can use SVCs and reinforcement learning to achieve the same. I am presently trying to build the same using CAN SLIM model of stock picking, and identifying small-cap stocks that can give multibagger returns in 5-10 years time. Any other tips or pointers will be gladly appreciated.

When considering how to approach a finance problem with a machine learning algorithm, consider the following aspects, which are often defined in plain sight for each generic learning model out there for being well-known behaviors that they possess, one or the other:

1. Supervised or unsupervised: What inputs from the market are at your disposal to feed into the algorithm? If there is historical data available (so-called training or in-sample data) and you would like to feed that into the algorithm to get back for outputs parameters that, when applied to unseen, future data (called testing or out-of-sample data), make quantitative predictions, then this is called "supervised learning" and you would want to look for supervised learning models from machine learning to solve the problem. If there is no training data available, making it necessary to form estimates without the aid of data currently available on the market, then the task is an "unsupervised learning" problem that requires an unsupervised learning algorithm for model construction.

2. Classifier or regressor: What outputs do you expect the algorithm to return to you? Foremost, you'd want the results to be easily interpretable and mesh well with intuition of the financial theory. But from a more practical standpoint, you will want to know if the problem you want to solve is a classification task, in which the output can be given binary labels of 1 or 0, True or False, or categorical labels, for example, or whether it is a regression task, which often means that you want the answer as a real-valued number, for example as decimal numbers between 0 and 1, or large whole numbers, for example. Depending on the situation, you would then want a machine learning algorithm that is either a classifier or a regressor.

3. Reduce overfitting or underfitting: Do existing approaches to the problem using techniques outside of machine learning suggest that prevailing estimates are prone to bias (inaccuracy with respect to the true parameters) or variance (imprecision around the true parameters). Knowing this will help you decide whether a machine learning model known to reduce overfitting should be the aim in your search for an appropriate algorithm, otherwise one that deals with underfitting would obviously be a better choice. Remember that overfitting is the case where the model clings too much to the training data's signal and noise, while underfitting is where the model fits the training data loosely but generalizes very well when predicting unseen, test data.

There should always be specific reasons why the machine learning model you are trying was selected in the first place, so the main points above serve as guidelines for algorithm selection. You will notice that machine learning textbooks and documentation for Python packages also categorize algorithms under the headings listed above. Replacing the word "algorithm" in the above with the term "black box" will help you realize that machine learning is nothing more than the process of building a reliable $$f(x)$$ function that takes input $$x$$ and magically gives back an output as your solution, but 'black box' is misleading since many machine learning algorithms have now been found to possess very tangible and controllable statistical properties, and the solutions they generate are derived from actual equations, or a series of equations, that can be proven by hand. When thoroughly understood by the user, a well-chosen algorithm can be harnessed/tailored for dealing with the uncertain nature of financial markets and data to extrapolate patterns/signals while methodically filtering out noise/errors.

After trying a model on a variety of real as well as artificially generated datasets of varying $$T\times N$$ shapes, resampling techniques like cross-validation, bootstrap and Monte Carlo simulation and maybe other performance criteria specific to the model you chose can then be used to verify its appropriateness empirically.