Downloading Financial Data With R and Quantmod Package

In my last post I played a little bit with R and stock quotes. I have downloaded Ford quotes from Yahoo Finance as csv file and loaded it using read.csv function. Fortunately, doing it manually is not the only way. The quantmod package for R makes that easier:

1
2
3
library(quantmod)
getSymbols("F", src = "yahoo", from = "2013-01-01", to = "2014-06-06")
head(F)
1
2
3
4
5
6
7
##            F.Open F.High F.Low F.Close  F.Volume F.Adjusted
## 2013-01-02  13.23  13.28 13.00   13.20  75274700      12.66
## 2013-01-03  13.24  13.70 13.05   13.46 121284700      12.91
## 2013-01-04  13.51  13.61 13.35   13.57  54669900      13.01
## 2013-01-07  13.52  13.58 13.35   13.43  43482400      12.88
## 2013-01-08  13.38  13.43 13.20   13.35  46336200      12.80
## 2013-01-09  13.40  13.60 13.39   13.47  36973900      12.92

Code below converts daily quotes to weekly quotes and makes a chart.

1
2
wFord <- to.weekly(F)
plot(wFord, main = "Ford Motor Co")

You may also use charting functions from quantmod package.

1
chartSeries(wFord)

Resources

ROC Curve and Stock Prices Forecasting

I have enrolled in the Practical Machine Learning course on coursera.org recently, and it has inspired me to do some experiments with stock prices forecasting. It is also a good opportunity to explain what is a ROC Curve.

ROC Curve

One of the common tasks of machine learning is to classify situations into one of two cases. For instance, given some medical data, guess if the patient is sick or healthy, or given some financial data, guess if price of the stock will increase or not. There are many methods to build from historical data classifiers for such predictions, but these classifiers often produce as output a qualitative value, like 0.7342, and we have to translate it to a decision, for instance “when output > 0.5 then classify patient as sick, otherwise classify patient as healthy.”

Different cutoff values used will result in different amounts of false positives and false negatives. What is false positive / false negative? Look at this image. Sometimes it is better to have less false negatives, sometimes less false positives. Examples:

  • If the patient is sick, but he is classified as healthy, he may die without necessary treatment.
  • If stock price decrease when you predicted growth and you start an investment, then you lose money.

ROC curve illustrates this trade-off for a specific classifier. The more area under the curve, the better classifier is.

Stock Price Forecasting Experiment

Let’s play. I will use R for that. I downloaded Ford Motor Co quotes (1. January 2013 – 6. June 2014) from Yahoo Finance.

1
2
3
4
5
6
#read from csv
ford<-read.csv(file="ford.csv", as.is = TRUE)
ford$Date <- strptime(ford$Date, "%Y-%m-%d")
#plot prices
plot( ford$Date,ford$Close,type="l",
  ylab="Closing price [USD]",xlab="Date", main="Ford Motor Co" )

I will count closing / opening price ratio for each quote. That means that if the price is higher at the end of the day, then this ratio will be greater than 1. I will also count differences between closing prices from day x and day x+1. Maybe there is a rule that if the price has increased during the day, then the price will also increase at the end of the next day?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
len<-length(ford$Close)
d<-1
predictor<-ford$Close/ford$Open
results <- diff(ford$Close,lag=d)>0.05 #0.05 because of transaction fees
range<-1:(len-d)
i<-data.frame(x=predictor[range],y=results[range])

plot(density(i$x[i$y==TRUE]),
     col="blue",main="",xlab="Closing to Opening Price Ratio")
lines(density(i$x[i$y==FALSE]),col="red")
lines(density(i$x),col="green")
legend(1.015,55, cex=0.5,
       c("Growths","Falls","All"),
       lty=c(1,1,1),
       lwd=c(2.5,2.5),
       col=c("blue","red","green"))

This chart shows that it is contrary. When price decreased during the day, it was more likely that it will grow a next day. So lets use that fact to predict growths. If closing / opening price is less than some value, then expect growth next day. So, lets test some thresholds to use.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#counts true positives
tpRate <- function (prediction) {
        return (sum(i$y & prediction ) / sum(i$y))}
#counts false positives
fpRate <- function (prediction) {
      return (sum(!i$y & prediction ) / sum(!i$y))}
#generate tresholds to test
tresholds <- seq(from=0.965,to=1.05,by=0.001)

#count true positives rate and false positives rate for each treshold
res<-data.frame(tresholds=tresholds,
              tpRates = sapply(tresholds, function(treshold){
                  return (tpRate(i$x<treshold))
                }),
                fpRates = sapply(tresholds, function(treshold){
                  return (fpRate(i$x<treshold))
                }))

#sort results by fpRates ascending
res<-res[order(res$fpRates,res$tpRates),]
plot(res$fpRates,res$tpRates,type="s",
  xlim=c(0,1),xlab="False Positives Rate",ylab="True Positives Rate",
  main="ROC Curve")

When you check ‘res’, you may find that there is some threshold value (0.986) for which true positive rate was 0.297 and false positive rate was about 0.009. That’s interesting, it recognises almost 30% opportunites to earn, and it seldom produces false signals…

1
2
3
4
5
table(prediction=i$x<0.986,expected=i$y)
#          expected
#prediction FALSE TRUE
#     FALSE   228   90
#     TRUE      2   38

So let’s estimate how much we can earn using this strategy.

1
2
3
4
5
6
7
8
library(zoo)
earn<-function (prediction, closings){
  expected<- diff(closings,lag=1) > 0.06
  rels<- rollapply(closings, 2, function (x) {return (x[2]/x[1])  })
  return (prod(rels[prediction])*0.996^(length(rels[prediction])))
}
earn(i$x<0.986, ford$Close)
#[1] 2.052437

That means 2.05 times more money at the end of investing. Seems good, so next question is …

Will this method make me rich?

No. This rule probably gives good results only for Ford stocks and only in this range of time. Lets verify method on PKOBP (from GPW) quotes from 2014-11-10 to 2014-06-06.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#read the data
pkobp<-read.csv(file="pko_d.csv",as.is=TRUE)

#use model to predict buy signals
signals<-pkobp$Close/pkobp$Open<0.986
signals<-signals[1:(length(signals)-1)] #we don't need last prediction

#show confusion matrix
expected<-diff(pkobp$Close,lag=1)>0.06
table(predicion=signals,expected=expected)

##          expected
## predicion FALSE TRUE
##     FALSE  1082  907
##     TRUE    207  202

The model predicted 409 signals, but 202 of them was false. It is not a good score … But maybe it is enough to earn some money?

1
2
earn(signals,pkobp$Close)
## [1] 0.2435

Nope. This strategy would lose ~80% money.

But let’s check one another: Alior (2012-12-14 to 2014-06-06)

1
2
3
4
5
alr<-read.csv(file="alr_d.csv",as.is=TRUE)
signals_a<-alr$Close/alr$Open<0.986
signals_a<-signals_a[1:(length(signals_a)-1)]
earn(signals_a,alr$Close)
## [1] 0.7344

It loses again.

What is wrong with my forecast?

The problem was that I had found some relationship that was true for “training” data set, but it was false for other cases. In result, my “model” has no generalization ability. It caught some noise. To prevent problems like that (overfitting), mehtods like cross validation may be used – but it is a topic for a whole post.

Are You a Strong Player?

Bob Martin in book “Agile Software Development. Principles, Patterns, and Practices” wrote:

People are the most important ingredient of success. A good process will not save the project from failure if the team doesn’t have strong players, but a bad process can make even the strongest of players ineffective. Even a group of strong players can fail badly if they don’t work as a team.

A strong player is not a necessarily an ace programmer. A strong player may be an average programmer, but someone who works well with others. Working well with others, communicating and interacting, is more important than raw programming talent. A team of average programmers who communicate well are more likely to succeed than a group of superstars who fail to interact as team.

Many developers think about people skills as something innate or personality related. Some may say: those ‘others’ are born leaders / speakers… me not. But the truth is that everybody was born crying and covered with mucus, not able to walk, talk or even write Java code. All these skills must be learned. This also applies to communication skills.

If you work some time as a software developer, learning a new language or framework is something you are comfortable about. You may read a book or some tutorial, sit in front of your computer and do some experiments. But learning people skills is different. You rarely can do that alone, just by reading a book, and it usually means getting out of your comfort zone.

During the first year of my career, most of the things I had to learn, were not technical – I had already known Java, Spring, and Hibernate, maybe not very well, but enough to complete tasks. I had to learn about being a part of a team, talking to a client, prioritizing, estimating risks, saying no … And now I still have a lot to improve.

But one thing is motivating. Knowledge about frameworks and technology may become obsolete quickly, but people skills will probably benefit by entire life.

Two Types of Problems Indicated by Low Test Coverage

There are two types of problems indicated by low test coverage, and they don’t exclude each other. First one is obvious: there are not enough tests. Yeah, everybody says that. But there is a second one: there is a too much source code.

Why I think that? If somebody says that there is no sense to test getters and setters, I would agree. The problem is when it makes no sense to test 80% of code, because it just translates one Data Transfer Objects to the another, etc …

And it is not an easy problem to fix. It is easier to write next unit test with dozens of mocks, than to remove an unnecessary layer in the code. We don’t have time and budget for that … But if project manager or product owner loves numbers and indicators, like a code coverage, maybe you can use that fact to convince him to plan some time for necessary refactors?