6  R and risk forecasting

There are many resources for learning R, and we do not wish to duplicate that here. See discussion in Section 2.4.

However, there are particular conventions we use and some parts of the R code that are particularly useful, and we provide an overview below. We suggest being aware of the issues raised by Patrick Burns in his R inferno.

6.1 Some relevant issues

6.1.1 Special characters

R uses both single quotes ’ and double quotes” for strings, and you can use either. That is particularly useful if you have to include a quotation mark inside a string, like

s= 'This is a quote characher " in the middle of a string\n'
cat(s)
This is a quote characher " in the middle of a string

The special character \n means a new line, quite handy for printing.

6.1.2 Assignment: = or <-

By convention, R uses <- for assignment, and not the equal sign =. You can use both in the vast majority of cases, but there is a single, very infrequent, exception where one needs to use <-. We generally prefer =.

 x = 3
 y <- 3
x==y
TRUE

6.1.3 Global assignment <<-

Global variables are seen as undesirable, for a good reason. Sometimes they cannot be avoided unless we want to make the code more complex, so a tricky tradeoff. We use <<- to put something into the global namespace.

GlobalVariable <<- 123.456

6.1.4 Printing: cat() vs. print()

There are two ways one can print to the screen and to text files in R, cat() and print(). The former allows for all sorts of text formatting while the latter simply dumps something on the screen. They both have their uses.

cat("This is the answer. x=",x,", and y=",y,".\n")
print(x)
This is the answer. x= 3 , and y= 3 .
[1] 3

6.1.5 Some useful functions

R comes with a large number of functions, below is a list of some of those most widely used in this book.

  • head
  • tail
  • cbind
  • rbind
  • cat
  • print
  • paste and paste0

6.2 Statistical distributions

R provides functions for just about every distribution imaginable. We will use several of those, the normal, student-t (and skewed student-t), chi-square, binomial and the Bernoulli. They all provide PDF, CDF, inverse CDF and random numbers. The first letter of the function name indicates which of these four types and the remainder of the name is the distribution, for example:

  • dnorm, pnorm, qnorm, rnorm;
  • dt, pt, qt, rt.

6.2.1 Distributions and densities

par(mar=c(4,4,0.2,0.1))
x=seq(-3,3,length=100)
z=seq(0,1,length=100)
plot(x,dnorm(x))
plot(x,pnorm(x))
plot(z,qnorm(z))

6.2.2 Random numbers

We can easily simulate random numbers and do that quite frequently. One should always set the seed by set.seed().

rnorm(1)
rnorm(3)
rnorm(3)
set.seed(666)
rnorm(3)
set.seed(666)
rnorm(3)
0.349416872092314
  1. 1.41146382624218
  2. 0.0124550371497289
  3. 0.977193850464427
  1. 0.097939546648519
  2. 1.55616627015026
  3. 0.567697284487257
  1. 0.753311046217783
  2. 2.01435466569865
  3. -0.355134460371891
  1. 0.753311046217783
  2. 2.01435466569865
  3. -0.355134460371891

6.3 Plotting

R has several ways to plot data. The simplest, what is known as base plots, is what we will use for the remainder of this book. You can make better looking plots witht the ggplot2 package which is used by the BBC and New York Times for their plots. Other packages exist, like plotly, especially useful for plots viewed in a browser.

There are four reasons why we generally use base plots.

  1. They are simpler to use than the alternatives;
  2. The alternatives are more buggy;
  3. Base plots can be included in latex documents with the same font as the document itself, and further allowing latex equations;
  4. It is really hard to make sub-tickmarks with ggplots;
  5. We aim to use as much of what is supplied with R as possible, without relying on other libraries.

6.3.1 Base plotting

The default R plot is very ugly.

par(mar=c(4,4,0.2,0.1))
plot(x,dnorm(x))

But there are many ways it can be made more visually appealing. Furthermore, it is always helpful to control the plot margins, and quite possibly set other characteristics.

The margins are set by par(mar=c(bottom, left side, top, right side)).

par(mar=c(4,4,1,0))
plot(x,dnorm(x),
    type='l',
    lwd=1.5,
    col="blue",
    las=1,
    bty='l',
    xlab="Outcomes",
    ylab="Probability",
    main="The normal density"
    )
w=seq(-3,3,by=0.5)
axis(1,at=w,label=FALSE,tcl=-0.3)

6.4 Packages/libraries

As standard, R comes with a lot of of functionality, but the strength of R is in all the packages available for it. The ecosystem is much richer than for any other language when it comes to statistics. Some of these packages come with R, but most have to be downloaded separately, either using the install.package() command, or a menu in RStudio.

We load the packages using the library() command. Some of them come with annoying start-up messages, which can be suppressed by the suppressPackageStartupMessages()command.

Best practice is to load all the packages used in code file at the top.

Here are the packages we make most use of in this book.

  • reshape2 re-shape data frames. Very useful when data is arranged in an unfriendly way;
  • moments skewness and kurtosis;
  • tseries time series analysis;
  • zoo timeseries objects;
  • lubridate date manipulation;
  • car QQ plots;
  • parallel multi-core calculations;
  • nloptr optimisaton algorithms;
  • rugarch univariate volatility models;
  • rmgarch multivariate volatility models.

6.5 Variables

Objects in R can be of different classes. For example, we can have a vector, which is an ordered array of observations of data of the same type (numbers, characters, logicals). We can also have a matrix, which is a rectangular arrangement of elements of the same data type. You can check what class an object is by running class(object).

6.5.1 Integers, reals and strings

The most used variable type is a real number, followed by integers and strings.

x = 1
y = 5.3
w=0.034553
z = "Lean R"
x+y
class(x)
class(y)
class(z)
6.3
'numeric'
'numeric'
'character'
x+z  # will not work because z is a string
ERROR: Error in x + z: non-numeric argument to binary operator

To print variables we can use cat() or print. And to turn numbers into strings we can use paste() or paste0(). In strings, a new line is \n.

cat(x)
cat(y)
cat('\n',x,'\n')
cat(y,'\n')
cat(w,'\n')
cat("Important number is x=",x,"and y=",y,"\n")
s=paste0("The return is ",round(100*w,1),"%")
cat(s,"\n")
1
5.3

 1 
5.3 
0.034553 
Important number is x= 1 and y= 5.3 
The return is 3.5% 

6.5.2 Not-a-number NA

We use a special value NA to indicate not-a-number, i.e. we don’t know what the value is. This becomes useful in backtesting.

a=NA
a
<NA>

6.5.3 TRUE and FALSE

We use a logical variable, true or false quite often. Note they are spelt all uppercase.

W = TRUE
r = !W
W
r
TRUE
FALSE

6.5.4 Vectors

R comes with vectors. Note that R does not know if they are column vectors or row vectors, which becomes important in matrix algebra.

v = vector(length=4)
v 
v[] = NA
v
v[2:3] = 2
v
v=seq(1,5)
v
v=seq(-1,2,by=0.5)
v
v=c(1,3,7,3,0.4)*3
v
  1. FALSE
  2. FALSE
  3. FALSE
  4. FALSE
  1. <NA>
  2. <NA>
  3. <NA>
  4. <NA>
  1. <NA>
  2. 2
  3. 2
  4. <NA>
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  1. -1
  2. -0.5
  3. 0
  4. 0.5
  5. 1
  6. 1.5
  7. 2
  1. 3
  2. 9
  3. 21
  4. 9
  5. 1.2

6.5.5 Matrices

R can create two and three-dimensional matrices. We usually only do the two-dimensional type, but will encounter the three-dimensional in the multivariate volatility models.

Matrices have column names which can be quite useful.

m=matrix(ncol=2,nrow=3)
m
m=matrix(3,ncol=2,nrow=3)
m
m=cbind(v,v)
m 
m=rbind(v,v)
m 
A matrix: 3 × 2 of type lgl
NANA
NANA
NANA
A matrix: 3 × 2 of type dbl
33
33
33
A matrix: 5 × 2 of type dbl
vv
3.0 3.0
9.0 9.0
21.021.0
9.0 9.0
1.2 1.2
A matrix: 2 × 5 of type dbl
v392191.2
v392191.2

We can access individual elements of matrixes and vectors

m[1,2]
m[,2]
m[2,]
m[1,3:5]
v[2:3]
v: 9
v
9
v
9
  1. 3
  2. 9
  3. 21
  4. 9
  5. 1.2
  1. 21
  2. 9
  3. 1.2
  1. 9
  2. 21

We can name the columns with colnames(). Unfortunately, that command name is different than what we use for data frames below.

m=cbind(rnorm(4),rnorm(4))
m
colnames(m)=c("Stock A","Stock B")
m
A matrix: 4 × 2 of type dbl
2.0281678-0.80251957
-2.2168745-1.79224083
0.7583962-0.04203245
-1.3061853 2.15004262
A matrix: 4 × 2 of type dbl
Stock AStock B
2.0281678-0.80251957
-2.2168745-1.79224083
0.7583962-0.04203245
-1.3061853 2.15004262

6.5.6 Lists

We quite often need to keep track of a lot of variables that belong together, and then the R list is very useful. It allows us to group multiple variables in one list.

l=list()
l$a=2
l$b="R is great"
l=list(l=c(2,3),b="Risk")
w=list()
w$q="my list"
w$l = l 
w$df=data.frame(cbind(c(1,2),c("VaR","ES")))
w
$q
'my list'
$l
$l
  1. 2
  2. 3
$b
'Risk'
$df
A data.frame: 2 × 2
X1X2
<chr><chr>
1VaR
2ES

6.5.7 NULL

R is a an old language that has evolved erratically over time. This means it has many undesirable features that can lead to difficult bugs. One is is a variable type called NULL, meaning nothing. While it can be useful, the problem is that NULL is not only used inconsistently, it can be outright dangerous.

1+What # variable What is not defined
ERROR: Error in eval(expr, envir, enclos): object 'What' not found

which makes sense. But consider this

l=list()
l$What
NULL

The variable What does not exist in the list l, but we can access it by l$What

l$What+3

And when doing math with it, it fails silently.

When we need to delete columns from a data frame or an element from a list, we assign NULL to it.

df$DeleteMe = NULL

I know, not very intuitive.

6.6 Matrix algebra

When dealing with vectors and matrices, * is element-by-element multiplication, while %*% is matrix multiplication. This becomes important when dealing with portfolios. Note that R vectors only have one dimension, they are not row or column vectors.

weight = c(0.3,0.7)
prices=cbind(runif(5)*100,runif(5)*100)
weight
prices
  1. 0.3
  2. 0.7
A matrix: 5 × 2 of type dbl
3.83443561.21745
14.14956955.33484
80.63855385.35008
26.66856846.97785
4.27020539.76166
weight * prices # element-by-element multiplication
A matrix: 5 × 2 of type dbl
1.15033042.85222
9.90469816.60045
24.19156659.74505
18.66799714.09336
1.28106227.83316
weight %*% prices # matrix multiplication
ERROR: Error in weight %*% prices: non-conformable arguments
weight %*% t(prices) # matrix multiplication
A matrix: 1 × 5 of type dbl
44.0025542.9792683.9366240.8850729.11422
prices %*% weight # matrix multiplication
A matrix: 5 × 1 of type dbl
44.00255
42.97926
83.93662
40.88507
29.11422

6.7 Data frames

Matrices have some limitations. They don’t have row names and all the columns must be of the same type, for example, we can’t have a column with a string and another with numbers. To deal with that, R comes with data frames, which can be thought of as more flexible matrices. We usually have to use both. For example, it is quite costly to insert new data into a data frame, perhaps by df[3,4]=42 but not to do the same for a matrix.

A data frame is a two-dimensional structure in which each column contains values of one variable and each rows contains one set of values, or “observation” from each column. It is perhaps the most common way of storing data in R and the one we will use the most.

One of the main advantaged of a data frame in comparison to a matrix, is that each column can have a different data type. For example, you can have one column with numbers, one with text, one with dates, and one with logicals, whereas a matrix limits you to only one data type. Keep in mind that a data frame needs all its columns to be of the same length.

6.7.1 Accessing the data from columns

We can access data from columns by number, like df[,3] but since all the columns have names, it is usually much better to access them by column name, like df$returns.

6.7.2 Creating a data frame from scratch

There are several different ways to create a data frame. One is loading from a file which we do below later. Alternatively, we might want to create a data frame from a list of vectors. This can easily be done with the data.frame() function:

df <- data.frame(col1 = 1:3,
                 col2 = c("A", "B", "C"),
                 col3 = c(TRUE, TRUE, FALSE),
                 col4 = c(1.0, 2.2, 3.3))

You have to specify the name of each column, and what goes inside it. Note that all vectors need to be the same length. We can now check the structure:

str(df)

dim(df)
colnames(df)
'data.frame':   3 obs. of  4 variables:
 $ col1: int  1 2 3
 $ col2: chr  "A" "B" "C"
 $ col3: logi  TRUE TRUE FALSE
 $ col4: num  1 2.2 3.3
  1. 3
  2. 4
  1. 'col1'
  2. 'col2'
  3. 'col3'
  4. 'col4'

6.7.3 Transforming a different object into a Data Frame

You might want to transform a matrix into a data frame (or vice versa). For example, you need an object to be a matrix to perform linear algebra operations, but you would like to keep the results as a data frame after the operations. You can easily switch from matrix to data frame using as.data.frame() (and analogously, from data frame to matrix with as.matrix(), however remember all columns need to have the same data type to be a matrix).

For example, let’s say we have the matrix:

my_matrix <- matrix(1:10, nrow = 5, ncol = 2, byrow = TRUE)
class(my_matrix)
my_matrix
  1. 'matrix'
  2. 'array'
A matrix: 5 × 2 of type int
1 2
3 4
5 6
7 8
910

We can now transform it into a data frame:

df = as.data.frame(my_matrix)
class(df)
df
str(df)
'data.frame'
A data.frame: 5 × 2
V1V2
<int><int>
1 2
3 4
5 6
7 8
910
'data.frame':   5 obs. of  2 variables:
 $ V1: int  1 3 5 7 9
 $ V2: int  2 4 6 8 10

And we can change the column names:

colnames(df) = c("Odd", "Even")
df
A data.frame: 5 × 2
OddEven
<int><int>
1 2
3 4
5 6
7 8
910

6.7.4 Alternatives to data frames

The R data frames suffer from having been proposed decades ago and therefore lack some very useful features one might expect, and they can be very slow. In response, there are two alternatives, each with their own pros and cons. We will not use either of those in this book because we want to use base R packages wherever possible.

6.7.4.1 Data.table

The data.table class is designed for performance and features, it is by far the fastest when using large datasets, but also has very useful features built into it that really facilitate data work. In our work we use data.table.

6.7.4.2 Tidy

The other alternative is tidy, part of the tidyverse. It has a lot of useful features, with the richest data manipulation tools in R.

6.8 Source files

It can be very useful to include other R files in some R code. The function to do that is source('file.r').

We make use of source when we include an R file called functions.r containing useful functions that are used frequently.