7 Variables in R

This chapter outlines the basic elements of R that support the later risk-forecasting material. It reviews assignment, the use of vectors, matrices, lists and data frames. The discussion is framed with reference to typical portfolio data such as S&P 500 prices and benchmark weights.

7.1 Assignment

In R, we use the equal sign, =, to assign a value to a variable. The variable’s name is on the left, and the value to be stored is on the right.

R also supports the left arrow operator <- for assignment, which is often preferred by R programmers and appears in many R style guides. Both operators work identically for most purposes:

# These are equivalent
x = 3
y <- 4

We use = throughout this notebook because it is more compatible with other programming languages and may be more familiar to those with experience in Python, Julia or similar languages.

x = 3
y = 4
x == y  # Two equal signs test for equality

[1] FALSE

Note that two equal signs, ==, are used to test for equality, not assignment.

7.2 R data structures

R provides several built-in data structures for organising and manipulating data. Understanding these structures is fundamental to working effectively with R.

7.2.1 Vectors

R comes with vectors. Note that R does not know if they are column vectors or row vectors, which becomes important in matrix algebra.

v = vector(length=4)
v

[1] FALSE FALSE FALSE FALSE

v[] = NA
v

[1] NA NA NA NA

v[2:3] = 2
v

[1] NA  2  2 NA

v=seq(1,5)
v

[1] 1 2 3 4 5

v=seq(-1,2,by=0.5)
v

[1] -1.0 -0.5  0.0  0.5  1.0  1.5  2.0

v=c(1,3,7,3,0.4)*3
v

[1]  3.0  9.0 21.0  9.0  1.2

One way to create vectors is c()

 x=c(1,4,0.9,"ss")
 x

[1] "1"   "4"   "0.9" "ss"

Here, we used both numbers and strings, and all became a string.

 x=c(1,4,0.9)
 x

[1] 1.0 4.0 0.9

While here, we only have numbers, and they stay numbers.

7.2.2 Matrices

R can create two-dimensional and three-dimensional matrices. We usually only work with the two-dimensional type, but we will encounter the three-dimensional type in the multivariate volatility models.

Matrices can have column names, which can be quite useful.

m=matrix(ncol=1,nrow=3)
m

     [,1]
[1,]   NA
[2,]   NA
[3,]   NA

m=matrix(ncol=2,nrow=3)
m

     [,1] [,2]
[1,]   NA   NA
[2,]   NA   NA
[3,]   NA   NA

m=matrix(3,ncol=2,nrow=3)
m

     [,1] [,2]
[1,]    3    3
[2,]    3    3
[3,]    3    3

We can also make, or add to matrices with cbind and rbind something we do quite often in these notes.

v=c(1,3,7,3,0.4)*3
m=cbind(v,v)
m

        v    v
[1,]  3.0  3.0
[2,]  9.0  9.0
[3,] 21.0 21.0
[4,]  9.0  9.0
[5,]  1.2  1.2

m=rbind(v,v)
m

  [,1] [,2] [,3] [,4] [,5]
v    3    9   21    9  1.2
v    3    9   21    9  1.2

We can access individual elements of matrixes and vectors.

m[1,2]

v 
9

m[,2]

v v 
9 9

m[2,]

[1]  3.0  9.0 21.0  9.0  1.2

m[1,3:5]

[1] 21.0  9.0  1.2

v[2:3]

[1]  9 21

We can name the columns with “colnames()”. Unfortunately, that command name is different than what we use for the dataframes below.

m=cbind(rnorm(4),rnorm(4))
m

           [,1]      [,2]
[1,]  1.1744566 -1.352977
[2,]  0.6810622  1.027312
[3,] -0.8922425  0.491728
[4,]  0.2452143  1.888752

colnames(m)=c("Stock A","Stock B")
m

        Stock A   Stock B
[1,]  1.1744566 -1.352977
[2,]  0.6810622  1.027312
[3,] -0.8922425  0.491728
[4,]  0.2452143  1.888752

7.2.3 Lists

We often need to keep track of many variables that belong together, and the R list object is very useful. It allows us to group multiple variables in one list.

l=list()
l$a=2
l$b= "R is great."
l=list(l=c(2,3),b="Risk")
w=list()
w$q= "my list"
w$l = l
w$df=data.frame(cbind(c(1,2),c("VaR","ES")))
w

$q
[1] "my list"

$l
$l$l
[1] 2 3

$l$b
[1] "Risk"


$df
  X1  X2
1  1 VaR
2  2  ES

we can find out what in a list

names(w)

[1] "q"  "l"  "df"

and access individual elements

w$l

$l
[1] 2 3

$b
[1] "Risk"

We make extensive use of lists in these notes.

7.3 Dataframes

Matrices have some limitations. They do not have row names, and all the columns must be of the same type. To deal with that, R comes with dataframes, which can be thought of as more flexible matrices. We usually have to use both. For example, it is quite costly to insert new data into a dataframe, perhaps by df[3,4]=42, but not to do the same for a matrix.

Also, some functions insist on a matrix or a data frame, even if they should be able to handle both.

A dataframe is a two-dimensional structure in which each column contains values of one variable, and each row contains one set of values from each column. It is the most common way of storing data in R and the one we will use the most.

One of the main advantages of a dataframe over a matrix is that each column can have a different data type. For example, you can have one column with numbers, one with text, one with dates, and one with logicals, whereas a matrix limits you to only one data type. Keep in mind that a dataframe needs all its columns to be of the same length.

Only a few things come for free, and there are downsides to dataframes. One is that accessing elements can be very slow. For example, in code that iteratively puts numbers into a matrix, as we do in the chapter on Backtesting Chapter 22, dataframes can be too slow. Then, it might be best to use matrices and subsequently convert them into dataframes. Some R functions, especially those belonging to old libraries, only accept dataframes and not matrices (or vice versa)

7.3.1 Iteratively creating data frames

It can be very tempting to iteratively increase the size of dataframes in a for loop, something you will encounter in many applications in these notes. This approach can be quite slow. The reason is it takes considerable time to do that operation. If we are only doing a few increases, that may be acceptable, but for larger operations it is generally better to pre-allocate a data frame or a matrix and then insert values into it. And for speed, it is often preferable to do that with a matrix, not a dataframe.

7.3.2 Accessing the data from columns

We can access data from columns by number, like df[,3], but since all the columns have names, it is usually much better to access them by column name, like df$returns.

7.3.3 Creating a dataframe from scratch

There are several different ways to create a dataframe. One is loading from a file, which we will do later. Alternatively, we could create a dataframe from a list of vectors. This can easily be done with the data.frame() function:

df = data.frame(col1 = 1:3,
 col2 = c("A", "B", "C"),
 col3 = c(TRUE, TRUE, FALSE),
 col4 = c(1.0, 2.2, 3.3))

You have to specify the name of each column and what goes inside it. Note that all vectors need to be the same length. We can now check the structure:

str(df) # display the structure of the dataframe

'data.frame':   3 obs. of  4 variables:
 $ col1: int  1 2 3
 $ col2: chr  "A" "B" "C"
 $ col3: logi  TRUE TRUE FALSE
 $ col4: num  1 2.2 3.3

dim(df) # dimension

[1] 3 4

colnames(df) # column names

[1] "col1" "col2" "col3" "col4"

7.3.4 Transforming a different object into a dataframe

You might want to transform a matrix into a dataframe (or vice versa). For example, you need an object to be a matrix to perform linear algebra operations, but you would like to keep the results as a dataframe after the operations. You can easily switch from matrix to dataframe using as.data.frame() (and analogously, from dataframe to matrix with as.matrix(), however, remember all columns need to have the same data type to be a matrix).

For example, let’s say we have the matrix:

myMatrix = matrix(1:10, nrow = 5, ncol = 2, byrow = TRUE)
class(myMatrix)

[1] "matrix" "array"

myMatrix

     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8
[5,]    9   10

We can now transform it into a dataframe:

df = as.data.frame(myMatrix)
class(df)

[1] "data.frame"

df

V1	V2
1	2
3	4
5	6
7	8
9	10

str(df)

'data.frame':   5 obs. of  2 variables:
 $ V1: int  1 3 5 7 9
 $ V2: int  2 4 6 8 10

And we can change the column names:

colnames(df) = c("Odd", "Even")
df

Odd	Even
1	2
3	4
5	6
7	8
9	10

7.4 Alternatives to dataframes

The R dataframes suffer from having been proposed decades ago and, therefore, need some very useful features one might expect, and they can be very slow. In response, there are two alternatives, each with its pros and cons. We will not use either of those in this book because we want to use base R packages wherever possible.

We discuss one case where one of those is needed in Section 8.2.2, which deals with compressed CSV files.

7.4.1 `data.table`

The data.table class is designed for performance and features. It is by far the fastest when using large datasets, but it also has very useful features built into it that really facilitate data work. In our work, we use data.table.

7.4.2 Tidy

The other alternative is tidy data, part of the tidyverse. It has many useful features and has the richest data manipulation tool in R.

7.4.3 Dataframes, `data.table` or tidy data?

While choice is good, it can be overwhelming. How should one choose between dataframes, data.table or tidy data? Dataframes have the advantage of being built into R; they are relatively simple, and for basic calculations that do not need a lot of performance, they might be the best choice.

data.table has the best performance, so if one has large datasets or is performing complicated data science operations on data, it is generally the best choice.

The tidyverse has the richest and most coherent way of doing data wrangling, that is, performing complicated operations on data. For many users in data science, the tidyverse is really the only thing they use in R. Consequently, the tidyverse is the best choice for applications that are mostly in data science unless one really needs performance.

If you search for opinions on dataframes vs. data.table vs. tidy you often find very strong views in favour of one of these. While these perspectives can be informative, all of these three have their own pros and cons, so just pick whatever you are most comfortable with and works best in your use case.

We only use dataframes in these notes because we want to keep the number of packages to a minimum and because dataframes are sufficient for our purposes.

7.5 R-specific concepts

This section covers concepts and behaviours that are specific to R and important to understand for effective programming.

7.5.1 Not-a-number NA

We use a special value NA to indicate not-a-number, i.e. we do not know what the value is. This becomes useful in backtesting in Chapter 22.

a=NA
a

[1] NA

7.5.2 NULL

R is an old language that has evolved erratically over time. This means it has some features that can lead to difficult bugs. One is a variable type called NULL, which means nothing. While it can be useful, the problem is that NULL is not only used inconsistently but can also cause unexpected behaviour.

1+What # variable What is not defined

Error: object 'What' not found

Which makes sense. But consider this

l=list()
l$What

NULL

The variable What does not exist in the list l, but we can access it by l$What

l$What+3

numeric(0)

And when doing math with it, it fails silently.

When we need to delete columns from a dataframe or an element from a list, we assign NULL to it.

df$DeleteMe = NULL

7.5.3 Scope and global assignment `<<-`

Variables in every programming language have a scope, meaning which part of the code can see them. For example, if you define a variable directly in R, it can be seen everywhere in your code, but if you define it inside a function, it is only visible within the function. The former is a global variable, while the second is a local variable.

Global variables are seen as undesirable for a good reason. They can lead to difficulty in finding bugs, which is a particular problem in R because it has a rather unfortunate way of dealing with missing variables.

Sometimes, global variables can only be avoided by making the code more complex, so it is a tricky tradeoff. We use <<- to put something into the global namespace.

GlobalVariable <<- 123.456

You sometimes hear very strong opinions on why one should always avoid global variables. While these concerns have merit, one should make a pragmatic choice if one can do something efficiently and safely with globals, which otherwise would require a complex workaround; by all means, do the global.

7.6 R programming essentials

7.6.1 Special characters

R uses both single quotes and double quotes for strings, and you can use either. That is particularly useful if you have to include a quotation mark inside a string, like

s= 'This is a quote character" in the middle of a string\n'
cat(s)

This is a quote character" in the middle of a string

The special character “\n”means a new line, quite handy for printing.

7.6.2 Printing: `cat()` vs. `print()`

To print variables, we can use cat() or print. To turn numbers into strings, we can use paste() or paste0(). In strings, a new line is \n.

x=10
y=1.2
w= "risk"
cat(x)

cat('\n',x,w,'\n')


 10 risk

cat("Important number for",w," is x=",x,"\n")

Important number for risk  is x= 10

s=paste0("The return is ",round(100*y,1),"%")
cat(s,"\n")

The return is 120%

There are two ways to print to the screen and to text files in R: “cat()”and “print()”. The former allows for all sorts of text formatting, while the latter simply dumps something on the screen. Both have their uses.

cat("This is the answer. x=",x,", and y=",y,".\n")

This is the answer. x= 10 , and y= 1.2 .

print(x)

[1] 10

7.6.3 Some useful functions

R has many functions. Below is a list of some of the most widely used in this book.

head: return the first part of an object
tail: return the last part of an object
cbind: combine by column
rbind: combine by row
cat: concatenate and print
print: print values
paste and paste0: concatenate strings

7.7 Packages/libraries

R comes with a lot of functionality, as standard, but its strength lies in all the packages available for it. The ecosystem is much richer than any other language when it comes to statistics. Some of these packages come with R, but most have to be downloaded separately, either using the install.package() command or a menu in RStudio.

We load the packages using the library() command. Some of them come with annoying start-up messages, which can be suppressed by the suppressPackageStartupMessages() command.

The best practice is to load all the packages used in the code file at the top.

7.8 Matrix algebra

When dealing with vectors and matrices, * is element-by-element multiplication, while %*% is matrix multiplication. This becomes important when dealing with portfolios. Note that R vectors only have one dimension. They are not row or column vectors.

weight = c(0.3,0.7)
prices=cbind(runif(5)*100,runif(5)*100)
weight

[1] 0.3 0.7

prices

          [,1]       [,2]
[1,] 12.388995 97.4340957
[2,]  6.746049 28.3045600
[3,] 18.885729  0.5522519
[4,] 16.392659 90.0616525
[5,] 29.555027 88.4185735

weight * prices # element-by-element multiplication

          [,1]       [,2]
[1,]  3.716698 68.2038670
[2,]  4.722234  8.4913680
[3,]  5.665719  0.3865764
[4,] 11.474861 27.0184957
[5,]  8.866508 61.8930015

weight %*% prices # matrix multiplication

Error in weight %*% prices: non-conformable arguments

weight %*% t(prices) # matrix multiplication

         [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 71.92057 21.83701 6.052295 67.96095 70.75951

prices %*% weight # matrix multiplication

          [,1]
[1,] 71.920565
[2,] 21.837007
[3,]  6.052295
[4,] 67.960954
[5,] 70.759510

7.9 Source files — `source('functions.r')`

It can be very useful to include other R files in some R code. The function to do that is source('file.r').

Some of the code we develop below can be reused later. For that reason, we collect all the useful functions into an R source file called functions.r and include that into our code with source('functions.r'). It has to be in the same folder as the source code used here, but you can keep it anywhere you want, just adjusting the path in source().