Seminar 2

Before the seminar two, please make sure you have the following packaged installed: "zoo", "lubridate", "car", "tseries", "rugarch", "rmgarch" and "reshape2". You can install package using install.packages() command.

The plan for this week:

1. Install and load packages
2. Basic Data Handling
3. Save Data Frames
4. Create, customize and export plots


Installing and loading packages

One of the most useful features of R is its large library of packages. Although the base R package is very powerful, there are thousands of community-built packages to perform specific tasks that can save us the effort of programming some things from scratch. The complete list of available R packages can be found here: https://cran.r-project.org.

To install a package, type in the console: install.packages("name_of_package"), and after installed, you can load it with: library(name_of_package).

It is considered a good practice to start your .R script with loading all the libraries that will be used in your program.

In the Information pane of RStudio, you can access the list of packages that are installed in your R environment:
By clicking on a package's name, you will see all the information it contains

Installing some packages we will need for our course

Make sure you have the following packages installed:

install.packages("zoo")
install.packages("lubridate")
install.packages("tseries")
install.packages("rugarch")
install.packages("rmgarch")
install.packages("reshape2")
install.packages("car")

Loading the packages needed for today

In [30]:
library(reshape2)
library(lubridate)

Basic Data Handling

The first thing to do when working with new data is cleaning it. Let's load the data downloaded in the previous class and take a look at it:

In [2]:
data <- read.csv("crsp.csv")
head(data)
A data.frame: 6 × 7
PERMNOdateTICKERCOMNAMPRCRETCFACPR
<int><int><chr><chr><dbl><dbl><dbl>
11010719900102MSFTMICROSOFT CORP88.750 0.020115144
21010719900103MSFTMICROSOFT CORP89.250 0.005634144
31010719900104MSFTMICROSOFT CORP91.875 0.029412144
41010719900105MSFTMICROSOFT CORP89.625-0.024490144
51010719900108MSFTMICROSOFT CORP91.000 0.015342144
61010719900109MSFTMICROSOFT CORP90.750-0.002747144

The format of this table is a data.frame. This is a type of variable that allows us to store data of different nature (numbers, characters, dates, etc.)

In [3]:
class(data)
'data.frame'

We are interested in having a data frame that holds the price of each stock over time, and another one that holds the returns. We will use the current TICKER of the stocks as the identifying name for each company.

Let's start with prices. Before building our data frame, we need to adjust the prices with the Cumulative Factor to Adjust Prices, or CFACPR. The reason is that the PRC variable does not take into account stock splits, which can lead us to believe that the price of a stock halved in a day, when the reason for this drop is no more than a stock split. To adjust for this, we will divide the column PRC by CFACPR. For comparison, we will keep the unadjusted prices in a Unadjusted_PRC.

In [4]:
# Keeping the unadjusted prices in a new column
data$Unadjusted_PRC <- data$PRC

# Modifying the PRC column
data$Adjusted_PRC <- data$PRC / data$CFACPR

head(data)
A data.frame: 6 × 9
PERMNOdateTICKERCOMNAMPRCRETCFACPRUnadjusted_PRCAdjusted_PRC
<int><int><chr><chr><dbl><dbl><dbl><dbl><dbl>
11010719900102MSFTMICROSOFT CORP88.750 0.02011514488.7500.6163194
21010719900103MSFTMICROSOFT CORP89.250 0.00563414489.2500.6197917
31010719900104MSFTMICROSOFT CORP91.875 0.02941214491.8750.6380208
41010719900105MSFTMICROSOFT CORP89.625-0.02449014489.6250.6223958
51010719900108MSFTMICROSOFT CORP91.000 0.01534214491.0000.6319444
61010719900109MSFTMICROSOFT CORP90.750-0.00274714490.7500.6302083

Now that we have the correct prices, we will select the date and price columns for each stock and put them into a variable with the ticker name. Afterwards, we have to rename the price column to the ticker name. For example, for MSFT:

In [5]:
# Getting the date and Adjusted_PRC variables for Microsoft
MSFT <- data[data$PERMNO == 10107, c("date", "Adjusted_PRC")]

# Renaming Adjusted_PRC to MSFT
names(MSFT)[2] <- "MSFT"

# Now we do the same for the five others

XOM <- data[data$PERMNO==11850, c("date", "Adjusted_PRC")]
names(XOM)[2] <- "XOM"

GE <- data[data$PERMNO==12060, c("date", "Adjusted_PRC")]
names(GE)[2] <- "GE"

JPM <- data[data$PERMNO==47896, c("date", "Adjusted_PRC")]
names(JPM)[2] <- "JPM"

INTC <- data[data$PERMNO==59328, c("date", "Adjusted_PRC")]
names(INTC)[2] <- "INTC"

C <- data[data$PERMNO==70519, c("date", "Adjusted_PRC")]
names(C)[2] <- "C"

# And merge all into a single table called PRC using the merge() function
PRC <- merge(MSFT, XOM)
PRC <- merge(PRC, GE)
PRC <- merge(PRC, JPM)
PRC <- merge(PRC, INTC)
PRC <- merge(PRC, C)
head(PRC)
A data.frame: 6 × 7
dateMSFTXOMGEJPMINTCC
<int><dbl><dbl><dbl><dbl><dbl><dbl>
1199001020.616319412.500044.5000010.000001.12500022.87621
2199001030.619791712.375044.4166710.333331.09375023.16825
3199001040.638020812.250044.1666710.375001.11718822.87621
4199001050.622395812.187543.7500010.416671.10937523.07090
5199001080.631944412.375044.0000010.416671.12500023.26559
6199001090.630208312.125043.0833310.083331.15625022.97356

We got the output we wanted, but it involved several lines of basically copy-pasting the same code.

As a challenge, you can try to replicate the process using a for loop.

Also, we could have saved us this trouble by using a package. R has thousands of packages with functions that can help us easily get the output we are looking for. We are going to create another table using the dcast function from the reshape2 package.

To get an overview of meaning of aruguments for function dcast, we can check its documentation by:

In [6]:
?dcast

date ~ PERMNO means each row is data from same date and each column is data from same PERMNO, value.var is the column of the input data frame that will be filled into the new data frame.

In [7]:
# Remove the previous variable
rm(PRC)

# Create a new data frame 
PRC <- dcast(data, date ~ PERMNO, value.var = "Adjusted_PRC")
names(PRC) <- c("date", "MSFT", "XOM", "GE", "JPM", "INTC", "C")
head(PRC)
A data.frame: 6 × 7
dateMSFTXOMGEJPMINTCC
<int><dbl><dbl><dbl><dbl><dbl><dbl>
1199001020.616319412.500044.5000010.000001.12500022.87621
2199001030.619791712.375044.4166710.333331.09375023.16825
3199001040.638020812.250044.1666710.375001.11718822.87621
4199001050.622395812.187543.7500010.416671.10937523.07090
5199001080.631944412.375044.0000010.416671.12500023.26559
6199001090.630208312.125043.0833310.083331.15625022.97356

Not only did this save us time and lines of code, but it also gave us as output a data.frame object instead of a matrix, which is easier to handle.

We can now directly create the data frame for returns:

In [8]:
RET <- dcast(data, date ~ PERMNO, value.var = "RET")
names(RET) <- c("date", "MSFT", "XOM", "GE", "JPM", "INTC", "C")
head(RET)
A data.frame: 6 × 7
dateMSFTXOMGEJPMINTCC
<int><dbl><dbl><dbl><dbl><dbl><dbl>
119900102 0.020115 0.000000 0.034884 0.004184 0.043478 0.030702
219900103 0.005634-0.010000-0.001873 0.033333-0.027778 0.012766
319900104 0.029412-0.010101-0.005629 0.004032 0.021429-0.012605
419900105-0.024490-0.005102-0.009434 0.004016-0.006993 0.008511
519900108 0.015342 0.015385 0.005714 0.000000 0.014085 0.008439
619900109-0.002747-0.020202-0.020833-0.032000 0.027778-0.012552
In [9]:
length(RET$"C")
8126

Transforming simple returns to compounded

The returns in our dataset are simple returns, which, as we saw in lectures, are calculated like this:

$$R_{t}=\frac{P_{t} - P_{t-1}}{P_{t-1}}$$

We prefer to work with continuously compounded returns, which are defined as:

$$Y_{t} = \log\left(\frac{P_{t}}{P_{t-1}}\right) = \log\left(1+R_{t}\right)$$

To transform them into compound returns, we should use the log() function:

In [10]:
# We choose all the columns except the first one 
# And transform them into a new Y data frame
Y <- log(1 + RET[,2:7])
Y$date <- RET$date
head(Y)
A data.frame: 6 × 7
MSFTXOMGEJPMINTCCdate
<dbl><dbl><dbl><dbl><dbl><dbl><int>
1 0.019915366 0.00000000 0.034289343 0.004175271 0.042559364 0.03024012319900102
2 0.005618188-0.01005034-0.001874756 0.032789500-0.028171106 0.01268520219900103
3 0.028987765-0.01015236-0.005644903 0.004023893 0.021202627-0.01268511719900104
4-0.024794868-0.00511506-0.009478782 0.004007957-0.007017566 0.00847498619900105
5 0.015225502 0.01526785 0.005697737 0.000000000 0.013986728 0.00840359119900108
6-0.002750780-0.02040885-0.021053069-0.032523192 0.027399190-0.01263144219900109

Date formats

On our RET data frame, we have a column for the dates of the observations. R has a special variable type for working with dates called Date, which will make our lifes easier when trying to do plots and some analyses. However, by default the date column in our dataset is not in this format:

In [11]:
class(Y$date)
'integer'

There are several ways to transform data into the Date type. We will use the package lubridate that we installed earlier. In particular, the function ymd() that stands for year-month-day. It is a powerful function that will turn any character in that order into a Date format. For example, it can handle:

ymd("20101005")
ymd("2010-10-05")
ymd("2010/10/05")

Likewise, you could also use dmy() or mdy() for different formats.

In [12]:
# Save the original int type date to int_date
Y$int_date <- Y$date
# Use the function ymd() to transform the column into Dates
Y$date <- ymd(Y$date)
# Check if it worked
class(Y$date)
'Date'
In [13]:
# Lets do the same for PRC
PRC$int_date <- PRC$date
PRC$date <- ymd(PRC$date)

Saving data frames

After handling data, we want to make sure we do not have to do the same procedure every time we open our program. For this, we can easily save the data frame we have created as an .RData object that can be loaded the next time we open R. To do so, we need to make sure we are at the Directory where we want to save the data, and use the save() function:

In [14]:
# Saving the data frame of returns
save(Y, file = "Y.RData")

# Saving the data frame of prices
save(PRC, file = "PRC.RData")

To load a .RData file, we need to use the function load:

In [15]:
# Remove the existing data frame of returns
rm(Y)

# Load the saved file
load("Y.RData")
head(Y)
A data.frame: 6 × 8
MSFTXOMGEJPMINTCCdateint_date
<dbl><dbl><dbl><dbl><dbl><dbl><date><int>
1 0.019915366 0.00000000 0.034289343 0.004175271 0.042559364 0.0302401231990-01-0219900102
2 0.005618188-0.01005034-0.001874756 0.032789500-0.028171106 0.0126852021990-01-0319900103
3 0.028987765-0.01015236-0.005644903 0.004023893 0.021202627-0.0126851171990-01-0419900104
4-0.024794868-0.00511506-0.009478782 0.004007957-0.007017566 0.0084749861990-01-0519900105
5 0.015225502 0.01526785 0.005697737 0.000000000 0.013986728 0.0084035911990-01-0819900108
6-0.002750780-0.02040885-0.021053069-0.032523192 0.027399190-0.0126314421990-01-0919900109

Saving as .csv

We can also easily save our data frames as a .csv file using the write.csv() function:

In [16]:
write.csv(Y, file = "Y.csv")

Plotting

The base R has an easy to use plot function called plot(). In this section we will learn how to customize our plots, add titles, change axis labels, add legends, select types of lines, vary the colors, and create subplots.

To get an overview of what plot() can do, we can check its documentation by:

In [17]:
?plot

Customizing a plot

We will start plotting the returns for JP Morgan:

In [18]:
# Simple plot, if we do not specify an X variable, plot() will use an index
plot(Y$JPM)
In [19]:
# By default, plot() uses points, we can plot a line with the option "type", "l" denotes line
plot(Y$JPM, type = "l")
In [20]:
# We can add a title with the option "main"
# Change the axes labels with "xlab" and "ylab"
# Choose a color for the graph with "col"
plot(Y$JPM, type = "l", main = "Compound returns for JP Morgan", 
    ylab = "Returns", xlab = "Observation", col = "red")

We would like to have the dates in the x axis to understand our data better:

In [21]:
# The first data argument is used as the X variable
plot(Y$date, Y$JPM, type = "l", main = "Compound returns for JP Morgan", 
    ylab = "Returns", xlab = "Date", col = "red")

Also, by using the option las = 1, we will make sure that the ticks for the axes are always horizontal. This is a good practice for easier visualization:

In [22]:
# The first data argument is used as the X variable
plot(Y$date, Y$JPM, type = "l", main = "Compound returns for JP Morgan", 
    ylab = "Returns", xlab = "Date", col = "red", las = 1)

Visualizing multiple graphs

What if we wanted to plot the prices for JP Morgan and Citi, the two major banks in our dataset?

There are many options to do so. One is to plot both in the same graph, we can achieve this by using the lines() function after a plot is created, or using the function matplot():

In [23]:
# First we plot the returns of JPM
plot(PRC$date, PRC$JPM, type = "l", main = "Prices for JP Morgan and Citi", 
    ylab = "Price", xlab = "Date", col = "red")

# Then we add the returns of C
lines(PRC$date, PRC$C, col = "blue")

# And we create a legend
legend("bottomright",legend = c("JPM", "C"), col = c("red", "blue"), lty=1)

There is something wrong with this plot. Since we first plotted the prices for JPM, the limits of the plot were determined to fit this data. However, when plotting the prices for C, we see some go below the limits of our plot. We can fix this with the ylim() option:

In [24]:
plot(PRC$date, PRC$JPM, type = "l", main = "Prices for JP Morgan and Citi", 
    ylab = "Price", xlab = "Date", col = "red", ylim = c(0, 600))

lines(PRC$date, PRC$C, col = "blue")

legend("topright",legend = c("JPM", "C"), col = c("red", "blue"), lty=1)

Now that we know how to visualize two time series in the same plot, we can do a visual comparison of the adjusted and unadjusted prices for a stock:

In [25]:
# Extract Microsoft data
msft <- data[data$PERMNO == 10107,]

# Plot the adjusted prices
plot(msft$PRC, type = "l", col = 1, main = "Adjusted and unadjusted prices for MSFT",
    ylab = "USD")

# Plot the unadjusted prices
lines(msft$Unadjusted_PRC, col = 2)

# Adding a legend
legend("bottomright",legend = c("Adjusted", "Unadjusted"), col = c(1:2), lty=1)

The function matplot() can be helpful when we want to plot graphs in the same space without worrying of limits:

In [26]:
# Plotting all prices
matplot(PRC[,2:7], type = "l", main = "Prices for our stocks", 
    ylab = "Price", lty = 1)

# Adding the legend
legend("topright", legend = names(PRC[,2:7]), col = c(1:6), fil = c(1:6))

In some cases, plotting many time series in the same space can be messy, like if we wanted to plot all returns:

In [27]:
matplot(RET[,2:7], type = "l", main = "Returns for our stocks", 
    ylab = "Returns", lty = 1)

# Adding the legend
legend("bottomright", legend = names(RET[,2:7]), col = c(1:6), fil = c(1:6))

For a better visualization, we will use par(mfrow = c(a,b)). This option divides the plotting area into a grid with a rows and b columns, each which will hold a subplot:

In [28]:
# Divide the space into a 3x2 grid
par(mfrow = c(3,2))

# We will use a FOR loop, which repeats a same piece of code as many times as we indicate
for (i in 2:7) { 
    # The code will be repeated for i = 2, 3, ..., 7
    # In each instance, we will plot the i-th column of PRC
    plot(PRC$date, PRC[,i], type = "l", ylab = "Returns", xlab = "Date",
         main = paste("Prices for", names(PRC)[i]))
}

After using par(mfrow = c()), make sure you reset the grid by running par(mfrow = c(1,1)) or dev.off().

Just for reference, there is a package called ggplot2, which is one of the most popular packages for plotting in R, used by a number of organizations like The New York Times or the BBC. For our purposes, the base package has all the functionalities we need, but if you want more information on ggplot2 you can go here.

Exporting a plot

For this course you will have to create reports that include plots from R. The best way to do that, is to export the plots from RStudio. You can easily do that by clicking on Export in the Plot Viewer, and either save the plot, or copy it to your clipboard to paste it in a document. Or you can use functions such as pdf(), bmp() and pdf() for saving the figure automatically.

Recap

In this seminar we have covered:

  • How to install and load packages
  • Create a different column for each stock manually and using a package
  • Working with date formats
  • Saving data as .RData and .csv
  • Customizing plots with titles, axis labels, colors and axis limits
  • Adding legends to plots
  • Plotting several time series in the same space
  • Creating subplots
  • Exporting plots

Some new functions used:

  • library()
  • merge()
  • dcast()
  • log()
  • ymd()
  • save()
  • load()
  • write.csv()
  • lines()
  • legend()
  • matplot()
  • par(mfrow = c())

For more discussion on the material covered in this seminar, refer to Chapter 1: Financial markets, prices and risk on Financial Risk Forecasting by Jon Danielsson.

Acknowledgements: Thanks to Alvaro Aguirre and Yuyang Lin for creating these notebooks
© Jon Danielsson, 2022