Seminar 2

Before the seminar two, please make sure you have the following packaged installed: "zoo", "lubridate", "car", "tseries", "rugarch", "rmgarch" and "reshape2". You can install package using install.packages() command.

The plan for this week:

1. Install and load packages
2. Basic Data Handling
3. Save Data Frames
4. Create, customize and export plots

Installing and loading packages

One of the most useful features of R is its large library of packages. Although the base R package is very powerful, there are thousands of community-built packages to perform specific tasks that can save us the effort of programming some things from scratch. The complete list of available R packages can be found here:

To install a package, type in the console: install.packages("name_of_package"), and after installed, you can load it with: library(name_of_package).

It is considered a good practice to start your .R script with loading all the libraries that will be used in your program.

In the Information pane of RStudio, you can access the list of packages that are installed in your R environment:
By clicking on a package's name, you will see all the information it contains

Installing some packages we will need for our course

Make sure you have the following packages installed:


Loading the packages needed for today

In [30]:

Basic Data Handling

The first thing to do when working with new data is cleaning it. Let's load the data downloaded in the previous class and take a look at it:

In [2]:
data <- read.csv("crsp.csv")
A data.frame: 6 × 7
11010719900102MSFTMICROSOFT CORP88.750 0.020115144
21010719900103MSFTMICROSOFT CORP89.250 0.005634144
31010719900104MSFTMICROSOFT CORP91.875 0.029412144
41010719900105MSFTMICROSOFT CORP89.625-0.024490144
51010719900108MSFTMICROSOFT CORP91.000 0.015342144
61010719900109MSFTMICROSOFT CORP90.750-0.002747144

The format of this table is a data.frame. This is a type of variable that allows us to store data of different nature (numbers, characters, dates, etc.)

In [3]:

We are interested in having a data frame that holds the price of each stock over time, and another one that holds the returns. We will use the current TICKER of the stocks as the identifying name for each company.

Let's start with prices. Before building our data frame, we need to adjust the prices with the Cumulative Factor to Adjust Prices, or CFACPR. The reason is that the PRC variable does not take into account stock splits, which can lead us to believe that the price of a stock halved in a day, when the reason for this drop is no more than a stock split. To adjust for this, we will divide the column PRC by CFACPR. For comparison, we will keep the unadjusted prices in a Unadjusted_PRC.

In [4]:
# Keeping the unadjusted prices in a new column
data$Unadjusted_PRC <- data$PRC

# Modifying the PRC column
data$Adjusted_PRC <- data$PRC / data$CFACPR

A data.frame: 6 × 9
11010719900102MSFTMICROSOFT CORP88.750 0.02011514488.7500.6163194
21010719900103MSFTMICROSOFT CORP89.250 0.00563414489.2500.6197917
31010719900104MSFTMICROSOFT CORP91.875 0.02941214491.8750.6380208
41010719900105MSFTMICROSOFT CORP89.625-0.02449014489.6250.6223958
51010719900108MSFTMICROSOFT CORP91.000 0.01534214491.0000.6319444
61010719900109MSFTMICROSOFT CORP90.750-0.00274714490.7500.6302083

Now that we have the correct prices, we will select the date and price columns for each stock and put them into a variable with the ticker name. Afterwards, we have to rename the price column to the ticker name. For example, for MSFT:

In [5]:
# Getting the date and Adjusted_PRC variables for Microsoft
MSFT <- data[data$PERMNO == 10107, c("date", "Adjusted_PRC")]

# Renaming Adjusted_PRC to MSFT
names(MSFT)[2] <- "MSFT"

# Now we do the same for the five others

XOM <- data[data$PERMNO==11850, c("date", "Adjusted_PRC")]
names(XOM)[2] <- "XOM"

GE <- data[data$PERMNO==12060, c("date", "Adjusted_PRC")]
names(GE)[2] <- "GE"

JPM <- data[data$PERMNO==47896, c("date", "Adjusted_PRC")]
names(JPM)[2] <- "JPM"

INTC <- data[data$PERMNO==59328, c("date", "Adjusted_PRC")]
names(INTC)[2] <- "INTC"

C <- data[data$PERMNO==70519, c("date", "Adjusted_PRC")]
names(C)[2] <- "C"

# And merge all into a single table called PRC using the merge() function
PRC <- merge(MSFT, XOM)
PRC <- merge(PRC, GE)
PRC <- merge(PRC, JPM)
PRC <- merge(PRC, INTC)
PRC <- merge(PRC, C)
A data.frame: 6 × 7

We got the output we wanted, but it involved several lines of basically copy-pasting the same code.

As a challenge, you can try to replicate the process using a for loop.

Also, we could have saved us this trouble by using a package. R has thousands of packages with functions that can help us easily get the output we are looking for. We are going to create another table using the dcast function from the reshape2 package.

To get an overview of meaning of aruguments for function dcast, we can check its documentation by:

In [6]:

date ~ PERMNO means each row is data from same date and each column is data from same PERMNO, value.var is the column of the input data frame that will be filled into the new data frame.

In [7]:
# Remove the previous variable

# Create a new data frame 
PRC <- dcast(data, date ~ PERMNO, value.var = "Adjusted_PRC")
names(PRC) <- c("date", "MSFT", "XOM", "GE", "JPM", "INTC", "C")
A data.frame: 6 × 7

Not only did this save us time and lines of code, but it also gave us as output a data.frame object instead of a matrix, which is easier to handle.

We can now directly create the data frame for returns:

In [8]:
RET <- dcast(data, date ~ PERMNO, value.var = "RET")
names(RET) <- c("date", "MSFT", "XOM", "GE", "JPM", "INTC", "C")
A data.frame: 6 × 7
119900102 0.020115 0.000000 0.034884 0.004184 0.043478 0.030702
219900103 0.005634-0.010000-0.001873 0.033333-0.027778 0.012766
319900104 0.029412-0.010101-0.005629 0.004032 0.021429-0.012605
419900105-0.024490-0.005102-0.009434 0.004016-0.006993 0.008511
519900108 0.015342 0.015385 0.005714 0.000000 0.014085 0.008439
619900109-0.002747-0.020202-0.020833-0.032000 0.027778-0.012552
In [9]:

Transforming simple returns to compounded

The returns in our dataset are simple returns, which, as we saw in lectures, are calculated like this:

$$R_{t}=\frac{P_{t} - P_{t-1}}{P_{t-1}}$$

We prefer to work with continuously compounded returns, which are defined as:

$$Y_{t} = \log\left(\frac{P_{t}}{P_{t-1}}\right) = \log\left(1+R_{t}\right)$$

To transform them into compound returns, we should use the log() function:

In [10]:
# We choose all the columns except the first one 
# And transform them into a new Y data frame
Y <- log(1 + RET[,2:7])
Y$date <- RET$date
A data.frame: 6 × 7
1 0.019915366 0.00000000 0.034289343 0.004175271 0.042559364 0.03024012319900102
2 0.005618188-0.01005034-0.001874756 0.032789500-0.028171106 0.01268520219900103
3 0.028987765-0.01015236-0.005644903 0.004023893 0.021202627-0.01268511719900104
4-0.024794868-0.00511506-0.009478782 0.004007957-0.007017566 0.00847498619900105
5 0.015225502 0.01526785 0.005697737 0.000000000 0.013986728 0.00840359119900108
6-0.002750780-0.02040885-0.021053069-0.032523192 0.027399190-0.01263144219900109

Date formats

On our RET data frame, we have a column for the dates of the observations. R has a special variable type for working with dates called Date, which will make our lifes easier when trying to do plots and some analyses. However, by default the date column in our dataset is not in this format:

In [11]:

There are several ways to transform data into the Date type. We will use the package lubridate that we installed earlier. In particular, the function ymd() that stands for year-month-day. It is a powerful function that will turn any character in that order into a Date format. For example, it can handle:


Likewise, you could also use dmy() or mdy() for different formats.

In [12]:
# Save the original int type date to int_date
Y$int_date <- Y$date
# Use the function ymd() to transform the column into Dates
Y$date <- ymd(Y$date)
# Check if it worked
In [13]:
# Lets do the same for PRC
PRC$int_date <- PRC$date
PRC$date <- ymd(PRC$date)

Saving data frames

After handling data, we want to make sure we do not have to do the same procedure every time we open our program. For this, we can easily save the data frame we have created as an .RData object that can be loaded the next time we open R. To do so, we need to make sure we are at the Directory where we want to save the data, and use the save() function:

In [14]:
# Saving the data frame of returns
save(Y, file = "Y.RData")

# Saving the data frame of prices
save(PRC, file = "PRC.RData")

To load a .RData file, we need to use the function load:

In [15]:
# Remove the existing data frame of returns

# Load the saved file
A data.frame: 6 × 8
1 0.019915366 0.00000000 0.034289343 0.004175271 0.042559364 0.0302401231990-01-0219900102
2 0.005618188-0.01005034-0.001874756 0.032789500-0.028171106 0.0126852021990-01-0319900103
3 0.028987765-0.01015236-0.005644903 0.004023893 0.021202627-0.0126851171990-01-0419900104
4-0.024794868-0.00511506-0.009478782 0.004007957-0.007017566 0.0084749861990-01-0519900105
5 0.015225502 0.01526785 0.005697737 0.000000000 0.013986728 0.0084035911990-01-0819900108
6-0.002750780-0.02040885-0.021053069-0.032523192 0.027399190-0.0126314421990-01-0919900109

Saving as .csv

We can also easily save our data frames as a .csv file using the write.csv() function:

In [16]:
write.csv(Y, file = "Y.csv")


The base R has an easy to use plot function called plot(). In this section we will learn how to customize our plots, add titles, change axis labels, add legends, select types of lines, vary the colors, and create subplots.

To get an overview of what plot() can do, we can check its documentation by:

In [17]:

Customizing a plot

We will start plotting the returns for JP Morgan:

In [18]:
# Simple plot, if we do not specify an X variable, plot() will use an index
In [19]:
# By default, plot() uses points, we can plot a line with the option "type", "l" denotes line
plot(Y$JPM, type = "l")
In [20]:
# We can add a title with the option "main"
# Change the axes labels with "xlab" and "ylab"
# Choose a color for the graph with "col"
plot(Y$JPM, type = "l", main = "Compound returns for JP Morgan", 
    ylab = "Returns", xlab = "Observation", col = "red")

We would like to have the dates in the x axis to understand our data better:

In [21]:
# The first data argument is used as the X variable
plot(Y$date, Y$JPM, type = "l", main = "Compound returns for JP Morgan", 
    ylab = "Returns", xlab = "Date", col = "red")

Also, by using the option las = 1, we will make sure that the ticks for the axes are always horizontal. This is a good practice for easier visualization:

In [22]:
# The first data argument is used as the X variable
plot(Y$date, Y$JPM, type = "l", main = "Compound returns for JP Morgan", 
    ylab = "Returns", xlab = "Date", col = "red", las = 1)

Visualizing multiple graphs

What if we wanted to plot the prices for JP Morgan and Citi, the two major banks in our dataset?

There are many options to do so. One is to plot both in the same graph, we can achieve this by using the lines() function after a plot is created, or using the function matplot():

In [23]:
# First we plot the returns of JPM
plot(PRC$date, PRC$JPM, type = "l", main = "Prices for JP Morgan and Citi", 
    ylab = "Price", xlab = "Date", col = "red")

# Then we add the returns of C
lines(PRC$date, PRC$C, col = "blue")

# And we create a legend
legend("bottomright",legend = c("JPM", "C"), col = c("red", "blue"), lty=1)