Before the Seminar one, please download R and RStudio. Also you need to create a student account of Wharton Research and Data Services (WRDS) (https://wrds-www.wharton.upenn.edu/pages/). As it can take a few days, please do that early.
The plan for this week:
1. Familiarize ourselves with R and RStudio
2. Learn some basic commands
3. Download, import and manipulate financial data from WRDS
4. Create a simple plot
R is a language and environment for statistical computing and graphics. It is a very powerful tool that will allow us to analyze financial data and implement models to assess and quantify risk. We will use it throughout the course for performing analyses and creating plots. No prior programming experience is required. We will go through some basics of the languages that should be enough to let you start working with financial datasets.
You can download R for free from https://www.r-project.org, by following these steps:
The software is open source, meaning that is supported by a community of developers. This is one of the main advantages of R, since it is constantly updated and offers a wide range of packages. A package is a bundle of code, data and documentation that can be easily downloaded and used in your own projects. We will use some packages for financial data.
RStudio is an "Integrated Developer Environment" (IDE), which just means it is an application where you can write your code, execute it, visualize plots, and see the objects you have created. You can download it here: https://rstudio.com. Once you open RStudio, you will see a screen like this:
Editor: Where we write our code. Here we can create or open .R files that contain code to be executed.
Console: The output of the code executed will be shown here.
Information: Here we will see the packages we have installed, plots once we create them, and details and documentation.
R and RStudio are two different things, but they work together. R is the programming language, and RStudio is the tool that helps you create programs using R. R can work without RStudio but not the other way around. RStudio can be thought of an interpreter to execute commands, create R scripts, manage R variables, reuse commands from history, visual debugging etc. Basically it lets you code in R easily.
To start working in RStudio, we need to first set our Working Directory. For our purposes, this should be the folder where we are going to store our data to easily access it. You can set up the Working Directory by going on Session -> Set Working Directory -> Choose Directory...
, or you can also type in the console setwd("~/PATH")
with the path of the Directory.
To create your first file, choose File -> New File -> R Script
. Now you are ready to write your code. To execute any part, you can select it and do Shift + Enter
or use the button Run
.
An R file (.R) is no more than a text file written in the R language. When we want to run an R file, it is executed in the R Console. You could directly type code into the console and run it, but an R file helps keep your program organized.
# This is a comment, it won't produce any output
3 + 2
14 * 2
0.94^10
In R you can store variables, which are just pieces of information like numbers. You create variables to be able to use them in other parts of your code. To assign a value to a variable, you can either use an arrow <-
or an equal sign =
. The former is the more correct way, but I prefer using the latter. Once created, you can "call" them by their name:
my_number <- 442
my_number
my_string <- "Hello world"
my_string
We will work with vectors and matrices. They can only hold one type of data, either numbers, logical values or strings. Vectors and matrices are created as follows:
# This is a vector:
vec1 <- c(1,2,3)
vec2 <- c("FM", "442")
# This is a matrix:
# arguments nrow and ncol are the number of rows and columns
# argument byrow let the matrix populates rows first
mtx1 <- matrix(c(1,2,3,4), nrow = 2, ncol = 2)
mtx2 <- matrix(c("a", "b", "c", "d", "e","f"), nrow = 3, ncol = 2, byrow = TRUE)
vec1
vec2
mtx1
mtx2
# Lenght of a vector
length(vec1)
# Dimensions of a matrix
dim(mtx1)
Logical
value is a type of data in R that returns TRUE
or FALSE
depending on the statement. Numerically, TRUE
is equivalent to 1, and FALSE
to 0. In the following example, we fisrtly compare between 10000 and 5, the result is a logical value. Then the result will be assigned to a variable.
# Logicals in R
a <- (10000 < 5)
b <- (10000 > 5)
a
b
sum(a,b)
A useful feature about logical vectors is that we can use them to subset other objects. For example:
# Subsetting with logical vectors
logi <- c(TRUE, FALSE, TRUE, TRUE)
vec <- c(1,2,3,4)
# If we subset by "logi", we will only keeps positions where TRUE
vec[logi]
We can access a single element or a subset of vectors and matrices using the brackets []
next to the variable's name and specifying the index of the desired elements. For matrices we need to specify the row followed by a comma and the column. If we want an entire row/column, we can leave the space blank:
# Second element of vec2
vec2[2]
# Third element of the second column of mtx2
mtx2[3,2]
# We can also change an element this way
vec2[1] <- "Finance"
mtx2[3,2] <- "X"
vec2
mtx2
The function seq()
allows us to easily create a vector of evenly distributed numbers between a first and last element:
# You need to specify the first and last element, and the increment
seq1 <- seq(2, 10, by = 2)
# With only one input it will create an integer sequence from 1
seq2 <- seq(5)
seq1
seq2
A data frame is similar to a matrix but it can hold data from different types and can have column names. It is the variable type that we will use the most through the course. You can transform a matrix into a data frame by passing the function as.data.frame
, or create one from scratch by:
# Create a data with two variables frame using data.frame
# "Stock" and "Price" are the names of columns
# The line (<chr> <dbl>) in data frame is the data type of each column
x <- data.frame("Stock" = c("A", "B"), "Price" = c(42,68))
x
# Two ways for accessing a column
x[,2]
x$Price
# Creating a new column
x$Price_plus_1 <- x$Price + 1
x
For-loops and If-Statements are a essential part of programming, allowing us to automate pieces of code.
A for
loop repeats a piece of code for various elements of an array. The syntax is:
for (elements) {
code to be repeated
}
Imagine we want to see the square of the first ten numbers:
for (i in 1:10) { # For every element i in 1, 2, ... 10
print(i^2) # print() displays the value in the console
}
An if
statement evaluates a logical claim, and based on that condition executes a piece of code or another.
The syntax of a basic if
statement is:
if (condition) {
code to be executed if condition is TRUE
} else {
code to be executed if condition is FALSE
}
For example:
x <- 10
if (x > 0) {
print("x has a positive value")
} else {
print("x has a negative value")
}
We will download data on a number of stocks and manipulate it. The database to use is provided by the Center for Research in Security Prices, and is usually known as CRSP. You will access it through a provider called Wharton Research and Data Services (WRDS). To start with, create a student account at https://wrds-www.wharton.upenn.edu/pages/, as it can take a few days, please do that early.
There are different ways of identifying a company in CRSP, and we need to be careful with what we choose. It is very common to associate a stock with its TICKER, but if the company has a merger, this might be subject to change. For example, if we consider JP Morgan (Ticker: JPM), historically it has been officially registered with different names before some mergers and acquisitions happened (Chemical Banking Corp, Chase Manhattan Corp, etc), each which a different ticker, but it is essentially the same company, and by specifying the Ticker JPM we would be losing years of financial data. For this reason, we work with the permanent company number, or PERMNO, which is mantained over time.
For the purpose of this course, we will mostly be working with comma-separated values , or .csv files.
Once logged on, do Select CRSP and go to “Stock / Security Files / Daily Stock File”, as shown in the screenshots below:
Explore the page, and help provided. Then in the steps
Click “Submit Query”
Open the output file in Excel and look for unexpected output.
Redo the exercise but this time selecting PERMNO in Step 2, using 10107 59328 12060 47896 70519 11850
Explain the differences in the two output files.
Save the output into a file in some directory as ‘crsp.csv’
All the details on the variables we download can be found in the Variable Descripions section of WRDS. It is important to distinguish between the type of returns we are using, whether we are including dividends or not. For example, the description of the time series variables we have downloaded is:
Open RStudio and select the directory you chose in the last step as the Working Directory. In your R script, write and execute:
# Write this line at the begin of your new script to clean the R environment
rm(list=ls())
# Importing the downloaded data
data <- read.csv('crsp.csv')
# Checking the dimensions
dim(data)
# First observations
head(data)
# A single column
head(data$RET)
head(data[,6])
# Names of the columns
names(data)
# Getting unique values of PERMNO
unique(data$PERMNO)
# Creating a variable for a company
# We are filtering the dataset for the rows with the City PERMNO
citi <- data[data$PERMNO == 70519,]
# Dimension
dim(citi)
# Check the first few elements
head(citi)
# Check different tickers for the same PERMNO
unique(citi$TICKER)
# Why does this happen?
# Highest return for Citi (the number depends on your input of data)
highest_citi <- max(citi$RET) * 100
paste0("Highest return for Citi: ", highest_citi, "%")
R can easily build plots to visualize data. We will use the plot
command for this. To read the documentation on the command, you can type ?plot
in the console and see all the options it includes.
plot(citi$PRC, type = "l", main = "Price of Citi")
In this seminar we have covered:
Some new functions used:
matrix()
length()
dim()
seq()
data.frame()
print()
read.csv()
head()
names()
unique()
paste0()
plot()
For more discussion on the material covered in this seminar, refer to Chapter 1: Financial markets, prices and risk on Financial Risk Forecasting by Jon Danielsson.
Acknowledgements: Thanks to Alvaro Aguirre and Yuyang Lin for creating these notebooks
© Jon Danielsson, 2022