6  Data used here

We use three data files in these notes, all with daily US stock prices between 2003 and 2022, both unadjusted and adjusted. The data was obtained from the data vendor we usually use, EOD, and is distributed here with their permission. The files can be found on https://www.financialriskforecasting.com/data as CSV files.

  1. SP500, daily prices of the standard and Poor’s 500 index (S&P 500), the data we used to illustrate univariate applications;
  2. SP500TR, total returns on the S&P 500;
  3. Stocks, the prices of six different stocks, chosen to represent the various sectors of the US economy, and including both winners and losers, “stocks.csv”. The file contains both unadjusted and adjusted prices.

The raw data arrives as CSV files called stocks.csv, sp500.csv and sp500tr.csv.

6.1 Libraries

library(reshape2)

6.2 Processing the sp500.csv file

We can import directly a .csv file into R as a data.frame object using the function read.csv(). The first three lines in the file are:

date,Close
20030102,909.03
20030103,908.59

And the last two:

20221229,3849.28
20221230,3839.5

sp500=read.csv('data/sp500.csv')

Now let’s check that our variable df is actually a dataframe:

class(sp500)
[1] "data.frame"

We can check the structure of a dataframe using the function str(). This can be quite useful to see what every column of the dataframe holds and what type of data it has:

str(sp500)
'data.frame':   5035 obs. of  2 variables:
 $ date : int  20030102 20030103 20030106 20030107 20030108 20030109 20030110 20030113 20030114 20030115 ...
 $ Close: num  909 909 929 923 910 ...
head(sp500,2)
date Close
20030102 909.03
20030103 908.59
tail(sp500,2)
date Close
5034 20221229 3849.28
5035 20221230 3839.50
names(sp500)
[1] "date"  "Close"

The name Close is not that convenient, so we renamed it.

names(sp500)[2]="price"
names(sp500)
[1] "date"  "price"

We can make a new column with log returns.

\[ y_t=\log P_1 - \log P_{t-1}. \]

When we do this, we will not have an observation for day 1, and since the new column needs to have the same number of rows as the dataframe, we have to add a first observation and put NA into it:

sp500$y = c(NA,diff(log(sp500$price)))
head(sp500)
date price y
20030102 909.03 NA
20030103 908.59 -0.0004841
20030106 929.01 0.0222256
20030107 922.93 -0.0065661
20030108 909.93 -0.0141857
20030109 927.57 0.0192006

It will be nicer to remove the first row. One way to do that is

sp500=sp500[2:dim(sp500)[1],]

A more elegant way is to

sp500=sp500[!is.na(sp500$y),]
head(sp500,2)
date price y
2 20030103 908.59 -0.0004841
3 20030106 929.01 0.0222256

6.2.1 Processing the sp500tr.csv file

We do the same with the SP-500 TR file

sp500tr=read.csv('data/sp500tr.csv')
names(sp500tr)[2]="price"
sp500tr$y = c(NA,diff(log(sp500tr$price)))

6.3 Processing the stocks.csv file

The start of the stocks.csv file looks like

ticker,date,Close,Adjusted_close
MCD,20030102,16.55,9.6942
MCD,20030103,16.12,9.4423

And the end

INTC,20221229,26.21,25.8945
INTC,20221230,26.43,26.1118

stocks=read.csv('data/stocks.csv')
str(stocks)
'data.frame':   30210 obs. of  4 variables:
 $ ticker        : chr  "MCD" "MCD" "MCD" "MCD" ...
 $ date          : int  20030102 20030103 20030106 20030107 20030108 20030109 20030110 20030113 20030114 20030115 ...
 $ Close         : num  16.6 16.1 16.6 16.7 16.8 ...
 $ Adjusted_close: num  9.69 9.44 9.75 9.76 9.86 ...

The dataframe has five columns and 30210 rows. The str() function shows us the names of each column, along with the data type it holds and some observations. There is a column for every stock, which holds numeric data, and a column for dates, which is of type character.

We can also check parts of the structure with different functions:

dim(stocks)
[1] 30210     4
nrow(stocks)
[1] 30210
ncol(stocks)
[1] 4
colnames(stocks)
[1] "ticker"         "date"           "Close"          "Adjusted_close"

We should rename some of the columns.

names(stocks)[3:4]=c("UnAdjustedPrice","price")
names(stocks)
[1] "ticker"          "date"            "UnAdjustedPrice" "price"          

The reason is that we will not work much with the unadjusted prices, so price is transparent and simple.

We can show the first and last row, and rbind put that into the same dataframe

rbind(head(stocks,1),tail(stocks,1))
ticker date UnAdjustedPrice price
1 MCD 20030102 16.55 9.6942
30210 INTC 20221230 26.43 26.1118

We can now see how which stocks we have:

unique(stocks$ticker)
[1] "MCD"  "DIS"  "AAPL" "GE"   "JPM"  "INTC"

6.3.0.1 Reshaping

The stocks dataframe isn’t particularly useful as it would be much better to have the prices for each stock in its column. While there are many ways to do that, the best is to use the package reshape2.

If we want to create a dataframe with the returns for every stock, we can use the dcast() function. date ~ means each row is data from the same date and each column is data from the same ticker, value.var is the column of the input dataframe that will be filled into the new dataframe.

Price = dcast(stocks, date ~ ticker, value.var = "price")
head(Price,2)
date AAPL DIS GE INTC JPM MCD
20030102 0.2250 13.7547 88.5027 9.7865 14.5077 9.6942
20030103 0.2265 13.8344 88.2248 9.6985 14.7929 9.4423
UnAdjustedPrice = dcast(stocks, date ~ ticker, value.var = "UnAdjustedPrice")

This works since all the stocks are American and from the same exchange, so they have observations on the same days.

We can now make a dataframe with returns. There are several ways to do that. Note that the first column is dates and we dont want to transform those.

Return=Price
for (i in 2:dim(Price)[2]) Return[,i]=c(NA,diff(log(Price[,i])))
head(Return,2)
date AAPL DIS GE INTC JPM MCD
20030102 NA NA NA NA NA NA
20030103 0.0066445 0.0057777 -0.003145 -0.0090327 0.0194678 -0.0263282

And to remove the NA.

Price=Price[!is.na(Return[,2]),]
UnAdjustedPrice=UnAdjustedPrice[!is.na(Return[,2]),]
Return=Return[!is.na(Return[,2]),]
head(Price,2)
date AAPL DIS GE INTC JPM MCD
2 20030103 0.2265 13.8344 88.2248 9.6985 14.7929 9.4423
3 20030106 0.2265 14.5436 90.4825 10.0738 15.9562 9.7527
head(UnAdjustedPrice,2)
date AAPL DIS GE INTC JPM MCD
2 20030103 14.9016 17.1292 25.3826 16.54 25.94 16.12
3 20030106 14.9016 18.0074 26.0322 17.18 27.98 16.65
head(Return,2)
date AAPL DIS GE INTC JPM MCD
2 20030103 0.0066445 0.0057777 -0.0031450 -0.0090327 0.0194678 -0.0263282
3 20030106 0.0000000 0.0499928 0.0252684 0.0379668 0.0757001 0.0323446

But it can be more convenient and safer to put all the dataframes into one list:

x=list(Return=Return,Price=Price,UnAdjustedPrice=UnAdjustedPrice)

6.4 Process raw data

Even better is to put all variables into the same list and load them all with ProcessRawData(). We put that into functions.r.

ProcessRawData=function(){

    sp500=read.csv('data/sp500.csv')
    names(sp500)[2]="price"
    sp500$y = c(NA,diff(log(sp500$price)))
    sp500=sp500[!is.na(sp500$y),]
    sp500$date.ts = ymd(sp500$date)
    sp500$y.ts = zoo(sp500$y,order.by=sp500$date.ts)

    sp500tr=read.csv('data/sp500tr.csv')
    names(sp500tr)[2]="price"
    sp500tr$y = c(NA,diff(log(sp500tr$price)))
    sp500tr$date.ts = ymd(sp500tr$date)
    sp500tr$y.ts = zoo(sp500tr$y,order.by=sp500tr$date.ts)

    stocks=read.csv('data/stocks.csv')
    names(stocks)[3:4]=c("UnAdjustedPrice","price")
    Price = dcast(stocks, date ~ ticker, value.var = "price")
    head(Price,2)
    UnAdjustedPrice = dcast(stocks, date ~ ticker, value.var = "UnAdjustedPrice")

    Return=Price
    for (i in 2:dim(Price)[2]) Return[,i]=c(NA,diff(log(Price[,i])))

    Price=Price[!is.na(Return[,2]),]
    UnAdjustedPrice=UnAdjustedPrice[!is.na(Return[,2]),]
    Return=Return[!is.na(Return[,2]),]

    data=list(
        Return=Return,
        Price=Price,
        UnAdjustedPrice=UnAdjustedPrice,
        sp500=sp500,
        sp500tr=sp500tr,
        Ticker=unique(stocks$ticker)
    )

    return(data)
}