library(reshape2)9 Data used here
We use three data files in these notes, all with daily US stock prices between 2003 and 2022, both unadjusted and adjusted. The data was obtained from the data vendor we usually use, EODHD, and is distributed here with their permission. The files can be found on https://www.financialriskforecasting.com/data as CSV files.
- SP500, daily prices of the Standard and Poor’s 500 index (S&P 500), the data we used to illustrate univariate applications;
- SP500TR, total returns on the S&P 500;
- Stocks, the prices of different stocks, chosen to represent the various sectors of the US economy, and including both winners and losers,
stocks.csv. The file contains both unadjusted and adjusted prices.
The raw data arrives as CSV files called stocks.csv, sp500.csv and sp500tr.csv.
9.1 Data and libraries
9.2 Processing the sp500.csv file
We can import directly a .csv file into R as a data.frame object using the function read.csv(). The first three lines in the file are:
date,Close
20030102,909.03
20030103,908.59
And the last two:
20221229,3849.28
20221230,3839.5
sp500=read.csv('data/sp500.csv')Now we check that our variable sp500 is actually a dataframe:
class(sp500)[1] "data.frame"
We can check the structure of a dataframe using the function str(). This can be quite useful to see what every column of the dataframe holds and what type of data it has:
str(sp500)'data.frame': 8939 obs. of 2 variables:
$ date : int 19900102 19900103 19900104 19900105 19900108 19900109 19900110 19900111 19900112 19900115 ...
$ Close: num 360 359 356 352 354 ...
head(sp500,2)| date | Close |
|---|---|
| 19900102 | 359.69 |
| 19900103 | 358.76 |
tail(sp500,2)| date | Close | |
|---|---|---|
| 8938 | 20250627 | 6173.07 |
| 8939 | 20250630 | 6204.95 |
names(sp500)[1] "date" "Close"
The name Close is not that convenient, so we rename it.
names(sp500)[2]="price"
names(sp500)[1] "date" "price"
We can make a new column with log returns.
\[ y_t=\log P_t - \log P_{t-1}. \]
When we do this, we will not have an observation for day 1, and since the new column needs to have the same number of rows as the dataframe, we have to add a first observation and put NA into it:
sp500$y = c(NA,diff(log(sp500$price)))
head(sp500)| date | price | y |
|---|---|---|
| 19900102 | 359.69 | NA |
| 19900103 | 358.76 | -0.0025889 |
| 19900104 | 355.67 | -0.0086503 |
| 19900105 | 352.20 | -0.0098041 |
| 19900108 | 353.79 | 0.0045043 |
| 19900109 | 349.62 | -0.0118567 |
It will be nicer to remove the first row. One way to do that is
sp500=sp500[2:dim(sp500)[1],]A more elegant way is to
sp500=sp500[!is.na(sp500$y),]
head(sp500,2)| date | price | y | |
|---|---|---|---|
| 2 | 19900103 | 358.76 | -0.0025889 |
| 3 | 19900104 | 355.67 | -0.0086503 |
9.2.1 Processing the sp500tr.csv file
We do the same with the S&P 500 TR file.
sp500tr=read.csv('data/sp500tr.csv')
names(sp500tr)[2]="price"
sp500tr$y = c(NA,diff(log(sp500tr$price)))9.3 Processing the stocks.csv file
The start of the stocks.csv file looks like:
ticker,date,Close,Adjusted_close
MCD,20030102,16.55,9.6942
MCD,20030103,16.12,9.4423
And the end
INTC,20221229,26.21,25.8945
INTC,20221230,26.43,26.1118
stocks=read.csv('data/stocks.csv')
str(stocks)'data.frame': 30210 obs. of 4 variables:
$ ticker : chr "MCD" "MCD" "MCD" "MCD" ...
$ date : int 20030102 20030103 20030106 20030107 20030108 20030109 20030110 20030113 20030114 20030115 ...
$ Close : num 16.6 16.1 16.6 16.7 16.8 ...
$ Adjusted_close: num 9.69 9.44 9.75 9.76 9.86 ...
The dataframe has four columns and 30210 rows. The str() function shows us the names of each column, along with the data type it holds and some observations. There are columns for ticker symbols, dates, and both unadjusted and adjusted stock prices.
We can also check parts of the structure with different functions:
dim(stocks)[1] 30210 4
nrow(stocks)[1] 30210
ncol(stocks)[1] 4
colnames(stocks)[1] "ticker" "date" "Close" "Adjusted_close"
We should rename some of the columns.
names(stocks)[3:4]=c("UnAdjustedPrice","price")
names(stocks)[1] "ticker" "date" "UnAdjustedPrice" "price"
The reason is that we will not work much with the unadjusted prices, so price is transparent and simple.
We can show the first and last row, and rbind puts that into the same dataframe
rbind(head(stocks,1),tail(stocks,1))| ticker | date | UnAdjustedPrice | price | |
|---|---|---|---|---|
| 1 | MCD | 20030102 | 16.55 | 9.6942 |
| 30210 | INTC | 20221230 | 26.43 | 26.1118 |
We can now see which stocks we have:
unique(stocks$ticker)[1] "MCD" "DIS" "AAPL" "GE" "JPM" "INTC"
9.3.0.1 Reshaping
The stocks dataframe is not particularly useful as it would be much better to have the prices for each stock in its column. While there are many ways to do that, the best is to use the package reshape2.
If we want to create a dataframe with the prices for every stock, we can use the dcast() function. date ~ means each row is data from the same date and each column is data from the same ticker, value.var is the column of the input dataframe that will be filled into the new dataframe.
Price = dcast(stocks, date ~ ticker, value.var = "price")
head(Price,2)| date | AAPL | DIS | GE | INTC | JPM | MCD |
|---|---|---|---|---|---|---|
| 20030102 | 0.2250 | 13.7547 | 88.5027 | 9.7865 | 14.5077 | 9.6942 |
| 20030103 | 0.2265 | 13.8344 | 88.2248 | 9.6985 | 14.7929 | 9.4423 |
UnAdjustedPrice = dcast(stocks, date ~ ticker, value.var = "UnAdjustedPrice")This works since all the stocks are American and from the same exchange, so they have observations on the same days.
We can now make a dataframe with returns. There are several ways to do that. Note that the first column is dates and we do not want to transform those.
Return=Price
for (i in 2:dim(Price)[2]) Return[,i]=c(NA,diff(log(Price[,i])))
head(Return,2)| date | AAPL | DIS | GE | INTC | JPM | MCD |
|---|---|---|---|---|---|---|
| 20030102 | NA | NA | NA | NA | NA | NA |
| 20030103 | 0.0066445 | 0.0057777 | -0.003145 | -0.0090327 | 0.0194678 | -0.0263282 |
And to remove the NA values.
Price=Price[!is.na(Return[,2]),]
UnAdjustedPrice=UnAdjustedPrice[!is.na(Return[,2]),]
Return=Return[!is.na(Return[,2]),]
head(Price,2)| date | AAPL | DIS | GE | INTC | JPM | MCD | |
|---|---|---|---|---|---|---|---|
| 2 | 20030103 | 0.2265 | 13.8344 | 88.2248 | 9.6985 | 14.7929 | 9.4423 |
| 3 | 20030106 | 0.2265 | 14.5436 | 90.4825 | 10.0738 | 15.9562 | 9.7527 |
head(UnAdjustedPrice,2)| date | AAPL | DIS | GE | INTC | JPM | MCD | |
|---|---|---|---|---|---|---|---|
| 2 | 20030103 | 14.9016 | 17.1292 | 25.3826 | 16.54 | 25.94 | 16.12 |
| 3 | 20030106 | 14.9016 | 18.0074 | 26.0322 | 17.18 | 27.98 | 16.65 |
head(Return,2)| date | AAPL | DIS | GE | INTC | JPM | MCD | |
|---|---|---|---|---|---|---|---|
| 2 | 20030103 | 0.0066445 | 0.0057777 | -0.0031450 | -0.0090327 | 0.0194678 | -0.0263282 |
| 3 | 20030106 | 0.0000000 | 0.0499928 | 0.0252684 | 0.0379668 | 0.0757001 | 0.0323446 |
But it can be more convenient and safer to put all the dataframes into one list:
x=list(Return=Return,Price=Price,UnAdjustedPrice=UnAdjustedPrice)9.4 Process raw data
Even better is to put all variables into the same list and load them fro=m functions.r. ```