library(reshape2)
7 Loading data
7.1 Libraries
7.2 CSV, RData, Excel and the rest
There is a large number of ways to store data. Perhaps the most common is Excel, but while Excel can be useful to manipulate data, we do not recommend storing the types of data we use here as an Excel spreadsheet. The reason is that Excel often changes data without telling us. It might erroneously interpret something as a date, and then automatically converting a number into a date. It can lose significant digits or transform data in other unexpected ways. Consequently, in the type of work we do here, we don’t recommend storing data as Excel spreadsheets. I know someone who’s PhD got delayed by a year because he used Excel for his data, didn’t keep a backup of the original data he hand collected, and then Excel mangled the data to the point of it becoming useless.
The best format for data is generally CSV. It is a simple text file that can be edited in any text editor. It can be read by any software package, including Excel, and so is universally accessible, and if there is problem with the data, it is easy to inspect it in a text editor.
There is an exception, since R has a native data format, RData, that allows us to save and load R variables. In what follows, we will obtain data from a data vendor in CSV format and then convert it into RData, and use that for subsequent analysis.
CSV files have some disadvantages. They are uncompressed and therefore can be quite large. Furthermore, by default, a lot of software truncates significant digits from floating point numbers before saving (both R and Python’s Pandas do that, Julia does not) and one may requires an extra option to include all digits. That can cause problems when one needs to see if numbers change.
The best alternative is the parquet format. It compresses the data, and preserves the full floating point precision. It further has the advantage that it is easy to exchange parquet files with other software packages. The disadvantage is that because it comes in a common binary form, it is hard to inspect the files and edit them. We use parquet files for the data we make for extremerisk.org since we use R, Python and Julia for that site.
7.2.1 Things to look out for
It is important to check the documentation of the functions we are using to make sure we are doing things right. For example, we need to be careful if:
- The CSV file has headers or not. A header is the title of a column. We can have a CSV file where the first row includes the column titles, or a csv with no titles;
- The separator is a comma (,), a semicolon (;), a tab ( or another symbol;
- We are getting data from a country where the decimal symbol is a comma (,) instead of a period (.).
By checking the documentation of a function, we will see how to deal with each of these specific cases and more. To do so, run ?read.csv
7.3 Data used here
There are three data sets we use below, US stock prices, the SP-500 and SP-500 TR indices. All obtained with permission from EOD.
The raw data arrives as CSV files, called stocks.csv
, sp500.csv
and sp500tr.csv
.
See Chapter 3.
7.3.1 Processing the sp500.csv
file
We can import directly a .csv
file into R as a data.frame
object using the function read.csv()
. The first three lines in the file are:
date,Close
20030102,909.03
20030103,908.59
and the last two:
20221229,3849.28
20221230,3839.5
=read.csv('data/sp500.csv') sp500
Now let’s check that our variable df
is actually a data frame:
class(sp500)
We can check the structure of a data frame using the function str()
. This can be quite useful to see what every column of the data frame holds, and what type of data it has:
str(sp500)
'data.frame': 5035 obs. of 2 variables:
$ date : int 20030102 20030103 20030106 20030107 20030108 20030109 20030110 20030113 20030114 20030115 ...
$ Close: num 909 909 929 923 910 ...
head(sp500,2)
tail(sp500,2)
date | Close | |
---|---|---|
<int> | <dbl> | |
1 | 20030102 | 909.03 |
2 | 20030103 | 908.59 |
date | Close | |
---|---|---|
<int> | <dbl> | |
5034 | 20221229 | 3849.28 |
5035 | 20221230 | 3839.50 |
names(sp500)
- 'date'
- 'Close'
The name Close
is not that convenient, so we rename it
names(sp500)[2]="price"
names(sp500)
- 'date'
- 'price'
We can make a new column with log returns
\[ y_t=\log P_1 - \log P_{t-1}. \]
When we do this we will not have an observation for day 1, and since the new column needs to have the same number of rows as the data frame, we have to add a first observation, and put NA
into it:
$y = c(NA,diff(log(sp500$price)))
sp500head(sp500)
date | price | y | |
---|---|---|---|
<int> | <dbl> | <dbl> | |
1 | 20030102 | 909.03 | NA |
2 | 20030103 | 908.59 | -0.0004841496 |
3 | 20030106 | 929.01 | 0.0222255557 |
4 | 20030107 | 922.93 | -0.0065661110 |
5 | 20030108 | 909.93 | -0.0141857185 |
6 | 20030109 | 927.57 | 0.0192005899 |
It will nicer to remove the first row. One way to do that is
=sp500[2:dim(sp500)[1],] sp500
A more elegant way is to
=sp500[!is.na(sp500$y),]
sp500head(sp500,2)
date | price | y | |
---|---|---|---|
<int> | <dbl> | <dbl> | |
2 | 20030103 | 908.59 | -0.0004841496 |
3 | 20030106 | 929.01 | 0.0222255557 |
We finally save the data frame into a RData file.
save(sp500,file="data/sp500.RData")
And can load it later by
load("data/sp500.RData")
7.3.2 Processing the sp500tr.csv
file
We do the same with the SP-500 TR file
=read.csv('data/sp500tr.csv')
sp500trnames(sp500tr)[2]="price"
$y = c(NA,diff(log(sp500tr$price)))
sp500trsave(sp500,file="data/sp500.RData")
7.3.3 Processing the stocks.csv
file
The start of stocks.csv
file looks like
ticker,date,Close,Adjusted_close
MCD,20030102,16.55,9.6942
MCD,20030103,16.12,9.4423
And the end
INTC,20221229,26.21,25.8945
INTC,20221230,26.43,26.1118
=read.csv('data/stocks.csv')
stocksstr(stocks)
'data.frame': 30210 obs. of 4 variables:
$ ticker : chr "MCD" "MCD" "MCD" "MCD" ...
$ date : int 20030102 20030103 20030106 20030107 20030108 20030109 20030110 20030113 20030114 20030115 ...
$ Close : num 16.6 16.1 16.6 16.7 16.8 ...
$ Adjusted_close: num 9.69 9.44 9.75 9.76 9.86 ...
We see the data frame has 5 columns and 30210 rows. The str()
function shows us the names of each column, along with the data type it holds, and some observations. We see there is a column for every stock, which holds numeric data, and a column for dates, which is of type character.
We can also check parts of the structure with different functions:
dim(stocks)
nrow(stocks)
ncol(stocks)
colnames(stocks)
- 30210
- 4
- 'ticker'
- 'date'
- 'Close'
- 'Adjusted_close'
It might be good to rename some of the colums
names(stocks)[3:4]=c("UnAdjustedPrice","price")
names(stocks)
- 'ticker'
- 'date'
- 'UnAdjustedPrice'
- 'price'
The reason is that we will not work much with the unadjusted prices, so price
is transarent and simple.
We can show the first and last row, and rbind
put that into the same data frame
rbind(head(stocks,1),tail(stocks,1))
ticker | date | UnAdjustedPrice | price | |
---|---|---|---|---|
<chr> | <int> | <dbl> | <dbl> | |
1 | MCD | 20030102 | 16.55 | 9.6942 |
30210 | INTC | 20221230 | 26.43 | 26.1118 |
We can now see how which stocks we have:
unique(stocks$ticker)
- 'MCD'
- 'DIS'
- 'AAPL'
- 'GE'
- 'JPM'
- 'INTC'
7.3.3.1 Reshaping
The stocks
data frame isn’t particularly useful as it, it would be much better to have the prices, for each stock in its own column. While there are many ways to do that, the best is to use the package reshape2
.
If we want to create a data frame with the returns for every stock, we can use the dcast()
function. date ~
means each row is data from same date
and each column is data from same ticker
, value.var
is the column of the input data frame that will be filled into the new data frame.
= dcast(stocks, date ~ ticker, value.var = "price")
Price head(Price,2)
= dcast(stocks, date ~ ticker, value.var = "UnAdjustedPrice") UnAdjustedPrice
date | AAPL | DIS | GE | INTC | JPM | MCD | |
---|---|---|---|---|---|---|---|
<int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | 20030102 | 0.2250 | 13.7547 | 88.5027 | 9.7865 | 14.5077 | 9.6942 |
2 | 20030103 | 0.2265 | 13.8344 | 88.2248 | 9.6985 | 14.7929 | 9.4423 |
This works since all the stocks are American, and all from the same exchange, so all with observations on the same days.
We can now make a data frame with returns. There are several ways to do that. Note that the first column is dates and we dont want to transform those.
=Price
Returnfor (i in 2:dim(Price)[2]) Return[,i]=c(NA,diff(log(Price[,i])))
head(Return,2)
date | AAPL | DIS | GE | INTC | JPM | MCD | |
---|---|---|---|---|---|---|---|
<int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | 20030102 | NA | NA | NA | NA | NA | NA |
2 | 20030103 | 0.006644543 | 0.00577766 | -0.003144957 | -0.009032651 | 0.01946779 | -0.02632817 |
And to remove the NA
.
=Price[!is.na(Return[,2]),]
Price=UnAdjustedPrice[!is.na(Return[,2]),]
UnAdjustedPrice=Return[!is.na(Return[,2]),]
Returnhead(Price,2)
head(UnAdjustedPrice,2)
head(Return,2)
date | AAPL | DIS | GE | INTC | JPM | MCD | |
---|---|---|---|---|---|---|---|
<int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
2 | 20030103 | 0.2265 | 13.8344 | 88.2248 | 9.6985 | 14.7929 | 9.4423 |
3 | 20030106 | 0.2265 | 14.5436 | 90.4825 | 10.0738 | 15.9562 | 9.7527 |
date | AAPL | DIS | GE | INTC | JPM | MCD | |
---|---|---|---|---|---|---|---|
<int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
2 | 20030103 | 14.9016 | 17.1292 | 25.3826 | 16.54 | 25.94 | 16.12 |
3 | 20030106 | 14.9016 | 18.0074 | 26.0322 | 17.18 | 27.98 | 16.65 |
date | AAPL | DIS | GE | INTC | JPM | MCD | |
---|---|---|---|---|---|---|---|
<int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
2 | 20030103 | 0.006644543 | 0.00577766 | -0.003144957 | -0.009032651 | 0.01946779 | -0.02632817 |
3 | 20030106 | 0.000000000 | 0.04999279 | 0.025268359 | 0.037966760 | 0.07570013 | 0.03234458 |
7.4 Saving RData files
Once we have finished cleaning and handling our dataset, we can save it so we don’t have to repeat the procedure next time we want to use it.
We can save each variable in its own file, like
save(Return, file = "data/Return.RData")
To save your data frame as a .csv
file you can use the function write.csv()
:
write.csv(Return, file = "data/Return.csv")
But it can be more convenient, and safer to put all the data frames into one list:
=list(Return=Return,Price=Price,UnAdjustedPrice=UnAdjustedPrice)
xsave(stocks, file = "data/stocks.RData")
names(x)
- 'Return'
- 'Price'
- 'UnAdjustedPrice'
Even better is simply to put all variables into the same list
=list(
dataReturn=Return,
Price=Price,
UnAdjustedPrice=UnAdjustedPrice,
sp500=sp500,
sp500tr=sp500tr,
Ticker=unique(stocks$ticker)
)save(data, file = "data/data.Rdata")
names(data)
- 'Return'
- 'Price'
- 'UnAdjustedPrice'
- 'sp500'
- 'sp500tr'
- 'Ticker'