library(moments)
library(tseries)
library(car)
library(reshape2)
source("common/functions.r",chdir=TRUE)
9 Descriptive statistics
The main strength of R is in doing statistical analysis. It has considerable depth in various statistical procedures built in. Also, it comes with a number of specialised statistical packages doing the type of analysis one cannot encounter anywhere else.
9.1 Libraries
9.2 Loading data
=ProcessRawData()
data=data$Price
Price=data$Return
Return=data$sp500 sp500
9.3 Sample statistics
Mean and standard deviation, sd
is directly built in, while skewness and kurtosis need library(moments)
.
mean(sp500$y)
[1] 0.0002861977
sd(sp500$y)
[1] 0.01216104
skewness(sp500$y)
[1] -0.511112
kurtosis(sp500$y)
[1] 15.7652
We can present these results a bit nicer by putting them into a data frame.
cat('Sample statistics for the SP-500 from',
min(sp500$date),'to',
max(sp500$date),'\n'
)
Sample statistics for the SP-500 from 20030103 to 20221230
=data.frame()
stats=c('mean:',paste0(round(mean(sp500$y)*100,3),'%'))
stats=rbind(stats,c('sd:',paste0(round(sd(sp500$y)*100,2),'%')))
stats=rbind(stats,c('skewness:',round(skewness(sp500$y),1)))
stats=rbind(stats,c('kurtosis:',round(kurtosis(sp500$y),1)))
stats stats
[,1] [,2]
stats "mean:" "0.029%"
"sd:" "1.22%"
"skewness:" "-0.5"
"kurtosis:" "15.8"
Or printing them with formatting.
for(i in 1:dim(stats)[1])
cat(stats[i,1],rep("",10-nchar(stats[i,1])),stats[i,2],"\n")
mean: 0.029%
sd: 1.22%
skewness: -0.5
kurtosis: 15.8
9.4 Statistical distributions
R provides functions for just about every distribution imaginable. We will use several of those: the normal, student-t, skewed student-t, chi-square, binomial and the Bernoulli. They all provide densities and distribution functions (PDF and CDF), as well as the inverse CDF and random numbers discussed in Chapter 10.
The first letter of the function name indicates which of these four types, d
for distribution, p
for probability, q
for quantile and r
for random. The remainder of the name is the distribution, for example, for the normal and student-t:
dnorm
,pnorm
,qnorm
,rnorm
;dt
,pt
,qt
,rt
.
dnorm(1)
[1] 0.2419707
pnorm(1.164)
[1] 0.877788
qnorm(0.95)
[1] 1.644854
9.4.1 Plotting distributions
If you want to plot the distribution functions, first create a vector for the x coordinates with seq(a,b,c)
which creates a sequence from a
to b
where the space between the numbers is c
. Alternatively, we can specify the number of elements in the sequence.
=seq(-3,3,length=100)
x=seq(0,1,length=100)
zplot(x,dnorm(x), main="Normal density",type='l')
plot(x,pnorm(x), main="Normal distribution",type='l')
plot(z,qnorm(z), main="Normal quantile",type='l')
9.5 Testing
9.5.1 Jarque-Bera
The Jarque-Bera (JB) test in the tseries
library uses the third and fourth central moments of the sample data to check whether its skewness and kurtosis match a normal distribution.
jarque.bera.test(sp500$y)
Jarque Bera Test
data: sp500$y
X-squared = 34398, df = 2, p-value < 2.2e-16
The p-value presented is the inverse of the JB statistic under the CDF of the asymptotic distribution. The p-value of this test is virtually zero, which means that we have enough evidence to reject the null hypothesis that our data has been drawn from a normal distribution.
9.5.2 Kolmogorov-Smirnov
The Kolmogorov-Smirnov test, ks.test()
is based on comparing actual observations with some hypothesised distribution, like the normal:
<- rnorm(n=length(sp500$y))
x <- rnorm(n=length(sp500$y))
z ks.test(x,z)
Asymptotic two-sample Kolmogorov-Smirnov test
data: x and z
D = 0.014303, p-value = 0.6818
alternative hypothesis: two-sided
not rejecting
<- rnorm(n=length(sp500$y))
x ks.test(x,sp500$y)
Warning in ks.test.default(x, sp500$y): p-value will be approximate in the
presence of ties
Asymptotic two-sample Kolmogorov-Smirnov test
data: x and sp500$y
D = 0.4857, p-value < 2.2e-16
alternative hypothesis: two-sided
<- rt(df=3,n=length(sp500$y))
x ks.test(x,sp500$y)
Warning in ks.test.default(x, sp500$y): p-value will be approximate in the
presence of ties
Asymptotic two-sample Kolmogorov-Smirnov test
data: x and sp500$y
D = 0.48907, p-value < 2.2e-16
alternative hypothesis: two-sided
9.5.3 Ljung-Box
The Ljung-Box (LB) test checks if the presented data exhibits serial correlation. We can implement the test using the Box.test()
function and specifying type = "Ljung-Box"
:
<- rnorm(n=length(sp500$y))
x Box.test(x, type = "Ljung-Box")
Box-Ljung test
data: x
X-squared = 0.32421, df = 1, p-value = 0.5691
Box.test(sp500$y, type = "Ljung-Box")
Box-Ljung test
data: sp500$y
X-squared = 81.253, df = 1, p-value < 2.2e-16
Box.test(sp500$y^2, type = "Ljung-Box")
Box-Ljung test
data: sp500$y^2
X-squared = 551.08, df = 1, p-value < 2.2e-16
9.5.4 Autocorrelation
The autocorrelation function shows us the linear correlation between a value in our time series and its different lags. The acf
function plots the autocorrelation of an ordered vector. The horizontal lines are confidence intervals, meaning that if a value is outside of the interval, it is considered significantly different from zero.
par(mar=c(4,4,3,0))
acf(sp500$y, main = "Autocorrelation of returns")
The plots made by the acf function don’t look all that nice. While there are other versions available for R, including one for ggplot
, for now, we will simply modify the built-in acf()
. Call that myACF()
and put it into functions.r
.
=function (x, n = 50, lwd = 2, col1 = co2[2], col2 = co2[1], main = NULL)
myACF
{= acf(x, n, plot = FALSE)
a = qnorm((1 + 0.95)/2)/sqrt(sum(!is.na(x)))
significance_level barplot(a$acf[2:n, , 1], las = 1, col = col1, border = FALSE,
las = 1, xlab = "lag", ylab = "ACF", main = main)
abline(significance_level, 0, col = col2, lwd = lwd)
abline(-significance_level, 0, col = col2, lwd = lwd)
axis(side = 1)
}par(mar=c(4,4,1,0))
myACF(sp500$y, main = "Autocorrelation of SP-500 returns")
myACF(sp500$y^2, main = "Autocorrelation of squared SP-500 returns")
9.5.5 QQ-Plots
We can use the quantile-quantile (QQ) plot from library(car)
to see how our data fits particular distributions. We will use the qqPlot()
function.
Start with the normal plot, indicated by distribution = "norm"
which is the default.
par(mar=c(4,4,1,0))
=qqPlot(sp500$y, distribution = "norm", envelope = FALSE) x
We can also use QQ-Plots to see how our data fits the distributions with various degrees of freedom.
par(mfrow=c(2,2), mar=c(4,4,1,0))
=qqPlot(sp500$y, distribution = "norm", envelope = FALSE,xlab="normal")
x=qqPlot(sp500$y, distribution = "t", df = 4, envelope = FALSE,xlab="t(4)")
x=qqPlot(sp500$y, distribution = "t", df = 3.5, envelope = FALSE,xlab="t(3.5)")
x=qqPlot(sp500$y, distribution = "t", df = 3, envelope = FALSE,xlab="t(3)") x
9.6 Return distribution analysis
In many financial theories, returns are usually assumed to be normally distributed. In reality, stock returns are usually heavy-tailed and skewed.
9.7 Histogram/density
To visualise the distribution of stock returns, we can use a histogram with the hist()
command.
hist(sp500$y)
We can improve the plot a bit and superimpose the normal with the same mean and variance as the underlying data.
=seq(min(sp500$y),max(sp500$y),length=500)
xhist(sp500$y,
freq=FALSE,
main="SP-500 histogram", xlab="Return",
col= "deepskyblue3",
breaks=30,
las=1
) lines(x,dnorm(x, mean(sp500$y), sd(sp500$y)), lty=2, col="red", lwd=2)
We can zoom into the plot by using xlim=c(-0.05,0.05)
:
hist(sp500$y,
freq=FALSE,
main="SP-500 histogram", xlab="Return",
col= "deepskyblue3",
breaks=60,
las=1,
xlim=c(-0.05,0.05)
) lines(x,dnorm(x, mean(sp500$y), sd(sp500$y)), col="red", lwd=3)
freq=F
displays histograms with density instead of frequency. This is more convenient if we want to compare the empirical distributions with the normal density. main="plot_name"
sets the title of the plot, and breaks=30
sets the number of bins. To add a normal density line, use the lines
command. The mean and standard deviation are set to match the sample mean and sample volatility. From the plot, we see that although the empirical distribution has a bell shape, the returns have fat tails and are skewed. Returns around the mean are more frequent than the normal distribution, so SP-500 returns appear not normally distributed.
9.7.1 Box plot
Another way to see the return distributions is the box plot. The upper and lower edges of the box are the 75th and 25th percentile, and the thick line in the middle represents the median. All points beyond the boundary of whiskers are classified as outliers.
boxplot(sp500$y, main="Box Plot - sp500", col="deepskyblue3", ylab="Return")
We see a significant number of outlines in sp500 returns. This indicates that extremely high/low returns are not rare. To make a horizontal box plot, add horizontal=T
.
9.7.2 Empirical density — normal
ecdf
can generate the empirical cumulative distribution. Using the S&P500, we have the following plot.
plot(ecdf(sp500$y))
plot(ecdf(sp500$y),
col='blue',
main="SP-500 empirical distribution and the normal",
las=1,
ylab="Probability",
xlab="Returns"
)=seq(min(sp500$y),max(sp500$y),length=length(sp500$y))
xlines(x,pnorm(x,mean(sp500$y),sd(sp500$y)),col='red')
Zooming in with xlim
:
plot(ecdf(sp500$y),
col='blue',
main="SP-500 empirical distribution and the normal",
las=1,
ylab="Probability",
xlab="Returns",
xlim=c(-0.04,0.04)
)=seq(min(sp500$y),max(sp500$y),length=length(sp500$y))
xlines(x,pnorm(x,mean(sp500$y),sd(sp500$y)),col='red')
Zooming into the left tail with xlim
:
plot(ecdf(sp500$y),
col='blue',
main="SP-500 empirical distribution and the normal",
las=1,
ylab="Probability",
xlab="Returns",
xlim=c(-0.1,-0.03),
ylim=c(0,0.01)
)=seq(min(sp500$y),max(sp500$y),length=length(sp500$y))
xlines(x,pnorm(x,mean(sp500$y),sd(sp500$y)),col='red')
9.7.3 Empirical density — AAPL and JPM
We can also compare the distribution of Apple and JPMorgan.
plot(ecdf(Return$AAPL),
col='blue',
las=1,
main="Apple and JPMorgan"
)lines(ecdf(Return$JPM),col='red')
legend('topleft',legend=c("AAPL","JPM"),col=c("blue","red"),bty='n',lty=1)
And zooming in
plot(ecdf(Return$AAPL),
col='blue',
xlim=c(-0.05,0.05),
las=1,
main="Apple and JPMorgan"
)lines(ecdf(Return$JPM),col='red')
legend('topleft',legend=c("AAPL","JPM"),col=c("blue","red"),bty='n',lty=1)
By looking at the upper tail, we can see which of the two stocks delivers more large returns:
plot(ecdf(Return$AAPL),
col='blue',
xlim=c(0.02,0.1),
ylim=c(0.83,1),
las=1,
main="Apple and JPMorgan"
)lines(ecdf(Return$JPM),col='red')
legend('bottomright',legend=c("AAPL","JPM"),col=c("blue","red"),bty='n',lty=1)
9.8 Exercise
Using data.Rdata
, compute descriptive statistics for all stocks in the Return
data frame. Summarise and present the results in a new data frame. It should include mean, standard deviation, skewness, kurtosis, and the p-value of the Jarque-Bera test. Keep five decimal places.