9  Descriptive statistics

The main strength of R is in doing statistical analysis. It has considerable depth in various statistical procedures built in. Also, it comes with a number of specialised statistical packages doing the type of analysis one cannot encounter anywhere else.

9.1 Libraries

library(moments)
library(tseries)
library(car)
library(reshape2)
source("common/functions.r",chdir=TRUE)

9.2 Loading data

data=ProcessRawData()
Price=data$Price
Return=data$Return
sp500=data$sp500

9.3 Sample statistics

Mean and standard deviation, sd is directly built in, while skewness and kurtosis need library(moments).

mean(sp500$y)
[1] 0.0002861977
sd(sp500$y)
[1] 0.01216104
skewness(sp500$y)
[1] -0.511112
kurtosis(sp500$y)
[1] 15.7652

We can present these results a bit nicer by putting them into a data frame.

cat('Sample statistics for the SP-500 from',
 min(sp500$date),'to',
 max(sp500$date),'\n'
)
Sample statistics for the SP-500 from 20030103 to 20221230 
stats=data.frame()
stats=c('mean:',paste0(round(mean(sp500$y)*100,3),'%'))
stats=rbind(stats,c('sd:',paste0(round(sd(sp500$y)*100,2),'%')))
stats=rbind(stats,c('skewness:',round(skewness(sp500$y),1)))
stats=rbind(stats,c('kurtosis:',round(kurtosis(sp500$y),1)))
stats
      [,1]        [,2]    
stats "mean:"     "0.029%"
      "sd:"       "1.22%" 
      "skewness:" "-0.5"  
      "kurtosis:" "15.8"  

Or printing them with formatting.

for(i in 1:dim(stats)[1])
 cat(stats[i,1],rep("",10-nchar(stats[i,1])),stats[i,2],"\n")
mean:      0.029% 
sd:        1.22% 
skewness:  -0.5 
kurtosis:  15.8 

9.4 Statistical distributions

R provides functions for just about every distribution imaginable. We will use several of those: the normal, student-t, skewed student-t, chi-square, binomial and the Bernoulli. They all provide densities and distribution functions (PDF and CDF), as well as the inverse CDF and random numbers discussed in Chapter 10.

The first letter of the function name indicates which of these four types, d for distribution, p for probability, q for quantile and r for random. The remainder of the name is the distribution, for example, for the normal and student-t:

  • dnorm, pnorm, qnorm, rnorm;
  • dt, pt, qt, rt.
dnorm(1)
[1] 0.2419707
pnorm(1.164)
[1] 0.877788
qnorm(0.95)
[1] 1.644854

9.4.1 Plotting distributions

If you want to plot the distribution functions, first create a vector for the x coordinates with seq(a,b,c) which creates a sequence from a to b where the space between the numbers is c. Alternatively, we can specify the number of elements in the sequence.

x=seq(-3,3,length=100)
z=seq(0,1,length=100)
plot(x,dnorm(x), main="Normal density",type='l') 

plot(x,pnorm(x), main="Normal distribution",type='l') 

plot(z,qnorm(z), main="Normal quantile",type='l') 

9.5 Testing

9.5.1 Jarque-Bera

The Jarque-Bera (JB) test in the tseries library uses the third and fourth central moments of the sample data to check whether its skewness and kurtosis match a normal distribution.

jarque.bera.test(sp500$y)

    Jarque Bera Test

data:  sp500$y
X-squared = 34398, df = 2, p-value < 2.2e-16

The p-value presented is the inverse of the JB statistic under the CDF of the asymptotic distribution. The p-value of this test is virtually zero, which means that we have enough evidence to reject the null hypothesis that our data has been drawn from a normal distribution.

9.5.2 Kolmogorov-Smirnov

The Kolmogorov-Smirnov test, ks.test() is based on comparing actual observations with some hypothesised distribution, like the normal:

x <- rnorm(n=length(sp500$y))
z <- rnorm(n=length(sp500$y))
ks.test(x,z)

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  x and z
D = 0.014303, p-value = 0.6818
alternative hypothesis: two-sided

not rejecting

x <- rnorm(n=length(sp500$y))
ks.test(x,sp500$y)
Warning in ks.test.default(x, sp500$y): p-value will be approximate in the
presence of ties

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  x and sp500$y
D = 0.4857, p-value < 2.2e-16
alternative hypothesis: two-sided
x <- rt(df=3,n=length(sp500$y))
ks.test(x,sp500$y)
Warning in ks.test.default(x, sp500$y): p-value will be approximate in the
presence of ties

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  x and sp500$y
D = 0.48907, p-value < 2.2e-16
alternative hypothesis: two-sided

9.5.3 Ljung-Box

The Ljung-Box (LB) test checks if the presented data exhibits serial correlation. We can implement the test using the Box.test() function and specifying type = "Ljung-Box":

x <- rnorm(n=length(sp500$y))
Box.test(x, type = "Ljung-Box")

    Box-Ljung test

data:  x
X-squared = 0.32421, df = 1, p-value = 0.5691
Box.test(sp500$y, type = "Ljung-Box")

    Box-Ljung test

data:  sp500$y
X-squared = 81.253, df = 1, p-value < 2.2e-16
Box.test(sp500$y^2, type = "Ljung-Box")

    Box-Ljung test

data:  sp500$y^2
X-squared = 551.08, df = 1, p-value < 2.2e-16

9.5.4 Autocorrelation

The autocorrelation function shows us the linear correlation between a value in our time series and its different lags. The acf function plots the autocorrelation of an ordered vector. The horizontal lines are confidence intervals, meaning that if a value is outside of the interval, it is considered significantly different from zero.

par(mar=c(4,4,3,0))
acf(sp500$y, main = "Autocorrelation of returns")

The plots made by the acf function don’t look all that nice. While there are other versions available for R, including one for ggplot, for now, we will simply modify the built-in acf(). Call that myACF() and put it into functions.r.

myACF=function (x, n = 50, lwd = 2, col1 = co2[2], col2 = co2[1], main = NULL) 
{
 a = acf(x, n, plot = FALSE)
 significance_level = qnorm((1 + 0.95)/2)/sqrt(sum(!is.na(x)))
 barplot(a$acf[2:n, , 1], las = 1, col = col1, border = FALSE, 
 las = 1, xlab = "lag", ylab = "ACF", main = main)
 abline(significance_level, 0, col = col2, lwd = lwd)
 abline(-significance_level, 0, col = col2, lwd = lwd)
 axis(side = 1)
}
par(mar=c(4,4,1,0))
myACF(sp500$y, main = "Autocorrelation of SP-500 returns")

myACF(sp500$y^2, main = "Autocorrelation of squared SP-500 returns")

9.5.5 QQ-Plots

We can use the quantile-quantile (QQ) plot from library(car) to see how our data fits particular distributions. We will use the qqPlot() function.

Start with the normal plot, indicated by distribution = "norm" which is the default.

par(mar=c(4,4,1,0))
x=qqPlot(sp500$y, distribution = "norm", envelope = FALSE)

We can also use QQ-Plots to see how our data fits the distributions with various degrees of freedom.

par(mfrow=c(2,2), mar=c(4,4,1,0))
x=qqPlot(sp500$y, distribution = "norm", envelope = FALSE,xlab="normal")
x=qqPlot(sp500$y, distribution = "t", df = 4, envelope = FALSE,xlab="t(4)")
x=qqPlot(sp500$y, distribution = "t", df = 3.5, envelope = FALSE,xlab="t(3.5)")
x=qqPlot(sp500$y, distribution = "t", df = 3, envelope = FALSE,xlab="t(3)")

9.6 Return distribution analysis

In many financial theories, returns are usually assumed to be normally distributed. In reality, stock returns are usually heavy-tailed and skewed.

9.7 Histogram/density

To visualise the distribution of stock returns, we can use a histogram with the hist() command.

hist(sp500$y) 

We can improve the plot a bit and superimpose the normal with the same mean and variance as the underlying data.

x=seq(min(sp500$y),max(sp500$y),length=500)
hist(sp500$y,
 freq=FALSE, 
 main="SP-500 histogram", xlab="Return",
 col= "deepskyblue3",
 breaks=30,
 las=1
) 
lines(x,dnorm(x, mean(sp500$y), sd(sp500$y)), lty=2, col="red", lwd=2) 

We can zoom into the plot by using xlim=c(-0.05,0.05):

hist(sp500$y,
 freq=FALSE, 
 main="SP-500 histogram", xlab="Return",
 col= "deepskyblue3",
 breaks=60,
 las=1,
 xlim=c(-0.05,0.05)
) 
lines(x,dnorm(x, mean(sp500$y), sd(sp500$y)), col="red", lwd=3) 

freq=F displays histograms with density instead of frequency. This is more convenient if we want to compare the empirical distributions with the normal density. main="plot_name" sets the title of the plot, and breaks=30 sets the number of bins. To add a normal density line, use the lines command. The mean and standard deviation are set to match the sample mean and sample volatility. From the plot, we see that although the empirical distribution has a bell shape, the returns have fat tails and are skewed. Returns around the mean are more frequent than the normal distribution, so SP-500 returns appear not normally distributed.

9.7.1 Box plot

Another way to see the return distributions is the box plot. The upper and lower edges of the box are the 75th and 25th percentile, and the thick line in the middle represents the median. All points beyond the boundary of whiskers are classified as outliers.

boxplot(sp500$y, main="Box Plot - sp500", col="deepskyblue3", ylab="Return")

We see a significant number of outlines in sp500 returns. This indicates that extremely high/low returns are not rare. To make a horizontal box plot, add horizontal=T.

9.7.2 Empirical density — normal

ecdf can generate the empirical cumulative distribution. Using the S&P500, we have the following plot.

plot(ecdf(sp500$y))

plot(ecdf(sp500$y),
 col='blue',
 main="SP-500 empirical distribution and the normal",
 las=1,
 ylab="Probability",
 xlab="Returns"
)
x=seq(min(sp500$y),max(sp500$y),length=length(sp500$y))
lines(x,pnorm(x,mean(sp500$y),sd(sp500$y)),col='red')

Zooming in with xlim:

plot(ecdf(sp500$y),
 col='blue',
 main="SP-500 empirical distribution and the normal",
 las=1,
 ylab="Probability",
 xlab="Returns",
 xlim=c(-0.04,0.04)
)
x=seq(min(sp500$y),max(sp500$y),length=length(sp500$y))
lines(x,pnorm(x,mean(sp500$y),sd(sp500$y)),col='red')

Zooming into the left tail with xlim:

plot(ecdf(sp500$y),
 col='blue',
 main="SP-500 empirical distribution and the normal",
 las=1,
 ylab="Probability",
 xlab="Returns",
 xlim=c(-0.1,-0.03),
 ylim=c(0,0.01)
)
x=seq(min(sp500$y),max(sp500$y),length=length(sp500$y))
lines(x,pnorm(x,mean(sp500$y),sd(sp500$y)),col='red')

9.7.3 Empirical density — AAPL and JPM

We can also compare the distribution of Apple and JPMorgan.

plot(ecdf(Return$AAPL),
 col='blue',
 las=1,
 main="Apple and JPMorgan"
 )
lines(ecdf(Return$JPM),col='red')
legend('topleft',legend=c("AAPL","JPM"),col=c("blue","red"),bty='n',lty=1)

And zooming in

plot(ecdf(Return$AAPL),
 col='blue',
 xlim=c(-0.05,0.05),
 las=1,
 main="Apple and JPMorgan"
)
lines(ecdf(Return$JPM),col='red')
legend('topleft',legend=c("AAPL","JPM"),col=c("blue","red"),bty='n',lty=1)

By looking at the upper tail, we can see which of the two stocks delivers more large returns:

plot(ecdf(Return$AAPL),
 col='blue',
 xlim=c(0.02,0.1),
 ylim=c(0.83,1),
 las=1,
 main="Apple and JPMorgan"
)
lines(ecdf(Return$JPM),col='red')
legend('bottomright',legend=c("AAPL","JPM"),col=c("blue","red"),bty='n',lty=1)

9.8 Exercise

Using data.Rdata, compute descriptive statistics for all stocks in the Return data frame. Summarise and present the results in a new data frame. It should include mean, standard deviation, skewness, kurtosis, and the p-value of the Jarque-Bera test. Keep five decimal places.