Monday, 30 June 2014
Friday, 27 June 2014
Book Statistics Corner, Part 4 – Sales Trends of Ten of the Most Popular Book Series
In a previous blog, we looked at some statistics on sales
for one particular popular book series, Patrick O’Brian’s Aubrey/Maturin
series, a historical fiction series about the Royal Navy during the era of the
Napoleonic Wars. Now, we will extend
this analysis, adding nine more of the most popular series in recent
history. The particular book series were
selected from a wiki article, “List of bestselling books”.
The author, series title and total sales (copies) are shown
below:
Author and Series

Total

J.K. Rowling  Harry Potter

447,000,000

Dan Brown  Robert Langdon

200,000,000

Stephanie Myers – Twilight

120,000,000

Suzanne Collins  Hunger Games

50,000,000

Robert Jordan  Wheel of Time

44,000,000

Stephen King  The Dark Tower

30,000,000

G.R.R. Martin  Game of Thrones

24,000,000

Veronica Roth – Divergent

20,000,000

Douglas Adams  Hitchhikers Guide

16,000,000

Patrick O'Brian  Aubrey/Maturin

4,000,000

Book Num  GR Reviews  GR Ratings  Sales (Wiki)  First Pub  GR Rating 
1  40,793  2,580,696  107,000,000  1997  4.38 
2  16,682  1,177,363  60,000,000  1998  4.28 
3  18,824  1,219,695  55,000,000  1999  4.46 
4  16,610  1,182,736  55,000,000  1999  4.46 
5  15,789  1,136,636  55,000,000  2003  4.40 
6  15,599  1,136,725  65,000,000  2005  4.48 
7  37,191  1,175,133  50,000,000  2006  4.57 
Total  161,488  9,608,984  447,000,000  4.43 
As the table and graph indicate, the numbers of copies sold
correlates pretty closely to the number of people who rated the books on
Goodreads, once we have normalized the data.
We do that by defining the value of the statistic as 100 for Book 1,
then comparing the following volumes to that index. For example, Volume 1 (Philosopher’s Stone)
sold 107 million copies, while volume 2 (Chamber of Secrets) sold 60 million
copies. Then 60/107 = .56, so Volume 2
is given the value 56, compared to Volume 1, which is given the value 100. Similarly for the other books and measures. The correlation coefficient between copies
sold and number of ratings is .964, which is high. Note that a value of 1.00 would indicate a
perfect correlation between two variables.
Another way to see this is to divide the number Goodreads
ratings into the number of books sold for each volume. As you can see, the number is consistently
close to 2 percent. That also shows that
Goodreads has a pretty wide reach among readers, at least as far as the Harry
Potter series is concerned:
Book Num

Title (Harry Potter and the…)

Ratings pct of Sales

1

Philosopher's Stone

2.4%

2

Chamber of Secrets

2.0%

3

Prisoner of Azkaban

2.2%

4

Goblet of Fire

2.2%

5

Order of the Phoenix

2.1%

6

HalfBlood Prince

1.7%

7

Deathly Hallows

2.4%

Now let’s look at the book series in detail, focusing on the
number of Goodreaders rankers vs the position of the books within the
series. We will go by series book sales,
largest to smallest.
1 – Harry Potter
(J.K. Rowling)
We see that the series followed a power law quite closely,
with the second and the last books departing somewhat from the best fit
curve. The median book had about half
the raters that the first book had. As
noted above, if we divide Goodreads raters into copies sold, we come up with a
figure of 2.1%. This relatively low
figure may be a reflection of the fact that a substantial part of the audience
did not participate on Goodreads, perhaps because they were too young.
2 – Robert Langdon
(Dan Brown)
In this case, we see that the function departs from the
power law form by quite a bit. That’s
mostly because the second book of the series, The Da Vince Code was really the
big breakout success. In fact, most
people think it is actually the first book in the series, which was actually
Angels and Demons. But the second book
caught the public’s fancy more, probably because of the implications for the
church. Note that the last two books seem to have lagged the first two quite
badly, relative to the first two, at least in this data. But it is hard to repeat that level of
success. If we divide the number of
Goodreads raters into the number of copies sold, we come up with a ratio of
1.4%. This probably reflects an older,
less socialmedia driven audience for this type of a series (and perhaps a less
enthusiastic one).
3 – Twilight
(Stephanie Myers)
This four book series followed a relatively flat power law
very closely. After the initial dropoff
from Book 1, she seems to have held on to about 40% of the initial book raters,
very consistently. If we divide the
number of Goodreads raters into the number of copies sold, we come up with a
ratio of 4.2%, a middlinghigh figure.
4 –Hunger Games
(Suzanne Collins)
This series also conformed closely to the power law, though
naturally that’s easier to do with only three data points to fit. She also did a very good job of holding onto
about half of the raters through the final two books of the trilogy. If we divide the number of Goodreads raters
into the number of copies sold, we come up with a ratio of 10.4%. This would seem to indicate that readers of
this series were very enthusiastic about sharing their ratings of the book and
were very social media aware.
This series conformed fairly well to the power series, but
with some bumps along the way. From
reading reviews, it seems that the series lagged somewhat in the latter middle
part, then picked up again towards the end.
Nonetheless, it did an excellent job of holding onto raters as the
series progressed, given its length.
Nearly half were still engaged for most of the latter half of a long
series. If we divide the number of
Goodreads raters into the number of copies sold, we come up with a ratio of
2.2%. Due to the length of the series,
this might also reflect an older, less socialmedia driven audience.
6 –The Dark Tower
(Stephen King)
This one is almost a textbook perfect example of a nice
power law. King did pretty well to hold
on to a lot of raters over a long series as well. If we divide the number of Goodreads raters
into the number of copies sold, we come up with a ratio of 2.1%. Again, due to the length of the series, this
might also reflect an older, less socialmedia driven audience.
7 –Game of Thrones
(G.R.R. Martin)
Yes, I know that’s not the real name, but that’s the name of
the TV show, so I figure that’s how most people think of it. Again, it is almost a pictureperfect example
of a power law. The last book has lagged
a bit, but he still has two more books to go.
Again, he has done a good job of holding on to nearly half of his
original audience, as inferred from Goodreads raters. If we divide the number of Goodreads raters
into the number of copies sold, we come up with a ratio of 7.6%. Perhaps this is at least partially due to the
series having a concurrent TV spinoff, with the consequent buzz and cross
promotion.
8 –Divergent
(Veronica Roth)
This is a pretty decent fit, but there are only three points
to fit, so that has to be borne in mind.
If we divide the number of Goodreads raters into the number of copies
sold, we come up with a ratio of 8.2%.
As with the Twilight series, this would seem to indicate an audience
that is very enthusiastic about the books and keen to share their feelings on
social media.
9 –Hitchhikers’ Guide (Douglas Adams)
Again, this is a nearly perfect fit to a power law. However, it has quite a steep dropoff, with
the first book in the series getting far more ratings than the earlier
books. This seems to be a feature of
older books and how they interact with Goodreads. It may be that it is more a reflection of
people’s recall of an older series, rather than being related to underlying
sales. However, if we divide the number
of Goodreads raters into the number of copies sold, we come up with a ratio of
6.1%, which is quite high for a series whose author died quite a while back and
whose audience probably skews older.
10 –
Aubrey/Maturin (Patrick O’Brian)
Again, this is a very good fit to a power law, especially
given the length of the series. We see a
bump at Book 10 (that was the book that shared the title with the movie “The
Far Side of the World”). Book 2 is also
a bit low. If we divide the number of
Goodreads raters into the number of copies sold, we come up with a ratio of
2.9%, which is what we might expect for a series whose author died quite a
while back and whose audience probably skews older.
Some Conclusions
·
It does appear that the number of Goodreads
raters reflects the number of copies sold fairly accurately within a series
(i.e. there is a good correlation). At
any rate, that appears to be the case for the Harry Potter series. For that series, the number of Goodreads
raters was about 2% of the copies that were sold.
·
However, a similar calculation went from a high
of 10% for the Hunger Games series to a low of 1.4% for the Dan Brown
series. So, there is some considerable
variability in different audiences to make their way to Goodreads and to share
their opinions, via rating the books.
·
Nearly all of these very popular series fit a
power law very well. The main exception
was the Dan Brown series, in which The Da Vinci Code was the exception to the
rule. But that book truly was
exceptional, on a lot of grounds.
·
Most of the newer series managed to have 40% to
50% of the Goodreads raters involved by the midpoint of the series, relative to
the first book. The older series had a
much greater rate of dropoff, though this may also be related to the fit between
the audience of the series and the members of Goodreads.
Next time we will look at whether these findings hold for
Amazon reviews, or whether things are different in the Kindle world.
Note that I will put up the raw data that these graphs are based on a little later in the week.
Friday, 20 June 2014
Book Statistics Corner, Part 3 – The Decay Curve of a Book Series
These days, a lot of writers are doing series. There are good reasons for that  once you
have built up an audience for a certain setting, cast of characters and genre,
you would like to maintain that audience.
It seems natural that a series would be the way to go. But what might you actually expect from a
series? For example, how many people
will move on from book 1 to book 2, book 2 to book 3 and so on? It seems likely that you will lose some
people along the way, but is there a pattern to that? To get a feel for this, let’s look at some
results for some well known long running book series. Naturally, we can only look at a few “ideal
type” cases, but with luck that will give us some insights that are typical for
most series.
First, we will look at Patrick O’Brian’s Aubrey/Maturin
historical fiction series. That is the
series of books that the recent (released in 2003) movie “The Far Side of the World”, featuring Russel Crowe, was based
on. They are about a Royal Navy captain
and a ship’s doctor/spy, set during the era of the Napoleonic wars, roughly
1800 to 1815. Why did I pick this series
first?
·
I have read them all, so I have a good sense of
how the series evolved.
·
It’s a long series (20 books) so it really tests
the idea of loyalty to a series.
·
It has sold a lot of books, and has had a lot of
fans, so the statistical power of the analysis should be high (that just means
that the numbers are big, so the results are probably grounded in some
underlying realities, not random noise).
·
It has spanned the era of print books sales in book
stores to the era of ebooks sales in online stores, so we might be able to see
whether the change in how stories are stocked and delivered has affected how
people consume series.
To begin, we will look at how the
series did via a number of Goodreads
measures. As you may know, Goodreads is a site where readers can leave reviews, ratings, and recomendations of the books they have read. The measures that we will look at are Numbers of
Reviews, Numbers of Raters, Average Rating, and Number of Editions. The results, as taken from the Goodreads
website are shown below. The year that
the book was first published is also shown, to give some idea of the time scale
involved. There’s a reason the first
four books are highlighted, which we will get to later.
Book Num

Title

GR Reviews

GR Ratings

GR Rating

Editions

First Pub
 
1

Master and Commander

1,700

21,161

4.08

90

1969
 
2

Post Captain

459

8,760

4.29

63

1972
 
3

HMS Surprise

319

7,879

4.40

54

1973
 
4

The Mauritius Command

224

6,885

4.32

53

1977
 
5

Desolation Island

233

6,307

4.35

50

1977
 
6

The Fortune of War

170

5,855

4.35

43

1978
 
7

The Surgeon's Mate

152

5,575

4.35

40

1980
 
8

Treason's Harbour

110

5,217

4.35

39

1980
 
9

The Ionian Mission

137

4,390

4.28

41

1981
 
10

The Far Side of the World

156

5,473

4.41

48

1984
 
11

The Reverse of the Medal

126

4,221

4.38

40

1986
 
12

The Letter of Marque

119

4,772

4.43

36

1988
 
13

The Thirteen Gun Salute

125

3,920

4.35

36

1989
 
14

The Nutmeg of Consolation

114

4,113

4.37

39

1991
 
15

Clarissa Oakes/The Truelove

107

3,778

4.33

35

1992
 
16

The WineDark Sea

102

3,709

4.36

34

1993
 
17

The Commodore

99

3,626

4.37

38

1994
 
18

The Yellow Admiral

100

3,813

4.32

36

1996
 
19

The Hundred Days

93

3,327

4.31

32

1998
 
20

Blue at the Mizzen

128

3,213

4.34

41

1999
 
4,773

115,994

4.34

888

As you can see, for most measures
there was a fairly steady decline from Book 1 to Book 20, though some books in
the latter part of the series seem to have done better than the book that
immediately preceded it  in mathematics, we would say that it is not a
monotonic series, but in statistics we might say that it comes pretty close to
one (it is quite well modelled by a power law, in fact).
The data is graphed above, with the various measures (Number
of Editions, Number of Goodreads Ratings, and number of Goodreads Reviews)
scaled in such a way that the measures for the first book are assigned the
value of 100, and the measures for books after that are assigned numbers
proportional to that initial value. So,
for example, the first book had 90 editions printed, while the second book had
63 books printed. In our scaled variable
we have assigned 100 to the first book, and 70 to the second book (63/90 =
0.70, so the second book is given the value 70). The reason for using these scales (it’s
called normalizing) is so that we can compare the three line graphs on the same
scale.
There are a lot of interesting results here. First off, we see that all of the graphs
decline steadily (each shows a decay curve), but they fall off at different
rates. The falloff for the number of
editions is slowest. That’s interesting,
since the number of editions is probably the measure that best tracks the
number of books sold and read. After
Book 5, the number of editions printed falls to about 40% to 50% of the number
of editions printed for the first book.
So, Patrick O’Brian appears to have held on to about half of his initial
book purchasers as the series matured.
There was an uptick at Book 10  that’s “The Far Side of the World”,
which was also the title of the movie starring Russell Crowe. So, that clearly seems to have given the book
a bounce.
There was also an uptick for the final book of the series
“Blue at the Mizzen”. A reasonable
hypothesis is that those extra editions may represent sales to people who
followed part of the series and dropped out, but who might have decided to buy
the final book to see how it turned out.
However, in some ways Book 19 was really the end of the series (Napoleon
is defeated) and book 20 could be thought of as the start of another series
that featured the same main characters in a different setting (the plot moves
from the Napoleonic wars to the wars of liberation in South America). But the author died shortly after Book 20, so
there was no chance for a “next generation” followup. So the final book uptick might be related to
people buying into a new series or it might be related to the wrapup of the
original series. We’ll never know.
There are some other interesting features of these decay
curves  first, how the decay curve of Goodreads ratings falls off more rapidly
than the decay curve of the number of editions and secondly how the line
representing the number of Goodreads reviews falls off even more sharply. So, it appears that people might be less
willing to invest the time and energy into rating or reviewing books that they
read in a series, as the series goes on.
Also, it appears that they are more willing to invest the time in a
rating than in a review. That makes
sense, as a rating only takes a few seconds, while a review can take five or
ten minutes  even much longer than that, for those who take their reviewing
very seriously indeed.
One other interesting aspect of the decay curves is that
they are well modelled by our old friend the power law, of which I have written
previously. The fitted lines next to
the jagged data lines are these power functions. The RSquared values next to the respective
lines indicate that the fits are quite robust, in a statistical sense (an
RSquare of 1.00 would indicate that the data fit the powerlaw function perfectly,
so values in the 0.85 to 0.95 range are really quite good fits.
The above data also shows that after the first few books,
the number of reviews and rankings correlated rather nicely with the number of
editions of the book. Since we assume that the number of editions
printed correlates fairly well with the number of copies sold, we can therefore have some more confidence in
the notion that the total number of reviews a book gets scales fairly well with
its total sales. This assumes, of course,
that each edition had more or less the same number of copies printed and
sold. This can be seen in the graph
below. Note, however that for the
initial books in the series, the number of reviews was higher than would be
expected from the relationship in the graph.
Again, this indicates that people may be more enthusiastic to
review/rank near the beginning of a series than later on.
It is also worth noting how the average rating of the books
went, as the series progressed. As you
can see, the first couple of books actually had the lowest rating, and after
that the ratings were quite consistent, at a bit under 4.4, for the most
part. So, it would appear that as the
series went on, the readers who dropped out were (not surprisingly) those who
were less satisfied with the books, and the readers who stayed with the series
were those who were more satisfied. So,
the audience was smaller, but more loyal as the series continued.
As we can see, the number of reviews in the Kindle store do not show the decay curve pattern that was evident in the Goodreads data, which was probably primarily based on legacy print book sales. In the Kindle store, the last two books had about as many reviews as the first two, and the others had 50 or more reviews, compared to the 85 or 90 for the top reviewed volumes. So, perhaps the always available nature of the ebooks in the Kindle store has altered the underlying sales dynamics of the series. Of course when it comes to ebook sales for books published before 2000, we are always looking at the “long tail”, so we might be seeing the dynamics of the long tail, which are generally thought to be underlain by a much flatter power law than initial book sales.
These are all good things to keep in mind when you evaluate the success of your own series, if you are a writer or publisher, especially if you are a selfpublisher or small scale publisher. To summarize:
·
There is a powerlaw like decay curve (in sales and other measurs), or at
least there was in the legacy system.
·
The slope of that curve varies depending on the
measure, with the tendency to rank or review probably falling off faster than sales, as the series
goes on.
·
If your book gets made into a movie, you will
most likely get a bump in sales J.
·
There may be a bump at the end of a long series,
as people who dropped out of some of the middle books come in to see how things
turned out.
·
Numbers of reviews and or numbers of rankings
(not average rankings) probably scale reasonably well with sales.
·
The dynamics of print book series and ebook
series may be quite different, with the ebook series possibly having a much
flatter decay curve (or none at all).
Well, that’s just one series (a highly successful one) whose
sales dynamics we have attempted to infer from Goodreads and Amazon Kindle
data, available to the public. In later
blogs we will see whether these results hold for some other series in other
genres, such as Robert Jordan’s Wheel of Time series or J.K. Rowling’s Harry
Potter series. We will also try to test
some recent ebook only series, to see if the dynamics of those are different.
