Wednesday 24 February 2016

It's Time for a Road Trip – On the Road with Bronco Billy



It's Time for a Road Trip – On the Road with Bronco Billy

It's late February, and the sun is beginning to come on noticeably stronger in the more temperate regions.  Spring is around the corner now, and that brings on thoughts of ROAD TRIP.  Sure, it is still a bit early, but you can still start making plans for your next road trip with help of “On the Road with Bronco Billy”.  Sit back and go on a ten day trucking trip in a big rig, through western North America, from Alberta to Texas, and back again.  Explore the countryside, learn some trucking lingo, and observe the shifting cultural norms across this great continent.  Then, come spring, try it out for yourself.
                      
It’s free this week (Feb 25 to 29, 2016) on Amazon, 99 cents otherwise.






Here’s the summary:
=======================================================
What follows is an account of a ten day journey through western North America during a working trip, delivering lumber from Edmonton Alberta to Dallas Texas, and returning with oilfield equipment. The writer had the opportunity to accompany a friend who is a professional truck driver, which he eagerly accepted. He works as a statistician for the University of Alberta, and is therefore is generally confined to desk, chair, and computer. The chance to see the world from the cab of a truck, and be immersed in the truck driving culture was intriguing. In early May 1997 they hit the road.

Some time has passed since this journal was written and many things have changed since the late 1990’s. That renders the journey as not just a geographical one, but also a historical account, which I think only increases its interest.

We were fortunate to have an eventful trip - a mechanical breakdown, a near miss from a tornado, and a large-scale flood were among these events. But even without these turns of fate, the drama of the landscape, the close-up view of the trucking lifestyle, and the opportunity to observe the cultural habits of a wide swath of western North America would have been sufficient to fill up an interesting journal.

The travelogue is about 20,000 words, about 60 to 90 minutes of reading, at typical reading speeds.
=======================================================

Tuesday 23 February 2016

One Year with Harper Lee’s To Kill a Mockingbird and Go Set a Watchman, Part 1


One Year with Harper Lee’s To Kill a Mockingbird and Go Set a Watchman, Part 1
As most people must have heard by now, Harper Lee died last week (Feb 19, 2016). I won’t go into her biography – after all, plenty of people have done that already. Instead, I will present some statistics on her Amazon book sales over the past year or so, as I happened to have been following those intently over that time span. My original motivation was to see how the original announcement and subsequent publishing of “Go Set a Watchman” affected her sales and reviews on Amazon. I had stopped my data gathering a couple of weeks ago, but her death caused me to reconsider. Though it might seem morbid, I was curious how her death affected her sales.

1 – Sales Rank and Key Events over the Year

The graph below shows how sales of To Kill a Mockingbird (TKAM) and Go Set a Watchman (GSAW) did over the timespan from early February 2015 to late February 2016, a period of a little over a year. The graph is fairly self-explanatory, but since it may be difficult to read on some screens, I will follow up with a narrative analysis.
Also, keep in mind, that on this graph, when the lines go down, sales are increasing. That can be a little confusing, rather like following golf scores at the U.S. Open – lower numbers are better.
 
The first key event was the announcement that a new book by Harper Lee was in the works, in early February 2015. That kicked off my data analysis, so I don’t have data preceding that event. As you can see, that announcement led to TKAM being ranked at or near the number 1 spot on Amazon in early February 2015. As the later trend indicates, TKAM’s baseline ranking was probably in the 500 to 1000 range, so this announcement had a huge effect on the sales for TKAM. The book stayed in the top 100 for about a month, and within the top 500 right until the summer.
The next key event was the pre-release of GSAW in late May 2015, followed by publication in early July 2015. GSAW began at about rank 500 when the pre-release became available (that means people could pre-order on Amazon, and have the book sent to their Kindle or their mailbox as soon as it was out). It then went steadily up in the ranks (down on the graph), hitting number 1 early in July. TKAM pretty much followed the same path, though it didn’t hit number 1 (its best ranking in that period was number 5).
Both books then lost those excellent rankings, drifting upwards to about the 400 to 600 range over the next few months, until early December. It should be noted, though, that GSAW’s sales ranks were better than TKAM during that time.
The next major event happened in early December, when GSAW won its category in the Goodreads Book of the Year (2015) rankings. That had GSAW back at rank number 1 in early December, though it drifted back up into the 500 to 1000 range by early February 2016. TKAM got a bit of a boost at this time from the success of its “sister book”, but the effect was quite small, as the graph indicates.
The larger markers on the graph show Christmas. As you can see, TKAM got a bit of a Christmas boost, dropping a couple hundred places in sales rank, from about 800 to about 600. The Christmas sales rank boost for GSAW was very modest, more of a Christmas blip, really.
Then, of course, we come to Ms. Lee’s death last week. As I noted in the intro to the blog, I had stopped gathering data in early February 2016, thinking that one year was plenty. Thus, the interruption in the time series in the two weeks before her death.
But the important finding is obvious – death can be a good career move, at least for a while. Both books are back to the top 100, with TKAM in the top 25 over the past 3 days (Feb 20 to 22) and GSAW in the 50 to 100 range. Time will tell how long that effect lasts – based on previous key events, one would expect both books to be in the 500 to 1000 rank range within two or three months.

2 – Sales and Key Events over the Year

The next graph is a bit more speculative. I used Hugh Howie and Data Guy’s sales rank to sales conversion formula. This is a power law mapping, based on crowd sourced author data indicating how their daily sales and sales rank correlate. The exact form of the “real” underlying function is unknown, so I am using Data Guy’s version of this as a “best guess”.
Sales Rank
Sales Per Day
1
7,000
5
4,000
20
3,000
35
2,000
100
1,000
200
500
350
250
500
175
750
120
1,500
100
3,000
70
5,500
25
10,000
15
50,000
5
100,000
1
100,001
-

Using the Data Guy mapping, and converting to a continuous function gives (approximately):
Sales = 10 (-.77*Rank+4.30)
I should note that the continuous function estimates somewhat high at the very top ranks, compared to the mapping above. But it is not unusual for there to be uncertainty in real world applications of power laws in the extreme ends of the range.
The key events and the responses to them are the same as in the previous graph, though the scale of the effects are very different. That’s because of the power law nature of the relationship between sales rank and sales. The number of sales associated with a low ranking is amplified by the power law, compared to the number of sales associated with a higher ranking.
For example, we see that the Goodreads Award boost for GSAW is not nearly as prominent in this graph, as it was in the ranking graph. That’s because the award only resulted in the book going as high as 4th rank, and the boost to better than 100th rank only lasted for about a week. Similarly with the sales boost caused by Ms. Lee’s death.
The third graph, below, shows the same data as the previous graph, but uses a logarithmic scale for the y-axis, sales. That allows one to see the small scale effects better, that are obscured by the scale differences in the original graph. It also makes clear that neither books sales ever dropped below 100 copies per day. 
 
  The final graph is similar to the second graph, but smooths the data with a 7 day moving average filter.

In another blog, I will analyse the trend in Amazon reviews over this time period, and make some observations about the interaction of sales rank and numbers of reviews.

Finally, of course, I have to include the traditional “call to action” – i.e. remind you that you can buy one of our books. Since Harper Lee wrote about the social and racial complexities of the American experience, I will offer up “On the Road with Bronco Billy”, a travelogue and cultural study of late 20thcentury America, as seen from the cab of a big rig. It also includes some observations on race and class in America, though not with so fine a literary touch as Harper Lee’s books.:)
On the Road with Bronco Billy - A Trucking Journal
Kindle Edition
What follows is an account of a ten day journey through western North America during a working trip, delivering lumber from Edmonton Alberta to Dallas Texas, and returning with oilfield equipment. The writer had the opportunity to accompany a friend who is a professional truck driver, which he eagerly accepted. He works as a statistician for the University of Alberta, and is therefore is generally confined to desk, chair, and computer. The chance to see the world from the cab of a truck, and be immersed in the truck driving culture was intriguing. In early May 1997 they hit the road.

Some time has passed since this journal was written and many things have changed since the late 1990’s. That renders the journey as not just a geographical one, but also a historical account, which I think only increases its interest.

We were fortunate to have an eventful trip - a mechanical breakdown, a near miss from a tornado, and a large-scale flood were among these events. But even without these turns of fate, the drama of the landscape, the close-up view of the trucking lifestyle, and the opportunity to observe the cultural habits of a wide swath of western North America would have been sufficient to fill up an interesting journal.

The travelogue is about 20,000 words, about 60 to 90 minutes of reading, at typical reading speeds.


 
 

Tuesday 16 February 2016

Part 3 of a Review of “Marketing Analytics – A Practical Guide to Real Marketing Science” (by Mike Grigsby Kogan)


Part 3 of a Review of “Marketing Analytics – A Practical Guide to Real Marketing Science” (by Mike Grigsby Kogan)

http://www.amazon.com/Marketing-Analytics-Practical-Guide-Science/dp/0749474173

A while back, I got a book from my Skillsoft learning library, with the above title. As a statistician/analyst at a university, I was curious about how the statistical techniques that I use on a routine basis are applied in the marketing world. And as someone who is involved in a small publishing venture, I was also curious about the theory and practice of marketing in general, and how it might be used to sell more novels. So, I thought I would read the book and do a write-up for the blog, to help fix ideas in my own mind and inform blog readers as well.

Naturally, if the book interests you, you should go to the source. The Amazon link is given above. The book sells for about 20 bucks, in both e-book and paperback form. Though the content gets somewhat technical, given the subject matter, the writer maintains a very readable style in my opinion.

Since the book is fairly long, a proper look at it will take at least two blogs, maybe three. I previously did a blog on Part One of the book, which was concerned with some elementary statistical ideas, as well as some fundamental concepts and strategies within the marketing world. The second blog was about Part Two of the book, which covered some fairly advanced techniques, whereby one uses one or more variables to predict another.

This third and final section deals with some other fairly advanced statistical techniques, which are less about predicting an outcome variable than in gaining deeper understanding of a dataset, in a more holistic sense. They include:

  • Factor analysis.

  • Cluster analysis (hierarchical, k-means, latent class).

  • Decision trees.

  • He also gives some review of statistical testing in general, on matters such as experimental vs. observational studies, sample sizes, A/B testing, and the new data mining methods.

In some cases, I have inserted an example of a given technique, from my own book related research, in italics.



Part Three – Inter-relationship Techniques

Chapter 7 – Modeling Inter-relationship Techniques – What Does the Customer (Market) Look Like?

  • This part of the book focuses on techniques that relate a number of variables together, rather predict the value of one variable from the value of others. So, for example, factor analysis can help determine which variables “go together”, in higher order concepts. These techniques also include market segmentation methods, such as cluster analysis, which determine which groups of customers are most like each other, in order to help conceptualize a broad mass of people into some higher order categories (i.e. sub-markets).
Essentially, these market segmentation methods group the dataset into subsets that are similar to each other within the group but different from other groups. Obviously, one can segment by things such as age, gender or income, but these methods take a deeper dive into the data to discover unexpected groupings. The point of this, of course, is to identify these segments and use different marketing techniques accordingly. After this is accomplished, one might run separate regression analyses for each identified market segment, to see what variables drive behavior for each segment, and how marketing can therefore be optimized. He explains a marketing notion, the four Ps (partition, probe, prioritize, position), and how segmentation methods help in executing those principles. He also notes that good segmentation should produce groups that are:
  • identifiable (e.g. by statistical scoring),
  • substantive (the group is big enough to be worthwhile developing a separate strategy for them),
  • accessible (you are able to contact them),
  • stable (the membership persists over reasonable time intervals)
  • responsive (this all makes a difference in an important metric, like sales).
The actual data used for segmentation can be from the firm's data (transactions and communications), demographics (e.g. census data), survey data, and the like. Once the segments are identified via the appropriate technique (e.g. k-means clustering), some key metrics can be run for each segment, to show how they differ in a real-world, easily interpretable way. Naming the segments is also a key part of the process – relevant, memorable segment names increases buy-in from management (but don't get too cute with trendy, playful names). Finally, of course, the segmentation must be tested and verified as effective and useful.

Chapter 8 – Segmentation Tools and Techniques

  • This chapter gets into the nuts and bolts of segmentation. First, there is a caution about using management-driven a priori rules for segmentation (e.g. big spenders versus small). Letting the data drive the segmentation is more likely to lead to fresh insights.
  • The first method looked at is the CHAID decision trees algorithm. This is really a dependent variables method (somewhat akin to logistic regression), but is often seen as a segmentation method. It determines optimal “splits” of the dataset, based upon a dependent variable of interest. It is a method that is popular in the data mining community, as it is relatively easy to understand and interpret, though it doesn't yield the sort of deeper explanations that a more traditional statistical model like logistic regression can (e.g. coefficient strengths, model R-square, etc).
  • He then briefly notes hierarchical clustering, a method that produces output someone akin to family trees, before giving a more in depth treatment of k-means clustering. This method creates a Euclidean-type distance metric, based on a choice of variables given by the analyst, then creates a number of clusters, again defined by the analyst. Essentially, these clusters are based upon minimizing within-group distances and maximizing between-group distances. As noted, there are a number of subjective steps, namely choosing the number of clusters and the variables to create the metric. This can result in a plethora of alternative models, with no cut and dried decision rule to use, to pick the best one.
Below is a quick example of k-means clustering, using my Amazon Top 100 dataset, for illustrative purposes. I used the following variables for the cluster analysis (please ignore the formatting, as the output from stats packages can be rather primitive):
  • Rank in the book’s Top 100 year (2013 or 2014, when pubished).
  • Rank in mid-2015
  • Writer’s age
  • Price in Top 100 year.
  • Number of reviews in Top 100 year.
  • Number of reviews in mid 2015.
  • Writer’s sex (male=0, female=1)
Final Cluster Centers
╔══════════════════╤═════════════╗
│ Cluster ║
├──────┼────── ┼──   ╢
│        1         │       2       │ 3    ║
├──────            ┼────── ┼───           ╢
Rank               25       │ 53    │ 56   ║
Rank_July19_2015  │1039     │1309   │10279 ║
WriterAge          54       │ 54      51   ║
Price1_Top100_Yr  │ 6.70    │ 7.20  │ 5.92 ║
NumRev_YR1        │ 4430    │ 1568  │ 1385 ║
NumRev_July2015   │9919     │2434   │ 1907 ║
D_FEMALE          │ .39     │ .71   │ .71  ║
╚══════════════════╧══════╧══════╝


Number of Cases in each Cluster
╔═══════╤═╤══ ╗
Cluster  │1│ 31 ║
║         │2│ 52 ║
║         │3│112║
Valid │    │195 ║
╚═══════╧═╧═ ╝
I set the analysis to break the dataset into three clusters. It would seem that:
  • Cluster 1 is a smallish set of books (n=31), with the highest ranking in both their publication year and in 2015. The writers were slightly older, prices were mid-way between the other two clusters, and the number of reviews in both the publishing year and mid-2015 were very high. The majority of writers were men (61%). We might give this group a name like “consistent male big sellers”.
  • Cluster 2 is a larger set of books (n=52), with middle ranks in their publishing year, and relatively high ranks in 2015 (not far behind cluster 1). The writers’ ages were the same as cluster 1, and they were somewhat higher priced. However, they had far fewer reviews than the books in cluster 1, both in their original publishing year and in 2015. The majority of writers were women (71%). We might give this group a name like “steady female mid-list sellers”.
  • Cluster 3 is the largest set of books (n=112). They had middle ranks in their publishing year, but dropped considerably by 2015. Writers were a bit younger, and prices were lower. They had the fewest reviews in both their original publishing year and in mid-2015. The majority of writers were women (71%). We might give this group a name like “female writers with inconsistent sales”.
  • A deeper dive into the dataset would probably show that much of this segmentation pertains to the writer’s genre and length of tenure as a published writer, but this will do for quickly illustrating how the technique works.
  • The author prefers a technique called LCA (latent class analysis), as it overcomes some of the shortcomings of k-means, noted above. However, SAS and SPSS don't include this method – it requires purchasing some add-ons to these programs (or using R). Two nice features of LCA are that it determines the “best” number of clusters and the most relevant set of variables to use for the segmentations, in a rigorous, statistical manner. It also provides a probability measure of class membership for each case, assisting strategists in determining just how strong a given case's association with a cluster is.

Chapter 9 – Marketing Research

  • This chapter deals with some distinctions between database marketing and marketing research. The former is driven more by transactional data on customers that the firm has, while the latter is often based on self-reported survey data. Survey fatigue and missing data are both issues in survey data, so the author gives some interesting ideas on how to get around these.
  • The author does a brief review of a method known as conjoint analysis. He feels that it is too contrived and lacking in grounding in actual customer decisions to be very useful, however.
  • He also gives a quick tour of structural equation modeling or path analysis, a technique used by the more quantitative sort of social scientist to model deep cause and effect analyses. This method, a variation on regression analysis, attempts to uncover “latent” variables based on “manifest” variables, and thus see what is motivating people under the surface. It is a complex subject, worthy of its own book.

Chapter 10 – Statistical Testing – Knowing What Works

  • Proper experimental design, as one would have in a medical experiment, is difficult in a corporate environment. Things such as randomization, double-blinding, and non-treatment are not always possible, for technical or business-cultural reasons.
  • It is important to get the appropriate sample size for a proper analysis, and that is often not properly computed.
  • A/B testing is a keystone of marketing. This is simply dividing a sample into two (randomly), and giving the treatment to one group rather than the other, then measuring the difference in some relevant response. A variation on a t-test can then determine if the test is statistically significant. However, it can be difficult, in a business environment to be sure that there is no cross-contamination from earlier treatments not related to the test (e.g. sales). Sometimes ANOVA or regression might be useful, with a more multi-factorial design.

Chapter 11 – Capstone – Focusing on Digital Analytics

  • The rise of the web has led to an explosion of data, often referred to as “big data”. New techniques and algorithms have sprung up, with names such as neural nets, machine learning and so forth. These tend to be less statistically based than older techniques, less theoretically rigorous, but often seem easier to implement and interpret. The author advised caution against over-reliance on techniques that have a “blackbox” feel to them, and promise to take the well paid analyst out of the scenario. Sometimes, you get what you pay for. One might accuse us statistical analysts of having a vested interest in the status quo, however.
  • Social media is also an important new marketing reality. Much research is still needed to determine its impact.

Chapter 12 – Finale and Take-Away

  • Marketing Analytics is really about “quantifying causality”. Although correlation is not causality, causality can generally be reliably determined, via some general rules of thumb about sequence of events and counter-factuals.
  • The customer may not always be right, but it is always right to focus on the customer.
  • Have a plan to implement your analytical results, and get buy-in for it.
  • Remember, that while individual persons can be very unpredictable, people as a group can be surprisingly predictable. And you don't have to be 100% correct, just usually correct and usefully correct.
So, there you have it. Most of the simpler and more advanced statistical techniques used in marketing (or in social science in general) have been outlined. Note, however, that there are other techniques, more along the data mining domain, that are also popular these days.
–------------------------------------------------------
And, since this is a book themed blog, here is your chance to buy a book. This is a travelogue, featuring a statistician and a truck driver, on a long haul trip, taking lumber to Texas and oilfield equipment to Alberta. So, you get content that alludes to the theme of the blog – statisticians and markets. :).
On the Road with Bronco Billy - A Trucking Journal
Kindle Edition
What follows is an account of a ten day journey through western North America during a working trip, delivering lumber from Edmonton Alberta to Dallas Texas, and returning with oilfield equipment. The writer had the opportunity to accompany a friend who is a professional truck driver, which he eagerly accepted. He works as a statistician for the University of Alberta, and is therefore is generally confined to desk, chair, and computer. The chance to see the world from the cab of a truck, and be immersed in the truck driving culture was intriguing. In early May 1997 they hit the road.

Some time has passed since this journal was written and many things have changed since the late 1990’s. That renders the journey as not just a geographical one, but also a historical account, which I think only increases its interest.

We were fortunate to have an eventful trip - a mechanical breakdown, a near miss from a tornado, and a large-scale flood were among these events. But even without these turns of fate, the drama of the landscape, the close-up view of the trucking lifestyle, and the opportunity to observe the cultural habits of a wide swath of western North America would have been sufficient to fill up an interesting journal.

The travelogue is about 20,000 words, about 60 to 90 minutes of reading, at typical reading speeds.