2019: Google is about to retire Fusion Tables, the tool that was used to geocode the data and create the maps. There is no direct replacement, and I will not recreate the maps with another tool.
A high-level initial analysis of the Ashley Madison credit card transactions with a text editor and Excel revealed interesting geographical differences. I saw that the total amount spent in my town of roughly 12,000 residents surpassed the total spending of much larger towns in the region with 30,000 or 40,000 residents. This made me curious. I wanted to visualize the per-capita spending on a map. It took a new, more powerful computer until I was ready for this, and here are the results.
In most Massachusetts municipalities, the per capita-spending over the period that Ashley Madison (AM) data was leaked (from 2008 through June of 2015) was well under $1. But some towns stand out with a per-capita spending of $3 of more. This adds up to tens or even hundreds of thousands of Dollars that were paid to AM by the residents of some towns. Overall, Massachusetts accounts for more than $6 million of Ashley Madison’s total revenues during this period, with Boston being the biggest spender in absolute numbers ($437,252.05), while Chesterfield and Goshen trail the pack with one $49 transaction each (I did not include towns with no spending at all). When I calculated these totals, I filtered by transaction type and included only settled credit card transactions.
I did not have population data for neighborhoods and villages, another shortcoming of my approach. These zip codes are not included in the amounts, but their population is, including some of the larger ones like Wellesley Hills or Assonet. This makes the per-capita spending in towns like Wellesley and Freetown appear much lower than it actually is.
The biggest shortcoming of the card transaction dump is that there is no currency information. This prevents us from comparing spending patterns between different countries. Massachusetts residents may have had foreign currency transactions on their credit card from when they traveled abroad. I saw some evidence that this happened (for example when the originating IP address was in Japan) and it can cause large errors. I briefly considered a ZIP code map of the highest and mean spending by a single person, but because of the currency issue, the results were misleading and I decided not to publish them.
A lot has been written about this data in the past few weeks, with much of it not very well researched. It does take serious time and effort to fully analyze a data pool of this magnitude, more than what I care to spend, but I do want to provide an estimate of the accuracy of my representation, based on what I saw. Here are my observations regarding data quality.
First off, the quality of the credit card transaction dump files (a series of daily CSV files) is low. About a dozen or so files were corrupt and it took some manual editing before I could load them into a database with SQL Server Integration Services.
Some addresses are mangled, with multiple zip codes appearing in a single record.
The address data itself, as it was entered by the users, contains a substantial number of flawed records. The website apparently did not require positive address verification for many of these transactions. There are definitely fake addresses in the data. I only used zip codes for this map, i.e. zip codes in the correct range with the state being Massachusetts, and ignored the rest of the address. The zip code-to-town map came from a website I don’t remember, and the population data came from the simplified 2010 census data on the state website.
Others have speculated that Alabama has the highest number of transactions because AL is the first state in the list and was selected by those using fake addresses. This would mean that these transactions are missing from their actual geographies, including Massachusetts.
A sizeable portion of transactions involved gift cards or pre-paid credit cards or anonymization services. We have to discount those.
In many cases, a few individuals account for much of the town’s spending. I can only assume that it must have been difficult to stop Avid Life Media from repeatedly charging the credit card every month once they had the information. Knowing what has been said about the distribution of male vs. real female users on the site and the probability of a successful connection, one can speculate that many users may not have wanted these recurring charges.
All in all, I estimate that 70% to 90% or more of the credit card data contains correct addresses. Thus, it is possible and likely that the actual spending was between 10% and 30% higher, for the reasons mentioned. This accuracy can be applied at the town level as well, at least for those where there was a statistically significant number of transactions, say with more than $10,000 spent. For those towns with less spending, a single missing transaction can cause a much larger relative error, so please be careful with interpreting these numbers.
Here is the top dozen:
Massachusetts towns with the highest per capita spending on Ashley Madison
|
|
---|---|
Monterey
|
3.5545
|
Wrentham
|
3.4538
|
Mansfield
|
3.2881
|
Boylston
|
2.8767
|
Cohasset
|
2.8122
|
Hopkinton
|
2.7281
|
Sherborn
|
2.4755
|
Sudbury
|
2.3806
|
Dover
|
2.2676
|
Pembroke
|
2.1915
|
Weston
|
2.1501
|
Wayland
|
2.0897
|
Monterey, MA? I do have an explanation for why this town of 961 residents in the western part of the state is at the top of this list … because of a typo in the zip code of a Somerville resident (02145 vs 01245). In other words, Monterey, MA did not actually earn this distinction.
If there is a lesson learned from the story, then it is this: No matter how successful an online business is, chances are it started on a shoestring budget. I’d like to imagine that the AM site was built by a bunch of inept high school drop-outs who worked for little money but didn’t know what they were doing. I did not see any code, but the pieces that I looked at for this article, i.e. a portion of the database schema, the data itself and how it is accessed and stored, with its simplicity and yet full of inconsistencies, are worrisome. This was a business with $115 million in annual sales and virtually no overhead, mind you. There should have been cash around to review and audit the infrastructure and fix this stuff. If security was handled in the same lax way as the database schema, then one cannot be surprised that this happened.
So, we basically have here a fly-by-night operation that hit a gold mine, but the owners were too greedy to properly reinvest into their infrastructure – an all-to-familiar plight. In a sense, they got what they deserved, but with unfortunate collateral damage. And the lesson for the rest of us is, never, ever, Ever! entrust even the tiniest scrap of personal data to such a business, no matter how professional they look. Never! There are thousands of others just like it still in operation, in every vertical you can think of.