Skip to main content

Data Science Toolkit

Data Science Toolkit:

A collection of the best open data sets and open-source tools for data science, wrapped in an easy-to-use REST/JSON API with command line, Python and Javascript interfaces. Available as a self-contained VM or EC2 AMI that you can deploy yourself.
Brining together existing resources and APIs into a convenient bundle. It'll be great to see this grow over time and consolidate more and more of the plumbing layer into something that can be abstracted away. The less time spent on raw input munging, the more that is freed up to explore the patterns in the data itself.
• −    − •    − • •    − • − −     − − −    • −    • − •    • − • •    •    − • − −

Institutionalizing institutional memory

A few years ago a story made the local headlines when submerged metal spikes were found at a beach in a local park:

When a wading man found metal spikes in the Green Lake muck last week, Seattle parks officials were horrified and said there couldn't be "any other explanation than malice."
It took about a week for the real story to emerge, relevant part:
Kathy Whitman, city aquatics director, confirmed Friday that looped metal spikes were used in the early stages of milfoil control in the 1980s to hold down plastic sheeting, and the spikes found this month may be those devices.
The events demonstrated a reliance on individual, siloed knowledge and how such systems decay over years. And it's not just city departments; more and more things (clothes, cars, information, even buildings!) are rarely expected to last for decades.

Sometimes it's truly overkill to build too much sustainability into a system and indeed, too much future-proofing can actually be harmful, but that's a whole subject in itself. But in moderate terms, it's rarely a bad idea to at least consider how a system will look a year, two years, five years and ten years from now. The thought exercise of revisiting something a decade later is often sufficient to overcome the the temptation to take the most expedient, short term path.

The real art is creating an environment and culture such that the default behavior considers sustainability from day 1, avoiding costly retrofits later on.

Bonus video:

• −    − •    − • •    − • − −     − − −    • −    • − •    • − • •    •    − • − −

Set ambitious goals

On starting CNN in 1980, Ted Turner said:

"We won't be signing off until the world ends. We'll be on, and we will cover the end of the world, live, and that will be our last event... and when the end of the world comes, we'll play 'Nearer, My God, to Thee' before we sign off."
I love this quotation as a mission statement. There's no ambiguity around the ambition and commitment that Turner expects from his new network and people that make up the organization that run it.

 

• −    − •    − • •    − • − −     − − −    • −    • − •    • − • •    •    − • − −

Succinct thoughts on monitoring and alerts

Anyone running an online service would do well to read this great post on  Monitoring Theory:

Monitors can be informativeactionable, or both. By informative, I mean that the alert must tell you categorically that there is a problem. By actionable I mean that receiving the alert must prompt some kind of immediate response.
I strongly agree with the points in the post, including the thoughts on queues and consumers which often have the characteristic of making a complex system even more complex.

I'll add one other: put alerts in place for things you 'know' to be true. Some of the most interesting failure scenarios I've seen have been foretold by 'this'll never fire' alerts.

• −    − •    − • •    − • − −     − − −    • −    • − •    • − • •    •    − • − −
• −    − •    − • •    − • − −     − − −    • −    • − •    • − • •    •    − • − −

Data Mining and Predictive Analytics: Statistics: The Need for Integration

Data Mining and Predictive Analytics: Statistics: The Need for Integration:

One thing which occurs to me is that many people have a tendency to think of statistics in an isolated way. This world view keeps statistics at bay, as something which is done separately from other business activities, and, importantly, which is done and understood only by the statisticians. This is very far from the ideal which I suggest, in which statistics (including data mining) are much more integrated with the business processes of which they are a part.
Couldn't agree more!
• −    − •    − • •    − • − −     − − −    • −    • − •    • − • •    •    − • − −

A Union Education - WSJ.com

I'm coming to the party a little late on this one but was interested by this story.

A Union Education - WSJ.com:

The trend is even starker if you go back a decade earlier. In 1960, 31.9% of the private work force belonged to a union, compared to only 10.8% of government workers. By 2010, the numbers had more than reversed, with 36.2% of public workers in unions but only 6.9% in the private economy.

...

unlike in the private economy, a public union has a natural monopoly over government services. An industrial union will fight for a greater share of corporate profits, but it also knows that a business must make profits or it will move or shut down.

No such check exists in this scenario. The consequences are even more extreme when the normal realities of the marketplace are removed from the equation.
• −    − •    − • •    − • − −     − − −    • −    • − •    • − • •    •    − • − −
• −    − •    − • •    − • − −     − − −    • −    • − •    • − • •    •    − • − −
• −    − •    − • •    − • − −     − − −    • −    • − •    • − • •    •    − • − −