Change has accelerated in the past decade. Earlier, systems were deployed with the expectation that they would last forever. They were not designed to look at each other’s data, and were fairly limiting. Open source was a new idea in the early 2000s. People began to adopt Lucene, a software I had written. There was no institutional backing or publicity. Open source emerged as a tool for development.
A Distributing Computing Platform
Nutch started in 2003. Around 2005, Google published a paper on how they build search engines. They had a paper talking about how they had automated things. We started working on reworking Nutch in 2004. The tale of debugging is much longer. In 2006, I joined yahoo! I developed Hadoop. Hadoop was named after my son’s toy elephant. It was a distributing computing platform, based on Google’s ideas.
A group of people believed that Hadoop could be used much further. Together, they formed Cloudera. I joined Cloudera in 2009. Stepping back from my lesson in Hadoop, if you can increase the scale and focus on flexibility, you can permit them to store more data in raw form and experiment. They can innovate more quickly. The waterfall method inhibited process through data. This gave us a much more appropriate platform. Most of the past data was relational.
New sources of data are events, things recorded from sensors, etc. We need a different class of tools. Companies can run petabytes of data easily today. Software is also eating the world. In every industry, everywhere, the advances being made are predominantly using software. A company’s growth is fuelled more by data, today. The use of data is no longer isolated. It has emerged everywhere.
There are some challenges. First, there are new set of technologies. There are over 10 open-source technologies going around, in space. If you have an idea or an inspiration, it is fairly easy to figure out what technologies will work in future. The larger challenges are institutional or cultural. It changes an organization’s structure, if you are using data. Much of the data that we now have concerns people. We haven’t done a good job so far, in managing people’s rights. If we want to keep ourselves from being regulated, we would need to be even more responsible. Data ethics needs our full attention.
You can read upto 3 premium stories before you subscribe to Magzter GOLD
Log-in, if you are already a subscriber
Get unlimited access to thousands of curated premium stories and 5,000+ magazines
READ THE ENTIRE ISSUE