top of page

Brief History of Digital Data Collection

Part of Datacop's Blog Series on Data Science in the Digital Economy (#3)

In the previous blog post we explored the question why we collect data. In short, we discovered that data collection and its processing acts as a human sense to perceive the abstract like things at scale or things over time. The creation of the internet has created a complex and abstract realm of human connection. Initially, there wasn’t much activity on the internet. In 1990, the global internet users numbered at 2.6 million. We can see that now with thirty years of hindsight, that as the number of people using the Internet grew so did the complexity and the importance of the questions decision makers asked about what is happening on their digital platforms.

The tools to answer these questions developed accordingly, from humble beginnings as a limited monitoring tool to a core strategic asset of any digital powerhouse. Today data is key in providing clarity in chaos. Data powers new products, new services and new technologies like machine learning. In this blog article we will explore three major stages of development of data science for the digital economy, the 1990s, the 2000s and the 2010s.

1990s: log files, hits and page_load times

In 1993 around 600 websites were active on the Internet. By 1995, the Internet grew to 44.4 million users. Sites like Yahoo and Amazon grabbing large traffic volumes already, first data questions appeared. If anyone wanted to have a look at any data regarding the web, such as traffic, they would have to examine their log files. A log file was essentially a big table of hits with several data attributes such as the connecting IP address, the timestamp of the hit, where on the website the hit landed, etc. By 1995, there were early web analytics software companies such as Webtrends and Analog, that enabled easier analysing of a website's log files. Throughout the 90s and early 00s; log files became the best practice in “website analytics” in the baby digital industry. Back then, this was exclusively the responsibility of the CTO and their IT Department.

The issues of the early internet reflected what the data and analytics were used for. At the time, data collection empowered early IT teams with the monitoring of web loading speeds. At the time, the global digital infrastructure wasn’t as developed and consumers didn’t have fast internet connections like today. Websites would crash often. Once a website started having hundreds and thousands of sub-links, the Log files would prove critical in identifying websites fast and slow loading speeds, amount of processed information, etc. to evaluate how the site is performing. Another important use of data was being able to identify real users from bots. Many modern analytics software do so automatically. Back then, “total hits” would include bots. Note, that websites typically had very little personal information about the users visiting their site. Even in the early days of the internet, data collection proved to be a critical skill of internet firms.

An example of a log file table. These were some of the earliest standardised data regarding internet traffic with attributes like the timestamp and the site URL.

2000s: web “sessions” , UTM tags & CRM

By 2005, a mere 10 years since the “Internet Explorer” was first released, the Internet crosses 1 billion active users. HTML, CSS and Javascript, the foundational coding languages of the internet, were becoming far more advanced to meet the exploding needs of internet’s demanding adopters. Static websites of the previous century started to be replaced with interactive web design. With a decade since its commercialisation, new digital services like email, search engines, wikipedia, e-shopping, video sharing, etc. started becoming common-place. Websites started becoming far more sophisticated and far more differentiated from one another. In this environment, interpreting log files were not answering many new crucial questions, digital firms were asking about behaviour on their site.

How many of the hits “bounce immediately”? How long will they stay? Are they interacting with our web elements? How many add an item to the cart? How many achieve a conversion? Which links and other websites are the users arriving from? Do they come directly or are they referred? In particular, as the websites started to have depth, it wasn’t enough just to see how many hits a website received. Those individual hits and information contained within needed to be connected with the behaviour of that particular device.

In this environment, Google purchased an innovative analytics company in the log file space - Urchin. Urchin advanced the field of digital analytics by introducing a number of key concepts provided a solution to the new challenges websites were facing. Firstly, Urchin’s data was accessible more quickly. Back then, it would often take 24 hours to receive data after they have occurred on the web. Urchin managed to do so in 15 minutes. Data started to become available quickly enough to enable tactical decision making. Secondly, they have developed a site-tag javascript infrastructure that combined with the log file web data to provide a more accurate picture. In effect, they have created the concept of the “session”. Slowly, the industry realised that they are not just collecting data about “hits” online, but that they are collecting data about “sessions” and “events”. This enabled digital firms to study conversion rates of desired user activity. How many signed up? How many purchased? Furthermore, e-mail by this time was the first ubiquitous online marketing channel. At first, email marketers faced the issue of not knowing from which emails the clicks were coming through. Urchin has defined the structure of custom URL tagging with “UTM” (an acronym for 'Urchin Tracking Module'). This enabled early digital marketers to start evaluating their campaigns. These innovations were eventually integrated into what is today known as Google Analytics. With these genuine advancements and thanks to google’s market power, GA was able to offer a basic level of the service for free and soon has come to dominate the industry possibly the most widely known digital data collection software.

As time went on the internet kept growing in both size and sophistication. More and more industries are being redefined by the digital revolution. Increased complexity and competition kept opening new questions and issues for many in the digital industry. A relationship with a customer could start spanning years, requiring CRM systems. A very large number of anonymous visitors started a drive to understand who they are and the collection of their personal data. The fact that anyone from anywhere with an internet connection could become your customer created the need for Analysts, Marketers, IT, to profile their users into segments based on their behaviour, geo-location and personal information. Slowly, for organisations with many customers, data collection started to transform from a mere competitive advantage to a fundamental pillar of their business strategy.

With the creation of cookie identifiers it was possible to link unassociated page hits tracked by a single device into a session.

2010s: Increased Complexity and Customer Data Platforms

By 2010 the internet crossed the 2 billion user mark. Most digital companies did not have the data collection architecture needed to meet the growing expectations of their digital customers. Many firms suffered from the fragmentation of their data into siloes as they purchased piecemeal solutions or invested into creating custom solutions for specific areas of concern such email data, website data, transactional data etc. Others, despite large investments into data collection, were failing to see a return. Often they could see they have amassed a large amount of data, but the data would be useless. Even in the cases that the data has been collected well, managers wouldn’t understand them.

Much of this is not their fault. Complexity in the industry is growing fast. The types of digital data widens by each marketing channel (email, web banners, push notifications, remarketing) enabled by advancing technologies. The format of data was widened by advancements in hardware with things like images, sound and video. With the introduction of data privacy regulations like GDPR and major scandals like Cambridge Analytics, digital firms collecting personal data added privacy as a major concern to their list of data challenges to deal with.

This gave rise to many new questions digital firms were facing. How can we unify our disparate sources of customer data? How do we connect data of a single user across devices? How do we create customer profiles over time? How do we handle personal data? Can relevant members of our company easily access data? How do we create seamless data flows between data collection systems and systems that process and apply that data? How do we start making decisions based on data across the company?

The big tech companies were able to develop internal proprietary systems to manage these needs. However, for many of the other thousands of digital players they could not invest their own resources into developing this kind of infrastructure. New players like Exponea (Bloomreach), Thealium, Synerise, Segment and others began offering a“Customer Data Platform” as a SaaS. These sophisticated systems are allowing many mid-sized digital players to make far more use of their data. The explosive success of this software can be seen in its quick growth. Already by 2020, the market size of CDPs is estimated at $2.4 billion in 2020 and expected to grow to a whopping $10.3 billion in 2025.

CDPs were able to take into account all the problems the industry was facing and create a Software as a Service that unifies customer data across devices and data collection touchpoints. They can handle the needs of regulatory concerns regarding personal data and seamlessly send and receive data between systems. Their architectures enable high levels of data integrity and prepare the data for processing, analysis and application. Thousands of digital players now have access to the same technology whose applications the early tech giants pioneered.

Above is a visualisation of a typical CDP platform. It is capable of integrating various sources of data, process them and push raw or processed data into other systems, while ensuring data integrity with a robust user ID and cross-device tracking architecture.

Today, there are more than 4.5 billion users connected to the Internet. Already 7 out of 10 most valuable companies in the world by market capitalisation can be considered “data companies”. Digital companies like Facebook, Google, Amazon, Alibaba, Tencent can all thank their dominating market positions in-part by navigating this complex technology and utilising it to their fullest potential. With enough resources and skill an organisation can turn data into a 6th super-sense for its decision-makers high and low in the chain of power - driving productivity.


What is data processing? In 4 steps Part of Datacop's Weekly Blog Series on Data Science in the Digital Economy (#4)

However, as many of you have probably experienced yourself it isn’t enough just to “collect a lot of data”. That data needs to be processed to make anything useful from it. Tune in to the next article, in which we are going to look at the process of extracting information, knowledge and applications from data. How does this process work? How is information different from data? How is knowledge different from information? Understanding the high-level perspective of the 4 data processing steps will allow you to better understand how to assess whether the data you collected is valuable and understand what needs to happen for that data to turn into business results.

45 views0 comments

Recent Posts

See All

Part of Datacop's Blog Series on Data Science in the Digital Economy (#2) Today, the market for technology that enables organisations to...

bottom of page