Web Mining
Various commentators have claimed that “data is the new oil” (Economist, 2017). This analogy typically refers to the economic impact of data on the industry and society today in comparison to oil and the second industrial revolution. However, while data could be the driving force of future economic growth, its characteristics, to refine economic value, are entirely different from the skill set needed during the second industrial revolution.
First, data comes in numerous shapes (variety, e.g., structured and unstructured), sizes (volume) and speed (velocity, e.g., real-time data) coining the term Big Data. Second, while data can be replicated at almost zero cost (no transportation cost), the cost for creating and aggregating meaningful data can be substantial. Third, the extraction of information or knowledge requires additional analytical techniques. Data per se has no economic value. Fourth, and finally, the usage of data implies new problems concerning privacy, ownership and trade regulations.
This course focuses on methods to aggregate textual, audio-visual and numerical data from different sources and types and processes them using appropriate methods to extract valuable information from it. Typical use cases are
1) Understanding the structure of the web as a distributed network using various protocols and standards (HTTP, SOCKS, REST, …).
2) Automatic news extraction from a website (Web crawling) such as
Newspaper websites, Data services 3) Social networks.
3) Text analysis of PDF documents using natural language processing for classification, sentiment analysis, semantic analysis and topic modeling.
4) Cognitive data processing for visual and audio analysis for image and video classification, face and gesture recognition, voice and music pattern recognition.
The learning objectives are:
-Understanding the structure of the web as a distributed network using various protocols and standards (HTTP, SOCKS, REST, …).
-Analyzing and parsing different document exchange formats such as HTML, XML, and JSON using regular expressions.
-Retrieving web resources using Python and storing the data in relational databases or NoSQL databases.
-Working and storing large amounts of data using cloud services
Parsing PDF and Word documents.
-Working with unstructured data such as videos, images and audio data from social networks and extracting semantic information from it.
-Applying natural language processing techniques for the classification and semantic analysis of documents.
-Lectures
-Follow-me-through the code examples
-Coding exercises
-Live data / real-world data analysis
Individual case study (50%)