ICWE 2019

Tutorial 1. Tuesday, June 11th 09:00-10:30 (#102, 1F)

Powerful Data Analysis and Composition with the UNIX Shell

More Information : http://www.smiffy.de/icwe-2019/

by Andreas Schmidt and Steffen Scholz (Karlsruhe Institute of Technology, Germany)

Short Bios :

Prof. Dr. Schmidt^(1),(2) is a professor at the Department of Computer Science and Business Information Systems of the Karlsruhe University of Applied Sciences (Germany). He is lecturing in the fields of database information systems, data analytics and model-driven software development. Additionally, he is a senior research fellow in computer science at the Institute for Applied Computer Science of the Karlsruhe Institute of Technology (KIT). His research focuses on database technology, knowledge extraction from unstructured data/text, Big Data, and generative programming. Andreas Schmidt was awarded his diploma in computer science by the University of Karlsruhe in 1995 and his PhD in mechanical engineering in 2000. Dr. Schmidt has numerous publications in the field of database technology and information extraction. He regularly gives tutorials on international conferences in the field of Big Data related topics and model driven software development. Prof. Schmidt followed sabbatical invitations from renowned institutions like the Systems-Group at ETH-Zurich in Switzerland and the Database Group at the Max-Planck-Institute for Informatics in Saarbrucken/Germany.

Dipl.-Ing Dr. Steffen G. Scholz⁽²⁾ has more than 15 years of R&D experience in the field of polymer micro & nano replication with a special focus on injection moulding and relevant tool-making technologies. He is an expert in process optimization and algorithm design and development for micro replication processes. He studied mechanical engineering with special focus on plastic processing and micro injection moulding and obtained his degree as from the University of Aachen (RWTH). He obtained his PhD from Cardiff University in the field of process monitoring and optimization in micro injection moulding and led a team in micro tool making and micro replication at Cardiff University. Dr. Scholz joined KIT in 2012, where he is now leading the group for process optimization, information management and applications (PIA).

(1) Institute for Automation and Applied Informatics Karlsruhe Institute of Technology Karlsruhe, Germany email: { andreas.schmidt | steffen.scholz }@kit.edu

(2) Department of Computer Science and Business Information Systems University of Applied Sciences Karlsruhe, Germany emai l: andreas.schmidt@hs-karlsruhe.de Primary email contact : andreas.schmidt@kit.edu

Brief description : For data analysis and knowledge discovery, typically we load the data into a dedicated tool, like a relational database, the statistic program R, mathematica, or some other specialized tools to perform our analysis. But often, there is also another option, which can be performed on nearly every computer, having the necessary amount of mass-storage available. Many shells, likebash,csh,…provide a bunch of powerful tools to manipulate and transform data and also to perform some sort of analysis like aggregation, etc. Beside the free availability, these tools have the advantage that they can be used immediately, without transforming and loading the data into the target system before. Another important point is, that they typically are stream based and so, huge amounts of data can be processed, without running out of main-memory. With the additional use of gnuplot, ambitious graphic plots can easily be generated.

The aim of this tutorial is to present the most useful tools like cat, grep, tr, sed, awk, comm, uniq, join, split, bzip2, wget, etc., and give an introduction on how they can be used together. So, for example, a wide number of queries which typically will be formulated with SQL, can also be performed using the tools mentioned before, as it will be shown in the tutorial. Also, selective data extraction from different webpages and the recombination of this information (mashups) can easily be performed.

The tutorial will also include hands-on parts, in which the participants do a number of practical data-analysis, transformation and visualization tasks.

Target Audience: Level: Intermediate -Participants should be familiar using a shell like bash, csh, DOS shell, …

Materials to be distributed to the attendees:

Slideset
Command refcard
Practical exercises

Duration: 3 hours

Introduction	15 min.
Commands/tools for structured data	45 min.
Hands-on Part I	30 min.
Commands/tools for unstructured data	30 min.
Visualization	30 min.
Hands-on Part II	30 min.

Software Requirements for the hands-on parts:

Unix and Mac users: none, the needed tools are already part of your distribution
Windows users: Please install cygwin on your computer (https://www.cygwin.com/).

gnuplot must be additional selected during the cygwin installation process.