Is the AMCAT only for IT


Lead by technology? The analysis of journalistic online offers with the help of automated processes

Advancement through technology? The analysis of journalistic online content by using automated tools

Jörg Haßler, Marcus Maurer & Thomas Holbach

Summary: With the increasing importance of the Internet as a communication medium, content analyzes of online media are also gaining in importance. This gives rise to opportunities, e.g. B. through digitization and machine readability or the availability of metadata, but also challenges. These include the volatility and dynamism, multimedia, hypertextuality and the reactivity and personalization of websites. The present article first discusses the advantages and disadvantages of common suggestions for solving these problems and then gives an overview of new methods for the automated storage of journalistic online offers.

Tags: Content analysis, online media, metadata, multimedia, database

Abstract: Because of the growing importance of the internet as a communication channel, content analyzes of websites are becoming more and more central to the field of communication research. Online content offers opportunities, such as digitalization and the availability of meta-data as well as challenges. These challenges are, e.g., the dynamics, multimediality, hypertextuality and personalization of websites. This article discusses frequently used strategies to cope with those challenges. Furthermore, the article presents five recently developed tools for automatic storage of journalistic online content.

Keywords: Content analysis, online media, meta-data, multimediality, database

1 Introduction1

The Internet has gained significantly in importance for journalism and the media audience in recent years. For some time now, all traditional media brands have also been represented with their own online offerings. The news magazine Der Spiegel started its own website in 1994. The Bild newspaper followed two years later. In the meantime, the Internet has also played a major role in the media repertoire for the recipients. Around three quarters of Germans are online, 72 percent of Internet users state that they specifically search for information on the Internet, and 55 percent use the latest news online (van Eimeren & Frees, 2013, p. 363). Journalistic news offers are still one of the most important sources of information (Hasebrink & Schmidt, 2013, p. 8).

With the growing importance of the Internet for communicators and recipients, the need for communication science to record and analyze media content online also increases. Media content analyzes usually take into account media with particularly high coverage, either because they consider them to be representative of the entire media system (diagnostic approach) or because the content analyzes are intended to be the basis for impact analyzes (prognostic approach) and therefore the media used by many respondents must be selected become. That is why there is already a lot to be said for taking journalistic online offers into account in content analyzes. However, analyzes of online content are also associated with considerable challenges, which result, for example, from the volatility, multimedia and hypertextuality of the content. These challenges not only affect the coding process, but also specifically the process of storing and providing the content for coding. In this article, we first want to discuss the opportunities and challenges of analyzing the content of journalistic online offers as well as a number of common, but also problematic, proposed solutions. We then compare five tools that are freely available online and that have been developed in recent years to automatically carry out various work steps in the content analysis of online media. We concentrate on database-supported tools that automatically save media contributions and / or serve as a coding platform. Although some of these tools also perform automated coding at the same time, this function is not in the foreground here (for an overview of the tools for automated coding see Scharkow, 2012).

2. Content analysis of online media: opportunities and challenges

Welker et al. (2010) distinguish six features of online media that have to be considered in content analysis. In the order in which they become relevant when designing content analyzes, these are 1) the quantity, 2) the volatility, dynamics and transitory, 3) the digitization and machine readability, 4) the mediality, multimedia and multimodality, 5) the Nonlinearity and hypertextuality and 6) reactivity and personalization. We want to add a seventh property to this, which has recently gained significantly in importance, the availability of meta-data, e.g. B. Information on how often a post was forwarded or rated in social networks:

Quantity: The number of websites available worldwide is estimated to be between 634 million (, 2013) and 3.94 trillion (de Kunder, 2013). There are two approaches to selecting offers to be analyzed. Firstly, a systematic identification of journalistic offers, as they are e.g. B. by Neuberger, Nuernbergk and Rischke (2009), and secondly, the orientation towards particularly wide-reaching offers. The second approach essentially corresponds to the approach used for conventional content analysis outside the Internet. Instead of the circulation figures, the reach figures of the Arbeitsgemeinschaft Online-Forschung e. V. consulted. The basic assumption here is that opinion leader media also exist online as a mirror image of the offline media market, and that one can also orientate oneself on the journalistic spectrum in order to comprehensively capture journalistic content. This has turned out to be the most practicable solution for offline content analyzes, as long-tail analyzes in the form of specialized journals and association newspapers are dispensed with for reasons of research economy.

Volatility, dynamism and transitory: online content is constantly changing. This applies both to individual contributions that can be edited later, as well as to entire web offers where the placement of the contributions changes continuously (Karlsson & Strömbäck, 2010). For practical reasons and reasons of intersubjective verifiability, the websites to be examined must therefore be archived before the analysis. Since the archives of the website providers only save individual articles but not their editorial embedding and existing web archives such as are extremely incomplete, it is essential for most content analyzes to develop your own archiving tools or to use existing software (Neuberger, Nuernbergk, & Rischke, 2009). When archiving websites, it is crucial that they are digital and machine-readable.

Digitization and machine readability: The fact that websites are digitized and machine readable opens up a number of possibilities for online content analysis. Content analysis data can be collected fully automatically. Since websites use standardized programming languages ​​and codes, huge amounts of text (big data) can in principle be recorded (Scharkow, 2012). This also enables fast and inexpensive analyzes because no human coders are required. Although programs for the automatic coding of online content have been worked on for some time, these methods are still very limited. Although with the help of word recognition programs z. For example, reporting topics (e.g. King & Lowe, 2003) are identified or even recurring sentence structures can be recorded with the help of grammar recognition programs (parsers) (e.g. de Nooy & Kleinnijenhuis, 2013). More complex analyzes, e.g. B. Evaluation tendencies or argumentation strategies at the contribution level, however, still have to be carried out manually (Lewis, Zamith, & Hermida, 2013).

Mediality, multimedia and multimodality of content: websites are made up not only of texts, but also of images, videos, audio files and interactive elements. This complicates the content analysis of web content for three reasons (e.g. Sjøvaag, Moe, & Stavelin, 2012): First, when designing code books, it is necessary to precisely define which elements of a website are to be examined. The question that arises here is whether the content coding should remain limited to the central text article on the website or whether multimedia elements should also be included. In the second case, it must then be decided whether these should be treated as separate contributions or as part of the text article. A distinction must also be made between multimedia elements that belong to a certain text in terms of content and those that are included in all text articles (e.g. streams of the current editions of a television news program on their website). If the multimedia elements are also to be recorded, this poses a particular challenge when archiving websites, because it must be ensured that they are also saved. Ideally, the stored pages with all multimedia elements are made accessible to the coders in exactly the same form as in the online version. After all, complex analyzes of websites that contain images and videos can only be adequately carried out using manual coding. With the help of complex automated processes, image and sound information can be converted into text (for an overview see Eble & Kirch, 2013). However, this does not reflect the multimedia character of online content.

Nonlinearity and hypertextuality: Texts can be linked to other texts, images, videos or interactive elements using hyperlinks. This creates an information network in which the sources of information can be referenced directly. The hyperlink structure causes two main problems when analyzing the content of websites: saving the hyperlink structure and archiving the websites. The first problem is primarily a capacity problem. In principle, it is technically possible, starting from a website, to save all websites that are referenced up to a specified link depth. With a large link depth, however, this is very time-consuming and requires a large storage capacity, because the networking increases the number of websites to be saved with each level of the link depth. The second problem - similar to multimedia - is that the hypertext structure must be preserved when the websites are archived.

Reactivity and personalization: Online content can be tailored to individual users by the provider with the help of special algorithms. For example, online media can use algorithms to generate individual start pages for individual users, on which posts are better placed the closer they correspond to the previous reading behavior of the users. In extreme cases, users end up in a filter bubble that only suggests and presents content that corresponds to their preferences (Pariser, 2011). The individualization of the offer is a significant problem for online content analysis because different coders may display different content. The problem increases in content analyzes, in which, for reasons of research economy, only the best-placed contributions are examined because it is no longer possible to determine the best-placed contributions independently of the recipient.

Availability of meta-data: Most websites now also make so-called meta-data available. These data allow conclusions to be drawn about the interaction with and between recipients. Many websites today have comment columns that can often even be used for individual contributions. The possibility of recommending posts on Facebook, Twitter or GooglePlus is now widespread. Websites that make this data available usually also publish the number of comments, likes and recommendations. If it is possible to save this data or to analyze it automatically, research opens up a wide range of analysis options, e. B. by linking the information with content analysis data in order to determine which contribution characteristics (e.g. topics or evaluation tendencies) lead to the dissemination of a contribution by the recipients.

3. Traditional methods of storing websites

The basic requirement for solving the problems discussed above is usually that the websites are saved. To date, most studies have used one of the following methods for storing and archiving websites (Karlsson & Strömbäck, 2010): taking screenshots, saving them as PDF files, using download programs (so-called crawlers, offline browsers) or web spiders) and access via RSS feeds. We want to briefly discuss the advantages and disadvantages of these methods in the following.

Screenshots and PDF files: Taking screenshots is a comparatively simple process. The websites to be examined are called up manually and saved as image files, e.g. B. saved in JPG format. This can be done either in advance of the coding or in the same work step with the coding. The process is technically not particularly complex, but also has several disadvantages: On the one hand, the time required to manually save each individual website is enormous. On the other hand, screenshots cannot adequately depict multimedia elements or hyperlinks. Hyperlinks can only be guessed at, e.g. B. if they are recognizable as underlined or highlighted text passages. Which content is hidden behind the link cannot be determined because the link in the screenshot cannot be clicked. Multimedia content is also not displayed because JPG files are static. The personalization of the website is also not bypassed when creating screenshots. All of this applies in a similar way to saving websites as PDF files. However, hyperlinks can be displayed in newer PDF versions. If you move the mouse over a link, the URL to which this link refers is displayed (mouse-over effect). In this way it is usually at least possible to record which page a post is linked to, even if the page cannot be viewed directly.

Download programs: So-called web crawlers, offline browsers or web spiders automatically call up websites and save them in various formats. Two free programs that are more widely used in communication science are HTTrack and Wget. HTTrack enables the user to enter the URL of a website and define a link depth. The crawler then automatically saves all publicly accessible areas of the desired website as well as the posts of the link targets up to the set link depth. The websites can be opened offline with the same layout, content and functionality in terms of hypertextuality. However, videos and audio files have to be saved manually. The same applies to Wget. Some professional, fee-based programs such as the Offline Explorer or Teleport also offer the archiving of multimedia content (Rüdiger & Welker, 2010). With regard to the personalization of the pages, it cannot be ruled out that web crawlers will save websites in a personalized form. For the provider of the website, the IP address and z. B. the operating system of the storing computer can be viewed. With HTTrack, some of the information that is sent to the website operator can be set and manipulated, but it cannot be assumed with certainty that the content is stored independently of the person.

Saving RSS feed messages: The only way around the problem of website reactivity and personalization is to save RSS feed messages. Many website operators provide RSS feeds, which are usually created automatically by the content management system. Their actual function is to notify users of changes to the websites. This can be done through short messages, but also by making the entire new website available via RSS feed. The articles appear in RSS feeds in reverse chronological order - the newest article comes first. RSS feeds therefore abstract from the (individual) view of the website. Since RSS feeds are created for all new posts that are available on a website, the storage of the RSS feeds is independent of whether and with what placement the posts are displayed to individual users. In impact analyzes, however, this does not solve the problem that individual users are shown and used different posts. This problem also exists z. B. in the content analysis of daily newspapers, in which all contributions are coded, although it is obvious that the recipients only read a (individually different) part of it. The RSS reports mostly only contain the central text article and do not appear in the actual format of the website.The extent to which hypertextuality and multimedia are retained when these messages are saved depends on the website provider. There is always at least one link that leads to the website version of the article. In contrast, multimedia elements are usually not included.

4. The analysis of online offers with the help of automated processes

If you compare the advantages and disadvantages of the various traditional methods of storing websites with one another, it becomes clear that obviously none of them meet all requirements. The problem of personalizing the pages is only solved by storing RSS feeds, which are formally at the same time so far removed from the original online version of the article that z. B. multimedia elements cannot be recorded. In the following, we therefore want to compare five comparatively new database-supported tools that are more or less freely available online that enable the automated storage of online content and facilitate coding organization: the Amsterdam Content Analysis Toolkit (AmCAT), the News Classifier for storing and analyzing Text contributions, the coding surface ANGRIST / IN-TOUCH, the face pager for analyzing social media data and ARTICLe, a new tool for storing texts and multimedia files. Although some of the methods also perform automated coding at the same time, this function is not in the foreground here. We will first discuss the possible applications and then the respective advantages and disadvantages of the individual processes. We limit ourselves to a rough overview and refer for details to the detailed documentation on the procedures, which are usually available online.

AmCAT: A tool that connects the organization and coding process of online content is the Amsterdam Content Analysis Toolkit (AmCAT)2 (van Atteveldt, 2008). AmCAT makes it possible to display a large number of contributions in an SQL database and then to code them automatically or manually using various programs. In the first step, contributions are loaded into the AmCAT Navigator (Fig. 1). Various file formats can be stored there and entered in a database. This database automatically records information from the articles, such as their source or the date of publication. AmCAT is designed in such a way that the articles are archived manually or by using an additional tool. Since AmCAT was specifically developed for semantic and network-based text analyzes, the focus is on the storage and organization of contributions in text formats such as XML, RTF or CSV. Using AmCAT alone therefore cannot guarantee that the hypertextuality and multimediaity of online content will be mapped in content analyzes. Rather, it depends on the preceding storage process whether it remains clear how many images, videos or hyperlinks are included in the articles to be examined. The online content is saved for processing in AmCAT with an additional tool and then manually loaded into the Navigator in a further step. The stored data can be completely z. B. be uploaded as a ZIP archive. Bypassing the personalization of websites also depends on the archiving method selected beforehand. AmCAT cannot bypass algorithms on websites per se. But are in the storage process z. If, for example, the RSS feeds are used to load texts from there into the database, contributions can be analyzed independently of the user (van Atteveldt, 2008, p. 182).

Figure 1: AmCAT online user interface

Source: (accessed on March 24, 2014)

The second step in the working process with AmCAT is the coding of the saved contributions. The strength of AmCAT lies in the possibilities of automated text analysis. By linking to tools such as Various analyzes are carried out directly by the computer, for example for Natural Language Processing (NLP) or Part-of-Speech Tagging (POS). At the same time, AmCAT enables human coding using an input mask to check the validity of the coding or to generate training data for machine learning using manual coding. The iNet tool was integrated into AmCAT for the coding process. It serves as a user interface from which various analyzes, e.g. B. Coding on contribution level and coding on statement level can be carried out. iNet enables the comprehensive organization of the entire coding process, from the allocation of the contributions, through the coding to the generation and transfer of data in a statistics program (van Atteveldt, 2008, p. 185).

An analysis of the meta-data from social network pages, such as likes and shares of a post on Facebook, is only possible to a limited extent with AmCAT and depends on the type of storage of the posts. If a storage method is used with which this metadata is available in text form, it can also be analyzed with the tools integrated in AmCAT. Overall, AmCAT offers the advantage that a large number of work processes in the content analysis of digital texts are simplified and linked with one another. The tool thus combines the advantages of automated and manual text analysis and enables a continuous validity test of the automated coding. In addition, it enables training data to be generated for machine learning. However, this is offset by the high complexity of the tool. As the individual elements can be combined in a variety of ways, extensive knowledge of programming languages ​​and the logic of automated text analysis processes is required. In addition to the great advantages of the tool, the design of AmCAT for automated analyzes also results in disadvantages: Online content is integrated into the database in pure text form. A holistic representation z. B. as an HTML file is not possible. As a result, hypertextuality, interactivity and multimedia cannot be recorded because elements such as hyperlinks, comment fields, images or videos are not stored in the database. In principle, however, these disadvantages can be avoided by combining AmCAT with a tool that saves websites completely and extracts the texts at the same time.

NewsClassifier: Also the NewsClassifier tool3 (Scharkow, 2012) has been developed to automatically carry out the entire content analysis process from archiving to coding.

"NewsClassifier is designed as an integrated framework that ranges from automatic data collection and cleaning to sampling, the organization of field work and the implementation of reliability tests to the actual manual and / or automatic coding." (Scharkow, 2012, p. 250 ).

Through the automated storage of online content, the tool enables the continuous collection of journalistic online offers in the first step. The time-consuming research and storage of the examination units is considerably reduced because the website to be examined only needs to be entered once. The websites are saved by accessing their RSS feeds. This has the advantage that algorithms that require a personalized presentation of the websites are bypassed. Instead, it saves all posts that were published within a certain period of time. A four-stage process is used to save the websites: First, the URL addresses of the RSS feeds of the websites to be examined are entered. The second step is to check whether full texts are already available in the RSS feeds. If this is the case, these are entered in a relational database. If no full texts are available, the third step is to check whether the HTML pages of the respective article are available. If these are available, they are imported into the relational database and adjusted, i.e. the pure text of the website to be examined is saved in the database. If the HTML pages are not available, a check is made as to whether print versions of the contribution pages are made available by the website providers. If this is the case, these are saved in text form in the database (Scharkow, 2012, p. 260). The NewsClassifier is therefore primarily aimed at automated text analysis. The editorial environment of the contributions, on the other hand, is largely excluded.

In addition to storage, the NewsClassifier enables sampling and allocation of the contributions to the coders. The coding itself is also carried out by the coders using an input mask directly on the computer. A code book can be created in the program for this purpose. For each category it can be determined which area of ​​the text to be analyzed it relates to, e.g. B. on the heading or the body text. In addition to manual coding, all categories can also be collected using automated processes. The manual codings provide the training data for the automatic classification. The more correct manual coding there is for a category, the more reliable the automated coding will be. However, the coding always relates to the contribution as a whole. Coding of individual statements or sections is not yet possible (Scharkow, 2012, p. 277). Finally, the NewsClassifier enables reliability tests to be carried out and the data collected to be easily transferred to a statistics program. The entire workflow of a content analysis of online content can in principle be organized using this tool (Scharkow, 2012, p. 268). The main disadvantage of the tool is that it is largely limited to analyzing text articles. It is true that HTML files can be archived and, in principle, video and audio files can also be analyzed. However, it is not possible to store multimedia elements and, above all, to link these elements to the articles in which they are integrated. Hyperlinks and interactive elements such as B. User comments must also be analyzed by human coders.

ANGRIST / IN-TOUCH: The tools ANGRIST (Adjustable Non-commercial Gadget for Relational data Input in Sequential Tasks) and IN-TOUCH have a slightly different focus 4 (Integrated Technique for Observing and Understanding Coder Habits) (Wettstein, 2012; Wettstein, Reichel, Kühne, & Wirth, 2012). The two tools only start after the online content has been saved and enable computer-aided manual and semi-automated coding in which the coders work directly on the computer. As a result, they are not designed from the outset to cope with the quantity and dynamics of online content, but rather focus on digitization and machine readability. ANGRIST is a script in the Python programming language that enables individual codebook categories to be queried step-by-step. In the first step, the code book and the articles to be coded are formatted for access by the ANGRIST script. The contributions must be available in Unicode or ASCII formatting. This means that letters, numbers and a limited number of special characters can be displayed within the program. Multimedia, hypertextual and interactive content are not displayed. Using the stored code book, the coders are guided from category to category with ANGRIST. The coders work directly with an input mask. This offers the advantage that the coders do not have to memorize any numerical codes, but are presented with formulated expressions, as in a survey, which they can assign using various selection tools such as drop-down menus or checkboxes. The codes are then already available in digital form and can easily be further processed in a statistics program. Overall, the tool makes human coding of online content much easier. Due to its design, however, it does not depict the hypertextuality and multimediaity of websites. Since ANGRIST is not an archiving tool, it cannot be used to personalize online offers. The collection of meta-data is only possible by human coders.

The IN-TOUCH tool can be used as a supplement to ANGRIST and is used to monitor and evaluate the coding behavior of human coders. The main aim of the tool is to make the coding process controllable. In addition to the behavior of the coders such as Reliability tests, for example the encoder speed and the time of day of the encoding, can also be carried out. The tool offers enormous advantages for computer-aided content analysis by human coders, as very labor-intensive steps such as reliability tests, coding selection and training can be significantly simplified. Both tools are recommended overall for computer-aided manual content analysis, but do not allow the storage of multimedia, hypertextual and interactive online content. They therefore develop their strength primarily in combination with other tools for the automated storage of online content.

Facepager: A tool for the automated storage of social media offers is the Facepager developed by Keyling and Jünger (2013) 5 (Fig. 2). The Facepager is based on the respective programming interfaces (API, Application Programming Interface) of Facebook and Twitter and can in principle be extended to online offers that work with the Java Script-based data format JSON. The tool can be used to automatically save large amounts of metadata as well as formal and content information from Facebook pages and Twitter channels.

Figure 2: User interface of Facepager 3.5

Source: (accessed on March 24, 2014)

The workflow begins with the manual addition of the Facebook pages, Twitter channels or other sources with a JSON-based interface to be saved. It is then selected which information is to be recorded by the affected pages. The possibilities here range from the recording of the sheer number of "fans" and "followers" to the automated downloading of status reports and comments. The latest version of the tool also enables the automated storage of files shared on Facebook or Twitter, such as. B. Image files. The options for saving are dependent on the facepager, which information from the site operator so z. B. made available by Facebook or Twitter. In principle, however, a wide range of information can also be extracted from other platforms using your own scripts. When the desired options have been selected, the storage process begins. In order to analyze non-public areas of social network pages, you need your own account in the relevant network with a user name and password. This is used to log in via the facepager and the automated collection of the desired information begins. In the next step, these are initially displayed within the program, but can easily be transferred for further processing in Excel or a statistics program. While multimedia files such as images are stored in their original form on the computer used - e.g. B. in JPG image format - the textual and numerical content such as fan numbers, follower numbers or status reports are extracted from the respective Facebook pages or Twitter channels and are available in text form. In this way, the content of the social media pages is stored in a machine-readable format. This is a considerable advantage for the automated text analysis. It can be done with an additional tool - e. B. ANGRIST, AmCAT or NewsClassifier - can be carried out. At the same time, the availability of information in plain text form makes manual coding more difficult. Because all stored information is displayed to the coders in tabular form and not in their original layout. This means that although it can be seen in the text how many and which hyperlinks are present in the articles and whether and how many multimedia elements are included, the corresponding multimedia elements are saved separately from their editorial context. In addition, the Facepager has not yet been used to store conventional websites, but has been specifically designed for social media pages. With regard to the personalization of online content, the advantages of the facepager consist in its access to the information via the respective programming interface of the provider. It can be assumed that the information available here is independent of the individual usage behavior of the respective coder. Overall, it can be concluded that the Facepager is ideally suited for storing social media pages and creates the conditions for automated content analysis, but has not yet been designed for the coding of journalistic online offers.

ARTICLe: The ARTICLe tool6 (Automatic Rss-crawling Tool for Internet-based Content anaLysis) focuses on the automated storage of websites as well as the preparation and organization of contributions for coding. It was programmed by Thomas Holbach at the University of Jena and further developed by Christoph Uschkrat and Jörg Haßler as part of the DFG project "Digitale Wissensklüfte". In contrast to all other programs, it saves the websites fully automatically, including all multimedia and hypertextual elements as well as meta-data for the distribution of posts in social networks.In addition, the database serves as a coding platform that offers various interaction options. The starting point for accessing the websites to be examined are their RSS feeds.

After manually entering the RSS feeds for all websites to be examined, the first automated step is to save all articles. All articles are saved via their link within the RSS feed. ARTICLe does not focus on extracting raw texts. Rather, the aim is to present the websites to the coders as they are or were actually available online. Therefore, screenshots are saved in HTML, PDF and JPG formats and stored in a relational database. The storage can take place at any time, e.g. B. every two hours.

The storage of videos and audio documents contained in articles as well as all pages of articles that are longer than one page is also automated. The recognition of multimedia elements and subsequent pages is based on a previously manually created keyword catalog. If one of the keywords appears in an article, the database indicates the existence of videos, audio documents or articles with more than one page. To do this, it is first necessary to research which file formats are used in the source codes of the websites to be examined. Does a website bind e.g. For example, if you regularly enter YouTube videos, it may be sufficient that the database searches for the word “youtube” within the source code. For articles with more than one page, words like “page 2” or “next page” are possible. Using a so-called regular expression (RegExp), the videos and subsequent pages are extracted and downloaded using a PHP script. All stored articles are stored together with all the multimedia elements they contain in reverse chronological order in a database table (Fig. 3). This database view also serves as a surface for coding the contributions (see below). In addition, the database saves screenshots of the start page of the registered website from which the RSS feeds originate. This enables, for example, analyzes of the dynamics of start pages, which are then independent of the RSS feeds and consequently not depersonalized.

Figure 3: Database view of ARTICLe


In the second step, the coders get access to the archived articles on a password-protected server via the internet address of the database or internally in the university network. The database is based on MySQL and is prepared for access by the coders in PHP format as a website. In these formats, the integration and placement of dynamic multimedia elements such as videos and sound documents are displayed, but they cannot be played. This is why the separately stored multimedia elements within the database are also displayed again in the same line as the corresponding article. After the coding of the text articles, the coding of any videos and sound documents that may be present follows. When the manual coding is complete, the coder marks the item as coded. In this way, the progress of the coding work can be read in real time. If the coders have problems or questions, they have the opportunity to leave a comment. In the third step, the metadata of the posts, e.g. B. Likes and Shares on Facebook, processed automatically. They can then be entered directly into a statistics program. The database evaluates data from Facebook, Twitter and GooglePlus. These are presented in tabular form for each contribution and can be called up in the database in the respective column of the relevant contribution.

Overall, the advantages of ARTICLe are the comprehensive, automated storage of websites including all hyperlinks, multimedia elements and the meta-data from social networks. The tool thus offers the prerequisites for comprehensive manual analysis of online content. In addition, the user interface allows you to communicate with the coders and monitor the coding process. However, the database has not yet been designed for the automated coding of texts or even multimedia elements. For the integration of automated analysis processes beyond the analysis of metadata, parsers or tools for semantic or network-based text analysis can also be used.

5. Summary and discussion

The increasing importance of the Internet as a political communication medium means that content analyzes of online media are also gaining in importance. However, online media differ in many respects from offline media: In particular, digitization and machine readability as well as the possibility of collecting meta-data represent considerable opportunities for online content analysis. On the other hand, the volatility and dynamics of their content represent multimedia , hypertextuality and personalization of the contributions those who want to conduct online content analysis are facing significant problems. The problems mentioned can initially be countered primarily by suitable storage of the websites to be examined. However, the methods commonly used to date for this (screenshots, web crawlers, RSS feeds) all have specific disadvantages. For this reason, some tools have been developed in recent years that automate the storage, but also numerous other work steps of online content analysis, with different focuses.

Since many of these tools are primarily designed with the aim of enabling automated content analysis of texts, analyzes of complex websites taking into account hyperlinks, interactive elements and multimedia content are not possible at all or only with considerable restrictions. Other tools are optimized for storing websites and store them with hyperlinks and multimedia elements, but cannot perform automated coding. Table 1 provides an overview of the functions of the tools.

Table 1: The functions of the tools in comparison






Automated storage of hyperlinks




Automated storage of multimedia elements



Avoid personalization




Function as coding platform / coding management





Download and analysis of meta data from social network sites



Automated coding of texts



Automated coding of multimedia elements

Overall, a link between the various methods is therefore usually a good idea for the analysis of journalistic online content. Depending on the question, the focus can be more on the use of human coders or automated coding. For example, with the Facepager, all status messages can be saved within a selected Facebook profile. These are extracted from there as a CSV file and can be uploaded to AmCAT. There they can be analyzed with the programmed code book. Probably the most sensible approach to the comprehensive analysis of online content is the storage of RSS feeds, which are made available by the website providers. However, they only serve as a starting point for saving the posts, which can be accessed and saved based on the links within the feeds. If screenshots are now automatically created in formats such as HTML or PDF, hypertextuality and multimedia content of online content can be taken into account in the analysis. For the automated text analysis, the raw texts are extracted in an additional work step. The human coders would thus be presented with files in the layout of the actual websites for their analyzes, while the pure texts would be processed automatically. With this procedure z. For example, complex categories can be coded manually, while meta-data on the “success” of the contribution with the audience are automatically recorded and evaluated. Such an approach can, for. B. can be realized by connecting ARTICLe with the NewsClassifier, AmCAT or ANGRIST. In this way, the advantages of manual and automated analysis of online content can be comprehensively implemented.

However, even with the combination of different tools for the storage and further processing of online content, it should be noted that such procedures require regular manual checking. Unexpected changes such as the change in RSS feeds or the introduction of so-called paywalls make storage extremely difficult. Without manual intervention such as B. the creation of an account for the retrieval of articles, paid content cannot be saved automatically. However, these problems also occur with offline content analytics, e.g. B. a library does not subscribe to a newspaper title. These problems can be minimized through careful preparation, planning and weighing up the costs and benefits of procuring the titles as well as a critical assessment of the effort and yield of an analysis of the titles and articles concerned. Other online-specific problems, such as unexpected changes to Internet addresses, pages that cannot be accessed or display errors in JavaScript or Flash content, can often only be prevented by manually checking the storage process. Here, you may have to make gradual compromises with regard to the storage process: If a website cannot be saved automatically, the causes must first be searched for manually. It may then be sufficient to alternatively use another tool, such as B. to capture a web crawler, this is often not possible only the recourse to the manual creation of screenshots, z. B. as a PDF file. Ideally, however, these manually saved articles are made available to the coders in the relational database in the same way as the automatically saved articles. The occurrence of such detailed problems is to be minimized by a thorough inspection of the material before starting the storage. If one is familiar with the online offers that are stored, patterns can often be recognized at which points irregularities can occur and how these can be dealt with. So z. For example, it can be seen in which departments multimedia elements in the programming languages ​​Flash or Javascript are used particularly frequently.

The careful planning of the research project as well as the sensible combination of the methods presented here therefore meet in principle all the technical challenges discussed here that arise in the analysis of journalistic online offers. Only the automated coding of multimedia elements has so far been almost impossible. There are already automated tools for capturing special image features such as B. the gestures and facial expressions of the people represented (e.g. Cohn & Ekman, 2005) or the general image structure (e.g. Stommel & Müller, 2011). A detailed, automated recording of all image content, on the other hand, seems hardly feasible.


Cohn, J.F., & Ekman, P. (2005). Measuring facial action. In J. A. Harrigan, R. Rosenthal, & K. S. Scherer (Eds.), The new handbook of methods in nonverbal behavior research (pp. 9-64). Oxford: Oxford University Press.

de Kunder, M. (2013). The size of the World Wide Web (The Internet). Available at (accessed on March 24, 2014).

de Nooy, W., & Kleinnijenhuis, J. (2013). Polarization in the media during an election campaign: A dynamic network model predicting support and attack among political actors. Political Communication, 30 (1), 117-138. doi: 10.1080 / 10584609.2012.737417

Eble, M., & Kirch, S. (2013). Knowledge transfer and media development: tools for integrating multimedia content into knowledge management. Open Journal of Knowledge Management, 7 (1), pp. 42-46. Available at (accessed on March 24, 2014).

Hasebrink, U., & Schmidt, J.-H. (2013). Cross-media information repertoires. Media Perspektiven, (1), 2–12.

Karlsson, M., & Strömbäck, J. (2010). Freezing the flow of online news: Exploring approaches to the study of the liquidity of online news. Journalism Studies, 11 (1), 2-19. doi: 10.1080 / 14616700903119784

Keyling, T., & Jünger, J. (2013). Facepager (version, f.e. 3.3). An application for generic data retrieval through APIs. Available at: (accessed on March 24, 2014).

King, G., & Lowe, W. (2003). An automated information extraction tool for international conflict data with performance as good as human coders: A rare events evaluation design. International Organization, 57 (3). doi: 10.1017 / S0020818303573064

Lewis, S. C., Zamith, R., & Hermida, A. (2013). Content analysis in an era of big data: A hybrid approach to computational and manual methods. Journal of Broadcasting & Electronic Media, 57 (1), 34-52. doi: 10.1080 / 08838151.2012.761702

Neuberger, C., Nuernbergk, C., & Rischke, M. (2009). Journalism - re-measured: The population of journalistic Internet offers - method and results. In C. Neuberger, C. Nuernbergk, & M. Rischke (Eds.), Journalism on the Internet. Profession, participation, mechanization (pp. 197–230). Wiesbaden: Verlag für Sozialwissenschaften / GWV Fachverlage.

Pariser, E. (2011). The filter bubble: What the Internet is hiding from you. New York: Penguin Press. (2013). Internet 2012 in numbers. Available at (accessed on March 24, 2014).

Rüdiger, K., & Welker, M. (2010). Editorial blogs of German newspapers. A workshop report on the difficulties of analyzing the content. In M. Welker & C. Wünsch (eds.), New writings on online research: Vol. 8. The online content analysis. Research object Internet (pp. 448–468). Cologne: Herbert von Halem.

Scharkow, M. (2012). Automatic content analysis and machine learning. Berlin: epubli.

Sjøvaag, H., Moe, H., & Stavelin, E. (2012). Public service news on the web: A large-scale content analysis of the Norwegian Broadcasting Corporation’s online news. Journalism Studies, 13 (1), 90-106. doi: 10.1080 / 1461670X.2011.578940

Stommel, M., & Müller, J. (2011). Automatic, computer-aided image recognition. In T. Petersen & C. Schwender (eds.) The decryption of the images. Methods for exploring visual communication. A manual (pp. 246-263). Cologne: Herbert von Halem.

van Atteveldt, W. (2008). Semantic network analysis: Techniques for extracting, representing and querying media content. Charleston, SC: BookSurge.

van Eimeren, B., & Frees, B. (2013). Rapid increase in Internet consumption - online users almost three hours a day on the Internet: Results of the ARD / ZDF online study 2013. Media Perspektiven, (7-8), 358–372.

Welker, M., Wünsch, C., Böcking, S., Bock, A., Friedemann, A., Herbers, M., Isermann, H., Knieper, T., Meier, S., Pentzold, C., Schweitzer, EJ (2010). The online content analysis: methodological challenge, but no alternative. In M. Welker & C. Wünsch (eds.), New writings on online research: Vol. 8. The online content analysis. Research object Internet (pp. 9–30). Cologne: Herbert von Halem.

Wettstein, M. (2012). Documentation and instructions for programming the encoder interface. Available at: (accessed on March 24, 2014).

Wettstein, M., Reichel, K., Kühne, R., & Wirth, W. (2012). IN-TOUCH - a new tool for checking and evaluating coder activities in content analysis. Lecture at the 13th annual meeting of the SGKM, Neuchâtel.

1 This publication was created within the framework of the research group "Political Communication in the Online World" (1381), sub-project 4, funded by the German Research Foundation (DFG).

2 Documentation of the tool is available at In addition, you can register as a user at and create projects without running your own server.

3 The tool is documented at and can be used and further developed with programming knowledge.

4 Documentation for the independent implementation of the ANGRIST tool is available at:

5 The facepager can be downloaded for free use and further development at

6 The ARTICLe user interface is available at The publication of the code on is currently in preparation.

Extended abstract

Advancement through technology? The analysis of journalistic online content by using automated tools1

Jörg Haßler, Marcus Maurer & Thomas Holbach

1. Introduction

Without any doubt, the Internet is continually gaining in significance for political communication research. At present, about 75 percent of the German population state that they use the Internet at least occasionally (van Eimeren & Frees, 2013, p. 363). All traditional mass media operate websites that provide real-time information. For citizens, these journalistic websites are the most important information sources online (Hasebrink & Schmidt, 2013, p. 8).

The growing importance of online media also gives rise to consequences for content analyzes of journalistic online media coverage. Because such analyzes consider wide-reaching media that are representative for the entire media system or media that are meant to be the basis for effect analyzes, journalistic online media have to be included in many content analyzes nowadays. On the one hand, such analyzes appear very promising because online media use standardized programming languages ​​and codes are available in digitized form.On the other hand, the quantity, dynamics, multimediality, hypertextuality or the personalization set boundaries for storage and analyzes of websites. This article discusses frequently used strategies to address those challenges and presents five recently developed tools for automated storage, organization or coding of online content.

2. Challenges of the Content Analysis of Websites

The internet constantly changes. This holds true both for the available websites in their entirety and for individual articles within web services. For practical reasons and for reasons of intersubjective comprehensibility, websites have to be stored before analysis. As websites are standardized, the dynamics can be addressed by automatically storing their content. The overall quantity of online content and the dynamics of websites can thus be challenged by careful selection and storage of websites. To address the multi-media it is necessary to store and code not only the text of websites but also embedded pictures, videos and audiofiles. The same holds true for hyperlinks. A further challenge for online content analyzes is personalization. Online content can be tailored individually using algorithms to address single users. This individualization can cause enormous problems for automated tools. For example, it is nearly impossible to analyze the placement of single articles as articles can be presented in different order to different users. But online content does not only challenge content analysis. It also provides opportunities like the wide availability of meta-data such as comments, likes and shares of articles.

3. Traditional procedures to store online content

Many studies in communication research use procedures like taking screenshots, webcrawling or storing RSS feeds to archive online content for content analyzes. These procedures vary regarding the degree of automatization. Taking screenshots of websites is the easiest but also the most time-consuming procedure to store online content. While websites mostly appear in the same layout and style and there are formats that make hyperlinks usable, it is impossible to save videos and audiofiles with just a screenshot. Webcrawlers like screenshots save online content in the same layout and style as it appears online. They make hyperlinks available but they also demand manual storage of videos and audiofiles. Furthermore, both procedures do not bypass personalization algorithms. A procedure that can be used to store websites regardless of personalization is the access via RSS feeds. These feeds are often automatically created by the content management system and list published articles. As all articles appear in the same layout in reverse chronological order, they are not personalized. Unfortunately, the RSS feeds do not per se show the articles in the layout and style like they appear online. To store the articles in the outlook of the online versions it is necessary to use the RSS feeds as a register and store the articles online versions from there. This short overview shows that conventional procedures do not address all challenges of websites for reliable content analyzes. Therefore it is necessary to combine these procedures and use tools that best fit the needs regarding the particular research questions.

4. The analysis of journalistic online content by using automated tools

To address the challenges of online content analyzes we want to compare five recently developed tools to store, organize and code online content. These tools are AmCAT, the NewsClassifier, the coding platforms ANGRIST / IN-TOUCH, the Facepager and ARTICLe.

AmCAT2 combines the organization and coding of online content (van Atteveldt, 2008). It allows to list great amounts of data in a SQL database. This data can be analyzed automatically or manually. AmCAT focuses on text formats like XML, RTF or CSV. Thus, AmCAT alone does not address the internet’s multimediality and hypertextuality. Furthermore it depends on the procedure of data storage whether AmCAT can be used for content analyzes of videos, audiofiles or hyperlinks. This holds true for bypassing algorithms. The data has to be stored in a way that neutralizes personalization. Besides the organization of data, AmCAT allows various procedures of automated text analysis, like Natural Language Processing (NLP) or Part-of-Speech Tagging (POS). It combines the organization of the coding process, the allocation of the material and the generation of data as well as their export to a statistics software (van Atteveldt, 2008, p. 185). The opportunities to organize, code and generate data lead to a high complexity of the tool. Coding of videos and pictures has to be done manually, as the tool is specialized for text analyzes. These disadvantages can be addressed by combining AmCAT with tools for data storage that are appropriate for research questions that focus on multimediality or hypertextuality and by combinations of automatic and manual coding.

The tool NewsClassifier3 was created to automate the whole process of content analysis from data storage to coding (Scharkow, 2012, p. 250). The tool allows to automatically store journalistic websites. It is possible to access the data via the RSS feeds of the websites. Thus, algorithms that personalize content can be bypassed. The data can be stored as HTML or text files. Like AmCAT NewsClassifier focuses on automated text analysis. To organize the coding procedure the tool is able to select a sample of data for automated or manual coding. Manual coding data can be used as training data for automated coding. Furthermore, NewsClassifier calculates reliability tests and allows exporting data to a statistics software. The disadvantages of the NewsClassifier are similar to those of AmCAT. It is not possible to automatically code content information from pictures, videos or audiofiles. But the automatic storage of the NewsClassifier makes manual coding of such data possible, thus addressing the main challenges of online content analyzes.

The tools ANGRIST and IN-TOUCH4 focus on the coding process itself (Wettstein, 2012; Wettstein, Reichel, Kühne, & Wirth, 2012). They allow computer-assisted half automatic coding. ANGRIST provides a step by step coding along categories within a programmed codebook. The texts for coding are displayed in the tool. Therefore, Unicode or ASCII formats are required. The user interface makes it unnecessary to code single numbers as it provides dropdown menus or checkboxes for coding. The tool IN-TOUCH complements ANGRIST as it is a tool for supervising the coding process. It provides reliability tests and controls the progress of the project. As both are tools for manual coding of text data, they do not per se account for multimedia and hypertextuality. Thus both tools can only complement storage tools if the research questions focus on videos, audiofiles or links.

A tool only for data storage is the Facepager5 (Keyling & Jünger, 2013). It was developed to collect information from social network sites. It accesses the application programming interface (API) of Facebook and Twitter. But it can also be used to save information from other JSON-based platforms, like YouTube. After adding the Facebook-Feeds or Twitter-Channels that should be collected, the Facepager saves information such as status updates, the number of page likes or the number of comments. The Facepager collects all data that is available from each platforms API. It might thus be insensitive with regards to personalization. To collect data it is necessary to have a user account at the social network sites of interest. The collected data is shown in the user interface and can as well be exported to a statistics software. Multimedia data that is shared in status updates can also be saved and is copied to the local hard disk. The text information is machine readable. The tool can thus be combined with one of the previously described tools for automated or half-automated coding. As the Facepager is developed for social network sites it cannot be used to store or analyze complete websites or articles from websites.

ARTICLe6 was developed for the automated storage of articles from journalistic websites by Thomas Holbach, Christoph Uschkrat and Jörg Haßler for the DFG funded project “Digital Knowledge Gaps”. In contrast to the previous tools it stores articles from websites fully automatic including all multimedia elements like pictures, videos and audiofiles. Furthermore, it stores meta-data such as the likes and shares of an article on social network sites. A third advantage of the tool is that it serves as a coding platform for manual coding. As ARTICLe saves articles via the RSS feeds of the websites it is able to bypass algorithms for personalization. Articles are stored as they appear online, as the focus of the database is to provide a platform for manual coding. Therefore screenshots in the formats HTML, PDF and JPG are saved in a relational database. As well as the texts and pictures, videos and audiofiles are collected automatically. The source codes of the articles are searched for keywords. If a keyword appears a regular expression (RegExp) extracts videos and audiofiles and a php-script allows to download these files. All stored articles are saved in a table together with all embedded multimedia files and the meta-data of the articles. This table serves as a coding platform were human coders can select, edit and comment all stored articles. Meta-data like Facebook likes and shares can be exported to statistics software. The main advantages of ARTICLE are the presentation of the articles like they appear online. It accounts for the multimedia and hypertextuality of websites and it allows to bypass the personalization of websites. In combination with tools for automated coding ARTICLe might provide a fully automated content analysis of news websites.

5. Conclusion

The growing importance of the internet as a political communication channel as lead to a growing importance of online content analyzes. To address the challenges of online content for scientific analyzes, like the quantitiy, dynamics, multimediality, hypertextuality and personalization of websites, it is necessary to use tools for data storage. Depending on the research questions there are a few recent tools that address these challenges and allow an automatization of several steps within the process of content analysis. Although there are technical obstacles like flash or javascript applications that are hardly storable, a careful planning of the content analysis and a mindful use of the presented tools allows to automate many working steps of the content analysis.


Hasebrink, U., & Schmidt, J.-H. (2013). Cross-media information repertoires. Media Perspektiven, (1), 2–12.

Keyling, T., & Jünger, J. (2013). Facepager (version, f.e. 3.3). An application for generic data retrieval through APIs. Retrieved from: (March 24, 2014).

Scharkow, M. (2012). Automatic content analysis and machine learning. Berlin: epubli.

van Atteveldt, W. (2008). Semantic network analysis: Techniques for extracting, representing and querying media content. Charleston, SC: BookSurge.

van Eimeren, B., & Frees, B. (2013). Rapid increase in Internet consumption - online users almost three hours a day on the Internet: Results of the ARD / ZDF online study 2013. Media Perspektiven, (7-8), 358–372.

Wettstein, M. (2012). Documentation and instructions for programming the encoder interface. Retrieved from: (March 24, 2014).

Wettstein, M., Reichel, K., Kühne, R., & Wirth, W. (2012). IN-TOUCH - a new tool for checking and evaluating coder activities in content analysis. Lecture at the 13th annual meeting of the SGKM, Neuchâtel.

1 This publication was created in the context of the Research Unit “Political Communication in the Online World” (1381), subproject 4 which is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation).

2 A documentation of the AmCAT (Amsterdam Content Analysis Toolkit) is available at Furthermore, registration to use the tool is possible at

3 The tool is available at

4 A documentation of ANGRIST (Adjustable Non-commercial Gadget for Relational data Input in Sequential Tasks) is available at

5 The Facepager can be downloaded at

6 The user interface of ARTICLe (The Automatic RSS-crawling Tool for Internet-based Content analysis) can be accessed at