Research process and further thoughts

Digital curation is a broad filed full of potentials. Before, I used to limit my thoughts within the digitization projects directed by archival institutions, as I summarized in the cases from the National Archives. But through the study in Digital Curation module and my self-research, gradually I realized that the definition of  digital archives is much broader than I expected. In our daily life, the emails we sent, the tweets we posted, etc. can all be contained into digital archives. So I try to find out how the academics apply and manage those new-era archives.

When I did my work placement in SAGE Ocean Team,  I helped my colleagues summarize the applications for archiving the data from social media including Facebook, Twitter, Weibo, etc. and learned some new cases about it by looking through academic papers and technical websites. Given that western readers may not know much about Weibo, I explain it detailedly in my third post. Hope it could be an inspiration for digital archivists!

In addition, some of the tools for archiving weibo are as follows.

1.WeiboEvents by PKUVIS

A free software, providing both Windows and Mac versions. Developed by: A visualization analysis group from Peking University (Its official website in English:


  • Monitor the real-time data and grab raw data for deeper analysis.
  • Data visualization: Each node represents a post of microblog, the connection represents the forwarding relationship, and the process of microblog propagation can be extracted. In the right dashboard, users can choose to filter the keywords contained in content posted before or display the selected data and create customized report.


A web-based tool, planning to launch an app on mobile phones in the near future. Developed by: A technology company, Zhiweidata. co. Ltd


  • Archive analysis: based on the exposure, the user’s total rating (weighting average of indicators such as user activity and follower volume), emotional value (positive and negative emotions), and content analysis (propagation tracking).
  • User profile: generate the data of users by gender, districts, active degree, etc.


Individual users can use the service of BDP for free but it will be limited in the permitted numbers of API and data storage (only 100M). For 1G storage space and linked with 8 API, it cost approximately £8 per month. More than 1000 business companies such as Nielsen, Alibaba, etc. reached partnership with BDP.

Developed by: Haizhi Network Technology Co. Ltd, a start-up founded in 2013.


  • No need to write python code or SQL, just drag and drop through the dashboard then you can complete professional functions as a data expert.
  • Multi-table association: Drag and drop to achieve multi-table association (join), which is simpler and more convenient than VLOOKUP.
  • Append merge: multiple tables with the same structure, quickly integrated into a new table, eliminating the need for Ctrl + C / Ctrl + V.

These applications can allow researchers to archive targeted posts from the whole Weibo retrospectively, no matter when they start to use it.


Extracts raw data from the whole database of weibo, covering user information, hashtags, blog post/review/like information, etc.


An easy analytics dashboard for industry intelligence on Weibo. Developed by: KAWO Co.Ltd, a start-up founded in 2013. Designed for digital agencies, researchers and journalists to use WeiboStats for reports and insights on Chinese Social Media. Users can easily track the performance of multiple Weibo accounts for free.

For those who have no background of coding, these tools will be very helpful when there is a need to complete some digital curation tasks. Of course Twitter and Facebook have their own tools as well, some are paid service but lots of them are free to use.

Digital curation covers a wider range apart from collecting and analyzing digital archives on social media. It can be challenging if you haven’t studied heated-discussed languages such as python, but it should be aware that you can still be an archivist with the help of many other online applications. Also, as the cases shows, the functions of archiving may have fascinating effects in practical.

Traditional cases of managing digital archives

As mentioned in last post, the methods to use digital archives on new-era media is different from manage archives in museums and galleries. Here I will briefly summarize two cases about how these traditional cultural institutions manage their digital archives.

1. The Parliamentary Archives – mixed storage solutions: cloud + localhost

The Parliamentary Archives is an institution which works for managing historic records from the House of Lords and displaying some of its collections for public audiences. By weighing the strengths and weakness of various storage framework, the parliamentary procurement office finally decided to prioritize cloud solutions and built an ICT policy about digital repository in 2012.

There are mainly two reasons about why they choose a combined approach to store data. On the one hand, the amount of digital materials is considerable, including both second-hand digitization and born digital ones, which has reached 50TB yet. Also, with the increasing number of quantity, the complexity of formats adds more difficulties to the storage task. Apart from the standard Office types such as PDF and JPEG, the materials are diverse in forms, such as CAD, Audio Visual (AV) and TIFF. On the other hand, local storage system has advantages in data confidentiality and privacy safety if we compare it with the cloud method. So that storing sensitive materials in its locally hosted system- Preservica Enterprise Edition while using the public cloud to store open data will be a sensible choice to protect private information. Besides, during this process they manage to fulfill the legal obligations such as sovereignty and Freedom of Information, etc., which meets for the requirements of ethics and legal issues.

From the aspect of technical infrastructure, they use CALM as a system for archive cataloguing and interacting with Preservica Enterprise Edition. In particular, Portcullis, which is a bespoke online delivery system is applied in this joined storage case, aiming to offer visitors the access to digital repository content. In this way, the end users, i.e., the visitors who viewed the digital materials through the Internet can check separate copies of original archives independently. While the electronic content contained in cloud will be stored deeply and securely.

2.Tate Gallery – sharing a centralized system across four organizations

The rationales of using a centralized storage system for cross-branch management of digital collections in Tate are as follows. Firstly, Tate has four branches in United Kingdom, which have started their digitization process since few years ago, but it is still remained in a relatively early phase. Secondly, compared to traditional galleries, Tate aims to broaden its horizons and enroll more born digital collections in the near future, e.g., audio visual artworks or digital archives, which may challenge the management of digital assets. Because it can be impractical to set the same standard for all artists, i.e., the artworks will be displayed in various forms and they can be impossible to predict. Thirdly, it is estimated that Tate may need 2 petabytes (2,048 terabytes) space to meet the requirements of enrolling increasing digital resources.

So that establishing a centralized archive repository system becomes an inevitable need, in order to significantly increase the efficiency of managing the digital resources. In addition, Tate Gallery created a special position named Digital Preservation in 2013, for the task of coordinating the implementations and communicating with different departments.

Unlike The Parliament Archives, Tate Gallery did not use a localhost to preserve archival data, they issue this responsibility to an external business company, Arkivum,which perform well in offering effective control over the locations within the storage system. It has been a common sense that one of the most critical meanings of building cloud system for digitized files is that users can be able to search or reuse them conveniently, which represents a lasting value through ages. But this appears to be a dilemma for those born- digital assets, for the technology-driven tools keep evolving continuously, which may cause a loss of confidence from users about their long-term access to the digital resources. To solve it, Arkivum continue developing comprehensive cooperative relations with Open Source software, therefore they can guarantee the ‘100% data integrity’ and maintain the inner peace of mind of an out of the box solution. Also, assessing the metadata workflow is included in the service provided by Arkivum.

More specifically, Tate Gallery divided its digitization missions according to different types of technical tasks. For example, TMS is used as a gallery system of managing electronic artworks, while Axiell’s Calm is responsible for cataloguing archives.



Digital Preservation in Parliament: eservation/

National Archives:

The Museum System (TMS):

How researchers use digital archives on Chinese social media

In April, the National Library of China announced that they will archive all the public posts on Weibo for non-commercial uses, which will be as a part of their preservation project of Internet information. It is estimated that more than 200 billion textual posts and 50 billion pictures will be stored. They hold the belief that this archive program can have a profound influence on digital-heritage retention. “It’s not only the content, but the emotions matter, known as affective computing and also the social networking reflected in these Weibo posts is important,”, commented by the Chinese scholar Zhou Kui.

Weibo is a microblogging application that is similar to Twitter, launched by Sina Corporation on August 14th, 2009, based on user relationships to share, disseminate and get information. Weibo has become one of the top 2 social media platforms in China nowadays. As of Q3 2018, this app has nearly 450 million users (compared to Twitter’s 300 million) and features that enable the study of emotional states and responses to the topics being discussed or spread across the web.

Before the Chinese National Library launched this nationwide project, the academics and some research groups have made progress on archiving weibo. Here I will introduce several interesting cases about it.

1. Archiving the comments may help prevent suicides – a rich ‘digital mine’ in Zoufan’s last post

Zoufan posted her last words on Weibo on 18, March, 2012. She was suffering from a major depressive disorder, and shortly after – committed a suicide. Since Zoufan’s last post, other Weibo users gradually found her account and continued to share their emotions or stories of depression as comments. There is  more than one million now. This caught the attention of Tingshao Zhu and his colleagues from the Chinese Institute of Psychology. Earlier this year, they started investigating this case and devising a strategy for how they could archive the weibos to connect with patients for preventing other suicides.

By analyzing those digital archives, this research group found that a significant number of patients with depressive disorders show their suicidal thoughts by posting anonymously. The researchers used Python to scrape and analyze the commentary text and further discovered that those who experience suicidal ideation, interact with others less, and are more inward looking. Specifically, the proportion of emotionally positive words is less than 5%, and the proportion of negative words is more than 80%. They rarely express thoughts about “family” and “future”, but mention “death” and “freedom” frequently.

Then Tingshao Zhu and his colleagues built an algorithm, trained with manually tagged data from the responses to the Zoufan posts, in order to recognize people with high risk of suicide among numerable updates on Weibo and classify the severity automatically. His team aims to use this algorithm, combined with their training in psychology to identify people at high risk of suicide and reach out to provide the support they need. Till now they found 4222 users with depressive disorders and provided further advice for them, which we hope will have a profound influence on treating depression in the long run.

2. Archiving the reviews and reposts of @Yutu – the parasocial interaction on new media

Yutu literally means “jade rabbit”, which refers to the pet rabbit of the Moon goddess in a Chinese myth. On Weibo, @Yutu, the official account of Chinese moon rover, has over 730,000 followers. It continues to post updates and news of its discoveries, as well as cute cartoons about its history and general knowledges about the universe, explaining complex concepts in a visual way.

In February 2014, it briefly went quiet during the lunar night, but after recovering from some mechanical difficulties (which were actually happening to the real rover on the moon), it posted the message: “Hi, anybody there?”, “I’m the rabbit that has seen the most stars!” This post attracted more than 840,00 reviews and 151,000 likes.

The scholar Feng Xian archived the reviews and reposts and then extracted the characters relating to emotional expressions as well as the emojis. He found that 60% of the users post compliments about its joyful ‘personality’ and 19% users were encouraging the rabbit/rover to keep going (as if it were a real person) when the rover itself was facing technical problems on the Moon. The reposting level also indicates a high penetration of Weibo content to the targeted audience. The researcher looked at six layers of reposting on the microblog: the direct reposting number is 2231 (40%), the secondary reposting number is 1780 (32%), and the next four are 735 (13%), 231 (4%), 111 (2%), 490 (9%), indicating that after original reposts by some users, their friends will keep forwarding it based on social relationship circle, similar to a virus spreading.

The authors also investigated the interaction model between the rover account and social media users, to find out how to balance the personified mood with scientific knowledge about the exploration of the universe. Specifically, instead of exhibiting the attitude of imparting professional knowledge as an emotionless machine, this account established an equal relationship with the audience during the virtual interaction process, which helped mobilize their enthusiasm to participate in the discussion. Besides, @Yutu also combines new stories of space rover with classic context of the Moon in China, upgrading its traditional meaning and triggering the dissemination to a further scope.

There are many more studies of archiving weibo in order to analyze user behaviors, communication trends and the spread of information through the network.

For example, a group of researchers from Hong Kong collected both Weibo and Twitter archives to understand the levels and spread of Ebola misinformation in 2013-2014. The researchers wrote a script to crawl Weibo data, as an API was not available at the time. They found that only 2% of their archive samples contained misinformed treatment options, compared to perhaps 50%+ reported in other studies looking at the misinformation spread of Ebola treatments in Guinea, Liberia and Nigeria during the same year.

Social media such as Weibo provide new opportunities as well as new challenges for archivist in the Internet era, since these digital archives may require different technologies and management approaches, which indeed deserves our attention.


BBC News. (2016). China's Jade Rabbit rover dies on Moon. [online] Available at: [Accessed 1 Aug. 2019].

Fung, I., Fu, K., Chan, C., Chan, B., Cheung, C., Abraham, T. and Tse, Z. (2016). Social Media's Initial Reaction to Information and Misinformation on Ebola, August 2014: Facts and Rumors. Public Health Reports, 131(3), pp.461-473. (2019). Prevent suicides by archiving weibo with the help of AI. [online] Available at: [Accessed 1 Aug. 2019].

Wang, M. (2019). China's national library to archive 200 billion Sina Weibo posts. [online] Available at: [Accessed 1 Aug. 2019].

Yang, S., Xu, J. and Ye, P. (2018). Review of Online Sentiment Visualization Techniques. [online] Available at: [Accessed 1 Aug. 2019].

Introduction to digital archives

Archives can offer reliable evidence of previous documents from an individual, a community, an institution, or even a nation. With the passing of time, archives will become time capsules with former memories, helping the next generation have a better understanding of what happened before.  (See the detailed definition)

With the swift changes in the Internet era nowadays, both the archives and archivist have transformed their role to a digital style. Digital archives may break the boundary of physical storage, and prolong the life of archives, which is meaningful for the aspect of preservation.

“A digital archive is similar in purpose to a physical archive, but the historical documents and objects have been digitized (often by scanning or photography, unless a document was created digitally in the first place) and made available online.

Previously, I held the stereotype that digital curation only refers to digitize the archives and preserve the materials online, after researching I found a broader definition of that.  it seems to be in a dynamic state covering larger range, in which additional data will be enrolled into archives  continuously, instead of holding the existing data only.

Digital curation involves maintaining, preserving and adding value to digital research data throughout its lifecycle.”

Digital curation is much more than digitalizing the archives or collections. With the development of technologies nowadays, digital archives generated from social media can be able to provide valuable resources for academic use. In the following post I will mainly focus on how researchers make a use of digital archives on Chinese social media, which is different from the methods used in museums and galleries.

About the blog

“digital archive”的图片搜索结果

Hi and welcome! I’m Shulin Hu.

I graduated from Communication University of China and currently is a master student in University College London, majoring in Digital Humanities.

This blog is about the management of digital archives. More specifically, I will show how the galleries or archival museums manage the electronic content in a proper and efficient way by using different approaches. Although I’m not from a professional background of archive, within the research process I gradually found that there are many interesting ideas included in these digitization projects, which may provide lots of inspirations for further research or working. Different institutions may choose various storage systems and commercial models according to diverse needs. I mainly use three cases as my backup: The Parliament Archives, Tate Galleries and Dorset History Centre and analyze them separately, in order to dig out the rationales of why the staff choose this method and how they adapt to the new workflow. My goal is to explain this topic from an understandable way and let those people who have few background like me to learn the detailed knowledge in managing and preserving digital archives.

Critically speaking, my blog may be weak in firsthand surveys, since I have not interviewed the staff working in museums or galleries in real-life nowadays. Fortunately, the cases I choose are not outdated. So many findings in this project are likely to offer new perspectives for the archivists who have a desire to transform their roles from traditional manager to modern digital curator.

All my data and concepts are from the academic papers or relevant websites. You can check the links of references if you want to know more about certain point. Feel free to cite sentences or apply the framework of this blog. (Licensing scheme)

Hope you have a happy experience when reading it and glad to receive any questions or suggestions for me!