Government Cats on Twitter

I have been fascinated by the way that national newspapers have coveredthe antics of the individual cats of various government departments (or more specifically, the civil servants of the departments). According to the National Archives, there is evidence of the existence of government cats going back to the 19th century. According to recently released government papers, there is evidence of a public following for invididual government cats since the 1940s. The current generation of government cats would be the first to have a presence on social media, in particular Twitter. So, I chose this project because all the pieces were in place to undertake a live study of the public followings of the government cat, without waiting for government papers to be released in 30 years' time.

The project consisted of a number of separate Python scripts, which made use of various standard and non-standard libraries. I divided the different aspects of the project into separate programs because I didn't want to keep accessing the Twitter API on every execution during testing.

The project can be split into the following sections:

1. Comparison of Twitter libraries
2. Summary and comparison of the basic visible data about each Twitter account
3. A network analysis of the entire government cat network, comprising each government cat account and its friends and followers, including a plot of the graph and indentification of the most central users.
4. A textual analysis of the tweets of each government cat, with a view to identifying common topics.
5. A textual analysis of the locations and profile descriptions of the followers of each government cat, with a view to developing an understanding of the self-identification of the community.

I chose this project as my final project for an MSc in Data Science. A copy of the full dissertation can be downloaded from my Academia page.

Comparison of Twitter libraries

Whilst the oldest and most commonly used Twitter library for Python is Tweepy, there are in fact nine libraries mentioned on the Twitter Developer page and there is no objective analysis of their perform. Since the project involved accessing the data about the followers and friends of particular Twitter accounts using Twitter's API, the first task was to understand how different libraries performed.

Due to time constraints, I ended up comparing only four libraries: Tweepy, Twython, TweetPony and Twitter API

In order to avoid breaching Twitter's rate limits, I created four different apps on Developer Platform, one for each library.

The full findings, with charts, are outlined in my blog post, So you want to access Twitter data using Python...

The code can be viewed at timeTest.py

The mean and standard deviation of each attribute was calculated used the code in AnalyseLibraries.py

While Twython turned out to be the fastest in terms of time, this was only because it returned a small fraction of the maximum possible number of followers/friends available through the API (200). TwitterAPI was the second fastest, but it was also the most complex to use out of the four. TweetPony was thus thus the best in a tradeoff between speed and simplicity, but, for a small scale project, there was not much difference between Tweepy and TweetPony in terms of speed or simplicity. So I chose to use Tweepy, as this was the most established and had most support available.

Summary and comparison of basic visible data

The purpose of this stage was to capture and compare the visible data about each of the seven government cat Twitter accounts. By visible data, I mean the data that is ease to glean from a cursory examination of a Twitter page: number of followers, number of friends, number of likes, number of tweets and the account creation date. While it is easy to manually observe the visible data for individual accounts, Twitter often displays the data in an approximate form. Furthermore, it is difficult to compare different accounts on one page. By extracting the data using the API, it is possible to find out the number of friends, followers, likes and tweets exactly to the nearest unit and creation date and time to the nearest second.

This stage is also important because it provides the overall context for the remaining stages to come, as the analysis in subsequent stages is limited because Twitter only allows extracting a maximum of 200 friends and 200 followers for an account using its API.

The findings are outlined in the blog post Who is the most popular government cat?

The code can be viewed at DataCollect_Basic.py

After undertaking this summary, the whole dataset of followers, friends and tweets for each government cat was extracted using the code in DataCollect.py

Network Analysis

The Government Cat network in this instance was taken to be all the government cat accounts plus their friends and followers. It was assumed that each government cat account was at the centre of its own local network of the most recent 200 friends and followers.

I was able to determine the following:

1. That the network of government cat accounts and 200 most recent friends and followers was around 1% of the global Government Cat network
2. The number of nodes with degree, in-degree and out-degree n, where n is 0 to 100
3. A visualisation of the network, showing the most central users according to degree centrality.
4. The sets of the most central users, not just in terms of connections (degree) but also in terms of information flow (union of in-degree and out-degree)
5. The values of attributes to indicate whether any of the central users might be "fake" accounts

The findings are outlined in the blog post Most important users in the Government cat network?

The Network Analysis code (steps 1-4) can be viewed at AnalysisFollowers.py

The code for identifying possible "fake" accounts (step 5) can be viewed at AnalysisFollowers2.py

Textual Analysis

The purpose of the final stages was to understand something about the popular topics that each government cat account tweeted about and the followers and friends based on their profile descriptions and locations. This was done using the Python library Textblob to identify frequently occuring 1-grams, 2-grams and 3-grams and eliminating words in a given stop list. I have not produced blog posts for this topic but my analysis can be viewed in my MSc report mentioned above.

The code for analysing tweets can be viewed at AnalysisTweets.py

The code for analysing the profile descriptions of followers can be viewed at AnalysisDescript.py

The code for analysing the locations of followers can be viewed at AnalysisLocations.py

Back to Portfolio page