Given these limitations, we suggest that findings in computational social science using social media data are required to be
systematically generalized and replicated. In doing so, we first focus on a single platform, Twitter.com, to overcome platform
variations. Twitter is a combination of social networking services and information sharing applications in which users can make
tweets, i.e., posts, with a 140-character limit. Tweets are open online by default, and are also broadcast directly to a user’s
followers. Users may rebroadcast a tweet by retweeting (RT) the message to their followers. Alternatively, followers may reply
directly to the author. Twitter is one of the most popular social media platforms around the world, and many studies in computationa
social science are based on Twitter datasets. Therefore, our study scope is limited to only those studies using Twitter data and
related to computational social science research.
While accessing the complete Twitter dataset is not possible, our approach is to collect a randomly-selected and representative
Twitter dataset. Our method began by generating a list of random Twitter IDs (egos). We then collected all the egos’ alters (i.e.,
followers and followees) and the following relationships among the alters. Finally, we obtained the profiles and the timelines of the
selected users (egos and alters). Although, the sampling approach is not adequate to estimate global network properties, we will
show that our dataset is sufficient and possibly a best option to re-examine previous propositions, because results based on
random samples could be generalized at the population level, whereas other sampling strategies usually do not have this property
(see S1 File). In addition, unlike sampling tweets, sampling users is a more appropriate approach to analyzing individual behaviors.
We synthesize existing studies into three themes to reflect the state-of-art research progress in computational social science in
relation to the use of Twitter data, namely usage, network structure, and information diffusion. Although these three themes have
covered most findings obtained by observational research, studies using online experiments and combing external data (e.g.,
survey) for predictions (e.g., voting behavior) are not addressed in this study. We rephrase existing propositions to make them
testable at the individual level, assuming that the usage, formation of network structure, and information diffusion can be explained
by individual behaviors.
Materials and Methods
Ethics Statement
The study was approved by the Human Research Ethics Committee for Non-Clinical Faculties, The University of Hong Kong. Data
were obtained from Twitter’s REST API. Before data collection, developer accounts were granted by Twitter to the authors of this
study, which allows the access to the data. Indirect identifier data fields will be replaced to unidentifiable pseudo code after all data
were collected upon the end of the project.
Data Collection
Instead of using the streaming API, we used Twitter’s REST APIs to collect a representative Twitter dataset. First, we employed a
method reported in Fu and Chan [11] and Zhu et al. [12] to generate random Twitter user IDs. The Twitter ID is a unique (numeric)
value that every account on Twitter has. Although an account can change its user name, it can never change its Twitter ID.
Therefore, as long as we find an approach to generating a list of valid Twitter user IDs, we find a way to generate a random sample
of Twitter users (accounts). After some exploratory experiments, we found that the Twitter ID ranges from 0 to 3,000,000,000 until
November 2014. Therefore, we generated 3×30,000 random numbers in this range. And then, we search these numbers via the
REST API to check the existence of Twitter IDs. Using this method, we obtained 34,006 valid Twitter user accounts. We call them
“egos”. This random sample could represent the population of all Twitter users (see a comparison between the random sampling
and BFS sampling strategies in S1 File).
Next, we obtained the egos’ user profiles, alters (followers and followees), and tweets/retweets in user timelines as many as
possible. Since users’ tweets and following relationships could be protected by privacy settings, we could only get the public users’
information. For egos, we obtained 4,702,258 tweets from 32,420 egos, of which 15,176 posted nothing. We obtained 2,484,247
unique alters of 32,702 egos, of which 13,713 have zero alters. For alters, we obtained profiles from 2,482,184 users. We further
obtained 2,378,687,333 tweets from 1,768,010 alters, of which 124,240 have zero tweet.
Next, we constructed 1.0 ego networks in which nodes are users (egos and alters) and edges are the following relationships
(without the following relationships among alters). Users without profiles were excluded. Finally, there are 2,516,190 nodes
(including 8,472 ego users) and 3,949,275 edges in the 1.0 ego networks. That means there are 8,472 separated ego networks
since only 8,472 egos satisfy the condition that ego users should have at least one alter user and this user’s profile information is
publicly available. Among the 8,472 egos, 6,415 users have posted at least one tweet in the past 6 months (active egos). We
further obtained the following relationships among the alters of the active egos to construct 1.5 ego networks. We used the 1.5 ego
networks to calculate clustering coefficient and betweenness of the active egos. A flow chart of data collection is appended in S1
(Figure A in S1 File).
Data Analysis
We used a conceptual replication approach. It means that (1) we do not merely reproduce former findings but replicate former
conceptual claims using an independent data, and (2) we generalize and rephrase former claims to hypotheses and propositions
that can be tested at the individual level. In this way, all analyses in the current study were based on the random sample of ego
users. Therefore, findings could be further generalized to the population of Twitter users. Even though we also collected the 1.5 ego
networks, we used them to calculate the egos’ network properties, which served as egos’ attributes in formal analyses. The induced
alters could not be considered as a representative sample (see S1 File). Further details of the calculations could be found in Table A
in S1 File.
Results
Usage
20%-80% rule of content generation.