With the rapid development of the field of AI, there has been a significant shift in the way academic results are disseminated.
The journal review cycle is long, and when you are still battling with reviewers, the method becomes obsolete. In order to protect the innovation of the results and expand the influence, many well-known large groups will choose to launch the first time on the ** preprint platform arxiv, and the academic achievements are changing faster and faster.
This also leads to the fact that the ** published on arxiv every day is not readable at all.
At this time, a group of sharers appeared on social media, and they picked out the really interesting and important things in the field of AI ML, so that everyone could understand and access the academic results more easily.
For example, our "Xi Xiaoyao Technology Said" often shares interesting ** with you, hehe
In addition to helping everyone screen, the sharers on social media also expand their own influence!
How big is it? The conclusion given in this ** introduced today is:The ** that has been shared by the big V has been cited 2-3 times more times than the others!
Title
tweets to citations: unveiling the impact of social media influencers on ai research visibility
Links
This article focuses on two very influential users on X (formerly Twitter), AK (@akhaliq) and Aran Komatsuzaki (@arankomatsuzaki), tracks the number of citations after they share, and sets up a control group. It also delves into whether the sharer has a preference for the author's geography, gender, and institution.
The picture below is a screenshot of the user page of the two big Vs on X, and you can see that there are many fans.
akhaliq
arankomatsuzaki
Their sharing form is generally as follows: **title + one-sentence summary + **link +**homepage screenshot, as shown in the figure below. Simple, clear and focused.
Forms of sharing. Share a few articles every day, with more than 1,000 views, which brings a lot of ** degrees to the ** it shares. Therefore, it is not difficult to understand that the number of citations exceeds the unshared **2-3 times.
Of course, subjective analysis is not reliable, we still have to use data to speak, let's take a look at the detailed chart data and the author's analysis process.
This paper constructs a comprehensive dataset of more than 8,000 articles, covering all the relevant articles shared by two social influencers on platforms such as X and Hugging Face between December 2018 and October 2023.
To conduct a controlled study, the authors also constructed a control group consisting of a one-to-one match with the shared** in the year of publication, the place of publication, and the topic of the abstract. With this approach, the qualitative comparability of the two groups is ensured, thus ruling out the common assumption that big Vs only share "high quality" ** will naturally get more citations).
The authors hypothesize that the number of citations is primarily influenced by publication time, quality, and subject matter. To quantify these factors, we used the conference and year of publication as the quality variables, and the text embeddings of the title and abstract to approximate the topic.
The data collection process consists of three parts:
1.Collect the target set
First, find the list recommended by @Akhaliq and @arankomatsuzaki, and use the Semantic Scholar API to query the title, abstract, year of publication, publication location, and number of citations for each document. Remove any ** that are missing the required attributes. The table below shows the top five most common authors shared by two users and their ** number.
2.Control groupFirst, a large-scale dataset was collected that was published at the same conference and in the same year as the ** in the target set. Specifically, for each instance of the ** published in year y in conference v, get all the ** published in conference v and year y by querying the semantic scholar api. A total of 247,993 unique articles were obtained, and 124,940 articles were obtained with all the required attributes. This data constitutes a corpus that matches the target set.
3.Matching algorithms
The target set was matched to the ** of the control group, the categorical variables (presentation sessions and topics) were matched exactly, and the continuous variables (topic embeddings) were matched using Euclidean distance matching. The truncation value of the cosine similarity is set to 06,Ensure a high degree of thematic similarity between the target set and the control group, retaining 91% of AK's tweets and 96% of Komatsuzaki's tweets.
Matching pairs are very similar in topic, almost always covering the same research subfield (e.g., diffusion models applied to image generation), solving the same problems, and using similar or identical approaches. As shown in the figure below:
4.Judging scores
In addition, to verify that the method successfully controlled quality, the review scores of the target and control groups at six major machine learning sessions were also examined
The results showed that the distribution of the evaluation scores of the two groups was similar, which indicated that the quality of the two groups was almost equal, which further confirmed the effectiveness of the matching method.
The authors used histograms (a, b) and violin plots (c, d) to show the distribution of citations for the experimental and control groups, respectively. As shown in the figure below:
The results showed that the median number of citations of ** shared by AK was 24, while that of the control group was 14; The median number of citations of the ** shared by komatsuzaki was 31, compared to 12 in the control group. These results show that:Compared with the control group, the ** shared by big V had a significant increase in the number of citations
The authors also used a 2-sample q-q plot to compare the distribution of the target and control groups at each quartile. To construct the chart, the reference counts are logarithmically scaled, normalized to the distribution (z-score) of the control group, and sorted by sequential pairings. The dotted line represents an equal distribution; A dot above the line indicates a higher quantile for the experimental group and vice versa. As shown in the figure below:
The graph shows that the distribution of the target group is always higher, especially near the median. This suggests that big V sharing actually has a significant impact on changing outcome variables such as the number of citations of **.
In addition, the authors used statistical tests such as EPPS-Singleton (ES), Kolmogorov-Smirnov (KS), and Mann-Whitney U (MWU) to establish statistical significance for this difference, all with p-values well below the strict = 0001 standard. As shown in the table below:
These tests showed a significant difference in the distribution between the experimental and control groups.
While top-level conference acceptance (i.e., review scores) has traditionally been the primary indicator of future citations, the study shows that the influence of Big V's sharing behavior on ** should not be underestimated. , which is also a shift in the way communities discover and read**.
Given the American background of AK and Aran Komatsuzaki, the authors wonder if what they share is geographically biased.
Changes in the number of publications by country**
The authors counted the number of AI publications in the field of AI by country, referring to the geographical distribution of AI repository publications in the Stanford HAI 2023 AI Index report. As shown in the figure below:
It can be seen that the number of ** publications in the field of AI in the United States has decreased slightly, which may indicate the maturity of the field of artificial intelligence and the increasing dispersion of research around the world. At the same time, the EU and the UK began to show moderate growth after a sustained decline from 2010 to 2017, while China's share continued to rise.
Geographic statistics for influencer sharing**
Authors used Semantic Scholar and DBLP to collect affiliation data for all authors listed in each target set. Then, use the Nominatim geocoding API to find the approximate latitude and longitude of each affiliation. Manually adjust obviously inaccurate coordinates with publicly available addresses. From this information, reverse geocoding is done using nominatim, finding the country of each affiliation, and then assigning a country to each publication using majority voting. The result is shown in the following figure:
The geographic heat map of the authors of the Global Impact Literature shows the distribution of their unique institutions.
From the image above, we can see that the two influencers shared** from all over the world. The United States and Europe are particularly popular.
Trend changes in influencer sharing**
Finally, the authors aggregate individual countries into the same geographic areas used in the HAI report and use a similar format for mapping.
The sharing patterns of influencers from 2018 to 2021 are markedly different from the global trends published by **.
Specifically, the publications shared by AK show a sharp decline in the "unknown" category and a dramatic rise in the US share. This seems to indicate an improvement in affiliation reporting, rather than a change in AK sharing habits, as the share from other regions is relatively stable.
Komatsuzaki's data shows the continued focus on U.S. affiliation**, and it wasn't until later that other geographic regions began to emerge.
In general,While the global landscape of AI publications suggests increased diversity and a more even distribution of research output, my data presents a skewed alignment in favor of the United States
In addition, the authors say that the statistics are incomplete: using only the affiliations shown on ** can be inherently biased against the United States. For example, many researchers affiliated with multinational organizations are assigned to the United States (where the headquarters are located), but they work in a branch office in another region. In addition, it is important to note the prominence of the "unknown" category in the data of the two influencers, with no affiliation found.
Gender diversity is crucial in computer science and engineering, fields that have historically been dominated by men.
First, in order to understand the overall gender distribution in this field, the authors refer to the gender distribution of Ph.D. recipients and faculty in computer science and related fields in the United States as reported in the 2021-2022 Taulbee Survey.
The Aminer Scholar Gender Prediction API, which categorizes authors as "male", "female", or "unknown" based on name and affiliation, if available, was then used by filtering only the first author of each article.
The results showed that in the @akhaliq dataset, the ratio of males to females was 80:20 among authors whose gender could be identified, while in the @arankomatsuzaki dataset, the ratio was 81:19.
These ratios roughly match the 77:23 ratio among computer science Ph.D. recipients reported by the Taulbee survey, and slightly deviate from the 76:24 ratio among faculty members.
This indicates that the number of female researchers is increasing, but there is still a large gap with the number of male researchers.
It can be seen that the big Vs on social ** are really important in AI ML research. They share their research** to make it more visible**. The study found that:The ** that has been shared by the big V has been cited 2-3 times more times than the others! 。This shows that the big Vs don't just share the good **, they can also help everyone understand and pay attention to important research results. Their promotion ability is really strong!
But there are a few things worth considering:
Now there is so much information, the ** published on ARXIV every day can't be seen at all, and these big Vs help us pick out the really interesting and important things in the field of AI ML, so that everyone can understand and get in touch with them more easily. Still, listening to them all the time can also make us miss out on some other good things. So, weThere is a need for a diverse, competitive** academic environment so that everyone can see more research and ideas。Now the big Vs on social media are becoming more and more influential in the AI ML academic circle. This means that we may need:Reconsider how to choose** and how to judge。It is hoped that conferences and academic institutions will keep up with this change and improve their systems and processes to ensure that high-quality research is seen and disseminated by all. The big Vs on social ** really helped and let more people see the research in the field of ML. But the analysis of this article found that most of the ** they shared was about the United States. While this reflects the U.S. leadership in the field of AI ML, we should also see research from other countries. In addition, the ratio of men and women in the ML field is not balanced. While there is no obvious gender bias in the content shared by influencers, this difference is a reminder of our efforts to increase gender diversity in this field. Nowadays, social networking and academic research are becoming more and more closely related in the field of AI ML. From the publisher's point of view, in order to expand the influence of the publisher, you can also consider promoting your work on social media after arxiv publishes. After all, in this era of information, "the aroma of wine is also afraid of deep alleys"!
You are also welcome to share your interesting work on "Xi Xiaoyao Technology Talk".