I have 75k twitter accounts.
I am looking for the following data on each of them.
Category Feature
1 - Profile Commonality between screen name and user names
1 - Profile Creation Date
1 - Profile Description / Bio
1 - Profile Display Name
1 - Profile Is Profile Picture Egg (Yes/No)
1 - Profile Is Profile Picture Human? (Yes/No)
1 - Profile Is Profile Picture Stock Image (yes/no)
1 - Profile Number of Sources (mobile, computer, null)
1 - Profile Primary Language
1 - Profile Handle (@name)
1 - Profile Twitter User ID
2 - Bio / Description Does Description have a URL? (Yes/No)
2 - Bio / Description If so, does the description URL have a clone elsewhere?
2 - Bio / Description Average Word Length
2 - Bio / Description Contains URL
2 - Bio / Description Correlation with a NLP Program
2 - Bio / Description Length
2 - Bio / Description Number of Words
2 - Bio / Description Score - ARI (Automated Readability Index)
2 - Bio / Description Score - Coleman Liau index
2 - Bio / Description Score - Dale-Chall Score
2 - Bio / Description Score - Flesch Kincaid Grade level
2 - Bio / Description Score - Flesch Reading Ease
2 - Bio / Description Score - Linsear Write Formula
2 - Bio / Description Score - SMOG
3 - Activity URL Is Shortened? (yes/No)
3 - Activity # of Posts
3 - Activity # of Retweets
3 - Activity # of Tweeting @'s
3 - Activity % of Tweets Geo-enables
3 - Activity Ave. # of Hashtags in Tweets
3 - Activity Ave. # of Links in Tweets
3 - Activity Ave. # of Special Characters in Tweets
3 - Activity Ave. # of User Mentions in Tweets
3 - Activity Average Duration between being a tweet being posted and this user re-tweeting it for all retweets (in minutes)
3 - Activity Average Duration between being a tweet being posted and this user re-tweeting it for top 10 fastest re-tweets (in minutes)
3 - Activity Average Duration between being a tweet being posted and this user re-tweeting it for top 3 fastest re-tweets (in minutes)
3 - Activity Average Tweets / Day Since Creation Date
3 - Activity Distribution of Tweets Per Hour
3 - Activity Longest No-Tweet Duration (In Days)
3 - Activity Most Compact Number of Tweets per Hour
3 - Activity Number of Languages
3 - Activity Percentage of tweets ending with punctuation, hashtag, or link
3 - Activity Number of Events / Hour Distribution - Standard Deviation
3 - Activity Number of Events / Hour Distribution - Skew
3 - Activity Number of Events / Hour Distribution - Kurt
3 - Activity Sentiment Score
3 - Activity Time from Last Tweet (In Days)
3 - Activity # of Followers
3 - Activity # of Following
3 - Activity # of Likes
3 - Activity Category of website Linked to
4 - Similarity Number of known bots followed by a user - a user following several known bots is more likely to be a bot.
4 - Similarity Number/Percentage of bots in the cluster that a user belonged to -if a clustering algorithm places the user in a cluster with many bots, he is more likely to be a bot.
4 - Similarity Pagerank and between-ness centrality of users in both retweet and mention networks
4 - Similarity Similarity of Profile to Known Bots
4 - Similarity Variables related to star and clique networks associated with users
5 - Outcome Is a Bot? (Yes / No)
5 - Outcome Bot Type (Spambots, Paybots, Influence Bots)
This is important, but not time sensitive. The proper data miner / analyst will be given a few weeks to work on the job,
The final output for this is 2-fold.
1) Looking for a google spreadsheet output of this info for all -75k accounts.
2) a web-based tool that I can upload a CSV file or paste in a list of Google ID's OR Usernames to get this data for the identified accounts.
I have a strong opinion about what tools you use to build this solution.
About the recuiterMember since Sep 14, 2017 Lance Hirahon
from Jalisco, Mexico