§14.1
Case Study: Trump Tweet Source Classification
This case starts with a famous Washington Post observation: when the Trump account wished teams good luck, the tweet often came from an iPhone; when it attacked rivals, it often came from Android. That is a perfect teaching case for classical NLP because the business object is small, the label is concrete, and the interpretation is tempting enough to be dangerous.
The file contains 5,653 tweets from January 1, 2016 through October 17, 2017. Each row has a timestamp, tweet text, and a source label: Android or iPhone. The context document frames the exercise as feature extraction plus classification: one source is treated as Trump, the other as a surrogate using the same handle.
The point of the case is not to turn NLP into political gossip. The point is to show how language, metadata, and a transparent baseline model can build an authorship-style fingerprint while still keeping the caveats visible.
The Executive Question
Can tweet text and simple metadata distinguish Android-labelled tweets from iPhone-labelled tweets, and what does that distinction actually mean?
The careful version: the model can classify source labels. It cannot prove who physically typed a given tweet, and it should not ignore that campaign operations changed over time.
Start With the Source Regime
The full corpus continues into 2017, but the main classifier uses the campaign window: January 1, 2016 through November 8, 2016. That restriction matters because the device mix changes after the election and eventually becomes an all-iPhone stream. A classifier that quietly learns a period regime is less useful as an authorship lesson.
Device labels define two different communication streams
The first lesson is metadata-first. The timestamp and device label are not afterthoughts; they determine what the text task is allowed to claim.
| Decision | Case choice | Analytical implication |
|---|---|---|
| Document | One tweet from the @realdonaldtrump handle | The model classifies source at the tweet level, not at the day, topic, or account level. |
| Label | Android versus iPhone source label | The label is a device/source proxy. It is not direct observation of who typed every word. |
| Features | Tokens, bigrams, URLs, hashtags, mentions, punctuation, and timing cues | The analysis shows how ordinary text features combine with metadata to create a fingerprint. |
| Evaluation | 75/25 stratified held-out split, seed 11; 894 held-out tweets | A held-out score checks whether the fingerprint generalizes beyond the examples we read. |
| Interpretation | Classification evidence, not authorship proof | A source classifier can support an audit story, but it cannot settle intent or identity by itself. |
What Features Carry the Fingerprint?
The strongest differences are not exotic. Android-labelled tweets use more combative cue words and more direct mentions. iPhone-labelled tweets carry more campaign-broadcast mechanics: URLs, hashtags, event language, thanks, and rally logistics. That is exactly what a manager should expect if two communication workflows share one public account.
The fingerprint is strongest when text and posting routine are read together
2016-07-06 04:36Crooked Hillary Clinton is unfit to serve as President of the U.S. Her temperament is weak and her opponents are strong. BAD JUDGEMENT!
2016-08-14 16:50Crooked Hillary Clinton is being protected by the media. She is not a talented person or politician. The dishonest media refuses to expose!
2016-05-26 13:18The Inspector General's report on Crooked Hillary Clinton is a disaster. Such bad judgement and temperament cannot be allowed in the W.H.
2016-01-28 17:32It is my great honor to support our Veterans with you! You can join me now. Thank you! #Trump4Vetshttps://t.co/UVn3kUd2DV
2016-02-23 03:51Join me live- now in Las Vegas Nevada! We will MAKE AMERICA SAFE & GREAT AGAIN! #VoteTrumpNV #NevadaCaucus https://t.co/IW9s9noxDT
2016-07-01 17:07Thank you for your support! We will MAKE AMERICA SAFE AND GREAT AGAIN! #ImWithYou #AmericaFirst https://t.co/ravfFT5UBE
This is also where preprocessing matters. A naive bag-of-words model treats #draintheswamp, join live, and crooked as tokens. Those tokens are not neutral: some represent campaign distribution, some represent rhetoric, and some represent the political target mix. The analyst has to decide whether those are legitimate source cues or leakage from the campaign calendar.
A Transparent Baseline Is Enough to Be Useful
A simple Naive Bayes model trained on unigrams and bigrams reaches 79% held-out accuracy, versus a 53% majority-class baseline. That is strong enough to show a real source signal and modest enough to keep the interpretation honest.
A transparent baseline can recover the source label, but it is not an authorship oracle
Reader-facing cue terms
Counts are shown as term hits per 100 campaign-window tweets in the selected source.
The cue list is more important than the algorithm name. Android-labelled tweets are more likely to carry terms such as crooked, lyin, media, and weak; iPhone-labelled tweets are more likely to carry campaign hashtags, links, join live, thank support, and event language. That contrast is the story. The classifier merely tests whether the contrast is stable enough to predict unseen tweets.
What the Case Teaches
The responsible interpretation has three layers:
- A source fingerprint exists. Text and posting features distinguish Android-labelled tweets from iPhone-labelled tweets in the campaign window.
- The fingerprint is operational. The iPhone stream looks more like a campaign broadcast channel; the Android stream looks more like direct commentary and attack language.
- The label is a proxy. Device source is not a sworn authorship record. It reflects tools, staff workflows, time periods, and campaign communication routines.