Tuesday, 10 May 2016

Voting, New and Old

Conspicuous by My Absence

I regret to announce that the long anticipated third installment of my analysis for the American primaries will not be provided here. Unfortunately, a quick look at the statistics did not really promise much of interest (not that a non-result has ever stopped me before). Additionally, with the Republican race now down to one candidate, it's unlikely there will be enough write-ins to dislodge Donald J Trump from the GOP candidacy. The Clinton-Sanders race is looking ever more decided in favour of the former, but there is still life in that contest; the only meaningful analysis on that, however, comes from my previous post on racial diversity:

Since then the following races have been decided

Burning through 6 of the 8 tossups and delivering an 11/14 (or 78.6%) accuracy rating. That's not too shabby considering that data is purely based on the racial make up of likely voters.

We will return to the 2016 United States Presidential Elections later this year, after the Australian Federal Election, the Brexit Vote and the territorial elections for both the NT and ACT.

But First

Other electoral news around the world this week includes Sadiq Kahn taking office as Mayor of London on Sunday and the Philippines electing Rodrigo Duterte as president on Monday. In neither election did the former incumbent run; former London Mayor Boris Johnson did not run due to his election to the House of Commons, while former President of the Philippines Benigno Aquino III did not run due to exhausting his term limits.

Sadiq Kahn is noteworthy for being the first London Mayor to be a Muslim. Rodrigo Duterte is noteworthy for being a lawyer who as Mayor of Davao City encouraged (has been suggested to have been directly involved in) vigilante death-squads who execute criminals (or assumed criminals) without trial or due practice. So, kind of an outsourcing Daredevil but without the blindness.

New Voting Measures

So, before we do any real electoral prediction or information this year, it's worth looking at the new voting rules for the Upper House. As you may (or may not – the AEC has not been publicising this nearly as much as I would have thought) know, the above- and below the line voting methods have changed.
Many people think they can vote as they used to, with the added bonus of being able to vote for multiple parties above the line. I thought this briefly too. THIS IS NOT THE CASE. (Except that it is – see the next section)

According to the latest amendments to the Commonwealth Electoral Act 1918, you must vote above the line by “writing at least the numbers 1 to 6 in the squares... printed on the ballot paper above the line” (s 239(2)(a)) or below the line by “writing at least the numbers 1 to 12 in the squares printed on the ballot paper below the line” (s 239(1)(a)).

So by law you must number AT LEAST 6 boxes above the line, or AT LEAST 12 boxes below the line. Below the line voting will work as per usual, but your vote exhausts once you stop numbering. This makes voting below the line for less tedious and, if you make an error by skipping or repeating a number (for example) your vote will still be valid until the error, at which point it is exhausted. Therefore, your vote will go to your #1 candidate until she is (a) elected by reaching the quota or (b) excluded for having too few votes. At this point your vote goes in full (in the latter case) or in part (by your share for the excess of votes beyond the quota) to your second choice, and so on until you run out of numbers.

Voting above the line assumes you are numbering each candidate in the associated column from top
to bottom (s 272):

As such, you are no longer allowed to vote above the line by marking a single box. You must mark at least six (except you don't – see the next section). If you were part of the 5% who voted below the line, though, your old method of voting by marking every box is valid, but not entirely necessary.

Since the burden of marking 6 boxes above the line is not drastically greater than 12 below the line, you might question why we keep both methods. Below the line gives you the option to skip certain candidates – for example in the last election in South Australia, there was outrage that Don Farrell would appear higher on the Labor list than Penny Wong, or that Cory Bernardi appeared on the Liberal ticket at all. In both cases, voters could have placed the unpopular candidates lower in their preferences than their second-placers.
Above the line voting allows you to quickly prioritise the parties a little quicker if you don't care about individual candidates, but is otherwise now largely redundant.

below the line voting also allows some manipulation of the overflow mechanics, if you accept the assumption that the majority of voters number from top to bottom. This may not be the case under the new system, where the easier below the line vote and harder above the line vote may see the usual 5:95 split change.
If we accept, however, that most votes go north to south, you may wish to number candidates in a party from south to north. This is because if you favour a candidate that is elected as a result, a portion of your vote overflows. If you vote for a loser, your whole vote overflows.

Consider a party fielding 2 candidates as your preferred party, and you don't know enough about the individuals to choose. Lets assume the quota for getting elected in 1000 votes. If you number them top to bottom, and the top candidate gets elected by 1001 votes, 1/1001th of your vote flows to candidate two. 1/1001th of every other vote also overflows. Most will go to candidate two, but maybe some will not. This gives candidate two less than one whole vote from the overflow. Worse, if candidate 1 is not elected, candidate two will have dropped out even earlier.

If, however, you number the party from bottom to top, and candidate 1 still gets elected, you ensure a full vote goes to the second candidate. If the first candidate is not elected, sooner or later the second will be eliminated and your vote will go to candidate 1 anyhow. Thus you have but that candidate at nor real disadvantage.

This also applies for the parties you dislike, when numbering all the way down the ticket. If you let your vote exhaust, you're saying you've reached the extent to which you care and the rest of the parties are of equal value to you. If, however, you dislike one party more than the others, you want to prevent your vote from exhausting, otherwise you have no say in whether that disliked party beats the others. Here, again, numbering from the bottom gives your less-hated parties an advantage over the more-hated ones.

This all assumes the trend of numbering from top to bottom remains. While the ease of below the line voting may undermine this lightly, most below the line voters will probably still vote top-to-bottom anyhow without thinking it through, and any above the line voters certainly will. But even if the candidates are numbered randomly by other voters, you're not actually disadvantaged. Only if the technique of voting south to north becomes popular will this strategy lose its power. That's your vote and your gamble.

But That's Not the Whole Truth

It is true that under s 239 of the Commonwealth Electoral Act 1918 you must at least mark 6 boxes above the line or 12 boxes below the line. And that's what all the information you get from the AEC will tell you and, for the latter rule, what paragraph 268(1)(b) will insist.

However, under paragraph 269(1)(b) “[a] ballot paper in a Senate election is not informal under paragraph 268(1)(b) if... the voter has marked the number 1, or the number 1 and one or more higher numbers, in squares printed on the ballot paper above the line.”

In other words, although you are required by law to number six boxes when voting, your vote will still be valid so long as you mark at least one. You would be in breach of s 239 and thus breaking the law by voting in this way, so I'm not advocating this method, but your vote would technically still count.

Similarly, for below the line voting, paragraph 268A(1)(b) states “[a] ballot paper in a Senate election is not informal under paragraph 268(1)(b) if ... the voter has consecutively numbered any of those squares from 1 to 6 (whether or not the voter has also included one or more higher numbers in those squares).”

So by law you must mark at least 6 boxes above the line or 12 bellow, but to cast a formal vote, you only need one box above the line or six below. To put it another way, you are legally required to cast a hyper-valid vote.

Why is this? A combination of two reasons. One mathematical, one pragmatic and forgiving. Mathematically, they want everyone to mark at least 12 candidates (a party above the line will have at least 2 candidates) so that even in a double disillusion election (like this one) and even if everyone votes identically, 12 senators can be picked. If everyone cast only one vote, and they all voted identically, there's be one candidate elected and no way to fill the remaining 11 seats (or 5 in a normal election).

That's why the law requires you to fill that many boxes. The reason a vote with fewer will still count is the pragmatic one. People make mistakes, especially when we change the system on them. Occasionally someone'll skip a number or write one twice. But now that we're exhausting votes, we can more easily give effect to their wishes as far as they are clear. For example, if someone intends to number six boxes above the line, but actually numbers two boxes as “3”, the vote will still be valid for the first two boxes. There have always been these redundancies. If you use a tick or a cross on a senate paper, it's read as a “1” (ss 268A(2)(a), 269(1A)(a)) and if you leave a square blank on the House of Representative form, but otherwise number the squares consecutively, that unmarked square will be read as the last number in the sequence (s 268(1)(c)).


As soon as the upper house ballots are finalised we'll be doing an run-down of the parties on your state or territory ticket and what they stand for. Other than that, and the standard prediction methods of comparing swings to margins, most of the upcoming material is a surprise (even to me) and will depend on my available spare time. Oh, and colourable maps. I'll try and get those done too, for those playing along at home on July 2.

Saturday, 12 March 2016

How Race may Shape the Race

Next in our 3-part series of presidential primaries analysed by demographics, we'll look at race. Again, these are based on the 2012 data linked to in this post despite the fact that 2014 data is available here. This is partly so we cal look at the voting data from the 2012 presidential campaign rather than the 2014 mid-terms, but mostly because I've set up my spreadsheets with the 2012 data and it'd take numerous hours to unpick the restructured lists, add in the 2014 data and redo all the graphs. Basically I'm lazy.

The data we're using lets us look at the racial profile of registered voters as well as actual voters. Race is broken down into White, Black, Asian and Hispanic. For Hispanic data, we are using the Hispanic (not White) numbers to prevent double-counting individuals. We are ignoring the other racial categories, including various mixed race categories, for simplicity. This analysis is too blunt by far for such nuanced factors to be reliably included.

There is a lot of discussion on the Republican data here, and no real useful conclusion. The Democratic summary below is more succinct and has some interesting data. And then there's a TL:DR summary at the bottom is you can't even stomach that.

Registered Voter Race in the Republican Primaries

I'd like to start with the Hispanic data for registered voters, because it raises some methodological questions. Here is the graph of Trump, Cruz and Rubio support against Hispanic voter registration:

Line of best fit (Trump): y = -0.25x + 43.89
Line of best fit (Cruz): y = 0.28x + 33.20
Line of best fit (Rubio): y = -0.06x + 22.96

It is worth noting to begin with that Ted Cruz, as a former Governor of Texas, has an advantage in the Texan primary. Rubio has an gubernatorial advantage in Florida, and Kesich (not considered below) in Ohio, though both of these contests are this week and so not already mixed in the data. Trump has no gubernatorial advantage anywhere, for obvious reasons.

This is relevant because Texas is the most outlying data with 24.67% of registered voters in 2012 being recorded as Hispanic. Because this data is on the extreme of this graph it has a lot of leverage power on the line of best fit. If we ignore Texas as an outlier, the graph looks like this:

Line of best fit (Trump): y = 0.54x + 41.88
Line of best fit (Cruz): y = -0.62x + 35.50
Line of best fit (Rubio): y = 0.10x + 22.56

This inverts every trend. With Cruz's Texan support gone his positive 0.28x slope drops 0.9 to -0.62x, while Trump gains 0.79 and Rubio the remaining 0.16 (these changes sum to +0.05 due to rounding), reversing both of their negative trends. Is this more accurate for predicting the remaining states?

Part of the problem is data volatility. If we remove Nevada, the other outlier, the lines shift dramatically again:
Line of best fit (Trump): y = 0.34x + 42.32
Line of best fit (Cruz): y = 0.09x + 33.92
Line of best fit (Rubio): y = -0.34x + 23.53

However, not even the remaining data-points are helpful here. Part of the problem is what I call agitated data. The data is spread so far from a nice linear progression, that small each point exerts a large divergance and thus a considerable angular pull. Once the stabilising influence of Texas and Nevada are removed, there's something off a free-for-all in the data pool, and removing a single datapoint can drastically shift the line of best fit:

Texas and Nevada certainly tie this data down by exerting a disproportional influence on the data, but that raise the issue of how accurate these two points are. A consistent line of best fit is not particularly meaningful if it is consistently wrong:

Lines of best fit missing various states and (darker) with all data points
Plotted without (left) and with (right) Texas and Nevada included

Because outliers will vary from graph to graph (a state with an unusually large Hispanic population may have an Asian population close to the national median, for example) and excluding them raises issues of where we draw the cutoff, they will remain in the data set. The to this rule will be Texas, because of Ted Cruz's unique standing there. Where this has only a small impact on the data, Texas will be included in the graph and the linear equation for the chart without Texas will be provided in parentheses. Where there is a more dramatic impact (determined subjectively) a second graph will be provided.

Registered Voter Race in the Republican Primaries (For Real This Time)

So, to the Hispanic data:

Line of best fit (Trump): y = -0.25x + 43.89
Line of best fit (Cruz): y = 0.27x + 33.20
Line of best fit (Rubio): y = -0.06x + 22.96

Firstly, Rubio does not do anywhere close to as well as some commentators had been suggested on the Hispanic vote. Because his parents immigrated from Cuba, many expected Rubio to perform well with this minority so often maligned by the republican party. In fact, we see a slight negative trend as Hispanic registration increases. Initially the corresponding Trump vote makes sense, after his anti-Hispanic comments early in the campaign, which leaves Cruz to claim the Hispanic lead.

Why? Texas. This is one of the graphs where Texas, as a major outlier in the Hispanic population bell curve, really shakes up the data. If we accept the Cruz did well in Texas due to name recognition and remove this state:

Line of best fit (Trump): y = 0.54x + 41.88
Line of best fit (Cruz): y = -0.62x + 35.5
Line of best fit (Rubio): y = 0.10x + 22.56

There, that looks... wait, what? Rubio has shifted from mildly negative to mildly positive. Fair enough. Take away Texas and Cruz's support among strong Hispanic states falls. Makes sense. But Trump with a greater than 1:2 slope in the positive?

It is important to remember that Republican's don't get a huge slice of the Hispanic vote, and given the primaries have an even lower turn-out, this is probably not the result of Hispanic voters supporting Trump. What it might be, as at least one broadcaster has suggested, is the swell of anti-Hispanic sentiment from non-Hispanics in states with a high Hispanic population.

In fact, as the White population increases at the expense of racial diversity and interracial tensions presumably decrease, Trump actually loses support:
 Line of best fit (Trump): y = -0.09x + 50.04        (Excluding Texas: y = -0.15x + 55.51)
Line of best fit (Cruz): y = -0.02x + 36.23        (Excluding Texas: y = 0.03x + 30.99)
Line of best fit (Rubio): y = 0.12x + 13.81        (Excluding Texas: y = 0.12x + 13.94)

The votes lost by Trump flow on in full to Rubio, which is interesting in that he is the most moderate candidate and the racially diverse option. I'm not entirely sure why Cruz doesn't benefit more from Trump's losses, but I have heard it suggested that Cruz would be the Trump of the Republican field if Trump were not the Trump of the Republican field. Perhaps that's relevant here.

What is also interesting is that this data is very stable. Whether by coincidence or an actual pattern, the only outlier (Hawaii) places its datapoints very close to where the linear equations would lead without them:

Line of best fit (Trump): y = -0.08x + 49.22
Line of best fit (Cruz): y = 0.01x + 33.03
Line of best fit (Rubio): y = 0.08x + 16.60

Even without the Hawaiian data to stabilise the lines, the data is reasonably consistent, with Trump trending neutral to negatively, Rubio neutral to positively and Cruz with slight trends positive or negative approximating a neutral polling.

While this data finally looks stable enough to hazard a prediction off of, it has a problem that even if 120% of registered voters were White - a mathematical impossibility - Trump would still win the state, followed by Cruz, then Rubio.

A look at the Black registered voters results in a similar situation:

Line of best fit (Trump): y = 0.20x + 40.49        (Excluding Texas: y = 0.22x + 40.95)
Line of best fit (Cruz): y = -0.12x + 35.70
        (Excluding Texas: y = -0.13x + 35.19)
 Line of best fit (Rubio): y = -0.11x + 23.98        (Excluding Texas: y = -0.10x + 24.06)

The data is scattered evenly enough for there to be no certain outliers, so the lies should be pretty stable. Trump does well, again probably not so much off the Black vote as off non-Black voters who regularly conflict with the Black community, and Rubio and Cruz both suffer as a result. Whether the Black population forms 0% of registered voters (we're looking at you Idaho) or 100%, Trump will finish first, increasingly ahead of Cruz and then Rubio. Of course, if the explanation above is correct, that the increase in Trump support comes from non-Black voters, this trend cannot hold: by 100%, Trump should have the bulk of his vote eroded. This is, at best, what may be termed a limited-range trend: the pattern holds reasonably reliably up to a point. Just as Newtonian models of motion work perfectly until speeds approach lightspeed, or physics tends to break down at "extremes" like black-holes, very low temperatures or the early universe, the rule isn't broken - it just applies in particular situations.

So, again we have a stable graph with poor predictive power (always giving the state to Trump, which we no is not always the case). This just notes the general observation that Trump is winning many states, with Cruz in second place and Rubio in third.

So that leaves the Asian data:
Line of best fit (Trump): y = 0.13x + 42.45        (Excluding Texas: 0.13x + 43.05)
Line of best fit (Cruz): y = 0.00x + 34.28        (Excluding Texas: 0.01x + 33.59)
Line of best fit (Rubio): y = -0.12x + 23.14        (Excluding Texas: -0.12x + 23.27)

Which gives us very volatile data anchored by Hawaii out on the extreme of the graph. Just out of interest, here's the graph without Hawaii

Line of best fit (Trump): y = 0.28x + 42.22
Line of best fit (Cruz): y = -2.39x + 38.01
Line of best fit (Rubio): y = 2.09x + 19.69

Cruz nosedives immediately into the ground, with Rubio winning any state with more that around 12% of the registered voters being of Asian decent. Either the Asian population of the united states is the most influential political demographic discovered, or the data is too chaotic for a meaningful linear equation. Hawaii suggests the latter. So do I:

The variation in predicted Trump support after deleting one state or another certainly is eccentric, with the extremes for both him and Rubio coming from the exclusion of Minnesota and Nevada. However, the trend for Rubio is reasonably solid, with him not only leading but taking over 50% of the vote before the Asian population reaches 20% of the registered voters in all scenarios. More impressively, Cruz is consistently buried head-first in the ground by between 10% and 20%.

Perhaps, then, this shows a limited range trend as explained previously, which may only hold for populations where the Asian registered voters do not exceed around 5% or so. Beyond that other factors come into play or become exaggerated and alter the data. Though maybe not, given the minimal impact such a small population is likely to have on the state as a hole.

And even if that were the case, between 0% and 5% we, again, have Trump safely in first place, with Cruz normally outperforming Rubio. In other words, this method summarises the votes so far rather than predicting the outcome.

Another issue became evident in constructing that last graph. Although the nature of the data is such that Trump + Cruz + Rubio = 100% of the 3-candidate result, there is no accounting for negative numbers. In this extreme case where Cruz bottoms out quickly, we get absurdities like those shown for Minnesota: after around 17% of the registered voters are of Asian descent, both Trump and Rubio win over 50% of the vote at the same time. This is because although Trump + Rubio > 100%, the fact that Cruz is on a negative number of votes ensures that Trump + Cruz + Rubio = 100%.

In short, any form of these graphs where a candidate drops below 0% of the vote is inherently broken.

2012 Voter Race in the Republican Primaries

Here is the data for the Republican primary race, but based on voters at presidential elections rather than registered voters. Whether this is closer to or further from the demographic attending primaries can only be speculated upon at this stage. However, the trends for White, Black and Asian voters is almost identical, so the distinction is minor at best in these cases:

 Line of best fit (Trump): y = -0.11x + 51.68        (Excluding Texas: -0.17x + 56.66)
Line of best fit (Cruz): y = -0.00x + 34.53        (Excluding Texas: 0.05x + 29.68)
Line of best fit (Rubio): y = 0.11x + 13.84        (Excluding Texas: 0.11x + 14.04)

Line of best fit (Trump): y = 0.20x + 40.47        (Excluding Texas: 0.21x + 40.90)
Line of best fit (Cruz): y = -0.12x + 35.73        (Excluding Texas: -0.13x + 35.26)
Line of best fit (Rubio): y = -0.10x + 23.99        (Excluding Texas: -0.10x + 24.05)

Line of best fit (Trump): y = 0.13x + 42.47        (Excluding Texas: 0.13x + 43.07)
Line of best fit (Cruz): y = 0.01x + 34.27        (Excluding Texas: 0.02x + 33.57)
Line of best fit (Rubio): y = -0.12x + 23.13        (Excluding Texas: -0.12x + 23.26)

Both the White and Asian data also respond the same way they did previously when the Hawaiian outlier was removed:

Line of best fit (Trump): y = -0.12x + 16.65

Line of best fit (Cruz): y = 0.05x + 52.07

Line of best fit (Rubio): y = 0.08x +   30.11


Line of best fit (Trump): y = 0.22x + 42.34

Line of best fit (Cruz): y = -2.69x + 38.30

Line of best fit (Rubio): y = 2.46x + 19.29

So all of the above explanations can be directly applied here: Trump doing well on high minority participation, possibly as a result of non-minority attitudes in those states, and poorer in states with a large white population; the Asian data predicting a quick game over for Cruz, then messing up the data beyond that point with negative values; and all models predicting at Trump victory in all states.

The only slightly different plot comes from the Hispanic data:

Line of best fit (Trump): y = -0.18x + 43.57
Line of best fit (Cruz): y = 0.19x + 33.61
Line of best fit (Rubio): y = -0.03x + 22.96

And even this is largely the same, with Rubio on a slight negative slope and Cruz negating trump thanks to a Texan skew. The only real difference is that the intersection of Crux and Trump occurs at ~25% rather than ~20. Without the Texan data removed, the data behaves just like the registered voter data: strong Trump gain, Rubio slightly positive and Cruz heavily negative to provide these gains.

Line of best fit (Trump): y = 0.72x + 41.48
Line of best fit (Cruz): y = -0.86x + 36.06
Line of best fit (Rubio): y = 0.17x + 22.40

In essence this data is pretty much the same as the first batch. The same issues and (lack of) conclusions follow.

Registered Voter Race and 2012 Votes in the Democrat Primaries

The Democrat race is far simpler than the Republican race for many reasons. For one thing, the fact that there are only two major candidates means that any loss for one candidate is a gain for another and vice versa. In fact, if the line of best fit for Clinton is given by y = mx + c, then the fit for Sanders will be y = -mx + 100-c. Furthermore, just as with the Republican charts, there is little significant distinction between the data for registrations and voter turnout.

As far as possible outlier states, there are none visually in the data spread. However, Sanders is a senator from Vermont and Clinton used to be a senator for New York. Exactly how a former senator's state's backing compares to that of a recent senator is uncertain, but since New York has not held its primary yet we only have to exclude Vermont (which, incidentally, was a huge win for Sanders who gained 86% of the vote and all 16 delegates). Vermont's exclusion does not drastically change any of the graphs, but the line of best fit is recalculated in parentheses for it's exclusion.

There has been a lot of talk about how well Clinton has been doing among the Black vote - much to the dismay of Sanders supporters who often cite that Sanders was arrested several times in the 60s for his civil rights activism. Here is the data for that much discussed Black vote:

 Line of best fit (Clinton): y = 1.41x + 36.43        (Excluding Vermont: 1.29x + 39.27)
Line of best fit (Sanders): y = -1.41x + 63.57        (Excluding Vermont: -1.29x + 60.73)

Line of best fit (Clinton): y = 1.36x + 36.34        (Excluding Vermont: 1.25x + 39.17)
Line of best fit (Sanders): y = -1.36x + 63.66        (Excluding Vermont: -1.25x + 60.83)

There are several nice things about these graphs. Firstly, there is a solid trend. You can see just looking at the scatter of plots that there is a strong correlation between Black voters and support for Clinton. But also, the lines actually intersect - in both graphs where Black people make up around 10% or 11% of the population. Which means there is actual potential for meaningful predictions here.

The flip-side to this trend is evident in the White vote:

Line of best fit (Clinton): y = -1.15x + 145.45        (Excluding Vermont: -1.01x + 136.04)
Line of best fit (Sanders): y = 1.15x – 45.45        (Excluding Vermont: 1.01x -36.04)

Line of best fit (Clinton): y = -1.18x + 147.66        (Excluding Vermont: -1.05x + 138.77)
Line of best fit (Sanders): y = 1.18x – 47.66        (Excluding Vermont: 1.05x – 38.77)

I cannot describe the joy (and following serious reevaluation of my sanity) that I gained from these graphs. After examining the Republican data to find no meaningful data I was expecting to have to write yet another post about how my methods have one again yielded no interesting data. These series of graphs show not only lines that provide more than a blanket "player 1 wins" statement on the contest, but discrete data with far stronger correlations. The lines of best fit are mapping actual trends, not just a mathematically random laser shot through a particulate gas!

I can only speculate on why race plays a more significant role in the Democratic primaries, but the obvious hypothesis has to be the fact that people of colour tend to support democratic candidates, and are therefore better represented in Democratic primaries. Instead of a vague and possibly imagined causal chain for the Republicans (racial diversity leads to racial tension leads to Republican voting patterns), the correlation is direct. Black people, for whatever reason, support Hillary Clinton.

The Asian and Hispanic graphs are less informative, as the data points are scattered further from any assumed trend, as with the Republican data.

 Line of best fit (Clinton): y = -1.05x + 56.01        (Excluding Vermont: -1.46x + 58.77)
Line of best fit (Sanders): y = 1.05x + 43.99        (Excluding Vermont: 1.46x + 41.23)

Line of best fit (Clinton): y = -0.69x + 55.39        (Excluding Vermont: -1.25x + 58.33)
Line of best fit (Sanders): y = 0.69x + 44.61        (Excluding Vermont: 1.25x + 41.67)

Interestingly, Asian voters seems to slightly favor Sanders. However, predictions based on these lines are very inaugurate based on the very small Asian populations in the sampled states. Using the registered voter data, Sanders should lose states below roughly 6% Asian and win those over this threshold. The only state over this point, however, was Nevada, won by Clinton. And, of course, all 9 states won by Sanders were below this point. The Voting record data is similarly off, with the % of voters being Asian for Sanders to win exceeding any state so far.

However, this does provide some insight into Hawaii where the Asian population is so large (~42%) that it leaves the Black vote (2%-3%) too low to push Clinton over her Black vote threshold and the White vote (29%-30%) below the level needed by Sanders. Although it is unreliable to extrapolate so far from this data, I would suggest Sanders has an advantage in Hawaii.

 Line of best fit (Clinton): y = 0.07x + 54.04        (Excluding Vermont: -0.15x + 57.08)
Line of best fit (Sanders): y = -0.07x + 45.96        (Excluding Vermont: 0.15x + 42.92)

Line of best fit (Clinton): y = 0.04x + 54.18        (Excluding Vermont: -0.19x + 57.20)
Line of best fit (Sanders): y = -0.04x + 45.82        (Excluding Vermont: 0.19x + 42.80)

This data, however, is very unhelpful. The data is too scattered to produce a meaningful trend line, and it's predictions are a very broad "Clinton wins everything" generalisation. The removal of Sander's safe state of Vermont, counterintelligence, slides him from a slight negative to slight positive trend in both states. This is because Vermont was a strong win in a low-Hispanic state, suggesting (perhaps falsely) Sanders performs better in non-Hispanic areas.


So the republican primary data was not very useful at all. These are it's "predictions":

This was calculated by taking each line of best fit (for each racial group, using both the registration data on the left and actual voting data from 2012 on the right), plugging in the corresponding population (e.g. the percentage of registered voters recorded as White) as the x value and seeing which candidate would win. In other words, consulting the graphs for the demographic of each state and seeing where the respective candidates placed.

How accurate is this? Well, I personally doubt Trump will win all of those states, since his record so far has been lower. But for an actual number, we can apply the same method to the primaries already passed:

So 70% accurate in most cases, and that's with the data it's optimised for. Future data is likely to fare even worse.

The same predictive method can be used on the more informative Democratic data:

With slightly better results for the Black and White vote, which was our most telling data sets:

This is approaching a useable system, but it's still far from perfect. Next I tried something more complex - I tried to aggregate the equations for all four racial groups in hopes that the average would be closer than the parts. My formula, for those intrigued, was as follows:

Taking the linear equation in the form y = mx + c for each racial group race:

Where pop = the population.

This is best demonstrated by the absolutely fictional and in no way based on a real place "Ruerto Pico". Lets say Ruerto Pico is 50% Hispanic, 30% White, 15% Black and 5% Asian. Each has a linear equation for both the registration and electoral data for each candidate in both parties. Let's assume we're using the equation from the registration data for Trump in the Republican race.

Hispanic: y = -0.25x + 43.89
White: y = -0.09x + 50.04
Black: y = 0.20x + 40.49
Asian: y = 0.13x + 42.45

To combine the lines, we take 50% of the Hispanic equation, 30% of the White equation and so on:

y = (-0.25*50% + -0.09*30% + 0.20*15% + 0.13*5%) + (43.89*0.5 + 50.04*0.3 + 40.49*0.15 + 42.45*0.05)

y = -11.55% + 45.153

y = 33.6% of the vote.

The problem, however, is best demonstrated by the Democratic race, where Clinton's c value for the White Vote is over 140, and with this vote being the vast majority, this system was consistently granting Clinton more than 100% of the vote in each state.

So we're stuck with the data provided so far. First off, lets just accept that there is no useful information to be taken from the Republican race here. Next, lets accept that California and Hawaii are special cases, having a large Asian population which hands both states to Sanders in the Asian column of our predictions:

As shown previously
We'll come back to those later. Finally lets accept that with these two exceptions, all of our useful data comes from the Black and White votes. These graphs were the only ones with evident trends just by looking at the data points, and their predictions were the only ones to break 80% accuracy on decided states.

So let's boil the entire Democratic race down to Black v White, by plotting candidate success against the Black:White ratio like this...

 Line of best fit: y = 0.87x + 0.39

 Line of best fit: y = 0.87x + 0.39

These two very similar trends can be applied to the undecided states to give us one half-decent chance at a prediction:

We can back-check this method on won states to find it still holds it's (comparibly) high accuracy:

And thus we can consolidate our predictions:

Yes, this technically exceeds the allowed quota of tossups, but hopefully future analysis will help fill these in.
Note that both California and Hawaii go to sanders before we even need consider the Asian vote.


The racial demographics provide no useful data for the Republican nomination.
The racial demographics provide some useful data on the Democrat nomination.
Relatively high Asian populations favor Sanders.
The real useful data is that as the Black : White ratio increases, so does support for Clinton.
Predictions as shown in the right-hand column of the final table.

Thursday, 3 March 2016

The Sexy Statistics

Now that Super Tuesday has passed and results have been collated, lets begin with an analysis of the demographic of sex/gender. The source data outlined in the previous post provides several number sets we could work with, but we'll pick just two: female and male registered voters, and female and male voters actually present in 2012. The former seems to be the most relevant, so we'll save that for later. However, the actual 2012 voters are not as irrelevant as might first be thought. These are likely to include the most politically active (who are more likely to attend primaries and caucuses) and those who have not become jaded in the decades since they've registered.

This post will be very cisnormative because of the nature of the source data.

2012 Voter Gender

14 states have held their Democrat primaries or caucuses. All of these states had over 50% of their voters at the last election recorded as female.

ALABAMA 1009 1145 53.16%
ARKANSAS 531 593 52.76%
GEORGIA 1899 2269 54.44%
IOWA 733 816 52.68%
MASSACHUSETTS 1576 1807 53.41%
NEVADA 511 536 51.19%
NEW HAMPSHIRE 322 366 53.20%
OKLAHOMA 656 775 54.16%
SOUTH CAROLINA 940 1247 57.02%
TENNESSEE 1172 1434 55.03%
TEXAS 3925 4719 54.59%
VERMONT 146 162 52.60%
VIRGINIA 1709 2069 54.76%

When plotting the current Democrat primary results against these numbers, we see that as the number of female voters last election increases, so does support for Hillary Clinton. This might seem unsurprising to many, but I was not expecting an increase this dramatic:

Line of best fit (Clinton): y = 5.64x - 2.46
Line of best fit (Sanders): y = -5.64x + 3.46

Of course I'm not foolish enough to believe that women are more likely to vote for Clinton because she's a woman. This slope may appear more dramatic because of a small sample size, or because the range from 51.19% to 57.02% is so narrow (only 5.83 percentage points) that a small vertical change can result in a large angular shift. Alternatively, it would not be surprising if Clinton had better cut-through and engagement with women. And, while I don't think there are many women who would vote for Clinton because she's a woman, I think there probably are a lot of men who will NOT vote for Clinton based on her sex.

(N.B. the candidate's result is recorded as a percentage of votes out of those won by the two candidates shown. Votes for other minor candidates etc. are eliminated, primarily so that in the Republican graphs we can ignore candidates who have already, or are likely to soon, drop out of the race.)

For the Republican race, the number of female voters at the last presidential election has an insignificant correlation with the candidates' success. The numbers for voters in states which have held a Republican contest already are again consistently female dominated:

ALABAMA 1009 1145 53.16%
ALASKA 140 149 51.56%
ARKANSAS 531 593 52.76%
GEORGIA 1899 2269 54.44%
IOWA 733 816 52.68%
MASSACHUSETTS 1576 1807 53.41%
MINNESOTA 1374 1485 51.94%
NEVADA 511 536 51.19%
NEW HAMPSHIRE 322 366 53.20%
OKLAHOMA 656 775 54.16%
SOUTH CAROLINA 940 1247 57.02%
TENNESSEE 1172 1434 55.03%
TEXAS 3925 4719 54.59%
VERMONT 146 162 52.60%
VIRGINIA 1709 2069 54.76%

And the graph, comparing Republican 2016 results with this gender distribution is disappointingly (that is to say, uninterestingly) flat:
Line of best fit (Trump): y = -0.25x + 0.56
Line of best fit (Cruz): y = 0.03x + 0.28
Line of best fit (Rubio): y = 0.22x + 0.16

There's not much to be said about this except that the Trump juggernaut appears completely indifferent to sex or gender. This may be, however, because the dynamic of the Trump candidacy is so removed from the dynamic of the 2012 presidential election that there's really no correlation to be found. As will be seen, a look at registered voters tells a different story.

Registered Voters

When it comes to registered voters in already-Democrat-contested states, we again see the female population being more politically engaged.

NEW HAMPSHIRE35339853.00%
SOUTH CAROLINA1096138255.77%

Applying the same methodology and plotting Democratic primary results against the % of registrations female, we again see far greater support for Clinton in states with more female registered voters.

Line of best fit (Clinton): y = 6.39x - 2.85
Line of best fit (Sanders): y = -6.39x + 3.85

The conclusions here are the same as already stated for the Democratic contest previously. The republican contest, however, looks very different this time:

ALABAMA 1201 1354 52.99%
ALASKA 181 180 49.86%
ARKANSAS 637 739 53.71%
GEORGIA 2178 2589 54.31%
IOWA 838 906 51.95%
MASSACHUSETTS 1750 2009 53.45%
MINNESOTA 1496 1589 51.51%
NEVADA 574 602 51.19%
NEW HAMPSHIRE 353 398 53.00%
OKLAHOMA 835 970 53.74%
SOUTH CAROLINA 1096 1382 55.77%
TENNESSEE 1432 1778 55.39%
TEXAS 4977 5772 53.70%
VERMONT 168 189 52.94%
VIRGINIA 1931 2279 54.13%

Interestingly, for the first time, we have found a category where male participation is greater than female: in voter registration in Alaska. The really interesting data, however, is in the resulting graph:

Line of best fit (Trump): y = 0.80x + 0.01
Line of best fit (Cruz): y = -1.22x + 0.94
Line of best fit (Rubio): y = 0.42x + 0.05

The candidate to take a beating from an increase in registered female voters in Ted Cruz, the ultra-right-wing senator from Texas. The m value of his linear equation dropped from neutral (0.03) to severely negative (-1.22) between the two Republican graphs. This is a loss of 1.25 in the m value, which equates to 1.25% of the vote lost per 1% increase in female registration as a proportion of the population.

To put that another way, in a field of 100,000 voters, if a 1,000 male voters unregistered (hypothetically) and 1,000 female voters replaced them, Cruz would lose 1,250 votes. This is, of course, absurd. Even if all of the male voters supported him and none of the females did, the most votes he could lose should be capped at 1,000. There is probably some error arising from the assumptions and approximations inherent in this calculation, the small data set etc., but this also ignores the reality that states with higher female registration may also have different attitudes to various issues, which may make up the other 250 lost votes.

The key point here is that as female voter registration increases, Cruz takes a bigger and bigger hit. Most of that support goes to Trump, whose m value jumped by 1.05 (a gain of 1,005 votes in the above scenario of 1,000 deregistered men and 1,000 registered women).

Sexy Predictions

If we extrapolate from these lines of best fit, we can obtain some (VERY) rough estimates for the Democrat and Republican results in the remaining states:

Democrat linear equation based on 2012 voter turnout

Republican linear equation based on 2012 voter turnout

Democrat linear equation based on voter registration

Republican linear equation based on voter registration

This seems to look in sure-fire wins for Trump and Clinton. However, as a minor point, the republican predictions are only valid as long as Cruz and Rubio remain in the race. If one of these drops out (on the numbers most likely to be Rubio, though tactically the Republicans would prefer it was Cruz) it may result in a combined anti-Trump vote of over 50% (particularly after considering how ingrained anti-Trump sentiment must now be in those camps). On a more major point, this data is worthless. The deviations from the linear equations in all graphs is significant. The correlation is poor to non-existent. If we apply this same "predictive" model to the won states, it seems little better than guesswork:

The linear equation for Democrats based on voter registration would have called over a third of the primaries incorrectly due to errors of up to 39%. This is exceptionally poor (8/13) for a method directly based on the results it is trying to predict.

Basing predictions of voters in 2012 gets one more primary right (Iowa) but this is just pure luck.

Similarly, the Republican equations which give Trump every remaining primary also give Trump every primary so far, with very static results:

These are only accurate 9 times out of 15 (= 60% of the time) - wrong again in more than one third of all cases with errors of up to 21% of the vote.

In Summary

TL;DR: Sex or gender demographics are a very poor means of determining voting outcomes in primaries. There is little correlation, if any, between the sex/gender of constituents and voting patterns. This is is unsurprising given the male:female ration is always close to 50:50 (it seems this is lightly but consistently skewed towards female participation) while voting patterns vary wildly from state to state.

It remains to be seen whether more variable demographics like age and race may have a stronger correlation with voting patterns.

N.B. Oklahoma's Democratic results became available during this post's writing. Statistics have not been recalculated to accommodate these numbers.