Darth Pharma: A Lesson in Frequentist Statistics

In my last blog entry, I wrote a thing about perfection and motherhood and ended up writing a bunch about my guesses for why people don’t vaccinate their kids. My goal had been to get to a punchline of something like “all you need is love” but I got tired of writing and got to where I got. The punchline that someone took away from it (and left in the comments) was that I was calling antivaxxers morons and saying that they don’t understand how science works. As I was replied in the comments that day, I definitely do not think antivaxers (I kind of like it better with two xx’s but is it one? I don’t really want to check) are morons. I actually heard a story about the “vaccine hesitant” on CBC radio the other day saying that antivaxxers have higher than average levels of education (interesting!).

However, I do think that they don’t get how statistics work. I think this because I think they are like most people and most people don’t get how statistics work.  I have been teaching this stuff to students nearly every year for the last 11 years or so and they definitely do not start the course understanding it.  Hell, I didn’t fully understand it when I started teaching it after taking multiple graduate level courses in social statistics and carrying out advanced statistical modeling in my dissertation and my first entirely separate publication. Understanding survey methodology and statistics is just not something one would pick up easily or from perusing a blog… until now.

So, here, I want to explain how data is collected, the concept of sampling error, and why there’s no such thing as a ground breaking study (okay, maybe there is but they are EXTREMELY rare and are more likely to use theory and ideas than statistics, I’d surmise). I’m going to try to keep this as simple as possible (even though usually this takes me about 6-8 weeks to cover in my intro stats course), not because I think you’re simpletons but because, as I say to my students, I like to make easy stuff hard and hard stuff easy (that could possibly be my life motto although, as discussed here, I’m wanting to stop making easy stuff hard). How many tangents can one woman go on? On to the lesson!

First, of all, wait! Some people get very confused about why I do statistics. They say, you’re a sociologist? I thought you taught math? Or they say you teach statistics? I thought you were a sociologist! Well, ladies and gentlefriends, sociologists use statistics (especially those of us trained in the United States but all over the world too) to analyse data from surveys.

What’s a survey, you ask? (No, you didn’t b/c you know what a survey is, but bear with me for a minute). I for one, LOVE taking quizzes on BuzzFeed which are kind of like surveys. I just took one now to find out my secret power based on my witch personality. According to this ultra scientific assessment, I’m a Lorist Witch. Whoa, spoopy accurate.

Lorist Witch
So, sociologists tend not to use BuzzFeed type quizzes to assess personality (psychologists do more of that kind of quiz… okay, NOT on BuzzFeed, psychologists stop yelling at me! I just mean personality assessments! sheesh!). Sociologists use data, sometimes collected themselves and often collected by government agencies like the Census Bureau or the Bureau of Labor Statistics in the US or, here in Canada, everything is done under one roof at Statistics Canada (most countries, I believe have one government agency for data collection, whereas the US splits it up by agency). There are THOUSANDS of surveys that are collected by these various bureaus and agencies that ask probably millions of questions about people’s lives. You can click on those links and go poke around if you want, which if you did you might find these questions from the Canadian Community Health Survey:
cchs upe example

If I were to answer this survey, I’d say:

1.) yes
2.) sometimes (which isn’t an option so I’d go with most of the time so I look like a better person than I am)
3.) no.

We call my slightly wrong answer to the second question “social desirability bias” and that is one way in which research can skew towards people’s best behavior, but that’s for another lecture.

So, every year millions of people around the world tick off little boxes in surveys with hundreds of questions (actually they’re more likely to answer the phone and the surveyor ticks off boxes in a computer) and then these peoples’ answers are uploaded into various different software programs. Sometimes it’s just a plain text file, sometimes they get converted into SPSS or SAS or Stata (there are lots of them and quantitative researchers can get really strong feelings about which is best and only noobs use SPSS but that’s what usually is taught in intro stats classes in most social sciences except for economics because econ students typically do more stats with more advanced stuff, anyway!) Then, in Canada, if you’re affiliated with a university, eventually you can download these giant spreadsheets of people’s answers and analyse them on your computer or if you want more detailed information in these top secret rooms in libraries that require one to get security clearance and pledge to the QUEEN OF ENGLAND that we will not disclose any private information about the citizens who answered the questions. This is all true. I’ve done it even when I was still just an American citizen and felt a bit squeamish about the pledge as a (unofficial) Daughter of the American Revolution.

When we get the data on our computers, it looks something like this (except these are actually data from stats.nba.com b/c I could get in trouble for sharing Statistics Canada data publically Image result for angel emoji):


What we can do with these data as researchers or students is then compare different outcomes for different groups on various survey questions. One way we could do that, in terms of, let’s say, oh, I don’t know, do vaccines cause autism, would be we carry out a longitudinal survey where we ask if a baby was vaccinated and then check to see if they say yes to “did your kid get autism” every so often after that. But, as discussed above in my propensity to appear like I wear my bike helmet more than I do, sometimes people could lie. Maybe people say they were vaccinated when they weren’t or vice versa. Maybe people say “my kid totally has autism” when they don’t. I’m doubtful about those. I generally don’t think most people lie about things that are so straightforward.

So, some researchers will use data that is collected at the aggregate level.  There might be general statistics collected by local health agencies that show overall rates of vaccination and overall rates of mumps without asking individuals if they themselves got the vaccine and if they themselves got the mumps. This may be one of the worst kinds of data because it can make it seem like there are relationships between things that aren’t really there. For example, we know that in areas with high ice cream sales, there are also high levels of drownings. Some might want to say, well that’s because you need to wait 30 minutes before swimming after eating ice cream. No, that is not why. (A) Because that is a myth (go ahead and swim with a full stomach, as long as you’re okay with possibly puking) and (B) because there’s a third factor that leads to both, and that is summer time! In the summer we are more likely to eat ice cream and we’re more likely to go swimming, unless you’re one of those polar bear people.

So, these aggregate data aren’t the best data but they’re often what we see first because they are the easiest to get. So, for instance, there have been measles outbreaks in places where there are high rates of immunization. This *could* lead people to say, “see, vaccines don’t work, 95% of people are immunized and still there’s an outbreak.” Well, in my very brief read about this since I am in no way a vaccine expert, part of the issue can be that vaccines don’t work but it’s also possible that the vaccines we get as kids eventually get less effective… this is why kids get “booster shots” to give our baby vaccines a boost to work better. So, the old people get mumps even if they were vaccinated. Regardless of what the actual explanation is, saying that just because mumps goes up in an area with high vaccine rates means that vaccines don’t work is as logical as saying that eating ice cream can make you drown.  You need data that measures the people who got the measles and see what their vaccine history is.

The best data some researchers will use is health record data because we trust doctors more than lying, forgetful, sleep deprived parents. So, instead of asking parents “did you vaccinate your kid and now does your kid have autism,” we might look at the health records of the kids and create a data set that shows if the kid got a vaccine and if the kid has autism. We can then see if kids who get vaccines then get autism more or less often than the kids who don’t get vaccines.

Although, these data sound awesome, as you can imagine, a lot of people don’t like nosy researchers poking around their medical history. No one wants the world to know about that time they went to the doctor convinced that s/he contracted herpes from trying on a pair of used jeans at the Good Will and then found out it was just an ingrown hair. This is embarrassing, so some people are like, uh, no thanks, let’s keep that private. Still, health researchers who work in teaching hospitals can sometimes get easier access to medical records and hope you forget about that herpes/pimple visit before you sign the consent form giving them permission to do so.

Now, one of the points made by the commenter on my Fuck Gravity post was that she doesn’t trust the government because the government is controlled by Big Pharma. Let’s explore this idea for a minute. So, let’s say that “Big Pharma” really DOES control everything in every government around the world, all from the comfort of its pharmaceutically funded Death Star.


In order to control ALL the statistics collected by all of the different agencies, Darth Pharma would need to somehow infiltrate all of the government statistical agencies and all the medical records of everyone and change all of the variables around just enough to make it appear that those who are vaccinated don’t get mumps when they really do. Or that there aren’t any real side effects of the vaccines. Personally, this seems much more expensive and difficult than actually developing a medication.

“But wait!” you say “I know that I’ve read that Big Pharma does their own research and they only publish the results that show good news for their drug and not the bad results. What about THAT?!?!” Yes, this is an issue with lots of drugs, one hears this a lot in arguments against mental health drugs. (To me, however, much of the arguments against mental health drugs have more to do with stigma around mental illness and less to do with the drugs themselves. I mean, I haven’t seen any controversial books problematizing insulin, but I’m not that well read and that’s for another post.). Anyway, so yes, pharmaceutical companies conduct their own research and use that research to get FDA approval, but they aren’t the only researchers out there and aren’t the only agencies collecting data. And so I want to explain why I don’t believe in the possibility of a ground-breaking study upsetting the vaccine applecart any time soon based on how statistics work.

Okay, here’s where it gets really complicated and hard so I’m going to way oversimplify how things work, but hopefully you’ll get the gist. If we were to take the survey or medical record data and calculate the average mumps/measles/whooping cough/rubella/whatever rates among those who were vaccinated and then compared to the illness rates of those who were not, what that would describe would be differences in our sample (the people we surveyed or the medical records we examined). That wouldn’t tell us what was likely happening in the population. Sometimes we can pick a sample where it looks like there’s a difference when really in the population there isn’t a difference.  In really rough terms, inferential statistics is all about trying to assesses whether differences we see in our sample are “real.” So, how do we do that?

First, we take a random sample from the population. It has to be random where everyone has the same chance of being picked. If you only pick your friends or people who like your website, it’s not a random sample. Everyone has to have the same chance of being picked. So, let’s imagine we have a list of all the babies born in 2001 based on Census data (which by law you have to fill out or you can GO TO JAIL!! Seriously!) or we have a list of all babies from birth certificates.  And, let’s say, we put this list of babies in random order and then pick the first 10 pick babies and sort them into two groups: vaccinated and unvaccinated. Let’s say that these babies below are our sample (and not just some random baby picture copy and pasted from Google Images):

vaccinated babies

Because they were randomly selected from all of the babies, there’s nothing about our particular sample of babies that makes them special.

We could have picked this sample:

vaccinated babies2

Or this sample:

vaccinated babies3

There is a nearly infinite combination of samples we could have randomly picked of all the babies born in 2010.

We could take any one of those samples and create a data set for them, where we ask if they were vaccinated and if they had the mumps and if they had autism.

vaccinated baby Samples Data
Then we could calculate what proportion of the vaccinated babies got sick and what proportion of the unvaccinated babies got sick for our sample. If we were to do this for multiple samples (which we wouldn’t) but if we did, we could put the values for one of the variables for all the samples into a table like so:

% of Babies with Mumps
Overall Vaccinated Not Vaccinated
Sample 1 20% 0% 50%
Sample 2 25% 0% 67%
Sample 3 0% 0% 0%

I’ve made up these data 100% but if they were real data what we would find is that if we took the average of these percentages, the average of the sample averages would be the same as the population average.
Image result for images mind blown

Okay, your mind maybe isn’t blown. Maybe it’s more like, what? I don’t believe you. Okay, smarty pants, let me give you a simulation where you can see this for yourself, using basketball. Basketball is the only sport I care about, I love it for so many reasons, not the least of which is the NBA’s very awesome statistics I can use for teaching. I pulled out statistics on the current season and looked at average points per game per team and got these data, with an overall average of about 105.5 points per game for all teams.

Golden State Warriors 120 Average Points Per Game
Brooklyn Nets 114 105.5
Washington Wizards 114
Orlando Magic 111
Indiana Pacers 111
Toronto Raptors 109
LA Clippers 109
Minnesota Timberwolves 109
Cleveland Cavaliers 108
Portland Trail Blazers 108
Philadelphia 76ers 108
Phoenix Suns 107
Houston Rockets 107
Los Angeles Lakers 106
New Orleans Pelicans 106
Denver Nuggets 106
Charlotte Hornets 105
Oklahoma City Thunder 105
Detroit Pistons 104
New York Knicks 104
Milwaukee Bucks 104
Miami Heat 104
Boston Celtics 103
Atlanta Hawks 101
Memphis Grizzlies 100
San Antonio Spurs 98.6
Utah Jazz 98
Dallas Mavericks 97.8
Sacramento Kings 93.1
Chicago Bulls 92.1

Let’s pretend that these basketball statistics don’t represent teams but they are the babies we want to know about. So, we can’t access all of the babies/teams, so we collect a random sample of 10 of these babies/teams, like this:

Houston Rockets 107
Dallas Mavericks 97.8
Indiana Pacers 111
Golden State Warriors 120
Oklahoma City Thunder 105
Miami Heat 104
Charlotte Hornets 105
Denver Nuggets 106
Memphis Grizzlies 100
Detroit Pistons 104
Average Pts Per Game 106.1

Notice that our average points per game in our sample (106.1) is close to the actual population average (105.5). If we had collected a different sample of 10 babies/teams, we would have gotten a different average, every single time. Like this:

basketball samples

What you can see there is that some of the averages are really close to the “real” mean (like the fourth one on the top row is 105.1) but some are pretty far away (like the second to last one on the bottom row at 100.3). If we were take all of these averages, we’d find that that average is really close to the actual population average, which in this case would be 105.0 points per game.


It’s still not quite exactly right but it’s closer than most of the individual samples. And, if we were to do this more than 12 times but kept doing it over and over and over again and if we used samples that were bigger than 10 teams/babies, we’d find that we’d get closer and closer to the actual population value over time.

What we also know is that if were to take all of these averages and put them into a graph, overtime those average would form a curve that looks like this:

basketball curve
We call this curve a normal curve. Just looking at that we can tell that there’s a higher chance that we’d pick a sample where the average points is 105 or 104 or 106 than that we’d pick a sample where the average points is 110. What mathematicians figured out A LONG TIME AGO, was how much area is under this normal curve. If you knew calculus, which I don’t, you could calculate the area under any curve. I’m #grateful that someone already figured this out for me so I can just use a table that tells me how much area is under that curve at any particular point on the curve. A few things we know is that the curve:
a.) is symmetrical so half of the scores fall above the midpoint and half fall below it
b.) that the average of all the scores falls at the midpoint
c.) in a standard normal curve, 95% of the scores fall within almost 2 standard deviations of the mean (are you asleep yet? sorry!). A standard deviation is the average distance from each score to the mean (i.e. so some of the basketball teams scored higher than average and some below average, the standard deviation is the average of those distances)
d.) we can use what are called z-scores (that’s pronounced zee-scores in the US and zed-scores everywhere else, btw) to calculate particular areas under the curve.

bell curve va

Alright, so if we were to take all of those baby/team average scores from the samples, what we know is that the mean of the sample means would equal the mean of the population and that 95% of the cases would fall within 2 standard deviations of the mean (we also know some other things I’m going to skip over for now).

What this tells us, too is that in our team example, the probability of getting a sample where the average scores of the teams was 110 is extremely low. In fact, the probability is less than 0.05 b/c 95 times out of a hundred, we’re going to get scores that are closer to the population mean than that.

So why does this explain why there can never be a ground breaking study? Well, let’s say that researchers have conducted 100 studies on the relationship between vaccines and autism. And they all do it very similarly and instead of putting average points scored in a basketball game in their table they put in the difference in the proportions of vaccinated kids with autism minus the proportion of unvaccinated kids with autism.

% of babies with autism
Vaccinated Not Vaccinated Difference
Sample 1 1.5% 1.0% 0.5%
Sample 2 1.3% 1.4% -0.1%
Sample 3 1.6% 1.7% -0.1%
Sample 4 1.4% 1.3% 0.1%
Sample 5 1.8% 1.2% 0.6%
Sample 6 1.9% 2.0% -0.1%
Sample 7 1.3% 1.6% -0.3%
Sample 8 1.4% 1.6% -0.2%
Sample 9… 1.8% 1.3% 0.5%
Sample 100 0.6% 1.1% -0.5%

If we were to do this an infinite number of times, we would eventually find that the average of all of those differences would equal the difference in autism rates between vaccinated and unvaccinated kids in the whole population. However, no one can ever collect an infinite number of samples. Usually, we collect one sample and so our question becomes, is that 0.5% difference in Sample 1 a “real” difference. Or in other words, what’s the probability that that sample came from a population where there was actually no difference between vaccinated and unvaccinated kids (Population A below)?

Null vs research distributions

Given what we know about our sample, we can figure out if that 0.5% difference in autism rates is more likely to come from Population A or Population B.

Let’s say we had a sample where we find 10% more vaccinated kids than unvaccinated kids have autism and we’re like “HOT DAMN! GROUND BREAKING STUDY!!!”. Depending on our calculations, we’d likely say it’s really unlikely that that difference would be found in Population A, since most people in population A fall between -2% and 3%  around zero. When we calculate a difference that falls outside of the 95% of the probable differences in Population A above (what we call a null distribution), we say that the difference is statistically significant. This means that there’s a 5% chance that our difference actually comes from a null distribution where the difference between the groups is actually zero (no difference, nada, zilch, nil, null).

In other words, even if in reality there is no relationship between autism and vaccines, if 100 researchers do a study on the relationship between autism and vaccines, most are going to find no relationship. However, over time, five of them will find a relationship–theoretically 2.5 of them will find that autism is more likely for the vaccinated and 2.5 of them will find that autism is more likely for the unvaccinated. That’s just how sampling error works. That’s because some of the time when you randomly sample you’re going to get only the highest scoring basketball teams and some of the time you’re going to get only the lowest scoring basketball teams, but most of the time you’ll get some high and some low and some average scoring teams. So, sorry, researcher with the 10% difference, but there’s no such thing as a ground breaking study when 95% of the researchers find the opposite!

It is for this exact reason, however, that some people are wisely hesitant about NEW vaccines or new medications that have not been tested in the wider population. It IS possible that Darth Pharma only presents the results of their studies when they find a significant effect of their product on an outcome. Because new medications are only tested while under their control, and not with decades of people taking the medicine (like with almost all of the vaccines we’re talking about), new medicines/vaccines can’t first be tested independently by researchers without a financial conflict of interest.

Darth Pharma *could* carry out 100 studies or maybe 1,000 studies and then shoot all of their data to the circular file for the 95% of the cases that showed no difference in outcomes and only publish the 5% of the studies that showed that their Drugzac works.  Even then, however, I think that would be financially unsound for them to do so. Because eventually when their drug gets to market, people will know whether or not it works for them. Efficacy rates will become more precise will more and more people taking the drug or people will realize that really it doesn’t work and then stop taking it. Or even if everyone is duped by a placebo effect, academic researchers will do their own tests to control for this. If a drug company in particular becomes known for doctoring their results over and over, people aren’t going to want to use their drugs.

Far smarter would be for drug companies to keep tinkering with their product until they find a product that seems to work at least better than not at all. I believe this to be what they do, not only because it makes rational economic sense, but because people like Sonia Shah have written about Darth Pharma’s horrifying evils using people in developing countries as guinea pigs for their drugs. Since I went to elementary school with Sonia’s sister and I’m loyal af and I read her book (or like some chapters maybe? it’s been a while), I believe her work. And why would Darth Pharma go to all the trouble of building human concentration camps in developing countries to try out drugs if they could just make up whatever results they want? Why wouldn’t they just lie about all of it and pay off the FDA?

So, beyond the fact that I have an awesome nephew on the autism spectrum whom I love and would WAY rather see him with autism than polio, as someone who is not an immunologist or a virologist and knows nothing about how these diseases work, based on how science as an enterprise works, I can’t see any compelling evidence to support not vaccinating kids.

I’m working on a study that challenges what a lot of research shows about breastfeeding, but I don’t claim that my study changes everything (okay, I did at first with my friends and students and colleagues in my office when I was really excited but an editor smacked that down and I realized she was right that no one study proves anything). And actually, my study adds in new variables that haven’t been included so it’s not a replication of the other studies and it actually fits with a number of other studies coming out challenging the claims to infinite superiority of breastmilk (and I do find some benefits still). Science isn’t about believing everything you read but it’s also not about rejecting everything you read either just because it was carried out by “the man”.

I heard a scientist on the radio last night talking about the difference between being a climate change skeptic and a climate change denier.  She argued that being a skeptic is good, it means you’re not sure if the evidence is strong enough but if the evidence grows strong enough, you’ll believe it. A denier is someone who refuses to believe regardless of what the preponderance of the evidence says.  A skeptic should not rely on one study to throw out all of the other studies.  So, in conclusion, dear friends, whom I love and respect, be a skeptic but get your kids vaccinated.


  1. Thanks for the stats primer! There are so many things to unpack when it comes to this issue. It seems like in general the level of research literacy is pretty low, and things like confounding factors or the concept that correlation does not equal causation appear to be foreign to a lot of people. Maybe the difference between a skeptic and a denier is an understanding of how to interpret research.


    • Totally! That’s one of the reasons why I like teaching methods & statistics to increase quantitative and research literacy. And it’s not that people are dumb, it’s just not something people would learn which is unfortunate.

      Liked by 1 person

  2. I love this. I wish you were my stats professor. You provide the context and perspective that my brain desperately begged for to make sense of what stats represents (or doesn’t represent). Thank you for the clarity.

    Liked by 1 person

  3. Maybe this is not on the subject really but the problem with the premise that autism is caused by vaccinations is that many things might be the cause of more cases of autism.

    Firstly the way autism is defined or diagnosed today compared to decades ago.

    I am going to go out on a limb and say that the biggest cause might be the age of parents. Women and men. This could easily be the largest factor in the cause of the increased prevalence of autism. Has there been studies done on this? I don’t know. The average age that women have children has gone up decade after decade. The make partners are also older. What are the odds of a 20 year old or 25 year old couple having a child with autism versus a 35 year old couple or a 40 year old couple or a 43 year old couple? I think this factor would likely be very correlated with increased autism diagnoses in children, and is likely a factor that leads to causation of increased risk of autism and is not just correlated with it for some other reason.

    The fact some asshole wrote a fraudulent paper a decade ago linking vaccinations to autism and some celebrities and the popular media for awhile widely reporting about the vaccination premise has so distorted the conversation so that much more likely causes are never discussed.

    Someone may have done this but there are probably 20 factors that I could think of off the top of my head that have changed significantly among the mass of the general population in the last 1/2/3/4/5 decades that could be a factor in the increasing diagnoses of autism. Many of them are medications. Anti-depressants, birth control pills, obsesity of parents, age of parents, food addictives, lack of regular exercise among parents, smoking or not smoking among parents, breast feeding versus formula. And so many more. And these could have been caused within the parents when they were in the womb or as young children and then passed on 30 years later to children. I mean seriously boxers vs briefs vs boxer briefs could actually be a cause of increased autism in children born in recent decades.

    The problem with linking it definitively with vaccinations is that it makes logical sense. Like all conspiracy theories, there is this big kernal of truth that can make it appear to be an answer to a complicated question. 90+ % of people get vaccinated and if you have an increase in something that affects randomly everyone in a population then that is a great answer to a complicated problem. But there are dozens, maybe hundreds of possible “solutions” to the mystery of the increase in recorded autism. Things that affect huge percentages of the population just like vaccinations. And of course it could be the combinations of a few things. What if it is a certain type of birth control pills AND having a child over 34 years AND being obese AND using formula. What if it takes a combination of factors to really see a spike in the rate of autism?

    Correlation is not useful at all when you are looking at vaccination because 90 % or more people are vaccinated. You can’t really get good data in that case. The people who did not get vaccinated are going to likely be a poor sample. Mennonites and people that don’t look after their children well and some people that are just forgetful and most importantly already sick or immune compromised children that can not be vaccinated. If you don’t vaccinate your children, at least before this hysteria about autism, I think the sample is going to be poor. Autism is rare, even now, quite rare, so in order to get the sample sizes needed to really compare vaccinated versus non vaccinated children it could be nearly impossible to get a non biased sample.

    Anyway I really like your blog Phylis and thinking about this issue and statistics. It is like almost everything in the world…. very complicated. And in this age of science everyone wants “answers” to complicated questions. And mostly there are no definitive answers. Yet on the morning news there is a daily “science” story saying butter is bad for you. Wait margerine is worse for you. This causes cancer and this prevents cancer. It is all noise and some things get blared from a loudspeaker for years.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s