Sampling and Data Quality Issues in Internet Surveys
Jenny Marlar of The Gallup Organization moderated this AAPOR Annual Conference session today on sampling and data quality concerns for online surveys.
The Performance of Different Calibration Models in Non-Probability Online Surveys
Julia Clark and Neale El-Dash of Ipsos Public Affairs looked at calibration performance of online surveys conducted for the 2012 U.S. Presidential election. Julia opened by noting that average field times has declined from 5 to 6 days in the 1960s to less than a full day in 2012. The number of polls per month, from 1960 to 2013, has dramatically increased as well.
For the 2012 election cycle, Reuters asked Ipsos for interactive visuals, speed, within-day polling and very narrow granularity (Presidential approval among Iraq veterans, for instance). Ipsos did 163,000 online interviews over 12 months with a constantly changing questionnaire, with weekly tracking by demographics. Ipsos used its Ampario river sample gathered from 300 non-panel partners with 22 million unique hits per year, to make the online survey work less like a panel.
Ipsos used Bayesian credibility intervals as an alternative to margin of error. Ipsos was among the top pollsters for accuracy, with its final poll giving Obama a 2-point lead. In fact, 2012 was a good cycle for online polling.
Post-election, Ipsos applied different calibration methods to see how accuracy could have been improved now that the final results are known. Neale discussed the original methodology, which used raking weighting on demographic variables for voter and nonvoter using CPS November 2009 results, weighting for age, gender, census division and more. Their Bayesian estimator combined their prior market average with the current sample estimate to obtain the final published estimate, with the actual weight assigned to the prior changing over time but, on average, worth 20%.
Ipsos looked at calibration models including no weights, demographics, demographics without race, demographics within states, demographics and party ID, demographics without race but with party ID, demographics with sample source optimized, and the Bayesian estimate. For gauging performance, they used the market average by week and then the final vote. Looking at Democratic lead, the only weighting method that performed worse than no weights at all was weighting by demographics within states. The best performance was by demographic and party ID, with the Bayesian estimator in the top 3 models. The model underperformed for Hispanic demographics and under $30,000 household income.
The Bayesian estimate performs well. It does a good job of optimizing on variance and bias, but a simpler weighting scheme might ultimately perform better. Online overall is still misrepresenting minorities and the less affluent; in a blended online sample, the sample source matters but the optimal mix has yet to be determined.
How Do Different Sampling Techniques Perform in a Web-Only Survey?
Ipek Bilgen of the NORC at the University of Chicago looked at results from a comparison of a Random Sample Email Blast to an Address-Based Sampling Approach. With 71% of U.S. households using the Internet, new web-based sampling approaches are emerging. However, the population is still younger, more educated, and higher SES (Socio-Economic Status). NORC examined different sampling strategies to address this skew to identify how response rates varied by sampling and incentive strategies. Survey results were benchmarked against the General Social Survey.
The study used four sampling methods: ABS, email blasts, Facebook and Google. The incentives were $2, $5 and $10 Amazon gift cards. For emailing, the sample frame was an InfoUSA email address list in 12 strata of 3 age groups and 4 regions. The invite was followed by two reminders. The ABS sample frame used the USPS DSF (Delivery Sequence File) for 4 strata (the geographical region). The invite was followed by two followup letters; respondents received a thank-you postcard.
The 21-question web survey used GSS questions for comparability and included demographic variables and substantive variables on computer and internet use. Results were calibrated by raking weights to the ACS on region, age, sex, etc.
Of 100,000 emails, only 197 took the survey but this was most likely because the invite was cloud-marked as spam, resulting in low delivery rates. The ABS had 750 responses on 10,000 mailings and did have an increased response rate by higher incentive. The Internet sample underestimated the lower educated and lower income groups and obviously overestimated Internet at home and by mobile usage. Also they were more likely to get their news from the Internet than from TV as shown in the GSS.
The ABS approach is getting a different web population than the email blast so different strategies might provide access to different probabilities.
Can We Effectively Sample From Social Media Sites?
Michael Stern of the NORC then shared results from sampling from ads on Facebook and Google. He began by pointing out that mobile surveys might fit especially well with social media sampling, given the prevalance of use of social media on smartphones. Most social media research is passive (scraping the sites), but this was active - focusing on recruiting from social media.
Two prominent examples of Facebook recruiting are Bhutta (2012) and Ramo and Prochaska (2012) targeting low-incidence groups. The Google model of flu protection overpredicted occurrence this year from analyzing searches related to flu.
On advertising on Google and Facebook, advertisers bid on clicks. To prevent people from taking the survey again and again, NORC used PINs and email addresses but the system could be spammed for people interested simply in earning the incentive. The click-through rate for 2 million people was 0.018%, which was great for a survey, Facebook said. In different ad images, the paper and pencil survey image did better than more technological imagery. On Facebook, NORC was able to say that this was an Amazon gift card but Google prevented "Amazon" from being used.
Facebook ads were to the general network, but Google ads were targeted by keywords. NORC chose a selection of unrelated keywords. Of people who click the survey, the majority are young; but answers are much more distributed for people who click the survey. The results from the Facebook ad are fairly random when compared to the GSS but there was a closer correspondence for Google and GSS. The results were not weighted.
Google outperformed the Facebook ads in results, speed of response and lower cost per click: $12 per complete vs. $29 per complete. The Google ad was touted on Slickdeals.com, which led to spam responses.
How Far Have We Come?
J. Michael Dennis of GfK Knowledge Networks discussed the lingering digital divide and its impact on the representativeness of Internet surveys. Internet adoption has slowed. How do the characteristics of the offline population differ from online?
Internet access is 97% for $75,000 and up, according to Pew, 94% for college graduates, but low for high-school educated. Has Internet penetration reached a point where the online population can adequately represent the U.S. general population? Given the persistence of the Digital Divide, should survey researchers still be concerned about sampling coverage issues?
GfK Knowledge Networks equips non-Internet households recruited through Address Based Sampling with netbooks and ISP payment. The survey results compared 3,000 online-only general population sample vs. 3,000 online and "offline". The data is weighted. Estimates for 5 out of 15 demographic variables not used for weighting were statistically different between the two samples, reaching 3.8 points of absolute difference. The average for 7 out of 25 public affairs questions were 1.4 percentage points of difference (note: points not percentage change). On health variables, the significant difference was 9 out of 25 health variables: from a 4.5 point difference on uninsured to a 4.3 percentage point difference on wine consumption; 11 out of 15 technology variables were statistically different. The inclusion of non-Internet households impacts the relationships between variables as well.
This is a 2013 study updating a 2008 study and the significant differences have persisted in time on behaviors such as recycling newspapers, recycling plastics and active participation, but gun ownership is now similar between online and offline populations.
Despite growth in Internet penetration over the years, excluding non-Internet households can still lead to over- or under-estimations for individual variables and change the magnitude of the correlations between variables.
Respondent Validation Phase II
A United Sample (uSamp) presentation studied validating respondent identity in online panels. How do we know panelists are representing their identities truthfully online? When someone signs up for an online panel, how can you tell they are who they say they are?
Validation is using procedures to verify that people responding online are "real" people. Collect Personally Identifiable Information (PII), then compare them to national third-party databases and categorize respondents accordingly.
uSamp conducted 7,200 surveys during the first two weeks of January, 2011: 6,000 of respondents provided name, address, birthdate and 1,200 were unwilling to. Those not validating were from the hardest to reach demographic groups (good news) and were 50% more likely to provide conflicting data (bad news). Validation databases do a poorer job of tracking hard-to-reach demographic groups.
The 2012 survey, of similar survey size, had a higher refusal rate for PII, of 19% vs. 17% in 2011 (25% refusal rate if you include prompt for email address). Respondents who failed verification were twice as likely to fail at least one quality check in the survey.
In conclusion, you can weed out bad actors from a volunteer panel using validation but these same people are easily caught by quality checks. "Fair to ask whether the expected benefits of validation are sufficiently great to balance the off of as much as a quarter of the sample for a study -- and a higher percentage among certain demographic groups?"