Surveys

Introduction

Many stages – with different people (and organisations) involved at each stage – good communication is very important! Statistics useful at many points during the survey, but especialy for design and analysis.

Excellent background environment at: amstat.org

Goal setting

Vital to define objectives, may involved specific accuracy requirements (eg < 10% margin of error). Formality depends on:

if sponsor and researcher are separate people/organisations
how transparent survey process needs to be (very for government)
importance of survey

Survey design

Survery design involves:

choosing a data collection methodology (eg. face-to-face, phone, mail, email, WWW)
questionnaire design
sample design and analysis planning
estimating costs

Try to achieve objects in a cost-effective manner – usually involves some compromise between cost, speed and accuracy.

Sample design and selection

Many design options (eg. stratification, clustering, etc.). Has major implications for analysis.

Data collection.

Contact selected respondents, obtain completed quetionnaires. Statistics involved here in design decisions (quotas, estimated time etc).

Data capture and cleaning

Data entry – often from paper questionnaires, typically proportion re-entered to assess quality. Coding – qualitative to quantitative. Data editing – eliminate inconsistent data, deal with outliers

Weighting and imputation

Data analysis and tables

Many techniques available, cross-tabulation ubiquitious. Need to estimate random sampling error.

Survey sampling

Sampling methods

probability sampling (each object has known non-zero probability of being selected, calculate sampling error)
judgement based.
quota sampleing (appeals to idea of representativeness, but can produce substantial bias)
convenience samples

Terminology

Units: the objects to be survey. Survey population: collection of units that results should describe or explain Sample: subset of population that is surveyed Sampling frame: method of contacting selected sample units, including information need to select them

Sampling frames

Simple example is a list, but more generally, is any procedure and data that effective enables selection of a sample. Good frames require effort to maintain, and most frames are inperfect (exhibiting undercoverage, duplicated units, out-of-date or missing data).

Sampling frames for households and individuals

No list of all occupied private dwellings in New Zealand, but some have been developed. Different frame for telephone and face-to-face surveys. Once household has been selected, need someway of selecting people in that house (eg. Kish grid, last birthday technique).

Telephone sampling

Undercoverage a fundamental problem (only 92% of houses have landline, < 80% of Maori/PI houses). Duplicates also occur.

White pages : telecom sells random samples (but doesn’t include unlisted _{15%), but may be cheaper to use (out-of-date) paper directories
Random digit dialing (naive approach: < 10% success; better approaches up to 60% success)}

Household sampling

Multistage approach widely used – area sample from NZ Geo system, list all households in area, select random number. Many variations (eg. random route)

Business frames

Business directory : excellent frame held by StatsNZ (280,000 enterprises), but not available for market research. Dun & Bradstreet : few duplicates, useful auxiliary info UBD : some auxiliary info, substantial undercoverage (~60%) Yellow pages : more duplicates, undercoverage

Probability Sampling

See:

Non-response

Incentives very important! Weighting can help to reduce effects of unit non-response, but need population data. Post-stratification and rim-weighting (for multiple strate) are most common methods.

Data checking and imputation

Checking: for consistency, for outliers Editing: recontact, replace with missing, impute Must document!

Methods for missing data:

drop any units with missing values (end up with no data!)
pairwise deletion (can be severely biased, not all methods can use)
impute (mean, mean + simulated error, mean + random residual, random hot-deck imputation, NN hotdeck imputation)