Surveys
Introduction
Many stages – with different people (and organisations) involved at each stage – good communication is very important! Statistics useful at many points during the survey, but especialy for design and analysis.
Excellent background environment at: amstat.org
Goal setting
Vital to define objectives, may involved specific accuracy requirements (eg < 10% margin of error). Formality depends on:
- if sponsor and researcher are separate people/organisations
- how transparent survey process needs to be (very for government)
- importance of survey
Survey design
Survery design involves:
- choosing a data collection methodology (eg. face-to-face, phone, mail, email, WWW)
- questionnaire design
- sample design and analysis planning
- estimating costs
Try to achieve objects in a cost-effective manner – usually involves some compromise between cost, speed and accuracy.
Sample design and selection
Many design options (eg. stratification, clustering, etc.). Has major implications for analysis.
Data collection.
Contact selected respondents, obtain completed quetionnaires. Statistics involved here in design decisions (quotas, estimated time etc).
Data capture and cleaning
Data entry – often from paper questionnaires, typically proportion re-entered to assess quality. Coding – qualitative to quantitative. Data editing – eliminate inconsistent data, deal with outliers
Weighting and imputation
Data analysis and tables
Many techniques available, cross-tabulation ubiquitious. Need to estimate random sampling error.
Survey sampling
Sampling methods
- probability sampling (each object has known non-zero probability of being selected, calculate sampling error)
- judgement based.
- quota sampleing (appeals to idea of representativeness, but can produce substantial bias)
- convenience samples
Terminology
Units: the objects to be survey. Survey population: collection of units that results should describe or explain Sample: subset of population that is surveyed Sampling frame: method of contacting selected sample units, including information need to select them
Sampling frames
Simple example is a list, but more generally, is any procedure and data that effective enables selection of a sample. Good frames require effort to maintain, and most frames are inperfect (exhibiting undercoverage, duplicated units, out-of-date or missing data).
Sampling frames for households and individuals
No list of all occupied private dwellings in New Zealand, but some have been developed. Different frame for telephone and face-to-face surveys. Once household has been selected, need someway of selecting people in that house (eg. Kish grid, last birthday technique).
Telephone sampling
Undercoverage a fundamental problem (only 92% of houses have landline, < 80% of Maori/PI houses). Duplicates also occur.
White pages : telecom sells random samples (but doesn’t include unlisted 15%), but may be cheaper to use (out-of-date) paper directories Random digit dialing (naive approach: < 10% success; better approaches up to 60% success)
Household sampling
Multistage approach widely used – area sample from NZ Geo system, list all households in area, select random number. Many variations (eg. random route)
Business frames
Business directory : excellent frame held by StatsNZ (280,000 enterprises), but not available for market research. Dun & Bradstreet : few duplicates, useful auxiliary info UBD : some auxiliary info, substantial undercoverage (~60%) Yellow pages : more duplicates, undercoverage
Probability Sampling
See:
Non-response
Incentives very important! Weighting can help to reduce effects of unit non-response, but need population data. Post-stratification and rim-weighting (for multiple strate) are most common methods.
Data checking and imputation
Checking: for consistency, for outliers Editing: recontact, replace with missing, impute Must document!
Methods for missing data:
- drop any units with missing values (end up with no data!)
- pairwise deletion (can be severely biased, not all methods can use)
- impute (mean, mean + simulated error, mean + random residual, random hot-deck imputation, NN hotdeck imputation)