Syllabus

This course will teach you to be a data analyst. You will learn how to take a large dataset break up into manageable pieces and use a range of qualitative and quantitative tools to summarise it and learn what it has to tell. You will learn the important of scepticism and curiosity, and how communicate your findings. Each section of the course is motivated by a particular dataset, and you will gain experience working with a wide variety of data sources varying in size and quality.

We will focus our efforts on the statistical programming environment R. You will learn how to program in R, learning both the basic syntax and a range of vocabulary to help solve common problems. The only way to become a competent programmer is practice, and many of the weekly homeworks will require substantial programming.

You will also learn other skills necessary for a practising statistician. You will become familiar with the linux command line, learn how to use latex to produce scientific publications, and learn about version control for offsite backups, tracking changes over time and collaborating with colleagues.

There are few requirements for this course. You will need some basic statistical knowledge, particularly of the linear model and its extensions, but otherwise the course is largely self-contained. Being a skilled touch typist is a big plus!

Overview

Goals

Structure

The course is based around five data problems:

We'll also use another dataset on baseball for one project, and you'll have the opportunity to clean and compile your own data for another.

Outline

Assessment

Please hand in a physical version of your homework and projects - I will write comments on it and give it back to you. An electronic version will be accepted under duress, but please don't make a habit of it.

All grades will be posted electronically on owl-space. It's your responsibility to double-check that I have correctly entered your grade from your assignment. Please let me know if I've made a mistake.

Homeworks and projects have a nominal due date, but the late penalties are small - there is a flat 20% penalty after model answers have been posted. That said, I encourage you to hand your assignments in on the due date, it makes life easier for me, and it will prevent you from becoming swamped with work late in the course.

Grading scale

The grading will be a little different to what you are used to in statistics. Your assignments will be graded according to a rubric.That each facet of the overall grade goes from 1 to 5. To get a 5 you will typically need to go above and beyond what I have covered in class and show me something new.

The numbers of the rubric convert to letter grades in the following fashion:

(But note that last semester no one got less than a 3.8).

There is a mixture of grads and undergrads in this class, and grading standards do differ (a B- in grad school is equivalent to a F as undergrad). Judging from past experience grading on a common scale is unlikely to cause any problems, but if necessary I can make exceptions.

Weekly homework

To do well in this course you will need to spend 4-5 hours a week (outside of class!), and the weekly homeworks are designed to encourage you do that. For each homework you will need to revise the week's work, as well synthesise some new information, from the help pages or the web.

Group projects

Each member of the group is responsible for every part of the project. I know group projects can be frustrating, but I hope to introduce you to some tools that should make it less painful. More details will be provided when we start the first project, but expect to produce a 20-page report detailing the analysis of a large data set.

Model answers

Homeworks and projects are open ended, and there are no right answers. To give you a feel for what some good answers are, I'll publish (anonymously) a few of the best answers. If you don't want yours to be published, please let me know.

Final exam

For the final exam, you will be expected to build a web page of use to other learners of R, summarising and explain one section of the course.

Collaboration and citation

For homeworks (and obviously group projects) I encourage you to work together. Please discuss the data, code and problems with one another, but do your own exploration and write up. I expect everyone to hand in substantially different homeworks.

Please use any resources available to you. Many homeworks will explicitly encourage you to use resources on the internet, but showing extra initiative will always be appreciated. You will find R programming tought at first, so feel free to email me questions or discuss your problems with other classmates.

Note that it is not acceptable to copy verbatim from outside sources, and in most assignments even quotes will not be appropriate. Use the ideas, not the particular details. Always give credit where credit is due, so all use of outside sources should be cited: for projects you will be expected to have a formal bibliography; for homeworks, a casual citation is fine; and for code, reference the source in a comment.

Disability statement

If you have a documented disability that will impact your work in this class, please contact me to discuss your needs. You'll also need to register with the Disability Support Services Office in the Allen Center.

Related classes

Other stat computing courses like this one: