Once we know R at a basic level, the best way to sharpen our R skills is by working on a data analysis problem head-on. In this course, we takes a use-case-based approach by tackling the New York City taxi data using R. There are ample lab exercises to reinforce concepts and learn new ones.
We do not shy away from using third-party packages when doing simplifies our work: and in particular GIS packages, ggplot2 for plotting, and dplyr for data processing. However, only dplyr is relevant to the course and explored in-depth. Data visualization and GIS packages are out of scope and not covered in-depth, although a basic explanation is provided and all the code will be provided for users who want to delve more in-depth on their own time.
While we do not cover Microsoft R Server (MRS) during this course, a secondary goal of the course is to prepare users for MRS and its set of tools and capabilities for scalable big data-processing and analytics. So this course can also be viewed as a prerequisite for taking learning to use MRS.
After completing this course, participants will be able to use R to perform a thorough data analysis task that starts with ingesting a raw flat file and performing exploratory data analysis, with lots of summaries and visualizations to boot. The user will gain an appreciation for packages such as dplyr in helping us set up robust and easy-to-modify data pipelines, ggplot2 and its straightforward notation, and will learn to think better like an R programmer and write more efficient and straight-forward R code.
We will follow the following workflow during the course.
- Setting up the environment
- Loading data into R
- Inspecting the data: We run sanity checks on the data and get a feel for the data
- Cleaning the data: We deal with column types, especially with the factor columns
- Being more efficient: We learn how pre-processing can lead to more efficiency
- Creating new features: Starting with the raw data, we ask how we can make it a more useful data to the analysis by adding relevant features
- Data summary and visualization: We explore various ways we can summarize the data using both base R and dplyr. We use ggplot2 to visualize results