I recently completed Coursera's "The Data Scientist's Toolbox" course presented by Johns Hopkins University. This 4-week course offers a broad overview of what data science is and how to set up your "toolbox" for R programming and analysis of data. Below are my takeaways that have inspired me to continue to learn more about data science and R:
-
Data science begins with formulating the right questions and finding the right data set before bringing in math/statistics and hacking (programming) knowledge. This is part of the experimental design process.
-
A data science is broadly defined as someone:
“who combines the skills of software programmer, statistician and storyteller slash artist to extract the nuggets of gold hidden under mountains of data”
-
RStudio is a go-to graphical interface (application) to start developing R projects right away.
-
RStudio plays well with Github.
-
R's strength is statistical computing.
-
R has a markdown package! It can "knit" your project together into HTML, PDF, or Word document.
-
There are 6 general categories of data analyses:
- descriptive: summarizing a data set (U.S. census)
- exploratory: exploring data to find relationships (% of women in specific work sectors)
- inferential: generalizing from a small sample to reflect on a larger group (air pollution in small area infers how all of US residents are impacted by pollution anywhere)
- predictive: using historical data to predict what happens next (elections)
- causal: exploring cause & effect of variables upon each other (trials for drugs)
- mechanistic: measuring exact variable differences (material science experiments)
-
Experimental design begins with choosing variables (independent and dependent) in order to formulate an hypothesis (expected outcome) about which variables will be affected or changed.
-
An independent variable (factor) is often the X-axis when plotted.
-
The dependent variable is often the Y-axis when plotted.
-
Big data is defined by volume (more data), velocity (data is being generated quickly), and variety (data is available in several formats).
Important note: I did not pay to take this course. I audited it for it's core content, therefore, was unable to take any of the module quizzes or submit a course project. This is a good course to use as a way to explore if data science is an area you'd like to get to know and explore.
Statistics was a weakness of mine in undergraduate coursework, but asking questions, doing math, and writing code don't scare me. So! Next up in my exploration of data science: R Programming on Coursera.