Explanatory Analysis: Hands on

Author

Bongani Ncube

Welcome to interactive learning

This will go a long way helping you learn R hands on , but beware coz this can be addictive!

Note

The dataset contains:

  • ? observations
  • ? variables

Next, we examine the first five observations of the data. The rest of the observations are not shown. You can also see the types of variables:

use base R to explore the data

  • table() to look at frequencies for categorical data
  • now you try table(new_data$vs)

  • how does qsec affect mpg?

  • use the lm() or glm() function for this

comment

Working with tidyverse

Select variables, generate new variable and rename variable

We will work with these functions.

Select variables using dplyr::select()

When you work with large datasets with many columns, it is sometimes easier to select only the necessary columns to reduce the dataset size. This is possible by creating a smaller dataset (fewer variables). Then you can work on the initial part of data analysis with this smaller dataset. This will greatly help data exploration.

Note

however for this exercise we gonna need all the variables for exploration so we will select everything

  • select mpg ,cyl and qsec variables

extending select verb

Note

sometimes it is necessary to perform conditional selection on variables because + at times you need only numerical variables for correlations + you may only need categorical variables for testing independence

for such a case we can use functions such as select_if

the code above will only select variables of class numeric

Generate new variable using mutate()

With mutate(), you can generate a new variable. For example, in the dataset new_data, we want to create a new variable named log_mpg which is a log transformation of calories .

\[log\_mpg=\log(mpg)\]

And let’s observe the first five observations:

extending mutate function

it is often wise to perform conditional mutations on data

Note

sometimes it is necessary to perform conditional mutation on variables such that + you only mutate is a certain condition is met

we often use mutate_if() ,mutate_at and mutate_all to achieve this

  • check data types before the coming operation
  • we note that category ,servings and high_traffic are characters when in actual fact they should be factors > lets change that

nice!! we have turned every character to a factor

Rename variable using rename()

Note

Now, we want to rename

  • variable mpg to miles_per_gallon
  • variable cyl to cylinder

how about we visualise with ggplot

  • what is the relationship between mpg and qsec
  • add labels

GREAT!!!!