Archive | October, 2014

Fuel Consumption Ratings for Canadian Vehicules in 2014

I’m currently following a course called Data Analysis in R on Udacity. Part of the course involves loading up a dataset found online, doing some exploratory data analysis, and publishing my results.

I used the dataset called “2014 – Fuel Consumption Ratings” available on this page:

http://data.gc.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64

As per the description: “A yearly data set of all passenger vehicles sold in Canada based on their fuel-consumption ratings, estimated carbon-dioxide emissions and annual fuel costs.”

I use this command to load the dataset:

fuel <- read.csv("http://www.nrcan.gc.ca/sites/www.nrcan.gc.ca/files/oee/files/excel/MY2014%20Fuel%20Consumption%20Ratings.csv");

Here’s a few commands to clean up the data frame:

names(fuel) <- c("Model.Year", "Manufacturer", "Model", "Vehicle.Class", "Engine.Size.L", "Cylinders", "Transmission", "Fuel.Type", "Fuel.Consumption.City.L.100km", "Fuel.Consumption.Hwy.L.100km", "Fuel.Consumption.City.Mpg", "Fuel.Consumption.Hwy.Mpg", "Fuel.Ly", "Co2.Emissions.g.km");
fuel <- fuel[2:1068, ];

Let’s load up the ol’ ggplot, as well as gridExtra for added flavour:

library(ggplot2);
library(grid);
library(gridExtra);
theme_set(theme_gray(base_size = 6));

First let’s coerce some rows from factors to numeric, to make plotting easier:

fuel[, "Fuel.Ly"]                         <- as.numeric(as.character(fuel[, "Fuel.Ly"]));
fuel[, "Engine.Size.L"]                   <- as.numeric(as.character(fuel[, "Engine.Size.L"]));
fuel[, "Co2.Emissions.g.km"]              <- as.numeric(as.character(fuel[, "Co2.Emissions.g.km"]));
fuel[, "Fuel.Consumption.City.L.100km"]   <- as.numeric(as.character(fuel[, "Fuel.Consumption.City.L.100km"]));
fuel[, "Fuel.Consumption.Hwy.L.100km"]    <- as.numeric(as.character(fuel[, "Fuel.Consumption.Hwy.L.100km"]));
fuel[, "Fuel.Consumption.City.Mpg"]       <- as.numeric(as.character(fuel[, "Fuel.Consumption.City.Mpg"]));
fuel[, "Fuel.Consumption.Hwy.Mpg"]        <- as.numeric(as.character(fuel[, "Fuel.Consumption.Hwy.Mpg"]));

Let’s plot a few variables to get a feel for the data:

fuelLY <- qplot(
  data = fuel, 
  x = Fuel.Ly, 
  binwidth = 100,
  xlab = "Fuel (L/year)",
  ylab = "Count"
);

co2emiss <- qplot(
  data = fuel, 
  x = Co2.Emissions.g.km, 
  binwidth = 8,
  xlab = "CO2 Emissions (g/km)",
  ylab = "Count"
);

fuelconscity <- qplot(
  data = fuel, 
  x = Fuel.Consumption.City.L.100km, 
  binwidth = 0.5,
  xlab = "Fuel Consumption in City (L/100km)",
  ylab = "Count"
);

fuelconshwy <- qplot(
  data = fuel, 
  x = Fuel.Consumption.Hwy.L.100km, 
  binwidth = 0.3,
  xlab = "Fuel Consumption on Highway (L/100km)",
  ylab = "Count"
);

grid.arrange(fuelLY, co2emiss, fuelconscity, fuelconshwy, ncol = 2);

Rplot

Looks like these variables (fuel per year, co2 emissions, fuel consumption per km) are all roughly normally distributed, and the all have a slight skew towards the left. This is a very interesting pattern. It could represent the car maker’s recent effort to make models “greener”, pushing the overall fuel efficiency up.

Let’s look at the media fuel consumption in the city broken down by manufacturer:

fuel$Manufacturer <- factor(fuel$Manufacturer); #To reset the factors

firstPlot <- qplot(
    data = fuel, 
    x = Manufacturer,
    y = Fuel.Consumption.Hwy.L.100km, 
    binwidth = 0.3,
    xlab = "Manufacturer",
    ylab = "Fuel Consumption (L/100km)",
    geom= "boxplot"
) + coord_cartesian(xlim = c(0, 16.5));

secondPlot <- qplot(
    data = fuel, 
    x = Manufacturer,
    y = Fuel.Consumption.Hwy.L.100km, 
    binwidth = 0.3,
    xlab = "Manufacturer",
    ylab = "Fuel Consumption (L/100km)",
    geom= "boxplot"
) + coord_cartesian(xlim = c(16.5, 30.5));

thirdPlot <- qplot(
    data = fuel, 
    x = Manufacturer,
    y = Fuel.Consumption.Hwy.L.100km, 
    binwidth = 0.3,
    xlab = "Manufacturer",
    ylab = "Fuel Consumption (L/100km)",
    geom= "boxplot"
) + coord_cartesian(xlim = c(30.5, 40));

grid.arrange(firstPlot, secondPlot, thirdPlot, nrow = 3);

ok1 ok2 ok3

 

(I apologize for the small images. Please click on them to see higher resolution…)

There’s many manufacturers compared to the number of 2014 models they each have, making the data a bit hard to read. Of the bigger manufacturers, Mini, Mazda and Honda have the most models with low fuel consumption, with a median near 6.

So this concludes my exploratory data analysis. Hope you enjoyed it.