Regression Discussion 2

Author

Langley Burke

#install.packages("AER")
library("AER")
Loading required package: car
Loading required package: carData
Loading required package: lmtest
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: survival
data()
#Listing datasets 

Data Set #1

data("EuroEnergy")
str(EuroEnergy)
'data.frame':   20 obs. of  2 variables:
 $ gdp   : int  45451 62049 2003 34540 28388 358675 38039 1331 11788 265863 ...
 $ energy: int  30633 58894 1211 27049 26405 233907 20119 1124 11053 192453 ...
#Store data set in global environment to be able to open up data in another page to view variables and observations.

Data Description

The variables in this data are energy consumption and GDP for the 20 European countries included in the data set for 1980. Each country has an entry of GDP and Energy, measured in million 1975 dollars and energy refers to aggregate energy consumption in million kilograms coal equivalency.

Data Type

The datatype here is an example of a cross-sectional data set. This is due to the data taken in the time stamp of year 1980. No data is included throughout time. It is looking at individual countries for their GDP and Energy Consumption. A way to identify this is to notice it is taken at a specific time (1980) and the individual observations (countries) are unable to be ordered among themselves.

head(EuroEnergy)
           gdp energy
Austria  45451  30633
Belgium  62049  58894
Cyprus    2003   1211
Denmark  34540  27049
Finland  28388  26405
France  358675 233907
tail(EuroEnergy)
               gdp energy
Spain       159602  88148
Sweden       59350  45132
Switzerland  42238  23234
Turkey       91946  32619
UK          279191 268056
WGermany    428888 352677
#Displaying first 6 rows and last six 
#Countries are row names currently and not a column
#Creating "Country" as a labeled column for plots

?names
ncol(EuroEnergy)
[1] 2
EuroEnergyNew <- tibble::rownames_to_column(EuroEnergy, var = "Country")

ncol(EuroEnergyNew)
[1] 3
print(EuroEnergyNew)
       Country    gdp energy
1      Austria  45451  30633
2      Belgium  62049  58894
3       Cyprus   2003   1211
4      Denmark  34540  27049
5      Finland  28388  26405
6       France 358675 233907
7       Greece  38039  20119
8      Iceland   1331   1124
9      Ireland  11788  11053
10       Italy 265863 192453
11       Malta   1251    456
12 Netherlands  82804  84416
13      Norway  27914  26086
14    Portugal  30642  12080
15       Spain 159602  88148
16      Sweden  59350  45132
17 Switzerland  42238  23234
18      Turkey  91946  32619
19          UK 279191 268056
20    WGermany 428888 352677
#Creating Scatter Plot of Country vs Energy Consumption

library(ggplot2)

ggplot(EuroEnergyNew, aes(x = Country, y = energy)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(x = "Country", y = "Energy Consumption", title = "Energy Consumption by Country")

Data Set #2

data()
data("TravelMode")
?TravelMode

str(TravelMode)
'data.frame':   840 obs. of  9 variables:
 $ individual: Factor w/ 210 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ mode      : Factor w/ 4 levels "air","train",..: 1 2 3 4 1 2 3 4 1 2 ...
 $ choice    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
 $ wait      : int  69 34 35 0 64 44 53 0 69 34 ...
 $ vcost     : int  59 31 25 10 58 31 25 11 115 98 ...
 $ travel    : int  100 372 417 180 68 354 399 255 125 892 ...
 $ gcost     : int  70 71 70 30 68 84 85 50 129 195 ...
 $ income    : int  35 35 35 35 30 30 30 30 40 40 ...
 $ size      : int  1 1 1 1 2 2 2 2 1 1 ...
head(TravelMode)
  individual  mode choice wait vcost travel gcost income size
1          1   air     no   69    59    100    70     35    1
2          1 train     no   34    31    372    71     35    1
3          1   bus     no   35    25    417    70     35    1
4          1   car    yes    0    10    180    30     35    1
5          2   air     no   64    58     68    68     30    2
6          2 train     no   44    31    354    84     30    2

Data Description

There are 9 variables in this data set, all related to individuals preferred choice of travel. The individual themselves (4 entries for each mode of transportation), the mode of transportation (being either air, train, car or bus), their choice (yes or no), the terminal wait time for each mode, the vehicle cost, the travel time in the vehicle, the income and the size of the party. The data set is an attempt to discover which mode of transportation is the preferred for the individuals.

Data Type

Due to the data including entries for each individual asked and there being no time untt, I believe this data type is cross sectional data. Due to the four entries for each individual I initially thought it was pooled but there is no different sample at a different time it is the same sample, asked the questions for each and the “no”s are included in the data set as well as the “yes” chosen mode.

#Clean data to keep individuals "Yes" choice of mode and getting rid of the no entries

filtered_travel <- TravelMode[TravelMode$choice == "yes", ]

head(filtered_travel)
   individual  mode choice wait vcost travel gcost income size
4           1   car    yes    0    10    180    30     35    1
8           2   car    yes    0    11    255    50     30    2
12          3   car    yes    0    23    720   101     40    1
16          4   car    yes    0     5    180    32     70    3
20          5   car    yes    0     8    600    99     45    2
22          6 train    yes   40    20    345    57     20    1
#making sure only the yes values are kept 

summary(filtered_travel$choice)
 no yes 
  0 210 
summary(filtered_travel$mode)
  air train   bus   car 
   58    63    30    59 
#Can see from the summary which mode is more popular amongst the sample
# Base R plot of wait itme and income

plot(filtered_travel$wait,filtered_travel$travel, type = "p", xlab = "Wait Time", ylab = "Travel Time", main = "Wait Time vs Travel Time")

#No one is choosing any transportation that has a wait time greater than the travel time.
library(ggplot2)

ggplot(filtered_travel, aes(x = mode, y = income)) +
  geom_boxplot()+
  theme_grey()+
  labs(x = "Mode", y = "Income", title = "Preferred Mode of Travel vs Income")

#Created a boxplot here to see the average income for each mode. 
#Air is highest which could be expected, train is lower than bus here although train has the range reaching the highest income.